Cleaning HTML in a content editor with javascript using jQuery

Some days ago, I saw a site that friends of mine created. They use a free content management system for their site, but they had pasted content from Word. The HTML was quite dirty...

Years ago, I had written a javascript for cleaning-up pasted HTML (for instance the content of a Word document), within a web based content editor purely for IE 5/6. It worked quite well, but was a mix based on blacklisting and whitelisting. Today, as I am using jQuery, it seemed to me much easier to create such a script for multiple browsers, now based purely on whitelisting.

The script I created is not meant to be used for securing content that will be submitted (think of cross site scripting). You can not be sure that non cleaned HTML will be submitted. It is only meant to help content editors who are somewhat unaware of these pasting issues.

I will show you how I have created the script, and hopefully it will be of some use to you. First I started to create a script that was looping through all elements of the HTML and that removed those elements not mentioned in the whitelist. This script looks like this:

var tagsAllowed = "|h1|h2|h3|p|span|div|a|b|strong|br|hr|";

//Extension for getting the tagName 
$.fn.tagName = function() {
    return this.get(0).tagName.toLowerCase();
}

function clearUnsupportedTagsAndAttributes(obj) {
    $(obj).children().each(function() {
        //recursively down the tree
        clearUnsupportedTagsAndAttributes($(this));
        var tag = $(this).tagName();
        if(tagsAllowed.indexOf("|" + tag + "|") < 0) {
            $(this).replaceWith($(this).html());
        }
    });
}

Then I realized that if a script or style tag was not whitelisted, you would not want to keep the content of these tags in the HTML. Therefore I have changed the following part of the script:

...
if(tagsAllowed.indexOf("|" + tag + "|") < 0) {
    if(tag == "style" || tag == "script")
        $(this).remove();
    else
        $(this).replaceWith($(this).html());
}
...

Now we removed all non whitelisted tags, I also would like to remove unwanted attributes. As you can also execute javascript on mouseover or click, and maybe you do not wish to include these in the html. For this I have added the following code:

var attributesAllowed = new Array(3);
attributesAllowed["span"] = "|id|class|";
attributesAllowed["div"] = "|id|class|onclick|style|";
attributesAllowed["a"] = "|id|class|href|name|";

And added an else block to the clearUnsupportedTagsAndAttributes function. For IE I needed to catch errors because some attributes IE loops through are not suported:

    ...
    else {
        var attrs = $(this).get(0).attributes;
        for(var i = 0; i <  attrs.length; i++) {
            try {
                if(attributesAllowed[tag] == null || 
                attributesAllowed[tag].indexOf("|" + attrs[i].name.toLowerCase() + "|") < 0) {
                    $(this).removeAttr(attrs[i].name);
                }
            }
            catch(e) {} //Fix for IE, catch unsupported attributes like contenteditable and dataFormatAs
        }
    }
    ...

Next I would like to remove empty tags. But needed to allow some tags to remain in the content even if they are empty, for example the br tag, so I added the following, and updated the above else statement:

var emptyTagsAllowed = "|br|hr|";

    ...
    else {
        if($(this).html().replace(/^\s+|\s+$/g, '') == "" && emptyTagsAllowed.indexOf("|" + tag + "|") < 0)
            $(this).remove();
        else
        {
            var attrs = $(this).get(0).attributes;
    ...

The last hurdle was comments in the HTML. When pasting Word, some comments are also present and not yet removed. I have created an extension for this, that I can call after cleaning the HTML with the clearUnsupportedTagsAndAttributes function:

//Extension for removing comments
$.fn.removeComments = function() {
    this.each(
        function(i, objNode){
            var objChildNode = objNode.firstChild;
            while (objChildNode) {
                if (objChildNode.nodeType === 8) {
                    var next = objChildNode.nextSibling;
                    objNode.removeChild(objChildNode);
                    objChildNode = next;
                }
                else
                {
                    if (objChildNode.nodeType === 1) {
                        //recursively down the tree
                        $(objChildNode).removeComments();
                    }
                    objChildNode = objChildNode.nextSibling;
                }
            }
        }
    );
}

Then to my shock, when pasting code from Visual Studio, I got an error on the clearUnsupportedTagsAndAttributes function. The HTML was not well formed and looping through it failed. So I added a variable someError that was false and put a try catch in the clearUnsupportedTagsAndAttributes function. In the catch part I set someError to true. I now show an error message, but what you probably want is to go and paste only the text in case of an error. You can see the full javascript here: CleanContent.js

A working example

You can try it in the example underneath that is using a simple contenteditable div area. Click on the Clean HTML button to clean up the HTML in the content editor. I have already inserted some 'unclean' HTML, but you can also try pasting from Word.

RICH TEXT EDITOR

Some supported heading

non supported heading

RED
bold


italic

Sample MS Word content

UNDERLYING HTML

You can try and tweak the tags and attributes allowed, so that it suites your needs or expand the script to your needs. Happy scripting...