De-purification

Forums: 

Hi,

H5P automatically purify text field which annoy me because the value is not use for display.

According to documentation :

« Type: text
Description: Defines all forms of text for display, from simple strings to HTML blocks. Several widgets are available for use in the editor for text fields. Strings will be purified and can contain HTML character references such as ' for an apostrophe. If not used for displaying HTML, strings might need de-purification. »

However, I can't find where is the purify function so that I know all I need to do to de-purify and be sure I don't miss any characters.

Any help on that ?

Edit : Okay the purification was not in javascript but in php : https://github.com/h5p/h5p-php-library/blob/fe3f4a504d27218b7ac47662e492...

However this is complex, is there an easy way to de-purify or to disable purification ? Thanks !

edit2 : I think I will encode the text into base64, but it's not efficient in execution time and storage :/

edit3 : hum it seems that if I don't set the tag attribute, the purification will be simple : https://github.com/h5p/h5p-php-library/blob/fe3f4a504d27218b7ac47662e492... so it should be easy to depurify now I know that

edit4 : because double_encode is set to false, the function can't be reversed (as two different string can be encoded into the same things). So I am forced to use the base64 trick :(

edit5 : I have an idea, I can emulate in javascript the htmlspecialchars with double encoding, that way, when php will apply htmlspecialchars without double encoding, it will not change the string. This way two different string will always be encoded differently. And I can then reverse the htmlspecialchars !

Thanks !

Here's my solution, it seems to work :

CodeMirror.H5P = {
    encode: function (str) { // adapted from https://github.com/locutusjs/locutus/blob/master/src/php/strings/htmlspe...
        return str.replace(/&/g, '&')
            .replace(/</g, '&lt;')
            .replace(/>/g, '&gt;')
            .replace(/'/g, '&#039;')
            .replace(/"/g, '&quot;');
    },
    decode: function (str) { // adapted from https://github.com/locutusjs/locutus/blob/master/src/php/strings/htmlspe...
        return str.replace(/&lt;/g, '<')
            .replace(/&gt;/g, '>')
            .replace(/&#0*39;/g, "'")
            .replace(/&quot;/g, '"')
            .replace(/&amp;/g, '&');
    }
}

 

otacke's picture

Hi!

Is something like this what you're looking for?

static htmlDecode(input) {
  const dparser = new DOMParser().parseFromString(input, 'text/html');
  return dparser.documentElement.textContent;
}

And have you looked for well established solutions for this common problem, such as "he" ? https://www.npmjs.com/package/he

Best,

Oliver 

This would not work if the value is "& abcdef &amp;". H5P would encode it to "&amp; abcdef &amp;" and it would then be decoded into "& abcdef &" instead of "& abcdef &amp;". However I finally found a solution (see my other comment) : to prevent this situation I can save it encoded, that way "& abcdef &amp;" would become "&amp; abcdef &amp;amp;" and H5P would encode it to "&amp; abcdef &amp;amp;" (same because doubleencode is disabled so as it is already encoded it will not add anything), and so I would be able to decode it to "& abcdef &amp;"