PHP to clean-up pasted Microsoft input

PHP to clean-up pasted Microsoft input - php

I have a site where users can post stuff (as in forums, comments, etc) using a customised implementation of TinyMCE. A lot of them like to copy & paste from Word, which means their input often comes with a plethora of associated MS inline formatting.
I can't just get rid of <span whatever> as TinyMCE relies on the span tag for some of it's formatting, and I can't (and don't want to) force said users to use TinyMCE's "Paste From Word" feature (which doesn't seem to work that well anyway).
Anyone know of a library/class/function that would take care of this for me? It must be a common problem, though I can't find anything definitive. I've been thinking recently that a series of brute-force regexes looking for MS-specific patterns might do the trick, but I don't want to re-write something that may already be available unless I must.
Also, fixing of curly quotes, em-dashes, etc would be good. I have my own stuff to do this now, but I'd really just like to find one MS-conversion filter to rule them all.

HTML Purifier will create standards compliant markup and filter out many possible attacks (such as XSS).
For faster cleanups that don't require XSS filtering, I use the PECL extension Tidy which is a binding for the Tidy HTML utility.
If those don't help you, I suggest you switch to FCKEditor which has this feature built-in.

In my case, this worked just fine:
$text = strip_tags($text, '<p><a><em><span>');
Rather than trying to pull out stuff you don't want such as embedded word xml, you can just specify you're allowed tags.

The website http://word2cleanhtml.com/ does a good job on converting from Word. I'm using it in PHP by scrapping, to process some legacy HTML, and until now it's working pretty fine (the result is very clean <p>, <b> code). Of course, being an external service it's not good to use it in online processing like your case.
If you try it and it brings many 400 errors, try filtering the HTML with Tidy first.

In my case, there was a pattern. The unwanted part always started with
<!-- [if gte mso 9]>
and ended by an
<![endif]-->
So my solution was to cut out everything before and after this block:
$array = explode("<!-", $string, 2);
$begin = $array[0];
$end=substr(strrchr($string,'[endif]-->'),10);
echo $begin.$end;

Related

Is "HTML Purifier" really trustworthy? Is there a better way to secure an untrusted/unsafe HTML code string in PHP?

I'm trying to secure HTML coming from external sources, for display on my own web control panel (to load in my browser, read, and delete).
strip_tags is completely unsafe and useless.
I went through a ton of trouble to make my own DOMDocument-based HTML securer, removing unsafe elements and attributes. Then I got linked to this nightmare of a webpage: https://owasp.org/www-community/xss-filter-evasion-cheatsheet
That document convinced me that not only is my "clever" HTML securer not enough -- there are far more things that can be done to inject malicious code into HTML than I ever could imagine. That list of things gives me the creeps for real. What a cold shower.
Anyway, looking for a (non-Google-infested) HTML securer for PHP, I found this: http://htmlpurifier.org/
While it seems OK at first glance, some signs point toward sloppiness which is the last thing you want in a security context:
On http://htmlpurifier.org/download , it claims that this is the official repository: https://repo.or.cz/w/htmlpurifier.git
But that page was last updated in "2018-02-23", with the label "Whoops, forgot to edit WHATSNEW".
The same page as in #1 calls the Github link the "Regular old mirror", but that repository has current (2020) updates... So is that actually the one used? Huh? https://github.com/ezyang/htmlpurifier/tree/master
At https://github.com/ezyang/htmlpurifier/blob/v4.13.0/NEWS , it says: "Further improvements to PHP 6.4 support". There never existed a PHP 6.4...
My perception of that project is that it's run by very sloppy and careless people. Can people who make so many mistakes and take so little care to keep their website correct really be trusted to write secure code to purify HTML?
I wish I had never been linked to that page with exploits. I was proud of my own code, and I spent a lot of time on it even though it's not many lines.
This really makes me wonder what everyone else is using (not made by Google). strip_tags is obviously a complete "no-no", but so is my DOMDocument code. For example, it checks if any href begins with (case insensitively) "javascript:", but the nightmare page shows that you can inject "invisible" tabs such as "ja vascript:" and add encoded characters and everything to break my code and allow the "javascript:" href after all. And numerous other things which would simply be impossible for me to sit and address in my own code.
Is there really no real_strip_tags or something built into PHP for this crucial and common task?

HTML Purifier is a pretty good, established and tested library, although I understand why the lack of clarity as to which repository is the right one really isn't very inspiring. :) It's not as actively worked on as it was in the past, but that's not a bad thing in this case, because it has a whitelist approach. New and exciting HTML that might break your page just isn't known to the whitelist and is stripped out; if you want HTML Purifier to know about these tags and attributes, you need to teach it about how they work before they become a threat.
That said, your DOMDocument-based code needn't be the wrong approach, but if you do it properly you'll probably end up with HTML Purifier again, which essentially parses the HTML, applies a standards-aware whitelist to the tags, attributes and their values, and then reassembles the HTML.
(Sidenote, since this is more of a question of best practises, you might get better answers on the Software Engineering Stack Exchange site rather than Stackoverflow.)

Use WYSIWYG Editor with PHP escape Method

I am building a small/test CMS using Php and Mysql.
Everything is working amazingly on the adding, editing, deleting and displaying level, but after finishing my code, I wanted to add a WYSIWYG editor in the Admin back end.
My problem is that I am using escape method to hopefully make my form a bit more secure and try to escape injections, therefore when adding a styled text, image or any other HTML code in my Editor I am getting them printed as line codes on my page(Which is completely right to avoid attacks).
MY ESCAPE METHOD:
function e($text) {
return htmlspecialchars($text, ENT_QUOTES, 'UTF-8');}
Is there any way to work around my escape method (which is think it should not be done because if I can do it every attacker could).
Or should I change my escape method to another method?

If I understand you correctly you are going to allow your users to put some formatting into the text they are going to create. For this you are going to add some WYSISWYG editor. But the question is how to distinguish the formatting and special characters which are allowed from what is not allowed. You need to clean up the text and leave only valid allowed formatting (HTML tags) and remove all malicious JavaScript or HTML.
This is not an easy task like it might sound at the first moment. I can see several approaches here.
Easiest solution to use strip_tags and specify what tags are allowed.
But please keep in mind that strip_tags is not perfect. Let me quote the manual here.
Because strip_tags() does not actually validate the HTML, partial or
broken tags can result in the removal of more text/data than expected.
This function does not modify any attributes on the tags that you
allow using allowable_tags, including the style and onmouseover
attributes that a mischievous user may abuse when posting text that
will be shown to other users.
This is a known issue. And libraries exist which do a better cleanup of HTML and JS to prevent breaks.
A bit more complicated solution would be to use some advanced library to cleanup the HTML code. For example this might be HTML Purifier
Quote from the documentation
HTML Purifier will not only remove all malicious code (better known as
XSS) with a thoroughly audited, secure yet permissive whitelist, it
will also make sure your documents are standards compliant, something
only achievable with a comprehensive knowledge of W3C's
specifications.
The other libraries exist which solve the same task. You can check for example this article where libraries are compared. And finally you might choose the best one.
Completely different approach is to avoid users from writing HTML tags. Ask them to write some other markup instead like this is done on StackOverflow or Basecamp or GitHub. Markdown might be a good approach.
Using simple markup for text allows you to complete avoid issues with broken HTML and JavaScript cause you can escape everything and build HTML markup on your own.
The editor might look like the one I'm using to write this message :)

You can use strip_tags() to remove the unwanted tags. Read about it on this manual:
http://php.net/manual/en/function.strip-tags.php
Example 1 (Based on the manual)
<?php
$text = '<p>Test paragraph, With link.</p>';
# Output: Test paragraph, With link. (Tags are stripped)
echo strip_tags($text);
echo "\n";
# Allow <p> and <a>
#Output: <p>Test paragraph, With link.</p>
echo strip_tags($text, '<p><a>');
?>
I hope this will help you!

Sanitize HTML5 with PHP (prevent XSS)

I'm building WYSIWYG editor with HTML5 and Javascript.
I'll allow users post pure HTML via WYSIWYG, so it have to be sanitized.
Basic task like protecting site from cross site scripting (XSS) is coming difficult task, because there isn't up-to-date purify & filter -software for PHP.
HTML Purifier isn't support HTML5 at the moment and overall status looks very bad (HTML5 support isn't coming anytime soon).
So how should I sanitize untrusted HTML5 with PHP (backend) ?
Options so far...
HTML Purifier (lack of new HTML5 tags, data-attributes etc.)
Implementing own purifier with strip_tags() and Tidy or PHP's DOM classes/functions
Using some "random" Tidy implementations like http://eksith.wordpress.com/2013/11/23/whitelist-html-sanitizing-with-php/
Google Caja (Javascript / Cloud)
htmLawed (there's beta for HTML5 support)
Is there any other options out there? Is PHP dying? ;)

PHP offers parsing methods to protect from code PHP/SQL injections (i.e. mysql_real_escape_string()). This is not the case for HTML/CSS/JavaScript. Why that?
First: HTML/CSS/Javascript sole purpose is to display information. It is pretty much up to you to accept certain elements of HTML or reject them depending of your requirements.
Secondly: due to the very high number of HTML/CSS/JS elements (also increasing constantly), it is impossible to try to control HTML. you cannot expect a functional solution.
This is why I would suggest a top-down solution. I suggest to start restricting everything and then only allowing a certain number of tags. One good base is probably to use BBCdode, pretty popular. If you want to "unlock" additional specific tags beyond BBCode, you can always add some.
This is the reason BBCode-like scripts are popular on forums and websites (including stack overflow). WISIGIG editors are designed for admin/internal use, because you don't expect your website administrator to inject bad content.
bottom-top approaches are vowed to fail. HTML sanitizers are exposed to exponential complexity and do not guarantee anything.
EDIT 1
You say it is a sanitation problem, not a front end issue. I disagree, because as you cannot handle all present and future HTML entities you would better restrict it at a front end level to be 100% sure.
This said, perhaps the below is a working solution for you:
you can do a bit to sanitize your code by striping all entities
except those in a white list using PHP's strip_tags().
You can also remove all remaining tags attributes (properties)
by using PHP's preg_replace() with some regular expression.
$string = "put some very dirty HTML here.";
$string = strip_tags($string, '<p><a><span><h1><li><ul><br>');
$string = preg_replace("/<([b-z][b-z0-9]*)[^>]*?(\/?)>/i",'<$1$2>', $string);
echo $string;
This will return your sanitized text.
note : I have excluded attributes removal for tags because you may still want to keep href="" properties. hence the [b-z][B-Z] regex.

I Believe the ideal is to use a combination :
mysql_real_escape_string(addslashes($_REQUEST['data']));
On Write
and
stripslashes($data)
on read always did the trick for me, I think it is better than
htmentities($data) on write
and
html_entity_decode($data) on read

Remove tag and content in between using REGEX/PHP

I've seen this question asked a few times on stackoverflow, with no resoundingly wonderful answer.
The answer always seems to be "don't use regex," without any examples of a better alternative.
For my purposes this will not be done for validation, but after the fact stripping.
I need to strip out all script tags including any content that may be between them.
Any suggestions on the best REGEX way to do this?
EDIT: PREEMPTIVE RESPONSE: I can't use HTML Purifier nor the DOMXPath feature of PHP.

The reason REGEX for HTML is considered evil, is because it can (usually) easily be broken, forcing you to repeatedly rethink your pattern. If for instance you're matching
<script>.+</script>
It could be broken easily with
<script type="text/javascript">
If you use
<script.+/script>
It can also be easily broken with
< script>...
There's no end for this. If you can't use any of the methods you've stated, you could try strip_tags, but it takes a whitelist as a parameter, not a blacklist, meaning you'll need to manually allow every single tag you want to allow.
If all else fail, you could resort to RegEx, what I came up with is this
<\s*script.*/script>
But I bet someone around here could probably come and break that too.

Best way to handle LARGE strings of text going into a database?

I have built a number of solutions in the past in which people enter data via a webform, validation checks are applied, regex in some cases and everything gets stored in a database. This data is then used to drive output on other pages.
I have a special case here where a user wants to copy/paste HUGE amounts of text (multiple paragraphs with various headers, links and etc throughout) -- what is the best way to handle this before it goes into a database to provide the best output when it needs to come back out?
So far the best I have come up with is sticking all the output from these fields in PRE tags and using regex to add links where appropriate. I have a database put together with a list of special keywords that need to be bold or have other styles applied to them which works fine. So I can make this work using these approaches but it just seems to me that there is probably a much more graceful way of doing it.
Nicholas

There are a lot of ways you could format the text for output. You could simply use pre tags as you mentioned (if you are worried about wrapping, the CSS white-space property does also support the pre-wrap value, but browser support for this is currently sketchy at best).
There are also a large number of markup languages you could use for more advanced formatting options (some of which are listed here). Stack Overflow itself uses Markdown, which I personally enjoy using very much.
However, as the data is being pasted from another source, a markup language may interfere with the formatting of the text - in which case you could roll your own solution, perhaps using regular expressions and functions like htmlentities and nl2br.
Whatever you decide, I would recommend storing the input in its original form in the database so you can retroactively amend your formatting routines at any time.

If you're expecting a good deal of formatting, you should probably go with a WYSIWYG editor. These editors produce word-like toolbars which product (hopefully) valid (x)html-markup which can be directly stored into a text field in your database. Here are a couple examples:
FCKeditor - Massive amount of options/tools
Tinymce - A nice alternative.
Markdown - What stackoverflow.com uses
Both FCKeditor and Tinymce have been thoroughly tested and have proven to be reliable. I don't have any experience with markdown but it seems solid.
I've always hated 'forum' formatting tags like [code], [link], etc. Stackoverflow and others have shown that providing an open wysisyg editor is safe, reliable, and very easy to use. Just take the output it gives you, run it through some sort of escape function to check for any kind of injection, xss, etc and store in a text field.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.