Best way to handle LARGE strings of text going into a database?

Best way to handle LARGE strings of text going into a database? - php

I have built a number of solutions in the past in which people enter data via a webform, validation checks are applied, regex in some cases and everything gets stored in a database. This data is then used to drive output on other pages.
I have a special case here where a user wants to copy/paste HUGE amounts of text (multiple paragraphs with various headers, links and etc throughout) -- what is the best way to handle this before it goes into a database to provide the best output when it needs to come back out?
So far the best I have come up with is sticking all the output from these fields in PRE tags and using regex to add links where appropriate. I have a database put together with a list of special keywords that need to be bold or have other styles applied to them which works fine. So I can make this work using these approaches but it just seems to me that there is probably a much more graceful way of doing it.
Nicholas

There are a lot of ways you could format the text for output. You could simply use pre tags as you mentioned (if you are worried about wrapping, the CSS white-space property does also support the pre-wrap value, but browser support for this is currently sketchy at best).
There are also a large number of markup languages you could use for more advanced formatting options (some of which are listed here). Stack Overflow itself uses Markdown, which I personally enjoy using very much.
However, as the data is being pasted from another source, a markup language may interfere with the formatting of the text - in which case you could roll your own solution, perhaps using regular expressions and functions like htmlentities and nl2br.
Whatever you decide, I would recommend storing the input in its original form in the database so you can retroactively amend your formatting routines at any time.

If you're expecting a good deal of formatting, you should probably go with a WYSIWYG editor. These editors produce word-like toolbars which product (hopefully) valid (x)html-markup which can be directly stored into a text field in your database. Here are a couple examples:
FCKeditor - Massive amount of options/tools
Tinymce - A nice alternative.
Markdown - What stackoverflow.com uses
Both FCKeditor and Tinymce have been thoroughly tested and have proven to be reliable. I don't have any experience with markdown but it seems solid.
I've always hated 'forum' formatting tags like [code], [link], etc. Stackoverflow and others have shown that providing an open wysisyg editor is safe, reliable, and very easy to use. Just take the output it gives you, run it through some sort of escape function to check for any kind of injection, xss, etc and store in a text field.

Related

Sanitize HTML5 with PHP (prevent XSS)

I'm building WYSIWYG editor with HTML5 and Javascript.
I'll allow users post pure HTML via WYSIWYG, so it have to be sanitized.
Basic task like protecting site from cross site scripting (XSS) is coming difficult task, because there isn't up-to-date purify & filter -software for PHP.
HTML Purifier isn't support HTML5 at the moment and overall status looks very bad (HTML5 support isn't coming anytime soon).
So how should I sanitize untrusted HTML5 with PHP (backend) ?
Options so far...
HTML Purifier (lack of new HTML5 tags, data-attributes etc.)
Implementing own purifier with strip_tags() and Tidy or PHP's DOM classes/functions
Using some "random" Tidy implementations like http://eksith.wordpress.com/2013/11/23/whitelist-html-sanitizing-with-php/
Google Caja (Javascript / Cloud)
htmLawed (there's beta for HTML5 support)
Is there any other options out there? Is PHP dying? ;)

PHP offers parsing methods to protect from code PHP/SQL injections (i.e. mysql_real_escape_string()). This is not the case for HTML/CSS/JavaScript. Why that?
First: HTML/CSS/Javascript sole purpose is to display information. It is pretty much up to you to accept certain elements of HTML or reject them depending of your requirements.
Secondly: due to the very high number of HTML/CSS/JS elements (also increasing constantly), it is impossible to try to control HTML. you cannot expect a functional solution.
This is why I would suggest a top-down solution. I suggest to start restricting everything and then only allowing a certain number of tags. One good base is probably to use BBCdode, pretty popular. If you want to "unlock" additional specific tags beyond BBCode, you can always add some.
This is the reason BBCode-like scripts are popular on forums and websites (including stack overflow). WISIGIG editors are designed for admin/internal use, because you don't expect your website administrator to inject bad content.
bottom-top approaches are vowed to fail. HTML sanitizers are exposed to exponential complexity and do not guarantee anything.
EDIT 1
You say it is a sanitation problem, not a front end issue. I disagree, because as you cannot handle all present and future HTML entities you would better restrict it at a front end level to be 100% sure.
This said, perhaps the below is a working solution for you:
you can do a bit to sanitize your code by striping all entities
except those in a white list using PHP's strip_tags().
You can also remove all remaining tags attributes (properties)
by using PHP's preg_replace() with some regular expression.
$string = "put some very dirty HTML here.";
$string = strip_tags($string, '<p><a><span><h1><li><ul><br>');
$string = preg_replace("/<([b-z][b-z0-9]*)[^>]*?(\/?)>/i",'<$1$2>', $string);
echo $string;
This will return your sanitized text.
note : I have excluded attributes removal for tags because you may still want to keep href="" properties. hence the [b-z][B-Z] regex.

I Believe the ideal is to use a combination :
mysql_real_escape_string(addslashes($_REQUEST['data']));
On Write
and
stripslashes($data)
on read always did the trick for me, I think it is better than
htmentities($data) on write
and
html_entity_decode($data) on read

Displaying foreign HTML safely

I have an application that needs to display foreign HTML data (e.g. HTML-encoded email texts, though not only) safely - i.e., remove XSS attempts and other nasty stuff. But still be able to display HTML as it should look like. Solutions considered so far aren't ideal:
Clean HTML with something like HTMLPurifier. Works fine, but once email size goes over 100K it becomes very slow - tens of seconds per email. I suspect any secure enough parser would be as slow in PHP - some emails are really bad HTML, I've seen some that generate 150K HTML for one page of text.
Display HTML in an iframe - here the problem is that iframe needs then to be in another origin to be safe from XSS AFAIK, and this would require different domain for the same app. Setting up application with two domains is much more work and may be very hard in some setups (such as hosting that gives only one domain name).
Any other solutions that can achieve this result?

From my understanding, I don't believe so.
The trouble is that you can only safely remove HTML tags if you understand its structure, and 'understanding its structure' is exactly what parsing is. Even if you find a different way to analyse the structure of HTML and don't call it parsing, that's what you're doing, and it's bound to be some form of slow (or unsafe).
What you could do is play around with a few preliminary filters (e.g. strip_tags, which is generally a good prelim' (if certainly nothing else)) to give the parser less work to do, but whether that's viable depends on the size of your tag whitelist - a small whitelist will probably yield better benchmark results, since a large chunk of HTML would be filtered out by strip_tags before the parser got to it.
Additionally, different parsers work in different ways, and the sort of HTML you deal with frequently may be best suited to one sort of parser over another - HTML Purifier itself even has different parsers at its disposal that you can switch between to see if that results in a better benchmark for you (though I suspect the differences are negligible).
Whether such juggling works for your usecases is something you'll probably have to benchmark yourself, though.
Word of caution: If you decide to pursue it, know I wouldn't go with the iframe approach. If you don't filter HTML, you also allow forms, and it becomes (IMO) trivial in combination with scripts and CSS to set up extremely convincing phishing, e.g. using tricks such as "this e-mail is password protected, to proceed, please enter your password".

One possible solution (and the one that SO uses!) is to only allow certain types of tags. <p> and <br /> are fine, but <script> is right out.

How to properly format text retrieved from a website?

I'm building an application for a company that, unfortunately, has a very poorly designed website. Most of the HTML tags are wrongly and sometimes randomly placed, there is excessive use of no-break-spaces, p tags are randomly assigned, they don't follow any rule and so on...
I'm retrieving data from their website by using a crawler and then feeding the resulted strings to my application through my own web-service. The problem is that once displaying it into the android textview, the text is formatted all wrong, spread and uneven, very dissorderly.
Also, worth mentioning that I can not suggest to the company for various reasons to modify their website...
I've tried
String text = Html.fromHtml(myString).toString();
and other variations, I've even tried formatting it manually but it's been a pain.
My question is:
Is there an easy, elegant way to re-format all this text, either with PHP on my web-service or with Java, directly in my Android application?
Thanks to anyone who will take the time to answer...

You can use Tidy with PHP to clean up the code if you're keeping it in place. Otherwise stripping the HTML would probably make working with it a lot easier.

I would so: no, there is no easy, elegant way. HTML combines data and visual representation, they are inherently linked. To understand the data you must look at the tags. Tags like <h1> and <a> carry meaning.
If the HTML is structured enough to break it down into meaningful blocks: header, body and unrelated/unimportant stuff. Then you could apply restyling principles to those. A simple solution is to just strip all the tags, get only the textNodes and stitch them together. If the HTML is exceptionally poorly formatted you might get sentences that are out of order, but if the HTML isn't too contrived I expect this approach should work.
To give you an indication of the complexity involved: You could have <span>s that have styling applied to them, for instance display: block. This changes the way the span is displayed, from inline to block, so it behaves more like a <div> would. This means that each <span> will likely be on it's own line, it will seem to force a line break. Detecting these situations isn't impossible but it is quite complex. Who knows what happens when you've got list elements, tables or even floating elements; they might be completely out of order.

Probably not the most elegant solution, but I managed to get the best results by stripping some tags according to what I needed with php (that was really easy to do) and then displaying the retrieved strings into formatted WebViews.
As I said, probably not the most elegant solution but in this case it worked best for me.

TinyMCE, PHP and MySQL: security and escaping questions

I'm implementing TinyMCE for a client so they can edit front-end content via a simple, familiar interface in their site's admin panel.
I have never used TinyMCE before but notice that you are able to insert whatever markup you want and it will be happily saved off to the MySQL database, assuming you don't escape the contents of the TinyMCE before running it through your query.
You can even insert single quotes and have it break your SQL query entirely.
But of course, when I do escape the contents, benign presentational stuff like paragraph tags get converted to HTML entities and so the whole point of the WYSIWYG editor is defeated, because the entities are spat back out when it comes to displaying the stored content on the front-end.
So is there a way I can "selectively escape" content from TinyMCE, to keep the innocent tags like P and BR but get rid of dangerous ones like SCRIPT, IFRAME, etc.? I really don't want to have to manually encode and decode them using str_replace() or whatever, but I'd rather not give my client a gaping security hole either.
Thanks.

Have you tried htmlpurifier? works wonders. Its caveats; big and slow, but the best you can have.
http://htmlpurifier.org .

Sorry Dude, I'd say this a question for the authors of TinyMCE, so I suggest you ask at: http://tinymce.moxiecode.com/enterprise/support.php ... I'm sure they'll be only to happy to answer (for a small fee), and I suspect this may even be one of there FAQ's.
It's just that I'd guess you'd be very lucky if you hit another TinyMCE-user (let alone an authorative one) on stack-overflow, a "general programming forum"... although I notice there are currently 837 questions tagged "tinymce" on this forum; have you tried searching through them? Maybe there's a pointer in one of those?
Cheers. Keith.
EDIT: Yep, Making user-made HTML templates safe is more or less the same question posed in different words, and it has (what looks to ignorant me) a couple of answers which posit practical solutions. I just searched stack overflow for "Tiny MCE html security".

That's like complaining that you can write naughty words in Microsoft Word, and that Word should filter them for you. Or complain to GM that they build cars that then get used as escape vehicles in bank robberies. TinyMCE's job is to be an online editor, not to be the content police.
If you need to ban certain tags, then remove them when the document's submitted by using strip_tags(). Or better yet, HTMLpurifier for a more bullet-proof sanitization. If embedded quotes are breaking your SQL, then why weren't you passing the submitted document through mysql_real_escape_string() or using PDO prepared queries first? MCE has no idea what the server-side handling is going to be, nor should it care at all. It's up to you to decide how to handle the data, because only you know what its ultimate purpose is going to be.
In any case, remember that all those editors work on the client side. You can make TinyMCE as bulletproof and as strict an editor as you want, but it's still running on the client. Nothing says a malicious user can't bypass it entirely and submit all the embedded quotes and bad tags they want. The ultimate responsibility for cleaning the data HAS to fall on your code running on the server, as it's the last line of defense, and the only one that can ensure the database remains pristine. Anything else is lipstick on a pig.

How to store Markdown comments

I want to use Markdown for my website's commenting system but I have stumbled upon the following problem: What should I store in the database - the original comment in Markdown, the parsed comment in HTML, or both?
I need the HTML version for viewing and the Markdown version if the user needs to edit his comment. If I store the Markdown version, I have to parse the comments at runtime. If I store the HTML version, I need to convert the comment back to Markdown when the user needs to edit it (I found Markdownify for this but it isn't flawless). If I store both versions, I'm doubling the used space.
What would be the best option? Also, how does Stack Overflow handle this?

Store both. It goes against the rules for database normalization, but I think it's worth it for the speed optimisation in this case - parsing large amounts of text is a very slow operation.
You only need to store it twice, but you might need to serve it thousands of times, so it's a space-time trade-off.

Store the original markdown and parse at runtime. There are a few problems with storing the converted version in the database.
If user wants to edit their comment, you have to backwards convert parsed into original markdown
Space in database (always go by the rule that if you don't need to store it, don't)
Changes made to markdown parser would have to be run on every comment in the database, instead of just showing up at runtime.

Just render the Markdown to HTML at runtime.
If your site runs into performance issues, the Markdown will be one of the last things you'll look into tweaking. And even then, I doubt it'll make sense.
Just take a look at the realtime JavaScript renderer that SO uses. It's fast.
Edit:
Sorry, I should've been more clear. I meant just render in PHP. You'll save yourself a lot of headache -- and you probably have more important things to worry about.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.