exploding <table>, <div> from sql content - php

I have a line in my template that I need help fixing.
If a user posts opening <div> tags or closing </div>, or any other structure related HTML tags into the content, it will result in a template mess on the page.
I'm using htmlentities on titles, and other forms. Unfortunately, I can't do that here, because the content field has a rich editor, and I need to keep the text styling tags intact (<b>, <u>, colors, <i> and such).
Right now, it's very easy for users to mess up the template on purpose and I want to prevent that.

The best way is to throw the text into a DOM parser and have it sort out any mess. Such a tool will probably be more robust than anything you can put together, including solving problems you didn't know you might have.

Now that you've clarified in the comments what the actual problem is, I may be able to help. The issue is basically that you don't want long words entered by the user to break the page layout.
The PHP wordwrap solution you've come up with already has numerous problems, of which the one you've found (breaking your HTML) is the most obvious.
However, since the issue is purely a question of not wanting to allow long words to break your page layout, there are several other solutions that could be used instead.
Do something specific to any excessively long words, rather than to the whole text. You could add manual breaks to them, spaces, or the HTML <wbr> tag (which is an optional line-break, ie for hyphenation purposes). Or you could even just block users from entering crazy long words in the first place.
CSS overflow-x:hidden
Using this, any text overflowing out of the side of the box will simply be hidden, rather than be printed on top of other parts of the page.
CSS word-wrap
There are several ways of doing this, and it gets a little tricky because of varying browser support. But here's a link that explains it all: http://blog.kenneth.io/blog/2012/03/04/word-wrapping-hypernation-using-css/
Hope that helps.

Related

Formatting HTML correctly from cURL requests

I am working on an applet that allows the user to input a URL to a news article or other webpage (in Japanese) and view the contents of that page within an iFrame in my page. The idea is that once the content is loaded into the page, the user can highlight words using their cursor, which stores the selected text in an array (for translating/adding to a personal dictionary of terms) and surrounds the text in a red box (div) according to a stylesheet defined on my domain. To do this, I use cURL to retrieve the HTML of the external page and dump it into the source of the iFrame.
However, I keep running into major formatting problems with the retrieved HTML. The big problem is preserving style sheets, and to fix this, I've used DOMDocument to add tags to the section of the retrieved HTML. This works for some pages/URLs, but there are still lots of style problems with the output HTML for many others. For example, div layers crash into each other, alignments are off, and backgrounds are missing. This is made a bit more problematic as I need to embed the output HTML into a new in order to make the onClick javascript function for passing text selections in the embedded content to work, which means the resulting source ends up looking like this:
<div onclick="parent.selectionFunction()" id ="studyContentn">
<!-- HTML of output from cURL, including doctype declarations and <html>,<head> tags -->
</div>
It seems like for the most part a lot of the formatting issues I keep running into are largely arbitrary. I've tried using php Tidy to clean output from HTML, but that also only works for some pages but not many others. I've got a slight suspicion it may have to do with CDATA declarations that get parsed oddly when working with DOMDocument, but I am not certain.
Is there a way I can guarantee that HTML output from cURL will be rendered correctly and faithfully in all instances? Or is there perhaps a better way of going about doing this? I've tried a bunch of different ways of approaching this issue, and each gets closer to a solution but brings its own new problems as well.
Thanks -- let me know if I can clarify anything.
If I understand correctly you are trying to pull the html of a complete web page and display it under your domain, in your html. This is always going to be tricky, a lot of java script will break, relative url's will be wrong and as you mentioned, styles as well. Your probably also changing the dimensions that the page is displayed in. These can all be worked around but your going to be fighting an uphill battle with each new site, or if a current site change design
I'd probably take a different approach to the problem. You might want to write a browser plugin as the interface to the external web site instead. Then your applet can sit on top of the functional and tested (hopefully) site. Then you can focus on what you need to do for your applet rather than a never ending list of fiddly html issues.
I am trying to do a similar thing. It is very difficult to conserve the formatting, and the JS scripts in webpage complicated the thing. I finally gave up the complete the idea of completely displaying the original format, but do it with a workaround:
Select only headers, links, lists, paragraph which you are interested at.
Add the domain path of your ownsite to links.
You may wrap the headers, links etc. items by your own class.
Display it
in your case you want to select text and store it, which is another topic. What I did is to parse the HTMl in two levels, and then it is easy to do the selection. Keep in mind IE and Firefox/Chrome needs to be dealt with separately.

How to properly format text retrieved from a website?

I'm building an application for a company that, unfortunately, has a very poorly designed website. Most of the HTML tags are wrongly and sometimes randomly placed, there is excessive use of no-break-spaces, p tags are randomly assigned, they don't follow any rule and so on...
I'm retrieving data from their website by using a crawler and then feeding the resulted strings to my application through my own web-service. The problem is that once displaying it into the android textview, the text is formatted all wrong, spread and uneven, very dissorderly.
Also, worth mentioning that I can not suggest to the company for various reasons to modify their website...
I've tried
String text = Html.fromHtml(myString).toString();
and other variations, I've even tried formatting it manually but it's been a pain.
My question is:
Is there an easy, elegant way to re-format all this text, either with PHP on my web-service or with Java, directly in my Android application?
Thanks to anyone who will take the time to answer...
You can use Tidy with PHP to clean up the code if you're keeping it in place. Otherwise stripping the HTML would probably make working with it a lot easier.
I would so: no, there is no easy, elegant way. HTML combines data and visual representation, they are inherently linked. To understand the data you must look at the tags. Tags like <h1> and <a> carry meaning.
If the HTML is structured enough to break it down into meaningful blocks: header, body and unrelated/unimportant stuff. Then you could apply restyling principles to those. A simple solution is to just strip all the tags, get only the textNodes and stitch them together. If the HTML is exceptionally poorly formatted you might get sentences that are out of order, but if the HTML isn't too contrived I expect this approach should work.
To give you an indication of the complexity involved: You could have <span>s that have styling applied to them, for instance display: block. This changes the way the span is displayed, from inline to block, so it behaves more like a <div> would. This means that each <span> will likely be on it's own line, it will seem to force a line break. Detecting these situations isn't impossible but it is quite complex. Who knows what happens when you've got list elements, tables or even floating elements; they might be completely out of order.
Probably not the most elegant solution, but I managed to get the best results by stripping some tags according to what I needed with php (that was really easy to do) and then displaying the retrieved strings into formatted WebViews.
As I said, probably not the most elegant solution but in this case it worked best for me.

PHP databases - don't want to show javascript code

I have problem with PHP and JavaScript/CSS.
I have database with table. The table has a descriptions of articles. I want to echo the descriptions of the articles from database. Unfortunately many of them has a JavaScript or CSS included ( Then some article text), so when I use echo, it shows all of that code (and after that text). Is there any way to not show the JavaScript/CSS part and show only the text? For example with str_replace and regular expression? If yes, can somebody write me how it should look like?
Thanks for help and let me know if u need more info (code etc.)
Use HTMLPurifier - it will remove the scripts, css and any harmfull content from your articles. Since it is a CPU-intensive operations, it's better to run article trough HTMLPurifer before saving in the database, then to run it each time you are showing the article.
If you're trying to remove tags from a user's post, you can call strip_tags. This will get rid of css links, script tags, etc. It will not get rid of the style attribute, but if you get rid of div, span, p, etc. that won't matter -- there will be no tag for it to reside on.
As has been stated by others, it is generally best to sanitize your input (data from user before it goes into the DB), than it is to sanitize your output.
If you're trying to simply hide the JS and CSS from users, you can use Packer to obfusicate Javascript from less-savvy users, use Packer and use base 62 encoding. The JS will still work but will look like jiberish. Be aware that more knowledgeable users can attempt to unobfusicate the code, so any critical security risks in the JS still exists. Don't think any JS that accesses your databases directly will be safe; instead remove database access from the Javascript for security. If the JS is just to do fancy things like move elements around the page it's probably fine to just obfuscate it.
Only consider this if YOU have complete control and awareness of all JS included with the articles. If this is something your anonmous or otherwise not 120% trusted users can upload, you need to kill that functionality and use HTML Purifier to remove any JS they might add. It is not safe to output user entered JS, for you or your users.
For the CSS, I'm not sure why you want to hide it, and CSS can't be obfuscated quite like JS can; the styles will still be in plain English, best you can do is butcher the class/id names and whitespace; outputting CSS that YOU generated isn't a real security risk though, and even if people reverse engineer it I wouldn't be that afraid.
Again, if this is something anonymous/non trusted users can ADD to your site on their own, you don't want this at all, so remove the ability to upload CSS with an article using the HTML Purifier Darhazer mentioned.
You can try the following regex to remove the script and css:
"<script[\d\D]*?>[\d\D]*?</script>"
"<style[\d\D]*?>[\d\D]*?</style>"
It should help, but it cannot remove all the scripts. Like onclick="javascript:alert(1)".

TinyMCE, PHP and MySQL: security and escaping questions

I'm implementing TinyMCE for a client so they can edit front-end content via a simple, familiar interface in their site's admin panel.
I have never used TinyMCE before but notice that you are able to insert whatever markup you want and it will be happily saved off to the MySQL database, assuming you don't escape the contents of the TinyMCE before running it through your query.
You can even insert single quotes and have it break your SQL query entirely.
But of course, when I do escape the contents, benign presentational stuff like paragraph tags get converted to HTML entities and so the whole point of the WYSIWYG editor is defeated, because the entities are spat back out when it comes to displaying the stored content on the front-end.
So is there a way I can "selectively escape" content from TinyMCE, to keep the innocent tags like P and BR but get rid of dangerous ones like SCRIPT, IFRAME, etc.? I really don't want to have to manually encode and decode them using str_replace() or whatever, but I'd rather not give my client a gaping security hole either.
Thanks.
Have you tried htmlpurifier? works wonders. Its caveats; big and slow, but the best you can have.
http://htmlpurifier.org .
Sorry Dude, I'd say this a question for the authors of TinyMCE, so I suggest you ask at: http://tinymce.moxiecode.com/enterprise/support.php ... I'm sure they'll be only to happy to answer (for a small fee), and I suspect this may even be one of there FAQ's.
It's just that I'd guess you'd be very lucky if you hit another TinyMCE-user (let alone an authorative one) on stack-overflow, a "general programming forum"... although I notice there are currently 837 questions tagged "tinymce" on this forum; have you tried searching through them? Maybe there's a pointer in one of those?
Cheers. Keith.
EDIT: Yep, Making user-made HTML templates safe is more or less the same question posed in different words, and it has (what looks to ignorant me) a couple of answers which posit practical solutions. I just searched stack overflow for "Tiny MCE html security".
That's like complaining that you can write naughty words in Microsoft Word, and that Word should filter them for you. Or complain to GM that they build cars that then get used as escape vehicles in bank robberies. TinyMCE's job is to be an online editor, not to be the content police.
If you need to ban certain tags, then remove them when the document's submitted by using strip_tags(). Or better yet, HTMLpurifier for a more bullet-proof sanitization. If embedded quotes are breaking your SQL, then why weren't you passing the submitted document through mysql_real_escape_string() or using PDO prepared queries first? MCE has no idea what the server-side handling is going to be, nor should it care at all. It's up to you to decide how to handle the data, because only you know what its ultimate purpose is going to be.
In any case, remember that all those editors work on the client side. You can make TinyMCE as bulletproof and as strict an editor as you want, but it's still running on the client. Nothing says a malicious user can't bypass it entirely and submit all the embedded quotes and bad tags they want. The ultimate responsibility for cleaning the data HAS to fall on your code running on the server, as it's the last line of defense, and the only one that can ensure the database remains pristine. Anything else is lipstick on a pig.

FCKeditor is leaving behind a lot of open tags - how to remove them

So I am using FCKeditor and the problem I am having is that when the user writes the document sometimes info is copied from Word, other times it is written directly in the editor and other times it could be done both ways. What this leaves me with in the DB is a lot of tags that are open and never closed. This is throwing my layout off dramatically and I am trying to find a solution.
I changed the config file to paste as plain text, which I assumed would stop Word formatting from transferring over, and it is still doing it.
So now I am trying to figure out a way to search for the opening tags and delete them before the info is sent to the DB is possible. Or is there some FCKeditor function/config option I am missing to aid me?
Any suggestions on how I should proceed?
Thanks
Levi
Just as a security precaution that will address both security-related problems (like <script> tags being inserted by users, for instance -- which you probably don't want) and presentation-related problems (such as not-closed tags), you could use a tool like HTMLPurifier on your server, on what you are receiving from the browser.
Of course, this will not solve the first problem, the fact that users can input whatever they want in FCKEditor ; but it will ensure your HTML is both valid and secure.
Actually, even if FCKEditor wasn't getting you not-valid HTML, you still could use HTMLPurifier, just for security.
The idea is that you provide a list of :
allowed tags
allowed attributes to those tags
And, in return, HTMLPurifier gives you clean and valid HTML.
Edit: Sounds like you're running into a bug in the editor. You might try a different one, and/or use a server-side script that goes through and strips unmatched div tags.
Html allows most tags to be left open. If it's leaving tags open that should be left closed, you could white or blacklist to search through and strip those out serverside. Otherwise, you're pretty much stuck with understanding that HTML is not XML, FCKeditor generates HTML, and HTML won't validate as XML. If it's throwing your printed output off, try wrapping the FCKeditor output in a div.
Otherwise, please include concrete examples of input and output that is messing up your page layout.

Categories