How to properly format text retrieved from a website?

How to properly format text retrieved from a website? - php

I'm building an application for a company that, unfortunately, has a very poorly designed website. Most of the HTML tags are wrongly and sometimes randomly placed, there is excessive use of no-break-spaces, p tags are randomly assigned, they don't follow any rule and so on...
I'm retrieving data from their website by using a crawler and then feeding the resulted strings to my application through my own web-service. The problem is that once displaying it into the android textview, the text is formatted all wrong, spread and uneven, very dissorderly.
Also, worth mentioning that I can not suggest to the company for various reasons to modify their website...
I've tried
String text = Html.fromHtml(myString).toString();
and other variations, I've even tried formatting it manually but it's been a pain.
My question is:
Is there an easy, elegant way to re-format all this text, either with PHP on my web-service or with Java, directly in my Android application?
Thanks to anyone who will take the time to answer...

You can use Tidy with PHP to clean up the code if you're keeping it in place. Otherwise stripping the HTML would probably make working with it a lot easier.

I would so: no, there is no easy, elegant way. HTML combines data and visual representation, they are inherently linked. To understand the data you must look at the tags. Tags like <h1> and <a> carry meaning.
If the HTML is structured enough to break it down into meaningful blocks: header, body and unrelated/unimportant stuff. Then you could apply restyling principles to those. A simple solution is to just strip all the tags, get only the textNodes and stitch them together. If the HTML is exceptionally poorly formatted you might get sentences that are out of order, but if the HTML isn't too contrived I expect this approach should work.
To give you an indication of the complexity involved: You could have <span>s that have styling applied to them, for instance display: block. This changes the way the span is displayed, from inline to block, so it behaves more like a <div> would. This means that each <span> will likely be on it's own line, it will seem to force a line break. Detecting these situations isn't impossible but it is quite complex. Who knows what happens when you've got list elements, tables or even floating elements; they might be completely out of order.

Probably not the most elegant solution, but I managed to get the best results by stripping some tags according to what I needed with php (that was really easy to do) and then displaying the retrieved strings into formatted WebViews.
As I said, probably not the most elegant solution but in this case it worked best for me.

Related

Is "HTML Purifier" really trustworthy? Is there a better way to secure an untrusted/unsafe HTML code string in PHP?

I'm trying to secure HTML coming from external sources, for display on my own web control panel (to load in my browser, read, and delete).
strip_tags is completely unsafe and useless.
I went through a ton of trouble to make my own DOMDocument-based HTML securer, removing unsafe elements and attributes. Then I got linked to this nightmare of a webpage: https://owasp.org/www-community/xss-filter-evasion-cheatsheet
That document convinced me that not only is my "clever" HTML securer not enough -- there are far more things that can be done to inject malicious code into HTML than I ever could imagine. That list of things gives me the creeps for real. What a cold shower.
Anyway, looking for a (non-Google-infested) HTML securer for PHP, I found this: http://htmlpurifier.org/
While it seems OK at first glance, some signs point toward sloppiness which is the last thing you want in a security context:
On http://htmlpurifier.org/download , it claims that this is the official repository: https://repo.or.cz/w/htmlpurifier.git
But that page was last updated in "2018-02-23", with the label "Whoops, forgot to edit WHATSNEW".
The same page as in #1 calls the Github link the "Regular old mirror", but that repository has current (2020) updates... So is that actually the one used? Huh? https://github.com/ezyang/htmlpurifier/tree/master
At https://github.com/ezyang/htmlpurifier/blob/v4.13.0/NEWS , it says: "Further improvements to PHP 6.4 support". There never existed a PHP 6.4...
My perception of that project is that it's run by very sloppy and careless people. Can people who make so many mistakes and take so little care to keep their website correct really be trusted to write secure code to purify HTML?
I wish I had never been linked to that page with exploits. I was proud of my own code, and I spent a lot of time on it even though it's not many lines.
This really makes me wonder what everyone else is using (not made by Google). strip_tags is obviously a complete "no-no", but so is my DOMDocument code. For example, it checks if any href begins with (case insensitively) "javascript:", but the nightmare page shows that you can inject "invisible" tabs such as "ja vascript:" and add encoded characters and everything to break my code and allow the "javascript:" href after all. And numerous other things which would simply be impossible for me to sit and address in my own code.
Is there really no real_strip_tags or something built into PHP for this crucial and common task?

HTML Purifier is a pretty good, established and tested library, although I understand why the lack of clarity as to which repository is the right one really isn't very inspiring. :) It's not as actively worked on as it was in the past, but that's not a bad thing in this case, because it has a whitelist approach. New and exciting HTML that might break your page just isn't known to the whitelist and is stripped out; if you want HTML Purifier to know about these tags and attributes, you need to teach it about how they work before they become a threat.
That said, your DOMDocument-based code needn't be the wrong approach, but if you do it properly you'll probably end up with HTML Purifier again, which essentially parses the HTML, applies a standards-aware whitelist to the tags, attributes and their values, and then reassembles the HTML.
(Sidenote, since this is more of a question of best practises, you might get better answers on the Software Engineering Stack Exchange site rather than Stackoverflow.)

exploding <table>, <div> from sql content

I have a line in my template that I need help fixing.
If a user posts opening <div> tags or closing </div>, or any other structure related HTML tags into the content, it will result in a template mess on the page.
I'm using htmlentities on titles, and other forms. Unfortunately, I can't do that here, because the content field has a rich editor, and I need to keep the text styling tags intact (<b>, <u>, colors, <i> and such).
Right now, it's very easy for users to mess up the template on purpose and I want to prevent that.

The best way is to throw the text into a DOM parser and have it sort out any mess. Such a tool will probably be more robust than anything you can put together, including solving problems you didn't know you might have.

Now that you've clarified in the comments what the actual problem is, I may be able to help. The issue is basically that you don't want long words entered by the user to break the page layout.
The PHP wordwrap solution you've come up with already has numerous problems, of which the one you've found (breaking your HTML) is the most obvious.
However, since the issue is purely a question of not wanting to allow long words to break your page layout, there are several other solutions that could be used instead.
Do something specific to any excessively long words, rather than to the whole text. You could add manual breaks to them, spaces, or the HTML <wbr> tag (which is an optional line-break, ie for hyphenation purposes). Or you could even just block users from entering crazy long words in the first place.
CSS overflow-x:hidden
Using this, any text overflowing out of the side of the box will simply be hidden, rather than be printed on top of other parts of the page.
CSS word-wrap
There are several ways of doing this, and it gets a little tricky because of varying browser support. But here's a link that explains it all: http://blog.kenneth.io/blog/2012/03/04/word-wrapping-hypernation-using-css/
Hope that helps.

Displaying foreign HTML safely

I have an application that needs to display foreign HTML data (e.g. HTML-encoded email texts, though not only) safely - i.e., remove XSS attempts and other nasty stuff. But still be able to display HTML as it should look like. Solutions considered so far aren't ideal:
Clean HTML with something like HTMLPurifier. Works fine, but once email size goes over 100K it becomes very slow - tens of seconds per email. I suspect any secure enough parser would be as slow in PHP - some emails are really bad HTML, I've seen some that generate 150K HTML for one page of text.
Display HTML in an iframe - here the problem is that iframe needs then to be in another origin to be safe from XSS AFAIK, and this would require different domain for the same app. Setting up application with two domains is much more work and may be very hard in some setups (such as hosting that gives only one domain name).
Any other solutions that can achieve this result?

From my understanding, I don't believe so.
The trouble is that you can only safely remove HTML tags if you understand its structure, and 'understanding its structure' is exactly what parsing is. Even if you find a different way to analyse the structure of HTML and don't call it parsing, that's what you're doing, and it's bound to be some form of slow (or unsafe).
What you could do is play around with a few preliminary filters (e.g. strip_tags, which is generally a good prelim' (if certainly nothing else)) to give the parser less work to do, but whether that's viable depends on the size of your tag whitelist - a small whitelist will probably yield better benchmark results, since a large chunk of HTML would be filtered out by strip_tags before the parser got to it.
Additionally, different parsers work in different ways, and the sort of HTML you deal with frequently may be best suited to one sort of parser over another - HTML Purifier itself even has different parsers at its disposal that you can switch between to see if that results in a better benchmark for you (though I suspect the differences are negligible).
Whether such juggling works for your usecases is something you'll probably have to benchmark yourself, though.
Word of caution: If you decide to pursue it, know I wouldn't go with the iframe approach. If you don't filter HTML, you also allow forms, and it becomes (IMO) trivial in combination with scripts and CSS to set up extremely convincing phishing, e.g. using tricks such as "this e-mail is password protected, to proceed, please enter your password".

One possible solution (and the one that SO uses!) is to only allow certain types of tags. <p> and <br /> are fine, but <script> is right out.

How to know if the website being scraped has changed?

I'm using PHP to scrape a website and collect some data. It's all done without using regex. I'm using php's explode() method to find particular HTML tags instead.
It is possible that if the structure of the website changes (CSS, HTML), then wrong data may be collected by the scraper. So the question is - how do I know if the HTML structure has changed? How to identify this before storing any data to my database to avoid wrong data being stored.

I think you don't have any clean solutions if you are scraping a page where content changes.
I have developed several python scrapers and I know how can be frustrating when site just makes a subtle change on its layout.
You could try a solution a la mechanize (don't know the php counterpart) and if you are lucky you could isolate the content you need to extract (links?).
Another possibile approach would be to code some constraints and check them before store to db.
For example, if you are scraping Urls, you will need to verify that what scraper has parsed is formally a valid Url; same for integer ID or whatever you want to scrape that can be recognized as valid.
If you are scraping plain text, it will be more difficult to check.

Depends on the site but you could count the number of page elements in the scraped page like div, class & style tags then by comparing these totals against those of later scrapes detect if the page structure has been changed.
A similiar process could be used for the CSS file where the names of each each class or id could be extracted using simple regex, stored and checked as needed. If this list has new additions then the page structure has almost certainly changed somewhere on the site being scraped.

Speaking out of my ass here, but its possible you might want to look at some Document Object Model PHP methods.
http://php.net/manual/en/book.dom.php
If my very, very limited understanding of DOM is correct, a change in HTML site structure would change the Document Object Model, but a simple content change within a fixed structure wouldn't. So, if you could capture the DOM state, and then compare it at each scrape, couldn't you in theory determine that such a change has been made?
(By the way, the way I did this when I was trying to get an email notification when the bar exam results were posted on a particular page was just compare file_get_contents() values. Surprisingly, worked flawlessly: No false positives, and emailed me as soon as the site posted the content.)

If you want to know changes with respect to structure, I think the best way is to store the DOM structure of your first page and then compare it with new one.
There are lot of way you can do it:-
SaxParser
DOmParser etc
I have a small blog which will give some pointers to what I mean
http://let-them-c.blogspot.com/2009/04/xml-as-objects-in-oops.html
or you can use http://en.wikipedia.org/wiki/Simple_API_for_XML or DOm Utility parser.

First, in some cases you may want to compare hashes of the original to the new html. MD5 and SHA1 are two popular hashes. This may or may not be valid in all circumstances but is something you should be familiar with. This will tell you if something has changed - content, tags, or anything.
To understand if the structure has changed you would need to capture a histogram of the tag occurrences and then compare those. If you care about tags being out of order then you would have to capture a tree of the tags and do a comparison to see if the tags occur in the same order. This is going to be very specific to what you want to achieve.
PHP Simple HTML DOM Parser is a tool which will help you parse the HTML.

Explode() is not an HTML parser, but you want to know about changes in the HTML structure. That's going to be tricky. Try using an HTML parser. Nothing else will be able to do this properly.

Best way to handle LARGE strings of text going into a database?

I have built a number of solutions in the past in which people enter data via a webform, validation checks are applied, regex in some cases and everything gets stored in a database. This data is then used to drive output on other pages.
I have a special case here where a user wants to copy/paste HUGE amounts of text (multiple paragraphs with various headers, links and etc throughout) -- what is the best way to handle this before it goes into a database to provide the best output when it needs to come back out?
So far the best I have come up with is sticking all the output from these fields in PRE tags and using regex to add links where appropriate. I have a database put together with a list of special keywords that need to be bold or have other styles applied to them which works fine. So I can make this work using these approaches but it just seems to me that there is probably a much more graceful way of doing it.
Nicholas

There are a lot of ways you could format the text for output. You could simply use pre tags as you mentioned (if you are worried about wrapping, the CSS white-space property does also support the pre-wrap value, but browser support for this is currently sketchy at best).
There are also a large number of markup languages you could use for more advanced formatting options (some of which are listed here). Stack Overflow itself uses Markdown, which I personally enjoy using very much.
However, as the data is being pasted from another source, a markup language may interfere with the formatting of the text - in which case you could roll your own solution, perhaps using regular expressions and functions like htmlentities and nl2br.
Whatever you decide, I would recommend storing the input in its original form in the database so you can retroactively amend your formatting routines at any time.

If you're expecting a good deal of formatting, you should probably go with a WYSIWYG editor. These editors produce word-like toolbars which product (hopefully) valid (x)html-markup which can be directly stored into a text field in your database. Here are a couple examples:
FCKeditor - Massive amount of options/tools
Tinymce - A nice alternative.
Markdown - What stackoverflow.com uses
Both FCKeditor and Tinymce have been thoroughly tested and have proven to be reliable. I don't have any experience with markdown but it seems solid.
I've always hated 'forum' formatting tags like [code], [link], etc. Stackoverflow and others have shown that providing an open wysisyg editor is safe, reliable, and very easy to use. Just take the output it gives you, run it through some sort of escape function to check for any kind of injection, xss, etc and store in a text field.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.