How to know if the website being scraped has changed? - php

I'm using PHP to scrape a website and collect some data. It's all done without using regex. I'm using php's explode() method to find particular HTML tags instead.
It is possible that if the structure of the website changes (CSS, HTML), then wrong data may be collected by the scraper. So the question is - how do I know if the HTML structure has changed? How to identify this before storing any data to my database to avoid wrong data being stored.

I think you don't have any clean solutions if you are scraping a page where content changes.
I have developed several python scrapers and I know how can be frustrating when site just makes a subtle change on its layout.
You could try a solution a la mechanize (don't know the php counterpart) and if you are lucky you could isolate the content you need to extract (links?).
Another possibile approach would be to code some constraints and check them before store to db.
For example, if you are scraping Urls, you will need to verify that what scraper has parsed is formally a valid Url; same for integer ID or whatever you want to scrape that can be recognized as valid.
If you are scraping plain text, it will be more difficult to check.

Depends on the site but you could count the number of page elements in the scraped page like div, class & style tags then by comparing these totals against those of later scrapes detect if the page structure has been changed.
A similiar process could be used for the CSS file where the names of each each class or id could be extracted using simple regex, stored and checked as needed. If this list has new additions then the page structure has almost certainly changed somewhere on the site being scraped.

Speaking out of my ass here, but its possible you might want to look at some Document Object Model PHP methods.
http://php.net/manual/en/book.dom.php
If my very, very limited understanding of DOM is correct, a change in HTML site structure would change the Document Object Model, but a simple content change within a fixed structure wouldn't. So, if you could capture the DOM state, and then compare it at each scrape, couldn't you in theory determine that such a change has been made?
(By the way, the way I did this when I was trying to get an email notification when the bar exam results were posted on a particular page was just compare file_get_contents() values. Surprisingly, worked flawlessly: No false positives, and emailed me as soon as the site posted the content.)

If you want to know changes with respect to structure, I think the best way is to store the DOM structure of your first page and then compare it with new one.
There are lot of way you can do it:-
SaxParser
DOmParser etc
I have a small blog which will give some pointers to what I mean
http://let-them-c.blogspot.com/2009/04/xml-as-objects-in-oops.html
or you can use http://en.wikipedia.org/wiki/Simple_API_for_XML or DOm Utility parser.

First, in some cases you may want to compare hashes of the original to the new html. MD5 and SHA1 are two popular hashes. This may or may not be valid in all circumstances but is something you should be familiar with. This will tell you if something has changed - content, tags, or anything.
To understand if the structure has changed you would need to capture a histogram of the tag occurrences and then compare those. If you care about tags being out of order then you would have to capture a tree of the tags and do a comparison to see if the tags occur in the same order. This is going to be very specific to what you want to achieve.
PHP Simple HTML DOM Parser is a tool which will help you parse the HTML.

Explode() is not an HTML parser, but you want to know about changes in the HTML structure. That's going to be tricky. Try using an HTML parser. Nothing else will be able to do this properly.

Related

Formatting HTML correctly from cURL requests

I am working on an applet that allows the user to input a URL to a news article or other webpage (in Japanese) and view the contents of that page within an iFrame in my page. The idea is that once the content is loaded into the page, the user can highlight words using their cursor, which stores the selected text in an array (for translating/adding to a personal dictionary of terms) and surrounds the text in a red box (div) according to a stylesheet defined on my domain. To do this, I use cURL to retrieve the HTML of the external page and dump it into the source of the iFrame.
However, I keep running into major formatting problems with the retrieved HTML. The big problem is preserving style sheets, and to fix this, I've used DOMDocument to add tags to the section of the retrieved HTML. This works for some pages/URLs, but there are still lots of style problems with the output HTML for many others. For example, div layers crash into each other, alignments are off, and backgrounds are missing. This is made a bit more problematic as I need to embed the output HTML into a new in order to make the onClick javascript function for passing text selections in the embedded content to work, which means the resulting source ends up looking like this:
<div onclick="parent.selectionFunction()" id ="studyContentn">
<!-- HTML of output from cURL, including doctype declarations and <html>,<head> tags -->
</div>
It seems like for the most part a lot of the formatting issues I keep running into are largely arbitrary. I've tried using php Tidy to clean output from HTML, but that also only works for some pages but not many others. I've got a slight suspicion it may have to do with CDATA declarations that get parsed oddly when working with DOMDocument, but I am not certain.
Is there a way I can guarantee that HTML output from cURL will be rendered correctly and faithfully in all instances? Or is there perhaps a better way of going about doing this? I've tried a bunch of different ways of approaching this issue, and each gets closer to a solution but brings its own new problems as well.
Thanks -- let me know if I can clarify anything.
If I understand correctly you are trying to pull the html of a complete web page and display it under your domain, in your html. This is always going to be tricky, a lot of java script will break, relative url's will be wrong and as you mentioned, styles as well. Your probably also changing the dimensions that the page is displayed in. These can all be worked around but your going to be fighting an uphill battle with each new site, or if a current site change design
I'd probably take a different approach to the problem. You might want to write a browser plugin as the interface to the external web site instead. Then your applet can sit on top of the functional and tested (hopefully) site. Then you can focus on what you need to do for your applet rather than a never ending list of fiddly html issues.
I am trying to do a similar thing. It is very difficult to conserve the formatting, and the JS scripts in webpage complicated the thing. I finally gave up the complete the idea of completely displaying the original format, but do it with a workaround:
Select only headers, links, lists, paragraph which you are interested at.
Add the domain path of your ownsite to links.
You may wrap the headers, links etc. items by your own class.
Display it
in your case you want to select text and store it, which is another topic. What I did is to parse the HTMl in two levels, and then it is easy to do the selection. Keep in mind IE and Firefox/Chrome needs to be dealt with separately.

How to properly format text retrieved from a website?

I'm building an application for a company that, unfortunately, has a very poorly designed website. Most of the HTML tags are wrongly and sometimes randomly placed, there is excessive use of no-break-spaces, p tags are randomly assigned, they don't follow any rule and so on...
I'm retrieving data from their website by using a crawler and then feeding the resulted strings to my application through my own web-service. The problem is that once displaying it into the android textview, the text is formatted all wrong, spread and uneven, very dissorderly.
Also, worth mentioning that I can not suggest to the company for various reasons to modify their website...
I've tried
String text = Html.fromHtml(myString).toString();
and other variations, I've even tried formatting it manually but it's been a pain.
My question is:
Is there an easy, elegant way to re-format all this text, either with PHP on my web-service or with Java, directly in my Android application?
Thanks to anyone who will take the time to answer...
You can use Tidy with PHP to clean up the code if you're keeping it in place. Otherwise stripping the HTML would probably make working with it a lot easier.
I would so: no, there is no easy, elegant way. HTML combines data and visual representation, they are inherently linked. To understand the data you must look at the tags. Tags like <h1> and <a> carry meaning.
If the HTML is structured enough to break it down into meaningful blocks: header, body and unrelated/unimportant stuff. Then you could apply restyling principles to those. A simple solution is to just strip all the tags, get only the textNodes and stitch them together. If the HTML is exceptionally poorly formatted you might get sentences that are out of order, but if the HTML isn't too contrived I expect this approach should work.
To give you an indication of the complexity involved: You could have <span>s that have styling applied to them, for instance display: block. This changes the way the span is displayed, from inline to block, so it behaves more like a <div> would. This means that each <span> will likely be on it's own line, it will seem to force a line break. Detecting these situations isn't impossible but it is quite complex. Who knows what happens when you've got list elements, tables or even floating elements; they might be completely out of order.
Probably not the most elegant solution, but I managed to get the best results by stripping some tags according to what I needed with php (that was really easy to do) and then displaying the retrieved strings into formatted WebViews.
As I said, probably not the most elegant solution but in this case it worked best for me.

Serializing html and css, and storing in mysql

I am in the process of developing an html template generator. I would like to store the templates in a mysql database (or text files if that would be more efficient). I am looking for the best way to serialize the html and css, and then reproduce the original efficiently.
Javascript and php will be used to create/edit/and remove elements from both the browser and the database for later reproduction.
Elements can be added such as div, p, a, etc. and nesting should not be an issue. The main problem I am having is if a div is nested within a div and a paragraph element is nested within the second div at a later time, how would all of this be stored in a database? Elements will be deleted and new elements will be added in random order.
I hope this was somewhat clear. I am not really looking for code, but suggestions on how all of this would work together. Any help would be greatly appreciated. Thanks in advance
Store the entire thing as a blob in a single field in the database. Unless you're doing something quite novel, there's no reason the database layer needs to have an in-depth understanding of DOM structure.
I want to first point out that, what you need is a parser not a serialising method. You can simply store the template, then when you read it out, parse each individual element and then build the editing form. Sort of like xtgem does. No need to serialise. Unless you already parsed the data in an array.
If each user has his own template, then the chances of multiple reads is very low. Almost nill. Flat files will do very well in this siiuation. No need to sanitise and better performance. Do not forget file locking though.
If you are returning data in text input or textarea elements, no problem with xss. You don't need to sanitise output either.
I agree with #Rex. When I was developing a forum, I found that storing the original was better. It allowed for easy editing.
It would be best to separate the files and parse them separately. Keep presentation separate from markup.
Parsers have never been my forte however. So I can't help in that aspect.
You can take a look at pcltemplate from http://phpconcept.net to see how the parser works. Maybe give you ideas.

Using PHP to retrieve information from a different site

I was wondering if there's a way to use PHP (or any other server-side or even client-side [if possible] language) to obtain certain pieces of information from a different website (NOT a local file like the include 'nav.php'.
What I mean is that...Say I have a blog at www.blog.com and I have another website at www.mysite.com
Is there a way to gather ALL of the h2 links from www.blog.com and put them in a div in www.mysite.com?
Also, is there a way I could grab the entire information inside a DIV (with an ID of-course) from blog.com and insert it in mysite.com?
Thanks,
Amit
First of all, if you want to retrieve content from a blog, check if the blog generator (ie, Blogger, WordPress) does not have a API thanks to which you won't have to reinvent the wheel. Usually, good APis come with good documentations (meaning that probably 5% out of all APIs are good APIs) and these documentations should come with code examples for top languages such as PHP, JavaScript, Java, etc... Once again, if it is to retrieve content from a blog, there should be tons of frameworks that are here for you
Check out the PHP Simple HTML DOM library
Can be as easy as:
// Create DOM from URL or file
$html = file_get_html('http://www.otherwebsite.com/');
// Find all images
foreach($html->find('h2') as $element)
echo $element->src;
This can be done by opening the remote website as a file, then taking the HTML and using the DOM parser to manipulate it.
$site_html = file_get_contents('http://www.example.com/');
$document = new DOMDocument();
$document->loadHTML($site_html);
$all_of_the_h2_tags = $document->getElementsByTagName('h2');
Read more about PHP's DOM functions for what to do from here, such as grabbing other tags, creating new HTML out of bits and pieces of the DOM, and displaying that on your own site.
Your first step would be to use CURL to do a request on the other site, and bring down the HTML from the page you want to access. Then comes the part of parsing the HTML to find all the content you're looking for. One could use a bunch of regular expressions, and you could probably get the job done, but the Stackoverflow crew might frown at you. You could also take the resulting HTML and use the domDocument object, and loadHTML to parse the HTML and load the content you want.
Also, if you control both sites, you can set up a special page on the first site (www.blog.com) with exactly the information you need, properly formatted either in HTML you can output directly, or XML that you can manipulate more easily from www.mysite.com.

How to protect yourself from XSS when you allow people to post RAW embed codes?

Tumblr and other blogging websites allows people to post embeded codes of videos from youtube and all video networks.
but how they filter only the flash object code and remove any other html or scripts? and even they have an automated code that informes you this is not a valid video code.
Is this done using REGEX expressions? And Is there a PHP class to do that?
Thanks
Generally speaking, using regex is not a good way to deal with HTML : HTML is not regular enough for regular expressions : there are too many variations permitted in the standards... And browsers even accept HTML that's not valid !
In PHP, as your question is tagged as php, a great solution that exists to filter user input is the HTMLPurifier tool.
A couple of interesting things are :
It allows you specify which specific tags are allowed
For each tag, you can define which specific attributes are allowed
Basically, the idea is to only keep what you specify (white-list), instead of trying to remove bad stuff using a black-list (which will never be quite complete).
And if you only specify a list of tags and attributes that can do no harm, only those will be kept -- and the risks of injections are lowered a lot.
Quoting HTMLPurifier's home page :
HTML Purifier is a standards-compliant
HTML filter library written in PHP.
HTML Purifier will not only remove
all malicious code (better known as
XSS) with a thoroughly audited,
secure yet permissive whitelist, it
will also make sure your documents are
standards compliant, something only
achievable with a comprehensive
knowledge of W3C's specifications.
Yes, another great thing is that the code you get as output is valid.
Of course, this will only allow you to clean / filter / purify the HTML input ; it will not allow you to validate that the URL used by the user is both :
correct ; i.e. points to a real content
"OK" as defined by your website ; i.e. for example no nudity, ...
About the second point, there's not much one can do about it : the best solution will be to either :
Have a moderator accept / reject the contents before they're put online
Give the website's users a way to flag some content as inappropriate, so a moderator takes actions.
Basically, to check the content itself of the video, there is not much choice but have a human being say "ok" or "not ok".
About the first point, though, there's hope : some services that host content have APIs that you might want / be able to use.
For instance, Youtube provides an API -- see Developer's Guide: PHP.
In your case, the Retrieving a specific video entry section looks promising : if you send an HTTP request to an URL that looks like this :
http://gdata.youtube.com/feeds/api/videos/videoID
(Replacing "videoID" by the ID of the video, of course)
You'll get some ATOM feed if the video is valid ; and "Invalid id" if it's not
This might help you validate at least some URL to contents -- even if you'll have to develop some specific code for each possible content-hosting service that your users like...
Now, to extract the identifier of the video from your HTML string... If you're thinking about using regex, you are wrong ;-)
The best solution to extract a portion of data from an HTML string is generally to :
Load the HTML using a DOM parser ; DOMDocument::loadHTML is generally pretty helpful, here
Go though the document using DOM methods ; either, depending on your situation :
DOMDocument::getElementsByTagName, if you need to iterate over all elements that have a specific tag name ; might be great to iterate over all <object> or <embed> tags, for instance
Or, if you need something more complex, you could do an XPath query, using the DOMXPath class and its DOMXPath::query method.
And using DOM will also allow you to modify the HTML document using a standard API -- which might help, in case you want to add some message next to the video, or any other thing like that.
Take a look at htmlpurifier to start.
http://htmlpurifier.org/
I have implemented an algorithm for this for the company i work for. It works just fine. BUT, it was quite complicated to implement.
I would definitely check out HTMLPurifier to see if that works in an easy way for you. If you insist on doing it the old-school-way like I did, this is the basic steps:
1.
First of ==> get friendly with stripos()
2.
You have to make an recursive function to identify the start and stop tags for the widget, that includes all combinations of <embed></embed> or <embed/> (selfclosing) or <object></object> ... or <object><params>...<embed/></object>
3.
After this, you have to parse out all attributes and params.
4.
Now, all <object> tags should have <param> tags as child elements. You have to parse all of these to get all the data you need for finally generating a new embed or object tag. Escpecially the params and attributes that holds with, height, data source are important.
5.
Now, you don't know if the attributes are enclosed by single or double-quotes, so your code has to be lenient in this way. Also, you dont know if the code is valid or well formed. So, It should be able to handle nested embed/object tags, embed tags that are not enclosed correctly etc etc... As it is user generatede content, you can't really know and trust the input. You will see that there are lots of combinations.
6.
If you manage to parse the embeded element with all its attributes (or object element and its child params), the whitelisting of domains is easy...
My code ended up to be about 800 lines of code, which is quite large, and it was filled with recursive methods, finding correct stop and end tags etc. My alghorithm also removed all the SEO-text that often are included in the cut&paste embed-code, like links back to the site holding the widget.
Its a good excercise, but If i where you... Don't start walking this road.
Recommendation: Try find something ready made, open source!
This will never be safe. Browsers have those funny little functionalities that help people display content of their pages even if html is messy. There are endless opportunities to get something through :)
check here to see the tip of the iceberg
What You need to do is use a single input for just a link and aditional inputs for width and height and filter those. THEN generate the object tag Yourself.
This might be safe.
http://php.net/manual/en/function.strip-tags.php
and allow certain tags?
The most simple and elegant solution: Allowing HTML and Preventing XSS # shiflett.org.
Using all sorts of "HTML purifier" is more than pointless. Sorry but I don't get people who like to use these bloated libraries when a much simpler solution is in hand.
If you're looking make your site "safe" from vulnerabilities, a white list approach is the (only) way to go. I would recommend safely escaping all user generated content, and white listing only markup you know is safe and works on your site. This means not only <B> tags, but also the flash embeddings.
For example, if you want to allow any youtube to be embedded, write a validation RegEx that looks for the embed code they generate. Refuse to accept any others (or simply display it as escaped markup). This is testable. Forget all this parsing nonsense.
If you also want to add vimeo videos, then look at the embed code they provide and accept that as well.
Ugh? I know this seems like a pain, but in reality it's much easier to write than some algorithm that tries to detect "bad" content in some sort of generic fashion.
After getting the simple version of the algorithm working, you could go back and make it nicer. You could "provisionally" accept content with URLs, scripts, etc. that don't pass your white list, and have an admin process to add approved regexes to your output escaping routine. This way legitimate users aren't left out in the cold, but you don't open your self up to attacks of this nature.

Categories