How can I scrape a website with invalid HTML [duplicate] - php

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 4 years ago.
I'm trying to scrape data from a website that has invalid HTML. Simple HTML DOM Parser parses it but loses some info because of how its handling the invalid HTML. The built-in DOM parser with DOMXPath isn't working, it returns a blank result set. I was able to get it (DOMDocument and DOMXPath) working locally after running the fetched HTML through PHP Tidy but PHP Tidy isn't installed on the server and its a shared hosting server, so I have no control over that. I tried HTMLPurifier but that just seems to be for securing user input, since it completely removes the doctype, head, and body tags.
Is there any kind of standalone alternative to PHP Tidy? I would really prefer to use DOMXPath to navigate around and grab what I need, it just seems to need some help cleaning the HTML up before it can parse it.
Edit: Im scraping this website: http://courseschedules.njit.edu/index.aspx?semester=2010f. For now I'm just trying to get all the course links.

DOM handles broken HTML fine if you use loadHTML or loadHTMLFile:
$dom = new DOMDocument;
libxml_use_internal_errors(TRUE);
$dom->loadHTMLFile('http://courseschedules.njit.edu/index.aspx?semester=2010f');
libxml_clear_errors();
$xPath = new DOMXPath($dom);
$links = $xPath->query('//div[#class="courseList_section"]//a');
foreach($links as $link) {
printf("%s (%s)\n", $link->nodeValue, $link->getAttribute('href'));
}
will output
ACCT - Accounting (index.aspx?semester=2010f&subjectID=ACCT)
AD - Art and Design (index.aspx?semester=2010f&subjectID=AD )
ARCH - Architecture (index.aspx?semester=2010f&subjectID=ARCH)
... many more ...
TRAN - Transportation Engr (index.aspx?semester=2010f&subjectID=TRAN)
TUTR - Tutoring (index.aspx?semester=2010f&subjectID=TUTR)
URB - Urban Systems (index.aspx?semester=2010f&subjectID=URB )
Using
echo $dom->saveXML($link), PHP_EOL;
in the foreach loop will output the full outerHTML of the links.

if you know the errors you might apply some regular expressions to fix them specifically. While this ad-hoc solution might seem dirty, it may actually be better as if the HTML is indeed malformed it might be complex to infer a correct interpretation automatically.
EDIT: Actually it might be better to simply extract the needed information through regular expressions as the page has many errors which would be hard or at least tedious to fix.

Is there a web service that will run your content through Tidy? Could you write one? Tidy is the only sane way I know of fixing broken markup.

Another simple way to solve the problem could be passing the site you are trying to scrape through a mobile browser adapter package such as google's mobilizer for complicated websites. This will correct the invalid html and enable you to use the simple html dom parser package, but it might not work if you need some of the information that is stripped out of the site. The links to this adapter are below. I use this for sites on which the information is poorly formatted or if I need a way to simplify the formatting so that it is easy to parse. The html returned by the google mobilizer is simpler and much easier to process.
http://www.google.com/gwt/n

Related

Robust way to parse text content from HTML in PHP?

I'm trying to find a robust way to parse ALL of the text (i.e. non-html/non-code/non-script content) from an HTML document. I'm talking specifically about extracting keywords on any input web page on the internet. I'm writing a keyword spider that tracks keyword trends on web pages using PHP, and although I've found a number of great ways to actually read in the content (like DOMDocument and cURL), I'm having a hard time finding any robust solutions for actually parsing out all of the word content separate from the HTML/Javascript/CSS/etc on any old random page on the Internet.
I first tried using strip_tags(), but it has lots of artifacts of javascript and other xml that might be on the page. I've also tried Simple HTML DOM, but it seems to have problems with punctuation and whitespace handling. I finally tried building a library from tutorials on nadeausoftware, and while it works phenomenally well on most pages, on some pages it doesn't return any content at all (I guess the curse of trying to use regex for parsing).
I'm just wondering if there aren't any php libraries that provide the specific capability of grabbing all of the non-html/non-javascript/non-xml/non-code words from an HTML document. I know that might sound like a tall order, and I'm not looking for perfection, but if there's a solution that's 80% reliable on most web-pages, I'd be happy.
Thanks for any help anyone can provide!
You could load the document, get rid of the tags you don't want and then query the textContent property:
$html = '<html><head><style type="text/css">hola</style></head><body><script>tada</script>hello <span>world</span></body></html>';
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('*') as $node) {
if (in_array($node->nodeName, array('script', 'style'))) {
$node->parentNode->removeChild($node);
}
}
echo $dom->documentElement->textContent;
// hello world
Demo
As it turns out, the PHP parsing code from nadeau software is actually more robust than I originally gave it credit --- on additional tinkering, I discovered that the problems I was encountering were due to my feeding the parser html content that wasn't properly encoded to utf-8.
It's unfortunate that there don't seem to be any existing libraries for handling such a complex use-case, but at least I was able to get the tutorial code to work on a broad number of test cases.

Parsing HTML with PHP to get data for several articles of the same kind

I'm working on a web-site which parses coupon sites and lists those coupons. There are some sites which provide their listings as an XML file - no problem with those. But there are also some sites which do not provide XML. I'm thinking of parsing their sites and get the coupon information from the site content - grabbing that data from HTML with PHP. As an example, you can see the following site:
http://www.biglion.ru/moscow/
I'm working with PHP. So, my question is - is there a relatively easy way to parse HTML and get the data for each coupon listed on that site just like I get while parsing XML?
Thanks for the help.
You can always use a DOM parser, but scraping content from sites is unreliable at best.
If their layout changes every so slightly, your app could fail. Oh, and in most cases it's also against most sites TOSs to do so..
While using a DOM parser might seem a good idea, I usually prefer good old regular expressions for scraping. It's much less work, and if the site changes it's layout you're screwed anyway, whatever your approach is. But, if using a smart enough regex, your code should be immune to changes that do not directly impact the part you're interested in.
One thing to remember is to include some class names in regex when they're provided, but to assume anything can be between the info you need. E.g.
preg_match_all('#class="actionsItemHeadding".*?<a[^>]*href="([^"]*)"[^>]*>(.*?)</a>#s', file_get_contents('http://www.biglion.ru/moscow/'), $matches, PREG_SET_ORDER);
print_r($matches);
The most reliable method is the Php DOM Parser if you prefer working with php.
Here is an example of parsing only the elements.
// Include the library
include('simple_html_dom.php');
// Retrieve the DOM from a given URL
$html = file_get_html('http://mypage.com/');
// Find all "A" tags and print their HREFs
foreach($html->find('a') as $e)
echo $e->href . '<br>';
I am providing some more information about parsing the other html elements too.
I hope that will be useful to you.

PHP Dom document html is faster or preg_match_all function is faster?

I got a doubt in mind that which one is faster in processing?
dom document or preg_match_all with curl function is faster in html page parsing?? and will dom document function leave a trace on other server like curl function do? For example in curl function we use a user agent to define who is accessing but in dom document there is nothing.
Does it matter which is faster if one gives you incorrect results?
Matching with regular expressions to get a single bit of data out of the document will be faster than parsing an entire HTML document. But regular expressions cannot parse HTML correctly in all cases.
See http://htmlparsing.com/regexes.html, which I have started to address this common question. (And for the rest of you reading this, I can use help. The source is on github, and I need examples for many different languages.)
Regular expressions will likely be faster, but they are also likely the worse choice. Unless you have benchmarked and profiled your application and found nothing else to optimize, you should look into a proper existing parser.
While Regular Expressions can be used to match HTML, it takes a thorough effort to come up with a reliable parser. PHP offers a bunch of native extensions to work with XML (and HTML) reliably. There is also a number of third party libraries. See my answer to
Best Methods to parse HTML
As for sending a custom user agent, this is possible with DOM too. You have to create a custom stream context and attach it with the underlying libxml functions. You can supply any of the available HTTP Stream context options this way. See my answer to
DOMDocument::validate() problem
for an example how to supply a custom UserAgent.
dom functions dont have anything to do with html fetching.
however there are load functions that can be used to fetch http resources directly.
they will show the same behaviour as file_get_contents without context params.
as to the other part of your question. preg functions are faster. however they are not intended for that use and you will probably regret using them for this purpose very soon.
if you are parsing html with regular expressions, you are either completely insanely nuts awesome, or just don't get the concept of html.

Using PHP to retrieve information from a different site

I was wondering if there's a way to use PHP (or any other server-side or even client-side [if possible] language) to obtain certain pieces of information from a different website (NOT a local file like the include 'nav.php'.
What I mean is that...Say I have a blog at www.blog.com and I have another website at www.mysite.com
Is there a way to gather ALL of the h2 links from www.blog.com and put them in a div in www.mysite.com?
Also, is there a way I could grab the entire information inside a DIV (with an ID of-course) from blog.com and insert it in mysite.com?
Thanks,
Amit
First of all, if you want to retrieve content from a blog, check if the blog generator (ie, Blogger, WordPress) does not have a API thanks to which you won't have to reinvent the wheel. Usually, good APis come with good documentations (meaning that probably 5% out of all APIs are good APIs) and these documentations should come with code examples for top languages such as PHP, JavaScript, Java, etc... Once again, if it is to retrieve content from a blog, there should be tons of frameworks that are here for you
Check out the PHP Simple HTML DOM library
Can be as easy as:
// Create DOM from URL or file
$html = file_get_html('http://www.otherwebsite.com/');
// Find all images
foreach($html->find('h2') as $element)
echo $element->src;
This can be done by opening the remote website as a file, then taking the HTML and using the DOM parser to manipulate it.
$site_html = file_get_contents('http://www.example.com/');
$document = new DOMDocument();
$document->loadHTML($site_html);
$all_of_the_h2_tags = $document->getElementsByTagName('h2');
Read more about PHP's DOM functions for what to do from here, such as grabbing other tags, creating new HTML out of bits and pieces of the DOM, and displaying that on your own site.
Your first step would be to use CURL to do a request on the other site, and bring down the HTML from the page you want to access. Then comes the part of parsing the HTML to find all the content you're looking for. One could use a bunch of regular expressions, and you could probably get the job done, but the Stackoverflow crew might frown at you. You could also take the resulting HTML and use the domDocument object, and loadHTML to parse the HTML and load the content you want.
Also, if you control both sites, you can set up a special page on the first site (www.blog.com) with exactly the information you need, properly formatted either in HTML you can output directly, or XML that you can manipulate more easily from www.mysite.com.

HTML Scraping in Php [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 9 years ago.
I've been doing some HTML scraping in PHP using regular expressions. This works, but the result is finicky and fragile. Has anyone used any packages that provide a more robust solution? A config driven solution would be ideal, but I'm not picky.
I would recomend PHP Simple HTML DOM Parser after you have scraped the HTML from the page. It supports invalid HTML, and provides a very easy way to handle HTML elements.
If the page you're scraping is valid X(HT)ML, then any of PHP's built-in XML parsers will do.
I haven't had much success with PHP libraries for scraping. If you're adventurous though, you can try simplehtmldom. I'd recommend Hpricot for Ruby or Beautiful Soup for Python, which are both excellent parsers for HTML.
I would also recommend 'Simple HTML DOM Parser.' It is a good option particularly if your familiar with jQuery or JavaScript selectors then you will find yourself at home.
I have even blogged about it in the past.
I had some fun working with htmlSQL, which is not so much a high end solution, but really simple to work with.
Using PHP for HTML scraping, I'd recommend cURL + regexp or cURL + some DOM parsers though I personally use cURL + regexp. If you have a profound taste of regexp, it's actually more accurate sometimes.
I've had very good with results with the Simple Html DOM Parser mentioned above as well. And then there's the  tidy Extension for PHP as well which works really well too.
I had to use curl on my host 1and1.
http://www.quickscrape.com/ is what I came up with using the Simple DOM class!

Categories