HTML Scraping in Php [duplicate] - php

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 9 years ago.
I've been doing some HTML scraping in PHP using regular expressions. This works, but the result is finicky and fragile. Has anyone used any packages that provide a more robust solution? A config driven solution would be ideal, but I'm not picky.

I would recomend PHP Simple HTML DOM Parser after you have scraped the HTML from the page. It supports invalid HTML, and provides a very easy way to handle HTML elements.

If the page you're scraping is valid X(HT)ML, then any of PHP's built-in XML parsers will do.
I haven't had much success with PHP libraries for scraping. If you're adventurous though, you can try simplehtmldom. I'd recommend Hpricot for Ruby or Beautiful Soup for Python, which are both excellent parsers for HTML.

I would also recommend 'Simple HTML DOM Parser.' It is a good option particularly if your familiar with jQuery or JavaScript selectors then you will find yourself at home.
I have even blogged about it in the past.

I had some fun working with htmlSQL, which is not so much a high end solution, but really simple to work with.

Using PHP for HTML scraping, I'd recommend cURL + regexp or cURL + some DOM parsers though I personally use cURL + regexp. If you have a profound taste of regexp, it's actually more accurate sometimes.

I've had very good with results with the Simple Html DOM Parser mentioned above as well. And then there's the  tidy Extension for PHP as well which works really well too.

I had to use curl on my host 1and1.
http://www.quickscrape.com/ is what I came up with using the Simple DOM class!

Related

Parsing tag’s properties from HTML with php

There are 365 codes that is
<sometag class=“day” date=“yyyy-mm-dd” count=“some Int”></sometag>
I have to parse date and count with php, beautiful soup or any parsing library which can be use with php and make JSON String
How can I parse? Every tags have same class name that is “day”
I will be waiting for more answers for more wide info. Thank you.
Agree with #Scuzzy. SimpleXML would be the easiest and cleanest way way for this very specific example. Problem with SimpleXML, I have found it to be very slow if you are parsing larger documents.
There's also this library for parsing html which I think you will find very useful if you're screen scraping with php which it looks like you are doing.
http://www.schrenk.com/nostarch/webbots/DSP_download.php

Getting specific HTML from a webpage with PHP [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to parse and process HTML with PHP?
I am learning PHP and when I have to extract (parse) some data from a webpage that does not have an available API, I use regular expressions or a function which takes the string that is between two strings.
I would like to know if there is a more "professional", easier way to do this, since regexp are resource consuming and not the easiest thing to write right now for me.
You should never try to parse XML (html) using regular-expressions, instead get yourself a proper parser library for XML and do it the correct way. I might sound like a harder task but you'll thank yourself in the end.
Parsing could be done using one of the below, or similar resources.
php.net - PHP: DOM - Manual
simplehtmldom.sourceforge.net - PHP Simple HTML DOM Parser
The popular and legendary answer regarding html and regular-expressions, poetry worth reading:
stackoverflow.com - The legendary HTML+RegExp answer!
PHP comes with a default XML parsing library for you to use in this specific case. Use file_get_contents in order to retrieve the HTML page and parse accordingly.
XML: http://php.net/manual/en/book.xml.php
file_get_contents: http://php.net/manual/en/function.file-get-contents.php

PHP Tidy alternative to only tab-indent output [duplicate]

This question already has answers here:
PHP function/class that formats/indents my HTML code? [duplicate]
(3 answers)
Closed 9 years ago.
Is there any PHP Tidy alternative to only tab-indent HTML output? I need the latter for development/debug purposes only to go through the generated output code. Though, as much as I tried to configure Tidy for this simple task, I couldn't without preventing other changes.
Two years later and there is still no library to achieve HTML output indentation without using implementations that rely on DOM API (ie. Tidy and alike).
I've developed library that tokenises HTML input using regular expression. None of the HTML is changed beyond adding the required spacing for indentation.
https://github.com/gajus/dindent
I always use jsbeautifier. Though it doesn't follow my standards with javascript, the html indentation is awesome.
EDIT: Before you downvote, notice that jsbeautifier is open source, and has ports in several languages, all serverside: https://github.com/einars/js-beautify
You can try the htmLawed library. It's a Tidy alternative for PHP. If you just need an indenting function, you can use the code for the hl_tidy function of the library.
// indent using one tab per indent, with all HTML being within an imaginary div
$out = hl_tidy($in, 't', 'div')
I use LogicHammers HTMLFormatter which you need to pay for but is worth every penny. Use it to format the html before you look at it and it makes it much easier.
Though this is not the exact answer , see if this helps you. I use netbeans and to make code indented I simply right click and Format the code. If you are using any other IDE search for similar functionality or may be you can import with help of 3rd party plugins.

Alternative to regexes for parsing html tags - How do you do it in pure code?

Recently read this SO Post ...first answer is nutz. Basically it is theoretically impossible for large models because of Chomsky Grammars Types.
What it the alternative? I don't want to use a library object like DOMDocument, I want to understand what is the correct way to do this with pure code?
If you don't want to use DOMDocument (though I'd urge you to look into it again, it's not that bad - especially combined with DOMXPath), you can also use PHPQuery or Simple HTML DOM Parser.

How can I scrape a website with invalid HTML [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 4 years ago.
I'm trying to scrape data from a website that has invalid HTML. Simple HTML DOM Parser parses it but loses some info because of how its handling the invalid HTML. The built-in DOM parser with DOMXPath isn't working, it returns a blank result set. I was able to get it (DOMDocument and DOMXPath) working locally after running the fetched HTML through PHP Tidy but PHP Tidy isn't installed on the server and its a shared hosting server, so I have no control over that. I tried HTMLPurifier but that just seems to be for securing user input, since it completely removes the doctype, head, and body tags.
Is there any kind of standalone alternative to PHP Tidy? I would really prefer to use DOMXPath to navigate around and grab what I need, it just seems to need some help cleaning the HTML up before it can parse it.
Edit: Im scraping this website: http://courseschedules.njit.edu/index.aspx?semester=2010f. For now I'm just trying to get all the course links.
DOM handles broken HTML fine if you use loadHTML or loadHTMLFile:
$dom = new DOMDocument;
libxml_use_internal_errors(TRUE);
$dom->loadHTMLFile('http://courseschedules.njit.edu/index.aspx?semester=2010f');
libxml_clear_errors();
$xPath = new DOMXPath($dom);
$links = $xPath->query('//div[#class="courseList_section"]//a');
foreach($links as $link) {
printf("%s (%s)\n", $link->nodeValue, $link->getAttribute('href'));
}
will output
ACCT - Accounting (index.aspx?semester=2010f&subjectID=ACCT)
AD - Art and Design (index.aspx?semester=2010f&subjectID=AD )
ARCH - Architecture (index.aspx?semester=2010f&subjectID=ARCH)
... many more ...
TRAN - Transportation Engr (index.aspx?semester=2010f&subjectID=TRAN)
TUTR - Tutoring (index.aspx?semester=2010f&subjectID=TUTR)
URB - Urban Systems (index.aspx?semester=2010f&subjectID=URB )
Using
echo $dom->saveXML($link), PHP_EOL;
in the foreach loop will output the full outerHTML of the links.
if you know the errors you might apply some regular expressions to fix them specifically. While this ad-hoc solution might seem dirty, it may actually be better as if the HTML is indeed malformed it might be complex to infer a correct interpretation automatically.
EDIT: Actually it might be better to simply extract the needed information through regular expressions as the page has many errors which would be hard or at least tedious to fix.
Is there a web service that will run your content through Tidy? Could you write one? Tidy is the only sane way I know of fixing broken markup.
Another simple way to solve the problem could be passing the site you are trying to scrape through a mobile browser adapter package such as google's mobilizer for complicated websites. This will correct the invalid html and enable you to use the simple html dom parser package, but it might not work if you need some of the information that is stripped out of the site. The links to this adapter are below. I use this for sites on which the information is poorly formatted or if I need a way to simplify the formatting so that it is easy to parse. The html returned by the google mobilizer is simpler and much easier to process.
http://www.google.com/gwt/n

Categories