This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to parse and process HTML with PHP?
I am learning PHP and when I have to extract (parse) some data from a webpage that does not have an available API, I use regular expressions or a function which takes the string that is between two strings.
I would like to know if there is a more "professional", easier way to do this, since regexp are resource consuming and not the easiest thing to write right now for me.
You should never try to parse XML (html) using regular-expressions, instead get yourself a proper parser library for XML and do it the correct way. I might sound like a harder task but you'll thank yourself in the end.
Parsing could be done using one of the below, or similar resources.
php.net - PHP: DOM - Manual
simplehtmldom.sourceforge.net - PHP Simple HTML DOM Parser
The popular and legendary answer regarding html and regular-expressions, poetry worth reading:
stackoverflow.com - The legendary HTML+RegExp answer!
PHP comes with a default XML parsing library for you to use in this specific case. Use file_get_contents in order to retrieve the HTML page and parse accordingly.
XML: http://php.net/manual/en/book.xml.php
file_get_contents: http://php.net/manual/en/function.file-get-contents.php
Related
There are 365 codes that is
<sometag class=“day” date=“yyyy-mm-dd” count=“some Int”></sometag>
I have to parse date and count with php, beautiful soup or any parsing library which can be use with php and make JSON String
How can I parse? Every tags have same class name that is “day”
I will be waiting for more answers for more wide info. Thank you.
Agree with #Scuzzy. SimpleXML would be the easiest and cleanest way way for this very specific example. Problem with SimpleXML, I have found it to be very slow if you are parsing larger documents.
There's also this library for parsing html which I think you will find very useful if you're screen scraping with php which it looks like you are doing.
http://www.schrenk.com/nostarch/webbots/DSP_download.php
This question already has answers here:
PHP function/class that formats/indents my HTML code? [duplicate]
(3 answers)
Closed 9 years ago.
Is there any PHP Tidy alternative to only tab-indent HTML output? I need the latter for development/debug purposes only to go through the generated output code. Though, as much as I tried to configure Tidy for this simple task, I couldn't without preventing other changes.
Two years later and there is still no library to achieve HTML output indentation without using implementations that rely on DOM API (ie. Tidy and alike).
I've developed library that tokenises HTML input using regular expression. None of the HTML is changed beyond adding the required spacing for indentation.
https://github.com/gajus/dindent
I always use jsbeautifier. Though it doesn't follow my standards with javascript, the html indentation is awesome.
EDIT: Before you downvote, notice that jsbeautifier is open source, and has ports in several languages, all serverside: https://github.com/einars/js-beautify
You can try the htmLawed library. It's a Tidy alternative for PHP. If you just need an indenting function, you can use the code for the hl_tidy function of the library.
// indent using one tab per indent, with all HTML being within an imaginary div
$out = hl_tidy($in, 't', 'div')
I use LogicHammers HTMLFormatter which you need to pay for but is worth every penny. Use it to format the html before you look at it and it makes it much easier.
Though this is not the exact answer , see if this helps you. I use netbeans and to make code indented I simply right click and Format the code. If you are using any other IDE search for similar functionality or may be you can import with help of 3rd party plugins.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to parse and process HTML/XML with PHP?
I want to grab anything thats inside the code below on a website.
<table class="eventResults byEvent vevent"></table>
How can I accomplish this?
Thanks
If you want to grab HTML from a site, you will want to use a DOM Parser. PHP has several XML processing packages to help you with this, be it DOM, SimpleXML or XMLReader. An often suggested alternative at SO is SimpleHtmlDom.
Since one of the class in the table is vevent, the content inside the table could be an hCalender microformat (can't tell for sure without seeing the content). If so, you can also use a microformat parser, preferably Transformr to save you the work of manually parsing the event data.
You can use the file_get_contents function or the curl extension for that.
First fetch the content of the whole page from that website as a string, then use regular expression to extract the substring inside table tags.
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 3 years ago.
I need to extract all the tags from an HTML file, In such a way that I would end up with either an array containing key=value for each of the attributes, or at least the raw text that makes up the tag.
I don't quite get along with regex, much less in PHP, so I would really appreciate some help in this.
PD: Some of the tags may span several lines and be indented with tabs and spaces on the subsequent lines.
Thanks.
You can use the DOM functions to parse an XML/XHTML document into a DOM Tree. From there it's not too hard to traverse the nodes you wish, extracting the data you're looking for.
Some people prefer the SimpleXML functions which might work equally well for you. I personally have issues with SimpleXML and prefer the more verbose, but more powerful DOM functions.
Yes, its easy. Use the DOM-Function of PHP and try to find the nodes with XPath.
That should be the painless way.
Another option is the simplehtmldom library.
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 9 years ago.
I've been doing some HTML scraping in PHP using regular expressions. This works, but the result is finicky and fragile. Has anyone used any packages that provide a more robust solution? A config driven solution would be ideal, but I'm not picky.
I would recomend PHP Simple HTML DOM Parser after you have scraped the HTML from the page. It supports invalid HTML, and provides a very easy way to handle HTML elements.
If the page you're scraping is valid X(HT)ML, then any of PHP's built-in XML parsers will do.
I haven't had much success with PHP libraries for scraping. If you're adventurous though, you can try simplehtmldom. I'd recommend Hpricot for Ruby or Beautiful Soup for Python, which are both excellent parsers for HTML.
I would also recommend 'Simple HTML DOM Parser.' It is a good option particularly if your familiar with jQuery or JavaScript selectors then you will find yourself at home.
I have even blogged about it in the past.
I had some fun working with htmlSQL, which is not so much a high end solution, but really simple to work with.
Using PHP for HTML scraping, I'd recommend cURL + regexp or cURL + some DOM parsers though I personally use cURL + regexp. If you have a profound taste of regexp, it's actually more accurate sometimes.
I've had very good with results with the Simple Html DOM Parser mentioned above as well. And then there's the tidy Extension for PHP as well which works really well too.
I had to use curl on my host 1and1.
http://www.quickscrape.com/ is what I came up with using the Simple DOM class!