I'm trying to fetch data from a div (based on his id), using PHP's PCRE. The goal is to fetch div's contents based on his id, and using recursivity / depth to get everything inside it. The main problem here is to get other divs inside the "main div", because regex would stop once it gets the next </div> it finds after the initial <div id="test">.
I've tryed so many different approaches to the subject, and none of it worked. The best solution, in my oppinion, is to use the R parameter (Recursion), but never got it to work properly.
Any Ideais?
Thanks in advance :D
You'd be much better off using some form of DOM parser - regex really isn't suited to this problem. If all you want is basic HTML dom parsing, something like simplehtmldom would be right up your alley. It's trivial to install (just include a single PHP file) and trivial to use (2-3 lines will do what you need).
include('simple-html-dom.php');
$dom = str_get_html($bunchofhtmlcode);
$testdiv = $dom->find('div#test',0); // 0 for the first occurrence
$testdiv_contents = $testdiv->innertext;
Related
I’m writing a custom parser/data extractor for some pretty shitty HTML.
Changing the HTML is out of the question.
I will spare you the details of the hoops I’ve had to jump through but I’ve now come pretty close to my original goal. I’m using a combination of DOMDocument getElementByName, regular expression replace (I know, I know...), and XPath queries.
I need to get all the text out of the body of the document. I would like for the navigation to remain a separate entity, at least in the abstract. Here’s what I’m doing now:
$contentnodes = $xpath->query("//body//*[not(self::a)]/text()|//body//ul/li/a");
foreach ($contentnodes as $contentnode) {
$type = $contentnode->nodeName;
$content = $contentnode->nodeValue;
$output[] = array( $type, $content);
}
This works, except that of course it treats all of the links on the page differently, and I only want it to do that to the navigation.
What XPath syntax can I use so that, in the first part of that query, before the |, I tell it to get all the text nodes of body’s children except ul > li > a.
Please note that I cannot rely on the presence of p tags or h1 tags or anything sensible like that to make educated guesses about content.
Thanks
Update: #hr_117’s answer below works. I’ve also found that you can use multiple not statements like so:
//body//text()[not(parent::a/parent::li/parent::ul)][not(parent::h1)]
You may try something like this:
//body//text()[not(parent::a/parent::li/parent::ul)]|//body//ul/li/a
//body//*[not(self::a/parent::li/parent::ul)]/text()[normalize-space()]|//body//ul/li/a
(test)
I'm currently using this code:
$blog= file_get_contents("http://powback.tumblr.com/post/" . $post);
echo $blog;
And it works. But tumblr has added a script that activates each time you enter a password-field. So my question is:
Can i remove certain parts with file_get_contents? Or just remove everything above the <html> tag? could i possibly kill a whole div so it wont load at all? And if so; how?
edit:
I managed to do it the simple way. By skipping 766 characters. The script now work as intended!
$blog= file_get_contents("powback.tumblr.com/post/"; . $post, NULL, NULL, 766);
After file_get_contents returns, you have in your hands a string. You can do anything you want to it, including cutting out parts of it.
There are two ways to actually do the cutting:
Using string functions like str_replace, preg_replace and others; the exact recipe depends on what you need to do. This approach is kind of frowned upon because you are working at the wrong level of abstraction, but in some cases it has an unmatched performance to time spent ratio.
Parsing the HTML into a DOM tree, modifying it appropriately (this time working at the appropriate level of abstraction) and then turn it back into a string and echo it. This can be more convenient to work with if your requirements are not dead simple and is easier to maintain, but it typically requires more code to be written.
If you want to do something that's most naturally expressed in HTML document terms ("cutting out this <div>") then don't be tempted and go with the second approach.
At that point, $blog is just a string, so you can use normal PHP functions to alter it. Look into these 2:
http://php.net/manual/en/function.str-replace.php
http://us2.php.net/manual/en/function.preg-replace.php
You can parse your output using simple html dom parser and display olythe contents thatyou really want to display
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How to parse and process HTML with PHP?
How do I go about pulling specific content from a given live online HTML page?
For example: http://www.gumtree.com/p/for-sale/ovation-semi-acoustic-guitar/93991967
I want to retrieve the text description, the path to the main image and the price only. So basically, I want to retrieve content which is inside specific divs with maybe specific IDs or classes inside a html page.
Psuedo code
$page = load_html_contents('http://www.gumtr..');
$price = getPrice($page);
$description = getDescription($page);
$title = getTitle($page);
Please note I do not intend to steal any content from gumtree, or anywhere else for that matter, I am just providing an example.
First of all, what u wanna do, is called WEBSCRAPING.
Basically, u load into the html content into one variable, so u will need to use regexps to search for specific ids..etc.
Search after webscraping.
HERE is a basic tutorial
THIS book should be useful too.
something like this would be a good starting point if you wanted tabular output
$raw=file_get_contents($url) or die('could not select');
$newlines=array("\t","\n","\r","\x20\x20","\0","\x0B","<br/>");
$content=str_replace($newlines, "", html_entity_decode($raw));
$start=strpos($content,'<some id> ');
$end = strpos($content,'</ending id>');
$table = substr($content,$start,$end-$start);
preg_match_all("|<tr(.*)</tr>|U",$table,$rows);
foreach ($rows[0] as $row){
if ((strpos($row,'<th')===false)){
// array to vars
preg_match_all("|<td(.*)</td>|U",$row,$cells);
$var1= strip_tags($cells[0][0]);
$var2= strip_tags($cells[0][1]);
etc etc
The tutorial Easy web scraping with PHP recommended by robotrobert is good to start, I have made several comments in it. For a better performance use curl. Among other things handles HTTP headers, SSL, cookies, proxies, etc. Cookies is something that you must pay attention.
I just found HTML Parsing and Screen Scraping with the Simple HTML DOM Library. Is more advanced, facilitates and speed up the page parsing through a DOM parser (instead regular expressions --enough hard to master and resources consuming). I recommend you this last one 100%.
g day dear community - hello all!
well I am trying to select either a class or an id using PHP Simple HTML DOM Parser with absolutely no luck. Perhaps i have to study the manpages again and again.
Well - the DOM-technique somewhat goes over my head:
But my example is very simple and seems to comply to the examples given in the manual (simplehtmldom.sourceforge AT net/manual.htm) but it just wont work, it's driving me up the wall. Other example scripts given with simple dom work fine.
See the example: http://www.aktive-buergerschaft.de/buergerstiftungsfinder
This is the easiest example i have found ... The question is - how to parse it?
Should i do it with Perl - The example HTML page is invalid HTML.
I do not know if the Simple HTML DOM Parser is able to handle badly malformed HTML
(probably not). And then i am lost.
Well: it is pretty hard to believe - but you can get the content with file_get_contents: But you afterwards have to do the parser job! And there i have some missing parts!
Finally: if i cannot get it to run i can try out some Perl parsers eg HTML::TreeBuilder::XPath
1: check whether file_get_contents is working!!!!
2: If no use curl or fopen or telnet to read the data.
Simple Html Dom filters all the noise can process malformed tags also...
Problem might be with your data retrieving
are there build in functions in latest versions of php specially designed to aid in this task ?
Use a DOM parser like SimpleXML to split the HTML code into nodes, and walk through the nodes to build the array.
For broken/invalid HTML, SimpleHTMLDOM is more lenient (but it's not built in).
String replace and explode would work if the HTML code is clean and always the same, as soon as you have new attributes it will brake.
So only dependable solution would be using regular expressions or XML/HTML parser.
Check http://php.net/manual/en/book.dom.php
An alternative to using a native DOM parser could be using YQL. This way you dont have to do the actual parsing yourself. The YQL Web Service enables applications to query, filter, and combine data from different sources across the Internet.
For instance, to grab the HTML table with the class example given at
http://www.w3schools.com/html/html_tables.asp
you can do
$yql = 'http://tinyurl.com/yql-table-grab';
$yql = json_decode(file_get_contents($yql));
print_r( $yql->query->results );
I've deliberated shortened the URL so it does not mess up the answer. $yql actually links to the YQL API, adds some options and contains the query:
select * from html
where xpath="//table[#class='example']"
and url="http://www.w3schools.com/html/html_tables.asp"
YQL can return JSON and XML. I've made it return JSON and decoded this then, which then results in a nested structure of stdClass objects and Arrays (so it's not all arrays). You have to see if that fits your needs.
You try out the interactive YQL console to see how it works.
i dont know if this is the faster , but you can check this class (using preg_replace)
http://wonshik.com/snippet/Convert-HTML-Table-into-a-PHP-Array
If you want to convert the html-description of a table, here's how I would do it:
remove all closing tags (</...>) ( http://php.net/manual/de/function.str-replace.php)
split string at opening tags (<...>) using a regular expression ( http://php.net/manual/en/function.split.php)
You have to work out the details on your own, since I do not know if you want to handle different lines as subarrays or you want to merge all lines into one big array or something else.
you could use the explode-function to turn the table cols and rows into arrays.
see: php explode