Using DOMXpath to find data in not so nice html

Using DOMXpath to find data in not so nice html - php

I am trying to get some data from a plant list site. This proves to be a bit problematic because their html isn't really well-formed. These are two lines from the search result (disclaimer: I am not responsible for this code):
<tr>
<td>
<i class="glyphicons-icon leaf"></i>
</td>
<td>
<a title="Cimicifuga simplex" href="/taxon/wfo-0000604773" class="result">
<h4 class="h4Results"><em>Cimicifuga simplex</em>(DC.) Wormsk. ex Turcz.</h4>
</a>
Bull. Soc. Imp. Naturalistes Moscou<br/>
<div>
<em>Status:</em><span id="entryStatus">Synonym of </span>
<em>Actaea simplex</em>(DC.) Wormsk. ex Prantl
</div>
<div>
<em>Rank:</em><span id="entryRank">Species</span>
</div>
<div>
<em>Family:</em> Ranunculaceae
</div>
</td>
<td>
<img title="No Image Available" src="/css/images/no_image.jpg" class="thumbnail pull-right"/>
</td>
</tr>
<tr>
<td>
<i class="glyphicons-icon leaf"></i>
</td>
<td>
<a title="Actaea simplex" href="/taxon/wfo-0000519124" class="result">
<h4 class="h4Results"><strong><em>Actaea simplex</em>(DC.) Wormsk. ex Prantl</strong></h4>
</a>
Bot. Jahrb. Syst.<br/>
<div>
<em>Status:</em><span id="entryStatus">Accepted Name</span>
</div>
<div>
<em>Rank:</em><span id="entryRank">Species</span>
</div>
<div>
<em>Family:</em> Ranunculaceae</div>
<div>
<em>Order:</em> Ranunculales
</div>
</td>
<td>
<img title="No Image Available" src="/css/images/no_image.jpg" class="thumbnail pull-right"/>
</td>
</tr>
I added some layout myself, otherwise it wasn't readable.
Anyway, I loaded the page in php and DOMXpath and now I want to get two things:
Select the row that has Accepted Name in it
Get the species name and the corresponding link from it
In this case the result would be "Actaea simplex" and "/taxon/wfo-0000519124". Mind that there will be more results resembling the first row, and that the position of the row that I am looking for doesn't have to be the second one.
Normally I just try, use google and try some more and in the end I get there, but in this case IDs are used as classes, and are not unique. This make it impossible to use an Xpath tester, and perhaps even useless for DOMXpath.
So, is it possible to get my data with DOMXpath, and if yes - what query do I use?

Try something like:
$dom = new DOMDocument();
$dom->loadXML($xml);
$xpath = new DOMXPath($dom);
$target = $xpath->query("//td[.//span[.='Accepted Name']]/a");
$link = $target[0]->getAttribute('href');
$title = $target[0]->getAttribute('title');
echo $title," ",$link;
Output
Actaea simplex /taxon/wfo-0000519124

Related

Extract links from specific table

I have a html code with many html tables. I want to extract links from specific one which has specific div above.
Here's my sample code:
<div class="boxuniwersal_header">Table 1</div>
<img src="img/boxuniwersal_top.gif" width="210" height="18" alt="" style="margin-top: 5px" />
<div class="boxuniwersal_content">
<div class="boxuniwersal_subcontent">
<div class='menu_m1'><table cellpadding="3"><tr><td><img src="some.jpg" width="45" /></td><td>Some text</td></tr></table></div>
<br />
</div>
</div>
<!-- /box -->
<!-- box -->
<div class="boxuniwersal_header">Table 2</div>
<img src="img/boxuniwersal_top.gif" width="210" height="18" alt="" style="margin-top: 5px" />
<div class="boxuniwersal_content">
<div class="boxuniwersal_subcontent">
<div class='menu_m1'><table cellpadding="3"><tr><td><img src="some2.jpg" width="45" /></td><td>Some text2</td></tr></table></div>
<br />
</div>
</div>
$domXPath = new DOMXPath($domDocument);
$results = $domXPath->query("//div/div/table/tr/td/a|//table//tr/td//a"); //querying domdocument
foreach($results as $result)
{
$links[]=$result->getAttribute("href");
}
This code returns all links. I want to grab only links from Table1. Is it possible?

Your main problem is just tuning the XPath expression to select the right XML.
If you change your XPath to
//div[text()="Table 1"]/following-sibling::div[1]//table//a
What this does is first find the <div> element whose text is the one your after.
The following-sibling::div[1] part will look at the first <div> element at the same level as the <div> element already selected (this is the one where the <table> is).
The last part just looks for all <a> elements within the enclosing <table>.

Get all strings between two other strings in html document in PHP

I'm creating some kind of crawler/proxy at the moment. It can navigate a website and still remain on my website while browsing. But I thought about while loading the website, get all the links and data at the same time.
So the website contains many "< tr>"(without the space) which again contains a lot of other stuff.
Here is 1 example of many on the website:
<tr>
<td class="vertTh">
<center>
Other
<br>
Document
</center>
</td>
<td>
<div class="Name">
Document Title Info
</div>
<a href="http://example.com/source/to/document/which%20can%20be%20very%20long%20and%20have%20weird%20characters" title="Source">
<img src="/static/img/icon-source.png" alt="Source">
</a>
<font class="Desc">Uploaded 03-24 14:02, Size 267.35 KB, ULed by <a class="Desc" href="/s/user/username/" title="Browse username">username</a></font>
</td>
<td align="right">67</td>
<td align="right">9</td>
</tr>
Users browse the proxy site, and while they do, it catches info from the original website.
I figured out how to get a string between two words, but I don't know how to make this to a "foreach" code or something else.
So let's say I want to get the source link. Then I would do something like this:
$url = $_GET['url'];
$str = file_get_contents('https://database.com/' . $url);
$source = 'http://example.com/source/to/' . getStringBetween($str,'example.com/source/to/','" title="Source">'); // Output looking like this: http://example.com/source/to/document/which%20can%20be%20very%20long%20and%20have%20weird%20characters
function getStringBetween($str,$from,$to)
{
$sub = substr($str, strpos($str,$from)+strlen($from),strlen($str));
return substr($sub,0,strpos($sub,$to));
}
But I can't just do this, because there are multiple of these strings. So I'm wondering if there is any kind of way I can get Source, name and size on all of these strings?

You might want to use preg_match_all so that you get a list of many matches. Then you can loop over it.
http://php.net/manual/en/function.preg-match-all.php
$html = '<tr>
<td class="vertTh">
<center>
Other
<br>
Document
</center>
</td>
<td>
<div class="Name">
Document Title Info
</div>
<a href="http://another-example.com/source/to/document/which%20can%20be%20very%20long%20and%20have%20weird%20characters" title="Source">
<img src="/static/img/icon-source.png" alt="Source">
</a>
<a href="http://example.com/source/to/document/which%20can%20be%20very%20long%20and%20have%20weird%20characters" title="Source">
<img src="/static/img/icon-source.png" alt="Source">
</a>
<font class="Desc">Uploaded 03-24 14:02, Size 267.35 KB, ULed by <a class="Desc" href="/s/user/username/" title="Browse username">username</a></font>
</td>
<td align="right">67</td>
<td align="right">9</td>
</tr>';
// use | as delimiter for pattern to make it a little cleaner
preg_match_all('|href="(http://.+?)" title="Source"|', $html, $matches);
// loop over $matches
var_dump($matches);
foreach ($matches[1] as $match) {
// $match == http://example.com/source/to/document/which%20can%20be%20very%20long%20and%20have%20weird%20characters
}
You can try this example at... http://phpfiddle.org/ or run it in a .php file locally. Good luck.
FYI: I added an extra anchor tag to illustrate finding another source.

Getting div value (content/text) using XPath

I have next html structure:
<li id="REQUIRED_ITEM_1" class="listing-post">
<a class="listing-thumb" href="blah" title="blah" data-palette-listing-image="">
<img src="REQUIRED_ITEM_2" width="75" height="75" alt="blah"> </a>
<div class="listing-detail ">
<div class="listing-title">
<div class="listing-icon hidden"></div>
blah
<div class="listing-maker">
<span class="name wrap">blah</span>
</div>
</div>
<div class="listing-date">
REQUIRED_ITEM_6
</div>
<div class="listing-price">
Sold
</div>
</div>
</li>
There are few dozens of these <li> on the same page, all with different id and content. The content that I need is marked REQUIRED_ITEM_1 - REQUIRED_ITEM_6.
I am collecting the data from these <li>s with the help of Xpath.
Here is the code I use:
foreach($xpath->query("//li[#class='listing-post']") as $link) {
$REQUIRED_ITEM_1 = $link->getAttribute('id');
$REQUIRED_ITEM_2 = $xpath->query(".//img", $link)->item(0)->getAttribute('src');
$REQUIRED_ITEM_3 = $xpath->query(".//a", $link)->item(1)->getAttribute('href');
$REQUIRED_ITEM_4 = $xpath->query(".//a", $link)->item(1)->getAttribute('title');
$REQUIRED_ITEM_5 = $xpath->query(".//a", $link)->item(2)->getAttribute('href');
$REQUIRED_ITEM_6 = $xpath->query("./div/text", $link)->item(4);
}
It works as intended for the first 5 REQUIRED_ITEMs, however it seems the code to get text contained within listing-date div (REQUIRED_ITEM_6) is wrong.
Also, is this the best way to parse my html and collect data, or is there a better approach?

Here is the xPath to get REQUIRED_ITEM_6
//li[#class='listing-post']//div[#class='listing-date']/text()
That would be little bit faster (but first version may be more safe, since it is less dependent on XML structure).
//li[#class='listing-post']/div/div[#class='listing-date']/text()
So your code must look like something like this (but you may need to adjust it little bit with your php, not sure why you used item(4)).
$REQUIRED_ITEM_6 = $xpath->query(".//div[#class='listing-date']/text()", $link)->item(0)->textContent;

regular expression to remove a div

I have a file like:
<div clas='dsfdsf'> this is first div </div>
<div clas='dsfdsf'> this is second div </div>
<div class="remove">
<table>
<thead>
<tr>
<th colspan="2">Mehr zum Thema</th>
</tr>
</thead>
<tbody>
<tr> this is tr</tr>
<tr> this row no 2 </tr>
</tbody>
</table>
</div>
<div clas='sasas'> this is last div </div>
I have get this file content in a variable like this:
$Cont = file_get_contents('myfile');
Now I want to replace div with class name 'remove' by preg_replace. I have tried this:
$patterns = "%<div class='remove'>(.+?)</div>%";
$strPageSource = preg_replace($patterns, '', $Cont);
It did not work. What should be the correct regular expression for this replace?

Try this code.
preg_replace("/<div class='remove'>(.*?)<\/div >/i", "<div class="newClass">Newthings</div> ", $Cont);

As it has been stated in the comments, you should not be using regex to parse HTML. Because there's no sane way for you to extract that <div> if there're other nested <div>'s inside. I.e.
<div clas='dsfdsf'> this is second div </div>
<div class="remove">
some text <div>nested div</div> more text and some elements<br />
</div>
What you want to do is find the location of your <div class="remove"> and then advance through the HTML (parse it) in the following manner
1) set $nesting_counter = 0
2) proceed through HTML until you encounter either <div> or </div>
a) if found <div>
$nesting_counter++ and go to point 2)
b) if found </div>
if $nesting_counter > 0
$nesting_counter-- and go to point 2)
else
you've found the closing tag for your `<div class="remove">`. remember current position and just remove that substring.

Find and separate the HTML blocks to an array

First of all I want to describe the idea - anyone know that any CMS or a simple website has some kind of blocks like the list of articles for example on the main page of wordpress where shown each in a block of information: Title, author, content, date etc.
So the main idea is how to find and separate such blocks of HTML and append each of them to an array.
I thought first need to clear them from: classes, ids and styles.
step1:
<div id="box1">
<h3 class="title_style">Title1</h3>
<p>content for box1</p>
<div class="author">Author Name1<span class="style_date">date1<span>any text</div>
</div>
<div id="box2">
<h3 class="title_style">Title2</h3>
<p>content for box2</p>
<div class="author">Author Name2<span class="style_date">date2<span>any text2</div>
</div>
to
<div>
<h3>Title1</h3>
<p>content for box1</p>
<div>Author Name1<span>date1<span>any text</div>
</div>
<div>
<h3>Title2</h3>
<p>content for box2</p>
<div>Author Name2<span>date2<span>any text2</div>
</div>
Step2:
I need to find each block and write them to an array so I can to put each block to a row in the table like this: (note that this blocks are present on almost any site so it doesn't matter what tags it has, they just repeat with different content and attributes, only the structure is the same)
<table>
<tr id="block1">
<td>Title1</td>
<td>content for box1</td>
<td>Author Name1</td>
<td>date1</td>
<td>any text</td>
</tr>
<tr id="block2">
<td>Title2</td>
<td>content for box2</td>
<td>Author Name2</td>
<td>date2</td>
<td>any text</td>
</tr>
</table>
Any ideas ? I need the logic how to do this, not the code itself.

You can walk the DOM of the document using PHP's DOMDocument class.
So you can do something like this:
$str = <<<STR
<div id="box1">
<h3 class="title_style">Title1</h3>
<p>content for box1</p>
<div class="author">Author Name1<span class="style_date">date1</span>any text</div>
</div>
<div id="box2">
<h3 class="title_style">Title2</h3>
<p>content for box2</p>
<div class="author">Author Name2<span class="style_date">date2</span>any text2</div>
</div>
STR;
$dom = new DOMDocument();
$dom->loadHTML($str);
$divs = $dom->getElementsByTagName('div');
foreach ($divs as $div) {
//read child elements
}

Try this library Simple HTML Dom Parser.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Using DOMXpath to find data in not so nice html - php

Related

Extract links from specific table

Get all strings between two other strings in html document in PHP

Getting div value (content/text) using XPath

regular expression to remove a div

Find and separate the HTML blocks to an array

Categories

Resources