Extracting HTML fields through XPath

Extracting HTML fields through XPath - php

I have this query which extracts the posts which has been "liked" more than 5 times.
//div[#class="pin"]
[.//span[#class = "LikesCount"]
[substring-before(normalize-space(text())," ") > 5]
I'd like to extract and store additional informations like titles,img url,like number,repin number,...
How to extract them all ?
Multiple XPath queries?
Digging into the nodes of the resulted posts while iterating with php and php functions?
...
Follows a Markup example:
<div class="pin">
<p class="description">gorgeous couch #modern</p>
[...]
<div class="PinHolder">
<a href="/pin/56787645270909880/" class="PinImage ImgLink">
<img src="http://media-cache-ec3.pinterest.com/upload/56787645270909880_d7AaHYHA_b.jpg"
alt="Krizia"
data-componenttype="MODAL_PIN"
class="PinImageImg"
style="height: 288px;">
</a>
</div>
<p class="stats colorless">
<span class="LikesCount">
22 likes
</span>
<span class="RepinsCount">
6 repins
</span>
</p>
[...]
</div>

As you are already using XPath in your code I would suggest to extract that information using XPath too. Here comes an example on how to extract the description.
<?php
// will store the posts as assoc arrays
$mostLikedPostsArr = array();
// call your fictional load function
$doc = load_html('whatever');
// create a XPath selector
$selector = new DOMXPath($doc);
// this your query from above
$query = '//div[#class="pin"][.//span[#class = "LikesCount"][substring-before(normalize-space(text())," ") > 5]';
// getting the most liked posts
$mostLikedPosts = $selector->query($query);
// now iterate through the post nodes
foreach($mostLikedPosts as $post) {
// assoc array for a post
$postArr = array();
// you can do 'relative' queries once having a reference to $post
// note $post as the second parameter to $selector->query()
// lets extract the description for example
$result = $selector->query('p[#class = "description"]', $post);
// just using nodeValue might be ok for text only nodes.
// to properly flatten the <a> tags inside the descriptions
// it will take further attention.
$postArr['description'] = $result->item(0)->nodeValue;
// ...
$mostLikedPostsArr []= $postArr;
}

Related

How to save xpath query data to saveHTML with HTML tags?

I'm trying to understand how I can save the html string found by query so that I can access it's elements.
I'm using the following query to find the below ul list.
$data = $xpath->query('//h2[contains(.,"Hurricane Data")]/following-sibling::ul/li');
<h2>Hurricane Data</h2>
<ul>
<li><strong>12 items</strong> found, see herefor more information</li>
<li><strong>19 items</strong> found, see herefor more information</li>
<li><strong>13 items</strong> found, see herefor more information</li>
</ul>
If I print_r($data), I get the following DOMNodeList Object ( [length] => 3 ) which refers to the 3 elements found.
If I foreach() into the $data I get a DOMElement Object with all 3 li data.
What I'm trying to accomplish is to put each li data into an accessible array, but I want to parse the html strong & a tags inside too.
Now, I've already did everything I want to do, except the strong and a tags aren't being inserted into the arrays, here is what I've come up with.
$string = [];
$query = $xpath->query('//h2[contains(.,"Hurricane Data")]/following-sibling::ul/li');
foreach($query as $values){
$try = new \DOMDocument;
$try->loadHTML(mb_convert_encoding($values->textContent, 'HTML-ENTITIES', 'UTF-8'));
$string[] = $try->saveHTML();
}
echo $string[0];
// outputs = 12 items found, see here for more information
// no strong tags, no hyperlinks

You don't need to reprocess the data, you can just say to save this particular node...
foreach($query as $values){
$string[] = $doc->saveHTML($values);
}
Where $doc is the document used as the basis for your XPath query.

PHP Xpath: Get node value by class name

I'm using xpath to pull data out of a piece of HTML code and I've been able to pull out most data except for one piece.
The HTML is structured like below, but there might only be one li or two or all three so I need to be able to target it by classname.
<li>
Product URL
</li>
<li>
<ul>
<li class="itemone">1</li>
<li class="itemtwo">2</li>
<li class="itemthree">3</li>
</ul>
</li>
This code is already retrieved using an xpath query and then further data is pulled out of the results of the xpath query with the below PHP snippet.
$rawData = $xpath->query('//div[#id=\'products\']/ul/li[contains(#class, \'product\')]');
foreach($rawData as $data) {
$productRaw = $data->getElementsByTagName('li');
$productTitle = $productRaw[0]->getElementsByTagName('a')[0]->nodeValue;
$productRefCode = $productRaw[0]->getElementsByTagName('span')[0]->nodeValue;
$productPrice = $productRaw[1]->getElementsByTagName('li');
}
The problem is $productPrice, the line above is pulling out the below node list.
DOMNodeList Object
(
[length] => 3
)
I'm looking to find anything in the above node list that has a classname of itemtwo, I've using an $xpath->query on $productRaw[1] and also tried getElementsByClassName but with no luck, I've tried the two snippets below with no luck.
$productPrice = $productRaw[1]->getElementsByTagName('li')->getElementsByClassName('itemtwo');
...
$productPrice = $productRaw[1]->query('//li[contains(#class, \'itemtwo\')]');
Both snippets give an error Fatal error: Call to undefined method DOMNodeList::getElementsByClassName() and Fatal error: Call to undefined method DOMNodeList::query().

Use DOMXPath::query, passing XPath string as the first parameter and DOMNode as the second, to execute XPath relative to certain DOMNode context, for example :
foreach($rawData as $data) {
$productRaw = $data->getElementsByTagName('li');
.....
$productPrice = $xpath->query('.//li[contains(#class, "itemtwo")]', $productRaw->item(1));
}
Also use . at the beginning of your XPath expression to explicitly tell that the expression is relative to current context node.

Something like this?
$str = '<li>
Product URL</li>
<li>
<ul>
<li class="itemone">1</li>
<li class="itemtwo">2</li>
<li class="itemthree">3</li>
</ul>
</li>';
$doc = new DOMDocument;
$doc->loadHTML($str);
$xpath = new DOMXPath($doc);
$productPrices = $xpath->query("//li[#class='itemtwo']");
foreach ($productPrices as $productPrice) {
print $productPrice->nodeValue."\n";
}

har07's answer was on the right track, but it only returned the node list with length set to 3 like I was already receiving with my existing code.
Original code:
$productPrice = $productRaw[1]->getElementsByTagName('li');
har07's suggestion:
$productPrice = $xpath->query('.//li[contains(#class, "itemtwo")]', $productRaw->item(1));
Solution, which returns the node value where an elements class name is equal to itemtwo:
$productPrice = $xpath->query('.//li[contains(#class, \'itemtwo\')]', $productRaw[1])->item(1)->nodeValue;

How to get ID using a specific word in regex?

My string:
<div class="sect1" id="s9781473910270.i101">
<div class="sect2" id="s9781473910270.i102">
<h1 class="title">1.2 Summations and Products[label*summation]</h1>
<p>text</p>
</div>
</div>
<div class="sect1" id="s9781473910270.i103">
<p>sometext [ref*summation]</p>
</div>
<div class="figure" id="s9781473910270.i220">
<div class="metadata" id="s9781473910270.i221">
</div>
<p>fig1.2 [label*somefigure]</p>
<p>sometext [ref*somefigure]</p>
</div>
Objective: 1.In the string above label*string and ref*string are the cross references. In the place of [ref*string] I need to replace with a with the atributes of class and href, href is the id of div where related label* resides. And class of a is the class of div
As I mentioned above a element class and ID is their relative div class names and ID. But if div class="metadata" exists, need to ignore it should not take their class name and ID.
Expected output:
<div class="sect1" id="s9781473910270.i101">
<div class="sect2" id="s9781473910270.i102">
<h1 class="title">1.2 Summations and Products[label*summation]</h1>
<p>text</p>
</div>
</div>
<div class="sect1" id="s9781473910270.i103">
<p>sometext <a class="section-ref" href="s9781473910270.i102">1.2</a></p>
</div>
<div class="figure" id="s9781473910270.i220">
<div class="metadata" id="s9781473910270.i221">
<p>fig1.2 [label*somefigure]</p>
</div>
<p>sometext <a class="fig-ref" href="s9781473910270.i220">fig 1.2</a></p>
</div>
How to do it in simpler way without using DOM parser?
My idea is, have to store label* string and their ID in an array and will loop against ref string to match the label* string if string matches then their related id and class should be replaced in the place of ref* string ,
So I have tried this regex to get label*string and their related id and class name.

This approach consists to use the html structure to retrieve needed elements with DOMXPath. Regex are used in a second time to extract informations from text nodes or attributes:
$classRel = ['sect2' => 'section-ref',
'figure' => 'fig-ref'];
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTML($html); // or $dom->loadHTMLFile($url);
$xp = new DOMXPath($dom);
// make a custom php function available for the XPath query
// (it isn't really necessary, but it is more rigorous than writing
// "contains(#class, 'myClass')" )
$xp->registerNamespace("php", "http://php.net/xpath");
function hasClass($classNode, $className) {
if (!empty($classNode))
return in_array($className, preg_split('~\s+~', $classNode[0]->value, -1, PREG_SPLIT_NO_EMPTY));
return false;
}
$xp->registerPHPFunctions('hasClass');
// The XPath query will find the first ancestor of a text node with '[label*'
// that is a div tag with an id and a class attribute,
// if the class attribute doesn't contain the "metadata" class.
$labelQuery = <<<'EOD'
//text()[contains(., 'label*')]
/ancestor::div
[#id and #class and not(php:function('hasClass', #class, 'metadata'))][1]
EOD;
$idNodeList = $xp->query($labelQuery);
$links = [];
// For each div node, a new link node is created in the associative array $links.
// The keys are labels.
foreach($idNodeList as $divNode) {
// The pattern extract the first text part in group 1 and the label in group 2
if (preg_match('~(\S+) .*? \[label\* ([^]]+) ]~x', $divNode->textContent, $m)) {
$links[$m[2]] = $dom->createElement('a');
$links[$m[2]]->setAttribute('href', $divNode->getAttribute('id'));
$links[$m[2]]->setAttribute('class', $classRel[$divNode->getAttribute('class')]);
$links[$m[2]]->nodeValue = $m[1];
}
}
if ($links) { // if $links is empty no need to do anything
$refNodeList = $xp->query("//text()[contains(., '[ref*')]");
foreach ($refNodeList as $refNode) {
// split the text with square brackets parts, the reference name is preserved in a capture
$parts = preg_split('~\[ref\*([^]]+)]~', $refNode->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE);
// create a fragment to receive text parts and links
$frag = $dom->createDocumentFragment();
foreach ($parts as $k=>$part) {
if ($k%2 && isset($links[$part])) { // delimiters are always odd items
$clone = $links[$part]->cloneNode(true);
$frag->appendChild($clone);
} elseif ($part !== '') {
$frag->appendChild($dom->createTextNode($part));
}
}
$refNode->parentNode->replaceChild($frag, $refNode);
}
}
$result = '';
$childNodes = $dom->getElementsByTagName('body')->item(0)->childNodes;
foreach ($childNodes as $childNode) {
$result .= $dom->saveXML($childNode);
}
echo $result;

This is not a task for regular expressions. Regular expressions are (usually) for regular languages. And what you want to do is some work on a context sensitive language (referencing an identifier which has been declared before).
So you should definately go with a DOM parser. The algorithm for this would be very easy, because you can operate on one node and it's children.
So the theoretical answer to your question is: you can't. Though it might work out with the many regex extensions in some crappy way.

How can I parse HTML in batches using xpath [PHP]?

I tried all sorts of things but couldn't find a solution.
I want to retrieve elements from html code using xpath in php.
Ex:
<div class='student'>
<div class='name'>Michael</div>
<div class='age'>26</div>
</div>
<div class='student'>
<div class='name'>Joseph</div>
<div class='age'>27</div>
</div>
I want to retrieve the information and put them in an array as follows:
$student[0][name] = Michael;
$student[0][age] = 26;
$student[1][name] = Joseph;
$student[1][age] = 27;`
In other words i want the matching ages to stay with the names.
I tried the following:
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpathDom = new DomXPath($dom);
$homepostcontentNodes = $xpathDom->query("//*[contains(#class, 'student')]//*[contains(#class, 'name')]");`
However, this is only grabbing me the nodes 'names'
How can i get the matching age nodes?

Of course it is only grabbing the nodes name - you are telling it to!
What you will need to do is in two steps:
Pick out all the student nodes
For each student node, pick out the columns
This is a pretty standard step in linearization of data, and the XPath queries are simple:
Step 1
You pretty much have it:
$studentNodes = $xpathDom->query("//div[contains(#class, 'student')]");
This will return all your student nodes.
Step 2
This is where the magic happens. We have our nodes, we can loop through them (DOMNodeList implements Iterator, so we can foreach-loop through them). What we need to figure out is how to find its children...
...Oh wait. DOMNode implements a method called getNodePath which returns the full, direct XPath path to the node. This allows us to then simply append /div to get all the div direct descendents to the node!
Another quick foreach, and we get this code:
$studentNodes = $xpathDom->query("//div[contains(#class, 'student')]");
$result = array();
foreach ($studentNodes as $v) {
// Child nodes: student
$r = array();
$columns = $xpathDom->query($v->getNodePath()."/div");
foreach ($columns as $v2) {
// Attributes allows me to get the 'class' property of the node. Bit clunky, but there's no alternative
$r[$v2->attributes->getNamedItem("class")->textContent] = $v2->textContent;
}
$result[] = $r;
}
var_dump($result);
Full fiddle: http://codepad.viper-7.com/t868Wh

XPath scraping two nodes values from HTML only if both exist

I am using Curl, XPath and PHP in order to scrape product names and prices from HTML source code. Here is a sample similar to the source code I am examining:
<div class="Gamesdb">
<p class="media-title">
Bluetooth Headset
</p>
<p class="sub-title"> Console </p>
<p class="rating star-50">
(1)
</p>
<p class="mt5">
<span class="price-preffix">
1 New
from
</span>
<a class="wt-link" href="/Games/Console/4-/105/Bluetooth-Headset/">
<span class="price">
<em>£34</em>
.99
</span>
<span class="free-delivery"> FREE delivery</span>
</a>
</p>
<p class="mt10">
<a class="primary button" href="/Games/Console/4-/105/Bluetooth-Headset/">
Product Details
<span style="color: rgb(255, 255, 255); margin-left: 6px; font-size: 16px;">»</span>
</a>
</p>
</div>
I want to extract the media title i.e:
<p class="media-title">
Bluetooth Headset
</p>
Only when the following price class is also present:
<span class="price">
<em>£34</em>
.99
</span>
Many of the other products listed don't include it.
I need to extract both the product name and price or nothing at all and move on to the next product.
Here is a sample of the code i am currently using which is effective at getting all the results regardless of any other conditions:
$results=file_get_contents('SCRAPEDHTML.txt');
$html = new DOMDocument();
#$html->loadHtml($results);
$xpath = new DOMXPath($html);
$nodelist = $xpath->query('//p[#class="media-title"]|//span[#class="price"]');
foreach ($nodelist as $n){
$results2[]=$n->nodeValue;
}
I believe this is possible using the correct xpath query but have so far been unable to achieve it. Many thanks in advance.

I am assuming there is only one "item" per div.Gamesdb. If not, there may not be enough structure in the source html to use xpath alone. You will probably have to index product names and look for prices near matching product names.
You can do this with a single giant XPath, but I recommend you use multiple XPaths. I'll show both ways.
First create your DOMXPath and register helper to match class names.
// This helper is the equivalent to the XPath:
// contains(concat(' ',normalize-space(#attr),' '), ' $token ')
// It's not necessary, but it's a bit easier to read and more
// bulletproof than #ATTR="TOKEN"
function has_token($attr, $token)
{
$attr = $attr[0];
$regex = '/(?:^|\s)'.preg_quote($token,'/').'(?:\s|$)/Su';
return (bool) preg_match($regex, $attr->value);
}
$xp = new DOMXPath($d);
$xp->registerNamespace("php", "http://php.net/xpath");
$xp->registerPHPFunctions("has_token");
You can then use a giant XPath:
$xp_container = '/html/body//div[php:function("has_token", #class, "Gamesdb")]';
$xp_title = 'p[php:function("has_token", #class, "media-title")]';
$xp_price = '//span[php:function("has_token", #class, "price")]';
$xp_titles_prices = "$xp_container[{$xp_title}][{$xp_price}]/{$xp_title} | $xp_container[{$xp_title}][{$xp_price}]{$xp_price}";
$nodes = $xp->query($xp_items);
$items = array();
$i = 0; // enumerator
foreach ($nodes as $node) {
$key = ($node->nodeName==='p') ? 'title' : 'price';
$value = '';
switch ($key) {
case 'price':
// remove inner whitespace
$value = preg_replace('/\s+/Su', '', trim($node->textContent));
break;
case 'title':
$value = preg_replace('/\s+/Su', ' ', trim($node->textContent));
break;
}
$items[(int) floor($i/2)][$key] = $value;
$i += 1;
}
However, the overall code is brittle and unclear. The XPath union operator (|) returns nodes in document order so we can't bisect the list. The PHP code must walk through every item in the nodelist and using the DOM determine which field corresponds to this data. Think about the changes you would have to make if you wanted to extend the code to collect a third item (e.g., price). Now imagine making those changes three months from now, when this code is no longer fresh in your mind.
I recommend you use multiple XPath calls instead and do the "do we have data for both price and title" check in PHP rather than XPath:
$xpitems = '/html/body//div[php:function("has_token", #class, "Gamesdb")]';
// below use $xpitems context:
$xptitle = 'normalize-space(p[php:function("has_token", #class, "media-title")])';
$xpprice = 'normalize-space(//span[php:function("has_token", #class, "price")])';
$nodeitems = $xp->query($xpitems);
$items = array();
foreach ($nodeitems as $nodeitem) {
$item = array(
'title' => $xp->evaluate($xptitle, $nodeitem),
'price' => str_replace(' ', '', $xp->evaluate($xpprice, $nodeitem)),
);
// Only add this item if we have data for *all* fields:
if (count(array_filter($item)) === count($item)) {
$items[] = $item;
}
}
This is much easier to read and understand, and much easier to extend in the future.

You cannot have a single XPath that returns both the name of the product and its price and nothing else. My suggestion would be first to get all the div nodes that contain both informations:
//div[p[#class='media-title'] and //span[#class='price']]
('all div nodes that have a p child node with class media-title and a span descendent node with class price'); then loop on all the returned nodes and exctract the product namee and price using two other XPath:
p[#class='media-title']
and
//span[#class='price']

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Extracting HTML fields through XPath - php

Related

How to save xpath query data to saveHTML with HTML tags?

PHP Xpath: Get node value by class name

How to get ID using a specific word in regex?

How can I parse HTML in batches using xpath [PHP]?

XPath scraping two nodes values from HTML only if both exist

Categories

Resources