XPath scraping two nodes values from HTML only if both exist

XPath scraping two nodes values from HTML only if both exist - php

I am using Curl, XPath and PHP in order to scrape product names and prices from HTML source code. Here is a sample similar to the source code I am examining:
<div class="Gamesdb">
<p class="media-title">
Bluetooth Headset
</p>
<p class="sub-title"> Console </p>
<p class="rating star-50">
(1)
</p>
<p class="mt5">
<span class="price-preffix">
1 New
from
</span>
<a class="wt-link" href="/Games/Console/4-/105/Bluetooth-Headset/">
<span class="price">
<em>£34</em>
.99
</span>
<span class="free-delivery"> FREE delivery</span>
</a>
</p>
<p class="mt10">
<a class="primary button" href="/Games/Console/4-/105/Bluetooth-Headset/">
Product Details
<span style="color: rgb(255, 255, 255); margin-left: 6px; font-size: 16px;">»</span>
</a>
</p>
</div>
I want to extract the media title i.e:
<p class="media-title">
Bluetooth Headset
</p>
Only when the following price class is also present:
<span class="price">
<em>£34</em>
.99
</span>
Many of the other products listed don't include it.
I need to extract both the product name and price or nothing at all and move on to the next product.
Here is a sample of the code i am currently using which is effective at getting all the results regardless of any other conditions:
$results=file_get_contents('SCRAPEDHTML.txt');
$html = new DOMDocument();
#$html->loadHtml($results);
$xpath = new DOMXPath($html);
$nodelist = $xpath->query('//p[#class="media-title"]|//span[#class="price"]');
foreach ($nodelist as $n){
$results2[]=$n->nodeValue;
}
I believe this is possible using the correct xpath query but have so far been unable to achieve it. Many thanks in advance.

I am assuming there is only one "item" per div.Gamesdb. If not, there may not be enough structure in the source html to use xpath alone. You will probably have to index product names and look for prices near matching product names.
You can do this with a single giant XPath, but I recommend you use multiple XPaths. I'll show both ways.
First create your DOMXPath and register helper to match class names.
// This helper is the equivalent to the XPath:
// contains(concat(' ',normalize-space(#attr),' '), ' $token ')
// It's not necessary, but it's a bit easier to read and more
// bulletproof than #ATTR="TOKEN"
function has_token($attr, $token)
{
$attr = $attr[0];
$regex = '/(?:^|\s)'.preg_quote($token,'/').'(?:\s|$)/Su';
return (bool) preg_match($regex, $attr->value);
}
$xp = new DOMXPath($d);
$xp->registerNamespace("php", "http://php.net/xpath");
$xp->registerPHPFunctions("has_token");
You can then use a giant XPath:
$xp_container = '/html/body//div[php:function("has_token", #class, "Gamesdb")]';
$xp_title = 'p[php:function("has_token", #class, "media-title")]';
$xp_price = '//span[php:function("has_token", #class, "price")]';
$xp_titles_prices = "$xp_container[{$xp_title}][{$xp_price}]/{$xp_title} | $xp_container[{$xp_title}][{$xp_price}]{$xp_price}";
$nodes = $xp->query($xp_items);
$items = array();
$i = 0; // enumerator
foreach ($nodes as $node) {
$key = ($node->nodeName==='p') ? 'title' : 'price';
$value = '';
switch ($key) {
case 'price':
// remove inner whitespace
$value = preg_replace('/\s+/Su', '', trim($node->textContent));
break;
case 'title':
$value = preg_replace('/\s+/Su', ' ', trim($node->textContent));
break;
}
$items[(int) floor($i/2)][$key] = $value;
$i += 1;
}
However, the overall code is brittle and unclear. The XPath union operator (|) returns nodes in document order so we can't bisect the list. The PHP code must walk through every item in the nodelist and using the DOM determine which field corresponds to this data. Think about the changes you would have to make if you wanted to extend the code to collect a third item (e.g., price). Now imagine making those changes three months from now, when this code is no longer fresh in your mind.
I recommend you use multiple XPath calls instead and do the "do we have data for both price and title" check in PHP rather than XPath:
$xpitems = '/html/body//div[php:function("has_token", #class, "Gamesdb")]';
// below use $xpitems context:
$xptitle = 'normalize-space(p[php:function("has_token", #class, "media-title")])';
$xpprice = 'normalize-space(//span[php:function("has_token", #class, "price")])';
$nodeitems = $xp->query($xpitems);
$items = array();
foreach ($nodeitems as $nodeitem) {
$item = array(
'title' => $xp->evaluate($xptitle, $nodeitem),
'price' => str_replace(' ', '', $xp->evaluate($xpprice, $nodeitem)),
);
// Only add this item if we have data for *all* fields:
if (count(array_filter($item)) === count($item)) {
$items[] = $item;
}
}
This is much easier to read and understand, and much easier to extend in the future.

You cannot have a single XPath that returns both the name of the product and its price and nothing else. My suggestion would be first to get all the div nodes that contain both informations:
//div[p[#class='media-title'] and //span[#class='price']]
('all div nodes that have a p child node with class media-title and a span descendent node with class price'); then loop on all the returned nodes and exctract the product namee and price using two other XPath:
p[#class='media-title']
and
//span[#class='price']

Related

How to strip a HTML element from a text file with PHP?

I am cleaning up a mess created by Adobe InDesign export feature of ePub files.
MY GOAL:
OPTION 1. I want to remove all span elements with class attribute CharOverride-7 but leave the other span elements.
OPTION 2. In some cases I want to replace the span.CharOverride-7 with a new element, such as i.
Note, my current manual and time-cconsuming way is to do mass search and replace action, but the input text file is inconsistent (extra spaces and other artifacts).
The input text contains hundreds of p paragraphs which look like this:
<p class="2"><span class="CharOverride-7">A book title</span><span class="CharOverride-8">https://aaa.net</span><span class="CharOverride-7">.</span></p>
<p class="2"><span class="CharOverride-7">Another book title</span><span class="CharOverride-8">https://aaa.net/</span><span class="CharOverride-7">.</span></p>
The desired output should look like this:
OPTION ONE (removal of the element)
<p class="2">A book title<span class="CharOverride-8">https://aaa.net/</span>.</p>
OPTION TWO (replace span.CharOverride with i element)
<p class="2"><i>A book title</i><span class="CharOverride-8">https://aaa.net</span><i>.</i></p>

For option one this way works with using DOMDocument(): https://www.php.net/manual/de/class.domdocument.php
<?php
$yourHTML = '<p class="2"><span class="CharOverride-7">A book title</span><span class="CharOverride-8">https://aaa.net</span><span class="CharOverride-7">.</span></p>';
$dom = new DOMDocument();
$dom->loadHTML($yourHTML, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED );
foreach ($dom->getElementsByTagName('span') as $span) {
if ($span->attributes["class"]->value == "CharOverride-7") {
$newelement = $dom->createTextNode($span->textContent);
$span->parentNode->replaceChild($newelement, $span);
}
}
$ret = $dom->saveHTML();
// <p class="2">A book title<span class="CharOverride-8">https://aaa.net</span>.</p>
echo $ret;

Here's a simple approach for you using preg_replace()...
<?php
$data = file_get_contents('[YOUR FILENAME HERE]');
$result1 = preg_replace('/<span class="CharOverride-7">(.*)<\/span>/U', '$1', $data);
//$result2 = preg_replace('/<span class="CharOverride-7">(.*)<\/span>/U', '<i>$1</i>', $data);
echo $result1;
// echo $result2;
// Overwrite your file here... (Beyond scope of this question)
Just use $result1 or $result2 at your leisure.
Regex101 Sandbox

How to get ID using a specific word in regex?

My string:
<div class="sect1" id="s9781473910270.i101">
<div class="sect2" id="s9781473910270.i102">
<h1 class="title">1.2 Summations and Products[label*summation]</h1>
<p>text</p>
</div>
</div>
<div class="sect1" id="s9781473910270.i103">
<p>sometext [ref*summation]</p>
</div>
<div class="figure" id="s9781473910270.i220">
<div class="metadata" id="s9781473910270.i221">
</div>
<p>fig1.2 [label*somefigure]</p>
<p>sometext [ref*somefigure]</p>
</div>
Objective: 1.In the string above label*string and ref*string are the cross references. In the place of [ref*string] I need to replace with a with the atributes of class and href, href is the id of div where related label* resides. And class of a is the class of div
As I mentioned above a element class and ID is their relative div class names and ID. But if div class="metadata" exists, need to ignore it should not take their class name and ID.
Expected output:
<div class="sect1" id="s9781473910270.i101">
<div class="sect2" id="s9781473910270.i102">
<h1 class="title">1.2 Summations and Products[label*summation]</h1>
<p>text</p>
</div>
</div>
<div class="sect1" id="s9781473910270.i103">
<p>sometext <a class="section-ref" href="s9781473910270.i102">1.2</a></p>
</div>
<div class="figure" id="s9781473910270.i220">
<div class="metadata" id="s9781473910270.i221">
<p>fig1.2 [label*somefigure]</p>
</div>
<p>sometext <a class="fig-ref" href="s9781473910270.i220">fig 1.2</a></p>
</div>
How to do it in simpler way without using DOM parser?
My idea is, have to store label* string and their ID in an array and will loop against ref string to match the label* string if string matches then their related id and class should be replaced in the place of ref* string ,
So I have tried this regex to get label*string and their related id and class name.

This approach consists to use the html structure to retrieve needed elements with DOMXPath. Regex are used in a second time to extract informations from text nodes or attributes:
$classRel = ['sect2' => 'section-ref',
'figure' => 'fig-ref'];
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTML($html); // or $dom->loadHTMLFile($url);
$xp = new DOMXPath($dom);
// make a custom php function available for the XPath query
// (it isn't really necessary, but it is more rigorous than writing
// "contains(#class, 'myClass')" )
$xp->registerNamespace("php", "http://php.net/xpath");
function hasClass($classNode, $className) {
if (!empty($classNode))
return in_array($className, preg_split('~\s+~', $classNode[0]->value, -1, PREG_SPLIT_NO_EMPTY));
return false;
}
$xp->registerPHPFunctions('hasClass');
// The XPath query will find the first ancestor of a text node with '[label*'
// that is a div tag with an id and a class attribute,
// if the class attribute doesn't contain the "metadata" class.
$labelQuery = <<<'EOD'
//text()[contains(., 'label*')]
/ancestor::div
[#id and #class and not(php:function('hasClass', #class, 'metadata'))][1]
EOD;
$idNodeList = $xp->query($labelQuery);
$links = [];
// For each div node, a new link node is created in the associative array $links.
// The keys are labels.
foreach($idNodeList as $divNode) {
// The pattern extract the first text part in group 1 and the label in group 2
if (preg_match('~(\S+) .*? \[label\* ([^]]+) ]~x', $divNode->textContent, $m)) {
$links[$m[2]] = $dom->createElement('a');
$links[$m[2]]->setAttribute('href', $divNode->getAttribute('id'));
$links[$m[2]]->setAttribute('class', $classRel[$divNode->getAttribute('class')]);
$links[$m[2]]->nodeValue = $m[1];
}
}
if ($links) { // if $links is empty no need to do anything
$refNodeList = $xp->query("//text()[contains(., '[ref*')]");
foreach ($refNodeList as $refNode) {
// split the text with square brackets parts, the reference name is preserved in a capture
$parts = preg_split('~\[ref\*([^]]+)]~', $refNode->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE);
// create a fragment to receive text parts and links
$frag = $dom->createDocumentFragment();
foreach ($parts as $k=>$part) {
if ($k%2 && isset($links[$part])) { // delimiters are always odd items
$clone = $links[$part]->cloneNode(true);
$frag->appendChild($clone);
} elseif ($part !== '') {
$frag->appendChild($dom->createTextNode($part));
}
}
$refNode->parentNode->replaceChild($frag, $refNode);
}
}
$result = '';
$childNodes = $dom->getElementsByTagName('body')->item(0)->childNodes;
foreach ($childNodes as $childNode) {
$result .= $dom->saveXML($childNode);
}
echo $result;

This is not a task for regular expressions. Regular expressions are (usually) for regular languages. And what you want to do is some work on a context sensitive language (referencing an identifier which has been declared before).
So you should definately go with a DOM parser. The algorithm for this would be very easy, because you can operate on one node and it's children.
So the theoretical answer to your question is: you can't. Though it might work out with the many regex extensions in some crappy way.

Traverse DOM find id backwards

I can't find out how to solve this
<div>
<p id="p1"> Price is <span>$ 25</span></p>
<p id='p2'> But this price is $ <span id="s1">50,23</span> </p>
<p id='p3'> This one : $ 14540.12 dollar</p>
</div>
What i'm trying to do is find an element with a price in it and it's shortest path to it.
This is what i have sofar.
$elements = $dom->getElementsByTagName('*');
foreach($elements as $child)
{
if (preg_match("/.$regex./",$child->nodeValue)){
echo $child->getNodePath(). "<br />";
}
}
This results in
/html
/html/body
/html/body/div
/html/body/div/p[1]
/html/body/div/p[1]/span
/html/body/div/p[2]
/html/body/div/p[2]/span
/html/body/div/p[3]
These are the paths to the elements i want, so that's OK in this test HTML. But in real webpages these path's get very long and are error prone.
What i'd like to do is find the closest element with an ID attribute and refer to that.
So once found and element that matched the $regex, I need to travel up the DOM and find the first element with and ID attribute and create the new shorter path from that.
In the HTML example above, there are 3 prices matching the $regex. The prices are in:
//p[#id="p1"]/span
//p[#id="s1"]
//p[#id="p3"]
So that is what i'd like to have returned from my function. The means I also need to get rid of all the other paths that exist, because they don't contain $regex
Any help on this?

You could use XPath to follow the ancestor-path to the first node containing an #id attribute and then cut its path off. Did not clean up the code, but something like this:
// snip
$xpath = new DomXPath($doc);
foreach($elements as $child)
{
$textValue = '';
foreach ($xpath->query('text()', $child) as $text)
$textValue .= $text->nodeValue;
if (preg_match("/.$regex./", $textValue)) {
$path = $child->getNodePath();
$id = $xpath->query('ancestor-or-self::*[#id][1]', $child)->item(0);
$idpath = '';
if ($id) {
$idpath = $id->getNodePath();
$path = '//'.$id->nodeName.'[#id="'.$id->attributes->getNamedItem('id')->value.'"]'.substr($path, strlen($idpath));
}
echo $path."\n";
}
}
Printing something like
/html
/html/body
/html/body/div
//p[#id="p1"]
//p[#id="p1"]/span
//p[#id="p2"]
//span[#id="s1"]
//p[#id="p3"]

How can I parse HTML in batches using xpath [PHP]?

I tried all sorts of things but couldn't find a solution.
I want to retrieve elements from html code using xpath in php.
Ex:
<div class='student'>
<div class='name'>Michael</div>
<div class='age'>26</div>
</div>
<div class='student'>
<div class='name'>Joseph</div>
<div class='age'>27</div>
</div>
I want to retrieve the information and put them in an array as follows:
$student[0][name] = Michael;
$student[0][age] = 26;
$student[1][name] = Joseph;
$student[1][age] = 27;`
In other words i want the matching ages to stay with the names.
I tried the following:
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpathDom = new DomXPath($dom);
$homepostcontentNodes = $xpathDom->query("//*[contains(#class, 'student')]//*[contains(#class, 'name')]");`
However, this is only grabbing me the nodes 'names'
How can i get the matching age nodes?

Of course it is only grabbing the nodes name - you are telling it to!
What you will need to do is in two steps:
Pick out all the student nodes
For each student node, pick out the columns
This is a pretty standard step in linearization of data, and the XPath queries are simple:
Step 1
You pretty much have it:
$studentNodes = $xpathDom->query("//div[contains(#class, 'student')]");
This will return all your student nodes.
Step 2
This is where the magic happens. We have our nodes, we can loop through them (DOMNodeList implements Iterator, so we can foreach-loop through them). What we need to figure out is how to find its children...
...Oh wait. DOMNode implements a method called getNodePath which returns the full, direct XPath path to the node. This allows us to then simply append /div to get all the div direct descendents to the node!
Another quick foreach, and we get this code:
$studentNodes = $xpathDom->query("//div[contains(#class, 'student')]");
$result = array();
foreach ($studentNodes as $v) {
// Child nodes: student
$r = array();
$columns = $xpathDom->query($v->getNodePath()."/div");
foreach ($columns as $v2) {
// Attributes allows me to get the 'class' property of the node. Bit clunky, but there's no alternative
$r[$v2->attributes->getNamedItem("class")->textContent] = $v2->textContent;
}
$result[] = $r;
}
var_dump($result);
Full fiddle: http://codepad.viper-7.com/t868Wh

Extracting HTML fields through XPath

I have this query which extracts the posts which has been "liked" more than 5 times.
//div[#class="pin"]
[.//span[#class = "LikesCount"]
[substring-before(normalize-space(text())," ") > 5]
I'd like to extract and store additional informations like titles,img url,like number,repin number,...
How to extract them all ?
Multiple XPath queries?
Digging into the nodes of the resulted posts while iterating with php and php functions?
...
Follows a Markup example:
<div class="pin">
<p class="description">gorgeous couch #modern</p>
[...]
<div class="PinHolder">
<a href="/pin/56787645270909880/" class="PinImage ImgLink">
<img src="http://media-cache-ec3.pinterest.com/upload/56787645270909880_d7AaHYHA_b.jpg"
alt="Krizia"
data-componenttype="MODAL_PIN"
class="PinImageImg"
style="height: 288px;">
</a>
</div>
<p class="stats colorless">
<span class="LikesCount">
22 likes
</span>
<span class="RepinsCount">
6 repins
</span>
</p>
[...]
</div>

As you are already using XPath in your code I would suggest to extract that information using XPath too. Here comes an example on how to extract the description.
<?php
// will store the posts as assoc arrays
$mostLikedPostsArr = array();
// call your fictional load function
$doc = load_html('whatever');
// create a XPath selector
$selector = new DOMXPath($doc);
// this your query from above
$query = '//div[#class="pin"][.//span[#class = "LikesCount"][substring-before(normalize-space(text())," ") > 5]';
// getting the most liked posts
$mostLikedPosts = $selector->query($query);
// now iterate through the post nodes
foreach($mostLikedPosts as $post) {
// assoc array for a post
$postArr = array();
// you can do 'relative' queries once having a reference to $post
// note $post as the second parameter to $selector->query()
// lets extract the description for example
$result = $selector->query('p[#class = "description"]', $post);
// just using nodeValue might be ok for text only nodes.
// to properly flatten the <a> tags inside the descriptions
// it will take further attention.
$postArr['description'] = $result->item(0)->nodeValue;
// ...
$mostLikedPostsArr []= $postArr;
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

XPath scraping two nodes values from HTML only if both exist - php

Related

How to strip a HTML element from a text file with PHP?

How to get ID using a specific word in regex?

Traverse DOM find id backwards

How can I parse HTML in batches using xpath [PHP]?

Extracting HTML fields through XPath

Categories

Resources