Using PHP X-Path to extract specific parts of a webpage

Using PHP X-Path to extract specific parts of a webpage - php

I am after a specific value from a webapge; the product name that is in the h1 tag:
<div id="extendinfo_container">
<h1><strong>Product Name</strong></h1>
<div style="font-size:0;height:4px;"></div>
<p class="text_breadcrumbs">
<img src="arrow_091.gif" align="absmiddle"/>
Product Name<img src="arrow_091.gif" align="absmiddle"/>
<strong>Product Name</strong>
<div class="dotted_line_blue">
<img src="theme_shim.gif" height="1" width="100%" alt=" " />
</div>
</div>
This is a poorly structured website with more than one h1 so I cannot simply do getElementById('h1').
I want to be as specific as possible in which element I get and this is the code I have:
$doc = new DOMDocument();
#$doc->loadHTML(file_get_contents('http://url/to/website'));
// locate <div id="extendinfo_container"><a><h1><strong>(.*)</strong></h1></a> as product name
$x = new DOMXPath($doc);
$pName = $x->query('//div[#id="extendinfo_container"]/a/h1/strong');
var_dump($pName->nodeValue);
This is return null. What query do I need to use to get the content I want?

query() returns a DOMNodeList, which doesn't have a nodeValue property. You have to select one element (i.e. the first):
$pName = $x->query('//div[#id="extendinfo_container"]/a/h1/strong')->item(0);
Or iterate over it:
foreach( $pName as $el) {
var_dump( $el->nodeValue);
}
Either one of these will give you access to a DOMNode, which is what you're looking for.

PHP's DOM is VERY picky about the html you load into it. It will barf and refuse to load even slightly malformed documents.
Turn off error supression (#$doc->loadHTML, remove the #) and make sure that it's not puking on this page you're trying to analyze. Otherwise, your XPath query looks fine, and if the document does get loaded/parsed properly, it SHOULD work.

The query works fine. I was accessing the value wrong. Here is the correct way to access the value:
var_dump($pName->item(0)->nodeValue);

Related

How to get xPath nodeValue USD dollar amount

I'm trying to start from the <span> element that has text Value when transacted
Then get its parent <div> and get following sibling which is a <div> and from that <div> get the text of the child <span>.
From what I can tell, the code is correct and should echo $1,034.29.
It echos $0.00 instead.
What am I missing here?
php code:
$a = new DOMXPath($doc);
$dep_val_txt = $a->query("//span[contains(text(), 'Value when transacted')]");
$dep_val_nxt_elem = $a->query("parent::div", $dep_val_txt[0]);
$dep_val_elem = $a->query("following-sibling::*[1]", $dep_val_nxt_elem[0]);
$dep_val = $dep_val_elem->item(0)->childNodes->item(0)->nodeValue;
echo $dep_val;
html code:
<div class="sc-8sty72-0 cyLejs">
<span class="sc-1ryi78w-0 bFGdFC sc-16b9dsl-1 iIOvXh sc-1n72lkw-0 bKaZjn" opacity="1">Value when transacted</span>
</div>
<div class="sc-8sty72-0 cyLejs">
<span class="sc-1ryi78w-0 bFGdFC sc-16b9dsl-1 iIOvXh u3ufsr-0 gXDEBk" opacity="1">$1,034.29</span>
</div>

In case someone else stumbles upon this question in the future, I will summarize the solution which was concluded by conversation with OP in the comments:
The issue here is not with the DOM selectors, as observed by the fact that his output is $0.00 even though he is not formatting the value to appear as a currency. This led me to believe that the website being scraped is in fact using placeholder values which are updated on the client side using Javascript. The reason this cannot be resolved with selectors is because the DOM received by PHP will be the initial render, which does not contain the values we wish to scrape.
So the solution is to examine the website being scraped to determine where and how the values are being fetched before being added to the DOM on the client side. For example, if the website is using an API call to fetch the values, one can simply use the same API to fetch the intended data without having to scrape the HTML DOM at all.

If you follow OPs question literally
start from the <span> element that has text "Value when transacted"
get its parent <div>
get following sibling which is a <div>
get the text of the child <span>
then the xpath expression should be
//span[text()='Value when transacted']/parent::div/following-sibling::div/span

You might find it easier and faster to process using a regex to match the price, here's a quick example in PHP:
<?php
// Your input HTML (as per your example)
$inputHtml = <<<HTML
<div class="sc-8sty72-0 cyLejs">
<span class="sc-1ryi78w-0 bFGdFC sc-16b9dsl-1 iIOvXh sc-1n72lkw-0 bKaZjn" opacity="1">Value when transacted</span>
</div>
<div class="sc-8sty72-0 cyLejs">
<span class="sc-1ryi78w-0 bFGdFC sc-16b9dsl-1 iIOvXh u3ufsr-0 gXDEBk" opacity="1">$1,034.29</span>
</div>
HTML;
$matches = [];
// Look for any div > span element which contains a string starting with $ and then match a number (allowing for a , or . within the price matched).
if (preg_match_all('#<div.*>\s*<span.*?>\$([0-9.,]+)</span>\s*</div>#mis', $inputHtml, $matches)) {
echo 'Price found: ' . $matches[1][0] . PHP_EOL;
}
Console output from this:
Price found: 1,034.29

Get contents of element from external page PHP

I'd like to get the content (CSS, children, ect.) to display on a HTML page, but this element is on a external page. When I use:
$page = new DOMDocument();
$page->loadHTMLFile('about.php');
$text = $page->getElementById('text');
echo $text->nodeValue;
I only get the text, but #text also has a image as child and some CSS. Can I get (and echo) those to, kind of like with an iframe, but then with a element. If so, how?
Thanks a lot.

Maybe what you're looking for is DOMDocument::saveHTML().
If you set the optional arguments it outputs only this particular node.
$elm = $page->getElementById('text');
echo $elm->ownerDocument->saveHTML($elm);

I have found a solution, although it doesn't retrieve the CSS, but if you only need the element and its children, this is my best bet.
Use simple_html_dom.php to do all the hard stuff.
My external page:
<div id='text'>
<img src='img/dummy.png' align='left' alt='Image not available. Our apologies.'/>
<span>text</span><br/>
<p>
text
</p>
<p>
text
</p>
<p>
text
</p>
<div>
Now, my page that I'd like to show the contents of my external page:
<?php include('../includes/simple_html_dom.php'); ?>
....
<?php
$html = file_get_html('about.php');
$ret = $html->find('div#text', 0);
echo $ret;
?>
what this does, it echos the element with its children, without CSS unfortunately.

How to parse multiple elements in portions for html via Simple Html Dom

I am attempting to get various elements inside of an li as shown below. I am pretty new to this so I may not be using the most efficient methods but this is where I have started...
EXAMPLE CODE SIMPLIFIED....
<li id='entry_0' title='09879879'>
<div ....>
<h2> The title text would go here </h2>
<span class='entrySize' ....> 20oz </span>
<span class='entryPrice' ....> $32.09 </span>
<span class='anotherEntry' ....> More Data I need To Grab </span>
.......
</div>
</li>
<li> .... With same structure as above .... 100's of entries like this </li>
I know how to pull individual parts separately but having trouble grasping how to do it grouped within a portion of the html.
$filename = "directory/file.html";
$html = file_get_html($filename);
for($i=0; $i<=count(entryNumber);$i++)
{
$li_id = "entry_".$i;
foreach($html->find('li[id='.$li_id.']') as $li) {
echo $li->innertext;
}
}
So this gets me the content in the line item tag with the id number as the unique attribute. I would like to grab the h2 text, entrySize, entryPrice etc as I iterate through the line item tags. What I don't understand is once I have the line item tag content how can I parse through that line item inner tags and attributes. There maybe other parts of the full HTML document that has tags with same id, class as these throughout the document so I am breaking this down to portions and than looking to parse each section at a time.
I would also like to pull the title attribute out of the title tag for the li tag.
I hope my explanation make sense.

You should probably use a DOM parser. PHP comes bundled with one, and there are many other's you could use.
http://php.net/dom
PHP Simple HTML DOM Parser
<?php
$html = file_get_content($page);
$doc = new DOMDocument();
$doc->loadHTML($html);
// now find what you need
$items = $dom->getElementsByTagName('li');
foreach ($items as $item) {
$id = $item->getAttribute('id');
if (strpos($id, 'item_') !== false) {
// found matchin li, grab its children
}
}
Use this as a baseline, we can't write all the code for you. Check out the PHP docs to finish this :) From what I have so far, you need to follow the docs to make it grab the child values, and handle them.

DOMXPath union extract with PHP

I'm trying to get img and the div which is coming after the div which contains that img, all in one query.
So I did this:
$nodes = $xpath->query('//div[starts-with(#id, "someid")]/img |
//div[starts-with(#id, "someid")]/following-sibling::div[#class="spec_class"][1]/text()');
Now, I'm able to get the attributes of img tag, but I can't get the text of the following sibling. If I separate the query (two queries - first for the img and second query for the sibling) it works. But how can I do this with only one query? By the way, there is no error in the syntax. But somehow the union doesn't work or maybe I'm not extracting the sibling content right.
Here's the markup (which repeats many times with another text and id="someid_%randomNumber%)
<div id="someid_1">
<img src="link_to_image.png" />
...some text...
</div>
<div>...another text...</div>
<div class="spec_class">
...Important text...
</div>
I want to get in one query both link_to_image.png and ...Important text...

Your query seems correct.
Example XML:
<div>
<div id="someid-1"><img src="foo"/></div>
<div class="spec_class">bar</div>
<div class="spec_class">baz</div>
</div>
Example PHP Code:
$dom = new DOMDocument;
$dom->loadXml($xhtml);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//div…') as $node) {
echo $dom->saveXML($node);
}
Outputs (demo):
<img src="foo"/>bar
Note that you will have to iterate the DOMNodeList returned by the XPath query.

Parse HTML with PHP to get sibling elements grouped by class

I have a HUGE HTML document that I need to parse.
The document is a list of <p> elements all (direct) children of the body tag.
The difference is the class name. The structure is like this:
<p class="first-level"></p>
<p class="second-level"></p>
<p class="third-level"></p>
<p class="third-level"></p>
<p class="nth-levels just-for-demo-1"></p>
<p class="nth-levels just-for-demo-1"></p>
<p class="third-level"></p>
<p class="second-level"></p>
<p class="third-level"></p>
<p class="nth-levels just-for-demo-2"></p>
<p class="first-level"></p>
<p class="second-level"></p>
<p class="second-level"></p>
<p class="third-level"></p>
And so on. nth-level can be any class name that isn't first-level, second-level or third-level.
Basically it's a multi-level <ul> element very poorly marked-up.
What I want to do is parse it and obtain all <p> elements (including tag, not just innerHTML) that are between one of the class names above.
In the example above, I want to get:
<p class="nth-levels just-for-demo-1"></p>
<p class="nth-levels just-for-demo-1"></p>
and
<p class="nth-levels just-for-demo-2"></p>
How the heck can I do that please?
Thank you.

Using XPath:
//p[not(#class='first-level')][not(#class='second-level')][not(#class='third-level')]
to get the (non?)matching nodes, then you can use this answerto get the outerHTML of the nodes.

Additionaly, if you're familiar with jQuery, then try jQuery port to PHP and you could have a powerful set of tools for matching a set of elements in a document (Selectors) as you used to be with jQuery along side with Hierarchy, Attribute Filters, Child Filters etc,Reference

$doc = new DOMDocument;
$doc->loadHTML(...);
$query = '//p[contains(#class, "just-for-demo-")]';
$xpath = new DOMXPath($doc);
$entries = $xpath->query($query);
foreach ($entries as $entry)
{
// not a best solution yet
$attribute = '';
foreach ($entry->attributes as $attr)
{
$attribute .= "{$attr->name}=\"{$attr->value}\"";
}
echo "<{$entry->nodeName}{$attribute}>{$entry->nodeValue}</{$entry->nodeName}>";
}

You could open the file (with fopen or something similar) and read one line at a time. Then just check if the required string is in the line (for example with strstr) and if yes, then add it to an array or do what you need with the line.
Note: this only works if the paragraphs are on different lines each.
fopen documentation
strstr documentation

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Using PHP X-Path to extract specific parts of a webpage - php

The query works fine. I was accessing the value wrong. Here is the correct way to access the value: var_dump($pName->item(0)->nodeValue);

Related

How to get xPath nodeValue USD dollar amount

Get contents of element from external page PHP

How to parse multiple elements in portions for html via Simple Html Dom

DOMXPath union extract with PHP

Parse HTML with PHP to get sibling elements grouped by class

Categories

Resources