How to get xPath nodeValue USD dollar amount - php

I'm trying to start from the <span> element that has text Value when transacted
Then get its parent <div> and get following sibling which is a <div> and from that <div> get the text of the child <span>.
From what I can tell, the code is correct and should echo $1,034.29.
It echos $0.00 instead.
What am I missing here?
php code:
$a = new DOMXPath($doc);
$dep_val_txt = $a->query("//span[contains(text(), 'Value when transacted')]");
$dep_val_nxt_elem = $a->query("parent::div", $dep_val_txt[0]);
$dep_val_elem = $a->query("following-sibling::*[1]", $dep_val_nxt_elem[0]);
$dep_val = $dep_val_elem->item(0)->childNodes->item(0)->nodeValue;
echo $dep_val;
html code:
<div class="sc-8sty72-0 cyLejs">
<span class="sc-1ryi78w-0 bFGdFC sc-16b9dsl-1 iIOvXh sc-1n72lkw-0 bKaZjn" opacity="1">Value when transacted</span>
</div>
<div class="sc-8sty72-0 cyLejs">
<span class="sc-1ryi78w-0 bFGdFC sc-16b9dsl-1 iIOvXh u3ufsr-0 gXDEBk" opacity="1">$1,034.29</span>
</div>

In case someone else stumbles upon this question in the future, I will summarize the solution which was concluded by conversation with OP in the comments:
The issue here is not with the DOM selectors, as observed by the fact that his output is $0.00 even though he is not formatting the value to appear as a currency. This led me to believe that the website being scraped is in fact using placeholder values which are updated on the client side using Javascript. The reason this cannot be resolved with selectors is because the DOM received by PHP will be the initial render, which does not contain the values we wish to scrape.
So the solution is to examine the website being scraped to determine where and how the values are being fetched before being added to the DOM on the client side. For example, if the website is using an API call to fetch the values, one can simply use the same API to fetch the intended data without having to scrape the HTML DOM at all.

If you follow OPs question literally
start from the <span> element that has text "Value when transacted"
get its parent <div>
get following sibling which is a <div>
get the text of the child <span>
then the xpath expression should be
//span[text()='Value when transacted']/parent::div/following-sibling::div/span

You might find it easier and faster to process using a regex to match the price, here's a quick example in PHP:
<?php
// Your input HTML (as per your example)
$inputHtml = <<<HTML
<div class="sc-8sty72-0 cyLejs">
<span class="sc-1ryi78w-0 bFGdFC sc-16b9dsl-1 iIOvXh sc-1n72lkw-0 bKaZjn" opacity="1">Value when transacted</span>
</div>
<div class="sc-8sty72-0 cyLejs">
<span class="sc-1ryi78w-0 bFGdFC sc-16b9dsl-1 iIOvXh u3ufsr-0 gXDEBk" opacity="1">$1,034.29</span>
</div>
HTML;
$matches = [];
// Look for any div > span element which contains a string starting with $ and then match a number (allowing for a , or . within the price matched).
if (preg_match_all('#<div.*>\s*<span.*?>\$([0-9.,]+)</span>\s*</div>#mis', $inputHtml, $matches)) {
echo 'Price found: ' . $matches[1][0] . PHP_EOL;
}
Console output from this:
Price found: 1,034.29

Related

Fetch Content from A Div id using PHP Simple HTML DOM Parser

I have following html code
<div id="b_changetext" class="FL gL_13 PT15"> <span class="gr_15 uparw_pc"><strong>5.80</strong></span> (+2.28%)</div>
I wanted to extract content (+2.28%)
Tried following code
foreach($html->find('div[n_changetext]') as $e){
echo $e->innertext . '<br>';
echo "wwwwww";
}
On running it does not enter the for loop . ( "wwwwww" is not displayed)
Can anyone please suggest a solution
div[n_changetext] finds elements with an n_changetext attribute (which is not valid in HTML).
To find an element with a given id you must specify that the name of the attribute is id and specify the value.
The value, in your example, starts with a b not an n:
find('div[id=b_changetext]')

PHP Regex replace link if it does not have data attribute

I need to loop through a bunch of HTML code and remove the <a> </a> tags from all links which DONT include the data attribute data-link="keepLink"
Here is an example of body value I need to modify:
<p><a data-link=\"keepLink\" href=\"[1|9999|16|191967|256]\">Daily Racing Link</a></p>\r\n<br>\n <strong>OFFER – Get up to a £400 deposit bonus when you sign up with Fanduel.</strong>
After the modification I need it to look like (so the offer link is removed):
<p><a data-link=\"keepLink\" href=\"[1|9999|16|191967|256]\">Daily Racing Link</a></p>\r\n<br>\n <strong>OFFER – Get up to a £400 deposit bonus when you sign up with Fanduel.</strong>
So far I have managed to get the first half of the link removing if it doesn't include a data-link="keepLink" attribute. But the closing </a> is still present.
Here is the regex I have used:
$result["body_value"] = preg_replace('/<a (?![^>]*data-link="keepLink").*?>/i', '', $result["body_value"]);
So the new body value looks like:
<p><a data-link=\"keepLink\" href=\"[1|9999|16|191967|256]\">Daily Racing Link</a></p>\r\n<br>\n <strong>OFFER – Get up to a £400 deposit bonus when you sign up with Fanduel</a>.</strong>
The DOMDocument extension is available by default in PHP. It is presumably faster and is designed exactly for what you are trying to achieve. You can use it to load your document and search for any links without a data-link attribute like this:
$dom = new DOMDocument;
$dom->loadHTMLFile('http://www.example.com'); // load the file
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//a[not(#data-link=\'keepLink\')]'); // search for links that do not have the 'data-link' attribute set to 'keepLink'
foreach($nodes as $element){
$textInside = $element->nodeValue; // get the text inside the link
$parentNode = $element->parentNode; // save parent node
$parentNode->replaceChild(new DOMText($textInside), $element); // remove the element
}
$myNewHTML = $dom->saveHTML(); // see http://php.net/manual/ro/domdocument.savehtml.php for limitations such as auto-adding of doc-type
echo $myNewHTML;
Proof of concept: https://3v4l.org/ejatQ.
Please bear in mind that this will take only the text values inside the elements without a data-link='keepLink' attribute value.
If you are set on regex and don't want to use a parser.
Try this
<a (?!data-link=)[^>]*>((?!<\/a>).*?)<\/a>
And replace it by $1. To keep your link-text.
See https://regex101.com/r/wKQk4p/2
Please say if you need any further explaination.

How to parse multiple elements in portions for html via Simple Html Dom

I am attempting to get various elements inside of an li as shown below. I am pretty new to this so I may not be using the most efficient methods but this is where I have started...
EXAMPLE CODE SIMPLIFIED....
<li id='entry_0' title='09879879'>
<div ....>
<h2> The title text would go here </h2>
<span class='entrySize' ....> 20oz </span>
<span class='entryPrice' ....> $32.09 </span>
<span class='anotherEntry' ....> More Data I need To Grab </span>
.......
</div>
</li>
<li> .... With same structure as above .... 100's of entries like this </li>
I know how to pull individual parts separately but having trouble grasping how to do it grouped within a portion of the html.
$filename = "directory/file.html";
$html = file_get_html($filename);
for($i=0; $i<=count(entryNumber);$i++)
{
$li_id = "entry_".$i;
foreach($html->find('li[id='.$li_id.']') as $li) {
echo $li->innertext;
}
}
So this gets me the content in the line item tag with the id number as the unique attribute. I would like to grab the h2 text, entrySize, entryPrice etc as I iterate through the line item tags. What I don't understand is once I have the line item tag content how can I parse through that line item inner tags and attributes. There maybe other parts of the full HTML document that has tags with same id, class as these throughout the document so I am breaking this down to portions and than looking to parse each section at a time.
I would also like to pull the title attribute out of the title tag for the li tag.
I hope my explanation make sense.
You should probably use a DOM parser. PHP comes bundled with one, and there are many other's you could use.
http://php.net/dom
PHP Simple HTML DOM Parser
<?php
$html = file_get_content($page);
$doc = new DOMDocument();
$doc->loadHTML($html);
// now find what you need
$items = $dom->getElementsByTagName('li');
foreach ($items as $item) {
$id = $item->getAttribute('id');
if (strpos($id, 'item_') !== false) {
// found matchin li, grab its children
}
}
Use this as a baseline, we can't write all the code for you. Check out the PHP docs to finish this :) From what I have so far, you need to follow the docs to make it grab the child values, and handle them.

Simple HTML DOM Parser - Get all plaintex rather than text of certain element

I tried all the solutions posted on this question. Although it is similar to my question, it's solutions aren't working for me.
I am trying to get the plain text that is outside of <b> and it should be inside the <div id="maindiv>.
<div id=maindiv>
<b>I don't want this text</b>
I want this text
</div>
$part is the object that contains <div id="maindiv">.
Now I tried this:
$part->find('!b')->innertext;
The code above is not working. When I tried this
$part->plaintext;
it returned all of the plain text like this
I don't want this text I want this text
I read the official documentation, but I didn't find anything to resolve this:
Query:
$selector->query('//div[#id="maindiv"]/text()[2]')
Explanation:
// - selects nodes regardless of their position in tree
div - selects elements which node name is 'div'
[#id="maindiv"] - selects only those divs having the attribute id="maindiv"
/ - sets focus to the div element
text() - selects only text elements
[2] - selects the second text element (the first is whitespace)
Note! The actual position of the text element may depend on
your preserveWhitespace setting.
Manual: http://www.php.net/manual/de/class.domdocument.php#domdocument.props.preservewhitespace
Example:
$html = <<<EOF
<div id="maindiv">
<b>I dont want this text</b>
I want this text
</div>
EOF;
$doc = new DOMDocument();
$doc->loadHTML($html);
$selector = new DOMXpath($doc);
$node = $selector->query('//div[#id="maindiv"]/text()[2]')->item(0);
echo trim($node->nodeValue); // I want this text
remove the <b> first:
$part->find('b', 0)->outertext = '';
echo $part->innertext; // I want this text

Using PHP X-Path to extract specific parts of a webpage

I am after a specific value from a webapge; the product name that is in the h1 tag:
<div id="extendinfo_container">
<h1><strong>Product Name</strong></h1>
<div style="font-size:0;height:4px;"></div>
<p class="text_breadcrumbs">
<img src="arrow_091.gif" align="absmiddle"/>
Product Name<img src="arrow_091.gif" align="absmiddle"/>
<strong>Product Name</strong>
<div class="dotted_line_blue">
<img src="theme_shim.gif" height="1" width="100%" alt=" " />
</div>
</div>
This is a poorly structured website with more than one h1 so I cannot simply do getElementById('h1').
I want to be as specific as possible in which element I get and this is the code I have:
$doc = new DOMDocument();
#$doc->loadHTML(file_get_contents('http://url/to/website'));
// locate <div id="extendinfo_container"><a><h1><strong>(.*)</strong></h1></a> as product name
$x = new DOMXPath($doc);
$pName = $x->query('//div[#id="extendinfo_container"]/a/h1/strong');
var_dump($pName->nodeValue);
This is return null. What query do I need to use to get the content I want?
query() returns a DOMNodeList, which doesn't have a nodeValue property. You have to select one element (i.e. the first):
$pName = $x->query('//div[#id="extendinfo_container"]/a/h1/strong')->item(0);
Or iterate over it:
foreach( $pName as $el) {
var_dump( $el->nodeValue);
}
Either one of these will give you access to a DOMNode, which is what you're looking for.
PHP's DOM is VERY picky about the html you load into it. It will barf and refuse to load even slightly malformed documents.
Turn off error supression (#$doc->loadHTML, remove the #) and make sure that it's not puking on this page you're trying to analyze. Otherwise, your XPath query looks fine, and if the document does get loaded/parsed properly, it SHOULD work.
The query works fine. I was accessing the value wrong. Here is the correct way to access the value:
var_dump($pName->item(0)->nodeValue);

Categories