Getting text content with xpath

Getting text content with xpath - php

I have some HTML like this:
<dd class="price">
<sup class="symbol">$</sup><span class="dollars">58</span><sup class="cents">.00</sup>
</dd>
What's the xpath to get $58.00 back as one string?
I'm using PHP:
$xpath = '?????';
$result = $xml->xpath($xpath);
echo $result[0]; // want this to show $58.00, possible?

These are valid in your case, check for more detail the links below;
$html = '<dd class="price">
<sup class="symbol">$</sup><span class="dollars">58</span><sup class="cents">.00</sup>
</dd>';
$dom = new DOMDocument();
$dom->loadXML($html);
$xpt = new DOMXpath($dom);
foreach ($xpt->query('//dd[#class="price"]') as $node) {
// outputs: $58.00
echo trim($node->nodeValue);
}
// or
$xml = new SimpleXMLElement($html);
$res1 = $xml->xpath('//dd[#class="price"]/sup');
$res2 = $xml->xpath('//dd[#class="price"]/span');
// outputs: $58.00
printf('%s%s%s', (string) $res1[0], (string) $res2[0], (string) $res1[1]);
DOMDocument
DOMXPath
SimpleXMLElement

data() will return all contents inside the current context. Try
//dd/data()

You haven't shown us your code so I don't know what platform you're using. If you have something that can evaluate non-node XPath expressions, then you can use this:
string(//dd[#class = 'price'])
if not, you can select the node,
//dd[#class = 'price']
and the API you're using should have a way of getting the inner text value of the selected node.

Related

php read html and handle double id-appearance

For my project I'm reading an external website which has used the same ID twice. I can't change that.
I need the content from the second appearance of that ID but my code just results the first one and does not see the second one.
Also a count to $data results 1 but not 2.
I'm desperate. Does anyone have an idea how to access the second ID 'hours'?
<?PHP
$url = 'myurl';
$contents = file_get_contents($url);
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTMLFile($url);
$data = $dom->getElementById("hours");
echo $data->nodeValue."\n";
echo count($data);
?>

As #rickdenhaan points out, getElementById always returns a single element which is the first element that has that specific value of id. However you can use DOMXPath to find all nodes which have a given id value and then pick out the one you want (in this code it will find the second one):
$url = 'myurl';
$contents = file_get_contents($url);
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTMLFile($url);
$xpath = new DOMXPath($dom);
$count = 0;
foreach ($xpath->query("//*[#id='hours']") as $node) {
if ($count == 1) echo $node->nodeValue;
$count++;
}
As #NigelRen points out in the comments, you can simplify this further by directly selecting the second input in the XPath i.e.
$node = $xpath->query("(//*[#id='hours'])[2]")[0];
echo $node->nodeValue;
Demo on 3v4l.org

Fetching value of specific text node using DOMXPath

From the following structure:
I'm trying to fetch the marked text with the following code:
$price_new='div/div[#class="cat_price"]/text()';
if ($price_new!=null && $node = $Website_Xpath->query ($price_new, $row )) {
$result [$value] ['Price'] = $node->item( 0 )->nodeValue;
} else {
$result [$value] ['Price'] = "";
}
but the node value is NULL. How do I fetch the number correctly?

You should provide the actual snippet, not just a screenshot of it. If I interpreted the screenshot correctly the snippet is something like:
$xml = <<<'XML'
<body>
<div class="cat_price">
<div class="was">67,000 - PKR</div>
"
64,9999"<span> - PKR</span>
</div>
</body>
XML;
The text node with the price is the following sibling of the div with the class was. So it is possible to fetch it using that axis:
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXpath($document);
$expression = 'string(//div[#class="cat_price"]
/div[#class="was"]/following-sibling::text()[1])';
var_dump($xpath->evaluate($expression));
Unlike DOMXpath::query(), DOMXpath::evaluate() can return scalar values depending on the expression. A string cast or a string function will return a string.
string(25) "
"
64,9999""
However the result will not only contain the number but the quotes and some whitespaces. translate() and normalize-space() could be used to clean it up:
$expression = 'normalize-space(
translate(//div[#class="cat_price"]
/div[#class="was"]/following-sibling::text()[1], \'"\', " ")
)';
var_dump($xpath->evaluate($expression));
Output:
string(7) "64,9999"

Your $Website_Xpath looks like an object of DOMXPath. Then the main issue with your code is in the XPath expression: 'div/div[#class="cat_price"]/text()'. You are trying to fetch a div from nowhere. Whether provide full path from the root node (e.g. /html/body/div), or select all divs with // prefix.
Example
$xml = <<<'XML'
<body>
<div class="cat_price">
<div class="was">67,000 - PKR</div>
64,9999<span> - PKR</span>
</div>
</body>
XML;
$doc = new DOMDocument();
$doc->loadXML($xml);
$text = '';
$xpath = new DOMXPath($doc);
// Select all text nodes within a <div> having class="cat_price"
if ($nodes = $xpath->query('//div[#class="cat_price"]/text()')) {
// Search for a node with some content, except spaces
foreach ($nodes as $n) {
if ($text = trim($n->nodeValue))
break;
}
}
var_dump($text);
Output
string(7) "64,9999"

Trying to retrieve text only from a div with xpath

I'm trying to write a document that will go through a webpage that was poorly coded and return the title element. However, the person who made the website I plan on scraping did not use ANY classes, simply just div elements. Heres the source webpage I'm trying to scrape:
<tbody>
<tr>
<td style = "...">
<div style = "...">
<div style = "...">TEXT I WANT</div>
</div>
</td>
</tr>
</tbody>
and when I copy the xpath in chrome I get this string:
/html/body/table/tbody/tr[2]/td[3]/table/tbody/tr[1]/td/div/div[3]
I'm having trouble figuring out where I put that string in an xpath query.
If not an xpath query maybe I should do a preg_match?
I tried this:
$location = '/html/body/table/tbody/tr[2]/td[3]/table/tbody/tr[1]/td/div/div[3]';
$html = file_get_contents($URL);
$doc = new DomDocument();
$doc->loadHtml($html);
$xpath = new DomXPath($doc);
// Now query the document:
foreach ($xpath->query($location) as $node) {
echo $node, "\n";
}
but nothing is printed to the page.
Thanks.
EDIT: Full sourse code here:
http://pastebin.com/K5tZ4dFH
EDIT2: Cleaner code screen shot: http://i.imgur.com/lWKheBy.png

From looking at your source, try the following:
$html = file_get_contents($URL);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$nodes = $xpath->query("//div[contains(#style, 'left:20px')]");
foreach ($nodes as $node) {
echo $node->textContent;
}

It looks like you want the text just before the first </div>, so this regex will find that:
[^<>]+(?=<\/div>)
Here's a live demo.

How to cut off a portion of a html inside <div> and store it as html string by using xpath and domdocument?

I would like to cut off some portion of html, I can take it by using XPath and DomDocument but the problem is that I need result as a html code string. Normally I would use reg. expr. for that but I wouldn't like to do a complicated search pattern that would mach the begining and the end of tag.
That's the example input:
some html code before
<div>this <b>is</b> what I want</div>
some html after
and the output:
<div>this <b>is</b> what I want</div>
I tried something like this:
subject = 'some html code before
<div>this <b>is</b> what I want</div>
some html after';
$doc = new DOMDocument();
$doc->loadHTML($subject);
$xpath = new DOMXpath($doc);
$result = $xpath->query("//div/*");
echo $result->saveHTML();
but i got only error:
Call to undefined method DOMNodeList::saveHTML()
Does anyone know how to get the result as a html string by using DomDocument and XPath?

Thank you Gentleman for pointing out my missunderstanding with accessing methods that are not aviailable in a child object. But line:
echo $doc->saveHTML($result->item(0));
generates only warning (without the html sting I want to have). Luckily I found another soulution and here it is:
<?php
$subject = '<html>
<head>
<title>A very short ebook</title>
<meta name="charset" value="utf-8" />
</head>
<body>
<h1 class="bookTitle">A very short ebook</h1>
<p style="text-align:right">Written by Kovid Goyal</p>
<div class="introduction">
<p>A very short ebook to demonstrate the use of XPath.</p>
</div>
<h2 class="chapter">Chapter One</h2>
<p>This is a truly fascinating chapter.</p>
<h2 class="chapter">Chapter Two</h2>
<p>A worthy continuation of a fine tradition.</p>
</body>
</html>';
$doc = new DOMDocument();
$doc->loadHTML($subject);
$xpath = new DOMXpath($doc);
$result = $xpath->query("//div");
//echo $doc->saveHTML($result->item(0));
echo domNodeList_to_string($result);
function domNodeList_to_string($DomNodeList) {
$output = '';
$doc = new DOMDocument;
while ( $node = $DomNodeList->item($i) ) {
// import node
$domNode = $doc->importNode($node, true);
// append node
$doc->appendChild($domNode);
$i++;
}
$output = $doc->saveHTML();
$output = print_r($output, 1);
// I added this because xml output and ajax do not like each others
//$output = htmlspecialchars($output);
return $output;
}
php>
so if one has a query like that:
$result = $xpath->query("//div");
then will get the raw html string output:
<div class="introduction">
<p>A very short ebook to demonstrate the use of XPath.</p>
</div>
if the query is:
$result = $xpath->query("//p");
then output will be:
<p style="text-align:right">Written by Kovid Goyal</p><p>A very short ebook to demonstrate the use of XPath.</p><p>This is a truly fascinating chapter.</p><p>A worthy continuation of a fine tradition.</p>
Does anyone know simpler (embeded in php) method to get the same result?

Try this:
$subject = 'some html code before
<div>this <b>is</b> what I want</div>
some html after';
$doc = new DOMDocument();
$doc->loadHTML($subject);
$xpath = new DOMXpath($doc);
$result = $xpath->query("//div");
echo $doc->saveHTML($result->item(0)); //echoes what you want :)
The saveHTML function belongs to the DOMDocument object, you can't call it directly on the node (much less on the NodeList, which is what the query returns), but what you can do is pass it the node as a param.
Also, your query was wrong: what you want is the div element (i.e. //div), not its children (//div/*).

As per the php manual docs on DOMXPath::querydocs, the function:
Returns a DOMNodeList containing all nodes matching the given XPath
expression. Any expression which does not return nodes will return an
empty DOMNodeList.
This means that the $result in the following code will be a DOMNodeListdocs object. So if you want to get individual HTML code out from inside it you'll need to use methods available with a DOMNodeList object. In this case, the item method:
$result = $xpath->query("//div");
echo $doc->saveHTML($result->item(0));
$result->item(0) returns the first DOMNode in the DOMNodeList created by your xpath query.

Try this :
$subject = 'some html code before<div>this <b>is</b> what I want</div>some html after';
$doc = new DOMDocument('1.0');
$doc->loadHTML($subject);
$xpath = new DOMXpath($doc);
$result = $xpath->query("//div");
$docSave = new DOMDocument('1.0');
foreach ( $result as $node ) {
$domNode = $docSave->importNode($node, true);
$docSave->appendChild($domNode);
}
echo $docSave->saveHTML();

extract value from web page

Hi I have a website's home page that I am reading in using Curl and I need to grab the number of pages that the site has.
The information is in a div:-
<div class="pager">
<span class="page-numbers current">1</span>
<span class="page-numbers">2</span>
<span class="page-numbers">3</span>
<span class="page-numbers">4</span>
<span class="page-numbers">5</span>
<span class="page-numbers dots">…</span>
<span class="page-numbers">15</span>
<span class="page-numbers next"> next</span>
</div>
The value I need is 15 but this could be any number depending on the site but will always be in the same position.
How could I read this value easily and assign it to a variable in PHP.
Thanks
Jonathan

You can use PHP's DOM module for that. Read the page with DOMDocument::loadhtmlfile(), then create a DOMXPath object and query all span elements within the document having the class="page-numbers" attribute.
(edit: oops, that's not what you're looking for, see second code snippet)
$html = '<html><head><title>:::</title></head><body>
<div class="pager">
<span class="page-numbers current">1</span>
<span class="page-numbers">2</span>
<span class="page-numbers">3</span>
<span class="page-numbers">4</span>
<span class="page-numbers">5</span>
<span class="page-numbers dots">…</span>
<span class="page-numbers">15</span>
<span class="page-numbers next"> next</span>
</div>
</body></html>';
$doc = new DOMDocument;
// since the content "is already here" we use loadhtml(content)
// instead of loadhtmlfile(url)
$doc->loadhtml($html);
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query('//span[#class="page-numbers"]');
echo 'there are ', $nodelist->length, ' span elements having class="page-numbers"';
edit: does this
<span class="page-numbers">15</span>
(the second last a element) always point to the last page, i.e. does this link contain the value you're looking for?
Then you can use a XPath expression that selects the second but last a element and from there its child span element.
//div[#class="pager"] <- select each <div> where the attribute class equals "pager"
//div[#class="pager"]/a <- select each <a> that is a direct child of the pager div
//div[#class="pager"]/a[position()=last()-1] <- select the <a> that is second but last
//div[#class="pager"]/a[position()=last()-1]/span <- select the direct child <span> of that second but last <a> element in the pager <div>
( you might want to fetch a good XPath tutorial ;-) )
$doc->loadhtml($html);
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query('//div[#class="pager"]/a[position()=last()-1]/span');
if ( 0 < $nodelist->length ) {
echo $nodelist->item(0)->nodeValue;
}
else {
echo 'not found';
}

There is no direct function or easy way to do that. You need to build or use an existing HTML parser to do that.

You can parse it with regular expression. First find all occurense of <span class="page-numbers">, then select the last one:
// div html code should be in $div_html
preg_match_all('#<span class="page-numbers">(\d+)#', $div_html, $page_numbers);
print_r(end($page_numbers[1])); // prints 15

This is something you would might want to use a xpath for - which requires loading the page as a dom document object:
$domDoc = new DOMDocument();
$domDoc->loadHTMLFile("http://path/to/yourfile.html");
$xp = new DOMXPath($domDoc);
$nodes = $xp->query("//xpath/to/relevant/node");
$value = $nodes[0];
I haven't written a good xpath in a while, so you should do some reading to figure out that part, but it shouldn't be too difficult.

perhaps
$nodes = $dom->getElementsByTagName("span");
$maxPageNum = 0;
foreach($nodes as $node)
{
if( $node.class == "page-numbers" && $node.value > $maxPageNum )
{
$maxPageNum = $node.value;
}
}
I don't know PHP, so maybe it's not that easy to access the class/inner text of a dom node, but there must be some way to get that info and the pseudocode here should work.

Just wanted to say a huge thank you to Volkerk for helping out - it worked really well. I had to make a few slight changes and ended up with this:-
function getusers($userurl)
{
$sSourceData = file_get_contents($userurl);
$doc = new DOMDocument();
#$doc->loadHTML($sSourceData);
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query('//div[#class="pager"]/a[position()=last()-1]/span');
if ( 0 < $nodelist->length ) {
$lastpage = $nodelist->item(0)->nodeValue;
$users = $lastpage * 35;
$userurl = $userurl.'?page='.$lastpage;
$sSourceData = file_get_contents($userurl);
$doc = new DOMDocument();
#$doc->loadHTML($sSourceData);
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query('//div[#class="user-details"]');
$users = $users + $nodelist->length;
echo 'there are ', $users , ' users';
}
else {
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query('//div[#class="user-details"]');
echo 'there are ', $nodelist->length, ' users';
}
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Getting text content with xpath - php

data() will return all contents inside the current context. Try //dd/data()

Related

php read html and handle double id-appearance

Fetching value of specific text node using DOMXPath

Trying to retrieve text only from a div with xpath

How to cut off a portion of a html inside <div> and store it as html string by using xpath and domdocument?

extract value from web page

Categories

Resources