extract value from web page

extract value from web page - php

Hi I have a website's home page that I am reading in using Curl and I need to grab the number of pages that the site has.
The information is in a div:-
<div class="pager">
<span class="page-numbers current">1</span>
<span class="page-numbers">2</span>
<span class="page-numbers">3</span>
<span class="page-numbers">4</span>
<span class="page-numbers">5</span>
<span class="page-numbers dots">…</span>
<span class="page-numbers">15</span>
<span class="page-numbers next"> next</span>
</div>
The value I need is 15 but this could be any number depending on the site but will always be in the same position.
How could I read this value easily and assign it to a variable in PHP.
Thanks
Jonathan

You can use PHP's DOM module for that. Read the page with DOMDocument::loadhtmlfile(), then create a DOMXPath object and query all span elements within the document having the class="page-numbers" attribute.
(edit: oops, that's not what you're looking for, see second code snippet)
$html = '<html><head><title>:::</title></head><body>
<div class="pager">
<span class="page-numbers current">1</span>
<span class="page-numbers">2</span>
<span class="page-numbers">3</span>
<span class="page-numbers">4</span>
<span class="page-numbers">5</span>
<span class="page-numbers dots">…</span>
<span class="page-numbers">15</span>
<span class="page-numbers next"> next</span>
</div>
</body></html>';
$doc = new DOMDocument;
// since the content "is already here" we use loadhtml(content)
// instead of loadhtmlfile(url)
$doc->loadhtml($html);
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query('//span[#class="page-numbers"]');
echo 'there are ', $nodelist->length, ' span elements having class="page-numbers"';
edit: does this
<span class="page-numbers">15</span>
(the second last a element) always point to the last page, i.e. does this link contain the value you're looking for?
Then you can use a XPath expression that selects the second but last a element and from there its child span element.
//div[#class="pager"] <- select each <div> where the attribute class equals "pager"
//div[#class="pager"]/a <- select each <a> that is a direct child of the pager div
//div[#class="pager"]/a[position()=last()-1] <- select the <a> that is second but last
//div[#class="pager"]/a[position()=last()-1]/span <- select the direct child <span> of that second but last <a> element in the pager <div>
( you might want to fetch a good XPath tutorial ;-) )
$doc->loadhtml($html);
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query('//div[#class="pager"]/a[position()=last()-1]/span');
if ( 0 < $nodelist->length ) {
echo $nodelist->item(0)->nodeValue;
}
else {
echo 'not found';
}

There is no direct function or easy way to do that. You need to build or use an existing HTML parser to do that.

You can parse it with regular expression. First find all occurense of <span class="page-numbers">, then select the last one:
// div html code should be in $div_html
preg_match_all('#<span class="page-numbers">(\d+)#', $div_html, $page_numbers);
print_r(end($page_numbers[1])); // prints 15

This is something you would might want to use a xpath for - which requires loading the page as a dom document object:
$domDoc = new DOMDocument();
$domDoc->loadHTMLFile("http://path/to/yourfile.html");
$xp = new DOMXPath($domDoc);
$nodes = $xp->query("//xpath/to/relevant/node");
$value = $nodes[0];
I haven't written a good xpath in a while, so you should do some reading to figure out that part, but it shouldn't be too difficult.

perhaps
$nodes = $dom->getElementsByTagName("span");
$maxPageNum = 0;
foreach($nodes as $node)
{
if( $node.class == "page-numbers" && $node.value > $maxPageNum )
{
$maxPageNum = $node.value;
}
}
I don't know PHP, so maybe it's not that easy to access the class/inner text of a dom node, but there must be some way to get that info and the pseudocode here should work.

Just wanted to say a huge thank you to Volkerk for helping out - it worked really well. I had to make a few slight changes and ended up with this:-
function getusers($userurl)
{
$sSourceData = file_get_contents($userurl);
$doc = new DOMDocument();
#$doc->loadHTML($sSourceData);
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query('//div[#class="pager"]/a[position()=last()-1]/span');
if ( 0 < $nodelist->length ) {
$lastpage = $nodelist->item(0)->nodeValue;
$users = $lastpage * 35;
$userurl = $userurl.'?page='.$lastpage;
$sSourceData = file_get_contents($userurl);
$doc = new DOMDocument();
#$doc->loadHTML($sSourceData);
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query('//div[#class="user-details"]');
$users = $users + $nodelist->length;
echo 'there are ', $users , ' users';
}
else {
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query('//div[#class="user-details"]');
echo 'there are ', $nodelist->length, ' users';
}
}

Related

Get content and class attribute value from child nodes of the DOM

at first sorry of my bad english !
this is my simple cURL result :
<li class="result">
<div class="song_info">
<span class="artist_name">art1</span>
<span class="song_name">name1</span>
<span class="views">100 time</span>
</div>
</li>
//again
<li class="result">
<div class="song_info">
<span class="artist_name">art2</span>
<span class="song_name">name2</span>
<span class="views">200 time</span>
</div>
</li>
and many like that ....
i used this code to extract values from html :
$classname = 'song_info';
$dom = new DOMDocument;
$dom->loadHTML($html); // my html result .
$xpath = new DOMXPath($dom);
$get = $xpath->query("//*[#class='" . $classname . "']");
$text = $get->item(0)->nodeValue;
echo $text;
this code give me just first result :
art1
name1
100time
i want to get all results ! (Better in json)
can anyone help me ?

DOMXPath::query method returns DOMNodeList. It implements Traversable interface, therefore you can loop through it with foreach. Rename $get variable to $nodes, so the variable will explicitly show what is stored in it. Then:
foreach ($nodes as $curNode) {
$childNodes = $curNode->childNodes;
foreach ($childNodes as $curChildNode) {
// use $curChildNode->textContent to get content
// and $curChildNode->getAttribute('class') to get class name
}
}

I found My Answer
$text = $get->item(0)->nodeValue; >> Give First Result
$text = $get->item(1)->nodeValue; >> Give Second Result
I write a loop and receive all results :/

Fetching value of specific text node using DOMXPath

From the following structure:
I'm trying to fetch the marked text with the following code:
$price_new='div/div[#class="cat_price"]/text()';
if ($price_new!=null && $node = $Website_Xpath->query ($price_new, $row )) {
$result [$value] ['Price'] = $node->item( 0 )->nodeValue;
} else {
$result [$value] ['Price'] = "";
}
but the node value is NULL. How do I fetch the number correctly?

You should provide the actual snippet, not just a screenshot of it. If I interpreted the screenshot correctly the snippet is something like:
$xml = <<<'XML'
<body>
<div class="cat_price">
<div class="was">67,000 - PKR</div>
"
64,9999"<span> - PKR</span>
</div>
</body>
XML;
The text node with the price is the following sibling of the div with the class was. So it is possible to fetch it using that axis:
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXpath($document);
$expression = 'string(//div[#class="cat_price"]
/div[#class="was"]/following-sibling::text()[1])';
var_dump($xpath->evaluate($expression));
Unlike DOMXpath::query(), DOMXpath::evaluate() can return scalar values depending on the expression. A string cast or a string function will return a string.
string(25) "
"
64,9999""
However the result will not only contain the number but the quotes and some whitespaces. translate() and normalize-space() could be used to clean it up:
$expression = 'normalize-space(
translate(//div[#class="cat_price"]
/div[#class="was"]/following-sibling::text()[1], \'"\', " ")
)';
var_dump($xpath->evaluate($expression));
Output:
string(7) "64,9999"

Your $Website_Xpath looks like an object of DOMXPath. Then the main issue with your code is in the XPath expression: 'div/div[#class="cat_price"]/text()'. You are trying to fetch a div from nowhere. Whether provide full path from the root node (e.g. /html/body/div), or select all divs with // prefix.
Example
$xml = <<<'XML'
<body>
<div class="cat_price">
<div class="was">67,000 - PKR</div>
64,9999<span> - PKR</span>
</div>
</body>
XML;
$doc = new DOMDocument();
$doc->loadXML($xml);
$text = '';
$xpath = new DOMXPath($doc);
// Select all text nodes within a <div> having class="cat_price"
if ($nodes = $xpath->query('//div[#class="cat_price"]/text()')) {
// Search for a node with some content, except spaces
foreach ($nodes as $n) {
if ($text = trim($n->nodeValue))
break;
}
}
var_dump($text);
Output
string(7) "64,9999"

creating preg_match using xpath in php

I am trying to get the contents using XPATH in php.
<div class='post-body entry-content' id='post-body-37'>
<div style="text-align: left;">
<div style="text-align: center;">
Hi
</div></div></div>
I am using below php code to get the output.
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$xpath->registerPhpFunctions('preg_match');
$regex = 'post-(content|[a-z]+)';
$items = $xpath->query("div[ php:functionString('preg_match', '$regex', #class) > 0]");
dd($items);
It returns output as below
DOMNodeList {#580
+length: 0
}

Here is a working version with the different advices you get in comments:
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
// you need to register the namespace "php" to make it available in the query
$xpath->registerNamespace("php", "http://php.net/xpath");
$xpath->registerPhpFunctions('preg_match');
// add delimiters to your pattern
$regex = '~post-(content|[a-z]+)~';
// search your node anywhere in the DOM tree with "//"
$items = $xpath->query("//div[php:functionString('preg_match', '$regex', #class)>0]");
var_dump($items);
Obviously, this kind of pattern is useless since you can get the same result with available XPATH string functions like contains.

For a simple task like this - getting the div nodes with class attribute starting with post- and containing content, you should be using regular simple XPath queries:
$xp->query('//div[starts-with(#class,"post-") and contains(#class, "content")]');
Here,
- //div - get all divs that...
- starts-with(#class,"post-") - have "class" attribute starting with "post-"
- and - and...
- contains(#class, "content") - contain "content" substring in the class attribute value.
To use the php:functionString you need to register the php namespace (with $xpath->registerNamespace("php", "http://php.net/xpath");) and the PHP functions (to register them all use $xp->registerPHPFunctions();).
For complex scenrios, when you need to analyze the values even deeper, you may want to create and register your own functions:
function example($attr) {
return preg_match('/post-(content|[a-z]+)/i', $attr) > 0;
}
and then inside XPath:
$divs = $xp->query("//div[php:functionString('example', #class)]");
Here, functionString passes the string contents of #class attribute to the example function, not the object (as would be the case with php:function).
See IDEONE demo:
function example($attr) {
return preg_match('/post-(content|[a-z]+)/i', $attr) > 0;
}
$html = <<<HTML
<body>
<div class='post-body entry-content' id='post-body-37'>
<div style="text-align: left;">
<div style="text-align: center;">
Hi
</div></div></div>
</body>
HTML;
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD);
$xp = new DOMXPath($dom);
$xp->registerNamespace("php", "http://php.net/xpath");
$xp->registerPHPFunctions('example');
$divs = $xp->query("//div[php:functionString('example', #class)]");
foreach ($divs as $div) {
echo $div->nodeValue;
}
See also a nice article about the using of PhpFunctions inside XPath in Using PHP Functions in XPath Expressions.

Getting text content with xpath

I have some HTML like this:
<dd class="price">
<sup class="symbol">$</sup><span class="dollars">58</span><sup class="cents">.00</sup>
</dd>
What's the xpath to get $58.00 back as one string?
I'm using PHP:
$xpath = '?????';
$result = $xml->xpath($xpath);
echo $result[0]; // want this to show $58.00, possible?

These are valid in your case, check for more detail the links below;
$html = '<dd class="price">
<sup class="symbol">$</sup><span class="dollars">58</span><sup class="cents">.00</sup>
</dd>';
$dom = new DOMDocument();
$dom->loadXML($html);
$xpt = new DOMXpath($dom);
foreach ($xpt->query('//dd[#class="price"]') as $node) {
// outputs: $58.00
echo trim($node->nodeValue);
}
// or
$xml = new SimpleXMLElement($html);
$res1 = $xml->xpath('//dd[#class="price"]/sup');
$res2 = $xml->xpath('//dd[#class="price"]/span');
// outputs: $58.00
printf('%s%s%s', (string) $res1[0], (string) $res2[0], (string) $res1[1]);
DOMDocument
DOMXPath
SimpleXMLElement

data() will return all contents inside the current context. Try
//dd/data()

You haven't shown us your code so I don't know what platform you're using. If you have something that can evaluate non-node XPath expressions, then you can use this:
string(//dd[#class = 'price'])
if not, you can select the node,
//dd[#class = 'price']
and the API you're using should have a way of getting the inner text value of the selected node.

Traversing child nodes with PHP DOMXpath?

I'm having some trouble understanding what exactly is stored in childNodes. Ideally I'd like to do another xquery on each of the child nodes, but can't seem to get it straight. Here's my scenario:
Data:
<div class="something">
<h3>
Link text 1
</h3>
<div class"somethingelse">Something else text 1</div>
</div>
<div class="something">
<h3>
Link text 2
</h3>
<div class"somethingelse">Something else text 2</div>
</div>
<div class="something">
<h3>
Link text 3
</h3>
<div class"somethingelse">Something else text 3</div>
</div>
And the code:
$html = new DOMDocument();
$html->loadHtmlFile($local_file);
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[#class='something']");
foreach ($nodelist as $n) {
Can I run another query here? }
For each element of "something" (i.e., $n) I want to access the values of the two pieces of text and the href. I tried using childNode and another xquery but couldn't get anything to work. Any help would be greatly appreciated!

Yes you can run another xpath query, something like that :
foreach ($nodelist as $n)
{
$other_nodes = $xpath->query('div[#class="somethingelse"]', $n);
echo $other_nodes->length;
}
This will get you the inner div with the class somethingelse, the second argument of the $xpath->query method tells to query to take this node as context, see more http://fr2.php.net/manual/en/domxpath.query.php

If I understand your question correctly, it worked when I used the descendant:: expression. Try this:
foreach ($nodelist as $n) {
$other_nodes = $xpath->query('descendant::div[#class="some-descendant"]', $n);
echo $other_nodes->length;
echo $other_nodes->item(0)->nodeValue;
}
Although sometimes it's just enough to combine queries using the // path expression for narrowing your search. The // path expression selects nodes in the document starting from the current node that match the selector.
$nodes = $xpath->query('//div[#class="some-descendant"]//div[#class="some-descendant-of-that-descendant"]');
Then loop through those for the stuff you need. Hope this helps.

Trexx had it but he missed the last sentence of the question:
foreach ($nodelist as $n){
$href = $xpath->query('h3/a', $n)->item(0)->getAttribute('href');
$a_text = $xpath->query('h3/a', $n)->item(0)->nodeValue;
$div_text = $xpath->query('div', $n)->item(0)->nodeValue;
}

Here is a code snippet that allows you to access the information contained within each of the nodes with class attribute "something":
$nodes_tracker = 0;
$nodes_array = array();
foreach($nodelist as $n){
$info = $xpath->query('//h3//a', $n)->item($nodes_tracker)->nodeValue;
$extra_info = $xpath->query('//div[#class="somethingelse"', $n)->item($nodes_tracker)->nodeValue;
array_push($nodes_array, $info. ' - '. $extra_info . '<br>'); //Add each info to array
$nodes_tracker++;
}
print_r($nodes_array);`

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

extract value from web page - php

There is no direct function or easy way to do that. You need to build or use an existing HTML parser to do that.

You can parse it with regular expression. First find all occurense of <span class="page-numbers">, then select the last one: // div html code should be in $div_html preg_match_all('#<span class="page-numbers">(\d+)#', $div_html, $page_numbers); print_r(end($page_numbers[1])); // prints 15

Related

Get content and class attribute value from child nodes of the DOM

Fetching value of specific text node using DOMXPath

creating preg_match using xpath in php

Getting text content with xpath

Traversing child nodes with PHP DOMXpath?

Categories

Resources