why are these spans not getting treated as nodes by domdocument()? - php

the result of the following domdocument() call
$html = <<<EOT
<div class="list_item">
<div class="list_item_content">
<div class="list_item_title">
<a href="/link/goes/here">
INFO<br />
<span class="part2">More Info</span><br />
<span class="part3">Etc.</span>
</a>
</div>
</div>
EOT;
libxml_use_internal_errors(false);
$dom = new DOMDocument();
$dom->loadhtml($html);
$xpath = new DOMXPath($dom);
$titles_nodeList = $xpath->query('//div[#class="list_item"]/div[#class="list_item_content"]/div[#class="list_item_title"]/a');
foreach ($titles_nodeList as $title) {
$titles[] = $title->nodeValue;
}
echo("<pre>");
print_r($titles);
echo("</pre>");
?>
is
Array
(
[0] =>
INFOMore InfoEtc.
)
Why are data in these two spans inside the a element included in the result, when I am not specifying these spans in the path? I am interested only in retrieving data contained in the a element directly, not information contained in the spans inside the a element. I am wondering what I am doing wrong.

Try this xpath:
//div[#class="list_item"]/div[#class="list_item_content"]/div[#class="list_item_title"]/a/child::text()

The nodes are there, but are viewing them in HTML mode in a browser. Try viewing the page source, and/or doing:
echo("<pre>");
htmlspecialchars(print_r($titles), true);
echo("</pre>");
instead, which'll encode the <> into <> and make them "visible".

Related

How can I get all <a> elements with a class name in PHP using xpath and add a <p> tag under each of those elements?

I tried to use xpath to get all a elements with the class name 'post-title' in PHP, but it's not working. I should mention that the a elements are being outputted by a shortcode, so I'm not even sure if xpath would work in this scenario. I'm not sure why but it's returning a DOMNodeList object of length 0 in $items. I also want to get the post_title which is the text within the a element so I'm using $post_title = $item->nodeValue, but I'm not sure if that would work. Any help would be most appreciated.
function add_rating_below_search_result_post_title(){
$dom = new DomDocument;
$dom->loadHTMLFile("example.com");
$xpath = new DomXPath($dom);
$items = $xpath->query("//a[#class='post-title']");
foreach ($items as $item) {
$post_title = $item->nodeValue;
$post_id = get_page_by_title($post_title, OBJECT, 'post');
$rating_content = get_rating($post_id);
$doc = new DOMDocument();
$rating = $doc->createElement("p", $rating_content);
$item->appendChild($rating);
}
}
This is the sample HTML for one item. There are multiple items like this in the HTML. The a element I want is on the 6th line from the bottom. I need to get all a elements that match this pattern for each of the items.
<div class="cl-layout__item cl-layout__item--id-1234">
<div class="cl-layout__item-spacing">
<div class="cl-template cl-template--post cl-template--id-1234 cl-template--image-top">
<div class="cl-element cl-element-featured_media cl-element--instance-1234 cl-element-featured_media--sizing-natural">
<a class="cl-element-featured_media__anchor" href="https://www.example.com/item1/"><img data-pin-title="Item Search" class="cl-element-featured_media__image" src="https://cdn.shortpixel.ai/spai/w_300+q_lossless+ret_img+to_webp/https://www.example.com/wp-content/uploads/2020/07/Item1-300x300.jpg" data-spai="1" alt="Item 1" data-pin-nopin="true" data-spai-upd="339">
<noscript data-spai="1"><img data-pin-title="Item Search" class="cl-element-featured_media__image" src="https://cdn.shortpixel.ai/spai/q_lossless+ret_img/https://www.example.com/wp-content/uploads/2020/07/Item1-300x300.jpg" data-spai-egr="1" alt="Item 1" />
</noscript>
</a>
</div>
<div class="cl-element cl-element-section cl-element--instance-1222 ">
<h4 class="cl-element cl-element-title cl-element--instance-1333 ">
<a class="post-title" href="https://www.example.com/item1/">Item 1 post title</a>
</h4>
</div>
</div>
</div>
</div>
If you run the following "as-is" first you ought to see a representation of the modified document in the textarea and then, if you enable the second URL and run it ( without scheme ) there is an issue.
This is substantially the same as your code - without seeing the proper url it is hard to say where the issue lies as you have two undeclared and unknown methods get_page_by_title and get_rating
There is, as you can see from the below code, no need to invoke a new DOMDocument object in the foreach loop to simply add a new node
<?php
$url='https://www.php.net/manual/en/class.domdocument.php';
#$url='php.net/manual/en/class.domdocument.php';
libxml_use_internal_errors( true );
$dom=new DOMDocument;
$dom->validateOnParse=false;
$dom->recover=true;
$dom->strictErrorChecking=false;
$dom->loadHTMLFile( $url );
libxml_clear_errors();
$xp=new DOMXPath( $dom );
$expr='//a[#class="methodname"]';
$col=$xp->query( $expr );
if( $col->length > 0 ){
foreach( $col as $node ){
$linktext=$node->nodeValue;
$p=$dom->createElement('p','------BANANA BANANA BANANA-------');
$node->appendChild($p);
}
printf('<textarea cols=100 rows=20>%s</textarea>',$dom->saveHTML() );
}
?>

PHP to get html source, then parse values within certain DIV tags

I can get the source code fine, but I now want to be able to get the data from within a specific div:
$html = file_get_contents('http://www.website.com');
say $html contains:
<div class="productData">
<div class="productDescription">Here is the product description</div>
<div class="productPrice">1.99</div>
</div>
I want to be able to return the data within , and do this for all occurrences?
Thank you.
Use the DOMDocument class, combined with DOMXPath, something like this:
$url = 'http://www.website.com/';
$dom = new DOMDocument();
$dom->load($url);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//*[contains(#class, 'productData')]");
foreach ($nodes as $node) {
    // do something
}

How to cut off a portion of a html inside <div> and store it as html string by using xpath and domdocument?

I would like to cut off some portion of html, I can take it by using XPath and DomDocument but the problem is that I need result as a html code string. Normally I would use reg. expr. for that but I wouldn't like to do a complicated search pattern that would mach the begining and the end of tag.
That's the example input:
some html code before
<div>this <b>is</b> what I want</div>
some html after
and the output:
<div>this <b>is</b> what I want</div>
I tried something like this:
subject = 'some html code before
<div>this <b>is</b> what I want</div>
some html after';
$doc = new DOMDocument();
$doc->loadHTML($subject);
$xpath = new DOMXpath($doc);
$result = $xpath->query("//div/*");
echo $result->saveHTML();
but i got only error:
Call to undefined method DOMNodeList::saveHTML()
Does anyone know how to get the result as a html string by using DomDocument and XPath?
Thank you Gentleman for pointing out my missunderstanding with accessing methods that are not aviailable in a child object. But line:
echo $doc->saveHTML($result->item(0));
generates only warning (without the html sting I want to have). Luckily I found another soulution and here it is:
<?php
$subject = '<html>
<head>
<title>A very short ebook</title>
<meta name="charset" value="utf-8" />
</head>
<body>
<h1 class="bookTitle">A very short ebook</h1>
<p style="text-align:right">Written by Kovid Goyal</p>
<div class="introduction">
<p>A very short ebook to demonstrate the use of XPath.</p>
</div>
<h2 class="chapter">Chapter One</h2>
<p>This is a truly fascinating chapter.</p>
<h2 class="chapter">Chapter Two</h2>
<p>A worthy continuation of a fine tradition.</p>
</body>
</html>';
$doc = new DOMDocument();
$doc->loadHTML($subject);
$xpath = new DOMXpath($doc);
$result = $xpath->query("//div");
//echo $doc->saveHTML($result->item(0));
echo domNodeList_to_string($result);
function domNodeList_to_string($DomNodeList) {
$output = '';
$doc = new DOMDocument;
while ( $node = $DomNodeList->item($i) ) {
// import node
$domNode = $doc->importNode($node, true);
// append node
$doc->appendChild($domNode);
$i++;
}
$output = $doc->saveHTML();
$output = print_r($output, 1);
// I added this because xml output and ajax do not like each others
//$output = htmlspecialchars($output);
return $output;
}
php>
so if one has a query like that:
$result = $xpath->query("//div");
then will get the raw html string output:
<div class="introduction">
<p>A very short ebook to demonstrate the use of XPath.</p>
</div>
if the query is:
$result = $xpath->query("//p");
then output will be:
<p style="text-align:right">Written by Kovid Goyal</p><p>A very short ebook to demonstrate the use of XPath.</p><p>This is a truly fascinating chapter.</p><p>A worthy continuation of a fine tradition.</p>
Does anyone know simpler (embeded in php) method to get the same result?
Try this:
$subject = 'some html code before
<div>this <b>is</b> what I want</div>
some html after';
$doc = new DOMDocument();
$doc->loadHTML($subject);
$xpath = new DOMXpath($doc);
$result = $xpath->query("//div");
echo $doc->saveHTML($result->item(0)); //echoes what you want :)
The saveHTML function belongs to the DOMDocument object, you can't call it directly on the node (much less on the NodeList, which is what the query returns), but what you can do is pass it the node as a param.
Also, your query was wrong: what you want is the div element (i.e. //div), not its children (//div/*).
As per the php manual docs on DOMXPath::querydocs, the function:
Returns a DOMNodeList containing all nodes matching the given XPath
expression. Any expression which does not return nodes will return an
empty DOMNodeList.
This means that the $result in the following code will be a DOMNodeListdocs object. So if you want to get individual HTML code out from inside it you'll need to use methods available with a DOMNodeList object. In this case, the item method:
$result = $xpath->query("//div");
echo $doc->saveHTML($result->item(0));
$result->item(0) returns the first DOMNode in the DOMNodeList created by your xpath query.
Try this :
$subject = 'some html code before<div>this <b>is</b> what I want</div>some html after';
$doc = new DOMDocument('1.0');
$doc->loadHTML($subject);
$xpath = new DOMXpath($doc);
$result = $xpath->query("//div");
$docSave = new DOMDocument('1.0');
foreach ( $result as $node ) {
$domNode = $docSave->importNode($node, true);
$docSave->appendChild($domNode);
}
echo $docSave->saveHTML();

why does a <script> tag stop domdocument() parsing?

In the following code, the seemingly innocuous introduction of a script tag containing an empty div causes parsing to fail. (Using an empty script tag causes no problem.) $html1 gets parsed properly, retrieving the values of the two spans:
Array
(
[0] => test1
[1] => test2
)
whereas $html2 does not get parsed properly, retrieving only the span preceding the script tag:
Array
(
[0] => test1
)
Why does this happen? With errors turned on I get two errors, "Unexpected end tag : script" and "Unexpected end tag : div" but I do not know why these are unexpected.
<?php
$html1 = <<<EOT
<div class="productList">
<span>test1</span>
<div></div>
<span>test2</span>
</div>
EOT;
$html2 = <<<EOT
<div class="productList">
<span>test1</span>
<script>
<div></div>
</script>
<span>test2</span>
</div>
EOT;
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadhtml($html1);
$xpath = new DOMXPath($dom);
$titles_nodeList = $xpath->query('//div[#class="productList"]/span');
foreach ($titles_nodeList as $title) {
$titles[] = $title->nodeValue;
}
echo("<p>titles without script tag and div</p>");
echo("<pre>");
print_r($titles);
echo("</pre>");
unset($titles);
$dom->loadhtml($html2);
$xpath = new DOMXPath($dom);
$titles_nodeList = $xpath->query('//div[#class="productList"]/span');
foreach ($titles_nodeList as $title) {
$titles[] = $title->nodeValue;
}
echo("<p>titles with script tag and div</p>");
echo("<pre>");
print_r($titles);
echo("</pre>");
?>
A div doesn't belong inside a script tag. Javascript belongs inside a script tag.
Take the div out of the script tag and it should be fine.
The trick is simple, change loadHTML to loadXML with one condition,
the HTML string has to be always well-formed
$dom->loadXML($html2);

extract value from web page

Hi I have a website's home page that I am reading in using Curl and I need to grab the number of pages that the site has.
The information is in a div:-
<div class="pager">
<span class="page-numbers current">1</span>
<span class="page-numbers">2</span>
<span class="page-numbers">3</span>
<span class="page-numbers">4</span>
<span class="page-numbers">5</span>
<span class="page-numbers dots">…</span>
<span class="page-numbers">15</span>
<span class="page-numbers next"> next</span>
</div>
The value I need is 15 but this could be any number depending on the site but will always be in the same position.
How could I read this value easily and assign it to a variable in PHP.
Thanks
Jonathan
You can use PHP's DOM module for that. Read the page with DOMDocument::loadhtmlfile(), then create a DOMXPath object and query all span elements within the document having the class="page-numbers" attribute.
(edit: oops, that's not what you're looking for, see second code snippet)
$html = '<html><head><title>:::</title></head><body>
<div class="pager">
<span class="page-numbers current">1</span>
<span class="page-numbers">2</span>
<span class="page-numbers">3</span>
<span class="page-numbers">4</span>
<span class="page-numbers">5</span>
<span class="page-numbers dots">…</span>
<span class="page-numbers">15</span>
<span class="page-numbers next"> next</span>
</div>
</body></html>';
$doc = new DOMDocument;
// since the content "is already here" we use loadhtml(content)
// instead of loadhtmlfile(url)
$doc->loadhtml($html);
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query('//span[#class="page-numbers"]');
echo 'there are ', $nodelist->length, ' span elements having class="page-numbers"';
edit: does this
<span class="page-numbers">15</span>
(the second last a element) always point to the last page, i.e. does this link contain the value you're looking for?
Then you can use a XPath expression that selects the second but last a element and from there its child span element.
//div[#class="pager"] <- select each <div> where the attribute class equals "pager"
//div[#class="pager"]/a <- select each <a> that is a direct child of the pager div
//div[#class="pager"]/a[position()=last()-1] <- select the <a> that is second but last
//div[#class="pager"]/a[position()=last()-1]/span <- select the direct child <span> of that second but last <a> element in the pager <div>
( you might want to fetch a good XPath tutorial ;-) )
$doc->loadhtml($html);
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query('//div[#class="pager"]/a[position()=last()-1]/span');
if ( 0 < $nodelist->length ) {
echo $nodelist->item(0)->nodeValue;
}
else {
echo 'not found';
}
There is no direct function or easy way to do that. You need to build or use an existing HTML parser to do that.
You can parse it with regular expression. First find all occurense of <span class="page-numbers">, then select the last one:
// div html code should be in $div_html
preg_match_all('#<span class="page-numbers">(\d+)#', $div_html, $page_numbers);
print_r(end($page_numbers[1])); // prints 15
This is something you would might want to use a xpath for - which requires loading the page as a dom document object:
$domDoc = new DOMDocument();
$domDoc->loadHTMLFile("http://path/to/yourfile.html");
$xp = new DOMXPath($domDoc);
$nodes = $xp->query("//xpath/to/relevant/node");
$value = $nodes[0];
I haven't written a good xpath in a while, so you should do some reading to figure out that part, but it shouldn't be too difficult.
perhaps
$nodes = $dom->getElementsByTagName("span");
$maxPageNum = 0;
foreach($nodes as $node)
{
if( $node.class == "page-numbers" && $node.value > $maxPageNum )
{
$maxPageNum = $node.value;
}
}
I don't know PHP, so maybe it's not that easy to access the class/inner text of a dom node, but there must be some way to get that info and the pseudocode here should work.
Just wanted to say a huge thank you to Volkerk for helping out - it worked really well. I had to make a few slight changes and ended up with this:-
function getusers($userurl)
{
$sSourceData = file_get_contents($userurl);
$doc = new DOMDocument();
#$doc->loadHTML($sSourceData);
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query('//div[#class="pager"]/a[position()=last()-1]/span');
if ( 0 < $nodelist->length ) {
$lastpage = $nodelist->item(0)->nodeValue;
$users = $lastpage * 35;
$userurl = $userurl.'?page='.$lastpage;
$sSourceData = file_get_contents($userurl);
$doc = new DOMDocument();
#$doc->loadHTML($sSourceData);
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query('//div[#class="user-details"]');
$users = $users + $nodelist->length;
echo 'there are ', $users , ' users';
}
else {
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query('//div[#class="user-details"]');
echo 'there are ', $nodelist->length, ' users';
}
}

Categories