PHP Xpath: Get node value by class name - php

I'm using xpath to pull data out of a piece of HTML code and I've been able to pull out most data except for one piece.
The HTML is structured like below, but there might only be one li or two or all three so I need to be able to target it by classname.
<li>
Product URL
</li>
<li>
<ul>
<li class="itemone">1</li>
<li class="itemtwo">2</li>
<li class="itemthree">3</li>
</ul>
</li>
This code is already retrieved using an xpath query and then further data is pulled out of the results of the xpath query with the below PHP snippet.
$rawData = $xpath->query('//div[#id=\'products\']/ul/li[contains(#class, \'product\')]');
foreach($rawData as $data) {
$productRaw = $data->getElementsByTagName('li');
$productTitle = $productRaw[0]->getElementsByTagName('a')[0]->nodeValue;
$productRefCode = $productRaw[0]->getElementsByTagName('span')[0]->nodeValue;
$productPrice = $productRaw[1]->getElementsByTagName('li');
}
The problem is $productPrice, the line above is pulling out the below node list.
DOMNodeList Object
(
[length] => 3
)
I'm looking to find anything in the above node list that has a classname of itemtwo, I've using an $xpath->query on $productRaw[1] and also tried getElementsByClassName but with no luck, I've tried the two snippets below with no luck.
$productPrice = $productRaw[1]->getElementsByTagName('li')->getElementsByClassName('itemtwo');
...
$productPrice = $productRaw[1]->query('//li[contains(#class, \'itemtwo\')]');
Both snippets give an error Fatal error: Call to undefined method DOMNodeList::getElementsByClassName() and Fatal error: Call to undefined method DOMNodeList::query().

Use DOMXPath::query, passing XPath string as the first parameter and DOMNode as the second, to execute XPath relative to certain DOMNode context, for example :
foreach($rawData as $data) {
$productRaw = $data->getElementsByTagName('li');
.....
$productPrice = $xpath->query('.//li[contains(#class, "itemtwo")]', $productRaw->item(1));
}
Also use . at the beginning of your XPath expression to explicitly tell that the expression is relative to current context node.

Something like this?
$str = '<li>
Product URL</li>
<li>
<ul>
<li class="itemone">1</li>
<li class="itemtwo">2</li>
<li class="itemthree">3</li>
</ul>
</li>';
$doc = new DOMDocument;
$doc->loadHTML($str);
$xpath = new DOMXPath($doc);
$productPrices = $xpath->query("//li[#class='itemtwo']");
foreach ($productPrices as $productPrice) {
print $productPrice->nodeValue."\n";
}

har07's answer was on the right track, but it only returned the node list with length set to 3 like I was already receiving with my existing code.
Original code:
$productPrice = $productRaw[1]->getElementsByTagName('li');
har07's suggestion:
$productPrice = $xpath->query('.//li[contains(#class, "itemtwo")]', $productRaw->item(1));
Solution, which returns the node value where an elements class name is equal to itemtwo:
$productPrice = $xpath->query('.//li[contains(#class, \'itemtwo\')]', $productRaw[1])->item(1)->nodeValue;

Related

Using xpath to exclude inner classes that contains any of 2 certain words

I am using the following to return an array of div class names and their inner Li class names. It works great, but I would like to exclude Li classes that contain any of 2 certain words (wishlist and/or compare). I have tried combinations of "not contains" but can't get the syntax right.
function GetInnerDivClasses(){
$dom = new DOMDocument();
#$dom->loadHTML(SourceHtml());
$xpath = new DOMXpath($dom);
$classNames = [];
foreach ($xpath->query(".//div[contains(#class, ' top-menu-')]") as $div) {
foreach ($xpath->query('.//li/#class', $div) as $li) {
$classNames[$div->getAttribute('class')][] = $li->textContent;
}
}
return $classNames;
}
For example if I wanted to exclude Li Classes that contain the text "wishlist" or "compare" in the following example, then I would not like "top-menu-item-6" and "top-menu-item-7"returned in the array.
<div class="top-menu top-menu-2">
<ul class="j-menu">
<li class="menu-item top-menu-item top-menu-item-1">
Contact</span>
</li>
<li class="top-menu-item-2">
FAQ</span>
</li>
<li class="top-menu-item-6">
Wishlist</span><span class="count-badge wishlist-badge">1</span>
</li>
<li class="top-menu-item-7">
Compare</span><span class="count-badge compare-badge count-zero">0</span>
</li>
</ul>
The following line needs the correct syntax to not include if contains either "wishlist" and/or "compare", but I can't get it:
foreach ($xpath->query('.//li/#class', $div) as $li) {
If someone could please help Thanks.
Aside from what to do with them, if I understand you correctly, this should extract your target attribute values from the html in the question:
echo $xpath->query('.//div/#class')[0]->textContent . PHP_EOL;
foreach ($xpath->query('//li[a[not(span[(contains(#class,"compare"))])][not(span[(contains(#class,"wishlist"))])]]/#class') as $target) {
echo($target->textContent . PHP_EOL);
}
Output:
top-menu top-menu-2
menu-item top-menu-item top-menu-item-1
top-menu-item-2

How to save xpath query data to saveHTML with HTML tags?

I'm trying to understand how I can save the html string found by query so that I can access it's elements.
I'm using the following query to find the below ul list.
$data = $xpath->query('//h2[contains(.,"Hurricane Data")]/following-sibling::ul/li');
<h2>Hurricane Data</h2>
<ul>
<li><strong>12 items</strong> found, see herefor more information</li>
<li><strong>19 items</strong> found, see herefor more information</li>
<li><strong>13 items</strong> found, see herefor more information</li>
</ul>
If I print_r($data), I get the following DOMNodeList Object ( [length] => 3 ) which refers to the 3 elements found.
If I foreach() into the $data I get a DOMElement Object with all 3 li data.
What I'm trying to accomplish is to put each li data into an accessible array, but I want to parse the html strong & a tags inside too.
Now, I've already did everything I want to do, except the strong and a tags aren't being inserted into the arrays, here is what I've come up with.
$string = [];
$query = $xpath->query('//h2[contains(.,"Hurricane Data")]/following-sibling::ul/li');
foreach($query as $values){
$try = new \DOMDocument;
$try->loadHTML(mb_convert_encoding($values->textContent, 'HTML-ENTITIES', 'UTF-8'));
$string[] = $try->saveHTML();
}
echo $string[0];
// outputs = 12 items found, see here for more information
// no strong tags, no hyperlinks
You don't need to reprocess the data, you can just say to save this particular node...
foreach($query as $values){
$string[] = $doc->saveHTML($values);
}
Where $doc is the document used as the basis for your XPath query.

How can I parse HTML in batches using xpath [PHP]?

I tried all sorts of things but couldn't find a solution.
I want to retrieve elements from html code using xpath in php.
Ex:
<div class='student'>
<div class='name'>Michael</div>
<div class='age'>26</div>
</div>
<div class='student'>
<div class='name'>Joseph</div>
<div class='age'>27</div>
</div>
I want to retrieve the information and put them in an array as follows:
$student[0][name] = Michael;
$student[0][age] = 26;
$student[1][name] = Joseph;
$student[1][age] = 27;`
In other words i want the matching ages to stay with the names.
I tried the following:
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpathDom = new DomXPath($dom);
$homepostcontentNodes = $xpathDom->query("//*[contains(#class, 'student')]//*[contains(#class, 'name')]");`
However, this is only grabbing me the nodes 'names'
How can i get the matching age nodes?
Of course it is only grabbing the nodes name - you are telling it to!
What you will need to do is in two steps:
Pick out all the student nodes
For each student node, pick out the columns
This is a pretty standard step in linearization of data, and the XPath queries are simple:
Step 1
You pretty much have it:
$studentNodes = $xpathDom->query("//div[contains(#class, 'student')]");
This will return all your student nodes.
Step 2
This is where the magic happens. We have our nodes, we can loop through them (DOMNodeList implements Iterator, so we can foreach-loop through them). What we need to figure out is how to find its children...
...Oh wait. DOMNode implements a method called getNodePath which returns the full, direct XPath path to the node. This allows us to then simply append /div to get all the div direct descendents to the node!
Another quick foreach, and we get this code:
$studentNodes = $xpathDom->query("//div[contains(#class, 'student')]");
$result = array();
foreach ($studentNodes as $v) {
// Child nodes: student
$r = array();
$columns = $xpathDom->query($v->getNodePath()."/div");
foreach ($columns as $v2) {
// Attributes allows me to get the 'class' property of the node. Bit clunky, but there's no alternative
$r[$v2->attributes->getNamedItem("class")->textContent] = $v2->textContent;
}
$result[] = $r;
}
var_dump($result);
Full fiddle: http://codepad.viper-7.com/t868Wh

xpath cannot get attribute value

Well the htmo code is something like this:
<a href="javascript:my_win('.......'">
<img src="...." border=0>
<font color="red">title
</a>
</font>
and I want to identify the color only for those a's which href contains th word:
javascript:my_win
This is my query:
$xpath->query('//a[contains(#href,"javascript:my_win")]/font');
but I get nothing.
If my query changes to this, I normally get all the hrefs, so there is no chance of mispelling.
$elements = $xpath->query('//a');
If my query changes to this Every colot is being printed out.
$elements = $xpath->query('//a/font');
Whole code is here:
$elements = $xpath->query('//a[contains(#href,"javascript:my_win")]/font');
foreach ( $elements as $element ) {
$str1=$element->getAttribute('color');
}
Use the # character to refer to attributes in XPath:
//a[contains(#href, 'some required text')]
When just writing href, the XPath processor will look for a child element named <href> whose text contents includes the specified string.

Using PHP to get DOM Element

I'm struggling big time understanding how to use the DOMElement object in PHP. I found this code, but I'm not really sure it's applicable to me:
$dom = new DOMDocument();
$dom->loadHTML("index.php");
$div = $dom->getElementsByTagName('div');
foreach ($div->attributes as $attr) {
$name = $attr->nodeName;
$value = $attr->nodeValue;
echo "Attribute '$name' :: '$value'<br />";
}
Basically what I need is to search the DOM for an element with a particular id, after which point I need to extract a non-standard attribute (i.e. one that I made up and put on with JS) so I can see the value of that. The reason is I need one piece from the $_GET and one piece that is in the HTML based from a redirect. If someone could just explain how I use DOMDocument for this purpose, that would be helpful. I'm really struggling understanding what's going on and how to properly implement it, because I clearly am not doing it right.
EDIT (Where I'm at based on comment):
This is my code lines 4-26 for reference:
<div id="column_profile">
<?php
require_once($_SERVER["DOCUMENT_ROOT"] . "/peripheral/profile.php");
$searchResults = isset($_GET["s"]) ? performSearch($_GET["s"]) : "";
$dom = new DOMDocument();
$dom->load("index.php");
$divs = $dom->getElementsByTagName('div');
foreach ($divs as $div) {
foreach ($div->attributes as $attr) {
$name = $attr->nodeName;
$value = $attr->nodeValue;
echo "Attribute '$name' :: '$value'<br />";
}
}
$div = $dom->getElementById('currentLocation');
$attr = $div->getAttribute('srckey');
echo "<h1>{$attr}</a>";
?>
</div>
<div id="column_main">
Here is the error message I'm getting:
Warning: DOMDocument::load() [domdocument.load]: Extra content at the end of the document in ../public_html/index.php, line: 26 in ../public_html/index.php on line 10
Fatal error: Call to a member function getAttribute() on a non-object in ../public_html/index.php on line 21
getElementsByTagName returns you a list of elements, so first you need to loop through the elements, then through their attributes.
$divs = $dom->getElementsByTagName('div');
foreach ($divs as $div) {
foreach ($div->attributes as $attr) {
$name = $attr->nodeName;
$value = $attr->nodeValue;
echo "Attribute '$name' :: '$value'<br />";
}
}
In your case, you said you needed a specific ID. Those are supposed to be unique, so to do that, you can use (note getElementById might not work unless you call $dom->validate() first):
$div = $dom->getElementById('divID');
Then to get your attribute:
$attr = $div->getAttribute('customAttr');
EDIT: $dom->loadHTML just reads the contents of the file, it doesn't execute them. index.php won't be ran this way. You might have to do something like:
$dom->loadHTML(file_get_contents('http://localhost/index.php'))
You won't have access to the HTML if the redirect is from an external server. Let me put it this way: the DOM does not exist at the point you are trying to parse it. What you can do is pass the text to a DOM parser and then manipulate the elements that way. Or the better way would be to add it as another GET variable.
EDIT: Are you also aware that the client can change the HTML and have it pass whatever they want? (Using a tool like Firebug)

Categories