Xpath DOMDocument not returning data from node - php

I am scanning a site using a dom document and receiving the #src attribute from the inner img element.
The html is:
<div class="content">
<div class="post">
<img src="abc"/>
</div>
<div class="post">
<img src="abc1"/>
</div>
<div class="post">
</div>
</div>
Note: I am specifically not using #src in my xpath. Also, this is precisely the way i need it to be coded.
Here is my php:
$document = new DOMDocument();
$document->loadXML($file);
$XPath = new DOMXPath($document);
$Query = '//div[#class="content"]//div[#class="post"]';
$posts = $XPath->query($Query);
foreach ($posts as $post) {
if($XPath->evaluate("#src", $post))
{
$return[] = $XPath->evaluate("#src", $post)->item(0);
}else{
$return[] = "";
}
}
It's adding positions to the array $return however they are all empty array positions.
My question is how do i make it output the data from the php code:
$return[] = $XPath->evaluate("#src", $post)->item(0);
This doesn't work:
$return[] = $XPath->evaluate("#src", $post)->item(0)->nodeValue;

.//#src[1]:
. => relative to node
//# => descendant
#src => the src attribute
[1] => the first one.
You can even use string(nodeset) if you only care about the value (of course, leave that out if you need to be able to manipulate the attributes and use your ->item(0) solution).
foreach ($posts as $post){
$return[] = $XPath->evaluate("string(.//#src[1])",$post);
}

Related

Need to get divs from string based on matching class

I have a variable $company_id = 8; and a block of HTML content stored as a string called all_content:
<div class="company-id-8">
Content One
</div>
<div class="company-id-9">
Content Two
</div>
<div class="company-id-8">
Content Three
</div>
<div class="company-id-3">
Content Four
</div>
I need to remove all of the divs from all_content that don't match the current company ID class. So, once filtered, the above html should become:
<div class="company-id-8">
Content One
</div>
<div class="company-id-8">
Content Three
</div>
I have the following code to filter out divs that don't belong to the current company:
$dom = new DomDocument();
$dom->loadHTML( $full_message );
$finder = new DomXPath($dom);
$classname = "company-id-" . $company_id;
$nodes = $finder->query("//div[contains(#class, '$classname')]");
foreach ( $nodes as $node ) {
$filtered_content .= ;
}
I can't seem to work out how to get my filtered div nodes back into the filtered_content string though?
How can I tidy this up and get it working?
Solution is to do the following:
$filtered_content = "";
foreach ( $nodes as $node ) {
$tmp_doc = new DOMDocument();
$tmp_doc->appendChild($tmp_doc->importNode($node,true));
$filtered_content .= $tmp_doc->saveHTML();
}
filtered_content ends up being a usable HTML string with the correct content.

DomDocument get all divs and put inside an array

I have have some divs with the same Id and same Class as you can see below:
<div id="results_information" class="control_results">
<!-- I have divs, subDivs, span, images inside -->
</div>
<div id="results_information" class="control_results">
<!-- I have divs, subDivs, span, images inside -->
</div>
....
In my case I want to save all of them inside an array to be used later, I want to save in this format:
[0] => '<div id="results_information" class="control_results">
<!-- I have divs, subDivs, span, images inside -->
</div>',
[1] => '<div id="results_information" class="control_results">
<!-- I have divs, subDivs, span, images inside -->
</div>',
....
For that I'm using this code below:
$dom = new DOMDocument(); // Create DOMDocument object.
$dom->loadHTMLFile($htmlOut); // Load target file.
$div =$dom->getElementById('results_information'); // Take all div elements.
But it doesn't work, how I can solve this problem and put my divs inside an array?
To solve your problem you need to do the following steps below:
First of all, you should be based on selecting a class and not an ID (Because id in this situation should be unique).
In this situation we assume that you have the following html inside a variable called $htmlOut:
<div id="results_information" class="control_results">
<span style="background:black; color:white">
hellow world
</span>
<strong>2</strong>
</div>
<div id="results_information" class="control_results">
<strong>2</strong>
<img src="hello.png" />
</div>
We need to extract all the html that exists inside theses two class called control_results and put inside an array, for this we need to work with DomDocument and DomXPath:
$array = array();
$dom = new DomDocument();
$dom->loadHtml($htmlOut);
$finder = new DomXPath($dom);
$classname = "control_results";
$nodes = $finder->query("//*[contains(#class, '$classname')]");
With that code we can extract all the content of the divs with classname control_results and put inside the variable $nodes.
Now we need to parser the variable $nodes (that is an array) and extract all the HTML of that two class. For this I create a function to handle:
function get_inner_html( $node ) {
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveXML( $child );
}
return $innerHTML;
}
This function will extract every childNodes (Every HTML code inside the class control_results) and returns.
Now you only need to create a foreach for the variable $nodes and call that function, like this:
foreach ($nodes as $rowNode) {
$array[] = get_inner_html($rowNode);
}
var_dump($array);
Below is the complete code:
$htmlOut = '
<div id="results_information" class="control_results">
<span style="background:black; color:white">
hellow world
</span>
<strong>2</strong>
</div>
<div id="results_information" class="control_results">
<strong>2</strong>
<img src="hello.png" />
</div>
';
$array = array();
$dom = new DomDocument();
$dom->loadHtml($htmlOut);
$finder = new DomXPath($dom);
$classname = "control_results";
$nodes = $finder->query("//*[contains(#class, '$classname')]");
foreach ($nodes as $rowNode) {
$array[] = get_inner_html($rowNode);
}
var_dump($array);
function get_inner_html( $node ) {
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveXML( $child );
}
return $innerHTML;
}
But this code has a little problem, if you check the results in array is:
0 => string '<span style="background:black; color:white">hellow world</span><strong>2</strong>',
1 => string '<strong>2</strong><img src="hello.png"/>'
instead of:
0 => string '<div id="results_information" class="control_results"><span style="background:black; color:white">hellow world</span><strong>2</strong></div>',
1 => string '<div id="results_information" class="control_results"><strong>2</strong><img src="hello.png"/></div>'
In this case you can perform a foreach of this array and include that div in the init of the contents and close that div in the final of the contents and re-save that array.
You will need to use xpath and get the elements using class name.
$dom = new DOMDocument();
$xpath = new DOMXpath($dom);
$div = $xpath->query('//div[contains(#class, "control_results")]')

PHP get custom attribute value with DOM parser

I use simple dom parser to do some scrapping but failed to get the custom attribute (color). I was able to get others value like the h3's inner text.
My dom is simple it look like this
<article data-color="red">
<h1>Hi </h1>
</article>
<article data-color="blue">
<h1>Hi 2</h1>
</article>
<article data-color="gold">
<h1>Hi 3</h1>
</article>
My code so far
$dom = $html->find('article');
$arr = array();
foreach ($dom as $data) {
if(isset($data->find('h3',0)->plaintext)){
$h3 = $data->find(h3',0)->plaintext;
}
}
$arr[] = array(
"title" => $h3,
/* "color" => $color */
);
echo json_encode(array_values($arr));
If you're afterthe data attribute property and since the DOM elements attributes are considered properties of that simple-html-dom object, just treat hyphenated properties as usual with a curly brace:
$object->{'property-with-a-hyphen'}
So when you apply this in your code:
foreach($dom as $data) {
$color = '';
if(isset($data->{'data-color'})) {
$color = $data->{'data-color'};
}
// array declarations below
$arr[] = array(
'color' => $color,
);
}

php DOMDocument - List child elements to array

For the following HTML:
<html>
<body>
<div whatever></div>
<div id="archive-wrapper">
<ul class="archive-list">
<li><div>A</div></li>
<li><div>B</div></li>
<li><div>C</div></li>
</ul>
</div>
</body>
How could I retrieve, with PHP DOMDocument (http://php.net/manual/es/class.domdocument.php), an array containing (#1,#2,#3) in the most effective way? It's not that I did not try anything or that I want an already done code, I just need to know some guidelines to do it and understand it on my own. Thanks :)
A simple example using php DOMDocument -
<?php
$html = <<<HTML
<html>
<body>
<div whatever></div>
<div id="archive-wrapper">
<ul class="archive-list">
<li><div>A</div></li>
<li><div>B</div></li>
<li><div>C</div></li>
</ul>
</div>
</body>
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
//get all links
$links = $dom->getElementsByTagName('a');
$linkArray = array();
//loop through each link
foreach ($links as $link){
$linkArray[] = $link->getAttribute('href');
}
edit
to get only the links inside ul->li, you could do something like -
$dom = new DOMDocument();
$dom->loadHTML($html);
$linkArray = array();
foreach ($dom->getElementsByTagName('ul') as $li){
foreach ($li->getElementsByTagName('li') as $a){
foreach ($a->getElementsByTagName('a') as $link){
$linkArray[] = $link->getAttribute('href');
}
}
}
or if you just want the 1st ul you could simplify to
//get 1st ul using ->item(0)
$ul = $dom->getElementsByTagName('ul')->item(0);
foreach ($ul->getElementsByTagName('li') as $li){
foreach ($li->getElementsByTagName('a') as $a){
$linkArray[] = $a->getAttribute('href');
}
}
what do you mean with PHP DOM? do you mean with PHP and JQuery? You can setup
you can put all that in a form and post it to a script
you can also wrap around a select which will only store the selected
data
better idea would be to jquery to post the items to an array on the
same page and using php as a processor for server side
munipilation? this is better in the long run, being its the most updated way of
interacting with html and server side scripts.
for example, you can try either way:
$("#form").submit(function(){ //form being the #form id
var items = [];
$("#archive-list li").each(function(n){
items[n] = $(this).html();
});
$.post(
"munipilate-data.php",
{items: items},
function(data){
$("#result").html(data);
});
});
I suggest you a regex to parse it.
$html = '<html>
<body>
<div whatever></div>
<div id="archive-wrapper">
<ul class="archive-list">
<li><div>A</div></li>
<li><div>B</div></li>
<li><div>C</div></li>
</ul>
</div>
</body>';
$reg = '/a href=["\']?([^"\' ]*)["\' ]/';
preg_match_all($reg, $html, $m);
$arr = array_map(function($v){
return trim(str_replace('a href=', '', $v), '"');
}, $m[0]);
print '<pre>';
print_r($arr);
print '</pre>';
Output:
Array
(
[0] => #1
[1] => #2
[2] => #3
)
Regex Demo

XPATH Get Attribute of Current Node

Having trouble getting the attribute of the current node in PHP and making a condition based on that attribute...
Example XML
<div class='parent'>
<div class='title'>A Title</div>
<div class='child'>some text</div>
<div class='child'>some text</div>
<div class='title'>A Title</div>
<div class='child'>some text</div>
<div class='child'>some text</div>
</div>
What I am trying to do is traverse the XML in PHP and do different things based on the class of the element/node
Eg.
$doc->loadHTML($xml_string);
$xpath = new DOMXpath($doc);
$nodeLIST = $xpath->query("//div[#class='parent']/div");
foreach ($nodeLIST as $node) {
if (CURRENT DIV NODE ATTRIBUTE EQUALS TITLE) {
SET $TITLE VARIABLE TO THE TEXT() OF THE CURRENT NODE
}
ELSEIF(CURRENT DIV NODE ATTRIBUTE EQUALS CHILD){
SET $CHILD VARIABLE TO THE TEXT() OF THE CURRENT NODE
}
}
I've tried all kind of things like the following...
if ($xpath->query("./[#class='title']/text()",$node)->length > 0) { }
But all i keep getting is PHP errors saying that my XPATH syntax is not valid. Can anyone help me?
You can achieve this by using getAttribute() method. Example:
foreach($nodeLIST as $node) {
$attribute = $node->getAttribute('class');
if($attribute == 'title') {
// do something
} elseif ($attribute == 'child') {
// do something
}
}
$node->getAttribute('class') gives you the attribute value, $node->textContent the string contents of the node. I wouldn't dive into XPath to read out the string value.
You can filter the 'title' and 'child' sets in different nodelists:
$titles = $xpath->query("//div[#class='parent']/div[#class='title']");
$children = $xpath->query("//div[#class='parent']/div[#class='child']");
And then process them separately:
foreach ($titles as $title) {
echo $title->textContent."\n";
}
foreach ($children as $child) {
echo $child->textContent."\n";
}
See: http://codepad.viper-7.com/x4LA50

Categories