PHP get custom attribute value with DOM parser - php

I use simple dom parser to do some scrapping but failed to get the custom attribute (color). I was able to get others value like the h3's inner text.
My dom is simple it look like this
<article data-color="red">
<h1>Hi </h1>
</article>
<article data-color="blue">
<h1>Hi 2</h1>
</article>
<article data-color="gold">
<h1>Hi 3</h1>
</article>
My code so far
$dom = $html->find('article');
$arr = array();
foreach ($dom as $data) {
if(isset($data->find('h3',0)->plaintext)){
$h3 = $data->find(h3',0)->plaintext;
}
}
$arr[] = array(
"title" => $h3,
/* "color" => $color */
);
echo json_encode(array_values($arr));

If you're afterthe data attribute property and since the DOM elements attributes are considered properties of that simple-html-dom object, just treat hyphenated properties as usual with a curly brace:
$object->{'property-with-a-hyphen'}
So when you apply this in your code:
foreach($dom as $data) {
$color = '';
if(isset($data->{'data-color'})) {
$color = $data->{'data-color'};
}
// array declarations below
$arr[] = array(
'color' => $color,
);
}

Related

Fetch content of all div with same class using PHP Simple HTML DOM Parser

I am new to HTML DOM parsing with PHP, there is one page which is having different content in its but having same 'class', when I am trying to fetch content I am able to get content of last div, Is it possible that somehow I could get all the content of divs having same class request you to please have a look over my code:
<?php
include(__DIR__."/simple_html_dom.php");
$html = file_get_html('http://campaignstudio.in/');
echo $x = $html->find('h2[class="section-heading"]',1)->outertext;
?>
In your example code, you have
echo $x = $html->find('h2[class="section-heading"]',1)->outertext;
as you are calling find() with a second parameter of 1, this will only return the 1 element. If instead you find all of them - you can do whatever you need with them...
$list = $html->find('h2[class="section-heading"]');
foreach ( $list as $item ) {
echo $item->outertext . PHP_EOL;
}
The full code I've just tested is...
include(__DIR__."/simple_html_dom.php");
$html = file_get_html('http://campaignstudio.in/');
$list = $html->find('h2[class="section-heading"]');
foreach ( $list as $item ) {
echo $item->outertext . PHP_EOL;
}
which gives the output...
<h2 class="section-heading text-white">We've got what you need!</h2>
<h2 class="section-heading">At Your Service</h2>
<h2 class="section-heading">Let's Get In Touch!</h2>

turn HTML into a PHP array

I have a string containing also HTML in a $html variable:
'Here is some text which I do not need to extract but then there are
<figure class="class-one">
<img src="/example.jpg" alt="example alt" class="some-image-class">
<figcaption>example caption</figcaption>
</figure>
And another one (and many more)
<figure class="class-one some-other-class">
<img src="/example2.jpg" alt="example2 alt">
</figure>'
I want to extract all <figure> elements and everything they contain including their attributes and other html-elements and put this in an array in PHP so I would get something like:
$figures = [
0 => [
"class" => "class-one",
"img" => [
"src" => "/example.jpg",
"alt" => "example alt",
"class" => "some-image-class"
],
"figcaption" => "example caption"
],
1 => [
"class" => "class-one some-other-class",
"img" => [
"src" => "/example2.jpg",
"alt" => "example2 alt",
"class" => null
],
"figcaption" => null
]];
So far I have tried:
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_clear_errors();
$figures = array();
foreach ($figures as $figure) {
$figures['class'] = $figure->getAttribute('class');
// here I tried to create the whole array but I can't seem to get the values from the HTML
// also I'm not sure how to get all html-elements within <figure>
}
Here is a Demo.
Here is the code that should get you where you want to be. I have added comments where I felt they would be helpful:
<?php
$htmlString = 'Here is some text which I do not need to extract but then there are <figure class="class-one"><img src="/example.jpg" alt="example alt" class="some-image-class"><figcaption>example caption</figcaption></figure>And another one (and many more)<figure class="class-one some-other-class"><img src="/example2.jpg" alt="example2 alt"></figure>';
//Create a new DOM document
$dom = new DOMDocument;
//Parse the HTML.
#$dom->loadHTML($htmlString);
//Create new XP
$xp = new DOMXpath($dom);
//Create empty figures array that will hold all of our parsed HTML data
$figures = array();
//Get all <figure> elements
$figureElements = $xp->query('//figure');
//Create number variable to keep track of our $figures array index
$figureCount = 0;
//Loop through each <figure> element
foreach ($figureElements as $figureElement) {
$figures[$figureCount]["class"] = trim($figureElement->getAttribute('class'));
$figures[$figureCount]["img"]["src"] = $xp->query('//img', $figureElement)->item($figureCount)->getAttribute('src');
$figures[$figureCount]["img"]["alt"] = $xp->query('//img', $figureElement)->item($figureCount)->getAttribute('alt');
//Check that an img class exists, otherwise set the value to null. If we don't do this PHP will throw a NOTICE.
if (boolval($xp->evaluate('//img', $figureElement)->item($figureCount))) {
$figures[$figureCount]["img"]["class"] = $xp->query('//img', $figureElement)->item($figureCount)->getAttribute('class');
} else {
$figures[$figureCount]["img"]["class"] = null;
}
//Check that a <figcaption> element exists, otherwise set the value to null
if (boolval($xp->evaluate('//figcaption', $figureElement)->item($figureCount))) {
$figures[$figureCount]["figcaption"] = $xp->query('//figcaption', $figureElement)->item($figureCount)->nodeValue;
} else {
$figures[$figureCount]["figcaption"] = null;
}
//Increment our $figureCount so that we know we can create a new array index.
$figureCount++;
}
print_r($figures);
?>
$doc = new \DOMDocument();
$doc->loadHTML($html);
$figure = $doc->getElementsByTagName("figure"); // DOMNodeList Object
//Craete array to add all DOMElement value
$figures = array();
$i= 0;
foreach($figure as $item) { // DOMElement Object
$figures[$i]['class']= $item->getAttribute('class');
//DOMElement::getElementsByTagName— Returns html tag
$img = $item->getElementsByTagName('img')[0];
if($img){
//DOMElement::getAttribute — Returns value of attribute
$figures[$i]['img']['src'] = $img->getAttribute('src');
$figures[$i]['img']['alt'] = $img->getAttribute('alt');
$figures[$i]['img']['class'] = $img->getAttribute('class');
}
//textContent - use to get the text of tag
if($item->getElementsByTagName('figcaption')[0]){
$figures[$i]['figcaption'] = $item->getElementsByTagName('figcaption')[0]->textContent;
}
$i++;
}
echo "<pre>";
print_r($figures);
echo "</pre>";

DomDocument get all divs and put inside an array

I have have some divs with the same Id and same Class as you can see below:
<div id="results_information" class="control_results">
<!-- I have divs, subDivs, span, images inside -->
</div>
<div id="results_information" class="control_results">
<!-- I have divs, subDivs, span, images inside -->
</div>
....
In my case I want to save all of them inside an array to be used later, I want to save in this format:
[0] => '<div id="results_information" class="control_results">
<!-- I have divs, subDivs, span, images inside -->
</div>',
[1] => '<div id="results_information" class="control_results">
<!-- I have divs, subDivs, span, images inside -->
</div>',
....
For that I'm using this code below:
$dom = new DOMDocument(); // Create DOMDocument object.
$dom->loadHTMLFile($htmlOut); // Load target file.
$div =$dom->getElementById('results_information'); // Take all div elements.
But it doesn't work, how I can solve this problem and put my divs inside an array?
To solve your problem you need to do the following steps below:
First of all, you should be based on selecting a class and not an ID (Because id in this situation should be unique).
In this situation we assume that you have the following html inside a variable called $htmlOut:
<div id="results_information" class="control_results">
<span style="background:black; color:white">
hellow world
</span>
<strong>2</strong>
</div>
<div id="results_information" class="control_results">
<strong>2</strong>
<img src="hello.png" />
</div>
We need to extract all the html that exists inside theses two class called control_results and put inside an array, for this we need to work with DomDocument and DomXPath:
$array = array();
$dom = new DomDocument();
$dom->loadHtml($htmlOut);
$finder = new DomXPath($dom);
$classname = "control_results";
$nodes = $finder->query("//*[contains(#class, '$classname')]");
With that code we can extract all the content of the divs with classname control_results and put inside the variable $nodes.
Now we need to parser the variable $nodes (that is an array) and extract all the HTML of that two class. For this I create a function to handle:
function get_inner_html( $node ) {
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveXML( $child );
}
return $innerHTML;
}
This function will extract every childNodes (Every HTML code inside the class control_results) and returns.
Now you only need to create a foreach for the variable $nodes and call that function, like this:
foreach ($nodes as $rowNode) {
$array[] = get_inner_html($rowNode);
}
var_dump($array);
Below is the complete code:
$htmlOut = '
<div id="results_information" class="control_results">
<span style="background:black; color:white">
hellow world
</span>
<strong>2</strong>
</div>
<div id="results_information" class="control_results">
<strong>2</strong>
<img src="hello.png" />
</div>
';
$array = array();
$dom = new DomDocument();
$dom->loadHtml($htmlOut);
$finder = new DomXPath($dom);
$classname = "control_results";
$nodes = $finder->query("//*[contains(#class, '$classname')]");
foreach ($nodes as $rowNode) {
$array[] = get_inner_html($rowNode);
}
var_dump($array);
function get_inner_html( $node ) {
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveXML( $child );
}
return $innerHTML;
}
But this code has a little problem, if you check the results in array is:
0 => string '<span style="background:black; color:white">hellow world</span><strong>2</strong>',
1 => string '<strong>2</strong><img src="hello.png"/>'
instead of:
0 => string '<div id="results_information" class="control_results"><span style="background:black; color:white">hellow world</span><strong>2</strong></div>',
1 => string '<div id="results_information" class="control_results"><strong>2</strong><img src="hello.png"/></div>'
In this case you can perform a foreach of this array and include that div in the init of the contents and close that div in the final of the contents and re-save that array.
You will need to use xpath and get the elements using class name.
$dom = new DOMDocument();
$xpath = new DOMXpath($dom);
$div = $xpath->query('//div[contains(#class, "control_results")]')

php DOMDocument - List child elements to array

For the following HTML:
<html>
<body>
<div whatever></div>
<div id="archive-wrapper">
<ul class="archive-list">
<li><div>A</div></li>
<li><div>B</div></li>
<li><div>C</div></li>
</ul>
</div>
</body>
How could I retrieve, with PHP DOMDocument (http://php.net/manual/es/class.domdocument.php), an array containing (#1,#2,#3) in the most effective way? It's not that I did not try anything or that I want an already done code, I just need to know some guidelines to do it and understand it on my own. Thanks :)
A simple example using php DOMDocument -
<?php
$html = <<<HTML
<html>
<body>
<div whatever></div>
<div id="archive-wrapper">
<ul class="archive-list">
<li><div>A</div></li>
<li><div>B</div></li>
<li><div>C</div></li>
</ul>
</div>
</body>
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
//get all links
$links = $dom->getElementsByTagName('a');
$linkArray = array();
//loop through each link
foreach ($links as $link){
$linkArray[] = $link->getAttribute('href');
}
edit
to get only the links inside ul->li, you could do something like -
$dom = new DOMDocument();
$dom->loadHTML($html);
$linkArray = array();
foreach ($dom->getElementsByTagName('ul') as $li){
foreach ($li->getElementsByTagName('li') as $a){
foreach ($a->getElementsByTagName('a') as $link){
$linkArray[] = $link->getAttribute('href');
}
}
}
or if you just want the 1st ul you could simplify to
//get 1st ul using ->item(0)
$ul = $dom->getElementsByTagName('ul')->item(0);
foreach ($ul->getElementsByTagName('li') as $li){
foreach ($li->getElementsByTagName('a') as $a){
$linkArray[] = $a->getAttribute('href');
}
}
what do you mean with PHP DOM? do you mean with PHP and JQuery? You can setup
you can put all that in a form and post it to a script
you can also wrap around a select which will only store the selected
data
better idea would be to jquery to post the items to an array on the
same page and using php as a processor for server side
munipilation? this is better in the long run, being its the most updated way of
interacting with html and server side scripts.
for example, you can try either way:
$("#form").submit(function(){ //form being the #form id
var items = [];
$("#archive-list li").each(function(n){
items[n] = $(this).html();
});
$.post(
"munipilate-data.php",
{items: items},
function(data){
$("#result").html(data);
});
});
I suggest you a regex to parse it.
$html = '<html>
<body>
<div whatever></div>
<div id="archive-wrapper">
<ul class="archive-list">
<li><div>A</div></li>
<li><div>B</div></li>
<li><div>C</div></li>
</ul>
</div>
</body>';
$reg = '/a href=["\']?([^"\' ]*)["\' ]/';
preg_match_all($reg, $html, $m);
$arr = array_map(function($v){
return trim(str_replace('a href=', '', $v), '"');
}, $m[0]);
print '<pre>';
print_r($arr);
print '</pre>';
Output:
Array
(
[0] => #1
[1] => #2
[2] => #3
)
Regex Demo

Xpath DOMDocument not returning data from node

I am scanning a site using a dom document and receiving the #src attribute from the inner img element.
The html is:
<div class="content">
<div class="post">
<img src="abc"/>
</div>
<div class="post">
<img src="abc1"/>
</div>
<div class="post">
</div>
</div>
Note: I am specifically not using #src in my xpath. Also, this is precisely the way i need it to be coded.
Here is my php:
$document = new DOMDocument();
$document->loadXML($file);
$XPath = new DOMXPath($document);
$Query = '//div[#class="content"]//div[#class="post"]';
$posts = $XPath->query($Query);
foreach ($posts as $post) {
if($XPath->evaluate("#src", $post))
{
$return[] = $XPath->evaluate("#src", $post)->item(0);
}else{
$return[] = "";
}
}
It's adding positions to the array $return however they are all empty array positions.
My question is how do i make it output the data from the php code:
$return[] = $XPath->evaluate("#src", $post)->item(0);
This doesn't work:
$return[] = $XPath->evaluate("#src", $post)->item(0)->nodeValue;
.//#src[1]:
. => relative to node
//# => descendant
#src => the src attribute
[1] => the first one.
You can even use string(nodeset) if you only care about the value (of course, leave that out if you need to be able to manipulate the attributes and use your ->item(0) solution).
foreach ($posts as $post){
$return[] = $XPath->evaluate("string(.//#src[1])",$post);
}

Categories