For the following HTML:
<html>
<body>
<div whatever></div>
<div id="archive-wrapper">
<ul class="archive-list">
<li><div>A</div></li>
<li><div>B</div></li>
<li><div>C</div></li>
</ul>
</div>
</body>
How could I retrieve, with PHP DOMDocument (http://php.net/manual/es/class.domdocument.php), an array containing (#1,#2,#3) in the most effective way? It's not that I did not try anything or that I want an already done code, I just need to know some guidelines to do it and understand it on my own. Thanks :)
A simple example using php DOMDocument -
<?php
$html = <<<HTML
<html>
<body>
<div whatever></div>
<div id="archive-wrapper">
<ul class="archive-list">
<li><div>A</div></li>
<li><div>B</div></li>
<li><div>C</div></li>
</ul>
</div>
</body>
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
//get all links
$links = $dom->getElementsByTagName('a');
$linkArray = array();
//loop through each link
foreach ($links as $link){
$linkArray[] = $link->getAttribute('href');
}
edit
to get only the links inside ul->li, you could do something like -
$dom = new DOMDocument();
$dom->loadHTML($html);
$linkArray = array();
foreach ($dom->getElementsByTagName('ul') as $li){
foreach ($li->getElementsByTagName('li') as $a){
foreach ($a->getElementsByTagName('a') as $link){
$linkArray[] = $link->getAttribute('href');
}
}
}
or if you just want the 1st ul you could simplify to
//get 1st ul using ->item(0)
$ul = $dom->getElementsByTagName('ul')->item(0);
foreach ($ul->getElementsByTagName('li') as $li){
foreach ($li->getElementsByTagName('a') as $a){
$linkArray[] = $a->getAttribute('href');
}
}
what do you mean with PHP DOM? do you mean with PHP and JQuery? You can setup
you can put all that in a form and post it to a script
you can also wrap around a select which will only store the selected
data
better idea would be to jquery to post the items to an array on the
same page and using php as a processor for server side
munipilation? this is better in the long run, being its the most updated way of
interacting with html and server side scripts.
for example, you can try either way:
$("#form").submit(function(){ //form being the #form id
var items = [];
$("#archive-list li").each(function(n){
items[n] = $(this).html();
});
$.post(
"munipilate-data.php",
{items: items},
function(data){
$("#result").html(data);
});
});
I suggest you a regex to parse it.
$html = '<html>
<body>
<div whatever></div>
<div id="archive-wrapper">
<ul class="archive-list">
<li><div>A</div></li>
<li><div>B</div></li>
<li><div>C</div></li>
</ul>
</div>
</body>';
$reg = '/a href=["\']?([^"\' ]*)["\' ]/';
preg_match_all($reg, $html, $m);
$arr = array_map(function($v){
return trim(str_replace('a href=', '', $v), '"');
}, $m[0]);
print '<pre>';
print_r($arr);
print '</pre>';
Output:
Array
(
[0] => #1
[1] => #2
[2] => #3
)
Regex Demo
Related
I have have some divs with the same Id and same Class as you can see below:
<div id="results_information" class="control_results">
<!-- I have divs, subDivs, span, images inside -->
</div>
<div id="results_information" class="control_results">
<!-- I have divs, subDivs, span, images inside -->
</div>
....
In my case I want to save all of them inside an array to be used later, I want to save in this format:
[0] => '<div id="results_information" class="control_results">
<!-- I have divs, subDivs, span, images inside -->
</div>',
[1] => '<div id="results_information" class="control_results">
<!-- I have divs, subDivs, span, images inside -->
</div>',
....
For that I'm using this code below:
$dom = new DOMDocument(); // Create DOMDocument object.
$dom->loadHTMLFile($htmlOut); // Load target file.
$div =$dom->getElementById('results_information'); // Take all div elements.
But it doesn't work, how I can solve this problem and put my divs inside an array?
To solve your problem you need to do the following steps below:
First of all, you should be based on selecting a class and not an ID (Because id in this situation should be unique).
In this situation we assume that you have the following html inside a variable called $htmlOut:
<div id="results_information" class="control_results">
<span style="background:black; color:white">
hellow world
</span>
<strong>2</strong>
</div>
<div id="results_information" class="control_results">
<strong>2</strong>
<img src="hello.png" />
</div>
We need to extract all the html that exists inside theses two class called control_results and put inside an array, for this we need to work with DomDocument and DomXPath:
$array = array();
$dom = new DomDocument();
$dom->loadHtml($htmlOut);
$finder = new DomXPath($dom);
$classname = "control_results";
$nodes = $finder->query("//*[contains(#class, '$classname')]");
With that code we can extract all the content of the divs with classname control_results and put inside the variable $nodes.
Now we need to parser the variable $nodes (that is an array) and extract all the HTML of that two class. For this I create a function to handle:
function get_inner_html( $node ) {
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveXML( $child );
}
return $innerHTML;
}
This function will extract every childNodes (Every HTML code inside the class control_results) and returns.
Now you only need to create a foreach for the variable $nodes and call that function, like this:
foreach ($nodes as $rowNode) {
$array[] = get_inner_html($rowNode);
}
var_dump($array);
Below is the complete code:
$htmlOut = '
<div id="results_information" class="control_results">
<span style="background:black; color:white">
hellow world
</span>
<strong>2</strong>
</div>
<div id="results_information" class="control_results">
<strong>2</strong>
<img src="hello.png" />
</div>
';
$array = array();
$dom = new DomDocument();
$dom->loadHtml($htmlOut);
$finder = new DomXPath($dom);
$classname = "control_results";
$nodes = $finder->query("//*[contains(#class, '$classname')]");
foreach ($nodes as $rowNode) {
$array[] = get_inner_html($rowNode);
}
var_dump($array);
function get_inner_html( $node ) {
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveXML( $child );
}
return $innerHTML;
}
But this code has a little problem, if you check the results in array is:
0 => string '<span style="background:black; color:white">hellow world</span><strong>2</strong>',
1 => string '<strong>2</strong><img src="hello.png"/>'
instead of:
0 => string '<div id="results_information" class="control_results"><span style="background:black; color:white">hellow world</span><strong>2</strong></div>',
1 => string '<div id="results_information" class="control_results"><strong>2</strong><img src="hello.png"/></div>'
In this case you can perform a foreach of this array and include that div in the init of the contents and close that div in the final of the contents and re-save that array.
You will need to use xpath and get the elements using class name.
$dom = new DOMDocument();
$xpath = new DOMXpath($dom);
$div = $xpath->query('//div[contains(#class, "control_results")]')
If I have an element something like this
<h1 class="first second last">
<p>Paragraph</p>
</h1>
I want to use find method only for these three classes. I tried like this :
$html->find('first.second.last',0)->plaintext;
But it's not working. Is there any idea to use find method for this type of condition?
Try this code :
<script type="text/javascript">
alert(document.getElementsByClassName("first second last"));
</script>
With this code you get your dom element.
test.html :
<h1 class="first second last">
<p>Paragraph</p>
</h1>
As I had to it in php simple html dom parser.
I found a alternate solution for my query not by using find method but we can check multiple classes by using this:
include "simple_html_dom.php";
$html = file_get_html('test.html');
$h1 = $html->find('h1');
foreach ($h1 as $h1) {
$h1Class = ($h1->class);
if($h1Class == 'first second last'){
$item['test'] = 'success';
}else{
$item['test'] = 'fail';
}
$ar[] = $item;
}
echo "<pre>";
print_r($ar);
getElementsByClassName() is your solution.
<script type="text/javascript">
var yourDomElement = document.getElementsByClassName("first second last"));
</script>
If you are using jQuery then you can also get same DOM by (.) operator:-
<script type="text/javascript">
var yourDomElement = (".first second last");
</script>
if you want get DOM by without javascript and jQuery then use this code:-
$dom_element_class = 'first second last';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$results = $xpath->query("//*[#class='" . $dom_element_class . "']");
if ($results->length > 0) {
echo $review = $results->item(0)->nodeValue;
}
This link will be useful.
I have the following code....
<div class="outer">
<div>
<h1>Christmas</h1>
<ul>
<li>Holiday</li>
<li>Fun</li>
<li>Joy</li>
</ul>
<h1>4th July</h1>
<ul>
<li>Fireworks</li>
<li>Happy</li>
<li>Spectral</li>
</ul>
</div>
</div>
<div class="outer">
<div>
<h1>Christmas2</h1>
<ul>
<li>Holiday</li>
<li>Fun</li>
<li>Joy</li>
</ul>
<h1>4th July</h1>
<ul>
<li>Fireworks2</li>
<li>Happy</li>
<li>Spectral</li>
</ul>
</div>
</div>
I already know that I can find the DIV and then look inside the DIV for the elements etc by doing...
$doc->loadHTML($output); //$output being the text above
$xpath = new DOMXpath($doc);
$elements = $xpath->query('//div[#class="outer"]'); //Check outer
I know this above 3 lines will get the elements from within the DIV listed, but what I really want to be able to do is get the text of the [H1], then display the [li] values next to each H1..
the output i'm looking for is...
Christmas - Holiday, Fun, Joy
4th July - Fireworks, Happy, Spectral
Christmas2 - Holiday, Fun, Joy
4th July2 - Fireworks, Happy, Spectral
Yes you can continue to use xpath to traverse the elements on the header and get its following sibling, the list. Example:
$doc = new DOMDocument();
$doc->loadHTML($output);
$xpath = new DOMXpath($doc);
$elements = $xpath->query('//div[#class="outer"]/div');
if($elements->length > 0) {
foreach($elements as $div) {
foreach ($xpath->query('./h1', $div) as $e) {
$header = $e->nodeValue;
$list = array();
foreach ($xpath->query('./following-sibling::ul/li', $e) as $li) {
$list[] = $li->nodeValue;
}
echo $header . ' - ' . implode(', ', $list) . '<br/>';
}
echo '<hr/>';
}
}
Sample Output
I've used phpQuery for this type of issue in the past:
// include phpquery
require('phpQuery/phpQuery.php');
// initialize
$doc = phpQuery::newDocumentHTML($markup);
// get the text from the various elements
$h1Value = $doc['h1:first']->text(); // Christmas
// ... etc.
(untested)
My PHP script can fetch content from a div id, but what is the way to filter this fetch data and exclude some of its content which has this div id <div id="navbar" class="n"> I have tried with this code but its not working
$regex = '#\<div id="navbar"\>(.+?)\<\/div\>#s';
preg_match($regex, $displaybody, $matches);
$match = $matches[0];
echo "$match";`
To fetch content i am using HTML DOM Parser.
Using regexpes to parse html is usually a bad idea. You can select nodes with the DOM just fine:
$input = '<html> <body> some content <span class="a">b</span> <div id="navbar" class="n">find me <span class="a">b</span></div> </html>';
$doc = new DOMDocument;
$doc->loadHTML($input);
$navbar = $doc->getElementById('navbar');
$innerhtml = '';
foreach ($navbar->childNodes as $cn) {
$innerhtml .= $doc->saveHTML($cn);
}
print $innerhtml;
I'm new to DOM parsing in PHP:
I have a HTML file that I'm trying to parse. It has a bunch of DIVs like this:
<div id="interestingbox">
<div id="interestingdetails" class="txtnormal">
<div>Content1</div>
<div>Content2</div>
</div>
</div>
<div id="interestingbox">
......
I'm trying to get the contents of the many div boxes using php.
How can I use the DOM parser to do this?
Thanks!
First i have to tell you that you can't use the same id on two different divs; there are classes for that point. Every element should have an unique id.
Code to get the contents of the div with id="interestingbox"
$html = '
<html>
<head></head>
<body>
<div id="interestingbox">
<div id="interestingdetails" class="txtnormal">
<div>Content1</div>
<div>Content2</div>
</div>
</div>
<div id="interestingbox2">a link</div>
</body>
</html>';
$dom_document = new DOMDocument();
$dom_document->loadHTML($html);
//use DOMXpath to navigate the html with the DOM
$dom_xpath = new DOMXpath($dom_document);
// if you want to get the div with id=interestingbox
$elements = $dom_xpath->query("*/div[#id='interestingbox']");
if (!is_null($elements)) {
foreach ($elements as $element) {
echo "\n[". $element->nodeName. "]";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
//OUTPUT
[div] {
Content1
Content2
}
Example with classes:
$html = '
<html>
<head></head>
<body>
<div class="interestingbox">
<div id="interestingdetails" class="txtnormal">
<div>Content1</div>
<div>Content2</div>
</div>
</div>
<div class="interestingbox">a link</div>
</body>
</html>';
//the same as before.. just change the xpath
[...]
$elements = $dom_xpath->query("*/div[#class='interestingbox']");
[...]
//OUTPUT
[div] {
Content1
Content2
}
[div] {
a link
}
Refer to the DOMXPath page for more details.
I got this to work using simplehtmldom as a start:
$html = file_get_html('example.com');
foreach ($html->find('div[id=interestingbox]') as $result)
{
echo $result->innertext;
}
Very nice function from http://www.sitepoint.com/forums/showthread.php?611393-php5-need-something-like-innerHTML-instead-of-nodeValue
function innerXML($node)
{
$doc = $node->ownerDocument;
$frag = $doc->createDocumentFragment();
foreach ($node->childNodes as $child)
{
$frag->appendChild($child->cloneNode(TRUE));
}
return $doc->saveXML($frag);
}
$dom = new DOMDocument();
$dom->loadXML('
<html>
<body>
<table>
<tr>
<td id="foo">
The first bit of Data I want
<br />The second bit of Data I want
<br />The third bit of Data I want
</td>
</tr>
</table>
<body>
<html>
');
$xpath = new DOMXPath($dom);
$node = $xpath->evaluate("/html/body//td[#id='foo' ]");
$dataString = innerXML($node->item(0));
$dataArr = explode("<br />", $dataString);
$dataUno = $dataArr[0];
$dataDos = $dataArr[1];
$dataTres = $dataArr[2];
echo "firstdata = $nameUno<br />seconddata = $nameDos<br />thirddata = $nameTres<br />"
WebExtractor: https://github.com/knyga/webextractor
It can parse page with css, regex, xpath selectors.
Look package and tests for examples:
use WebExtractor\DataExtractor\DataExtractorFactory; use
WebExtractor\DataExtractor\DataExtractorTypes; use
WebExtractor\Client\Client;
$factory = DataExtractorFactory::getFactory(); $extractor =
$factory->createDataExtractor(DataExtractorTypes::CSS); $client = new
Client; $content =
$client->get('https://en.wikipedia.org/wiki/2014_Winter_Olympics');
$extractor->setContent($content); $h1 =
$extractor->setSelector('h1')->extract();