How to get parent and nested elements by DOMDocument?

How to get parent and nested elements by DOMDocument? - php

In a typical HTML as
<ol>
<li>
<span>parent</span>
<ul>
<li><span>nested 1</span></li>
<li><span>nested 2</span></li>
</ul>
</li>
</ol>
I try to get the contents of <li> elements but I need to get the parent and those nested under ul separately.
If go as
$ols = $doc->getElementsByTagName('ol');
foreach($ols as $ol){
$lis = $ol->getElementsByTagName('li');
// here I need li immediately under <ol>
}
$lis is all li elements including both parent and nested ones.
How can I get li elements one level under ol by ignoring deeper levels?

There are two approaches to this, the first is how you are working with getElementsByTagName(), the idea would be just to pick out the first <li> tag and assume that it is the correct one...
$ols = $doc->getElementsByTagName('ol');
foreach($ols as $ol){
$lis = $ol->getElementsByTagName('li')[0];
echo $doc->saveHTML($lis).PHP_EOL;
}
This echoes...
<li>
<span>parent</span>
<ul>
<li><span>nested 1</span></li>
<li><span>nested 2</span></li>
</ul>
</li>
which should work - BUT is not exact enough at times.
The other method would be to use XPath, where you can specify the levels of the document tags you want to retrieve. This uses //ol/li, which is any <ol> tag with an immediate descendant <li> tag.
$xp = new DOMXPath($doc);
$lis = $xp->query("//ol/li");
foreach ( $lis as $li ) {
echo $doc->saveHTML($li);
}
this also gives...
<li>
<span>parent</span>
<ul>
<li><span>nested 1</span></li>
<li><span>nested 2</span></li>
</ul>
</li>

Related

Getting href-attributes using XPath in PHP

I am new to PHP and trying to write a scraper for a website.
I am trying to get an element with class name categories. I have use
$showPage = '<li class="categories">Categories<ul> <li class="cat-item cat-item-940"><a href="http://www.desitvbox.me/category/star-plus/amul-taste-of-india/" >Amul Taste of India</a>
</li>
<li class="cat-item cat-item-942"><a href="http://www.desitvbox.me/category/star-plus/dance-plus/" >Dance Plus</a>
</li>
<li class="cat-item cat-item-239"><a href="http://www.desitvbox.me/category/star-plus/diya-aur-baati-hum-star/" >Diya Aur Baati Hum</a>
</li>
<li class="cat-item cat-item-745"><a href="http://www.desitvbox.me/category/star-plus/suhani-si-ek-ladki/" >Suhani Si Ek Ladki</a>
</li>
<li class="cat-item cat-item-147"><a href="http://www.desitvbox.me/category/star-plus/star-plus-completed-shows/" >Star Plus Completed Shows</a>
<ul class="children">
<li class="cat-item cat-item-772"><a href="http://www.desitvbox.me/category/star-plus/star-plus-completed-shows/airlines/" >Airlines</a>
</li>
<li class="cat-item cat-item-518"><a href="http://www.desitvbox.me/category/star-plus/star-plus-completed-shows/arjun/" >Arjun</a>
</li>
<li class="cat-item cat-item-237"><a href="http://www.desitvbox.me/category/star-plus/star-plus-completed-shows/chef-pankaj-ka-zayka/" >Chef Pankaj Ka Zayka</a>
</li>
</ul>
</li>
</ul></li>';
$dom = new DOMDocument();
$dom->validateOnParse = true;
$dom->loadHTML($showPage);
$dom->preserveWhiteSpace = false;
$allShowsList = new DOMXPath($dom);
$allShowsTableHTML = $allShowsList->query('//li[contains(#class, "categories")]');
However, I want to now read the values of all a href mentioned in $allShowsTableHTML.
Can you please advise how can I do that?
As you can see one the record also have ul class = 'childern'. which I also want to read.
I need to get the href and the title.
I have tried below but no result.
$allShowTableDom = new DOMDocument();
foreach ($allShowTableHTML as $showLink)
{
$allShowTableDom->appendChild($allShowTableDom->importNode($showLink,true));
}
$showsArray = $allShowsTableHTML->getElementsByTagName('a');
I think it is not going in foreach loop.

To get all href attributes of the hyperlinks, add some more axis steps, finally loop over the result list, where the ->value property will contain the URIs.
Given you can just dump all href attributes inside the whole <li> element, simply extend your query by //a/#href:
$document = new DOMXPath($dom);
$hrefs = $document->query('//li[contains(#class, "categories")]//a/#href');
foreach ($hrefs as $href) {
echo $href->value;
}
If this contains nodes you don't want to get, you could also descend the contain unsorted list and select with a more specific query:
//li[contains(#class, "categories")]/ul/li/a/#href

PHP Simple HTML DOM Parser find direct LI elements

HTML:
<ul>
<li><a></a>
<ul>
<li></li>
<li></li>
</ul>
</li>
<li>
...
</li>
</ul>
For parent ul:first-of-type, what would be the selector for it's (direct) child li elements, in order to parse the descendant li elements separately?

In Jquery you can simply use this selector : ul > li
Update:-
Using Simple DOM:-
<ul class="listitems">
<li><a></a>
<ul>
<li></li>
<li></li>
</ul>
</li>
<li>
...
</li>
</ul>
Simple HTML Dom code to get just the first level li items:
$html = file_get_html( $url );
$first_level_items = $html->find( '.listitems', 0)->children();
foreach ( $first_level_items as $item ) {
... do stuff ...
}

Remove unnecessary li

echo $nav gives code like this:
<ul>
<li class="someclass">sometext
<ul>
<li class="someclass">sometext</li>
<li class="spacer"></li>
<li class="someclass">sometext</li>
<li class="spacer"></li>
<li class="someclass">sometext</li>
<li class="spacer"></li>
<li class="someclass">sometext</li>
<li class="spacer"></li>
</ul>
</li>
<li class="spacer"></li>
<li class="someclass">sometext</li>
<li class="spacer"></li>
</ul>
There are list items with class spacer inside each child ul, after each normal list item.
How do I remove the spacer list items which are grandchildren of the main list, using PHP?
Example: <ul> <li> <ul> <li class="spacer">
I'm searching for a regular expression, which should erase <li class="spacer"></li> only in a child <ul> element.

If you don't have access to the $nav variable to remove it (which you likely do) then I'd just use CSS to hide it, something like this should work:
li ul li.spacer {
display:none;
}
If however you have access to $nav - delete that spacer li from the code. Simples.
Also, on a side note. having empty elements like that on the page as "spacers" is semantically bad. This should be handled via CSS, add margins/padding on other elements on the page, don't use a class of spacer, if you do then you may as well go back to using stray <br /> tags everywhere to create spaces.

$xml = new SimpleXMLElement($nav);
$spacers = $xml->xpath('li//li[#class="spacer"]');
foreach($spacers as $i => $n) {
unset($spacers[$i][0]);
}
echo $xml->asXML();
This is converting to XML (use a recent PHP 5.3 version and DOMDocument to export to HTML). Output:
<?xml version="1.0"?>
<ul>
<li class="someclass">sometext
<ul>
<li class="someclass">sometext</li>
<li class="someclass">sometext</li>
<li class="someclass">sometext</li>
<li class="someclass">sometext</li>
</ul>
</li>
<li class="spacer"/>
<li class="someclass">sometext</li>
<li class="spacer"/>
</ul>

How about str_replace?
$nav = str_replace('<li class="spacer"></li>','',$nav);
edited code below
Based on the new requirement this code works. I know its hacky and sloppy but it works:
$temp = explode("\n",$nav);
for ($i=0;$i<count($temp);$i++) {
if (strstr($temp[$i],"<ul>")) {
$nested_ul = 1;
}
if (strstr($temp[$i],"</ul>")) {
$nested_ul = 0;
}
if ($nested_ul==0) {
if (!strstr($temp[$i],"spacer")) {
$new_nav .= $temp[$i]."\n";
}
} else {
$new_nav .= $temp[$i]."\n";
}
}
echo $new_nav;

"Easily" is relative. It depends on a few things. If you want, modify where the $nav is getting generated from.

use preg_replace to replace the li tags:
$new_nav = preg_replace('/<li class="spacer"></li>/', '', $nav);
echo $nav;

There are multiple ways:
Do not create it. It will be easier if you do not create something you do not want. It will be easier to maintain. So if you have any control over what is generated into $var string, just change it.
Simply replace it like that: str_replace('<li class="spacer"></li>', $var).
Use some HTML parser and remove the nodes.
Use JavaScript to remove <li class="spacer"></li> on client side.

Use substr_replace and strpos instead of str_replace, and specify an offset just after the first spacer.
http://www.php.net/manual/en/function.substr-replace.php
http://www.php.net/manual/en/function.strpos.php

Add the following CSS
ul ul li.spacer { display: none; }

Try this:
$nav = str_replace('<li class="spacer"></li>', '', $nav);

Replace text in PHP

We have a variable with hmtl code inside.
<?php echo $list; ?>
This will give something like:
<li><a href='http://site.com/2010/' title='2010'>2010</a></li>
<li><a href='http://site.com/2009/' title='2009'>2009</a></li>
<li><a href='http://site.com/2008/' title='2008'>2008</a></li>
Want to add class for each <li>, it can be taken from title attribute:
<li class="y2010"><a href='http://site.com/2010/' title='2010'>2010</a></li>
<li class="y2009"><a href='http://site.com/2009/' title='2009'>2009</a></li>
<li class="y2008"><a href='http://site.com/2008/' title='2008'>2008</a></li>
We should work with variable $list.
Tentative scheme:
search for title attribute in each
<li>....</li>
throw its value to the class, which we add for opening <li>
PHP solution wanted.
Thanks.

Parsing the DOM sounds like overkill to me, if I understand the problem you're facing. Assuming that you know for sure that the entire contents of the $list variable will be structured as <li><a href='foo' title='bar'>bar</a></li> then you can do what you're asking pretty easily by combining regular expressions with a loop:
$list = "<li><a href='http://site.com/2010/' title='2010'>2010</a></li>
<li><a href='http://site.com/2009/' title='2009'>2009</a></li>
<li><a href='http://site.com/2008/' title='2008'>2008</a></li>";
preg_match_all("/title='([^']*)'/s",$list,$matches); //this gets all titles
$output=$list;
foreach($matches[1] as $match) { //this applies the titles to the li elements
$location = strpos($output,"<li>");
$output = substr($output,0,$location)."<li class='".$match."'>".substr($output,$location+4);
}
If you echo $output:
<li class="y2010"><a href='http://site.com/2010/' title='2010'>2010</a></li>
<li class="y2009"><a href='http://site.com/2009/' title='2009'>2009</a></li>
<li class="y2008"><a href='http://site.com/2008/' title='2008'>2008</a></li>

I accomplished this by splitting the text into an array, and performing a search/replace once the year is obtained.
$carrReturn="\r\n"; //Set the Newline and Return string to search for
$arr = explode($carrReturn, $list); //Break the text into an array
$list=""; //clear $list
for ($x=0; $x<count($arr); $x++){
$current=$arr[$x];
$year= strip_tags($current); //Get the year by stripping the HTML tags.
$list.=str_replace("<li", "<li class=\"y".$year."\"",$current)."\r\n";
//Reconstruct $list
}
Output
<li class="y2010"><a href='http://site.com/2010/' title='2010'>2010</a></li>
<li class="y2009"><a href='http://site.com/2009/' title='2009'>2009</a></li>
<li class="y2008"><a href='http://site.com/2008/' title='2008'>2008</a></li>

I dont know why you guys are so obsessed with Regex. DOM is clean and readable:
$dom = new DOMDocument;
$dom->loadXML("<ul>$list</ul>");
$xPath = new DOMXPath($dom);
foreach($xPath->query('//li/a/#title') as $node) {
$node->parentNode->parentNode->setAttribute('class', $node->nodeValue);
}
echo $dom->saveXML($dom->documentElement);
Outputs:
<ul>
<li class="2010">2010</li>
<li class="2009">2009</li>
<li class="2008">2008</li>
</ul>

RegEx:
preg_replace("/<li>(<a .+ title=')(\d{4})'/", "<li title='y$2'>$1$2", $string);
This really depends on every li and anchor being formatted the same exact way each time though.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to get parent and nested elements by DOMDocument? - php

Related

Getting href-attributes using XPath in PHP

Read next html tag using PHP

PHP Simple HTML DOM Parser find direct LI elements

Remove unnecessary li

Replace text in PHP

Categories

Resources