Getting href-attributes using XPath in PHP - php

I am new to PHP and trying to write a scraper for a website.
I am trying to get an element with class name categories. I have use
$showPage = '<li class="categories">Categories<ul> <li class="cat-item cat-item-940"><a href="http://www.desitvbox.me/category/star-plus/amul-taste-of-india/" >Amul Taste of India</a>
</li>
<li class="cat-item cat-item-942"><a href="http://www.desitvbox.me/category/star-plus/dance-plus/" >Dance Plus</a>
</li>
<li class="cat-item cat-item-239"><a href="http://www.desitvbox.me/category/star-plus/diya-aur-baati-hum-star/" >Diya Aur Baati Hum</a>
</li>
<li class="cat-item cat-item-745"><a href="http://www.desitvbox.me/category/star-plus/suhani-si-ek-ladki/" >Suhani Si Ek Ladki</a>
</li>
<li class="cat-item cat-item-147"><a href="http://www.desitvbox.me/category/star-plus/star-plus-completed-shows/" >Star Plus Completed Shows</a>
<ul class="children">
<li class="cat-item cat-item-772"><a href="http://www.desitvbox.me/category/star-plus/star-plus-completed-shows/airlines/" >Airlines</a>
</li>
<li class="cat-item cat-item-518"><a href="http://www.desitvbox.me/category/star-plus/star-plus-completed-shows/arjun/" >Arjun</a>
</li>
<li class="cat-item cat-item-237"><a href="http://www.desitvbox.me/category/star-plus/star-plus-completed-shows/chef-pankaj-ka-zayka/" >Chef Pankaj Ka Zayka</a>
</li>
</ul>
</li>
</ul></li>';
$dom = new DOMDocument();
$dom->validateOnParse = true;
$dom->loadHTML($showPage);
$dom->preserveWhiteSpace = false;
$allShowsList = new DOMXPath($dom);
$allShowsTableHTML = $allShowsList->query('//li[contains(#class, "categories")]');
However, I want to now read the values of all a href mentioned in $allShowsTableHTML.
Can you please advise how can I do that?
As you can see one the record also have ul class = 'childern'. which I also want to read.
I need to get the href and the title.
I have tried below but no result.
$allShowTableDom = new DOMDocument();
foreach ($allShowTableHTML as $showLink)
{
$allShowTableDom->appendChild($allShowTableDom->importNode($showLink,true));
}
$showsArray = $allShowsTableHTML->getElementsByTagName('a');
I think it is not going in foreach loop.

To get all href attributes of the hyperlinks, add some more axis steps, finally loop over the result list, where the ->value property will contain the URIs.
Given you can just dump all href attributes inside the whole <li> element, simply extend your query by //a/#href:
$document = new DOMXPath($dom);
$hrefs = $document->query('//li[contains(#class, "categories")]//a/#href');
foreach ($hrefs as $href) {
echo $href->value;
}
If this contains nodes you don't want to get, you could also descend the contain unsorted list and select with a more specific query:
//li[contains(#class, "categories")]/ul/li/a/#href

Related

How to get parent and nested elements by DOMDocument?

In a typical HTML as
<ol>
<li>
<span>parent</span>
<ul>
<li><span>nested 1</span></li>
<li><span>nested 2</span></li>
</ul>
</li>
</ol>
I try to get the contents of <li> elements but I need to get the parent and those nested under ul separately.
If go as
$ols = $doc->getElementsByTagName('ol');
foreach($ols as $ol){
$lis = $ol->getElementsByTagName('li');
// here I need li immediately under <ol>
}
$lis is all li elements including both parent and nested ones.
How can I get li elements one level under ol by ignoring deeper levels?
There are two approaches to this, the first is how you are working with getElementsByTagName(), the idea would be just to pick out the first <li> tag and assume that it is the correct one...
$ols = $doc->getElementsByTagName('ol');
foreach($ols as $ol){
$lis = $ol->getElementsByTagName('li')[0];
echo $doc->saveHTML($lis).PHP_EOL;
}
This echoes...
<li>
<span>parent</span>
<ul>
<li><span>nested 1</span></li>
<li><span>nested 2</span></li>
</ul>
</li>
which should work - BUT is not exact enough at times.
The other method would be to use XPath, where you can specify the levels of the document tags you want to retrieve. This uses //ol/li, which is any <ol> tag with an immediate descendant <li> tag.
$xp = new DOMXPath($doc);
$lis = $xp->query("//ol/li");
foreach ( $lis as $li ) {
echo $doc->saveHTML($li);
}
this also gives...
<li>
<span>parent</span>
<ul>
<li><span>nested 1</span></li>
<li><span>nested 2</span></li>
</ul>
</li>

PHP DOMXPath->query()/->evaluate() not matching inner text

I am currently trying to create a pure PHP menu traversal system - it's because I'm doing an impromptu project for some people but they want as little JS as possible (i.e: none) and ideally pure PHP.
I have a menu which looks like this:
ul {
list-style-type: none;
}
nav > ul.sidebar-list ul.sub {
display: none;
}
nav > ul.sidebar-list ul.sub.active {
display: block;
}
<nav class="sidebar" aria-labelledby="primary-navigation">
<ul class="sidebar-list">
<!--each element has a sub-menu which is initially hidden by css when the page is loaded. Via php the appropriate path the current page and top-level links will be visible only-->
<li>Home</li>
<!--sub-items-->
<ul class="sub active">
<li>Barn</li>
<li>Activities</li>
<ul class="sub active">
<li>News</li>
<li>Movements</li>
<li>Reviews</li>
<li>About Us</li>
<li>Terms of Use</li>
</ul>
</ul>
<li>Events</li>
<ul class="sub">
<li>Overview</li>
<li>Farming</li>
<li>Practises</li>
<li>Links</li>
<ul class="sub">
<li>Another Farm</li>
<li>24m</li>
</ul>
</ul>
</ul>
</nav>
In order to attempt to match the title inner-text of the page to a menu-item innertext (probably not the best way of doing things but I'm still learning php) I run:
$menu = new DOMDocument();
assert($menu->loadHTMLFile($menu_path), "Loading nav.html (menu file) failed");
//show content to log of the html document
error_log("HTML file: \n\n".$menu->textContent);
//set up a query to find an element matching the title string found
$xpath = new DOMXPath($menu);
$menu_query = "//a/li[matches(text(), '$title_text', 'i')]";
$elements = $xpath->query($menu_query);
error_log($elements ? ("Result of xpath query is: ".print_r($elements, TRUE)): "The xpath query for searching the menu is incorrect and will not find you anything!\ntype of return: ".gettype($elements));
I get the correct return at: https://www.freeformatter.com/xpath-tester.html but in the script I don't. I have tried many different combinations of the text matching such as: //x:a/x:li[lower-case(text())='$title_text'] but always an empty node list.
PHP uses XPath 1.0. matches is an XPath 2.0 function, so you would have seen warnings in your error log if you were looking for them.
PHP Warning: DOMXPath::query(): xmlXPathCompOpEval: function matches not found in php shell code on line 1
PHP Stack trace:
PHP 1. {main}() php shell code:0
PHP 2. DOMXPath->query() php shell code:1
A simple case-sensitive match can be done with an equality check.
$title_text = "Farming";
$menu_query = "//a/li[. = '$title_text']";
But the case-insensitive search involves translating the characters from upper to lower case:
$title_text = "FaRmInG";
$title_text = strtolower($title_text);
$menu_query = "//a/li[translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz') = '$title_text']";
In either case we end up with a NodeList that can be iterated through:
$html = <<< HTML
<nav class="sidebar" aria-labelledby="primary-navigation">
<ul class="sidebar-list">
<!--each element has a sub-menu which is initially hidden by css when the page is loaded. Via php the appropriate path the current page and top-level links will be visible only-->
<li>Home</li>
<!--sub-items-->
<ul class="sub active">
<li>Barn</li>
<li>Activities</li>
<ul class="sub active">
<li>News</li>
<li>Movements</li>
<li>Reviews</li>
<li>About Us</li>
<li>Terms of Use</li>
</ul>
</ul>
<li>Events</li>
<ul class="sub">
<li>Overview</li>
<li>Farming</li>
<li>Practises</li>
<li>Links</li>
<ul class="sub">
<li>Another Farm</li>
<li>24m</li>
</ul>
</ul>
</ul>
</nav>
HTML;
$menu = new DOMDocument();
$menu->loadHTML($html);
$xpath = new DOMXPath($menu);
$elements = $xpath->query($menu_query);
foreach ($elements as $element) {
print_r($element);
}

Read next html tag using PHP

This is my HTML part of code:
<ul>
<li> something,,,,... </li>
<li> something,,,,... </li>
<li> something,,,,... </li>
<li> something,,,,... </li>
<li>
<h5>Price</h5>
<span>100$</span>
</li>
</ul>
In my php I am using php-simple-dom for finding tags. So php part looks something like this:
foreach($html->find("li") as $li)
{
if(strpos($li->plaintext,"<h5>Price</h5>") !== false)
{
var_dump($li->plaintext); // result: string("<h5>Price</h5><span>100$</span>")
}
}
I have some other idea:
foreach($html->find("h5") as $h5)
{
if(strpos($h5->plaintext,"Price") !== false)
{
// finding some way to read next tag
}
}
What I need ?
I need to get <span> value. This is example, in real code there are more tags and multiple spans in one <li>. But point is that next tag contain wanted information.
I'm not pretty sure how many tags could be in one <li>, but I belive <span> you are looking for is always after <h5>. You can use method $e->next_sibling() as follows:
foreach ($html->find('li h5') as $h5) {
$price = $h5->next_sibling();
echo $price->plaintext;
}
So you want to get a value of a specific tag, you could find DOMDocument::getElementsByTagName useful.
Return Values
A new DOMNodeList object containing all the matched elements.
Here is how you would use it:
$html = <<< HTML
<ul>
<li> something,,,,... </li>
<li> something,,,,... </li>
<li> something,,,,... </li>
<li> something,,,,... </li>
<li>
<h5>Price</h5>
<span>100$</span>
</li>
</ul>
HTML;
$dom = new DOMDocument;
$dom->loadXML($html);
$prices = $dom->getElementsByTagName('span');
foreach ($prices as $price) {
echo $price->nodeValue, PHP_EOL;
}
The above example will output: 100$
Go ahead and try it with several prices. It works as excepted.
You might also find the DOM documentation useful.

Remove unnecessary li

echo $nav gives code like this:
<ul>
<li class="someclass">sometext
<ul>
<li class="someclass">sometext</li>
<li class="spacer"></li>
<li class="someclass">sometext</li>
<li class="spacer"></li>
<li class="someclass">sometext</li>
<li class="spacer"></li>
<li class="someclass">sometext</li>
<li class="spacer"></li>
</ul>
</li>
<li class="spacer"></li>
<li class="someclass">sometext</li>
<li class="spacer"></li>
</ul>
There are list items with class spacer inside each child ul, after each normal list item.
How do I remove the spacer list items which are grandchildren of the main list, using PHP?
Example: <ul> <li> <ul> <li class="spacer">
I'm searching for a regular expression, which should erase <li class="spacer"></li> only in a child <ul> element.
If you don't have access to the $nav variable to remove it (which you likely do) then I'd just use CSS to hide it, something like this should work:
li ul li.spacer {
display:none;
}
If however you have access to $nav - delete that spacer li from the code. Simples.
Also, on a side note. having empty elements like that on the page as "spacers" is semantically bad. This should be handled via CSS, add margins/padding on other elements on the page, don't use a class of spacer, if you do then you may as well go back to using stray <br /> tags everywhere to create spaces.
$xml = new SimpleXMLElement($nav);
$spacers = $xml->xpath('li//li[#class="spacer"]');
foreach($spacers as $i => $n) {
unset($spacers[$i][0]);
}
echo $xml->asXML();
This is converting to XML (use a recent PHP 5.3 version and DOMDocument to export to HTML). Output:
<?xml version="1.0"?>
<ul>
<li class="someclass">sometext
<ul>
<li class="someclass">sometext</li>
<li class="someclass">sometext</li>
<li class="someclass">sometext</li>
<li class="someclass">sometext</li>
</ul>
</li>
<li class="spacer"/>
<li class="someclass">sometext</li>
<li class="spacer"/>
</ul>
How about str_replace?
$nav = str_replace('<li class="spacer"></li>','',$nav);
edited code below
Based on the new requirement this code works. I know its hacky and sloppy but it works:
$temp = explode("\n",$nav);
for ($i=0;$i<count($temp);$i++) {
if (strstr($temp[$i],"<ul>")) {
$nested_ul = 1;
}
if (strstr($temp[$i],"</ul>")) {
$nested_ul = 0;
}
if ($nested_ul==0) {
if (!strstr($temp[$i],"spacer")) {
$new_nav .= $temp[$i]."\n";
}
} else {
$new_nav .= $temp[$i]."\n";
}
}
echo $new_nav;
"Easily" is relative. It depends on a few things. If you want, modify where the $nav is getting generated from.
use preg_replace to replace the li tags:
$new_nav = preg_replace('/<li class="spacer"></li>/', '', $nav);
echo $nav;
There are multiple ways:
Do not create it. It will be easier if you do not create something you do not want. It will be easier to maintain. So if you have any control over what is generated into $var string, just change it.
Simply replace it like that: str_replace('<li class="spacer"></li>', $var).
Use some HTML parser and remove the nodes.
Use JavaScript to remove <li class="spacer"></li> on client side.
Use substr_replace and strpos instead of str_replace, and specify an offset just after the first spacer.
http://www.php.net/manual/en/function.substr-replace.php
http://www.php.net/manual/en/function.strpos.php
Add the following CSS
ul ul li.spacer { display: none; }
Try this:
$nav = str_replace('<li class="spacer"></li>', '', $nav);

Replace text in PHP

We have a variable with hmtl code inside.
<?php echo $list; ?>
This will give something like:
<li><a href='http://site.com/2010/' title='2010'>2010</a></li>
<li><a href='http://site.com/2009/' title='2009'>2009</a></li>
<li><a href='http://site.com/2008/' title='2008'>2008</a></li>
Want to add class for each <li>, it can be taken from title attribute:
<li class="y2010"><a href='http://site.com/2010/' title='2010'>2010</a></li>
<li class="y2009"><a href='http://site.com/2009/' title='2009'>2009</a></li>
<li class="y2008"><a href='http://site.com/2008/' title='2008'>2008</a></li>
We should work with variable $list.
Tentative scheme:
search for title attribute in each
<li>....</li>
throw its value to the class, which we add for opening <li>
PHP solution wanted.
Thanks.
Parsing the DOM sounds like overkill to me, if I understand the problem you're facing. Assuming that you know for sure that the entire contents of the $list variable will be structured as <li><a href='foo' title='bar'>bar</a></li> then you can do what you're asking pretty easily by combining regular expressions with a loop:
$list = "<li><a href='http://site.com/2010/' title='2010'>2010</a></li>
<li><a href='http://site.com/2009/' title='2009'>2009</a></li>
<li><a href='http://site.com/2008/' title='2008'>2008</a></li>";
preg_match_all("/title='([^']*)'/s",$list,$matches); //this gets all titles
$output=$list;
foreach($matches[1] as $match) { //this applies the titles to the li elements
$location = strpos($output,"<li>");
$output = substr($output,0,$location)."<li class='".$match."'>".substr($output,$location+4);
}
If you echo $output:
<li class="y2010"><a href='http://site.com/2010/' title='2010'>2010</a></li>
<li class="y2009"><a href='http://site.com/2009/' title='2009'>2009</a></li>
<li class="y2008"><a href='http://site.com/2008/' title='2008'>2008</a></li>
I accomplished this by splitting the text into an array, and performing a search/replace once the year is obtained.
$carrReturn="\r\n"; //Set the Newline and Return string to search for
$arr = explode($carrReturn, $list); //Break the text into an array
$list=""; //clear $list
for ($x=0; $x<count($arr); $x++){
$current=$arr[$x];
$year= strip_tags($current); //Get the year by stripping the HTML tags.
$list.=str_replace("<li", "<li class=\"y".$year."\"",$current)."\r\n";
//Reconstruct $list
}
Output
<li class="y2010"><a href='http://site.com/2010/' title='2010'>2010</a></li>
<li class="y2009"><a href='http://site.com/2009/' title='2009'>2009</a></li>
<li class="y2008"><a href='http://site.com/2008/' title='2008'>2008</a></li>
I dont know why you guys are so obsessed with Regex. DOM is clean and readable:
$dom = new DOMDocument;
$dom->loadXML("<ul>$list</ul>");
$xPath = new DOMXPath($dom);
foreach($xPath->query('//li/a/#title') as $node) {
$node->parentNode->parentNode->setAttribute('class', $node->nodeValue);
}
echo $dom->saveXML($dom->documentElement);
Outputs:
<ul>
<li class="2010">2010</li>
<li class="2009">2009</li>
<li class="2008">2008</li>
</ul>
RegEx:
preg_replace("/<li>(<a .+ title=')(\d{4})'/", "<li title='y$2'>$1$2", $string);
This really depends on every li and anchor being formatted the same exact way each time though.

Categories