Fellas!
I have one nasty page to parse but can't figure out how to extract correct data blocks from it using Simple HTML DOM, because it has no CSS child selector support.
HTML:
<ul class="ul-block">
<li>xxx</li>
<li>xxx</li>
<li>
<ul>
<li>xxx2</li>
</ul>
</ul>
How would I extract (direct) child li elements of parent ul.ul-block?
The $node->find('ul[class=ul-block] > li'); doesn't work and $node->find('ul[class=ul-block] li'); ofc finds also nested descandant li elements :(
I had the same issue, and used the children method to grab just the first level items.
<ul class="my-list">
<li>
Some Text
<ul>
<li>Some Inner Text</li>
<li>Some Inner Text</li>
<li>Some Inner Text</li>
<li>Some Inner Text</li>
</ul>
</li>
<li>
Some Text
<ul>
<li>Some Inner Text</li>
<li>Some Inner Text</li>
<li>Some Inner Text</li>
<li>Some Inner Text</li>
</ul>
</li>
</ul>
And here's the Simple HTML Dom code to get just the first level li items:
$html = file_get_html( $url );
$first_level_items = $html->find( '.my-list', 0)->children();
foreach ( $first_level_items as $item ) {
... do stuff ...
}
Simple example with php DOM:
$dom = new DomDocument;
$dom->loadHtml('
<ul class="ul-block">
<li>a</li>
<li>b</li>
<li>
<ul>
<li>c</li>
</ul>
</li>
</ul>
');
$xpath = new DomXpath($dom);
foreach ($xpath->query('//ul[#class="ul-block"]/li') as $liNode) {
echo $liNode->nodeValue, '<br />';
}
Related
In a typical HTML as
<ol>
<li>
<span>parent</span>
<ul>
<li><span>nested 1</span></li>
<li><span>nested 2</span></li>
</ul>
</li>
</ol>
I try to get the contents of <li> elements but I need to get the parent and those nested under ul separately.
If go as
$ols = $doc->getElementsByTagName('ol');
foreach($ols as $ol){
$lis = $ol->getElementsByTagName('li');
// here I need li immediately under <ol>
}
$lis is all li elements including both parent and nested ones.
How can I get li elements one level under ol by ignoring deeper levels?
There are two approaches to this, the first is how you are working with getElementsByTagName(), the idea would be just to pick out the first <li> tag and assume that it is the correct one...
$ols = $doc->getElementsByTagName('ol');
foreach($ols as $ol){
$lis = $ol->getElementsByTagName('li')[0];
echo $doc->saveHTML($lis).PHP_EOL;
}
This echoes...
<li>
<span>parent</span>
<ul>
<li><span>nested 1</span></li>
<li><span>nested 2</span></li>
</ul>
</li>
which should work - BUT is not exact enough at times.
The other method would be to use XPath, where you can specify the levels of the document tags you want to retrieve. This uses //ol/li, which is any <ol> tag with an immediate descendant <li> tag.
$xp = new DOMXPath($doc);
$lis = $xp->query("//ol/li");
foreach ( $lis as $li ) {
echo $doc->saveHTML($li);
}
this also gives...
<li>
<span>parent</span>
<ul>
<li><span>nested 1</span></li>
<li><span>nested 2</span></li>
</ul>
</li>
I am currently trying to create a pure PHP menu traversal system - it's because I'm doing an impromptu project for some people but they want as little JS as possible (i.e: none) and ideally pure PHP.
I have a menu which looks like this:
ul {
list-style-type: none;
}
nav > ul.sidebar-list ul.sub {
display: none;
}
nav > ul.sidebar-list ul.sub.active {
display: block;
}
<nav class="sidebar" aria-labelledby="primary-navigation">
<ul class="sidebar-list">
<!--each element has a sub-menu which is initially hidden by css when the page is loaded. Via php the appropriate path the current page and top-level links will be visible only-->
<li>Home</li>
<!--sub-items-->
<ul class="sub active">
<li>Barn</li>
<li>Activities</li>
<ul class="sub active">
<li>News</li>
<li>Movements</li>
<li>Reviews</li>
<li>About Us</li>
<li>Terms of Use</li>
</ul>
</ul>
<li>Events</li>
<ul class="sub">
<li>Overview</li>
<li>Farming</li>
<li>Practises</li>
<li>Links</li>
<ul class="sub">
<li>Another Farm</li>
<li>24m</li>
</ul>
</ul>
</ul>
</nav>
In order to attempt to match the title inner-text of the page to a menu-item innertext (probably not the best way of doing things but I'm still learning php) I run:
$menu = new DOMDocument();
assert($menu->loadHTMLFile($menu_path), "Loading nav.html (menu file) failed");
//show content to log of the html document
error_log("HTML file: \n\n".$menu->textContent);
//set up a query to find an element matching the title string found
$xpath = new DOMXPath($menu);
$menu_query = "//a/li[matches(text(), '$title_text', 'i')]";
$elements = $xpath->query($menu_query);
error_log($elements ? ("Result of xpath query is: ".print_r($elements, TRUE)): "The xpath query for searching the menu is incorrect and will not find you anything!\ntype of return: ".gettype($elements));
I get the correct return at: https://www.freeformatter.com/xpath-tester.html but in the script I don't. I have tried many different combinations of the text matching such as: //x:a/x:li[lower-case(text())='$title_text'] but always an empty node list.
PHP uses XPath 1.0. matches is an XPath 2.0 function, so you would have seen warnings in your error log if you were looking for them.
PHP Warning: DOMXPath::query(): xmlXPathCompOpEval: function matches not found in php shell code on line 1
PHP Stack trace:
PHP 1. {main}() php shell code:0
PHP 2. DOMXPath->query() php shell code:1
A simple case-sensitive match can be done with an equality check.
$title_text = "Farming";
$menu_query = "//a/li[. = '$title_text']";
But the case-insensitive search involves translating the characters from upper to lower case:
$title_text = "FaRmInG";
$title_text = strtolower($title_text);
$menu_query = "//a/li[translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz') = '$title_text']";
In either case we end up with a NodeList that can be iterated through:
$html = <<< HTML
<nav class="sidebar" aria-labelledby="primary-navigation">
<ul class="sidebar-list">
<!--each element has a sub-menu which is initially hidden by css when the page is loaded. Via php the appropriate path the current page and top-level links will be visible only-->
<li>Home</li>
<!--sub-items-->
<ul class="sub active">
<li>Barn</li>
<li>Activities</li>
<ul class="sub active">
<li>News</li>
<li>Movements</li>
<li>Reviews</li>
<li>About Us</li>
<li>Terms of Use</li>
</ul>
</ul>
<li>Events</li>
<ul class="sub">
<li>Overview</li>
<li>Farming</li>
<li>Practises</li>
<li>Links</li>
<ul class="sub">
<li>Another Farm</li>
<li>24m</li>
</ul>
</ul>
</ul>
</nav>
HTML;
$menu = new DOMDocument();
$menu->loadHTML($html);
$xpath = new DOMXPath($menu);
$elements = $xpath->query($menu_query);
foreach ($elements as $element) {
print_r($element);
}
I am new to PHP and trying to write a scraper for a website.
I am trying to get an element with class name categories. I have use
$showPage = '<li class="categories">Categories<ul> <li class="cat-item cat-item-940"><a href="http://www.desitvbox.me/category/star-plus/amul-taste-of-india/" >Amul Taste of India</a>
</li>
<li class="cat-item cat-item-942"><a href="http://www.desitvbox.me/category/star-plus/dance-plus/" >Dance Plus</a>
</li>
<li class="cat-item cat-item-239"><a href="http://www.desitvbox.me/category/star-plus/diya-aur-baati-hum-star/" >Diya Aur Baati Hum</a>
</li>
<li class="cat-item cat-item-745"><a href="http://www.desitvbox.me/category/star-plus/suhani-si-ek-ladki/" >Suhani Si Ek Ladki</a>
</li>
<li class="cat-item cat-item-147"><a href="http://www.desitvbox.me/category/star-plus/star-plus-completed-shows/" >Star Plus Completed Shows</a>
<ul class="children">
<li class="cat-item cat-item-772"><a href="http://www.desitvbox.me/category/star-plus/star-plus-completed-shows/airlines/" >Airlines</a>
</li>
<li class="cat-item cat-item-518"><a href="http://www.desitvbox.me/category/star-plus/star-plus-completed-shows/arjun/" >Arjun</a>
</li>
<li class="cat-item cat-item-237"><a href="http://www.desitvbox.me/category/star-plus/star-plus-completed-shows/chef-pankaj-ka-zayka/" >Chef Pankaj Ka Zayka</a>
</li>
</ul>
</li>
</ul></li>';
$dom = new DOMDocument();
$dom->validateOnParse = true;
$dom->loadHTML($showPage);
$dom->preserveWhiteSpace = false;
$allShowsList = new DOMXPath($dom);
$allShowsTableHTML = $allShowsList->query('//li[contains(#class, "categories")]');
However, I want to now read the values of all a href mentioned in $allShowsTableHTML.
Can you please advise how can I do that?
As you can see one the record also have ul class = 'childern'. which I also want to read.
I need to get the href and the title.
I have tried below but no result.
$allShowTableDom = new DOMDocument();
foreach ($allShowTableHTML as $showLink)
{
$allShowTableDom->appendChild($allShowTableDom->importNode($showLink,true));
}
$showsArray = $allShowsTableHTML->getElementsByTagName('a');
I think it is not going in foreach loop.
To get all href attributes of the hyperlinks, add some more axis steps, finally loop over the result list, where the ->value property will contain the URIs.
Given you can just dump all href attributes inside the whole <li> element, simply extend your query by //a/#href:
$document = new DOMXPath($dom);
$hrefs = $document->query('//li[contains(#class, "categories")]//a/#href');
foreach ($hrefs as $href) {
echo $href->value;
}
If this contains nodes you don't want to get, you could also descend the contain unsorted list and select with a more specific query:
//li[contains(#class, "categories")]/ul/li/a/#href
HTML:
<ul>
<li><a></a>
<ul>
<li></li>
<li></li>
</ul>
</li>
<li>
...
</li>
</ul>
For parent ul:first-of-type, what would be the selector for it's (direct) child li elements, in order to parse the descendant li elements separately?
In Jquery you can simply use this selector : ul > li
Update:-
Using Simple DOM:-
<ul class="listitems">
<li><a></a>
<ul>
<li></li>
<li></li>
</ul>
</li>
<li>
...
</li>
</ul>
Simple HTML Dom code to get just the first level li items:
$html = file_get_html( $url );
$first_level_items = $html->find( '.listitems', 0)->children();
foreach ( $first_level_items as $item ) {
... do stuff ...
}
I am trying to generate the HTML for tree like (jsTree) out of a 2d array with no success.
I have the following array: My Array
and from this array i would like to create a tree html (ul and li) structure like:
<ul id="ParentId-0">
<li id="categoryID-1" data-parentid="1">
bla bla
<ul id="ParentId-1">
<li id="categoryID-20" data-parentid="20">
some Title
<ul id="ParentId-20">......</ul>
</li>
</ul>
</li>
<li id="categoryID-2" data-parentid="2">
second li Title
<ul id="ParentId-2">
<li id="categoryID-46" data-parentid="46">
Another Title
<ul id="ParentId-46">
<li id="categoryID-300" data-parentid="30">
And another Category
<ul id="ParentId-300"></ul>
</li>
</ul>
</li>
</ul>
</li>
</ul>
Anyone with an idea?
Edit
I've tried using DOMDocument to create the tree and it worked however it took like 40 - 50 seconds to load and i am trying to find a faster way.