I want to scrape a html list structure, so I can save parent and child separately.
Here's the view source of html
<ul class="categories_list">
<li>Sports Nutrition
<ul class="categories_list">
<li>Protein
<ul class="categories_list">
<li>Protein Powder
<ul class="categories_list">
<li>Whey Protein
<ul class="categories_list">
<li>Whey Protein Isolate</li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
</ul>
<ul class="categories_list">
<li>Pre Workout Supplements</li>
</ul>
<ul class="categories_list">
<li>Creatine
<ul class="categories_list">
<li>Creatine Monohydrate</li>
</ul>
</li>
</ul>
<ul class="categories_list">
<li>Amino Acids
<ul class="categories_list">
<li>Essential Amino Acids
<ul class="categories_list">
<li>BCAA</li>
</ul>
</li>
</ul>
</li>
</ul>
<ul class="categories_list">
<li>Joint Supplements
<ul class="categories_list">
<li>Curcumin
<ul class="categories_list">
<li>Curcumin Phytosome</li>
</ul>
</li>
</ul>
</li>
</ul>
<ul class="categories_list">
<li>Energy & Endurance
<ul class="categories_list">
<li>Stimulants</li>
</ul>
</li>
</ul>
</li>
</ul>
I am using simple HTML DOM for scraping. I am able to get all categories, but I cannot get them in proper the hierarchy.
I also tried the children approach, but that didn't work.
So I am looking for some help in my existing to make it working.
Here's my existing code:
$html= file_get_html($url);
foreach ($html->find('ul.categories_list li') as $link) {
echo $link->plaintext.'<br>';
}
There is this script which tried to get all elements. This needs to be improved upon:
<?php
require_once("simple_html_dom.php");
$dom = file_get_html("source.php");
getCategory($dom);
print_r($categoryList);
function getCategory(simple_html_dom $dom){
global $categoryList;
foreach($dom->find('ul.categories_list li') as $ul){
//extract the a tag if found
$categoryName = $ul->find('a',0)->href;
$categoryLabel = $ul->find('a',0)->innertext;
$categoryList[] = array(
"categoryName" => $categoryName,
"categoryLabel" => $categoryLabel,
);
//remove a node
$ul->find('a',0)->outertext = '';
$string = $ul->innertext;
if(trim($string) == ''){
continue;
}else{
// die($string);
$dom2 = str_get_html($string);
getCategory($dom2);
}
}
}
It basically does recursion filling the $categoryList on each call.
Related
I'm trying to get all items and sub-items with anchor tag form the following menu:
<nav class="header-nav" id="headerLara">
<div class="menu-hauptmenu-container">
<ul id="head_nav_ul" class="menu">
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-has-children menu-item-4">
<a>First Menu</a>
<ul class="sub-menu">
<li class="menu-item menu-item-type-post_type menu-item-object-page menu-item-14002">
F menu 1
</li>
<li class="menu-item menu-item-type-post_type menu-item-object-post menu-item-12718">
F menu 2
</li>
</ul>
</li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-has-children menu-item-6">
<a>Second Menu</a>
<ul class="sub-menu">
<li class="menu-item menu-item-type-post_type menu-item-object-page menu-item-1257">
S menu 1
</li>
<li class="menu-item menu-item-type-post_type menu-item-object-page menu-item-5420">
S menu 2
</li>
</ul>
</li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-12821">
Third Menu
</li>
</ul>
</div>
</nav>
Now i want outpul like :
<nav class="header-nav" id="headerLara">
<div class="menu-hauptmenu-container">
<ul>
<li>
<a class="has-child">First Menu</a>
<ul>
<li>
F menu 1
</li>
<li>
F menu 2
</li>
</ul>
</li>
<li>
<a class="has-child">Second Menu</a>
<ul>
<li>
S menu 1
</li>
<li>
S menu 2
</li>
</ul>
</li>
<li>
Third Menu
</li>
</ul>
</div>
</nav>
I've done some R&D and tried with following PHP code :
<?php
$doc = new DomDocument;
$doc->validateOnParse = true;
$doc->loadHtml(file_get_contents('http://example.com/blabla.php'));
$header = $doc->getElementById('headerLara');
$mainUls = $header->getElementsByTagName('ul');
foreach ($mainUls as $mainUl) {
echo '<ul>';
$mainLis = $mainUl->getElementsByTagName('li');
foreach ($mainLis as $mainLi) {
echo '<li>';
$mainAnc = $mainLi->getElementsByTagName('a');
$href = $mainAnc->item(0)->getAttribute('href');
echo '<a class="has-child" href="'.$href.'">'.$mainAnc->item(0)->nodeValue.'</a>';
$secUls = $mainLi->getElementsByTagName('ul');
if($secUls->length < 2){
foreach ($secUls as $secUl) {
echo '<ul>';
$secLis = $secUl->getElementsByTagName('li');
foreach ($secLis as $secLi) {
echo '<li>';
$secAnc = $mainLi->getElementsByTagName('a');
$shref = $secAnc->item(0)->getAttribute('href');
echo ''.$secAnc->item(0)->nodeValue.'';
echo '</li>';
}
echo '</ul>';
}
}
echo '</li>';
}
echo '</ul>';
}
?>
But this is not working for me as i want and return output like:
<ul>
<li>
<a class="has-child" href="">First Menu</a>
<ul>
<li>
First Menu
</li>
<li>
First Menu
</li>
</ul>
</li>
<li>
<a class="has-child" href="http://example.com/fm1">F menu 1</a>
</li>
<li>
<a class="has-child" href="http://example.com/fm2">F menu 2</a>
</li>
<li>
<a class="has-child" href="">Second Menu</a>
<ul>
<li>
Second Menu
</li>
<li>
Second Menu
</li>
</ul>
</li>
<li>
<a class="has-child" href="http://example.com/sm1">S menu 1</a>
</li>
<li>
<a class="has-child" href="http://example.com/sm2">S menu 2</a>
</li>
</ul>
I've checked many links which seems similar to my problem but found nothing helpful.
How can i get the proper output, Thanks in advance.
There are a few minor errors (picking up from the wrong node) but there are two main problems.
The first is getElementsByTagName() selects all child elements with that tag name, this isn't limited to immediate child nodes, so each time it would be more tags than you are expecting. In this code it uses XPath as DOMDocument doesn't have a convenient way of doing a just immediate child nodes called, so XPath just uses the context node as your start point and something like a to say only <a> tags who are direct descendants of the context node.
The other (main thing) is that you are building the output using echo statements. Which may work, but is also prone to typos, invalid structure etc. This code uses the DOM API calls to create the document.
$doc = new DomDocument;
$doc->validateOnParse = true;
$doc->loadHtml($html);
$xp = new DOMXPath($doc);
$header = $doc->getElementById('headerLara');
$mainUls = $xp->query('div/ul', $header);
foreach ($mainUls as $mainUl) {
$mainULE = $doc->createElement("ul");
$mainLis = $xp->query('li', $mainUl);
foreach ($mainLis as $mainLi) {
$li = $doc->createElement("li");
$mainAnc = $xp->query('a', $mainLi)[0];
$href = $mainAnc->getAttribute('href');
$a = $doc->createElement("a", htmlspecialchars($mainAnc->nodeValue));
$href = $mainAnc->getAttribute('href');
if ( !empty($href) ) {
$a->setAttribute("href", $href);
}
$li->appendChild($a);
$secUls = $xp->query('ul', $mainLi);
if($secUls->length < 2){
foreach ($secUls as $secUl) {
$a->setAttribute("class", "has-child");
$secULE = $doc->createElement("ul");
$secLis = $xp->query('li', $secUl);
foreach ($secLis as $secLi) {
$secLIE = $doc->createElement("li");
$secAnc = $xp->query('a', $secLi);
$shref = $secAnc[0]->getAttribute('href');
$secA = $doc->createElement("a", htmlspecialchars($secAnc[0]->nodeValue));
$secA->setAttribute("href", $shref);
$secLIE->appendChild($secA);
$secULE->appendChild($secLIE);
}
$li->appendChild($secULE);
}
}
$mainULE->appendChild($li);
}
echo PHP_EOL.PHP_EOL.">>>>".$doc->saveHTML($mainULE);
// Next line replaces existing HTML
//$mainUl->parentNode->replaceChild($mainULE,$mainUl);
}
I get the data from the database. but I can not write between li,ul html tags.There is a problem in the loop. How do I write between li ul categories and subcategories. I have this pdo mysql category function.
<?php
...
function kategoriVer($ustid=0){
$result = DB::get('SELECT * FROM urunler_kategori WHERE kategori_k='.$ustid.'');
echo '<ul class="drop-down">';
foreach($result as $kategori){
echo '<li class="drop">
'.$kategori->baslik_k.'';
kategoriVer($kategori->id_k);
}
echo '</li></ul>';
}
kategoriVer();
?>
output html:
<ul class="sub-menu">
<li class="menuparent">
Çadırlar
<ul class="sub-menu">
<li class="menuparent">Hi-Tech Çadırlar<ul class="sub-menu"></li></ul>
<li class="menuparent">Çelik Konstrüksiyon<ul class="sub-menu"></li></ul>
<li class="menuparent">Tribün Çadırlar<ul class="sub-menu"></li></ul>
<li class="menuparent">Yürüyüş Yolları<ul class="sub-menu"></li></ul>
</li>
</ul>
<li class="menuparent">Şemsiyeler<ul class="sub-menu"></li></ul>
<li class="menuparent">İklimlendirme<ul class="sub-menu"></li></ul>
</li>
</ul>
i want this output:
<ul class="sub-menu">
<li class="menuparent">
Headers
<ul class="sub-menu">
<li>Standard</li>
<li>No Topbar</li>
<li>Social Icons</li>
<li>Minimal</li>
<li>Classic</li>
</ul>
</li>
</ul>
<ul>
<li>
<a>name1</a>
<div>
<ul>
<li>
<a>name</a>
<ul>
<li>
<a>name</a>
</li>
<li>
<a>name</a>
</li>
</ul>
</li>
<li>
<a>name</a>
<ul>
<li>
<a>name</a>
</li>
<li>
<a>name</a>
</li>
</ul>
</li>
</ul>
</div>
</li>
<li>
<a>name2</a>
<div>
<ul>
<li>
<a>name</a>
<ul>
<li>
<a>name</a>
</li>
<li>
<a>name</a>
</li>
</ul>
</li>
<li>
<a>name</a>
<ul>
<li>
<a>name</a>
</li>
<li>
<a>name</a>
</li>
</ul>
</li>
</ul>
</div>
</li>
</ul>
As we can se we hve some <ul>.
We would like use PHP Simple HTML DOM for get array with data ul up.
We want use code:
foreach($html->find('li') as $li) {
}
But in this example we see all li on display:
http://prntscr.com/5390mf
http://prntscr.com/5390pj
But we dont know how get only parrent li:
1.
<li>
<a>name1</a>
</li>
2.
<li>
<a>name2</a>
</li>
And only than get childrens ul in parrents li and all children li in li.
Tell me please how make it?
P.S.: if i do bad explain that I would like to receive, please write
First:
foreach($html->find('li',0) as $line) {
}
Second (if we can add class to parrent ul):
foreach($html->find('ul.parrent_ul->li') as $line) {
}
Enjoy!
foreach($html->find('ul > li') as $li) {
}
Should get only the top level li's - the one's that are direct children of the ul elements
I've a dynamic menu which looks like
<li class='has-sub'> cat1</li>
<ul>
<li> test5</li>
<li class='has-sub'> cat2</li>
<ul>
<li> cat9</li>
<li class='has-sub'> cat7</li>
<ul>
<li> cat8</li>
<li> cat10</li>
<li> cat1 cat2</li>
</ul>
</ul>
</ul>
<li class='has-sub'> cat3</li>
<ul>
<li> cat5</li>
</ul>
I want to change that to a properly nested navigation menu like
<li class='has-sub'> <a href='#'><span>cat1</span></a>
<ul>
<li><a href='#'><span> test5</span></a></li>
<li class='has-sub'><a href='#'><span> cat2</span></a>
<ul>
<li> <a href='#'><span>cat9</span></a></li>
<li class='has-sub'> <a href='#'><span>cat7</span></a>
<ul>
<li> <a href='#'><span>cat8</span></a></li>
<li> <a href='#'><span>cat10</span></a></li>
<li> <a href='#'><span>cat1 cat2</span></a></li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
<li class='has-sub'> <a href='#'><span>cat3</span></a>
<ul>
<li> <a href='#'><span>cat5</span></a></li>
</ul>
</li>
I tried few str_replace but since the list is dynamic It wont work.
I'm new to Regex and am not sure how to format this dynamic menu to a properly nested/formatted menu.
Thanks in advance!
It's an answer that has been linked to more than it probably has been read, but still: You can't parse markup using regex. Not reliably anyway.
Instead, you should use a parser like the DOMDocument class. Basic usage here would be:
$dom = new DOMDocument();
$dom->loadHTML($theMarkupString);
//get the list:
$list = $dom->getElementById('navContainerID');
$navItems = $list->getElementsByTagName('li');
foreach($navItems as $item)
{
//add spans, links, classes... how to do so is all in the doc pages
}
A simple DOM Parser and an strtr() will solve this...
$dom = new DOMDocument;
$dom->loadHTML($html);
$arrLi = array();
foreach ($dom->getElementsByTagName('li') as $tag) {
$arrLi[$tag->nodeValue]="<a href='#'><span>$tag->nodeValue</span></a>";
}
echo $html = strtr($html,$arrLi);
Demonstration
I have a nested unordered list like this (simplified version / the depth is variable) :
<ul>
<li>
Root
<ul>
<li>
Page A
<ul>
<li>
Page 1 2
</li>
</ul>
</li>
</ul>
</li>
</ul>
Using PHP, is there a nice way to "explode" this nested list in (for this example) 3 lists ?
Thanks for your help
Edit :
The expected output will be :
<ul>
<li>
Root
</li>
</ul>
<ul>
<li>
Page A
</li>
</ul>
<ul>
<li>
Page 1 2
</li>
</ul>
Try:
$lists = '<ul>
<li>
Root
<ul>
<li>
Page A
<ul>
<li>
Page 1 2
</li>
</ul>
</li>
</ul>
</li>
</ul>';
$list_array = explode('<ul>', $lists);
foreach($list_array as $list){
// Now you have a single list, but missing the <ul>
// at the start. replace that and assign to a variable or whatever.
}