DOMXpath & PHP: how to wrap a bunch of <li> inside an <ul> - php

I have a html-document with this not-so-nice markup, without the 'ul':
<p>Lorem</p>
<p>Ipsum...</p>
<li class='item'>...</li>
<li class='item'>...</li>
<li class='item'>...</li>
<div>...</div>
I am now trying to "grab" all li-elements and wrap them inside an ul-list which I'd like to place in the same spot, using PHP and DOMXPath. I manage to find and "remove" the li-elements:
$elements = $xpath->query('//li[#class="item"]');
$wrapper = $document->createElement('ul');
foreach($elements as $child) {
$wrapper->appendChild($child);
}

Maybe you can get the parentNode of the first <li> and then use the insertBefore method:
$html = <<<HTML
<p>Lorem</p>
<p>Ipsum...</p>
<li class='item'>...</li>
<li class='item'>...</li>
<li class='item'>...</li>
<div>...</div>
HTML;
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXpath($doc);
$elements = $xpath->query('//li[#class="item"]');
$wrapper = $doc->createElement('ul');
$elements->item(0)->parentNode->insertBefore(
$wrapper, $elements->item(0)
);
foreach($elements as $child) {
$wrapper->appendChild($child);
}
echo $doc->saveHTML();
Demo

Here's what you need. You may need to tweak the XPath query for your real HTML.
$document = new DOMDocument;
// We don't want to bother with white spaces
$document->preserveWhiteSpace = false;
$html = <<<EOT
<p>Lorem</p>
<p>Ipsum...</p>
<li class='item'>...</li>
<li class='item'>...</li>
<li class='item'>last...</li>
<div>...</div>
EOT;
$document->LoadHtml($html);
$xpath = new DOMXPath($document);
$elements = $xpath->query('//li[#class="item"]');
// Saves a reference to the Node that is positioned right after our li's
$ref = $xpath->query('//li[#class="item"][last()]')->item(0)->nextSibling;
$wrapper = $document->createElement('ul');
foreach($elements as $child) {
$wrapper->appendChild($child);
}
$ref->parentNode->insertBefore($wrapper, $ref);
echo $document->saveHTML();
Running example: https://repl.it/B3UO/24

Related

Want to get specific data from a webpage

I am trying hard to get data from following portion of a webpage
<div id="menu_pannel">
<ul class="sf-menu" id="nav">
<li class="current"><a href="/" class="current" >Home</a></li>
<li class="">Schedule</li>
<li class="">All Channels</li>
<li class="">Sports Channels
<ul id="submenu">
<li>Sky Sports 1</li>
<li>Sky Sports 2</li>
<li><a href="http://www.time4tv.com/2011/03/sky-sports-3.php">Sky Sports
I want to get data from for that i am using
$pattern = '|<ul id="nav" class="sf-menu">(.*?)</ul>|';
preg_match($pattern, $html, $data);
but getting emty array .
if strip_tags($html) doesn't returns what you want, you can use this example to get an array of text:
function getTextBetweenTags($string, $tagname) {
preg_match_all("#<$tagname.*?>([^<]+)</$tagname>#", $string, $matches);
return $matches[1];
}
$values = getTextBetweenTags ($html, 'a' );
foreach($values as $value) {
echo $value . '<br>';
}
where $html is a var containing your html.
If you decide to use dom parser
$doc = new DOMDocument();
$doc->loadHTML($str);
$x = new DomXpath($doc);
$ul = $x->query('//ul[#id="nav"]'); // 'id' is a unique identifier!
// Echo outerHTML of ul[#id="nav"]
echo $doc->saveHTML($ul->item(0));
demo
Use DOMDocument class for manipulating HTML content:
// $html_str is your html fragment
$doc = new DOMDocument();
$doc->loadHTML($html_str);
$ul_content = "";
$ul = $doc->getElementsByTagName("ul")->item(0);
if ($ul && $ul->getAttribute('class') == 'sf-menu') {
foreach ($ul->childNodes as $n) {
$ul_content .= $doc->saveHTML($n);
}
}
echo $ul_content;

How to get all child nodes from DOMDocument?

I have the following
$string = '<html><head></head><body><ul id="mainmenu">
<li id="1">Hallo</li>
<li id="2">Welt
<ul>
<li id="3">Sub Hallo</li>
<li id="4">Sub Welt</li>
</ul>
</li>
</ul></body></html>';
$dom = new DOMDocument;
$dom->loadHTML($string);
now I want to have all li IDs inside one array.
I tried the following:
$all_li_ids = array();
$menu_nodes = $dom->getElementById('mainmenu')->childNodes;
foreach($menu_nodes as $li_node){
if($li_node->nodeName=='li'){
$all_li_ids[]=$li_node->getAttribute('id');
}
}
print_r($all_li_ids);
As you might see, this will print out [1,2]
How do I get all children (the subchildren as well [1,2,3,4])?
My test doesn't return element by using $dom->getElementById('mainmenu'). But if your using does, do not use Xpath
$xpath = new DOMXPath($dom);
$ul = $xpath->query("//*[#id='mainmenu']")->item(0);
$all_li_ids = array();
// Find all inner li tags
$menu_nodes = $ul->getElementsByTagName('li');
foreach($menu_nodes as $li_node){
$all_li_ids[]=$li_node->getAttribute('id');
}
print_r($all_li_ids); 1,2,3,4
One way to do it would be to add another foreach loop, ie:
foreach($menu_nodes as $node){
if($node->nodeName=='li'){
$all_li_ids[]=$node->getAttribute('id');
}
foreach($node as $sub_node){
if($sub_node->nodeName=='li'){
$all_li_ids[]=$sub_node->getAttribute('id');
}
}
}

DOM xpath isn't working correctly

I don't quite understand whats wrong with my xpath code. It's not returning any results. First, here's the code:
$url = 'http://someurl.com';
$dom = new DOMDocument('1.0');
$dom->preserveWhiteSpace = false;
$dom->loadHTML($url);
$xpath = new DOMXPath($dom);
$query = '//div[#id="main"]//ul[#id="tags"]//li//a';
$result = $xpath->query($query);
foreach($result as $node){
$val = $node->getAttribute('href');
echo $val."<br/>";
}
Here is the HTML:
<div id="main">
<ul id="tags">
<li class="tag_col_0">somevalue</li>
<li class="tag_col_0">somevalue1</li>
</ul>
</div>
I'm not quite sure what's wrong here.

Php DOM and Xpath - Replace node but keep children of old node

Consider the following html:
<html>
<title>Xyz</title>
<body>
<div>
<div class='mycls'>
<div>1 Books</div>
<div>2 Papers</div>
<div>3 Pencils</div>
</div>
</div>
<body>
</html>
$dom = new DOMDocument();
$dom->loadHTML([loaded html of remote url through curl]);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('html/body/div[#class="mycls"]');
till here its working fine, i need to replace the node to get following:
<body>
<div>
<span>
<div>1 Books</div>
<div>2 Papers</div>
<div>3 Pencils</div>
</span>
</div>
<body>
Something like the following should work for you:
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$oldNode = $xpath->query('//div[#class="mycls"]')->item(0);
$span = $dom->createElement('span');
if ($oldNode->hasChildNodes()) {
$children = [];
foreach ($oldNode->childNodes as $child) {
$children[] = $child;
}
foreach ($children as $child) {
$span->appendChild($child->parentNode->removeChild($child));
}
}
$oldNode->parentNode->replaceChild($span, $oldNode);
echo htmlspecialchars($dom->saveHTML());
Demo: http://codepad.viper-7.com/WNTrR5
Note that in the demo I also have fixed your HTML which was utterly broken :-)
If you demo is really the HTML you are getting back from the cURL call and you cannot change it (no control over it) you can do:
$libxmlErrors = libxml_use_internal_errors(true); // at the start
and
libxml_use_internal_errors($libxmlErrors); // at the end
To prevent errors popping up

How do I extract this value using PHP Dom

I do have html file this is just a prt of it though...
<div id="result" >
<div class="res_item" id="1" h="63c2c439b62a096eb3387f88465d36d0">
<div class="res_main">
<h2 class="res_main_top">
<img
src="/ff/gigablast.com.png"
alt="favicon for gigablast.com"
width=16
height=16
/>
<a
href="http://www.gigablast.com/"
rel="nofollow"
>
Gigablast
</a>
<div class="res_main">
<h2 class="res_main_top">
<img
src="/ff/ask.com.png"
alt="favicon for ask.com"
width=16
height=16
/>
<a
href="http://ask.com/" rel="nofollow"
>
Ask.com - What's Your Question?
</a>....
I want extract only url address (for example: http://www.gigablast.com and http://ask.com/ - there are atleast 10 urls in that html) from above using PHP Dom Document..I know up to this but dont know how to move ahead??
$doc = new DomDocument;
$doc->loadHTMLFile('urllist.html');
$data = $doc->getElementById('result');
then what?? this is inside tag hence I cant use $data->getElementsByTagName() here!!
Using XPath to narrow down the field to a elements inside the <div class="res_main"> element:
$doc = new DomDocument();
$doc->loadHTMLFile('urllist.html');
$xpath = new DomXpath($doc);
$query = '//div[#class="res_main"]//a';
$nodes = $xpath->query($query);
$urls = array();
foreach ($nodes as $node) {
$href = $node->getAttribute('href');
if (!empty($href)) {
$urls[] = $href;
}
}
This solves the problem of picking up all the <a> elements inside of the document, since it allows you to filter only the ones you want (since you don't care about navigation links, etc)...
You can call getElementsByTagName on a DOMElement object:
$doc = new DomDocument;
$doc->loadHTMLFile('urllist.html');
$result = $doc->getElementById('result');
$anchors = $result->getElementsByTagName('a');
$urls = array();
foreach ($anchors as $a) {
$urls[] = $a->getAttribute('href');
}
If you want to get image sources as well, that would be easy to add.
If you are just trying to extract the href attribute of all a tags in the document (and the <div id="result"> doesn't matter, you could use this:
$doc = new DomDocument;
$doc->loadHTMLFile('urllist.html');
$anchors = $doc->getElementsByTagName('a');
$urls = array();
foreach($anchors as $anchor) {
$urls[] = $anchor->attributes->href;
}
// $urls is your collection of urls in the original document.

Categories