PHP - Get links from within an element after element has been found - php

I have the following code....
<div class="outer">
<div>
<h1>Christmas</h1>
<ul>
<li>Holiday</li>
<li>Fun</li>
<li>Joy</li>
</ul>
<h1>4th July</h1>
<ul>
<li>Fireworks</li>
<li>Happy</li>
<li>Spectral</li>
</ul>
</div>
</div>
<div class="outer">
<div>
<h1>Christmas2</h1>
<ul>
<li>Holiday</li>
<li>Fun</li>
<li>Joy</li>
</ul>
<h1>4th July</h1>
<ul>
<li>Fireworks2</li>
<li>Happy</li>
<li>Spectral</li>
</ul>
</div>
</div>
I already know that I can find the DIV and then look inside the DIV for the elements etc by doing...
$doc->loadHTML($output); //$output being the text above
$xpath = new DOMXpath($doc);
$elements = $xpath->query('//div[#class="outer"]'); //Check outer
I know this above 3 lines will get the elements from within the DIV listed, but what I really want to be able to do is get the text of the [H1], then display the [li] values next to each H1..
the output i'm looking for is...
Christmas - Holiday, Fun, Joy
4th July - Fireworks, Happy, Spectral
Christmas2 - Holiday, Fun, Joy
4th July2 - Fireworks, Happy, Spectral

Yes you can continue to use xpath to traverse the elements on the header and get its following sibling, the list. Example:
$doc = new DOMDocument();
$doc->loadHTML($output);
$xpath = new DOMXpath($doc);
$elements = $xpath->query('//div[#class="outer"]/div');
if($elements->length > 0) {
foreach($elements as $div) {
foreach ($xpath->query('./h1', $div) as $e) {
$header = $e->nodeValue;
$list = array();
foreach ($xpath->query('./following-sibling::ul/li', $e) as $li) {
$list[] = $li->nodeValue;
}
echo $header . ' - ' . implode(', ', $list) . '<br/>';
}
echo '<hr/>';
}
}
Sample Output

I've used phpQuery for this type of issue in the past:
// include phpquery
require('phpQuery/phpQuery.php');
// initialize
$doc = phpQuery::newDocumentHTML($markup);
// get the text from the various elements
$h1Value = $doc['h1:first']->text(); // Christmas
// ... etc.
(untested)

Related

DOMDocument: problems with replaceChild()

I'm attempting to find the first <p> in a <div>:
<div class="embed-left">
<h4>Bookmarks</h4>
<p>Something goes here.</p>
<p>Read more...</p>
</div>
Which I've done.
Now, however, I need to replace the found text with a link, as assigned to the <span> before then being used in the $url createElement() method:
$results_links = $this->data_migration->process_embed_find_links();
$dom = new DOMDocument();
foreach ($results_links as $notes):
$dom->loadHTML($notes['note']);
$x = $dom->getElementsByTagName('div')->length;
// Loop through the <div> elements found in the HTML...
for ($i = 0; $i < $x; $i++):
$parentNode = $dom->getElementsByTagName('div')->item($i);
// Here's a <h4> element.
$childNodeHeading = $dom->getElementsByTagName('div')->item($i)->childNodes->item(1);
// If the <h4> element is "Bookmarks"...
if ( $childNodeHeading->nodeValue == "Bookmarks" ):
// ... then grab the first <p> element.
$childNodeTitle = $dom->getElementsByTagName('div')->item($i)->childNodes->item(3);
// Create the appropriate <p> element.
$title = $dom->createElement('p', $childNodeTitle->nodeValue);
echo "<p>" . $title->nodeValue . "</p>";
// Find the `notes_links.from-asset` rows.
$results_bookmarks_links = $this->data_migration->process_embed_find_links_bookmarks_links(array(
'note_id' => $notes['note_id'],
// Send the first <p> tag in the <div> element.
'title' => htmlentities($childNodeTitle->nodeValue)
));
// Loop through the data (one row returned, but it's more neat to run it through a foreach() function)...
foreach ($results_bookmarks_links as $index => $link):
// Assuming there are values (which there has to be, by virtue of the fact that we found the <div> elements in the first place...
if ( isset($results_bookmarks_links) && ( count($results_bookmarks_links) > 0 ) ):
// Create the <span> element for the link item, according to Sina's design.
$span = '<span>[#' . $notes['note_id'] . ']</span>';
**$url = $dom->createElement('span', $span);**
**$parentNode->replaceChild(
$url,
$title
);**
endif;
endforeach;
endif;
endfor;
endforeach;
Which I've had no success with.
I'm unable to figure out either the parent element, or the proper parameters to use in the replaceChild() method.
I've emboldened the main bits that I'm having trouble with, if that helps.
The important thing is to replace the existing p with a newly-created p that contains the child nodes.
Here's an example, using XPath to select the nodes to be replaced:
<?php
$html = <<<END
<div class="embed-left">
<h4>Bookmarks</h4>
<p>Something goes here.</p>
<p>Read more...</p>
</div>
END;
$doc = new DOMDocument;
$doc->loadHTML($html, LIBXML_HTML_NOIMPLIED);
$xpath = new DOMXPath($doc);
$nodes = $xpath->query('//div[h4[text()="Bookmarks"]]/p[1]');
foreach ($nodes as $oldnode) {
$note = 'TODO'; // build `$note` somewhere
$link = $doc->createElement('a');
$link->setAttribute('href', '#');
$link->textContent = sprintf('[#%s]', $note);
$span = $doc->createElement('span');
$span->appendChild($link);
$newnode = $doc->createElement('p');
$newnode->appendChild($span);
$oldnode->parentNode->replaceChild($newnode, $oldnode);
}
print $doc->saveHTML($doc->documentElement);

How get list subItens nodes separateds using PHP DOM

I was seeing this tip
PHP DOM get items from first ul element
But in this case:
<li>First item
<ul>
<li>
First SubItem
</li>
<li>
Second SubItem
</li>
</ul>
</li>
PHP Code:
$DOM = new DOMDocument;
libxml_use_internal_errors(true);
$DOM->loadHTML( $output);
$items = $DOM->getElementsByTagName('ul');
echo '<ul>';
foreach ($items->item(3)->getElementsByTagName('li') as $li) {
var_dump($li);die();
echo '<li>'.$li->nodeValue;
$ul = $li->getElementsByTagName('ul');
echo '<ul>';
echo '--->'.$ul->length.'<br>';
for($u=0;$u<$ul->length;$u++){
foreach ($ul->item($u)->getElementsByTagName('li') as $lii) {
echo '<li>'.$lii->nodeValue.'</li>';
}
}
echo '</ul>';
echo '</li>';
}
echo '</ul>';
The Problem is:
Im getting in //$li->nodeValue;// "First itemFirst SubItemSecond SubItem" as the Fist node;
I need get this items separated (subItems)
I'm assuming you just want to retrieve the text values from those <li> tags.
You can greatly simplify the query with DOMXPath as ->query('//li') will fetch all <li> tags in your code snippet.
$DOM = new DOMDocument();
$DOM->loadHTML($output);
$xPath = new DOMXPath($DOM);
if($xpResponse = $xPath->query('//li/text()')) {
echo "<ul>\n";
foreach($xpResponse as $xNode) {
echo "<li>" . trim($xNode->nodeValue) . "</li>\n";
}
echo "</ul>\n";
}
This will simply output (as HTML):
First item
First SubItem
Second SubItem

How to get all child nodes from DOMDocument?

I have the following
$string = '<html><head></head><body><ul id="mainmenu">
<li id="1">Hallo</li>
<li id="2">Welt
<ul>
<li id="3">Sub Hallo</li>
<li id="4">Sub Welt</li>
</ul>
</li>
</ul></body></html>';
$dom = new DOMDocument;
$dom->loadHTML($string);
now I want to have all li IDs inside one array.
I tried the following:
$all_li_ids = array();
$menu_nodes = $dom->getElementById('mainmenu')->childNodes;
foreach($menu_nodes as $li_node){
if($li_node->nodeName=='li'){
$all_li_ids[]=$li_node->getAttribute('id');
}
}
print_r($all_li_ids);
As you might see, this will print out [1,2]
How do I get all children (the subchildren as well [1,2,3,4])?
My test doesn't return element by using $dom->getElementById('mainmenu'). But if your using does, do not use Xpath
$xpath = new DOMXPath($dom);
$ul = $xpath->query("//*[#id='mainmenu']")->item(0);
$all_li_ids = array();
// Find all inner li tags
$menu_nodes = $ul->getElementsByTagName('li');
foreach($menu_nodes as $li_node){
$all_li_ids[]=$li_node->getAttribute('id');
}
print_r($all_li_ids); 1,2,3,4
One way to do it would be to add another foreach loop, ie:
foreach($menu_nodes as $node){
if($node->nodeName=='li'){
$all_li_ids[]=$node->getAttribute('id');
}
foreach($node as $sub_node){
if($sub_node->nodeName=='li'){
$all_li_ids[]=$sub_node->getAttribute('id');
}
}
}

php DOMDocument - List child elements to array

For the following HTML:
<html>
<body>
<div whatever></div>
<div id="archive-wrapper">
<ul class="archive-list">
<li><div>A</div></li>
<li><div>B</div></li>
<li><div>C</div></li>
</ul>
</div>
</body>
How could I retrieve, with PHP DOMDocument (http://php.net/manual/es/class.domdocument.php), an array containing (#1,#2,#3) in the most effective way? It's not that I did not try anything or that I want an already done code, I just need to know some guidelines to do it and understand it on my own. Thanks :)
A simple example using php DOMDocument -
<?php
$html = <<<HTML
<html>
<body>
<div whatever></div>
<div id="archive-wrapper">
<ul class="archive-list">
<li><div>A</div></li>
<li><div>B</div></li>
<li><div>C</div></li>
</ul>
</div>
</body>
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
//get all links
$links = $dom->getElementsByTagName('a');
$linkArray = array();
//loop through each link
foreach ($links as $link){
$linkArray[] = $link->getAttribute('href');
}
edit
to get only the links inside ul->li, you could do something like -
$dom = new DOMDocument();
$dom->loadHTML($html);
$linkArray = array();
foreach ($dom->getElementsByTagName('ul') as $li){
foreach ($li->getElementsByTagName('li') as $a){
foreach ($a->getElementsByTagName('a') as $link){
$linkArray[] = $link->getAttribute('href');
}
}
}
or if you just want the 1st ul you could simplify to
//get 1st ul using ->item(0)
$ul = $dom->getElementsByTagName('ul')->item(0);
foreach ($ul->getElementsByTagName('li') as $li){
foreach ($li->getElementsByTagName('a') as $a){
$linkArray[] = $a->getAttribute('href');
}
}
what do you mean with PHP DOM? do you mean with PHP and JQuery? You can setup
you can put all that in a form and post it to a script
you can also wrap around a select which will only store the selected
data
better idea would be to jquery to post the items to an array on the
same page and using php as a processor for server side
munipilation? this is better in the long run, being its the most updated way of
interacting with html and server side scripts.
for example, you can try either way:
$("#form").submit(function(){ //form being the #form id
var items = [];
$("#archive-list li").each(function(n){
items[n] = $(this).html();
});
$.post(
"munipilate-data.php",
{items: items},
function(data){
$("#result").html(data);
});
});
I suggest you a regex to parse it.
$html = '<html>
<body>
<div whatever></div>
<div id="archive-wrapper">
<ul class="archive-list">
<li><div>A</div></li>
<li><div>B</div></li>
<li><div>C</div></li>
</ul>
</div>
</body>';
$reg = '/a href=["\']?([^"\' ]*)["\' ]/';
preg_match_all($reg, $html, $m);
$arr = array_map(function($v){
return trim(str_replace('a href=', '', $v), '"');
}, $m[0]);
print '<pre>';
print_r($arr);
print '</pre>';
Output:
Array
(
[0] => #1
[1] => #2
[2] => #3
)
Regex Demo

how to use dom php parser

I'm new to DOM parsing in PHP:
I have a HTML file that I'm trying to parse. It has a bunch of DIVs like this:
<div id="interestingbox">
<div id="interestingdetails" class="txtnormal">
<div>Content1</div>
<div>Content2</div>
</div>
</div>
<div id="interestingbox">
......
I'm trying to get the contents of the many div boxes using php.
How can I use the DOM parser to do this?
Thanks!
First i have to tell you that you can't use the same id on two different divs; there are classes for that point. Every element should have an unique id.
Code to get the contents of the div with id="interestingbox"
$html = '
<html>
<head></head>
<body>
<div id="interestingbox">
<div id="interestingdetails" class="txtnormal">
<div>Content1</div>
<div>Content2</div>
</div>
</div>
<div id="interestingbox2">a link</div>
</body>
</html>';
$dom_document = new DOMDocument();
$dom_document->loadHTML($html);
//use DOMXpath to navigate the html with the DOM
$dom_xpath = new DOMXpath($dom_document);
// if you want to get the div with id=interestingbox
$elements = $dom_xpath->query("*/div[#id='interestingbox']");
if (!is_null($elements)) {
foreach ($elements as $element) {
echo "\n[". $element->nodeName. "]";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
//OUTPUT
[div] {
Content1
Content2
}
Example with classes:
$html = '
<html>
<head></head>
<body>
<div class="interestingbox">
<div id="interestingdetails" class="txtnormal">
<div>Content1</div>
<div>Content2</div>
</div>
</div>
<div class="interestingbox">a link</div>
</body>
</html>';
//the same as before.. just change the xpath
[...]
$elements = $dom_xpath->query("*/div[#class='interestingbox']");
[...]
//OUTPUT
[div] {
Content1
Content2
}
[div] {
a link
}
Refer to the DOMXPath page for more details.
I got this to work using simplehtmldom as a start:
$html = file_get_html('example.com');
foreach ($html->find('div[id=interestingbox]') as $result)
{
echo $result->innertext;
}
Very nice function from http://www.sitepoint.com/forums/showthread.php?611393-php5-need-something-like-innerHTML-instead-of-nodeValue
function innerXML($node)
{
$doc = $node->ownerDocument;
$frag = $doc->createDocumentFragment();
foreach ($node->childNodes as $child)
{
$frag->appendChild($child->cloneNode(TRUE));
}
return $doc->saveXML($frag);
}
$dom = new DOMDocument();
$dom->loadXML('
<html>
<body>
<table>
<tr>
<td id="foo">
The first bit of Data I want
<br />The second bit of Data I want
<br />The third bit of Data I want
</td>
</tr>
</table>
<body>
<html>
');
$xpath = new DOMXPath($dom);
$node = $xpath->evaluate("/html/body//td[#id='foo' ]");
$dataString = innerXML($node->item(0));
$dataArr = explode("<br />", $dataString);
$dataUno = $dataArr[0];
$dataDos = $dataArr[1];
$dataTres = $dataArr[2];
echo "firstdata = $nameUno<br />seconddata = $nameDos<br />thirddata = $nameTres<br />"
WebExtractor: https://github.com/knyga/webextractor
It can parse page with css, regex, xpath selectors.
Look package and tests for examples:
use WebExtractor\DataExtractor\DataExtractorFactory; use
WebExtractor\DataExtractor\DataExtractorTypes; use
WebExtractor\Client\Client;
$factory = DataExtractorFactory::getFactory(); $extractor =
$factory->createDataExtractor(DataExtractorTypes::CSS); $client = new
Client; $content =
$client->get('https://en.wikipedia.org/wiki/2014_Winter_Olympics');
$extractor->setContent($content); $h1 =
$extractor->setSelector('h1')->extract();

Categories