How can I get the following information in xpath?
Text 01 - link_1.com
Text 02 - link_2.com
$page = '
<div class="news">
<div class="content">
<div>
<span class="title">Text 01</span>
<span class="link">link_1.com</span>
</div>
</div>
<div class="content">
<div>
<span class="title">Text 02</span>
<span class="link">link_2.com</span>
</div>
</div>
</div>';
#$this->dom->loadHTML($page);
$xpath = new DOMXPath($this->dom);
// perform step #1
$childElements = $xpath->query("//*[#class='content']");
$lista = '';
foreach ($childElements as $child) {
// perform step #2
$textChildren = $xpath->query("//*[#class='title']", $child);
foreach ($textChildren as $n) {
echo $n->nodeValue.'<br>';
}
$linkChildren = $xpath->query("//*[#class='link']", $child);
foreach ($linkChildren as $n) {
echo $n->nodeValue.'<br>';
}
}
My result is returning
Text 01
Text 02
link_1.com
link_2.com
Text 01
Text 02
link_1.com
link_2.com
Replace // by descendant:: in second and third xpath, because // tells xpath to search this element evrywhere in xml and not in specific node (as you need), and $child is NOT separate XML. descendat:: means any child node
#$this->dom->loadHTML($page);
$xpath = new DOMXPath($this->dom);
// perform step #1
$childElements = $xpath->query("//*[#class='content']");
$lista = '';
foreach ($childElements as $child) {
// perform step #2
$textChildren = $xpath->query("descendant::*[#class='title']", $child);
foreach ($textChildren as $n) {
echo $n->nodeValue.'<br>';
}
$linkChildren = $xpath->query("descendant::*[#class='link']", $child);
foreach ($linkChildren as $n) {
echo $n->nodeValue.'<br>';
}
}
Related
I have HTML code:
<div>
<h1>Header</h1>
<code><p>First code</p></code>
<p>Next example</p>
<code><b>Second example</b></code>
</div>
Using PHP I want replace all < symbols located in code elements for example above code I want converted to:
<div>
<h1>Header</h1>
<code><p>First code</p></code>
<p>Next example</p>
<code><b>Second example</b></code>
</div>
I try using PHP DomDocument class but my work was ineffective. Below is my code:
$dom = new DOMDocument();
$dom->loadHTML($content);
$innerHTML= '';
$tmp = '';
if(count($dom->getElementsByTagName('*'))){
foreach ($dom->getElementsByTagName('*') as $child) {
if($child->tagName == 'code'){
$tmp = $child->ownerDocument->saveXML( $child);
$innerHTML .= htmlentities($tmp);
}
else{
$innerHTML .= $child->ownerDocument->saveXML($child);
}
}
}
So, you're iterating over the markup properly, and your use of saveXML() was close to what you want, but nowhere in your code do you try to actually change the contents of the element. This should work:
<?php
$content='<div>
<h1>Header</h1>
<code><p>First code</p></code>
<p>Next example</p>
<code><b>Second example</b></code>
</div>';
$dom = new DOMDocument();
$dom->loadHTML($content, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
foreach ($dom->getElementsByTagName('code') as $child) {
// get the markup of the children
$html = implode(array_map([$child->ownerDocument,"saveHTML"], iterator_to_array($child->childNodes)));
// create a node from the string
$text = $dom->createTextNode($html);
// remove existing child nodes
foreach ($child->childNodes as $node) {
$child->removeChild($node);
}
// append the new text node - escaping is done automatically
$child->appendChild($text);
}
echo $dom->saveHTML();
I have have some divs with the same Id and same Class as you can see below:
<div id="results_information" class="control_results">
<!-- I have divs, subDivs, span, images inside -->
</div>
<div id="results_information" class="control_results">
<!-- I have divs, subDivs, span, images inside -->
</div>
....
In my case I want to save all of them inside an array to be used later, I want to save in this format:
[0] => '<div id="results_information" class="control_results">
<!-- I have divs, subDivs, span, images inside -->
</div>',
[1] => '<div id="results_information" class="control_results">
<!-- I have divs, subDivs, span, images inside -->
</div>',
....
For that I'm using this code below:
$dom = new DOMDocument(); // Create DOMDocument object.
$dom->loadHTMLFile($htmlOut); // Load target file.
$div =$dom->getElementById('results_information'); // Take all div elements.
But it doesn't work, how I can solve this problem and put my divs inside an array?
To solve your problem you need to do the following steps below:
First of all, you should be based on selecting a class and not an ID (Because id in this situation should be unique).
In this situation we assume that you have the following html inside a variable called $htmlOut:
<div id="results_information" class="control_results">
<span style="background:black; color:white">
hellow world
</span>
<strong>2</strong>
</div>
<div id="results_information" class="control_results">
<strong>2</strong>
<img src="hello.png" />
</div>
We need to extract all the html that exists inside theses two class called control_results and put inside an array, for this we need to work with DomDocument and DomXPath:
$array = array();
$dom = new DomDocument();
$dom->loadHtml($htmlOut);
$finder = new DomXPath($dom);
$classname = "control_results";
$nodes = $finder->query("//*[contains(#class, '$classname')]");
With that code we can extract all the content of the divs with classname control_results and put inside the variable $nodes.
Now we need to parser the variable $nodes (that is an array) and extract all the HTML of that two class. For this I create a function to handle:
function get_inner_html( $node ) {
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveXML( $child );
}
return $innerHTML;
}
This function will extract every childNodes (Every HTML code inside the class control_results) and returns.
Now you only need to create a foreach for the variable $nodes and call that function, like this:
foreach ($nodes as $rowNode) {
$array[] = get_inner_html($rowNode);
}
var_dump($array);
Below is the complete code:
$htmlOut = '
<div id="results_information" class="control_results">
<span style="background:black; color:white">
hellow world
</span>
<strong>2</strong>
</div>
<div id="results_information" class="control_results">
<strong>2</strong>
<img src="hello.png" />
</div>
';
$array = array();
$dom = new DomDocument();
$dom->loadHtml($htmlOut);
$finder = new DomXPath($dom);
$classname = "control_results";
$nodes = $finder->query("//*[contains(#class, '$classname')]");
foreach ($nodes as $rowNode) {
$array[] = get_inner_html($rowNode);
}
var_dump($array);
function get_inner_html( $node ) {
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveXML( $child );
}
return $innerHTML;
}
But this code has a little problem, if you check the results in array is:
0 => string '<span style="background:black; color:white">hellow world</span><strong>2</strong>',
1 => string '<strong>2</strong><img src="hello.png"/>'
instead of:
0 => string '<div id="results_information" class="control_results"><span style="background:black; color:white">hellow world</span><strong>2</strong></div>',
1 => string '<div id="results_information" class="control_results"><strong>2</strong><img src="hello.png"/></div>'
In this case you can perform a foreach of this array and include that div in the init of the contents and close that div in the final of the contents and re-save that array.
You will need to use xpath and get the elements using class name.
$dom = new DOMDocument();
$xpath = new DOMXpath($dom);
$div = $xpath->query('//div[contains(#class, "control_results")]')
I'm attempting to find the first <p> in a <div>:
<div class="embed-left">
<h4>Bookmarks</h4>
<p>Something goes here.</p>
<p>Read more...</p>
</div>
Which I've done.
Now, however, I need to replace the found text with a link, as assigned to the <span> before then being used in the $url createElement() method:
$results_links = $this->data_migration->process_embed_find_links();
$dom = new DOMDocument();
foreach ($results_links as $notes):
$dom->loadHTML($notes['note']);
$x = $dom->getElementsByTagName('div')->length;
// Loop through the <div> elements found in the HTML...
for ($i = 0; $i < $x; $i++):
$parentNode = $dom->getElementsByTagName('div')->item($i);
// Here's a <h4> element.
$childNodeHeading = $dom->getElementsByTagName('div')->item($i)->childNodes->item(1);
// If the <h4> element is "Bookmarks"...
if ( $childNodeHeading->nodeValue == "Bookmarks" ):
// ... then grab the first <p> element.
$childNodeTitle = $dom->getElementsByTagName('div')->item($i)->childNodes->item(3);
// Create the appropriate <p> element.
$title = $dom->createElement('p', $childNodeTitle->nodeValue);
echo "<p>" . $title->nodeValue . "</p>";
// Find the `notes_links.from-asset` rows.
$results_bookmarks_links = $this->data_migration->process_embed_find_links_bookmarks_links(array(
'note_id' => $notes['note_id'],
// Send the first <p> tag in the <div> element.
'title' => htmlentities($childNodeTitle->nodeValue)
));
// Loop through the data (one row returned, but it's more neat to run it through a foreach() function)...
foreach ($results_bookmarks_links as $index => $link):
// Assuming there are values (which there has to be, by virtue of the fact that we found the <div> elements in the first place...
if ( isset($results_bookmarks_links) && ( count($results_bookmarks_links) > 0 ) ):
// Create the <span> element for the link item, according to Sina's design.
$span = '<span>[#' . $notes['note_id'] . ']</span>';
**$url = $dom->createElement('span', $span);**
**$parentNode->replaceChild(
$url,
$title
);**
endif;
endforeach;
endif;
endfor;
endforeach;
Which I've had no success with.
I'm unable to figure out either the parent element, or the proper parameters to use in the replaceChild() method.
I've emboldened the main bits that I'm having trouble with, if that helps.
The important thing is to replace the existing p with a newly-created p that contains the child nodes.
Here's an example, using XPath to select the nodes to be replaced:
<?php
$html = <<<END
<div class="embed-left">
<h4>Bookmarks</h4>
<p>Something goes here.</p>
<p>Read more...</p>
</div>
END;
$doc = new DOMDocument;
$doc->loadHTML($html, LIBXML_HTML_NOIMPLIED);
$xpath = new DOMXPath($doc);
$nodes = $xpath->query('//div[h4[text()="Bookmarks"]]/p[1]');
foreach ($nodes as $oldnode) {
$note = 'TODO'; // build `$note` somewhere
$link = $doc->createElement('a');
$link->setAttribute('href', '#');
$link->textContent = sprintf('[#%s]', $note);
$span = $doc->createElement('span');
$span->appendChild($link);
$newnode = $doc->createElement('p');
$newnode->appendChild($span);
$oldnode->parentNode->replaceChild($newnode, $oldnode);
}
print $doc->saveHTML($doc->documentElement);
I am using PHP to parse HTML provided to me by Wordpress.
This is a post's PHP returned my Wordpress:
<p>Test</p>
<p>
<img class="alignnone size-thumbnail wp-image-39" src="img.png"/>
</p>
<p>Ok.</p>
This is my parsing function (with debugging left in):
function get_parsed_blog_post()
{
$html = ob_wp_content(false);
print_r(htmlspecialchars($html));
echo '<hr/><hr/><hr/>';
$parse = new DOMDocument();
$parse->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXpath($parse);
$ps = $xpath->query('//p');
foreach ($ps as $p)
{
$imgs = $p->getElementsByTagName('img');
print($imgs->length);
echo '<br/>';
if ($imgs->length > 0)
{
$p->setAttribute('class', 'image-content');
foreach ($imgs as $img)
{
$img->removeAttribute('class');
}
}
}
$htmlFinal = $parse->saveHTML();
print_r(htmlspecialchars($htmlFinal));
echo '<hr/><hr/><hr/>';
return $htmlFinal;
}
The purpose of this code is to remove the classes Wordpress adds to the <img>s, and to set any <p> that contains an image to be a class of image-content.
And this returns:
1
1
0
<p class="image-content">Test
<p class="image-content">
<img src="img.png">
</p>
<p>Ok.</p></p>
Somehow, it has wrapped the first occurrence of <p> around my entire parsed post, causing the first <p> to have the image-content class incorrectly applied. Why is this happening? How do I stop it?
METHOD 1
As to use exactly your code, I have done some changes to make it working.
If you will print out each $p you will be able to see first element will contain all your HTML. The simplest solution is to add a blank <p> before your HTML and skip it when foreach.
function get_parsed_blog_post()
{
$page_content_html = ob_wp_content(false);
$html = "<p></p>".$page_content_html;
print_r(htmlspecialchars($html));
echo '<hr/><hr/><hr/>';
$parse = new DOMDocument();
$parse->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXpath($parse);
$ps = $xpath->query('//p');
$i = 0;
foreach ($ps as $p)
{
if($i != 0) {
$imgs = $p->getElementsByTagName('img');
print($imgs->length);
echo '<br/>';
if ($imgs->length > 0)
{
$p->setAttribute('class', 'image-content');
foreach ($imgs as $img)
{
$img->removeAttribute('class');
}
}
}
$i++;
}
$htmlFinal = $parse->saveHTML();
print_r(htmlspecialchars($htmlFinal));
echo '<hr/><hr/><hr/>';
return $htmlFinal;
}
Total execution time in seconds: 0.00034999847412109
METHOD 2
The problem was caused by LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD (which is making first <p> as a parent too), but you can remove document tags without this. So, you can do it as here:
function get_parsed_blog_post()
{
$page_content_html = ob_wp_content(false);
$doc = new DOMDocument();
$doc->loadHTML($page_content_html);
foreach($doc->getElementsByTagName('p') as $paragraph) {
$imgs = $paragraph->getElementsByTagName('img');
if ($imgs->length > 0)
{
$paragraph->setAttribute('class', 'image-content');
foreach ($imgs as $img)
{
$img->removeAttribute('class');
}
}
}
/* REMOVING DOCTYPE, HTML AND BODY TAGS */
// Removing DOCTYPE
$doc->removeChild($doc->doctype);
// Removing HTML tag
$doc->replaceChild($doc->firstChild->firstChild, $doc->firstChild);
// Removing Body Tag
$html = $doc->getElementsByTagName("body")->item(0);
$fragment = $doc->createDocumentFragment();
while ($html->childNodes->length > 0) {
$fragment->appendChild($html->childNodes->item(0));
}
$html->parentNode->replaceChild($fragment, $html);
$htmlFinal = $doc->saveHTML();
print_r(htmlspecialchars($htmlFinal));
echo '<hr/><hr/><hr/>';
return $htmlFinal;
}
Total execution time in seconds: 0.00026822090148926
I have the following code....
<div class="outer">
<div>
<h1>Christmas</h1>
<ul>
<li>Holiday</li>
<li>Fun</li>
<li>Joy</li>
</ul>
<h1>4th July</h1>
<ul>
<li>Fireworks</li>
<li>Happy</li>
<li>Spectral</li>
</ul>
</div>
</div>
<div class="outer">
<div>
<h1>Christmas2</h1>
<ul>
<li>Holiday</li>
<li>Fun</li>
<li>Joy</li>
</ul>
<h1>4th July</h1>
<ul>
<li>Fireworks2</li>
<li>Happy</li>
<li>Spectral</li>
</ul>
</div>
</div>
I already know that I can find the DIV and then look inside the DIV for the elements etc by doing...
$doc->loadHTML($output); //$output being the text above
$xpath = new DOMXpath($doc);
$elements = $xpath->query('//div[#class="outer"]'); //Check outer
I know this above 3 lines will get the elements from within the DIV listed, but what I really want to be able to do is get the text of the [H1], then display the [li] values next to each H1..
the output i'm looking for is...
Christmas - Holiday, Fun, Joy
4th July - Fireworks, Happy, Spectral
Christmas2 - Holiday, Fun, Joy
4th July2 - Fireworks, Happy, Spectral
Yes you can continue to use xpath to traverse the elements on the header and get its following sibling, the list. Example:
$doc = new DOMDocument();
$doc->loadHTML($output);
$xpath = new DOMXpath($doc);
$elements = $xpath->query('//div[#class="outer"]/div');
if($elements->length > 0) {
foreach($elements as $div) {
foreach ($xpath->query('./h1', $div) as $e) {
$header = $e->nodeValue;
$list = array();
foreach ($xpath->query('./following-sibling::ul/li', $e) as $li) {
$list[] = $li->nodeValue;
}
echo $header . ' - ' . implode(', ', $list) . '<br/>';
}
echo '<hr/>';
}
}
Sample Output
I've used phpQuery for this type of issue in the past:
// include phpquery
require('phpQuery/phpQuery.php');
// initialize
$doc = phpQuery::newDocumentHTML($markup);
// get the text from the various elements
$h1Value = $doc['h1:first']->text(); // Christmas
// ... etc.
(untested)