How exclude html comments from text node xpath?

How exclude html comments from text node xpath? - php

I have the follow html structure:
<a>
<div>
<div>
<span>
text node 1<br>
text node 2 <!--//comments-->
</span>
</div>
</div>
</a>
With the follow query, i get second node, but how get that node excluding comments?
$spanx = $xpath->query('//a/div/div/span/text()[2]');
$span = $spanx->item($l)->nodeValue;
echo "<td>".$span."</td></tr>";
I have that result:
text node 2 //comments
I search for:
text node 2

I've tested the following on my localhost. I've created the file named DOM_with_comment.html containing:
<a>
<div>
<div>
<span>
text node 1<br>
text node 2 <!--//comments-->
</span>
</div>
</div>
</a>
When I run:
<?php
$doc = new DOMDocument;
libxml_use_internal_errors(true);
$doc->preserveWhiteSpace = false;
$doc->loadHTMLFile('DOM_with_comment.html');
$xpath = new DOMXPath($doc);
echo "<pre>";
foreach ($xpath->query('//a/div/div/span/text()') as $item) {
var_dump($item->nodeValue);
}
The output is:
string(29) "
text node 1"
string(31) "
text node 2 "
string(14) "
"
So, by accessing the first qualifying result [0] from your xpath query then displaying the trim()ed ->nodeValue() with var_export() it is revealed that there are no comments or whitespaces on either side of the targeted substring.
var_export(trim($xpath->query('//a/div/div/span/text()[2]')[0]->nodeValue));
// outputs: 'text node 2'
p.s. If your input is not coming from a file, but a variable, this works the same way:
$html = <<<HTML
<a>
<div>
<div>
<span>
text node 1<br>
text node 2 <!--//comments-->
</span>
</div>
</div>
</a>
HTML;
$doc->loadHTML($html);

Related

How replace text with DOMDocument

Need change letter "a" to "1" and "e" to "2"
This is approximately html, in fact it is more nested
<body>
<p>
<span>sppan</span>
link
some text
</p>
<p>
another text
</p>
</body>
expected output
<body>
<p>
<span>spp1n</span>
link
some t2xt
</p>
<p>
anoth2r t2xt
</p>
</body>

I believe your expected output has an error (given your conditions), but generally speaking, it can be done using xpath:
$html= '
[your html above]
';
$HTMLDoc = new DOMDocument();
$HTMLDoc->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD );
$xpath = new DOMXPath($HTMLDoc);
# locate all the text elements in the html;:
$targets = $xpath->query('//*//text()');
#get the text from each element
foreach ($targets as $target) {
$current = $target->nodeValue;
#make the required changes
$new = str_replace(["a", "e"],["1","2"], $current);
#replace the old with the new
$target->nodeValue=$new;
};
echo $HTMLDoc->saveHTML();
Output:
<body>
<p>
<span>spp1n</span>
link
som2 t2xt
</p>
<p>
1noth2r t2xt
</p>
</body>

Get contents of div up to a certain point

I'm grabbing all the paragraph tags using the PHP Simple HTML DOM Parser with the following code:
// Product Description
$html = file_get_html('http://domain.local/index.html');
$contents = strip_tags($html->find('div[class=product-details] p'));
How can I say grab X amount of paragraphs until it hits the first ul?
<p>
Paragraph 1
</p>
<p>
Paragraph 2
</p>
<p>
Paragraph 3
</p>
<ul>
<li>
List item 1
</li>
<li>
List item 2
</li>
</ul>
<blockquote>
Quote 1
</blockquote>
<blockquote>
Quote 2
</blockquote>
<blockquote>
Quote 3
</blockquote>
<p>
Paragraph 4
</p>
<p>
Paragraph 5
</p>

You can use the following code as per requirements mentioned:-
<?php
$html = file_get_html('http://domain.local/index.html');
$detailTags = $html->find('div[class=product-details] *');
$contents = "";
foreach ($detailTags as $detailTag){
// these condition will check if tag is not <p> or it's <ul> to break the loop.
if (strpos($detailTag, '<ul>') === 0 && strpos($detailTag, '<p>') !== 0) {
break;
}
$contents .= strip_tags($detailTag);
}
// contents will contain the output required.
echo $contents;
?>
OUTPUT:-
Paragraph 1 Paragraph 2 Paragraph 3

EDIT: Nandal's code will work for you because it will not force you to change the library.
If you don't want to be dependent upon 3rd party library then you can use PHP's DOM Document feature for which you would need to enable the extension.
You can look into the below code which prints the paragraphs until you hit any other tag:
<?php
$html = new DOMDocument();
$html->loadHTML("<html><body><p>Paragraph 1</p><p> Paragraph 2</p><p> Paragraph 3</p><ul> <li> List item 1 </li> <li> List item 2 </li> </ul><blockquote> Quote 1</blockquote><blockquote> Quote 2</blockquote><blockquote> Quote 3</blockquote><p> Paragraph 4</p><p> Paragraph 5</p></body></html>");
$xpath = new DOMXPath($html);
$nodes = $xpath->query('/html/body//*');
foreach($nodes as $node) {
if($node->nodeName != "p") {
break;
}
print $node -> nodeValue . "\n";
}

Add attributes to outer tags of html fragments

I try to add attributes to outer tags of html code fragments. I prepared some code, but it behaves strange.
The string that is for testing has two outer tags: div and paragraph. But only div gets the new attribute.
And the paragraphs is being moved into the div. What is wrong in the code?
Thanks
https://ideone.com/6Fu2zy
<?php
$html = '
<div>
<a>
<h1>Article 02</h1>
</a>
<img src="abc.jpg">
</div>
<p>
<span>dsaf</span>
</p>';
$dom = new DOMDocument();
#$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$x = new DOMXPath($dom);
foreach ($x->query("/*") as $node) {
$node->setAttribute("style", "xxxx");
}
$newHtml = $dom->saveHtml();
echo $newHtml;
edit:
So I could put the nodes into <root> tags and then add attributes. But I did not know how to do that so I simply left outer <html> and <body> tags.
Adding attributes succeed but then I did not know how to remove outer <html> and <body> tags from the code.
I tried the same way than before but did not succeed.
https://ideone.com/6Fu2zy
<?php
$html = '
<div>
<a>
<h1>Article 02</h1>
</a>
<img src="abc.jpg">
</div>
<p>
<span>dsaf</span>
</p>';
$dom = new DOMDocument();
#$dom->loadHTML($html, LIBXML_HTML_NODEFDTD);
$x = new DOMXPath($dom);
foreach ($x->query("/html/body/*") as $node) {
$node->setAttribute("style", "xxxx");
}
$newHtml = #$dom->saveHtml();
#$dom->loadHTML($newHtml, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$newHtml2 = #$dom->saveHtml();
echo $newHtml2;

The problem is that your HTML has not a root element, so DOMDocument convert the first element (<div>) to a wrapper for all other nodes.
Your:
<div>
<a><h1>Article 02</h1></a>
<img src="abc.jpg">
</div>
<p><span>dsaf</span></p>
loaded by DOMDocument become:
<div>
<a><h1>Article 02</h1></a>
<img src="abc.jpg">
<p><span>dsaf</span></p>
</div>
Consequently the /* pattern return only one node.
Add a root element to your HTML:
<root>
<div>
<a><h1>Article 02</h1></a>
<img src="abc.jpg">
</div>
<p><span>dsaf</span></p>
</root>
then use this path:
/root/*
After transformation, if you need to output only inner HTML, unfortunately DOMDocument doesn't have this feature. You can do something like this:
$innerHTML = "";
foreach( $dom->getElementsByTagName( 'root' )->item(0)->childNodes as $child )
{
$innerHTML .= $dom->saveHTML( $child );
}

php - Simple HTML dom - elements between other elements

I'm trying to write a php script to crawl a website and keep some elements in data base.
Here is my problem : A web page is written like this :
<h2>The title 1</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>
<h2>The title 2</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>
<p class="one_class"> Some different text </p>
<p> Some other interesting text </p>
<h2>The title 3</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>
I want to get only the h2 and p with interesting text, not the p class="one_class".
I tried this php code :
<?php
$numberP = 0;
foreach($html->find('p') as $p)
{
$pIsOneClass = PIsOneClass($html, $p);
if($pIsOneClass == false)
{
echo $p->outertext;
$h2 = $html->find("h2", $numberP);
echo $h2->outertext;
$numberP++;
}
}
?>
the function PIsOneClass($html, $p) is :
<?php
function PIsOneClass($html, $p)
{
foreach($html->find("p.one_class") as $p_one_class)
{
if($p == $p_one_class)
{
return true;
}
}
return false;
}
?>
It doesn't work, i understand why but i don't know how to resolve it.
How can we say "I want every p without class who are between two h2 ?"
Thx a lot !

This task is easier with XPath, since you're scraping more than one element and you want to keep the source in order. You can use PHP's DOM library, which includes DOMXPath, to find and filter the elements you want:
$html = '<h2>The title 1</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>
<h2>The title 2</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>
<p class="one_class"> Some different text </p>
<p> Some other interesting text </p>
<h2>The title 3</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>';
# create a new DOM document and load the html
$dom = new DOMDocument;
$dom->loadHTML($html);
# create a new DOMXPath object
$xp = new DOMXPath($dom);
# search for all h2 elements and all p elements that do not have the class 'one_class'
$interest = $xp->query('//h2 | //p[not(#class="one_class")]');
# iterate through the array of search results (h2 and p elements), printing out node
# names and values
foreach ($interest as $i) {
echo "node " . $i->nodeName . ", value: " . $i->nodeValue . PHP_EOL;
}
Output:
node h2, value: The title 1
node p, value: Some interesting text
node h2, value: The title 2
node p, value: Some interesting text
node p, value: Some other interesting text
node h2, value: The title 3
node p, value: Some interesting text
As you can see, the source text stays in order, and it's easy to eliminate the nodes you don't want.

From the simpleHTML dom manual
[attribute=value]
Matches elements that have the specified attribute with a certain value.
or
[!attribute]
Matches elements that don't have the specified attribute.

Retrieve a text node with Simple HTML DOM Parser

I'm quite new to Simple HTML DOM Parser. I want to get a child element from the following HTML:
<div class="article">
<div style="text-align:justify">
<img src="image.jpg" title="image">
<br>
<br>
"Text to grab"
<div>......</div>
<br></br>
................
................
</div>
</div>
I'm trying to get the text "Text to grab"
So far I've tried the following query:
$html->find('div[class=article] div')->children(3);
But it's not working. Any idea how to solve this ?

You don't need simple_html_dom here. It can be done with DOMDocument and DOMXPath. Both are part of the PHP core.
Example:
// your sample data
$html = <<<EOF
<div class="article">
<div style="text-align:justify">
<img src="image.jpg" title="image">
<br>
<br>
"Text to grab"
<div>......</div>
<br></br>
................
................
</div>
</div>
EOF;
// create a document from the above snippet
// if you are loading from a remote url use:
// $doc->load($url);
$doc = new DOMDocument();
$doc->loadHTML($html);
// initialize a XPath selector
$selector = new DOMXPath($doc);
// get the text node (also text elements in xml/html are nodes
$query = '//div[#class="article"]/div/br[2]/following-sibling::text()[1]';
$textToGrab = $selector->query($query)->item(0);
// remove newlines on start and end using trim() and output the text
echo trim($textToGrab->nodeValue);
Output:
"Text to grab"

If it's always in the same place you can do:
$html->find('.article text', 4);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How exclude html comments from text node xpath? - php

Related

How replace text with DOMDocument

Get contents of div up to a certain point

Add attributes to outer tags of html fragments

php - Simple HTML dom - elements between other elements

Retrieve a text node with Simple HTML DOM Parser

Categories

Resources