I have a markup HTML as below:
<body>
<div>......</div>
............
<div class="entry-content">
<div class="code1 code2">(ads.....);</div>
<p><img src="https://www..."></img></p>
<h2> title </h2>
<div class="code1-block code2">(ads.....);</div>
<div class="data1 dta-ta1">
<ul><li><p> text</p></li>
<li><span> text2 </span></li>
<li><span> text3 </span></li>
<div class="codex1 code-block"><span>(ads ....); </span></div>
<li><span> text4 </span></li>
<div class="codex1 code-block"><span>(ads ....); </span></div>
</ul>
</div>
<div class="codex2-block code2">(ads.....);</div>
<div class="data2-entry dta-ta2">
<p>
<span> text5</span>
</p>
<p> text6 </p>
<p> text7 </p
<div class="codex1 code-block"><span>(ads ....); </span></div>
<li><span> text8 </span></li>
<div class="codex1 code-block"><span>(ads ....); </span></div>
</div>
</div>
</body>
I've tried to "go into div with class="entry-content" get all texts from its child nodes excluding child nodes with class= "code1", "code2", "codex1", "codex2"
My code as below just goes to the div and gets all texts from child nodes. However, I can not remove text from the child nodes with code1 & code2. I appreciate for your supports. Thanks.
$classname='entry-content';
$a = new DOMXPath($dom);
$query = "//*[contains(concat(' ', normalize-space(#class), ' '), ' $classname ')]";
$list = $a->query($query);
if ($list->length > 0) {
foreach ($list as $element) {
$nodes = $element->childNodes;
foreach ($element as $node) {
$bodytext = trim(preg_replace('/[\r\n]+/', ' ', $node->nodeValue));
$bodyContent .= '<p>' . $bodytext . '</p>';
}
}
}
My expected output:
https://www...
title
text2
text3
text4
text5
text6
text7
text8
Your input document is not well-formed, a > is missing for </p, and one div is not closed properly. With the input document fixed, a working path expression is
XPath expression
//div[#class='content']//text()[not(ancestor::div/#class[contains(., 'code')])][normalize-space()]
It selects all text nodes, but only if they do not have an ancestor div element that has a class attribute whose value contains "code", and also, the text nodes selected cannot be whitespace-only.
Output
Individual results are separated by ------:
title
-----------------------
text
-----------------------
text2
-----------------------
text3
-----------------------
text4
-----------------------
text5
-----------------------
text6
-----------------------
text7
-----------------------
text8
Update
I tried with your answer. It works however I still need a source from img tag. How can I get it?
It's possible to also select the source attribute of an img element, but this would make the Xpath expression even more complicated. You should just add another line of PHP to evaluate a separate path expression, such as:
//div[#class='entry-content']/p/img/#source
Update 2
While I absolutely do not recommend to use this expression (because it obfuscates your code), here is how to combine both expressions into a single one with a union operator:
//div[#class='entry-content']//text()[not(ancestor::div/#class[contains(., 'code')])][normalize-space()] | //div[#class='entry-content']//p/img/#src
Related
I have the follow html structure:
<a>
<div>
<div>
<span>
text node 1<br>
text node 2 <!--//comments-->
</span>
</div>
</div>
</a>
With the follow query, i get second node, but how get that node excluding comments?
$spanx = $xpath->query('//a/div/div/span/text()[2]');
$span = $spanx->item($l)->nodeValue;
echo "<td>".$span."</td></tr>";
I have that result:
text node 2 //comments
I search for:
text node 2
I've tested the following on my localhost. I've created the file named DOM_with_comment.html containing:
<a>
<div>
<div>
<span>
text node 1<br>
text node 2 <!--//comments-->
</span>
</div>
</div>
</a>
When I run:
<?php
$doc = new DOMDocument;
libxml_use_internal_errors(true);
$doc->preserveWhiteSpace = false;
$doc->loadHTMLFile('DOM_with_comment.html');
$xpath = new DOMXPath($doc);
echo "<pre>";
foreach ($xpath->query('//a/div/div/span/text()') as $item) {
var_dump($item->nodeValue);
}
The output is:
string(29) "
text node 1"
string(31) "
text node 2 "
string(14) "
"
So, by accessing the first qualifying result [0] from your xpath query then displaying the trim()ed ->nodeValue() with var_export() it is revealed that there are no comments or whitespaces on either side of the targeted substring.
var_export(trim($xpath->query('//a/div/div/span/text()[2]')[0]->nodeValue));
// outputs: 'text node 2'
p.s. If your input is not coming from a file, but a variable, this works the same way:
$html = <<<HTML
<a>
<div>
<div>
<span>
text node 1<br>
text node 2 <!--//comments-->
</span>
</div>
</div>
</a>
HTML;
$doc->loadHTML($html);
I'm grabbing all the paragraph tags using the PHP Simple HTML DOM Parser with the following code:
// Product Description
$html = file_get_html('http://domain.local/index.html');
$contents = strip_tags($html->find('div[class=product-details] p'));
How can I say grab X amount of paragraphs until it hits the first ul?
<p>
Paragraph 1
</p>
<p>
Paragraph 2
</p>
<p>
Paragraph 3
</p>
<ul>
<li>
List item 1
</li>
<li>
List item 2
</li>
</ul>
<blockquote>
Quote 1
</blockquote>
<blockquote>
Quote 2
</blockquote>
<blockquote>
Quote 3
</blockquote>
<p>
Paragraph 4
</p>
<p>
Paragraph 5
</p>
You can use the following code as per requirements mentioned:-
<?php
$html = file_get_html('http://domain.local/index.html');
$detailTags = $html->find('div[class=product-details] *');
$contents = "";
foreach ($detailTags as $detailTag){
// these condition will check if tag is not <p> or it's <ul> to break the loop.
if (strpos($detailTag, '<ul>') === 0 && strpos($detailTag, '<p>') !== 0) {
break;
}
$contents .= strip_tags($detailTag);
}
// contents will contain the output required.
echo $contents;
?>
OUTPUT:-
Paragraph 1 Paragraph 2 Paragraph 3
EDIT: Nandal's code will work for you because it will not force you to change the library.
If you don't want to be dependent upon 3rd party library then you can use PHP's DOM Document feature for which you would need to enable the extension.
You can look into the below code which prints the paragraphs until you hit any other tag:
<?php
$html = new DOMDocument();
$html->loadHTML("<html><body><p>Paragraph 1</p><p> Paragraph 2</p><p> Paragraph 3</p><ul> <li> List item 1 </li> <li> List item 2 </li> </ul><blockquote> Quote 1</blockquote><blockquote> Quote 2</blockquote><blockquote> Quote 3</blockquote><p> Paragraph 4</p><p> Paragraph 5</p></body></html>");
$xpath = new DOMXPath($html);
$nodes = $xpath->query('/html/body//*');
foreach($nodes as $node) {
if($node->nodeName != "p") {
break;
}
print $node -> nodeValue . "\n";
}
Example file:
<p>
some content
<sup>3</sup>
some content</p>
<p>
some content
<sup>4</sup>
some content<sup>5</sup></p>
<div class="footnote">
<li id="fn3">
<p>
content3
↩
</p>
</li>
<li id="fn4">
<p>
content4
↩
</p>
</li>
<li id="fn5">
<p>
content5
↩
</p>
</li>
<div>
I need to place reference footnote at the bottom of the paragraph where the footnote is referenced.(i.e.)if the content in ptag has aelement with class fn-ref(one or many atags in a paragraph), I need to place related footnote at the bottom of that paragraph. Related footnote reference can be found in the div class="footnotes"
I should search in every ptag for a class="fn-ref", If I found, I should create a div class="footnote" in which the related footnote reference content should be placed. If it is more than one, then within that div element itself reference content should be placed one by one.
Expected output:
<p>
some content
<sup>3</sup>
some content</p>
<div class=footnote>
<p>
<span class="label-fn">
3
</span>
content3
</p>
</div>
<p>
some content
<sup>4</sup>
some content<sup>5</sup></p>
<div class=footnote>
<p>
<span class="label-fn">
4
</span>
content4
</p>
<p>
<span class="label-fn">
5
</span>
content5
</p>
</div>
I should try like parent().clone().html() then before and after add stuff but I don't know where to get started as am newbie in DOM parser class.
Tried so far:
$dom = new DOMDocument;
$dom->loadHTMLFile("test.html", LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xp = new DOMXPath($dom);
$xp->registerNamespace("php", "http://php.net/xpath");
$pElement = $xp->query("//*[contains(#class, "fn-ref")]");
foreach($pElement as $pNode) {
if ($pNode->nodeName[0] === 'p') {
//??
I'm trying to write a php script to crawl a website and keep some elements in data base.
Here is my problem : A web page is written like this :
<h2>The title 1</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>
<h2>The title 2</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>
<p class="one_class"> Some different text </p>
<p> Some other interesting text </p>
<h2>The title 3</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>
I want to get only the h2 and p with interesting text, not the p class="one_class".
I tried this php code :
<?php
$numberP = 0;
foreach($html->find('p') as $p)
{
$pIsOneClass = PIsOneClass($html, $p);
if($pIsOneClass == false)
{
echo $p->outertext;
$h2 = $html->find("h2", $numberP);
echo $h2->outertext;
$numberP++;
}
}
?>
the function PIsOneClass($html, $p) is :
<?php
function PIsOneClass($html, $p)
{
foreach($html->find("p.one_class") as $p_one_class)
{
if($p == $p_one_class)
{
return true;
}
}
return false;
}
?>
It doesn't work, i understand why but i don't know how to resolve it.
How can we say "I want every p without class who are between two h2 ?"
Thx a lot !
This task is easier with XPath, since you're scraping more than one element and you want to keep the source in order. You can use PHP's DOM library, which includes DOMXPath, to find and filter the elements you want:
$html = '<h2>The title 1</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>
<h2>The title 2</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>
<p class="one_class"> Some different text </p>
<p> Some other interesting text </p>
<h2>The title 3</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>';
# create a new DOM document and load the html
$dom = new DOMDocument;
$dom->loadHTML($html);
# create a new DOMXPath object
$xp = new DOMXPath($dom);
# search for all h2 elements and all p elements that do not have the class 'one_class'
$interest = $xp->query('//h2 | //p[not(#class="one_class")]');
# iterate through the array of search results (h2 and p elements), printing out node
# names and values
foreach ($interest as $i) {
echo "node " . $i->nodeName . ", value: " . $i->nodeValue . PHP_EOL;
}
Output:
node h2, value: The title 1
node p, value: Some interesting text
node h2, value: The title 2
node p, value: Some interesting text
node p, value: Some other interesting text
node h2, value: The title 3
node p, value: Some interesting text
As you can see, the source text stays in order, and it's easy to eliminate the nodes you don't want.
From the simpleHTML dom manual
[attribute=value]
Matches elements that have the specified attribute with a certain value.
or
[!attribute]
Matches elements that don't have the specified attribute.
i want to get all the text between <p> and <h3> tag for the following HTML
<div class="bodyText">
<p>
<div class="articleBox articleSmallHorizontal channel-32333770 articleBoxBordered alignRight">
<div class="one">
<img src="url" alt="bar" class="img" width="80" height="60" />
</div>
<div class="two">
<h4 class="preTitle">QIEZ-Lieblinge</h4>
<h3 class="title"><a href="url" title="ABC" onclick="cmsTracking.trackClickOut({element:this, channel : 32333770, channelname : 'top_listen', content : 14832081, callTemplate : '_htmltagging.Text', action : 'click', mouseevent : event});">
Prominente Gastronomen </a></h3>
<span class="postTitle"></span>
<span class="district">Berlin</span> </div>
<div class="clear"></div>
</div>
I want this TEXT</p>
<h3>I want this TEXT</h3>
<p>I want this TEXT</p>
<p>
<div class="inlineImage alignLeft">
<div class="medium">
<img src="http://images03.qiez.de/Restaurant+%C3%96_QIEZ.jpg/280x210/0/167.231.886/167.231.798" width="280" height="210" alt="Schöne Lage: das Restaurant Ø. (c)QIEZ"/>
<span class="caption">
Schöne Lage: das Restaurant Ø. (c)QIEZ </span>
</div>
</div>I want this TEXT</p>
<p>I want this TEXT</p>
<p>I want this TEXT<br /> </p>
<blockquote><img src="url" alt="" width="68" height="68" />
"Eigentlich nur drei Worte: Ich komme wieder."<span class="author">Tina Gerstung</span></blockquote>
<div class="clear"></div>
</div>
i want all "I want this TEXT". i used xpath query
//div[contains(#class,'bodyText')]/*[local-name()='p' or local-name()='h3']
but it does not give me the text if <p> tag is followed by any other tag
It looks like you have div elements contained within your p element which is not valid and messing up things. If you use a var_dump in the loop you can see that it does actually pick up the node but the nodeValue is empty.
A quick and dirty fix to your html would be to wrap the first div that is contained in the p element in a span.
<span><div class="articleBox articleSmallHorizontal channel-32333770 articleBoxBordered alignRight">...</div></span>
A better fix would be to put the div element outside the paragraph.
If you use the dirty workaround you will need to change your query like so:
$xpath->query("//div[contains(#class,'bodyText')]/*[local-name()='p' or local-name()='h3']/text()");
If you do not have control of the source html. You can make a copy of the html and remove the offending divs:
$nodes = $xpath->query("//div[contains(#class,'articleBox')]");
$node = $nodes->item(0);
$node->parentNode->removeChild($node);
It might be easier to work with simple_html_dom. Maybe you can try this:
include('simple_html_dom.php');
$dom = new simple_html_dom();
$dom->load($html);
foreach($dom->find("div[class=bodyText]") as $parent) {
foreach($parent->children() as $child) {
if ($child->tag == 'p' || $child->tag == 'h3') {
// remove the inner text of divs contained within a p element
foreach($dom->find('div') as $e)
$e->innertext = '';
echo $child->plaintext . '<br>';
}
}
}
This is mixed content. Depending on what defines the position of the element, you can use a number of factors. In this cse, probably simply selected all the text nodes will be sufficient:
//div[contains(#class, 'bodyText')]/(p | h3)/text()
If the union operator within a path location is not allowed in your processor, then you can use your syntax as before or a little bit simpler in my opinion:
//div[contains(#class, 'bodyText')]/*[local-name() = ('p', 'h3')]/text()