Get contents of div up to a certain point - php

I'm grabbing all the paragraph tags using the PHP Simple HTML DOM Parser with the following code:
// Product Description
$html = file_get_html('http://domain.local/index.html');
$contents = strip_tags($html->find('div[class=product-details] p'));
How can I say grab X amount of paragraphs until it hits the first ul?
<p>
Paragraph 1
</p>
<p>
Paragraph 2
</p>
<p>
Paragraph 3
</p>
<ul>
<li>
List item 1
</li>
<li>
List item 2
</li>
</ul>
<blockquote>
Quote 1
</blockquote>
<blockquote>
Quote 2
</blockquote>
<blockquote>
Quote 3
</blockquote>
<p>
Paragraph 4
</p>
<p>
Paragraph 5
</p>

You can use the following code as per requirements mentioned:-
<?php
$html = file_get_html('http://domain.local/index.html');
$detailTags = $html->find('div[class=product-details] *');
$contents = "";
foreach ($detailTags as $detailTag){
// these condition will check if tag is not <p> or it's <ul> to break the loop.
if (strpos($detailTag, '<ul>') === 0 && strpos($detailTag, '<p>') !== 0) {
break;
}
$contents .= strip_tags($detailTag);
}
// contents will contain the output required.
echo $contents;
?>
OUTPUT:-
Paragraph 1 Paragraph 2 Paragraph 3

EDIT: Nandal's code will work for you because it will not force you to change the library.
If you don't want to be dependent upon 3rd party library then you can use PHP's DOM Document feature for which you would need to enable the extension.
You can look into the below code which prints the paragraphs until you hit any other tag:
<?php
$html = new DOMDocument();
$html->loadHTML("<html><body><p>Paragraph 1</p><p> Paragraph 2</p><p> Paragraph 3</p><ul> <li> List item 1 </li> <li> List item 2 </li> </ul><blockquote> Quote 1</blockquote><blockquote> Quote 2</blockquote><blockquote> Quote 3</blockquote><p> Paragraph 4</p><p> Paragraph 5</p></body></html>");
$xpath = new DOMXPath($html);
$nodes = $xpath->query('/html/body//*');
foreach($nodes as $node) {
if($node->nodeName != "p") {
break;
}
print $node -> nodeValue . "\n";
}

Related

Complex Xpath get all values excluding some specific class attributes

I have a markup HTML as below:
<body>
<div>......</div>
............
<div class="entry-content">
<div class="code1 code2">(ads.....);</div>
<p><img src="https://www..."></img></p>
<h2> title </h2>
<div class="code1-block code2">(ads.....);</div>
<div class="data1 dta-ta1">
<ul><li><p> text</p></li>
<li><span> text2 </span></li>
<li><span> text3 </span></li>
<div class="codex1 code-block"><span>(ads ....); </span></div>
<li><span> text4 </span></li>
<div class="codex1 code-block"><span>(ads ....); </span></div>
</ul>
</div>
<div class="codex2-block code2">(ads.....);</div>
<div class="data2-entry dta-ta2">
<p>
<span> text5</span>
</p>
<p> text6 </p>
<p> text7 </p
<div class="codex1 code-block"><span>(ads ....); </span></div>
<li><span> text8 </span></li>
<div class="codex1 code-block"><span>(ads ....); </span></div>
</div>
</div>
</body>
I've tried to "go into div with class="entry-content" get all texts from its child nodes excluding child nodes with class= "code1", "code2", "codex1", "codex2"
My code as below just goes to the div and gets all texts from child nodes. However, I can not remove text from the child nodes with code1 & code2. I appreciate for your supports. Thanks.
$classname='entry-content';
$a = new DOMXPath($dom);
$query = "//*[contains(concat(' ', normalize-space(#class), ' '), ' $classname ')]";
$list = $a->query($query);
if ($list->length > 0) {
foreach ($list as $element) {
$nodes = $element->childNodes;
foreach ($element as $node) {
$bodytext = trim(preg_replace('/[\r\n]+/', ' ', $node->nodeValue));
$bodyContent .= '<p>' . $bodytext . '</p>';
}
}
}
My expected output:
https://www...
title
text2
text3
text4
text5
text6
text7
text8
Your input document is not well-formed, a > is missing for </p, and one div is not closed properly. With the input document fixed, a working path expression is
XPath expression
//div[#class='content']//text()[not(ancestor::div/#class[contains(., 'code')])][normalize-space()]
It selects all text nodes, but only if they do not have an ancestor div element that has a class attribute whose value contains "code", and also, the text nodes selected cannot be whitespace-only.
Output
Individual results are separated by ------:
title
-----------------------
text
-----------------------
text2
-----------------------
text3
-----------------------
text4
-----------------------
text5
-----------------------
text6
-----------------------
text7
-----------------------
text8
Update
I tried with your answer. It works however I still need a source from img tag. How can I get it?
It's possible to also select the source attribute of an img element, but this would make the Xpath expression even more complicated. You should just add another line of PHP to evaluate a separate path expression, such as:
//div[#class='entry-content']/p/img/#source
Update 2
While I absolutely do not recommend to use this expression (because it obfuscates your code), here is how to combine both expressions into a single one with a union operator:
//div[#class='entry-content']//text()[not(ancestor::div/#class[contains(., 'code')])][normalize-space()] | //div[#class='entry-content']//p/img/#src

How exclude html comments from text node xpath?

I have the follow html structure:
<a>
<div>
<div>
<span>
text node 1<br>
text node 2 <!--//comments-->
</span>
</div>
</div>
</a>
With the follow query, i get second node, but how get that node excluding comments?
$spanx = $xpath->query('//a/div/div/span/text()[2]');
$span = $spanx->item($l)->nodeValue;
echo "<td>".$span."</td></tr>";
I have that result:
text node 2 //comments
I search for:
text node 2
I've tested the following on my localhost. I've created the file named DOM_with_comment.html containing:
<a>
<div>
<div>
<span>
text node 1<br>
text node 2 <!--//comments-->
</span>
</div>
</div>
</a>
When I run:
<?php
$doc = new DOMDocument;
libxml_use_internal_errors(true);
$doc->preserveWhiteSpace = false;
$doc->loadHTMLFile('DOM_with_comment.html');
$xpath = new DOMXPath($doc);
echo "<pre>";
foreach ($xpath->query('//a/div/div/span/text()') as $item) {
var_dump($item->nodeValue);
}
The output is:
string(29) "
text node 1"
string(31) "
text node 2 "
string(14) "
"
So, by accessing the first qualifying result [0] from your xpath query then displaying the trim()ed ->nodeValue() with var_export() it is revealed that there are no comments or whitespaces on either side of the targeted substring.
var_export(trim($xpath->query('//a/div/div/span/text()[2]')[0]->nodeValue));
// outputs: 'text node 2'
p.s. If your input is not coming from a file, but a variable, this works the same way:
$html = <<<HTML
<a>
<div>
<div>
<span>
text node 1<br>
text node 2 <!--//comments-->
</span>
</div>
</div>
</a>
HTML;
$doc->loadHTML($html);

php - Simple HTML dom - elements between other elements

I'm trying to write a php script to crawl a website and keep some elements in data base.
Here is my problem : A web page is written like this :
<h2>The title 1</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>
<h2>The title 2</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>
<p class="one_class"> Some different text </p>
<p> Some other interesting text </p>
<h2>The title 3</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>
I want to get only the h2 and p with interesting text, not the p class="one_class".
I tried this php code :
<?php
$numberP = 0;
foreach($html->find('p') as $p)
{
$pIsOneClass = PIsOneClass($html, $p);
if($pIsOneClass == false)
{
echo $p->outertext;
$h2 = $html->find("h2", $numberP);
echo $h2->outertext;
$numberP++;
}
}
?>
the function PIsOneClass($html, $p) is :
<?php
function PIsOneClass($html, $p)
{
foreach($html->find("p.one_class") as $p_one_class)
{
if($p == $p_one_class)
{
return true;
}
}
return false;
}
?>
It doesn't work, i understand why but i don't know how to resolve it.
How can we say "I want every p without class who are between two h2 ?"
Thx a lot !
This task is easier with XPath, since you're scraping more than one element and you want to keep the source in order. You can use PHP's DOM library, which includes DOMXPath, to find and filter the elements you want:
$html = '<h2>The title 1</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>
<h2>The title 2</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>
<p class="one_class"> Some different text </p>
<p> Some other interesting text </p>
<h2>The title 3</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>';
# create a new DOM document and load the html
$dom = new DOMDocument;
$dom->loadHTML($html);
# create a new DOMXPath object
$xp = new DOMXPath($dom);
# search for all h2 elements and all p elements that do not have the class 'one_class'
$interest = $xp->query('//h2 | //p[not(#class="one_class")]');
# iterate through the array of search results (h2 and p elements), printing out node
# names and values
foreach ($interest as $i) {
echo "node " . $i->nodeName . ", value: " . $i->nodeValue . PHP_EOL;
}
Output:
node h2, value: The title 1
node p, value: Some interesting text
node h2, value: The title 2
node p, value: Some interesting text
node p, value: Some other interesting text
node h2, value: The title 3
node p, value: Some interesting text
As you can see, the source text stays in order, and it's easy to eliminate the nodes you don't want.
From the simpleHTML dom manual
[attribute=value]
Matches elements that have the specified attribute with a certain value.
or
[!attribute]
Matches elements that don't have the specified attribute.

Adding id an attribute to paragraph elements

Suppose that a variable contains html markups like below:
<p> paragraph 1 </p>
<p> paragraph 2 </p>
...
How can I turn it to something like this:
<p id="1" data-pic="someStaticText"> paragraph 1 </p>
<p id="2" data-pic="someStaticText"> paragraph 2 </p>
...
Of course, it's not just composed of paragraph elements.
Well, I figured it out while doing some more research:
$html_string = preg_replace_callback(
"(<p(.*?)>)is",
function($m) {
static $id = 0;
$id++;
return "<p id=\"p".$id."\"data-pic=\"someStaticText\"".$m[1].">";
},
$html_string);
You could use a while loop with each iteration replacing the first <p> of the string.
$id = 1;
while(strpos($html_string, '<p>') !== FALSE){
$html_string = str_replace('<p>','<p id="'.$id.'" data-pic="someStaticText">',$html_string, 1);
$id++;
}

PHP Xpath - Parsing flat HTML structure

I am trying to parse some fairly flat HTML and group everything from one h1 tag to the next. For example, I have the following HTML:
<h1> Heading 1 </h1>
<p> Paragraph 1.1 </p>
<p> Paragraph 1.2 </p>
<p> Paragraph 1.3 </p>
<h1> Heading 2 </h1>
<p> Paragraph 2.1 </p>
<p> Paragraph 2.2 </p>
<h1> Heading 3 </h1>
<p> Paragraph 3.1 </p>
<p> Paragraph 3.2 </p>
<p> Paragraph 3.3 </p>
I basically want it to look like:
<div id='1'>
<h1> Heading 1 </h1>
<p> Paragraph 1.1 </p>
<p> Paragraph 1.2 </p>
<p> Paragraph 1.3 </p>
</div>
<div id='2'>
<h1> Heading 2 </h1>
<p> Paragraph 2.1 </p>
<p> Paragraph 2.2 </p>
</div>
<div id='3'>
<h1> Heading 3 </h1>
<p> Paragraph 3.1 </p>
<p> Paragraph 3.2 </p>
<p> Paragraph 3.3 </p>
</div>
It is probably not even worth be posting the code I have done so far, as it just turned into a mess. Basically I was attempting to do an Xpath query for '//h1'. Create new DIV tags as parent nodes. Then copy the h1 DOM Node into the first DIV, and then loop over nextSibling until I hit another h1 tag - as mentioned it got messy.
Could someone point me in a better direction here?
Iterate over all nodes that are on the same level (I created a hint node called platau in my example), whenever your run across <h1>, insert the div before and keep a reference to it.
For <h1> and any other node and if the reference exists, remove the node and add it as child to the reference.
Example:
$doc->loadXML($xml);
$xp = new DOMXPath($doc);
$current = NULL;
$id = 0;
foreach($xp->query('/platau/node()') as $i => $sort)
{
if (isset($sort->tagName) && $sort->tagName === 'h1')
{
$current = $doc->createElement('div');
$current->setAttribute('id', ++$id);
$current = $sort->parentNode->insertBefore($current, $sort);
}
if (!$current) continue;
$sort->parentNode->removeChild($sort);
$current->appendChild($sort);
}
Demo

Categories