I am trying to parse some fairly flat HTML and group everything from one h1 tag to the next. For example, I have the following HTML:
<h1> Heading 1 </h1>
<p> Paragraph 1.1 </p>
<p> Paragraph 1.2 </p>
<p> Paragraph 1.3 </p>
<h1> Heading 2 </h1>
<p> Paragraph 2.1 </p>
<p> Paragraph 2.2 </p>
<h1> Heading 3 </h1>
<p> Paragraph 3.1 </p>
<p> Paragraph 3.2 </p>
<p> Paragraph 3.3 </p>
I basically want it to look like:
<div id='1'>
<h1> Heading 1 </h1>
<p> Paragraph 1.1 </p>
<p> Paragraph 1.2 </p>
<p> Paragraph 1.3 </p>
</div>
<div id='2'>
<h1> Heading 2 </h1>
<p> Paragraph 2.1 </p>
<p> Paragraph 2.2 </p>
</div>
<div id='3'>
<h1> Heading 3 </h1>
<p> Paragraph 3.1 </p>
<p> Paragraph 3.2 </p>
<p> Paragraph 3.3 </p>
</div>
It is probably not even worth be posting the code I have done so far, as it just turned into a mess. Basically I was attempting to do an Xpath query for '//h1'. Create new DIV tags as parent nodes. Then copy the h1 DOM Node into the first DIV, and then loop over nextSibling until I hit another h1 tag - as mentioned it got messy.
Could someone point me in a better direction here?
Iterate over all nodes that are on the same level (I created a hint node called platau in my example), whenever your run across <h1>, insert the div before and keep a reference to it.
For <h1> and any other node and if the reference exists, remove the node and add it as child to the reference.
Example:
$doc->loadXML($xml);
$xp = new DOMXPath($doc);
$current = NULL;
$id = 0;
foreach($xp->query('/platau/node()') as $i => $sort)
{
if (isset($sort->tagName) && $sort->tagName === 'h1')
{
$current = $doc->createElement('div');
$current->setAttribute('id', ++$id);
$current = $sort->parentNode->insertBefore($current, $sort);
}
if (!$current) continue;
$sort->parentNode->removeChild($sort);
$current->appendChild($sort);
}
Demo
Related
I'm grabbing all the paragraph tags using the PHP Simple HTML DOM Parser with the following code:
// Product Description
$html = file_get_html('http://domain.local/index.html');
$contents = strip_tags($html->find('div[class=product-details] p'));
How can I say grab X amount of paragraphs until it hits the first ul?
<p>
Paragraph 1
</p>
<p>
Paragraph 2
</p>
<p>
Paragraph 3
</p>
<ul>
<li>
List item 1
</li>
<li>
List item 2
</li>
</ul>
<blockquote>
Quote 1
</blockquote>
<blockquote>
Quote 2
</blockquote>
<blockquote>
Quote 3
</blockquote>
<p>
Paragraph 4
</p>
<p>
Paragraph 5
</p>
You can use the following code as per requirements mentioned:-
<?php
$html = file_get_html('http://domain.local/index.html');
$detailTags = $html->find('div[class=product-details] *');
$contents = "";
foreach ($detailTags as $detailTag){
// these condition will check if tag is not <p> or it's <ul> to break the loop.
if (strpos($detailTag, '<ul>') === 0 && strpos($detailTag, '<p>') !== 0) {
break;
}
$contents .= strip_tags($detailTag);
}
// contents will contain the output required.
echo $contents;
?>
OUTPUT:-
Paragraph 1 Paragraph 2 Paragraph 3
EDIT: Nandal's code will work for you because it will not force you to change the library.
If you don't want to be dependent upon 3rd party library then you can use PHP's DOM Document feature for which you would need to enable the extension.
You can look into the below code which prints the paragraphs until you hit any other tag:
<?php
$html = new DOMDocument();
$html->loadHTML("<html><body><p>Paragraph 1</p><p> Paragraph 2</p><p> Paragraph 3</p><ul> <li> List item 1 </li> <li> List item 2 </li> </ul><blockquote> Quote 1</blockquote><blockquote> Quote 2</blockquote><blockquote> Quote 3</blockquote><p> Paragraph 4</p><p> Paragraph 5</p></body></html>");
$xpath = new DOMXPath($html);
$nodes = $xpath->query('/html/body//*');
foreach($nodes as $node) {
if($node->nodeName != "p") {
break;
}
print $node -> nodeValue . "\n";
}
Note: this differs from the following question in that here we have values appearing within a node and within a childnode of that same node:
XPath contains(text(),'some string') doesn't work when used with node with more than one Text subnode
Given the following html:
$content =
'<html>
<body>
<div>
<p>During the interim there shall be nourishment supplied</p>
</div>
<div>
<p>During the interim there shall be interim nourishment supplied</p>
</div>
<div>
<ul><li>During the interim there shall be nourishment supplied</li></ul>
</div>
</body>
</html>';
And the following xpath:
//*[contains(text(),'interim')]
... only provides 3 matches, whereas I want four matches. As per comments, the four elements I'm expecting are P P A LI.
This works exactly as expected. See this glot.io link.
<?php
$html = <<<HTML
<html>
<body>
<div>
<p>During the interim there shall be nourishment supplied</p>
</div>
<div>
<p>During the interim there shall be interim nourishment supplied</p>
</div>
<div>
<ul><li>During the interim there shall be nourishment supplied</li></ul>
</div>
</body>
</html>
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach($xpath->query('//*/text()[contains(.,"interim")]') as $n) var_dump($n->getNodePath());
You will get four matches:
/html/body/div[1]/p/text()
/html/body/div[2]/p/a/text()
/html/body/div[2]/p/text()[2]
/html/body/div[3]/ul/li/text()
Example file:
<p>
some content
<sup>3</sup>
some content</p>
<p>
some content
<sup>4</sup>
some content<sup>5</sup></p>
<div class="footnote">
<li id="fn3">
<p>
content3
↩
</p>
</li>
<li id="fn4">
<p>
content4
↩
</p>
</li>
<li id="fn5">
<p>
content5
↩
</p>
</li>
<div>
I need to place reference footnote at the bottom of the paragraph where the footnote is referenced.(i.e.)if the content in ptag has aelement with class fn-ref(one or many atags in a paragraph), I need to place related footnote at the bottom of that paragraph. Related footnote reference can be found in the div class="footnotes"
I should search in every ptag for a class="fn-ref", If I found, I should create a div class="footnote" in which the related footnote reference content should be placed. If it is more than one, then within that div element itself reference content should be placed one by one.
Expected output:
<p>
some content
<sup>3</sup>
some content</p>
<div class=footnote>
<p>
<span class="label-fn">
3
</span>
content3
</p>
</div>
<p>
some content
<sup>4</sup>
some content<sup>5</sup></p>
<div class=footnote>
<p>
<span class="label-fn">
4
</span>
content4
</p>
<p>
<span class="label-fn">
5
</span>
content5
</p>
</div>
I should try like parent().clone().html() then before and after add stuff but I don't know where to get started as am newbie in DOM parser class.
Tried so far:
$dom = new DOMDocument;
$dom->loadHTMLFile("test.html", LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xp = new DOMXPath($dom);
$xp->registerNamespace("php", "http://php.net/xpath");
$pElement = $xp->query("//*[contains(#class, "fn-ref")]");
foreach($pElement as $pNode) {
if ($pNode->nodeName[0] === 'p') {
//??
I'm trying to write a php script to crawl a website and keep some elements in data base.
Here is my problem : A web page is written like this :
<h2>The title 1</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>
<h2>The title 2</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>
<p class="one_class"> Some different text </p>
<p> Some other interesting text </p>
<h2>The title 3</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>
I want to get only the h2 and p with interesting text, not the p class="one_class".
I tried this php code :
<?php
$numberP = 0;
foreach($html->find('p') as $p)
{
$pIsOneClass = PIsOneClass($html, $p);
if($pIsOneClass == false)
{
echo $p->outertext;
$h2 = $html->find("h2", $numberP);
echo $h2->outertext;
$numberP++;
}
}
?>
the function PIsOneClass($html, $p) is :
<?php
function PIsOneClass($html, $p)
{
foreach($html->find("p.one_class") as $p_one_class)
{
if($p == $p_one_class)
{
return true;
}
}
return false;
}
?>
It doesn't work, i understand why but i don't know how to resolve it.
How can we say "I want every p without class who are between two h2 ?"
Thx a lot !
This task is easier with XPath, since you're scraping more than one element and you want to keep the source in order. You can use PHP's DOM library, which includes DOMXPath, to find and filter the elements you want:
$html = '<h2>The title 1</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>
<h2>The title 2</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>
<p class="one_class"> Some different text </p>
<p> Some other interesting text </p>
<h2>The title 3</h2>
<p class="one_class"> Some text </p>
<p> Some interesting text </p>';
# create a new DOM document and load the html
$dom = new DOMDocument;
$dom->loadHTML($html);
# create a new DOMXPath object
$xp = new DOMXPath($dom);
# search for all h2 elements and all p elements that do not have the class 'one_class'
$interest = $xp->query('//h2 | //p[not(#class="one_class")]');
# iterate through the array of search results (h2 and p elements), printing out node
# names and values
foreach ($interest as $i) {
echo "node " . $i->nodeName . ", value: " . $i->nodeValue . PHP_EOL;
}
Output:
node h2, value: The title 1
node p, value: Some interesting text
node h2, value: The title 2
node p, value: Some interesting text
node p, value: Some other interesting text
node h2, value: The title 3
node p, value: Some interesting text
As you can see, the source text stays in order, and it's easy to eliminate the nodes you don't want.
From the simpleHTML dom manual
[attribute=value]
Matches elements that have the specified attribute with a certain value.
or
[!attribute]
Matches elements that don't have the specified attribute.
I've just started tooling around with XPath recently.
Currently I'm just parsing some pages line by line and taking the relevant text.
What I'd like to do is exclude a div at the top and it's child elements.
Basically I'm looking at this :
<html>
<head> Foo </head>
<body>
<div id='header'>
<ul id='menu'> <li> Bar </li> <li> FooBar </li> <li> BarFoo </li> </ul>
</div>
<table> <tr> <td>data</td><td>data</td> </tr> </table>
<div>
<p>Lorem Ipsum</p>
<p>dolor sit amet</p>
</div>
</body>
</html>
Except much more content.
Currently I loop through every node with :
$dom = new DOMDocument;
$dom->loadHTMLFile('http://www.test.com/test.htm');
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('/html/body//*');
foreach($nodes as $node) {
echo $node->nodeValue;
}
I want to ignore the entire header node.
Is there a simple way to just do that?
This would work:
/html/body//*[not(ancestor-or-self::div[#id="header"])]
The XPath selects all nodes below the body element unless they are an ancestor of a DIV with the id attribute value of "header" or that div itself.
Check http://schlitt.info/opensource/blog/0704_xpath.html for an XPath tutorial.