Xpath, php and how to skip specific node (and it's children) - php

I've just started tooling around with XPath recently.
Currently I'm just parsing some pages line by line and taking the relevant text.
What I'd like to do is exclude a div at the top and it's child elements.
Basically I'm looking at this :
<html>
<head> Foo </head>
<body>
<div id='header'>
<ul id='menu'> <li> Bar </li> <li> FooBar </li> <li> BarFoo </li> </ul>
</div>
<table> <tr> <td>data</td><td>data</td> </tr> </table>
<div>
<p>Lorem Ipsum</p>
<p>dolor sit amet</p>
</div>
</body>
</html>
Except much more content.
Currently I loop through every node with :
$dom = new DOMDocument;
$dom->loadHTMLFile('http://www.test.com/test.htm');
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('/html/body//*');
foreach($nodes as $node) {
echo $node->nodeValue;
}
I want to ignore the entire header node.
Is there a simple way to just do that?

This would work:
/html/body//*[not(ancestor-or-self::div[#id="header"])]
The XPath selects all nodes below the body element unless they are an ancestor of a DIV with the id attribute value of "header" or that div itself.
Check http://schlitt.info/opensource/blog/0704_xpath.html for an XPath tutorial.

Related

XPath - How to extract element following by the parent parent H1 tag

I am trying to extract blog post from web pages. Different pages have different structure so it is very difficult to extract what I need. There are some CSS and JS code in the HTML section also, I have to avoid them.
I know <h1> a dummy title </h1> from previous, so it can help to validate the exact one.
I do not know any ID and CLASS attribute.
<body>
<div>
<h1> a dummy title </h1>
</div>
<script> function loadDoc() {const xhttp = new XMLHttpRequest();} </script>
<div class="subtitle">
<p>...</p>
</div>
<div class="blog-post">
<p>...</p>
<div class="clear-fix">...</div>
<p>...</p>
<div class="clear-fix">...</div>
<p>...</p>
<p>...</p>
</div>
<div class="another-section">
<p>...</p>
<p>...</p>
</div>
<div class="another-another-section">
<p>...</p>
<p>...</p>
<div class="clear-fix">...</div>
<p>...</p>
<p>...</p>
<p>...</p>
</div>
</body>
What I have tried with:
I have tried to find the <div> with maximum <p> but sometimes there are some other <div> with maximum <p>, I have to avoid them by finding nearest <h1>
$html=
'[My html above]
';
$HTMLDoc = new DOMDocument();
$HTMLDoc->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD );
$xpath = new DOMXPath($HTMLDoc);
#locate the 3 divs
$pees = $xpath->query('//div[.//p]');
$pchilds = [];
#get the number of p children in each div
foreach ($pees as $pee) {
$childs = $pee->childElementCount;
array_push($pchilds,$childs);}
#now find the div with the max number of p children
foreach ($pees as $pee) {
$childs = $pee->childElementCount;
if ($childs == max($pchilds))
echo ($pee->nodeValue);
#or do whatever
}
Find all divs with p elements, then counting p elements inside each, finally getting the first with the max() count
$document = new DOMDocument();
$document->loadXML($xml);
$xpath = new DOMXpath($document);
$dcnt = array();
// Find all divs following an H1
$divs = $xpath->query('//h1/following-sibling::div');
// Count `p` inside them
foreach($divs as $idx=>$d) {
$cnt = (int) $xpath->evaluate('count(.//p)', $d);
$dcnt[$idx] = $cnt;
}
// show content of div with max() count
foreach($divs as $idx=>$d) {
if( $dcnt[$idx] == max($dcnt) ){
print $idx . ': ' . $divs[$idx]->nodeName . ': ' . $divs[$idx]->nodeValue;
break;
}
}
XPath 2.0 solution (see SaxonC for PHP support). Find the first first nearest div after <h1> containing with max <p> :
//h1/following-sibling::div[p[max(//h1/following-sibling::div/count(p))]][1]
Output :
<div class="blog-post">
<p>...</p>
<div class="clear-fix">...</div>
<p>...</p>
<div class="clear-fix">...</div>
<p>...</p>
<p>...</p>
</div>'
XPath 1.0 approximate solution (could return the wrong div) :
//h1/following-sibling::div[count(./p)>1][count(./p)>count(./preceding-sibling::div[./p][1]/p)][count(./p)>count(./following-sibling::div[./p][1]/p)][1]
In a comment you added,
want to find the first <h1> then I want to find the most nearest
<div> having max <p>.
There can be another tags in that <div> but I want to print
<p> tags only.
If your PHP processor has exslt support
something like this should be possible:
< file xmlstarlet select --template \
--var T='//div[p][contains(preceding::h1[1],"my title")]' \
--copy-of '($T[count(p) = math:max(dyn:map($T,"count(p)"))])[1]/p'
where
the T variable selects the div nodes of interest, assuming
you know, or can extract, the h1 section header text
dyn:map
maps each div to the count of its p children,
math:max
picks the maximum count
($T[…])[1]/p selects the p children of the first of possibly
more divs with a maximum p count
The command above uses xmlstarlet syntax; to make a single XPath
expression replace $T (2 places) with T contents inside parentheses.
It executes the following XSLT stylesheet (add -C before --template
to list it):
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:math="http://exslt.org/math" xmlns:dyn="http://exslt.org/dynamic" version="1.0" extension-element-prefixes="math dyn">
<xsl:output omit-xml-declaration="yes" indent="no"/>
<xsl:template match="/">
<xsl:variable select="//div[p][contains(preceding::h1[1],"my title")]" name="T"/>
<xsl:copy-of select="($T[count(p) = math:max(dyn:map($T,"count(p)"))])[1]/p"/>
</xsl:template>
</xsl:stylesheet>

How exclude html comments from text node xpath?

I have the follow html structure:
<a>
<div>
<div>
<span>
text node 1<br>
text node 2 <!--//comments-->
</span>
</div>
</div>
</a>
With the follow query, i get second node, but how get that node excluding comments?
$spanx = $xpath->query('//a/div/div/span/text()[2]');
$span = $spanx->item($l)->nodeValue;
echo "<td>".$span."</td></tr>";
I have that result:
text node 2 //comments
I search for:
text node 2
I've tested the following on my localhost. I've created the file named DOM_with_comment.html containing:
<a>
<div>
<div>
<span>
text node 1<br>
text node 2 <!--//comments-->
</span>
</div>
</div>
</a>
When I run:
<?php
$doc = new DOMDocument;
libxml_use_internal_errors(true);
$doc->preserveWhiteSpace = false;
$doc->loadHTMLFile('DOM_with_comment.html');
$xpath = new DOMXPath($doc);
echo "<pre>";
foreach ($xpath->query('//a/div/div/span/text()') as $item) {
var_dump($item->nodeValue);
}
The output is:
string(29) "
text node 1"
string(31) "
text node 2 "
string(14) "
"
So, by accessing the first qualifying result [0] from your xpath query then displaying the trim()ed ->nodeValue() with var_export() it is revealed that there are no comments or whitespaces on either side of the targeted substring.
var_export(trim($xpath->query('//a/div/div/span/text()[2]')[0]->nodeValue));
// outputs: 'text node 2'
p.s. If your input is not coming from a file, but a variable, this works the same way:
$html = <<<HTML
<a>
<div>
<div>
<span>
text node 1<br>
text node 2 <!--//comments-->
</span>
</div>
</div>
</a>
HTML;
$doc->loadHTML($html);

How can I select only the immediate parent node of a text string using xpath for every match

Note: this differs from the following question in that here we have values appearing within a node and within a childnode of that same node:
XPath contains(text(),'some string') doesn't work when used with node with more than one Text subnode
Given the following html:
$content =
'<html>
<body>
<div>
<p>During the interim there shall be nourishment supplied</p>
</div>
<div>
<p>During the interim there shall be interim nourishment supplied</p>
</div>
<div>
<ul><li>During the interim there shall be nourishment supplied</li></ul>
</div>
</body>
</html>';
And the following xpath:
//*[contains(text(),'interim')]
... only provides 3 matches, whereas I want four matches. As per comments, the four elements I'm expecting are P P A LI.
This works exactly as expected. See this glot.io link.
<?php
$html = <<<HTML
<html>
<body>
<div>
<p>During the interim there shall be nourishment supplied</p>
</div>
<div>
<p>During the interim there shall be interim nourishment supplied</p>
</div>
<div>
<ul><li>During the interim there shall be nourishment supplied</li></ul>
</div>
</body>
</html>
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach($xpath->query('//*/text()[contains(.,"interim")]') as $n) var_dump($n->getNodePath());
You will get four matches:
/html/body/div[1]/p/text()
/html/body/div[2]/p/a/text()
/html/body/div[2]/p/text()[2]
/html/body/div[3]/ul/li/text()

Transform complex and variable xml

I've a complex XML that I want to transform in HTML. Some tags need to be replaced in html tags.
The XML is this:
<root>
<div>
<p>
<em>bol text</em>, some normale text
</p>
</div>
<list>
<listitem>
normal text inside list <em>bold inside list</em>
</listitem>
<listitem>
another text in list...
</listitem>
</list>
<p>
A sample paragraph
</p>
The text inside the element is variable, which means that the other xml that I parse can completely change.
The output I want is this (for this scenario):
<root>
<div>
<p>
<strong>bol text</strong>, some normale text
</p>
</div>
<ul>
<li>
normal text inside list <strong>bold inside list</strong>
</li>
<li>
another text in list...
</li>
</ul>
<p>
A sample paragraph
</p>
</root>
I make a recursive function for parse any single node of xml and replace it in HTML tag (but doesn't work):
$doc = new DOMDocument();
$doc->preserveWhiteSpace = false;
$doc->load('section.xml');
echo $doc->saveHTML();
function printHtml(DOMNode $node)
{
if ($node->hasChildNodes())
{
foreach ($node->childNodes as $child)
{
printHtml($child);
}
}
if ($node->nodeName == 'em')
{
$newNode = $node->ownerDocument->createElement('strong', $node->nodeValue);
$node->parentNode->replaceChild($newNode, $node);
}
if ($node->nodeName == 'listitem')
{
$newNode = $node->ownerDocument->createElement('li', $node->nodeValue);
$node->parentNode->replaceChild($newNode, $node);
}
}
Can anyone help me?
This is an example of a complete xml:
<root>
<div>
<p>
<em>bol text</em>, some normale text
</p>
</div>
<list>
<listitem>
normal text inside list <em>bold inside list</em>
</listitem>
<listitem>
another text in list...
</listitem>
</list>
<media>
<info isVisible="false">
<title>
<p>Image title <em>in bold</em> not in bold</p>
</title>
</info>
<file isVisible="true">
<href>
"path/to/file.jpg"
</href>
</file>
</media>
<p>
A sample paragraph
</p>
</root>
Which has to be transformed into:
<root>
<div>
<p>
<strong>bol text</strong>, some normale text
</p>
</div>
<ul>
<li>
normal text inside list <em>bold inside list</em>
</li>
<li>
another text in list...
</li>
</ul>
<!-- the media tag can be presented in two mode: with title visible, and title hidden -->
<!-- this is the case when the title is hidden -->
<img src="path/to/file.jpg" />
<!-- this is the case when the title is visible -->
<!-- the info tag (inside media tag) has an attribute isVisible="false" which means it doesn't have to be shown. -->
<!-- if the info tag has visible=true, the media tag must be translated into
<div>
<img src="path/to/file.jpg" />
<p>Image title <strong>in bold</strong> not in bold</p>
<div>
-->
<p>
A sample paragraph
</p>
</root>
There's a language specially designed for this task: it's called XSLT, and you can easily express your desired transformation in XSLT and invoke it from your PHP program. There's a learning curve, of course, but it's a much better solution than writing low-level DOM code.
In XSLT you write a set of template rules saying how individual elements should be handled. Many elements in your example are copied through unchanged, so you can start with a default rule that does this:
<xsl:template match="*">
<xsl:copy><xsl:apply-templates/></xsl:copy>
</xsl:template>
The "match" part says what part of the input you are matching; the body of the rule says what output to produce. The xsl:apply-templates does a recursive descent to process the children of the current element.
Some of your elements are simply renamed, for example
<xsl:template match="listitem">
<li><xsl:apply-templates/></li>
</xsl:template>
Some of the rules are a little bit more complex, but still easily expressed:
<xsl:tempate match="media/file[#isVisible='true']">
<img src="{href}"/>
</xsl:template>
I hope you agree that this declarative rule-based approach is much clearer than your procedural code; it's also much easier for someone else to change the rules in six months' time.
Well, maybe, it's not the most correct idea, but why not just to use str_replace? That way You will see clearly the list of changes to apply and add / remove new ones easily.
file_get_contents $file = file_get_contents('file.xml');
str_replace $file = str_replace("<em>", "<strong>", $file);
file_put_contents file_put_contents('file.html', $file);
UPDATE (Some more ideas regarding the changes in the question)
This seems a little bit tricky (at least for me now) to use PHP + DOM here. Maybe, it would be more reasonable to use XSL / XSLT (Extensible Stylesheet Language Transformations). In that case, smth. similar can be found here: How to replace a node-name with another in Xslt?
XSLT specifically used for Language Transformations http://en.wikipedia.org/wiki/XSLT

PHP Xpath - Parsing flat HTML structure

I am trying to parse some fairly flat HTML and group everything from one h1 tag to the next. For example, I have the following HTML:
<h1> Heading 1 </h1>
<p> Paragraph 1.1 </p>
<p> Paragraph 1.2 </p>
<p> Paragraph 1.3 </p>
<h1> Heading 2 </h1>
<p> Paragraph 2.1 </p>
<p> Paragraph 2.2 </p>
<h1> Heading 3 </h1>
<p> Paragraph 3.1 </p>
<p> Paragraph 3.2 </p>
<p> Paragraph 3.3 </p>
I basically want it to look like:
<div id='1'>
<h1> Heading 1 </h1>
<p> Paragraph 1.1 </p>
<p> Paragraph 1.2 </p>
<p> Paragraph 1.3 </p>
</div>
<div id='2'>
<h1> Heading 2 </h1>
<p> Paragraph 2.1 </p>
<p> Paragraph 2.2 </p>
</div>
<div id='3'>
<h1> Heading 3 </h1>
<p> Paragraph 3.1 </p>
<p> Paragraph 3.2 </p>
<p> Paragraph 3.3 </p>
</div>
It is probably not even worth be posting the code I have done so far, as it just turned into a mess. Basically I was attempting to do an Xpath query for '//h1'. Create new DIV tags as parent nodes. Then copy the h1 DOM Node into the first DIV, and then loop over nextSibling until I hit another h1 tag - as mentioned it got messy.
Could someone point me in a better direction here?
Iterate over all nodes that are on the same level (I created a hint node called platau in my example), whenever your run across <h1>, insert the div before and keep a reference to it.
For <h1> and any other node and if the reference exists, remove the node and add it as child to the reference.
Example:
$doc->loadXML($xml);
$xp = new DOMXPath($doc);
$current = NULL;
$id = 0;
foreach($xp->query('/platau/node()') as $i => $sort)
{
if (isset($sort->tagName) && $sort->tagName === 'h1')
{
$current = $doc->createElement('div');
$current->setAttribute('id', ++$id);
$current = $sort->parentNode->insertBefore($current, $sort);
}
if (!$current) continue;
$sort->parentNode->removeChild($sort);
$current->appendChild($sort);
}
Demo

Categories