XPath to query multiple selectors - php

I want to get values and attributes from a selector
and then get attributes and values of its children based on a query.
allow me to give an example.
this is the structure
<div class='message'>
<div>
<a href='http://www.whatever.com'>Text</a>
</div>
<div>
<img src='image_link.jpg' />
</div>
</div>
<div class='message'>
<div>
<a href='http://www.whatever2.com'>Text2</a>
</div>
<div>
<img src='image_link2.jpg' />
</div>
</div>
So I would like to make a query to match all of those once.
Something like this:
//$dom is the DomDocument() set up after loaded HTML with $dom->loadHTML($html);
$dom_xpath = new DOMXpath($dom);
$elements = $dom_xpath->query('//div[#class="message"], //div[#class="message"] //a, //div[#class="message"] //img');
foreach($elements as $ele){
echo $ele[0]->getAttribute('class'); //it should return 'message'
echo $ele[1]->getAttribute('href'); //it should return 'http://www.whatever.com' in the 1st loop, and 'http://www.whatever2.com' in the second loop
echo $ele[2]->getAttribute('src'); //it should return image_link.jpg in the 1st loop and 'image_link2.jpg' in the second loop
}
Is there some way of doing that using multiple xpath selectors like I did in the example? to avoid making queries all the time and save some CPU.

Use the union operator (|) in a single expression like this:
//div[#class="message"]|//div[#class="message"]//a|//div[#class="message"]//img
Note that this will return a flattened result set (so to speak). In other words, you won't access the elements in groups of three like your example shows. Instead, you'll just iterate everything the expressions matched (in document order). For this reason, it might be even smarter to simply iterate the nodes returned by //div[#class="message"] and use DOM methods to access their children (for the other elements).

Use:
(//div[#class='message'])[$k]//#*
This selects all three attributes that belong to the $k-th div (and any of its descendants) in the document whose class attribute has string value "message"
You can evaluate N such XPath expressions -- for $k from 1 to N, where N is the total count of //div[#class='message']
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:for-each select="//div[#class='message']">
<xsl:variable name="vPos" select="position()"/>
<xsl:apply-templates select=
"(//div[#class='message'])[0+$vPos]//#*"/>
================
</xsl:for-each>
</xsl:template>
<xsl:template match="#*">
<xsl:value-of select=
"concat('name = ', name(), ' value = ', ., '
')"/>
</xsl:template>
</xsl:stylesheet>
when this transformation is applied on the provided XML document (wrapped in a single top element to become well-formed):
<html>
<div class='message'>
<div>
<a href='http://www.whatever.com'>Text</a>
</div>
<div>
<img src='image_link.jpg' />
</div>
</div>
<div class='message'>
<div>
<a href='http://www.whatever2.com'>Text2</a>
</div>
<div>
<img src='image_link2.jpg' />
</div>
</div>
</html>
The XPath expression is evaluated twice and the selected attributes are formatted and output:
name = class value = message
name = href value = http://www.whatever.com
name = src value = image_link.jpg
================
name = class value = message
name = href value = http://www.whatever2.com
name = src value = image_link2.jpg
================

Related

XPath - How to extract element following by the parent parent H1 tag

I am trying to extract blog post from web pages. Different pages have different structure so it is very difficult to extract what I need. There are some CSS and JS code in the HTML section also, I have to avoid them.
I know <h1> a dummy title </h1> from previous, so it can help to validate the exact one.
I do not know any ID and CLASS attribute.
<body>
<div>
<h1> a dummy title </h1>
</div>
<script> function loadDoc() {const xhttp = new XMLHttpRequest();} </script>
<div class="subtitle">
<p>...</p>
</div>
<div class="blog-post">
<p>...</p>
<div class="clear-fix">...</div>
<p>...</p>
<div class="clear-fix">...</div>
<p>...</p>
<p>...</p>
</div>
<div class="another-section">
<p>...</p>
<p>...</p>
</div>
<div class="another-another-section">
<p>...</p>
<p>...</p>
<div class="clear-fix">...</div>
<p>...</p>
<p>...</p>
<p>...</p>
</div>
</body>
What I have tried with:
I have tried to find the <div> with maximum <p> but sometimes there are some other <div> with maximum <p>, I have to avoid them by finding nearest <h1>
$html=
'[My html above]
';
$HTMLDoc = new DOMDocument();
$HTMLDoc->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD );
$xpath = new DOMXPath($HTMLDoc);
#locate the 3 divs
$pees = $xpath->query('//div[.//p]');
$pchilds = [];
#get the number of p children in each div
foreach ($pees as $pee) {
$childs = $pee->childElementCount;
array_push($pchilds,$childs);}
#now find the div with the max number of p children
foreach ($pees as $pee) {
$childs = $pee->childElementCount;
if ($childs == max($pchilds))
echo ($pee->nodeValue);
#or do whatever
}
Find all divs with p elements, then counting p elements inside each, finally getting the first with the max() count
$document = new DOMDocument();
$document->loadXML($xml);
$xpath = new DOMXpath($document);
$dcnt = array();
// Find all divs following an H1
$divs = $xpath->query('//h1/following-sibling::div');
// Count `p` inside them
foreach($divs as $idx=>$d) {
$cnt = (int) $xpath->evaluate('count(.//p)', $d);
$dcnt[$idx] = $cnt;
}
// show content of div with max() count
foreach($divs as $idx=>$d) {
if( $dcnt[$idx] == max($dcnt) ){
print $idx . ': ' . $divs[$idx]->nodeName . ': ' . $divs[$idx]->nodeValue;
break;
}
}
XPath 2.0 solution (see SaxonC for PHP support). Find the first first nearest div after <h1> containing with max <p> :
//h1/following-sibling::div[p[max(//h1/following-sibling::div/count(p))]][1]
Output :
<div class="blog-post">
<p>...</p>
<div class="clear-fix">...</div>
<p>...</p>
<div class="clear-fix">...</div>
<p>...</p>
<p>...</p>
</div>'
XPath 1.0 approximate solution (could return the wrong div) :
//h1/following-sibling::div[count(./p)>1][count(./p)>count(./preceding-sibling::div[./p][1]/p)][count(./p)>count(./following-sibling::div[./p][1]/p)][1]
In a comment you added,
want to find the first <h1> then I want to find the most nearest
<div> having max <p>.
There can be another tags in that <div> but I want to print
<p> tags only.
If your PHP processor has exslt support
something like this should be possible:
< file xmlstarlet select --template \
--var T='//div[p][contains(preceding::h1[1],"my title")]' \
--copy-of '($T[count(p) = math:max(dyn:map($T,"count(p)"))])[1]/p'
where
the T variable selects the div nodes of interest, assuming
you know, or can extract, the h1 section header text
dyn:map
maps each div to the count of its p children,
math:max
picks the maximum count
($T[…])[1]/p selects the p children of the first of possibly
more divs with a maximum p count
The command above uses xmlstarlet syntax; to make a single XPath
expression replace $T (2 places) with T contents inside parentheses.
It executes the following XSLT stylesheet (add -C before --template
to list it):
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:math="http://exslt.org/math" xmlns:dyn="http://exslt.org/dynamic" version="1.0" extension-element-prefixes="math dyn">
<xsl:output omit-xml-declaration="yes" indent="no"/>
<xsl:template match="/">
<xsl:variable select="//div[p][contains(preceding::h1[1],"my title")]" name="T"/>
<xsl:copy-of select="($T[count(p) = math:max(dyn:map($T,"count(p)"))])[1]/p"/>
</xsl:template>
</xsl:stylesheet>

Replace span's in PHP but keep content inside

I have the following string:
<span style="font-size: 13px;">
<span style="">
<span style="">
<span style="font-family: Roboto, sans-serif;">
<span style="">
Some text content
</span>
</span>
</span>
</span>
</span>
and I want to change this string to the following using PHP:
<span style="font-size: 13px;">
<span style="font-family: Roboto, sans-serif;">
Some text content
</span>
</span>
I dont have any idea, how to do that, because when I try to use str_replace to replace the <span style=""> I dont know, how to replace the </span> and keep the content inside. My next problem is, that I dont know exactly, how much <span style=""> I have in my string. I also have not only 1 of this blocks in my string.
Thanks in advance for your help, and maybe sorry for my stupid question - I'm still learning.
This is easily done with a proper HTML parser. PHP has DOMDocument which can parse X/HTML into the Document Object Model which can then be manipulated how you want.
The trick to solving this problem is being able to recursively traverse the DOM tree, seeking out each node, and replacing the ones you don't want. To this I've written a short helper method by extending DOMDocument here...
$html = <<<'HTML'
<span style="font-size: 13px;">
<span style="">
<span style="">
<span style="font-family: Roboto, sans-serif;">
<span style="">
Some text content
</span>
</span>
</span>
</span>
</span>
HTML;
class MyDOMDocument extends DOMDocument {
public function walk(DOMNode $node, $skipParent = false) {
if (!$skipParent) {
yield $node;
}
if ($node->hasChildNodes()) {
foreach ($node->childNodes as $n) {
yield from $this->walk($n);
}
}
}
}
libxml_use_internal_errors(true);
$dom = new MyDOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$keep = $remove = [];
foreach ($dom->walk($dom->childNodes->item(0)) as $node) {
if ($node->nodeName !== "span") { // we only care about span nodes
continue;
}
// we'll get rid of all span nodes that don't have the style attribute
if (!$node->hasAttribute("style") || !strlen($node->getAttribute("style"))) {
$remove[] = $node;
foreach($node->childNodes as $child) {
$keep[] = [$child, $node];
}
}
}
// you have to modify them one by one in reverse order to keep the inner nodes
foreach($keep as [$a, $b]) {
$b->parentNode->insertBefore($a, $b);
}
foreach($remove as $a) {
if ($a->parentNode) {
$a->parentNode->removeChild($a);
}
}
// Now we should have a rebuilt DOM tree with what we expect:
echo $dom->saveHTML();
Output:
<span style="font-size: 13px;">
<span style="font-family: Roboto, sans-serif;">
Some text content
</span>
</span>
For a more general way to modify HTML document, take a look at XSLT (Extensible Stylesheet Language Transformations). PHP has a XSLT library.
You then have an XML document with your transform rules in place:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="html" indent="yes"/>
<!-- remove spans with empty styles -->
<xsl:template match="*[#style and string-length(./#style) = 0]">
<xsl:apply-templates />
</xsl:template>
<!-- catch all to copy any elements that aren't matched in other templates -->
<xsl:template match="*">
<xsl:copy select=".">
<!-- copy the attributes of the element -->
<xsl:copy-of select="#*" />
<!-- continue applying templates to this element's children -->
<xsl:apply-templates select="*" />
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Then your PHP:
$sourceHtml = new DOMDocument();
$sourceHtml->load('source.html');
$xsl = new DOMDocument();
$xsl->load('transform.xsl');
$xsltProcessor = new XSLTProcessor;
$xsltProcessor->importStyleSheet($xsl); // attach the xsl rules
echo $xsltProcessor->transformToXML($sourceHtml);
$transformedHtml = $xsltProcessor->transformToDoc($sourceHtml);
$transformedHtml->saveHTMLFile('transformed.html');
XSLT is superpowerful for this kind of thing, and you can set all sorts of rules for parent/sibling relationships, and modify attributes and content accordingly.

Transform complex and variable xml

I've a complex XML that I want to transform in HTML. Some tags need to be replaced in html tags.
The XML is this:
<root>
<div>
<p>
<em>bol text</em>, some normale text
</p>
</div>
<list>
<listitem>
normal text inside list <em>bold inside list</em>
</listitem>
<listitem>
another text in list...
</listitem>
</list>
<p>
A sample paragraph
</p>
The text inside the element is variable, which means that the other xml that I parse can completely change.
The output I want is this (for this scenario):
<root>
<div>
<p>
<strong>bol text</strong>, some normale text
</p>
</div>
<ul>
<li>
normal text inside list <strong>bold inside list</strong>
</li>
<li>
another text in list...
</li>
</ul>
<p>
A sample paragraph
</p>
</root>
I make a recursive function for parse any single node of xml and replace it in HTML tag (but doesn't work):
$doc = new DOMDocument();
$doc->preserveWhiteSpace = false;
$doc->load('section.xml');
echo $doc->saveHTML();
function printHtml(DOMNode $node)
{
if ($node->hasChildNodes())
{
foreach ($node->childNodes as $child)
{
printHtml($child);
}
}
if ($node->nodeName == 'em')
{
$newNode = $node->ownerDocument->createElement('strong', $node->nodeValue);
$node->parentNode->replaceChild($newNode, $node);
}
if ($node->nodeName == 'listitem')
{
$newNode = $node->ownerDocument->createElement('li', $node->nodeValue);
$node->parentNode->replaceChild($newNode, $node);
}
}
Can anyone help me?
This is an example of a complete xml:
<root>
<div>
<p>
<em>bol text</em>, some normale text
</p>
</div>
<list>
<listitem>
normal text inside list <em>bold inside list</em>
</listitem>
<listitem>
another text in list...
</listitem>
</list>
<media>
<info isVisible="false">
<title>
<p>Image title <em>in bold</em> not in bold</p>
</title>
</info>
<file isVisible="true">
<href>
"path/to/file.jpg"
</href>
</file>
</media>
<p>
A sample paragraph
</p>
</root>
Which has to be transformed into:
<root>
<div>
<p>
<strong>bol text</strong>, some normale text
</p>
</div>
<ul>
<li>
normal text inside list <em>bold inside list</em>
</li>
<li>
another text in list...
</li>
</ul>
<!-- the media tag can be presented in two mode: with title visible, and title hidden -->
<!-- this is the case when the title is hidden -->
<img src="path/to/file.jpg" />
<!-- this is the case when the title is visible -->
<!-- the info tag (inside media tag) has an attribute isVisible="false" which means it doesn't have to be shown. -->
<!-- if the info tag has visible=true, the media tag must be translated into
<div>
<img src="path/to/file.jpg" />
<p>Image title <strong>in bold</strong> not in bold</p>
<div>
-->
<p>
A sample paragraph
</p>
</root>
There's a language specially designed for this task: it's called XSLT, and you can easily express your desired transformation in XSLT and invoke it from your PHP program. There's a learning curve, of course, but it's a much better solution than writing low-level DOM code.
In XSLT you write a set of template rules saying how individual elements should be handled. Many elements in your example are copied through unchanged, so you can start with a default rule that does this:
<xsl:template match="*">
<xsl:copy><xsl:apply-templates/></xsl:copy>
</xsl:template>
The "match" part says what part of the input you are matching; the body of the rule says what output to produce. The xsl:apply-templates does a recursive descent to process the children of the current element.
Some of your elements are simply renamed, for example
<xsl:template match="listitem">
<li><xsl:apply-templates/></li>
</xsl:template>
Some of the rules are a little bit more complex, but still easily expressed:
<xsl:tempate match="media/file[#isVisible='true']">
<img src="{href}"/>
</xsl:template>
I hope you agree that this declarative rule-based approach is much clearer than your procedural code; it's also much easier for someone else to change the rules in six months' time.
Well, maybe, it's not the most correct idea, but why not just to use str_replace? That way You will see clearly the list of changes to apply and add / remove new ones easily.
file_get_contents $file = file_get_contents('file.xml');
str_replace $file = str_replace("<em>", "<strong>", $file);
file_put_contents file_put_contents('file.html', $file);
UPDATE (Some more ideas regarding the changes in the question)
This seems a little bit tricky (at least for me now) to use PHP + DOM here. Maybe, it would be more reasonable to use XSL / XSLT (Extensible Stylesheet Language Transformations). In that case, smth. similar can be found here: How to replace a node-name with another in Xslt?
XSLT specifically used for Language Transformations http://en.wikipedia.org/wiki/XSLT

PHP function to compare equality of XML elements

I have an XML file which is supposed to be my phones contacts backup and I am trying to create a php file to retrieve only the contacts that have a phone number assigned to them. The file contains contacts from different applications.
The XML has these elements:
<Contact>
<Id>5238</Id>
<GivenName>friend1</GivenName>
<FullName>friendA</FullName>
<CreateTime>0001-01-01T00:00:00+00:00</CreateTime>
<ModifyTime>0001-01-01T00:00:00+00:00</ModifyTime>
<Starred>false</Starred>
<AccountName>SIM</AccountName>
<AccountType>com.anddroid.contacts.sim</AccountType>
</Contact>
<PhoneNumbers>
<Id>53</Id>
<ContactId>1380</ContactId>
<Name>2</Name>
<Value>07123456789</Value>
<Primary>2</Primary>
</PhoneNumbers>
<Contact>
<Id>328</Id>
<FamilyName>tee</FamilyName>
<GivenName>friend2</GivenName>
<FullName>friend2 tee</FullName>
<CreateTime>0001-01-01T00:00:00+00:00</CreateTime>
<ModifyTime>0001-01-01T00:00:00+00:00</ModifyTime>
<Picture>18948</Picture>
<Starred>false</Starred>
<AccountName>xxxxxxx#hotmail.com</AccountName>
<AccountType>com.htc.socialnetwork.facebook</AccountType>
</Contact>
And I want to make a php file that will retrieve the FullName from Contact and the Value from PhoneNumbers where the Contact/Id matches the PhoneNumbers/ContactId.
I created this code:
<?php
$xml = simplexml_load_file("Contact.xml");
$i=0;
$k=0;
foreach ($xml->Contact as $contact) {
if ($contact->AccountName == "SIM"){
echo "Contact: " . $k . "<br /> "; echo $contact->nodeValue[$k] . "<br /> " . $contact->FullName . "<br /> ";
$k++;
}
}
foreach ($xml->PhoneNumbers as $number) {
echo "Contact: " . $i . "<br /> "; echo $number->Value . "<br /> ";
$i++;
}
?>
It outputs 53 contacts and 173 numbers. If I dont put the if ($contact->AccountName == "SIM") it outputs the same numbers but 700++ contacts. I just want some help producing a function or something to output only the contacts that I already have their phone number.
Any help is appreciated.
Thank you
I would suggest to use a XSL-stylesheet:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" >
<xsl:template match="/">
<ul><xsl:apply-templates/></ul>
</xsl:template>
<xsl:template match="Contact">
<!-- select phonenumbers with the matching ContactId -->
<xsl:variable name="numbers" select="//PhoneNumbers[ContactId=current()/Id]"/>
<!-- when any matching PhoneNumber has been found, continue -->
<xsl:if test="count($numbers) > 0">
<li>
<xsl:value-of select="FullName"/>
<ul>
<!-- call a named template with the matching PhoneNumbers as param -->
<xsl:call-template name="printNumbers">
<xsl:with-param name="numbers" select="$numbers" />
</xsl:call-template>
</ul>
</li>
</xsl:if>
</xsl:template>
<xsl:template name="printNumbers">
<xsl:param name="numbers" />
<!-- loop through PhoneNumbers and print the Value -->
<xsl:for-each select="$numbers">
<li><xsl:value-of select="Value" /></li>
</xsl:for-each>
</xsl:template>
<xsl:template match="PhoneNumbers"/>
</xsl:stylesheet>
How to use the stylesheet:
<?php
$doc = new DOMDocument();
$xsl = new XSLTProcessor();
$doc->load('path/to/stylesheet.xsl');
$xsl->importStyleSheet($doc);
$doc->load('Contact.xml');
echo $xsl->transformToXML($doc);
?>

Why is my recursive loop creating too many children?

I'm using a PHP recursive loop to parse through an XML document to create a nested list, however for some reason the loop is broken and creating duplicates of elements within the list, as well as blank elements.
The XML (a list of family tree data) is structured as follows:
<?xml version="1.0" encoding="UTF-8"?>
<family>
<indi>
<id>id1</id>
<fn>Thomas</fn>
<bday></bday>
<dday></dday>
<spouse></spouse>
<family>
<indi>
<id>id1</id>
<fn>Alexander</fn>
<bday></bday>
<dday></dday>
<spouse></spouse>
<family>
</family>
</indi>
<indi>
<id>id1</id>
<fn>John</fn>
<bday></bday>
<dday></dday>
<spouse></spouse>
<family>
<indi>
<id>id1</id>
<fn>George</fn>
<bday></bday>
<dday></dday>
<spouse></spouse>
<family>
</family>
</indi>
</family>
</indi>
</family>
</indi>
</family>
And here's my PHP loop, which loads the XML file then loops through it to create a nested ul:
<?php
function outputIndi($indi) {
echo '<li>';
$id = $indi->getElementsByTagName('id')->item(0)->nodeValue;
echo '<span class="vcard person" id="' . $id . '">';
$fn = $indi->getElementsByTagName('fn')->item(0)->nodeValue;
$bday = $indi->getElementsByTagName('bday')->item(0)->nodeValue;
echo '<span class="edit fn">' . $fn . '</span>';
echo '<span class="edit bday">' . $bday . '</span>';
// ...
echo '</span>';
echo '<ul>';
$family = $indi->getElementsByTagName('family');
foreach ($family as $subIndi) {
outputIndi($subIndi);
}
echo '</ul></li>';
}
$doc = new DOMDocument();
$doc->load('armstrong.xml');
outputIndi($doc);
?>
EDIT here's the desired outcome (nested lists, with ul's signifying families and li's signifying individuals)
<ul>
<li>
<span class="vcard">
<span class="fn">Thomas</span>
<span class="bday"></span>
<span class="dday"></span>
<ul>
... repeat for all ancestors ...
</ul>
<li>
<ul>
You can see the output at http://chris-armstrong.com/gortin . Any ideas where I'm going wrong? I think it's something to do with the $subIndi value, but anytime I try and change it I get an error. Would really appreciate any help!
Sounds perfect! Could you give me an
example? Does this mean I can save the
data as XML, then load it in as nested
ul's?
Yes, you can do exactly that. Here's an XSL which renders nested UL's:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
<html>
<body>
<h2>Family tree</h2>
<ul>
<li><xsl:value-of select="indi/fn" /></li>
<!-- apply-templates will select all the indi/family nodes -->
<xsl:apply-templates select="indi/family" />
</ul>
</body>
</html>
</xsl:template>
<xsl:template match="family">
<ul>
<li>
<div>
<xsl:value-of select="id" />: <xsl:value-of select="fn" />
(<xsl:variable name="bday" select="bday" />
to
<xsl:variable name="dday" select="dday" />)
</div>
</li>
<!-- This node matches the 'family' nodes, and we're going to apply-templates on the inner 'family' node,
so this is the same thing as recursion. -->
<xsl:apply-templates select="family" />
</ul>
</xsl:template>
</xsl:stylesheet>
I don't know php, but this article will show you how to transform XML using the style sheet above.
You can also link your style sheet by adding a stylesheet directive at the top of your XML file (see for an example).
getElementsByTagName will give you all nodes, not just immediate children:
$family = $indi->getElementsByTagName('family');
foreach ($family as $subIndi) {
outputIndi($subIndi);
}
You will call outputIndi() for grand children, etc repeatedly.
Here is an example (from another stackoverflow question):
for ($n = $indi->firstChild; $n !== null; $n = $n->nextSibling) {
if ($n instanceof DOMElement && $n->tagName == "family") {
outputIndi($n);
}
}
Replace this
$family = $indi->getElementsByTagName('family');
foreach ($family as $subIndi) {
outputIndi($subIndi);
}
by this
if(!empty($indi))
foreach($indi as $subIndi){
outputIndi($subIndi);
}
I realize
if($indi->hasChildNodes())
is better than
if(!empty($indi))

Categories