Transform complex and variable xml - php

I've a complex XML that I want to transform in HTML. Some tags need to be replaced in html tags.
The XML is this:
<root>
<div>
<p>
<em>bol text</em>, some normale text
</p>
</div>
<list>
<listitem>
normal text inside list <em>bold inside list</em>
</listitem>
<listitem>
another text in list...
</listitem>
</list>
<p>
A sample paragraph
</p>
The text inside the element is variable, which means that the other xml that I parse can completely change.
The output I want is this (for this scenario):
<root>
<div>
<p>
<strong>bol text</strong>, some normale text
</p>
</div>
<ul>
<li>
normal text inside list <strong>bold inside list</strong>
</li>
<li>
another text in list...
</li>
</ul>
<p>
A sample paragraph
</p>
</root>
I make a recursive function for parse any single node of xml and replace it in HTML tag (but doesn't work):
$doc = new DOMDocument();
$doc->preserveWhiteSpace = false;
$doc->load('section.xml');
echo $doc->saveHTML();
function printHtml(DOMNode $node)
{
if ($node->hasChildNodes())
{
foreach ($node->childNodes as $child)
{
printHtml($child);
}
}
if ($node->nodeName == 'em')
{
$newNode = $node->ownerDocument->createElement('strong', $node->nodeValue);
$node->parentNode->replaceChild($newNode, $node);
}
if ($node->nodeName == 'listitem')
{
$newNode = $node->ownerDocument->createElement('li', $node->nodeValue);
$node->parentNode->replaceChild($newNode, $node);
}
}
Can anyone help me?
This is an example of a complete xml:
<root>
<div>
<p>
<em>bol text</em>, some normale text
</p>
</div>
<list>
<listitem>
normal text inside list <em>bold inside list</em>
</listitem>
<listitem>
another text in list...
</listitem>
</list>
<media>
<info isVisible="false">
<title>
<p>Image title <em>in bold</em> not in bold</p>
</title>
</info>
<file isVisible="true">
<href>
"path/to/file.jpg"
</href>
</file>
</media>
<p>
A sample paragraph
</p>
</root>
Which has to be transformed into:
<root>
<div>
<p>
<strong>bol text</strong>, some normale text
</p>
</div>
<ul>
<li>
normal text inside list <em>bold inside list</em>
</li>
<li>
another text in list...
</li>
</ul>
<!-- the media tag can be presented in two mode: with title visible, and title hidden -->
<!-- this is the case when the title is hidden -->
<img src="path/to/file.jpg" />
<!-- this is the case when the title is visible -->
<!-- the info tag (inside media tag) has an attribute isVisible="false" which means it doesn't have to be shown. -->
<!-- if the info tag has visible=true, the media tag must be translated into
<div>
<img src="path/to/file.jpg" />
<p>Image title <strong>in bold</strong> not in bold</p>
<div>
-->
<p>
A sample paragraph
</p>
</root>

There's a language specially designed for this task: it's called XSLT, and you can easily express your desired transformation in XSLT and invoke it from your PHP program. There's a learning curve, of course, but it's a much better solution than writing low-level DOM code.
In XSLT you write a set of template rules saying how individual elements should be handled. Many elements in your example are copied through unchanged, so you can start with a default rule that does this:
<xsl:template match="*">
<xsl:copy><xsl:apply-templates/></xsl:copy>
</xsl:template>
The "match" part says what part of the input you are matching; the body of the rule says what output to produce. The xsl:apply-templates does a recursive descent to process the children of the current element.
Some of your elements are simply renamed, for example
<xsl:template match="listitem">
<li><xsl:apply-templates/></li>
</xsl:template>
Some of the rules are a little bit more complex, but still easily expressed:
<xsl:tempate match="media/file[#isVisible='true']">
<img src="{href}"/>
</xsl:template>
I hope you agree that this declarative rule-based approach is much clearer than your procedural code; it's also much easier for someone else to change the rules in six months' time.

Well, maybe, it's not the most correct idea, but why not just to use str_replace? That way You will see clearly the list of changes to apply and add / remove new ones easily.
file_get_contents $file = file_get_contents('file.xml');
str_replace $file = str_replace("<em>", "<strong>", $file);
file_put_contents file_put_contents('file.html', $file);
UPDATE (Some more ideas regarding the changes in the question)
This seems a little bit tricky (at least for me now) to use PHP + DOM here. Maybe, it would be more reasonable to use XSL / XSLT (Extensible Stylesheet Language Transformations). In that case, smth. similar can be found here: How to replace a node-name with another in Xslt?
XSLT specifically used for Language Transformations http://en.wikipedia.org/wiki/XSLT

Related

XPath - How to extract element following by the parent parent H1 tag

I am trying to extract blog post from web pages. Different pages have different structure so it is very difficult to extract what I need. There are some CSS and JS code in the HTML section also, I have to avoid them.
I know <h1> a dummy title </h1> from previous, so it can help to validate the exact one.
I do not know any ID and CLASS attribute.
<body>
<div>
<h1> a dummy title </h1>
</div>
<script> function loadDoc() {const xhttp = new XMLHttpRequest();} </script>
<div class="subtitle">
<p>...</p>
</div>
<div class="blog-post">
<p>...</p>
<div class="clear-fix">...</div>
<p>...</p>
<div class="clear-fix">...</div>
<p>...</p>
<p>...</p>
</div>
<div class="another-section">
<p>...</p>
<p>...</p>
</div>
<div class="another-another-section">
<p>...</p>
<p>...</p>
<div class="clear-fix">...</div>
<p>...</p>
<p>...</p>
<p>...</p>
</div>
</body>
What I have tried with:
I have tried to find the <div> with maximum <p> but sometimes there are some other <div> with maximum <p>, I have to avoid them by finding nearest <h1>
$html=
'[My html above]
';
$HTMLDoc = new DOMDocument();
$HTMLDoc->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD );
$xpath = new DOMXPath($HTMLDoc);
#locate the 3 divs
$pees = $xpath->query('//div[.//p]');
$pchilds = [];
#get the number of p children in each div
foreach ($pees as $pee) {
$childs = $pee->childElementCount;
array_push($pchilds,$childs);}
#now find the div with the max number of p children
foreach ($pees as $pee) {
$childs = $pee->childElementCount;
if ($childs == max($pchilds))
echo ($pee->nodeValue);
#or do whatever
}
Find all divs with p elements, then counting p elements inside each, finally getting the first with the max() count
$document = new DOMDocument();
$document->loadXML($xml);
$xpath = new DOMXpath($document);
$dcnt = array();
// Find all divs following an H1
$divs = $xpath->query('//h1/following-sibling::div');
// Count `p` inside them
foreach($divs as $idx=>$d) {
$cnt = (int) $xpath->evaluate('count(.//p)', $d);
$dcnt[$idx] = $cnt;
}
// show content of div with max() count
foreach($divs as $idx=>$d) {
if( $dcnt[$idx] == max($dcnt) ){
print $idx . ': ' . $divs[$idx]->nodeName . ': ' . $divs[$idx]->nodeValue;
break;
}
}
XPath 2.0 solution (see SaxonC for PHP support). Find the first first nearest div after <h1> containing with max <p> :
//h1/following-sibling::div[p[max(//h1/following-sibling::div/count(p))]][1]
Output :
<div class="blog-post">
<p>...</p>
<div class="clear-fix">...</div>
<p>...</p>
<div class="clear-fix">...</div>
<p>...</p>
<p>...</p>
</div>'
XPath 1.0 approximate solution (could return the wrong div) :
//h1/following-sibling::div[count(./p)>1][count(./p)>count(./preceding-sibling::div[./p][1]/p)][count(./p)>count(./following-sibling::div[./p][1]/p)][1]
In a comment you added,
want to find the first <h1> then I want to find the most nearest
<div> having max <p>.
There can be another tags in that <div> but I want to print
<p> tags only.
If your PHP processor has exslt support
something like this should be possible:
< file xmlstarlet select --template \
--var T='//div[p][contains(preceding::h1[1],"my title")]' \
--copy-of '($T[count(p) = math:max(dyn:map($T,"count(p)"))])[1]/p'
where
the T variable selects the div nodes of interest, assuming
you know, or can extract, the h1 section header text
dyn:map
maps each div to the count of its p children,
math:max
picks the maximum count
($T[…])[1]/p selects the p children of the first of possibly
more divs with a maximum p count
The command above uses xmlstarlet syntax; to make a single XPath
expression replace $T (2 places) with T contents inside parentheses.
It executes the following XSLT stylesheet (add -C before --template
to list it):
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:math="http://exslt.org/math" xmlns:dyn="http://exslt.org/dynamic" version="1.0" extension-element-prefixes="math dyn">
<xsl:output omit-xml-declaration="yes" indent="no"/>
<xsl:template match="/">
<xsl:variable select="//div[p][contains(preceding::h1[1],"my title")]" name="T"/>
<xsl:copy-of select="($T[count(p) = math:max(dyn:map($T,"count(p)"))])[1]/p"/>
</xsl:template>
</xsl:stylesheet>

Regex - Replacing content - eZ Publish XML field

I have an Xml content that i want to modify before using the eZ Publish 5 API to create it.
I am trying to implement a Regex to modify the content.
Here is the Xml code that i have (with html entities) :
Print of Xml code http://img15.hostingpics.net/pics/453268xmlcode.jpg
I want to be able to catch empty.jpg in :
<img alt="" src="http://www.asite.org/empty.jpg" />
And replace the whole line for each occurrence by :
<custom name="my_checkbox"></custom>
Problem :
The img tag can sometimes contain other attributes like : height="15" width="12"
<img height="15" alt="" width="12" src="http://www.asite.org/empty.jpg" />
And sometimes the attributes are after the src attribute in a different order.
The aim would be :
Xml code - Aim http://img15.hostingpics.net/pics/318980xmlcodeaim.jpg
I've tried many things so far but nothing worked.
Thanks in advance for helping.
Cheers !
EDIT :
Here is an example of what i've tried so far :
/(<img [a-z = ""]* src="http:\/\/www\.asite\.org\/empty\.jpg" \/&gt)/g
Dealing with XML i've used an XML parser to reach the desired section.
Then we can apply a regex (~<img.*?>(?=</span)~) to select and replace the image tag with your custom tag (note that in the object received by the xml parser the html entities are replaces with their equivalent char).
This is a piece of code that emulates and handle your situation:
<?php
$xmlstr = <<<XML
<sections>
<section>
<paragraph>
<literal class="html">
<img alt="" src="http://asite.org/empty.png" /></span></span> Yes/no&nbsp;<br />
<img alt="" src="http://asite.org/empty.png" /></span></span> Other text/no&nbsp;<br />
</literal>
</paragraph>
</section>
</sections>
XML;
$sections = new SimpleXMLElement($xmlstr);
foreach ($sections->section->paragraph as $paragraph) {
$re = "~<img.*?>(?=</span)~";
$subst = "<custom name=\"my_checkbox\"></custom>";
$paragraph->literal = preg_replace($re, $subst, $paragraph->literal);
}
echo $sections->asXML();
?>
The output is:
<?xml version="1.0"?>
<sections>
<section>
<paragraph>
<literal class="html">
<custom name="my_checkbox"></custom></span></span> Yes/no&nbsp;<br />
<custom name="my_checkbox"></custom></span></span> Other text/no&nbsp;<br />
</literal>
</paragraph>
</section>
</sections>
An online demo can be found HERE

PHP DOM - How to insert element in a specific position?

In a nutshell, this is what I'm trying to do:
Get all <img> tags from a document
Set a data-src attribute (for lazy loading)
Empty their sources (for lazy loading)
Inject a <noscript> tag after this image
1-3 are fine. I just can't get the created <noscript> tag to be beside the image correctly.
I'm trying with insertBefore but I'm open for suggestions:
// Create a DOMDocument instance
$dom = new DOMDocument;
$dom->formatOutput = true;
$dom->preserveWhiteSpace = false;
// Loads our content as HTML
$dom->loadHTML($content);
// Get all of our img tags
$images = $dom->getElementsByTagName('img');
// How many of them
$len = count($images);
// Loop through all the images in this content
for ($i = 0; $i < $len; $i++) {
// Reference this current image
$image = $images->item($i);
// Create our fallback image before changing this node
$fallback_image = $image->cloneNode();
// Add the src as a data-src attribute instead
$image->setAttribute('data-src', $src);
// Empty the src of this img
$image->setAttribute('src', '');
// Now prepare our <noscript> markup
// E.g <noscript><img src="foobar.jpg" /></noscript>
$noscript = $dom->createElement("noscript");
$noscript->appendChild( $fallback_image );
$image->parentNode->insertBefore( $noscript, $image );
}
return $dom->saveHTML();
Having two images in the page, this is the result (abbreviated for clarity's sake):
Before:
<div>
<img />
<p />
</div>
<p>
<img />
</p>
After:
<div>
<img /> <!-- this should be the fallback wrapped in <noscript> that is missing -->
<p>
<img />
</p>
</div>
<p>
<img /> <!-- nothing happened here -->
</p>
Using $dom->appendChild works but the <noscript> tag should be beside the image and not at the end of the document.
My PHP skills are very rusty so I'd appreciate any clarification or suggestions.
UPDATE
Just realised saveHTML() was also adding <DOCTYPE><html><body> tags, so I've added a preg_replace (until I find a better solution) to take care of removing that.
Also, the output I have pasted before was based on the inspector of Chrome's Developer Tools.
I checked the viewsoure to see what was really going on (and thus found out about the tag).
This is what's really happening:
https://eval.in/114620
<div>
<img /> </noscript> <!-- wha? just a closing noscript tag -->
<p />
</div>
<p>
<img /> <!-- nothing happened here -->
</p>
SOLVED
So this is how I fixed it:
https://eval.in/117959
I think it's a good idea to work with new nodes after they have being inserted into the DOM:
$noscript = $dom->createElement("noscript");
$noscriptnode = $image->parentNode->insertBefore( $noscript, $image );
// Only now work with noscript by adding it's contents etc...
Also when it's inserted with "insertBefore" - it's a good idea to save it's reference.
$noscriptnode = $image->parentNode->insertBefore( $noscript, $image );
And another thing: I wasrunning this code within Wordpress. Some hooks were being run afterwards which was messing up my markup.

Xpath, php and how to skip specific node (and it's children)

I've just started tooling around with XPath recently.
Currently I'm just parsing some pages line by line and taking the relevant text.
What I'd like to do is exclude a div at the top and it's child elements.
Basically I'm looking at this :
<html>
<head> Foo </head>
<body>
<div id='header'>
<ul id='menu'> <li> Bar </li> <li> FooBar </li> <li> BarFoo </li> </ul>
</div>
<table> <tr> <td>data</td><td>data</td> </tr> </table>
<div>
<p>Lorem Ipsum</p>
<p>dolor sit amet</p>
</div>
</body>
</html>
Except much more content.
Currently I loop through every node with :
$dom = new DOMDocument;
$dom->loadHTMLFile('http://www.test.com/test.htm');
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('/html/body//*');
foreach($nodes as $node) {
echo $node->nodeValue;
}
I want to ignore the entire header node.
Is there a simple way to just do that?
This would work:
/html/body//*[not(ancestor-or-self::div[#id="header"])]
The XPath selects all nodes below the body element unless they are an ancestor of a DIV with the id attribute value of "header" or that div itself.
Check http://schlitt.info/opensource/blog/0704_xpath.html for an XPath tutorial.

XPath to query multiple selectors

I want to get values and attributes from a selector
and then get attributes and values of its children based on a query.
allow me to give an example.
this is the structure
<div class='message'>
<div>
<a href='http://www.whatever.com'>Text</a>
</div>
<div>
<img src='image_link.jpg' />
</div>
</div>
<div class='message'>
<div>
<a href='http://www.whatever2.com'>Text2</a>
</div>
<div>
<img src='image_link2.jpg' />
</div>
</div>
So I would like to make a query to match all of those once.
Something like this:
//$dom is the DomDocument() set up after loaded HTML with $dom->loadHTML($html);
$dom_xpath = new DOMXpath($dom);
$elements = $dom_xpath->query('//div[#class="message"], //div[#class="message"] //a, //div[#class="message"] //img');
foreach($elements as $ele){
echo $ele[0]->getAttribute('class'); //it should return 'message'
echo $ele[1]->getAttribute('href'); //it should return 'http://www.whatever.com' in the 1st loop, and 'http://www.whatever2.com' in the second loop
echo $ele[2]->getAttribute('src'); //it should return image_link.jpg in the 1st loop and 'image_link2.jpg' in the second loop
}
Is there some way of doing that using multiple xpath selectors like I did in the example? to avoid making queries all the time and save some CPU.
Use the union operator (|) in a single expression like this:
//div[#class="message"]|//div[#class="message"]//a|//div[#class="message"]//img
Note that this will return a flattened result set (so to speak). In other words, you won't access the elements in groups of three like your example shows. Instead, you'll just iterate everything the expressions matched (in document order). For this reason, it might be even smarter to simply iterate the nodes returned by //div[#class="message"] and use DOM methods to access their children (for the other elements).
Use:
(//div[#class='message'])[$k]//#*
This selects all three attributes that belong to the $k-th div (and any of its descendants) in the document whose class attribute has string value "message"
You can evaluate N such XPath expressions -- for $k from 1 to N, where N is the total count of //div[#class='message']
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:for-each select="//div[#class='message']">
<xsl:variable name="vPos" select="position()"/>
<xsl:apply-templates select=
"(//div[#class='message'])[0+$vPos]//#*"/>
================
</xsl:for-each>
</xsl:template>
<xsl:template match="#*">
<xsl:value-of select=
"concat('name = ', name(), ' value = ', ., '
')"/>
</xsl:template>
</xsl:stylesheet>
when this transformation is applied on the provided XML document (wrapped in a single top element to become well-formed):
<html>
<div class='message'>
<div>
<a href='http://www.whatever.com'>Text</a>
</div>
<div>
<img src='image_link.jpg' />
</div>
</div>
<div class='message'>
<div>
<a href='http://www.whatever2.com'>Text2</a>
</div>
<div>
<img src='image_link2.jpg' />
</div>
</div>
</html>
The XPath expression is evaluated twice and the selected attributes are formatted and output:
name = class value = message
name = href value = http://www.whatever.com
name = src value = image_link.jpg
================
name = class value = message
name = href value = http://www.whatever2.com
name = src value = image_link2.jpg
================

Categories