Replace span's in PHP but keep content inside - php

I have the following string:
<span style="font-size: 13px;">
<span style="">
<span style="">
<span style="font-family: Roboto, sans-serif;">
<span style="">
Some text content
</span>
</span>
</span>
</span>
</span>
and I want to change this string to the following using PHP:
<span style="font-size: 13px;">
<span style="font-family: Roboto, sans-serif;">
Some text content
</span>
</span>
I dont have any idea, how to do that, because when I try to use str_replace to replace the <span style=""> I dont know, how to replace the </span> and keep the content inside. My next problem is, that I dont know exactly, how much <span style=""> I have in my string. I also have not only 1 of this blocks in my string.
Thanks in advance for your help, and maybe sorry for my stupid question - I'm still learning.

This is easily done with a proper HTML parser. PHP has DOMDocument which can parse X/HTML into the Document Object Model which can then be manipulated how you want.
The trick to solving this problem is being able to recursively traverse the DOM tree, seeking out each node, and replacing the ones you don't want. To this I've written a short helper method by extending DOMDocument here...
$html = <<<'HTML'
<span style="font-size: 13px;">
<span style="">
<span style="">
<span style="font-family: Roboto, sans-serif;">
<span style="">
Some text content
</span>
</span>
</span>
</span>
</span>
HTML;
class MyDOMDocument extends DOMDocument {
public function walk(DOMNode $node, $skipParent = false) {
if (!$skipParent) {
yield $node;
}
if ($node->hasChildNodes()) {
foreach ($node->childNodes as $n) {
yield from $this->walk($n);
}
}
}
}
libxml_use_internal_errors(true);
$dom = new MyDOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$keep = $remove = [];
foreach ($dom->walk($dom->childNodes->item(0)) as $node) {
if ($node->nodeName !== "span") { // we only care about span nodes
continue;
}
// we'll get rid of all span nodes that don't have the style attribute
if (!$node->hasAttribute("style") || !strlen($node->getAttribute("style"))) {
$remove[] = $node;
foreach($node->childNodes as $child) {
$keep[] = [$child, $node];
}
}
}
// you have to modify them one by one in reverse order to keep the inner nodes
foreach($keep as [$a, $b]) {
$b->parentNode->insertBefore($a, $b);
}
foreach($remove as $a) {
if ($a->parentNode) {
$a->parentNode->removeChild($a);
}
}
// Now we should have a rebuilt DOM tree with what we expect:
echo $dom->saveHTML();
Output:
<span style="font-size: 13px;">
<span style="font-family: Roboto, sans-serif;">
Some text content
</span>
</span>

For a more general way to modify HTML document, take a look at XSLT (Extensible Stylesheet Language Transformations). PHP has a XSLT library.
You then have an XML document with your transform rules in place:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="html" indent="yes"/>
<!-- remove spans with empty styles -->
<xsl:template match="*[#style and string-length(./#style) = 0]">
<xsl:apply-templates />
</xsl:template>
<!-- catch all to copy any elements that aren't matched in other templates -->
<xsl:template match="*">
<xsl:copy select=".">
<!-- copy the attributes of the element -->
<xsl:copy-of select="#*" />
<!-- continue applying templates to this element's children -->
<xsl:apply-templates select="*" />
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Then your PHP:
$sourceHtml = new DOMDocument();
$sourceHtml->load('source.html');
$xsl = new DOMDocument();
$xsl->load('transform.xsl');
$xsltProcessor = new XSLTProcessor;
$xsltProcessor->importStyleSheet($xsl); // attach the xsl rules
echo $xsltProcessor->transformToXML($sourceHtml);
$transformedHtml = $xsltProcessor->transformToDoc($sourceHtml);
$transformedHtml->saveHTMLFile('transformed.html');
XSLT is superpowerful for this kind of thing, and you can set all sorts of rules for parent/sibling relationships, and modify attributes and content accordingly.

Related

how can i retrieve data from nested xml node using php?

I am new in xml and data retrieve and i have problem with this code.
XML code:
<?xml version="1.0" encoding="UTF-8"?>
<site>
<page>
<content>
<P>
<FONT size="2" face="Tahoma">
<STRONG>text...</STRONG>
</FONT>
</P>
<P>
<FONT size="2" face="Tahoma">text....</FONT>
</P>
<P align="center">
<IMG style="WIDTH: 530px" border="1" alt="" src="http://www.alkul.com/online/2014/5/6/child%20disorder.jpg">
</P>
<P>
<STRONG>
<FONT size="2" face="Tahoma">text3</FONT>
</STRONG>
</P>
<P>
<STRONG>
<FONT size="2" face="Tahoma">text1</FONT>
</STRONG>
</P>
</content>
</page>
</site>
php code:
<?php
$html = "";
$url = "Data.xml";
$xml = simplexml_load_file($url);
for ($i = 0; $i<10; $i++) {
$title = $xml->page[$i]->content->P->FONT;
$html .= "<p>$title</p>";
}
echo $html;
I just need to display the content of content node but the output is empty
First of all, the provided XML is not valid as you should receive the following error:
Warning: simplexml_load_string(): Entity: line 8: parser error : Opening and ending tag mismatch: IMG line 8 and P
In XML the IMG element needs to be closed like this:
<IMG style="WIDTH: 530px" border="1" alt="" src="http://www.alkul.com/online/2014/5/6/child%20disorder.jpg"/>
Note the forward slash at the end of the element.
If you do not see that error, please look in your error log or enable error reporting in PHP.
Now the XML can be parsed by SimpleXML. I ended up with this:
$pList = $xml->xpath('./page/content/P');
foreach ($pList as $pElement) {
$text = strip_tags($pElement->asXML());
echo $text . "<br>";
}
It selects all the P elements into $pList and iterates over the list. For each element it takes the XML and strips all tags from it, leaving you with just the "inner text" for each element.
Lastly, I'd suggest you use the PHP Simple HTML DOM Parser as it is quite easy to use and more tailored towards scraping data from HTML.
If you only want to display what is in the content node so here is your code
<?php
$html = "";
$url = "data.xml";
$xml = simplexml_load_file($url);
$title = $xml->page->content->asXML();
$html .= "<p>$title</p>";
echo $html;
You have HTML inside an XML node. This needs XML encoding, normally done with a CDATA block. You then can just use the $xml->page->content element with echo or by casting it to string.
XML (take note of the <![CDATA[ ... ]]> part):
<?xml version="1.0" encoding="UTF-8"?>
<site>
<page>
<content><![CDATA[
<P>
<FONT size="2" face="Tahoma">
<STRONG>text...</STRONG>
</FONT>
</P>
<P>
<FONT size="2" face="Tahoma">text....</FONT>
</P>
<P align="center">
<IMG style="WIDTH: 530px" border="1" alt="" src="http://www.alkul.com/online/2014/5/6/child%20disorder.jpg">
</P>
<P>
<STRONG>
<FONT size="2" face="Tahoma">text3</FONT>
</STRONG>
</P>
<P>
<STRONG>
<FONT size="2" face="Tahoma">text1</FONT>
</STRONG>
</P>
]]></content>
</page>
</site>
PHP:
$xml = simplexml_load_file($url);
$firstTenPages = new LimitIterator(new IteratorIterator($xml->page), 0, 10);
foreach ($firstTenPages as $page)
{
echo $page->content;
}

Preg replace divs with li but keep class active

Trying to get my head around how to create a PHP preg replace for a string that will convert
<div class="active make_link">1</div>
<div class="make_link digit">2</div>
<div class="make_link digit">3</div>
etc
to
<li class="active">1</li>
<li>2</li>
<li>3</li>
etc
Figured out how to replace the elements but not how to keep the class active.
$new_pagination = preg_replace('/<div[^>]*>(.*)<\/div>/U', '<li>$1</li>', $old_pagination);
Any ideas?
Try this..You can do this using str_ireplace too
<?php
$html='<div class="active make_link">1</div>
<div class="make_link digit">2</div>
<div class="make_link digit">3</div>';
echo str_ireplace(array('<div','</div','class="active make_link"','class="make_link digit"'),array('<li','</li','active',''),$html);
Or simple html dom:
require_once('simple_html_dom.php');
$doc = str_get_html($string);
foreach($doc->find('div') as $div){
$div->tag = 'li';
preg_match('/active/', $div->class, $m);
$div->class = #$m[0];
}
echo $doc;
This may seem a bit excessive, but it's a good use-case for XSLT:
$xslt = <<<XML
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="#*|node()">
<xsl:copy><xsl:apply-templates select="#*|node()" /></xsl:copy>
</xsl:template>
<xsl:template match="div">
<li>
<xsl:if test="#*[name()='class' and contains(., 'active')]">
<xsl:attribute name="class">active</xsl:attribute>
</xsl:if>
<xsl:apply-templates select="node()" />
</li>
</xsl:template>
</xsl:stylesheet>
XML;
It uses the identity rule and then overrides handling for <div>, adding a class="active" for nodes that have such a class name.
$xsl = new XSLTProcessor;
$doc = new DOMDocument;
$doc->loadXML($xslt);
$xsl->importStyleSheet($doc);
$doc = new DOMDocument;
$html = <<<HTML
<div class="active make_link">1</div>
<div class="make_link digit">2<div>test</div></div>
<div class="make_link digit">3</div>
HTML;
$doc->loadHTML($html);
echo $xsl->transformToDoc($doc)->saveHTML();

Transform complex and variable xml

I've a complex XML that I want to transform in HTML. Some tags need to be replaced in html tags.
The XML is this:
<root>
<div>
<p>
<em>bol text</em>, some normale text
</p>
</div>
<list>
<listitem>
normal text inside list <em>bold inside list</em>
</listitem>
<listitem>
another text in list...
</listitem>
</list>
<p>
A sample paragraph
</p>
The text inside the element is variable, which means that the other xml that I parse can completely change.
The output I want is this (for this scenario):
<root>
<div>
<p>
<strong>bol text</strong>, some normale text
</p>
</div>
<ul>
<li>
normal text inside list <strong>bold inside list</strong>
</li>
<li>
another text in list...
</li>
</ul>
<p>
A sample paragraph
</p>
</root>
I make a recursive function for parse any single node of xml and replace it in HTML tag (but doesn't work):
$doc = new DOMDocument();
$doc->preserveWhiteSpace = false;
$doc->load('section.xml');
echo $doc->saveHTML();
function printHtml(DOMNode $node)
{
if ($node->hasChildNodes())
{
foreach ($node->childNodes as $child)
{
printHtml($child);
}
}
if ($node->nodeName == 'em')
{
$newNode = $node->ownerDocument->createElement('strong', $node->nodeValue);
$node->parentNode->replaceChild($newNode, $node);
}
if ($node->nodeName == 'listitem')
{
$newNode = $node->ownerDocument->createElement('li', $node->nodeValue);
$node->parentNode->replaceChild($newNode, $node);
}
}
Can anyone help me?
This is an example of a complete xml:
<root>
<div>
<p>
<em>bol text</em>, some normale text
</p>
</div>
<list>
<listitem>
normal text inside list <em>bold inside list</em>
</listitem>
<listitem>
another text in list...
</listitem>
</list>
<media>
<info isVisible="false">
<title>
<p>Image title <em>in bold</em> not in bold</p>
</title>
</info>
<file isVisible="true">
<href>
"path/to/file.jpg"
</href>
</file>
</media>
<p>
A sample paragraph
</p>
</root>
Which has to be transformed into:
<root>
<div>
<p>
<strong>bol text</strong>, some normale text
</p>
</div>
<ul>
<li>
normal text inside list <em>bold inside list</em>
</li>
<li>
another text in list...
</li>
</ul>
<!-- the media tag can be presented in two mode: with title visible, and title hidden -->
<!-- this is the case when the title is hidden -->
<img src="path/to/file.jpg" />
<!-- this is the case when the title is visible -->
<!-- the info tag (inside media tag) has an attribute isVisible="false" which means it doesn't have to be shown. -->
<!-- if the info tag has visible=true, the media tag must be translated into
<div>
<img src="path/to/file.jpg" />
<p>Image title <strong>in bold</strong> not in bold</p>
<div>
-->
<p>
A sample paragraph
</p>
</root>
There's a language specially designed for this task: it's called XSLT, and you can easily express your desired transformation in XSLT and invoke it from your PHP program. There's a learning curve, of course, but it's a much better solution than writing low-level DOM code.
In XSLT you write a set of template rules saying how individual elements should be handled. Many elements in your example are copied through unchanged, so you can start with a default rule that does this:
<xsl:template match="*">
<xsl:copy><xsl:apply-templates/></xsl:copy>
</xsl:template>
The "match" part says what part of the input you are matching; the body of the rule says what output to produce. The xsl:apply-templates does a recursive descent to process the children of the current element.
Some of your elements are simply renamed, for example
<xsl:template match="listitem">
<li><xsl:apply-templates/></li>
</xsl:template>
Some of the rules are a little bit more complex, but still easily expressed:
<xsl:tempate match="media/file[#isVisible='true']">
<img src="{href}"/>
</xsl:template>
I hope you agree that this declarative rule-based approach is much clearer than your procedural code; it's also much easier for someone else to change the rules in six months' time.
Well, maybe, it's not the most correct idea, but why not just to use str_replace? That way You will see clearly the list of changes to apply and add / remove new ones easily.
file_get_contents $file = file_get_contents('file.xml');
str_replace $file = str_replace("<em>", "<strong>", $file);
file_put_contents file_put_contents('file.html', $file);
UPDATE (Some more ideas regarding the changes in the question)
This seems a little bit tricky (at least for me now) to use PHP + DOM here. Maybe, it would be more reasonable to use XSL / XSLT (Extensible Stylesheet Language Transformations). In that case, smth. similar can be found here: How to replace a node-name with another in Xslt?
XSLT specifically used for Language Transformations http://en.wikipedia.org/wiki/XSLT

XPath to query multiple selectors

I want to get values and attributes from a selector
and then get attributes and values of its children based on a query.
allow me to give an example.
this is the structure
<div class='message'>
<div>
<a href='http://www.whatever.com'>Text</a>
</div>
<div>
<img src='image_link.jpg' />
</div>
</div>
<div class='message'>
<div>
<a href='http://www.whatever2.com'>Text2</a>
</div>
<div>
<img src='image_link2.jpg' />
</div>
</div>
So I would like to make a query to match all of those once.
Something like this:
//$dom is the DomDocument() set up after loaded HTML with $dom->loadHTML($html);
$dom_xpath = new DOMXpath($dom);
$elements = $dom_xpath->query('//div[#class="message"], //div[#class="message"] //a, //div[#class="message"] //img');
foreach($elements as $ele){
echo $ele[0]->getAttribute('class'); //it should return 'message'
echo $ele[1]->getAttribute('href'); //it should return 'http://www.whatever.com' in the 1st loop, and 'http://www.whatever2.com' in the second loop
echo $ele[2]->getAttribute('src'); //it should return image_link.jpg in the 1st loop and 'image_link2.jpg' in the second loop
}
Is there some way of doing that using multiple xpath selectors like I did in the example? to avoid making queries all the time and save some CPU.
Use the union operator (|) in a single expression like this:
//div[#class="message"]|//div[#class="message"]//a|//div[#class="message"]//img
Note that this will return a flattened result set (so to speak). In other words, you won't access the elements in groups of three like your example shows. Instead, you'll just iterate everything the expressions matched (in document order). For this reason, it might be even smarter to simply iterate the nodes returned by //div[#class="message"] and use DOM methods to access their children (for the other elements).
Use:
(//div[#class='message'])[$k]//#*
This selects all three attributes that belong to the $k-th div (and any of its descendants) in the document whose class attribute has string value "message"
You can evaluate N such XPath expressions -- for $k from 1 to N, where N is the total count of //div[#class='message']
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
<xsl:for-each select="//div[#class='message']">
<xsl:variable name="vPos" select="position()"/>
<xsl:apply-templates select=
"(//div[#class='message'])[0+$vPos]//#*"/>
================
</xsl:for-each>
</xsl:template>
<xsl:template match="#*">
<xsl:value-of select=
"concat('name = ', name(), ' value = ', ., '
')"/>
</xsl:template>
</xsl:stylesheet>
when this transformation is applied on the provided XML document (wrapped in a single top element to become well-formed):
<html>
<div class='message'>
<div>
<a href='http://www.whatever.com'>Text</a>
</div>
<div>
<img src='image_link.jpg' />
</div>
</div>
<div class='message'>
<div>
<a href='http://www.whatever2.com'>Text2</a>
</div>
<div>
<img src='image_link2.jpg' />
</div>
</div>
</html>
The XPath expression is evaluated twice and the selected attributes are formatted and output:
name = class value = message
name = href value = http://www.whatever.com
name = src value = image_link.jpg
================
name = class value = message
name = href value = http://www.whatever2.com
name = src value = image_link2.jpg
================

Why is my recursive loop creating too many children?

I'm using a PHP recursive loop to parse through an XML document to create a nested list, however for some reason the loop is broken and creating duplicates of elements within the list, as well as blank elements.
The XML (a list of family tree data) is structured as follows:
<?xml version="1.0" encoding="UTF-8"?>
<family>
<indi>
<id>id1</id>
<fn>Thomas</fn>
<bday></bday>
<dday></dday>
<spouse></spouse>
<family>
<indi>
<id>id1</id>
<fn>Alexander</fn>
<bday></bday>
<dday></dday>
<spouse></spouse>
<family>
</family>
</indi>
<indi>
<id>id1</id>
<fn>John</fn>
<bday></bday>
<dday></dday>
<spouse></spouse>
<family>
<indi>
<id>id1</id>
<fn>George</fn>
<bday></bday>
<dday></dday>
<spouse></spouse>
<family>
</family>
</indi>
</family>
</indi>
</family>
</indi>
</family>
And here's my PHP loop, which loads the XML file then loops through it to create a nested ul:
<?php
function outputIndi($indi) {
echo '<li>';
$id = $indi->getElementsByTagName('id')->item(0)->nodeValue;
echo '<span class="vcard person" id="' . $id . '">';
$fn = $indi->getElementsByTagName('fn')->item(0)->nodeValue;
$bday = $indi->getElementsByTagName('bday')->item(0)->nodeValue;
echo '<span class="edit fn">' . $fn . '</span>';
echo '<span class="edit bday">' . $bday . '</span>';
// ...
echo '</span>';
echo '<ul>';
$family = $indi->getElementsByTagName('family');
foreach ($family as $subIndi) {
outputIndi($subIndi);
}
echo '</ul></li>';
}
$doc = new DOMDocument();
$doc->load('armstrong.xml');
outputIndi($doc);
?>
EDIT here's the desired outcome (nested lists, with ul's signifying families and li's signifying individuals)
<ul>
<li>
<span class="vcard">
<span class="fn">Thomas</span>
<span class="bday"></span>
<span class="dday"></span>
<ul>
... repeat for all ancestors ...
</ul>
<li>
<ul>
You can see the output at http://chris-armstrong.com/gortin . Any ideas where I'm going wrong? I think it's something to do with the $subIndi value, but anytime I try and change it I get an error. Would really appreciate any help!
Sounds perfect! Could you give me an
example? Does this mean I can save the
data as XML, then load it in as nested
ul's?
Yes, you can do exactly that. Here's an XSL which renders nested UL's:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
<html>
<body>
<h2>Family tree</h2>
<ul>
<li><xsl:value-of select="indi/fn" /></li>
<!-- apply-templates will select all the indi/family nodes -->
<xsl:apply-templates select="indi/family" />
</ul>
</body>
</html>
</xsl:template>
<xsl:template match="family">
<ul>
<li>
<div>
<xsl:value-of select="id" />: <xsl:value-of select="fn" />
(<xsl:variable name="bday" select="bday" />
to
<xsl:variable name="dday" select="dday" />)
</div>
</li>
<!-- This node matches the 'family' nodes, and we're going to apply-templates on the inner 'family' node,
so this is the same thing as recursion. -->
<xsl:apply-templates select="family" />
</ul>
</xsl:template>
</xsl:stylesheet>
I don't know php, but this article will show you how to transform XML using the style sheet above.
You can also link your style sheet by adding a stylesheet directive at the top of your XML file (see for an example).
getElementsByTagName will give you all nodes, not just immediate children:
$family = $indi->getElementsByTagName('family');
foreach ($family as $subIndi) {
outputIndi($subIndi);
}
You will call outputIndi() for grand children, etc repeatedly.
Here is an example (from another stackoverflow question):
for ($n = $indi->firstChild; $n !== null; $n = $n->nextSibling) {
if ($n instanceof DOMElement && $n->tagName == "family") {
outputIndi($n);
}
}
Replace this
$family = $indi->getElementsByTagName('family');
foreach ($family as $subIndi) {
outputIndi($subIndi);
}
by this
if(!empty($indi))
foreach($indi as $subIndi){
outputIndi($subIndi);
}
I realize
if($indi->hasChildNodes())
is better than
if(!empty($indi))

Categories