I'm trying to read XML which has HTML inside an element. It is NOT enclosed in CDATA tags, which is the problem because any XML parser I use tries to parse it as XML.
The point in the XML where it dies:
<item>
<title>Title text <img src="https://abs.twimg.com/emoji/v1/72x72/1f525.png" draggable="false" alt="🔥" aria-label="Emoji: Fire"></title>
</item>
Error message:
Warning: XMLReader::readOuterXml(): (xml file here) parser error : Opening and ending tag mismatch: img line 1 and title in (php file here)
I know how to get HTML out of an XML element but the parser doesn't like the fact that it's an open tag and it can't find the closing tag so it dies and I can't get any further.
Now, I don't actually need the <title> element so if there is a way to ignore it, that would work as the information I need is in only two child nodes of the <item> parent.
If anyone can see a workaround to this, that would be great.
Update
Using Christian Gollhardt's suggestions, I've managed to load the XML into an object but I get the same problem I did before where I have issues getting the CDATA from the <description> element.
This is the CDATA I should get:
<description>
<![CDATA[<a href="https://twitter.com/menomatters" >#menomatters</a> <a href="https://twitter.com/physicool1" >#physicool1</a> will chill my own "personal summer". <img src="https://abs.twimg.com/emoji/v1/72x72/1f525.png" draggable="false" alt="🔥" aria-label="Emoji: Fire"><img src="https://abs.twimg.com/emoji/v1/72x72/2600.png" draggable="false" alt="☀️" aria-label="Emoji: Black sun with rays">]]>
</description>
This is what I end up with:
["description"]=>
string(54) "#menomatters will chill my own "personal summer". ]]>"
Looks like an issue with closing tags again?
Take a look at DOMDocument. You can either work direct with it, or you can write a function, witch give you a cleaned document.
Clean Methods:
function tidyXml($xml) {
$doc = new DOMDocument();
if (#$doc->loadHTML($xml)) {
$output = '';
//Dom Document creates <html><body><myxml></body></html>, so we need to remove it
foreach ($doc->getElementsByTagName('body')->item(0)->childNodes as $child) {
$output .= $doc->saveXML($child);
}
return $output;
} else {
throw new Exception('Document can not be cleaned');
}
}
function getSimpleXml($xml) {
return new SimpleXMLElement(tidyXml($xml));
}
Implementation
$xml= '<item><title>Title text <img src="https://abs.twimg.com/emoji/v1/72x72/1f525.png" draggable="false" alt="�" aria-label="Emoji: Fire"></title></item>';
$myxml = getSimpleXml($xml);
$titleNodeCollection =$myxml->xpath('/item/title');
foreach ($titleNodeCollection as $titleNode) {
$titleText = (string)$titleNode;
$imageUrl = (string)$titleNode->img['src'];
$innerContent = str_replace(['<title>', '</title>'], '', $titleNode->asXML());
var_dump($titleText, $imageUrl, $innerContent);
}
Enjoy!
Related
I have some <img> tags like these:
<img alt="" src="{assets_8170:{filedir_14}test.png}" style="width: 700px; height: 181px;" />
<img src="{filedir_14}test.png" alt="" />
And I need to update the src value, extracting the filename and adding it inside a WordPress shortcode:
<img src="[my-shortcode file='test.png']" ... />
The regex to extract the filename is this one: [a-zA-Z_0-9-()]+\.[a-zA-Z]{2,4}, but I am not able to create the complete regex, considering that the image tag attributes do not follow the same order in all instances.
PHP - Parsing html contents, making transforms and returning the resulting html
The answer grew bigger during its lifecycle trying to address the issue.
Several attempts were made but the latest one (loadXML/saveXML) nailed it.
DOMDocument - loadHTML and saveHTML
If you need to parse an html string in php so that you can later fetch and modify its content in a structured and safe manner without breaking the encoding, you can use DOMDocument::loadHTML():
https://www.php.net/manual/en/domdocument.loadhtml.php
Here I show how to parse your html string, fetch all its <img> elements and for each of them how to retrieve their src attribute and set it with an arbitrary value.
At the end to return the html string of the transformed document, you can use DOMDocument::saveHTML:
https://www.php.net/manual/en/domdocument.savehtml
Taking into account the fact that by default the document will contain the basic html frame wrapping your original content. So to be sure the resulting html will be limited to that part only, here I show how to fetch the body content and loop through its children to return the final composition:
https://onlinephp.io/c/157de
<?php
$html = "
<img alt=\"\" src=\"{assets_8170:{filedir_14}test.png}\" style=\"width: 700px; height: 181px;\" />
<img src=\"{filedir_14}test.png\" alt=\"\" />
";
$transformed = processImages($html);
echo $transformed;
function processImages($html){
//parse the html fragment
$dom = new DOMDocument();
$dom->loadHTML($html);
//fetch the <img> elements
$images = $dom->getElementsByTagName('img');
//for each <img>
foreach ($images as $img) {
//get the src attribute
$src = $img->getAttribute('src');
//set the src attribute
$img->setAttribute('src', 'bogus');
}
//return the html modified so far (body content only)
$body = $dom->getElementsByTagName('body')->item(0);
$bodyChildren = $body->childNodes;
$bodyContent = '';
foreach ($bodyChildren as $child) {
$bodyContent .= $dom->saveHTML($child);
}
return $bodyContent;
}
Problems with src attribute value restrictions
After reading on comments you pointed out that saveHTML was returning an html where the image src attribute value had its special characters escaped I made some more research...
The reason why that happens it's because DOMDocument wants to make sure that the src attribute contains a valid url and {,} are not valid characters.
Evidence that it doesn't happen with custom data attributes
For example if I added an attribute like data-test="mycustomcontent: {wildlyusingwhatever}" that one was going to be returned untouched because it didn't require strict rules to adhere to.
Quick fix to make it work (defeating the parser as a whole)
Now to put a fix on that all I could come out with so far was this:
https://onlinephp.io/c/0e334
//VERY UNSAFE -- replace the in $bodyContent %7B as { and %7D as }
$bodyContent = str_replace("%7B", "{", $bodyContent);
$bodyContent = str_replace("%7D", "}", $bodyContent);
return $bodyContent;
But of course it's nor safe nor smart and neither a very good solution. First of all because it defeats the whole purpose of using a parser instead of regex and secondly because it could seriously damage the result.
A better approach using loadXML and saveXML
To prevent the html rules to kick in, it could be attempted the route of parsing the text as XML instead of HTML so that it will still adhere to the nested markdown syntax (difficult/impossible to deal with using regex) but it won't apply all the restrictions about contents.
I modified the core logic by doing this:
//loads the html content as xml wrapping it with a root element
$dom->loadXml("<root>${html}</root>");
//...
//returns the xml content of each children in <root> as processed so far
$rootNode = $dom->childNodes[0];
$children = $rootNode->childNodes;
$content = '';
foreach ($children as $child) {
$content .= $dom->saveXML($child);
}
return $content;
And this is the working demo: https://onlinephp.io/c/f9de1
We are upgrading our Software to PHP 7.2.3 and I have the following code snippet which worked fine in previous versions:
$doc = new DOMDocument();
$doc->loadHTML("<html><body>".($_POST['reportForm_structure'])."</body></html>");
$root = $doc->documentElement->firstChild->firstChild->firstChild;
file_put_contents('D:\testoutput.txt', print_r($root ,true));
foreach($root->childNodes as $child) {
if ($child->nodeName == "ul") {
foreach($child->childNodes as $ulChild) {
$this->loadNodes($ulChild, $this->report);
}
}
}
The file_put_contentsis just for error research.
I get the following error: Invalid argument supplied for foreach(). The message refers to line of code where the first foreach loop is. So the data structure is not initialized correctly. I can see that the conversion from HTML to DOMDocument does not work properly anymore. When I check the output of file_put_contents I can see that $root is a DOMText object instead of a DOMElement object but why? When pass the argument of loadHTMLdirectly to file_put_contents,
file_put_contents('D:\testoutput.txt', print_r("<html><body>".($_POST['reportForm_structure'])."</body></html>", true);
the output looks like proper HTML, so that's why I am confused that I does not work anymore.
<html><body><ul class="ltr">
<li class="open last" id="root" rel="root">
<ins> </ins>HeaderText
<ul><li class="open last" id="id1" rel="header"><ins> </ins>Test123
<ul><li class="open leaf last" id="id2" rel="header"><a class="clicked" href="#"><ins> </ins>Test456</a></li></ul></li></ul></li>
Does anyone know how to solve this issue. Did I miss something in the configuration here?
I couldn't reproduce the DOMText node with the code you show. But my guess is that you are preserving whitespace and then fetch the whitespace node between the ul element and the li element.
v-------- whitespace node
<html><body><ul class="ltr">
<li class="open last" id="root" rel="root">
In any case, if you want the element with the ID "root", use a more precise query, e.g. use
$root = $doc->getElementById("root");
You can also you can set $doc->preserveWhiteSpace = false but it's better to query for the node by ID instead of traversing down three children and assuming it's that node.
Thanks #Gordon and #DarsVaeda for pointing me in the right direction. DOMDocument interprets carriage returns and tabs as text nodes. I had to remove those to make it work again. Changed
$doc->loadHTML("<html><body>".$_POST['reportForm_structure']."</body></html>");
to
$doc = new DOMDocument();
$string = trim(preg_replace('/\t+/', '', $_POST['reportForm_structure']));
$string = preg_replace( "/\r|\n/", "", $string );
$doc->loadHTML("<html><body>".$string."</body></html>");
Is there a way to do this? I would like to replace one element with another but somehow it isn't possible in PHP. Got the following code (the $content is valid html5 in my real code but took off some stuff to make the code shorter.):
$content='<!DOCTYPE html>
<content></content>
</html>';
$with='<img class="fullsize" src="/slide-01.jpg" />';
function replaceCustom($content,$with) {
#$document = DOMDocument::loadHTML($content);
$source = $document->getElementsByTagName("content")->item(0);
if(!$source){
return $content;
}
$fragment = $document->createDocumentFragment();
$document->validate();
$fragment->appendXML($with);
$source->parentNode->replaceChild($fragment, $source);
$document->formatOutput = TRUE;
$content = $document->saveHTML();
return $content;
}
echo replaceCustom($content,$with);
If I replace the <img class="fullsize" src="/slide-01.jpg" /> with <img class="fullsize" src="/slide-01.jpg"> then the content tag gets replaced with an empty string. Even though the img without closing tag is perfectly valid html it won't work because PHP only seems to support xml. All example code I've seen make use of the appendXML to create a documentFragment from a string but there is no HTML equivalent.
Is there a way to do this so it won't fail with valid HTML but invalid XML?
DOMDocumentFragment::appendXML indead requires XML in my version (5.4.20, libxml2 Version 2.8.0). You have mainly 2 options:
Provide valid XML to the function (so a self closing tag like <img />.
Go 'the long way around', as suggested by the manual:
If you want to stick to the standards, you will have to create a temporary DOMDocument with a dummy root and then loop through the child nodes of the root of your XML data to append them.
$tempDoc = new DOMDocument();
$tempDoc->loadHTML('<html><body>'.$with.'</body></html>');
$body = $tempDoc->getElementsByTagName('body')->item(0);
foreach($body->childNodes as $node){
$newNode = $document->importNode($node, true);
$source->parentNode->insertBefore($newNode,$source);
}
$source->parentNode->removeChild($source);
I'm using SimpleXML & PHP to parse an XML element in the following form:
<element>
random text with <inlinetag src="http://url.com/">inline</inlinetag> XML to parse
</element>
I know I can reach inlinetag using $element->inlinetag, but I don't know how to reach it in such a way that I can basically replace the inlinetag with a link to the attribute source without using it's location in the text. The result would basically have to look like this:
here is a random text with inline XML
This may be a stupid questions, I hope someone here can help! :)
I found a way to do this using DOMElement.
One way to replace the element is by cloning it with a different name/attributes. Here is is a way to do this, using the accepted answer given on How do you rename a tag in SimpleXML through a DOM object?
function clonishNode(DOMNode $oldNode, $newName, $replaceAttrs = [])
{
$newNode = $oldNode->ownerDocument->createElement($newName);
foreach ($oldNode->attributes as $attr)
{
if (isset($replaceAttrs[$attr->name]))
$newNode->setAttribute($replaceAttrs[$attr->name], $attr->value);
else
$newNode->appendChild($attr->cloneNode());
}
foreach ($oldNode->childNodes as $child)
$newNode->appendChild($child->cloneNode(true));
$oldNode->parentNode->replaceChild($newNode, $oldNode);
}
Now, we use this function to clone the inline element with a new element and attribute name. Here comes the tricky part: iterating over all the nodes will not work as expected. The length of the selected nodes will change as you clone them, as the original node is removed. Therefore, we only select the first element until there are no elements left to clone.
$xml = '<element>
random text with <inlinetag src="http://url.com/">inline</inlinetag> XML to parse
</element>';
$dom = new DOMDocument;
$dom->loadXML($xml);
$nodes= $dom->getElementsByTagName('inlinetag');
echo $dom->saveXML(); //<element>random text with <inlinetag src="http://url.com/">inline</inlinetag> XML to parse</element>
while($nodes->length > 0) {
clonishNode($nodes->item(0), 'a', ['src' => 'href']);
}
echo $dom->saveXML(); //<element>random text with inline XML to parse</element>
That's it! All that's left to do is getting the content of the element tag.
Is this the result you want to achieve?
<?php
$data = '<element>
random text with
<inlinetag src="http://url.com/">inline
</inlinetag> XML to parse
</element>';
$xml = simplexml_load_string($data);
foreach($xml->inlinetag as $resource)
{
echo 'Your SRC attribute = '. $resource->attributes()->src; // e.g. name, price, symbol
}
?>
First, I am a php newbie. I have looked at the question and solution here. For my needs however, the parsing does not go deep enough into the various articles.
A small sampling of my rss feed reads like this:
<channel>
<atom:link href="http://mywebsite.com/rss" rel="self" type="application/rss+xml" />
<title>My Web Site</title>
<description>My Feed</description>
<link>http://mywebsite.com/</link>
<image>
<url>http://mywebsite.com/views/images/banner.jpg</url>
<title>My Title</title>
<link>http://mywebsite.com/</link>
<description>Visit My Site</description>
</image>
<item>
<title>Article One</title>
<guid isPermaLink="true">http://mywebsite.com/details/e8c5106</guid>
<link>http://mywebsite.com/geturl/e8c5106</link>
<comments>http://mywebsite.com/details/e8c5106#comments</comments>
<pubDate>Wed, 09 Jan 2013 02:59:45 -0500</pubDate>
<category>Category 1</category>
<description>
<![CDATA[<div>
<img src="http://mywebsite.com/myimages/1521197-main.jpg" width="120" border="0" />
<ul><li>Poster: someone's name;</li>
<li>PostDate: Tue, 08 Jan 2013 21:49:35 -0500</li>
<li>Rating: 5</li>
<li>Summary:Lorem ipsum dolor </li></ul></div><div style="clear:both;">]]>
</description>
</item>
<item>..
The image links that I want to parse out are the ones way inside each Item > Description
The code in my php file reads:
<?php
$xml = simplexml_load_file('http://mywebsite.com/rss?t=2040&dl=1&i=1&r=ceddfb43483437b1ed08ab8a72cbc3d5');
$imgs = $xml->xpath('/item/description/img');
foreach($imgs as $image) {
echo $image->src;
}
?>
Can someone please help me figure out how to configure the php code above?
Also a very newbie question... once I get the resulting image urls, how can I display the images in a row on my html?
Many thanks!!!
Hernando
The <img> tags inside that RSS feed are not actually elements of the XML document, contrary to the syntax highlighting on this site - they are just text inside the <description> element which happen to contain the characters < and >.
The string <![CDATA[ tells the XML parser that everything from there until it encounters ]]> is to be treated as a raw string, regardless of what it contains. This is useful for embedding HTML inside XML, since the HTML tags wouldn't necessarily be valid XML. It is equivalent to escaping the whole HTML (e.g. with htmlspecialchars) so that the <img> tags would look like <img>. (I went into more technical details on another answer.)
So to extract the images from the RSS requires two steps: first, get the text of each <description>, and second, find all the <img> tags in that text.
$xml = simplexml_load_file('http://mywebsite.com/rss?t=2040&dl=1&i=1&r=ceddfb43483437b1ed08ab8a72cbc3d5');
$descriptions = $xml->xpath('//item/description');
foreach ( $descriptions as $description_node ) {
// The description may not be valid XML, so use a more forgiving HTML parser mode
$description_dom = new DOMDocument();
$description_dom->loadHTML( (string)$description_node );
// Switch back to SimpleXML for readability
$description_sxml = simplexml_import_dom( $description_dom );
// Find all images, and extract their 'src' param
$imgs = $description_sxml->xpath('//img');
foreach($imgs as $image) {
echo (string)$image['src'];
}
}
I don't have much experience with xPath, but you could try the following:
$imgs = $xml->xpath('item//img');
This will select all img-elements which are inside item-elements, regardless if there are other elements inbetween. Removing the leading slash will search for item anywhere in the documet, not just from the root. Otherwise, you'd need something like /rss/channel/item....
As for displaying the images: Just output <img>-tags followed by line-breaks, like so:
foreach($imgs as $image) {
echo '<img src="' . $image->src . '" /><br />';
}
The preferred way would be to use CSS instead of <br>-tags, but I think they are simpler for a start.