How to handle special HTML characters in DOMDocument? - php

Let's say I build an HTML fragment using the following code:
$dom = new DOMDocument();
$header = $dom->createElement("h2", "Lorem & Ipsum");
$dom->appendChild($header);
print($dom->saveHTML());
The raw HTML code printed contains the unescaped & symbol instead of the necessary HTML &. The code also throws the following PHP error:
Warning: DOMDocument::createElement(): unterminated entity reference
What's the best way to handle this?

It appears that the PHP team is not willing to change this behavior (source), so we have to find a workaround instead.
One way is to simply do the encoding yourself in the PHP code, as such:
$header = $dom->createElement("h2", "Lorem & Ipsum");
However, this isn't always convenient, as the text printed may be inside of a variable or contain other special characters besides &. So, you can use the htmlentities function.
$text = "Lorem & Ipsum";
$header = $dom->createElement("h2", htmlentities($text));
If this still is not an ideal solution, another workaround is to use the textContent property instead of the second argument in createElement.
In the code below, I've implemented this in a DOMDocument subclass, so you just have to use the BetterDOM subclass instead to fix this strange bug.
class BetterDOM extends DOMDocument {
public function createElement($tag, $text = null) {
$base = parent::createElement($tag);
$base->textContent = $text;
return $base;
}
}
// Correctly prints "<h2>Lorem & Ipsum</h2>" with no errors
$dom = new BetterDOM();
$header = $dom->createElement("h2", "Lorem & Ipsum");
$dom->appendChild($header);
print($dom->saveHTML());

Related

PHP decoding square brackets href attr to html file

Saving an html the decodes square brackets.
//My STRing
$teaserTest = "<a href='[CLICK_URL]'><strong>testgerr</strong></a>";
//Calling save function
saveFile($teaserTest);
//Save function
function saveFile($stringToAdd){
$doc = new DOMDocument();
$doc->formatOutput = true;
$doc->loadHTML('<html><head><title>Test</title></head><body>'.$stringToAdd.'</body></html>');
$doc->saveHTMLFile("Campaigns/test.html");
}
file resaults <a href="%5BCLICK_URL%5D">
im trying to keep the"[" decoded.
[] brackets are special chars in url
which is specified in following RFC It is important for the ip address for example: http://[::1]/example/
That because it is good to encoding. But if you have a special approach use a different pattern for it.

PHP: Converting xml to array

I have an xml string. That xml string has to be converted into PHP array in order to be processed by other parts of software my team is working on.
For xml -> array conversion i'm using something like this:
if(get_class($xmlString) != 'SimpleXMLElement') {
$xml = simplexml_load_string($xmlString);
}
if(!$xml) {
return false;
}
It works fine - most of the time :) The problem arises when my "xmlString" contains something like this:
<Line0 User="-5" ID="7436194"><Node0 Key="<1" Value="0"></Node0></Line0>
Then, simplexml_load_string won't do it's job (and i know that's because of character "<").
As i can't influence any other part of the code (i can't open up a module that's generating XML string and tell it "encode special characters, please!") i need your suggestions on how to fix that problem BEFORE calling "simplexml_load_string".
Do you have some ideas? I've tried
str_replace("<","<",$xmlString)
but, that simply ruins entire "xmlString"... :(
Well, then you can just replace the special characters in the $xmlString to the HTML entity counterparts using htmlspecialchars() and preg_replace_callback().
I know this is not performance friendly, but it does the job :)
<?php
$xmlString = '<Line0 User="-5" ID="7436194"><Node0 Key="<1" Value="0"></Node0></Line0>';
$xmlString = preg_replace_callback('~(?:").*?(?:")~',
function ($matches) {
return htmlspecialchars($matches[0], ENT_NOQUOTES);
},
$xmlString
);
header('Content-Type: text/plain');
echo $xmlString; // you will see the special characters are converted to HTML entities :)
echo PHP_EOL . PHP_EOL; // tidy :)
$xmlobj = simplexml_load_string($xmlString);
var_dump($xmlobj);
?>

PHP DOMDocument does not keep numeric presentation of HTML special characters

I have a DOMDocument and would like to append some nodes.
In one of the nodes, I would like to put:
$copyrightStatementText = "© This is the CopyRight";
The problem is that the function:
$copyrightStatement = $dom_output->createElement('copyright-statement', $copyrightStatementText);
Is converting the © immediately to ©.
My goal is to keep the ©
Any idea how could I do that?
From DOMDocument::createElement():
Note:
The value will not be escaped. Use DOMDocument::createTextNode() to create a text node with escaping support.
So use DOMDocument::createTextNode() instead:
$copyrightString = "© This is the Copyright";
$copyrightNode = $dom_output->createTextNode($copyrightString);
$copyrightContainer = $dom_output->createElement('copyright-statement');
$copyrightContainer->appendChild($copyrightNode);

get wrapping element using preg_match php

I want a preg_match code that will detect a given string and get its wrapping element.
I have a string and a html code like:
$string = "My text";
$html = "<div><p class='text'>My text</p><span>My text</span></div>";
So i need to create a function that will return the element wrapping the string like:
$element = get_wrapper($string, $html);
function get_wrapper($str, $code){
//code here that has preg_match and return the wrapper element
}
The returned value will be array since it has 2 possible returning values which are <p class='text'></p> and <span></span>
Anyone can give me a regex pattern on how to get the HTML element that wraps the given string?
Thanks! Answers are greatly appreciated.
It's bad idea use regex for this task. You can use DOMDocument
$oDom = new DOMDocument('1.0', 'UTF-8');
$oDom->loadXML("<div>" . $sHtml ."</div>");
get_wrapper($s, $oDom);
after recursively do
function get_wrapper($s, $oDom) {
foreach ($oDom->childNodes AS $oItem) {
if($oItem->nodeValue == $s) {
//needed tag - $oItem->nodeName
}
else {
get_wrapper($s, $oItem);
}
}
}
The simple pattern would be the following, but it assumes a lot of things. Regexes shouldn't be used with these. You should look at something like the Simple HTML DOM parser which is more intelligent.
Anyway, the regex that would match the wrapper tags and surrounding html elements is as follows.
/[A-Za-z'= <]*>My text<[A-Za-z\/>]*/g
Even if regex is never the correct answer in the domain of dom parsing, I came out with another (quite simple) solution
<[^>/]+?>My String</.+?>
if the html is good (ie it has closing tags, < is replaced with < & so on). This way you have in the first regex group the opening tag and in the second the closing one.

How to Ignore Whitespaces using preg_match()

I have a string that looks like:
">ANY CONTENT</span>(<a id="show
I need to fetch ANY CONTENT. However, there are spaces in between
</span> and (<a id="show
Here is my preg_match:
$success = preg_match('#">(.*?)</span>\s*\(<a id="show#s', $basicPage, $content);
\s* represents spaces. I get an empty array!
Any idea how to fetch CONTENT?
Use a real HTML parser. Regular expressions are not really suitable for the job. See this answer for more detail.
You can use DOMDocument::loadHTML() to parse into a structured DOM object that you can then query, like this very basic example (you need to do error checking though):
$dom = new DOMDocument;
$dom->loadHTML($data);
$span = $dom->getElementsByTagName('span');
$content = $span->item(0)->textContent;
I just had to:
">
define the above properly, because "> were too many in the page, so it didn't know which one to choose specficially. Therefore, it returned everything before "> until it hits (
Solution:
.">
Sample:
$success = preg_match('#\.">(.*?)</span>\s*\(<a id="show#s', $basicPage, $content);

Categories