SimpleXML with decoded entities - php

How can I make SimpleXML to replace HTML/XML entities with their respective characters, in PHP?
Assume having this XML document, in a string:
$data = '<?xml version="1.0" encoding="ISO-8859-1"?><example>Tom & Jerry</example>'
Obviously, I want SimpleXml to decode & to &. It does not do it by default. I have tried these two ways, neither of which worked:
$xml = new SimpleXMLElement($data);
$xml = new SimpleXMLElement($data, LIBXML_NOENT);
What's the best way to get XML entities decoded? I guess XML parser should do it, I would like to avoid running html_entity_decode before parsing (actually, it won't work either). May this be a problem with the encoding of the string? If so, how could I track and fix it?

I don't know if this is going to work in some cases but maybe...
$xml = new SimpleXMLElement(html_entity_decode($data));
http://www.php.net/manual/en/function.html-entity-decode.php

Related

How to get DOMDocument from HTML with cyrillic letters in PHP 8 [duplicate]

I'm trying to parse some HTML using DOMDocument, but when I do, I suddenly lose my encoding (at least that is how it appears to me).
$profile = "<div><p>various japanese characters</p></div>";
$dom = new DOMDocument();
$dom->loadHTML($profile);
$divs = $dom->getElementsByTagName('div');
foreach ($divs as $div) {
echo $dom->saveHTML($div);
}
The result of this code is that I get a bunch of characters that are not Japanese. However, if I do:
echo $profile;
it displays correctly. I've tried saveHTML and saveXML, and neither display correctly. I am using PHP 5.3.
What I see:
ã¤ãªãã¤å·ã·ã«ã´ã«ã¦ãã¢ã¤ã«ã©ã³ãç³»ã®å®¶åº­ã«ã9人åå¼ã®5çªç®ã¨ãã¦çã¾ãããå½¼ãå«ãã¦4人ã俳åªã«ãªã£ããç¶è¦ªã¯æ¨æã®ã»ã¼ã«ã¹ãã³ã§ãæ¯è¦ªã¯éµä¾¿å±ã®å®¢å®¤ä¿ã ã£ããé«æ ¡æ代ã¯ã­ã£ãã£ã®ã¢ã«ãã¤ãã«å¤ãã¿ãæè²è³éãåããªããã«ããªãã¯ç³»ã®é«æ ¡ã¸é²å­¦ã
What should be shown:
イリノイ州シカゴにて、アイルランド系の家庭に、9人兄弟の5番目として生まれる。彼を含めて4人が俳優になった。父親は木材のセールスマンで、母親は郵便局の客室係だった。高校時代はキャディのアルバイトに勤しみ、教育資金を受けながらカトリック系の高校へ進学
EDIT: I've simplified the code down to five lines so you can test it yourself.
$profile = "<div lang=ja><p>イリノイ州シカゴにて、アイルランド系の家庭に、</p></div>";
$dom = new DOMDocument();
$dom->loadHTML($profile);
echo $dom->saveHTML();
echo $profile;
Here is the html that is returned:
<div lang="ja"><p>イリノイ州シカゴã«ã¦ã€ã‚¢ã‚¤ãƒ«ãƒ©ãƒ³ãƒ‰ç³»ã®å®¶åº­ã«ã€</p></div>
<div lang="ja"><p>イリノイ州シカゴにて、アイルランド系の家庭に、</p></div>
DOMDocument::loadHTML will treat your string as being in ISO-8859-1 (the HTTP/1.1 default character set) unless you tell it otherwise. This results in UTF-8 strings being interpreted incorrectly.
If your string doesn't contain an XML encoding declaration, you can prepend one to cause the string to be treated as UTF-8:
$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="utf-8" ?>' . $profile);
echo $dom->saveHTML();
If you cannot know if the string will contain such a declaration already, there's a workaround in SmartDOMDocument which should help you:
$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML(mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8'));
echo $dom->saveHTML();
This is not a great workaround, but since not all characters can be represented in ISO-8859-1 (like these katana), it's the safest alternative.
The problem is with saveHTML() and saveXML(), both of them do not work correctly in Unix. They do not save UTF-8 characters correctly when used in Unix, but they work in Windows.
The workaround is very simple:
If you try the default, you will get the error you described
$str = $dom->saveHTML(); // saves incorrectly
All you have to do is save as follows:
$str = $dom->saveHTML($dom->documentElement); // saves correctly
This line of code will get your UTF-8 characters to be saved correctly. Use the same workaround if you are using saveXML().
Update
As suggested by "Jack M" in the comments section below, and verified by "Pamela" and "Marco Aurélio Deleu", the following variation might work in your case:
$str = utf8_decode($dom->saveHTML($dom->documentElement));
Note
English characters do not cause any problem when you use saveHTML() without parameters (because English characters are saved as single byte characters in UTF-8)
The problem happens when you have multi-byte characters (such as Chinese, Russian, Arabic, Hebrew, ...etc.)
I recommend reading this article: http://coding.smashingmagazine.com/2012/06/06/all-about-unicode-utf8-character-sets/. You will understand how UTF-8 works and why you have this problem. It will take you about 30 minutes, but it is time well spent.
Make sure the real source file is saved as UTF-8 (You may even want to try the non-recommended BOM Chars with UTF-8 to make sure).
Also in case of HTML, make sure you have declared the correct encoding using meta tags:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
If it's a CMS (as you've tagged your question with Joomla) you may need to configure appropriate settings for the encoding.
This took me a while to figure out but here's my answer.
Before using DomDocument I would use file_get_contents to retrieve URLs and then process them with string functions. Perhaps not the best way but quick. After being convinced Dom was just as quick I first tried the following:
$dom = new DomDocument('1.0', 'UTF-8');
if ($dom->loadHTMLFile($url) == false) { // read the url
// error message
}
else {
// process
}
This failed spectacularly in preserving UTF-8 encoding despite the proper meta tags, PHP settings, and all the rest of the remedies offered here and elsewhere. Here's what works:
$dom = new DomDocument('1.0', 'UTF-8');
$str = file_get_contents($url);
if ($dom->loadHTML(mb_convert_encoding($str, 'HTML-ENTITIES', 'UTF-8')) == false) {
}
etc. Now everything's right with the world.
You could prefix a line enforcing utf-8 encoding, like this:
#$doc->loadHTML('<?xml version="1.0" encoding="UTF-8"?>' . "\n" . $profile);
And you can then continue with the code you already have, like:
$doc->saveXML()
Use correct header for UTF-8
Don't get satisfied by "it works".
#cmbuckley in his accepted answer advised to set <?xml encoding="utf-8" ?> to the document. However to use XML declaration in HTML document is a bit weird. HTML is not XML (unless it is XHTML) and it can confuse browsers and other software on the way to client (may be source of the failures reported by others).
I successfully used HTML5 declaration:
$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML('<!DOCTYPE html><meta charset="UTF-8">' . $profile);
echo $dom->saveHTML();
If you use other standard, use correct header, the DOMDocument follows the standards quite pedantically and seems to support HTML5, too (if not in your case, try to update the libxml extension).
You must feed the DOMDocument a version of your HTML with a header that make sense.
Just like HTML5.
$profile ='<?xml version="1.0" encoding="'.$_encoding.'"?>'. $html;
maybe is a good idea to keep your html as valid as you can, so you don't get into issues when you'll start query... around :-) and stay away from htmlentities!!!! That's an an necessary back and forth wasting resources.
keep your code insane!!!!
Use it for correct result
$dom = new DOMDocument();
$dom->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset=utf-8">' . $profile);
echo $dom->saveHTML();
echo $profile;
This operation
mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8');
It is bad way, because special symbols like &lt ; , &gt ; can be in $profile, and they will not convert twice after mb_convert_encoding. It is the hole for XSS and incorrect HTML.
Works finde for me:
$dom = new \DOMDocument;
$dom->loadHTML(utf8_decode($html));
...
return utf8_encode( $dom->saveHTML());
The only thing that worked for me was the accepted answer of
$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="utf-8" ?>' . $profile);
echo $dom->saveHTML();
HOWEVER
This brought about new issues, of having <?xml encoding="utf-8" ?> in the output of the document.
The solution for me was then to do
foreach ($doc->childNodes as $xx) {
if ($xx instanceof \DOMProcessingInstruction) {
$xx->parentNode->removeChild($xx);
}
}
Some solutions told me that to remove the xml header, that I had to perform
$dom->saveXML($dom->documentElement);
This didn't work for me as for a partial document (e.g. a doc with two <p> tags), only one of the <p> tags where being returned.
The problem is that when you add a parameter to DOMDocument::saveHTML() function, you lose the encoding. In a few cases, you'll need to avoid the use of the parameter and use old string function to find what your are looking for.
I think the previous answer works for you, but since this workaround didn't work for me, I'm adding that answer to help people who may be in my case.

saveHTML() doesn't output special characters properly

I've looked at other answers (php: using DomDocument whenever I try to write UTF-8 it writes the hexadecimal notation of it, DOMDocument->saveHTML() converting to space) and either they don't apply to my situation, or I'm not understanding them.
I'm feeding some HTML into $dom like this...
$dom = new DOMDocument;
$dom->loadHTML($table_data_for_db);
I then do some stuff with it, then output it like this..
$table_data_for_db = $dom->saveHTML();
echo $table_data_for_db;
The problem is that special characters such as → end up like this →.
1.) Is there a way around this?
2.) Is there another way in PHP other than using DOMDocument, loadHTML, etc. to strip out sections of HTML? Like, if I want to remove <style id="fraction_class"> and all of its contents, is there another way?
Thank you.

Issue parsing XML, unknown encoding

I'm trying to read an XML feed, I'm not sure the encoding is proper, but it's set to UTF-8 and when I try to parse it in PHP via SimpleXML, it errors on "BöðVar" (note the special "o" characters).
libxml_use_internal_errors(TRUE);
$XMLOutputXMLObj = simplexml_load_string($xml_string);
if($XMLOutputXMLObj !== FALSE)
{
//do stuff
}
This is all I get for an error:
Entity 'ouml' not defined
Entity 'eth' not defined
I tried using "mb_convert_encoding", in various ways, but that failed.
How can I resolve this issue for any character? IE WITHOUT manually replacing ö with &214; (with # of course)?
Even better... is there a way to make it so SimpleXML doesn't care what it is parsing, as long as the tags are intact?
Thanks
Have you tried to escape the XML data in the node using the <![CDATA[ and ]]> tags before and after the node's text/value? E.g.
<?xml version="1.0" encoding="UTF-8"?>
<fmsdata>
<result><![CDATA[Success !##$%^&*()]]></result>
</fmsdata>

load DOMDocument with HTML Special Characters (php)

i have a problem to load a xml-file with php. I use DOMDocument, because i need the function getElementsByTagName.
I use this code.
$dom = new DomDocument('1.0', 'UTF-8');
$dom->resolveExternals = false;
$dom->load($_FILES["file"]["tmp_name"]);
<?xml version="1.0" encoding="UTF-8"?>
<Data>
<value>1796563</value>
<value>Verliebt! ’</value>
</Data>
ErrorMessage:
Warning: DOMDocument::load() [domdocument.load]: Entity 'rsquo' not defined in /tmp/php1VRb3N, line: 4 in /www/htdocs/bla/upload.php on line 51
Your XML parser is not lying. That's an invalid (not even well-formed) document that you won't be able to load with anything.
rsquo is a predefined entity in HTML, but not in XML. In an XML document if you want to use anything but the most basic built-in entities (amp, lt, gt, quot and apos) you must define them in a DTD referenced by a <!DOCTYPE> declaration. (This is how XHTML does it.)
You need to find out where the input came from and fix whatever was responsible for creating it, because at the moment it's simply not XML. Use a character reference (’) or just the plain literal character ’ in UTF-8 encoding.
As a last resort if you really have to accept this malformed nonsense for input you could do nasty string replacements over the file:
$xml= file_get_contents($_FILES['file']['tmp_name']);
$xml= str_replace('’', '’', $xml);
$dom->loadXML(xml);
If you need to do this for all the non-XML HTML entities and not just rsquo that's a bit more tricky. You could do a regex replacement:
function only_html_entity_decode($match) {
if (in_array($match[1], array('amp', 'lt', 'gt', 'quot', 'apos')))
return $match[0];
else
return html_entity_decode($match[0], ENT_COMPAT, 'UTF-8');
}
$xml= preg_replace_callback('/&(\w+);/', 'only_html_entity_decode', $xml);
This still isn't great as it's going to be mauling any sequences of &\w+; characters inside places like comments, CDATA sections and PIs, where this doesn't actually mean an entity reference. But it's probably about the best you can do given this broken input.
It's certainly better than calling html_entity_decode over the whole document, which will also mess up any XML entity references, causing the document to break whenever there's an existing & or <.
Another hack, questionable in different ways, would be to load the document using loadHTML().
In order to use that Entity, it must be defined in a DTD. Otherwise it's invalid XML. If you don't have a DTD, you should decode the entity prior to loading the XML with DOM:
$dom->load(
html_entity_decode(
file_get_contents($_FILES["file"]["tmp_name"]),
ENT_COMPAT, 'UTF-8'));
My solution with help from bobince is:
$xml= file_get_contents($_FILES['file']['tmp_name']);
$xml= preg_replace('/&(\w+);/', '', $xml);
$dom = new DomDocument();
$dom->loadXML($xml);

php output xml produces parse error "’"

Is there any function that I can use to parse any string to ensure it won't cause xml parsing problems? I have a php script outputting a xml file with content obtained from forms.
The thing is, apart from the usual string checks from a php form, some of the user text causes xml parsing errors. I'm facing this "’" in particular. This is the error I'm getting Entity 'rsquo' not defined
Does anyone have any experience in encoding text for xml output?
Thank you!
Some clarification:
I'm outputting content from forms in a xml file, which is subsequently parsed by javascript.
I process all form inputs with: htmlentities(trim($_POST['content']), ENT_QUOTES, 'UTF-8');
When I want to output this content into a xml file, how should I encode it such that it won't throw up xml parsing errors?
So far the following 2 solutions work:
1) echo '<content><![CDATA['.$content.']]></content>';
2) echo '<content>'.htmlspecialchars(html_entity_decode($content, ENT_QUOTES, 'UTF-8'),ENT_QUOTES, 'UTF-8').'</content>'."\n";
Are the above 2 solutions safe? Which is better?
Thanks, sorry for not providing this information earlier.
You take it the wrong way - don't look for a parser which doesn't give you errors. Instead try to have a well-formed xml.
How did you get ’ from the user? If he literally typed it in, you are not processing the input correctly - for example you should escape & to &. If it is you who put the entity there (perhaps in place of some apostrophe), either define it in DTD (<!ENTITY rsquo "&x2019;">) or write it using a numeric notation (’), because almost every of the named entities are a part of HTML. XML defines only a few basic ones, as Gumbo pointed out.
EDIT based on additions to the question:
In #1, you escape the content in the way that if user types in ]]> <°)))><, you have a problem.
In #2, you are doing the encoding and decoding which result in the original value of the $content. the decoding should not be necessary (if you don't expect users to post values like & which should be interpreted like &).
If you use htmlspecialchars() with ENT_QUOTES, it should be ok, but see how Drupal does it.
html_entity_decode($string, ENT_QUOTES, 'UTF-8')
Enclose the value within CDATA tags.
<message><![CDATA[’]]></message>
From the w3schools site:
Characters like "<" and "&" are illegal in XML elements.
"<" will generate an error because the parser interprets it as the start of a new element.
"&" will generate an error because the parser interprets it as the start of an character entity.
Some text, like JavaScript code, contains a lot of "<" or "&" characters. To avoid errors script code can be defined as CDATA.
Everything inside a CDATA section is ignored by the parser.
The problem is that your htmlentities function is doing what it should - generating HTML entities from characters. You're then inserting these into an XML document which doesn't have the HTML entities defined (things like ’ are HTML-specific).
The easiest way to handle this is keep all input raw (i.e. don't parse with htmlentities), then generate your XML using PHP's XML functions.
This will ensure that all text is properly encoded, and your XML is well-formed.
Example:
$user_input = "...<>&'";
$doc = new DOMDocument('1.0','utf-8');
$element = $doc->createElement("content");
$element->appendChild($doc->createTextNode($user_input));
$doc->appendChild($element);
I had a similar problem that the data i needed to add to the XML was already being returned by my code as htmlentities() (not in the database like this).
i used:
$doc = new DOMDocument('1.0','utf-8');
$element = $doc->createElement("content");
$element->appendChild($doc->createElement('string', htmlspecialchars(html_entity_decode($string, ENT_QUOTES, 'UTF-8'), ENT_XML1, 'UTF-8')));
$doc->appendChild($element);
or if it was not already in htmlentities()
just the below should work
$doc = new DOMDocument('1.0','utf-8');
$element = $doc->createElement("content");
$element->appendChild($doc->createElement('string', htmlspecialchars($string, ENT_XML1, 'UTF-8')));
$doc->appendChild($element);
basically using htmlspecialchars with ENT_XML1 should get user imputed data into XML safe data (and works fine for me):
htmlspecialchars($string, ENT_XML1, 'UTF-8');
Use htmlspecialchars() will solve your problem. See the post below.
PHP - Is htmlentities() sufficient for creating xml-safe values?
This worked for me. Some one facing the same issue can try this.
htmlentities($string, ENT_XML1)
With special characters conversion.
htmlspecialchars(htmlentities($string, ENT_XML1))
htmlspecialchars($trim($_POST['content'], ENT_XML1, 'UTF-8');
Should do it.

Categories