PHP htmlentities or htmlspecialchars - php

I'm using fwrite to create an xml file but i'm losing the special the characters.
Example:
$message0 = htmlspecialchars('<?xml version="1.0" encoding="UTF-8"?>');
$file = fopen("test.xml","w");
echo fwrite($file,"$message0");
fclose($file);
The above code gives me the following output
<?xml version="1.0" encoding="UTF-8"?><JobTemplates>
I need the special characters in order for the xml file to work. If i echo the variables, the special characters appear on the page.

Not understanding why you're encoding html characters for this. It's a trusted string, so, just put it in single quotes and write it. If any character's are giving you trouble, escape them instead of encoding them.
If there's a reason you must do it this way, then decode inline. But it all seems a bit messy to me.

Here is a tested example , you should not use htmlspecialchars
$message0 = '<?xml version="1.0" encoding="UTF-8"?><contact><name>foo</name><phone>123456</phone></contact>';
$file = fopen("test.xml","w");
fwrite($file,$message0);
fclose($file);

Related

Encode ’ to be XML safe

I have a string that contains a right single quotation mark:
$str = "David’s Spade";
I am sending the string via XML and need to encode it. I have read that I should encode string using htmlspecialchars, but I have found that XML request still fails whereas htmlentities works.
When I error_log $str:
$str; // David\xe2\x80\x99s Spade
htmlspecialchars($str); // David\xe2\x80\x99s Spade
htmlspecialchars($str, ENT_QUOTES, 'UTF-8'); // David\xe2\x80\x99s Spade
htmlentities($str); // David’s Spade
Would it be better to str_replace ’ and then use htmlentities? Are there any other chars htmlentities may miss?
I am sending the string via XML and need to encode it.
No, you don't. If the XML is UTF-8 encoded (it is by default) and as your $str is UTF-8 encoded (as you show by the binary sequences in your question), you do not need to encode it.
This is by the book. So given on the technical information of the data you collaborate with, this is clear and fine.
You then write that some things work and others won't. Whatever you do there, there problem lies within the things you've hidden from your question.
To make this more explicit:
$str = "David’s Spade"; // "David\xE2\x80\x99s Spade"
is a perfectly valid string, for example to use it with an XML library like Simplexml to add it to an XML document:
$xml = new SimpleXMLElement('<?xml version="1.0" encoding="UTF-8"?><doc/>');
$xml->element = $str;
$xml->asXML('php://output');
Output:
<?xml version="1.0" encoding="UTF-8"?>
<doc><element>David’s Spade</element></doc>
As you can see, the XML has been encoded by not changing the byte-sequence of the string here because it's UTF-8.
Let's take some ASCII:
$xml = new SimpleXMLElement('<doc/>');
$xml->element = $str;
$xml->asXML('php://output');
Output:
<?xml version="1.0"?>
<doc><element>David’s Spade</element></doc>
As this example shows, it depends on the document encoding then. This second example is a fall-back of Simplexml to make the output more robust, but actually this wouldn't be necessary as UTF-8 would be the default encoding.
In any case you should not be too concerned about the encoding yourself by using a library that has specialized on creating XML documents. PHP has some few for exactly that. Take one of them.

Outputting UTF-8 with PHP SimpleXML

I'm trying to parse an XML file generated from Wordpress' export function. I've grabbed the text from the block but when I echo the text it gets malformed, into ASCII I think.
<?php
header("Content-Type: text/plain; charset: UTF-8;");
$source = file_get_contents("blog.wordpress.2013-10-31.xml");
$xml = simplexml_load_string($source);
$items = $xml->channel->item;
foreach($items as $item) {
$namepsaces = $item->getNameSpaces(true);
$content = $item->children($namepsaces['content']);
if($content != '') {
echo '#' . $item->title . "#\n";
echo $content->encoded;
echo "\n\n\n";
}
}
So As the BBC’s would become As the BBC’s. Anyway I can stop this?
Edit: I've appended echo '“Test”'; to just after the header and I'm seeing “Test†in my browser, so this doesn't appear to be a SimpleXML issue.
As UTF-8 ’ (0xE2 0x80 0x99) is WINDOWS-1252 â € ™ and that is exactly what you describe, it seems that you load UTF-8 encoded strings as WINDOWS-1252.
The output of SimpleXML when you read from elements or attributes is always UTF-8 encoded, therefore about that part I see no problem with your code.
So it's more likely that the XML file has the wrong encoding hinted. Fix that and you should be fine (as you have not shown that file, it's hard to say what exactly needs to be changed and why the encoding got mixed-up in the first place, perhaps some transfer issue).
You perhaps need to re-encode the XML file before you send it to the parser. If so, XMLRecoder might be helpful.
You are using a colon here: charset: UTF-8
The correct code is
header('Content-Type: text/html; charset=utf-8');
Check your XML file starts with
<?xml version="1.0" encoding="UTF-8"?>

Parsing xml with PHP what to do with characters like these

I'm parsing an xml document using php.
When I see the result in my browser I get the following characters:
ñ instead of spanish ñ
í instead of í
á instead of á
ó instead of ó
é instead of é
I was going to use a str_replace and replace every odd character for the good ones, but sadly the pattern before happens only sometimes and in general I have a wide collection of odd characters :(
The xml heading is:
<?xml version="1.0" encoding="iso-8859-1"?>
But if I change it to utf-8 it simply won't be printed ..
I load the xml as a string with simplexml_load_string (comes from database like that)
Can you please give me any ideas on how to solve this?
Thanks a lot
You have 2 options:
a) include a header('Content-Type: text/html; charset=iso-8859-1'); before any output in your php file.
b) convert the output to utf-8 with $str = mb_convert_encoding($str, 'UTF-8', 'ISO-8859-1');
Both should do the trick.
SimpleXML uses UTF-8 to encode stored strings. You can use an XML-File with iso-8859-1, but if you want to print XML values with this encoding, you have to use utf8_decode before.
$string = preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $string);
// new xml
$xml = new SimpleXMLElement('new.xml');
// Displaying XML in textual form
echo $xml->asXML();

PHP: htmlentities and htmlspecialchars not converting some characters

I'm trying convert all special chars into HTML safe entities on their way into my database, but I can't seem to get PHP to handle certain characters. For example, if my string contains any of the following: ¡£¢∞§¶ It gets turned into an empty string.
So for example, the following string:
Hello£
Get turned into an empty string after it's POSTed and processed by the following code:
$workDetails["copy"] = htmlentities($workDetails["copy"], ENT_QUOTES, "UTF-8");
I presume I'm doing something wrong? :(
Maybe it will just be enough if you change the Encoding of your website to UTF-8 via the header() command:
header("Content-Type: text/html; charset=utf-8"); in PHP
or
<?xml version="1.0" encoding="utf-8" ?>; at the top of your HTML template if you use one.
but if you definitely need to convert those chars to its specific html code, you should create your own function to replace the symbols which are not covered by htmlspecialchars() as well.

Formatting string for xml attribute in php

I have some strings that are valid in my database but when I include them in an attribute of a UTF-8 XML output they give me the following error:
XML Parsing Error: not well-formed
My current code (simplified):
header('Content-Type: text/xml');
echo '<?xml version="1.0" encoding="UTF-8" standalone="yes"?>';
echo '<root attribute="' . htmlentities($string_from_hell) . '">';
How should I format these strings before including them in XML attributes?
A possible value for $string_from_hell:  (don't know if it will show up properly)
Try
htmlspecialchars($string_from_hell, ENT_QUOTES, "UTF-8")
htmlentities won't do because it will create HTML entities that are not recognized in XML, only HTML. You should also specify the charset because the default is not UTF-8, it's the ISO-8859-1.
You're also missing the quotes (") around the attribute value.
There are also better ways to create XML files that handle escaping for you. See e.g. XMLWriter.

Categories