Problems with php xml characters

Problems with php xml characters - php

Hello friends i have a problem with some characters reading a xml file from php i am using this source code:
$file = 'test.xml';
$xml_1 = simplexml_load_file($file);
echo ($xml_1->content);
its work ok but when the content is a special character like ñ ó it show a rarer character like this Ã± i tried to include in html head utf8 charset but its the same

SimpleXML emits UTF-8 output by design. If you application does not support UTF-8 you'll have to convert with the usual tools (e.g. mb_convert_encoding()) but you need to take this into account:
You need to know for sure the encoding your app is using.
UTF-8 can hold the complete Unicode catalogue thus some characters may not have an equivalent in your target encoding.
Whatever, in 2016 there's no reason to use anything else than UTF-8 unless your maintaining legacy code.

Finally i find the solution i must to use utf8_decode php function to convert the characters it is not enought with put utf8 charset in the head page you must to convert using php before

Related

simplexml_load_string strange characters

I am having difficulty with non-standard characters using simplexml_load_string.
I have loaded an newspaper xml feed using file_get_contents. If I print to screen the contents I get a title for one of the articles as :
<title>‘If Legault were running in Alberta, he’d be more popular’: How right-wing is the CAQ?</title>
If I then do this:
$feed = #simplexml_load_string($xml);
And print the results of $feed, the title has changed to:
[title] => â��If Legault were running in Alberta, heâ��d be more popularâ��: How right-wing is the CAQ?
Any advice on how to stop these characters being displayed like this?

This looks SimpleXML is creating a UTF-8 string, which is then rendered in ISO-8859-1 (latin-1) or something close like CP-1252.
When you save the result to a file and serve that file via a web server, the browser will use the encoding declared in the file.
Including in a web page
Since your web page encoding is not UTF-8, you need to convert the string to whatever encoding you are using, eg ISO-8859-1 (latin-1).
This is easily done with iconv():
$xmlout = iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $xmlout);
Saving to database
You database column is not using UTF-8 collation, so you should use iconv to convert the string to the charset that your database uses.
Assuming your database collation is the same as the encoding that you render in, you will not have to do anything when reading from the database.
Explanation
In UTF-8, a 0xc2 prefix byte is used to access the top half of the "Latin-1 Supplement" block which includes characters such as accented letters, currency symbols, fractions, superscript 2 and 3, the copyright and registered trademark symbols, and the non-breaking space.
However in ISO-8859-1, the byte 0xC2 represents an Â. So when your UTF-8 string is misinterpreted as one of those, then you get Â followed by some other nonsense character.

This is a charset issue. it needs to be utf8, you can run utf8_decode on the content, but its better to fix this issue by matching charsets from your input (feed) to your output (html page i presume).

PHP urlencode for chinese characters

I'm creating a php application that involves sending chinese characters as url parameters.
I have to send query like :
http://xyz.com/?q=新
But the script at xyz.com won't automatically encode the chinese character. So, I need to explicitly send an encoded string as the paramter. It becomes:
http://xyz.com/?q=%E6%96%B0
The problem is, PHP won't encode the chinese character properly.
I've tried urlencode() and rawurlencode(). But they give %D0%C2 (doesn't work for my purpose) instead of %E6%96%B0 (works well with xyz.com) as the output.
I'm using this website to create the latter encoded string.
I've also defined header('Content-Type: text/html; charset=gb2312'); to display chinese characters properly.
Is there anything I can do to urlencode the chinese character properly?
Thanks!
PS: I'm a relatively new programmer and don't understand chinese.

You're URLencoding using the charset you specify in your header. %D0%C2 is 新 in gb2312; %E6%96%B0 is 新 in UTF-8. Switch your charset over to UTF-8 and you should fix this issue and still be able to display Simplified Chinese Han.

In order to reproduce your problem I created a simple PHP file:
<?php
var_dump(urlencode('新'));
?>
First I used UTF8 encoding and got %E6%96%B0. Afterwards I changed to GB2312 and got %D0%C2.
At http://meyerweb.com/eric/tools/dencoder/ they seem to use JavaScript, that's UTF8 capable and therefore returns %E6%96%B0, too.
PS: When changing from GB2312 to UTF8 some editors might break code some internationalized code. So please make sure to have a copy of your file before converting!

Strange behaviour when encoding cURL response as UTF-8

I'm making a cURL request to a third party website which returns a text file on which I need to do a few string replacements to replace certain characters by their html entity equivalents e.g I need to replace í by í.
Using string_replace/preg_replace_callback on the response directly didn't result in matches (whether searching for í directly or using its hex code \x00\xED), so I used utf8_encode() before carrying out the replacement. But utf8_encode replaces all the í characters by Ã.
Why is this happening, and what's the correct approach to carrying out UTF-8 replacements on an arbitrary piece of text using php?
*edit - some further research reveals
utf8_decode("í") == í;
utf8_encode("í") == ÃƒÂ;
utf8_encode("\xc3\xad") == ÃƒÂ;

utf8_encode is definitely not the way to go here (you're double-encoding if you do that).
Re. searching for the character directly or using its hex code, did you make sure to add the u modifier at the end of the regex? e.g. /\x00\xED/u?

You're probably specify the characters/strings you want replaced via string literals in the php source code? If you do, then the values of those string literals depends on the encoding you save your php file in. So while you see the character í, maybe the literal value is a latin encoded í, like maybe 8859-1 encoding, or maybe its windows cp1252 í, or maybe its utf8 í, or maybe even utf32 í...i dont know off hand how many of those are different, but i know at least some have different byte representations, and so wont match in a php string comparison.
my point is, you need to specify the correct character that will match whatever encoding your incoming text is in.
heres an example without using literals
$iso8859_1 = chr(236);
$utf8 = utf8_encode(chr(236));
be warned, text editors may or may not convert the existing characters when you change the encoding, if you decide to change the file encoding to utf8. I've seen editors do really bizarre things when changing the encoding. Start with a fresh file.
also-just because the other server claims its utf8, doesn't mean it really is.

PHP: Fixing encoding issues with database content - removing accents from characters

I'm trying to make a URL-safe version of a string.
In my database I have a value medúlla - I want to turn this into medulla.
I've found plenty of functions to do this, but when I retrieve the value from the database it comes back as medÃºlla.
I've tried:
Setting the column as utf_8 encoding
Setting the table as utf_8 encoding
Setting the entire database as utf_8 encoding
Running `SET NAMES utf8` on the database before querying
When I echo the value onto the screen it displays as I want it to, but the conversion function doesn't see the ú character (even a simple str_replace() doesn't work either).
Does anybody know how I can force the system to recognise this as UTF-8 and allow me to run the conversion?
Thanks,
Matt

To transform an UTF-8 string into an URL-safe string you should use:
$str = iconv('UTF-8', 'ASCII//IGNORE//TRANSLIT', $strt);
The IGNORE part tells iconv() not to raise an exception when facing a character it can't manage, and the TRANSLIT part converts an UTF-8 character into its nearest ASCII equivalent ('ú' into 'u' and such).
Next step is to preg_replace() spaces into underscores and substitute or drop any character which is unsafe within an URL, either with preg_replace() or urlencode().
As for the database stuff, you really should have done all this setting stuff before INSERTing UTF-8 content. Changing charset to an existing table is somewhat like changing a file extension in Windows - it doesn't convert a JPEG into a GIF. But don't worry and remember that the database will return you byte by byte exactly what you've stored in it, no matter which charset has been declared. Just keep the settings you used when INSERTing and treat the returned strings as UTF-8.

I'm trying to make a URL-safe version of a string.
Whilst it is common to use ASCII-only ‘slugs’ in URLs, it is actually possible to have web addresses including non-ASCII characters. eg.:
http://en.wikipedia.org/wiki/Medúlla
This is a valid IRI. For inclusion in a URI, you should UTF-8 and %-encode it:
http://en.wikipedia.org/wiki/Med%C3%BAlla
Either way, most browsers (except sometimes not IE) will display the IRI version in the address bar. Sites such as Wikipedia use this to get pretty addresses.
the conversion function doesn't see the ú character
What conversion function? rawurlencode() will correctly spit out %C3%BA for ú, if, as presumably you do, you have it in UTF-8 encoding. This is the correct way to include text in a URL's path component. (urlencode() also gives the same results, but it should only be used for query components.)
If you mean htmlentities()... do not use this function. It converts all non-ASCII characters to HTML character references, which makes your output unnecessarily larger, and means it has to know what encoding the string you pass in is. Unless you give it a UTF-8 $charset argument it will use ISO-8859-1, and consequently screw up all your non-ASCII characters.
Unless you are specifically authoring for an environment which mangles non-ASCII characters, it is better to use htmlspecialchars(). This gives smaller output, and it doesn't matter(*) if you forget to include the $charset argument, since all it changes is a couple of characters like < and &.
(Actually it could matter for some East Asian multibyte character sets where < could be part of a multibyte sequence and so shouldn't be escaped. But in general you'd want to avoid these legacy encodings, as UTF-8 is less horrific.)
(even a simple str_replace() doesn't work either).
If you wrote str_replace(..., 'ú', ...) in the PHP source code, you would have to be sure that you saved the source code in the same encoding as you'll be handling, otherwise it won't match.
It is unfortunate that most Windows text editors still save in the (misleadingly-named) “ANSI” code page, which is locale-specific, instead of just using UTF-8. But it should be possible to save the file as UTF-8, and then the replace should work. Alternatively, write '\xc3\xba' to avoid the problem.
Running SET NAMES utf8 on the database before querying
Use mysql_set_charset() in preference.

Parse XML with special characters (UTF-8)

I'm starting out with some XML that looks like this (simplified):
<?xml version="1.0" encoding="UTF-8"?>
<alldata>
<data name="Forsetì" />
</alldata>
</xml>
But after I've parsed it with simplexml_load_string the special character (the i) becomes: Ã¬ which is obviously pretty mangled.
Is there a way to prevent this from happening?
I know for a fact the XML is fine, when saved as .txt and viewed in the browser the characters are fine. When I use simplexml_load_string on the XML and then save values as a text file, or to the database, its mangled.

This looks SimpleXML is creating a UTF-8 string, which is then rendered in ISO-8859-1 (latin-1) or something close like CP-1252.
When you save the result to a file and serve that file via a web server, the browser will use the encoding declared in the file.
Including in a web page
Since your web page encoding is not UTF-8, you need to convert the string to whatever encoding you are using, eg ISO-8859-1 (latin-1).
This is easily done with iconv():
$xmlout = iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $xmlout);
Saving to database
You database column is not using UTF-8 collation, so you should use iconv to convert the string to the charset that your database uses.
Assuming your database collation is the same as the encoding that you render in, you will not have to do anything when reading from the database.
Explanation
In UTF-8, a 0xc2 prefix byte is used to access the top half of the "Latin-1 Supplement" block which includes characters such as accented letters, currency symbols, fractions, superscript 2 and 3, the copyright and registered trademark symbols, and the non-breaking space.
However in ISO-8859-1, the byte 0xC2 represents an Â. So when your UTF-8 string is misinterpreted as one of those, then you get Â followed by some other nonsense character.

It's very likely that the XML is fine, but the character gets mangled when stored or output.
If you're outputting data on a HTML page: Make sure it's encoded in UTF-8 as well. If your HTML page is in ISO-8859-1, you can use utf8_decode as a quick fix; using UTF-8 is the better option in the long run.
If you're storing the data in a mySQL, you need to have UTF8 selected as the encoding all the way through: As the connection's encoding, in the table, and in the column(s) you insert the data into.

I've also had some problems with this, and it came from the PHP script encoding. Make sure it's set to UTF-8.
If it's still not good, try printing the variable using uft8_encode or utf8_decode.

XML is strict when it comes to entities, like & should be &amp; and ì should &igrave;
So you will need a translation table.
function xml_entity_decode($_string) {
// Set up XML translation table
$_xml=array();
$_xl8=get_html_translation_table(HTML_ENTITIES,ENT_COMPAT);
while (list($_key,)=each($_xl8))
$_xml['&#'.ord($_key).';']=$_key;
return strtr($_string,$_xml);
}

Late to the party... But I've faced this and solved like below.
You have declared encoding in XML so if you load xml file using DOMDocument it won't cause any issue.
But in case it happens in other use case, you can use html_entity_decode like below:
html_entity_decode($xml->saveXML());

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.