How to parse special characters with PHP? - php

I'm using PHP and cURL to access a remote API. It returns a JSON result. The API returns user-posted content, so I expected some odd characters here and there. However, very simple characters such as – or ’ are being echoed out via PHP as Chinese characters (I'm aware those aren't true dashes or apostrophes, but rather some equivalent). Nonetheless, other websites manage to display them fine, so I'm not sure why they're echoed out as Chinese characters in my case.
For example: the character ’ echoes out as 鈥檙.
I've tried various PHP methods at my disposal to get them to encode or display correctly, including:
htmlentities()
utf8_encode()
htmlspecialchars()
and none make a difference.
Additionally, I've checked and my page does have
<meta charset="utf-8">
at in the <head> element.
Am I missing an obvious solution? I feel like I must be.

鈥檙 is not a special charter its unicode. special charters are still ascii and takes 8 bits.
whereas unicode take 16 bits.
Have you tried removing
<meta charset="utf-8">

The API's HTTP Content-Type should give you an idea of the character encoding. You need to view the headers returned by your curl request to see what encoding you're receiving. Running curl from the command line will show you:
curl -v http://...
For example, curl -v google.com shows:
Content-Type: text/html; charset=UTF-8
Then you need to be sure that you are respecting that character encoding in your database, and in your HTML meta tag.

So, I was just an idiot. I failed to notice that there was a conflicting meta tag on MY page adding the WRONG charset. Thanks to all who took time to try and help.

Related

How to decode utf-8 charected in codeigniter?

I'm developing a site with codeigniter that support multilanguage. When a user search with their native language I got the first result when I paginate the result the character is not decoding.
This is the url which is used to paginate.
When I print the uri segment I got %E0%B4%AE
I tried the url encode and url decode that time I got a different charecter like à´®
Can any one tell me how can I decode this type of charecterset?
While urldecode is what you should be using, the reason that you are getting the wrong output printed is probably because the output page's encoding hasn't been set to UTF-8, and is thus defaulting to ISO-8859-1. Hence, while the characters have been decoded correctly by PHP, the browser then interprets the characters in the wrong encoding, resulting in incorrect display.
To fix the problem, send a charset in the Content-type header before any output like so:
header('Content-type: <type>; charset=utf-8');
If your output page is HTML, you could alternatively use this tag in the head:
<meta charset="utf-8">
If you take the second option, be sure to place the tag as early as possible in the head, as browsers do not scan past the first 1024 bytes of the page for this declaration.

Japanese and Russian characters - web encoding?

I have a Zope/Plone WS that calls some functions written in Python.
That WS are called by PHP pages (utf-8 into header) but characters aren't visible.
I've tried to decode (where possible) special chars into entities (into Python) and that works, but not all chars have corresponding HTML entities.
I've tried to save the original Python file in UTF-8 format, but I thought that wasn't the right way.
Can someone help?
note : I pass through some php include, if this could be an hint...
Edit it's weird, because if I log all the "pieces" singly, then I have the right chars encoded. If I go up to the "main php page" (where I include all pieces), that messes up everything.
Obviously, the "main php page" has that:
<meta http-equiv="content-type" content="text/html;charset=utf-8" />
496e73e972657220646174652064926172726976e9652065742064652064e970617274
That string is encoded in ISO-8859-1, not UTF-8.
Somewhere you're converting your strings to ISO-8859-1, which means they're not interpreted correctly when trying to interpret them as UTF-8, and all non-European characters will be discarded since ISO-8859-1 can't encode anything but a handful of European characters.
I just edited the file site.py of python.
I follow that guide: click here and everything is ok now.
Thank you all for help.

weird characters

Si i'm parsing a web page with a parser that i created..and when i parse the page and echo the content out I get characters like these †why is doing it that,it supposed to be ... or any other character like -- instead.
The weird characters are caused by encoding problems, your best bet is to encode them to UTF-8 (make sure your page is also in UTF-8) before you echo them.
You can use the function utf8_encode for that.
Here is a very complete answer on how
to successfully do that:
Detect encoding and make everything UTF-8
Usually those type of characters come from bad character encoding. From the top of my head, your best solution is to check the web page that you created for the meta tag supplying character encoding on the webpage. Something like this:
<meta content='text/html; charset=UTF-8' http-equiv='Content-Type'/>
And making sure you supply the same character encoding on your end.
I go this solved with iconv("UTF-8","ISO-8859-1",$string) it does the job, 10x guys

Browser displays � instead of ´

I have a PHP file which has the following text:
<div class="small_italic">This is what you´ll use</div>
On one server, it appears as:
This is what you´ll use
And on another, as:
This is what you�ll use
Why would there be a difference and what can I do to make it appear properly (as an apostrophe)?
Note to all (for future reference)
I implemented Gordon's / Gumbo's suggestion, except I implemented it on a server level rather than the application level. Note that (a) I had to restart the Apache server and more importantly, (b) I had to replace the existing "bad data" with the corrected data in the right encoding.
/etc/php.ini
default_charset = "iso-8859-1"
You have to make sure the content is served with the proper character set:
Either send the content with a header that includes
<?php header("Content-Type: text/html; charset=[your charset]"); ?>
or - if the HTTP charset headers don't exist - insert a <META> element into the <head>:
<meta http-equiv="Content-Type" content="text/html; charset=[your charset]" />
Like the attribute name suggests, http-equiv is the equivalent of an HTTP response header and user agents should use them in case the corresponding HTTP headers are not set.
Like Hannes already suggested in the comments to the question, you can look at the headers returned by your webserver to see which encoding it serves. There is likely a discrepancy between the two servers. So change the [your charset] part above to that of the "working" server.
For a more elaborate explanation about the why, see Gumbo's answer.
The display of the REPLACEMENT CHARACTER � (U+FFFD) most likely means that you’re specifying your output to be Unicode but your data isn’t.
In this case, if the ACUTE ACCENT ´ is for example encoded using ISO 8859-1, it’s encoded with the byte sequence 0xB4 as that’s the code point of that character in ISO 8859-1. But that byte sequence is illegal in a Unicode encoding like UTF-8. In that case the replacement character U+FFFD is shown.
So to fix this, make sure that you’re specifying the character encoding properly according to your actual one (or vice versa).
To sum it maybe up a little bit:
Make sure the FILE saved on the web server has the right encoding
Make sure the web server also delivers it with the right encoding
Make sure the HTML meta tags is set to the right encoding
Make sure to use "standard" special chars, i.e. use the ' instead of ´of you want to write something like "Luke Skywalker's code"
For encoding, UTF-8 might be good for you.
If this answer helps, please mark as correct or vote for it. THX
The simple solution is to use ASCII code for special characters.
The value of the apostrophe character in ASCII is ’. Try putting this value in your HTML, and it should work properly for you.
Set your browser's character set to a defined value:
For example,
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
This is probably being caused by the data you're inserting into the page with PHP being in a different character encoding from the page itself (the most common iteration is one being Latin 1 and the other UTF-8).
Check the encoding being used for the page, and for your database. Chances are there will be a mismatch.
Create an .htaccess file in the root directory:
AddDefaultCharset utf-8
AddCharset utf-8 *
<IfModule mod_charset.c>
CharsetSourceEnc utf-8
CharsetDefault utf-8
</IfModule>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

Unicode and PHP - am I doing something wrong?

I'm using Kohana 3, which has full support for Unicode.
I have this as the first child of my <head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
The Unicode character I am inserting into is é as in Café.
However, I am getting the triangle with a ? (as in could not decode character).
As far as I can tell in my own code, I am not doing any string manipulation on the text.
In fact, I have placed the accent straight into a view's PHP file and it is still not working.
I copied the character from this page: http://www.fileformat.info/info/unicode/char/00e9/index.htm
I've only just started examining PHP's Unicode limitations, so I could be doing something horribly wrong.
So, how do I display this character? Do I need to resort to the HTML entity?
Update
So this works
Caf<?php echo html_entity_decode('é', ENT_NOQUOTES, 'UTF-8'); ?>
Why does that work? If I copy the output accented e from that script and insert it into my document, it doesn't work.
View the http headers. You should see something like
Content-Type: text/html; charset=UTF-8
Browsers don't pay much attention to meta tags, if there was a real http header stating a different encoding.
update
Whatcha get from this?
echo bin2hex('é');
echo chr(0xc3) . chr(0xa9);
You should get c3a9é, otherwise I'd say file encoding issue.
I guess, you see �, the replacement character for invalid UTF-8 byte sequences. Your text is not UTF-8 encoded. Check your editor’s settings to control the encoding of the PHP file.
If you’re not sure about the encoding of your sources, you can enforce UTF-8 compatibilty as described here (German text): Force UTF-8.
You should never need entities except the basic ones.

Categories