json_decode fails with German characters (even while encodig is UTF8)

json_decode fails with German characters (even while encodig is UTF8) - php

I'm using GoogleMaps API to retrieve location information. The result is fetched via cURL and the fetched string should be converted to a JSON-object using json_decode.
For many locations (in for example The Netherlands) this works like a charm. But for many German (and probably more countries like Austria, Swiss etc) this doesn't work as expected.
I believe this is because of the 'special' characters like ß, but also ü, ë, ä, ï and so on.
For example: this is the string fetched via cURL (http://maps.googleapis.com/maps/api/geocode/json?address=Stoltenkampstra%C3%9Fe%2011,Bad%20Bentheim&sensor=false&language=nl)
In the following $sResponse is the result fetched by cURL.
When I try to perform json_decode($sResponse); its value becomes null. When I perform json_last_error() it says 5 (which means JSON_ERROR_UTF8). When I perform mb_detect_encoding($sResponse) it says UTF-8.
Any suggestions?

If you encounter this problem as well, make sure you've set your document to have to correct charset. In my case I forgot to include <meta charset='utf-8'> in my index.php-file. To me this was what I overlooked... Dumb... but maybe it helps you in the future ;)
As correctly mentioned by Gumbo, this wasn't the only fix to the problem. (It only fixed how the data was presented in my browser). I was also playing with the Encoding-library, using Encoding::toUTF8(). This is a very neat and helpful class I've found during my search for a solution. You can read about it here: Detect encoding and make everything UTF-8

Related

Obtaining correct UTF-8 characters from a Percent Encoded URL parameter

I'm having some trouble with the dreaded UTF-8 Character Encoding! It's driving me insane, no matter which way I approach it or how many online guides I follow, I can never get it to return the desired results. Here's what's going on:
My whole website uses a simple text-file database that is UTF-8 encoded, and it correctly shows all manner of special characters, latin, arabic, japanese, you name it, they all show correctly, with one exception:
When the user uses the "Search" input box I have on my website, I use $search = $_REQUEST['search']; to get the input data on the results page and show results accordingly. When a user inserts special characters in the search box, they get "Percent Encoded" in the URL (for example, "ï" becomes "%E3%AF"). When showing $string in the actual website, any special character appears as � (black diamond with question mark).
I have tried everthing it says here http://malevolent.com/weblog/archive/2007/03/12/unicode-utf8-php-mysql/ with the exception of the header(). I have set the charset as UTF-8 in my head section with an http-equiv meta but for some reason whenever I set it as a header() my PHP stylesheet stops working (and the character problem remains). Maybe this is a clue?
I have tried urldecode and rawurldecode too, but they don't change anything.
Keep in mind special characters appear correctly elsewhere on the site, it's only with the $search string where this problem appears. As a side-note, even though the characters are not visualizing correctly, my search engine does actually interpret the special characters correctly when filtering the results. This makes me understand that the special character is actually there and correctly encoded, but it's just a matter of making it visualize correctly with the correct charset. However... everything appears to be UTF-8.
To be honest I'm so confused about this that this question might also appear to be confusing and the information I'm giving you might not be very well structured either, so I apologize and will try to provide more detailed information for any questions.
Thank you!

Make sure not to have any function which alters your $_REQUEST. Some functions are not aware of special encodings.
The best way to investigate is checking the state of the variables before and after they are altered.

I would like to add one thing more point regarding utf-8 string manipulation.
When manipulating utf-8 strings always use multibyte string functions.
use mb_strtolower in place of strtolower()
http://php.net/manual/en/ref.mbstring.php.

Curl: get UTF-8 data from site with incorrect charset

I scrape some sites that occasionally have UTF-8 characters in the title, but that don't specify UTF-8 as the charset (qq.com is an example). When I use look at the website in my browser, the data I want to copy (i.e. the title) looks correct (Japanese or Chinese..not too sure). I can copy the title and paste it into the terminal and it looks exactly the same. I can even write it to the DB and when I retrieve from the DB it still looks the same, and correct.
However, when I use cURL, the data that gets printed is wrong. I can run cURL from the command line or use PHP .. when it's printed to the terminal it's clearly incorrect, and it remains that way when I store it to the DB (remember: the terminal can display these characters properly). I've tried all eligible combinations of the following:
Setting CURLOPT_BINARYTRANSFER to true
mb_convert_encoding($html, 'UTF-8')
utf8_encode($html)
utf8_decode($html)
None of these display the characters as expected. This is very frustrating since I can get the right characters so easily just by visiting the site, but cURL can't. I've read a lot of suggestions such as this one: How to get web-page-title with CURL in PHP from web-sites of different CHARSET?
The solution in general seems to be "convert the data to UTF-8." To be honest, I don't actually know what that means. Don't the above functions convert the data to UTF-8? Why isn't it already UTF-8? What is it, and why does it display properly in some circumstances, but not for cURL?

have you tried :
$html = iconv("gb2312","utf-8",$html);
the gb2312 was taken from the qq.com headers

How to get rid of � using php

I am pulling comments out of the database and have this, �, show up... how do I get rid of it? Is it because of whats in the database or how I'm showing it, I've tried using htmlspecialchars but doesn't work.
Please help

The problem lies with Character Encoding. If the character shows up fine in the database, but not on the page. Your page needs to be set to the same character encoding as the database. And vice a versa, if your page that posts to the database character encoding does not match, well it comes out weird.
I generally set my character encoding to UTF-8 for any type of posting fields, such as Comments / Posts. Most MySQL databases default to the latin charset. So you will need to modify that: http://yoonkit.blogspot.com/2006/03/mysql-charset-from-latin1-to-utf8.html
The HTML part can be done with a META tag: <META http-equiv="Content-Type" content="text/html; charset=UTF-8">
or with PHP: header('Content-type: text/html; charset=utf-8'); (must be placed before any output.)
Hopefully that gets the ball rolling for you.

That happens when you have a character that your font doesn't know how to display. It shows up differently in every program, many Windows programs show it as a box, Firefox shows it as a questionmark in a diamond, other programs just use a plain question mark.
So you can use a newer display system, install a missing font (like if it's asian characters) or look to see if it's one or two characters that do this and just replace them with something visible.

It might be problem of the way you are storing the information in the database. If the encoding you were using didn't accept accents (à, ñ, î, ç...), then it stores them using weird symbols. Same happens to other language specific symbols. There is probably not a solution for what's already in the database, but you can still save the following inserts by changing the encoding type in mysql.
Cheers

Make sure your database UTF-8 (if it won't solve the problem make sure you specify your char-set while connecting to the database).
You can also encode / decode before entering data to your database.
I would suggest to go with htmlspecialchars() for encoding and htmlspecialchars_decode() for decoding.

Are you passing your charset in mysql_set_charset() with mysql_connect() ???

As others have said, check what your database encoding is. You could try using utf8_encode() or iconv() to convert your character encoding.

Check your code for errors. That's all one can really say considering that you have given us absolutely no details as to what you're doing.
Encoding problems are usually what cause that (are you converting from integers to characters?), so, you fix it by checking if you're converting things properly.

Why is PHP's utf8_encode breaking my utf-8 string?

I'm doing a kind of roundabout experiment thing where I'm pulling data from tables in a remote page to turn it into an ICS so that I can find out when this sports team is playing (because I can't find anywhere that the information is more readily available than in this table), but that's just to give you some context.
I pull this data using cURL and parse it using domDocument. Then I take it and parse it for the info I need. What's giving me trouble is the opposing team. When I display the data on the initial PHP page, it's correct. But when I write to an ICS file, special UTF-8 characters get messed up. I thought utf8_encode would solve that problem, but it actually seems to have the opposite effect: when I run the function on my data, even the stuff displayed on the page (which had been displaying correctly), not in the separate ICS file (which was writing incorrectly), is incorrect. As an example: it turns "Inđija" to "InÄija."
Any tips or resources as far as dealing with UTF-8 strings in PHP? My server (a remote host) doesn't have mbstring installed either, which is a pain.

utf8_encode encodes a string in ISO 8859-1 as UTF-8. If you put UTF-8 into it, it's going to interpret it as if it was ISO 8859-1, and hence produce mojibake.
To help with your first problem, before this, I'd want to know what sort of "special" characters are being messed up in the original problem, and what way are they being messed up?

Anyone have issues going from ColdFusion's serializeJSON method to PHP's json_decode?

The Interwebs are no help on this one. We're encoding data in ColdFusion using serializeJSON and trying to decode it in PHP using json_decode. Most of the time, this is working fine, but in some cases, json_decode returns NULL. We've looked for the obvious culprits, but serializeJSON seems to be formatting things as expected. What else could be the problem?
UPDATE: A couple of people (wisely) asked me to post the output that is causing the problem. I would, except we just discovered that the result set is all of our data (listing information for 2300+ rental properties for a total of 565,135 ASCII characters)! That could be a problem, though I didn't see anything in the PHP docs about a max size for the string. What would be the limiting factor there? RAM?
UPDATE II: It looks like the problem was that a couple of our users had copied and pasted Microsoft Word text with "smart" quotes. Those pesky users...

You could try operating in UTF-8 and also letting PHP know that fact.
I had an issue with PHP's json_decode not being able to decode a UTF-8 JSON string (with some "weird" characters other than the curly quotes that you have). My solution was to hint PHP that I was working in UTF-8 mode by inserting a Content-Type meta tag in the HTML page that was doing the submit to the PHP. That way the content type of the submitted data, which is the JSON string, would also be UTF-8:
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
After that, PHP's json_decode was able to properly decode the string.

can you replicate this issue reliably? and if so can you post sample data that returns null? i'm sure you know this, but for informational sake for others stumbling on this who may not, RFC 4627 describes JSON, and it's a common mistake to assume valid javascript is valid JSON. it's better to think of JSON as a subset of javascript.
in response to the edit:
i'd suggest checking to make sure your information is being populated in your PHP script (before it's being passed off to json_decode), and also validating that information (especially if you can reliably reproduce the error). you can try an online validator for convenience. based on the very limited information it sounds like perhaps it's timing out and not grabbing all the data? is there a need for such a large dataset?

I had this exact problem and it turns out it was due to ColdFusion putting none printable characters into the JSON packets (these characters did actually exist in our data) but they can't go into JSON.
Two questions on this site fixed this problem for me, although I went for the PHP solution rather than the ColdFusion solution as I felt it was the more elegant of the two.
PHP solution
Fix the string before you pass it to json_decode()
$string = preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $string);
ColdFusion solution
Use the cleanXmlString() function in that SO question after using serializeJSON()

You could try parsing it with another parser, and looking for an error -- I know Python's JSON parsers are very high quality. If you have Python installed it's easy enough to run the text through demjson's syntax checker. If it's a very large dataset you can use my library jsonlib -- memory use will be higher than with demjson, but it will run faster because it's written in C.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.