I've run into an interesting behavior in the native PHP 5 implementation of json_encode(). Apparently when serializing an object to a json string, the encoder will null out any properties that are strings containing "curly" quotes, the kind that would potentially be copy-pasted out of MS Word documents with the auto conversion enabled.
Is this an expected behavior of the function? What can I do to force these kinds of characters to covert to their basic equivalents? I've checked for character encoding mismatches between the database returning the data and the administration page the inserts it and everything is setup correctly - it definitely seems like the encoder just refuses these values because of these characters. Has anyone else encountered this behavior?
EDIT:
To clarify;
MSWord will take standard quotation marks and apostraphes and convert them to more aesthetic "fancy" or "curly" quotes. These characters can cause problems when placed in content managers that have charset mistmatches between their editing interface (in the html) and the database encoding.
That's not the problem here, though. For example, I have a json_object representing a person's profile and the string:
Jim O’Shea
The UTF code for that apostraphe being \u2019
Will come out null in the json object when fetched from database and directly json_encoded.
{"model_name":"Bio","logged":true,"BioID":"17","Name":null,"Body":"Profile stuff!","Image":"","Timestamp":"2011-09-23 11:15:24","CategoryID":"1"}
Never had this specific problem (i.e. with json_encode()) but a simple - albeit a bit ugly - solution I have used in other places is to loop through your data and pass it through this function I got from somewhere (will credit it when I find out where I got it):
function convert_fancy_quotes ($str) {
return str_replace(array(chr(145),chr(146),chr(147),chr(148),chr(151)),array("'","'",'"','"','-'),$str);
}
json_encode has the nasty habit of silently dropping strings that it finds invalid (i.e. non-UTF8) characters in. (See here for background: How to keep json_encode() from dropping strings with invalid characters)
My guess is the curly quotes are in the wrong character set, or get converted along the way. For example, it could be that your database connection is ISO-8859-1 encoded.
Can you clarify where the data comes from in what format?
If I ever need to do that, I first copy the text into Notepad and then copy it from there. Notepad forces it to be normal quotes. Never had to do it through code though...
Related
I'm grabbing a bunch of data from a database and putting it into a PHP array. I'm then looking to json_encode that array using $output = json_encode($out).
My issue is that from time to time, something in the array is not able to be read by json_encode and the whole thing fails. If I use print_r($out) to have a look, I can clearly see where it's failing, because the character that is screwing things up always appears as a question mark inside of a black diamond �.
First - what are these characters?
Second - Is there a function I can pass the elements through prior to adding them to the array that would strip these out, or replace 'them' with blanks?
I found the answer to this. Since the data coming FROM the database was stored with the "black diamond" character, I needed to get this out POST grabbing it from the database.
$x[4] = utf8_encode(odbc_result($query, 'B'));
By passing the result through utf8_encode, the string is encoded into UTF-8 and the illegal character is removed.
Say echo json_encode($out);
This will solve your issue
Black diamonds are browser issue. Database uses plain question marks.
It seems you are getting already wrong data from databalse. But that's quite tricky to have incorrect utf with your settings. You need to check everything
if your table marked with utf8 charset
if your data indeed encoded in utf (not marked but indeed encoded)
if your server sending correct charset in Content-type header.
it is also useful to see the page choosing different charsets from your browser menu.
But first of all you have to wipe any trace of all random actions you tried, all these various encode, decode and stuff. Just plain and direct output from database. Otherwise you will never get to the problem
If I have
<p id='test'>TEST™</p>
and I use
document.getElementById('test').innerHTML;
to pass the HTML to a php function where it extract all of the text nodes using DOMDocument and XPath.
When the PHP gets the content the ™ gets converted to ™. I run it through XPath and the text node comes back as:
TESTâ„ ¢
I am not sure what is going wrong, or if there is a way fix it, either on the javascript side so it passes the ™ rather then ™.
Any help is appreciated.
Your value that your variable is being passed with the TM character, not with ™, running through htmlentities() in PHP should take care of it.
You could try and use the HTML Unicode form
EX
<p id='test'>™</p>
Read this page for more example on Unicode TM
http://www.fileformat.info/info/unicode/char/2122/index.htm
Hope this helps.
You need to be more precise than saying it "comes back as". The ™ appears to have been written somewhere in UTF-8 encoding, and the same bytes have then been read by something that doesn't realise they are in UTF-8 encoding, and is assuming they are Latin-1 or similar. To solve the problem you will need to look very carefully at the configuration of the software that wrote the character and the software that read it.
What Michael said is true; in addition you should be aware that XML processors are basically required to convert character entities (like &tm;) to their actual character values, and will (almost) always produce output with those characters encoded in some prevailing character set. It takes heroic measures to prevent this, and is usually not a "good idea". So you should abandon attempts to do that, and my guess is that you would be better served by making sure that the function you are passing the HTML to is told to interpret it as utf-8 not some other charset (which may just be the system default).
I am using JSON_ENCODE in PHP to output data.
When it gets to this word: Æther it outputs \u00c6ther.
Anyone know of a way to make json output that character or am I going to have to change the text to not have that character in it?
That's the unicode version of the character. JavaScript should handle it properly. You'll notice the slash before it which means that it's an escape sequence. The u indicates it's a unicode code point and the hex digits represent the actual character.
See here for some more info.
That is working as specified. The RFC ( http://www.ietf.org/rfc/rfc4627.txt ) indicates that any character may be escaped, and your average printable character can be written in the \uXXXX format.
Any JSON parser that cannot understand a character escaped in that way is not compliant with the standard. Work on resolving that problem rather than trying to coax PHP into misbehaving as well.
(It is legal to put UTF-8 characters into JSON strings without escaping them as well, with a few exceptions, but the safe approach of escaping anything questionable is wise.)
I seem to be completely unable to get around utf-8 character encoding.
So I'm exporting content from a database as a utf-8 xml file.
The software I am importing into is quite strict about character encoding, so I can't just put everything in CDATA tags.
There's a whole bunch of weird characters, e.g. ’, — … already in the data.
These aren't working in the xml and need to be replaced out (normally with just a ' quote).
Ideally, I'd like to decode all the characters, and then use htmlspecialchars($text, ENT_COMPAT, 'UTF-8', FALSE) to encode them back again. But I can't seem to find a function that will decode them. Is there one?
I've started to manually go through each entity with a str_replace() but it's turning into a much bigger job than I anticipated.
Any help would be a lifesaver.
Thanks
html_entity_decode() perhaps?
in some cases, in character conversion issues in php, it is important to have a locale set. Doesn't matter which, e.g.
setlocale(LC_CTYPE,'en_US.utf8');
But I would advise that any time invested in getting the encoding right from the beginning, without reverting to entities, if at all possible, is worth it.
I'm working on a project in PHP (5.3.1) where I need to send a JSON string to a webservice (in python), but the result I get from json_encode does not pass as a valid JSON (i'm using JSLint to check validity).
I should add that the structure I'm trying to encode is fairly big (13K encoded), and consists partially of UTF8 data, and while json_encode does handle it, i get spaces in weird places in the result. For example, I could get {"hello":tru e} or {"hell o":true} which results in an error from the webservice since the JSON is invalid (or data, like in the second example).
I've also tried to use Zend framework for JSON encoding, but that didn't make much different.
Is there a known issue with JSON in PHP? Did anyone encounter that behavior and found a solution?
You state that "the structure I'm trying to encode ... consists partially of UTF8 data." This implies that it is also partially of non-UTF8 data. The json_encode doc has a comment at the bottom, that
json_encode() expects strings to be encoded to be in UTF8 format, while by default PHP strings are ISO-8859-1 encoded.
This means that
json_encode(array('àü'));
will produce a json representation of an empty string, while
json_encode(array(utf8_encode('àü')));
will work.
Are the failing segments of the JSON due to non-UTF8 input?
For sure object keys cannot contain spaces or any non unicode characters, unquoted variables can be only boolean, integer ,float, object and array value, strings should always be quoted.
Also, I would recommend you to add correct header before your json output.
if(!headers_sent())
header('Content-Type: application/json; charset=utf-8', true,200);
Can you also post your array or object that you passing to json_encode?
I was handling some automatically generated emails the other day and noticed the same weird behavior (spaces were inserted to the email body), so I started to check the email post and found the culprit:
From the SMTP RFC2821:
The maximum total length of a text
line including the is 1000
characters (not counting the leading
dot duplicated for transparency).
My email body was indeed in one line, so breaking it with \n's fixed the spaces issue.
After scratching my head for nearly a day, I've come to the conclusion that the problem was not in the json_encode function. It was with my post function.
Basically, the json_encode was preparing the data to be sent to another service. Before today, I've used stream_context_create and fopen to post data to the external service, but now I use fsockopen and fputs and it seems to be working.
Although I'm unsure as to the nature of the problem, I'm happy it works now :)
BTW: After this process, I mail myself the input and output (both in JSON) and this is how I saw there was a problem in the first place. This problem still persists but I guess that's related to the encoding of the mail or something of that sort.