I'm implementing communication between two servers using JSON and cURL. The problem is, that sometimes there's BOM (byte order mark), appended before opening bracket in JSON reply. I've managed to trim it and successfully parse JSON string, but considering that JSON is generated by my own code, I've no idea, where does that BOM come from.
I'm using json_encode() to generate reply and header() + echo to print it, an as far as I cant tell, json_decode() does not produce any BOMs. Corresponding .php files are encoded in UTF-8 and have no BOM in them (according to Notepad++). Apart from cURL, I've also tried to perform requests using Chrome and python (urllib2). While Chrome does not register any BOM at all, python regularly fails to parse incoming JSON because of it.
So, is there some nuance in using echo, that somehow produces such a result? Where should I start looking for the source of the problem and what may be the solution?
I had the same problem. I was outputting json from PHP and there were other class files included at the top of the page. These files output nothing, but when they were included I was getting as many Byte Order Marks as I had included files. So if I had 4 includes, I also had 4 BOMs at the start of my json.
I made sure the includes were not printing any data and there were no stray carriage returns outside the PHP tags. I tried headers such as "application-json", etc., but nothing worked.
In the end, I simply opened each PHP file in notepad++, went to "Encoding" and changed it from UTF-8 to ANSI, then saved. That was all it took to get it working and returning valid json. I made no code changes to the PHP at all.
This solution still feels less than ideal. Since we are not outputting anything from those included files there shouldn't be anything affected.
Related
I am trying to work with the Amazon Associates API, but whenever trying to get the information of a product, the characters come out in a weird way.
Example
Text on the Amazon page:
🔥 【23800 mAh
Output of the JSON from the API: 🔥 ã€23800 mAh
Just like this, more weird characters are appearing, such as a dash transformation in a question mark.
I've used a code snippet in PHP that was provided by them, which contained the following line which determined the charset:
$awsv4->addHeader('content-type', 'application/json; charset=utf-8');
Does anyone have a pointer where I might be going wrong here, and what I could do to fix this weird conversion?
I don't get what you are actually going to do with this. Assuming you are going to keep this in a database I am going to give you a solution.
You said these characters are changing into those weird symbols. What happening here is those imojis/symbols are converting into a UTF-8 encoded data form. And that is why you are seeing those symbols which are pure UTF-8 form.
Now if you want to keep those data you have to keep that data into some kind of text encode. It doesn't have to UTF-8 only . There are many encodings available.
If you want to decode those you have to write them in a system where it can be shown normally. Like I have kept imojis in my database as UTF-8 many years ago. The mobile phone I was using, gave me encoding for one emoji. I saved it and next time when I am seeing the data with my PC browser I am seeing some other symbols. The decoding system must be installed when you want to see them next time.
The point is you can not save the data as you can see in the Amazon.
This line
$awsv4->addHeader('content-type', 'application/json; charset=utf-8');
tells that the request you are sending is UTF-8 encoded and is JSON encoded data. It does not ask for a specific encoding for the response.
However the response you are getting is UTF-8 encoded
You read 🔥 ã€23800 mAh because you are displaying the received data using a encoding different from UTF-8.
A more detailed answer may be given if we see how you are writing the output and which is the context (web page, terminal...)
Example :
$fire = '🔥';
I know PHP 5+ supports this functionality natively but is it best practice or should I be storing them using their codepoints instead and if so, why?
As far as your editor and the PHP compiler are concerned, it's all just text, and '🔥' is no different from 'fire' or 'Φωτιά'.
When PHP runs, it will read the bytes in from the file and put them in memory, without caring what they mean. This leads to the most likely problem you'll have: if you save the file in your text editor as UTF-16, and then echo the string to a browser telling it that it's UTF-8, the browser won't show the right thing. But that's easily avoided by making sure your editor always uses UTF-8, and your output headers tell the browser that's what you're using.
If you don't trust your editor to do that, and you're running PHP7, you could write it in the escaped notation "\u{1f525}", but when it runs, the same bytes will end up in memory.
You might have similar problems if you send the text elsewhere - to a database, for instance - and that somewhere else doesn't know to handle it as UTF-8. How you write the string in your source file won't make any difference to that, though, that's just a case of making sure everything is configured to match.
Note: you don't actually have to use UTF-8 for this, you could use UTF-16, or some other encoding, as long as you're consistent; but UTF-8 is by far the most common these days, particularly on the web.
I'm making multiple AJAX calls that returns XML data. When I get the data back, my success function (in JQuery) tries to turn the XML to JSON (using a plugin). I was quickly reminded why I can't assume I would be getting VALID XML back from my AJAX request -- because it turns out a few of the XML responses were invalid -- causing the JSON conversion to fail, script to fail, etc...
My questions are:
What is the best way to check for
valid XML on an AJAX response? Or,
should I just attempt the JSON
conversion, then do a quick check if
the JSON object is valid?
In troubleshooting the XML, I found that there are a few strange characters at the VERY beginning of the XML response. Here's an image from my Firebug:
Should I try to detect and strip the response of those chars or could there possibly be something wrong with my encoding?
Any help is appreciated! Let me know if more info is needed!
It's the UTF-8 byte-order mark when incorrectly interpreted as ISO-8859-1.
You can't safely strip this because it's just a symptom of a larger problem. Your content is encoded as UTF-8. Somewhere along the way you are decoding it as ISO-8859-1 instead. If you try to hide the problem by stripping the BOM, you're only setting yourself up for more problems down the line as soon as you start using non-ASCII characters. The only reason things are even looking sort-of right is because ASCII is a common subset of both UTF-8 and ISO-8859-1.
The strange characters are the Byte Order Mark and are actually valid XML, you can most likely just strip them without risk in most circumstances.
I'm doing a kind of roundabout experiment thing where I'm pulling data from tables in a remote page to turn it into an ICS so that I can find out when this sports team is playing (because I can't find anywhere that the information is more readily available than in this table), but that's just to give you some context.
I pull this data using cURL and parse it using domDocument. Then I take it and parse it for the info I need. What's giving me trouble is the opposing team. When I display the data on the initial PHP page, it's correct. But when I write to an ICS file, special UTF-8 characters get messed up. I thought utf8_encode would solve that problem, but it actually seems to have the opposite effect: when I run the function on my data, even the stuff displayed on the page (which had been displaying correctly), not in the separate ICS file (which was writing incorrectly), is incorrect. As an example: it turns "Inđija" to "InÄija."
Any tips or resources as far as dealing with UTF-8 strings in PHP? My server (a remote host) doesn't have mbstring installed either, which is a pain.
utf8_encode encodes a string in ISO 8859-1 as UTF-8. If you put UTF-8 into it, it's going to interpret it as if it was ISO 8859-1, and hence produce mojibake.
To help with your first problem, before this, I'd want to know what sort of "special" characters are being messed up in the original problem, and what way are they being messed up?
I'm writing my first little AJAX-enabled Joomla component. I'm using mootools. I got a xmlhttprequest to contact my Joomla component, and the component returns a response - just plain text echoed by php, like
echo 'Hello World!';
It's all working fine, except wireshark tells me that the response is prepended with \357\273\277\357\273\277 when it gets read by the javascript on the client side. This shows up as a little square before the response in an alert box that the script shows.
I don't explicitly set the encoding on the xmlhttprequest; mootools docs say that it defaults to UTF8.
What's the right way to handle this? Should I be setting the encoding on the request? Mime type? Should the javascript get rid of it? I'm not planning to have any characters requiring UTF8 in the response, so using plain old ascii would be ok for me too.
Thanks
A UTF-8 BOM is generally not recommended. Byte-order cannot be reversed in UTF-8 so it serves little purpose other than to just inform the consuming source that the following content is, indeed, UTF-8 encoded.
I'd strip it either on the Joomla end (preferred) or with javascript.
Also, for whatever reason, it looks like you have a double BOM there.
This related question might help as well.
I'm using Microsoft Expression Web 3, and even though it was set to not add a BOM for php files, there was indeed a BOM at the beginning of php files. I used a hex editor to remove the BOM, and now Expression doesn't add a BOM anymore while saving.
I don't know why there was 2 BOMs in the xmlhttprequest response, but now they're both gone.