I'm making multiple AJAX calls that returns XML data. When I get the data back, my success function (in JQuery) tries to turn the XML to JSON (using a plugin). I was quickly reminded why I can't assume I would be getting VALID XML back from my AJAX request -- because it turns out a few of the XML responses were invalid -- causing the JSON conversion to fail, script to fail, etc...
My questions are:
What is the best way to check for
valid XML on an AJAX response? Or,
should I just attempt the JSON
conversion, then do a quick check if
the JSON object is valid?
In troubleshooting the XML, I found that there are a few strange characters at the VERY beginning of the XML response. Here's an image from my Firebug:
Should I try to detect and strip the response of those chars or could there possibly be something wrong with my encoding?
Any help is appreciated! Let me know if more info is needed!
It's the UTF-8 byte-order mark when incorrectly interpreted as ISO-8859-1.
You can't safely strip this because it's just a symptom of a larger problem. Your content is encoded as UTF-8. Somewhere along the way you are decoding it as ISO-8859-1 instead. If you try to hide the problem by stripping the BOM, you're only setting yourself up for more problems down the line as soon as you start using non-ASCII characters. The only reason things are even looking sort-of right is because ASCII is a common subset of both UTF-8 and ISO-8859-1.
The strange characters are the Byte Order Mark and are actually valid XML, you can most likely just strip them without risk in most circumstances.
Related
I am trying to work with the Amazon Associates API, but whenever trying to get the information of a product, the characters come out in a weird way.
Example
Text on the Amazon page:
🔥 【23800 mAh
Output of the JSON from the API: 🔥 ã€23800 mAh
Just like this, more weird characters are appearing, such as a dash transformation in a question mark.
I've used a code snippet in PHP that was provided by them, which contained the following line which determined the charset:
$awsv4->addHeader('content-type', 'application/json; charset=utf-8');
Does anyone have a pointer where I might be going wrong here, and what I could do to fix this weird conversion?
I don't get what you are actually going to do with this. Assuming you are going to keep this in a database I am going to give you a solution.
You said these characters are changing into those weird symbols. What happening here is those imojis/symbols are converting into a UTF-8 encoded data form. And that is why you are seeing those symbols which are pure UTF-8 form.
Now if you want to keep those data you have to keep that data into some kind of text encode. It doesn't have to UTF-8 only . There are many encodings available.
If you want to decode those you have to write them in a system where it can be shown normally. Like I have kept imojis in my database as UTF-8 many years ago. The mobile phone I was using, gave me encoding for one emoji. I saved it and next time when I am seeing the data with my PC browser I am seeing some other symbols. The decoding system must be installed when you want to see them next time.
The point is you can not save the data as you can see in the Amazon.
This line
$awsv4->addHeader('content-type', 'application/json; charset=utf-8');
tells that the request you are sending is UTF-8 encoded and is JSON encoded data. It does not ask for a specific encoding for the response.
However the response you are getting is UTF-8 encoded
You read 🔥 ã€23800 mAh because you are displaying the received data using a encoding different from UTF-8.
A more detailed answer may be given if we see how you are writing the output and which is the context (web page, terminal...)
I am using flash to read contents from a UTF8 page, which has unicode in it.
The problem is that when Flash loads the data it displays ???????? instead all unicode.
What could be the problem?
By default Flash treats strings as if they are encoded using UTF-8. The reason that you are seeing characters that possibly substitute non-printable characters or invalid / missing glyphs could be that you set System.useCodepage to true - if that's what happened, then why did you do that?
Otherwise, the font that is used to display the characters may be missing glyphs for the characters you need. You can check that by using Font.hasGlyphs("string with the glyphs"); to make sure the text can be displayed. This would normally only apply to embedded fonts.
Yet another possibility is that the source text you are trying to display is not a UTF-8 encoded string. Some particularly popular file formats such as XML and HTML some times use a declaration of the format in no correspondence to the actual payload (example XML tag: <?xml encoding="utf-8" ?> can be attached to any XML regardless of the actual encoding of the document). In order to make sure that the text is in UTF-8 - read it as ByteArray and verify that the first bit of every byte is set to 0. Single-byte encodings that use national characters use the first bit to encode their characters, while UTF-8 never does that.
Flash internally uses UTF-8 to represent strings, so there should not be a problem if the entire stack uses UTF-8 encoding.
You probably have an implicit decode/encode step somewhere along the way.
This could really be a million things, unfortunately. Start from the ground up, insert traces and/or log messages to see where the conversion fails. Make sure your XML-content uses UTF-8, and especially if you're using PHP, make sure that all the PHP source files are saved in UTF-8 encoding - editing PHP files in simple text editors often results in Windows/Mac format source files, which will then break your character encoding. Also, verify HTML request/response headers to see if there is an encoding mismatch.
Whenever XMLReader tried to parse this XML file Im feeding it, it breaks on "½" and on a period that looks like this "."
Both are characters that whenever I try to delete them from the xml feed, the editor deletes the characters in front of them first. So, they act like foreign/different encoding characters.
What are my options to fix it? I can't edit the xml file every time. Thanks a lot
You have to fix the program or process that creates the "XML" file. (I put "XML" in quotes, because actually, you would like it to be an XML file, but it isn't one.) You might be able to patch or repair or recover the data, but that's not a long-term solution.
The anecdotal evidence suggests that the "½" character is encoded as two bytes, suggesting it is encoded as UTF-8, while the "é" character is encoded as one byte, suggesting it is encoded as ISO 8859-1. That means that two different processes have written to the file, writing to it using different encodings. (Perhaps it was originally created in one encoding, and then modified using an editor that didn't know what the original encoding was.) That isn't going to work.
I'm implementing communication between two servers using JSON and cURL. The problem is, that sometimes there's BOM (byte order mark), appended before opening bracket in JSON reply. I've managed to trim it and successfully parse JSON string, but considering that JSON is generated by my own code, I've no idea, where does that BOM come from.
I'm using json_encode() to generate reply and header() + echo to print it, an as far as I cant tell, json_decode() does not produce any BOMs. Corresponding .php files are encoded in UTF-8 and have no BOM in them (according to Notepad++). Apart from cURL, I've also tried to perform requests using Chrome and python (urllib2). While Chrome does not register any BOM at all, python regularly fails to parse incoming JSON because of it.
So, is there some nuance in using echo, that somehow produces such a result? Where should I start looking for the source of the problem and what may be the solution?
I had the same problem. I was outputting json from PHP and there were other class files included at the top of the page. These files output nothing, but when they were included I was getting as many Byte Order Marks as I had included files. So if I had 4 includes, I also had 4 BOMs at the start of my json.
I made sure the includes were not printing any data and there were no stray carriage returns outside the PHP tags. I tried headers such as "application-json", etc., but nothing worked.
In the end, I simply opened each PHP file in notepad++, went to "Encoding" and changed it from UTF-8 to ANSI, then saved. That was all it took to get it working and returning valid json. I made no code changes to the PHP at all.
This solution still feels less than ideal. Since we are not outputting anything from those included files there shouldn't be anything affected.
I'm doing a kind of roundabout experiment thing where I'm pulling data from tables in a remote page to turn it into an ICS so that I can find out when this sports team is playing (because I can't find anywhere that the information is more readily available than in this table), but that's just to give you some context.
I pull this data using cURL and parse it using domDocument. Then I take it and parse it for the info I need. What's giving me trouble is the opposing team. When I display the data on the initial PHP page, it's correct. But when I write to an ICS file, special UTF-8 characters get messed up. I thought utf8_encode would solve that problem, but it actually seems to have the opposite effect: when I run the function on my data, even the stuff displayed on the page (which had been displaying correctly), not in the separate ICS file (which was writing incorrectly), is incorrect. As an example: it turns "Inđija" to "InÄija."
Any tips or resources as far as dealing with UTF-8 strings in PHP? My server (a remote host) doesn't have mbstring installed either, which is a pain.
utf8_encode encodes a string in ISO 8859-1 as UTF-8. If you put UTF-8 into it, it's going to interpret it as if it was ISO 8859-1, and hence produce mojibake.
To help with your first problem, before this, I'd want to know what sort of "special" characters are being messed up in the original problem, and what way are they being messed up?