I've just launched a small application i've been working on. Nothing major, but something I would like to get properly working. It's at www.wedrapp.com.
Most of the time it works perfectly fine. Enter a city, XML is returned, parsed and the data returned is shown to the user.
Unfortunately however, an error is returned when certain cities are searched such as Marseille. If you search Marseille you will see what I mean. I have a feeling it is to do with special characters, as Marseille searched actually returns Marseilles, Provence-Alpes-Côte d'Azur in the XML. Similarly Paris gives an error as it actually returns Paris, Île-de-France.
Can anyone shed some light on how to strip these strange characters out, or at least stop them providing an error before hitting the screen? It is XML parsed with PHP.
Find out in which encoding the XML returned by google is. Then re-encode it from that encoding to UTF-8, then you can load the XML with SimpleXML.
The Google Weather API XML has an encoding based on the language that is specified when it's requested (It is possible to specify the encoding you want to have as well, I come to that soon).
For example, it can be ISO-8859-2 as a related question PHP XML — Google Weather API - parsing and modifying data (Language, UTF-8, and F to Celsius) shows.
You can find out which one by looking into the HTTP Response Header Content-Type:
Content-Type: text/xml; charset=ISO-8859-1
You used utf8_encodeDocs to change the encoding, it converts a ISO-8859-1 (also referred to as Latin-1) encoded string to UTF-8. It looks like that standard queries to the secret google weather API return this by default.
You can specify the encoding you'd like to have by adding a oe parameter to the query. For example to get it directly as UTF-8:
http://www.google.com/ig/api?weather=Mountain+View&oe=utf-8
^
Doing this will ensure you always get a specific encoding instead that you need to guess or to parse response headers.
Related
I am trying to work with the Amazon Associates API, but whenever trying to get the information of a product, the characters come out in a weird way.
Example
Text on the Amazon page:
🔥 【23800 mAh
Output of the JSON from the API: 🔥 ã€23800 mAh
Just like this, more weird characters are appearing, such as a dash transformation in a question mark.
I've used a code snippet in PHP that was provided by them, which contained the following line which determined the charset:
$awsv4->addHeader('content-type', 'application/json; charset=utf-8');
Does anyone have a pointer where I might be going wrong here, and what I could do to fix this weird conversion?
I don't get what you are actually going to do with this. Assuming you are going to keep this in a database I am going to give you a solution.
You said these characters are changing into those weird symbols. What happening here is those imojis/symbols are converting into a UTF-8 encoded data form. And that is why you are seeing those symbols which are pure UTF-8 form.
Now if you want to keep those data you have to keep that data into some kind of text encode. It doesn't have to UTF-8 only . There are many encodings available.
If you want to decode those you have to write them in a system where it can be shown normally. Like I have kept imojis in my database as UTF-8 many years ago. The mobile phone I was using, gave me encoding for one emoji. I saved it and next time when I am seeing the data with my PC browser I am seeing some other symbols. The decoding system must be installed when you want to see them next time.
The point is you can not save the data as you can see in the Amazon.
This line
$awsv4->addHeader('content-type', 'application/json; charset=utf-8');
tells that the request you are sending is UTF-8 encoded and is JSON encoded data. It does not ask for a specific encoding for the response.
However the response you are getting is UTF-8 encoded
You read 🔥 ã€23800 mAh because you are displaying the received data using a encoding different from UTF-8.
A more detailed answer may be given if we see how you are writing the output and which is the context (web page, terminal...)
I'm using GoogleMaps API to retrieve location information. The result is fetched via cURL and the fetched string should be converted to a JSON-object using json_decode.
For many locations (in for example The Netherlands) this works like a charm. But for many German (and probably more countries like Austria, Swiss etc) this doesn't work as expected.
I believe this is because of the 'special' characters like ß, but also ü, ë, ä, ï and so on.
For example: this is the string fetched via cURL (http://maps.googleapis.com/maps/api/geocode/json?address=Stoltenkampstra%C3%9Fe%2011,Bad%20Bentheim&sensor=false&language=nl)
In the following $sResponse is the result fetched by cURL.
When I try to perform json_decode($sResponse); its value becomes null. When I perform json_last_error() it says 5 (which means JSON_ERROR_UTF8). When I perform mb_detect_encoding($sResponse) it says UTF-8.
Any suggestions?
If you encounter this problem as well, make sure you've set your document to have to correct charset. In my case I forgot to include <meta charset='utf-8'> in my index.php-file. To me this was what I overlooked... Dumb... but maybe it helps you in the future ;)
As correctly mentioned by Gumbo, this wasn't the only fix to the problem. (It only fixed how the data was presented in my browser). I was also playing with the Encoding-library, using Encoding::toUTF8(). This is a very neat and helpful class I've found during my search for a solution. You can read about it here: Detect encoding and make everything UTF-8
I'm making multiple AJAX calls that returns XML data. When I get the data back, my success function (in JQuery) tries to turn the XML to JSON (using a plugin). I was quickly reminded why I can't assume I would be getting VALID XML back from my AJAX request -- because it turns out a few of the XML responses were invalid -- causing the JSON conversion to fail, script to fail, etc...
My questions are:
What is the best way to check for
valid XML on an AJAX response? Or,
should I just attempt the JSON
conversion, then do a quick check if
the JSON object is valid?
In troubleshooting the XML, I found that there are a few strange characters at the VERY beginning of the XML response. Here's an image from my Firebug:
Should I try to detect and strip the response of those chars or could there possibly be something wrong with my encoding?
Any help is appreciated! Let me know if more info is needed!
It's the UTF-8 byte-order mark when incorrectly interpreted as ISO-8859-1.
You can't safely strip this because it's just a symptom of a larger problem. Your content is encoded as UTF-8. Somewhere along the way you are decoding it as ISO-8859-1 instead. If you try to hide the problem by stripping the BOM, you're only setting yourself up for more problems down the line as soon as you start using non-ASCII characters. The only reason things are even looking sort-of right is because ASCII is a common subset of both UTF-8 and ISO-8859-1.
The strange characters are the Byte Order Mark and are actually valid XML, you can most likely just strip them without risk in most circumstances.
I'm doing a kind of roundabout experiment thing where I'm pulling data from tables in a remote page to turn it into an ICS so that I can find out when this sports team is playing (because I can't find anywhere that the information is more readily available than in this table), but that's just to give you some context.
I pull this data using cURL and parse it using domDocument. Then I take it and parse it for the info I need. What's giving me trouble is the opposing team. When I display the data on the initial PHP page, it's correct. But when I write to an ICS file, special UTF-8 characters get messed up. I thought utf8_encode would solve that problem, but it actually seems to have the opposite effect: when I run the function on my data, even the stuff displayed on the page (which had been displaying correctly), not in the separate ICS file (which was writing incorrectly), is incorrect. As an example: it turns "Inđija" to "InÄija."
Any tips or resources as far as dealing with UTF-8 strings in PHP? My server (a remote host) doesn't have mbstring installed either, which is a pain.
utf8_encode encodes a string in ISO 8859-1 as UTF-8. If you put UTF-8 into it, it's going to interpret it as if it was ISO 8859-1, and hence produce mojibake.
To help with your first problem, before this, I'd want to know what sort of "special" characters are being messed up in the original problem, and what way are they being messed up?
I'm working with UK address data and also International address data.
I need to geocode the address data for use on a google map. I'm doing this using the HTTP service. Ie/ Constructing a query string and passing it to file_get_contents($THEURL).
I've managed to geocode 80% of the address data perfectly, however those addresses in countries like Norway and Sweeden that contain special characters will not return a geocode.The code returned is 602 (cannot find an address).
Looking into the documentation I can see that the string sent to google must be UTF8 encoded.
I've tried the following to ensure the string is UTF8 encoded / remove the special characters.
1) Using UTF8 encode on the query string - this often results in malformed characters being displayed on the screen.
2) mb_check_encoding reports the string is correctly encoded.
3) Using a function to substitue special characters for thier europiene eqivilents (in the hope google api will compensate.
Can anyone suggest a reason why my method isn't working (whether to do with encoding or not?).
You need to systematically go through every encoding aspect in your system and define what encoding it is in. Mb_detect_encoding and guesswork are not a good approach here.
You need to check the encoding of:
incoming data
pages
GET parameters
database connection
database table collations
the script files you work with
If malformed characters occur, chances are you are using ISO-8859-1 or some other non-UTF-8 encoding somewhere. When everything is clean UTF-8, the request should go through.
A very good article on the basics is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).