Characters appearing different from API - php

I am trying to work with the Amazon Associates API, but whenever trying to get the information of a product, the characters come out in a weird way.
Example
Text on the Amazon page:
🔥 【23800 mAh
Output of the JSON from the API: 🔥 ã€23800 mAh
Just like this, more weird characters are appearing, such as a dash transformation in a question mark.
I've used a code snippet in PHP that was provided by them, which contained the following line which determined the charset:
$awsv4->addHeader('content-type', 'application/json; charset=utf-8');
Does anyone have a pointer where I might be going wrong here, and what I could do to fix this weird conversion?

I don't get what you are actually going to do with this. Assuming you are going to keep this in a database I am going to give you a solution.
You said these characters are changing into those weird symbols. What happening here is those imojis/symbols are converting into a UTF-8 encoded data form. And that is why you are seeing those symbols which are pure UTF-8 form.
Now if you want to keep those data you have to keep that data into some kind of text encode. It doesn't have to UTF-8 only . There are many encodings available.
If you want to decode those you have to write them in a system where it can be shown normally. Like I have kept imojis in my database as UTF-8 many years ago. The mobile phone I was using, gave me encoding for one emoji. I saved it and next time when I am seeing the data with my PC browser I am seeing some other symbols. The decoding system must be installed when you want to see them next time.
The point is you can not save the data as you can see in the Amazon.

This line
$awsv4->addHeader('content-type', 'application/json; charset=utf-8');
tells that the request you are sending is UTF-8 encoded and is JSON encoded data. It does not ask for a specific encoding for the response.
However the response you are getting is UTF-8 encoded
You read 🔥 ã€23800 mAh because you are displaying the received data using a encoding different from UTF-8.
A more detailed answer may be given if we see how you are writing the output and which is the context (web page, terminal...)

Related

php parser grabbing ISO wbesite, to be converted to UTF8

I am parsing some content from another website which i can see in the header is iso-8859-1. But the cms system which is pulling in the content is UTF8.
It gets most characters, but things like " get weird characters. Im not sure how to convert this content properly.
Can anyone help plz
You want to use utf8_encode($content) and utf8_decode($content) to save and restore your code in and out the database.

Strange Characters In XML Response From Google Weather API Error

I've just launched a small application i've been working on. Nothing major, but something I would like to get properly working. It's at www.wedrapp.com.
Most of the time it works perfectly fine. Enter a city, XML is returned, parsed and the data returned is shown to the user.
Unfortunately however, an error is returned when certain cities are searched such as Marseille. If you search Marseille you will see what I mean. I have a feeling it is to do with special characters, as Marseille searched actually returns Marseilles, Provence-Alpes-Côte d'Azur in the XML. Similarly Paris gives an error as it actually returns Paris, Île-de-France.
Can anyone shed some light on how to strip these strange characters out, or at least stop them providing an error before hitting the screen? It is XML parsed with PHP.
Find out in which encoding the XML returned by google is. Then re-encode it from that encoding to UTF-8, then you can load the XML with SimpleXML.
The Google Weather API XML has an encoding based on the language that is specified when it's requested (It is possible to specify the encoding you want to have as well, I come to that soon).
For example, it can be ISO-8859-2 as a related question PHP XML — Google Weather API - parsing and modifying data (Language, UTF-8, and F to Celsius) shows.
You can find out which one by looking into the HTTP Response Header Content-Type:
Content-Type: text/xml; charset=ISO-8859-1
You used utf8_encodeDocs to change the encoding, it converts a ISO-8859-1 (also referred to as Latin-1) encoded string to UTF-8. It looks like that standard queries to the secret google weather API return this by default.
You can specify the encoding you'd like to have by adding a oe parameter to the query. For example to get it directly as UTF-8:
http://www.google.com/ig/api?weather=Mountain+View&oe=utf-8
^
Doing this will ensure you always get a specific encoding instead that you need to guess or to parse response headers.

Best practices about parsing multi language feed

I'm having a problem parsing data from different feeds, some of them in English, others in Italian and others in Spanish. I'm parsing using a PHP script and saving the parsed data into my MySQL database.
The problem is that when I parse items that contains "non common" characters like: "Strage di Viareggio Più" when I look into my database the phrase is stored in this way: "Strage di Viareggio Più".
My database can use that kind character because when I input that manualy it works fine, in the original feed (rss file) the phrase is also fine, I think is my PHP server who is changing the letter. How can I solve this? Thanks!
Make sure that the database uses UTF-8 (as you say it does) and that the PHP script has its internal encoding set to UTF-8, which you can achieve with iconv_set_encoding. If you're reading data from an HTTP request that should be all you need, as long as the request tags its own encoding correctly.
Looks like input data is in UTF-8, but charset/collation of DB table - ASCII. I would suggest to have UTF-8 everywhere.
What you need to implement, before saving to MySQL is:
http://php.net/manual/en/function.htmlentities.php
Check these different threads for more information
Best practices in PHP and MySQL with international strings
htmlentities() makes Chinese characters unusable
What I find incredible is that this question has received -2 in the past 24 hours without any comments.
From the question posted:
I'm parsing using a PHP script and saving the parsed data into my MySQL database.
and
I think is my PHP server who is changing the letter. How can I solve this? Thanks!
The answers posted so far are related to the encoding and settings of MySQL. The person asking the question has clearly stated that he can insert special characters manually and is having no problems:
My database can use that kind character because when I input that manualy it works fine
My answer was to help him convert the characters into an html entity which will circumvent the problem he is having with the RSS feed and answering the question posted.

Strange characters at beginning of XML AJAX response?

I'm making multiple AJAX calls that returns XML data. When I get the data back, my success function (in JQuery) tries to turn the XML to JSON (using a plugin). I was quickly reminded why I can't assume I would be getting VALID XML back from my AJAX request -- because it turns out a few of the XML responses were invalid -- causing the JSON conversion to fail, script to fail, etc...
My questions are:
What is the best way to check for
valid XML on an AJAX response? Or,
should I just attempt the JSON
conversion, then do a quick check if
the JSON object is valid?
In troubleshooting the XML, I found that there are a few strange characters at the VERY beginning of the XML response. Here's an image from my Firebug:
Should I try to detect and strip the response of those chars or could there possibly be something wrong with my encoding?
Any help is appreciated! Let me know if more info is needed!
It's the UTF-8 byte-order mark when incorrectly interpreted as ISO-8859-1.
You can't safely strip this because it's just a symptom of a larger problem. Your content is encoded as UTF-8. Somewhere along the way you are decoding it as ISO-8859-1 instead. If you try to hide the problem by stripping the BOM, you're only setting yourself up for more problems down the line as soon as you start using non-ASCII characters. The only reason things are even looking sort-of right is because ASCII is a common subset of both UTF-8 and ISO-8859-1.
The strange characters are the Byte Order Mark and are actually valid XML, you can most likely just strip them without risk in most circumstances.

Why is PHP's utf8_encode breaking my utf-8 string?

I'm doing a kind of roundabout experiment thing where I'm pulling data from tables in a remote page to turn it into an ICS so that I can find out when this sports team is playing (because I can't find anywhere that the information is more readily available than in this table), but that's just to give you some context.
I pull this data using cURL and parse it using domDocument. Then I take it and parse it for the info I need. What's giving me trouble is the opposing team. When I display the data on the initial PHP page, it's correct. But when I write to an ICS file, special UTF-8 characters get messed up. I thought utf8_encode would solve that problem, but it actually seems to have the opposite effect: when I run the function on my data, even the stuff displayed on the page (which had been displaying correctly), not in the separate ICS file (which was writing incorrectly), is incorrect. As an example: it turns "Inđija" to "InÄija."
Any tips or resources as far as dealing with UTF-8 strings in PHP? My server (a remote host) doesn't have mbstring installed either, which is a pain.
utf8_encode encodes a string in ISO 8859-1 as UTF-8. If you put UTF-8 into it, it's going to interpret it as if it was ISO 8859-1, and hence produce mojibake.
To help with your first problem, before this, I'd want to know what sort of "special" characters are being messed up in the original problem, and what way are they being messed up?

Categories