Why is PHP's utf8_encode breaking my utf-8 string? - php

I'm doing a kind of roundabout experiment thing where I'm pulling data from tables in a remote page to turn it into an ICS so that I can find out when this sports team is playing (because I can't find anywhere that the information is more readily available than in this table), but that's just to give you some context.
I pull this data using cURL and parse it using domDocument. Then I take it and parse it for the info I need. What's giving me trouble is the opposing team. When I display the data on the initial PHP page, it's correct. But when I write to an ICS file, special UTF-8 characters get messed up. I thought utf8_encode would solve that problem, but it actually seems to have the opposite effect: when I run the function on my data, even the stuff displayed on the page (which had been displaying correctly), not in the separate ICS file (which was writing incorrectly), is incorrect. As an example: it turns "Inđija" to "InÄija."
Any tips or resources as far as dealing with UTF-8 strings in PHP? My server (a remote host) doesn't have mbstring installed either, which is a pain.

utf8_encode encodes a string in ISO 8859-1 as UTF-8. If you put UTF-8 into it, it's going to interpret it as if it was ISO 8859-1, and hence produce mojibake.
To help with your first problem, before this, I'd want to know what sort of "special" characters are being messed up in the original problem, and what way are they being messed up?

Related

Characters appearing different from API

I am trying to work with the Amazon Associates API, but whenever trying to get the information of a product, the characters come out in a weird way.
Example
Text on the Amazon page:
🔥 【23800 mAh
Output of the JSON from the API: 🔥 ã€23800 mAh
Just like this, more weird characters are appearing, such as a dash transformation in a question mark.
I've used a code snippet in PHP that was provided by them, which contained the following line which determined the charset:
$awsv4->addHeader('content-type', 'application/json; charset=utf-8');
Does anyone have a pointer where I might be going wrong here, and what I could do to fix this weird conversion?
I don't get what you are actually going to do with this. Assuming you are going to keep this in a database I am going to give you a solution.
You said these characters are changing into those weird symbols. What happening here is those imojis/symbols are converting into a UTF-8 encoded data form. And that is why you are seeing those symbols which are pure UTF-8 form.
Now if you want to keep those data you have to keep that data into some kind of text encode. It doesn't have to UTF-8 only . There are many encodings available.
If you want to decode those you have to write them in a system where it can be shown normally. Like I have kept imojis in my database as UTF-8 many years ago. The mobile phone I was using, gave me encoding for one emoji. I saved it and next time when I am seeing the data with my PC browser I am seeing some other symbols. The decoding system must be installed when you want to see them next time.
The point is you can not save the data as you can see in the Amazon.
This line
$awsv4->addHeader('content-type', 'application/json; charset=utf-8');
tells that the request you are sending is UTF-8 encoded and is JSON encoded data. It does not ask for a specific encoding for the response.
However the response you are getting is UTF-8 encoded
You read 🔥 ã€23800 mAh because you are displaying the received data using a encoding different from UTF-8.
A more detailed answer may be given if we see how you are writing the output and which is the context (web page, terminal...)

Odd encoding issue after UTF-8 straightens "most" things out

Ok, So we have a script that takes emails sent to thunderbird, convertes part of the message to html and saves it to a MySQL. Every file, every part written is set to UTF-8. Finally, on my end of the work, the CRM (written in PHP5.3 expected output Chrome and Firefox), I pull the message, along with other info and display something resembling GMail, but as a "task list" for our employees.
The problem I'm having, if you havn't guessed already, some customer emails are obviously using different encodings. Thus, some (not all, and certainly not majority) of the e-mails don't show all characters correctly.
At first I made use of utf8_encode to get the email messages to look right, and this helps with most email messages coming from the database, however, a few slip by with bad characters.
In the DB these "bad apostrophes" appear as ’, but after utf8_encode they come through as �??. I've tried various encoding things to guess and change as needed, however, this tends to hurt the vast majority of the other emails.
Any suggestions, on one end of the pipe or the other, how I might get these few emails to match everything else, or how i might at least create a possible preg_replace filter at the end or something?
update
it seems even the emails with bad characters are passed to end php as utf-8 according to mb_detect_encoding. This is before any extra encoding. iconv does detect the ones that ahve problems, but this really gives me no way to solve them and just puts a php error box up on the screen instead of a simple FALSE return that it says it's supposed to give, so this too seems to be no solution.
The problem is that you don't know the encoding of the mail. utf8_encode encodes only from ISO-8859-1 to UTF-8. So you could try to get the encoding with mb_detect_encoding and then convert to UTF-8 with iconv.
EDIT: You could also try to read the Content-Type's charset of the mail.
Found My Answer!
Let me start by saying thanks Sebastián Grignoli for creating this VERY handy class(raw). I ended up working it into my final solution.
Second, I added the class to Codeigniter. For any of you using CI, this is an easy implementation. Simply create a file in application/libraries named Encoding.php (yes with the capital e). Then copy in the code to that file, but comment out(or remove) namespace ForceUTF8 on line 40.
My end result looks something like:
echo(Encoding::fixUTF8(utf8_decode($msgHTML)));
I'm still double checking, but thus far, I've yet to find one single error!
If I do find another encoding issue after this, I'll make sure to update.
SO Question I found that helped.

Best practices about parsing multi language feed

I'm having a problem parsing data from different feeds, some of them in English, others in Italian and others in Spanish. I'm parsing using a PHP script and saving the parsed data into my MySQL database.
The problem is that when I parse items that contains "non common" characters like: "Strage di Viareggio Più" when I look into my database the phrase is stored in this way: "Strage di Viareggio Più".
My database can use that kind character because when I input that manualy it works fine, in the original feed (rss file) the phrase is also fine, I think is my PHP server who is changing the letter. How can I solve this? Thanks!
Make sure that the database uses UTF-8 (as you say it does) and that the PHP script has its internal encoding set to UTF-8, which you can achieve with iconv_set_encoding. If you're reading data from an HTTP request that should be all you need, as long as the request tags its own encoding correctly.
Looks like input data is in UTF-8, but charset/collation of DB table - ASCII. I would suggest to have UTF-8 everywhere.
What you need to implement, before saving to MySQL is:
http://php.net/manual/en/function.htmlentities.php
Check these different threads for more information
Best practices in PHP and MySQL with international strings
htmlentities() makes Chinese characters unusable
What I find incredible is that this question has received -2 in the past 24 hours without any comments.
From the question posted:
I'm parsing using a PHP script and saving the parsed data into my MySQL database.
and
I think is my PHP server who is changing the letter. How can I solve this? Thanks!
The answers posted so far are related to the encoding and settings of MySQL. The person asking the question has clearly stated that he can insert special characters manually and is having no problems:
My database can use that kind character because when I input that manualy it works fine
My answer was to help him convert the characters into an html entity which will circumvent the problem he is having with the RSS feed and answering the question posted.

php INSERT through Flash AS3 sometimes inserts weird Strings

I haven't got clue if this is a normal issue or not but I have a small flash application that handles management for my company. It's a small company, so its not a big deal, its just a bunch of INSERTs, SELECTs, UPDATEs and other stuff to manage their clients, address, phone numbers, etc.
The flash (in AS3) sends the variables through a URLRequest to several php pages and the php handles the request to mySQL.
My problem is that, sometimes, instead of inserting the String I sent, it instead gets a weird string, made mostly, but not only, of numbers (and it happens like 1 column out of about 10 per INSERT, so its fairly common).
Is this a known issue? Could it be because of the encoding (I used UTF-8, which I believe is the one that we use here in portugal, due to special characters, like ã, à, á, etc)?
Thank you for your time.
Marco Fox.
After connecting to the DB, try the following query "SET CHARACTER SET utf8;".
Make sure every PHP page are in utf-8.
To do that, open the file in Notepad++ and use the menu Encoding -> Convert to UTF-8 without BOM, or open the file in notepad and ask to save as and look at encoding dropdown bellow name (this will save the BOM, which is not good).
Some IDE have the ability to save in ANSI, UTF-8 and more, or have the conversion option.
In Flash, use encodeURI() in your URLLoader data if you are passing it by GET.
Hopes that this solves your problem (if it is, in fact, encoding issues).

Converting unicode for MySQL and JSON

I have some HTML that was inserted into a MySQL database from a CSV file, which in turn was exported from an access MDB file. The MDB file was exported as Unicode, and indeed is Unicode. I am however unsure as what encoding the MySQL database has.
When I try to echo out html stored in a field however, there is no Unicode. This is a direct retrieval of one of the html fields in the database.
http://www.yousendit.com/download/TTZueEVYQzMrV3hMWEE9PQ
It says utf-8 in the source. The actual page code generated from echoing out article_desc is here:
http://www.nomorepasting.com/getpaste.php?pasteid=22566
I need to use this html with JSON, and I am wondering what I should do. I can not use any other frameworks or libraries. Should I convert the data before inserting it into the MySQL DB, or something else?
The mdb file was exported as Unicode, and indeed is unocode.
That makes no sense. A file can not be unicode. It can be encoded with a unicode-compatible encoding, such as utf-8, or utf-16 or utf-8 with BOM or ..
Charset issues is a very common problem, and it has its root in ignorance. I don't say this to offend you, but you really need to know the difference between codepoints (strings) and encodings (bytestreams). If you don't know which you're dealing with at all times throughout your entire application, you will get problems eventually. The curse about these issues is, that they only happen in edge cases, so it's easy to oversee them for a long time and when you finally realise something is wrong, it may be triggered in a completely unrelated part of your application. This makes it almost impossible to debug.

Categories