Weird character encoding in Firefox - php

I have a problem with character encoding in Firefox. When I copy/paste a paragraph from Microsoft Word (2007), it could contains special character like this (dots/squares to make a list or quote) :
 Te’st
 Ze’f
• Gzg’a
The quote ’ is different compared to this quote ' (typed directly using keyboard). So I paste this in a textarea and save (using AJAX in some case). In the database (which has a collation latin1_swedish_ci) it shows perfectly fine. But when getting these data to edit again using Firefox, it shows weird binary symbols. Works fine in Chrome and IE.
I don't want to modify the charset of the database. Is there any way to solve this problem?
Note: you can also test by viewing this post in Chrome and FF

The characters you copypasted (assuming they got transmitted correctly into this forum) contain, in addition to letters, three occurrences of U+2019 RIGHT SINGLE QUOTATION MARK, which is the correct punctuation apostrophe in English and many other languages, one occurrence of U+2022 BULLET, which sounds ok, and two occurrences of U+F0A7, which is in the Private Use (PU) range and should not be used public information exchange, only for special purposes by mutual agreements between interested parties.
It is possible that some notations in Word 2007 documents get converted to PU characters in copy and paste, but at least normal list bullet normally becomes U+2022 BULLET. So it is a bit of a mystery where the PU characters come from.
Regarding single quotes, they are representable in windows-1252 too, and latin1_swedish_ci seems to cover it (though it is, as far as I understand, just the definition of collating order, rather than a character encoding). And as you are saying that the data looks fine in the database, it seems that problem is in the way in which the data is written in an HTML document served to the browser.
In particular, if the encoding of the page in which the data is then presented is UTF-8 and the actual data is there in windows-1252 encoding, problems arise. It would mean a problem like the one you describe, as U+2019 is encoded as 0x92 in windows-1252, and this causes a character-level data error when interpreted as UTF-8.
You can check the situation by using View→Encoding in Firefox when viewing the result page. If my hypothesis is correct, you will see UTF-8 selected there, and changing it to “West European (windows-1252)” makes the single quote appear (and may mess up other things on the page thoroughly).

Related

Obtaining correct UTF-8 characters from a Percent Encoded URL parameter

I'm having some trouble with the dreaded UTF-8 Character Encoding! It's driving me insane, no matter which way I approach it or how many online guides I follow, I can never get it to return the desired results. Here's what's going on:
My whole website uses a simple text-file database that is UTF-8 encoded, and it correctly shows all manner of special characters, latin, arabic, japanese, you name it, they all show correctly, with one exception:
When the user uses the "Search" input box I have on my website, I use $search = $_REQUEST['search']; to get the input data on the results page and show results accordingly. When a user inserts special characters in the search box, they get "Percent Encoded" in the URL (for example, "ï" becomes "%E3%AF"). When showing $string in the actual website, any special character appears as � (black diamond with question mark).
I have tried everthing it says here http://malevolent.com/weblog/archive/2007/03/12/unicode-utf8-php-mysql/ with the exception of the header(). I have set the charset as UTF-8 in my head section with an http-equiv meta but for some reason whenever I set it as a header() my PHP stylesheet stops working (and the character problem remains). Maybe this is a clue?
I have tried urldecode and rawurldecode too, but they don't change anything.
Keep in mind special characters appear correctly elsewhere on the site, it's only with the $search string where this problem appears. As a side-note, even though the characters are not visualizing correctly, my search engine does actually interpret the special characters correctly when filtering the results. This makes me understand that the special character is actually there and correctly encoded, but it's just a matter of making it visualize correctly with the correct charset. However... everything appears to be UTF-8.
To be honest I'm so confused about this that this question might also appear to be confusing and the information I'm giving you might not be very well structured either, so I apologize and will try to provide more detailed information for any questions.
Thank you!
Make sure not to have any function which alters your $_REQUEST. Some functions are not aware of special encodings.
The best way to investigate is checking the state of the variables before and after they are altered.
I would like to add one thing more point regarding utf-8 string manipulation.
When manipulating utf-8 strings always use multibyte string functions.
use mb_strtolower in place of strtolower()
http://php.net/manual/en/ref.mbstring.php.

PHP - Can't remove strange character

I'd really appreciate some help with this. I've wasted days on this problem and none of the suggestions I have found online seem to give me a fix.
I have a CSV file from a supplier. It appears to have been exported from an Microsoft system.
I'm using PHP to import the data into MySQL (both latest versions).
I have one particular record which contains a strange character that I can't get rid of. Manual editing to remove the character is possible, but I would prefer an automated solution as this will happen multiple times a day.
The character appears to be an interpretation of a “smart quote”. A hex editor tells me that the character codes are C2 and 92. In the hex editor it looks like a weird A followed by a smart quote. In other editors and Calc, Writer etc it just appears as a box. メ
I'm using mb_detect_encoding to determine the encoding. All records in the CSV file are returned as ASCII, except the one with the strange character, which is returned as UTF-8.
I can insert the offending record into MySQL and it just appears in Workbench as a square.
MySQL tables are configured to utf-8 – utf8_unicode_ci and other unusual UTF characters (eg fractions) are ok.
I've tried lots of solutions to this...
How to detect malformed utf-8 string in PHP?
Remove non-utf8 characters from string
Removing invalid/incomplete multibyte characters
How to detect malformed utf-8 string in PHP?
How to replace Microsoft-encoded quotes in PHP
etc etc but none of them have worked for me.
All I really want to do is remove or replace the offending character, ideally with a search and replace for the hex values but none of the examples I have tried have worked.
Can anyone help me move forward with this one please?
EDIT:
Can't post answer as not enough reputation:
Thanks for your input. Much appreciated.
I'm just going to go with the hex search and replace:
$DodgyText = preg_replace("/\xEF\xBE\x92/", "" ,$DodgyText);
I know it's not the elegant solution, but I need a quick fix and this works for me.
Another solution is:
$contents = iconv('UTF-8', 'Windows-1251//IGNORE',$contents);
$contents = iconv('Windows-1251', 'UTF-8//IGNORE',$contents);
Where you can replace Windows-1251 to your local encoding.
At a quick glance, this looks like a UTF-8 file. (UTF-8 is identical with the first 128 characters in the ASCII table, hence everything is detected as ASCII except for the special character.)
It should work if your database connection is also UTF-8 encoded (which it may not be by default).
How to do that depends on your database library, let us know which one you're using if you need help setting the connection encoding.
updated code based on established findings
You can do search & replace on strings using hexadecimal notation:
str_replace("\xEF\xBE\x92", '', $value);
This would return the value with the special code removed
That said, if your database table is UTF-8, you shouldn't need that conversion; instead you could look at the connection (or session) character set (i.e. SET NAMES utf8;). Configuring this depends on what library you use to connect to your database.
To debug the value you could use bin2hex(); this usually helps in doing searches online.

Accents, urls and Firefox

I'm having some problems and I was wondering if any of you could help me.
I have my site & DB set to utf8. I have a problem when I type in accents in the query strings section ã turns to %E3, but if i use links or forms within the page it gives %C3%A3 in the url.
What can I do?
EDIT: Let me try to clarify this a bit:
I'm trying to use accented characters in my URLs (query strings) but I'm having somewhat of a hard time getting this to work across multiple browsers. Some browsers like Firefox and IE output a different percent encoded string depending on whether I'm using a form within the page or typing the accented character in the address bar. Like I said in my original question, ã inputed in a form turns to %C3%A3 in the url but if I type ã in the address bar, the browser changes that to %E3 in the url.
This complicates things for me because if I get %E3, then in php/html I get an unknown character (that is the diamond question mark, correct?)
Hopefully this helps - let me know otherwise.
ã inputed in a form turns to %C3%A3 in the url
Depends on the form encoding, which is usually taken from the encoding of the page that contains the form. %C3%A9 is the correct UTF-8 URL-encoded form of ã.
if I type ã in the address bar, the browser changes that to %E3 in the url.
This is browser-dependent. When you put non-ASCII characters in the a URL in location bar:
http://www.example.com/test.p/café?café
WebKit browsers encode them all as UTF-8:
http://www.example.com/test.p/caf%C3%A9?caf%C3%A9
which is IMO most correct, as per IRI. However, IE and Opera, for historical reasons, use the OS's default system encoding to encode text entered into the query string only. So on a Western European Windows installation (using code page 1252), you get:
http://www.example.com/test.p/caf%C3%A9?caf%E9
For characters that aren't available in the system encoding, IE and Opera replaces them with a ?. Firefox will use the system encoding when all the characters in the query string, or UTF-8 otherwise.
Horrible and inconsistent, but then it's pretty rare for users to manually type out query strings.

Special characters in Flex

I am working on a Flex app that has a MySQL database. Data is retrieved from the DB using PHP then I am using AMFPHP to pass the data on to Flex
The problem that I am having is that the data is being copied from Word documents which sometimes result in some of the more unusual characters are not displaying properly. For example, Word uses different characters for starting and ending double quotes instead of just " (the standard double quotes). Another example is the long dash instead of -.
All of these characters result in one or more accented capital A characters appearing instead. Not only that, each time the document is saved, the characters are replaced again resulting in an ever-increasing number of these accented A's appearing.
Doing a search and replace for each troublesome character to swap it for one of the none characters seems to work but obviously this requires compiling a list of all the characters that may appear and means there is scope for this continuing as new characters are used for the first time. It also seems like a bit of a brute force way of getting round the problem rather than a proper solution.
Does anyone know what causes this and have any good workarounds / fixes? I have had similar problems when using utf-8 characters in html documents that aren't set to use utf-8. Is this the same thing and if so, how do I get flex to use utf-8?
Many thanks
Adam
It is the same thing, and smart quotes aren't special as such: you will in fact be failing for every non-ASCII character. As such a trivial ad-hoc replace for the smart quote characters will be pointless.
At some point, someone is mis-decoding a sequence of bytes as ISO-8859-1 or Windows code page 1252 when it should have been UTF-8. Difficult to say where without detail/code.
What is “the document”? What format is it? Does that format support UTF-8 content? If it does not, you will need to encode output you put into it at the document-creation phase to the encoding the consumer of that document expects, eg. using iconv.

Questions about iPhone emoji and web pages


Okay, so emoji basically shows the above on a computer. Is that another programming language? So how do I put those little boxes into a php file? When I put it into a php file, it turns into question marks and what not. Also, how can I store these in a MySQL without it turning into question marks and other weird things?
how do I put those little boxes into a php file?
Same way as any other Unicode character. Just paste them and make sure you're saving the PHP file and serving the PHP page as UTF-8.
When I put it into a php file, it turns into question marks and what not
Then you have an encoding problem. Work it out with Unicode characters you can actually see properly first, for example ąαд™日本, before worrying about the emoji.
Your PHP file should be saved as UTF-8; the page it produces should be served as Content-Type: text/html;charset:UTF-8 (or with similar meta tag); the MySQL database should be using a UTF-8 collation to store data and PHP should be talking to MySQL using UTF-8.
However. Even handling everything correctly like this, PCs will still not show the emoji. That's because:
they don't have fonts that include shapes for those characters, and
emoji are still completely unstandardised. Those characters you posted are in the Unicode Private Use Area, which means they don't have any official meaning at all.
Each network in Japan uses different character codes for their emoji, mapped to different areas in the PUA. So even on another mobile phone, it probably won't display the correct character, unless you spend ages manually converting emoji codes for different networks. I'm guessing the ones you posted above are from SoftBank (iPhone?).
There is an ongoing proposal led by Google and Apple to collate the different networks' emoji and give them a proper standardised place in Unicode. Until then, getting emoji to display consistently across networks is an exercise in unhappiness. See the character overview from the standardisation work to see how much converting you would have to do.
God, I hate emoji. All that pain for such a load of useless twee rubbish.
This has nothing to do with programming languages, just with encoding and fonts. As a very brief overview: Every character is stored by its character code (e.g.: 0x41 = A, 0x42 = B, etc), which is rendered as a meaningful character on your screen using a font (which says "the character with the code 0x41 should look like this ...").
These emoji occupy the "private use area" of the Unicode table, which is a range of codes that are undefined and free for anyone to use. That makes them perfectly valid character codes, it's just that no standard font has an appropriate character to display for them, since they are undefined. Only the iPhone and other handhelds, mostly in Japan, have appropriate icons for these codes. This is done to save bandwidth; instead of transmitting relatively large image files back and forth, emoji can be transmitted using a single character code.
As for how to store them: They should be storable as is, as long as you don't try to convert them to another encoding, in which case they may get lost. Just be aware that they only make sense on the iPhone and other SoftBank phones in Japan.
Character Viewer http://img.skitch.com/20091110-e7nkuqbjrisabrdipk96p4yt59.png
If you're on OSX you can copy and paste the character into the Character Viewer to find out what it is. I think there's a similar Character Map on Windows (albeit inferior ;-P). You could put it through PHP's ord(), but that only works on ASCII characters. See the discussion on the ord page for UTF8 functions.
BTW, just for the fun of it, these characters display fine on the iPhone as is, because the iPhone has a font which has icons for them:
iPhone http://img.skitch.com/20091110-bjt3tutjxad1kw4p9uhem5jhnk.png
I'm using FF3.5 and WinXP. I see little boxes in my browser, too.
This tells me the string requires a character set not installed on my computer.
When you put the string into a PHP file, the question marks tell you the same thing: your computer doesn't know how to display the characters.
You could store these emoji characters in MySQL if you encoded them differently, probably using UTF-8.
Do a web search for character encoding, as it relates to MySQL.

Categories