Can anyone tell me what this ascii character is? - php

I have this character showing up occasionally and I can't seem to find it in the ascii table. I'd like to run a filter on the data before it's sent to the database but I have to know what it is first. Maybe someone can clue me in. I am using a wysiwyg editor and this is where it's coming from. The character appears very sporadicly but seems to appear more often than not when I do two \r or a backspace.
Here is the character
Â
OK, it was suggested that I change the content-type to utf8 in the head of the document but I am still getting these characters in the database. Here is a test after I added the content-type
adf af  aafd a a
aa a  afa aÂ
adf

It is highly likely that this character is related to UTF-8 encoding issues. Joel's article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) is definitely recommended reading in this instance.
Filtering these characters out before sending to the database is almost certainly the wrong thing to do here.
In the case that you mention, you are probably dealing with the character U+00A0, which is the Unicode character for non-break space. The bit pattern for this character is:
1010 0000
After UTF-8 encoding, where the encoded bytes look like
110x xxxx 10xx xxxx
where 'x' represents a bit of the Unicode character value, so U+00A0 is encoded as:
1100 0010 1010 0000
which is 0xC2 0xA0. Coincidentally, the second character is the same byte value as the original character you were encoding (U+00A0), while the first character is the  you are seeing.

It is a "Latin Capitial A with a Circumflex", HTML code  Unicode U+00C2
Wikipage: http://en.wikipedia.org/wiki/%C3%82

When I have this issue, the fix that works for me is, based on #Greg's answer, given that:
0xC2=194, 0xA0=160,
In php:
$output=str_replace(chr(194).chr(160), " ", $html);
That repalces Âs with the they were supposed to be.

I am the OP. I am not logged in anymore but I came back to share the solution. The issue was in fact an encoding problem. I added:
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
After I did this, I noticed that I was still getting these funky characters in my database. I then changed the encoding on the database table and that did nothing either. That only left the browser... I checked the encoding in the browser and noticed that it was using ISO-8859-1. I changed the encoding on the browser to utf-8 and it is working fine now. :)
Thanks to everyone that contributed.

I think that you are seeing a bug that I once have experienced. ISO-8859-1 is actually a subset of Windows-1152 (I think it's 1152) for Western European languages. The problem is that browsers gladly submits Windows-1152 characters when the web server accepts ISO-8859-1. That means that the browser sends data that is invalid ISO-8859-1. That is what happened with my Windows installation at least. I have seen this behaviour in both IE and Firefox.
I had the problem with a wysiwyg editor where the users would paste data in from a Word document. This document would contain both hyphens and dashes. One of the characters would get submitted fine. The other would be garbage because that character doesn't exist in ISO-8859-1 (I can never remember which is which).
The .net framework that we were using didn't help either as it did not complain about an invalid ISO character when converting to unicode.

Related

Display chinese characters WITHOUT using utf8 encoding?

I'm fetching rows from a MySQL database with a unicode_general_ci collation. Columns contains chinese characters such as 格拉巴酒和蒸馏物 and I need to display those characters.
I know that I should work in utf-8 encoding:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
but I can't: I'm working on a legacy application where most of the .php files are saved as ANSI and the whole site is using:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
Is there any way to display them?
Bonus question: I've tried to manually change the encoding in Chrome (Tool -> Encodig -> UTF-8) and It seems it doesn't work: page is reloaded but ???? are displayed instead of chinese characters.
You can display 格
using the numeric entity reference 格, etc. The encoding of the page should not matter in this case; HTML entity references always refer to Unicode code points.
PHP has a function htmlentities for this purpose, but it appears that you will need workarounds for handling numeric entities. This json_encode hack is fairly obscure, but is probably programmatically the simplest.
echo preg_replace('/\\\\u([0-9a-f]{4})/', '&#x$1;',
preg_replace('^/"(.*)"$/', '$1', json_encode($s)));
This leverages the fact that json_encode will coincidentally do the conversion for you; the rest is all mechanics. (I guess that's PHP for you.)
IDEone demo
Your "bonus question" isn't really a question, but of course, that's how it works; raw bytes in the range 128-255 are only rarely valid UTF-8 sequences, so unless what you have on the page is valid UTF-8, you are likely to get the "invalid character" replacement glyph for those bytes.
For the record, the first two Chinese Han glyphs in your text in UTF-8 would display as 格拉 if mistakenly displayed in Windows code page 1252 (what you, and oftentimes Microsoft, carelessly refer to as "ANSI") -- if you have those bytes on the page then forcing the browser to display it in UTF-8 should actually work as a workaround as well.
For additional background I recommend #deceze's What Every Programmer Absolutely, Positively Needs to Know About Encodings and Character Sets to Work With Text.
I'm not sure that you can. iso-8859-1 is commonly called "Latin 1". There's no support for any Asian kanji-type languages at all.
http://en.wikipedia.org/wiki/ISO/IEC_8859-1
ISO 8859-1 encodes what it refers to as "Latin alphabet no. 1," consisting of 191 characters from the Latin script. This character-encoding scheme is used throughout the Americas, Western Europe, Oceania, and much of Africa. It is also commonly used in most standard romanizations of East-Asian languages.

Form saves special latin characters as symbols

My PHP form is submitting special latin characters as symbols.
So, Québec turns into Québec
My form is set to UTF-8 and my database table has latin1_swedish_ci collation.
PHP: $db = new PDO('mysql:host=localhost;dbname=x;charset=utf8', 'x', 'x');
A bindParam: $sql->bindParam(":x", $_POST['x'],PDO::PARAM_STR);
I am new to PDO so I am not sure what the problem is. Thank you
*I am using phpMyAdmin
To expand a little bit more on the encoding problem...
Any time you see one character in a source turn into two (or more characters), you should immediately suspect an encoding issue, especially if UTF-8 is involved. Here's why. (I apologize if you already know some of this, but I hope to help some future SO'ers as well.)
All characters are stored in your computer not as characters, but as bytes. Back in the olden days, space and transmission time were much more limited than now, so people tried to save every byte possible, even down to not using a full byte to store a character. Now, because we realize that we need to communicate with the whole world, we've decided it's more important to be able to represent every character in every language. That transition hasn't always been smooth, and that's what you're running up against.
Latin-1 (in various flavors) is an encoding that always uses a single 8-bit byte for a character. Which means it can only have 256 possible characters. Plenty if you only want to write English or Swedish, but not enough to add Russian and Chinese. (background on Latin-1)
UTF-8 encodes the first half of Latin-1 in exactly the same way, which is why you see most of the characters looking the same. But it doesn't always use a single byte for a character -- it can use up to four bytes on one character. (utf-8) As you discovered, it uses 2 bytes for é. But Latin-1 doesn't know that, and is doing its best to display those two bytes.
The trick is to always specify your encoding for byte streams (like info from a file, a URL, or a database), and to make sure that encoding is correct. (Sometimes that's a pain to find out, for sure.) Most modern languages, like Java and PHP do a good job of handling all the translation issues between different encodings, as long as you've correctly specified what you're dealing with.
You've pretty much answered your own question: you're receiving UTF-8 from the form but trying to store it in a Latin-1 column. You can either change the encoding on the column in MySQL or use the iconv function to translate between the two encodings.
Change your database table and column to utf8_unicode_ci.
Make sure you are saving the file with UTF-8 encoding (this is often overlooked)
Set headers:
<?php header("Content-type: text/html; charset=utf-8"); ?>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

any way to detect and remove (or fix) bad characters resulting from bad encoding conversions

I am writing a parser. I have taken care of all the encoding conversion to output UTF-8 correctly, but sometimes the source material is incorrect. such as ☐ or â€tm - the results of bad encoding conversion.
I know this is a long shot - but does anyone know of a list of common strings resulting from bad character conversions, or anything so I don't have to build my own list.
Yes I know I am being lazy, but I read somewhere that makes me a good programmer?
tl;dr: See last two paragraphs.
I hate/love encoding problems.
We're looking at a mutated copy of Unicode Character 'RIGHT SINGLE QUOTATION MARK' (U+2019). The byte sequence for that character is 0xE2 0x80 0x99. In Windows-1252, that corresponds to a+hat, Euro, and the trademark symbol (™). The 'tm' we see is a further transliteration of that trademark symbol into ASCII t and ASCII m, 0x74 0x6D, making our final corrupted sequence of bytes 0xE2 0x80 0x74 0x6D.
Chances are that the actual representation of a+hat-euro-t-m is already in UTF-8. That is, that a+hat is a UTF-8 sequence and the Euro symbol is also a UTF-8 sequence, because someone Copied from a Windows-1252 document that was already improperly encoded, and Pasted into a UTF-8 document. You'll find it's plenty more bytes than just the four from the original corruption.
One way to solve this would be first turning the UTF-8 encoding of those characters back into Windows-1252, then treat that Windows-1252 string as UTF-8 when writing it back out.
You can use iconv with the //TRANSLIT flag for this purpose:
$less_bad = iconv('UTF-8', 'Windows-1252//TRANSLIT', $bad);
This tells iconv to try turning any characters that can't be represented in Windows-1252 into something similar. This translation is imperfect and will destroy any legitimate UTF-8 characters that aren't representable in Windows-1252.
Once you have the Windows-1252 string, save it back out and serve it up as UTF-8. If all went well, the corruption should be gone, and you shouldn't have any problems.
Yeah, right.
In this specific case, the final byte of the proper sequence, 0x99, has been munged into two bytes by a bad Copy/Paste. You aren't going to get it back through character set encoding hoop jumping.
While the hoop jumping could work for some documents, you will surely find many things that are even more poorly re-encoded. Your best bet is going to be conducting a byte-level search and replace operation, looking for incorrectly encoded sequences and replacing them with a plain-ASCII or properly UTF-8 encoded alternative. There are lots of ways that the encoding would be wrong. For example, if the corruption source was in the ISO-8859 family, the final corrupted sequence would have been different, or perhaps the final ™ might not be munched into t and m in certain places.
A byte-level search and replace is guaranteed only to impact incorrectly re-encoded sequences, and will not leave the risk of munching on single-encoded UTF-8 characters that can't be represented in inferior character sets. It's safer and faster.
edit: I totally didn't actually catch that you were already planning on doing this. ;) Unfortunately I've never seen such a handy list. Perhaps you should publish and publicize your work so that others may benefit. yourcharacterencodingsucks.com is available!

Strange characters appearing after copy/pasting in forms/emails

I have users who sometimes paste things into my site's forms after copying something from their Gmail. The characters look normal when they paste it, but in the database they have extra special characters that appear.
Here is an example of the text with the the special characters.
It originally happened on this page:
http://www.hikingsanfrancisco.com/hiker_community/scheduled_hike_event.php?hike_event_id=91
But it looks like the person who made it has cleaned up the strange characters.
Does anyone know how to stop this from happening in the future?
Thanks,
Alex
I use PHP and MySQL
I'd guess that you're getting UTF-8 encoded text but your database is configured for ISO-8859-1 (AKA Latin-1). The page you reference says:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
so it is claiming to be encoded as UTF-8. A form on a UTF-8 page will be sent back to the server in UTF-8. Then you send that UTF-8 data into your database where it is stored as Latin-1 encoded text. If you're not handling the UTF-8 to Latin-1 change yourself then you'll get "funny" characters when you send the data back to a browser. As long as the text only uses standard ASCII characters then everything will be okay as UTF-8 and Latin-1 overlap on the ASCII characters.
The solution is to pick a character encoding and use it everywhere. I'd recommend UTF-8 everywhere. However, if your database is already in Latin-1 then you'll have to go with Latin-1 or change the encoding in the database and re-encode all the data. But, if all the text in your database is simple ASCII then no re-encoding will be needed.
Hard to say what's going without examples but a character encoding mismatch is the usual problem when funny (funny peculiar, not funny ha-ha) characters appear only when text is sent back to the browser.

Why is this the extended ascii character (â, é, etc) getting replaced with <?> characters?

Why is this the extended ascii character (â, é, etc) getting replaced with <?> characters?
I attached a pic... but I am using PHP to pull the data from MySQL, and some of these locations have extended characters... I am using the Font Arial.
You can see the screen shot here: http://img269.imageshack.us/i/funnychar.png/
Still happening after the suggestions, here is what I did:
My firefox (view->encoding) is set to UTF-8 after adding the line, however, the text inside the option tags is still showing the funny character instead of the actual accented one. What should I look for now?
UPDATE:
I have the following in the PHP program that is giving my those <?> characters...
ini_set( 'default_charset', 'UTF-8' );
And right after my zend db object creation, I am setting the following query:
$db->query("SET NAMES utf8;");
I changed all my tables over to UTF-8 and reinserted all the data (waste of time) as it never helped. It was latin1 prior.
Also STATUS is reporting:
Connection: Localhost via UNIX socket
Server characterset: latin1
Db characterset: latin1
Client characterset: utf8
Conn. characterset: utf8
UNIX socket: /var/run/mysqld/mysqld.sock
Uptime: 4 days 20 hours 59 min 41 sec
Looking at the source of the page, I see
<option value="Br�l� Lake"> Br�l� Lake
OK- NEW UPDATE-
I Changed everything in my PHP and HTML to:
and
header('Content-Type: text/html; charset=latin1');
Now it works, what gives?? How do I convert it all to UTF-8?
That's what the browser does when it doesn't know the encoding to use for a character. Make sure you specify the encoding type of the text you send to the client either in headers or markup meta.
In HTML:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
In PHP (before any other content is sent to the client):
header('Content-Type: text/html; charset=utf-8');
I'm assuming you'll want UTF-8 encoding. If your site uses another encoding for text, then you should replace UTF-8 with the encoding you're using.
One thing to note about using HTML to specify the encoding is that the browser will restart rendering a page once it sees the Content-Type meta tag, so you should include the <meta /> tag immediately after the <head /> tag in your page so the browser doesn't do any more extra processing than it needs.
Another common charset is "iso-8859-1" (Basic Latin), which you may want to use instead of UTF-8. You can find more detailed info from this awesome article on character encodings and the web. You can also get an exhaustive list of character encodings here if you need a specific type.
If nothing else works, another (rare) possibility is that you may not have a font installed on your computer with the characters needed to display the page. I've tried repeating your results on my own server and had no luck, possibly because I have a lot of fonts installed on my machine so the browser can always substitute unavailable characters from one font with another font.
What I did notice by investigating further is that if text is sent in an encoding different than the encoding the browser reports as, Unicode characters can render unexpectedly. To work around this, I used the HTML character entity representation of special characters, so â becomes â in my HTML and é becomes é. Once I did this, no matter what encoding I reported as, my characters rendered correctly.
Obviously you don't want to modify your database to HTML encode Unicode characters. Your best option if you must do this is to use a PHP function, htmlentities(). You should use this function on any data-driven text you expect to have Unicode characters in. This may be annoying to do, but if specifying the encoding doesn't help, this is a good last resort for forcing Unicode characters to work.
There is no such standard called "extended ASCII", just a bunch of proprietary extensions.
Anyway, there are a variety of possible causes, but it's not your font. You can start by checking the character set in MySQL, and then see what PHP is doing. As Dan said, you need to make sure PHP is specifying the character encoding it's actually using.
As others have mentioned, this is a character-encoding question. You should read Joel Spolsky's article about character encoding.
Setting
header('Content-Type: text/html; charset=utf-8');
will fix your problem if your php page is writing UTF-8 characters to the browser. If the text is still garbled, it's possible your text is not UTF-8; in that case you need to use the correct encoding name in the Content-Type header. If you have a choice, always use UTF-8 or some other Unicode encoding.
Simplest fix
ini_set( 'default_charset', 'UTF-8' );
this way you don't have to worry about manually sending the Content-Type header yourself.
EDIT
Make sure you are actually storing data as UTF-8 - sending non-UTF-8 data to the browser as UTF-8 is just as likely to cause problems as sending UTF-8 data as some other character set.
SELECT table_collation
FROM information_schema.`TABLES` T
WHERE table_name=[Table Name];
SELECT default_character_set_name
, default_collation_name
FROM information_schema.`SCHEMATA` S
WHERE schema_name=[Schema Name];
Check those values
There are two transmission encodings, PHP<->browser and Mysql<->PHP, and they need to be consistent with each other. Setting up the encoding for Mysql<->PHP is dealt with in the answers to the questions below:
Special characters in PHP / MySQL
How to make MySQL handle UTF-8 properly
php mysql character set: storing html of international content
The quick answer is "SET NAMES UTF8".
The slow answer is to read the articles recommended in the other answers - it's a lot better to understand what's going on and make one precise change than to apply trial and error until things seem to work. This isn't just a cosmetic UI issue, bad encoding configurations can mess up your data very badly. Think about the Simpsons episode where Lisa gets chewing gum in her hair, which Marge tries to get out by putting peanut butter on.
You should encode all special chars into HTML entities instead of depending on the charset.
htmlentities() will do the work for you.
I changed all my tables over to UTF-8 and reinserted all the data (waste of time) as it never helped. It was latin1 prior.
If your original data was latin1, then inserting it into a UTF-8 database won't convert it to UTF-8, AFAIK, it will insert the same data but now believe it's UTF-8, thus breaking.
If you've got a SQL dump, I'd suggest running it through a tool to convert to UTF-8. Notepad++ does this pretty well - simply open the file, check that the accented characters are displaying correctly, then find "convert to UTF-8" in the menu.
These special characters generally appear due to the the extensions. If we provide a meta tag with charset=utf-8 we can eliminate them by adding:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
to your meta tags

Categories