I have users who sometimes paste things into my site's forms after copying something from their Gmail. The characters look normal when they paste it, but in the database they have extra special characters that appear.
Here is an example of the text with the the special characters.
It originally happened on this page:
http://www.hikingsanfrancisco.com/hiker_community/scheduled_hike_event.php?hike_event_id=91
But it looks like the person who made it has cleaned up the strange characters.
Does anyone know how to stop this from happening in the future?
Thanks,
Alex
I use PHP and MySQL
I'd guess that you're getting UTF-8 encoded text but your database is configured for ISO-8859-1 (AKA Latin-1). The page you reference says:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
so it is claiming to be encoded as UTF-8. A form on a UTF-8 page will be sent back to the server in UTF-8. Then you send that UTF-8 data into your database where it is stored as Latin-1 encoded text. If you're not handling the UTF-8 to Latin-1 change yourself then you'll get "funny" characters when you send the data back to a browser. As long as the text only uses standard ASCII characters then everything will be okay as UTF-8 and Latin-1 overlap on the ASCII characters.
The solution is to pick a character encoding and use it everywhere. I'd recommend UTF-8 everywhere. However, if your database is already in Latin-1 then you'll have to go with Latin-1 or change the encoding in the database and re-encode all the data. But, if all the text in your database is simple ASCII then no re-encoding will be needed.
Hard to say what's going without examples but a character encoding mismatch is the usual problem when funny (funny peculiar, not funny ha-ha) characters appear only when text is sent back to the browser.
Related
I apologize if this question is not directly related to programming. I'm having an issue, of which I have two examples;
I have a website, where I store Arabic words in a DB, and then retrieve it, and display it on a page, using php. (Here's the link to my page, that is displaying Arabic incorrectly.)
I visit any random website, where the majority of the content is supposed to be in Arabic. (An example of a random website that gives me this issue.)
In both these cases, the Arabic text is displayed as 'ÇáÔíÎ: ÇáÓáÝ ãÚäÇå ÇáãÊÞÏãæä Ýßá'... or such weird characters. Do note that, in the first case, I may be able to correct it, since I control the content. So, I can set the encoding.
But what about the second case [this is where I want to apologize, since it isn't directly related to programming (the code) from my end] - what do I do for random websites I visit, where the text (Arabic) is displayed incorrectly? Any help would really be appreciated.
For the second case:
This website is encoded with Windows-1256 (Arabic), however, it wrongly declares to be encoded with ISO 8859-1 (Latin/Western European). If you look at the source, you can see that it declares <meta ... charset=ISO-8859-1" /> in its header.
So, what happens is that the server sends to your browser an HTML file that is encoded with Windows-1256, but your browser decodes this file with ISO 8859-1 (because that's what the file claims to be).
For the ASCII characters, this is no problem as they are encoded identically in both encodings. However, not so for the Arabic characters: each code byte corresponding to an Arabic character (as encoded by Windows-1256) maps to some Latin character of the ISO 8859-1 encoding, and these garbled Latin characters are what you see in place of the Arabic text.
If you want to display all the text of this website correctly, you can manually set the character encoding that your browser uses to decode this website.
You can do this, for example, with Chrome by installing the Set Character Encoding extension, and then right-click on the website and select:
Set Character Encoding > Arabic (Windows-1256)
In Safari, you can do it simply by selecting:
View > Text Encoding > Arabic (Windows).
The same should be possible with other browsers, such as Firefox or Internet Explorer, too...
For the first case:
Your website (the HTML file that your server sends to the browser) is encoded with UTF-8. However, this HTML file doesn't contain any encoding declaration, so the browser doesn't know with which encoding this file has been encoded.
In this case, the browser is likely to use a default encoding to decode the file, which typically is ISO 8859-1/Windows-1252 (Latin/Western European). The result is the same as in the above case: all the Arabic characters are decoded to garbled Latin characters.
To solve this problem, you have to declare that your HTML file is encoded with UTF-8 by adding the following tag in the header of your file:
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
I am having an issue where some title for radio button are showing blank due to non-english characters in the title.
Here is example of a string from the database that is showing blank.
WATERTOWN/171 Watertown St/Rte 16/Newtonÿ
The last character in the string is ÿ, that is what is causing the problem in this case.
How can I correct this problem?
On the very top of my page I have this code
<meta charset="utf-8">
I am not sure if the character ÿ is not a valid UTF-8 or not.
I tried using the method utf8_encode() on the data before storing it into the SQL Server database but that did not work.
That is the problem here and how to fix it?
PHP is a little bit relaxed on everything, also on encoding. There is no real "encoding" you probaly know from other languages.
Actually it treats the string byte by byte and passes them to the output.
What you can do (if not already done): set the database connection encoding to UTF-8.
And double check, if this character is not already in the database ;)
(And yes, ÿ is of course an UTF-8 character ;) but it looks a little bit, that you are using a different encoding for the database connection, then what is store actually in the database)
I'm fetching rows from a MySQL database with a unicode_general_ci collation. Columns contains chinese characters such as 格拉巴酒和蒸馏物 and I need to display those characters.
I know that I should work in utf-8 encoding:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
but I can't: I'm working on a legacy application where most of the .php files are saved as ANSI and the whole site is using:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
Is there any way to display them?
Bonus question: I've tried to manually change the encoding in Chrome (Tool -> Encodig -> UTF-8) and It seems it doesn't work: page is reloaded but ???? are displayed instead of chinese characters.
You can display 格
using the numeric entity reference 格, etc. The encoding of the page should not matter in this case; HTML entity references always refer to Unicode code points.
PHP has a function htmlentities for this purpose, but it appears that you will need workarounds for handling numeric entities. This json_encode hack is fairly obscure, but is probably programmatically the simplest.
echo preg_replace('/\\\\u([0-9a-f]{4})/', '&#x$1;',
preg_replace('^/"(.*)"$/', '$1', json_encode($s)));
This leverages the fact that json_encode will coincidentally do the conversion for you; the rest is all mechanics. (I guess that's PHP for you.)
IDEone demo
Your "bonus question" isn't really a question, but of course, that's how it works; raw bytes in the range 128-255 are only rarely valid UTF-8 sequences, so unless what you have on the page is valid UTF-8, you are likely to get the "invalid character" replacement glyph for those bytes.
For the record, the first two Chinese Han glyphs in your text in UTF-8 would display as æ ¼æ‹‰ if mistakenly displayed in Windows code page 1252 (what you, and oftentimes Microsoft, carelessly refer to as "ANSI") -- if you have those bytes on the page then forcing the browser to display it in UTF-8 should actually work as a workaround as well.
For additional background I recommend #deceze's What Every Programmer Absolutely, Positively Needs to Know About Encodings and Character Sets to Work With Text.
I'm not sure that you can. iso-8859-1 is commonly called "Latin 1". There's no support for any Asian kanji-type languages at all.
http://en.wikipedia.org/wiki/ISO/IEC_8859-1
ISO 8859-1 encodes what it refers to as "Latin alphabet no. 1," consisting of 191 characters from the Latin script. This character-encoding scheme is used throughout the Americas, Western Europe, Oceania, and much of Africa. It is also commonly used in most standard romanizations of East-Asian languages.
I'm making a KSSN (Korean ID Number) checker in PHP using a MySQL database.
I check if it is working by using a file_get_contents call to an external site.
The problem is that the requests (with Hangul/Korean characters in them) are using the wrong charset.
When I echo the string, the Korean characters just get replaced by question marks.
How can I make it to use Korean? Should I change anything in the database too?
What should be the charset?
PHP Source and SQL Dump: http://www.multiupload.com/RJ93RASZ31
NOTE: I'm using Apache (HTML), not CLI.
You need to:
tell the browser what encoding you wish to receive in the form submission, by setting Content-Type by header or <meta> as in aviv's answer.
tell the database what encoding you're sending it bytes in, using mysql_set_charset().
Currently you are using EUC-KR in the database so presumably you want to use that encoding in both the above points. In this century I would suggest instead using UTF-8 throughout for all web apps/databases, as the East Asian multibyte encodings are an anachronistic unpleasantness. (With potential security implications, as if mysql_real_escape_string doesn't know the correct encoding, a multibyte sequence containing ' or \ can sneak through an SQL injection.)
However, if enpang.com are using EUC-KR for the encoding of the Name URL parameter you would need either to stick with EUC-KR, or to transcode the name value from UTF-8 to EUC-KR for that purpose using iconv(). (It's not clear to me what encoding enpang.com are using for URL parameters to their name check service; I always get the same results anyway.)
I don't know the charset, but if you are using HTML to show the results you should set the charset of the html
<META http-equiv="Content-Type" content="text/html; charset=EUC-JP">
You can also use iconv (php function) to convert the charset to a different charset
http://php.net/manual/en/book.iconv.php
And last but not least, check your database encoding for the tables.
But i guess that in your case you will only have to change the meta tag.
Basically all charset problems stem from the fact that they're being mixed and/or misinterpreted.
A string (text) is a sequence of bytes in a specific order. The string is encoded using some specific charset, that in itself is neither right nor wrong nor anything else. The problem is when you try to read the string, the sequence of bytes, assuming the wrong charset. Bytes encoded using, for example, KS X 1001 just don't make sense when you read them assuming they're UTF-8, that's where the question marks come from.
The site you're getting the text from sends it to you in some specific character set, let's assume KS X 1001. Let's assume your own site uses UTF-8. Embedding a stream of bytes representing KS X 1001 encoded text in the middle of UTF-8 encoded text and telling the browser to interpret the whole site as UTF-8 leads to the KS X 1001 encoded text not making sense to the UTF-8 parser.
UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU
KSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKSKS
UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU
will be rendered as
Hey, this is UTF-8 encoded text, awesome!
???????I?have?no?idea?what?this?is???????
Hey, this is UTF-8 encoded text, awesome!
To solve this problem, convert the fetched text into UTF-8 (or whatever encoding you're using on your site). Look at the Content-Type header of that other site, it should tell you what encoding the site is in. If it doesn't, take a guess.
Why is this the extended ascii character (â, é, etc) getting replaced with <?> characters?
I attached a pic... but I am using PHP to pull the data from MySQL, and some of these locations have extended characters... I am using the Font Arial.
You can see the screen shot here: http://img269.imageshack.us/i/funnychar.png/
Still happening after the suggestions, here is what I did:
My firefox (view->encoding) is set to UTF-8 after adding the line, however, the text inside the option tags is still showing the funny character instead of the actual accented one. What should I look for now?
UPDATE:
I have the following in the PHP program that is giving my those <?> characters...
ini_set( 'default_charset', 'UTF-8' );
And right after my zend db object creation, I am setting the following query:
$db->query("SET NAMES utf8;");
I changed all my tables over to UTF-8 and reinserted all the data (waste of time) as it never helped. It was latin1 prior.
Also STATUS is reporting:
Connection: Localhost via UNIX socket
Server characterset: latin1
Db characterset: latin1
Client characterset: utf8
Conn. characterset: utf8
UNIX socket: /var/run/mysqld/mysqld.sock
Uptime: 4 days 20 hours 59 min 41 sec
Looking at the source of the page, I see
<option value="Br�l� Lake"> Br�l� Lake
OK- NEW UPDATE-
I Changed everything in my PHP and HTML to:
and
header('Content-Type: text/html; charset=latin1');
Now it works, what gives?? How do I convert it all to UTF-8?
That's what the browser does when it doesn't know the encoding to use for a character. Make sure you specify the encoding type of the text you send to the client either in headers or markup meta.
In HTML:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
In PHP (before any other content is sent to the client):
header('Content-Type: text/html; charset=utf-8');
I'm assuming you'll want UTF-8 encoding. If your site uses another encoding for text, then you should replace UTF-8 with the encoding you're using.
One thing to note about using HTML to specify the encoding is that the browser will restart rendering a page once it sees the Content-Type meta tag, so you should include the <meta /> tag immediately after the <head /> tag in your page so the browser doesn't do any more extra processing than it needs.
Another common charset is "iso-8859-1" (Basic Latin), which you may want to use instead of UTF-8. You can find more detailed info from this awesome article on character encodings and the web. You can also get an exhaustive list of character encodings here if you need a specific type.
If nothing else works, another (rare) possibility is that you may not have a font installed on your computer with the characters needed to display the page. I've tried repeating your results on my own server and had no luck, possibly because I have a lot of fonts installed on my machine so the browser can always substitute unavailable characters from one font with another font.
What I did notice by investigating further is that if text is sent in an encoding different than the encoding the browser reports as, Unicode characters can render unexpectedly. To work around this, I used the HTML character entity representation of special characters, so â becomes â in my HTML and é becomes é. Once I did this, no matter what encoding I reported as, my characters rendered correctly.
Obviously you don't want to modify your database to HTML encode Unicode characters. Your best option if you must do this is to use a PHP function, htmlentities(). You should use this function on any data-driven text you expect to have Unicode characters in. This may be annoying to do, but if specifying the encoding doesn't help, this is a good last resort for forcing Unicode characters to work.
There is no such standard called "extended ASCII", just a bunch of proprietary extensions.
Anyway, there are a variety of possible causes, but it's not your font. You can start by checking the character set in MySQL, and then see what PHP is doing. As Dan said, you need to make sure PHP is specifying the character encoding it's actually using.
As others have mentioned, this is a character-encoding question. You should read Joel Spolsky's article about character encoding.
Setting
header('Content-Type: text/html; charset=utf-8');
will fix your problem if your php page is writing UTF-8 characters to the browser. If the text is still garbled, it's possible your text is not UTF-8; in that case you need to use the correct encoding name in the Content-Type header. If you have a choice, always use UTF-8 or some other Unicode encoding.
Simplest fix
ini_set( 'default_charset', 'UTF-8' );
this way you don't have to worry about manually sending the Content-Type header yourself.
EDIT
Make sure you are actually storing data as UTF-8 - sending non-UTF-8 data to the browser as UTF-8 is just as likely to cause problems as sending UTF-8 data as some other character set.
SELECT table_collation
FROM information_schema.`TABLES` T
WHERE table_name=[Table Name];
SELECT default_character_set_name
, default_collation_name
FROM information_schema.`SCHEMATA` S
WHERE schema_name=[Schema Name];
Check those values
There are two transmission encodings, PHP<->browser and Mysql<->PHP, and they need to be consistent with each other. Setting up the encoding for Mysql<->PHP is dealt with in the answers to the questions below:
Special characters in PHP / MySQL
How to make MySQL handle UTF-8 properly
php mysql character set: storing html of international content
The quick answer is "SET NAMES UTF8".
The slow answer is to read the articles recommended in the other answers - it's a lot better to understand what's going on and make one precise change than to apply trial and error until things seem to work. This isn't just a cosmetic UI issue, bad encoding configurations can mess up your data very badly. Think about the Simpsons episode where Lisa gets chewing gum in her hair, which Marge tries to get out by putting peanut butter on.
You should encode all special chars into HTML entities instead of depending on the charset.
htmlentities() will do the work for you.
I changed all my tables over to UTF-8 and reinserted all the data (waste of time) as it never helped. It was latin1 prior.
If your original data was latin1, then inserting it into a UTF-8 database won't convert it to UTF-8, AFAIK, it will insert the same data but now believe it's UTF-8, thus breaking.
If you've got a SQL dump, I'd suggest running it through a tool to convert to UTF-8. Notepad++ does this pretty well - simply open the file, check that the accented characters are displaying correctly, then find "convert to UTF-8" in the menu.
These special characters generally appear due to the the extensions. If we provide a meta tag with charset=utf-8 we can eliminate them by adding:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
to your meta tags