Okay, so emoji basically shows the above on a computer. Is that another programming language? So how do I put those little boxes into a php file? When I put it into a php file, it turns into question marks and what not. Also, how can I store these in a MySQL without it turning into question marks and other weird things?
how do I put those little boxes into a php file?
Same way as any other Unicode character. Just paste them and make sure you're saving the PHP file and serving the PHP page as UTF-8.
When I put it into a php file, it turns into question marks and what not
Then you have an encoding problem. Work it out with Unicode characters you can actually see properly first, for example ąαд™日本, before worrying about the emoji.
Your PHP file should be saved as UTF-8; the page it produces should be served as Content-Type: text/html;charset:UTF-8 (or with similar meta tag); the MySQL database should be using a UTF-8 collation to store data and PHP should be talking to MySQL using UTF-8.
However. Even handling everything correctly like this, PCs will still not show the emoji. That's because:
they don't have fonts that include shapes for those characters, and
emoji are still completely unstandardised. Those characters you posted are in the Unicode Private Use Area, which means they don't have any official meaning at all.
Each network in Japan uses different character codes for their emoji, mapped to different areas in the PUA. So even on another mobile phone, it probably won't display the correct character, unless you spend ages manually converting emoji codes for different networks. I'm guessing the ones you posted above are from SoftBank (iPhone?).
There is an ongoing proposal led by Google and Apple to collate the different networks' emoji and give them a proper standardised place in Unicode. Until then, getting emoji to display consistently across networks is an exercise in unhappiness. See the character overview from the standardisation work to see how much converting you would have to do.
God, I hate emoji. All that pain for such a load of useless twee rubbish.
This has nothing to do with programming languages, just with encoding and fonts. As a very brief overview: Every character is stored by its character code (e.g.: 0x41 = A, 0x42 = B, etc), which is rendered as a meaningful character on your screen using a font (which says "the character with the code 0x41 should look like this ...").
These emoji occupy the "private use area" of the Unicode table, which is a range of codes that are undefined and free for anyone to use. That makes them perfectly valid character codes, it's just that no standard font has an appropriate character to display for them, since they are undefined. Only the iPhone and other handhelds, mostly in Japan, have appropriate icons for these codes. This is done to save bandwidth; instead of transmitting relatively large image files back and forth, emoji can be transmitted using a single character code.
As for how to store them: They should be storable as is, as long as you don't try to convert them to another encoding, in which case they may get lost. Just be aware that they only make sense on the iPhone and other SoftBank phones in Japan.
Character Viewer http://img.skitch.com/20091110-e7nkuqbjrisabrdipk96p4yt59.png
If you're on OSX you can copy and paste the character into the Character Viewer to find out what it is. I think there's a similar Character Map on Windows (albeit inferior ;-P). You could put it through PHP's ord(), but that only works on ASCII characters. See the discussion on the ord page for UTF8 functions.
BTW, just for the fun of it, these characters display fine on the iPhone as is, because the iPhone has a font which has icons for them:
iPhone http://img.skitch.com/20091110-bjt3tutjxad1kw4p9uhem5jhnk.png
I'm using FF3.5 and WinXP. I see little boxes in my browser, too.
This tells me the string requires a character set not installed on my computer.
When you put the string into a PHP file, the question marks tell you the same thing: your computer doesn't know how to display the characters.
You could store these emoji characters in MySQL if you encoded them differently, probably using UTF-8.
Do a web search for character encoding, as it relates to MySQL.
Related
I have a problem with character encoding in Firefox. When I copy/paste a paragraph from Microsoft Word (2007), it could contains special character like this (dots/squares to make a list or quote) :
Te’st
Ze’f
• Gzg’a
The quote ’ is different compared to this quote ' (typed directly using keyboard). So I paste this in a textarea and save (using AJAX in some case). In the database (which has a collation latin1_swedish_ci) it shows perfectly fine. But when getting these data to edit again using Firefox, it shows weird binary symbols. Works fine in Chrome and IE.
I don't want to modify the charset of the database. Is there any way to solve this problem?
Note: you can also test by viewing this post in Chrome and FF
The characters you copypasted (assuming they got transmitted correctly into this forum) contain, in addition to letters, three occurrences of U+2019 RIGHT SINGLE QUOTATION MARK, which is the correct punctuation apostrophe in English and many other languages, one occurrence of U+2022 BULLET, which sounds ok, and two occurrences of U+F0A7, which is in the Private Use (PU) range and should not be used public information exchange, only for special purposes by mutual agreements between interested parties.
It is possible that some notations in Word 2007 documents get converted to PU characters in copy and paste, but at least normal list bullet normally becomes U+2022 BULLET. So it is a bit of a mystery where the PU characters come from.
Regarding single quotes, they are representable in windows-1252 too, and latin1_swedish_ci seems to cover it (though it is, as far as I understand, just the definition of collating order, rather than a character encoding). And as you are saying that the data looks fine in the database, it seems that problem is in the way in which the data is written in an HTML document served to the browser.
In particular, if the encoding of the page in which the data is then presented is UTF-8 and the actual data is there in windows-1252 encoding, problems arise. It would mean a problem like the one you describe, as U+2019 is encoded as 0x92 in windows-1252, and this causes a character-level data error when interpreted as UTF-8.
You can check the situation by using View→Encoding in Firefox when viewing the result page. If my hypothesis is correct, you will see UTF-8 selected there, and changing it to “West European (windows-1252)” makes the single quote appear (and may mess up other things on the page thoroughly).
I haven't got clue if this is a normal issue or not but I have a small flash application that handles management for my company. It's a small company, so its not a big deal, its just a bunch of INSERTs, SELECTs, UPDATEs and other stuff to manage their clients, address, phone numbers, etc.
The flash (in AS3) sends the variables through a URLRequest to several php pages and the php handles the request to mySQL.
My problem is that, sometimes, instead of inserting the String I sent, it instead gets a weird string, made mostly, but not only, of numbers (and it happens like 1 column out of about 10 per INSERT, so its fairly common).
Is this a known issue? Could it be because of the encoding (I used UTF-8, which I believe is the one that we use here in portugal, due to special characters, like ã, à, á, etc)?
Thank you for your time.
Marco Fox.
After connecting to the DB, try the following query "SET CHARACTER SET utf8;".
Make sure every PHP page are in utf-8.
To do that, open the file in Notepad++ and use the menu Encoding -> Convert to UTF-8 without BOM, or open the file in notepad and ask to save as and look at encoding dropdown bellow name (this will save the BOM, which is not good).
Some IDE have the ability to save in ANSI, UTF-8 and more, or have the conversion option.
In Flash, use encodeURI() in your URLLoader data if you are passing it by GET.
Hopes that this solves your problem (if it is, in fact, encoding issues).
I've looked across the web, I've looked through SO, through PHP documentation and more.
It seems like a ridiculous problem not to have a standard solution to. If you get an unknown character set, and it has strange characters (like english quotes), is there a standard way to convert them to UTF-8?
I've seen many messy solutions using a plethora of functions and checking and none of them are definitely going to work.
Has anyone come up with their own function or a solution that always works?
EDIT
Many people have answered saying "it is not solvable" or something of that nature. I understand that now, but none have given any sort of solution that has worked besides utf8_encode which is very limited. What methods ARE out there to deal with this? What is the best method?
No. One should always know what character set a string is in. Guessing the character set by using a sniffing function is unreliable (although in most situations, in the western world, it's usually a mix-up between ISO-8859-1 and UTF-8).
But why do you have to deal with unknown character sets? There is no general solution for this because the general problem shouldn't exist in the first place. Every web page and data source can and should have a character set definition, and if one doesn't, one should request the administrator of that resource to add one.
(Not to sound like a smartass, but that is the only way to deal with this well.)
The reason why you saw so many complicated solutions for this problem is because by definition it is not solvable. The process of encoding a string of text is non-deterministic.
It is possible to construct different combinations of text and encodings that result in the same byte stream. Therefore, it is not possible, strictly logically speaking, to determine the encoding, character set, and the text from a byte stream.
In reality, it is possible to achieve results that are "close enough" using heuristic methods, because there is a finite set of encodings that you'll encounter in the wild, and with a large enough sample a program can determine the most likely encoding. Whether the results are good enough depends on the application.
I do want to comment on the question of user-generated data. All data posted from a web page has a known encoding (the POST comes with an encoding that the developer has defined for the page). If a user pastes text into a form field, the browser will interpret the text based on encoding of the source data (as known by the operating system) and the page encoding, and transcode it if necessary. It is too late to detect the encoding on the server - because the browser may have modified the byte stream based on the assumed encoding.
For instance, if I type the letter Ä on my German keyboard and post it on a UTF-8 encoded page, there will be 2 bytes (xC3 x84) that are sent to the server. This is a valid EBCDIC string that represents the letter C and d. This is also a valid ANSI string that represents the 2 characters à and „. It is, however, not possible, no matter what I try, to paste an ANSI-encoded string into a browser form and expect it to be interpreted as UTF-8 - because the operating system knows that I am pasting ANSI (I copied the text from Textpad where I created an ANSI-encoded text file) and will transcode it to UTF-8, resulting in the byte stream xC3 x83 xE2 x80 x9E.
My point is that if a user manages to post garbage, it is arguably because it was already garbage at the time it was pasted into a browser form, because the client did not have the proper support for the character set, the encoding, whatever.
Because character encoding is non-deterministic, you cannot expect that there exist a trivial method to uncover from such a situation.
Unfortunately, for uploaded files the problem remains. The only reliable solution that I see is to show the user a section of the file and ask if it was interpreted correctly, and cycle through a bunch of different encodings until this is the case.
Or we could develop a heuristic method that looks at the occurance of certain characters in various languages. Say I uploaded my text file that contains the two bytes xC3 x84. There is no other information - just two bytes in the file. This method could find out that the letter Ä is fairly common in German text, but the letters à and „ together are uncommon in any language, and thus determine that the encoding of my file is indeed UTF-8. This roughy is the level of complexity that such a heuristic method has to deal with, and the more statistical and linguistic facts it can use, the more reliable will its results be.
Pekka is right about the unreliability, but if you need a solution and are willing to take the risk, and you have the mbstring library available, this snippet should work:
function forceToUtf8($string) {
if (!mb_check_encoding($string)) {
return false;
}
return mb_convert_encoding($string, 'UTF-8', mb_detect_encoding($string));
}
If I'm not wrong, there is something called utf8encode... it works well EXCEPT if you are already in utf8
http://php.net/manual/en/function.utf8-encode.php
Probably a problem many of you have encountered some day earlier, but i'm having problems with rendering of special characters in Flash (as2 and as3).
So my question is: What is the proper and fool-proof way to display characters like ', ", ë, ä, etc in a flash textfield? The data is collected from a php generated xml file, with content retrieved from a SQL database.
I believe it has something to do with UTF-8 encoding of the retrieved database data (which i've tried already) but I have yet to find a solid solution.
Just setting the header to UTF-8 won't work, it's a bit like changing the covers on a book from english to french and expecting the contents to change with it.
What you need to to is to make sure your text is UTF-8 from beginning to end, store it as that in the database, if you can't do that, make sure you encode your output properly.
If you get all those steps down it should all work just fine in flash, assuming you've got the proper glyphs embedded unless you're using a system font.
AS2 has a setting called useSystemCodepage, this may seem to solve the problem, but will likely make it break even more for users on different codepages, try to avoid this unless you're really sure of what you're doing.
Sometimes having those extra letters in your language actually helps ;)
I think that it's enough for you to put this in the xml head
<?xml version="1.0" encoding="UTF-8"?>
If your special characters are a part of Unicode set (and they should be, otherwise you're basically on your own), you just need to ensure that the font you're using to render the text has all of the necessary glyphs, and that the database output produces proper unicode text.
Some fonts don't neccessarily include all the unicode glyphs, but only a subset of them (usually dropping international glyphs and special characters). Make sure the font has them (test the font out in a word processor, for example). Also, if you're using embedded fonts, be sure to embed all the characters you need to use.
I have a php script which accesses a MSSQL2005 database, reads some data from it and sends the results in a mail.
There are special characters in both some column names and in the fields itself.
When I access the script through my browser (webserver iis), the query is executed correctly and the contents of the mail are correctly (for my audience) encoded.
However, when I execute php from the console, the query fails (due to the special characters in the column names). If I replace the special characters in the query with calls to chr() and the character code in latin-1, the query gets executed correctly, but the results are also encoded in latin-1 and therefore not displayed correctly in the mail.
Why is PHP/the MSSQL driver/… using a different encoding in the two scenarios? Is there a way around it?
If you wonder, I need the console because I want to schedule the script using SQLAgent (or taskmanager or whatever).
Depending on the type of characters you have in your database, it might be a console limitation I guess. If you type chcp in the console, you'll see what is the active code page, which might something like CP437 also known as Extended ASCII. If you have characters out of this code page, like in UTF8, you might run into problems. You can change the current active code page by typing chcp 65001 to switch to UTF8.
You might also want to change the default Raster font to Lucida Console depending on the required characters as not all fonts support extended characters (right click on command prompt window's title, properties, font).
As already said, PHP's unicode support is not ideal, but you can manage to do it in PHP5 with a few well placed function call of utf8_decode. The secret of character encoding is to understand well what is the current encoding of all the tools you are using: database, database connection, current bytes in your PHP variable, your output to the console screen, your email's body encoding, your email client, and so on...
For everything that have special characters, in our modern days, something like UTF8 is often recommended. Make sure everything along the way is set to UTF8 and convert only where necessary.
PHP's poor support for the non English world is well known. I've never used a database with characters outside the basic ASCII realm, but obviously you already have a work around and it seems you just have to live with it.
If you wanted to take it a step further, you could:
1. Write an array that contains all the special chars and their CHR equivalents
2. foreach the array and str_replace on the query
But if the query is hardcoded, I guess what you have is fine. Also, make sure you are using the latest PHP, at least 4.4.x, there's always a change this was fixed but I skimmed the 4.x.x release notes and I don't see anything that relates to your problem.
The thing to remember about PHP strings is that they are streams of bytes. If you want to get the data in the correct character set (for whatever you are doing), you have to do this explicitly through some kind of function or filter. It's all pretty low-level.
Depending on your setup, you may need to know the internal character set of the strings in the database, but at the very least you need to know what character set the database is sending to PHP (because, remember, to PHP it's just a stream of bytes).
Then you have to know the target character set (and possibly specify it, which you really should anyway). For example, say that you are getting utf-8 from the database, but wish to send a latin-1 (and therefore base64 or q-printable encoded as 'Content-transfer-encoding'):
$send_string = base64_encode(utf8_decode($database_string));
Of course in this case, you'd have to know that all the utf-8 characters exist in the latin-1 character set, and you probably wouldn't really want base64 (PHP unfortunately does not have a good q-printable encoding function, though curiously, it does for decoding), and if you aren't talking about utf-8 <=> latin-1 you'll want to whip out the mbstring functions instead.
As far as the console, you'd have to know what PHP is getting when you are typing in special characters from the console, which probably depends on the shell and/or PHP settings. But remember that PHP only understands strings as byte byte byte and you should be able to work it out.