PHP for Python Programmers: UTF-8 Issues

PHP for Python Programmers: UTF-8 Issues - php

I have an open source PHP website and I intend to modify/translate (mostly constant strings) it so it can be used by Japanese users.
The original code is PHP+MySQL+Apache and written in English with charset=utf-8
I want to change, for example, the word "login" into Japanese counterpart "ログイン" etc
I am not sure whether I have to save the PHP code in utf-8 format (just like Python)?
I only have experience with Python, so what other issues I should take care of?

If it's in the file, then yes, you will need to save the file as UTF-8.
If it's is in the database, you do not need to save the PHP file as UTF-8.
In PHP, strings are basically just binary blobs. You will need to save the file as UTF-8 so the correct bytes are read in. In theory, if you saved the raw bytes in an ANSI file, it would still be output to the browser correctly, just your editor would not display it correctly, and you would run the risk of your editor manipulating it incorrectly.
Also, when handling non-ANSI strings, you'll need to be careful to use the multi-byte versions of string manipulation functions (str_replace will likely botch a utf-8 string for example).

If the file contains UTF-8 characters then save it with UTF-8. Otherwise you can save it in any format. One thing you should be aware of is that the PHP interpreter does not support the UTF-8 byte order mark so make sure you save it without that.

I'm sorry you have to use PHP after using Python.
PHP has no concept of character sets: all strings are binary, even in parsed php code, so if you include a UTF-8 multibyte character in a php string, make sure the bytes in the code file are UTF-8 bytes.
You will need to be extremely careful with the use of string functions at all levels of your application. You also need to make sure your MySQL connection is set to use UTF-8 (using SET NAMES or the charset dsn parameter in later versions of PDO), and that your mysql string column datatypes use utf-8 storage.

Related

Get file encoding of a large csv

I need to determine the character encoding of the contents of a .csv file.
Every snippet that I have seen do this uses file_get_contents(), however I can't use that because the file is too large to store in a variable (server memory limit exhausted).
How can I determine the character encoding of a file? Can I just get the first x characters and check them? Would that guarantee that my whole file is that encoding?
Alternatively, can I simply convert the entire csv to UTF-8 without knowing the current file encoding?

No, you can't determine the encoding with just the first x characters. You can guess it, and the guess may be wrong. The file may be UTF-8 but not contain UTF-8 before x characters. If may contain another encoding that is compatible with ASCII, bot only after character x.
No, you can't convert a file without knowing the current file encoding.

You can go straight to the conversion, as you said, using iconv (http://php.net/manual/en/function.iconv.php#49434)

'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
—Charles Babbage, 1864.
You have missing metadata and are proposing to put in values whether they are right or not.
Only the author/sender can tell you, perhaps via some standard, specification, convention, agreement or communication. A common method of communication when transferring data via HTTP is the Content-Type header.
Unfortunately, inadequate communication of metadata for text files and streams is too common in our industry. It stems from the 1970s and 80s when text files were converted to the local character encoding upon receipt. That doesn't apply anymore and nothing really took its place.
Non-answer:
Conversion from ISO-8859-1 will never fail during conversion because it uses all 256 bytes values in any sequence.
Conversion to any current Unicode encoding (including UTF-8) will never fail because all of them support the whole Unicode character set, and Unicode includes every computerized character you are likely to see today.
But wait, there is more needed metadata in the case of CSV:
line ending (arguably detectable)
field separator (arguably detectable)
quoting scheme, including escaping
presence of header row
and, finally, the datatype of each column.
And, keep in mind, if you were to guess any of this, and the data source is updatable, today's guess might not work tomorrow.

Is it safe to use raw emojis in PHP source code?

Example :
$fire = '🔥';
I know PHP 5+ supports this functionality natively but is it best practice or should I be storing them using their codepoints instead and if so, why?

As far as your editor and the PHP compiler are concerned, it's all just text, and '🔥' is no different from 'fire' or 'Φωτιά'.
When PHP runs, it will read the bytes in from the file and put them in memory, without caring what they mean. This leads to the most likely problem you'll have: if you save the file in your text editor as UTF-16, and then echo the string to a browser telling it that it's UTF-8, the browser won't show the right thing. But that's easily avoided by making sure your editor always uses UTF-8, and your output headers tell the browser that's what you're using.
If you don't trust your editor to do that, and you're running PHP7, you could write it in the escaped notation "\u{1f525}", but when it runs, the same bytes will end up in memory.
You might have similar problems if you send the text elsewhere - to a database, for instance - and that somewhere else doesn't know to handle it as UTF-8. How you write the string in your source file won't make any difference to that, though, that's just a case of making sure everything is configured to match.
Note: you don't actually have to use UTF-8 for this, you could use UTF-16, or some other encoding, as long as you're consistent; but UTF-8 is by far the most common these days, particularly on the web.

Detecting char encoding on direct file uploads PHP

on my site I allow for direct text file uploads. These files are then stored on the server, and displayed on the website. I use UTF-8 on the site.
Now I run into trouble when people upload non-UTF-8 files which contain special chars, such as é.
I've been doing some testing. Made 2 text files, both containing the same word fiancée. One encoded UTF-8 and one encoded ISO 8859-2.
The UTF-8 one uploads fine, and shows the text correct, but the ISO 8859-2 shows as fianc�e.
Now I've tried to detect the uploaded file content with mb_detect_encoding, but whatever file I throw at it, it always detect UTF-8.
I noticed that I can use utf8_encode to convert the ISO 8859-2 files to valid UTF-8, but this only works on non-UTF files. And as I currently cannot detect non-UTF files, I cannot use the utf8_encode function, as it messes up valid UTF-8 files.
Hope this makes sense :)
So my question is, how can I detect files that are for sure not UTF-8 encoded to start with, so that I can use the utf8_encode function on them.

You cannot. Welcome to encodings.
Seriously though, files are just binary blobs. The bits and bytes in the file could mean anything at all; it could be images, CAD data or, perhaps, text. It depends on how you interpret the bytes. For text files that specifically means with which encoding you interpret them. There's nothing in the files themselves that tells you the correct encoding, you have to know it. Typically you want to know it from metadata accompanying the file. In the case of random user uploads though, there is no metadata, and/or it wouldn't be reliable. So you cannot "know".
The next step would be to guess, but that is obviously not foolproof. You can rule out certain encodings, for example if a file does not validate as UTF-8 (mb_check_encoding($data, 'UTF-8') == false), then it cannot be UTF-8. However, any single byte encoding will validate as any other single byte encoding. It's impossible to distinguish ISO-8859-1 from ISO-8859-2 this way, the bytes are equally valid in both. It's just that the characters that show up may not be the ones you want. To detect that automatically you need a statistical language analyser which can tell you that this character probably shouldn't show up in that word for it to be grammatical. Obviously for that to work you need to know the language used in the file, or you need to detect that first… And even then this is hardly foolproof.
The sanest way is to ask the user. Accept the upload, perhaps do some upfront testing on which encodings can be ruled out, then ask the user which of a bunch of possible encodings the file is in. Present them the result, what the file looks like when interpreted as the chosen encoding, let the user confirm that it looks alright. Many decent text editors do this when you open a file with an ambiguous encoding.

Is it possible to write a great PHP app which uses Unicode?

My next web application project will make extensive use of Unicode. I usually use PHP and CodeIgniter however Unicode is not one of PHP's strong points.
Is there a PHP tool out there that can help me get Unicode working well in PHP?
Or should I take the opportunity to look into alternatives such as Python?

PHP can handle unicode fine once you make sure to encode and decode on entry and exit. If you are storing in a database, ensure that the language encodings and charset mappings match up between the html pages, web server, your editor, and the database.
If the whole application uses UTF-8 everywhere, decoding is not necessary. The only time you need to decode is when you are outputting data in another charset that isn't on the web. When outputting html, you can use
htmlentities($var, ENT_QUOTES, 'UTF-8');
to get the correct output. The standard function will destroy the string in most cases. Same goes for mail functions too.
http://developer.loftdigital.com/blog/php-utf-8-cheatsheet is a very good resource for working in UTF-8

One of the Major feature of PHP 6 will be tightly integrated with UNICODE support.
Implementing UTF-8 in PHP 5.
Since PHP strings are byte-oriented, the only practical encoding scheme for Unicode text is UTF-8. Tricks are [Got it from PHp Architect Magazine]:
Present HTML pages in UTF-8
Convert PHP scripts to UTF-8
Convert the site content, back-end databases and the like to UTF-8
Ensure that no PHP functions corrupt the UTF-8 text
Check out http://www.gravitonic.com/talks/ PHP UTF 8 Cheat Sheet

PHP is mostly unaware of chrasets and treats strings as bytestreams. That's not much of a problem really, but you'll have to do a bit of work your self.
The general rule of thumb is that you should use the same charset everywhere. If you use UTF-8 everywhere, then you're 99% there. Just make sure that you don't mix charsets, because then it gets really complicated. The only thing that won't work correct with UTF-8, is string manipulation, which needs to operate on a character level. Eg. strlen, substr etc. You should use UTF-8-aware versions in place of those. The multibyte-string extension gives you just that.
For a checklist of places where you need to make sure the charset is set correct, look at:
http://developer.loftdigital.com/blog/php-utf-8-cheatsheet
For more information, look at:
http://www.phpwact.org/php/i18n/utf-8

PHP, MSSQL2005 and Codepages

I have a php script which accesses a MSSQL2005 database, reads some data from it and sends the results in a mail.
There are special characters in both some column names and in the fields itself.
When I access the script through my browser (webserver iis), the query is executed correctly and the contents of the mail are correctly (for my audience) encoded.
However, when I execute php from the console, the query fails (due to the special characters in the column names). If I replace the special characters in the query with calls to chr() and the character code in latin-1, the query gets executed correctly, but the results are also encoded in latin-1 and therefore not displayed correctly in the mail.
Why is PHP/the MSSQL driver/… using a different encoding in the two scenarios? Is there a way around it?
If you wonder, I need the console because I want to schedule the script using SQLAgent (or taskmanager or whatever).

Depending on the type of characters you have in your database, it might be a console limitation I guess. If you type chcp in the console, you'll see what is the active code page, which might something like CP437 also known as Extended ASCII. If you have characters out of this code page, like in UTF8, you might run into problems. You can change the current active code page by typing chcp 65001 to switch to UTF8.
You might also want to change the default Raster font to Lucida Console depending on the required characters as not all fonts support extended characters (right click on command prompt window's title, properties, font).
As already said, PHP's unicode support is not ideal, but you can manage to do it in PHP5 with a few well placed function call of utf8_decode. The secret of character encoding is to understand well what is the current encoding of all the tools you are using: database, database connection, current bytes in your PHP variable, your output to the console screen, your email's body encoding, your email client, and so on...
For everything that have special characters, in our modern days, something like UTF8 is often recommended. Make sure everything along the way is set to UTF8 and convert only where necessary.

PHP's poor support for the non English world is well known. I've never used a database with characters outside the basic ASCII realm, but obviously you already have a work around and it seems you just have to live with it.
If you wanted to take it a step further, you could:
1. Write an array that contains all the special chars and their CHR equivalents
2. foreach the array and str_replace on the query
But if the query is hardcoded, I guess what you have is fine. Also, make sure you are using the latest PHP, at least 4.4.x, there's always a change this was fixed but I skimmed the 4.x.x release notes and I don't see anything that relates to your problem.

The thing to remember about PHP strings is that they are streams of bytes. If you want to get the data in the correct character set (for whatever you are doing), you have to do this explicitly through some kind of function or filter. It's all pretty low-level.
Depending on your setup, you may need to know the internal character set of the strings in the database, but at the very least you need to know what character set the database is sending to PHP (because, remember, to PHP it's just a stream of bytes).
Then you have to know the target character set (and possibly specify it, which you really should anyway). For example, say that you are getting utf-8 from the database, but wish to send a latin-1 (and therefore base64 or q-printable encoded as 'Content-transfer-encoding'):
$send_string = base64_encode(utf8_decode($database_string));
Of course in this case, you'd have to know that all the utf-8 characters exist in the latin-1 character set, and you probably wouldn't really want base64 (PHP unfortunately does not have a good q-printable encoding function, though curiously, it does for decoding), and if you aren't talking about utf-8 <=> latin-1 you'll want to whip out the mbstring functions instead.
As far as the console, you'd have to know what PHP is getting when you are typing in special characters from the console, which probably depends on the shell and/or PHP settings. But remember that PHP only understands strings as byte byte byte and you should be able to work it out.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.