PHP, MSSQL2005 and Codepages

PHP, MSSQL2005 and Codepages - php

I have a php script which accesses a MSSQL2005 database, reads some data from it and sends the results in a mail.
There are special characters in both some column names and in the fields itself.
When I access the script through my browser (webserver iis), the query is executed correctly and the contents of the mail are correctly (for my audience) encoded.
However, when I execute php from the console, the query fails (due to the special characters in the column names). If I replace the special characters in the query with calls to chr() and the character code in latin-1, the query gets executed correctly, but the results are also encoded in latin-1 and therefore not displayed correctly in the mail.
Why is PHP/the MSSQL driver/… using a different encoding in the two scenarios? Is there a way around it?
If you wonder, I need the console because I want to schedule the script using SQLAgent (or taskmanager or whatever).

Depending on the type of characters you have in your database, it might be a console limitation I guess. If you type chcp in the console, you'll see what is the active code page, which might something like CP437 also known as Extended ASCII. If you have characters out of this code page, like in UTF8, you might run into problems. You can change the current active code page by typing chcp 65001 to switch to UTF8.
You might also want to change the default Raster font to Lucida Console depending on the required characters as not all fonts support extended characters (right click on command prompt window's title, properties, font).
As already said, PHP's unicode support is not ideal, but you can manage to do it in PHP5 with a few well placed function call of utf8_decode. The secret of character encoding is to understand well what is the current encoding of all the tools you are using: database, database connection, current bytes in your PHP variable, your output to the console screen, your email's body encoding, your email client, and so on...
For everything that have special characters, in our modern days, something like UTF8 is often recommended. Make sure everything along the way is set to UTF8 and convert only where necessary.

PHP's poor support for the non English world is well known. I've never used a database with characters outside the basic ASCII realm, but obviously you already have a work around and it seems you just have to live with it.
If you wanted to take it a step further, you could:
1. Write an array that contains all the special chars and their CHR equivalents
2. foreach the array and str_replace on the query
But if the query is hardcoded, I guess what you have is fine. Also, make sure you are using the latest PHP, at least 4.4.x, there's always a change this was fixed but I skimmed the 4.x.x release notes and I don't see anything that relates to your problem.

The thing to remember about PHP strings is that they are streams of bytes. If you want to get the data in the correct character set (for whatever you are doing), you have to do this explicitly through some kind of function or filter. It's all pretty low-level.
Depending on your setup, you may need to know the internal character set of the strings in the database, but at the very least you need to know what character set the database is sending to PHP (because, remember, to PHP it's just a stream of bytes).
Then you have to know the target character set (and possibly specify it, which you really should anyway). For example, say that you are getting utf-8 from the database, but wish to send a latin-1 (and therefore base64 or q-printable encoded as 'Content-transfer-encoding'):
$send_string = base64_encode(utf8_decode($database_string));
Of course in this case, you'd have to know that all the utf-8 characters exist in the latin-1 character set, and you probably wouldn't really want base64 (PHP unfortunately does not have a good q-printable encoding function, though curiously, it does for decoding), and if you aren't talking about utf-8 <=> latin-1 you'll want to whip out the mbstring functions instead.
As far as the console, you'd have to know what PHP is getting when you are typing in special characters from the console, which probably depends on the shell and/or PHP settings. But remember that PHP only understands strings as byte byte byte and you should be able to work it out.

Related

Is it safe to use raw emojis in PHP source code?

Example :
$fire = '🔥';
I know PHP 5+ supports this functionality natively but is it best practice or should I be storing them using their codepoints instead and if so, why?

As far as your editor and the PHP compiler are concerned, it's all just text, and '🔥' is no different from 'fire' or 'Φωτιά'.
When PHP runs, it will read the bytes in from the file and put them in memory, without caring what they mean. This leads to the most likely problem you'll have: if you save the file in your text editor as UTF-16, and then echo the string to a browser telling it that it's UTF-8, the browser won't show the right thing. But that's easily avoided by making sure your editor always uses UTF-8, and your output headers tell the browser that's what you're using.
If you don't trust your editor to do that, and you're running PHP7, you could write it in the escaped notation "\u{1f525}", but when it runs, the same bytes will end up in memory.
You might have similar problems if you send the text elsewhere - to a database, for instance - and that somewhere else doesn't know to handle it as UTF-8. How you write the string in your source file won't make any difference to that, though, that's just a case of making sure everything is configured to match.
Note: you don't actually have to use UTF-8 for this, you could use UTF-16, or some other encoding, as long as you're consistent; but UTF-8 is by far the most common these days, particularly on the web.

PHP for Python Programmers: UTF-8 Issues

I have an open source PHP website and I intend to modify/translate (mostly constant strings) it so it can be used by Japanese users.
The original code is PHP+MySQL+Apache and written in English with charset=utf-8
I want to change, for example, the word "login" into Japanese counterpart "ログイン" etc
I am not sure whether I have to save the PHP code in utf-8 format (just like Python)?
I only have experience with Python, so what other issues I should take care of?

If it's in the file, then yes, you will need to save the file as UTF-8.
If it's is in the database, you do not need to save the PHP file as UTF-8.
In PHP, strings are basically just binary blobs. You will need to save the file as UTF-8 so the correct bytes are read in. In theory, if you saved the raw bytes in an ANSI file, it would still be output to the browser correctly, just your editor would not display it correctly, and you would run the risk of your editor manipulating it incorrectly.
Also, when handling non-ANSI strings, you'll need to be careful to use the multi-byte versions of string manipulation functions (str_replace will likely botch a utf-8 string for example).

If the file contains UTF-8 characters then save it with UTF-8. Otherwise you can save it in any format. One thing you should be aware of is that the PHP interpreter does not support the UTF-8 byte order mark so make sure you save it without that.

I'm sorry you have to use PHP after using Python.
PHP has no concept of character sets: all strings are binary, even in parsed php code, so if you include a UTF-8 multibyte character in a php string, make sure the bytes in the code file are UTF-8 bytes.
You will need to be extremely careful with the use of string functions at all levels of your application. You also need to make sure your MySQL connection is set to use UTF-8 (using SET NAMES or the charset dsn parameter in later versions of PDO), and that your mysql string column datatypes use utf-8 storage.

php INSERT through Flash AS3 sometimes inserts weird Strings

I haven't got clue if this is a normal issue or not but I have a small flash application that handles management for my company. It's a small company, so its not a big deal, its just a bunch of INSERTs, SELECTs, UPDATEs and other stuff to manage their clients, address, phone numbers, etc.
The flash (in AS3) sends the variables through a URLRequest to several php pages and the php handles the request to mySQL.
My problem is that, sometimes, instead of inserting the String I sent, it instead gets a weird string, made mostly, but not only, of numbers (and it happens like 1 column out of about 10 per INSERT, so its fairly common).
Is this a known issue? Could it be because of the encoding (I used UTF-8, which I believe is the one that we use here in portugal, due to special characters, like ã, à, á, etc)?
Thank you for your time.
Marco Fox.

After connecting to the DB, try the following query "SET CHARACTER SET utf8;".
Make sure every PHP page are in utf-8.
To do that, open the file in Notepad++ and use the menu Encoding -> Convert to UTF-8 without BOM, or open the file in notepad and ask to save as and look at encoding dropdown bellow name (this will save the BOM, which is not good).
Some IDE have the ability to save in ANSI, UTF-8 and more, or have the conversion option.
In Flash, use encodeURI() in your URLLoader data if you are passing it by GET.
Hopes that this solves your problem (if it is, in fact, encoding issues).

___ encoding to UTF-8 - is there an end-all solution?

I've looked across the web, I've looked through SO, through PHP documentation and more.
It seems like a ridiculous problem not to have a standard solution to. If you get an unknown character set, and it has strange characters (like english quotes), is there a standard way to convert them to UTF-8?
I've seen many messy solutions using a plethora of functions and checking and none of them are definitely going to work.
Has anyone come up with their own function or a solution that always works?
EDIT
Many people have answered saying "it is not solvable" or something of that nature. I understand that now, but none have given any sort of solution that has worked besides utf8_encode which is very limited. What methods ARE out there to deal with this? What is the best method?

No. One should always know what character set a string is in. Guessing the character set by using a sniffing function is unreliable (although in most situations, in the western world, it's usually a mix-up between ISO-8859-1 and UTF-8).
But why do you have to deal with unknown character sets? There is no general solution for this because the general problem shouldn't exist in the first place. Every web page and data source can and should have a character set definition, and if one doesn't, one should request the administrator of that resource to add one.
(Not to sound like a smartass, but that is the only way to deal with this well.)

The reason why you saw so many complicated solutions for this problem is because by definition it is not solvable. The process of encoding a string of text is non-deterministic.
It is possible to construct different combinations of text and encodings that result in the same byte stream. Therefore, it is not possible, strictly logically speaking, to determine the encoding, character set, and the text from a byte stream.
In reality, it is possible to achieve results that are "close enough" using heuristic methods, because there is a finite set of encodings that you'll encounter in the wild, and with a large enough sample a program can determine the most likely encoding. Whether the results are good enough depends on the application.
I do want to comment on the question of user-generated data. All data posted from a web page has a known encoding (the POST comes with an encoding that the developer has defined for the page). If a user pastes text into a form field, the browser will interpret the text based on encoding of the source data (as known by the operating system) and the page encoding, and transcode it if necessary. It is too late to detect the encoding on the server - because the browser may have modified the byte stream based on the assumed encoding.
For instance, if I type the letter Ä on my German keyboard and post it on a UTF-8 encoded page, there will be 2 bytes (xC3 x84) that are sent to the server. This is a valid EBCDIC string that represents the letter C and d. This is also a valid ANSI string that represents the 2 characters Ã and „. It is, however, not possible, no matter what I try, to paste an ANSI-encoded string into a browser form and expect it to be interpreted as UTF-8 - because the operating system knows that I am pasting ANSI (I copied the text from Textpad where I created an ANSI-encoded text file) and will transcode it to UTF-8, resulting in the byte stream xC3 x83 xE2 x80 x9E.
My point is that if a user manages to post garbage, it is arguably because it was already garbage at the time it was pasted into a browser form, because the client did not have the proper support for the character set, the encoding, whatever.
Because character encoding is non-deterministic, you cannot expect that there exist a trivial method to uncover from such a situation.
Unfortunately, for uploaded files the problem remains. The only reliable solution that I see is to show the user a section of the file and ask if it was interpreted correctly, and cycle through a bunch of different encodings until this is the case.
Or we could develop a heuristic method that looks at the occurance of certain characters in various languages. Say I uploaded my text file that contains the two bytes xC3 x84. There is no other information - just two bytes in the file. This method could find out that the letter Ä is fairly common in German text, but the letters Ã and „ together are uncommon in any language, and thus determine that the encoding of my file is indeed UTF-8. This roughy is the level of complexity that such a heuristic method has to deal with, and the more statistical and linguistic facts it can use, the more reliable will its results be.

Pekka is right about the unreliability, but if you need a solution and are willing to take the risk, and you have the mbstring library available, this snippet should work:
function forceToUtf8($string) {
if (!mb_check_encoding($string)) {
return false;
}
return mb_convert_encoding($string, 'UTF-8', mb_detect_encoding($string));
}

If I'm not wrong, there is something called utf8encode... it works well EXCEPT if you are already in utf8
http://php.net/manual/en/function.utf8-encode.php

Questions about iPhone emoji and web pages


Okay, so emoji basically shows the above on a computer. Is that another programming language? So how do I put those little boxes into a php file? When I put it into a php file, it turns into question marks and what not. Also, how can I store these in a MySQL without it turning into question marks and other weird things?

how do I put those little boxes into a php file?
Same way as any other Unicode character. Just paste them and make sure you're saving the PHP file and serving the PHP page as UTF-8.
When I put it into a php file, it turns into question marks and what not
Then you have an encoding problem. Work it out with Unicode characters you can actually see properly first, for example ąαд™日本, before worrying about the emoji.
Your PHP file should be saved as UTF-8; the page it produces should be served as Content-Type: text/html;charset:UTF-8 (or with similar meta tag); the MySQL database should be using a UTF-8 collation to store data and PHP should be talking to MySQL using UTF-8.
However. Even handling everything correctly like this, PCs will still not show the emoji. That's because:
they don't have fonts that include shapes for those characters, and
emoji are still completely unstandardised. Those characters you posted are in the Unicode Private Use Area, which means they don't have any official meaning at all.
Each network in Japan uses different character codes for their emoji, mapped to different areas in the PUA. So even on another mobile phone, it probably won't display the correct character, unless you spend ages manually converting emoji codes for different networks. I'm guessing the ones you posted above are from SoftBank (iPhone?).
There is an ongoing proposal led by Google and Apple to collate the different networks' emoji and give them a proper standardised place in Unicode. Until then, getting emoji to display consistently across networks is an exercise in unhappiness. See the character overview from the standardisation work to see how much converting you would have to do.
God, I hate emoji. All that pain for such a load of useless twee rubbish.

This has nothing to do with programming languages, just with encoding and fonts. As a very brief overview: Every character is stored by its character code (e.g.: 0x41 = A, 0x42 = B, etc), which is rendered as a meaningful character on your screen using a font (which says "the character with the code 0x41 should look like this ...").
These emoji occupy the "private use area" of the Unicode table, which is a range of codes that are undefined and free for anyone to use. That makes them perfectly valid character codes, it's just that no standard font has an appropriate character to display for them, since they are undefined. Only the iPhone and other handhelds, mostly in Japan, have appropriate icons for these codes. This is done to save bandwidth; instead of transmitting relatively large image files back and forth, emoji can be transmitted using a single character code.
As for how to store them: They should be storable as is, as long as you don't try to convert them to another encoding, in which case they may get lost. Just be aware that they only make sense on the iPhone and other SoftBank phones in Japan.
Character Viewer http://img.skitch.com/20091110-e7nkuqbjrisabrdipk96p4yt59.png
If you're on OSX you can copy and paste the character into the Character Viewer to find out what it is. I think there's a similar Character Map on Windows (albeit inferior ;-P). You could put it through PHP's ord(), but that only works on ASCII characters. See the discussion on the ord page for UTF8 functions.
BTW, just for the fun of it, these characters display fine on the iPhone as is, because the iPhone has a font which has icons for them:
iPhone http://img.skitch.com/20091110-bjt3tutjxad1kw4p9uhem5jhnk.png

I'm using FF3.5 and WinXP. I see little boxes in my browser, too.
This tells me the string requires a character set not installed on my computer.
When you put the string into a PHP file, the question marks tell you the same thing: your computer doesn't know how to display the characters.
You could store these emoji characters in MySQL if you encoded them differently, probably using UTF-8.
Do a web search for character encoding, as it relates to MySQL.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.