Handling Multibyte characters in php

Handling Multibyte characters in php - php

Am working on php based mime parser. If the body contains string like Iñtërnâtiônàlizætiøn we see that It is getting converted into IÃ±tÃ«rnÃ¢tiÃ´nÃ lizÃ¦tiÃ¸n. Can somebody suggest how to handle (what functions) for such string ?
So we are doing the following
Using Zend Library connecting to the IMAP server
mail = new Zend_Mail_Storage_Imap($params);
Read the message using
$message = $mail->getMessage($i);
in the loop.
When we print the $message we see the string e.g. Iñtërnâtiônàlizætiøn printed as IÃ±tÃ«rnÃ¢tiÃ´nÃ lizÃ¦tiÃ¸n.
What I need is if there is someway by which we can retain the original string? And this is just one example we may run into other multi-byte characters, so what to know how we handle this generically?

There's no specific function for that, you simply need to treat the string in the encoding it's in. A string is just a blob of bytes, it gets turned into characters by whatever is interpreting those bytes as text. And that something needs to use the correct encoding for that, otherwise those bytes are not interpreted as the characters they were supposed to be. See Handling Unicode Front To Back In A Web App for a rundown of the common pitfalls.

as mentioned in the comment, you can use php mb_* functions to work with multibyte characters. Here is just an example to detect the encoding of a string:
$s="Iñtërnâtiônàlizætiøn";
echo mb_detect_encoding($s); //UTF-8
then you can work with this, use utf8_decode($s) or any mb_ functions to convert the string to your wished encoding.

Related

Why is php converting certain characters to '?'

Everything in my code is running my database(Postgresql) is using utf8 encoding, I've checked the php.ini file its encoding is utf8, I tried debugging to see if it was any of the functions I used that were doing this, but nothing everything is running as expected, however after my frontend sends a post request to backend server through curl for some text to be inserted in the database, some characters like 'da' are converted to '?' in postgre and in memcached, I think php is converting them to Latin-1 again after the request reaches the other side for some reason becuase I use utf8_encode before the request and utf8_decode on the other side
this is the code to send the request
$pre_opp->
Send_Request_To_BackEnd("/Settings",$school_name,$uuid,"Upload_Bio","POST",str_replace(" ","%",utf8_encode($bio)));
this is how the backend system receives this
$data= str_replace("%"," ",utf8_decode($_POST["Data"]));

Don't replace " " with "%".
Use urlencode and urldecode instead of utf8_encode and utf8_decode - It will give you a clean alphanumeric representation of any character to easily transport your data.
If everything in your environment defaults to UTF-8, you shouldn't need utf_encode and utf_decode anyways, I guess. But if you still do, you could try combining both like this:
Send_Request_To_BackEnd("/Settings",$school_name,$uuid,"Upload_Bio","POST", urlencode(utf8_encode($bio)));
and
$data= str_replace("%"," ",utf8_decode(urldecode($_POST["Data"])));

You say this like it's a mystery:
I think php is converting them to Latin-1 again after the request reaches the other side for some reason
But then you give the reason yourself:
because I use utf8_encode before the request and utf8_decode on the other side
That is exactly what uf8_decode does: it converts UTF-8 to Latin-1.
As the manual explains, this is also where your '?' replacements come from:
This function converts the string string from the UTF-8 encoding to ISO-8859-1. Bytes in the string which are not valid UTF-8, and UTF-8 characters which do not exist in ISO-8859-1 (that is, characters above U+00FF) are replaced with ?.
Since you'd picked the unfortunate replacement of % for space, sequences like "%da" were being interpreted as URL percent escapes, and generating invalid UTF-8 strings. You then asked PHP to convert them to Latin-1, and it couldn't, so it substituted "?".
The simple solution is: don't do that. If your data is already in UTF-8, neither of those functions will do anything but mess it up; if it's not already in UTF-8, then work out what encoding it's in and use iconv or mb_convert_encoding to convert it, once. See also "UTF-8 all the way through".
Since we can't see your Send_Request_To_BackEnd function, it's hard to know why you thought you needed it. If you're constructing a URL with that string, you should use urlencode inside your request sending code; you shouldn't need to decode it the other end, PHP will do that for you.

How to convert binary string to normal string in php

Description of the problem
I am trying to import email content into a database table. Sometimes, I get an SQL error while inserting a message. I found that it fails when the message content is a binary string instead of a string.
For exemple, I get this in the console if I print a message that is imported successfully (Truncated)
However, I get this with problematic import:
I found out that if I use the function utf8_encode, I am successfully able to import it into SQL. The problem is that it "breaks" previously successfull imports accented characters:
What I have tried
Detect if the string was a binary string with ctype_print, returned false for both non binary and binary string. I would have then be able to call utf8_encode only if it was binary
Use of unpack, did not work
Detect string encoding with mb_detect_encoding, return UTF-8 for both
use iconv , failed with iconv(): Detected an illegal character in input string
Cast the content as string using (string) / settype($html, 'string')
Question
How can I transform the binary string in a normal string so I can then import it in my database without breaking accented characters in other imports?

This is pretty late, but for anyone else reading... Apparently the b prefix is meaningless in PHP, it's a bit of a red herring. See: https://stackoverflow.com/a/51537602/6111743
What encodings did you pass to iconv()? This is the correct solution but you have to give it the correct first argument, which depends on your input. In my example I use "LATIN1" because that turned out to be the correct way to interpret my input but your use case may vary.
You can use mb_check_encoding() to check if it is valid UTF-8 or not. This returns a boolean.
Assuming the question is really something like "how to convert extended ascii string to valid utf-8 string in PHP" - Here is how I did it in my application:
if(!mb_check_encoding($string)) {
$string = iconv("LATIN1", "UTF-8//TRANSLIT//IGNORE", $string);
}
The "TRANSLIT" part tells it to attempt transliteration, that's optional for you. The "IGNORE" will prevent it from throwing Detected an illegal character in input string if it does detect one; instead the character will just get ignored, meaning, removed. Your use case may not need either of these.
When you're debugging, I recommend just using "UTF-8" as the second argument so you can see what it's doing. It's useful to see if it throws an error. For me, I had given it the wrong first argument at first (I wrote "ASCII" instead of "LATIN-1") and it threw the illegal character error on an accented character. That error went away once I passed it the correct encoding.
By the way, mb_detect_encoding() was no help to me in figuring out that Latin-1 was what I needed. What helped was dumping the contents of unpack("C*", $string) to see what exact bytes were in there. That's more debugging advice than solution but worth mentioning in case it helps.

Why is my PHP urlencode not functioning as examples on internet?

Why does my urlencode() produce something different than I expected?
This might be my expectations being wrong but then I would be even more puzzled.
example
urlencode("ä");
expectations = returns %C3%A4
reality = returns %E4
Where have I gone wrong in my expections? It seems to be linked to encoding. But I'm not very familiar in what I should do/use.
Should I change something on my server to that the function uses the right encoding?

urlencode encodes the raw bytes in your string into a percent-encoded representation. If you expect %C3%A4 that means you expect the UTF-8 byte representation of "ä". If you get %E4 that means your string is actually encoded in ISO-8859-1 instead.
Encode your string in UTF-8 to get the expected result. How to do this depends on where this string comes from. If it's a string literal in your source code file, save the file as UTF-8 in your text editor. If it comes from a database, see UTF-8 all the way through.
For more background information, see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

Is there any downside to save all my source code files in UTF-8?

If that's relevant (it very well could be), they are PHP source code files.

There are a few pitfalls to take care of:
PHP is not aware of the BOM character certain editors or IDEs like to put at the very beginning of UTF-8 files. This character indicates the file is UTF-8, but it is not necessary, and it is invisible. This can cause "headers already sent out" warnings from functions that deal with HTTP headers because PHP will output the BOM to the browser if it sees one, and that will prevent you from sending any header. Make sure your text editor has a UTF-8 (No BOM) encoding; if you're not sure, simply do the test. If <?php header('Content-Type: text/html') ?> at the beginning of an otherwise empty file doesn't trigger a warning, you're fine.
Default string functions are not multibyte encodings-aware. This means that strlen really returns the number of bytes in the string, not the actual number of characters. This isn't too much of a problem until you start splicing strings of non-ASCII characters with functions like substr: when you do, indices you pass to it refer to byte indices rather than character indices, and this can cause your script to break non-ASCII characters in two. For instance, echo substr("é", 0, 1) will return an invalid UTF-8 character because in UTF-8, é actually takes two bytes and substr will return only the first one. (The solution is to use the mb_ string functions, which are aware of multibyte encodings.)
You must ensure that your data sources (like external text files or databases) return UTF-8 strings too, because PHP makes no automagic conversion. To that end, you may use implementation-specific means (for instance, MySQL has a special query that lets you specify in which encoding you expect the result: SET CHARACTER SET UTF8 or something along these lines), or if you couldn't find a better way, mb_convert_encoding or iconv will convert one string into another encoding.

It's actually usually recommended that you keep all sources in UTF8. It won't matter size of regular code with latin characters at all, but will prevent glitches with any special characters.

If you are using any special chars in e.g string values, the size is a little bit bigger, but that shouldn't matter.
Nevertheless my suggestion is, to always leave the default format. I spent so many hours because there was an error with the format saving and all characters changed.
From a technical point of few, there isn't a difference!

Very relevant, the PHP parser may start to output spurious characters, like a funky unside-down questionmark. Just stick to the norm, much preferred.

PHP: Fixing encoding issues with database content - removing accents from characters

I'm trying to make a URL-safe version of a string.
In my database I have a value medúlla - I want to turn this into medulla.
I've found plenty of functions to do this, but when I retrieve the value from the database it comes back as medÃºlla.
I've tried:
Setting the column as utf_8 encoding
Setting the table as utf_8 encoding
Setting the entire database as utf_8 encoding
Running `SET NAMES utf8` on the database before querying
When I echo the value onto the screen it displays as I want it to, but the conversion function doesn't see the ú character (even a simple str_replace() doesn't work either).
Does anybody know how I can force the system to recognise this as UTF-8 and allow me to run the conversion?
Thanks,
Matt

To transform an UTF-8 string into an URL-safe string you should use:
$str = iconv('UTF-8', 'ASCII//IGNORE//TRANSLIT', $strt);
The IGNORE part tells iconv() not to raise an exception when facing a character it can't manage, and the TRANSLIT part converts an UTF-8 character into its nearest ASCII equivalent ('ú' into 'u' and such).
Next step is to preg_replace() spaces into underscores and substitute or drop any character which is unsafe within an URL, either with preg_replace() or urlencode().
As for the database stuff, you really should have done all this setting stuff before INSERTing UTF-8 content. Changing charset to an existing table is somewhat like changing a file extension in Windows - it doesn't convert a JPEG into a GIF. But don't worry and remember that the database will return you byte by byte exactly what you've stored in it, no matter which charset has been declared. Just keep the settings you used when INSERTing and treat the returned strings as UTF-8.

I'm trying to make a URL-safe version of a string.
Whilst it is common to use ASCII-only ‘slugs’ in URLs, it is actually possible to have web addresses including non-ASCII characters. eg.:
http://en.wikipedia.org/wiki/Medúlla
This is a valid IRI. For inclusion in a URI, you should UTF-8 and %-encode it:
http://en.wikipedia.org/wiki/Med%C3%BAlla
Either way, most browsers (except sometimes not IE) will display the IRI version in the address bar. Sites such as Wikipedia use this to get pretty addresses.
the conversion function doesn't see the ú character
What conversion function? rawurlencode() will correctly spit out %C3%BA for ú, if, as presumably you do, you have it in UTF-8 encoding. This is the correct way to include text in a URL's path component. (urlencode() also gives the same results, but it should only be used for query components.)
If you mean htmlentities()... do not use this function. It converts all non-ASCII characters to HTML character references, which makes your output unnecessarily larger, and means it has to know what encoding the string you pass in is. Unless you give it a UTF-8 $charset argument it will use ISO-8859-1, and consequently screw up all your non-ASCII characters.
Unless you are specifically authoring for an environment which mangles non-ASCII characters, it is better to use htmlspecialchars(). This gives smaller output, and it doesn't matter(*) if you forget to include the $charset argument, since all it changes is a couple of characters like < and &.
(Actually it could matter for some East Asian multibyte character sets where < could be part of a multibyte sequence and so shouldn't be escaped. But in general you'd want to avoid these legacy encodings, as UTF-8 is less horrific.)
(even a simple str_replace() doesn't work either).
If you wrote str_replace(..., 'ú', ...) in the PHP source code, you would have to be sure that you saved the source code in the same encoding as you'll be handling, otherwise it won't match.
It is unfortunate that most Windows text editors still save in the (misleadingly-named) “ANSI” code page, which is locale-specific, instead of just using UTF-8. But it should be possible to save the file as UTF-8, and then the replace should work. Alternatively, write '\xc3\xba' to avoid the problem.
Running SET NAMES utf8 on the database before querying
Use mysql_set_charset() in preference.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.