I have set of strings where some of them are made of non-ascii characters.
How do I get strings with only ascii characters using a php script.
Thanks a lot in advance for any guidance..
<?php
echo preg_replace('/[^(\x20-\x7F)]*/', '', 'Standard ASCII and some gärbägè');
?>
Probably the easiest option is to use the iconv function (if the iconv extension is available), using either the //IGNORE or //TRANSLIT option (see the documentation), if the behavior suits your needs.
Related
I know that if I use multibyte(UTF-8) characters for the pattern, I have to use mb_ functions or have to use u option for pattern of preg_ functions.
But when I use multibyte(UTF-8) characters only for the subject of preg_ functions and use only ascii characters for the pattern, do preg_ functions (without u option) work correctly?
I know that in this case I have to use mb_ function or add u option to the pattern:
$str = preg_replace("/$utf8_multibyte_pattern/", '', $str);
I want to know if this code(u option is not used) is safe or not:
$ascii_pattern = "[a-zA-Z0-9'$#\\\"%&()\-~|~=!#`{}[]:;+*/.,_<>?_\n\t\r]";
$multibyte_str = preg_replace("/$ascii_pattern/", '', $utf8_multibyte_str);
Maybe I found the answer by myself.
But someone who knows about character code well, please comment to this answer or post another answer.
According to wikipedia, UTF-8 character codes don't contain ascii code.
http://en.wikipedia.org/wiki/UTF-8#Advantages
The ASCII characters are represented by themselves as single bytes that do not appear anywhere else, which makes UTF-8 work with the majority of existing APIs that take bytes strings but only treat a small number of ASCII codes specially. This removes the need to write a new Unicode version of every API, and makes it much easier to convert existing systems to UTF-8 than any other Unicode encoding.
I think this means preg function with ascii pattern without u option is safe for multibyte(UTF8) subject.
And this code (without u option)
$multibyte_str = preg_replace("/$ascii_pattern/", '', $utf8_multibyte_str);
and this code (with u option)
$multibyte_str = preg_replace("/$ascii_pattern/u", '', $utf8_multibyte_str);
are the same.
Both correctly works.
Am I correct?
It is safe as far as I know as long as you use the unicode property (/u) like so:
$ascii_pattern = "[a-zA-Z0-9'$#\\\"%&()\-~|~=!#`{}[]:;+*/.,_<>?_\n\t\r]";
$multibyte_str = preg_replace("/$ascii_pattern/u", '', $utf8_multibyte_str);
To see more information on unicode characters, see here
Has anyone of you ever used php_writeexcel (http://www.bettina-attack.de/jonny/view.php/projects/php_writeexcel/)?
I would like to know if there is an easy way to enable utf-8 support. php_writeexcel exports html to Microsoft Excel documents, yet it can't display certain characters:
http://pastebin.com/AgVpph7F
Perhaps I could solve this with some php functions?
Thanks for your help!
For fields with special characters (eg french) I use utf8_decode() to get the special characters to show up correctly.
Php_writeexcel is a port of the Perl module Spreadsheet::WriteExcel. However, the port is from a time when Unicode strings weren't supported in the underlying Excel file format.
Later (2.xx) versions of Spreadsheet::WriteExcel have native support for Unicode but they haven't been ported to PHP.
As such you won't be able to handle Unicode strings with php_writeexcel.
It isn't a perfect solution, but iconv will convert some of those characters.
http://www.php.net/manual/en/function.iconv.php
Depending how you want the unsupported characters to be handled:
iconv('UTF-8', 'ISO-8859-1//IGNORE','ėčščįęščūųüó');
output: üó
iconv('UTF-8', 'ISO-8859-1//TRANSLIT','ėčščįęščūųüó');
output: ??????????üó
Huh, looking at all those string functions, sometimes I get confused. One is using all the time mb_ functions, the other - plain ones, so the question is simple...
When should I use mb_strpos(); and when should I go with the plain one (strpos();)?
And, yes, I'm aware about that mb_ functions stand for multi-byte, but does it really mean, that if I'm working with only utf-8 encoded strings, I should stick with mb_ functions?
Thanks in advance!
You should use the mb_ functions whenever you expect to work with text that's not pure ASCII. I.e. you can work with the regular string functions, even if you're using UTF-8, as long as all the strings you're using them on only contain ASCII characters.
strpos('foobar', 'foo') // fine in any (ASCII-compatible) encoding, including UTF-8
strpos('ふーばー', 'ふー') // won't work as expected, use mb_strpos instead
Yes, if working with UTF-8 (which is a multi-byte encoding : one character can use more than one byte), you should use the mb_* functions.
The non-mb functions will work on bytes, and not characters -- which is fine when 1 character == 1 byte ; but that's not the case with (for example) UTF-8.
I'd say yes, here's the description from the php documentation:
mbstring provides multibyte specific string functions that help you deal with multibyte encodings in PHP. In addition to that, mbstring handles character encoding conversion between the possible encoding pairs. mbstring is designed to handle Unicode-based encodings such as UTF-8 and UCS-2 and many single-byte encodings for convenience....
If you're not sure that the mb extension is loaded, you should check before because mb-string is a non-default extension.
Should I be using mb_convert_case with MB_CASE_TITLE or ucwords? Or something else? What will the differences be?
It depends.
mb_convert_case() is multibyte safe. ucwords() is not.
mb_convert_case() requires an extension that is not always available. ucwords() is always available.
So if your application will only ever use single-byte encodings then ucwords() gives you better portability.
But if your application might need to process multi-byte encodings then ucwords() will fail you.
function uc_words($string){
return mb_convert_case($string, MB_CASE_TITLE, "UTF-8");
}
MB means multi byte, so it can convert non-ASCII characters, ucwords can convert only ASCII.
If you use ucwords on "moj šal", you will get "Moj šal", if you use multi byte convert you will get "Moj Šal"... that's it.
Hey, I'm using php 5 and need to communicate with another server that runs completely in unicode. I need to convert every string to unicode before sending it over. This seems like an easy task, but I haven't been able to find a way to do it yet. Is there a simple function that returns a unicode string? i.e. convert_to_unicode("the string i'm sending")
You can use the utf8_encode and utf8_decode functions. Also, you may need to go through Multibyte String to deal with specific encoding with those mb functions.
You can use either :
utf8_encode / utf8_decode
The mb_* Multibyte String functions ; in your case, see mb_convert_encoding
iconv and the iconv function.
You can use the function utf8_encode
Ok, iconv worked. The trouble is that this is a windows server, so I had to do it in little-endian. UTF-16LE works. Here's the working code:
iconv("UTF-8", "UTF-16LE", "data to send")