Converting to and from Unicode in PHP - php

Hey, I'm using php 5 and need to communicate with another server that runs completely in unicode. I need to convert every string to unicode before sending it over. This seems like an easy task, but I haven't been able to find a way to do it yet. Is there a simple function that returns a unicode string? i.e. convert_to_unicode("the string i'm sending")

You can use the utf8_encode and utf8_decode functions. Also, you may need to go through Multibyte String to deal with specific encoding with those mb functions.

You can use either :
utf8_encode / utf8_decode
The mb_* Multibyte String functions ; in your case, see mb_convert_encoding
iconv and the iconv function.

You can use the function utf8_encode

Ok, iconv worked. The trouble is that this is a windows server, so I had to do it in little-endian. UTF-16LE works. Here's the working code:
iconv("UTF-8", "UTF-16LE", "data to send")

Related

PHP greek url convert

I have a URL like: domain.tld/Σχετικά_με_μας
[edit]
Reading the $_SERVER['REQUEST_URI'] I get to work with:
%CE%A3%CF%87%CE%B5%CF%84%CE%B9%CE%BA%CE%AC_%CE%BC%CE%B5_%CE%BC%CE%B1%CF%82
[/edit]
In PHP I need to convert it to HTML, I get pretty far with:
htmlentities(urldecode($navstring), ENT_QUOTES, 'UTF-8');
It results in:
Σχετικά_με_μας
but the 'ά' becomes 'ά' But I need it converted to
ά
I'dd really appreciate help. I need a universal solution, not a "string replace"
I have been playing around a little, and the following worked. Use mb-convert-encoding instead of htmlentities.:
mb_convert_encoding(urldecode($navstring),'HTML-ENTITIES','UTF-8');
//string(90) "domain.tld/Σχετικά_με_μας"
See mb-convert-encoding
Information
All modern web browsers understand UTF-8 character encoding.
My advice would be :
Always know the character encoding of the data you are using.
Store your data with UTF-8.
Output data with UTF-8
The mbstring php extension doesn't just manipulate Unicode strings. It also converts multibyte strings between various character encodings.
Use the mb_detect_encoding() (ref) and mb_convert_encoding() (ref 2) functions to convert Unicode strings from one character encoding to another.
PHP Needs to know !
You also need to tell PHP that you are working with UTF-8, to tell him the default value, you can do it in your php.ini file :
default_charset = "UTF-8";
That default value is added to the default Content-Type header returned by PHP unless you specified it with the header() function :
header('Content-Type: application/json;charset=utf-8');
Keep in mind
The default character set is used by a lot of functions in PHP such as :
htmlentities()
htmlspecialchars()
all the mbstring functions
...

How can I encode from unicode to GB18030 in PHP?

I know how to do this in Python:
In [1]: u'中华人民共和国'.encode('GB18030').encode('base64')
Out[1]: '1tC7qsjLw/G5srrNufo=\n'
But I need to do this in PHP, and I'm not sure how to do it.
You can use iconv to convert the string from UTF-8 (or whatever your initial encoding is) to GB18030, then base64_encode the result. E.g.:
echo base64_encode(iconv('UTF-8', 'GB18030', '中华人民共和国'));
outputs:
1tC7qsjLw/G5srrNufo=
Note that PHP doesn't have native Unicode strings - they're just a bunch of bytes, so you'll need to specify the encoding the string is in. If it's a string literal in your PHP it'll be whatever encoding you've used for the file.
echo mb_convert_encoding($str, "gb18030", "UTF-8");
replace UTF-8 with whatever encoding the text is in already. Or the third parameter is optional, it might work even without that.
mb_convert_encoding

When should I use mb_strpos(); over strpos();?

Huh, looking at all those string functions, sometimes I get confused. One is using all the time mb_ functions, the other - plain ones, so the question is simple...
When should I use mb_strpos(); and when should I go with the plain one (strpos();)?
And, yes, I'm aware about that mb_ functions stand for multi-byte, but does it really mean, that if I'm working with only utf-8 encoded strings, I should stick with mb_ functions?
Thanks in advance!
You should use the mb_ functions whenever you expect to work with text that's not pure ASCII. I.e. you can work with the regular string functions, even if you're using UTF-8, as long as all the strings you're using them on only contain ASCII characters.
strpos('foobar', 'foo') // fine in any (ASCII-compatible) encoding, including UTF-8
strpos('ふーばー', 'ふー') // won't work as expected, use mb_strpos instead
Yes, if working with UTF-8 (which is a multi-byte encoding : one character can use more than one byte), you should use the mb_* functions.
The non-mb functions will work on bytes, and not characters -- which is fine when 1 character == 1 byte ; but that's not the case with (for example) UTF-8.
I'd say yes, here's the description from the php documentation:
mbstring provides multibyte specific string functions that help you deal with multibyte encodings in PHP. In addition to that, mbstring handles character encoding conversion between the possible encoding pairs. mbstring is designed to handle Unicode-based encodings such as UTF-8 and UCS-2 and many single-byte encodings for convenience....
If you're not sure that the mb extension is loaded, you should check before because mb-string is a non-default extension.

Weird char (�) appears after doing html_entity_decode

In a separate YML file i have :
flags: [<img src="/images/cms_bo/icons/english.png" alt="English"/>]
When i call this into my code, it's not interpreted, so i used html_entity_decode.
It works but i have only 1 strange char just before my image : �
<?php echo html_entity_decode($form['lang']->render()); ?>
All my files are UTF8 encoded. Do you have an idea on what i've missed to solve this problem ?
PS:
public static function getI18nCulturesForChoice()
{
return array_combine(self::getI18nCultures(), self::getI18nCulturesFlags());
}
Try using html_entity_decode($form['lang']->render(),ENT_QUOTES, "UTF-8");
Prior to PHP 5.3.3, the default character set for html_entity_decode was ISO-8859-1! If you're working with UTF-8, you will need to use the third argument to the function to tell it to deal with UTF-8 instead of assuming ISO-8859-1.
This is blindly assuming you're using an older version of PHP.
If you are using a newer version of PHP, consider using iconv with the //IGNORE//TRANSLIT flags to try and remove any bad UTF-8 sequences before passing the string into html_entity_decode.
Maybe your file has a Byte Order Mark (BOM) set.

strings with only ascii characters php

I have set of strings where some of them are made of non-ascii characters.
How do I get strings with only ascii characters using a php script.
Thanks a lot in advance for any guidance..
<?php
echo preg_replace('/[^(\x20-\x7F)]*/', '', 'Standard ASCII and some gärbägè');
?>
Probably the easiest option is to use the iconv function (if the iconv extension is available), using either the //IGNORE or //TRANSLIT option (see the documentation), if the behavior suits your needs.

Categories