PHP String Function with non-English languages

PHP String Function with non-English languages - php

I was trying range(); function with non-English language. It is not working.
$i =0
foreach(range('क', 'म') as $ab) {
++$i;
$alphabets[$ab] = $i;
}
Output: à =1
It was Hindi (India) alphabets. It is only iterating only once (Output shows).
For this, I am not getting what to do!
So, if possible, please tell me what to do for this and what should I do first before thinking of working with non-English text with any PHP functions.

Short answer: it's not possible to use range like that.
Explanation
You are passing the string 'क' as the start of the range and 'म' as the end. You are getting only one character back, and that character is à.
You are getting back à because your source file is encoded (saved) in UTF-8. One can tell this by the fact that à is code point U+00E0, while 0xE0 is also the first byte of the UTF-8 encoded form of 'क' (which is 0xE0 0xA4 0x95). Sadly, PHP has no notion of encodings so it just takes the first byte it sees in the string and uses that as the "start" character.
You are getting back only à because the UTF-8 encoded form of 'म' also starts with 0xE0 (so PHP also thinks that the "end character" is 0xE0 or à).
Solution
You can write range as a for loop yourself, as long as there is some function that returns the Unicode code point of an UTF-8 character (and one that does the reverse). So I googled and found these here:
// Returns the UTF-8 character with code point $intval
function unichr($intval) {
return mb_convert_encoding(pack('n', $intval), 'UTF-8', 'UTF-16BE');
}
// Returns the code point for a UTF-8 character
function uniord($u) {
$k = mb_convert_encoding($u, 'UCS-2LE', 'UTF-8');
$k1 = ord(substr($k, 0, 1));
$k2 = ord(substr($k, 1, 1));
return $k2 * 256 + $k1;
}
With the above, you can now write:
for($char = uniord('क'); $char <= uniord('म'); ++$char) {
$alphabet[] = unichr($char);
}
print_r($alphabet);
See it in action.

The lazy solution would be to use html_entity_decode() and range() only for the numeric ranges it was originally intended (that it works with ASCII is a bit silly anyway):
foreach (range(0x0915, 0x092E) as $char) {
$char = html_entity_decode("&#$char;", ENT_COMPAT, "UTF-8");
$alphabets[$char] = ++$i;
}

Another solution would be translating and getting the range then translate back again.
$first = file_get_contents("http://ajax.googleapis.com/ajax/services/language/translate?v=1.0&langpair=|en&q=क");
$second = file_get_contents("http://ajax.googleapis.com/ajax/services/language/translate?v=1.0&langpair=|en&q=म"); //not real value
$jsonfirst = json_decode($first);
$jsonsecond = json_decode($second);
$f = $jsonfirst->responseData->translatedText;
$l = $jsonsecond->responseData->translatedText;
foreach(range($f, $l) as $ab) {
echo $ab;
}
Outputs
ABCDEFGHI
To translate back use an arraymap and a callback function that translates each of the English values back to hindi.

Related

UTF-8 char is not showing well in <td> elements

I have a strange problem...
I have the following string:
$sString = "This is my encoded string é à";
First, I remove html entities:
$sString = html_entity_decode($sString, ENT_COMPAT, 'UTF-8');
What I want is to split this string properly to show each char in a different column of the same table's line.
Well, logically, I used:
$aString = str_split($sString) // Fill an array with each char
It doesn't work. It show in box the char as I didn't used html_entity_decode...
So, I decided to try the following:
for($i = 0; $i < 16; $i++) {
echo "<td>";
echo $sLine1[$i];
echo "</td>";
}
It works BUT special chars as showed as a ? in a black box (encoding problem).
Where it's really strange, it's that when I don't put it in <td> elements, it shows well and there's no encoding problems !
My HTML page contains the charset to UTF-8 and is correctly formated (with doctype, html, body, etc...)
I have to admit that at this point, I've no idea from where this problem comes...
UPDATE
I just realized that when I show char by char outside the <td>, it doesn't work either. The encoded char needs to be by pair to work !
It's a problem for me because the string comes from a database, and special chars won't always be at the same place !
Exemple:
This will show the encoding problem char:
$sString = "Paëlla";
echo $sString[3];
But in this way, it will show the ë:
$sString = "Paëlla";
echo $sString[3];
echo $sString[4];

str_split split the string on bytes. But in UTF-8, characters like é and à are encoded on a sequence of 2 bytes. You need to use mbstring to be UTF-8 aware.
mb_internal_encoding('UTF-8');
function mb_str_split($string, $length = 1) {
$ret = array();
$l = mb_strlen($string);
for ($i = 0; $i < $l; $i += $length) {
$ret[] = mb_substr($string, $i, $length);
}
return $ret;
}
Same if you apply [offset] to a string: you get a byte, not a character if the charset of the string may encode a character on more than a byte. In this case, use mb_substr.
mb_internal_encoding('UTF-8');
echo mb_substr("Paëlla", 2, 1);

Some adding to dinesh123 answer:
Try to trim html strip_tags before you get a string ($sString)
Check a file encoding
Try to set header("Content-Type:text/html; charset=UTF-8") in start of file

PHP: UTF-8 character gets messy in function which takes the first letter from each word of a sentence

I have this function which when executed it returns the first letters of each word of a string.
function initials($stringsoftext) {
$retturns = '';
foreach (explode(' ', $stringsoftext) as $word)
$retturns .= ($word[0]);
return $retturns;
}
Everything works fine. The only problem is that when the words begin with special characters it starts to get messy.
For example "test økonomi" become "t�" instead of "tø"
How can i correct this?

That happens because $word[0] takes the first byte of a string, whereas you are using a multi-bye encoding. So a character may consist of multiple bytes. In case of a ø character it consists of 2 bytes: 0xC3 0xB8
That is how you would extract the first character instead:
mb_substr($word, 0, 1, 'utf8')
Working demo: http://ideone.com/XVnC87

You should use mb_substr with mb_internal_encoding as in example:
<?php
header('Content-Type: text/html; charset=UTF-8');
mb_internal_encoding('UTF-8');
echo initials('ąęść óęłęł');
function initials($stringsoftext) {
$retturns = '';
foreach (explode(' ', $stringsoftext) as $word) {
$retturns .= mb_substr($word,0,1);
}
return $retturns;
}

Complementing various answers above, you could convert utf-8 (to be precise, assumed as utf-8) encoded character to its ISO 8859 counterpart.
No multibyte support required, as it's not enabled by default in many PHP configurations.
Use utf8_encode() in order to do so
<?php
function initials($stringsoftext) {
$retturns = '';
foreach (explode(' ', utf8_decode($stringsoftext)) as $word)
$retturns .= ($word[0]);
return $retturns;
}
echo initials("test økonomi");
//return tø
?>
Edit: This approach could break if the characters being converted is not defined on ISO 8859 charset (e.g non latin symbols). Just to reiterate if PHP multi byte support is turned on, mb_substr() solutions is certainly the most appropriate as it is able to properly process the string in utf8 encoding.

Character replacement encoding php

I have a string that I want to replace all 'a' characters to the greek 'α' character. I don't want to convert the html elements inside the string ie text.
The function:
function grstrletter($string){
$skip = false;
$str_length = strlen($string);
for ($i=0; $i < $str_length; $i++){
if($string[$i] == '<'){
$skip = true;
}
if($string[$i] == '>'){
$skip = false;
}
if ($string[$i]=='a' && !$skip){
$string[$i] = 'α';
}
}
return $string;
}
Another function I have made works perfectly but it doesn't take in account the hmtl elements.
function grstrletter_no_html($string){
return strtr($string, array('a' => 'α'));
}
I also tried a lot of encoding functions that php offers with no luck.
When I echo the greek letter the browser output it without a problem. When I return the string the browser outputs the classic strange question mark inside a triangle whenever the replace was occured.
My header has <meta http-equiv="content-type" content="text/html; charset=UTF-8"> and I also tried it with php header('Content-Type: text/html; charset=utf-8'); but again with no luck.
The string comes from a database in UTF-8 and the site is in wordpress so I just use the wordpress functions to get the content I want. I don't think is a db problem because when I use my function grstrletter_no_html() everything works fine.
The problem seems to happen when I iterate the string character by character.
The file is saved as UTF-8 without BOM (notepad++). I tried also to change the encoding of the file with no luck again.
I also tried to replace the greek letter with the corresponding html entity α and α but again same results.
I haven't tried yet any regex.
I would appreciate any help and thanks in advance.
Tried: Greek characters encoding works in HTML but not in PHP
EDIT
The solution based on deceze brilliant answer:
function grstrletter($string){
$skip = false;
$str_length = strlen($string);
for ($i=0; $i < $str_length; $i++){
if($string[$i] == '<'){
$skip = true;
}
if($string[$i] == '>'){
$skip = false;
}
if ($string[$i]=='a' && !$skip){
$part1 = substr($string, 0, $i);
$part1 = $part1 . 'α';
$string = $part1 . substr($string, $i+1);
}
}
return $string;
}

The problem is that you're setting only a single byte of your string. Example:
$str = "\x00\x00\x00";
var_dump(bin2hex($str));
$str[1] = "\xff\xff";
var_dump(bin2hex($str));
Output:
string(6) "000000"
string(6) "00ff00"
You're setting a two-byte character, but only one byte of it is actually pushed into the string. The second result here would have to be 00ffff for your code to work.
What you need is to cut the string from 0 to $i - 1, concatenate the 'α' into it, then concatenate the rest of the string $i + 1 to end onto it if you want to insert a multibyte character. That, or work with characters instead of bytes using the mbstring functions.
For more background information, see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

php true multi-byte string shuffle function?

I have a unique problem with multibyte character strings and need to be able to shuffle, with some fair degree of randomness, a long UTF-8 encoded multibyte string in PHP without dropping or losing or repeating any of the characters.
In the PHP manual under str_shuffle there is a multi-byte function (the first user submitted one) that doesn't work: If I use a string with for example all the Japanese hiragana and katakana of string length (ex) 120 chars, I am returned a string that's 119 chars or 118 chars. Sometimes I've seen duplicate chars even though the original string doesn't have them. So that's not functional.
To make this more complex, I also need to include if possible Japanese UTF-8 newlines and line feeds and punctuation.
Can anyone with experience dealing in multiple languages with UTF-8 mb strings help? Does PHP have any built in functions to do this? str_shuffle is EXACTLY what I want. I just need it to also work on multibyte chars.
Thanks very much!

Try splitting the string using mb_strlen and mb_substr to create an array, then using shuffle before joining it back together again. (Edit: As also demonstrated in #Frosty Z's answer.)
An example from the PHP interactive prompt:
php > $string = "Pretend I'm multibyte!";
php > $len = mb_strlen($string);
php > $sploded = array();
php > while($len-- > 0) { $sploded[] = mb_substr($string, $len, 1); }
php > shuffle($sploded);
php > echo join('', $sploded);
rmedt tmu nIb'lyi!eteP
You'll want to be sure to specify the encoding, where appropriate.

This should do the trick, too. I hope.
class String
{
public function mbStrShuffle($string)
{
$chars = $this->mbGetChars($string);
shuffle($chars);
return implode('', $chars);
}
public function mbGetChars($string)
{
$chars = [];
for($i = 0, $length = mb_strlen($string); $i < $length; ++$i)
{
$chars[] = mb_substr($string, $i, 1, 'UTF-8');
}
return $chars;
}
}

I like to use this function:
function mb_str_shuffle($multibyte_string = "abcčćdđefghijklmnopqrsštuvwxyzžß,.-+'*?=)(/&%$#!~ˇ^˘°˛`˙´˝") {
$characters_array = mb_str_split($multibyte_string);
shuffle($characters_array);
return implode('', $characters_array); // or join('', $characters_array); if you have a death wish (JK)
}
Split string into an array of multibyte characters
Shuffle the good guy array who doesn't care about his residents being multibyte
Join the shuffled array together into a string
Of course I normally wouldn't have a default value for function's parameter.

php outputting strange character

I have the following code to generate a random password string:
<?php
$password = '';
for($i=0; $i<10; $i++) {
$chars = array('lower' => array('a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'), 'upper' => array('A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'), 'num' => array('1','2','3','4','5','6','7','8','9','0'), 'sym' => array('!','£','$','%','^','&','*','(',')','-','=','+','{','}','[',']',':','#','~',';','#','<','>','?',',','.','/'));
$set = rand(1, 4);
switch($set) {
case 1:
$set = 'lower';
break;
case 2:
$set = 'upper';
break;
case 3:
$set = 'num';
break;
case 4:
$set = 'sym';
break;
}
$count = count($chars[$set]);
$digit = rand(0, ($count-1));
$output = $chars[$set][$digit];
$password.= $output;
}
echo $password;
?>
However every now and then one of the characters it outputs will be a capital a with a ^ above it. French or something. How is this possible? it can only pick whats it my arrays!

The only non-ascii character is the pound character, so my guess is that it has to do with this.
First off, it's probably a good idea to avoid that one, as not many people will be able to easily type it.
Good chance that the encoding of your php file (or the encoding set by your editor) is not the same as your output encoding.

Are you sure it is indeed a character not in your array, or is the browser just unable to output? For example your monetary pound sign. Ensure that both PHP, DB, and HTML output all use the same encoding.
On a separate note, your loop is slightly more complicated than it needs to be. I typically see password generators randomize a string versus several arrays. A quick example:
$chars = "abcdefghijkABCDEFG1289398$%#^&";
$pos = rand(0, strlen($chars) - 1);
$password .= $chars[$pos];

i think you generate special HTML characters
for example here and iso8859-1 table

You may be seeing the byte sequence C2 A3, appearing as your capital A with a circumflex followed by a pound symbol. This is because C2A3 is the UTF-8 sequence for a pound sign. As such, if you've managed to enter the UTF-8 character in your PHP file (possibly without noticing it, depending on your editor and environment) you'd see the separate byte sequence as output if your environment is then ASCII / ISO8859-1 or similar.

As per Jason McCreary, I use this function for such Password Creation
function randomString($length) {
$characters = "0123456789abcdefghijklmnopqrstuvwxyz" .
"ABCDEFGHIJKLMNOPQRSTUVWXYZ$%#^&";
$string = '';
for ($p = 0; $p < $length; $p++)
$string .= $characters[mt_rand(0, strlen($characters))];
return $string;
}

The pound symbol (£) is what is breaking, since it is not part of the basic ASCII character set.
You need to do one of the following:
Drop the pound symbol (this will also help people using non-UK keyboards!)
Convert the pound symbol to an HTML entity when outputting it to the site (&#pound;)
Set your site's character set encoding to UTF-8, which will allow extended characters to be displayed. This is probably the best option in the long run, and should be fairly quick and easy to achieve.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP String Function with non-English languages - php

Related

UTF-8 char is not showing well in <td> elements

PHP: UTF-8 character gets messy in function which takes the first letter from each word of a sentence

Character replacement encoding php

php true multi-byte string shuffle function?

php outputting strange character

Categories

Resources