Arabic not urlencoding correctly - php

I have the string:
$str = 'ماجد';
This need to be encoded as:
'%E3%C7%CC%CF'
But I cannot figure out how to reach this encoded string. I believe it is Windows-1256. The above encoded string is how it is being encoded by a program I have.
Does anyone know how to reach this string?

If you know you want to use Windows-1256 then all you have to do is to change the encoding of the input string (which is UTF-8) to Windows-1256. Then you apply urlencode() to the returned string and that's all.
There are several ways to change the encoding of a string in PHP. One of them (that I tested and provides the result you expect) is using iconv():
$str = 'ماجد';
$conv = iconv('utf-8', 'windows-1256', $str);
echo(urlencode($conv));

You need to somehow split the string into its hexadecimal representation and then put a % singn in front of the hex number pairs.
<?php
$hexString = bin2hex("ماجد");
for($i = 0; $i < strlen($hexString); $i += 2){
echo "%".substr($hexString, $i, 2);
}
?>
This will do the trick but im sure there is a more elegant way.

Related

how to change ascii alphabet to utf-8 in php

I have an ASCII string. I like to change its encoding to utf-8.
But I found there's a simple function to change ascii to utf-8 in php.
and vice verse, I like to change utf-8 alphabet to ascii.
Please advise.
I have tried:
<?php
// utf-8
$str = "CHONKIOK";
// I can't even how to print these utf-8 characters in php. I just copied/pasted the string.
// strlen($str) => 24 bytes
// mb_detect_encoding($str) => utf-8
$str2 = "CHONKIOK";
// strlen($str2) => 8 bytes
// mb_detect_encoding($str2) => ascii
// change ascii to utf-8
$str = mb_convert_encoding($str2, "UTF-8");
echo mb_detect_encoding($str);
// returns ascii
What you are doing is correct.
As per mb_detect_encoding it states that it detects the most likely character encoding.
As the entire ASCII set is contained within UTF-8 at the exact same character positions, this function is telling you that it's an ASCII string because it technically is. The bytes of this string when encoded in both ASCII and UFT-8 are identical.
As you've found, when you include some characters outside of the ASCII set then it will give you the next probable encoding.
What exactly should I do to obtain this string: "CHONKIOK" from "CHONKIOK"?
The characters you're after are called "Fullwidth Latin" characters.
Given the C character provided is character 65,315 and a regular C is character 67, you could possible obtain the strings you're after by adding the difference of 65,248. This is only possible because the alphabet tends to repeat in the same order throughout different parts of the character charts.
You can get the code point of a character using mb_ord and convert it back to a character using mb_chr, after adding 65,248.
That might look something like:
$str_input = "ABC abc 123";
$convertable = "ABCDEFG12349abcdefg";
$str_output = "";
for ($i = 0; $i < strlen($str_input); $i++) {
$char = mb_ord($str_input[$i], "UTF-8");
if(str_contains($convertable, $str_input[$i])) $char += 65248;
$str_output .= mb_chr($char, "UTF-8");
}
echo $str_output; // outputs "ABC abc 123"
Just be sure to include the whole alphabet in $convertable
try this to convert to utf-8:
utf8_encode(string $string): string
try this to convert to ASCII:
utf8_decode(string $string): string

UTF-8 char is not showing well in <td> elements

I have a strange problem...
I have the following string:
$sString = "This is my encoded string é à";
First, I remove html entities:
$sString = html_entity_decode($sString, ENT_COMPAT, 'UTF-8');
What I want is to split this string properly to show each char in a different column of the same table's line.
Well, logically, I used:
$aString = str_split($sString) // Fill an array with each char
It doesn't work. It show in box the char as I didn't used html_entity_decode...
So, I decided to try the following:
for($i = 0; $i < 16; $i++) {
echo "<td>";
echo $sLine1[$i];
echo "</td>";
}
It works BUT special chars as showed as a ? in a black box (encoding problem).
Where it's really strange, it's that when I don't put it in <td> elements, it shows well and there's no encoding problems !
My HTML page contains the charset to UTF-8 and is correctly formated (with doctype, html, body, etc...)
I have to admit that at this point, I've no idea from where this problem comes...
UPDATE
I just realized that when I show char by char outside the <td>, it doesn't work either. The encoded char needs to be by pair to work !
It's a problem for me because the string comes from a database, and special chars won't always be at the same place !
Exemple:
This will show the encoding problem char:
$sString = "Paëlla";
echo $sString[3];
But in this way, it will show the ë:
$sString = "Paëlla";
echo $sString[3];
echo $sString[4];
str_split split the string on bytes. But in UTF-8, characters like é and à are encoded on a sequence of 2 bytes. You need to use mbstring to be UTF-8 aware.
mb_internal_encoding('UTF-8');
function mb_str_split($string, $length = 1) {
$ret = array();
$l = mb_strlen($string);
for ($i = 0; $i < $l; $i += $length) {
$ret[] = mb_substr($string, $i, $length);
}
return $ret;
}
Same if you apply [offset] to a string: you get a byte, not a character if the charset of the string may encode a character on more than a byte. In this case, use mb_substr.
mb_internal_encoding('UTF-8');
echo mb_substr("Paëlla", 2, 1);
Some adding to dinesh123 answer:
Try to trim html strip_tags before you get a string ($sString)
Check a file encoding
Try to set header("Content-Type:text/html; charset=UTF-8") in start of file

php multibyte string acessing via key [$i]

there is a string $string = "öşğüçı"; pay attention to the last one which is not i
when I want to print first char by echo $string[0] it prints nothing.. I know they are multibyte ones.. though printing first character can be accomplished by
echo $string[0].$string[1] but that is not what I want.. the question is
how can I make the obove mentioned issue just to program in a way below
for($i = 0; $i < sizeof($string); $i++)
echo $string[$i] . " ";
and it will print the following
ö ş ğ ü ç ı
masters of php please help...
to split a string into characters
$string = "öşğüçı";
preg_match_all('/./u', $string, $m);
$chars = $m[0];
note the "u" flag in the regular expression
<?php
// inform the browser you are sending text encoded with utf-8
header("Content-type: text/plain; charset=utf-8");
// if you're using a literal string make sure the file
// is saved using utf-8 as encoding
// or if you're getting it from another source make sure
// you get it in utf-8
$string = "öşğüçı";
// if you do not have your string in utf-8
// you need to find out the actual encoding
// and use "iconv" to convert it to utf-8
// process the string using the mb_* functions
// knowing that it is encoded in utf-8 at this point
$encoding = "UTF-8";
for($i = 0; $i < mb_strlen($string, $encoding); $i++) {
echo mb_substr($string, $i, 1, $encoding);
}
Of course if you prefer another encoding (but I wouldn't see why; maybe just utf-16) you can substitute each instance of "utf-8" from above with your desired encoding and read and use accordingly.
Example for UTF-16 output (file/input is encoded in UTF-8)
<?php
header("Content-type: text/plain; charset=utf-16");
$string = "öşğüçı";
$string = iconv("UTF-8", "UTF-16", $string);
$encoding = "UTF-16";
for($i = 0; $i < mb_strlen($string, $encoding); $i++) {
echo mb_substr($string, $i, 1, $encoding);
}
You cannot handle multi-byte strings in this way in PHP. If it's a fixed-length encoding, where every character takes up, say, two bytes, you can simply take two bytes at a time. If it's a variable-length encoding like UTF-8 though, you will need to use mb_substr and mb_strlen.
May I recommend What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text, which explains this in more detail.
Use iconv_substr or mb_substr to get character and iconv_strlen or mb_strlen to get size of string.

PHP method for stripping duplicate chars from a multibyte string?

Arrrgh. Does anyone know how to create a function that's the multibyte character equivalent of the PHP count_chars($string, 3) command?
Such that it will return a list of ONLY ONE INSTANCE of each unique character. If that was English and we had
"aaabggxxyxzxxgggghq xcccxxxzxxyx"
It would return "abgh qxyz" (Note the space IS counted).
(The order isn't important in this case, can be anything).
If Japanese kanji (not sure browsers will all support this):
漢漢漢字漢字私私字私字漢字私漢字漢字私
And it will return just the 3 kanji used:
漢字私
It needs to work on any UTF-8 encoded string.
Hey Dave, you're never going to see this one coming.
php > $kanji = '漢漢漢字漢字私私字私字漢字私漢字漢字私';
php > $not_kanji = 'aaabcccbbc';
php > $pattern = '/(.)\1+/u';
php > echo preg_replace($pattern, '$1', $kanji);
漢字漢字私字私字漢字私漢字漢字私
php > echo preg_replace($pattern, '$1', $not_kanji);
abcbc
What, you thought I was going to use mb_substr again?
In regex-speak, it's looking for any one character, then one or more instances of that same character. The matched region is then replaced with the one character that matched.
The u modifier turns on UTF-8 mode in PCRE, in which it deals with UTF-8 sequences instead of 8-bit characters. As long as the string being processed is UTF-8 already and PCRE was compiled with Unicode support, this should work fine for you.
Hey, guess what!
$not_kanji = 'aaabbbbcdddbbbbccgggcdddeeedddaaaffff';
$l = mb_strlen($not_kanji);
$unique = array();
for($i = 0; $i < $l; $i++) {
$char = mb_substr($not_kanji, $i, 1);
if(!array_key_exists($char, $unique))
$unique[$char] = 0;
$unique[$char]++;
}
echo join('', array_keys($unique));
This uses the same general trick as the shuffle code. We grab the length of the string, then use mb_substr to extract it one character at a time. We then use that character as a key in an array. We're taking advantage of PHP's positional arrays: keys are sorted in the order that they are defined. Once we've gone through the string and identified all of the characters, we grab the keys and join'em back together in the same order that they appeared in the string. You also get a per-character character count from this technique.
This would have been much easier if there was such a thing as mb_str_split to go along with str_split.
(No Kanji example here, I'm experiencing a copy/paste bug.)
Here, try this on for size:
function mb_count_chars_kinda($input) {
$l = mb_strlen($input);
$unique = array();
for($i = 0; $i < $l; $i++) {
$char = mb_substr($input, $i, 1);
if(!array_key_exists($char, $unique))
$unique[$char] = 0;
$unique[$char]++;
}
return $unique;
}
function mb_string_chars_diff($one, $two) {
$left = array_keys(mb_count_chars_kinda($one));
$right = array_keys(mb_count_chars_kinda($two));
return array_diff($left, $right);
}
print_r(mb_string_chars_diff('aabbccddeeffgg', 'abcde'));
/* =>
Array
(
[5] => f
[6] => g
)
*/
You'll want to call this twice, the second time with the left string on the right, and the right string on the left. The output will be different -- array_diff just gives you the stuff in the left side that's missing from the right, so you have to do it twice to get the whole story.
Please try to check the iconv_strlen PHP standard library function. Can't say about orient encodings, but it works fine for european and east europe languages. In any case it gives some freedom!
$name = "My string";
$name_array = str_split($name);
$name_array_uniqued = array_unique($name_array);
print_r($name_array_uniqued);
Much easier. User str_split to turn the phrase into an array with each character as an element. Then use array_unique to remove duplicates. Pretty simple. Nothing complicated. I like it that way.

php true multi-byte string shuffle function?

I have a unique problem with multibyte character strings and need to be able to shuffle, with some fair degree of randomness, a long UTF-8 encoded multibyte string in PHP without dropping or losing or repeating any of the characters.
In the PHP manual under str_shuffle there is a multi-byte function (the first user submitted one) that doesn't work: If I use a string with for example all the Japanese hiragana and katakana of string length (ex) 120 chars, I am returned a string that's 119 chars or 118 chars. Sometimes I've seen duplicate chars even though the original string doesn't have them. So that's not functional.
To make this more complex, I also need to include if possible Japanese UTF-8 newlines and line feeds and punctuation.
Can anyone with experience dealing in multiple languages with UTF-8 mb strings help? Does PHP have any built in functions to do this? str_shuffle is EXACTLY what I want. I just need it to also work on multibyte chars.
Thanks very much!
Try splitting the string using mb_strlen and mb_substr to create an array, then using shuffle before joining it back together again. (Edit: As also demonstrated in #Frosty Z's answer.)
An example from the PHP interactive prompt:
php > $string = "Pretend I'm multibyte!";
php > $len = mb_strlen($string);
php > $sploded = array();
php > while($len-- > 0) { $sploded[] = mb_substr($string, $len, 1); }
php > shuffle($sploded);
php > echo join('', $sploded);
rmedt tmu nIb'lyi!eteP
You'll want to be sure to specify the encoding, where appropriate.
This should do the trick, too. I hope.
class String
{
public function mbStrShuffle($string)
{
$chars = $this->mbGetChars($string);
shuffle($chars);
return implode('', $chars);
}
public function mbGetChars($string)
{
$chars = [];
for($i = 0, $length = mb_strlen($string); $i < $length; ++$i)
{
$chars[] = mb_substr($string, $i, 1, 'UTF-8');
}
return $chars;
}
}
I like to use this function:
function mb_str_shuffle($multibyte_string = "abcčćdđefghijklmnopqrsštuvwxyzžß,.-+'*?=)(/&%$#!~ˇ^˘°˛`˙´˝") {
$characters_array = mb_str_split($multibyte_string);
shuffle($characters_array);
return implode('', $characters_array); // or join('', $characters_array); if you have a death wish (JK)
}
Split string into an array of multibyte characters
Shuffle the good guy array who doesn't care about his residents being multibyte
Join the shuffled array together into a string
Of course I normally wouldn't have a default value for function's parameter.

Categories