Non ASCII Characters being converted to squares

Non ASCII Characters being converted to squares - php

I've got the following code which searches a string for Non ASCII characters and returns it via an AJAX query.
$asciistring = $strDescription;
for ($i=0; $i<strlen($asciistring); $i++) {
if (ord($asciistring[$i]) > 127){
$display_string .= $asciistring[$i];
}
}
If $strDescription contains £ (character # 156) the above code works fine. However, I want to separate each Non ASCII character found with a comma. When I modify my code below, it converts the £ character into squares.
$asciistring = $strDescription;
for ($i=0; $i<strlen($asciistring); $i++) {
if (ord($asciistring[$i]) > 127){
$display_string .= $asciistring[$i] . ", ";
}
}
What am I doing wrong and how do I fix it?

You assume 1 character = 1 byte.
This assumption is wrong when it comes to UTF-8 / UTF-16 etc.
UTF-8 e.a. consist of multi-byte chars: 1 character = 1 to 3 bytes.
So, your loop over 8-bit-bytes can not handle any UTF-8 chars.
Use the mb_... functions instead - multibyte string functions.
Additionaly: converting ASCII to UTF-8 and vice versa is
in general not needed
will always result in certain characters not available in either
encoding (i.e. the € sign is one of them)
will be a maintenance nightmare on the long run
My recommendation: it's worth the effort to switch all and everything from dev to production to entirely use UTF-8. All problems are gone afterwards.

I provide you two way. At first use utf8_decode. You can try these
$asciistring = 'a£bÂc£d';
$asciistring = utf8_decode($asciistring);
First way preg_match_all
if (preg_match_all('/[\x80-\xFF]/', $asciistring, $matches)) {
$display_string = implode(',', $matches[0]);
}
2nd way as you wrote
$display_string = array();
for ($i=0; $i<strlen($asciistring); $i++) {
if (ord($asciistring[$i]) > 127)
{
$display_string[] = $asciistring[$i];
}
}
$display_string = implode(',', $display_string);
Both give me the same output
£,Â,£
I think you will be helpful!

Related

Arabic not urlencoding correctly

I have the string:
$str = 'ماجد';
This need to be encoded as:
'%E3%C7%CC%CF'
But I cannot figure out how to reach this encoded string. I believe it is Windows-1256. The above encoded string is how it is being encoded by a program I have.
Does anyone know how to reach this string?

If you know you want to use Windows-1256 then all you have to do is to change the encoding of the input string (which is UTF-8) to Windows-1256. Then you apply urlencode() to the returned string and that's all.
There are several ways to change the encoding of a string in PHP. One of them (that I tested and provides the result you expect) is using iconv():
$str = 'ماجد';
$conv = iconv('utf-8', 'windows-1256', $str);
echo(urlencode($conv));

You need to somehow split the string into its hexadecimal representation and then put a % singn in front of the hex number pairs.
<?php
$hexString = bin2hex("ماجد");
for($i = 0; $i < strlen($hexString); $i += 2){
echo "%".substr($hexString, $i, 2);
}
?>
This will do the trick but im sure there is a more elegant way.

UTF-8 char is not showing well in <td> elements

I have a strange problem...
I have the following string:
$sString = "This is my encoded string é à";
First, I remove html entities:
$sString = html_entity_decode($sString, ENT_COMPAT, 'UTF-8');
What I want is to split this string properly to show each char in a different column of the same table's line.
Well, logically, I used:
$aString = str_split($sString) // Fill an array with each char
It doesn't work. It show in box the char as I didn't used html_entity_decode...
So, I decided to try the following:
for($i = 0; $i < 16; $i++) {
echo "<td>";
echo $sLine1[$i];
echo "</td>";
}
It works BUT special chars as showed as a ? in a black box (encoding problem).
Where it's really strange, it's that when I don't put it in <td> elements, it shows well and there's no encoding problems !
My HTML page contains the charset to UTF-8 and is correctly formated (with doctype, html, body, etc...)
I have to admit that at this point, I've no idea from where this problem comes...
UPDATE
I just realized that when I show char by char outside the <td>, it doesn't work either. The encoded char needs to be by pair to work !
It's a problem for me because the string comes from a database, and special chars won't always be at the same place !
Exemple:
This will show the encoding problem char:
$sString = "Paëlla";
echo $sString[3];
But in this way, it will show the ë:
$sString = "Paëlla";
echo $sString[3];
echo $sString[4];

str_split split the string on bytes. But in UTF-8, characters like é and à are encoded on a sequence of 2 bytes. You need to use mbstring to be UTF-8 aware.
mb_internal_encoding('UTF-8');
function mb_str_split($string, $length = 1) {
$ret = array();
$l = mb_strlen($string);
for ($i = 0; $i < $l; $i += $length) {
$ret[] = mb_substr($string, $i, $length);
}
return $ret;
}
Same if you apply [offset] to a string: you get a byte, not a character if the charset of the string may encode a character on more than a byte. In this case, use mb_substr.
mb_internal_encoding('UTF-8');
echo mb_substr("Paëlla", 2, 1);

Some adding to dinesh123 answer:
Try to trim html strip_tags before you get a string ($sString)
Check a file encoding
Try to set header("Content-Type:text/html; charset=UTF-8") in start of file

PHP String Function with non-English languages

I was trying range(); function with non-English language. It is not working.
$i =0
foreach(range('क', 'म') as $ab) {
++$i;
$alphabets[$ab] = $i;
}
Output: à =1
It was Hindi (India) alphabets. It is only iterating only once (Output shows).
For this, I am not getting what to do!
So, if possible, please tell me what to do for this and what should I do first before thinking of working with non-English text with any PHP functions.

Short answer: it's not possible to use range like that.
Explanation
You are passing the string 'क' as the start of the range and 'म' as the end. You are getting only one character back, and that character is à.
You are getting back à because your source file is encoded (saved) in UTF-8. One can tell this by the fact that à is code point U+00E0, while 0xE0 is also the first byte of the UTF-8 encoded form of 'क' (which is 0xE0 0xA4 0x95). Sadly, PHP has no notion of encodings so it just takes the first byte it sees in the string and uses that as the "start" character.
You are getting back only à because the UTF-8 encoded form of 'म' also starts with 0xE0 (so PHP also thinks that the "end character" is 0xE0 or à).
Solution
You can write range as a for loop yourself, as long as there is some function that returns the Unicode code point of an UTF-8 character (and one that does the reverse). So I googled and found these here:
// Returns the UTF-8 character with code point $intval
function unichr($intval) {
return mb_convert_encoding(pack('n', $intval), 'UTF-8', 'UTF-16BE');
}
// Returns the code point for a UTF-8 character
function uniord($u) {
$k = mb_convert_encoding($u, 'UCS-2LE', 'UTF-8');
$k1 = ord(substr($k, 0, 1));
$k2 = ord(substr($k, 1, 1));
return $k2 * 256 + $k1;
}
With the above, you can now write:
for($char = uniord('क'); $char <= uniord('म'); ++$char) {
$alphabet[] = unichr($char);
}
print_r($alphabet);
See it in action.

The lazy solution would be to use html_entity_decode() and range() only for the numeric ranges it was originally intended (that it works with ASCII is a bit silly anyway):
foreach (range(0x0915, 0x092E) as $char) {
$char = html_entity_decode("&#$char;", ENT_COMPAT, "UTF-8");
$alphabets[$char] = ++$i;
}

Another solution would be translating and getting the range then translate back again.
$first = file_get_contents("http://ajax.googleapis.com/ajax/services/language/translate?v=1.0&langpair=|en&q=क");
$second = file_get_contents("http://ajax.googleapis.com/ajax/services/language/translate?v=1.0&langpair=|en&q=म"); //not real value
$jsonfirst = json_decode($first);
$jsonsecond = json_decode($second);
$f = $jsonfirst->responseData->translatedText;
$l = $jsonsecond->responseData->translatedText;
foreach(range($f, $l) as $ab) {
echo $ab;
}
Outputs
ABCDEFGHI
To translate back use an arraymap and a callback function that translates each of the English values back to hindi.

PHP method for stripping duplicate chars from a multibyte string?

Arrrgh. Does anyone know how to create a function that's the multibyte character equivalent of the PHP count_chars($string, 3) command?
Such that it will return a list of ONLY ONE INSTANCE of each unique character. If that was English and we had
"aaabggxxyxzxxgggghq xcccxxxzxxyx"
It would return "abgh qxyz" (Note the space IS counted).
(The order isn't important in this case, can be anything).
If Japanese kanji (not sure browsers will all support this):
漢漢漢字漢字私私字私字漢字私漢字漢字私
And it will return just the 3 kanji used:
漢字私
It needs to work on any UTF-8 encoded string.

Hey Dave, you're never going to see this one coming.
php > $kanji = '漢漢漢字漢字私私字私字漢字私漢字漢字私';
php > $not_kanji = 'aaabcccbbc';
php > $pattern = '/(.)\1+/u';
php > echo preg_replace($pattern, '$1', $kanji);
漢字漢字私字私字漢字私漢字漢字私
php > echo preg_replace($pattern, '$1', $not_kanji);
abcbc
What, you thought I was going to use mb_substr again?
In regex-speak, it's looking for any one character, then one or more instances of that same character. The matched region is then replaced with the one character that matched.
The u modifier turns on UTF-8 mode in PCRE, in which it deals with UTF-8 sequences instead of 8-bit characters. As long as the string being processed is UTF-8 already and PCRE was compiled with Unicode support, this should work fine for you.
Hey, guess what!
$not_kanji = 'aaabbbbcdddbbbbccgggcdddeeedddaaaffff';
$l = mb_strlen($not_kanji);
$unique = array();
for($i = 0; $i < $l; $i++) {
$char = mb_substr($not_kanji, $i, 1);
if(!array_key_exists($char, $unique))
$unique[$char] = 0;
$unique[$char]++;
}
echo join('', array_keys($unique));
This uses the same general trick as the shuffle code. We grab the length of the string, then use mb_substr to extract it one character at a time. We then use that character as a key in an array. We're taking advantage of PHP's positional arrays: keys are sorted in the order that they are defined. Once we've gone through the string and identified all of the characters, we grab the keys and join'em back together in the same order that they appeared in the string. You also get a per-character character count from this technique.
This would have been much easier if there was such a thing as mb_str_split to go along with str_split.
(No Kanji example here, I'm experiencing a copy/paste bug.)
Here, try this on for size:
function mb_count_chars_kinda($input) {
$l = mb_strlen($input);
$unique = array();
for($i = 0; $i < $l; $i++) {
$char = mb_substr($input, $i, 1);
if(!array_key_exists($char, $unique))
$unique[$char] = 0;
$unique[$char]++;
}
return $unique;
}
function mb_string_chars_diff($one, $two) {
$left = array_keys(mb_count_chars_kinda($one));
$right = array_keys(mb_count_chars_kinda($two));
return array_diff($left, $right);
}
print_r(mb_string_chars_diff('aabbccddeeffgg', 'abcde'));
/* =>
Array
(
[5] => f
[6] => g
)
*/
You'll want to call this twice, the second time with the left string on the right, and the right string on the left. The output will be different -- array_diff just gives you the stuff in the left side that's missing from the right, so you have to do it twice to get the whole story.

Please try to check the iconv_strlen PHP standard library function. Can't say about orient encodings, but it works fine for european and east europe languages. In any case it gives some freedom!

$name = "My string";
$name_array = str_split($name);
$name_array_uniqued = array_unique($name_array);
print_r($name_array_uniqued);
Much easier. User str_split to turn the phrase into an array with each character as an element. Then use array_unique to remove duplicates. Pretty simple. Nothing complicated. I like it that way.

php outputting strange character

I have the following code to generate a random password string:
<?php
$password = '';
for($i=0; $i<10; $i++) {
$chars = array('lower' => array('a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'), 'upper' => array('A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'), 'num' => array('1','2','3','4','5','6','7','8','9','0'), 'sym' => array('!','£','$','%','^','&','*','(',')','-','=','+','{','}','[',']',':','#','~',';','#','<','>','?',',','.','/'));
$set = rand(1, 4);
switch($set) {
case 1:
$set = 'lower';
break;
case 2:
$set = 'upper';
break;
case 3:
$set = 'num';
break;
case 4:
$set = 'sym';
break;
}
$count = count($chars[$set]);
$digit = rand(0, ($count-1));
$output = $chars[$set][$digit];
$password.= $output;
}
echo $password;
?>
However every now and then one of the characters it outputs will be a capital a with a ^ above it. French or something. How is this possible? it can only pick whats it my arrays!

The only non-ascii character is the pound character, so my guess is that it has to do with this.
First off, it's probably a good idea to avoid that one, as not many people will be able to easily type it.
Good chance that the encoding of your php file (or the encoding set by your editor) is not the same as your output encoding.

Are you sure it is indeed a character not in your array, or is the browser just unable to output? For example your monetary pound sign. Ensure that both PHP, DB, and HTML output all use the same encoding.
On a separate note, your loop is slightly more complicated than it needs to be. I typically see password generators randomize a string versus several arrays. A quick example:
$chars = "abcdefghijkABCDEFG1289398$%#^&";
$pos = rand(0, strlen($chars) - 1);
$password .= $chars[$pos];

i think you generate special HTML characters
for example here and iso8859-1 table

You may be seeing the byte sequence C2 A3, appearing as your capital A with a circumflex followed by a pound symbol. This is because C2A3 is the UTF-8 sequence for a pound sign. As such, if you've managed to enter the UTF-8 character in your PHP file (possibly without noticing it, depending on your editor and environment) you'd see the separate byte sequence as output if your environment is then ASCII / ISO8859-1 or similar.

As per Jason McCreary, I use this function for such Password Creation
function randomString($length) {
$characters = "0123456789abcdefghijklmnopqrstuvwxyz" .
"ABCDEFGHIJKLMNOPQRSTUVWXYZ$%#^&";
$string = '';
for ($p = 0; $p < $length; $p++)
$string .= $characters[mt_rand(0, strlen($characters))];
return $string;
}

The pound symbol (£) is what is breaking, since it is not part of the basic ASCII character set.
You need to do one of the following:
Drop the pound symbol (this will also help people using non-UK keyboards!)
Convert the pound symbol to an HTML entity when outputting it to the site (&#pound;)
Set your site's character set encoding to UTF-8, which will allow extended characters to be displayed. This is probably the best option in the long run, and should be fairly quick and easy to achieve.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Non ASCII Characters being converted to squares - php

Related

Arabic not urlencoding correctly

UTF-8 char is not showing well in <td> elements

PHP String Function with non-English languages

PHP method for stripping duplicate chars from a multibyte string?

php outputting strange character

Categories

Resources