PHP Turkish Characters to ASCII Giving Same Output

PHP Turkish Characters to ASCII Giving Same Output - php

ord('Ö') is giving 195 and also ord('Ç') is giving 195 too. I didn't get what is the error. Can you guys help me?

ord — Convert the first byte of a string to a value between 0 and 255
https://www.php.net/manual/en/function.ord.php
The question is - what the charset of the source file?
Since 'Ö' and 'Ç' both are not ASCII symbols, they are represented as two bytes in UTF-8 encoding
Ö - 0xC3 0x96
Ç - 0xC3 0x87
As you can see, both characters has first bytes 0xC3 (=195 dec.)
So, you need to decide what code you want to get?
For example, you can convert the UTF-8 string into Windows-1254:
print ord(iconv('UTF-8', 'Windows-1254', 'Ö')); // 214
print ord(iconv('UTF-8', 'Windows-1254', 'Ç')); // 199
Or you may want to get unicode Code point. To do that you can first convert the string into UTF-32, and then decode a 32-bit number:
function get_codepoint($utf8char) {
$bin = iconv('UTF-8', 'UTF-32BE', $utf8char); // convert to UTF-32 big endian
$a = unpack('Ncp', $bin); // unpack binary data
return $a['cp']; // get the code point
}
print get_codepoint('Ö'); // 214
print get_codepoint('Ç'); // 199
Or in php 7.2 and later you can simple use mb_ord
print mb_ord('Ö'); // 214
print mb_ord('Ç'); // 199

Related

How do I display extended ascii characters in my php code?

I'm trying to decode a text that contains extended ASCII characters but when I try to convert the character I get the wrong value. Like this:
echo "“<br>";
echo ord("“")."<br>";
echo chr(ord("“"))."<br>";
And this is my output:
“
226
�
The ASCII value of the character "“" is 147, not 226. And instead of the � symbol, I want to get "“" character back.
I'm using UTF-8
<meta charset="utf-8">
I have tried changing to different charsets but it didn't work.

1st U+201C Left Double Quotation Mark is UTF-8 byte sequence E2 80 9C (hexadecimal) i.e. decimal 226 128 156
2nd ord — Convert the first byte of a string to a value between 0 and 255
Result: ord("“") returns 226…
Instead of ord and chr pair, use mb_ord and its complement mb_chr, e.g. as follows:
<?php
echo "“<br>";
echo mb_ord("“")."<br>";
echo mb_chr(mb_ord("“"))."<br>";
?>
Result: .\SO\74045685.php
“8220“
Edit you can get Windows-1251 code (147) for character “ (U+201C, Left Double Quotation Mark) as follows:
echo ord(mb_convert_encoding("“","Windows-1251","UTF-8")); //147

You're incorrect about the “ character, the UTF-8 encoding is two bytes: c293.
See: SET TRANSMIT STATE.
In the manual for ord() it says:
However, note that this function is not aware of any string encoding,
and in particular will never identify a Unicode code point in a
multi-byte encoding such as UTF-8 or UTF-16.
On top of this, if I actually convert the '“' charachter to hexadecimal, I get: e2809c. So it's a triplet. Never trust what you read online. 😏
See: https://3v4l.org/57UV8

There is no ASCII representation for “, as has already been said it is multibyte, UTF-8 to be precise:
echo mb_detect_encoding("“"); // UTF-8
ord() and chr() don't support this, you're only looking at the first byte of up to four needed for a particular character. Fortunately there are functions that does:
echo "“\n"; // “
echo mb_ord("“")."\n"; // 8220
echo mb_chr(mb_ord("“")); // “
But why do you need to transform it back and forth? It seems you already have the character in your code :), not as a value but as the actual visual representation.

Russian characters from hex to string utf8 - getting the wrong characters

I am trying to pass hex-encoded parameters to an image-creating script. All documents are in utf8. Everything is fine until I go through the string in a loop. See the minimized example:
$string="ABCDЯ";
for($i=0;$i<strlen($string);$i++) {
echo $string[$i]."<br>"
}
gives the output:
A
B
C
D
�
instead of
A
B
C
D
Я
Why is that? Since I want to analyze the characters in the string, it fails at this point, because all Russian characters end up as �.

In manual:
The string in PHP is implemented as an array of bytes and an integer
indicating the length of the buffer. It has no information about how
those bytes translate to characters, leaving that task to the
programmer.
So, you're iterating $string byte by byte. If a character is not encoded with single-byte, the correct result won't be returned.
Given that PHP does not dictate a specific encoding for strings, one
might wonder how string literals are encoded. For instance, is the
string "á" equivalent to "\xE1" (ISO-8859-1), "\xC3\xA1" (UTF-8, C
form), "\x61\xCC\x81" (UTF-8, D form) or any other possible
representation? The answer is that string will be encoded in whatever
fashion it is encoded in the script file.
You can use mb_substr to get a character when iterating $string.
<?php
$string = 'ABCDЯ';
for($i = 0; $i < strlen($string); $i++) {
echo mb_substr($string, $i, 1, 'UTF-8') . '<br>';
}

ord() doesn't work with utf-8

according to ISO 8859-1
€ Symbol has decimal value 128
My default php script encoding is
echo mb_internal_encoding(); //ISO-8859-1
So now as PHP
echo chr(128); //Output exactly what i want '€'
But
echo ord('€'); //opposite it returns 226, it should be 128
why it is so?

It is only for 2018's PHP v7.2.0+.
mb_ord()
Now you can use mb_ord().
Example echo mb_ord('€','UTF-8');
See also mb_chr(), to get the UTF-8 representation of a decimal code. Example echo mb_chr(2048,'UTF-8');.
The best practice is to be universal, save all your PHP scripts as UTF-8 (see #deceze).

According to Wikipedia and FileFormat,
ISO-8859-1 doesn't have the Euro symbol at all
ISO-8859-15 has it at codepoint 164 (0xA4)
Windows-1252 has it at codepoint 128 (0x80)
Unicode has the Euro symbol at codepoint 8364 (0x20AC)
UTF-8 encodes that as 0xE2 0x82 0xAC. The first byte E2 is 226 in decimal.
So it seems your source file is encoded in UTF-8 (and ord() only returns the first byte), whereas your output is in Windows-1252.

echo ord('€'); //opposite it returns 226, it should be 128
Your .php file is saved as UTF-8 (you saved it as UTF-8 in your text editor when you saved the file to disk). The string literal in there contains the bytes E2 82 AC; visualised it's something like this:
echo ord('\xE2\x82\xAC');
Open the file in a hex editor for real clarity.
ord only returns a single integer in the range of 0 - 255. Your string literal contains three bytes, for which ord would need to return three integers, which it won't. It returns only the first one, which is 226.
Save the file in different encodings in your text editor and you'll see different results.

This PHP function return the decimal number of the first character in string.
If the number is lower than 128 then the character is encoded in 1 octet.
Elseif the number is lower than 2048 then the character is encoded in 2 octets.
Elseif the number is lower than 65536 then the character is encoded in 3 octets.
Elseif the number is lower than 1114112 then the character is encoded in 4 octets.
function ord_utf8($s){
return (int) ($s=unpack('C*',$s[0].$s[1].$s[2].$s[3]))&&$s[1]<(1<<7)?$s[1]:
($s[1]>239&&$s[2]>127&&$s[3]>127&&$s[4]>127?(7&$s[1])<<18|(63&$s[2])<<12|(63&$s[3])<<6|63&$s[4]:
($s[1]>223&&$s[2]>127&&$s[3]>127?(15&$s[1])<<12|(63&$s[2])<<6|63&$s[3]:
($s[1]>193&&$s[2]>127?(31&$s[1])<<6|63&$s[2]:0)));
}
echo ord_utf8('€');
// Output 8364 then this character is encoded in 3 octets
You can check the result in https://eval.in/748181 …
The ord_utf8 function is the reciprocal of chr_utf8 (print one utf8 character from decimal number)
function chr_utf8($n,$f='C*'){
return $n<(1<<7)?chr($n):($n<1<<11?pack($f,192|$n>>6,1<<7|191&$n):
($n<(1<<16)?pack($f,224|$n>>12,1<<7|63&$n>>6,1<<7|63&$n):
($n<(1<<20|1<<16)?pack($f,240|$n>>18,1<<7|63&$n>>12,1<<7|63&$n>>6,1<<7|63&$n):'')));
}
for($test=1;$test<1114111;$test++)
if (ord_utf8(chr_utf8($test))!==$test)
die('Error found');
echo 'No error';
// Output No error

printf() Extended Unicode Characters?

$formatthis = 219;
$printthis = 98;
// %c - the argument is treated as an integer, and presented as the character with
that ASCII value.
$string = 'There are %c treated as integer %c';
echo printf($string, $formatthis, $printthis);
I'm attempting to understand printf().
I don't quite understand the parameters.
I can see that the first parameter seems to be the string that the formatting will be applied to.
The second is the first variable to format, and the third seems to be the second variable to format.
What I don't understand is how to get it to print unicode characters that are special.
E.G. Beyond a-z, A-Z, !##$%^&*(){}" ETC.
Also, why does it out put with the location of the last quote in the string?
OUTPUT:
There are � treated as integer �32
How could I encode this in to UTF-16 (Dec) // Snowman = 9,731 DEC UTF 16?
UTF-8 'LATIN CAPITAL LETTER A' (U+0041) = 41, but if I write in PHP 41 I will get ')' I googled an ASCII table and it's showing that the number for A is 065...
ASCII is a subset of UTF-8, so if a document is ASCII then it is already UTF-8
If it's already in UTF-8, why are those two numbers different? Also the outputs different..
EDIT, Okay so the chart I'm looking at is obviously displaying the digits in HEX value which I didn't immediately notice, 41 in HEX is ASCII 065

%c is basically an int2bin function, meaning it formats a number into its binary representation. This goes up to the decimal number 255, which will be output as the byte 0xFF.
To output, say, the snowman character ☃, you'd need to output the exact bytes necessary to represent it in your encoding of choice. If you chose UTF-8 to encode it, the necessary bytes are E2 98 83:
printf('%c%c%c', 226, 152, 131); // ☃
// or
printf('%c%c%c', 0xE2, 0x98, 0x83); // ☃
The problem in your case is 1) that the bytes you're outputting don't mean anything in the encoding you're interpreting the result as (meaning the byte for 98 doesn't mean anything in UTF-8 at this point, which is why you're seeing a "�") and 2) that you're echoing the result of printf, which outputs 32 (printf returns the number of bytes it output).

Convert two string to the same byte length

I have 2 strings in my PHP code, 1 is a parameter to my method and 1 is a string from an ini file.
The problem is that they are not equal, although they have the same content, probably due to encoding issues. When using var_dump, it is reported that the first string's lenght is 23 and the second string's length is 47 (see the end of my question for the reason behind this)
How can i make sure they are both encoded the same way and have the same length in the end so comparison won't fail? Preferably, i would like them to be utf8 encoded.
For reference, this is an excerpt from the code:
static function getString($keyword,$file) {
$lang_handle = parse_ini_file($file, true);
var_dump($keyword);
foreach ($lang_handle as $key => $value) {
var_dump($key);
if ($key == $keyword) {
foreach ($value as $subkey => $subvalue) {
var_dump("\t" . $subkey . " => " . $subvalue);
}
}
}
}
with the following ini:
[clientcockpit/login.php]
header = "Kunden Login"
username = "Benutzername"
password = "Passwort"
forgot = "Passwort vergessen"
login = "Login"
When calling the method with getString("clientcockpit/login.php", "inifile.ini") the output is:
string 'clientcockpit/login.php' (length=23)
string '�c�l�i�e�n�t�c�o�c�k�p�i�t�/�l�o�g�i�n�.�p�h�p�' (length=47)

Your INI file seems to be in UTF16 encoding or similar, using two bytes to represent a single character. I guess that the strange characters in your string are actually NULL bytes (\0).
PHP's Unicode support is quite poor and I guess that parse_ini_file() does not support multibyte encodings properly. It will treat the file as if it was encoded using a "ASCII-compatible" single-byte encoding, just looking for special characters [ and ] to detect sections. As a result, the section keys will be corrupted: One byte actually belonging to [ or ] will be part of the section key:
UTF-16: [c] (3 characters, 6 bytes)
For UTF-16BE (big endian):
Bytes: 00 5B 00 63 00 5D (6 bytes)
ASCII: \0 [ \0 c \0 ] (6 characters)
For UTF-16LE (little endian):
Bytes: 5B 00 63 00 5D 00 (6 bytes)
ASCII: [ \0 c \0 ] \0 (6 characters)
Assuming ASCII, instead of reading c, parse_ini_file() will read \0c\0 if the source file encoding is UTF-16.
If you can control the format of your INI file, make sure to save it in UTF8 or ISO-8859-1 encoding, using your favorite text editor.
Otherwise you will have to read in the file contents using file_get_contents(), do the encoding conversion (eg. using iconv()) and pass the result to parse_ini_string(). The drawback here is that you will have to detect or hardcode the original file encoding.
If the mb multibyte extension is available on your PHP installation, you can use mb_detect_encoding() and mb_convert_encoding() to do the conversion dynamically.

Try this:
$lang_handle = parse_ini_string(file_get_contents($file), true);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP Turkish Characters to ASCII Giving Same Output - php

ord('Ö') is giving 195 and also ord('Ç') is giving 195 too. I didn't get what is the error. Can you guys help me?

Related

How do I display extended ascii characters in my php code?

Russian characters from hex to string utf8 - getting the wrong characters

ord() doesn't work with utf-8

printf() Extended Unicode Characters?

Convert two string to the same byte length

Categories

Resources