ord() doesn't work with utf-8

ord() doesn't work with utf-8 - php

according to ISO 8859-1
€ Symbol has decimal value 128
My default php script encoding is
echo mb_internal_encoding(); //ISO-8859-1
So now as PHP
echo chr(128); //Output exactly what i want '€'
But
echo ord('€'); //opposite it returns 226, it should be 128
why it is so?

It is only for 2018's PHP v7.2.0+.
mb_ord()
Now you can use mb_ord().
Example echo mb_ord('€','UTF-8');
See also mb_chr(), to get the UTF-8 representation of a decimal code. Example echo mb_chr(2048,'UTF-8');.
The best practice is to be universal, save all your PHP scripts as UTF-8 (see #deceze).

According to Wikipedia and FileFormat,
ISO-8859-1 doesn't have the Euro symbol at all
ISO-8859-15 has it at codepoint 164 (0xA4)
Windows-1252 has it at codepoint 128 (0x80)
Unicode has the Euro symbol at codepoint 8364 (0x20AC)
UTF-8 encodes that as 0xE2 0x82 0xAC. The first byte E2 is 226 in decimal.
So it seems your source file is encoded in UTF-8 (and ord() only returns the first byte), whereas your output is in Windows-1252.

echo ord('€'); //opposite it returns 226, it should be 128
Your .php file is saved as UTF-8 (you saved it as UTF-8 in your text editor when you saved the file to disk). The string literal in there contains the bytes E2 82 AC; visualised it's something like this:
echo ord('\xE2\x82\xAC');
Open the file in a hex editor for real clarity.
ord only returns a single integer in the range of 0 - 255. Your string literal contains three bytes, for which ord would need to return three integers, which it won't. It returns only the first one, which is 226.
Save the file in different encodings in your text editor and you'll see different results.

This PHP function return the decimal number of the first character in string.
If the number is lower than 128 then the character is encoded in 1 octet.
Elseif the number is lower than 2048 then the character is encoded in 2 octets.
Elseif the number is lower than 65536 then the character is encoded in 3 octets.
Elseif the number is lower than 1114112 then the character is encoded in 4 octets.
function ord_utf8($s){
return (int) ($s=unpack('C*',$s[0].$s[1].$s[2].$s[3]))&&$s[1]<(1<<7)?$s[1]:
($s[1]>239&&$s[2]>127&&$s[3]>127&&$s[4]>127?(7&$s[1])<<18|(63&$s[2])<<12|(63&$s[3])<<6|63&$s[4]:
($s[1]>223&&$s[2]>127&&$s[3]>127?(15&$s[1])<<12|(63&$s[2])<<6|63&$s[3]:
($s[1]>193&&$s[2]>127?(31&$s[1])<<6|63&$s[2]:0)));
}
echo ord_utf8('€');
// Output 8364 then this character is encoded in 3 octets
You can check the result in https://eval.in/748181 …
The ord_utf8 function is the reciprocal of chr_utf8 (print one utf8 character from decimal number)
function chr_utf8($n,$f='C*'){
return $n<(1<<7)?chr($n):($n<1<<11?pack($f,192|$n>>6,1<<7|191&$n):
($n<(1<<16)?pack($f,224|$n>>12,1<<7|63&$n>>6,1<<7|63&$n):
($n<(1<<20|1<<16)?pack($f,240|$n>>18,1<<7|63&$n>>12,1<<7|63&$n>>6,1<<7|63&$n):'')));
}
for($test=1;$test<1114111;$test++)
if (ord_utf8(chr_utf8($test))!==$test)
die('Error found');
echo 'No error';
// Output No error

Related

How do I display extended ascii characters in my php code?

I'm trying to decode a text that contains extended ASCII characters but when I try to convert the character I get the wrong value. Like this:
echo "“<br>";
echo ord("“")."<br>";
echo chr(ord("“"))."<br>";
And this is my output:
“
226
�
The ASCII value of the character "“" is 147, not 226. And instead of the � symbol, I want to get "“" character back.
I'm using UTF-8
<meta charset="utf-8">
I have tried changing to different charsets but it didn't work.

1st U+201C Left Double Quotation Mark is UTF-8 byte sequence E2 80 9C (hexadecimal) i.e. decimal 226 128 156
2nd ord — Convert the first byte of a string to a value between 0 and 255
Result: ord("“") returns 226…
Instead of ord and chr pair, use mb_ord and its complement mb_chr, e.g. as follows:
<?php
echo "“<br>";
echo mb_ord("“")."<br>";
echo mb_chr(mb_ord("“"))."<br>";
?>
Result: .\SO\74045685.php
“8220“
Edit you can get Windows-1251 code (147) for character “ (U+201C, Left Double Quotation Mark) as follows:
echo ord(mb_convert_encoding("“","Windows-1251","UTF-8")); //147

You're incorrect about the “ character, the UTF-8 encoding is two bytes: c293.
See: SET TRANSMIT STATE.
In the manual for ord() it says:
However, note that this function is not aware of any string encoding,
and in particular will never identify a Unicode code point in a
multi-byte encoding such as UTF-8 or UTF-16.
On top of this, if I actually convert the '“' charachter to hexadecimal, I get: e2809c. So it's a triplet. Never trust what you read online. 😏
See: https://3v4l.org/57UV8

There is no ASCII representation for “, as has already been said it is multibyte, UTF-8 to be precise:
echo mb_detect_encoding("“"); // UTF-8
ord() and chr() don't support this, you're only looking at the first byte of up to four needed for a particular character. Fortunately there are functions that does:
echo "“\n"; // “
echo mb_ord("“")."\n"; // 8220
echo mb_chr(mb_ord("“")); // “
But why do you need to transform it back and forth? It seems you already have the character in your code :), not as a value but as the actual visual representation.

PHP html_entity_decode is not working for UTF-8 characters? [duplicate]

Assuming UTF-8 encoding, and strlen() in PHP, is it possible that this string has a length of 4?
I'm only interested to know about strlen(), not other functions
This is the string:
$1ï¿½2
I have tested it on my own computer, and I have verified UTF-8 encoding, and the answer I get is 6.
I don't see anything in the manual for strlen or anything I've read on UTF-8 that would explain why some of the characters above would count for less than one.
PS: This question and answer (4) comes from a mock test for ZCE I bought on Ebay.

how about using mb_strlen() ?
http://lt.php.net/manual/en/function.mb-strlen.php
But if you need to use strlen, its possible to configure your webserver by setting mbstring.func_overload directive to 2, so it will automatically replace using of strlen to mb_strlen in your scripts.

The string you posted is six character long: $1ï¿½2 (dollar sign, digit one, lowercase i with diaeresis, upside-down question mark, one half fraction, digit two)
If strlen() was called with a UTF-8 representation of that string, you would get a result of nine (probably, though there are multiple representations with different lengths).
However, if we were to store that string as ISO 8859-1 or CP1252 we would have a six byte long sequence that would be legal as UTF-8. Reinterpreting those 6 bytes as UTF-8 would then result in 4 characters: $1�2 (dollar sign, digit one, Unicode Replacement Character, digit 2). That is, the UTF-8 encoding of the single character '�' is identical to the ISO-8859-1 encoding of the three characters "ï¿½".
The replacement character often gets inserted when a UTF-8 decoder reads data that's not valid UTF-8 data.
It appears that the original string was processed through multiple layers of misinterpretation; by the use of a UTF-8 decoder on non-UTF-8 data (producing $1�2), and then by whatever you used to analyze that data (producing $1ï¿½2).

need to use Multibyte String Function mb_strlen() like:
mb_strlen($string, 'UTF-8');

It's likely that at some point between the preparation of the question and your reading of it some process has mangled non-ASCII characters in it, so the question was originally about some string with 4 characters in it.
The sequence ï¿½ is obtained when you encode the replacement character U+FFFD (�) in UTF-8 and interpret the result in latin1. This character is used as a replacement for byte sequences that don't encode any character when reading text from a file, for example. What has happened is likely this:
The original question, stored in a latin1 text file, had: $1¢2 (you can replace ¢ with any non-ASCII character)
The file was read by a program that used UTF-8. Since the byte corresponding to ¢ could not be interpreted, the program substituted it and read the text $1�2. This text was then written out using UTF-8, resulting in $1\xEF\xBF\xBD2 in the file.
Then some third program comes that reads the file in latin1, and shows $1ï¿½2.

No.
I'll use a proof by contradiction.
strlen counts bytes, so with a strlen of 4, there would need to be exactly 4 bytes in that string.
UTF8 encoding needs at least 1 byte per character.
We have established that:
there are 4 bytes
a character is represented by no less than 1 byte
...yet, we have 6 characters....which is a contradiction. So, no.
However, what's not totally clear is which character set the displaying software(eg, the web browser) is using to intepret the string. It could use some uncommon encoding scheme where a character can be represented by less than 8 bits. If this were the case, then 4 bytes could display as 6 characters. So, the string could be utf8, but the browser could decide to interpret it as, say, some 5 bit character set.

Many UTF-8 characters take several bytes instead of one. That's how UTF-8 is constructed (That's how you can have so many characters in a single set).
Try mb_strlen() instead.

PHP Turkish Characters to ASCII Giving Same Output

ord('Ö') is giving 195 and also ord('Ç') is giving 195 too. I didn't get what is the error. Can you guys help me?

ord — Convert the first byte of a string to a value between 0 and 255
https://www.php.net/manual/en/function.ord.php
The question is - what the charset of the source file?
Since 'Ö' and 'Ç' both are not ASCII symbols, they are represented as two bytes in UTF-8 encoding
Ö - 0xC3 0x96
Ç - 0xC3 0x87
As you can see, both characters has first bytes 0xC3 (=195 dec.)
So, you need to decide what code you want to get?
For example, you can convert the UTF-8 string into Windows-1254:
print ord(iconv('UTF-8', 'Windows-1254', 'Ö')); // 214
print ord(iconv('UTF-8', 'Windows-1254', 'Ç')); // 199
Or you may want to get unicode Code point. To do that you can first convert the string into UTF-32, and then decode a 32-bit number:
function get_codepoint($utf8char) {
$bin = iconv('UTF-8', 'UTF-32BE', $utf8char); // convert to UTF-32 big endian
$a = unpack('Ncp', $bin); // unpack binary data
return $a['cp']; // get the code point
}
print get_codepoint('Ö'); // 214
print get_codepoint('Ç'); // 199
Or in php 7.2 and later you can simple use mb_ord
print mb_ord('Ö'); // 214
print mb_ord('Ç'); // 199

printf() Extended Unicode Characters?

$formatthis = 219;
$printthis = 98;
// %c - the argument is treated as an integer, and presented as the character with
that ASCII value.
$string = 'There are %c treated as integer %c';
echo printf($string, $formatthis, $printthis);
I'm attempting to understand printf().
I don't quite understand the parameters.
I can see that the first parameter seems to be the string that the formatting will be applied to.
The second is the first variable to format, and the third seems to be the second variable to format.
What I don't understand is how to get it to print unicode characters that are special.
E.G. Beyond a-z, A-Z, !##$%^&*(){}" ETC.
Also, why does it out put with the location of the last quote in the string?
OUTPUT:
There are � treated as integer �32
How could I encode this in to UTF-16 (Dec) // Snowman = 9,731 DEC UTF 16?
UTF-8 'LATIN CAPITAL LETTER A' (U+0041) = 41, but if I write in PHP 41 I will get ')' I googled an ASCII table and it's showing that the number for A is 065...
ASCII is a subset of UTF-8, so if a document is ASCII then it is already UTF-8
If it's already in UTF-8, why are those two numbers different? Also the outputs different..
EDIT, Okay so the chart I'm looking at is obviously displaying the digits in HEX value which I didn't immediately notice, 41 in HEX is ASCII 065

%c is basically an int2bin function, meaning it formats a number into its binary representation. This goes up to the decimal number 255, which will be output as the byte 0xFF.
To output, say, the snowman character ☃, you'd need to output the exact bytes necessary to represent it in your encoding of choice. If you chose UTF-8 to encode it, the necessary bytes are E2 98 83:
printf('%c%c%c', 226, 152, 131); // ☃
// or
printf('%c%c%c', 0xE2, 0x98, 0x83); // ☃
The problem in your case is 1) that the bytes you're outputting don't mean anything in the encoding you're interpreting the result as (meaning the byte for 98 doesn't mean anything in UTF-8 at this point, which is why you're seeing a "�") and 2) that you're echoing the result of printf, which outputs 32 (printf returns the number of bytes it output).

Iconv byte length

I am using iconv to convert string from CP1251 to UTF-8
Problem is that string length before conversion is 4 bytes, after 8.
After converting i send message to Apple servers, where is length is limited.
How I can get conversion and keep the same length?

There is no way you can do it. In UTF-8 one-byte codes are used only for the ASCII values 0 through 127. In this case the UTF-8 code has the same value as the ASCII code. The high-order bit of these codes is always 0.
As you are trying to encode non-ASCII characters, you'll get more, then 1 byte per character.
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets
http://en.wikipedia.org/wiki/UTF-8#Overlong_encodings

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

ord() doesn't work with utf-8 - php

according to ISO 8859-1 € Symbol has decimal value 128 My default php script encoding is echo mb_internal_encoding(); //ISO-8859-1 So now as PHP echo chr(128); //Output exactly what i want '€' But echo ord('€'); //opposite it returns 226, it should be 128 why it is so?

It is only for 2018's PHP v7.2.0+. mb_ord() Now you can use mb_ord(). Example echo mb_ord('€','UTF-8'); See also mb_chr(), to get the UTF-8 representation of a decimal code. Example echo mb_chr(2048,'UTF-8');. The best practice is to be universal, save all your PHP scripts as UTF-8 (see #deceze).

Related

How do I display extended ascii characters in my php code?

PHP html_entity_decode is not working for UTF-8 characters? [duplicate]

PHP Turkish Characters to ASCII Giving Same Output

printf() Extended Unicode Characters?

Iconv byte length

Categories

Resources