Get ASCII value of extended ASCII characters in php - php

I am trying to get the ASCII value of some character that are in extend ASCII character set.
Like:
echo ord('„');
Its output is: 226
But actual ASCII value is : 132.
My question is how to get the actual ASCII value of those character that are greater than 1 byte size?

ord simply takes the first byte of the given string and returns its numerical value in decimal form. If it doesn't give you what you expect, likely your input is not what you expect. If you want the byte value of Extended ASCII, then your input string must be encoded in Extended ASCII. Currently you're likely getting the value of the first byte of E2 80 9E, the UTF-8 encoding for "„", because your input is actually UTF-8 encoded, because your source code file is saved as UTF-8.

I found the solution here.
Your character is 8222 in utf8 encoding and it is called multibyte character (mb) or html special entity.
function mb_ord($string)
{
if (extension_loaded('mbstring') === true)
{
mb_language('Neutral');
mb_internal_encoding('UTF-8');
mb_detect_order(array('UTF-8', 'ISO-8859-15', 'ISO-8859-1', 'ASCII'));
$result = unpack('N', mb_convert_encoding($string, 'UCS-4BE', 'UTF-8'));
if (is_array($result) === true)
{
return $result[1];
}
}
return ord($string);
}
echo mb_ord('„');

Related

What encoding is the resulting string if I concatenate a UTF-8 encoded string with an ASCII string in PHP?

If I use the function mb_convert_encoding() to convert an ASCII encoded string in PHP to a UTF-8 string, then concatenate it with an ASCII encoded string, what encoding is it? Are there any negative consequences for doing this?
It would depend firstly on whether you mean strict ASCII, which only includes 128 characters. Every single one of these characters has the exact same encoding in the ASCII encoding scheme as it does in the UTF-8 encoding scheme. For these characters, the mb_convert_encoding function will have no effect. You can easily verify this yourself with this script:
/* Convert ASCII to UTF-8 */
for ($i=0; $i<128; $i++) {
$str1 = chr($i);
$str2 = mb_convert_encoding($str1, "UTF-8", "ASCII");
echo $str1 . " - " . $str2 . " - ";
if ($str1 !== $str2) {
echo " - DIFFERENT!";
} else {
echo " - same";
}
echo "\n";
}
For all of these true ASCII characters, there's no point in transcoding them.
HOWEVER, if by "ASCII" you mean extended ASCII (see here) and are talking about characters with accents and stuff, then you are getting into trouble because there is no definitive character set described by this term. You'll notice that in the list of supported character encodings for php's Multibyte String extension there is only one occurrence of the acronym ASCII and that is for ASCII itself.
To answer your questions more precisely:
If I use the function mb_convert_encoding() to convert an ASCII encoded string in PHP to a UTF-8 string, then concatenate it with an ASCII encoded string, what encoding is it?
The resulting string is both ASCII and UTF-8 because both encoding schemes use identical byte encodings for those 128 characters.
Are there any negative consequences for doing this?
There should be no negative consequences under any circumstance if the characters are in fact true ASCII characters.
If, on the other hand, the strings include some accented character like Å or õ and some sloppy coder is calling this "extended ASCII" then you might have problems. Those characters have different encodings in the latin-1 and UTF-8 encoding schemes, for instance.
Consider taking a peek at this php function and it may shake loose some understanding. Ask yourself what it means to convert a character which is NOT ASCII from ASCII to UTF-8. It is not a meaningful conversion but it does result in a change in this particular script:
$chars = array("Å", "õ");
foreach ($chars as $char) {
echo $char . " : ";
$str1 = mb_convert_encoding($str1, "UTF-8", "ASCII");
$str2 = mb_convert_encoding($str1, "UTF-8", "ISO-8859-1");
echo $str1 . " - " . $str2 . " - ";
if ($char !== $str1) {
echo " - ASCII DIFFERENT";
}
if ($char !== $str2) {
echo " - LATIN 1 DIFFERENT";
}
echo "\n";
}
You might start to get confused at this point. It might help for you to know that my PHP code in that last function has its own character encoding which on my workstation happens to be utf-8. These transformations I've performed are therefore pretty stupid. I'm lying to PHP, saying that these UTF-8 strings are ASCII or Latin-1 and asking PHP to transform them to UTF-8. It performs a transformation as best it can but we all know that transformation isn't meaningful.
I hope you can appreciate what I'm getting at here. Every time you see a character on a computer, it has some encoding. Whether or not there are any negative consequences will depend on how you treat the data that comes to you, what transformations you perform on it, and what you intend to do with it later.
It's helpful to think of a chain of custody. Where did your data come from? What encoding did they use? Is that what I'm using on my system? Where am I sending this data? Does it need to be converted? You should also be careful to specify character sets for all these things:
data you receive from clients
form submissions to your website
display of html on your website
operations on text strings in your applications
character encoding of your connection to a database, character encoding of the tables in your db and encodings of the columns in those tables
character encoding of stored data
email character encoding
character encoding of data submitted to an API
And so on.
General rule of thumb: use utf-8 for everything you possibly can.
ASCII is a subset of UTF-8, so an ASCII string is a valid UTF-8 string. Concatenating two UTF-8 strings is unambiguous.

ord() doesn't work with utf-8

according to ISO 8859-1
€ Symbol has decimal value 128
My default php script encoding is
echo mb_internal_encoding(); //ISO-8859-1
So now as PHP
echo chr(128); //Output exactly what i want '€'
But
echo ord('€'); //opposite it returns 226, it should be 128
why it is so?
It is only for 2018's PHP v7.2.0+.
mb_ord()
Now you can use mb_ord().
Example echo mb_ord('€','UTF-8');
See also mb_chr(), to get the UTF-8 representation of a decimal code. Example echo mb_chr(2048,'UTF-8');.
The best practice is to be universal, save all your PHP scripts as UTF-8 (see #deceze).
According to Wikipedia and FileFormat,
ISO-8859-1 doesn't have the Euro symbol at all
ISO-8859-15 has it at codepoint 164 (0xA4)
Windows-1252 has it at codepoint 128 (0x80)
Unicode has the Euro symbol at codepoint 8364 (0x20AC)
UTF-8 encodes that as 0xE2 0x82 0xAC. The first byte E2 is 226 in decimal.
So it seems your source file is encoded in UTF-8 (and ord() only returns the first byte), whereas your output is in Windows-1252.
echo ord('€'); //opposite it returns 226, it should be 128
Your .php file is saved as UTF-8 (you saved it as UTF-8 in your text editor when you saved the file to disk). The string literal in there contains the bytes E2 82 AC; visualised it's something like this:
echo ord('\xE2\x82\xAC');
Open the file in a hex editor for real clarity.
ord only returns a single integer in the range of 0 - 255. Your string literal contains three bytes, for which ord would need to return three integers, which it won't. It returns only the first one, which is 226.
Save the file in different encodings in your text editor and you'll see different results.
This PHP function return the decimal number of the first character in string.
If the number is lower than 128 then the character is encoded in 1 octet.
Elseif the number is lower than 2048 then the character is encoded in 2 octets.
Elseif the number is lower than 65536 then the character is encoded in 3 octets.
Elseif the number is lower than 1114112 then the character is encoded in 4 octets.
function ord_utf8($s){
return (int) ($s=unpack('C*',$s[0].$s[1].$s[2].$s[3]))&&$s[1]<(1<<7)?$s[1]:
($s[1]>239&&$s[2]>127&&$s[3]>127&&$s[4]>127?(7&$s[1])<<18|(63&$s[2])<<12|(63&$s[3])<<6|63&$s[4]:
($s[1]>223&&$s[2]>127&&$s[3]>127?(15&$s[1])<<12|(63&$s[2])<<6|63&$s[3]:
($s[1]>193&&$s[2]>127?(31&$s[1])<<6|63&$s[2]:0)));
}
echo ord_utf8('€');
// Output 8364 then this character is encoded in 3 octets
You can check the result in https://eval.in/748181 …
The ord_utf8 function is the reciprocal of chr_utf8 (print one utf8 character from decimal number)
function chr_utf8($n,$f='C*'){
return $n<(1<<7)?chr($n):($n<1<<11?pack($f,192|$n>>6,1<<7|191&$n):
($n<(1<<16)?pack($f,224|$n>>12,1<<7|63&$n>>6,1<<7|63&$n):
($n<(1<<20|1<<16)?pack($f,240|$n>>18,1<<7|63&$n>>12,1<<7|63&$n>>6,1<<7|63&$n):'')));
}
for($test=1;$test<1114111;$test++)
if (ord_utf8(chr_utf8($test))!==$test)
die('Error found');
echo 'No error';
// Output No error

What changes my UTF-8 string to ASCII?

I have the following code:
$string = $this->getTextFromHTML($html);
echo mb_detect_encoding($string, 'ASCII,UTF-8,ISO-8859-1');
$stringArray = mb_split('\W+', $string);
$cleaned = array();
foreach($stringArray as $v) {
$string = trim($v);
if(!empty($string))
array_push($cleaned, $string);
}
echo mb_detect_encoding($stringArray[752], 'ASCII,UTF-8,ISO-8859-1');
The above returns:
// UTF-8
// ASCII
What part of my code is turning my string into ASCII? Or am I detecting the encoding incorrectly?
Strings have no actual associated encoding, they're merely byte arrays. mb_detect_encoding doesn't tell you what encoding the string has, it merely tries to detect it. That means it takes a few guesses (your second argument) and tells you the first that is valid.
Your original string probably contains some non-ASCII characters, so ASCII isn't a valid encoding for it, but UTF-8 is. When you're later testing a substring of the original, that substring probably contains only characters which are valid in ASCII, and since ASCII is the first encoding that's tested, that's the guessed result. Any ASCII string is also valid UTF-8, so there's no actual problem or "conversion" which happened.
As #Phylogenesis mentioned in the comments, ASCII characters under 0x7F are valid UTF-8. Unless you have a byte order mark in your data, the text is both valid ASCII and UTF-8. You've specified that ASCII is an option before UTF-8, so it is returned.
For example: https://ideone.com/DupS4A
<?php
$str = "apple";
// Returns ASCII
var_dump(mb_detect_encoding($str, "ASCII, UTF-8"));
// 0xEFBBBF is the byte order mark in UTF-8
$str_with_bom = chr(0xEF) . chr(0xBB) . chr(0xBF) . "apple";
// Returns UTF-8
var_dump(mb_detect_encoding($str_with_bom, "ASCII, UTF-8"));

PHP Utf8 Decoding Issue

I have the following address line: Praha 5, Staré Město,
I need to use utf8_decode() function on this string before I can write it to a PDF file (using domPDF lib).
However, the php utf8 decode function for the above address line appears incorrect (or rather, incomplete).
The following code:
<?php echo utf8_decode('Praha 5, Staré Město,'); ?>
Produces this:
Praha 5, Staré M?sto,
Any idea why ě is not getting decoded?
utf8_decode converts the string from a UTF-8 encoding to ISO-8859-1, a.k.a. "Latin-1".
The Latin-1 encoding cannot represent the letter "ě". It's that simple.
"Decode" is a total misnomer, it does the same as iconv('UTF-8', 'ISO-8859-1', $string).
See What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.
I wound up using a home-grown UTF-8 / UTF-16 decoding function (convert to &#number; representations), I haven't found any pattern to why UTF-8 isn't detected, I suspect it's because the "encoded-as" sequence isn't always exactly in the same position in the string returned. You might do some additional checking on that.
Three-character UTF-8 indicator: $startutf8 = chr(0xEF).chr(187).chr(191); (if you see this ANYWHERE, not just first three characters, the string is UTF-8 encoded)
Decode according to UTF-8 rules; this replaced an earlier version which chugged through byte by byte:using
function charset_decode_utf_8 ($string) {
/* Only do the slow convert if there are 8-bit characters */
/* avoid using 0xA0 (\240) in ereg ranges. RH73 does not like that */
if (! ereg("[\200-\237]", $string) and ! ereg("[\241-\377]", $string))
return $string;
// decode three byte unicode characters
$string = preg_replace("/([\340-\357])([\200-\277])([\200-\277])/e",
"'&#'.((ord('\\1')-224)*4096 + (ord('\\2')-128)*64 + (ord('\\3')-128)).';'",
$string);
// decode two byte unicode characters
$string = preg_replace("/([\300-\337])([\200-\277])/e",
"'&#'.((ord('\\1')-192)*64+(ord('\\2')-128)).';'",
$string);
return $string;
}
Problem is in your PHP file encoding , save your file in UTF-8 encoding , then even no need to use utf8_decode , if you get these data 'Praha 5, Staré Město,' from database , better change it charset to UTF-8
you don't need that (#Rajeev :this string is automatically detected as utf-8 encoded :
echo mb_detect_encoding('Praha 5, Staré Město,');
will always return UTF-8.).
You'd rather see :
https://code.google.com/p/dompdf/wiki/CPDFUnicode

Unicode unknown "�" character detection in PHP

Is there any way in PHP of detecting the following character �?
I'm currently fixing a number of UTF-8 encoding issues with a few different algorithms and need to be able to detect if � is present in a string. How do I do so with strpos?
Simply pasting the character into my codebase does not seem to work.
if (strpos($names['decode'], '?') !== false || strpos($names['decode'], '�') !== false)
Converting a UTF-8 string into UTF-8 using iconv() using the //IGNORE parameter produces a result where invalid UTF-8 characters are dropped.
Therefore, you can detect a broken character by comparing the length of the string before and after the iconv operation. If they differ, they contained a broken character.
Test case (make sure you save the file as UTF-8):
<?php
header("Content-type: text/html; charset=utf-8");
$teststring = "Düsseldorf";
// Deliberately create broken string
// by encoding the original string as ISO-8859-1
$teststring_broken = utf8_decode($teststring);
echo "Broken string: ".$teststring_broken ;
echo "<br>";
$teststring_converted = iconv("UTF-8", "UTF-8//IGNORE", $teststring_broken );
echo $teststring_converted;
echo "<br>";
if (strlen($teststring_converted) != strlen($teststring_broken ))
echo "The string contained an invalid character";
in theory, you could drop //IGNORE and simply test for a failed (empty) iconv operation, but there might be other reasons for a iconv to fail than just invalid characters... I don't know. I would use the comparison method.
Here is what I do to detect and correct the encoding of strings not encoded in UTF-8 when that is what I am expecting:
$encoding = mb_detect_encoding($str, 'utf-8, iso-8859-1, ascii', true);
if (strcasecmp($encoding, 'UTF-8') !== 0) {
$str = iconv($encoding, 'utf-8', $str);
}
As far as I know, that question mark symbol is not a single character. There are many different character codes in the standard font sets that are not mapped to a symbol, and that is the default symbol that is used. To do detection in PHP, you would first need to know what font it is that you're using. Then you need to look at the font implementation and see what ranges of codes map to the "?" symbol, and then see if the given character is in one of those ranges.
I use the CUSTOM method (using str_replace) to sanitize undefined characters:
$input='a³';
$text=str_replace("\n\n", "sample000" ,$text);
$text=str_replace("\n", "sample111" ,$text);
$text=filter_var($text,FILTER_SANITIZE_SPECIAL_CHARS, FILTER_FLAG_STRIP_LOW);
$text=str_replace("sample000", "<br/><br/>" ,$text);
$text=str_replace("sample111", "<br/>" ,$text);
echo $text; //outputs ------------> a3

Categories