Escaped characters - \xFF - php

What do we call these strings and how they can be decoded? as I found these are UTF-8 multi-byte characters and somewhere I noticed they are UTF-32
\x4c\x4f\x42\x41\x4c\x53

The following sequence of bytes \x4c\x4f\x42\x41\x4c\x53 is an ASCII safe sequence. So you can either treat it as a single byte encoding string, or UTF-8.
$s = "\x4c\x4f\x42\x41\x4c\x53";
echo $s; // outputs LOBALS

Related

How do I display extended ascii characters in my php code?

I'm trying to decode a text that contains extended ASCII characters but when I try to convert the character I get the wrong value. Like this:
echo "“<br>";
echo ord("“")."<br>";
echo chr(ord("“"))."<br>";
And this is my output:
“
226
�
The ASCII value of the character "“" is 147, not 226. And instead of the � symbol, I want to get "“" character back.
I'm using UTF-8
<meta charset="utf-8">
I have tried changing to different charsets but it didn't work.
1st U+201C Left Double Quotation Mark is UTF-8 byte sequence E2 80 9C (hexadecimal) i.e. decimal 226 128 156
2nd ord — Convert the first byte of a string to a value between 0 and 255
Result: ord("“") returns 226…
Instead of ord and chr pair, use mb_ord and its complement mb_chr, e.g. as follows:
<?php
echo "“<br>";
echo mb_ord("“")."<br>";
echo mb_chr(mb_ord("“"))."<br>";
?>
Result: .\SO\74045685.php
“8220“
Edit you can get Windows-1251 code (147) for character “ (U+201C, Left Double Quotation Mark) as follows:
echo ord(mb_convert_encoding("“","Windows-1251","UTF-8")); //147
You're incorrect about the “ character, the UTF-8 encoding is two bytes: c293.
See: SET TRANSMIT STATE.
In the manual for ord() it says:
However, note that this function is not aware of any string encoding,
and in particular will never identify a Unicode code point in a
multi-byte encoding such as UTF-8 or UTF-16.
On top of this, if I actually convert the '“' charachter to hexadecimal, I get: e2809c. So it's a triplet. Never trust what you read online. 😏
See: https://3v4l.org/57UV8
There is no ASCII representation for “, as has already been said it is multibyte, UTF-8 to be precise:
echo mb_detect_encoding("“"); // UTF-8
ord() and chr() don't support this, you're only looking at the first byte of up to four needed for a particular character. Fortunately there are functions that does:
echo "“\n"; // “
echo mb_ord("“")."\n"; // 8220
echo mb_chr(mb_ord("“")); // “
But why do you need to transform it back and forth? It seems you already have the character in your code :), not as a value but as the actual visual representation.

What encoding is the resulting string if I concatenate a UTF-8 encoded string with an ASCII string in PHP?

If I use the function mb_convert_encoding() to convert an ASCII encoded string in PHP to a UTF-8 string, then concatenate it with an ASCII encoded string, what encoding is it? Are there any negative consequences for doing this?
It would depend firstly on whether you mean strict ASCII, which only includes 128 characters. Every single one of these characters has the exact same encoding in the ASCII encoding scheme as it does in the UTF-8 encoding scheme. For these characters, the mb_convert_encoding function will have no effect. You can easily verify this yourself with this script:
/* Convert ASCII to UTF-8 */
for ($i=0; $i<128; $i++) {
$str1 = chr($i);
$str2 = mb_convert_encoding($str1, "UTF-8", "ASCII");
echo $str1 . " - " . $str2 . " - ";
if ($str1 !== $str2) {
echo " - DIFFERENT!";
} else {
echo " - same";
}
echo "\n";
}
For all of these true ASCII characters, there's no point in transcoding them.
HOWEVER, if by "ASCII" you mean extended ASCII (see here) and are talking about characters with accents and stuff, then you are getting into trouble because there is no definitive character set described by this term. You'll notice that in the list of supported character encodings for php's Multibyte String extension there is only one occurrence of the acronym ASCII and that is for ASCII itself.
To answer your questions more precisely:
If I use the function mb_convert_encoding() to convert an ASCII encoded string in PHP to a UTF-8 string, then concatenate it with an ASCII encoded string, what encoding is it?
The resulting string is both ASCII and UTF-8 because both encoding schemes use identical byte encodings for those 128 characters.
Are there any negative consequences for doing this?
There should be no negative consequences under any circumstance if the characters are in fact true ASCII characters.
If, on the other hand, the strings include some accented character like Å or õ and some sloppy coder is calling this "extended ASCII" then you might have problems. Those characters have different encodings in the latin-1 and UTF-8 encoding schemes, for instance.
Consider taking a peek at this php function and it may shake loose some understanding. Ask yourself what it means to convert a character which is NOT ASCII from ASCII to UTF-8. It is not a meaningful conversion but it does result in a change in this particular script:
$chars = array("Å", "õ");
foreach ($chars as $char) {
echo $char . " : ";
$str1 = mb_convert_encoding($str1, "UTF-8", "ASCII");
$str2 = mb_convert_encoding($str1, "UTF-8", "ISO-8859-1");
echo $str1 . " - " . $str2 . " - ";
if ($char !== $str1) {
echo " - ASCII DIFFERENT";
}
if ($char !== $str2) {
echo " - LATIN 1 DIFFERENT";
}
echo "\n";
}
You might start to get confused at this point. It might help for you to know that my PHP code in that last function has its own character encoding which on my workstation happens to be utf-8. These transformations I've performed are therefore pretty stupid. I'm lying to PHP, saying that these UTF-8 strings are ASCII or Latin-1 and asking PHP to transform them to UTF-8. It performs a transformation as best it can but we all know that transformation isn't meaningful.
I hope you can appreciate what I'm getting at here. Every time you see a character on a computer, it has some encoding. Whether or not there are any negative consequences will depend on how you treat the data that comes to you, what transformations you perform on it, and what you intend to do with it later.
It's helpful to think of a chain of custody. Where did your data come from? What encoding did they use? Is that what I'm using on my system? Where am I sending this data? Does it need to be converted? You should also be careful to specify character sets for all these things:
data you receive from clients
form submissions to your website
display of html on your website
operations on text strings in your applications
character encoding of your connection to a database, character encoding of the tables in your db and encodings of the columns in those tables
character encoding of stored data
email character encoding
character encoding of data submitted to an API
And so on.
General rule of thumb: use utf-8 for everything you possibly can.
ASCII is a subset of UTF-8, so an ASCII string is a valid UTF-8 string. Concatenating two UTF-8 strings is unambiguous.

PHP intval of multibyte strings

How does intval() change when using UTF-8 multibyte strings as opposed to regular one byte per character strings? Is it the same?
PHP doesn't distinguish string encodings internally. A string is simply an array of bytes. If you pass an UTF-8 string to intval, the function only sees the bytes of the encoded UTF-8 string. Given the nature of the UTF-8 encoding, intval will treat any non-ASCII character as a non-digit. So it doesn't make a difference whether you pass an ASCII, Latin1, or UTF-8 string.

how to get unicode character from a unicode string in php

I want to get a single unicode chatacter from a unicode string.
for example:-
$str = "पर्वत निर्माणों में कोनसा संचलन कार्य करता है";
echo $str[0];
output is:- �
but i want to get char 'प' at 0 index of the string.
plz help me how to get char 'प' instead of � .
As #deceze writes, you need to use mb_substr in order to get a character, instead of just a byte. In addition, you need to set the internal encoding with mb_internal_encoding. Assuming that the encoding of your .php file is UTF-8, the following should work:
mb_internal_encoding('utf-8');
$str = "पर्वत निर्माणों में कोनसा संचलन कार्य करता है";
echo mb_substr($str, 0, 1);
PHP's default $str[x] notation operates on bytes, so you're just getting the first part of a multibyte character. To extract entire encoding aware byte sequences for whole characters, you need to use mb_substr.
Also see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

php unicode 16 bit

how can I append a 16 bit unicode character to a string in php
$test = "testing" . (U + 199F);
From what I see, \x only takes 8 bit characters aka ascii
From the manual:
PHP only supports a 256-character set, and hence does not offer native Unicode support.
You could enter a manually-encoded UTF-8 sequence, I suppose.
You can also type out UCS4 as byte sequence and use iconv("UTF-32LE", "UTF-8", $str); to convert it into UTF-8 for further processing. You just can't input the codepoint as a 32-bit code unit in one go.
Unicode characters don't directly exist in PHP(*), but you can deal with strings containing bytes represent characters in UTF-8 encoding. Here's one way of converting a numeric character code point to UTF-8:
function unichr($i) {
return iconv('UCS-4LE', 'UTF-8', pack('V', $i));
}
$test= 'testing'.unichr(0x199F);
(*: and ‘16-bit’ Unicode characters don't exist at all; Unicode has code points way beyond U+FFFF. There are 16-bit ‘code units’ in UTF-16, but that's an ugly encoding you're unlikely to meet in PHP.)
Because unicode is just multibyte and PHP only supports single byte you can create multibyte characters with multiple single bytes :)
$test = "testing\x19\x9F";
Try:
$test = "testing" . "\u199F";

Categories