I have an ASCII string. I like to change its encoding to utf-8.
But I found there's a simple function to change ascii to utf-8 in php.
and vice verse, I like to change utf-8 alphabet to ascii.
Please advise.
I have tried:
<?php
// utf-8
$str = "CHONKIOK";
// I can't even how to print these utf-8 characters in php. I just copied/pasted the string.
// strlen($str) => 24 bytes
// mb_detect_encoding($str) => utf-8
$str2 = "CHONKIOK";
// strlen($str2) => 8 bytes
// mb_detect_encoding($str2) => ascii
// change ascii to utf-8
$str = mb_convert_encoding($str2, "UTF-8");
echo mb_detect_encoding($str);
// returns ascii
What you are doing is correct.
As per mb_detect_encoding it states that it detects the most likely character encoding.
As the entire ASCII set is contained within UTF-8 at the exact same character positions, this function is telling you that it's an ASCII string because it technically is. The bytes of this string when encoded in both ASCII and UFT-8 are identical.
As you've found, when you include some characters outside of the ASCII set then it will give you the next probable encoding.
What exactly should I do to obtain this string: "CHONKIOK" from "CHONKIOK"?
The characters you're after are called "Fullwidth Latin" characters.
Given the C character provided is character 65,315 and a regular C is character 67, you could possible obtain the strings you're after by adding the difference of 65,248. This is only possible because the alphabet tends to repeat in the same order throughout different parts of the character charts.
You can get the code point of a character using mb_ord and convert it back to a character using mb_chr, after adding 65,248.
That might look something like:
$str_input = "ABC abc 123";
$convertable = "ABCDEFG12349abcdefg";
$str_output = "";
for ($i = 0; $i < strlen($str_input); $i++) {
$char = mb_ord($str_input[$i], "UTF-8");
if(str_contains($convertable, $str_input[$i])) $char += 65248;
$str_output .= mb_chr($char, "UTF-8");
}
echo $str_output; // outputs "ABC abc 123"
Just be sure to include the whole alphabet in $convertable
try this to convert to utf-8:
utf8_encode(string $string): string
try this to convert to ASCII:
utf8_decode(string $string): string
I have this PHP code:
$tagId = 1; // the original value of tag
$tagIdAsHex = sprintf("%02X", $tagId); // the tag value in hex format
$tagAsHexBytes = pack('H*', $tagIdAsHex); // the packed hex value of tag packed into string as a conversion
How can I translate that to C++?
byte tagId = 1;
auto hexedTag = IntToHex(tagId); //C++ Builder
??
The PHP code shown is simply converting the integer 1 into a hex-encoded string containing "01", and is then parsing that hex string into a binary string holding a single byte 0x01.
In C, you can use sscanf() in a loop to parse a hex string.
In standard C++, you can use std::hex and std::setw() to parse a hex string from any std::istream, such as std::istringstream, using operator>> in a loop.
In C++Builder specifically, you can use its RTL's HexToBin() function to parse a hex string into a pre-allocated byte array.
This question looks embarrassingly simple, but I haven't been able to find an answer.
What is the PHP equivalent to the following C# line of code?
string str = "\u1000";
This sample creates a string with a single Unicode character whose "Unicode numeric value" is 1000 in hexadecimal (4096 in decimal).
That is, in PHP, how can I create a string with a single Unicode character whose "Unicode numeric value" is known?
PHP 7.0.0 has introduced the "Unicode codepoint escape" syntax.
It's now possible to write Unicode characters easily by using a double-quoted or a heredoc string, without calling any function.
$unicodeChar = "\u{1000}";
Because JSON directly supports the \uxxxx syntax the first thing that comes into my mind is:
$unicodeChar = '\u1000';
echo json_decode('"'.$unicodeChar.'"');
Another option would be to use mb_convert_encoding()
echo mb_convert_encoding('က', 'UTF-8', 'HTML-ENTITIES');
or make use of the direct mapping between UTF-16BE (big endian) and the Unicode codepoint:
echo mb_convert_encoding("\x10\x00", 'UTF-8', 'UTF-16BE');
I wonder why no one has mentioned this yet, but you can do an almost equivalent version using escape sequences in double quoted strings:
\x[0-9A-Fa-f]{1,2}
The sequence of characters matching the regular expression is a
character in hexadecimal notation.
ASCII example:
<?php
echo("\x48\x65\x6C\x6C\x6F\x20\x57\x6F\x72\x6C\x64\x21");
?>
Hello World!
So for your case, all you need to do is $str = "\x30\xA2";. But these are bytes, not characters. The byte representation of the Unicode codepoint coincides with UTF-16 big endian, so we could print it out directly as such:
<?php
header('content-type:text/html;charset=utf-16be');
echo("\x30\xA2");
?>
ア
If you are using a different encoding, you'll need alter the bytes accordingly (mostly done with a library, though possible by hand too).
UTF-16 little endian example:
<?php
header('content-type:text/html;charset=utf-16le');
echo("\xA2\x30");
?>
ア
UTF-8 example:
<?php
header('content-type:text/html;charset=utf-8');
echo("\xE3\x82\xA2");
?>
ア
There is also the pack function, but you can expect it to be slow.
PHP does not know these Unicode escape sequences. But as unknown escape sequences remain unaffected, you can write your own function that converts such Unicode escape sequences:
function unicodeString($str, $encoding=null) {
if (is_null($encoding)) $encoding = ini_get('mbstring.internal_encoding');
return preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/u', create_function('$match', 'return mb_convert_encoding(pack("H*", $match[1]), '.var_export($encoding, true).', "UTF-16BE");'), $str);
}
Or with an anonymous function expression instead of create_function:
function unicodeString($str, $encoding=null) {
if (is_null($encoding)) $encoding = ini_get('mbstring.internal_encoding');
return preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/u', function($match) use ($encoding) {
return mb_convert_encoding(pack('H*', $match[1]), $encoding, 'UTF-16BE');
}, $str);
}
Its usage:
$str = unicodeString("\u1000");
html_entity_decode('エ', 0, 'UTF-8');
This works too. However the json_decode() solution is a lot faster (around 50 times).
Try Portable UTF-8:
$str = utf8_chr( 0x1000 );
$str = utf8_chr( '\u1000' );
$str = utf8_chr( 4096 );
All work exactly the same way. You can get the codepoint of a character with utf8_ord(). Read more about Portable UTF-8.
As mentioned by others, PHP 7 introduces support for the \u Unicode syntax directly.
As also mentioned by others, the only way to obtain a string value from any sensible Unicode character description in PHP, is by converting it from something else (e.g. JSON parsing, HTML parsing or some other form). But this comes at a run-time performance cost.
However, there is one other option. You can encode the character directly in PHP with \x binary escaping. The \x escape syntax is also supported in PHP 5.
This is especially useful if you prefer not to enter the character directly in a string through its natural form. For example, if it is an invisible control character, or other hard to detect whitespace.
First, a proof example:
// Unicode Character 'HAIR SPACE' (U+200A)
$htmlEntityChar = " ";
$realChar = html_entity_decode($htmlEntityChar);
$phpChar = "\xE2\x80\x8A";
echo 'Proof: ';
var_dump($realChar === $phpChar); // bool(true)
Note that, as mentioned by Pacerier in another answer, this binary code is unique to a specific character encoding. In the above example, \xE2\x80\x8A is the binary coding for U+200A in UTF-8.
The next question is, how do you get from U+200A to \xE2\x80\x8A?
Below is a PHP script to generate the escape sequence for any character, based on either a JSON string, HTML entity, or any other method once you have it as a native string.
function str_encode_utf8binary($str) {
/** #author Krinkle 2018 */
$output = '';
foreach (str_split($str) as $octet) {
$ordInt = ord($octet);
// Convert from int (base 10) to hex (base 16), for PHP \x syntax
$ordHex = base_convert($ordInt, 10, 16);
$output .= '\x' . $ordHex;
}
return $output;
}
function str_convert_html_to_utf8binary($str) {
return str_encode_utf8binary(html_entity_decode($str));
}
function str_convert_json_to_utf8binary($str) {
return str_encode_utf8binary(json_decode($str));
}
// Example for raw string: Unicode Character 'INFINITY' (U+221E)
echo str_encode_utf8binary('∞') . "\n";
// \xe2\x88\x9e
// Example for HTML: Unicode Character 'HAIR SPACE' (U+200A)
echo str_convert_html_to_utf8binary(' ') . "\n";
// \xe2\x80\x8a
// Example for JSON: Unicode Character 'HAIR SPACE' (U+200A)
echo str_convert_json_to_utf8binary('"\u200a"') . "\n";
// \xe2\x80\x8a
function unicode_to_textstring($str){
$rawstr = pack('H*', $str);
$newstr = iconv('UTF-16BE', 'UTF-8', $rawstr);
return $newstr;
}
$msg = '67714eac99c500200054006f006b0079006f002000530074006100740069006f006e003a0020';
echo unicode_to_textstring($str);
I have the hex representation and I'm trying to convert this to the type of hex. For example if you execute:
echo "\xFF\xAA";
echo '<br>';
$first = "FF";
$second = "AA";
echo "\x$first\x$second";
die();
The result you get is:
So the first line is weird symbols. This indicates to me this is actually of type hex. Where the second result although is valid representations of hex, they are in fact not actually hex.
So my question is how can I make the actual result hex? I want to be able to use a variable as the representation but actually convert it to hex. (I want line two to show up as symbols)
hex2bin — Decodes a hexadecimally encoded binary string
<?php
$hex = hex2bin("6578616d706c65206865782064617461");
echo $hex;
?>
Output:
example hex data
Source: http://php.net/manual/en/function.hex2bin.php
So just use
echo hex2bin("$first$second");
In PHP when I use the ord function in order to catch the ASCII code of my character I get this behavior:
ord("a") // return 97
chr(97) // return a
But when I use a special character like Œ the returns are different:
ord("Œ") // return 197
chr(197) // return �
All of my pages are encoded in utf8. This behaviour is the same for most of the special characters.
Has somebody seen this problem in the past? How can I fix it?
ord() and chr() both use the ASCII values of characters, which is a single byte encoding. Œ is not a valid character in ASCII.
You can get each byte of a multi-byte character by specifying the byte offset, as follows:
$oethel = "Œ";
$firstByte = ord($oethel[0]); // 197
$secondByte = ord($oethel[1]); // 146
Reversing the process, however, does not work, because assigning to a string byte offset converts that string to an array:
$newOethel = "";
$newOethel[0] = chr(197);
$newOethel[1] = chr(146);
echo $newOethel;
// Output is as follows:
// PHP Notice: Array to string conversion
// Array
The black diamond with a question mark is a display problem.
Review the details of black diamond in https://stackoverflow.com/a/38363567/1766831 . There are two cases; see which one fits.