I find that PHP's string to float conversion is not locale aware. If I setlocale() to a locale where the decimal point is a comma, floatval fails to parse "3,14". I find this surprising especially since the opposite conversion - float to string - is locale aware and outputs the comma.
<?php
setlocale(LC_ALL, "Norwegian", "no");
$localeconv = localeconv();
echo "decimal_point is `" . $localeconv['decimal_point'] . "'<br/>";
print "float to string: " . 3.14 . "<br/>"; // <-- Outputs "3,14" CORRECT
print "string to float: " . floatval("3,14"); // <-- Outputs "3" INCORRECT
?>
The output I get is the following:
decimal_point is `,'
float to string: 3,14
string to float: 3
This is with PHP 5.3.6 on Windows. Is this the intended behaviour? Does PHP on Unix give the same result?
There is a locale aware function in The PHP Manual
<?php
function ParseFloat($floatString){
$LocaleInfo = localeconv();
$floatString = str_replace($LocaleInfo["mon_thousands_sep"] , "", $floatString);
$floatString = str_replace($LocaleInfo["mon_decimal_point"] , ".", $floatString);
return floatval($floatString);
}
?>
This is safer than simply replacing commas with dots as that would break things for some locales.
Related
I'm trying to write my own mb_ucwords() function to proivde a quick wrapper of mb_convert_case so that it would work with multibyte strings since the base ucwords() function does not.
I have ran into an issue where a string passed in that starts with the µ character (U+00B5 MICRO SIGN) was coming back as "Μ" (U+039C GREEK CAPITAL LETTER MU) instead of being ignored as I would assume should happen.
I wrote a quick test script to verify some information:
function testUtf8($letter) {
echo "CHAR: " . $letter . "\n";
echo "Detected Encoding: " . mb_detect_encoding($letter) . "\n";
echo "IS VALID UTF-8? " . (mb_check_encoding($letter, 'UTF-8') ? 'YES' : 'NO') . "\n";
$lower = mb_strtolower($letter);
$upper = mb_strtoupper($letter);
$conv = mb_convert_case($letter, MB_CASE_TITLE, 'UTF-8');
echo "mb_strtolower(): " . $lower . "(" . mb_ord($lower) . ")\n";
echo "mb_strtoupper(): " . $upper . "(" . mb_ord($upper) . ")\n";
echo "mb_convert_case(): " . $conv . "(" . mb_ord($conv) . ")\n";
echo "\n";
echo "Matches RegEx /\p{L}/u: " . (preg_match('/\p{L}/u', $letter) ? 'YES' : 'NO') . "\n";
echo "Matches RegEx /\p{N}/u: " . (preg_match('/\p{N}/u', $letter) ? 'YES' : 'NO') . "\n";
echo "Matches RegEx /\p{Xan}/u: " . (preg_match('/\p{Xan}/u', $letter) ? 'YES' : 'NO') . "\n";
}
testUtf8('µ');
And the output I get is:
CHAR: µ
Detected Encoding: UTF-8
IS VALID UTF-8? YES
mb_strtolower(): µ(181)
mb_strtoupper(): Μ(924)
mb_convert_case(): Μ(924)
Matches RegEx /\p{L}/u: YES
Matches RegEx /\p{N}/u: NO
Matches RegEx /\p{Xan}/u: YES
Can someone explain to me why PHP thinks µ is a "letter" and why the MB uppercase version is "Μ"? I was going to work around this by testing the first letter of each word and verifying that it was a valid unicode "letter" before running the conversion, but as you can see that wont work for this character since /\p{L}/u matches that character :(
Any idea how I can work around this?
Here is the rough draft of my function:
/**
* #param string $string The string to convert
* #param string $encoding Default is UTF-8
* #param string $delim_pattern Pattern used to break $string into words
* #return string
*/
public static function mb_ucwords(
string $string,
string $encoding = 'UTF-8',
string $delim_pattern = '/([\/\-\s\v"\'\\\]+)/u'
): string {
$words = preg_split($delim_pattern, $string, -1, PREG_SPLIT_DELIM_CAPTURE);
$output = "";
foreach($words as $word) {
$output .= mb_convert_case($word, MB_CASE_TITLE, $encoding);
}
return $output;
}
Currently testing this code agasinst PHP7.4
EDIT:
Apparently this is a GREEK letter as well as the symbol for micro, and M is the capital version of said GREEK letter. I'm not sure how to handle this...
In Unicode 2, µ (U+00B5 MICRO SIGN) was changed to have a compatibility decomposition of μ (U+03BC GREEK SMALL LETTER MU). At the same time, its category was changed from symbol to letter, to match μ (U+03BC GREEK SMALL LETTER MU). This means that U+00B5 should not be used in new text; it is only to be used for compatibility with non-Unicode character sets. Under certain normalization forms, these are considered to be the same character.
In Unicode 3.0, it was updated to have has M (U+039C GREEK CAPITAL LETTER MU) as its uppercase mapping, giving the result that you see now.
Unfortunately, since µ (U+00B5 MICRO SIGN) is basically deprecated, you're on your own if you use it. You could compare the first character of the string with µ (U+00B5 MICRO SIGN) before calling mb_convert_case. However, there's no guarantee that some system won't silently convert it to μ (U+03BC GREEK SMALL LETTER MU), for example if it normalizes the string. If you will never otherwise use μ (U+03BC GREEK SMALL LETTER MU), you could special-case that character as well.
The fail-safe way to handle this without breaking support for Greek text would be to use some sort of markup language or rich text to indicate that the character is used as a symbol instead of a letter, and then parse that when performing the case conversion. But that would obviously be a larger undertaking.
You could go as simple as this
function mb_ucfirst($string)
{
$main_encoding = "cp1250";
$inner_encoding = "utf-8";
$string = iconv($main_encoding, $inner_encoding, $string);
$strlen = mb_strlen($string);
$firstChar = mb_substr($string, 0, 1, $inner_encoding);
$then = mb_substr($string, 1, $strlen - 1, $inner_encoding);
return iconv($inner_encoding, $main_encoding , mb_strtoupper($firstChar, $inner_encoding) . $then );
}
Keeps the µ while I was testing it.
I am making a dynamic Unicode icon in PHP. I want the UTF-8 code of the Unicode icon.
So far I have done:
$value = "1F600";
$emoIcon = "\u{$value}";
$emoIcon = preg_replace("/\\\\u([0-9A-F]{2,5})/i", "&#x$1;", $emoIcon);
echo $emoIcon; //output 😀
$hex=bin2hex($emoIcon);
echo $hex; // output 26237831463630303b
$hexVal=chunk_split($hex,2,"\\x");
var_dump($hexVal); // output 26\x23\x78\x31\x46\x36\x30\x30\x3b\x
$result= "\\x" . substr($hexVal,0,-2);
var_dump($result); // output \x26\x23\x78\x31\x46\x36\x30\x30\x3b
But when I put the value directly, it prints the correct data:
$emoIcon = "\u{1F600}";
$emoIcon = preg_replace("/\\\\u([0-9A-F]{2,5})/i", "&#x$1;", $emoIcon);
echo $emoIcon; //output 😀
$hex=bin2hex($emoIcon);
echo $hex; // output f09f9880
$hexVal=chunk_split($hex,2,"\\x");
var_dump($hexVal); // output f0\x9f\x98\x80\x
$result= "\\x" . substr($hexVal,0,-2);
var_dump($result); // output \xf0\x9f\x98\x80
\u{1F600} is a Unicode escape sequence used in double-quoted strings, it must have a literal value - trying to use "\u{$value}", as you've seen, doesn't work (for a couple reasons, but that doesn't matter so much.)
If you want to start with "1F600" and end up with 😀 use hexdec to turn it into an integer and feed that to IntlChar::chr to encode that code point as UTF-8. E.g.:
$value = "1F600";
echo IntlChar::chr(hexdec($value));
Outputs:
😀
I use this table of Emoji and try this code:
<?php print json_decode('"\u2600"'); // This convert to ☀ (black sun with rays) ?>
If I try to convert this \u1F600 (grinning face) through json_decode, I see this symbol — ὠ0.
Whats wrong? How to get right Emoji?
PHP 5
JSON's \u can only handle one UTF-16 code unit at a time, so you need to write the surrogate pair instead. For U+1F600 this is \uD83D\uDE00, which works:
echo json_decode('"\uD83D\uDE00"');
😀
PHP 7
You now no longer need to use json_decode and can just use the \u and the unicode literal:
echo "\u{1F30F}";
🌏
In addition to the answer of Tino, I'd like to add code to convert hexadecimal code like 0x1F63C to a unicode symbol in PHP5 with splitting it to a surrogate pair:
function codeToSymbol($em) {
if($em > 0x10000) {
$first = (($em - 0x10000) >> 10) + 0xD800;
$second = (($em - 0x10000) % 0x400) + 0xDC00;
return json_decode('"' . sprintf("\\u%X\\u%X", $first, $second) . '"');
} else {
return json_decode('"' . sprintf("\\u%X", $em) . '"');
}
}
echo codeToSymbol(0x1F63C); outputs 😼
Example of code parsing string including emoji unicode format
$str = 'Test emoji \U0001F607 \U0001F63C';
echo preg_replace_callback(
'/\\\U([A-F0-9]+)/',
function ($matches) {
return mb_convert_encoding(hex2bin($matches[1]), 'UTF-8', 'UTF-32');
},
$str
);
Output: Test emoji 😇 😼
https://3v4l.org/63dUR
I am passing my message to SMS api,
This is the documentation
Normally Unicode Messages are Arabic and Chinese Message, which are
defined by GSM Standards. Unicode messages are nothing but normal text
type messages but it has to be submitted in HEX form. To submit
Unicode messages following Url to be used.
I tried bin2hex() there is not working for the output.
$str = '人';
//$str = 'a';
$output = bin2hex($str);
echo $output;
//output
//人 = e4baba ; I would expect '4EBA'
I found a similar solution but it is in VB.net anyone can convert it?
http://www.supportchain.com/index.php?/Knowledgebase/Article/View/28/7/unable-to-send-sms-with-chinese-character-using-api
the sample i had tried, and it is work:-
example of conversion : a converted to hexadecimal is 0061, 人 converted to hexadecimal is 4EBA
The issue you are facing has to do with encoding. Since these are considered special characters, you need to add some encoding details when converting to hex.
Each of these outputs exactly what you were looking for when I run them:
echo bin2hex(iconv('UTF-8', 'ISO-10646-UCS-2', '人')) . PHP_EOL;
//Outputs 4eba
echo bin2hex(iconv('UTF-8', 'UNICODE-1-1', '人')) . PHP_EOL;
//Outputs 4eba
echo bin2hex(iconv('UTF-8', 'UTF-16BE', '人')) . PHP_EOL;
//Outputs 4eba
Pick whichever one you fancy.
If you want to convert back:
echo iconv('UTF-16BE', 'UTF-8', hex2bin('4eba')) . PHP_EOL;
//outputs 人
I want to remove ZERO WIDTH NON-JOINER character from a string but using str_replace wasn't useful.
str_replace should solves this, as long as you're careful with what you're replacing.
// \xE2\x80\x8C is ZERO WIDTH NON-JOINER
$foo = "foo\xE2\x80\x8Cbar";
print($foo . " - " . strlen($foo) . "\n");
$foo = str_replace("\xE2\x80\x8C", "", $foo);
print($foo . " - " . strlen($foo) . "\n");
Outputs as expected:
foobar - 9
foobar - 6
str_replace will do what you want, but PHP does not have very good native support for Unicode. The following will do what you ask. json_decode has been used to get the Unicode char, since PHP does not support the \u syntax.
<?php
$unicodeChar = json_decode('"\u200c"');
$string = 'blah'.$unicodeChar.'blah';
echo str_replace($unicodeChar, '', $string);
?>
edit: While my method works, I would suggest you use fiskfisk's solution. It is less hacky than using json_decode.