PHP regex issue: cannot find $C - php

I'm trying to parse dollar amounts from a text of in mixed French (Canadian) and English. The text is in UTF-8. They use $C to denote currency. For some reason when I use preg_match neither the '$' nor the 'C' can be found. Everything else works fine. Any ideas?
e.g. use
preg_match_all('/\$C/u', $match)
on "Thanks for a payment of 46,00 $C" returns empty.

I think the regex can't find those characters because they aren't there. If you initialize the string like this:
$source = "Thanks for a payment of 46,00 $C";
...(i.e., as a double-quoted string literal), $C gets interpreted as a variable name. Since you never initialized that variable, it gets replaced with nothing in the actual string. You should either use single-quotes to initialize the string, or escape the dollar sign with a backslash like you did in the regex.
By the way, this couldn't be an encoding problem, because (in the example, at least), all the characters are from the ASCII character set. Whether it was encoded as UTF-8, ISO-8859-1 or ASCII, the binary representation of the string would be identical.

preg_match_all('/\$C/u', 'Thanks for a payment of 46,00 $C', $matches);
print_r($matches);
works fine for me:
Array
(
[0] => Array
(
[0] => $C
)
)

Maybe this helps:
// assuming $text is the input string
$matches = array();
preg_match_all('/([0-9,\\.]+)\\s*\\$C/u', $text, $matches);
if ($matches) {
$price = floatval(str_replace(',', '.', $matches[1][0]));
printf("%.2f\n", $price);
} else {
printf("No price found\n");
}
Just make sure the input string ($text) has been properly decoded into an Unicode string. (For example, if it's in UTF-8, use the utf8_decode function.)

Related

PHP: Encode UTF8-Characters to html entities

I want to encode normal characters to html-entities like
a => a
A => A
b => b
B => B
but
echo htmlentities("a");
doesn't work. It outputs the normal charaters (a A b B) in the html source code instead of the html-entities.
How can I convert them?
You can build a function for this fairly easily using mb_ord or IntlChar::ord, either of which will give you the numeric value for a Unicode Code Point.
You can then convert that to a hexadecimal string using base_convert, and add the '&#x' and ';' around it to give an HTML entity:
function make_entity(string $char) {
$codePoint = mb_ord($char, 'UTF-8'); // or IntlChar::ord($char);
$hex = base_convert($codePoint, 10, 16);
return '&#x' . $hex . ';';
}
echo make_entity('a');
echo make_entity('€');
echo make_entity('🐘');
You then need to run that for each code point in your UTF-8 string. It is not enough to loop over the string using something like substr, because PHP's string functions work with individual bytes, and each UTF-8 code point may be multiple bytes.
One approach would be to use a regular expression replacement with a pattern of /./u:
The . matches each single "character"
The /u modifier turns on Unicode mode, so that each "character" matched by the . is a whole code point
You can then run the above make_entity function for each match (i.e. each code point) with preg_replace_callback.
Since preg_replace_callback will pass your callback an array of matches, not just a string, you can make an arrow function which takes the array and passes element 0 to the real function:
$callback = fn($matches) => make_entity($matches[0]);
So putting it together, you have this:
echo preg_replace_callback('/./u', fn($m) => make_entity($m[0]), 'a€🐘');
Arrow functions were introduced in PHP 7.4, so if you're stuck on an older version, you can write the same thing as a regular anonymous function:
echo preg_replace_callback('/./u', function($m) { return make_entity($m[0]) }, 'a€🐘');
Or of course, just a regular named function (or a method on a class or object; see the "callable" page in the manual for the different syntax options):
function make_entity_from_array_item(array $matches) {
return make_entity($matches[0]);
}
echo preg_replace_callback('/./u', 'make_entity_from_array_item', 'a€🐘');

PHP convert double quoted string to single quoted string

I know this question asked here many times.But That solutions are not useful for me. I am facing this problem very badly today.
// Case 1
$str = 'Test \300'; // Single Quoted String
echo json_encode(utf8_encode($str)) // output: Test \\300
// Case 2
$str = "Test \300"; // Double Quoted String
echo json_encode(utf8_encode($str)) // output: Test \u00c0
I want case 2's output and I have single quoted $str variable. This variable is filled from XML string parsing . And that XML string is saved in txt file.
(Here \300 is encoding of À (latin Charactor) character and I can't control it.)
Please Don't give me solution for above static string
Thanks in advance
This'll do:
$string = '\300';
$string = preg_replace_callback('/\\\\\d{1,3}/', function (array $match) {
return pack('C', octdec($match[0]));
}, $string);
It matches any sequence of a backslash followed by up to three numbers and converts that number from an octal number to a binary string. Which has the same result as what "\300" does.
Note that this will not work exactly the same for escaped escapes; i.e. "\\300" will result in a literal \300 while the above code will convert it.
If you want all the possible rules of double quoted strings followed without reimplementing them by hand, your best bet is to simply eval("return \"$string\""), but that has a number of caveats too.
May You are looking for this
$str = 'Test \300'; // Single Quoted String
echo json_encode(stripslashes($str)); // output: Test \\300

Normalize Name-Surname strings: PHP+REGEX (Spanish chars- UTF8)

I'm having strings with name and surname which I need to normalize with a functiont and make them like:
Name Surname (I can recive strings like NAME SURNAME, Name SURNAME, etc...)
I've found this snipet:
echo nameize("HÉCTOR MAÑAÇ");
function nameize($str,$a_char = array("'","-"," ")){
//$str contains the complete raw name string
//$a_char is an array containing the characters we use as separators for capitalization. If you don't pass anything, there are three in there as default.
$string = strtolower($str);
foreach ($a_char as $temp){
$pos = strpos($string,$temp);
if ($pos){
//we are in the loop because we found one of the special characters in the array, so lets split it up into chunks and capitalize each one.
$mend = '';
$a_split = explode($temp,$string);
foreach ($a_split as $temp2){
//capitalize each portion of the string which was separated at a special character
$mend .= ucfirst($temp2).$temp;
}
$string = substr($mend,0,-1);
}
}
return ucfirst($string);
}
Which works pretty well, but, as you can see testing this exact example, doesn't parse spanish chars (utf8) I've tested mb_regex_encoding("UTF-8"); mb_internal_encoding("UTF-8");, headers UTF8, etc. But can't make it work fine with "special" spanish chars.
Any suggestion?
Can't see, where you use the Multibyte String Functions.
Maybe this would be convenient for your needs:
echo mb_convert_case("HÉCTOR MAÑAÇ", MB_CASE_TITLE, "UTF-8");
output:
Héctor Mañaç
Your function works fine for the given example also. Please check your file encoding type. It must be UTF-8. You can check it in Notepadd++.

How to check single byte katakana in a string

Iam working with Double byte japaneese character website, i need to check the user enter a single byte katakana.Site developed in php platform.
This is the preg match that i used for checking
'/[\x{3040}-\x{309F}]/u'
I'm not 100% sure if this the test string I use is legal $string. I'll remove the answer (or try to update it) if it works out different. As the string is manual input (escaped the backslash initially), instead of raw;
$string = "\\xe3\\x80\\x85"; // RAW input might still be '\xe3\x80\x85' here
$result = preg_match_all("/\\\\xe3\\\\x8[0-3]\\\\x[8-9a-b][0-9a-f]/u", $string, $matches);
echo $string;
echo '<pre>';
print_r($matches);
echo '</pre>';
This prints out;
\xe3\x80\x85
Array
(
[0] => Array
(
[0] => \xe3\x80\x85
)
)
Thus; 々

get http url parameter without auto decoding using PHP

I have a url like
test.php?x=hello+world&y=%00h%00e%00l%00l%00o
when i write it to file
file_put_contents('x.txt', $_GET['x']); // -->hello world
file_put_contents('y.txt', $_GET['y']); // -->\0h\0e\0l\0l\0o
but i need to write it to without encoding
file_put_contents('x.txt', ????); // -->hello+world
file_put_contents('y.txt', ????); // -->%00h%00e%00l%00l%00o
how can i do?
Thanks
You can get unencoded values from the $_SERVER["QUERY_STRING"] variable.
function getNonDecodedParameters() {
$a = array();
foreach (explode ("&", $_SERVER["QUERY_STRING"]) as $q) {
$p = explode ('=', $q, 2);
$a[$p[0]] = isset ($p[1]) ? $p[1] : '';
}
return $a;
}
$input = getNonDecodedParameters();
file_put_contents('x.txt', $input['x']);
Because the The $_GET and $_REQUEST superglobals are automatically run through a decoding function (equivalent to urldecode()), you simply need to re-urlencode() the data to get it to match the characters passed in the URL string:
file_put_contents('x.txt', urlencode($_GET['x'])); // -->hello+world
file_put_contents('y.txt', urlencode($_GET['y'])); // -->%00h%00e%00l%00l%00o
I've tested this out locally and it's working perfectly. However, from your comments, you might want to look at your encoding settings as well. If the result of urlencode($_GET['y']) is %5C0h%5C0e%5C0l%5C0l%5C0o then it appears that the null character that you're passing in (%00) is being interpreted as a literal string "\0" (like a \ character concatenated to a 0 character) instead of correctly interpreting the \0 as a single null character.
You should have a look at the PHP documentation on string encoding and ASCII device control characters.
i think you can use urlencode() to pass the value in URL and urldecode() to get the value.

Categories