My preg_match only works with utf8_encode [duplicate] - php

This question already has answers here:
Difference between * and + regex
(7 answers)
Closed 2 years ago.
My PHP code receives a $request from an AJAX call. I am able to extract the $name from this parameter. As this name is in German, the allowed characters also include ä, ö and ü.
I want to validate $name = "Bär" via preg_match. I am sure, that the ä is correctly arriving as an UTF-8 encoded string in my PHP code. But if I do this
preg_match('/^[a-zA-ZäöüÄÖÜ]*$/', $name);
I get false, although it should be true. I only receive true in case I do
preg_match(utf8_encode('/^[a-zA-ZäöüÄÖÜ]*$/'), $name);
Can someone explain this to me and also how I set PHP to globaly encode every string to UTF-8?

PHP strings do not have any specific character encoding. String literals contain the bytes that the interpreter finds between the quotes in the source file.
You have to make sure that the text editor or IDE that you are using is saving files in UTF-8. You'll typically find the character encoding in the settings menu.

Your regular expression is wrong. You only test for one sign. The + stands for 1 or more characters. If your PHP code is saved as UTF-8 (without BOM), the u flag is required for Unicode.
$name = "Bär";
$result = preg_match('/^[a-zA-ZäöüÄÖÜ]+$/u', $name);
var_dump($result); //int(1)
For all German umlauts the ß is still missing in the list.

Related

List of known troublesome characters that causes PHP to fail to detect the proper character encoding before converting to UTF-8 resulting in lost data

PHP isn't always correct, what I write has to always be correct. In this case an email with a subject contains an en dash character. This thread is about detecting oddball characters that when alone (let's say, among otherwise purely ASCII text) is incorrectly detected by PHP. I've already determined one static example though my goal here is to create a definitive thread containing as close to a version of drop-in code as we can possibly create.
Here is my starting string from the subject header of an email:
<?php
//This is AFTER exploding the : of the header and using trim on $p[1]:
$s = '=?ISO-8859-1?Q?orkut=20=96=20convite=20enviado=20por=20Lais=20Piccirillo?=';
//orkut – convite enviado por Lais Piccirillo
?>
Typically the next step is to do the following:
$s = imap_mime_header_decode($s);//orkut � convite enviado por Lais Piccirillo
Typically past that point I'd do the following:
$s = mb_convert_encoding($subject, 'UTF-8', mb_detect_encoding($s));//en dash missing!
Now I received a static answer for an earlier static question. Eventually I was able to put this working set of code together:
<?php
$s1 = '=?ISO-8859-1?Q?orkut=20=96=20convite=20enviado=20por=20Lais=20Piccirillo?=';
//Attempt to determine the character set:
$en = mb_detect_encoding($s1);//ASCII; wrong!!!
$p = explode('?', $s1, 3)[1];//ISO-8859-1; wrong!!!
//Necessary to decode the q-encoded header text any way FIRST:
$s2 = imap_mime_header_decode($s1);
//Now scan for character exceptions in the original text to compensate for PHP:
if (strpos($s1, '=96') !== false) {$s2 = mb_convert_encoding($s2[0]->text, 'UTF-8', 'CP1252');}
else {$s2 = mb_convert_encoding($s2[0]->text, 'UTF-8');}
//String is finally ready for client output:
echo '<pre>'.print_r($s2,1).'</pre>';//orkut – convite enviado por Lais Piccirillo
?>
Now either I've still programmed this incorrectly and there is something in PHP I'm missing (tried many combinations of html_entity_decode, iconv, mb_convert_encoding and utf8_encode) or, at least for the moment with PHP 8, we'll be forced to detect specific characters and manually override the encoding as I've done on line 12. In the later case a bug report either needs to be created or likely updated if one specific to this issue already exists.
So technically the question is:
How do we properly detect all character encodings to prevent any characters from being lost during the conversion of strings to UTF-8?
If no such proper answer exists valid answers include characters that when among otherwise purely ASCII text results in PHP failing to properly detect the correct character encoding thus resulting in an incorrect UTF-8 encoded string. Presuming this issue becomes fixed in the future and can be validated against all odd-ball characters listed in all of the other relevant answers then a proper answer can be accepted.
You are blaming PHP for something that PHP could not possibly solve:
$s1 is an ASCII string; just as the string "smiling face emoji" is ASCII, even though it describes the string "🙂".
$s2 is decoded according to the information you were sent. In fact, it's decoded into a raw sequence of bytes, and a label which was provided in the input.
Your actual problem is that the information you were sent was wrong - the system that sent it to you has made the common mistake of mislabelling Windows-1252 as ISO-8859-1.
Those two encodings agree on the meanings of 224 out of the 256 possible 8-bit values. They disagree on the values from 0x80 to 0x9F: those are control characters in ISO 8859 and (mostly) assigned to printable characters in Windows-1252.
Note that there is no way for any system to automatically tell you which interpretation was intended - either way, there is simply a byte in memory containing (for instance) 0x96. However, the extra control characters from ISO 8859 are very rarely used, so if the string claims to be ISO-8859-1 but contains bytes in that range, it's almost certainly in some other encoding. Since Windows-1252 is very widely used (and often mislabelled in this way), a common solution is simply to assume that any data labelled ISO-8859-1 is actually Windows-1252.
That makes the solution really very simple:
// $input is the ASCII string you've received
$input = '=?ISO-8859-1?Q?orkut=20=96=20convite=20enviado=20por=20Lais=20Piccirillo?=';
// Decode the string into its labelled encoding, and string of bytes
$mime_decoded = imap_mime_header_decode($input);
$input_encoding = $mime_decode[0]->charset;
$raw_bytes = $mime_decode[0]->text;
// If it claims to be ISO-8859-1, assume it's lying
if ( $input_encoding === 'ISO-8859-1' ) {
$input_encoding = 'Windows-1252';
}
// Now convert from a known encoding to UTF-8 for the use of your application
$utf8_string = mb_convert_encoding($raw_bytes, 'UTF-8', $input_encoding);

character encoding for mixed data [duplicate]

This question already has answers here:
UTF-8 all the way through
(13 answers)
Closed 8 years ago.
I'm having an issue with getting the correct character encoding for data being POSTed which is built up from multiple sources (I get the data as a single POST variable). I think they're not in the same character encoding...
For instance, take the symbol £. If I do nothing to the character encoding I get two results:
a = £ and b = £
I've tried using various configurations of iconv() like so;
$data = iconv('UTF-8', 'windows-1252//TRANSLIT', $_POST['data']);
The above results in a = £ and b = �
I've also tried utf8_encode/decode as well as html_entity_decode, as I think there's a possibility that one of the pound symbols are being generated using html_entities.
I've tried setting the character encoding in the header which didn't work. I just can't get both instances to work at the same time.
I'm not sure what to try next, any ideas?
I've managed to work around this issue by finding the content that was causing an issue when everything else was in utf8 by using utf8_encode().
This appears to work for the £ symbol. I've not found any other characters causing an issue so far.
Note, I am still using iconv() in conjunction with this.

I'm trying to make sure my string has only valid UTF-8 characters in PHP. How can I do that? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
PHP: replace invalid characters in utf-8 string in
I have a string that has an invalid character in it (it's not UTF-8) such as the following displaying SUB:
I think it's some kind of foreign invalid character.
Is there a way in PHP to take a string and use preg_replace or something else to ensure that I am only using valid UTF-8 characters in my strings, and anything else just gets removed?
Thanks.
First of all, there is no invalid UTF-8 characters. There are invalid UTF-8 bytes and byte sequences, which means someone is trying to pull off an encoding attack on your server. These can be validated with mb_check_encoding on the coming input data, and immediately failing with 400 Bad Request if you don't get valid UTF-8.
What you have is just the SUBSTITUTE control character, a valid character but unprintable.
Originally intended for use as a transmission control character to
indicate that garbled or invalid characters had been received. It has
often been put to use for other purposes when the in-band signaling of
errors it provides is unneeded, especially where robust methods of
error detection and correction are used, or where errors are expected
to be rare enough to make using the character for other purposes
advisable.
You can use this regex to get rid of it (and a few others):
$reg = '/(?![\r\n\t])[\p{Cc}]/u';
preg_replace( $reg, "", $str );
The mb_check_encoding function should be able to do this.
mb_check_encoding("Jetzt gibts mehr Kanonen", "UTF-8");
Note: I haven't tested this.

Php regular expressions character encoding issue

My regular expression wont consider accented characters thus not finding any matches when I am searching words containing ü,õ,ö or ä characters.
$data is HTML data stripped from HTML tags using strip_tags and containing words with ü, õ, ö and ä characters loaded via CURL from website with character encoding UTF-8 (as returned headers tell me);
$data = strip_tags( curl_exec('my_website_url') );
$match = preg_match( '/ü/' , $data , $matches );
I have tried using following (also with 'ISO-8859-1'):
mb_internal_encoding("UTF-8");
mb_regex_encoding('UTF-8');
or:
$data = utf8_decode($data)
Not success yet.
Make sure your PHP source file is UTF-8 encoded as well.
If it's for example ISO-8859-1, the ü in your preg_match directive will be a different character from the üs in your UTF-8 data.
You should tell PRCE that you are using UTF-8 which is done by adding u modifier -> '/ü/u'. But if possible do not put these characters directly into source code. If you change (or your editor will) encoding of the file, your code will stop working and tracing this down would be quite PITA. I'd suggest, instead of using '/ü/' directly to replace character in question with its code: '/\x{c3bc}/u' - the 0xc3bc is your letter.

Hex to Unicode in PHP ( \u014D to ō) [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
How to decode Unicode escape sequences like “\u00ed” to proper UTF-8 encoded characters?
How can I convert \u014D to ō in PHP?
Thank You
It's not immediate clear what you mean when you say "to ō". If you're asking how to convert it into a different encoding then a general approach is to use the iconv function. 014D is the UCS-2 (unicode) for your desired function so, if you have a string containing the bytes 014D you could use
iconv('UCS-2', 'UTF-8', $s)
to convert from UCS-2 to UTF-8. Similarly if you want to convert to a different encoding - although you need to be aware that not all encodings will include the character you are using. You'll see from the iconv documentation that the //TRANSLIT option may help in that case.
Note that iconv is taking a byte sequence so, if you actually have a string containing a slash, then a u, then a 0 etc... you'll need to convert that into the byte sequence first.
If you have the escape characters in the string you could use a messy exec statement.
$string = '\\u014D';
exec("\$string = '$string'");
This way, the Unicode escape sequence should be recognized and interpreted as a unicode character When the string is parsed.
Of course, you should never use exec unless absolutely necessary.

Categories