Would this regex be multibyte safe? - php

I'm using the following regex to check an image filename only contains alphanumeric, underscore, hyphen, decimal point:
preg_match('!^[\w.-]*$!',$filename)
This works ok. But I have concerns about multibyte characters. Should I specifically handle them to prevent undetermined errors, or should this regex reject mb filenames ok?

PHP does not have "native" support for multibyte characters; you need to use the "mbstring" extension­Docs (which may or may not be available). Furthermore, it would appear that there is no way to create a "multibyte-character string", as such -- rather, one chooses to treat a native string as multibyte-character string by using special "mbstring" functions. In other words, a PHP string does not know its own character encoding -- you have to keep track of it manually.
You may be able to get away with it so long as you use UTF-8 (or similar) encoding. UTF-8 always encodes multibyte characters to "high" bytes (for instance, ß is encoded as 0xcf 0x9f), so PHP will probably treat them just like any other character. You would not be able to use an encoding that might potentially encode a multibyte character into "special" PHP bytes, such as 0x22, the "double-quote" symbol.
The only regular expression functions in PHP that know how to deal with specific multibyte characters out of a range of multiple character-sets are mb_ereg­Docs, mb_eregi­Docs, mb_ereg_replace­Docs and mb_eregi_replace­Docs.
PCRE based regular expression functions like preg_match­Docs support UTF-8 by using the u-modifier (PCRE8)­Docs.
But of course, as described above PHP strings don't know their own encoding, so you first need to instruct the "mbstring" library using the mb_regex_encoding function. Note that that function specifies the encoding of the string you're matching, not the string containing the regular expression itself.

Related

Detect encoding in PHP without multibyte extension?

Is there a way to detect the encoding of a string in PHP without having the mbstring extension loaded? I know it is possible to do so with mb_detect_encoding(), but is there an equivalent, non-multibyte function?
If not, what would it take to implement a detect_encoding() function that would at least detect UTF-8?
Strings in PHP are just byte sequences, they carry no encoding information with them. mb_detect_encoding doesn't actually detect the string's encoding, it tries to make an educated guess by running the byte sequence against a series of identification functions, one per encoding (by default those given by mb_detect_order), and returns the first one in which the sequence matches. These functions are very basic and don't even exist for many popular encodings.
There is no way, with or without the mbstring extension, to ascertain the encoding of a string - only to maybe rule some out, which you could only do if the string happens to contain byte sequences that would be invalid in those particular encodings.
You will never know whether "\xC2\xA4" is supposed to be the UTF-8 ¤ or ISO-8859-1 ¤ just by looking at it - because they're the exact same bytes.
For more information see: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets
There's always iconv, which is generally enabled in PHP by default
<pre>
<?php
iconv_set_encoding("internal_encoding", "UTF-8");
iconv_set_encoding("output_encoding", "ISO-8859-1");
var_dump(iconv_get_encoding('all'));
?>
</pre>

Google can't read a sitemap with special characters in URLs

I got a big sitemap created dynamically with PHP, it has a sitemap index with some 230 separate sitemaps, and each individual sitemap has between 3.000 and 15.000 URLs.
In most of those 230 sitemaps, everything is ok, but in some of them some URLs contain special characters and Google returns an error, does not accept such sitemap. The example of a normal, accepted URL:
http://www.site.com/Gentofte-Greve/Denmark 1 Badmintonligaen/12-fe-juice_a-1091627-1-33-1-odds/
The example of an URL which corrupts the entire sitemap file for Google:
http://www.site.com/Team%20%C5rhus%20Elite-Solr%F8d%20Strand/Denmark 1 Badmintonligaen/12-fe-juice_a-1091631-1-33-1-odds/
Any special character, for example the Nordic ones, will wreck the sitemap. Here is an example of Nordic characters: http://www.borgos.nndata.no/alfabet.htm
My questions is - HOW do I code those special characters (and other similar ones) so sitemap still checks out fine. Which PHP coding function do I use if that's a solution? Is the only solution to use str_replace and replace those characters with normal ones? It wouldn't be an issue, the URL works no matter what you write in the first part of it as that part is for SEO only, but this would be time-consuming. I'd prefer to be able to write those special characters in a way which doesn't wreck the sitemap for Google.
Everything else regarding my sitemaps is fine, they're coded in UTF-8 or at least they should be with this line:
<?xml version='1.0' encoding='UTF-8'?>
Are the %C5 and %F8 sequences meant to represent the characters U+00C5 (Å) and U+00F8 (ø)? If so, you need to use their UTF-8 encodings, not their raw Unicode codepoint numbers. 'Å' should be %C3%85, and 'ø' should be %C3%B8.
For more information about URI encoding, see RFC 3986.
Doing this in PHP is complicated by the fact that PHP strings are really byte strings, not Unicode character strings. They can't store abstract Unicode characters; they can only store the encoded representation of those characters, in a particular encoding such as UTF-8 or UTF-16. You can use the mbstring extension to work with encoded Unicode strings, but doing this correctly will probably mean using the mbstring functions for all handling of Unicode text throughout your application.
You should be looking to fix this encoding problem at the source: how did your program get a string that contains the byte 0xC5 to represent the character U+00C5? Something, somewhere, must've assumed that Unicode codepoint numbers translate directly into bytes, which is wrong. Find and fix that, so that your data is read into the PHP string in UTF-8 form to begin with, and then use the mbstring functions for any manipulation of the string afterward.
Once you have a string that contains the UTF-8 representation of your URL, rawurlencode() should give you the correct percent-escaped result.

Why use multibyte string functions in PHP?

At the moment, I don't understand why it is really important to use mbstring functions in PHP when dealing with UTF-8? My locale under linux is already set to UTF-8, so why doesn't functions like strlen, preg_replace and so on don't work properly by default?
All of the PHP string functions do not handle multibyte strings regardless of your operating system's locale. That is why you need to use the multibyte string functions.
From the Multibyte String Introduction:
When you manipulate (trim, split, splice, etc.) strings encoded in a
multibyte encoding, you need to use special functions since two or
more consecutive bytes may represent a single character in such
encoding schemes. Otherwise, if you apply a non-multibyte-aware string
function to the string, it probably fails to detect the beginning or
ending of the multibyte character and ends up with a corrupted garbage
string that most likely loses its original meaning.
Here is my answer in plain English.
A single Japanese and Chinese and Korean character take more than a single byte. Eg., a typical charactert say x is takes 1 byte in English it will take more than 1 byte in Japanese and Chinese and Korean. Now PHP's standard string functions are meant to treat a single character as 1 byte. So in case you are trying to do compare two Japanese or Chinese or Korean characters they will not work as expected. For example the length of "Hello World!" in Japanese or Chinese or Korean will have more than 12 bytes.
Read http://www.php.net/manual/en/intro.mbstring.php
You do not need to use UTF-8 aware code to process UTF-8. For the most part.
I've even written a Unicode uppercaser/lowercaser, and NFC and NFD transforms, using only byte-aware functions. It's hard to think of anything more complicated than that, that needs such delicate and detailed treatment of UTF-8. And yet it still works with byte-only functions.
It's very rare that you need UTF-8 aware code. Maybe to count the number of characters, or to move an insertion point forward by 1 character. But actually, even then your code won't work ;) because of decomposed characters.
But if all you are doing is replacements, finding stuff, or even parsing syntax, you just need byte-aware functions.
I'll explain why.
It's because no UTF-8 character can be found inside any other UTF-8 character. That's how it is designed.
Try to explain to me how you can get text processing errors, in terms of a multi-byte system where no character can be found inside another character? Just one example case! The simplest you can think of.
PHP strings are just plain byte sequences. They have no meaning by themselves. And they do not use any particular character encoding either.
So if you read a file using file_get_contents() you get a binary-safe representation of the file. May it be the (binary) representation of an image or a human-readable text file - PHP doesn't care.
Now, as long as you just need to do basic processing of the string, you do not need to know the character encoding at all. So if you want to store the string back into a file using file_put_contents() or want to get its length (not the number of characters) using strlen(), you're fine.
However, as soon as you start doing more fancy string manipulation, you need to know the character encoding! There is no way to store it as part of the string, so you either have to track it separately, or, what most people do, use the convention of having all (text) strings in a common character encoding, like US-ASCII or nowadays UTF-8.
So because there is no way to set a character encoding for a string, PHP has no idea which character encoding the string is using. Due to that, the only sane thing for strlen() to do is to return the number of bytes, as this is the only thing PHP knows for sure.
If you provide the additional information of the used character encoding, you need to use another function - the function is called mb_strlen() in this case.
The same applies to preg_replace(): If you want to replace umlaut-a, or match three identical characters in a row, you need to know how umlaut-a is encoded, and in general, how characters are encoded.
So if you have a hypothetical character encoding, which encodes a lower-case a as a1 and an upper-case A as a2, a b as b1 and B as b2 (and so on), you can have an (encoded) string a1a1a1 which consists of three identical characters in a row. However, without knowing the encoding and by just looking at the byte sequence, there is no way to detect this.
Summary:
No sane 'default' is possible as PHP strings do not contain the character encoding. And even if, a single function like strlen() cannot return the length of the byte sequence as required for Content-Length HTTP header and at the same time the number of characters as useful to denote the length of a blog article.
That's why the Function Overloading Feature is inherently broken and even if it looks nice at first, will break your code in a hard-to-debug way.
multibyte => multi + byte.
1) It is use to work with string which is in other language(means not in English) format.
2) Default PHP string functions only work proper with English (or releted to it) language.
3) If you want to use strlen() or strpos() or uppercase() or strreplace() for special character,
Suppose We need to apply string functions on "Hello".
In chines (你好), Arabic (مرحبا), Japanese (こんにちは), Hindi (
नमस्ते), Gujarati (હેલો).
Different language can it's own character sets
so that mbstring introduced for communicate with various languages like (chines,Japanese etc).
Raul González is a perfect example of why:
It is about shortening too long user names for MySQL database, say we have 10 character limit and Raul González.
The unit test below is an example how you can get an error like this
General error: 1366 Incorrect string value: '\xC3' for column 'name' at row 1 (SQL: update users set name = Raul Gonz▒, updated_at = 2019-03-04 04:28:46 where id = 793)
and how you can avoid it
public function test_substr(): void
{
$name = 'Raul González';
$user = factory(User::class)->create(['name' => $name]);
try {
$name1 = substr($name, 0, 10);
$user->name = $name1;
$user->save();
} catch (Exception $ex) {
}
$this->assertTrue(isset($ex));
$name2 = mb_substr($name, 0, 10);
$user->name = $name2;
$user->save();
$this->assertTrue(true);
}
PHP Laravel and PhpUnit was used for illustration.

PHP: Fixing encoding issues with database content - removing accents from characters

I'm trying to make a URL-safe version of a string.
In my database I have a value medúlla - I want to turn this into medulla.
I've found plenty of functions to do this, but when I retrieve the value from the database it comes back as medúlla.
I've tried:
Setting the column as utf_8 encoding
Setting the table as utf_8 encoding
Setting the entire database as utf_8 encoding
Running `SET NAMES utf8` on the database before querying
When I echo the value onto the screen it displays as I want it to, but the conversion function doesn't see the ú character (even a simple str_replace() doesn't work either).
Does anybody know how I can force the system to recognise this as UTF-8 and allow me to run the conversion?
Thanks,
Matt
To transform an UTF-8 string into an URL-safe string you should use:
$str = iconv('UTF-8', 'ASCII//IGNORE//TRANSLIT', $strt);
The IGNORE part tells iconv() not to raise an exception when facing a character it can't manage, and the TRANSLIT part converts an UTF-8 character into its nearest ASCII equivalent ('ú' into 'u' and such).
Next step is to preg_replace() spaces into underscores and substitute or drop any character which is unsafe within an URL, either with preg_replace() or urlencode().
As for the database stuff, you really should have done all this setting stuff before INSERTing UTF-8 content. Changing charset to an existing table is somewhat like changing a file extension in Windows - it doesn't convert a JPEG into a GIF. But don't worry and remember that the database will return you byte by byte exactly what you've stored in it, no matter which charset has been declared. Just keep the settings you used when INSERTing and treat the returned strings as UTF-8.
I'm trying to make a URL-safe version of a string.
Whilst it is common to use ASCII-only ‘slugs’ in URLs, it is actually possible to have web addresses including non-ASCII characters. eg.:
http://en.wikipedia.org/wiki/Medúlla
This is a valid IRI. For inclusion in a U​RI, you should UTF-8 and %-encode it:
http://en.wikipedia.org/wiki/Med%C3%BAlla
Either way, most browsers (except sometimes not IE) will display the IRI version in the address bar. Sites such as Wikipedia use this to get pretty addresses.
the conversion function doesn't see the ú character
What conversion function? rawurlencode() will correctly spit out %C3%BA for ú, if, as presumably you do, you have it in UTF-8 encoding. This is the correct way to include text in a URL's path component. (urlencode() also gives the same results, but it should only be used for query components.)
If you mean htmlentities()... do not use this function. It converts all non-ASCII characters to HTML character references, which makes your output unnecessarily larger, and means it has to know what encoding the string you pass in is. Unless you give it a UTF-8 $charset argument it will use ISO-8859-1, and consequently screw up all your non-ASCII characters.
Unless you are specifically authoring for an environment which mangles non-ASCII characters, it is better to use htmlspecialchars(). This gives smaller output, and it doesn't matter(*) if you forget to include the $charset argument, since all it changes is a couple of characters like < and &.
(Actually it could matter for some East Asian multibyte character sets where < could be part of a multibyte sequence and so shouldn't be escaped. But in general you'd want to avoid these legacy encodings, as UTF-8 is less horrific.)
(even a simple str_replace() doesn't work either).
If you wrote str_replace(..., 'ú', ...) in the PHP source code, you would have to be sure that you saved the source code in the same encoding as you'll be handling, otherwise it won't match.
It is unfortunate that most Windows text editors still save in the (misleadingly-named) “ANSI” code page, which is locale-specific, instead of just using UTF-8. But it should be possible to save the file as UTF-8, and then the replace should work. Alternatively, write '\xc3\xba' to avoid the problem.
Running SET NAMES utf8 on the database before querying
Use mysql_set_charset() in preference.

Can str_replace be safely used on a UTF-8 encoded string if it's only given valid UTF-8 encoded strings as arguments?

PHP's str_replace() was intended only for ANSI strings and as such can mangle UTF-8 strings. However, given that it's binary-safe would it work properly if it was only given valid UTF-8 strings as arguments?
Edit: I'm not looking for a replacement function, I would just like to know if this hypothesis is correct.
Yes. UTF-8 is deliberately designed to allow this and other similar non-Unicode-aware processing.
In UTF-8, any non-ASCII byte sequence representing a valid character always begins with a byte in the range \xC0-\xFF. This byte may not appear anywhere else in the sequence, so you can't make a valid UTF-8 sequence that matches part of a character.
This is not the case for older multibyte encodings, where different parts of a byte sequence are indistinguishable. This caused a lot of problems, for example trying to replace an ASCII backslash in a Shift-JIS string (where byte \x5C might be the second byte of a character sequence representing something else).
It's correct because UTF-8 multibyte characters are exclusively non-ASCII (128+ byte value) characters beginning with a byte that defines how many bytes follow, so you can't accidentally end up matching a part of one UTF-8 multibyte character with another.
To visualise (abstractly):
a for an ASCII character
2x for a 2-byte character
3xx for a 3-byte character
4xxx for a 4-byte character
If you're matching, say, a2x3xx (a bytes in ASCII range), since a < x, and 2x cannot be a subset of 3xx or 4xxx, et cetera, you can be safe that your UTF-8 will match correctly, given the prerequisite that all strings are definitely valid UTF-8.
Edit: See bobince's answer for a less abstract explanation.
Well, I do have a counter example: I have a UTF8 encoded settings ".ini' file specifying appliation settings like email sender name. it says something like:
email_from = Märta
and I read it from there to variable $sender. Now that I replace the message body (UTF8 again)
regards
{sender}
$message = str_replace("{sender}",$sender_name,$message);
The email is absolutely correct in every respect but the sender is totally broken. There are other cases (like explode() ) when something goes wrong with a UTF string. It is healthy before the conversion but not after it. Sorry to say there seems to be no way of correcting this behaviour.
Edit: Actually, explode() is involved in parsing the .ini file so the problem may well lie in that very function so the str_replace() may well be innocent.
No you cannot.
From practice I am telling you if you have some multibyte symbols like ◊ etc, and others are non-multibyte it wont work correctly, because there are symbols that take 2-4 to place them,
str_replace takes fixed bytes, and replaces... In result we have something that isn't any symbols trash etc.
Yes, I think this is correct, at least I couldn't find any counter-example.

Categories