I had criticized an answer that suggested preg_match over === when finding substring offsets in order to avoid type mismatch.
However, later on the answer's author has discovered that preg_match is actually significantly faster than multi-byte operating mb_strpos. Normal strpos is faster than both functions but of course, cannot deal with multibyte strings.
I understand that mb_strpos needs to do something more than strpos. However, if regex can do it almost as fast as strpos, what is it that mb_strpos does that takes so much time?
I have strong suspicion that it's an optimization error. Could, for example, PHP extensions be slower than its native functions?
mb_strpos($str, "颜色", 0 ,"GBK"): 15.988190889 (89%)
preg_match("/颜色/", $str): 1.022506952 (6%)
strpos($str, "dh"): 0.934401989 (5%)
Functions were run 106 times. The absolute time(s) accounts for the sum of time of 106 runs of a function, rather than average for one.
The test string is $str = "代码dhgd颜色代码";. The test can be seen here (scroll down to skip the testing class).
Note: According to one of the commentators (and common sense), preg_match also does not use multi-byte when comparing, being subject to same risk of errors as strpos.
To understand why the functions have a different runtime you need to understand what they actually do. Because summing them up as ‘they search for needle in haystack’ isn’t enough.
strpos
If you look at the implementation of strpos, it uses zend_memstr internally, which implements a pretty naive algorithm for searching for needle in haystack: Basically, it uses memchr to find the first byte of needle in haystack and then uses memcmp to check whether the whole needle begins at that position. If not, it repeats the search for the first byte of needle from the position of the previous match of the first byte.
Knowing this, we can say that strpos does only search for a byte sequence in a byte sequence using a naive search algorithm.
mb_strpos
This function is the multi-byte counterpart to strpos. This makes searching a little more complex as you can’t just look at the bytes without knowing to which character they belong to.
mb_strpos uses mbfl_strpos, which does a lot more in comparison to the simple algorithm of zend_memstr, it’s like 200 lines of complex code (mbfl_strpos) compared to 30 lines of slick code (zend_memstr).
We can skip the part where both needle and haystack are converted to UTF-8 if necessary, and come to the major chunk of code.
First we have two setup loops and then there is the loop that proceeds the pointer according to the given offset where you can see that they aware of actual characters and how they skip whole encoded UTF-8 characters: since UTF-8 is a variable-width character encoding where the first byte of each encoded character denotes the whole length of the encoded character. This information is stored in the u8_tbl array.
Finally, the loop where the actual search happens. And here we have something interesting, because the test for needle at a certain position in haystack is tried in reverse. And if one byte did not match, the jump table jtbl is used to find the next possible position for needle in haystack. This is actually an implementation of the Boyer–Moore string search algorithm.
So now we know that mb_strpos …
converts the strings to UTF-8, if necessary
is aware of actual characters
uses the Boyer–Moore search algorithm
preg_match
As for preg_match, it uses the PCRE library. Its standard matching algorithm uses a nondeterministic finite automaton (NFA) to find a match conducting a depth-first search of the pattern tree. This is basically a naive search approach.
I am leaving out preg_match to make the analysis more punctuated.
Taken your observation that mb_strpos is relatively slower compared to strpos, it leads you to the assumption that — because of the consumed time — mb_strpos does more than strpos.
I think this observation is correct.
You then asked what is that "more" that is causing the time difference.
I try to give a simple answer: That "more" is because strpos operates on binary strings (one character = 8 bit = 1 octet = 1 byte). mb_strpos operates on encoded character sequences (as nearly all of the mb_* functions do) which can be X bits, perhaps even in variable length per each character.
As this is always about a specific character encoding, both the haystack as well as the needle string (probably) need to be first validated for that encoding, and then the whole operation to find the string position needs to be done in that specific character encoding.
That is translation work and — depending on encoding — also requires a specific search algorithm.
Next to that the mb extension also needs to have some structures in memory to organize the different character encodings, be it translation tables and/or specific algorithms. See the extra parameter you inject — the name of the encoding for example.
That is by far more work than just doing simple byte-by-byte comparisons.
For example the GBK character encoding is pretty interesting when you need to encode or decode a certain character. The mb string function in this case needs to take all these specifics into account to find out if and at which position the character is. As PHP only has binary strings in the userland from which you would call that function, the whole work needs to be done on each single function call.
To illustrate this even more, if you look through the list of supported encodings (mb_list_encodings), you can also find some like BASE64, UUENCODE, HTML-ENTITIES and Quoted-Printable. As you might imagine, all these are handled differently.
For example a single numeric HTML entity can be up to 1024 bytes large, if not even larger. An extreme example I know and love is this one. However, for that encoding, it has to be handled by the mb_strpos algorithm.
Reason of slowness
Taking a look at the 5.5.6 PHP source files, the delay seems to arise for the most part in the mbfilter.c, where - as hakre surmised - both haystack and needle need to be validated and converted, every time mb_strpos (or, I guess, most of the mb_* family) gets called:
Unless haystack is in the default format, encode it to the default format:
if (haystack->no_encoding != mbfl_no_encoding_utf8) {
mbfl_string_init(&_haystack_u8);
haystack_u8 = mbfl_convert_encoding(haystack, &_haystack_u8, mbfl_no_encoding_utf8);
if (haystack_u8 == NULL) {
result = -4;
goto out;
}
} else {
haystack_u8 = haystack;
}
Unless needle is in the default format, encode it to the default format:
if (needle->no_encoding != mbfl_no_encoding_utf8) {
mbfl_string_init(&_needle_u8);
needle_u8 = mbfl_convert_encoding(needle, &_needle_u8, mbfl_no_encoding_utf8);
if (needle_u8 == NULL) {
result = -4;
goto out;
}
} else {
needle_u8 = needle;
}
According to a quick check with valgrind, the encoding conversion accounts for a huge part of mb_strpos's runtime, about 84% of the total, or five-sixths:
218,552,085 ext/mbstring/libmbfl/mbfl/mbfilter.c:mbfl_strpos [/usr/src/php-5.5.6/sapi/cli/php]
183,812,085 ext/mbstring/libmbfl/mbfl/mbfilter.c:mbfl_convert_encoding [/usr/src/php-5.5.6/sapi/cli/php]
which appears to be consistent with the OP's timings of mb_strpos versus strpos.
Encoding not considered, mb_strpos'ing a string is exactly the same of strpos'ing a slightly longer string. Okay, a string up to four times as long if you have really awkward strings, but even then, you would get a delay by a factor of four, not by a factor of twenty. The additional 5-6X slowdown arises from encoding times.
Accelerating mb_strpos...
So what can you do? You can skip those two steps by ensuring that you have internally the strings already in the "basic" format in which mbfl* do conversion and compare, which is mbfl_no_encoding_utf8 (UTF-8):
Keep your data in UTF-8.
Convert user input to UTF-8 as soon as practical.
Convert, if necessary, back to client encoding if needed.
Then your pseudo-code:
$haystack = "...";
$needle = "...";
$res = mb_strpos($haystack, $needle, 0, $Encoding);
becomes:
$haystack = "...";
$needle = "...";
mb_internal_encoding('UTF-8') or die("Cannot set encoding");
$haystack = mb_convert_encoding($haystack, 'UTF-8' [, $SourceEncoding]);
$needle = mb_convert_encoding($needle, 'UTF-8', [, $SourceEncoding]);
$res = mb_strpos($haystack, $needle, 0);
...when it's worth it
Of course this is only convenient if the "setup time" and maintenance of a whole UTF-8 base is appreciably smaller than the "run time" of doing conversions implicitly in every mb_* function.
The problems with mb_ performance may be caused by a messed php-mbstring package installation (on a linux). Installing it explicitly for the exact version of php installation helped me.
sudo apt-get install php7.1-mbstring
...
Before: Time: 16.17 seconds, Memory: 36.00MB OK (3093 tests, 40272 assertions)
After: Time: 1.81 seconds, Memory: 36.00MB OK (3093 tests, 40272 assertions)
Related
I need to count the number of such a textarea's value. That textarea might be containing 5000 characters. But I just need to know whether is the number of those characters more than 20 characters or not. I can do that by using strlen() function. Something like this:
$content = $_POST['textarea_content'];
$content_length = mb_strlen($content, 'utf8');
if ( $content_length > 20 ) {
// do stuff
}
But my approach isn't optimise at all. It counts the number of all characters and then compare it. As I said, sometimes there is lots of characters like 5000 characters. So is there any approach to break counting after 20 characters?
Strings in PHP have an internal variable that saves the length of the string, so runtime of strlen($str) is not depends on the length of the string at all.
Your problem is that you want to use mb_strlen in order to get the number of characters in the string (and not the number of bytes). In other words - you want to know the length of the string, even if the string contains Unicode characters.
If you know that your string is UTF-8, it can be used for optimization. UTF-8 will save at most 4-bytes per char, so if you use isset($str[80]) - you know for sure that your string is at-least 20 chars (and probably much more). If not, you will still have to use the mb_ functions to get the information you need.
The reason for the usage of isset instead of strlen is because you asked about the optimized way. You can read more in this question regarding the two.
To sum it up - your optimized code would probably be:
if (isset($str[80]) || mb_strlen(mb_substr($str, 0, 21, 'utf-8'), 'utf-8') > 20) {
....
}
In php, the code will first check the isset part, and if it return true the other part will not run (so you get the optimization here from both the isset and the fact that you don't need to run the mb_ functions).
If you have more information about the characters in your string you can use it for more optimization (if, for example, you know that your all of the chars in your string are from the lower range of the UTF-8, you don't have to use $str[80], you might as-well use $str[40].
You can use this table from wikipedia:
Together with the information from the utf8-chartable website:
In order to help optimize the number of bytes you might need for each char in your string.
I have encountered something bizarre in the mb_strwidth function; it may be a bug but I thought it better to ask here first in case I'm missing something.
Context
A class is being used to represent a generic string and is both iterable and seekable; with both iterations and seeks applying to the character within the string. The string has full multi-byte support, so when a new position is sought, it not only stores the character position, but recalculates the byte position in the string; like so:
$this->posByte = mb_strwidth(
mb_substr($this->value, 0, $pos, $this->charEncoding),
$this->charEncoding
)
Perceived Error
However, when a multi-byte character is introduced, this is returning an incorrect value. The test case is this:
$str = string('The simple sentence of the simple man; here are some multi-byte chars: Øðćă.', 'UTF-8')
$str->seek(72);
This seeks to the second multi-byte character 'ð', but the byte calculation given above returns 72, the same as the character position; whereas it should be 73 since the preceding character 'Ø' has a code point of U+00D8; which is 216 in decimal and firmly in the two-byte character range.
This is confirmed by using the multi-byte unaware function strlen() (since I have not enabled mb overloading); which simply counts the number of bytes in a string. This:
$bytePos = strlen(mb_substr($this->value, 0, $pos, $this->charEncoding));
returns 73 as expected.
Is this a known problem?
I can use strlen() for now as a workaround, but I don't particularly like doing so since enabling multi-byte overloading in the PHP config would then cause the errors to reappear; does anyone have any experience of a similar issue? Is PHP just using an out-of-date character mapping?
For the record, this is from a PHPUnit test run on a PHP 5.6.3 windows environment.
You appear to be misinterpreting the function of mb_strwidth. Its purpose has nothing to do with bytes, it merely gives you the visual width of a string according to a fixed table. This is purely interesting for Asian character sets with appropriate monospaced fonts, where latin characters, commas and other punctation are half-width and "regular" characters are full-width. Everything up to and including U+1FFF is 1.
You need to use strlen and other encoding-unaware functions to operate on strings in bytes, and mb_ functions to operate on them on a character level, to figure out your byte/character relationships.
If you're worried about the barbaric mb-overloading, either check the ini setting and refuse operation on insane systems, or use mb_strlen with a single-byte encoding set.
I have a large string $string that when applied to md5(), give me
c4ca4238a0b923820dcc509a6f75849b
The length is 32, I want to reduce it, so
base64_encode(md5($string, true));
xMpCOKC5I4INzFCab3WEmw==
Removing the last two == it give me a string with length = 22.
Are there any other better algorithms?
I am not sure you realised that md5 is a hash function, and therefore irreversible. If you do not care about reversibility, you could just as well trim the md5 hash (or any hash of your liking*) down to an arbitrary number of characters. All this would do is increase the likelihood of collision (I feel this does not produce an uniform distribution though).
If you are looking for a reversible (ie. non-destructive) compression, then do not reinvent the wheel. Use the built-in functions, such as gzdeflate() or gzcompress(), or other similar functions.
*Here is a list of hash functions (wikipedia) along with the size of their output.
I suppose the smallest possible "hash function" would be a parity bit :)
One better way would be to, instead of converting to binary to hexadecimal (as md5 does) and then converting the string to base64, instead convert from the hexadecimal md5 directly to base64.
Since hexadecimal is 16 bits per character, and base64 is 64 bits per character, every 2 hexadecimal characters will make up one base64 character.
To perform the conversion, you can do the following:
Split the string into sixteen 2 character chunks
The first character should be multiplied by 2 and added to the second (keeping in mind that A-F = 10-15).
This number can be matched to the base64 scheme using the table from here: https://en.wikipedia.org/wiki/Base64
This will result in a 16 character base64 string with the same value as the hexadecimal representation of the md5 string.
Theoretically, you could do the same for any base. If we had a way to encode base128 strings in ASCII, we could end up with an 8 character string. However, as the character set is limited, I think base64 is the highest base that is commonly used.
The smaller the length of the string you want .. the smaller the number of possible combination
Total Number of Possibility with reputation
Total Possibility = nr
Since we are dealing with base64 has the printable output this means we only have 64 characters
n = 64
If you are looking at 22 letters in length
nr = 6422 = 5,444,517,870,735,015,415,413,993,718,908,291,383,296 possibilities
Back to your question : Are there any better algorithm?
Truncate the string with a good hash to desired length you want since the total possibility and collision is fixed
$string = "the fox jumps over the lazy brown dog";
echo truncateHash($string, 8);
Output
9TWbFjOl
Function Used
function truncateHash($str, $length) {
$hash = hash("sha256", $str, true);
return substr(base64_encode($hash), 0, $length);
}
This encoding generates shorter string,
print base64_encode(hash("crc32b",$string,1));
output
qfQIdw==
Not sure if MD5 is the right choice for you, but i will assume that you have reason to stick with this algorithm and are looking for a shorter representation. There are several possibilities to generate a shorter string with different alphabets:
Option 1: Binary string
The shortest possbile form of an MD5 is it's binary representation, to get such a string you can simply call:
$binaryMd5 = md5($input, true);
This string you can store like any other string in a database, it needs only 16 characters. Just make sure you do a proper escaping, either with mysqli_real_escape_string() or with parametrized queries (PDO).
Option 2: Base64 encoding
Base64 encoding will produce a string with this alphabet: [0-9 A-Z a-z + /] and uses '=' as padding. This encoding is very fast, but includes the sometimes unwanted characters '+/='.
$base64Md5 = base64_encode(md5($input, true));
The output length will be always 24 characters for the MD5 hash.
Option 3: Base62 encoding
The base62 encoding only uses the alphabet [0-9 A-Z a-z]. Such strings can be safely used for any purpose like tokens in an URL, and they are very compact. I wrote a base62 encoder, which is able to convert binary strings to the base62 alphabet. It may not be the fastest possible implementation, but it was my goal to write understandable code. The same class could be easily adapted to different alphabets.
$base62Md5 = StoBase62Encoder::base62encode(md5($input, true));
The output length will vary from 16 to 22 characters for the MD5 hash.
Base 91 looks like the most space efficient binary to ASCII printable encoding algorithm (which is what it seems you want).
I've not seen the PHP implementation, but if your software has to work with others I'd stick to Base 64; it's well-known, lightning fast, and available everywhere.
Firstly, to answer your question: Yes, there is a better algorithm (if with "better" you mean "shorter").
Use the hash() function (which has been part of the PHP core and enabled by default since PHP 5.1.2.) with any of the adler32, fnv132, crc32, crc32b, fnv132 or joaat algorithms.
Without a more in-depth knowledge of your current situation, you might as well just pick whichever one you think sounds the coolest.
Here is an example:
hash('crc32b', $string)
I set up an online example you can play around with.
Secondly, I would like to point out that what you are asking is an almost exact duplicate of another question here on stackoverflow.
I read from your post that you are searching for a hashing algorithm and not compression.
There are various standard hashing algorithms in php out there. Have a look at PHP hashing functions.
Depending on what you want to hash there are different approches. Be careful and calculate the average collision probability.
However it seems you are searching for a 'compression' which outputs the minimum possible size of chars for a given string. If you do, then have a look at Lempel–Ziv–Welch (php implementation) or others.
At the moment, I don't understand why it is really important to use mbstring functions in PHP when dealing with UTF-8? My locale under linux is already set to UTF-8, so why doesn't functions like strlen, preg_replace and so on don't work properly by default?
All of the PHP string functions do not handle multibyte strings regardless of your operating system's locale. That is why you need to use the multibyte string functions.
From the Multibyte String Introduction:
When you manipulate (trim, split, splice, etc.) strings encoded in a
multibyte encoding, you need to use special functions since two or
more consecutive bytes may represent a single character in such
encoding schemes. Otherwise, if you apply a non-multibyte-aware string
function to the string, it probably fails to detect the beginning or
ending of the multibyte character and ends up with a corrupted garbage
string that most likely loses its original meaning.
Here is my answer in plain English.
A single Japanese and Chinese and Korean character take more than a single byte. Eg., a typical charactert say x is takes 1 byte in English it will take more than 1 byte in Japanese and Chinese and Korean. Now PHP's standard string functions are meant to treat a single character as 1 byte. So in case you are trying to do compare two Japanese or Chinese or Korean characters they will not work as expected. For example the length of "Hello World!" in Japanese or Chinese or Korean will have more than 12 bytes.
Read http://www.php.net/manual/en/intro.mbstring.php
You do not need to use UTF-8 aware code to process UTF-8. For the most part.
I've even written a Unicode uppercaser/lowercaser, and NFC and NFD transforms, using only byte-aware functions. It's hard to think of anything more complicated than that, that needs such delicate and detailed treatment of UTF-8. And yet it still works with byte-only functions.
It's very rare that you need UTF-8 aware code. Maybe to count the number of characters, or to move an insertion point forward by 1 character. But actually, even then your code won't work ;) because of decomposed characters.
But if all you are doing is replacements, finding stuff, or even parsing syntax, you just need byte-aware functions.
I'll explain why.
It's because no UTF-8 character can be found inside any other UTF-8 character. That's how it is designed.
Try to explain to me how you can get text processing errors, in terms of a multi-byte system where no character can be found inside another character? Just one example case! The simplest you can think of.
PHP strings are just plain byte sequences. They have no meaning by themselves. And they do not use any particular character encoding either.
So if you read a file using file_get_contents() you get a binary-safe representation of the file. May it be the (binary) representation of an image or a human-readable text file - PHP doesn't care.
Now, as long as you just need to do basic processing of the string, you do not need to know the character encoding at all. So if you want to store the string back into a file using file_put_contents() or want to get its length (not the number of characters) using strlen(), you're fine.
However, as soon as you start doing more fancy string manipulation, you need to know the character encoding! There is no way to store it as part of the string, so you either have to track it separately, or, what most people do, use the convention of having all (text) strings in a common character encoding, like US-ASCII or nowadays UTF-8.
So because there is no way to set a character encoding for a string, PHP has no idea which character encoding the string is using. Due to that, the only sane thing for strlen() to do is to return the number of bytes, as this is the only thing PHP knows for sure.
If you provide the additional information of the used character encoding, you need to use another function - the function is called mb_strlen() in this case.
The same applies to preg_replace(): If you want to replace umlaut-a, or match three identical characters in a row, you need to know how umlaut-a is encoded, and in general, how characters are encoded.
So if you have a hypothetical character encoding, which encodes a lower-case a as a1 and an upper-case A as a2, a b as b1 and B as b2 (and so on), you can have an (encoded) string a1a1a1 which consists of three identical characters in a row. However, without knowing the encoding and by just looking at the byte sequence, there is no way to detect this.
Summary:
No sane 'default' is possible as PHP strings do not contain the character encoding. And even if, a single function like strlen() cannot return the length of the byte sequence as required for Content-Length HTTP header and at the same time the number of characters as useful to denote the length of a blog article.
That's why the Function Overloading Feature is inherently broken and even if it looks nice at first, will break your code in a hard-to-debug way.
multibyte => multi + byte.
1) It is use to work with string which is in other language(means not in English) format.
2) Default PHP string functions only work proper with English (or releted to it) language.
3) If you want to use strlen() or strpos() or uppercase() or strreplace() for special character,
Suppose We need to apply string functions on "Hello".
In chines (你好), Arabic (مرحبا), Japanese (こんにちは), Hindi (
नमस्ते), Gujarati (હેલો).
Different language can it's own character sets
so that mbstring introduced for communicate with various languages like (chines,Japanese etc).
Raul González is a perfect example of why:
It is about shortening too long user names for MySQL database, say we have 10 character limit and Raul González.
The unit test below is an example how you can get an error like this
General error: 1366 Incorrect string value: '\xC3' for column 'name' at row 1 (SQL: update users set name = Raul Gonz▒, updated_at = 2019-03-04 04:28:46 where id = 793)
and how you can avoid it
public function test_substr(): void
{
$name = 'Raul González';
$user = factory(User::class)->create(['name' => $name]);
try {
$name1 = substr($name, 0, 10);
$user->name = $name1;
$user->save();
} catch (Exception $ex) {
}
$this->assertTrue(isset($ex));
$name2 = mb_substr($name, 0, 10);
$user->name = $name2;
$user->save();
$this->assertTrue(true);
}
PHP Laravel and PhpUnit was used for illustration.
Is there a native or inexpensive way to check for the length of a string in bytes in PHP?
See http://bytes.com/topic/php/answers/653733-binary-string-length
Relevant part:
"In PHP, like in C, the string ends with a zero-character, '\0', (char)
0, null-terminator, null-byte or whatever you like to call it."
No, that's not the case - PHP strings are stored with both the length and the
data, unlike C strings that just has one pointer and uses a terminator. They're
"binary-safe" - NUL doesn't terminate the string.
See the definition of zvalue_value in zend.h; the string part has both a "char
*val" and "int len".
Problems would start if you're using the mbstring.func_overload, which changes
how strlen() and the other functions work, and does try and treat strings as
strings of characters in a specific encoding rather than a string of bytes.
This is not the normal PHP behaviour.
The answer is that strlen should return the number of bytes regardless of the content of the string. For multi-byte character strings, you get the wrong number of characters, but the right number of bytes. However, you need to be certain you're not using the mbstring overload, which changes how strlen behaves.
In the event that you have mbstring overload set or your are developing for the platforms where you are unsure about this setting you can do the following:
$len=strlen(bin2hex($data))/2;
The reason why this works is that in Hex you are guaranteed to get 2 characters for all bytes that come from bin2hex (it returns two chars even for the initial binary 0).
Note that it will use significantly more resources than a normal strlen (afterall, so you should definitely not do that to the large amount of data if it's not absolutely necessary.
On php.org, someone was nice enough to create this function. Just multiply by 8 and you've got however many bits were in that string, as the function returns bytes.
The length of a string (textual data) is determined by the position of the NULL character which marks the end.
In case of binary data, NULL can be and often is in the middle of data.
You don't check the length of binary data. You have to know it beforehand. In your case, the length is 16 (bytes, not bits, if it is UUID).
As far as UUID validity is concerned, any 16-byte value is a valid UUID, so you are out of luck there.