The PHP manual on String conversion to numbers says:
The value is given by the initial portion of the string. If the string starts with valid numeric data, this will be the value used. Otherwise, the value will be 0 (zero).
This means that anything other than a number, plus or minus at the beginning of a string should return 0 when the string is converted to a number. Yet, (some) whitespace at the beginning of a string is ignored:
echo intval(" 3"); // 3
echo intval("
3"); // 3
Is there any kind of whitespace that intval() and (int) do not strip?
Where is this behavior documented?
The observed behavior is largely undocumented. Probably the space stripping depends on strtod() in C, which should use isspace().
intval() manual says:
The common rules of integer casting apply.
Following the link (links removed):
To explicitly convert a value to integer, use either the (int) or (integer) casts. However, in most cases the cast is not needed, since a value will be automatically converted if an operator, function or control structure requires an integer argument. A value can also be converted to integer with the intval() function.
So it looks like casting and intval() are equivalent. And a little below the quote above:
From strings
See String conversion to numbers
OK. Nothing really helpful there for us, except this little note:
For more information on this conversion, see the Unix manual page for strtod(3).
Following the thread of links again, choosing the version of strtod() specified in POSIX, the API standard for UNIX.
These functions shall convert the initial portion of the string pointed to by nptr to double, float, and long double representation, respectively. First, they decompose the input string into three parts:
1. An initial, possibly empty, sequence of white-space characters (as specified by isspace())
…
Common sense tells us that this should apply to floats only because strtod() returns a floating point number in C, but number parsing and internal representation is quirky in PHP, as much as almost everything in this language. Who knows how it really works under the hood. Better not to know.
Related
I had criticized an answer that suggested preg_match over === when finding substring offsets in order to avoid type mismatch.
However, later on the answer's author has discovered that preg_match is actually significantly faster than multi-byte operating mb_strpos. Normal strpos is faster than both functions but of course, cannot deal with multibyte strings.
I understand that mb_strpos needs to do something more than strpos. However, if regex can do it almost as fast as strpos, what is it that mb_strpos does that takes so much time?
I have strong suspicion that it's an optimization error. Could, for example, PHP extensions be slower than its native functions?
mb_strpos($str, "颜色", 0 ,"GBK"): 15.988190889 (89%)
preg_match("/颜色/", $str): 1.022506952 (6%)
strpos($str, "dh"): 0.934401989 (5%)
Functions were run 106 times. The absolute time(s) accounts for the sum of time of 106 runs of a function, rather than average for one.
The test string is $str = "代码dhgd颜色代码";. The test can be seen here (scroll down to skip the testing class).
Note: According to one of the commentators (and common sense), preg_match also does not use multi-byte when comparing, being subject to same risk of errors as strpos.
To understand why the functions have a different runtime you need to understand what they actually do. Because summing them up as ‘they search for needle in haystack’ isn’t enough.
strpos
If you look at the implementation of strpos, it uses zend_memstr internally, which implements a pretty naive algorithm for searching for needle in haystack: Basically, it uses memchr to find the first byte of needle in haystack and then uses memcmp to check whether the whole needle begins at that position. If not, it repeats the search for the first byte of needle from the position of the previous match of the first byte.
Knowing this, we can say that strpos does only search for a byte sequence in a byte sequence using a naive search algorithm.
mb_strpos
This function is the multi-byte counterpart to strpos. This makes searching a little more complex as you can’t just look at the bytes without knowing to which character they belong to.
mb_strpos uses mbfl_strpos, which does a lot more in comparison to the simple algorithm of zend_memstr, it’s like 200 lines of complex code (mbfl_strpos) compared to 30 lines of slick code (zend_memstr).
We can skip the part where both needle and haystack are converted to UTF-8 if necessary, and come to the major chunk of code.
First we have two setup loops and then there is the loop that proceeds the pointer according to the given offset where you can see that they aware of actual characters and how they skip whole encoded UTF-8 characters: since UTF-8 is a variable-width character encoding where the first byte of each encoded character denotes the whole length of the encoded character. This information is stored in the u8_tbl array.
Finally, the loop where the actual search happens. And here we have something interesting, because the test for needle at a certain position in haystack is tried in reverse. And if one byte did not match, the jump table jtbl is used to find the next possible position for needle in haystack. This is actually an implementation of the Boyer–Moore string search algorithm.
So now we know that mb_strpos …
converts the strings to UTF-8, if necessary
is aware of actual characters
uses the Boyer–Moore search algorithm
preg_match
As for preg_match, it uses the PCRE library. Its standard matching algorithm uses a nondeterministic finite automaton (NFA) to find a match conducting a depth-first search of the pattern tree. This is basically a naive search approach.
I am leaving out preg_match to make the analysis more punctuated.
Taken your observation that mb_strpos is relatively slower compared to strpos, it leads you to the assumption that — because of the consumed time — mb_strpos does more than strpos.
I think this observation is correct.
You then asked what is that "more" that is causing the time difference.
I try to give a simple answer: That "more" is because strpos operates on binary strings (one character = 8 bit = 1 octet = 1 byte). mb_strpos operates on encoded character sequences (as nearly all of the mb_* functions do) which can be X bits, perhaps even in variable length per each character.
As this is always about a specific character encoding, both the haystack as well as the needle string (probably) need to be first validated for that encoding, and then the whole operation to find the string position needs to be done in that specific character encoding.
That is translation work and — depending on encoding — also requires a specific search algorithm.
Next to that the mb extension also needs to have some structures in memory to organize the different character encodings, be it translation tables and/or specific algorithms. See the extra parameter you inject — the name of the encoding for example.
That is by far more work than just doing simple byte-by-byte comparisons.
For example the GBK character encoding is pretty interesting when you need to encode or decode a certain character. The mb string function in this case needs to take all these specifics into account to find out if and at which position the character is. As PHP only has binary strings in the userland from which you would call that function, the whole work needs to be done on each single function call.
To illustrate this even more, if you look through the list of supported encodings (mb_list_encodings), you can also find some like BASE64, UUENCODE, HTML-ENTITIES and Quoted-Printable. As you might imagine, all these are handled differently.
For example a single numeric HTML entity can be up to 1024 bytes large, if not even larger. An extreme example I know and love is this one. However, for that encoding, it has to be handled by the mb_strpos algorithm.
Reason of slowness
Taking a look at the 5.5.6 PHP source files, the delay seems to arise for the most part in the mbfilter.c, where - as hakre surmised - both haystack and needle need to be validated and converted, every time mb_strpos (or, I guess, most of the mb_* family) gets called:
Unless haystack is in the default format, encode it to the default format:
if (haystack->no_encoding != mbfl_no_encoding_utf8) {
mbfl_string_init(&_haystack_u8);
haystack_u8 = mbfl_convert_encoding(haystack, &_haystack_u8, mbfl_no_encoding_utf8);
if (haystack_u8 == NULL) {
result = -4;
goto out;
}
} else {
haystack_u8 = haystack;
}
Unless needle is in the default format, encode it to the default format:
if (needle->no_encoding != mbfl_no_encoding_utf8) {
mbfl_string_init(&_needle_u8);
needle_u8 = mbfl_convert_encoding(needle, &_needle_u8, mbfl_no_encoding_utf8);
if (needle_u8 == NULL) {
result = -4;
goto out;
}
} else {
needle_u8 = needle;
}
According to a quick check with valgrind, the encoding conversion accounts for a huge part of mb_strpos's runtime, about 84% of the total, or five-sixths:
218,552,085 ext/mbstring/libmbfl/mbfl/mbfilter.c:mbfl_strpos [/usr/src/php-5.5.6/sapi/cli/php]
183,812,085 ext/mbstring/libmbfl/mbfl/mbfilter.c:mbfl_convert_encoding [/usr/src/php-5.5.6/sapi/cli/php]
which appears to be consistent with the OP's timings of mb_strpos versus strpos.
Encoding not considered, mb_strpos'ing a string is exactly the same of strpos'ing a slightly longer string. Okay, a string up to four times as long if you have really awkward strings, but even then, you would get a delay by a factor of four, not by a factor of twenty. The additional 5-6X slowdown arises from encoding times.
Accelerating mb_strpos...
So what can you do? You can skip those two steps by ensuring that you have internally the strings already in the "basic" format in which mbfl* do conversion and compare, which is mbfl_no_encoding_utf8 (UTF-8):
Keep your data in UTF-8.
Convert user input to UTF-8 as soon as practical.
Convert, if necessary, back to client encoding if needed.
Then your pseudo-code:
$haystack = "...";
$needle = "...";
$res = mb_strpos($haystack, $needle, 0, $Encoding);
becomes:
$haystack = "...";
$needle = "...";
mb_internal_encoding('UTF-8') or die("Cannot set encoding");
$haystack = mb_convert_encoding($haystack, 'UTF-8' [, $SourceEncoding]);
$needle = mb_convert_encoding($needle, 'UTF-8', [, $SourceEncoding]);
$res = mb_strpos($haystack, $needle, 0);
...when it's worth it
Of course this is only convenient if the "setup time" and maintenance of a whole UTF-8 base is appreciably smaller than the "run time" of doing conversions implicitly in every mb_* function.
The problems with mb_ performance may be caused by a messed php-mbstring package installation (on a linux). Installing it explicitly for the exact version of php installation helped me.
sudo apt-get install php7.1-mbstring
...
Before: Time: 16.17 seconds, Memory: 36.00MB OK (3093 tests, 40272 assertions)
After: Time: 1.81 seconds, Memory: 36.00MB OK (3093 tests, 40272 assertions)
I have basic PHP question
lets say I have a string "02/03/2013", how is this represented internally in PHP, is it converted to integers or to a Hexadecimal equivalent
when comparing two strings, how does PHP compare them internally?
Thanks for the answer in advance
PHP is written in C. All variables are ZVAL structs.
Please read these tutorials to learn more about the PHP internals and get started with writing extensions.
Extension Writing Part I: Introduction to PHP and Zend
Extension Writing Part II: Parameters, Arrays, and ZVALs
Extension Writing Part III: Resources
Table 1 shows the various types, and their corresponding letter codes
and C types which can be used with zend_parse_parameters():
Type Code Variable Type
Boolean b zend_bool
Long l long
Double d double
String s char*, int
Resource r zval*
Array a zval*
Object o zval*
zval z zval*
A PHP string is just a sequence of bytes, with no encoding tagged to it. Visit here for additional info..
Strings are strings. No conversion takes place; your string just happens to contain some digits, which is fine, but PHP doesn't treat it any differently than any other string.
PHP compares strings the same way that any other language would: it goes through the two strings character by character, and looks for the first pair of characters that differ. Once it finds one, the string which had a character with a lower ASCII value (like you'd get from ord()) is considered as being "less" than the other string.
I'm trying to use the API of a web service provider. They don't have an example in Ruby, but they do have one for PHP, and I'm trying to interpret between the two. The API examples always use "true" on PHP's hash_hmac() call, which produces a binary output. The difference seems to be that Ruby's OpenSSL::HMAC.hexdigest() function always returns text. (If I change the PHP call to "false" they return the same value.) Does anyone know of a way to "encode" the text returned from OpenSSL::HMAC.hexdigest() to get the same thing as returned from a hash_hmac('sha256', $text, $key, true)?
Use OpenSSL::HMAC.digest to get the binary output.
You'll need to convert each pair of hex digits into a byte with the same value. I don't know any Ruby, but this is similar to how it would be handled in PHP.
First, take your string of hex digits and split them into an array. Each element in the array should be two characters long. Convert each element from a string of two hex bytes to an integer. It looks like you can do this by calling the hex method on each string.
Next, call pack on the converted array using the parameter c*, to convert each integer into a one-byte character. You should get the correct string of bytes as the result.
Is there a native or inexpensive way to check for the length of a string in bytes in PHP?
See http://bytes.com/topic/php/answers/653733-binary-string-length
Relevant part:
"In PHP, like in C, the string ends with a zero-character, '\0', (char)
0, null-terminator, null-byte or whatever you like to call it."
No, that's not the case - PHP strings are stored with both the length and the
data, unlike C strings that just has one pointer and uses a terminator. They're
"binary-safe" - NUL doesn't terminate the string.
See the definition of zvalue_value in zend.h; the string part has both a "char
*val" and "int len".
Problems would start if you're using the mbstring.func_overload, which changes
how strlen() and the other functions work, and does try and treat strings as
strings of characters in a specific encoding rather than a string of bytes.
This is not the normal PHP behaviour.
The answer is that strlen should return the number of bytes regardless of the content of the string. For multi-byte character strings, you get the wrong number of characters, but the right number of bytes. However, you need to be certain you're not using the mbstring overload, which changes how strlen behaves.
In the event that you have mbstring overload set or your are developing for the platforms where you are unsure about this setting you can do the following:
$len=strlen(bin2hex($data))/2;
The reason why this works is that in Hex you are guaranteed to get 2 characters for all bytes that come from bin2hex (it returns two chars even for the initial binary 0).
Note that it will use significantly more resources than a normal strlen (afterall, so you should definitely not do that to the large amount of data if it's not absolutely necessary.
On php.org, someone was nice enough to create this function. Just multiply by 8 and you've got however many bits were in that string, as the function returns bytes.
The length of a string (textual data) is determined by the position of the NULL character which marks the end.
In case of binary data, NULL can be and often is in the middle of data.
You don't check the length of binary data. You have to know it beforehand. In your case, the length is 16 (bytes, not bits, if it is UUID).
As far as UUID validity is concerned, any 16-byte value is a valid UUID, so you are out of luck there.
Is there any PHP function that encodes a string to a int value, which later I can decode it back to a string without any key?
Sure, you can convert strings to numbers and vice versa. Consider:
$a = "" + 1
gettype($a) // integer
$b = "$a"
gettype($b) // string
You can also do type casting with settype().
If I misunderstood you and you want to encode arbitrary strings, consider using base64_encode() and bas64_decode(). If you want to convert the base 64 string representation to a base 10 integer, simply use base_convert().
And int has 4 or 8 bytes depending on the platform, and each character in a string is one byte (or more depending on encoding). So, you can only encode very small strings to integers, which basically makes the answer to your question: no.
What do you want to accomplish?
I would suspect not, since there are far more possible string combinations than integers within the MAX_INT.
Does it have to be an integer?
i'm convinced that what you think you want to do is not really what you want to do. :-) this just sounds like a silly idea. As another user has asked before:) what do you need this for? What are your intentions?
Well now that you mentioned that numbers and a-z letter are acceptable, then I have one suggestion, you could loop through the individual letters' ordinal value and display that as a two-digit hexadecimal. You can then convert these hexadecimals back to the ordinal values of the individual characters. Don't know what kind of characters are you about to encode, possibly you will need to use 4-characters per letter (e.g. String Peter would become 00700065007400650072 ) Well... have fun with that, I still don't really see the rationale for doing what you're doing.
op through the individual letters' ordinal value and display that as a two-digit hexadecimal. You can then convert these hexadecimals back to the ordinal values of the individual characters. Don't know what kind of characters are you about to encode, possibly you will need to use 4-characters per letter (e.g. String Peter would become 00700065007400650072 ) Well... have fun with that, I still don't really see the
There is no function for PHP but I recently wrote a class to encrypt and decrypt a string in PHP. You can look at it at: https://github.com/Lars-/PHP-Security-class