I have two strings with seemingly the same values. One is stored as a key in an array, the other a value in another different array. I compare the two using ==, ===, and strcmp. All treat them as different strings. I do a var_dump and this is what I get.
string(17) "Valentine’s Day"
string(15) "Valentine's Day"
Does anyone have any idea why the first string would be 17 characters and the second 15?
Update: This is slightly more obvious when I pasted this out of my editor whose font made the two different apostrophe's almost indistinguishable.
The first string contains a Unicode character for the apostrophe while the second string just has a regular ASCII ' character.
The Unicode character takes up more space.
If you run the PHP ord() function on each of those characters you'll see that you get different values for each:
echo ord("’"); //226 This is just the first 2 bytes (see comments below for details from ircmaxell)
echo ord("'"); //27
As a complement to #Mark answer above which is right (the ’ is a multi-byte character, most probably UTF-8, while ' is not). You can easily convert it to ASCII (or ISO-8859-1) using iconv, per example:
echo iconv('utf-8', 'ascii//TRANSLIT', $str);
Note: Not all characters can be transformed from multi-byte to ASCII or latin1. You can use //IGNORE to have them removed from the resulting string.
’ != '
mainly. if you want this not to be an issue, you could do something like this.
if (str_replace('’', '\'', "Valentine’s Day") == "Valentine's Day") {
Related
Consider the following code:
$str = '';
for ($i=0x0; $i<=0x7f; $i++) {
$str .= chr($i);
}
echo json_encode($str);
The result is:
"\u0000\u0001\u0002\u0003\u0004\u0005\u0006\u0007\b\t\n\u000b\f\r\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f !\"#$%&'()*+,-.\/0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"
There are all but one ASCII characters (the last one): 127 (0x7f).
Is there a way to show that character? (for instance: "\u007f")
Delete (DEL) is a control character in the ASCII character set with the coding 0x7f or 127 decimal. This character is saved as ASCII in the JSON string. This character can be made visible by outputting the JSON string in hexadecimal format.
$jsonStr = json_encode(chr(0x7f));
echo bin2hex($jsonStr); //227f22
22 is the encoding for a double quotation mark ("). echo is not suitable for checking what is in a string. There are always misunderstandings. Control characters (including DEL) are only displayed as spaces in the browser. If you look closely at the result of your code example, you will see the space at the end.
I am not sure, but I think this is what you are searching for:
U+007F
Also see this answer: Why no symbols defined for ascii values from 127 to 159
floatval('19500.00');
returns 19500 ;
however
echo floatval('19,500.00');
returns 19 ;
this could've really given me a big problem it was good that I've noticed :D ... is there some reason for that behavior or it's just a bug ... should all values be number_formatted before ouput?
You put that value in single quotes, so it's not treated as a numerical value, but as a string.
Here's php.net's explanation what happens to strings with floatval (from http://php.net/manual/en/function.floatval.php):
Strings will most likely return 0 although this depends on the
leftmost characters of the string.
Meaning: The leftmost characters of the string in your case is 19 - that's a numerical value again, so the output is 19.
Decimal and thousands separator symbols are defined by current locale used in your script.
You can set default_locale in php.ini globally, or change locale in your script on the go: http://php.net/manual/en/function.setlocale.php
Different locales have different separators. To check, which is the current separator symbol, you can use localeconv function: http://us2.php.net/manual/en/function.localeconv.php
$test = localeconv();
echo $test['decimal_point'];
#.
echo $test['thousands_sep'];
#,
But actually no one can make this function work properly with all these commas and dots, so the only solution is to clean the input removing everything except "." and numbers by regexp or str_replace:
echo floatval(preg_replace("/[^0-9\.]/", "", '19,500.00'));
#19500
echo floatval(str_replace(",", "", '19,500.00'));
#19500
I have this code:
$string = 'علی';
echo strlen($string);
Since $string has 3 Persian characters, output must be 3 but I get 6.
علی has 3 characters. Why my output is 6 ?
How can I use strlen() in php for Persian with real output?
Use mb_strlen
Returns the number of characters in string str having character encoding (the second parameter) encoding. A multi-byte character is counted as 1.
Since your 3 characters are all multi-byte, you get 6 returned with strlen, but this returns 3 as expected.
echo mb_strlen($string,'utf-8');
Fiddle
Note
It's important not to underestimate the power of this method and any similar alternatives. For example one could be inclined to say ok if the characters are multi-byte then just get the length with strlen and divide it by 2 but that will only work if all characters of your string are multi-byte and even a period . will invalidate the count. For example this
echo mb_strlen('علی.','utf-8');
Returns 4 which is correct. So this function is not only taking the whole length and dividing it by 2, it counts 1 for every multi-byte character and 1 for every single-byte character.
Note2:
It looks like you decided not to use this method because mbstring extension is not enabled by default for old PHP versions and you might have decided not to try enabling it:) For future readers though, it is not difficult and its advisable to enable it if you are dealing with multi-byte characters as its not only the length that you might need to deal with. See Manual
try this:
function ustrlen($text)
{
if(function_exists('mb_strlen'))
return mb_strlen( $text , 'utf-8' );
return count(preg_split('//u', $text)) - 2;
}
it will work for any php version.
mb_strlen function is your friend
$string = 'علی';
echo mb_strlen($string, 'utf8');
As of PHP5, iconv_strlen() can be used (as described in php.net, it returns the character count of a string, so probably it's the best choice):
iconv_strlen("علی");
// 3
Based on this answer by chernyshevsky#hotmail.com, you can try this:
function string_length (string $string) : int {
return strlen(utf8_decode($string));
}
string_length("علی");
// 3
Also, as others answered, you can use mb_strlen():
mb_strlen("علی");
// 3
Notes
There is a very little difference between them (for illegal latin characters):
iconv_strlen("a\xCC\r"); // A notice
string_length("a\xCC\r"); // 3
mb_strlen("a\xCC\r"); // 2
Performance: mb_strlen() is the fastest. Totally, there is no difference between iconv_strlen() and string_length() at performance. But amazingly, mb_strlen() is faster that both about 9 times (as I tested)!
I want to use str_word_count() on a UTF-8 string.
Is this safe in PHP? It seems to me that it should be (especially considering that there is no mb_str_word_count()).
But on php.net there are a lot of people muddying the water by presenting their own 'multibyte compatible' versions of the function.
So I guess I want to know...
Given that str_word_count simply counts all character sequences in delimited by " " (space), it should be safe on multibyte strings, even though its not necessarily aware of the character sequences, right?
Are there any equivalent 'space' characters in UTF-8, which are not ASCII " " (space)?#
This is where the problem might lie I guess.
I'd say you guess right. And indeed there are space characters in UTF-8 which are not part of US-ASCII. To give you an example of such spaces:
Unicode Character 'NO-BREAK SPACE' (U+00A0): 2 Bytes in UTF-8: 0xC2 0xA0 (c2a0)
And perhaps as well:
Unicode Character 'NEXT LINE (NEL)' (U+0085): 2 Bytes in UTF-8: 0xC2 0x85 (c285)
Unicode Character 'LINE SEPARATOR' (U+2028): 3 Bytes in UTF-8: 0xE2 0x80 0xA8 (e280a8)
Unicode Character 'PARAGRAPH SEPARATOR' (U+2029): 3 Bytes in UTF-8: 0xE2 0x80 0xA8 (e280a8)
Anyway, the first one - the 'NO-BREAK SPACE' (U+00A0) - is a good example as it is also part of Latin-X charsets. And the PHP manual already provides a hint that str_word_count would be locale dependent.
If we want to put this to a test, we can set the locale to UTF-8, pass in an invalid string containing a \xA0 sequence and if this still counts as word-breaking character, that function is clearly not UTF-8 safe, hence not multibyte safe (as same non-defined as per the question):
<?php
/**
* is PHP str_word_count() multibyte safe?
* #link https://stackoverflow.com/q/8290537/367456
*/
echo 'New Locale: ', setlocale(LC_ALL, 'en_US.utf8'), "\n\n";
$test = "aword\xA0bword aword";
$result = str_word_count($test, 2);
var_dump($result);
Output:
New Locale: en_US.utf8
array(3) {
[0]=>
string(5) "aword"
[6]=>
string(5) "bword"
[12]=>
string(5) "aword"
}
As this demo shows, that function totally fails on the locale promise it gives on the manual page (I do not wonder nor moan about this, most often if you read that a function is locale specific in PHP, run for your life and find one that is not) which I exploit here to demonstrate that it by no means does anything regarding the UTF-8 character encoding.
Instead for UTF-8 you should take a look into the PCRE extension:
Matching Unicode letter characters in PCRE/PHP
PCRE has a good understanding of Unicode and UTF-8 in PHP in specific. It can also be quite fast if you craft the regular expression pattern carefully.
About the "template answer" - I don't get the demand "working faster". We're not talking about long times or lot of counts here, so who cares if it takes some milliseconds longer or not?
However, a str_word_count working with soft hyphen:
function my_word_count($str) {
return str_word_count(str_replace("\xC2\xAD",'', $str));
}
a function that complies with the asserts (but is probably not faster than str_word_count):
function my_word_count($str) {
$mystr = str_replace("\xC2\xAD",'', $str); // soft hyphen encoded in UTF-8
return preg_match_all('~[\p{L}\'\-]+~u', $mystr); // regex expecting UTF-8
}
The preg function is essentially the same what's already proposed, except that a) it already returns a count so no need to supply matches, which should make it faster and b) there really should not be iconv fallback, IMO.
About a comment:
I can see that your PCRE functions are wrost (performance) than my
preg_word_count() because need a str_replace that you not need:
'~[^\p{L}\'-\xC2\xAD]+~u' works fine (!).
I considered that a different thing, string replace will only remove the multibyte character, but regex of yours will deal with \\xC2 and \\xAD in any order they might appear, which is wrong. Consider a registered sign, which is \xC2\xAE.
However, now that I think about it due to the way valid UTF-8 works, it wouldn't really matter, so that should be usable equally well. So we can just have the function
function my_word_count($str) {
return preg_match_all('~[\p{L}\'\-\xC2\xAD]+~u', $str); // regex expecting UTF-8
}
without any need for matches or other replacements.
About str_word_count(str_replace("\xC2\xAD",'', $str));, if is stable
with UTF8, is good, but seems is not.
If you read this thread, you'll know str_replace is safe if you stick to valid UTF-8 strings. I didn't see any evidence in your link of the contrary.
EDITED (to show new clues): there are a possible solution using str_word_count() with PHP v5.1!
function my_word_count($str, $myLangChars="àáãâçêéíîóõôúÀÁÃÂÇÊÉÍÎÓÕÔÚ") {
return str_word_count($str, 0, $myLangChars);
}
but not is 100% because I try to add to $myLangChars \xC2\xAD (the SHy - SOFT HYPHEN character), that must be a word component in any language, and it not works (see).
Another, not so fast, but complete and flexible solution (extracted from here), based on PCRE library, but with an option to mimic the str_word_count() behaviour on non-valid-UTF8:
/**
* Like str_word_count() but showing how preg can do the same.
* This function is most flexible but not faster than str_word_count.
* #param $wRgx the "word regular expression" as defined by user.
* #param $triggError changes behaviour causing error event.
* #param $OnBadUtfTryAgain when true mimic the str_word_count behaviour.
* #return 0 or positive integer as word-count, negative as PCRE error.
*/
function preg_word_count($s,$wRgx='/[-\'\p{L}\xC2\xAD]+/u', $triggError=true,
$OnBadUtfTryAgain=true) {
if ( preg_match_all($wRgx,$s,$m) !== false )
return count($m[0]);
else {
$lastError = preg_last_error();
$chkUtf8 = ($lastError==PREG_BAD_UTF8_ERROR);
if ($OnBadUtfTryAgain && $chkUtf8)
return preg_word_count(
iconv('CP1252','UTF-8',$s), $wRgx, $triggError, false
);
elseif ($triggError) trigger_error(
$chkUtf8? 'non-UTF8 input!': "error PCRE_code-$lastError",
E_USER_NOTICE
);
return -$lastError;
}
}
(TEMPLATE ANSWER) help for bounty!
(this is not an answer, is a help for bounty, because I can not edit neither to duplicate the question)
We want to count "real-world words" in a UTF-8 latim text.
FOR BOUNTY, WE NEED:
a function that comply the asserts below and is faster than str_word_count;
or str_word_count working with SHy character (how to?);
or preg_word_count working faster (using preg_replace? word-separator regular expression?).
ASSERTS
Supose that a "multibyte safe" function my_word_count() exists, then the following asserts must be true:
assert_options(ASSERT_ACTIVE, 1);
$text = "1,2,3,4=0 (1 2 3 4)=0 (... ,.)=0 (2.5±0.1; 0.5±0.2)=0";
assert( my_word_count($text)==0 ); // no word there
$text = "(one two,three;four)=4 (five-six se\xC2\xADven)=2";
assert( my_word_count($text)==6 ); // hyphen merges two words
$text = "(um±dois três)=3 (àáãâçêéíîóõôúÀÁÃÂÇÊÉÍÎÓÕÔÚ)=1";
assert( my_word_count($text)==4 ); // a UTF8 case
$text = "(ÍSÔ9000-X, ISÔ 9000-X, ÍSÔ-9000-X)=6"; //Codes are words?
assert( my_word_count($text)==6 ); // suppose no: X is another word
All it does it count the number of spaces, or words in between. if you're curious, you can just make your own counting function using explode and count.
Anytime the ascii space byte is found, it splits and that all there really is to it.
I would like to write a (HTML) parser based on state machine but I have doubts how to acctually read/use an input. I decided to load the whole input into one string and then work with it as with an array and hold its index as current parsing position.
There would be no problems with single-byte encoding, but in multi-byte encoding each value does not represent a character, but a byte of a character.
Example:
$mb_string = 'žščř'; //4 multi-byte characters in UTF-8
for($i=0; $i < 4; $i++)
{
echo $mb_string[$i], PHP_EOL;
}
Outputs:
Ĺ
ž
Ĺ
Ą
This means I cannot iterate through the string in a loop to check single characters, because I never know if I am in the middle of an character or not.
So the questions are:
How do I multi-byte safe read a
single character from a string in a
performance friendly way?
Is it good idea to work with the
string as it was an array in this
case?
How would you read the input?
http://php.net/mb_string is the thing you're looking for
just mb_substr characters one by one
not until PHP6
what input exactly? The usual way in general
mb_internal_encoding("UTF-8");
$mb_string = 'žščř';
$l=mb_strlen($mb_string);
for($i=0;$i<$l;$i++){
print(mb_substr($mb_string,$i,1)."<br/>");
}
Without using the mdb_relatedFunctions and with multi-byte encoded strings you can use standard sub string functions that read in multiples of the bytes used for encoding.
For example for a UTF-8 encoded (2 bytes) string if you need the first character from the string
$string = 'žščř'; //4 multi-byte characters in UTF-8
You have to get the $string[0] AND $string[1] values, so you are actually looking for the substring between indexes 0 and 1 (for the first character).
Note that $string[0] or $string[N] will reference the first (or Nth byte of the multi-byte string)
regards,