I want to check the size of a string that can contain any type of data.
I have checked strlen and mb_strlen but I am unsure about the differences relating to different data contents.
Some background : what I need to do in the end is cut the string in chunks to serialize it and store it in chunks (being able to restore afterwards). Chunks always have the same size (32Kb) and contain a serialized object with data and the part of the string that I cut, so I need the exact size of the string to be able to do that.
From PHP's manual:
Note:
strlen() returns the number of bytes rather than the number of characters in a string.
By contrast, mb_strlen will take character encoding into consideration. It returns the number of actual characters as defined in the character encoding in the string. For multibyte/variable byte character encodings, strlen can/will be bigger than mb_strlen.
mb_strlen may also return FALSE if you specify a character encoding to which the string being tested doesn't conform.
I'm trying to convert a string to UTF8, on both obj-c and php.
I get different results:
"\xd7\x91\xd7\x93\xd7\x99\xd7\xa7\xd7\x94" //Obj-C
"\u05d1\u05d3\u05d9\u05e7\u05d4" //PHP
Obj-C code:
const char *cData = [#"בדיקה" cStringUsingEncoding:NSUTF8StringEncoding]
PHP code:
utf8_encode('בדיקה')
This difference breaks my hash algorithm that follows.
How can I make the two strings encoded the same way? Should I change the obj-c\php ?
Go to http://www.utf8-chartable.de/unicode-utf8-table.pl
In the combo box switch to “U+0590 … U+5FF Hebrew”
Scroll down to “U+05D1” which is the rightmost character of your input string.
The third column shows the two UTF-8 bytes: “d7 91”
If you keep looking you will see that the PHP and the Objective-C are actually the same. The “problem” you are seeing is that while PHP uses an Unicode escape (\u), Objective-C uses direct byte hexadecimal escapes (\x). Those are only visual representations of the strings, the bytes in memory are actually the same.
If your hash algorithm deals with bytes correctly, you should not see differences.
What are you using to do the encoding on PHP? It looks like you're generating a UTF-16 string.
Try utf8_encode() and see if that gives better results.
Am working on php based mime parser. If the body contains string like Iñtërnâtiônàlizætiøn we see that It is getting converted into Iñtërnâtiônà lizætiøn. Can somebody suggest how to handle (what functions) for such string ?
So we are doing the following
Using Zend Library connecting to the IMAP server
mail = new Zend_Mail_Storage_Imap($params);
Read the message using
$message = $mail->getMessage($i);
in the loop.
When we print the $message we see the string e.g. Iñtërnâtiônàlizætiøn printed as Iñtërnâtiônà lizætiøn.
What I need is if there is someway by which we can retain the original string? And this is just one example we may run into other multi-byte characters, so what to know how we handle this generically?
There's no specific function for that, you simply need to treat the string in the encoding it's in. A string is just a blob of bytes, it gets turned into characters by whatever is interpreting those bytes as text. And that something needs to use the correct encoding for that, otherwise those bytes are not interpreted as the characters they were supposed to be. See Handling Unicode Front To Back In A Web App for a rundown of the common pitfalls.
as mentioned in the comment, you can use php mb_* functions to work with multibyte characters. Here is just an example to detect the encoding of a string:
$s="Iñtërnâtiônàlizætiøn";
echo mb_detect_encoding($s); //UTF-8
then you can work with this, use utf8_decode($s) or any mb_ functions to convert the string to your wished encoding.
At the moment, I don't understand why it is really important to use mbstring functions in PHP when dealing with UTF-8? My locale under linux is already set to UTF-8, so why doesn't functions like strlen, preg_replace and so on don't work properly by default?
All of the PHP string functions do not handle multibyte strings regardless of your operating system's locale. That is why you need to use the multibyte string functions.
From the Multibyte String Introduction:
When you manipulate (trim, split, splice, etc.) strings encoded in a
multibyte encoding, you need to use special functions since two or
more consecutive bytes may represent a single character in such
encoding schemes. Otherwise, if you apply a non-multibyte-aware string
function to the string, it probably fails to detect the beginning or
ending of the multibyte character and ends up with a corrupted garbage
string that most likely loses its original meaning.
Here is my answer in plain English.
A single Japanese and Chinese and Korean character take more than a single byte. Eg., a typical charactert say x is takes 1 byte in English it will take more than 1 byte in Japanese and Chinese and Korean. Now PHP's standard string functions are meant to treat a single character as 1 byte. So in case you are trying to do compare two Japanese or Chinese or Korean characters they will not work as expected. For example the length of "Hello World!" in Japanese or Chinese or Korean will have more than 12 bytes.
Read http://www.php.net/manual/en/intro.mbstring.php
You do not need to use UTF-8 aware code to process UTF-8. For the most part.
I've even written a Unicode uppercaser/lowercaser, and NFC and NFD transforms, using only byte-aware functions. It's hard to think of anything more complicated than that, that needs such delicate and detailed treatment of UTF-8. And yet it still works with byte-only functions.
It's very rare that you need UTF-8 aware code. Maybe to count the number of characters, or to move an insertion point forward by 1 character. But actually, even then your code won't work ;) because of decomposed characters.
But if all you are doing is replacements, finding stuff, or even parsing syntax, you just need byte-aware functions.
I'll explain why.
It's because no UTF-8 character can be found inside any other UTF-8 character. That's how it is designed.
Try to explain to me how you can get text processing errors, in terms of a multi-byte system where no character can be found inside another character? Just one example case! The simplest you can think of.
PHP strings are just plain byte sequences. They have no meaning by themselves. And they do not use any particular character encoding either.
So if you read a file using file_get_contents() you get a binary-safe representation of the file. May it be the (binary) representation of an image or a human-readable text file - PHP doesn't care.
Now, as long as you just need to do basic processing of the string, you do not need to know the character encoding at all. So if you want to store the string back into a file using file_put_contents() or want to get its length (not the number of characters) using strlen(), you're fine.
However, as soon as you start doing more fancy string manipulation, you need to know the character encoding! There is no way to store it as part of the string, so you either have to track it separately, or, what most people do, use the convention of having all (text) strings in a common character encoding, like US-ASCII or nowadays UTF-8.
So because there is no way to set a character encoding for a string, PHP has no idea which character encoding the string is using. Due to that, the only sane thing for strlen() to do is to return the number of bytes, as this is the only thing PHP knows for sure.
If you provide the additional information of the used character encoding, you need to use another function - the function is called mb_strlen() in this case.
The same applies to preg_replace(): If you want to replace umlaut-a, or match three identical characters in a row, you need to know how umlaut-a is encoded, and in general, how characters are encoded.
So if you have a hypothetical character encoding, which encodes a lower-case a as a1 and an upper-case A as a2, a b as b1 and B as b2 (and so on), you can have an (encoded) string a1a1a1 which consists of three identical characters in a row. However, without knowing the encoding and by just looking at the byte sequence, there is no way to detect this.
Summary:
No sane 'default' is possible as PHP strings do not contain the character encoding. And even if, a single function like strlen() cannot return the length of the byte sequence as required for Content-Length HTTP header and at the same time the number of characters as useful to denote the length of a blog article.
That's why the Function Overloading Feature is inherently broken and even if it looks nice at first, will break your code in a hard-to-debug way.
multibyte => multi + byte.
1) It is use to work with string which is in other language(means not in English) format.
2) Default PHP string functions only work proper with English (or releted to it) language.
3) If you want to use strlen() or strpos() or uppercase() or strreplace() for special character,
Suppose We need to apply string functions on "Hello".
In chines (你好), Arabic (مرحبا), Japanese (こんにちは), Hindi (
नमस्ते), Gujarati (હેલો).
Different language can it's own character sets
so that mbstring introduced for communicate with various languages like (chines,Japanese etc).
Raul González is a perfect example of why:
It is about shortening too long user names for MySQL database, say we have 10 character limit and Raul González.
The unit test below is an example how you can get an error like this
General error: 1366 Incorrect string value: '\xC3' for column 'name' at row 1 (SQL: update users set name = Raul Gonz▒, updated_at = 2019-03-04 04:28:46 where id = 793)
and how you can avoid it
public function test_substr(): void
{
$name = 'Raul González';
$user = factory(User::class)->create(['name' => $name]);
try {
$name1 = substr($name, 0, 10);
$user->name = $name1;
$user->save();
} catch (Exception $ex) {
}
$this->assertTrue(isset($ex));
$name2 = mb_substr($name, 0, 10);
$user->name = $name2;
$user->save();
$this->assertTrue(true);
}
PHP Laravel and PhpUnit was used for illustration.
I need to encrypt a string using MySQL's AES_ENCRYPT function, then attach that encrypted string to the end of a URL, such that it can then be decrypted and used by a PHP script on the other end.
Basically, I am encrypting the string (using MySQL's AES_ENCRYPT), I am then using PHP's rawurlencode() function to make it "URL safe". I then pass the encrypted string in a URL, which is then retrieved by the PHP script on the other end where it gets successfully decrypted... about 95% of the time.
Seems as though about 5% of strings are encrypting in such a way that they are getting corrupted somewhere in the process, and can't be decoded on the other end after being passed by a URL. Can anyone help me out here? Is there a 100% fool-proof way to do this? I have also tried using urlencode() as well as base64_encode() in varying combinations.
Thanks.
Solved.
Once I have encrypted the string using MySQL's AES_ENCRYPT function, I use PHP's bin2hex() function to convert that encrypted data (which is in binary form) in to Hexidecimal. I then pass the Hexidecimal as a string on the end of the URL. Once the URL is received on the other end, I then use this custom PHP function to revert the Hex string back to binary:
function hex2bin($data) {
$len = strlen($data);
return pack("H" . $len, $data);
}
From there, all that's left to do is decrypt the data using MySQL's AES_DECRYPT function, and wha-la. The original string is successfully restored.
URLs have a finite maximum length. AES-encrypted strings do not.
URLs are not an appropriate vector for passing arbitrary information. Using an HTTP POST is a much better way, if you must communicate over HTTP.
About why you are having problems: quoting from the PHP manual page on urlencode:
Note: Be careful about variables that
may match HTML entities. Things like
&, © and £ are parsed by
the browser and the actual entity is
used instead of the desired variable
name. This is an obvious hassle that
the W3C has been telling people about
for years. The reference is here:
http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.2.
PHP supports changing the argument
separator to the W3C-suggested
semi-colon through the arg_separator
.ini directive. Unfortunately most
user agents do not send form data in
this semi-colon separated format. A
more portable way around this is to
use & instead of & as the
separator. You don't need to change
PHP's arg_separator for this. Leave it
as &, but simply encode your URLs
using htmlentities() or
htmlspecialchars().