This question already has answers here:
Enclosing the string with double quotes
(3 answers)
Closed 8 years ago.
I am trying to build a function (unless there is already one, I was not able to find one) that satisfies:
being saved in a MySQL database → mysqli_real_escape_string
being saved in a serialized array in a MySQL database (I had issues when unserialize failed)
as for output:
doesn't interfer with HTML → utf8_encode(htmlentities($source, ENT_QUOTES | ENT_HTML401, 'UTF-8'));
doesn't interfer with it being a query in an URL, thus encoding the '&','%'
Please give me any advice if there is an idea on how to improve secure encoding.
And I am not sure about the functions give, whether they are the best to be used.
I also had issues with non-printable characters and tried
PHP: How to remove all non printable characters in a string?
$s = preg_replace('/[\x00-\x08\x0B\x0C\x0E-\x1F\x80-\x9F]/u', '', $s);
EDIT
Because of the diversity of this question, I want to substantiate the question on how to clean a string that is an element of an array that is put with serialize() in a database ´?
For instance, I had a failure when trying to unserialize after having put a string containing a newline (\n or \r) into an string element of an array that has been serialized successfully...
EDIT_2
The reason for why I have tried to issue encoding HTML entities before saving them into the DB using mysqli_real_escape_string() is that when recalling/loading this object from the DB, the data has changed. For example a user wants to put the string test'test into the database that is encoded by mysqli_real_escape_string() to test\'test and then when loaded from the DB it's still test\'test whcih is NOT what the user wants to have neither what he has sent . Please if you could find a solution for this -- mine was to apply sth. like where mysqli_real_escape_string() had no effect as the quotes have already been HTML encoded.
From the top of my head, I feel you should try json_encode and json_decode
This question already has answers here:
htmlentities() vs. htmlspecialchars()
(12 answers)
Closed 8 years ago.
I have read their documentation, but I still don't get when to use each of them and their difference.
Let's consider the situation of having a general string in a variable and needing to echo it inside HTML code. If it has any HTML markup in it, I want it converted to HTML code (< replaced by <, & replaced by &. If it has UTF special chars that aren't available in HTML code, it's replaced by HTML number (• replaced by •).
What's the best function for that?
A harder need: unprintable chars, like \n, char(10), char(13), etc, be replaced by their number code, in the case the string is printed inside <pre> or any special textarea so that the string be dumped.
htmlentities is a workaround for not having set the character type of the document properly. htmlspecialchars is the correct function to use for merely writing text into an HTML document.
As to your second question, I think you're looking for addcslashes.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
PHP: replace invalid characters in utf-8 string in
I have a string that has an invalid character in it (it's not UTF-8) such as the following displaying SUB:
I think it's some kind of foreign invalid character.
Is there a way in PHP to take a string and use preg_replace or something else to ensure that I am only using valid UTF-8 characters in my strings, and anything else just gets removed?
Thanks.
First of all, there is no invalid UTF-8 characters. There are invalid UTF-8 bytes and byte sequences, which means someone is trying to pull off an encoding attack on your server. These can be validated with mb_check_encoding on the coming input data, and immediately failing with 400 Bad Request if you don't get valid UTF-8.
What you have is just the SUBSTITUTE control character, a valid character but unprintable.
Originally intended for use as a transmission control character to
indicate that garbled or invalid characters had been received. It has
often been put to use for other purposes when the in-band signaling of
errors it provides is unneeded, especially where robust methods of
error detection and correction are used, or where errors are expected
to be rare enough to make using the character for other purposes
advisable.
You can use this regex to get rid of it (and a few others):
$reg = '/(?![\r\n\t])[\p{Cc}]/u';
preg_replace( $reg, "", $str );
The mb_check_encoding function should be able to do this.
mb_check_encoding("Jetzt gibts mehr Kanonen", "UTF-8");
Note: I haven't tested this.
At the moment, I don't understand why it is really important to use mbstring functions in PHP when dealing with UTF-8? My locale under linux is already set to UTF-8, so why doesn't functions like strlen, preg_replace and so on don't work properly by default?
All of the PHP string functions do not handle multibyte strings regardless of your operating system's locale. That is why you need to use the multibyte string functions.
From the Multibyte String Introduction:
When you manipulate (trim, split, splice, etc.) strings encoded in a
multibyte encoding, you need to use special functions since two or
more consecutive bytes may represent a single character in such
encoding schemes. Otherwise, if you apply a non-multibyte-aware string
function to the string, it probably fails to detect the beginning or
ending of the multibyte character and ends up with a corrupted garbage
string that most likely loses its original meaning.
Here is my answer in plain English.
A single Japanese and Chinese and Korean character take more than a single byte. Eg., a typical charactert say x is takes 1 byte in English it will take more than 1 byte in Japanese and Chinese and Korean. Now PHP's standard string functions are meant to treat a single character as 1 byte. So in case you are trying to do compare two Japanese or Chinese or Korean characters they will not work as expected. For example the length of "Hello World!" in Japanese or Chinese or Korean will have more than 12 bytes.
Read http://www.php.net/manual/en/intro.mbstring.php
You do not need to use UTF-8 aware code to process UTF-8. For the most part.
I've even written a Unicode uppercaser/lowercaser, and NFC and NFD transforms, using only byte-aware functions. It's hard to think of anything more complicated than that, that needs such delicate and detailed treatment of UTF-8. And yet it still works with byte-only functions.
It's very rare that you need UTF-8 aware code. Maybe to count the number of characters, or to move an insertion point forward by 1 character. But actually, even then your code won't work ;) because of decomposed characters.
But if all you are doing is replacements, finding stuff, or even parsing syntax, you just need byte-aware functions.
I'll explain why.
It's because no UTF-8 character can be found inside any other UTF-8 character. That's how it is designed.
Try to explain to me how you can get text processing errors, in terms of a multi-byte system where no character can be found inside another character? Just one example case! The simplest you can think of.
PHP strings are just plain byte sequences. They have no meaning by themselves. And they do not use any particular character encoding either.
So if you read a file using file_get_contents() you get a binary-safe representation of the file. May it be the (binary) representation of an image or a human-readable text file - PHP doesn't care.
Now, as long as you just need to do basic processing of the string, you do not need to know the character encoding at all. So if you want to store the string back into a file using file_put_contents() or want to get its length (not the number of characters) using strlen(), you're fine.
However, as soon as you start doing more fancy string manipulation, you need to know the character encoding! There is no way to store it as part of the string, so you either have to track it separately, or, what most people do, use the convention of having all (text) strings in a common character encoding, like US-ASCII or nowadays UTF-8.
So because there is no way to set a character encoding for a string, PHP has no idea which character encoding the string is using. Due to that, the only sane thing for strlen() to do is to return the number of bytes, as this is the only thing PHP knows for sure.
If you provide the additional information of the used character encoding, you need to use another function - the function is called mb_strlen() in this case.
The same applies to preg_replace(): If you want to replace umlaut-a, or match three identical characters in a row, you need to know how umlaut-a is encoded, and in general, how characters are encoded.
So if you have a hypothetical character encoding, which encodes a lower-case a as a1 and an upper-case A as a2, a b as b1 and B as b2 (and so on), you can have an (encoded) string a1a1a1 which consists of three identical characters in a row. However, without knowing the encoding and by just looking at the byte sequence, there is no way to detect this.
Summary:
No sane 'default' is possible as PHP strings do not contain the character encoding. And even if, a single function like strlen() cannot return the length of the byte sequence as required for Content-Length HTTP header and at the same time the number of characters as useful to denote the length of a blog article.
That's why the Function Overloading Feature is inherently broken and even if it looks nice at first, will break your code in a hard-to-debug way.
multibyte => multi + byte.
1) It is use to work with string which is in other language(means not in English) format.
2) Default PHP string functions only work proper with English (or releted to it) language.
3) If you want to use strlen() or strpos() or uppercase() or strreplace() for special character,
Suppose We need to apply string functions on "Hello".
In chines (你好), Arabic (مرحبا), Japanese (こんにちは), Hindi (
नमस्ते), Gujarati (હેલો).
Different language can it's own character sets
so that mbstring introduced for communicate with various languages like (chines,Japanese etc).
Raul González is a perfect example of why:
It is about shortening too long user names for MySQL database, say we have 10 character limit and Raul González.
The unit test below is an example how you can get an error like this
General error: 1366 Incorrect string value: '\xC3' for column 'name' at row 1 (SQL: update users set name = Raul Gonz▒, updated_at = 2019-03-04 04:28:46 where id = 793)
and how you can avoid it
public function test_substr(): void
{
$name = 'Raul González';
$user = factory(User::class)->create(['name' => $name]);
try {
$name1 = substr($name, 0, 10);
$user->name = $name1;
$user->save();
} catch (Exception $ex) {
}
$this->assertTrue(isset($ex));
$name2 = mb_substr($name, 0, 10);
$user->name = $name2;
$user->save();
$this->assertTrue(true);
}
PHP Laravel and PhpUnit was used for illustration.
This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
How to decode Unicode escape sequences like “\u00ed” to proper UTF-8 encoded characters?
How can I convert \u014D to ō in PHP?
Thank You
It's not immediate clear what you mean when you say "to ō". If you're asking how to convert it into a different encoding then a general approach is to use the iconv function. 014D is the UCS-2 (unicode) for your desired function so, if you have a string containing the bytes 014D you could use
iconv('UCS-2', 'UTF-8', $s)
to convert from UCS-2 to UTF-8. Similarly if you want to convert to a different encoding - although you need to be aware that not all encodings will include the character you are using. You'll see from the iconv documentation that the //TRANSLIT option may help in that case.
Note that iconv is taking a byte sequence so, if you actually have a string containing a slash, then a u, then a 0 etc... you'll need to convert that into the byte sequence first.
If you have the escape characters in the string you could use a messy exec statement.
$string = '\\u014D';
exec("\$string = '$string'");
This way, the Unicode escape sequence should be recognized and interpreted as a unicode character When the string is parsed.
Of course, you should never use exec unless absolutely necessary.