I'm using php and mysqli and I meet a problem with an insert query which looks like :
SET NAMES 'utf8'
$text = mysqli_real_escape_string($connection, $text)
insert into table values('', '".$text."');
Pages are encoded utf8 without BOM and mysql is utf8 general ci
The problem is when I use phpmyadmin the request works fine but when I use website interface and type a text with character "+" it replace with a space " " in mysql but all other characters like ', ", accents, \, /, % are correctly inserted...
It worked before I probably made a mistake.
Thanks you by advance and sorry for my poor english.
It is neither mysql, not mysqli, not PHP.
None of them put any special meaning in this character.
If you care to verify your inserts, by simply echoing $text out before insert, you will see that it is already stripped of + sign. So, you have to find the code that strips that symbol out.
A program is not a "black box" which you feed with data and it returns some unexpected output.
But rather set of operators, each performing some data manipulations.
So, you have to debug your code, means you have to echo your $text variable out in various parts of your code to see where it gets changed. Most likely it is getting some unnecessary treatment. After finding that code you may either remove it or ask here if it ok or not.
The only possible case of automated replacement of + character would be if you type your text right in the browser's address bar. In this case + can be replaced with space automatically as PHP does decode urlencoded text and + is used to substitute space character in the URL
Related
I'm working on a script that builds an XML feed using strings from the database. The strings are user-entered image captions from Facebook Open Graph API. The strings are supposed to be all UTF8 according to facebook. So i import the captions into the database and store them as utf8-unicode (i also tried utf8-bin)
But i always have the same error when trying to display the output XML feed, because one of the caption have a weird whitespace character
This page contains the following errors:
error on line 63466 at column 14: Input is not proper UTF-8, indicate encoding !
Bytes: 0x0B 0x54 0x68 0x6F
Below is a rendering of the page up to the first error.
In the database (phpmyadmin) and in the page source code (using chrome), the problematic characters appear as empty square symbol.
Now if i copy and paste the problematic character in an converter it gives me Hexadecimal 000B
What's the easiest way to fix this ?
I'd also like to understand in the first place, why Facebook Graph API is giving me non-utf8 characters when it's not supposed to
Failed attemps:
utf8_encode() isn't working because the rest of the strings are UTF8 valid.
I also tried multiple different ways of stripping out all non-utf8 characters, but it doesn't filter out this specific character. Same when trying to filter out all non-latin.
htmlentities() htmlspecialchars() or the same isn't encoding the problematic characters
charactericonv(mb_detect_encoding()) will not detect the string as invalid utf8
str_replace() or preg_replace() is of no help, if i try to copy and paste the character in Visual Studio Code, nothing is pasted, not even a whitespace
str_replace("\0", "", ) ...nope
Here is a list of what we have found and/or worked through with the original poster:
MySQL's utf-8 is not a proper implementation of utf-8 - utf8mb4 is;
additional information on character sets and collation differences;
changes that happen to existing data if collation is changed.
We have checked the above and discovered that the initial problem was caused by vertical tabulation symbols creeping into the text fields. A good way to remove said symbols is by running $str = str_replace("\x0b", "", $str);, where $str is the string that is going to be inserted into the text field. It's important to not replace \v, as that might not be desired.
If the 0B is always at the beginning of a string, then trace the strings back to their source and see if they are "BOM" encoded. Wikipedia on BOM .
At least come back with the various steps the data takes, so we can help with deducing the source of the problem.
Note: although needed for Emoji and Chinese, switching to utf8mb4 will not deal with BOM if that is the 'real' problem.
(using str_replace is just a bandaid)
I am trying to sanitise database input and found a problem with the Ⓡ character.
Ⓡ converts to
Ⓡ
Even with html_entity_decode around the variable.
This is a problem because the field is only meant to allow 4 characters in the database.
® Actually works though and is treated as a single character.
I have the same problem with Ⓒ vs ©.
As far as I know they are just html entities so should be decoded. However they aren't even encoded with htmlspecialchars(). It just echoes out the code
Ⓡ
Does PHP have any built-in functions to solve this? Thanks
Edit just to say what I am trying to do:
I have text fields to input and add to a database which displays in a table below.
When I enter any other character like < > &, it enters straight into the database as one character.
I am trying to make Ⓡ and Ⓒ always go in as one character as well (instead of 6).
I am only encoding on output in the table so certain characters don't break the website.
The problem that the entity doesn't decode when using html_entity_decode is likely that the target character set given to html_entity_decode is still the default ISO-8859-1. ISO-8859-1 cannot encode "Ⓡ" (the CIRCLED LETTER R), but it can encode "®" (the REGISTERED MARK).
So, first, to decode it correctly:
html_entity_decode('Ⓡ', ENT_COMPAT, 'UTF-8')
But secondly, "Ⓡ" and "®" are not the same character, and you probably don't want "Ⓡ".
I am having a problem with  character on my website.
I have a website where users can use a wysiwyg editor (ckeditor) to fill out their profile. The content is ran through htmlpurify before being put into a database (for security reasons).
The database has all tables setup with UTF-8 charset. I also call 'SET NAMES utf-8' at the beginning of script execution to prevent problems (which has worked for years, as I haven't had this problem in a long time). The webpage the text is displayed on has a content-type of utf-8 and I also use the header() function to set the content-type and charset as well.
When displaying the text all seemed fine until I tried running a regular expression on the content. html_entity_decode (called with the encoding param of 'utf-8') is removing/not showing the  character for some reason and it leaves behind something which is causing all of my regexes to fail (it seems there is a character there but I cannot view it in the source).
How can I prevent and/or remove this character so I can run the regular expression?
EDIT: I have decided to abandon ckeditor and go with the markdown format like this site uses to have more flexibility. I have hated wysiwyg editors for as long as I remember. Updating all the profiles to the new format will give me a chance to remove all of the offending text and give the site a clean start. Thanks for all the input.
You are probably facing the situation that the string actually is not properly UTF-8 encoded (as you wrote it is, but it ain't). html_entity_decode might then remove any invalid UTF-8 byte sequences (e.g. single-byte-charset encoding of Â) with a substitution character.
Depending on the PHP version you're using you've got more control how to deal with this by making use of the flags.
Additionally to find the character you can't see, create a hexdump of the string.
Since the character you are talking about exists within the ANSI charset, you can do this:
utf8_encode( preg_replace($match, $replace, utf8_decode($utf8_text));
This will however destroy any unicode character not existing within the ANSI charset. To avoid this you can always try using mb_ereg_replace which has multibyte (unicode) support:
string mb_ereg_replace ( string $pattern , string $replacement , string $string [, string $option = "msr" ] )
How do I enter the plus(+) and minus(-) signs into a MySQL database and output them + - as normal all while still using mysqli_real_escape_string(); if there is a better way please let me know.
What input/output are you actually getting? Whats the column type in the db for where you are trying to insert these?
When looking at the mysqli_real_escape_string() documentation, I do not think that the + and - characters are escaped. Only NUL (ASCII 0), \n, \r, \, ', ", and Control-Z are escaped according to that page.
So I suppose you need to provide more information about your problem.
I use mysql_real_escape_string() and set your charset for the table or row as latin1 utf8 might be a better choice though...
Could also play around with the different encodings urlencode() and such.
//encode the update in case of & and other characters that will break twitter API
$update = '%E2%99%AB'.urlencode($update);
%E2%99%AB creates the 'musical note symbol`
http://www.eskimo.com/~bloo/indexdot/html/topics/urlencoding.htm
I sometimes import data from CSV files that were provided to me, into a mysql table.
In the last one I did, some of the entries has a weird bad character in front of the actual data, and it got imported in my database. Now I'm looking for a way to clean it up.
The bad data is in the mysql column 'email', it seems to be always right in front of the actual data. When trying to print it on my screen using PHP, it shows up as �. When exporting it to a CSV file, it looks like  , and if I SET CHARACTER SET utf8 before printing it on the screen using PHP, it looks like a normal space ' '.
I was thinking of writing a PHP script that goes over all my rows one at a time, fix the email address field, and update the row. However I'm not quite sure about the "fix the email" part!
I was thinking maybe to do a "explode" and use the bad character as a delimiter, but I don't know how to type that character into my code.
Is there maybe a way to find the underlying value/utf8/hex or whatever of that character, then find it in the string?
I hope it's clear enough.
Thanks
EDIT:
In Hex, it looks like it's A0. What can I do to search and delete a character by its hex value? Either in PHP or directly in MySQL I guess ...
SELECT HEX(field) FROM table; should help determine the character.
As an alternative solution, it might actually be easier to fix the issue at the source. I've encountered similar problems with CSV files exported from Excel and have generally found that using something along the lines of...
$correctedLine = mb_convert_variables('UTF-8', 'Windows-1252', $sourceLine);
...tends to rectify the issue. (That said, you'll need to ensure that you have the multi byte string extension compiled in/enabled.)
you can trim any leading unprintable ascii char with something like:
update t set email = substr(email, 2) where ascii(email) not between 32 and 126
you can get the ascii value of the offending char with this:
select ascii(email) as first_char
I think I found a PHP answer that seems to work more reliably:
$newemail = preg_replace('/\xA0/', '', $row['oldemail']);
And then I'm going to update the row with the new email