U+0008 Character added to string in certain browsers - php

I have a comments section on my website. These comments are stored in a MySQL database in a field with a 'text'-datatype and the latin1_swedish_ci collation. When I echo the result of a query to display the comments, the character U+0008 (backspace) is displayed in Firefox, Opera and other browsers. Chrome ignores this and just displays white space. Is there any way to remove this character?
Edit:
I have two sections on the website, one where I post notifications and one where users post comments. The 'message'-fields that contain the content are identically configured and echoed. As I said, it doesn't make a difference wether I post comments via the site, or via a direct query. Remarkably, ne comment that has an -tag at the end doesn't have the U+0008 character appended to it.

Change the collation in MySQL from latin1_swedish_ci to unicode and try. Could be that the data while being entered is encoded in unicode, but while getting stored in the DB loses the encoding and shows up the backspace in unicode string. However you will need to find out where it comes from as #WebnetMobile.com suggested

I replaced a space in my echo function. This apparantly solved the issue. I really don't know how that could fix it, but the U+0008 characters are all gone...

Related

Can't get my unicode from MySQL to print to browser correctly with PHP

I have been poring over stackoverflow all night looking for a way to solve my issues, but I absolutely cannot get the browser to display my Unicode characters correctly when pulling them from my database. In particular, I am trying to use the "combining macron" character (U+0304), added after a character to put a macron over it. I want the user to have the option to turn them on and off, and having one character to look for and ignore seems easier to accomplish this than instead of making conversions between individual macroned letters and their non-macroned counterpart (e.g. Ā -> A).
It would be trivial to use the HTML entity (& #772;) to accomplish this, but if I were to use the MySQL database for something other than making a webpage I want it to be easily transferable. I have tested with the HTML entity and I can get it to successfully add a macron to the previous character.
However, when using the Unicode character in my MySQL table, I absolutely cannot get it to print anything other than question marks (?) in the browser. In the table itself, the entry is a VARCHAR(64) and looks like 'word¯' with the macron appearing afterwards, but I assume that's just a limitation of the cmd environment that it doesn't put the macron over the d. The column Collation is latin1_swedish_ci, if that makes a difference. Here is what I have tried to get the entry to print correctly:
Changing my php.ini to have a default charset of utf-8
Making the top of my php file read:
<?php
header('Content-Type: text/html; charset=utf-8');
?>
And setting the first parameter of my database PDO as mysql:dbname=NAME;host=localhost;charset=utf8'
When I simply make the php file echo the character I want, it prints to the page correctly. So I'm thinking the problem isn't with the encoding? Or maybe the encoding of the database and the server aren't the same and that is creating the ??
EDIT:
I can get it to correctly display if I insert the value from PhPMyAdmin, but not when I enter it through the cmd. In both cases I am pasting the same word with an ending character of 'U+0304'. Is there a reason that it works with PHPMyAdmin and not through a direct query, and what can I do so it works with both?

Weird character encoding in Firefox

I have a problem with character encoding in Firefox. When I copy/paste a paragraph from Microsoft Word (2007), it could contains special character like this (dots/squares to make a list or quote) :
 Te’st
 Ze’f
• Gzg’a
The quote ’ is different compared to this quote ' (typed directly using keyboard). So I paste this in a textarea and save (using AJAX in some case). In the database (which has a collation latin1_swedish_ci) it shows perfectly fine. But when getting these data to edit again using Firefox, it shows weird binary symbols. Works fine in Chrome and IE.
I don't want to modify the charset of the database. Is there any way to solve this problem?
Note: you can also test by viewing this post in Chrome and FF
The characters you copypasted (assuming they got transmitted correctly into this forum) contain, in addition to letters, three occurrences of U+2019 RIGHT SINGLE QUOTATION MARK, which is the correct punctuation apostrophe in English and many other languages, one occurrence of U+2022 BULLET, which sounds ok, and two occurrences of U+F0A7, which is in the Private Use (PU) range and should not be used public information exchange, only for special purposes by mutual agreements between interested parties.
It is possible that some notations in Word 2007 documents get converted to PU characters in copy and paste, but at least normal list bullet normally becomes U+2022 BULLET. So it is a bit of a mystery where the PU characters come from.
Regarding single quotes, they are representable in windows-1252 too, and latin1_swedish_ci seems to cover it (though it is, as far as I understand, just the definition of collating order, rather than a character encoding). And as you are saying that the data looks fine in the database, it seems that problem is in the way in which the data is written in an HTML document served to the browser.
In particular, if the encoding of the page in which the data is then presented is UTF-8 and the actual data is there in windows-1252 encoding, problems arise. It would mean a problem like the one you describe, as U+2019 is encoded as 0x92 in windows-1252, and this causes a character-level data error when interpreted as UTF-8.
You can check the situation by using View→Encoding in Firefox when viewing the result page. If my hypothesis is correct, you will see UTF-8 selected there, and changing it to “West European (windows-1252)” makes the single quote appear (and may mess up other things on the page thoroughly).

What would cause an to turn into a unicode character?

I've got some documents on my website which users can edit via a rich text editor and then save them (to the DB) and print them. Some users are experiencing an issue (only happening on the live site) where some of the characters are getting screwed up. I've checked the DB, and the funny characters are in the DB, so it's not a display issue. It either happens when they save the document (submit the form on the site) or they've put something weird in there or their browser changed some of the characters.
The character that keeps appearing everywhere is  . It's an accented A followed by a space. Looking at the source HTML, it appears that the affected documents had all their 's converted. But whenever I try it, they come out fine.
What would cause an to turn into a unicode character, but only in limited cases?
Misinterpreting the UTF-8 encoding as Latin-1 will cause this.
>>> u'\xa0'.encode('utf-8').decode('latin-1')
u'\xc2\xa0'
>>> print u'\xa0*'.encode('utf-8').decode('latin-1')
 *

Special Characters Problem

When I display contents from the database, I get this:
��Some will have a job. Others will want one. They are my people, they are my clients and they are being denied their rights.
This text had been entered by the user via textarea with tinyMCE. How can I replace special characters (using preg_replace()) from the sentence to ' ' except for the characters: <>?
This article is totally worth a read. Dealing with UTF-8 characters is something that we all go through at some point. The trick seems to be to catch them before they go into the database or to fix the database so that when they're going in they aren't broken. Once they're in there though it's slightly more difficult.
As Chuck mentioned above, it is the database problem. Unless you only wish to display non-Unicode, ie Latin characters, then yes, preg_replace is the way to go. You will need to know the character sets well enough to filter out what you don't want.
But if you just want everything to display nicely, ie no garbage characters, then change the corresponding parts of the db to accept utf-8.
e.g. If you are using mySQL, try changing the field and table encoding to be able to accept UTF-8. The default is latin1_general_ci - try changing it to utf8_general_ci. Hope that explains my point.

Black Diamonds that are Fixing themselves in MySQL

I am running into a very strange issue with a site that I am working on. The site is basically a job board where the owner or users can create job listings including a description that ends up being stored into a MySQL text field. What we are experiencing is this, whenever listings from certain sources are entered, they initially end up with the "Black Diamond" with a question mark inside character in place of apostrophes and double spaces. This part I know is an encoding issue and can correct. The real question is this, these black diamonds show when the record is displayed in a MySQL admin tool and when the job listing is viewed in a web browser (simple select statement displays the listing in a PHP app), but after the first time it is viewed, then the problem somehow fixes itself. It is like the running the select then displaying the record updates the job description field and fixes the encoding issues. How could this be? Has anyone ever heard of this or anything similar? I cannot understand how a database field would change without running an update statement...
How are the job listings entered? Are they entered via a web page? If so, what character encoding does the web page use? (This should determine the character encoding of the submitted data AFAIK.) What character set is the connection used to communicate with MySQL? What is the character set of the column the data is stored in? Finally, what is the character encoding of the web page(s) on which the entered data is reviewed?
Here is what I do: I declare all of my pages as UTF-8 encoded, using the following tag at the start of the <head> section:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
I issue the following command immediately when I connect to MySQL, so as to make sure that MySQL understands the data I send to it will be UTF-8 encoded:
SET NAMES uft8
(Depending on the database abstraction method you use, a special function might be recommended in order to set the connection character set, like mysqli's mysqli_set_charset().)
I also make sure that those columns in which I intend to store UTF-8 data are declared to be UTF-8. You can find out what the character set of a column is by issuing SHOW CREATE TABLE table_name. The character set of the table (which by default is the character set for any column in the table) is displayed at the end. If the character set for the column is different to the default character set for the table then it is displayed as part of the column definition. If you wish to change the character set of a column then you can do so using ALTER TABLE.
If you have not previously taken the steps to handle character sets in your app then you may find that the tables are all using the latin1 character set. If you naively store UTF-8-encoded data (for example) into these columns, you may run into character encoding issues. Changing the column character set using ALTER TABLE does not necessarily fix your old data, because MySQL reads your old data assuming it to be valid latin1-encoded text and converts it to the eqivalent UTF-8 (correctly converting what it has read, but not giving the result you want).
The above steps would hopefully mean that future data will be correctly encoded and correctly displayed, but you may have data already mis-encoded in your database, so be aware that if you follow the above steps and still see older data displaying incorrectly, this may be why. Good luck.
Run into this problem a few years ago... I remember finding those notorious characters, and replacing them in php with a single quote or a double quote... Ofcourse with escaping... A simple preg_replace for those characters will do the trick... Its just an encoding issue...
This page, though geared for wordpress might help
http://codex.wordpress.org/Converting_Database_Character_Sets
I had the same issue (mysql encoding and webpage encoding set to UTF-8 but black diamonds showing up in my query results. I found this snippet while googling but cannot for the life of me find its source to give proper attribution:
if( function_exists('mysql_set_charset') ){
mysql_set_charset('utf8', $db_connection);
}else{
mysql_query("SET NAMES 'utf8'", $db_connection);
}
Anyway, it cleared up the issue for me.

Categories