PHP htmlspecialchars and character sets issue - php

I've got a page that loads content from a database. I have the page set to use utf-8 as such
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
I've set the collation on the db to utf8_unicode_ci, and I access the field from mysql using this:
$desc = htmlspecialchars($row['description'],ENT_QUOTES,'UTF-8');
I've done this to fix smart quotes and em-dashes. However, any field that contains those characters is being returned blank now. If I take away the 'UTF-8' parameter in my htmlspecialchars call I get the full text just with questionmark-in-a-diamond characters where the quotes should be. Is there something I'm missing?

If you're getting an empty string back, it probably means there's an invalid UTF-8 character in the content from the database. You could also set the ENT_IGNORE flag, but I'd recommend doing a binary search with the offending string to try to figure out exactly what the offending character is.
http://www.php.net/manual/en/function.htmlspecialchars.php. See the section under return value.

Related

Utf-8 strings won't convert similary. I want all the scraped text become the same for saving in database

I have huge problems with encodings. I'm scraping text from some other sites with file_get_contents(). And the quotes becomes special odd characters or questionmarks. But the strange thing is that some text from different sites ARE utf-8, but the quotes becomes different things when I receive it. When I run utf8_decode() a quote from one utf-8 text becomes a quote. Bot in another utf-8 text from another site it becomes a questionmark.
Is there any way to fix so all text is looking good when I save it to db.
The charset in database table is latin1_swedish_ci, and I have tried to change it to utf8_unicode_ci but did no difference.
Edit:
Have now tried a little bit more. These two works for different texts. This one works for one text:
$source = utf8_encode($source);
And this are working for the others:
$source = mb_convert_encoding($source, 'HTML-ENTITIES', 'utf-8');
But you can't put the string through both. They are not working together. They destroy the other ones for each other.
Printscreen without any encoding (text is in Swedish):
Edit:
FYI: I have now changed the table to utf8_unicode_ci. However, still not working. Here are all the functions I've tried with:
Actually, if I just leave it like this, most of the texts are outputted with right characters. It's just some where " becomes ”.
can you please dump the code you grabbed using print_r?
notice: your html page must have a correct meta-charset to display unicode characters correctly.
<head>
<meta charset="UTF-8">
</head>

UTF-8 page to UTF-8 database table showing incorrect

I'm hoping for an explanation of why some UTF-8 text is being saved to a database table incorrectly...
I created an HTML form and the page's meta content is set to UTF-8:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
The PHP and template files are all Unicode/UTF-8.
The form field data is submitted to a utf8_unicode_ci encoded database table.
If I submit the form with characters such as "éçä" (which I created from Windows' Character Map program set to Unicode character set) they show up incorrectly in the database ("éçä"). I'm viewing the database via phpMyAdmin (which is also set to UTF-8 character encoding).
However, if I run iconv() on the string to convert to ISO-8859-1 before inserting it into the database, then the character show correctly:
$input = iconv("UTF-8", "ISO-8859-1//TRANSLIT", $input);
What is going on? Shouldn't the fact that everything is UTF-8/Unicode from beginning to end resulted in it being correct in the DB? What am I doing wrong and why did converting the data to ISO-8859-1 work?
The only other thing done to the data is a FILTER_SANITIZE_MAGIC_QUOTES:
$input = filter_var($input,FILTER_SANITIZE_MAGIC_QUOTES);
Thank you for your time and input.
Two steps you haven't mentioned:
Specify UTF-8 in HTTP Content-Type header
Specify UTF-8 when connecting to MySQL, e.g. specifying charset in PDO

What's the cause of these characters when I pull content from the database?

I'm pulling some content from my database and when I display it, I am getting some random characters occasionally dispersed throughout the content. I am seeing a lot of  where spaces were/are. I'm also getting ’ in some places.
The characters don't appear when I view in phpMyAdmin. How do I encode the content correctly? Is it something I should do BEFORE I insert the content or is it something I do when I am displaying?
What character set is the data stored in?
For example, if the data is stored as UTF-8, then when displaying the data, you need to make sure the page encoding is set to UTF-8 as well.
If it is stored in some other character set, then set the page encoding as appropriate.
You can do this by passing appropriate headers:
Content-Type: text/html; charset=utf-8
Or letting the browser know in your document:
<META http-equiv="Content-Type" content="text/html; charset="utf-8">
And in HTML5:
<meta charset="utf-8" />
That's UTF-8 being misinterpreted as CP1252. Make sure all the appropriate headers are in place.
>>> print u'’'.encode('cp1252').decode('utf-8')
’
IMO, the best thing would be to work on utf-8 on your files/database (or at least the same encoding in all places).
Please check what do you have under $db['default']['char_set'] and $db['default']['dbcollat'] on your application/config/database.php and what encoding you are using in your views/html. If you see the data correctly on PMA, then maybe the problem is in your views.
Try to use utf8_encode or utf8_decode when you print your text.

Strange characters from PHP form. Character set?

I have a form on my site where users can submit text as part of a product review. The review goes to a MySQL database, where I can review it before approving it so it appears on my site. I received a review today that was filled with strange characters. For example, I think the below was supposed to come out as "fun" but instead it showed up in my MySQL DB as:
“funâ€Â
I'm pretty sure this is a character encoding issue, and I've read a few entries on stackoverflow about such issues, but I'm just not sure how to implement a fix. I'm guessing I need to change the php function I use to do data cleaning from the form, which is below:
function cleanDataForDB($data) {
$data = trim(htmlentities(strip_tags(nl2br($data),'<br><br />')));
if (get_magic_quotes_gpc())
$data = stripslashes($data);
$data = mysql_real_escape_string($data);
return $data;
}
The html for my site is encoded in UTF-8. I have this tag at the top of every page:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
Do I need to use a php encoding function, such as utf8_encode() on data entry and utf8_decode() when I'm displaying in a browser?
Any help is greatly appreciated. Thanks!
Chris
It's also good to make sure that the web server is advertising UTF-8, but that's not the culprit here. I use the Live HTTP Headers extension in Firefox to test. MySQL always defaults to the latin-1 character set and you must explicitly set it other wise with mysql_set_charset(). PHP itself it not very good at multi-byte character sets like UTF-8, but as long as it doesn't need to understand those characters (such as regular expression matching) you are safe. You just need to make sure all input and output to the User (via the meta tag) and to the database are aware of the character encoding.

PHP Strange character before £ sign?

For some reason i get a £76756687 weird character when i type a £ into a text field on my form?
As you suspect, it's a character encoding issue - is the page set to use a charset of UTF-8? (You can't go wrong with this encoding really.) Also, you'll probably want to entity encode the pound symbol on the way out (£)
As an example character set (for both the form page and HTML email) you could use:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
That said, is there a good reason for the user to have to enter the currency symbol? Would it be a better idea to have it as either a static text item or a drop down to the left of the text field? (Feel free to ignore if I'm talking arse and you're using a freeform textarea or summat.)
You’re probably using UTF-8 as character encoding but don’t declare your output correctly. Because the £ character (U+00A3) is encoded in UTF-8 with 0xC2A3. And that byte sequence represents the two characters  and £ when interpreted with ISO 8859-1.
So you just need to specify your character encoding correctly. In PHP you can use the header function to set the proper value for Content-Type header field like:
header('Content-Type: text/html;charset=utf-8');
But make sure that you call this function before any output. Otherwise the HTTP header is already sent and you cannot modify it.

Categories