Special Characters Problem

Special Characters Problem - php

When I display contents from the database, I get this:
��Some will have a job. Others will want one. They are my people, they are my clients and they are being denied their rights.
This text had been entered by the user via textarea with tinyMCE. How can I replace special characters (using preg_replace()) from the sentence to ' ' except for the characters: <>?

This article is totally worth a read. Dealing with UTF-8 characters is something that we all go through at some point. The trick seems to be to catch them before they go into the database or to fix the database so that when they're going in they aren't broken. Once they're in there though it's slightly more difficult.

As Chuck mentioned above, it is the database problem. Unless you only wish to display non-Unicode, ie Latin characters, then yes, preg_replace is the way to go. You will need to know the character sets well enough to filter out what you don't want.
But if you just want everything to display nicely, ie no garbage characters, then change the corresponding parts of the db to accept utf-8.
e.g. If you are using mySQL, try changing the field and table encoding to be able to accept UTF-8. The default is latin1_general_ci - try changing it to utf8_general_ci. Hope that explains my point.

Related

Can't get my unicode from MySQL to print to browser correctly with PHP

I have been poring over stackoverflow all night looking for a way to solve my issues, but I absolutely cannot get the browser to display my Unicode characters correctly when pulling them from my database. In particular, I am trying to use the "combining macron" character (U+0304), added after a character to put a macron over it. I want the user to have the option to turn them on and off, and having one character to look for and ignore seems easier to accomplish this than instead of making conversions between individual macroned letters and their non-macroned counterpart (e.g. Ā -> A).
It would be trivial to use the HTML entity (& #772;) to accomplish this, but if I were to use the MySQL database for something other than making a webpage I want it to be easily transferable. I have tested with the HTML entity and I can get it to successfully add a macron to the previous character.
However, when using the Unicode character in my MySQL table, I absolutely cannot get it to print anything other than question marks (?) in the browser. In the table itself, the entry is a VARCHAR(64) and looks like 'word¯' with the macron appearing afterwards, but I assume that's just a limitation of the cmd environment that it doesn't put the macron over the d. The column Collation is latin1_swedish_ci, if that makes a difference. Here is what I have tried to get the entry to print correctly:
Changing my php.ini to have a default charset of utf-8
Making the top of my php file read:
<?php
header('Content-Type: text/html; charset=utf-8');
?>
And setting the first parameter of my database PDO as mysql:dbname=NAME;host=localhost;charset=utf8'
When I simply make the php file echo the character I want, it prints to the page correctly. So I'm thinking the problem isn't with the encoding? Or maybe the encoding of the database and the server aren't the same and that is creating the ??
EDIT:
I can get it to correctly display if I insert the value from PhPMyAdmin, but not when I enter it through the cmd. In both cases I am pasting the same word with an ending character of 'U+0304'. Is there a reason that it works with PHPMyAdmin and not through a direct query, and what can I do so it works with both?

Obtaining correct UTF-8 characters from a Percent Encoded URL parameter

I'm having some trouble with the dreaded UTF-8 Character Encoding! It's driving me insane, no matter which way I approach it or how many online guides I follow, I can never get it to return the desired results. Here's what's going on:
My whole website uses a simple text-file database that is UTF-8 encoded, and it correctly shows all manner of special characters, latin, arabic, japanese, you name it, they all show correctly, with one exception:
When the user uses the "Search" input box I have on my website, I use $search = $_REQUEST['search']; to get the input data on the results page and show results accordingly. When a user inserts special characters in the search box, they get "Percent Encoded" in the URL (for example, "ï" becomes "%E3%AF"). When showing $string in the actual website, any special character appears as � (black diamond with question mark).
I have tried everthing it says here http://malevolent.com/weblog/archive/2007/03/12/unicode-utf8-php-mysql/ with the exception of the header(). I have set the charset as UTF-8 in my head section with an http-equiv meta but for some reason whenever I set it as a header() my PHP stylesheet stops working (and the character problem remains). Maybe this is a clue?
I have tried urldecode and rawurldecode too, but they don't change anything.
Keep in mind special characters appear correctly elsewhere on the site, it's only with the $search string where this problem appears. As a side-note, even though the characters are not visualizing correctly, my search engine does actually interpret the special characters correctly when filtering the results. This makes me understand that the special character is actually there and correctly encoded, but it's just a matter of making it visualize correctly with the correct charset. However... everything appears to be UTF-8.
To be honest I'm so confused about this that this question might also appear to be confusing and the information I'm giving you might not be very well structured either, so I apologize and will try to provide more detailed information for any questions.
Thank you!

Make sure not to have any function which alters your $_REQUEST. Some functions are not aware of special encodings.
The best way to investigate is checking the state of the variables before and after they are altered.

I would like to add one thing more point regarding utf-8 string manipulation.
When manipulating utf-8 strings always use multibyte string functions.
use mb_strtolower in place of strtolower()
http://php.net/manual/en/ref.mbstring.php.

(?) marks in HTML. Encoding issue from content from the database?

Any idea why this is happening?
It looks to be happening mainly with apostrophes and hyphens. Any ideas if I can fix this? I pull the data from my database and print it to the page like:
<div class="block">
<?=$details['agenda'] ?>
</div>

As other commenters may have mentioned, this is a character encoding problem. If you're lucky, you can force your HTML page to render in UTF-8 and that will resolve it.
Unfortunately, if you're not lucky, you'll discover that the characters are stored in the database in the wrong encoding. Or maybe the database converts them. Or maybe the character encoding data has been destroyed along the path! There's no way of knowing in advance where those characters have been damaged.
The best way I know to fix problems like this is to force every step along your path to follow UTF-8 content encoding. For example, you probably go through steps like this:
Content author writes a document in Microsoft Word containing "SmartQuotes"
Content author copies-and-pastes into the edit box of a content management system.
Content management system saves to the database.
Database may or may not store data in Unicode internally - make sure you use nvarchar (or whatever unicode type your database supports).
Reading from the database may need to scan for characters.
However, it's very tricky to fix this! A long time ago, I used to have a habit of writing "detect-and-fix" routines like this:
$smartquotes = array("”", "“");
str_replace($smartquotes, '"', $mytext);
Of course you know what the problem is - I'd keep discovering new characters I had to fix. Microsoft Word likes to do tons of unusual characters - copyright, registration marks, apostrophes, hyphens, and so on. I'd keep adding to this function, over and over, until I went crazy. So nowadays I just go through my entire content delivery path and force everything to obey UTF-8 rules; that seems to resolve it in most cases.
Good luck!

PHP - Can't remove strange character

I'd really appreciate some help with this. I've wasted days on this problem and none of the suggestions I have found online seem to give me a fix.
I have a CSV file from a supplier. It appears to have been exported from an Microsoft system.
I'm using PHP to import the data into MySQL (both latest versions).
I have one particular record which contains a strange character that I can't get rid of. Manual editing to remove the character is possible, but I would prefer an automated solution as this will happen multiple times a day.
The character appears to be an interpretation of a “smart quote”. A hex editor tells me that the character codes are C2 and 92. In the hex editor it looks like a weird A followed by a smart quote. In other editors and Calc, Writer etc it just appears as a box. ﾒ
I'm using mb_detect_encoding to determine the encoding. All records in the CSV file are returned as ASCII, except the one with the strange character, which is returned as UTF-8.
I can insert the offending record into MySQL and it just appears in Workbench as a square.
MySQL tables are configured to utf-8 – utf8_unicode_ci and other unusual UTF characters (eg fractions) are ok.
I've tried lots of solutions to this...
How to detect malformed utf-8 string in PHP?
Remove non-utf8 characters from string
Removing invalid/incomplete multibyte characters
How to detect malformed utf-8 string in PHP?
How to replace Microsoft-encoded quotes in PHP
etc etc but none of them have worked for me.
All I really want to do is remove or replace the offending character, ideally with a search and replace for the hex values but none of the examples I have tried have worked.
Can anyone help me move forward with this one please?
EDIT:
Can't post answer as not enough reputation:
Thanks for your input. Much appreciated.
I'm just going to go with the hex search and replace:
$DodgyText = preg_replace("/\xEF\xBE\x92/", "" ,$DodgyText);
I know it's not the elegant solution, but I need a quick fix and this works for me.

Another solution is:
$contents = iconv('UTF-8', 'Windows-1251//IGNORE',$contents);
$contents = iconv('Windows-1251', 'UTF-8//IGNORE',$contents);
Where you can replace Windows-1251 to your local encoding.

At a quick glance, this looks like a UTF-8 file. (UTF-8 is identical with the first 128 characters in the ASCII table, hence everything is detected as ASCII except for the special character.)
It should work if your database connection is also UTF-8 encoded (which it may not be by default).
How to do that depends on your database library, let us know which one you're using if you need help setting the connection encoding.

updated code based on established findings
You can do search & replace on strings using hexadecimal notation:
str_replace("\xEF\xBE\x92", '', $value);
This would return the value with the special code removed
That said, if your database table is UTF-8, you shouldn't need that conversion; instead you could look at the connection (or session) character set (i.e. SET NAMES utf8;). Configuring this depends on what library you use to connect to your database.
To debug the value you could use bin2hex(); this usually helps in doing searches online.

List of 'messed up characters' in utf8

one of my clients has a website which has been totally messed up by the hosting companie forcing a characterset on the complete database. We've had troubles before with character sets but now it's just straight forward a drama!
So far I've added the charset=utf-8 to the page content type and set the charset for the mysql connection to utf8. And now it's time to replace all characters. So far what I've found is:
Ã¶ = ö
Ã« = ë
Ã© = é
The data inside the database is being updated like so:
UPDATE table SET `fieldname` = REPLACE(`fieldname`, 'Ã¶', 'ö');
Now I just need to find a complete list of alle characters that are messed up. I tried a MySQL query searching for field LIKE '%Ã%' but this returns me all records inside the database.
Google also just displays a couple of characters (mostly the 3 above) in some topics of other people that have had troubles, however it seems there's nowhere a complete list of these characters (or at least the most common) which I can use to find and replace all data for my client.
If anyone perhaps knows such location or is able to complete my list I will, in return, create a page containing these characters to help others (unless there's a list already which I'm not aware of somewhere ofcourse).
// EDIT:
it would be for the most common european characters such as é è ë, á à ä, ö ó ò, ï, ü and perhaps the ringel-S (German double S). Not so much for the spaning signs like ñ or ã, but if they are in a list somewhere that would be much appreciated aswel.
// EDIT 2:
I updated the MySQL database and tables using the 2 ALTER queries from the 1st part of this article: http://developer.loftdigital.com/blog/php-utf-8-cheatsheet. I DID NOT make use of the mb_ functions so far and didn't do any MB configuration as it seems.
The headers are all set to utf-8 in the files (I still have to check the headers for some ajax scripts tho, not sure if that's needed but it won't be harmfull doing so). And the files are all saved as UTF8 without BOM. Also PHPFreakMailer is updated by setting the charset to utf-8.
Bad enough, I'm still having these weird characters. I wasn't thinking they'd go away by theirself, but at least it was worth hoping so :-) So what's the final step I should take? Continuïng using the REPLACE query and changing all wierd characters manually?
Thanks in advance!

This is a bit crazy; what character set do you think "Ã¶" is in?
It looks like that's actually a correct UTF-8 sequence (since it's two bytes), you're just displaying it as ISO-8559-1.
Edit:
Based on your comment I think the following is going on:
I think (but really not 100% sure) that the correct UTF-8 binary sequence is stored in the database. But since the table is marked as ISO-8559-1, and you requested to automatically convert character set. So it thinks it's ISO-8559-1 (which looks like Ã¶), but then tries to convert that to UTF-8.
You should be able to verify this, if strlen('Ã¶') is 4, and not 2. If the length is indeed 2, your browser encoding somehow screws up.
To fix this, don't set the MySQL to encode the characters.
Option 2
The data could also be 'double encoded' in the table. To check this, simply also check the string length on the database. If the 'Ã¶' is 4 bytes long, this is the issue.
My advice in this case is to not try to make a big 'messed up character'-map. You should simply be able to 'utf8_decode' the string. Normally this function will output a ISO-8559-1 string, but in your case.. it should turn out to be the original valid UTF-8 string.
I hope this works!
Edit2
Ok so effectively what I believe has happened is Option 2. To put it in simple (php) terms:
$output = utf8_encode(utf8_encode('string'));
So one utf8_decode() should be enough.
Do test this before you run your migration scripts though :)

If they forced a character change, why is your database not converted? Are your tables still the old character set (see your phpMyAdmin on table information).
Is the data wrong if it shows up in your phpMyAdmin or only on your webpage? -> your names and collation should change, as well as headers and filetype (safe file as utf-8).
Or try:
ALTER TABLE tbl_name CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
I would start replacing characters only if there are no options from within MySQL left.

Since you've tagged this question with "php", I assume you read the database and it's values with PHP? If so, please have a look at mb_convert_encoding if you no longer have control over the database.
The better solution would be to fix the inconsistency between the data and the tables characterset. Backup the database (just in case), and alter all tables and columns to UTF-8. Note: when using MySQL, it is not enough to alter the table's charset, you'll have to do this per column.

Why don't you use: ä = ä, ö = ö,...
Do htmlentities(); in php and it will convert all special characters into Entitys. I think this would be the easiest way to do it.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.