There are symbols like Â and so on in database, what to do?

There are symbols like Â and so on in database, what to do? - php

I have a few symbols in my description like Â â € and so on. Can I do anything about it? Or if it's in database, I can't do nothing now?

It sort of depends what the problem actually is...
If it's that those characters are supposed to be there (such as "Mañana" in Spanish) then you'll need to ensure everything is in UTF-8... the best way is to:
1: check the database tables are in "utf-8" encoding (if not convert them to utf-8)
2: as Martin noted, ensure the database connector is utf-8 using something like:
mysql_set_charset('utf8'); //note that MySQL uses no hyphen here
3: ensure the the document is utf-8 (you can add a header at the top)
<?php header('Content-type:text/html;charset=utf-8'); ?>
4: just to be on the safe side, add it in as a meta tag as well
<meta http-equiv="content-type" content="text/html;charset=utf-8" />
HOWEVER
It's quite possible you've got some duff characters in the database where something like ISO-8859-1 has been juggled to UTF-8, badly. In this case you'll notice things like Â£ where what you actually want is £ (because UTF-8 characters contain more data than ISO-8859-1 characters, that extra data can become an additional character if you're not careful).
In which case your best bet is to clean the database (you could probably do something like UPDATE table SET field = REPLACE(field, 'Â£', '£') for common "errors") and then convert the whole kaboodle to UTF-8 (as outlined above) to avoid the problem recurring.

To avoid having such characters,
Set the charset for your form. HTML forms have charset attribute and value. You can use UTF-8
Set Charset for the Document, via PHP or using META tags ( but this only works on the output )
set Charset for the db table
get a class/function to do ascii character conversion as part of your data filtering and escaping

Related

$_POST will convert from utf-8 to Ã¤ Ã¶ Ã¼ etc

I am new here, so I apologize if I am doing anything wrong.
I have a form which submits user input onto another page. User is expected to type ä, ö, é, etc... I have placed all of the following in the document:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
header('Content-Type:text/html; charset=UTF-8');
<form action="whatever.php" accept-charset="UTF-8">
I even tried:
ini_set('default_charset', 'UTF-8');
When the other page loads, I need to check what the user input with something like:
if ( $_POST['field'] == $check ) {
...
}
But if he inputs something like 'München', PHP will compare 'MÃ¼nchen' with 'München' and will never trigger TRUE even though it should. Since it is specified UTF-8 everywhere, I am guessing that the server is converting to something else (Windows-1252 as I read on another thread) because it does not support or is not configured to UTF-8. I am using Apache on a local server before I load into production; I have not changed (and don't know how to) any of the default settings. I've been working on a Windows 7, editing with Notepad++ enconding my files in ANSI. If I bin2hex('München') I get '4dc3bc6e6368656e'.
If I echo $_POST['field']; it displays 'München' correctly.
I have researched everywhere for an explanation, all I find is that I should include those tags/headings I already have.
Any help is much appreciated.

You are facing many different problems at the same, let's start with the simplest one.
Problem 1) You say that echo $_POST['field']; will display it correctly? What do you mean with "display"? It can be displayed correctly in two cases:
either the field is in UTF-8 and your page has been declared as UTF-8 and the browser is displaying it as UTF-8 or,
the field is in Latin-1 and the browser has decided (through the auto-detection heuristics) that your page is in Latin-1.
So, the fact that echo $_POST['field']; is correct tells you nothing.
Problem 2) You are using
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
header('Content-Type:text/html; charset=UTF-8');
Is this PHP code? If it is, it will be an error because the header must be set before sending out any byte. If you do this you will not set the Content-Type header and PHP should generate a warning.
Problem 3) You are using
<form action="whatever.php" accept-charset="UTF-8">
Some browsers (IE, mostly) ignore accept-charset if they can coerce the data to be sent in ASCII or ISO Latin-1. So the data will be in UTF-8 and declared as ISO Latin-1 or ISO Latin-1 and sent as ISO Latin-1 (but this second case is not your case).
Have a look at https://stackoverflow.com/a/8547004/449288 to see how to solve this problem.
Problem 4) Which strings are you comparing? For example, if you have
$city = "München"
$_POST['city'] == $city
The result of this code will depend on the encoding of the PHP file. If the file is encoded in ISO Latin-1 and the $_POST correctly contains UTF-8 data, the == will compare different bytes and will return false.

Another solution that may be helpful is in Apache, you can place a directive in your configuration file (httpd.conf) or .htacess called AddDefaultCharset. It looks like this:
AddDefaultCharset utf-8
http://httpd.apache.org/docs/2.0/mod/core.html#adddefaultcharset
That will override any other default charsets.

I changed "mbstring.detect_order = pass" in my php.ini file and i worked

I've used Unicode characters in my forms and file many times. I had not any problem up to now.
Try to do these steps and check the result:
Remove header('Content-Type:text/html; charset=UTF-8'); from your HTML form codes.
Use your form just like <form action="whatever.php"> without accept-charset="UTF-8". (It's better to insert the method of sending data in your form tag).
In target page (whatever.php), insert again <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> in a <head> tag.
I always did my project like what I mentioned here and I did not have any problem with Unicode strings.

This is due to the character encoding of the PHP file(s).
The hardcoded München is stored with the character encoding of the source file(s), in this case ANSI and when that value is compared to the UTF-8 encoded value provided in the $_POST variable, the two will, quite naturally, differ.
The solution to your problem is one of:
Serve and process content with the same encoding as that of the source file(s), in this case likely to be windows-1252.
This would, for starters, include changing the content="text/html; charset=UTF-8" to content="text/html; charset=windows-1252" whenever serving HTML data.
Avoid all hardcoded values that could be affected by character encoding issues between UTF-8 and windows-1252, more or less only hardcode values that only includes English letters and numbers.
Any UTF-8 values would have to be read from a source that ensures they are UTF-8 encoded (for instance a database set to use UTF-8 as storage encoding as well as connection encoding).
Wrap all hardcoded assignments in utf8_encode(), for instance $value = utf8_encode ('München');
Change the encoding of the source file(s) to UTF-8.
This can be accomplished in any number of ways, a decent text editor will be able to do it or the outstanding libiconv can be used, especially for batch processing.
Either solution 1 or 4 would be my preferred solution, especially if multiple people are involved in the project.
As a side-note, some text editors (notably Notepad++) has the option of using either UTF-8 or UTF-8 without BOM. The BOM (Byte Order Mark) is pointless in UTF-8 and will cause problems when writing headers in PHP (most often when doing a redirect). This is because the BOM is right in front of the initial <?php, causing the server to send the BOM just as it would had there been any other character in front. The difference is you'd note a character in front, but the BOM isn't displayed.
Rule of thumb: Always use UTF-8 without BOM.

Arabic Character Encoding Issue: UTF-8 versus Windows-1256

Quick Background: I inherited a large sql dump file containing a combination of english and arabic text and (I think) it was originally exported using 'latin1'. I changed all occurrences of 'latin1' to 'utf8' prior to importing the file. The the arabic text didn't appear correctly in phpmyadmin (which I guess is normal), but when I loaded the text to a web page with the following...
<meta http-equiv='Content-Type' content='text/html; charset=windows-1256'/>
...everything looked good and the arabic text displayed perfectly.
Problem: My client is really really really picky and doesn't want to change his...
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
...to the 'Windows-1256' equivalent. I didn't think this would be a problem, but when I changed the charset value to 'UTF-8', all of the arabic characters appeared as diamonds with question marks. Shouldn't UTF-8 display arabic text correctly?
Here are a few notes about my database configuration:
Database charset is 'utf8'
Database connection collation is 'utf8_general_ci'
All databases, tables, and applicable fields have been collated as 'utf8_general_ci'
I've been scouring stack overflow and other forums for anything the relates to my issue. I've found similar problems, but not of the solutions seem to work for my specific situation. Hope someone can help!

If the document looks right when declared as windows-1256 encoded, then it most probably is windows-1256 encoded. So it was apparently not exported using latin1—which would have been impossible, since latin1 has no Arabic letters.
If this is just about a single file, then the simplest way is to convert it from windows-1256 encoding to utf-8 encoding, using e.g. Notepad++. (Open the file in it, change the encoding, via File format menu, to Arabic, windows-1256. Then select Convert to UTF-8 in the File format menu and do File → Save.)
Windows-1256 and UTF-8 are completely different encodings, so data gets all messed up if you declare windows-1256 data as UTF-8 or vice versa. Only ASCII characters, such as English letters, have the same representation in both encodings.

We can't find the error in your code if you don't show us your code, so we're very limited in how we can help you.
You told the browser to interpret the document as being UTF-8 rather than Windows-1256, but did you actually change the encoding used from Windows-1256 to UTF-8?
For example,
$ cat a.pl
use strict;
use warnings;
use feature qw( say );
use charnames ':full';
my $enc = $ARGV[0] or die;
binmode STDOUT, ":encoding($enc)";
print <<"__EOI__";
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=$enc">
<title>Foo!</title>
</head>
<body dir="rtl">
\N{ARABIC LETTER ALEF}\N{ARABIC LETTER LAM}\N{ARABIC LETTER AIN}\N{ARABIC LETTER REH}\N{ARABIC LETTER BEH}\N{ARABIC LETTER YEH}\N{ARABIC LETTER TEH MARBUTA}
</body>
</html>
__EOI__
$ perl a.pl UTF-8 > utf8.html
$ perl a.pl Windows-1256 > cp1256.html

I think you need to go back to square one. It sounds like you have a database dump in Win-1256 encoding and you want to work with it in UTF-8 from now on. It also sounds like you are using PHP but you have lots of irrelevant tags on your question and are missing the most important one, PHP.
First, you need to convert the text dump into UTF-8 and you should be able to do that with PHP. Chances are that your conversion script will have two steps, first read the Win-1256 bytes and decode them into internal Unicode text strings, then encode the Unicode text strings into UTF-8 bytes for output to a new text file.
Once you have done that, redo the database import as you did before, but now you have correctly encoded the input data as UTF-8.
After that it should be as simple as reading the database and rendering a web page with the correct UTF-8 encoding.
P.S. It is actually possible to reencode the data every time you display it, but that does not solve the problem of having a database full of incorrectly encoded data.

inorder to display arabic characters correctly , you need to convert your php file to utf-8 without Bom
this happened with me, arabic characters was displayed diamonds, but conversion to utf-8 without bom will solve this problem

I seems that the db is configured as UTF8, but the data itself is extended ascii. If the data is converted to UTF8, it will display correctly in content type set to UTF8

Problems with utf-8 encoding in php

Another utf-8 related problem I believe...
I am using php to update data in a mysql db then display that data elsewhere in the site. Previously I have run into utf-8 problems before where special characters are displayed as question marks when viewed in a browser but this one seems slightly different.
I have a number of records to enter that contain the è character. If I enter this directly in the db then it appears correctly on the page so I take this to mean that utf-8 content is being output correctly.
However when I try and update the values in the db through php, then the è character is replaced. What appears instead is & Atilde ; & uml ; (without the spaces) which appears in the browser as Ã¨
I have the tables in the database set to use UTF-8. I believe this is correct cos, as mentioned, if I update the db through phpMyAdmin, its all ok. Similarly I have set the character encoding for the page which seems to be correct. I am also running the sql statement "SET NAMES 'utf8';" before trying to update the db.
Anyone have any other ideas as to where the problem may lie?
Many thanks

Yup.
The character you have is LATIN SMALL LETTER E WITH GRAVE. As you can see, in UTF-8 that character is encoded into two bytes 0xC3 and 0xA8.
But in many default, western encodings (such as ISO-8859-1) which are single-byte only, this multi-byte character is decoded as two separate characters, LATIN CAPITAL LETTER A WITH TILDE and DIAERESIS. Notice how they are both encoded as C3 and A8 in ISO-8859-1?
Furthermore, it looks like PHP is processing these characters through htmlentities() which result in the Ã and ¨ respectively.
So, where exactly is the problem in your code? Well, htmlentities() could be doing it all by itself since its 3rd argument is a encoding name - which you may not have properly set to 'UTF-8'. But it could be some other string processing function as well. (Note: As a general rule, it's a bad idea to store HTML entities in the database - this step should be reserved for time of display)
There are a bunch of other ways to trip yourself up with UTF-8 in php - I suggest hitting up the cheatsheet and make sure you're in good shape.

Well it is your own code convert characters into entities.
To make it right:
Ban htmlentities function from your scripts forever.
Use htmlspecialchars, but not on insert, but whan displaying data.
Repair existing data in the database using html_entity_decode.

I suppose you're taking the results of some form submission and inserting the results in the database. If so, you must ensure that you instruct the browser to send UTF-8 data and you should validate the user input for a valid UTF-8 stream.
Change your form element to include accept-charset:
<form accept-charset="utf-8" method="post" ... >
<input type="text name="field" />
...
</form>
Validate the data with:
$valid = array_key_exists("field", $_POST) && !is_array($_POST['field']) &&
preg_match('//u', $_POST['field']) && ...; //check length with mb_strlen etc.

I think you miss Content-Type declaration on the html page:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
If you don't have it, the browser will guess the encoding, and convert any characters outside of that encoding to entities when posting a form.

PHP Strange character before £ sign?

For some reason i get a Â£76756687 weird character when i type a £ into a text field on my form?

As you suspect, it's a character encoding issue - is the page set to use a charset of UTF-8? (You can't go wrong with this encoding really.) Also, you'll probably want to entity encode the pound symbol on the way out (£)
As an example character set (for both the form page and HTML email) you could use:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
That said, is there a good reason for the user to have to enter the currency symbol? Would it be a better idea to have it as either a static text item or a drop down to the left of the text field? (Feel free to ignore if I'm talking arse and you're using a freeform textarea or summat.)

You’re probably using UTF-8 as character encoding but don’t declare your output correctly. Because the £ character (U+00A3) is encoded in UTF-8 with 0xC2A3. And that byte sequence represents the two characters Â and £ when interpreted with ISO 8859-1.
So you just need to specify your character encoding correctly. In PHP you can use the header function to set the proper value for Content-Type header field like:
header('Content-Type: text/html;charset=utf-8');
But make sure that you call this function before any output. Otherwise the HTTP header is already sent and you cannot modify it.

PHP output showing little black diamonds with a question mark

I'm writing a php program that pulls from a database source. Some of the varchars have quotes that are displaying as black diamonds with a question mark in them (�, REPLACEMENT CHARACTER, I assume from Microsoft Word text).
How can I use php to strip these characters out?

If you see that character (� U+FFFD "REPLACEMENT CHARACTER") it usually means that the text itself is encoded in some form of single byte encoding but interpreted in one of the unicode encodings (UTF8 or UTF16).
If it were the other way around it would (usually) look something like this: Ã¤.
Probably the original encoding is ISO-8859-1, also known as Latin-1. You can check this without having to change your script: Browsers give you the option to re-interpret a page in a different encoding -- in Firefox use "View" -> "Character Encoding".
To make the browser use the correct encoding, add an HTTP header like this:
header("Content-Type: text/html; charset=ISO-8859-1");
or put the encoding in a meta tag:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
Alternatively you could try to read from the database in another encoding (UTF-8, preferably) or convert the text with iconv().

I also faced this � issue. Meanwhile I ran into three cases where it happened:
substr()
I was using substr() on a UTF8 string which cut UTF8 characters, thus the cut chars could not be displayed correctly. Use mb_substr($utfstring, 0, 10, 'utf-8'); instead. Credits
htmlspecialchars()
Another problem was using htmlspecialchars() on a UTF8 string. The fix is to use: htmlspecialchars($utfstring, ENT_QUOTES, 'UTF-8');
preg_replace()
Lastly I found out that preg_replace() can lead to problems with UTF. The code $string = preg_replace('/[^A-Za-z0-9ÄäÜüÖöß]/', ' ', $string); for example transformed the UTF string "F(×)=2×-3" into "F � 2� ". The fix is to use mb_ereg_replace() instead.
I hope this additional information will help to get rid of such problems.

This is a charset issue. As such, it can have gone wrong on many different levels, but most likely, the strings in your database are utf-8 encoded, and you are presenting them as iso-8859-1. Or the other way around.
The proper way to fix this problem, is to get your character-sets straight. The simplest strategy, since you're using PHP, is to use iso-8859-1 throughout your application. To do this, you must ensure that:
All PHP source-files are saved as iso-8859-1 (Not to be confused with cp-1252).
Your web-server is configured to serve files with charset=iso-8859-1
Alternatively, you can override the webservers settings from within the PHP-document, using header.
In addition, you may insert a meta-tag in you HTML, that specifies the same thing, but this isn't strictly needed.
You may also specify the accept-charset attribute on your <form> elements.
Database tables are defined with encoding as latin1
The database connection between PHP to and database is set to latin1
If you already have data in your database, you should be aware that they are probably messed up already. If you are not already in production phase, just wipe it all and start over. Otherwise you'll have to do some data cleanup.
A note on meta-tags, since everybody misunderstands what they are:
When a web-server serves a file (A HTML-document), it sends some information, that isn't presented directly in the browser. This is known as HTTP-headers. One such header, is the Content-Type header, which specifies the mimetype of the file (Eg. text/html) as well as the encoding (aka charset).
While most webservers will send a Content-Type header with charset info, it's optional. If it isn't present, the browser will instead interpret any meta-tags with http-equiv="Content-Type". It's important to realise that the meta-tag is only interpreted if the webserver doesn't send the header. In practice this means that it's only used if the page is saved to disk and then opened from there.
This page has a very good explanation of these things.

As mentioned in earlier answers, it is happening because your text has been written to the database in iso-8859-1 encoding, or any other format.
So you just need to convert the data to utf8 before outputting it.
$text = “string from database”;
$text = utf8_encode($text);
echo $text;

To make sure your MYSQL connection is set to UTF-8 (or latin1, depending on what you're using), you can do this to:
$con = mysql_connect("localhost","username","password");
mysql_set_charset('utf8',$con);
or use this to check what charset you are using:
$con = mysql_connect("localhost","username","password");
$charset = mysql_client_encoding($con);
echo "The current character set is: $charset\n";
More info here: http://php.net/manual/en/function.mysql-set-charset.php

I chose to strip these characters out of the string by doing this -
ini_set('mbstring.substitute_character', "none");
$text= mb_convert_encoding($text, 'UTF-8', 'UTF-8');

Just Paste This Code In Starting to The Top of Page.
<?php
header("Content-Type: text/html; charset=ISO-8859-1");
?>

Based on your description of the problem, the data in your database is almost certainly encoded as Windows-1252, and your page is almost certainly being served as ISO-8859-1. These two character sets are equivalent except that Windows-1252 has 16 extra characters which are not present in ISO-8859-1, including left and right curly quotes.
Assuming my analysis is correct, the simplest solution is to serve your page as Windows-1252. This will work because all characters that are in ISO-8859-1 are also in Windows-1252. In PHP you can change the encoding as follows:
header('Content-Type: text/html; charset=Windows-1252');
However, you really should check what character encoding you are using in your HTML files and the contents of your database, and take care to be consistent, or convert properly where this is not possible.

Add this function to your variables
utf8_encode($your variable);

Try This Please
mb_substr($description, 0, 490, "UTF-8");

This will help you. Put this inside <head> tag
<meta charset="iso-8859-1">

That can be caused by unicode or other charset mismatch. Try changing charset in your browser, in of the settings the text will look OK. Then it's question of how to convert your database contents to charset you use for displaying. (Which can actually be just adding utf-8 charset statement to your output.)

what I ended up doing in the end after I fixed my tables was to back it up and change back the settings to utf-8 then I altered my dump file so that DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci are my character set entries
now I don't have characterset issues anymore because the database and browser are utf8.
I figured out what caused it. It was the web page+browser effects on the DB. On the terminals that are linux (ubuntu+firefox) it was encoding the database in latin1 which is what the tabes are set. But on the windows 10+edge terminals, the entries were force coded into utf8. Also I noticed the windows 10 has issues staying with latin1 so I decided to bend with the wind and convert all to utf8.
I figured it was a windows 10 issue because we started using win 10 terminals.
so yet again microsoft bugs causes issues. I still don't know why the encoding changes on the forms because the browser in windows 10 shows the latin1 characterset but when it goes in its utf8 encoded and I get the data anomaly. but in linux+firefox it doesn't do that.

This happened to work in my case:
$text = utf8_decode($text)
I turns the black diamond character into a question mark so you can:
$text = str_replace('?', '', utf8_decode($text));

Just add these lines before headers.
Accurate format of .doc/docx files will be retrieved:
if(ini_get('zlib.output_compression'))
ini_set('zlib.output_compression', 'Off');
ob_clean();

When you extract data from anywhere you should use functions with the prefix md_FUNC_NAME.
Had the same problem it helped me out.
Or you can find the code of this symbol and use regexp to delete these symbols.

You can also change the caracter set in your browser. Just for debug reasons.

Using the same charset (as suggested here) in both the database and the HTML has not worked for me... So remembering that the code is generated as HTML, I chose to use the "(HTML code) or the " (ISO Latin-1 code) in my database text where quotes were used. This solved the problem while providing me a quotation mark. It is odd to note that prior to this solution, only some of the quotation marks and apostrophes did not display correctly while others did, however, the special code did work in all instances.

I ran the "detect encoding" code after my collation change in phpmyadmin and now it comes up as Latin_1.
but here is something I came across looking a different data anomaly in my application and how I fixed it:
I just imported a table that has mixed encoding (with diamond question marks in some lines, and all were in the same column.) so here is my fix code. I used utf8_decode process that takes the undefined placeholder and assigns a plain question mark in the place of the "diamond question mark " then I used str_replace to replace the question mark with a space between quotes.
here is the
[code]
include 'dbconnectfile.php';
//// the variable $db comes from my db connect file
/// inx is my auto increment column
/// broke_column is the column I need to fix
$qwy = "select inx,broke_column from Table ";
$res = $db->query($qwy);
while ($data = $res->fetch_row()) {
for ($m=0; $m<$res->field_count; $m++) {
if ($m==0){
$id=0;
$id=$data[$m];
echo $id;
}else if ($m==1){
$fix=0;
$fix=$data[$m];
$fix = utf8_decode($fix);
$fixx =str_replace("?"," ",$fix);
echo $fixx;
////I echoed the data to the screen because I like to see something as I execute it :)
}
}
$insert= "UPDATE Table SET broke_column='".$fixx."' where inx='".$id."'";
$insresult= $db->query($insert);
echo"<br>";
}
?>

For global purposes.
Instead of converting, codifying, decodifying each text I prefer to let them as they are and instead change the server php settings.
So,
Let the diamonds
From the browser, on the view menu select
"text encoding" and find the one which let's you see your text
correctly.
Edit your php.ini and add:
default_charset = "ISO-8859-1"
or instead of ISO-8859 the one which fits your text encoding.

Go to your phpmyadmin and select your database and just increase the length/value of that table's field to 500 or 1000 it will solve your problem.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.