I am getting stuck with this, previously i am using php 5 and now i came up with php 7,
the problem is when i am trying to echo value from database it returns weird special character and where in my previous web it returns normal. Is it utf-8 problem? i tried meta tag utf-8, and change collation sql into utf8_unicode_ci, but somehow it doesn't help at all...
it returns like this
 — 
what i want to return
—
What you get from the database is a UTF8 encoded string.
The characters you see is a UTF8 string interpreted with encoding Western (Windows Latin 1).
If you include that string in a web page whose character set is Latin 1 then you'll see the string you posted; if the character set is UTF-8 then you should see the correct characters (without need to convert them into HTML entities).
As the latter is not your case you can proceed as follows:
Let the characters you see are stored in the variable $string: you can get html entities with mb_convert_encoding:
$html = mb_convert_encoding( $string, 'HTML-ENTITIES', 'UTF-8' );
This will result in:
—
As after conversion you get characters in the ASCII range then the resulting string is suitable for any destination character encoding.
Note that, according to the above, even the dash — is converted (into —)
This is just a quick solution to the problem you faced.
I think the comment from Machavity:
"Take a minute and read stackoverflow.com/questions/279170/utf-8-all-the-way-through"
is a good advice.
Related
This question already has answers here:
UTF-8 all the way through
(13 answers)
Closed 9 months ago.
I've searched around for a while and haven't yet found something that'll work for me. I am using a PHP form to submit data into SAP using the SAP DI API. I need to figure out which character set will actually allow me to store and work with Vietnamese characters.
UTF8 seems to work for a lot of the characters but ô becomes ô. More importantly, there are character limits, and UTF-8 breaks character limits. If I have a string of 30 characters it tells the API that it's more than 50. The same is true for storing in MySQL--if there's a varchar character limit, UTF-8 causes the string to go above it.
Unfortunately, when I search, UTF-8 seems to be the only thing people suggest for Vietnamese characters. If I don't encode the characters at all, they get stored as their html character codes. I've also tried ISO-8859-1, converting into UCS-2 or UCS-4... I'm really at a loss. If anyone has experience working with vietnamese characters, your help would be greatly appreciated.
UPDATE
It appears the issue may be with my wampserver on Windows. here's a bit of code that is confusing me:
$str = 'VậTCôNG';
$str1 = utf8_encode($str);
if (mb_detect_encoding($str,"UTF-8",true) == true) {
print_r('yes');
if ($str1 == $str) {
print_r('yes2');
}
}
echo $str . $str1;
This prints "yes" but not "yes2", and $str.str1 = "VậTCôNGVáºTCôNG" in the browser.
I have my php.ini file with:
default_charset = "utf-8"
and my httpd.conf file with:
AddDefaultCharset UTF-8
and my php file I'm running has:
header("Content-type: text/html; charset=utf-8");
So I'm now wondering: if the original string was utf-8, why wouldn't it equal a utf8 encoding of itself? and why is the utf8 encoding returning wrong characters? Is something wrong in the wampserver configurations?
ô is the "Mojibake" for ô. That is, you do have UTF-8, but something in the code mangled it.
See Trouble with utf8 characters; what I see is not what I stored and search for Mojibake. It says to check these:
The bytes to be stored need to be UTF-8-encoded. Fix this.
The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Fix this.
The column needs to be declared CHARACTER SET utf8 (or utf8mb4). Fix this.
HTML should start with <meta charset=UTF-8>.
It is possible to recover the data in the database, but it depends on details not yet provided.
http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases
Each Vietnamese character take 2-3 bytes for encoding in UTF-8. It is unclear whether the "hard 50" is really a character limit or a byte limit.
If you happen to have Mojibake's sibling "double encoding", then a Vietnamese character will take 4-6 bytes and feel like 2-3 characters. See "Test the data" in the first link.
An example of how to 'undo' Mobibake in MySQL:
CONVERT(BINARY(CONVERT('VáºTCôNG' USING latin1)) USING utf8mb4) --> 'VậTCôNG'
"Double encoding" is sort of like Mojibake twice. That is one side treats it as latin1, the other as UTF-8, but twice.
VậTCôNG, as UTF-8, is hex 56e1baad5443c3b44e47. If that hex is treated as character set cp850 or keybcs2, the string is Vß║¡TC├┤NG.
Change it to VISCII.
Input: ô
Output: ô
You can test it at Charset converter.
I have database in german language. I am fetching data from database and create array. problem is some german characters are converted special character. thats why not able to encode array in json. i have also tried by putting header for german language but it is for display purpose so it is not works.
please check below array created
if i copy and paste that same string manually and create array than it create perfect array. can you please solve issue.
after use of
htmlentities($str, ENT_QUOTES, 'UTF-8', false);
you can see have display both string one is fetched from database and one is encoded using htmlentities. you can see last index of array is empty so htmlentities could not encode it.
You need to escape your string. Try this:
// $str, encode quotes, UTF-8 encoding, false (do not double encode).
echo htmlentities($str, ENT_QUOTES, 'UTF-8', false);
Replace $str with your string. This should fix your issue.
The question mark in a black diamond is often caused by
latin1 encoding of text in the client, and
SET NAMES latin1 or otherwise picking (or defaulting to) latin1 when connecting, and
(It does not matter what CHARACTER SET the table/columns is.), and
Saying <meta charset=UTF-8> in the html.
A simple fix is to change the last step to be <meta charset=ISO-8859-1>, thereby working only with latin1, not utf8. That's "ok" for Western Europe, but won't work for Asia.
The "right" fix is to go with utf8mb4 at all for steps.
First of all, I have to say that; I am a stranger of multilingual conversions.
I have strings that i want to mb_lowercase in UTF-8 form if possible (sth like clean url), and I use
$str = iconv("UTF-8", "ASCII//TRANSLIT", utf8_encode($str));
$str = preg_replace("/[^a-zA-Z0-9_]/", "", $str);
$str = mb_strtolower($str);
to achive my requirements (an UTF8, lowercase string)
However, when I stress that function with "çokGüŞelLl" using CocoaRestClient; I get à as $str (thanks to my client?) and iconv triggers an error complaining about an illegal character in input string (Ã).
What is the problem with iconv? the str is encoded as utf8 by utf8_encode($str) already. How can it be an illegal character?
Notes:
I read about #iconv questions here, but I think it is not a good solution to have empty database entries.
Thanks to all answers, I will read and try to understand each of them.
The PHP function utf8_encode() expects your string to be ISO-8859-1 encoded. If it isn’t, well, you get funny results.
Ensure that your data is proper UTF-8 before saving it to your database:
// Validate that the input string is valid UTF-8
if (preg_match("//u", $string) === false) {
throw new \InvalidArgumentException("String contains invalid UTF-8 characters.");
}
// Normalize to Unicode NFC form (recommended by W3C)
$string = \Normalizer::normalize($string);
Now everything is stored the same way in our database and we don't have to care about this problem anymore when receiving data from our database.
$string = $database->getSomeRecordWithUnicode();
echo mb_strtolower($string);
Done!
PS: If you want to ensure that your database is using the exact same encoding as PHP either use utf8mb4 as character set (and utf8mb4_unicode_ci as default collation for perfect sorting) or a BLOB (binary) data type.
PPS: Use your database configuration file to force proper encoding of all strings instead of using e.g. $mysqli->set_charset("utf8") or similar.
About HTML forms
Because you asked in the comments of your question. How data is sent to your server has nothing to do with the locale the user has set in his operating system. It has to do with the client's browser. All modern browsers default to utf-8 when sending form data. If you are afraid that some of your clients might be using totally broken browsers, simply tell them that you only accept utf-8. Drupal is doing that on all their forms.
<!doctype html>
<html>
<body>
<form accept-charset="UTF-8">
Now all browsers should encode the data they submit in utf-8.
If you encode çokGüŞelLl as UTF-8 you should get the following bytes:
var_dump( bin2hex('çokGüŞelLl') );
string(26) "c3a76f6b47c3bcc59e656c4c6c"
That's a check you must do. You also have this:
utf8_encode($str)
Your string contains Ş, which cannot be represented in ISO-8859-1 to begin with.
So, whatever reason you have to convert your original UTF-8 (as stored in DB) to ISO-8859-1, I'm afraid that it's corrupting your data.
You're double encoding. First you set your database to UTF-8. That means your data is now UTF-8 encoded. Then you use utf8_encode on the iconv-function. But your input is already UTF-8. Try removing your utf8_encode statement from iconv.
I am trying to sanitise database input and found a problem with the Ⓡ character.
Ⓡ converts to
Ⓡ
Even with html_entity_decode around the variable.
This is a problem because the field is only meant to allow 4 characters in the database.
® Actually works though and is treated as a single character.
I have the same problem with Ⓒ vs ©.
As far as I know they are just html entities so should be decoded. However they aren't even encoded with htmlspecialchars(). It just echoes out the code
Ⓡ
Does PHP have any built-in functions to solve this? Thanks
Edit just to say what I am trying to do:
I have text fields to input and add to a database which displays in a table below.
When I enter any other character like < > &, it enters straight into the database as one character.
I am trying to make Ⓡ and Ⓒ always go in as one character as well (instead of 6).
I am only encoding on output in the table so certain characters don't break the website.
The problem that the entity doesn't decode when using html_entity_decode is likely that the target character set given to html_entity_decode is still the default ISO-8859-1. ISO-8859-1 cannot encode "Ⓡ" (the CIRCLED LETTER R), but it can encode "®" (the REGISTERED MARK).
So, first, to decode it correctly:
html_entity_decode('Ⓡ', ENT_COMPAT, 'UTF-8')
But secondly, "Ⓡ" and "®" are not the same character, and you probably don't want "Ⓡ".
I'm having some trouble with a string that comes from a webpage having foreign characters in it.
The string is generated by parsing the webpage using str_get_html(), followed by $htmldom->innertext; (simple_html_dom class library).
When I output the string using htmlentities() it is displayed fine; but using explode() on the string and printing the parts, I get a tilted block with a question mark in it for each foreign character.
I need to store the string in a utf8 MySQL database, so I need the right foreign characters.
My page has a header with utf8 character set.
I have already tried mb_split() and preg_split(), but those have the same problem.
I solved the issue with :
https://github.com/neitanod/forceutf8
It has a great function that just converts anything to utf-8, no matter what source it's from (as long as it comes in Latin1 (iso 8859-1), Windows-1252 or UTF8 already, or a mix of them).
Many thanks go to Sebastian Grignoli.
PHP and UTF-8 isn't a very good combination. Some functions work fine with UTF-8, others don't, and the worst are those that are documented to work, but in fact do not (such as DOMDocument ).
You can use mb_convert_encoding() to convert multibyte characters to HTML entities, which usually provides an acceptable workaround:
$string = mb_convert_encoding($string, 'HTML-ENTITIES', 'UTF-8');