PHP: utf-8 encode, htmlentities giving weird results - php

I'm trying to get data from a POST form. When the user inputs "habláis", it shows up in view source as just "habláis". I want to convert this to "habláis" for purposes of string comparison, but both utf8_encode() and htmlentities() are outputting habláis, and htmlspecialchars() does nothing. I would use str_replace but it won't recognize the á when it searches the string.
I'm using a charset of utf-8 consistently across pages. Any idea what's going on?

You are probably not specifying UTF-8 as the character set for the htmlentities() operation.

I'm not sure if this is your problem, but are you calling htmlentities with the UTF-8 parameter? I ask because that's not its default:
Like htmlspecialchars(), it takes an
optional third argument charset which
defines character set used in
conversion. Presently, the ISO-8859-1
character set is used as the default.
So you might want to try calling your function like this:
$output = htmlentities($input, ENT_COMPAT, 'UTF-8');
Does this solve your problem?

Related

iconv with ascii // transit triggers ErrorException: "iconv(): Detected an illegal character in input string"

First of all, I have to say that; I am a stranger of multilingual conversions.
I have strings that i want to mb_lowercase in UTF-8 form if possible (sth like clean url), and I use
$str = iconv("UTF-8", "ASCII//TRANSLIT", utf8_encode($str));
$str = preg_replace("/[^a-zA-Z0-9_]/", "", $str);
$str = mb_strtolower($str);
to achive my requirements (an UTF8, lowercase string)
However, when I stress that function with "çokGüŞelLl" using CocoaRestClient; I get à as $str (thanks to my client?) and iconv triggers an error complaining about an illegal character in input string (Ã).
What is the problem with iconv? the str is encoded as utf8 by utf8_encode($str) already. How can it be an illegal character?
Notes:
I read about #iconv questions here, but I think it is not a good solution to have empty database entries.
Thanks to all answers, I will read and try to understand each of them.
The PHP function utf8_encode() expects your string to be ISO-8859-1 encoded. If it isn’t, well, you get funny results.
Ensure that your data is proper UTF-8 before saving it to your database:
// Validate that the input string is valid UTF-8
if (preg_match("//u", $string) === false) {
throw new \InvalidArgumentException("String contains invalid UTF-8 characters.");
}
// Normalize to Unicode NFC form (recommended by W3C)
$string = \Normalizer::normalize($string);
Now everything is stored the same way in our database and we don't have to care about this problem anymore when receiving data from our database.
$string = $database->getSomeRecordWithUnicode();
echo mb_strtolower($string);
Done!
PS: If you want to ensure that your database is using the exact same encoding as PHP either use utf8mb4 as character set (and utf8mb4_unicode_ci as default collation for perfect sorting) or a BLOB (binary) data type.
PPS: Use your database configuration file to force proper encoding of all strings instead of using e.g. $mysqli->set_charset("utf8") or similar.
About HTML forms
Because you asked in the comments of your question. How data is sent to your server has nothing to do with the locale the user has set in his operating system. It has to do with the client's browser. All modern browsers default to utf-8 when sending form data. If you are afraid that some of your clients might be using totally broken browsers, simply tell them that you only accept utf-8. Drupal is doing that on all their forms.
<!doctype html>
<html>
<body>
<form accept-charset="UTF-8">
Now all browsers should encode the data they submit in utf-8.
If you encode çokGüŞelLl as UTF-8 you should get the following bytes:
var_dump( bin2hex('çokGüŞelLl') );
string(26) "c3a76f6b47c3bcc59e656c4c6c"
That's a check you must do. You also have this:
utf8_encode($str)
Your string contains Ş, which cannot be represented in ISO-8859-1 to begin with.
So, whatever reason you have to convert your original UTF-8 (as stored in DB) to ISO-8859-1, I'm afraid that it's corrupting your data.
You're double encoding. First you set your database to UTF-8. That means your data is now UTF-8 encoded. Then you use utf8_encode on the iconv-function. But your input is already UTF-8. Try removing your utf8_encode statement from iconv.

decoding ISO characters

I got Chinese characters encoded in ISO-8859-1, for example 兼 = 兼
Those characters are taken form the database using AJAX and sent by Json using json_encode.
I then use the template Handlebars to set the data on the page.
When I look at the ajax page the characters are displayed correctly, the source is still encoded.
But the final result displays the encrypted characters.
I tried to decode on the javascript part with unescape but there is no foreach with the template that gives me the possibility to decode the specific variable, so it crashes.
I tried to decode on the PHP side with htmlspecialchars_decode but without success.
Both pages are encoded in ISO-8859-1, but I can change them in UTF8 if necessary, but the data in the database remains encoded in ISO-8859-1.
Thank you for your help.
You're simply representing your characters in HTML entities. If you want them as "actual characters", you'll need to use an encoding that can represent those characters, ISO-8859 won't do. htmlspecialchars_decode doesn't work because it only decodes a handful of characters that are special in HTML and leaves other characters alone. You'll need html_entity_decode to decode all entities, and you'll need to provide it with a character set to decode to which can handle Chinese characters, UTF-8 being the obvious best choice:
$str = html_entity_decode($str, ENT_COMPAT, 'UTF-8');
You'll then need to make sure the browser knows that you're sending it UTF-8. If you want to store the text in the database in UTF-8 as well (which you really should), best follow the guide How to handle UTF-8 in a web app which explains all the pitfalls.
Are you including your text with the "double-stache" Handlebars syntax?
{{your expression}}
As the Handlebars documentation mentions, that syntax HTML-escapes its output, which would cause the results you're mentioning, where you're seeing the entity 兼 instead of 兼.
Using three braces instead ("triple-stache") won't escape the output and will let the browser correctly interpet those numeric entities:
{{{your expression}}}

How to use PHP htmlentities()?

In my project I currently use htmlentities() to filter data coming from the database:
echo htmlentities($variable_name);
I am in the USA and this works fine for me. My friend is in Brazil and for him some text characters don't show up correctly.
How can I use htmlentities() so it internationalizes properly?
The problem could be that the output is not encoded in UTF-8. According to the php docs for htmlentities, the function
takes an optional third argument
charset which defines character set
used in conversion. Presently, the
ISO-8859-1 character set is used as
the default.
So you can try calling
htmlentities($string, ENT_COMPAT, 'UTF-8');
instead, and that might fix the problem, since it's not the default character encoding.
While I suspect Keoki has it correct, another possible problem could be the font. If using a special character where your friend's font doesn't contain that character, they'll just see the missing character sign. In the webpage or whatever medium you are using to post the character, be sure that a font is set, as there's no guarentees on the default font working.
If neither of these be the case though, what is an example character that isn't showing up? Can you post the full code you are using?
You can also try iconv to Convert string to requested character encoding
http://www.php.net/manual/en/function.iconv.php

Convert foreign characters with accents

I'm trying to compare some text to the text in a database. In the database any text with an accent is encoded like in HTML (i.e. é) when I compare the database text to my string it doesn't match because my string just shows é. When I use the PHP function htmlentities to encode the string first the é turns into é weird? Using htmlspecialchars doesn't encode the é at all.
How would you suggest I compare é to é as well as all the other accented characters?
You need to send in the correct charset to htmlentities. It looks like you're using UTF-8, but the default is ISO-8859-1. Change it like this:
$encoded = htmlentities($text, ENT_COMPAT, 'UTF-8');
Another solution is to convert the text to ISO-8859-1 before encoding, but that may destroy information (ISO-8859-1 does not contain nearly as many characters as UTF-8). If you want to try that instead, do like this:
$encoded = htmlentities(utf8_decode($text));
I'm working on french site, and I also had same problem. This is the function that I use.
function convert_accent($string)
{
return htmlspecialchars_decode(htmlentities(utf8_decode($string)));
}
What it does it decodes your string to utf8, than converts everything HTML entities. even tags. But we want to convert tags back to normal, than htmlspecialchars_decode will convert them back. So in the end you will get a string with converted accents without touching tags.
You can use pass through this function your email content before sending it to recipent.
Another issue you might face is that, sometimes with this function the content from database converts to ? . In this case you should do this before running your query:
mysql_query("SET NAMES `utf8`");
But you might need to do it, it depends on encoding in your table. I hope it helps.
The comparing task is related to the charset and the collation you selected when you create the database or the tables. If you are saving strings with a lot of accents like spanish I sugget you to use charset uft8 and the collation could be the more accurate to the language(english, french or whatever) you're using.
The best thing of using the correct charset in the database is that you can save the string in natural way e.g: my name I can store it as is "Mario Juárez" and I have no need of doing some weird conversions.
Ran into similar issues recently. Followed Emil's answer and it worked fine locally but not on our dev/stage environments. I ended up using this and it worked all around:
$title = html_entity_decode(utf8_decode($item));
Thanks for leading me in the right direction!

utf-8 and htmlentities in RSS feeds

I'm writing some RSS feeds in PHP and stuggling with character-encoding issues. Should I utf8_encode() before or after htmlentities() encoding? For example, I've got both ampersands and Chinese characters in a description element, and I'm not sure which of these is proper:
$output = utf8_encode(htmlentities($source)); or
$output = htmlentities(utf8_encode($source));
And why?
It's important to pass the character set to the htmlentities function, as the default is ISO-8859-1:
utf8_encode(htmlentities($source,ENT_COMPAT,'utf-8'));
You should apply htmlentities first as to allow utf8_encode to encode the entities properly.
(EDIT: I changed from my opinion before that the order didn't matter based on the comments. This code is tested and works well).
First: The utf8_encode function converts from ISO 8859-1 to UTF-8. So you only need this function, if your input encoding/charset is ISO 8859-1. But why don’t you use UTF-8 in the first place?
Second: You don’t need htmlentities. You just need htmlspecialchars to replace the special characters by character references. htmlentities would replace “too much” characters that can be encoded directly using UTF-8. Important is that you use the ENT_QUOTES quote style to replace the single quotes as well.
So my proposal:
// if your input encoding is ISO 8859-1
htmlspecialchars(utf8_encode($string), ENT_QUOTES)
// if your input encoding is UTF-8
htmlspecialchars($string, ENT_QUOTES, 'UTF-8')
Don't use htmlentities()!
Simply use UTF-8 characters. Just make sure you declare encoding of the feed in HTTP headers (Content-Type:application/xml;charset=UTF-8) or failing that, in the feed itself using <?xml version="1.0" encoding="UTF-8"?> on the first line.
It might be easier to forget htmlentities and use a CDATA section. It works for the title section, which doesn't seem support encoded HTML characters in Firefox's RSS viewer:
<title><![CDATA[News & Updates " > » ☂ ☺ ☹ ☃ Test!]]></title>
You want to do $output = htmlentities(utf8_encode($source));. This is because you want to convert your international characters into proper UTF8 first, and then have ampersands (and possibly some of the UTF-8 characters as well) turned in to HTML entities. If you do the entities first, then some of the international characters may not be handled properly.
If none of your international characters are going to be changed by utf8_encode, then it doesn't matter which order you call them in.
After much trial & error, I finally found a way to properly display a string from a utf8-encoded database value, through an xml file, to an html page:
$output = '<![CDATA['.utf8_encode(htmlentities($string)).']]>';
I hope this helps someone.

Categories