htmlentities and é (e acute) - php

I'm having a problem with PHP's htmlentities and the é character. I know it's some sort of encoding issue I'm just overlooking, so hopefully someone can see what I'm doing wrong.
Running a straight htmlentities("é") does not return the correct code as expected (either é or é. I've tried forced the charset to be 'UTF-8' (using the charset parameter of htmlentities) but the same thing.
The ultimate goal is to have this character sent in an HTML email encoded in 'ISO-8859-1'. When I try to force it into that encoding, same issue. In the source of the email, you see é, and in the HTML view é.
Who can shed some light on my mistake?

// I assume that your page is utf-8 encoded
header("Content-type: text/html;charset=UTF-8");
$in_utf8encoded = "é à ù è ò";
// first you need the convert the string to the charset you want...
$in_iso8859encoded = iconv("UTF-8", "ISO-8859-1", $in_utf8encoded);
// ...in order to make htmlentities work with the same charset
$out_iso8859= htmlentities($in_iso8859encoded, ENT_COMPAT, "ISO-8859-1");
// then only to display in your page, revert it back to utf-8
echo iconv("ISO-8859-1", "UTF-8", $out_iso8859);

I have added htmlspecialchars for you to see that it is really encoded
http://sandbox.phpcode.eu/g/11ce7/4
<?PHP
echo htmlspecialchars(htmlentities("é", ENT_COMPAT | ENT_HTML401, "UTF-8"));

I suggest you take a look at http://php.net/html_entity_decode . You can use this in the following way:
$eacute = html_entity_decode('é',ENT_COMPAT,'iso-8859-1');
This way you don't have to care about the encoding of the php file.
edit: typo

I fixed with
$string = htmlentities($string,ENT_QUOTES | ENT_SUBSTITUTE,"ISO-8859-1");

If you have stored the special characters as é, then you could use the following soon after making connection to the database.
mysql_set_charset('utf8', $dbHandler);
With this, you now don't need to use htmlentities while displaying data.

Related

Problems with special chars encoding with an access mdb database using php

My boss is forcing me to use an access mdb database (yes, I'm serious) in a php server.
I can connect it and retrieve data from it, but as you could imagine, I have problems with encodings because I want to work using utf8.
The thing is that now I have two "solutions" to translate Windows-1252 to UTF-8
This is the first way:
mb_convert_encoding($string, "UTF-8", "Windows-1252").
It works, but the problem is that special chars are not properly converted, for example char º is converted to \u00ba and char Ó is converted to \u00d3.
My second way is doing this:
mb_convert_encoding(mb_convert_encoding($string, "UTF-8", "Windows-1252"), "HTML-ENTITIES", "UTF-8")
It works too, but it happens the same, special chars are not correctly converted. Char º is converted to º
Does anybody know how to properly change encoding including special chars?
Or does anybody know how to convert from º and \u00ba to something readable?
I did simple test to convert codepoint to letters
<?php
function codepoint_decode($str) {
return json_decode(sprintf('"%s"', $str));
}
$string_with_codepoint = "Ahed \u00d3\u00ba\u00d3";
// $string_with_codepoint = mb_convert_encoding($string, "UTF-8", "Windows-1252");
$output = codepoint_decode($string_with_codepoint);
echo $output; // Ahed ÓºÓ
Credit go for this answer
I finally found the solution.
I had the solution from the beginning but I was doing my tests wrong.
My bad.
The right way to do it for me is mb_convert_encoding($string, "UTF-8", "Windows-1252")
But i was checking the result like this:
$stringUTF8 = mb_convert_encoding($string, "UTF-8", "Windows-1252");
echo json_encode($stringUTF8);
that's why it was returning unicode chars like \u20ac, if I would have done:
$stringUTF8 = mb_convert_encoding($string, "UTF-8", "Windows-1252");
echo $stringUTF8;
I should have seen the solution from the beginning but I was wrong. It was json_encode() what was turning special chars into unicode chars.
Thanks everybody for your help!!

PHP decode UTF-8 in URL ie &title=%C5%8Cyu to Ōyu not ÅŒyu

Please can you help me decode this URL so that it displays properly using PHP to output
This is the link
http://www.megalithic.co.uk/visits.php?op=site&sid=18341&title=Ōyu
I think it's actually coming through as UTF-8 - ie
&title=%C5%8Cyu
$title displays as ÅŒyu
How do I convert this in PHP? I need to use ISO-8859-1 on the page
None of these work
$title=iconv("UTF-8","ISO-8859-1",$title);
$title=iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $title);
$title = utf8_decode($title);
$title = urldecode($title);
Do I need to use the Multibyte MB extension and if so how?
Many thanks in advance
Andy
If that link is to your PHP page, and you get the value via $_GET['title'], then it's already decoded from the URL encoding and $_GET['title'] holds a UTF-8 encoded string with the character Ō. This character cannot be encoded in ISO-8859-1. If that is a strict requirement, you'll have to encode the character as HTML entity in order to express it in a strictly ISO-8859-1 encoded page:
echo htmlentities('Ō', ENT_COMPAT | ENT_HTML5, 'UTF-8');
The character "Ō" is not there in ISO-8859-1, so it is not possible to convert it from UTF-8 with any of the standard charset conversion functions.
It might, however, be possible to write a function that converts to numerical HTML encodings, like Ō for "Ō".

PHP, HTML and character encodings

I actually have a fairly simple question but I'm unable to find an answer anywhere. The PHP function html_entity_decode is supposed to "converts all HTML entities to their applicable characters from string."
So, since Ω is the HTML encoding for the Greek captical letter Omega, I'd expect that echo html_entity_decode('Ω', ENT_COMPAT, 'UTF-8'); would output Ω. But instaid, it outputs some strange characters which my browser can't recongize. Why is this?
Thanks,
Martijn
When you convert entities into UTF-8 characters like your last parameter specifies, your output encoding must be UTF-8 as well. Otherwise, in a single-byte encoding like ISO-8859-1, you will see double-byte characters as two broken single ones.
It's works fine:
http://codepad.viper-7.com/tb2LaW
Make sure your webpage encoding is UTF-8
If you have different encoding on webpage change this:
html_entity_decode('Ω', ENT_COMPAT, 'UTF-8');
^^^^^
header('Content-type: text/html;charset=utf-8');
mysql_set_charset("utf8", $conn);
Refer this URL:-
http://www.phpwact.org/php/i18n/charsets
php mysql character set: storing html of international content

Convert foreign characters with accents

I'm trying to compare some text to the text in a database. In the database any text with an accent is encoded like in HTML (i.e. é) when I compare the database text to my string it doesn't match because my string just shows é. When I use the PHP function htmlentities to encode the string first the é turns into é weird? Using htmlspecialchars doesn't encode the é at all.
How would you suggest I compare é to é as well as all the other accented characters?
You need to send in the correct charset to htmlentities. It looks like you're using UTF-8, but the default is ISO-8859-1. Change it like this:
$encoded = htmlentities($text, ENT_COMPAT, 'UTF-8');
Another solution is to convert the text to ISO-8859-1 before encoding, but that may destroy information (ISO-8859-1 does not contain nearly as many characters as UTF-8). If you want to try that instead, do like this:
$encoded = htmlentities(utf8_decode($text));
I'm working on french site, and I also had same problem. This is the function that I use.
function convert_accent($string)
{
return htmlspecialchars_decode(htmlentities(utf8_decode($string)));
}
What it does it decodes your string to utf8, than converts everything HTML entities. even tags. But we want to convert tags back to normal, than htmlspecialchars_decode will convert them back. So in the end you will get a string with converted accents without touching tags.
You can use pass through this function your email content before sending it to recipent.
Another issue you might face is that, sometimes with this function the content from database converts to ? . In this case you should do this before running your query:
mysql_query("SET NAMES `utf8`");
But you might need to do it, it depends on encoding in your table. I hope it helps.
The comparing task is related to the charset and the collation you selected when you create the database or the tables. If you are saving strings with a lot of accents like spanish I sugget you to use charset uft8 and the collation could be the more accurate to the language(english, french or whatever) you're using.
The best thing of using the correct charset in the database is that you can save the string in natural way e.g: my name I can store it as is "Mario Juárez" and I have no need of doing some weird conversions.
Ran into similar issues recently. Followed Emil's answer and it worked fine locally but not on our dev/stage environments. I ended up using this and it worked all around:
$title = html_entity_decode(utf8_decode($item));
Thanks for leading me in the right direction!

utf-8 and htmlentities in RSS feeds

I'm writing some RSS feeds in PHP and stuggling with character-encoding issues. Should I utf8_encode() before or after htmlentities() encoding? For example, I've got both ampersands and Chinese characters in a description element, and I'm not sure which of these is proper:
$output = utf8_encode(htmlentities($source)); or
$output = htmlentities(utf8_encode($source));
And why?
It's important to pass the character set to the htmlentities function, as the default is ISO-8859-1:
utf8_encode(htmlentities($source,ENT_COMPAT,'utf-8'));
You should apply htmlentities first as to allow utf8_encode to encode the entities properly.
(EDIT: I changed from my opinion before that the order didn't matter based on the comments. This code is tested and works well).
First: The utf8_encode function converts from ISO 8859-1 to UTF-8. So you only need this function, if your input encoding/charset is ISO 8859-1. But why don’t you use UTF-8 in the first place?
Second: You don’t need htmlentities. You just need htmlspecialchars to replace the special characters by character references. htmlentities would replace “too much” characters that can be encoded directly using UTF-8. Important is that you use the ENT_QUOTES quote style to replace the single quotes as well.
So my proposal:
// if your input encoding is ISO 8859-1
htmlspecialchars(utf8_encode($string), ENT_QUOTES)
// if your input encoding is UTF-8
htmlspecialchars($string, ENT_QUOTES, 'UTF-8')
Don't use htmlentities()!
Simply use UTF-8 characters. Just make sure you declare encoding of the feed in HTTP headers (Content-Type:application/xml;charset=UTF-8) or failing that, in the feed itself using <?xml version="1.0" encoding="UTF-8"?> on the first line.
It might be easier to forget htmlentities and use a CDATA section. It works for the title section, which doesn't seem support encoded HTML characters in Firefox's RSS viewer:
<title><![CDATA[News & Updates " > » ☂ ☺ ☹ ☃ Test!]]></title>
You want to do $output = htmlentities(utf8_encode($source));. This is because you want to convert your international characters into proper UTF8 first, and then have ampersands (and possibly some of the UTF-8 characters as well) turned in to HTML entities. If you do the entities first, then some of the international characters may not be handled properly.
If none of your international characters are going to be changed by utf8_encode, then it doesn't matter which order you call them in.
After much trial & error, I finally found a way to properly display a string from a utf8-encoded database value, through an xml file, to an html page:
$output = '<![CDATA['.utf8_encode(htmlentities($string)).']]>';
I hope this helps someone.

Categories