In my project I currently use htmlentities() to filter data coming from the database:
echo htmlentities($variable_name);
I am in the USA and this works fine for me. My friend is in Brazil and for him some text characters don't show up correctly.
How can I use htmlentities() so it internationalizes properly?
The problem could be that the output is not encoded in UTF-8. According to the php docs for htmlentities, the function
takes an optional third argument
charset which defines character set
used in conversion. Presently, the
ISO-8859-1 character set is used as
the default.
So you can try calling
htmlentities($string, ENT_COMPAT, 'UTF-8');
instead, and that might fix the problem, since it's not the default character encoding.
While I suspect Keoki has it correct, another possible problem could be the font. If using a special character where your friend's font doesn't contain that character, they'll just see the missing character sign. In the webpage or whatever medium you are using to post the character, be sure that a font is set, as there's no guarentees on the default font working.
If neither of these be the case though, what is an example character that isn't showing up? Can you post the full code you are using?
You can also try iconv to Convert string to requested character encoding
http://www.php.net/manual/en/function.iconv.php
Related
I have a URL like: domain.tld/Σχετικά_με_μας
[edit]
Reading the $_SERVER['REQUEST_URI'] I get to work with:
%CE%A3%CF%87%CE%B5%CF%84%CE%B9%CE%BA%CE%AC_%CE%BC%CE%B5_%CE%BC%CE%B1%CF%82
[/edit]
In PHP I need to convert it to HTML, I get pretty far with:
htmlentities(urldecode($navstring), ENT_QUOTES, 'UTF-8');
It results in:
Σχετικά_με_μας
but the 'ά' becomes 'ά' But I need it converted to
ά
I'dd really appreciate help. I need a universal solution, not a "string replace"
I have been playing around a little, and the following worked. Use mb-convert-encoding instead of htmlentities.:
mb_convert_encoding(urldecode($navstring),'HTML-ENTITIES','UTF-8');
//string(90) "domain.tld/Σχετικά_με_μας"
See mb-convert-encoding
Information
All modern web browsers understand UTF-8 character encoding.
My advice would be :
Always know the character encoding of the data you are using.
Store your data with UTF-8.
Output data with UTF-8
The mbstring php extension doesn't just manipulate Unicode strings. It also converts multibyte strings between various character encodings.
Use the mb_detect_encoding() (ref) and mb_convert_encoding() (ref 2) functions to convert Unicode strings from one character encoding to another.
PHP Needs to know !
You also need to tell PHP that you are working with UTF-8, to tell him the default value, you can do it in your php.ini file :
default_charset = "UTF-8";
That default value is added to the default Content-Type header returned by PHP unless you specified it with the header() function :
header('Content-Type: application/json;charset=utf-8');
Keep in mind
The default character set is used by a lot of functions in PHP such as :
htmlentities()
htmlspecialchars()
all the mbstring functions
...
Everything is set to UTF-8 (file encoding, MySQL [however I don't use it], Apache, meta, mbstring etc...) but check this out:
$s="áéőúöüóűí";
echo $s; //works perfectly
echo $s[0] // doesn't work. Prints out a single '?'.
I have tried almost everything. Any ideas? Thanks in advance!
It is absolutely correct behavior.
if you want to get a first letter from a multi-byte string, not first byte from binary string, you have to use mb_substr():
mb_internal_encoding("UTF-8");
echo mb_substr($s,0,1);
You should use mb_* functions for multibyte strings. mb_substr() in your case.
And if you define $s[0]="á", does it work ? I believe that when encoded in UTF-8, those special chars are stored over two UTF-chars.
If you display in ANSI some UTF-8 text, it is rendered like this :
áéoúöüóuÃ
You see that á becomes á
So rendering the first char ($s[0]) would only display the "Ã", which is an incomplete character
you have to make some changes in database go to the the table structure
you can find a column "Collation"
which column you want to change click edit on right side menu
the default Collation is - 'latin1_general_ci' change it to 'utf8_general_ci'
I'm creating a php application that involves sending chinese characters as url parameters.
I have to send query like :
http://xyz.com/?q=新
But the script at xyz.com won't automatically encode the chinese character. So, I need to explicitly send an encoded string as the paramter. It becomes:
http://xyz.com/?q=%E6%96%B0
The problem is, PHP won't encode the chinese character properly.
I've tried urlencode() and rawurlencode(). But they give %D0%C2 (doesn't work for my purpose) instead of %E6%96%B0 (works well with xyz.com) as the output.
I'm using this website to create the latter encoded string.
I've also defined header('Content-Type: text/html; charset=gb2312'); to display chinese characters properly.
Is there anything I can do to urlencode the chinese character properly?
Thanks!
PS: I'm a relatively new programmer and don't understand chinese.
You're URLencoding using the charset you specify in your header. %D0%C2 is 新 in gb2312; %E6%96%B0 is 新 in UTF-8. Switch your charset over to UTF-8 and you should fix this issue and still be able to display Simplified Chinese Han.
In order to reproduce your problem I created a simple PHP file:
<?php
var_dump(urlencode('新'));
?>
First I used UTF8 encoding and got %E6%96%B0. Afterwards I changed to GB2312 and got %D0%C2.
At http://meyerweb.com/eric/tools/dencoder/ they seem to use JavaScript, that's UTF8 capable and therefore returns %E6%96%B0, too.
PS: When changing from GB2312 to UTF8 some editors might break code some internationalized code. So please make sure to have a copy of your file before converting!
I am having a problem with  character on my website.
I have a website where users can use a wysiwyg editor (ckeditor) to fill out their profile. The content is ran through htmlpurify before being put into a database (for security reasons).
The database has all tables setup with UTF-8 charset. I also call 'SET NAMES utf-8' at the beginning of script execution to prevent problems (which has worked for years, as I haven't had this problem in a long time). The webpage the text is displayed on has a content-type of utf-8 and I also use the header() function to set the content-type and charset as well.
When displaying the text all seemed fine until I tried running a regular expression on the content. html_entity_decode (called with the encoding param of 'utf-8') is removing/not showing the  character for some reason and it leaves behind something which is causing all of my regexes to fail (it seems there is a character there but I cannot view it in the source).
How can I prevent and/or remove this character so I can run the regular expression?
EDIT: I have decided to abandon ckeditor and go with the markdown format like this site uses to have more flexibility. I have hated wysiwyg editors for as long as I remember. Updating all the profiles to the new format will give me a chance to remove all of the offending text and give the site a clean start. Thanks for all the input.
You are probably facing the situation that the string actually is not properly UTF-8 encoded (as you wrote it is, but it ain't). html_entity_decode might then remove any invalid UTF-8 byte sequences (e.g. single-byte-charset encoding of Â) with a substitution character.
Depending on the PHP version you're using you've got more control how to deal with this by making use of the flags.
Additionally to find the character you can't see, create a hexdump of the string.
Since the character you are talking about exists within the ANSI charset, you can do this:
utf8_encode( preg_replace($match, $replace, utf8_decode($utf8_text));
This will however destroy any unicode character not existing within the ANSI charset. To avoid this you can always try using mb_ereg_replace which has multibyte (unicode) support:
string mb_ereg_replace ( string $pattern , string $replacement , string $string [, string $option = "msr" ] )
I'm trying to get data from a POST form. When the user inputs "habláis", it shows up in view source as just "habláis". I want to convert this to "habláis" for purposes of string comparison, but both utf8_encode() and htmlentities() are outputting habláis, and htmlspecialchars() does nothing. I would use str_replace but it won't recognize the á when it searches the string.
I'm using a charset of utf-8 consistently across pages. Any idea what's going on?
You are probably not specifying UTF-8 as the character set for the htmlentities() operation.
I'm not sure if this is your problem, but are you calling htmlentities with the UTF-8 parameter? I ask because that's not its default:
Like htmlspecialchars(), it takes an
optional third argument charset which
defines character set used in
conversion. Presently, the ISO-8859-1
character set is used as the default.
So you might want to try calling your function like this:
$output = htmlentities($input, ENT_COMPAT, 'UTF-8');
Does this solve your problem?