How to display utf8 chinese in html with php

How to display utf8 chinese in html with php - php

I have chinese characters stored in my mysql database in utf-8, but I need to show them on a webpage that has to be output as charset=ISO-8859-1
When rendered in Latin my test string looks like this "dsfsdfsdf åšä¸€ä¸ªæµ‹è¯•"
I have tried using htmlentities in the following ways because I can't tell from the php docs if $encoding refers to the encoding of the input string or desired output string.
$row['admin_comment']=htmlentities( $row['admin_comment'] ,
ENT_COMPAT | ENT_HTML401 ,
'ISO-8859-1' ,
false );
$row['admin_comment']=htmlentities( $row['admin_comment'] ,
ENT_COMPAT | ENT_HTML401 ,
'UTF-8' ,
false );
But both have output string unchanged

You cannot output chinese character in the ISO-8859-1 charset. It's simply impossible.
You have 2 possibilities:
stick to UTF-8 (recommended)
pick another chinese-compatible charset (BIG5 If my memory serves me right)
Why your page MUST be rendered as LATIN-1? I find this requirement very strange. My suggestion is to use EVERYWHERE (from DataBase encoding to HTML rendering) the UTF-8 charset. It will save you A LOT of pain in the future.

The htmlentities function does not convert characters into their numeric character entities. For that you can use the mb_encode_numericentity function:
$row['admin_comment'] = mb_encode_numericentity($row['admin_comment'],
array(0xFF, 0xFFFF, 0, 0xFFFF), "UTF-8");
You probably should look into migrating to UTF-8 though.

It turns out you can set an iframe in your page to a different encoding.

Related

mb_detect_encoding returns both ASCII and UTF8 [duplicate]

I'm trying to automatically convert imported IPTC metadata from images to UTF-8 for storage in a database based on the PHP mb_ functions.
Currently it looks like this:
$val = mb_convert_encoding($val, 'UTF-8', mb_detect_encoding($val));
However, when mb_detect_encoding() is supplied an ASCII string (special characters in the Latin1-fields from 192-255) it detects it as UTF-8, hence in the following attempt to convert everything to proper UTF-8 all special characters are removed.
I tried writing my own method by looking for Latin1 values and if none occured I would go on to letting mb_detect_encoding decide what it is. But I stopped midway when I realized that I can't be sure that other encoding don't use the same byte values for other things.
So, is there a way to properly detect ASCII to feed to mb_convert_encoding as the source encoding?

Specifying a custom order, where ASCII is detected first, works.
mb_detect_encoding($val, 'ASCII,UTF-8,ISO-8859-15');
For completeness, the list of available encodings is at http://www.php.net/manual/en/mbstring.supported-encodings.php

You can specified explicitly
$val = mb_convert_encoding($val, 'UTF-8', 'ASCII');
EDIT:
$val = mb_convert_encoding($val, 'UTF-8', 'auto');

If you do not want to worry about what encodings you will allow, you can add them all
$encoding = mb_detect_encoding($val, implode(',', mb_list_encodings()));

PHP decode UTF-8 in URL ie &title=%C5%8Cyu to Ōyu not ÅŒyu

Please can you help me decode this URL so that it displays properly using PHP to output
This is the link
http://www.megalithic.co.uk/visits.php?op=site&sid=18341&title=Ōyu
I think it's actually coming through as UTF-8 - ie
&title=%C5%8Cyu
$title displays as ÅŒyu
How do I convert this in PHP? I need to use ISO-8859-1 on the page
None of these work
$title=iconv("UTF-8","ISO-8859-1",$title);
$title=iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $title);
$title = utf8_decode($title);
$title = urldecode($title);
Do I need to use the Multibyte MB extension and if so how?
Many thanks in advance
Andy

If that link is to your PHP page, and you get the value via $_GET['title'], then it's already decoded from the URL encoding and $_GET['title'] holds a UTF-8 encoded string with the character Ō. This character cannot be encoded in ISO-8859-1. If that is a strict requirement, you'll have to encode the character as HTML entity in order to express it in a strictly ISO-8859-1 encoded page:
echo htmlentities('Ō', ENT_COMPAT | ENT_HTML5, 'UTF-8');

The character "Ō" is not there in ISO-8859-1, so it is not possible to convert it from UTF-8 with any of the standard charset conversion functions.
It might, however, be possible to write a function that converts to numerical HTML encodings, like Ō for "Ō".

Corrupted data using UTF-8 and mb_substr

I'm get data from MySQL db, varchar(255) utf8_general_ci field and try to write the text to a PDF with PHP. I need to determine the string length in the PDF to limit the output of the text in a table. But I noticed that the output of mb_substr/substr is really strange.
For example:
mb_internal_encoding("UTF-8");
$_tmpStr = $vfrow['title'];
$_tmpStrLen = mb_strlen($vfrow['title']);
for($i=$_tmpStrLen; $i >= 0; $i--){
file_put_contents('cutoffattributes.txt',$vfrow['field']." ".$_tmpStr."\n",FILE_APPEND);
file_put_contents('cutoffattributes.txt',$vfrow['field']." ".mb_substr($_tmpStr, 0, $i)."\n",FILE_APPEND);
}
outputs this:
npp file link
Database:
My question is where does the extra character come from?

You need to ensure you're actually getting the data from the database in UTF-8 encoding by setting your connection encoding appropriately. This depends on your database adapter, see UTF-8 all the way through for details.
You need to tell your mb_ functions that the data is in UTF-8 so they can treat it correctly. Either set this globally for all functions using mb_internal_encoding, or pass the $encoding parameter to your function when you call it:
mb_substr($_tmpStr, 0, $i, 'UTF-8')

The extra character is first part of two byte UTF-8 sequence. You may have problems with internal encoding of Multibyte String Functions. Your code treats text as fixed, 1-byte encoding. The ń in UTF-8, hex C5 84, is treated as Ĺ„ in CP-1250 and Ĺ[IND] in ISO-8859-2, two characters.
Try to execute this one on the top of script:
mb_internal_encoding("UTF-8");
http://php.net/manual/en/function.mb-internal-encoding.php

Aside from table and field being set to UTF-8 you need to set mysqli_set_charset('UTF-8') to UTF-8 also (if you are using mysqli).
Also did you try?
$_tmpStr = utf8_encode( $vfrow['title'] );

mb_detect_encoding detects ASCII as UTF-8?

I'm trying to automatically convert imported IPTC metadata from images to UTF-8 for storage in a database based on the PHP mb_ functions.
Currently it looks like this:
$val = mb_convert_encoding($val, 'UTF-8', mb_detect_encoding($val));
However, when mb_detect_encoding() is supplied an ASCII string (special characters in the Latin1-fields from 192-255) it detects it as UTF-8, hence in the following attempt to convert everything to proper UTF-8 all special characters are removed.
I tried writing my own method by looking for Latin1 values and if none occured I would go on to letting mb_detect_encoding decide what it is. But I stopped midway when I realized that I can't be sure that other encoding don't use the same byte values for other things.
So, is there a way to properly detect ASCII to feed to mb_convert_encoding as the source encoding?

Specifying a custom order, where ASCII is detected first, works.
mb_detect_encoding($val, 'ASCII,UTF-8,ISO-8859-15');
For completeness, the list of available encodings is at http://www.php.net/manual/en/mbstring.supported-encodings.php

You can specified explicitly
$val = mb_convert_encoding($val, 'UTF-8', 'ASCII');
EDIT:
$val = mb_convert_encoding($val, 'UTF-8', 'auto');

If you do not want to worry about what encodings you will allow, you can add them all
$encoding = mb_detect_encoding($val, implode(',', mb_list_encodings()));

PHP: how do I convert foreign characters from simple_html_dom to UTF8?

I'm having some trouble with a string that comes from a webpage having foreign characters in it.
The string is generated by parsing the webpage using str_get_html(), followed by $htmldom->innertext; (simple_html_dom class library).
When I output the string using htmlentities() it is displayed fine; but using explode() on the string and printing the parts, I get a tilted block with a question mark in it for each foreign character.
I need to store the string in a utf8 MySQL database, so I need the right foreign characters.
My page has a header with utf8 character set.
I have already tried mb_split() and preg_split(), but those have the same problem.

I solved the issue with :
https://github.com/neitanod/forceutf8
It has a great function that just converts anything to utf-8, no matter what source it's from (as long as it comes in Latin1 (iso 8859-1), Windows-1252 or UTF8 already, or a mix of them).
Many thanks go to Sebastian Grignoli.

PHP and UTF-8 isn't a very good combination. Some functions work fine with UTF-8, others don't, and the worst are those that are documented to work, but in fact do not (such as DOMDocument ).
You can use mb_convert_encoding() to convert multibyte characters to HTML entities, which usually provides an acceptable workaround:
$string = mb_convert_encoding($string, 'HTML-ENTITIES', 'UTF-8');

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to display utf8 chinese in html with php - php

It turns out you can set an iframe in your page to a different encoding.

Related

mb_detect_encoding returns both ASCII and UTF8 [duplicate]

PHP decode UTF-8 in URL ie &title=%C5%8Cyu to Ōyu not ÅŒyu

Corrupted data using UTF-8 and mb_substr

mb_detect_encoding detects ASCII as UTF-8?

PHP: how do I convert foreign characters from simple_html_dom to UTF8?

Categories

Resources