converting non-unicode, non-english content to unicode - php

I've a text content in "xyz" language.
<p style="font-family:xyz;"> eWvS³: kmwkMnsâ kq¸Àt^mWmb KmeIvkn kocoknsâ aq¶mw]Xn¸v </p>
// It will not display correctly as font is not embedded.
here the font xyz (xyz.ttf) is non unicode.
Now I want to convert that "XYZ" (xyz.ttf) font text to unicode "PQR" (pqr.ttf) font
Simply, a non-unicode Chinese (non_uni_chinese.ttf) to uniocode Chinese (uni_chinese.ttf)
how can I make it possible using php. any help?

You must do this "character by character".
It's mean you must replace every character in "non-Unicode Chinese" font to Unicode font.
I don't know much about Chinese, but in Vietnam, they use this way:
Write a string that contains "non-Unicode" font by an Unicode font, and these characters will not display correctly. For example: Ñaây laø Tieáng Vieät <- this is a non-Unicode Vietnamese write with an Unicode font.
Replace "character by character". For example: Ñ = Đ; aâ = â; aø = 2;...
Then we have this result: Đây là Tiếng Việt.
Of course we don't do it step-by-step, we use a software called "Unikey" to do this.
And I'm sure that there is a software to do that in Chinese. The point here is you must "do" it again in PHP.
Here come something can help you: http://www.pinyin.info/tools/converter/chars2uninumbers.html
Good luck.

Generated output should use one encoding. It's not proper solution, but anyway, for converting string to different encoding you should use iconv function http://www.php.net/manual/en/function.iconv.php

Related

How do I use Extended ASCII characters in a PHP/PDF document generated by FPDF?

I am trying to create a document that contains Extended ASCII characters. For text coming from the client the following works:
// Convert from UTF-8 to ISO-8859-1 - Deal with Spanish characters
setlocale(LC_ALL, 'en_US.UTF-8');
foreach ($_POST as $key => $value){
$post[$key] = iconv("UTF-8", "ISO-8859-1", $value);
}
$pdf->Cell(0, 0, $post["Name"], 0, 1);
However, I can't get text in the PHP file to work. For example:
$name = "José";
I don't know what encoding the variable uses. As a result, I can't convert it to ISO-8859-1. The é gets mangled.
Edit:
I am rewriting a program that generates PDF documents (some in Spanish). If I copy text from the existing PDFs, I get the following: (which looks normal in the PDF document and in the IDE but can't be printed with FPDF using either CP1252 or ISO-8859-1 fonts).
$Name = "José" // Jos\x65\xcc\x81 - I have no idea what encoding is used for the é
Changing the extended characters to UTF-8 solves the problem:
$Name = "José" // Jos\xC3\xA9 - UTF-8
Does anyone know what kind of encoding I am copying from the existing PDFs?
Is there a way to convert it to UTF-8?
Can users enter this stuff into a browser?
When I convert the UTF-8 encoded characters to ISO-8859-1 for output to FPDF, the PDF contains the three character encoded version of the é.
2nd Edit: Unicode equivalence from Wikipedia
Unicode provides two notions, canonical equivalence and
compatibility. Code point sequences that are defined as canonically
equivalent are assumed to have the same appearance and meaning when
printed or displayed. For example, the code point U+006E (the Latin
lowercase "n") followed by U+0303 (the combining tilde "◌̃") is
defined by Unicode to be canonically equivalent to the single code
point U+00F1 (the lowercase letter "ñ" of the Spanish alphabet).
Therefore, those sequences should be displayed in the same manner,
should be treated in the same way by applications such as
alphabetizing names or searching, and may be substituted for each
other.
Which is the long way of paraphrasing #smith's comment that I just need to get TCPDF or something that will properly handle UTF-8. It should be noted that I am getting the error in PHP's iconv, so I not entirely sure that it can be made to go away by switching to TCPDF.
Turns out that to use extended ASCII characters one needs to pick and encoding and use it throughout. In my case, I went with UTF-8 encoded characters and used them everywhere. My original problem stemmed from my mistake in copying text from a PDF document which was encoded in the canonically equivalent format. Once I used UTF-8 encoded characters everywhere my problems went away.

PHP Uploaded file name: Japanese character encoding

When uploading a file with a japanese name, some characters are creating problem.
On a windows system, I want to save the name of the file as-uploaded. So I have to use
mb_convert_encoding($name, "SJIS", "AUTO");
which works fine most of the cases.
Though, some characters like ① as in 0423図表① totally disappear at the end. It seems that when uploaded the name of the file is already "wrong":
it looks like "0423å³è¡¨â .pptx" in UTF-8 and if I change the header charset with
header('Content-Type: text/html; charset=SJIS');
it looks like
"0423テ・ツ崢ウティツ。ツィテ「ツ堕.pptx"
I am not sure what I can do in this case. I tried to replace the ① character but I cannot even find it with strpos() before or after the encoding conversion.
To qualify my answer (to the downvoter):
Q: I have heard that UTF-8 does not support some Japanese characters. Is this correct?
A: There is a lot of misinformation floating around about the support
of Chinese, Japanese and Korean (CJK) characters. The Unicode Standard
supports all of the CJK characters from JIS X 0208, JIS X 0212, JIS X
0221, or JIS X 0213, for example, and many more. This is true no
matter which encoding form of Unicode is used: UTF-8, UTF-16, or
UTF-32.
Unicode supports over 80,000 CJK characters right now, and work is
underway to encode further additions. The International Standard
ISO/IEC 10646 and the Unicode Standard are completely synchronized in
repertoire and content. And that means that Unicode has the same
repertoire as GB 18030, since that also is synchronized with ISO 10646
— although with a different ordering and byte format.
From: The Unicode Consortium.
My Answer:
Rather than strpos use mb_stripos, from the PHP Multibyte string functions to find and replace characters. This should help your script detect and translate the non-latin characters.
If the uploaded file name ($_FILES['var']['name']) is already incorrect in the PHP script (from output such as print_r($_FILES)) then you need to ensure you are correctly encoding the HTML form with accept-charset='UTF-8' (or SJIS, etc.). I would hope you're already well ahead of me on this.
Also it may be advisable to add a few preconditionals at the top of your code, again using the PHP mb_ functions add at the top of your PHP page:
mb_internal_encoding('UTF-8'); //or whatever character set works for you
mb_http_output('SJIS');
mb_http_input('UTF-8');
mb_regex_encoding('UTF-8');
Out of interest:
http://www.unicode.org/reports/tr37/
and
http://david.latapie.name/blog/shift-jis-utf-8/

Processing arabic text for transliteration

I used http://www.ar-php.org/en_index-php-arabic.html library for Arabic to english and English to arabic transliteration.
For simple English or Arabic text copied from web it work fine.
But for English text which is written using robert_bold , robert_regular_0 fonts, which looks like:
When I convert it, it gives me unsupported text like :
ال ‘؟ س[
كير[ ’[ ت
شو ’\ ن
به ’; س
؟ م[ن
س ال#اناه
When I convert simple English text, it gives all supported Arabic characters.
I am not native Arabic country residence.
Any suggestion to improve my system will appreciable.
I believe your problem lies in encoding of your text in this 'robert_bold' font.
It does seem to use some other characters then the standard, so you will need to add those characters to your transliteration library as well.
Look at one of the words you mentioned - Shu'un. The second 'u' letter in the picture has a line above it. So, its outside of normal range of characters, and as such - there is no transliteration for it in that library.

Convert the Chinese Characters From ISO-8859-1 To UTF-8

I got a system which previously the html encoding type was set as ISO-8859-1 and it caused all the Chinese characters store in the format of "&\#36830;&\#34915;&\#35033;".
So my question is, how can I convert the format above into Chinese word back in UTF-8?
For your information, I had tried with utf8_decode, iconv, but none of them work. :(
Thank you very much.
The current text encoding of that string is rather insubstantial. What you have there are HTML entities; they have little to do with the underlying "physical" encoding like ISO-8859 or UTF-8. What you want is to decode those HTML entities into a byte representation of the characters in a specific encoding, in this case to UTF-8. Therefore:
echo html_entity_decode('连衣裙', ENT_COMPAT, 'UTF-8');
// 连衣裙
You need to use:
utf8_encode($data);
and not decode,to convert your current ISO-8859-1 to UTF-8.
Some native PHP functions such as strtolower(), strtoupper() and ucfirst() do not always function correctly with UTF-8 strings. Possible solutions: convert to latin first or add the following line to your code:
setlocale(LC_CTYPE, 'C');
Make sure not to save your PHP files using a BOM (Byte-Order Marker) UTF-8 file marker (your browser might show these BOM characters between PHP pages on your site).
Just for your reference:
ISO-8859-1 => Albanian, Brazilian, Catalan, Danish, Dutch, English, Finnish, French, German, Portuguese, Norwegian, Spanish, Swedish
UTF-8 => Chinese (simplified), Chinese (traditional), Japanese, Persian
There are many tools that can convert character references to characters, and writing such a tool is rather straightforward, especially if you know the references are all decimal. So the answer really depends on the software environment.
For example, to do such a conversion for an individual HTML document, you could use the BabelPad editor: command Convert → Numeric Character References (NCR) → NCR to Unicode, and save the result as UTF-8.

PHP GD Text and Special Characters / Encoding?

I'm generating text in php using imagettftext. the text is being pulled from a mysql database. some characters are not appearing in the rendered text despite being in the character map for the font and appearing in the database. for example, m-dashes (—)and smartquotes/apostrophes (“”’).
the characters either don't appear or are replaced by question marks.
i suspect this has to do with encoding, but i don't know enough about encoding to know where to start. any help would be much appreciated.
Try using htmlentityencode on the text before you pass it to the function.
The text string in UTF-8 encoding.
May include decimal numeric character references (of the form: €) to access characters in a font beyond position 127. The hexadecimal format (like ©) is supported. Strings in UTF-8 encoding can be passed directly.
Named entities, such as ©, are not supported. Consider using html_entity_decode() to decode these named entities into UTF-8 strings (html_entity_decode() supports this as of PHP 5.0.0).
If a character is used in the string which is not supported by the font, a hollow rectangle will replace the character.
Source: http://www.php.net/manual/en/function.imagettftext.php

Categories