PHP GD Text and Special Characters / Encoding?

PHP GD Text and Special Characters / Encoding? - php

I'm generating text in php using imagettftext. the text is being pulled from a mysql database. some characters are not appearing in the rendered text despite being in the character map for the font and appearing in the database. for example, m-dashes (—)and smartquotes/apostrophes (“”’).
the characters either don't appear or are replaced by question marks.
i suspect this has to do with encoding, but i don't know enough about encoding to know where to start. any help would be much appreciated.

Try using htmlentityencode on the text before you pass it to the function.
The text string in UTF-8 encoding.
May include decimal numeric character references (of the form: €) to access characters in a font beyond position 127. The hexadecimal format (like ©) is supported. Strings in UTF-8 encoding can be passed directly.
Named entities, such as ©, are not supported. Consider using html_entity_decode() to decode these named entities into UTF-8 strings (html_entity_decode() supports this as of PHP 5.0.0).
If a character is used in the string which is not supported by the font, a hollow rectangle will replace the character.
Source: http://www.php.net/manual/en/function.imagettftext.php

Related

How do I use Extended ASCII characters in a PHP/PDF document generated by FPDF?

I am trying to create a document that contains Extended ASCII characters. For text coming from the client the following works:
// Convert from UTF-8 to ISO-8859-1 - Deal with Spanish characters
setlocale(LC_ALL, 'en_US.UTF-8');
foreach ($_POST as $key => $value){
$post[$key] = iconv("UTF-8", "ISO-8859-1", $value);
}
$pdf->Cell(0, 0, $post["Name"], 0, 1);
However, I can't get text in the PHP file to work. For example:
$name = "José";
I don't know what encoding the variable uses. As a result, I can't convert it to ISO-8859-1. The é gets mangled.
Edit:
I am rewriting a program that generates PDF documents (some in Spanish). If I copy text from the existing PDFs, I get the following: (which looks normal in the PDF document and in the IDE but can't be printed with FPDF using either CP1252 or ISO-8859-1 fonts).
$Name = "José" // Jos\x65\xcc\x81 - I have no idea what encoding is used for the é
Changing the extended characters to UTF-8 solves the problem:
$Name = "José" // Jos\xC3\xA9 - UTF-8
Does anyone know what kind of encoding I am copying from the existing PDFs?
Is there a way to convert it to UTF-8?
Can users enter this stuff into a browser?
When I convert the UTF-8 encoded characters to ISO-8859-1 for output to FPDF, the PDF contains the three character encoded version of the é.
2nd Edit: Unicode equivalence from Wikipedia
Unicode provides two notions, canonical equivalence and
compatibility. Code point sequences that are defined as canonically
equivalent are assumed to have the same appearance and meaning when
printed or displayed. For example, the code point U+006E (the Latin
lowercase "n") followed by U+0303 (the combining tilde "◌̃") is
defined by Unicode to be canonically equivalent to the single code
point U+00F1 (the lowercase letter "ñ" of the Spanish alphabet).
Therefore, those sequences should be displayed in the same manner,
should be treated in the same way by applications such as
alphabetizing names or searching, and may be substituted for each
other.
Which is the long way of paraphrasing #smith's comment that I just need to get TCPDF or something that will properly handle UTF-8. It should be noted that I am getting the error in PHP's iconv, so I not entirely sure that it can be made to go away by switching to TCPDF.

Turns out that to use extended ASCII characters one needs to pick and encoding and use it throughout. In my case, I went with UTF-8 encoded characters and used them everywhere. My original problem stemmed from my mistake in copying text from a PDF document which was encoded in the canonically equivalent format. Once I used UTF-8 encoded characters everywhere my problems went away.

Display \u1F603 (emoji icon) in web page

I store codes like "\u1F603" within messages in my database, and now I need to display the corresponding emoji on my web page.
How can I convert \u1F603 to \xF0\x9F\x98\x83 using PHP for displaying emoji icons in a web page?

You don't need to convert emoji character codes to UTF-8 sequences, you can simply use the original 21-bit Unicode value as numeric character reference in HTML like this: 😃 which renders as: 😃.
The Wikipedia article "Unicode and HTML" explains:
In order to work around the limitations of legacy encodings, HTML is designed such that it is possible to represent characters from the whole of Unicode inside an HTML document by using a numeric character reference: a sequence of characters that explicitly spell out the Unicode code point of the character being represented. A character reference takes the form &#N;, where N is either a decimal number for the Unicode code point, or a hexadecimal number, in which case it must be prefixed by x. The characters that compose the numeric character reference are universally representable in every encoding approved for use on the Internet.
For example, a Unicode code point like U+5408, which corresponds to a particular Chinese character, has to be converted to a decimal number, preceded by &# and followed by ;, like this: 合, which produces this: 合.
So if in your PHP code you have a string containing '\u1F603', then you can create the corresponding HTML string using preg_replace, as in following example:
$text = "This is fun \\u1F603!"; // this has just one backslash, it had to be escaped
echo "Database has: $text<br>";
$html = preg_replace("/\\\\u([0-9A-F]{2,5})/i", "&#x$1;", $text);
echo "Browser shows: $html<br>";
This outputs:
Database has: This is fun \u1F603!
Browser shows: This is fun 😃!
Note that if in your data you would use the literal \u notation also for lower range Unicode characters, i.e. with hex numbers of 2 to 4 digits, you must make sure the next user's character is not also a hex digit, as it would lead to a wrong interpretation of where the \u escape sequence stops. In that case I would suggest to always left-pad these hex numbers with zeroes in your data so they are always 5 digits long.
To ensure your browser uses the correct character encoding, do the following:
Specify the UTF-8 character encoding in the HTML head section:
<meta charset="utf-8">
Save your PHP file in UTF-8 encoding. Depending on your editor, you may need to use a "Save As" option, or find such a setting in the editor's "Preferences" or "Options" menu.

Hell everyone,
after many try i can found solution.
I user below code:
https://github.com/BriquzStudio/php-emoji
include 'Emoji.php';
$message = Emoji::Decode($message);
This one working fine for me!! :)Below is my reslut

Which middot character is this?

$string = 'Single · Female'
I copied it from facebook.
In html source its just that dot, how did they type it?
While echoing in php its A with circumflex (Â) concatenated with that same dot.
How can i explode this string with that dot?

It is U+00B7 MIDDLE DOT, a character used for many purposes, e.g. as a separator between links, alternatives, or other items.
If your code displays it as Â·, then the reason is that the UTF-8 encoded form of U+00B7, namely 0xC2 0xB7, is being misinterpreted as being ISO-8859-1 or Windows-1252 encoded. You should fix this basic problem (instead of trying to deal with some of its symptoms). See UTF-8 all the way through.
Regarding the question “how did they type it?”, we cannot really know, and we need not know. There are zillions of ways to type characters, and anyone can invent a few more. (On my keyboard, I use AltGr Shift X. If I needed to type “·” on a Windows computer with vanilla settings, I would use Alt 0183.)

I believe this is an interpunct. It can be used through the HTML entities · or · and in PHP with the unicode value U+00B7.
If you want to echo the unicode character without HTML entities, you can set the character encoding to UTF-8. Splitting is done through explode("·", $textToSplit) given that your PHP file is using UTF-8 as character encoding.

Euro (€) in imagettftext

How can i create an € sign with imagettftext()?
I'm using the font 'Lucida Grande' which contains the euro sign.
€ does not work, too.

See the documentation (emphasis mine):
The text string in UTF-8 encoding.
May include decimal numeric character references (of the form:
€) to access characters in a font beyond position 127. The
hexadecimal format (like ©) is supported. Strings in UTF-8
encoding can be passed directly.
Named entities, such as ©, are not supported. Consider using
html_entity_decode() to decode these named entities into UTF-8 strings
(html_entity_decode() supports this as of PHP 5.0.0).
If a character is used in the string which is not supported by the
font, a hollow rectangle will replace the character.
You can try using the actual € character, or decoding it, or using the decimal character reference. See here for the entry in FileFormat.Info Unicode Lookup, which has all the code formats: http://www.fileformat.info/info/unicode/char/20ac/index.htm. In this case it would be €.

Strange behaviour when encoding cURL response as UTF-8

I'm making a cURL request to a third party website which returns a text file on which I need to do a few string replacements to replace certain characters by their html entity equivalents e.g I need to replace í by í.
Using string_replace/preg_replace_callback on the response directly didn't result in matches (whether searching for í directly or using its hex code \x00\xED), so I used utf8_encode() before carrying out the replacement. But utf8_encode replaces all the í characters by Ã.
Why is this happening, and what's the correct approach to carrying out UTF-8 replacements on an arbitrary piece of text using php?
*edit - some further research reveals
utf8_decode("í") == í;
utf8_encode("í") == ÃƒÂ;
utf8_encode("\xc3\xad") == ÃƒÂ;

utf8_encode is definitely not the way to go here (you're double-encoding if you do that).
Re. searching for the character directly or using its hex code, did you make sure to add the u modifier at the end of the regex? e.g. /\x00\xED/u?

You're probably specify the characters/strings you want replaced via string literals in the php source code? If you do, then the values of those string literals depends on the encoding you save your php file in. So while you see the character í, maybe the literal value is a latin encoded í, like maybe 8859-1 encoding, or maybe its windows cp1252 í, or maybe its utf8 í, or maybe even utf32 í...i dont know off hand how many of those are different, but i know at least some have different byte representations, and so wont match in a php string comparison.
my point is, you need to specify the correct character that will match whatever encoding your incoming text is in.
heres an example without using literals
$iso8859_1 = chr(236);
$utf8 = utf8_encode(chr(236));
be warned, text editors may or may not convert the existing characters when you change the encoding, if you decide to change the file encoding to utf8. I've seen editors do really bizarre things when changing the encoding. Start with a fresh file.
also-just because the other server claims its utf8, doesn't mean it really is.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP GD Text and Special Characters / Encoding? - php

Related

How do I use Extended ASCII characters in a PHP/PDF document generated by FPDF?

Display \u1F603 (emoji icon) in web page

Which middot character is this?

Euro (€) in imagettftext

Strange behaviour when encoding cURL response as UTF-8

Categories

Resources