PHP-GD: Dealing with Unicode characters - php

I am developing a web service that renders characters using the PHP GD extension, using a user-selected TTF font.
This works fine in ASCII-land, but there are a few problems:
The string to be rendered comes in as UTF-8. I would like to limit the list of user-selectable fonts to be only those which can render the string properly, as some fonts only have glyphs for ASCII characters, ISO 8601, etc.
In the case where some decorative characters are included, it would be fine to render the majority of characters in the selected font and render the decorative characters in Arial (or whatever font contains the extended glyphs).
It does not seem like PHP-GD has support for querying the font metadata sufficiently to figure out if a character can be rendered in a given font. What is a good way to get font metrics into PHP? Is there a command-line utility that can dump in XML or other parsable format?

PHP-Cairo built against Pango and fontconfig should have enough brains to do font substitution when appropriate.

You can try to use pdf_info_font() from pdflib extension. Good example is there: http://www.pdflib.com/pdflib-cookbook/fonts/font-metrics-info/php-font-metrics-info/

If you don't have a unicode font, you'll need to try something like
<?php
$trans = new Latin1UTF8();
$mixed = "MIXED TEXT INPUT";
print "Original: ".$mixed;
print "Latin1: ".$trans->mixed_to_latin1($mixed);
print "UTF-8: ".$trans->mixed_to_utf8($mixed);
class Latin1UTF8 {
private $latin1_to_utf8;
private $utf8_to_latin1;
public function __construct() {
for($i=32; $i<=255; $i++) {
$this->latin1_to_utf8[chr($i)] = utf8_encode(chr($i));
$this->utf8_to_latin1[utf8_encode(chr($i))] = chr($i);
}
}
public function mixed_to_latin1($text) {
foreach( $this->utf8_to_latin1 as $key => $val ) {
$text = str_replace($key, $val, $text);
}
return $text;
}
public function mixed_to_utf8($text) {
return utf8_encode($this->mixed_to_latin1($text));
}
}
?>
Taken from http://php.net/manual/en/function.utf8-decode.php
If the mixed and utf-8 characters are equal then you can use it. If not, then you can't.

I ended up using the TTX utility to dump font metrics. I could then inspect the resulting .ttx files and look at the character->glyph map. I did this manually, but automatic parsing of the XML files is possible.
An example GNU Makefile which generates the .ttx files from a set of TrueType fonts in the same directory:
all: fontmetrics
fontmetrics: $(patsubst %.ttf,%.ttx,$(wildcard *.ttf))
.PHONY: fontmetrics
clean:
rm -f *.ttx
%.ttx: %.ttf
ttx -t cmap $<

Related

Magento 1.9 Invoice PDF Print Issue on Arabic text appear in crossed boxes

Description of issue
I am facing an issue with printing invoices, shipments and packing slip when a customer inputs billing/shipping information in Arabic. The downloaded PDF will look like the below screenshot, without decoding (Compiling) Arabic characters.
Pdf invoice shows crossed box on Arabic characters
Expected Result: Invoices, Shipments, and Packingslips should display all information properly, whether in English and/or Arabic. Without using third-party extensions.
System information
CentOS 7
Magento 1.9.2.4
Tried solutions
I have exhausted the internet references, nothing really worked, the best I managed to do is the following.
1- Font forcing
$font = Zend_Pdf_Font::fontWithPath(Mage::getBaseDir() . '/lib/LinLibertineFont/AraJozoor-Regular.ttf');
The PDF document came out with separated characters, similar to the attached image.
Arabic text with separated and reversed characters
When I tried to print invoice pdf with Arabic words it's giving separate characters instead of the complete Arabic word. The single characters of Arabic script are not shown from right to left, but from left to right and it splits as single chars.E.g.: مدرسة (school) is shown like ة س ر د م
Unlike other Right-to-Left languages, such as Hebrew, which has the same issue, and which can be solved by just reversing characters. In Arabic, the characters must be connected to each other.
I have partially solved character reverse issue while maintaining other English words using the following piece of code. but it is still, prints each character separately.
function fixText($text){
if(preg_match("/\p{Arabic}/u", $text ) ){
preg_match_all('/./us', $text, $ar);
$text = join('',array_reverse($ar[0]));
$words = explode( ' ', $text );
foreach( $words as $i => $word ){
if( !preg_match( "/\p{Arabic}/u", $word ) ){
$words[$i] = implode( '', array_reverse( str_split( $word ) ) );
}
}
$text = implode( ' ', $words );
return $text; }}
2- Font-family server update
If you try to send the invoice via email, there is no Arabic text issue, even though it is generated by the same function. So my attention heads towerds the server trying to update fonts. and I used the following codes to do so.
yum install linux-libertine-fonts
yum clean all
rm -rf /var/cache/yum
Which is the font package used by Magento
3- Using TCPDF library instead of Zend pdf library
This can be helf you
1) Please Download the font that support Arabic text and symbol.
2) Place the font in lib directory
3) Please override this file app/code/core/Mage/Sales/Model/Order/Pdf/Abstract.php and app/code/core/Mage/Sales/Model/Order/Pdf/Items/Abstract.php
and replace
$font = Zend_Pdf_Font::fontWithPath(Mage::getBaseDir() . '/lib/LinLibertineFont/LinLibertine_Re-4.4.1.ttf');
with
$font = Zend_Pdf_Font::fontWithPath(Mage::getBaseDir() . '/lib/dejavu-sans/DejaVuSans.ttf');
I used it on magento 1.9
Please don't use this font "dejavu-sans/DejaVuSans.ttf"
You should your language and symbol supported font

how to use imagick annotateImage for chinese text?

I need to annotate an image with Chinese Text and I am using Imagick library right now.
An example of a Chinese Text is
这是中文
The Chinese Font file used is this
The file originally is named 华文黑体.ttf
it can also be found in Mac OSX under /Library/Font
I have renamed it to English STHeiTi.ttf make it easier to call the file in php code.
In particular the Imagick::annotateImage function
I also am using the answer from "How can I draw wrapped text using Imagick in PHP?".
The reason why I am using it is because it is successful for English text and application needs to annotate both English and Chinese, though not at the same time.
The problem is that when I run the annotateImage using Chinese text, I get annotation that looks like 罍
Code included here
The problem is you are feeding imagemagick the output of a "line splitter" (wordWrapAnnotation), to which you are utf8_decodeing the text input. This is wrong for sure, if you are dealing with Chinese text. utf8_decode can only deal with UTF-8 text that CAN be converted to ISO-8859-1 (the most common 8-bit extension of ASCII).
Now, I hope that you text is UTF-8 encoded. If it is not, you might be able to convert it like this:
$text = mb_convert_encoding($text, 'UTF-8', 'BIG-5');
or like this
$text = mb_convert_encoding($text, 'UTF-8', 'GB18030'); // only PHP >= 5.4.0
(in your code $text is rather $text1 and $text2).
Then there are (at least) two things to fix in your code:
pass the text "as is" (without utf8_decode) to wordWrapAnnotation,
change the argument of setTextEncoding from "utf-8" to "UTF-8"
as per specs
I hope that all variables in your code are initialized in some missing part of it. With the two changes above (the second one might not be necessary, but you never know...), and with the missing parts in place, I see no reason why your code should not work, unless your TTF file is broken or the Imagick library is broken (imagemagick, on which Imagick is based, is a great library, so I consider this last possibility rather unlikely).
EDIT:
Following your request, I update my answer with
a) the fact that setting mb_internal_encoding('utf-8') is very important for the solution, as you say in your answer, and
b) my proposal for a better line splitter, that works acceptably for western languages and for Chinese, and that is probably a good starting point for other languages using Han logograms (Japanese kanji and Korean hanja):
function wordWrapAnnotation(&$image, &$draw, $text, $maxWidth)
{
$regex = '/( |(?=\p{Han})(?<!\p{Pi})(?<!\p{Ps})|(?=\p{Pi})|(?=\p{Ps}))/u';
$cleanText = trim(preg_replace('/[\s\v]+/', ' ', $text));
$strArr = preg_split($regex, $cleanText, -1, PREG_SPLIT_DELIM_CAPTURE |
PREG_SPLIT_NO_EMPTY);
$linesArr = array();
$lineHeight = 0;
$goodLine = '';
$spacePending = false;
foreach ($strArr as $str) {
if ($str == ' ') {
$spacePending = true;
} else {
if ($spacePending) {
$spacePending = false;
$line = $goodLine.' '.$str;
} else {
$line = $goodLine.$str;
}
$metrics = $image->queryFontMetrics($draw, $line);
if ($metrics['textWidth'] > $maxWidth) {
if ($goodLine != '') {
$linesArr[] = $goodLine;
}
$goodLine = $str;
} else {
$goodLine = $line;
}
if ($metrics['textHeight'] > $lineHeight) {
$lineHeight = $metrics['textHeight'];
}
}
}
if ($goodLine != '') {
$linesArr[] = $goodLine;
}
return array($linesArr, $lineHeight);
}
In words: the input is first cleaned up by replacing all runs of whitespace, including newlines, with a single space, except for leading and trailing whitespace, which is removed. Then it is split either at spaces, or right before Han characters not preceded by "leading" characters (like opening parentheses or opening quotes), or right before "leading" characters. Lines are assembled in order not to be rendered in more than $maxWidth pixels horizontally, except when this is not possible by the splitting rules (in which case the final rendering will probably overflow). A modification in order to force splitting in overflow cases is not difficult. Note that, e.g., Chinese punctuation is not classified as Han in Unicode, so that, except for "leading" punctuation, no linebreak can be inserted before it by the algorithm.
I'm afraid you will have to choose a TTF that can support Chinese code points. There are many sources for this, here are two:
http://www.wazu.jp/gallery/Fonts_ChineseTraditional.html
http://wildboar.net/multilingual/asian/chinese/language/fonts/unicode/non-microsoft/non-microsoft.html
Full solution here:
https://gist.github.com/2971092/232adc3ebfc4b45f0e6e8bb5934308d9051450a4
Key ideas:
Must set the html charset and internal encoding on the form and on the processing page
header('Content-Type: text/html; charset=utf-8');
mb_internal_encoding('utf-8');
These lines must be at the top lines of the php files.
Use this function to determine if text is Chinese and use the right font file
function isThisChineseText($text) {
return preg_match("/\p{Han}+/u", $text);
}
For more details check out https://stackoverflow.com/a/11219301/80353
Set TextEncoding properly in ImagickDraw object
$draw = new ImagickDraw();
// set utf 8 format
$draw->setTextEncoding('UTF-8');
Note the Capitalized UTF. THis was helpfully pointed out to me by Walter Tross in his answer here: https://stackoverflow.com/a/11207521/80353
Use preg_match_all to explode English words, Chinese Words and spaces
// separate the text by chinese characters or words or spaces
preg_match_all('/([\w]+)|(.)/u', $text, $matches);
$words = $matches[0];
Inspired by this answer https://stackoverflow.com/a/4113903/80353
Works just as well for english text

Fpdf and special characters

I am trying to write out some special characters with built in fonts, is there any way to do this?
$str = 'ščťžýáíéäúň§ôúőűáéóüöűú';
$str = iconv('UTF-8', 'windows-1252', $str);
the result is one letter Š, not too good. :)
I know it's an old thread, but I've faced the issue this weekend and spent a longtime Googling and playing, so here's a time saver.
http://fpdf.org/en/script/script92.php is the way to go to use diacritics (accented characters). But you need to add some code to it...
Slot this in at line 617
/* Modified by Vinod Patidar due to font key does not match in dejavu bold.*/
if ( $family == 'dejavu' && !empty($style) && ($style == 'B' || $style == 'b') ) {
$fontkey = $family.' '.strtolower($style);
} else {
$fontkey = $family.$style;
}
/* Modified end here*/
Then change
if($family=='arial')
$family = 'helvetica';
To
if($family=='arial'||$family='dejavu')
$family = 'helvetica';
Then don't use the font in the example "DejaVu Sans Condensed" because Condensed seems to mean the Bold version doesn't contain all the characters
You may also need to add the getPageWidth and getPageHeight methods from the normal fpdf.php script as it is newer than tfpdf.php!
With the changes above
$pdflabel->AddFont('DejaVu','','DejaVuSans.ttf',true);
$pdf->AddFont('DejaVu','B','DejaVuSans-Bold.ttf',true);
Works a good 'un with European languages
You'll need to use tFPDF derivate of FPDF. tFPDF uses the PHP multi-byte string functions and generates its output encoded with UTF-8. FPDF does not. You'll also need to use a font that supports all the Unicode characters you want to use. Most commonly, I'll use Arial.
See: http://fpdf.org/en/script/script92.php

How to Convert Arabic Characters to Unicode Using PHP

I want to to know how can I convert a word into unicode exactly like:
http://www.arabunic.free.fr/
can anyone know how to do that using PHP considering that Arabic text may contains ligatures?
thanks
Edit
I'm not sure what is that "unicode" but I need to have the Arabic Character in it's equivalent machine number considering that arabic characters have different contextual forms depending on their position - see here:
http://en.wikipedia.org/wiki/Arabic_alphabet#Table_of_basic_letters
the same character in different position:
ب‎ | ـب‎ | ـبـ‎ | بـ‎
I think it must be a way to convert each Arabic character into it's equivalent number, but how?
Edit
I still believe there's a way to convert each character to it's form depending on positions
any idea is appreciated..
All what you need is function called: utf8Glyphs which you can find it in ArGlyphs.class.php download it from ar-php
and visit Ar-PHP for the ArPHP more information about the project and classes.
This will reverse the word with same of its characters (glyphs).
Example of usage:
<?php
include('Arabic.php');
$Arabic = new Arabic('ArGlyphs');
$text = 'بسم الله الرحمن الرحيم';
$text = $Arabic->utf8Glyphs($text);
echo $text;
?>
i assume you wnat to convert بهروز to \u0628\u0647\u0631\u0648\u0632 take a look at http://hsivonen.iki.fi/php-utf8/ all you have to do after calling unicodeToUtf8('بهروز') is to convert integers you got in array to hex & make sure they have 4digigts & prefix em with \u & you're done. also you can get same using json_encode
json_encode('بهروز') // returns "\u0628\u0647\u0631\u0648\u0632"
EDIT:
seems you want to get character codes of بب which first one differs from second one, all you have to do is applying bidi algorithm on your text using fribidi_log2vis then getting character code by one of ways i said before.
here's example:
$string = 'بب'; // \u0628\u0628
$bidiString = fribidi_log2vis($string, FRIBIDI_LTR, FRIBIDI_CHARSET_UTF8);
json_encode($bidiString); // \ufe90\ufe91
EDIT:
i just remembered that tcpdf has bidi algorithm which implemented using pure php so if you can not get fribidi extension of php to work, you can use tcpdf (utf8Bidi by default is protected so you need to make it public)
require_once('utf8.inc'); // http://hsivonen.iki.fi/php-utf8/
require_once('tcpdf.php'); // http://www.tcpdf.org/
$t = new TCPDF();
$text = 'بب';
$t->utf8Bidi(utf8ToUnicode($text)); // will return an array like array(0 => 65168, 1 => 65169)
Just set the element containing the arabic text to "rtl" (right to left), then input correctly spelled arabic and the text will flow with all ligatures looked for.
div {
direction:rtl;
}
On a side note, don't forget to read "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)"
Think about that : The "ba" (ب) arabic letter is a "ba" no matter where it appears in the sentence.
Try this:
<?php
$string = 'a';
$expanded = iconv('UTF-8', 'UTF-32', $string);
$arr = unpack('L*', $expanded);
print_r($arr);
?>
I'm totally agree with FloatBird about the use of the arabic.php which you will find it as he said at ar-php, The thing is they have changed the class name after version 4 from Arabic to I18N_Arabic so in order for the code to work using arabic.php ver 4.0 you need to change the code to
<?php
include('Arabic.php');
$Arabic = new I18N_Arabic('ArGlyphs');
$text = 'بسم الله الرحمن الرحيم';
$text = $Arabic->utf8Glyphs($text);
echo $text;
?>
Also notice that you need to put the php code file inside the I18N folder.
Anyway it is working fantastically, Thanks again FloatBird
I had a similar problem when I wanted to store an object that had values in Arabic, so writing in Arabic was stored as UNICODE," so the solution was as follows.
$detailsLog = $product->only(['name', 'unit', 'quantity']);
$detailsLog = json_encode($detailsLog, JSON_UNESCAPED_UNICODE);
$log->details = $detailsLog;
$log->save();
When you put the second parameter of the json_encode JSON_UNESCAPED_UNICODE follower, the Arabic words return without encoding.
i think you could try:
<meta charset="utf-8" />
if this does not work use FloatBird Answer

shorten a word in a block of text

I have a block of text which occasionally has a really long word/web address which breaks out of my site's layout.
What is the best way to go through this block of text and shorten the words?
EXAMPLE:
this is some text and this a long word appears like this
fkdfjdksodifjdisosdidjsosdifosdfiosdfoisjdfoijsdfoijsdfoijsdfoijsdfoijsdfoisjdfoisdjfoisdfjosdifjosdifjosdifjosdifjosdifjsodifjosdifjosidjfosdifjsdoiofsij and i need that to either wrap in ALL browsers or trim the word.
You need wordwrap function i suppose.
You could truncate the string so it appears with an ellipsis in the middle or the end of the string. However, this would be independent from the actual rendering in a webbrowser. There is no way for PHP to determine the actual length a string will have with a certain font when rendered in a browser, especially if you have defined fallback fonts and don't know which font is used in the browser, e.g.
font-family: Verdana, Arial, sans-serif;
Compare the following:
I am 23 characters long
I am 23 characters long
Both chars have the same length, but since the one is monotyped and the other isn't the actual width it will have is different. PHP cannot determine this. You'd have to find a client side technology, probably JavaScript, to solve this for you.
You could also wrap the text into an element with the CSS property overflow:hidden to make the text disappear after a fixed length.
Look around SO. I'm pretty sure this was asked more than once before.
You could use the word-wrap: break-word CSS property to wrap the text that breaks your layout.
Check out the Mozilla Developer Center examples which demonstrate its use.
function fixlongwords($string) {
$exploded = explode(' ', $string);
$result = '';
foreach($exploded as $curr) {
if(strlen($curr) > 20) {
$curr = wordwrap($curr, 20, '<br/>\n');
}
$result .= $curr.' ';
}
return $result;
}
This should do the job.
You could do something like this:
preg_replace("/(\\S{20})/", '$1‌', $text);
It should* add a zero-width non-join character into all words each 20 characters. This means they will word-wrap.
* (untested)
Based on #JonnyLitt's answer, here's my take on the problem:
<?php
function insertSoftBreak($string, $interval=20, $breakChr='­') {
$splitString = explode(' ', $string);
foreach($splitString as $key => $val) {
if(strlen($val)>$interval) {
$splitString[$key] = wordwrap($val, $interval, $breakChr, true);
}
}
return implode(' ', $splitString);
}
$string = 'Hello, My name is fwwfdfhhhfhhhfrhgrhffwfweronwefbwuecfbryhfbqpibcqpbfefpibcyhpihbasdcbiasdfayifvbpbfawfgawg, because that is my name.';
echo insertSoftBreak($string);
?>
Breaking the string up in space-seperated values, check the length of each individual 'word' (words include symbols like dot, comma, or question mark). For each word, check if the length is longer than $interval characters, and if so, insert a ­ (soft hyphen) every $interval'th character.
I've chosen soft hyphens because they seem to be relatively well-supported across browsers, and they usually don't show unless the word actually wraps at that position.
I'm not aware of any other usable (and well supported) HTML entities that could be used instead (‌ does not seem to work in FF 3.6, at least), so if crossbrowser support for ­ turns out lacking, a pure CSS or Javascript-based solution would be best.

Categories