Fpdf and special characters

Fpdf and special characters - php

I am trying to write out some special characters with built in fonts, is there any way to do this?
$str = 'ščťžýáíéäúň§ôúőűáéóüöűú';
$str = iconv('UTF-8', 'windows-1252', $str);
the result is one letter Š, not too good. :)

I know it's an old thread, but I've faced the issue this weekend and spent a longtime Googling and playing, so here's a time saver.
http://fpdf.org/en/script/script92.php is the way to go to use diacritics (accented characters). But you need to add some code to it...
Slot this in at line 617
/* Modified by Vinod Patidar due to font key does not match in dejavu bold.*/
if ( $family == 'dejavu' && !empty($style) && ($style == 'B' || $style == 'b') ) {
$fontkey = $family.' '.strtolower($style);
} else {
$fontkey = $family.$style;
}
/* Modified end here*/
Then change
if($family=='arial')
$family = 'helvetica';
To
if($family=='arial'||$family='dejavu')
$family = 'helvetica';
Then don't use the font in the example "DejaVu Sans Condensed" because Condensed seems to mean the Bold version doesn't contain all the characters
You may also need to add the getPageWidth and getPageHeight methods from the normal fpdf.php script as it is newer than tfpdf.php!
With the changes above
$pdflabel->AddFont('DejaVu','','DejaVuSans.ttf',true);
$pdf->AddFont('DejaVu','B','DejaVuSans-Bold.ttf',true);
Works a good 'un with European languages

You'll need to use tFPDF derivate of FPDF. tFPDF uses the PHP multi-byte string functions and generates its output encoded with UTF-8. FPDF does not. You'll also need to use a font that supports all the Unicode characters you want to use. Most commonly, I'll use Arial.
See: http://fpdf.org/en/script/script92.php

Related

how to use imagick annotateImage for chinese text?

I need to annotate an image with Chinese Text and I am using Imagick library right now.
An example of a Chinese Text is
这是中文
The Chinese Font file used is this
The file originally is named 华文黑体.ttf
it can also be found in Mac OSX under /Library/Font
I have renamed it to English STHeiTi.ttf make it easier to call the file in php code.
In particular the Imagick::annotateImage function
I also am using the answer from "How can I draw wrapped text using Imagick in PHP?".
The reason why I am using it is because it is successful for English text and application needs to annotate both English and Chinese, though not at the same time.
The problem is that when I run the annotateImage using Chinese text, I get annotation that looks like 罍
Code included here

The problem is you are feeding imagemagick the output of a "line splitter" (wordWrapAnnotation), to which you are utf8_decodeing the text input. This is wrong for sure, if you are dealing with Chinese text. utf8_decode can only deal with UTF-8 text that CAN be converted to ISO-8859-1 (the most common 8-bit extension of ASCII).
Now, I hope that you text is UTF-8 encoded. If it is not, you might be able to convert it like this:
$text = mb_convert_encoding($text, 'UTF-8', 'BIG-5');
or like this
$text = mb_convert_encoding($text, 'UTF-8', 'GB18030'); // only PHP >= 5.4.0
(in your code $text is rather $text1 and $text2).
Then there are (at least) two things to fix in your code:
pass the text "as is" (without utf8_decode) to wordWrapAnnotation,
change the argument of setTextEncoding from "utf-8" to "UTF-8"
as per specs
I hope that all variables in your code are initialized in some missing part of it. With the two changes above (the second one might not be necessary, but you never know...), and with the missing parts in place, I see no reason why your code should not work, unless your TTF file is broken or the Imagick library is broken (imagemagick, on which Imagick is based, is a great library, so I consider this last possibility rather unlikely).
EDIT:
Following your request, I update my answer with
a) the fact that setting mb_internal_encoding('utf-8') is very important for the solution, as you say in your answer, and
b) my proposal for a better line splitter, that works acceptably for western languages and for Chinese, and that is probably a good starting point for other languages using Han logograms (Japanese kanji and Korean hanja):
function wordWrapAnnotation(&$image, &$draw, $text, $maxWidth)
{
$regex = '/( |(?=\p{Han})(?<!\p{Pi})(?<!\p{Ps})|(?=\p{Pi})|(?=\p{Ps}))/u';
$cleanText = trim(preg_replace('/[\s\v]+/', ' ', $text));
$strArr = preg_split($regex, $cleanText, -1, PREG_SPLIT_DELIM_CAPTURE |
PREG_SPLIT_NO_EMPTY);
$linesArr = array();
$lineHeight = 0;
$goodLine = '';
$spacePending = false;
foreach ($strArr as $str) {
if ($str == ' ') {
$spacePending = true;
} else {
if ($spacePending) {
$spacePending = false;
$line = $goodLine.' '.$str;
} else {
$line = $goodLine.$str;
}
$metrics = $image->queryFontMetrics($draw, $line);
if ($metrics['textWidth'] > $maxWidth) {
if ($goodLine != '') {
$linesArr[] = $goodLine;
}
$goodLine = $str;
} else {
$goodLine = $line;
}
if ($metrics['textHeight'] > $lineHeight) {
$lineHeight = $metrics['textHeight'];
}
}
}
if ($goodLine != '') {
$linesArr[] = $goodLine;
}
return array($linesArr, $lineHeight);
}
In words: the input is first cleaned up by replacing all runs of whitespace, including newlines, with a single space, except for leading and trailing whitespace, which is removed. Then it is split either at spaces, or right before Han characters not preceded by "leading" characters (like opening parentheses or opening quotes), or right before "leading" characters. Lines are assembled in order not to be rendered in more than $maxWidth pixels horizontally, except when this is not possible by the splitting rules (in which case the final rendering will probably overflow). A modification in order to force splitting in overflow cases is not difficult. Note that, e.g., Chinese punctuation is not classified as Han in Unicode, so that, except for "leading" punctuation, no linebreak can be inserted before it by the algorithm.

I'm afraid you will have to choose a TTF that can support Chinese code points. There are many sources for this, here are two:
http://www.wazu.jp/gallery/Fonts_ChineseTraditional.html
http://wildboar.net/multilingual/asian/chinese/language/fonts/unicode/non-microsoft/non-microsoft.html

Full solution here:
https://gist.github.com/2971092/232adc3ebfc4b45f0e6e8bb5934308d9051450a4
Key ideas:
Must set the html charset and internal encoding on the form and on the processing page
header('Content-Type: text/html; charset=utf-8');
mb_internal_encoding('utf-8');
These lines must be at the top lines of the php files.
Use this function to determine if text is Chinese and use the right font file
function isThisChineseText($text) {
return preg_match("/\p{Han}+/u", $text);
}
For more details check out https://stackoverflow.com/a/11219301/80353
Set TextEncoding properly in ImagickDraw object
$draw = new ImagickDraw();
// set utf 8 format
$draw->setTextEncoding('UTF-8');
Note the Capitalized UTF. THis was helpfully pointed out to me by Walter Tross in his answer here: https://stackoverflow.com/a/11207521/80353
Use preg_match_all to explode English words, Chinese Words and spaces
// separate the text by chinese characters or words or spaces
preg_match_all('/([\w]+)|(.)/u', $text, $matches);
$words = $matches[0];
Inspired by this answer https://stackoverflow.com/a/4113903/80353
Works just as well for english text

How to Convert Arabic Characters to Unicode Using PHP

I want to to know how can I convert a word into unicode exactly like:
http://www.arabunic.free.fr/
can anyone know how to do that using PHP considering that Arabic text may contains ligatures?
thanks
Edit
I'm not sure what is that "unicode" but I need to have the Arabic Character in it's equivalent machine number considering that arabic characters have different contextual forms depending on their position - see here:
http://en.wikipedia.org/wiki/Arabic_alphabet#Table_of_basic_letters
the same character in different position:
ب‎ | ـب‎ | ـبـ‎ | بـ‎
I think it must be a way to convert each Arabic character into it's equivalent number, but how?
Edit
I still believe there's a way to convert each character to it's form depending on positions
any idea is appreciated..

All what you need is function called: utf8Glyphs which you can find it in ArGlyphs.class.php download it from ar-php
and visit Ar-PHP for the ArPHP more information about the project and classes.
This will reverse the word with same of its characters (glyphs).
Example of usage:
<?php
include('Arabic.php');
$Arabic = new Arabic('ArGlyphs');
$text = 'بسم الله الرحمن الرحيم';
$text = $Arabic->utf8Glyphs($text);
echo $text;
?>

i assume you wnat to convert بهروز to \u0628\u0647\u0631\u0648\u0632 take a look at http://hsivonen.iki.fi/php-utf8/ all you have to do after calling unicodeToUtf8('بهروز') is to convert integers you got in array to hex & make sure they have 4digigts & prefix em with \u & you're done. also you can get same using json_encode
json_encode('بهروز') // returns "\u0628\u0647\u0631\u0648\u0632"
EDIT:
seems you want to get character codes of بب which first one differs from second one, all you have to do is applying bidi algorithm on your text using fribidi_log2vis then getting character code by one of ways i said before.
here's example:
$string = 'بب'; // \u0628\u0628
$bidiString = fribidi_log2vis($string, FRIBIDI_LTR, FRIBIDI_CHARSET_UTF8);
json_encode($bidiString); // \ufe90\ufe91
EDIT:
i just remembered that tcpdf has bidi algorithm which implemented using pure php so if you can not get fribidi extension of php to work, you can use tcpdf (utf8Bidi by default is protected so you need to make it public)
require_once('utf8.inc'); // http://hsivonen.iki.fi/php-utf8/
require_once('tcpdf.php'); // http://www.tcpdf.org/
$t = new TCPDF();
$text = 'بب';
$t->utf8Bidi(utf8ToUnicode($text)); // will return an array like array(0 => 65168, 1 => 65169)

Just set the element containing the arabic text to "rtl" (right to left), then input correctly spelled arabic and the text will flow with all ligatures looked for.
div {
direction:rtl;
}
On a side note, don't forget to read "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)"
Think about that : The "ba" (ب) arabic letter is a "ba" no matter where it appears in the sentence.

Try this:
<?php
$string = 'a';
$expanded = iconv('UTF-8', 'UTF-32', $string);
$arr = unpack('L*', $expanded);
print_r($arr);
?>

I'm totally agree with FloatBird about the use of the arabic.php which you will find it as he said at ar-php, The thing is they have changed the class name after version 4 from Arabic to I18N_Arabic so in order for the code to work using arabic.php ver 4.0 you need to change the code to
<?php
include('Arabic.php');
$Arabic = new I18N_Arabic('ArGlyphs');
$text = 'بسم الله الرحمن الرحيم';
$text = $Arabic->utf8Glyphs($text);
echo $text;
?>
Also notice that you need to put the php code file inside the I18N folder.
Anyway it is working fantastically, Thanks again FloatBird

I had a similar problem when I wanted to store an object that had values in Arabic, so writing in Arabic was stored as UNICODE," so the solution was as follows.
$detailsLog = $product->only(['name', 'unit', 'quantity']);
$detailsLog = json_encode($detailsLog, JSON_UNESCAPED_UNICODE);
$log->details = $detailsLog;
$log->save();
When you put the second parameter of the json_encode JSON_UNESCAPED_UNICODE follower, the Arabic words return without encoding.

i think you could try:
<meta charset="utf-8" />
if this does not work use FloatBird Answer

PHP-GD: Dealing with Unicode characters

I am developing a web service that renders characters using the PHP GD extension, using a user-selected TTF font.
This works fine in ASCII-land, but there are a few problems:
The string to be rendered comes in as UTF-8. I would like to limit the list of user-selectable fonts to be only those which can render the string properly, as some fonts only have glyphs for ASCII characters, ISO 8601, etc.
In the case where some decorative characters are included, it would be fine to render the majority of characters in the selected font and render the decorative characters in Arial (or whatever font contains the extended glyphs).
It does not seem like PHP-GD has support for querying the font metadata sufficiently to figure out if a character can be rendered in a given font. What is a good way to get font metrics into PHP? Is there a command-line utility that can dump in XML or other parsable format?

PHP-Cairo built against Pango and fontconfig should have enough brains to do font substitution when appropriate.

You can try to use pdf_info_font() from pdflib extension. Good example is there: http://www.pdflib.com/pdflib-cookbook/fonts/font-metrics-info/php-font-metrics-info/

If you don't have a unicode font, you'll need to try something like
<?php
$trans = new Latin1UTF8();
$mixed = "MIXED TEXT INPUT";
print "Original: ".$mixed;
print "Latin1: ".$trans->mixed_to_latin1($mixed);
print "UTF-8: ".$trans->mixed_to_utf8($mixed);
class Latin1UTF8 {
private $latin1_to_utf8;
private $utf8_to_latin1;
public function __construct() {
for($i=32; $i<=255; $i++) {
$this->latin1_to_utf8[chr($i)] = utf8_encode(chr($i));
$this->utf8_to_latin1[utf8_encode(chr($i))] = chr($i);
}
}
public function mixed_to_latin1($text) {
foreach( $this->utf8_to_latin1 as $key => $val ) {
$text = str_replace($key, $val, $text);
}
return $text;
}
public function mixed_to_utf8($text) {
return utf8_encode($this->mixed_to_latin1($text));
}
}
?>
Taken from http://php.net/manual/en/function.utf8-decode.php
If the mixed and utf-8 characters are equal then you can use it. If not, then you can't.

I ended up using the TTX utility to dump font metrics. I could then inspect the resulting .ttx files and look at the character->glyph map. I did this manually, but automatic parsing of the XML files is possible.
An example GNU Makefile which generates the .ttx files from a set of TrueType fonts in the same directory:
all: fontmetrics
fontmetrics: $(patsubst %.ttf,%.ttx,$(wildcard *.ttf))
.PHONY: fontmetrics
clean:
rm -f *.ttx
%.ttx: %.ttf
ttx -t cmap $<

shorten a word in a block of text

I have a block of text which occasionally has a really long word/web address which breaks out of my site's layout.
What is the best way to go through this block of text and shorten the words?
EXAMPLE:
this is some text and this a long word appears like this
fkdfjdksodifjdisosdidjsosdifosdfiosdfoisjdfoijsdfoijsdfoijsdfoijsdfoijsdfoisjdfoisdjfoisdfjosdifjosdifjosdifjosdifjosdifjsodifjosdifjosidjfosdifjsdoiofsij and i need that to either wrap in ALL browsers or trim the word.

You need wordwrap function i suppose.

You could truncate the string so it appears with an ellipsis in the middle or the end of the string. However, this would be independent from the actual rendering in a webbrowser. There is no way for PHP to determine the actual length a string will have with a certain font when rendered in a browser, especially if you have defined fallback fonts and don't know which font is used in the browser, e.g.
font-family: Verdana, Arial, sans-serif;
Compare the following:
I am 23 characters long
I am 23 characters long
Both chars have the same length, but since the one is monotyped and the other isn't the actual width it will have is different. PHP cannot determine this. You'd have to find a client side technology, probably JavaScript, to solve this for you.
You could also wrap the text into an element with the CSS property overflow:hidden to make the text disappear after a fixed length.
Look around SO. I'm pretty sure this was asked more than once before.

You could use the word-wrap: break-word CSS property to wrap the text that breaks your layout.
Check out the Mozilla Developer Center examples which demonstrate its use.

function fixlongwords($string) {
$exploded = explode(' ', $string);
$result = '';
foreach($exploded as $curr) {
if(strlen($curr) > 20) {
$curr = wordwrap($curr, 20, '<br/>\n');
}
$result .= $curr.' ';
}
return $result;
}
This should do the job.

You could do something like this:
preg_replace("/(\\S{20})/", '$1‌', $text);
It should* add a zero-width non-join character into all words each 20 characters. This means they will word-wrap.
* (untested)

Based on #JonnyLitt's answer, here's my take on the problem:
<?php
function insertSoftBreak($string, $interval=20, $breakChr='') {
$splitString = explode(' ', $string);
foreach($splitString as $key => $val) {
if(strlen($val)>$interval) {
$splitString[$key] = wordwrap($val, $interval, $breakChr, true);
}
}
return implode(' ', $splitString);
}
$string = 'Hello, My name is fwwfdfhhhfhhhfrhgrhffwfweronwefbwuecfbryhfbqpibcqpbfefpibcyhpihbasdcbiasdfayifvbpbfawfgawg, because that is my name.';
echo insertSoftBreak($string);
?>
Breaking the string up in space-seperated values, check the length of each individual 'word' (words include symbols like dot, comma, or question mark). For each word, check if the length is longer than $interval characters, and if so, insert a  (soft hyphen) every $interval'th character.
I've chosen soft hyphens because they seem to be relatively well-supported across browsers, and they usually don't show unless the word actually wraps at that position.
I'm not aware of any other usable (and well supported) HTML entities that could be used instead (‌ does not seem to work in FF 3.6, at least), so if crossbrowser support for  turns out lacking, a pure CSS or Javascript-based solution would be best.

How to handle diacritics (accents) when rewriting 'pretty URLs'

I rewrite URLs to include the title of user generated travelblogs.
I do this for both readability of URLs and SEO purposes.
http://www.example.com/gallery/280-Gorges_du_Todra/
The first integer is the id, the rest is for us humans (but is irrelevant for requesting the resource).
Now people can write titles containing any UTF-8 character, but most are not allowed in the URL.
My audience is generally English speaking, but since they travel, they like to include names like
Aït Ben Haddou
What is the proper way to translate this for displaying in an URL using PHP on linux.
So far I've seen several solutions:
just strip all non allowed characters, replace spaces
this has strange results:
'Aït Ben Haddou' → /gallery/280-At_Ben_Haddou/
Not really helpfull.
just strip all non allowed characters, replace spaces, leave charcode (stackoverflow.com) most likely because of the 'regex-hammer' used
this gives strange results:
'tést tést' → /questions/0000/t233st-t233st
translate to 'nearest equivalent'
'Aït Ben Haddou' → /gallery/280-Ait_Ben_Haddou/
But this goes wrong for german; for example 'ü' should be transliterated 'ue'.
For me, as a Dutch person, the 3rd result 'looks' the best.
I'm quite sure however that (1) many people will have a different opinion and (2) it is just plain wrong in the german example.
Another problem with the 3rd option is: how to find all possible characters that can be converted to a 7bit equivalent?
So the question is:
what, in your opinion, is the most desirable result. (within tech-limits)
How to technically solve it. (reach the desired result) with PHP.

Ultimately, you're going to have to give up on the idea of "correct", for this problem. Translating the string, no matter how you do it, destroys accuracy in the name of compatibility and readability. All three options are equally compatible, but #1 and #2 suffer in terms of readability. So just run with it and go for whatever looks best — option #3.
Yes, the translations are wrong for German, but unless you start requiring your users to specify what language their titles are in (and restricting them to only one), you're not going to solve that problem without far more effort than it's worth. (For example, running each word in the title through dictionaries for each known language and translating that word's diacritics according to the rules of its language would work, but it's excessive.)
Alternatively, if German is a higher concern than other languages, make your translation always use the German version when one exists: ä→ae, ë→e, ï→i, ö→oe, ü→ue.
Edit:
Oh, and as for the actual method, I'd translate the special cases, if any, via str_replace, then use iconv for the rest:
$text = str_replace(array("ä", "ö", "ü", "ß"), array("ae", "oe", "ue", "ss"), $text);
$text = iconv('UTF-8', 'US-ASCII//TRANSLIT', $text);

To me the third is most readable.
You could use a little dictionary e.g. ï -> i and ü -> ue to specify how you'd like various charcaters to be translated.

As an interesting side note, on SO nothing seems to really matter after the ID -- this is a link to this page:
How to handle diacritics (accents) when rewriting 'pretty URLs'
Obviously the motivation is to allow title changes without breaking links, and you may want to consider that feature as well.

Nice topic, I had the same problem a while ago.
Here's how I fixed it:
function title2url($string=null){
// return if empty
if(empty($string)) return false;
// replace spaces by "-"
// convert accents to html entities
$string=htmlentities(utf8_decode(str_replace(' ', '-', $string)));
// remove the accent from the letter
$string=preg_replace(array('#&([a-zA-Z]){1,2}(acute|grave|circ|tilde|uml|ring|elig|zlig|slash|cedil|strok|lig){1};#', '#&[euro]{1};#'), array('${1}', 'E'), $string);
// now, everything but alphanumeric and -_ can be removed
// aso remove double dashes
$string=preg_replace(array('#[^a-zA-Z0-9\-_]#', '#[\-]{2,}#'), array('', '-'), html_entity_decode($string));
}
Here's how my function works:
Convert it to html entities
Strip the accents
Remove all remaining weird chars

Now people can write titles containing any UTF-8 character, but most are not allowed in the URL.
On the contrary, most are allowed. See for example Wikipedia's URLs - things like http://en.wikipedia.org/wiki/Café (aka http://en.wikipedia.org/wiki/Caf%C3%A9) display nicely - even if StackOverflow's highlighter doesn't pick them out correctly :-)
The trick is reading them reliably across any hosting environment; there are problems with CGI and Windows servers, particularly IIS, for example.

This is a good function:
function friendlyURL($string) {
setlocale(LC_CTYPE, 'en_US.UTF8');
$string = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string);
$string = str_replace(' ', '-', $string);
$string = preg_replace('/\\s+/', '-', $string);
$string = strtolower($string);
return $string;
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.