I need to annotate an image with Chinese Text and I am using Imagick library right now.
An example of a Chinese Text is
这是中文
The Chinese Font file used is this
The file originally is named 华文黑体.ttf
it can also be found in Mac OSX under /Library/Font
I have renamed it to English STHeiTi.ttf make it easier to call the file in php code.
In particular the Imagick::annotateImage function
I also am using the answer from "How can I draw wrapped text using Imagick in PHP?".
The reason why I am using it is because it is successful for English text and application needs to annotate both English and Chinese, though not at the same time.
The problem is that when I run the annotateImage using Chinese text, I get annotation that looks like 罍
Code included here
The problem is you are feeding imagemagick the output of a "line splitter" (wordWrapAnnotation), to which you are utf8_decodeing the text input. This is wrong for sure, if you are dealing with Chinese text. utf8_decode can only deal with UTF-8 text that CAN be converted to ISO-8859-1 (the most common 8-bit extension of ASCII).
Now, I hope that you text is UTF-8 encoded. If it is not, you might be able to convert it like this:
$text = mb_convert_encoding($text, 'UTF-8', 'BIG-5');
or like this
$text = mb_convert_encoding($text, 'UTF-8', 'GB18030'); // only PHP >= 5.4.0
(in your code $text is rather $text1 and $text2).
Then there are (at least) two things to fix in your code:
pass the text "as is" (without utf8_decode) to wordWrapAnnotation,
change the argument of setTextEncoding from "utf-8" to "UTF-8"
as per specs
I hope that all variables in your code are initialized in some missing part of it. With the two changes above (the second one might not be necessary, but you never know...), and with the missing parts in place, I see no reason why your code should not work, unless your TTF file is broken or the Imagick library is broken (imagemagick, on which Imagick is based, is a great library, so I consider this last possibility rather unlikely).
EDIT:
Following your request, I update my answer with
a) the fact that setting mb_internal_encoding('utf-8') is very important for the solution, as you say in your answer, and
b) my proposal for a better line splitter, that works acceptably for western languages and for Chinese, and that is probably a good starting point for other languages using Han logograms (Japanese kanji and Korean hanja):
function wordWrapAnnotation(&$image, &$draw, $text, $maxWidth)
{
$regex = '/( |(?=\p{Han})(?<!\p{Pi})(?<!\p{Ps})|(?=\p{Pi})|(?=\p{Ps}))/u';
$cleanText = trim(preg_replace('/[\s\v]+/', ' ', $text));
$strArr = preg_split($regex, $cleanText, -1, PREG_SPLIT_DELIM_CAPTURE |
PREG_SPLIT_NO_EMPTY);
$linesArr = array();
$lineHeight = 0;
$goodLine = '';
$spacePending = false;
foreach ($strArr as $str) {
if ($str == ' ') {
$spacePending = true;
} else {
if ($spacePending) {
$spacePending = false;
$line = $goodLine.' '.$str;
} else {
$line = $goodLine.$str;
}
$metrics = $image->queryFontMetrics($draw, $line);
if ($metrics['textWidth'] > $maxWidth) {
if ($goodLine != '') {
$linesArr[] = $goodLine;
}
$goodLine = $str;
} else {
$goodLine = $line;
}
if ($metrics['textHeight'] > $lineHeight) {
$lineHeight = $metrics['textHeight'];
}
}
}
if ($goodLine != '') {
$linesArr[] = $goodLine;
}
return array($linesArr, $lineHeight);
}
In words: the input is first cleaned up by replacing all runs of whitespace, including newlines, with a single space, except for leading and trailing whitespace, which is removed. Then it is split either at spaces, or right before Han characters not preceded by "leading" characters (like opening parentheses or opening quotes), or right before "leading" characters. Lines are assembled in order not to be rendered in more than $maxWidth pixels horizontally, except when this is not possible by the splitting rules (in which case the final rendering will probably overflow). A modification in order to force splitting in overflow cases is not difficult. Note that, e.g., Chinese punctuation is not classified as Han in Unicode, so that, except for "leading" punctuation, no linebreak can be inserted before it by the algorithm.
I'm afraid you will have to choose a TTF that can support Chinese code points. There are many sources for this, here are two:
http://www.wazu.jp/gallery/Fonts_ChineseTraditional.html
http://wildboar.net/multilingual/asian/chinese/language/fonts/unicode/non-microsoft/non-microsoft.html
Full solution here:
https://gist.github.com/2971092/232adc3ebfc4b45f0e6e8bb5934308d9051450a4
Key ideas:
Must set the html charset and internal encoding on the form and on the processing page
header('Content-Type: text/html; charset=utf-8');
mb_internal_encoding('utf-8');
These lines must be at the top lines of the php files.
Use this function to determine if text is Chinese and use the right font file
function isThisChineseText($text) {
return preg_match("/\p{Han}+/u", $text);
}
For more details check out https://stackoverflow.com/a/11219301/80353
Set TextEncoding properly in ImagickDraw object
$draw = new ImagickDraw();
// set utf 8 format
$draw->setTextEncoding('UTF-8');
Note the Capitalized UTF. THis was helpfully pointed out to me by Walter Tross in his answer here: https://stackoverflow.com/a/11207521/80353
Use preg_match_all to explode English words, Chinese Words and spaces
// separate the text by chinese characters or words or spaces
preg_match_all('/([\w]+)|(.)/u', $text, $matches);
$words = $matches[0];
Inspired by this answer https://stackoverflow.com/a/4113903/80353
Works just as well for english text
Related
I am trying to write out some special characters with built in fonts, is there any way to do this?
$str = 'ščťžýáíéäúň§ôúőűáéóüöűú';
$str = iconv('UTF-8', 'windows-1252', $str);
the result is one letter Š, not too good. :)
I know it's an old thread, but I've faced the issue this weekend and spent a longtime Googling and playing, so here's a time saver.
http://fpdf.org/en/script/script92.php is the way to go to use diacritics (accented characters). But you need to add some code to it...
Slot this in at line 617
/* Modified by Vinod Patidar due to font key does not match in dejavu bold.*/
if ( $family == 'dejavu' && !empty($style) && ($style == 'B' || $style == 'b') ) {
$fontkey = $family.' '.strtolower($style);
} else {
$fontkey = $family.$style;
}
/* Modified end here*/
Then change
if($family=='arial')
$family = 'helvetica';
To
if($family=='arial'||$family='dejavu')
$family = 'helvetica';
Then don't use the font in the example "DejaVu Sans Condensed" because Condensed seems to mean the Bold version doesn't contain all the characters
You may also need to add the getPageWidth and getPageHeight methods from the normal fpdf.php script as it is newer than tfpdf.php!
With the changes above
$pdflabel->AddFont('DejaVu','','DejaVuSans.ttf',true);
$pdf->AddFont('DejaVu','B','DejaVuSans-Bold.ttf',true);
Works a good 'un with European languages
You'll need to use tFPDF derivate of FPDF. tFPDF uses the PHP multi-byte string functions and generates its output encoded with UTF-8. FPDF does not. You'll also need to use a font that supports all the Unicode characters you want to use. Most commonly, I'll use Arial.
See: http://fpdf.org/en/script/script92.php
I want to to know how can I convert a word into unicode exactly like:
http://www.arabunic.free.fr/
can anyone know how to do that using PHP considering that Arabic text may contains ligatures?
thanks
Edit
I'm not sure what is that "unicode" but I need to have the Arabic Character in it's equivalent machine number considering that arabic characters have different contextual forms depending on their position - see here:
http://en.wikipedia.org/wiki/Arabic_alphabet#Table_of_basic_letters
the same character in different position:
ب | ـب | ـبـ | بـ
I think it must be a way to convert each Arabic character into it's equivalent number, but how?
Edit
I still believe there's a way to convert each character to it's form depending on positions
any idea is appreciated..
All what you need is function called: utf8Glyphs which you can find it in ArGlyphs.class.php download it from ar-php
and visit Ar-PHP for the ArPHP more information about the project and classes.
This will reverse the word with same of its characters (glyphs).
Example of usage:
<?php
include('Arabic.php');
$Arabic = new Arabic('ArGlyphs');
$text = 'بسم الله الرحمن الرحيم';
$text = $Arabic->utf8Glyphs($text);
echo $text;
?>
i assume you wnat to convert بهروز to \u0628\u0647\u0631\u0648\u0632 take a look at http://hsivonen.iki.fi/php-utf8/ all you have to do after calling unicodeToUtf8('بهروز') is to convert integers you got in array to hex & make sure they have 4digigts & prefix em with \u & you're done. also you can get same using json_encode
json_encode('بهروز') // returns "\u0628\u0647\u0631\u0648\u0632"
EDIT:
seems you want to get character codes of بب which first one differs from second one, all you have to do is applying bidi algorithm on your text using fribidi_log2vis then getting character code by one of ways i said before.
here's example:
$string = 'بب'; // \u0628\u0628
$bidiString = fribidi_log2vis($string, FRIBIDI_LTR, FRIBIDI_CHARSET_UTF8);
json_encode($bidiString); // \ufe90\ufe91
EDIT:
i just remembered that tcpdf has bidi algorithm which implemented using pure php so if you can not get fribidi extension of php to work, you can use tcpdf (utf8Bidi by default is protected so you need to make it public)
require_once('utf8.inc'); // http://hsivonen.iki.fi/php-utf8/
require_once('tcpdf.php'); // http://www.tcpdf.org/
$t = new TCPDF();
$text = 'بب';
$t->utf8Bidi(utf8ToUnicode($text)); // will return an array like array(0 => 65168, 1 => 65169)
Just set the element containing the arabic text to "rtl" (right to left), then input correctly spelled arabic and the text will flow with all ligatures looked for.
div {
direction:rtl;
}
On a side note, don't forget to read "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)"
Think about that : The "ba" (ب) arabic letter is a "ba" no matter where it appears in the sentence.
Try this:
<?php
$string = 'a';
$expanded = iconv('UTF-8', 'UTF-32', $string);
$arr = unpack('L*', $expanded);
print_r($arr);
?>
I'm totally agree with FloatBird about the use of the arabic.php which you will find it as he said at ar-php, The thing is they have changed the class name after version 4 from Arabic to I18N_Arabic so in order for the code to work using arabic.php ver 4.0 you need to change the code to
<?php
include('Arabic.php');
$Arabic = new I18N_Arabic('ArGlyphs');
$text = 'بسم الله الرحمن الرحيم';
$text = $Arabic->utf8Glyphs($text);
echo $text;
?>
Also notice that you need to put the php code file inside the I18N folder.
Anyway it is working fantastically, Thanks again FloatBird
I had a similar problem when I wanted to store an object that had values in Arabic, so writing in Arabic was stored as UNICODE," so the solution was as follows.
$detailsLog = $product->only(['name', 'unit', 'quantity']);
$detailsLog = json_encode($detailsLog, JSON_UNESCAPED_UNICODE);
$log->details = $detailsLog;
$log->save();
When you put the second parameter of the json_encode JSON_UNESCAPED_UNICODE follower, the Arabic words return without encoding.
i think you could try:
<meta charset="utf-8" />
if this does not work use FloatBird Answer
Can you post a regex search and replacement in php for minifying/compressing javascript?
For example, here's a simple one for CSS
header('Content-type: text/css');
ob_start("compress");
function compress($buffer) {
/* remove comments */
$buffer = preg_replace('!/\*[^*]*\*+([^/][^*]*\*+)*/!', '', $buffer);
/* remove tabs, spaces, newlines, etc. */
$buffer = str_replace(array("\r\n", "\r", "\n", "\t", ' ', ' ', ' '), '', $buffer);
return $buffer;
}
/* put CSS here */
ob_end_flush();
And here's one for html:
<?php
/* Minify All Output - based on the search and replace regexes. */
function sanitize_output($buffer)
{
$search = array(
'/\>[^\S ]+/s', //strip whitespaces after tags, except space
'/[^\S ]+\</s', //strip whitespaces before tags, except space
'/(\s)+/s' // shorten multiple whitespace sequences
);
$replace = array(
'>',
'<',
'\\1'
);
$buffer = preg_replace($search, $replace, $buffer);
return $buffer;
}
ob_start("sanitize_output");
?>
<html>...</html>
But what about one for javascript?
A simple regex for minifying/compressing javascript is unlikely to exist anywhere. There are probably several good reasons for this, but here are a couple of these reasons:
Line breaks and semicolons
Good javascript minifiers remove all extra line breaks, but because javascript engines will work without semicolons at the end of each statement, a minifier could easily break this code unless it is sophisticated enough to watch for and handle different coding styles.
Dynamic Language Constructs
Many of the good javascript minifiers available will also change the names of your variables and functions to minify the code. For instance, a function named 'strip_white_space' that is called 12 times in your file might be renamed simple 'a', for a savings of 192 characters in your minified code. Unless your file has a lot of comments and/or whitespace, optimizations like these are where the majority of your filesize savings will come from.
Unfortunately, this is much more complicated than a simple regex should try to handle. Say you do something as simple as:
var length = 12, height = 15;
// other code that uses these length and height values
var arr = [1, 2, 3, 4];
for (i = (arr.length - 1); i >= 0; --i) {
//loop code
}
This is all valid code. BUT, how does the minifier know what to replace? The first "length" has "var" before it (but it doesn't have to), but "height" just has a comma before it. And if the minifier is smart enough to replace the first "length" properly, how smart does it have to be know NOT to change the word "length" when used as a property of the array? It would get even more complicated if you defined a javascript object where you specifically defined a "length" property and referred to it with the same dot-notation.
Non-regex Options Several projects exist to solve this problem using more complex solutions than just a simple regex, but many of them don't make any attempt to change variable names, so I still stick with Dean Edwards' packer or Douglas Crockford's JSMin or something like the YUI Compressor.
PHP implementation of Douglas Crockford's JSMin
https://github.com/mrclay/minify
I had a better shot at this Gist by orangeexception than Jan or B.F's answers.
preg_replace('#(?s)\s|/\*.*?\*/|//[^\r\n]*#', '', $javascript);
https://gist.github.com/orangexception/1301150/ed16505e2cb200dee0b0ab582ebbc67d5f060fe8
I'm writing on my own minifier because I have some PHP inside.
There is still one not solved problem. Preg_replace cannot handle quotes as boundary, or better it cannot count pair and impair quotes. Into the bargain there are double quotes, escaped double quotes, single quotes and escaped single quotes.
Here are just some interesting preg-functions.
$str=preg_replace('#//.*#','',$str);//delete comments
$str=preg_replace('#\s*/>#','>',$str);//delete xhtml tag slash ( />)
$str=str_replace(array("\n","\r","\t"),"",$str);//delete escaped white spaces
$str=preg_replace("/<\?(.*\[\'(\w+)\'\].*)\?>/","?>$1<?",$str);//rewrite associated array to object
$str=preg_replace("/\s*([\{\[\]\}\(\)\|&;]+)\s*/","$1",$str);//delete white spaces between brackets
$count=preg_match_all("/(\Wvar (\w{3,})[ =])/", $str, $matches);//find var names
$x=65;$y=64;
for($i=0;$i<$count;$i++){
if($y+1>90){$y=65;$x++;}//count upper case alphabetic ascii code
else $y++;
$str=preg_replace("/(\W)(".$matches[$i]."=".$matches[$i]."\+)(\W)/","$1".chr($x).chr($y)."+=$3",$str);//replace 'longvar=longvar+'blabla' to AA+='blabla'
$str=preg_replace("/(\W)(".$matches[$i].")(\W)/","$1".chr($x).chr($y)."$3",$str);//replace all other vars
}
//echo or save $str.
?>
You may do similarly with function names:
$count= preg_match_all("/function (\w{3,})/", $str, $matches);
If you want to see the replaced vars, put the following code in the for-loop:
echo chr($x).chr($y)."=".$matches[$i]."<br>";
Separate php from JS by:
$jsphp=(array)preg_split("/<\?php|\?>/",$str);
for($i=0;$i<count($jsphp);$i++){
if($i%2==0){do something whith js clause}
else {do something whith PHP clause}
}
This is only a draft. I'm always happy for suggestions.
Hope it was Englisch...
Adapted from B.F. answer and some other searching and testing I got to this. It works for my needs, is fast enough etc. It does leave my quoted text alone(finally).
<?php $str=preg_replace('#//.*#','',$someScriptInPhpVar);//delete comments
$count=preg_match_all("/(\Wvar (\w{3,})[ =])/", $str, $matches);//find var names
$x=65;$y=96;
for($i=0;$i<$count;$i++){if($y+1>122){$y=97;$x++;} else $y++; //count upper lower case alphabetic ascii code
$str=preg_replace("/([^\"a-zA-Z])(".$matches[2][$i]."=".$matches[2][$i]."\+)(\W)/","$1".chr($x).chr($y)."+=$3",$str);//replace 'longvar=longvar+'blabla' to AA+='blabla'
$str=preg_replace("/(\b)(".$matches[2][$i].")(\b)(?![\"\'\w :])/","$1".chr($x).chr($y)."$3",$str);//replace all other vars
}
$str=preg_replace("/\s+/"," ",$str);
$someScriptInPhpVar=str_replace(array("\n","\r","\t","; ","} "),array("","","",";","}"),$str);//delete escaped white space and other space ?>
I have a block of text which occasionally has a really long word/web address which breaks out of my site's layout.
What is the best way to go through this block of text and shorten the words?
EXAMPLE:
this is some text and this a long word appears like this
fkdfjdksodifjdisosdidjsosdifosdfiosdfoisjdfoijsdfoijsdfoijsdfoijsdfoijsdfoisjdfoisdjfoisdfjosdifjosdifjosdifjosdifjosdifjsodifjosdifjosidjfosdifjsdoiofsij and i need that to either wrap in ALL browsers or trim the word.
You need wordwrap function i suppose.
You could truncate the string so it appears with an ellipsis in the middle or the end of the string. However, this would be independent from the actual rendering in a webbrowser. There is no way for PHP to determine the actual length a string will have with a certain font when rendered in a browser, especially if you have defined fallback fonts and don't know which font is used in the browser, e.g.
font-family: Verdana, Arial, sans-serif;
Compare the following:
I am 23 characters long
I am 23 characters long
Both chars have the same length, but since the one is monotyped and the other isn't the actual width it will have is different. PHP cannot determine this. You'd have to find a client side technology, probably JavaScript, to solve this for you.
You could also wrap the text into an element with the CSS property overflow:hidden to make the text disappear after a fixed length.
Look around SO. I'm pretty sure this was asked more than once before.
You could use the word-wrap: break-word CSS property to wrap the text that breaks your layout.
Check out the Mozilla Developer Center examples which demonstrate its use.
function fixlongwords($string) {
$exploded = explode(' ', $string);
$result = '';
foreach($exploded as $curr) {
if(strlen($curr) > 20) {
$curr = wordwrap($curr, 20, '<br/>\n');
}
$result .= $curr.' ';
}
return $result;
}
This should do the job.
You could do something like this:
preg_replace("/(\\S{20})/", '$1', $text);
It should* add a zero-width non-join character into all words each 20 characters. This means they will word-wrap.
* (untested)
Based on #JonnyLitt's answer, here's my take on the problem:
<?php
function insertSoftBreak($string, $interval=20, $breakChr='') {
$splitString = explode(' ', $string);
foreach($splitString as $key => $val) {
if(strlen($val)>$interval) {
$splitString[$key] = wordwrap($val, $interval, $breakChr, true);
}
}
return implode(' ', $splitString);
}
$string = 'Hello, My name is fwwfdfhhhfhhhfrhgrhffwfweronwefbwuecfbryhfbqpibcqpbfefpibcyhpihbasdcbiasdfayifvbpbfawfgawg, because that is my name.';
echo insertSoftBreak($string);
?>
Breaking the string up in space-seperated values, check the length of each individual 'word' (words include symbols like dot, comma, or question mark). For each word, check if the length is longer than $interval characters, and if so, insert a (soft hyphen) every $interval'th character.
I've chosen soft hyphens because they seem to be relatively well-supported across browsers, and they usually don't show unless the word actually wraps at that position.
I'm not aware of any other usable (and well supported) HTML entities that could be used instead ( does not seem to work in FF 3.6, at least), so if crossbrowser support for turns out lacking, a pure CSS or Javascript-based solution would be best.
I rewrite URLs to include the title of user generated travelblogs.
I do this for both readability of URLs and SEO purposes.
http://www.example.com/gallery/280-Gorges_du_Todra/
The first integer is the id, the rest is for us humans (but is irrelevant for requesting the resource).
Now people can write titles containing any UTF-8 character, but most are not allowed in the URL.
My audience is generally English speaking, but since they travel, they like to include names like
Aït Ben Haddou
What is the proper way to translate this for displaying in an URL using PHP on linux.
So far I've seen several solutions:
just strip all non allowed characters, replace spaces
this has strange results:
'Aït Ben Haddou' → /gallery/280-At_Ben_Haddou/
Not really helpfull.
just strip all non allowed characters, replace spaces, leave charcode (stackoverflow.com) most likely because of the 'regex-hammer' used
this gives strange results:
'tést tést' → /questions/0000/t233st-t233st
translate to 'nearest equivalent'
'Aït Ben Haddou' → /gallery/280-Ait_Ben_Haddou/
But this goes wrong for german; for example 'ü' should be transliterated 'ue'.
For me, as a Dutch person, the 3rd result 'looks' the best.
I'm quite sure however that (1) many people will have a different opinion and (2) it is just plain wrong in the german example.
Another problem with the 3rd option is: how to find all possible characters that can be converted to a 7bit equivalent?
So the question is:
what, in your opinion, is the most desirable result. (within tech-limits)
How to technically solve it. (reach the desired result) with PHP.
Ultimately, you're going to have to give up on the idea of "correct", for this problem. Translating the string, no matter how you do it, destroys accuracy in the name of compatibility and readability. All three options are equally compatible, but #1 and #2 suffer in terms of readability. So just run with it and go for whatever looks best — option #3.
Yes, the translations are wrong for German, but unless you start requiring your users to specify what language their titles are in (and restricting them to only one), you're not going to solve that problem without far more effort than it's worth. (For example, running each word in the title through dictionaries for each known language and translating that word's diacritics according to the rules of its language would work, but it's excessive.)
Alternatively, if German is a higher concern than other languages, make your translation always use the German version when one exists: ä→ae, ë→e, ï→i, ö→oe, ü→ue.
Edit:
Oh, and as for the actual method, I'd translate the special cases, if any, via str_replace, then use iconv for the rest:
$text = str_replace(array("ä", "ö", "ü", "ß"), array("ae", "oe", "ue", "ss"), $text);
$text = iconv('UTF-8', 'US-ASCII//TRANSLIT', $text);
To me the third is most readable.
You could use a little dictionary e.g. ï -> i and ü -> ue to specify how you'd like various charcaters to be translated.
As an interesting side note, on SO nothing seems to really matter after the ID -- this is a link to this page:
How to handle diacritics (accents) when rewriting 'pretty URLs'
Obviously the motivation is to allow title changes without breaking links, and you may want to consider that feature as well.
Nice topic, I had the same problem a while ago.
Here's how I fixed it:
function title2url($string=null){
// return if empty
if(empty($string)) return false;
// replace spaces by "-"
// convert accents to html entities
$string=htmlentities(utf8_decode(str_replace(' ', '-', $string)));
// remove the accent from the letter
$string=preg_replace(array('#&([a-zA-Z]){1,2}(acute|grave|circ|tilde|uml|ring|elig|zlig|slash|cedil|strok|lig){1};#', '#&[euro]{1};#'), array('${1}', 'E'), $string);
// now, everything but alphanumeric and -_ can be removed
// aso remove double dashes
$string=preg_replace(array('#[^a-zA-Z0-9\-_]#', '#[\-]{2,}#'), array('', '-'), html_entity_decode($string));
}
Here's how my function works:
Convert it to html entities
Strip the accents
Remove all remaining weird chars
Now people can write titles containing any UTF-8 character, but most are not allowed in the URL.
On the contrary, most are allowed. See for example Wikipedia's URLs - things like http://en.wikipedia.org/wiki/Café (aka http://en.wikipedia.org/wiki/Caf%C3%A9) display nicely - even if StackOverflow's highlighter doesn't pick them out correctly :-)
The trick is reading them reliably across any hosting environment; there are problems with CGI and Windows servers, particularly IIS, for example.
This is a good function:
function friendlyURL($string) {
setlocale(LC_CTYPE, 'en_US.UTF8');
$string = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string);
$string = str_replace(' ', '-', $string);
$string = preg_replace('/\\s+/', '-', $string);
$string = strtolower($string);
return $string;
}