How to Convert Arabic Characters to Unicode Using PHP

How to Convert Arabic Characters to Unicode Using PHP - php

I want to to know how can I convert a word into unicode exactly like:
http://www.arabunic.free.fr/
can anyone know how to do that using PHP considering that Arabic text may contains ligatures?
thanks
Edit
I'm not sure what is that "unicode" but I need to have the Arabic Character in it's equivalent machine number considering that arabic characters have different contextual forms depending on their position - see here:
http://en.wikipedia.org/wiki/Arabic_alphabet#Table_of_basic_letters
the same character in different position:
ب‎ | ـب‎ | ـبـ‎ | بـ‎
I think it must be a way to convert each Arabic character into it's equivalent number, but how?
Edit
I still believe there's a way to convert each character to it's form depending on positions
any idea is appreciated..

All what you need is function called: utf8Glyphs which you can find it in ArGlyphs.class.php download it from ar-php
and visit Ar-PHP for the ArPHP more information about the project and classes.
This will reverse the word with same of its characters (glyphs).
Example of usage:
<?php
include('Arabic.php');
$Arabic = new Arabic('ArGlyphs');
$text = 'بسم الله الرحمن الرحيم';
$text = $Arabic->utf8Glyphs($text);
echo $text;
?>

i assume you wnat to convert بهروز to \u0628\u0647\u0631\u0648\u0632 take a look at http://hsivonen.iki.fi/php-utf8/ all you have to do after calling unicodeToUtf8('بهروز') is to convert integers you got in array to hex & make sure they have 4digigts & prefix em with \u & you're done. also you can get same using json_encode
json_encode('بهروز') // returns "\u0628\u0647\u0631\u0648\u0632"
EDIT:
seems you want to get character codes of بب which first one differs from second one, all you have to do is applying bidi algorithm on your text using fribidi_log2vis then getting character code by one of ways i said before.
here's example:
$string = 'بب'; // \u0628\u0628
$bidiString = fribidi_log2vis($string, FRIBIDI_LTR, FRIBIDI_CHARSET_UTF8);
json_encode($bidiString); // \ufe90\ufe91
EDIT:
i just remembered that tcpdf has bidi algorithm which implemented using pure php so if you can not get fribidi extension of php to work, you can use tcpdf (utf8Bidi by default is protected so you need to make it public)
require_once('utf8.inc'); // http://hsivonen.iki.fi/php-utf8/
require_once('tcpdf.php'); // http://www.tcpdf.org/
$t = new TCPDF();
$text = 'بب';
$t->utf8Bidi(utf8ToUnicode($text)); // will return an array like array(0 => 65168, 1 => 65169)

Just set the element containing the arabic text to "rtl" (right to left), then input correctly spelled arabic and the text will flow with all ligatures looked for.
div {
direction:rtl;
}
On a side note, don't forget to read "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)"
Think about that : The "ba" (ب) arabic letter is a "ba" no matter where it appears in the sentence.

Try this:
<?php
$string = 'a';
$expanded = iconv('UTF-8', 'UTF-32', $string);
$arr = unpack('L*', $expanded);
print_r($arr);
?>

I'm totally agree with FloatBird about the use of the arabic.php which you will find it as he said at ar-php, The thing is they have changed the class name after version 4 from Arabic to I18N_Arabic so in order for the code to work using arabic.php ver 4.0 you need to change the code to
<?php
include('Arabic.php');
$Arabic = new I18N_Arabic('ArGlyphs');
$text = 'بسم الله الرحمن الرحيم';
$text = $Arabic->utf8Glyphs($text);
echo $text;
?>
Also notice that you need to put the php code file inside the I18N folder.
Anyway it is working fantastically, Thanks again FloatBird

I had a similar problem when I wanted to store an object that had values in Arabic, so writing in Arabic was stored as UNICODE," so the solution was as follows.
$detailsLog = $product->only(['name', 'unit', 'quantity']);
$detailsLog = json_encode($detailsLog, JSON_UNESCAPED_UNICODE);
$log->details = $detailsLog;
$log->save();
When you put the second parameter of the json_encode JSON_UNESCAPED_UNICODE follower, the Arabic words return without encoding.

i think you could try:
<meta charset="utf-8" />
if this does not work use FloatBird Answer

Related

Split emojis’s string in PHP

I have a string:
$string = '😂🧜‍♂️';
And i want to split in:
$array = ['1F602', '1F9DCU-200D-2642-FE0F'];
How can i do it?
I have already try to use some functions but they doesn’t works because they doesn’t split properly emojis with more then one unicode.
Thank you in advance!

I was about to write the code for splitting emojis using an emoji-unicode dictionary but fortunately the code already exists.
This repo contains everything you need.
You can either use it directly or explore the code and take what you want.

How do I strip out in PHP everything but printing characters?

I am working with this daily data feed. To my surprise, one the fields didn't look right after it was in MySQL. (I have no control over who provides the feed.)
So I did a mysqldump and discovered the zip code and the city for this record contained a non-printing char. It displayed it in 'vi' as this:
<200e>
I'm working in PHP and I parse this data and put it into the MySQL database. I have used the trim function on this, but that doesn't get rid of it. The problem is, if you do a query on a zipcode in the MySQL database, it doesn't find the record with the non-printing character.
I'd like the clean this up before it's put into the MySQL database.
What can I do in PHP? At first I thought regular expression to only allow a-z,A-Z, and 0-9, but that's not good for addresses. Addresses use periods, commas, hyphens and perhaps other things I'm not thinking of at the moment.
What's the best approach? I don't know what it's called to define it exactly other than printing characters should only be allowed. Is there another PHP function like trim that does this job? Or regular expression? If so, I'd like an example. Thanks!
I have looked into using the PHP function, and saw this posted at PHP.NET:
<?php
$a = "\tcafé\n";
//This will remove the tab and the line break
echo filter_var($a, FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_LOW);
//This will remove the é.
echo filter_var($a, FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_HIGH);
?>
While using FILTER_FLAG_STRIP_HIGH does indeed strip out the <200e> I mentioned seen in 'vi', I'm concerned that it would strip out the letter's accent in a name such as André.
Maybe a regular expression is the solution?

You can use PHP filters: http://www.php.net/manual/en/function.filter-var.php
I would recommend on using the FILTER_SANITIZE_STRING filter, or anything that fits what you need.

I think you could use this little regex replace:
preg_replace( '/[^[:print:]]+/', '', $your_value);
It basically strip out all non-printing characters from $your_value

I tried this:
<?php
$string = "\tabcde éç ÉäÄéöÖüÜß.,!-\n";
$string = preg_replace('/[^a-z0-9\!\.\, \-éâëïüÿçêîôûéäöüß]/iu', '', $string);
print "[$string]";
It gave:
[abcde éç ÉäÄéöÖüÜß.,!-]
Add all the special characters, you need into the regexp.

If you work in English and do not need to support unicode characters, then allow just [\x20-\x7E]
...and remove all others:
$s = preg_replace('/[^\x20-\x7E]+/', '', $s);

how to use imagick annotateImage for chinese text?

I need to annotate an image with Chinese Text and I am using Imagick library right now.
An example of a Chinese Text is
这是中文
The Chinese Font file used is this
The file originally is named 华文黑体.ttf
it can also be found in Mac OSX under /Library/Font
I have renamed it to English STHeiTi.ttf make it easier to call the file in php code.
In particular the Imagick::annotateImage function
I also am using the answer from "How can I draw wrapped text using Imagick in PHP?".
The reason why I am using it is because it is successful for English text and application needs to annotate both English and Chinese, though not at the same time.
The problem is that when I run the annotateImage using Chinese text, I get annotation that looks like 罍
Code included here

The problem is you are feeding imagemagick the output of a "line splitter" (wordWrapAnnotation), to which you are utf8_decodeing the text input. This is wrong for sure, if you are dealing with Chinese text. utf8_decode can only deal with UTF-8 text that CAN be converted to ISO-8859-1 (the most common 8-bit extension of ASCII).
Now, I hope that you text is UTF-8 encoded. If it is not, you might be able to convert it like this:
$text = mb_convert_encoding($text, 'UTF-8', 'BIG-5');
or like this
$text = mb_convert_encoding($text, 'UTF-8', 'GB18030'); // only PHP >= 5.4.0
(in your code $text is rather $text1 and $text2).
Then there are (at least) two things to fix in your code:
pass the text "as is" (without utf8_decode) to wordWrapAnnotation,
change the argument of setTextEncoding from "utf-8" to "UTF-8"
as per specs
I hope that all variables in your code are initialized in some missing part of it. With the two changes above (the second one might not be necessary, but you never know...), and with the missing parts in place, I see no reason why your code should not work, unless your TTF file is broken or the Imagick library is broken (imagemagick, on which Imagick is based, is a great library, so I consider this last possibility rather unlikely).
EDIT:
Following your request, I update my answer with
a) the fact that setting mb_internal_encoding('utf-8') is very important for the solution, as you say in your answer, and
b) my proposal for a better line splitter, that works acceptably for western languages and for Chinese, and that is probably a good starting point for other languages using Han logograms (Japanese kanji and Korean hanja):
function wordWrapAnnotation(&$image, &$draw, $text, $maxWidth)
{
$regex = '/( |(?=\p{Han})(?<!\p{Pi})(?<!\p{Ps})|(?=\p{Pi})|(?=\p{Ps}))/u';
$cleanText = trim(preg_replace('/[\s\v]+/', ' ', $text));
$strArr = preg_split($regex, $cleanText, -1, PREG_SPLIT_DELIM_CAPTURE |
PREG_SPLIT_NO_EMPTY);
$linesArr = array();
$lineHeight = 0;
$goodLine = '';
$spacePending = false;
foreach ($strArr as $str) {
if ($str == ' ') {
$spacePending = true;
} else {
if ($spacePending) {
$spacePending = false;
$line = $goodLine.' '.$str;
} else {
$line = $goodLine.$str;
}
$metrics = $image->queryFontMetrics($draw, $line);
if ($metrics['textWidth'] > $maxWidth) {
if ($goodLine != '') {
$linesArr[] = $goodLine;
}
$goodLine = $str;
} else {
$goodLine = $line;
}
if ($metrics['textHeight'] > $lineHeight) {
$lineHeight = $metrics['textHeight'];
}
}
}
if ($goodLine != '') {
$linesArr[] = $goodLine;
}
return array($linesArr, $lineHeight);
}
In words: the input is first cleaned up by replacing all runs of whitespace, including newlines, with a single space, except for leading and trailing whitespace, which is removed. Then it is split either at spaces, or right before Han characters not preceded by "leading" characters (like opening parentheses or opening quotes), or right before "leading" characters. Lines are assembled in order not to be rendered in more than $maxWidth pixels horizontally, except when this is not possible by the splitting rules (in which case the final rendering will probably overflow). A modification in order to force splitting in overflow cases is not difficult. Note that, e.g., Chinese punctuation is not classified as Han in Unicode, so that, except for "leading" punctuation, no linebreak can be inserted before it by the algorithm.

I'm afraid you will have to choose a TTF that can support Chinese code points. There are many sources for this, here are two:
http://www.wazu.jp/gallery/Fonts_ChineseTraditional.html
http://wildboar.net/multilingual/asian/chinese/language/fonts/unicode/non-microsoft/non-microsoft.html

Full solution here:
https://gist.github.com/2971092/232adc3ebfc4b45f0e6e8bb5934308d9051450a4
Key ideas:
Must set the html charset and internal encoding on the form and on the processing page
header('Content-Type: text/html; charset=utf-8');
mb_internal_encoding('utf-8');
These lines must be at the top lines of the php files.
Use this function to determine if text is Chinese and use the right font file
function isThisChineseText($text) {
return preg_match("/\p{Han}+/u", $text);
}
For more details check out https://stackoverflow.com/a/11219301/80353
Set TextEncoding properly in ImagickDraw object
$draw = new ImagickDraw();
// set utf 8 format
$draw->setTextEncoding('UTF-8');
Note the Capitalized UTF. THis was helpfully pointed out to me by Walter Tross in his answer here: https://stackoverflow.com/a/11207521/80353
Use preg_match_all to explode English words, Chinese Words and spaces
// separate the text by chinese characters or words or spaces
preg_match_all('/([\w]+)|(.)/u', $text, $matches);
$words = $matches[0];
Inspired by this answer https://stackoverflow.com/a/4113903/80353
Works just as well for english text

str_replace doesn't seems to work with the implode function

After imploding an array:
$in_list = "'".implode("','",$array)."'";
$in_list content is :
'Robert','Emmanuel','José','Alexander'
Now when i try to replace the word José by another string,
str_replace("José","J",$in_list);
It doesn't get the new value, José is still there. Am i missing something? thanx in advance.

How exactly do you try to replace the string?
When trying it this way:
$in_list = str_replace("José","J",$in_list);
echo $in_list;
everything should work fine.
Remember, the function is returning a value. So it returns a new string.

This should work. It depends on your array.
$str = array('Robert','Emmanuel','José','Alexander');
$str = implode(",", $str);
print str_replace('José', 'J', $str);

I'm not sure what's going on, It seems to work for me. What version of PHP are you using?
$in_list = "'".implode("','", array('Robert', 'Emmanuel', 'José', 'Alexander'))."'";
$replaced = str_replace("José", "J", $in_list);
//prints 'Robert','Emmanuel','J','Alexander'
echo $replaced;
See: http://codepad.viper-7.com/24qutm

try $in_list = html_entity_decode((str_replace(htmlentities("José"),"J",htmlentities($in_list));

Have you tried on a word without accents? I would say you have a character set mismatch, for example 'José' in $in_list is in latin1 character set and your PHP source file in UTF8.
If this is the case, you should first convert either your PHP file or the variable to the character set you want to work with.

Spontaneous guess: Those two strings are not the same. I suppose one "José" is a string hardcoded in your source code and the other is received from the database or the browser or so. If the encoding of the two strings is not the same, PHP won't identify them as identical and not replace the character. Make sure your source code file is saved in the same encoding as the data you're working on, preferably both being UTF-8.

This worked for me, but it doesn't look like I'm doing anything notably different from you?
$array = array('Robert', 'Emmanuel', 'José', 'Alexander');
$in_list = "'".implode("','",$array)."'";
echo $in_list.PHP_EOL;
echo str_replace("José","J",$in_list).PHP_EOL;
Output:
'Robert','Emmanuel','José','Alexander'
'Robert','Emmanuel','J','Alexander'
Keep in mind that str_replace will not perform the replacement on $in_list itself, but rather return a string containing the replacement.
Hope this helps!

How to handle diacritics (accents) when rewriting 'pretty URLs'

I rewrite URLs to include the title of user generated travelblogs.
I do this for both readability of URLs and SEO purposes.
http://www.example.com/gallery/280-Gorges_du_Todra/
The first integer is the id, the rest is for us humans (but is irrelevant for requesting the resource).
Now people can write titles containing any UTF-8 character, but most are not allowed in the URL.
My audience is generally English speaking, but since they travel, they like to include names like
Aït Ben Haddou
What is the proper way to translate this for displaying in an URL using PHP on linux.
So far I've seen several solutions:
just strip all non allowed characters, replace spaces
this has strange results:
'Aït Ben Haddou' → /gallery/280-At_Ben_Haddou/
Not really helpfull.
just strip all non allowed characters, replace spaces, leave charcode (stackoverflow.com) most likely because of the 'regex-hammer' used
this gives strange results:
'tést tést' → /questions/0000/t233st-t233st
translate to 'nearest equivalent'
'Aït Ben Haddou' → /gallery/280-Ait_Ben_Haddou/
But this goes wrong for german; for example 'ü' should be transliterated 'ue'.
For me, as a Dutch person, the 3rd result 'looks' the best.
I'm quite sure however that (1) many people will have a different opinion and (2) it is just plain wrong in the german example.
Another problem with the 3rd option is: how to find all possible characters that can be converted to a 7bit equivalent?
So the question is:
what, in your opinion, is the most desirable result. (within tech-limits)
How to technically solve it. (reach the desired result) with PHP.

Ultimately, you're going to have to give up on the idea of "correct", for this problem. Translating the string, no matter how you do it, destroys accuracy in the name of compatibility and readability. All three options are equally compatible, but #1 and #2 suffer in terms of readability. So just run with it and go for whatever looks best — option #3.
Yes, the translations are wrong for German, but unless you start requiring your users to specify what language their titles are in (and restricting them to only one), you're not going to solve that problem without far more effort than it's worth. (For example, running each word in the title through dictionaries for each known language and translating that word's diacritics according to the rules of its language would work, but it's excessive.)
Alternatively, if German is a higher concern than other languages, make your translation always use the German version when one exists: ä→ae, ë→e, ï→i, ö→oe, ü→ue.
Edit:
Oh, and as for the actual method, I'd translate the special cases, if any, via str_replace, then use iconv for the rest:
$text = str_replace(array("ä", "ö", "ü", "ß"), array("ae", "oe", "ue", "ss"), $text);
$text = iconv('UTF-8', 'US-ASCII//TRANSLIT', $text);

To me the third is most readable.
You could use a little dictionary e.g. ï -> i and ü -> ue to specify how you'd like various charcaters to be translated.

As an interesting side note, on SO nothing seems to really matter after the ID -- this is a link to this page:
How to handle diacritics (accents) when rewriting 'pretty URLs'
Obviously the motivation is to allow title changes without breaking links, and you may want to consider that feature as well.

Nice topic, I had the same problem a while ago.
Here's how I fixed it:
function title2url($string=null){
// return if empty
if(empty($string)) return false;
// replace spaces by "-"
// convert accents to html entities
$string=htmlentities(utf8_decode(str_replace(' ', '-', $string)));
// remove the accent from the letter
$string=preg_replace(array('#&([a-zA-Z]){1,2}(acute|grave|circ|tilde|uml|ring|elig|zlig|slash|cedil|strok|lig){1};#', '#&[euro]{1};#'), array('${1}', 'E'), $string);
// now, everything but alphanumeric and -_ can be removed
// aso remove double dashes
$string=preg_replace(array('#[^a-zA-Z0-9\-_]#', '#[\-]{2,}#'), array('', '-'), html_entity_decode($string));
}
Here's how my function works:
Convert it to html entities
Strip the accents
Remove all remaining weird chars

Now people can write titles containing any UTF-8 character, but most are not allowed in the URL.
On the contrary, most are allowed. See for example Wikipedia's URLs - things like http://en.wikipedia.org/wiki/Café (aka http://en.wikipedia.org/wiki/Caf%C3%A9) display nicely - even if StackOverflow's highlighter doesn't pick them out correctly :-)
The trick is reading them reliably across any hosting environment; there are problems with CGI and Windows servers, particularly IIS, for example.

This is a good function:
function friendlyURL($string) {
setlocale(LC_CTYPE, 'en_US.UTF8');
$string = iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $string);
$string = str_replace(' ', '-', $string);
$string = preg_replace('/\\s+/', '-', $string);
$string = strtolower($string);
return $string;
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.