convert arabic string to utf8 encoded url - php

lets assume that i have a string as the following:
إصلاح إصلاح
and i want to convert it to seo friendly url removing slashes and special characters with the following function calls
$title = trim(strtolower($str));
$title = preg_replace('#[^a-z0-9\s-]#',null, $title);
$title = preg_replace('#[\s-]+#','-', $title);
in English its working fine and its giving correct results but in arabic its giving the following result :
15731589160415751581-15731589160415751581
Thanks in advance

I'd suggest urlencode() with unique post id, like
/blog/12345-<?= urlencode('إصلاح إصلاح') ?>

This is an unsolved problem yet. What you basically had to do is to transliterate any given character (irrelevant if arabic or chinese or japanese or whatever) to latin transcription and then perform the URI generation methods on it.
There is some basic(!) support in iconv for this, have a look at http://ch.php.net/manual/de/function.iconv.php, you have to use iconv('UTF-8', 'ISO-8859-1//TRANSLIT//IGNORE', $text) but as I said, support is limited.
If I were you I would just remove spaces and such and then call urlencode() on it:
$url = urlencode(mb_ereg_replace('\s+', '-', $url));
I'm using mb_ereg_replace() because it is unicode aware and such replaces unicode whitespaces as well.

The unicode property for arabic letter is : \p{arabic}, change the second preg_replace by:
$title = preg_replace('#[^\p{arabic}\s-]#',null, $title);

Try This function. I always use it and it works perfectly!
function SafeUrl3($str) {
$friendlyURL = htmlentities($str, ENT_COMPAT, "UTF-8", false) ;
$friendlyURL = preg_replace ( "/[^أ-يa-zA-Z0-9_.-]/u", "-", $friendlyURL ) ;
$friendlyURL = html_entity_decode($friendlyURL,ENT_COMPAT, "UTF-8") ;
$friendlyURL = trim($friendlyURL, '-') ;
return $friendlyURL ;
}

Related

PHP str_replace removing unintentionally removing Chinese characters

i have a PHP scripts that removes special characters, but unfortunately, some Chinese characters are also removed.
<?php
function removeSpecialCharactersFromString($inputString){
$inputString = str_replace(str_split('#/\\:*?\"<>|[]\'_+(),{}’! &'), "", $inputString);
return $inputString;
}
$test = '赵景然 赵景然';
print(removeSpecialCharactersFromString($test));
?>
oddly, the output is 赵然 赵然. The character 景 is removed
in addition, 陈 一 is also removed. What might be the possible cause?
The string your using to act as a list of the things you want to replace doesn't work well with the mixed encoding. What I've done is to convert this string to UTF16 and then split it.
function removeSpecialCharactersFromString($inputString){
$inputString = str_replace(str_split(
mb_convert_encoding('#/\\:*?\"<>|[]\'_+(),{}’! &', 'UTF16')), "", $inputString);
return $inputString;
}
$test = '#赵景然 赵景然';
print(removeSpecialCharactersFromString($test));
Which gives...
赵景然赵景然
BTW -str_replace is MB safe - sort of recognised the poster... http://php.net/manual/en/ref.mbstring.php#109937

substr_replace() when used with special charactors (äöå) replaces with a? [duplicate]

I'm trying to do accented character replacement in PHP but get funky results, my guess being because i'm using a UTF-8 string and str_replace can't properly handle multi-byte strings..
$accents_search = array('á','à','â','ã','ª','ä','å','Á','À','Â','Ã','Ä','é','è',
'ê','ë','É','È','Ê','Ë','í','ì','î','ï','Í','Ì','Î','Ï','œ','ò','ó','ô','õ','º','ø',
'Ø','Ó','Ò','Ô','Õ','ú','ù','û','Ú','Ù','Û','ç','Ç','Ñ','ñ');
$accents_replace = array('a','a','a','a','a','a','a','A','A','A','A','A','e','e',
'e','e','E','E','E','E','i','i','i','i','I','I','I','I','oe','o','o','o','o','o','o',
'O','O','O','O','O','u','u','u','U','U','U','c','C','N','n');
$str = str_replace($accents_search, $accents_replace, $str);
Results I get:
Ørjan Nilsen -> �orjan Nilsen
Expected Result:
Ørjan Nilsen -> Orjan Nilsen
Edit: I've got my internal character handler set to UTF-8 (according to mb_internal_encoding()), also the value of $str is UTF-8, so from what I can tell, all the strings involved are UTF-8. Does str_replace() detect char sets and use them properly?
According to php documentation str_replace function is binary-safe, which means that it can handle UTF-8 encoded text without any data loss.
Looks like the string was not replaced because your input encoding and the file encoding mismatch.
It's possible to remove diacritics using Unicode normalization form D (NFD) and Unicode character properties.
NFD converts something like the "ü" umlaut from "LATIN SMALL LETTER U WITH DIAERESIS" (which is a letter) to "LATIN SMALL LETTER U" (letter) and "COMBINING DIAERESIS" (not a letter).
header('Content-Type: text/plain; charset=utf-8');
$test = implode('', array('á','à','â','ã','ª','ä','å','Á','À','Â','Ã','Ä','é','è',
'ê','ë','É','È','Ê','Ë','í','ì','î','ï','Í','Ì','Î','Ï','œ','ò','ó','ô','õ','º','ø',
'Ø','Ó','Ò','Ô','Õ','ú','ù','û','Ú','Ù','Û','ç','Ç','Ñ','ñ'));
$test = Normalizer::normalize($test, Normalizer::FORM_D);
// Remove everything that's not a "letter" or a space (e.g. diacritics)
// (see http://de2.php.net/manual/en/regexp.reference.unicode.php)
$pattern = '/[^\pL ]/u';
echo preg_replace($pattern, '', $test);
Output:
aaaaªaaAAAAAeeeeEEEEiiiiIIIIœooooºøØOOOOuuuUUUcCNn
The Normalizer class is part of the PECL intl package. (The algorithm itself isn't very complicated but needs to load a lot of character mappings afaik. I wrote a PHP implementation a while ago.)
(I'm adding this two months late because I think it's a nice technique that's not known widely enough.)
Try this function definition:
if (!function_exists('mb_str_replace')) {
function mb_str_replace($search, $replace, $subject) {
if (is_array($subject)) {
foreach ($subject as $key => $val) {
$subject[$key] = mb_str_replace((string)$search, $replace, $subject[$key]);
}
return $subject;
}
$pattern = '/(?:'.implode('|', array_map(create_function('$match', 'return preg_quote($match[0], "/");'), (array)$search)).')/u';
if (is_array($search)) {
if (is_array($replace)) {
$len = min(count($search), count($replace));
$table = array_combine(array_slice($search, 0, $len), array_slice($replace, 0, $len));
$f = create_function('$match', '$table = '.var_export($table, true).'; return array_key_exists($match[0], $table) ? $table[$match[0]] : $match[0];');
$subject = preg_replace_callback($pattern, $f, $subject);
return $subject;
}
}
$subject = preg_replace($pattern, (string)$replace, $subject);
return $subject;
}
}

php file_put_contents asian character filename encoding

I'm trying to get this scrape images off of wikipedia. What good is free licensed media if you can't get it? Original script is here.
If you put this
http://upload.wikimedia.org/wikipedia/commons/2/26/%E7%9A%84-bw.png
in firefox, it will immediately be transformed into
http://upload.wikimedia.org/wikipedia/commons/2/26/的-bw.png
so that when you save the image, it's saved as 的-bw.png
Simple enough eh? Now how to get php to do that? Just guessing, I tried utf8_decode($fileName) .. but getting the wrong Chinese characters.
$src= "http://upload.wikimedia.org/wikipedia/commons/2/26/%E7%9A%84-bw.png";
$pngData = file_get_contents($src);
$fileName = basename($src);
file_put_contents($fileName, $pngData);
Any help appreciated, as I really have no idea where to go from here.
Have you tried url_decode(); ?
<?php
$url = 'http://upload.wikimedia.org/wikipedia/commons/2/26/%E7%9A%84-bw.png';
$parts = explode('/', $url);
$title = $parts[count($parts)-1]; //get last section
$title = urldecode($title);
?>
Squirrelmail contains a nice function in the sources to convert unicode to entities:
<?php
function charset_decode_utf_8 ($string) {
/* Only do the slow convert if there are 8-bit characters */
/* avoid using 0xA0 (\240) in ereg ranges. RH73 does not like that */
if (! ereg("[\200-\237]", $string) and ! ereg("[\241-\377]", $string))
return $string;
// decode three byte unicode characters
$string = preg_replace("/([\340-\357])([\200-\277])([\200-\277])/e",
"'&#'.((ord('\\1')-224)*4096 + (ord('\\2')-128)*64 + (ord('\\3')-128)).';'",
$string);
// decode two byte unicode characters
$string = preg_replace("/([\300-\337])([\200-\277])/e",
"'&#'.((ord('\\1')-192)*64+(ord('\\2')-128)).';'",
$string);
return $string;
}
?>

How to convert a nice page title into a valid URL string?

imagine a page Title string in any given language (english, arabic, japanese etc) containing several words in UTF-8. Example:
$stringRAW = "Blues & μπλουζ Bliss's ブルース Schön";
Now this actually needs to be converted into something thats a valid portion of a URL of that page:
$stringURL = "blues-μπλουζ-bliss-ブルース-schön"
just check out this link
This works on my server too!
Q1. What characters are allowed as valid URL these days? I remember having seen whol arabic strings sitting on the browser and i tested it on my apache 2 and all worked fine.
I guesse it must become: $stringURL = "blues-blows-bliss-black"
Q2. What existing php functions do you know that encode/convert these UTF-8 strings correctly for URL ripping them off of any invalid chars?
I guesse that at least:
1. spaces should be converted into dashes -
2. delete invalid characters? which are they? # and '&'?
3. converts all letters to lower case (or are capitcal letters valid in urls?)
Thanks: your suggestions are much appreciated!
this is solution which I use:
$text = 'Nevalidní Český text';
$text = preg_replace('/[^\\pL0-9]+/u', '-', $text);
$text = trim($text, "-");
$text = iconv("utf-8", "us-ascii//TRANSLIT", $text);
$text = preg_replace('/[^-a-z0-9]+/i', '', $text);
Capitals in URL's are not a problem, but if you want the text to be lowercase then simply add $text = strtolower($text); at the end :-).
I would use:
$stringURL = str_replace(' ', '-', $stringURL); // Converts spaces to dashes
$stringURL = urlencode($stringURL);
$stringURL = preg_replace('~[^a-z ]~', '', str_replace(' ', '-', $stringRAW));
Check this method: http://www.whatstyle.net/articles/52/generate_unique_slugs_in_cakephp
pick the title of your webpage
$title = "mytitle#$3%#$5345";
simply urlencode it
$url = urlencode($title);
you dont need to worry about small details but remember to identify your url request its best to use a unique id prefix in url such as /389894/sdojfsodjf , during routing process you can use id 389894 to get the topic sdojfsodjf .
Here is a short & handy one that does the trick for me
$title = trim(strtolower($title)); // lower string, removes white spaces and linebreaks at the start/end
$title = preg_replace('#[^a-z0-9\s-]#',null, $title); // remove all unwanted chars
$title = preg_replace('#[\s-]+#','-', $title); // replace white spaces and - with - (otherwise you end up with ---)
and of course you need to handle umlauts, currency signs and so forth depending on the possible input

converting & to & for XML in PHP

I am building a XML RSS for my page. And running into this error:
error on line 39 at column 46: xmlParseEntityRef: no name
Apparently this is because I cant have & in XML... Which I do in my last field row...
What is the best way to clean all my $row['field']'s in PHP so that &'s turn into &
Use htmlspecialchars to encode just the HTML special characters &, <, >, " and optionally ' (see second parameter $quote_style).
It's called htmlentities() and html_entity_decode()
Really should look in the dom xml functions in php. Its a bit of work to figure out, but you avoid problems like this.
Convert Reserved XML characters to Entities
function xml_convert($str, $protect_all = FALSE)
{
$temp = '__TEMP_AMPERSANDS__';
// Replace entities to temporary markers so that
// ampersands won't get messed up
$str = preg_replace("/&#(\d+);/", "$temp\\1;", $str);
if ($protect_all === TRUE)
{
$str = preg_replace("/&(\w+);/", "$temp\\1;", $str);
}
$str = str_replace(array("&","<",">","\"", "'", "-"),
array("&", "<", ">", """, "&apos;", "-"),
$str);
// Decode the temp markers back to entities
$str = preg_replace("/$temp(\d+);/","&#\\1;",$str);
if ($protect_all === TRUE)
{
$str = preg_replace("/$temp(\w+);/","&\\1;", $str);
}
return $str;
}
Use
html_entity_decode($row['field']);
This will take and revert back to the & from & also if you have &npsb; it will change that to a space.
http://us.php.net/html_entity_decode
Cheers

Categories