This question already has answers here:
PHP: Replace umlauts with closest 7-bit ASCII equivalent in an UTF-8 string
(7 answers)
Closed 9 years ago.
What is the most efficient way to remove accents from a string e.g. ÈâuÑ becomes Eaun?
Is there a simple, built in way that I'm missing or a regular expression?
If you have iconv installed, try this (the example assumes your input string is in UTF-8):
echo iconv('UTF-8', 'ASCII//TRANSLIT', $string);
(iconv is a library to convert between all kinds of encodings; it's efficient and included with many PHP distributions by default. Most of all, it's definitely easier and more error-proof than trying to roll your own solution (did you know that there's a "Latin letter N with a curl"? Me neither.))
I found a solution, that worked in all my test-cases (copied from http://php.net/manual/en/transliterator.transliterate.php):
var_dump(transliterator_transliterate('Any-Latin; Latin-ASCII; [\u0080-\u7fff] remove',
"A æ Übérmensch på høyeste nivå! И я люблю PHP! есть. fi ¦"));
// string(50) "A ae Ubermensch pa hoyeste niva! I a lublu PHP! est. fi "
see: http://www.php.net/normalizer
EDIT: This solution is independent of the locale set using setlocale(). Another benefit over iconv() is, that even non-latin characters are not ignored.
EDIT2: I discovered, that there are some characters, that are not covered by the transliteration I posted originally. Any-Latin translates the cyrillic character ь to a character, that doesn't fit into a latin character-set: ʹ (http://en.wikipedia.org/wiki/Prime_%28symbol%29). I've added [\u0100-\u7fff] remove to remove all these non-latin characters. I also added a test to the text ;)
I suggest, that they mean the latin alphabet and not one of the latin character-sets by Latin here. But anyways - in my opinion, they should transliterate it to something ASCII then in Latin-ASCII ...
EDIT3: Sorry for another change here. I had to take the characters down to u0080 instead of u0100, to get only ASCII characters as output. The test above is updated.
Reposting this on request of #palantir ...
I find iconv completely unreliable, and I dislike preg_replace solutions and big arrays ... so my favorite way (and the only reliable method I've found) is ...
function toASCII( $str )
{
return strtr(utf8_decode($str),
utf8_decode(
'ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
}
You can use iconv to transliterate the characters to plain US-ASCII and then use a regular expression to remove non-alphabetic characters:
preg_replace('/[^a-z]/i', '', iconv("UTF-8", "US-ASCII//TRANSLIT", $text))
Another way would be using the Normalizer to normalize to the Normalization Form KD (NFKD) and then remove the mark characters:
preg_replace('/\p{Mn}/u', '', Normalizer::normalize($text, Normalizer::FORM_KD))
Note: I'm reposting this from another similar question in the hope that it's helpful to others.
I ended up writing a PHP library based on URLify.js from the Django project, since I found iconv() to be too incomplete. You can find it here:
https://github.com/jbroadway/urlify
Handles Latin characters as well as Greek, Turkish, Russian, Ukrainian, Czech, Polish, and Latvian.
Related
I am trying to write some code to take UTF-8 text and create a slug that contains some UTF-8 characters. So this is not about transliterating UTF-8 into ASCII.
So basically I want to replace any UTF-8 character that is whitespace, a control character, punctuation, or a symbol with a dash. There exist Unicode categories that I should be able to use: \p{Z}, \p{C}, \p{P}, or \p{S}, respectively.
So I could do something as simple as this:
preg_replace("#(\p{P}|\p{C}|\p{S}|\p{Z})+#", "-", "This. test? has an ö in it");
but it results in this:
This-test-has-an-√-in-it
(and I'd want This-test-has-an-ö-in-it)
It butchers the umlaut o, possibly because in Unicode it is comprised of two bytes c3b6 of which the b6 seems to be recognized as a punctuation character (so the \p{P} matches here). The c3 remains in the text. This is strange because AFAIK a single byte b6 doesn't exist in UTF-8.
I also tried the same thing in Perl in order to make sure it is not a PHP problem, but the code
$s = 'This. test? has an ö in it';
$s =~ s/(\p{P}|\p{C}|\p{S}|\p{Z})+/-/g;
also produces:
This-test-has-an-√-in-it
(which probably makes sense as PHP's PCRE are Perl Compatible Regular Expressions)
While when I do this in Python
import regex as re
text=u"This. test? has an ö in it";
print re.sub(ur"(\p{P}|\p{C}|\p{S}|\p{Z})+", "-", text)
it produces my desired
This-test-has-an-ö-in-it
What to do?
The solution was to use the "Unicode modifier" u:
preg_replace("#(\p{P}|\p{C}|\p{S}|\p{Z})+#u", "-", "This. test? has an ö in it");
correctly produces
This-test-has-an-ö-in-it
So: using Unicode Categories without the Unicode modifier produces strange results without any warning.
I need to "flatten out" a number of Unicode strings for the purposes of indexing and searching. For example, I need to convert GötheФ€ into ASCII. The last two characters have no close representations in ASCII so it's Ok to discard them completely. So what I expect from
echo iconv("UTF-8", "ASCII//TRANSLIT//IGNORE", "GötheФ€");
is Gothe but instead it outputs Gothe?EUR.
In addition to letters, I'd also like all the variety of Unicode numerals and punctuation marks, such as periods, commas, dashes, slashes etc. to be replaced by their closest ASCII counterparts, which is something ASCII//TRANSLIT//IGNORE in iconv function does already but not without producing some garbage output for the Unicode characters for which it's not able to find any ASCII replacements. I'd like such characters to be totally ignored.
How do get the expected result? Is there a better way, perhaps using intl library?
You've picked a hard problem. It is better to tell the user entering Unicode characters to transliterate ASCII themselves. Doing it for them will only upset them when they disagree with your transliteration.
Anything you do will likely be jarring and offensive to people who place great meaning on Diacritics: http://en.wikipedia.org/wiki/Diacritic
No matter what transliteration strategy you use, you will not please everyone, since different people prescribe different meanings to different characters. A transliteration that delights one person will enrage another. You won't make everyone happy unless you let everyone use whatever character they want in Unicode.
But life is jarring and offensive, so off we go:
This PHP Code:
function toASCII( $str )
{
return strtr(utf8_decode($str),
utf8_decode(
'ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
}
What the above PHP function does is replace each Unicode character in the first parameter of utf8_decode and replaces it with the corresponding character in the second parameter of utf8_decode.
For example the Unicode À is transliterated to ASCII A, and the å is converted to a. You'll have to specify this for every single Unicode character that you believe transliterates to an ASCII character. For the others, remove them or run them through another transliteration algorithm.
There are 95,221 other characters that you will have to look at which might transliterate to ASCII. It becomes an existential game of "When is an A no longer an A?". What about the Klingon characters and the road-map signs that kind of look like an A? The fish character kind of looks like an a. Who is to say what is what?
This is a lot of work, but if you are cleaning database input, you have to create a white list of characters and block out the other barbarians, keeping them out at the moat, it's the only reliable way.
I have the the problem described in title.
If I use
preg_match_all('/\pL+/u', $_POST['word'], $new_word);
and I type hello à and ì the new_word returned is *hello and *
Why?
Someone advised me to specify all characters I want to convert in this way
preg_match_all('/\pL+/u', $_POST['word'], 'aäeëioöuáéíóú');
, but I want my application works with all existing accents (for a multilanguage website).
Can you help me?
Thanks.
EDIT: I specify that I utilise this regex to purify punctuation. It well purify all punctuation but unicode characters are wrong returned, in fact are not even returned.
EDIT 2: I am sorry, but I very badly explained.
The problem is not in preg_match_all but in
str_word_count($my_key, 2, 'aäáàeëéèiíìoöóòuúù');
I had to manually specify accented characters but I think there are many others. Right?
\pL should match all utf8 characters and spaces. Be sure, that $_POST['word'] is a string encoded with utf8. If not, try utf8_encode() before matching or check the encoding of your HTML form. In my tests, your example works like a charm.
You may use this together with count() to get the number of words. Then you need not care about the possible characters. \pL will do this for you. This should do the trick:
$string = "áll thât words wíth ìntérnâtiønal çhårs";
preg_match_all('/\pL+/u', $string, $words);
echo count($words[0]); // returns: 6
Try using mb_ereg_match() (instead of preg_match()) from Multibyte String PHP library. It is specially made for working with multibyte strings.
This question already has answers here:
PHP: Replace umlauts with closest 7-bit ASCII equivalent in an UTF-8 string
(7 answers)
Closed 9 years ago.
What is the most efficient way to remove accents from a string e.g. ÈâuÑ becomes Eaun?
Is there a simple, built in way that I'm missing or a regular expression?
If you have iconv installed, try this (the example assumes your input string is in UTF-8):
echo iconv('UTF-8', 'ASCII//TRANSLIT', $string);
(iconv is a library to convert between all kinds of encodings; it's efficient and included with many PHP distributions by default. Most of all, it's definitely easier and more error-proof than trying to roll your own solution (did you know that there's a "Latin letter N with a curl"? Me neither.))
I found a solution, that worked in all my test-cases (copied from http://php.net/manual/en/transliterator.transliterate.php):
var_dump(transliterator_transliterate('Any-Latin; Latin-ASCII; [\u0080-\u7fff] remove',
"A æ Übérmensch på høyeste nivå! И я люблю PHP! есть. fi ¦"));
// string(50) "A ae Ubermensch pa hoyeste niva! I a lublu PHP! est. fi "
see: http://www.php.net/normalizer
EDIT: This solution is independent of the locale set using setlocale(). Another benefit over iconv() is, that even non-latin characters are not ignored.
EDIT2: I discovered, that there are some characters, that are not covered by the transliteration I posted originally. Any-Latin translates the cyrillic character ь to a character, that doesn't fit into a latin character-set: ʹ (http://en.wikipedia.org/wiki/Prime_%28symbol%29). I've added [\u0100-\u7fff] remove to remove all these non-latin characters. I also added a test to the text ;)
I suggest, that they mean the latin alphabet and not one of the latin character-sets by Latin here. But anyways - in my opinion, they should transliterate it to something ASCII then in Latin-ASCII ...
EDIT3: Sorry for another change here. I had to take the characters down to u0080 instead of u0100, to get only ASCII characters as output. The test above is updated.
Reposting this on request of #palantir ...
I find iconv completely unreliable, and I dislike preg_replace solutions and big arrays ... so my favorite way (and the only reliable method I've found) is ...
function toASCII( $str )
{
return strtr(utf8_decode($str),
utf8_decode(
'ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
}
You can use iconv to transliterate the characters to plain US-ASCII and then use a regular expression to remove non-alphabetic characters:
preg_replace('/[^a-z]/i', '', iconv("UTF-8", "US-ASCII//TRANSLIT", $text))
Another way would be using the Normalizer to normalize to the Normalization Form KD (NFKD) and then remove the mark characters:
preg_replace('/\p{Mn}/u', '', Normalizer::normalize($text, Normalizer::FORM_KD))
Note: I'm reposting this from another similar question in the hope that it's helpful to others.
I ended up writing a PHP library based on URLify.js from the Django project, since I found iconv() to be too incomplete. You can find it here:
https://github.com/jbroadway/urlify
Handles Latin characters as well as Greek, Turkish, Russian, Ukrainian, Czech, Polish, and Latvian.
Today I stumbled about a Problem which seems to be a bug in the Zend-Framework. Given the following route:
<test>
<route>citytest/:city</route>
<defaults>
<controller>result</controller>
<action>test</action>
</defaults>
<reqs>
<city>.+</city>
</reqs>
</test>
and three Urls:
mysite.local/citytest/Berlin
mysite.local/citytest/Hamburg
mysite.local/citytest/M%FCnchen
the last Url does not match and thus the correct controller is not called. Anybody got a clue why?
Fyi, where are using Zend-Framework 1.0 ( Yeah, I know that's ancient but I am not in charge to change that :-/ )
Edit: From what I hear, we are going to upgrade to Zend 1.5.6 soon, but I don't know when, so a Patch would be great.
Edit: I've tracked it down to the following line (Zend/Controller/Router/Route.php:170):
$regex = $this->_regexDelimiter . '^' .
$part['regex'] . '$' .
$this->_regexDelimiter . 'iu';
If I change that to
$this->_regexDelimiter . 'i';
it works. From what I understand, the u-modifier is for working with asian characters. As I don't use them, I'm fine with that patch for know. Thanks for reading.
Please its working perfect for me
/^[\p{L}-. ]*$/u
^ Start of the string
[ ... ]* Zero or more of the following:
\p{L} Unicode letter characters
– dashes
. periods
spaces
$ End of the string
/u Enable Unicode mode in PHP
EXAMPLE:
$str= ‘Füße’;
if (!preg_match(“/^[\p{L}-. ]*$/u”, $str))
{
echo ‘error’;
}
else
{
echo “success”;
}
The problem is the following:
Using the /u pattern modifier prevents
words from being mangled but instead
PCRE skips strings of characters with
code values greater than 127.
Therefore, \w will not match a
multibyte (non-lower ascii) word at
all (but also won’t return portions of
it). From the pcrepattern man page;
In UTF-8 mode, characters with values
greater than 128 never match \d, \s,
or \w, and always match \D, \S, and
\W. This is true even when Unicode
character property support is
available.
From Handling UTF-8 with PHP.
Therefore it's actually irrelevant if your URL is ISO-8859-1 encoded (mysite.local/citytest/M%FCnchen) or UTF-8 encoded (mysite.local/citytest/M%C3%BCnchen), the default regex won't match.
I also made experiments with umlauts in URLs in Zend Framework and came to the conclusion that you wouldn't really want umlauts in your URLs. The problem is, that you cannot rely on the encoding used by the browser for the URL. Firefox (prior to 3.0) for example does not UTF-8 encode URLs entered into the address textbox (if not specified in about:config) and IE does have a checkbox within its options to choose between regular and UTF-8 encoding for its URLs. But if you click on links within a page both browsers use the URL in the given encoding (UTF-8 on an UTF-8 page). Therefore you cannot be sure in which encoding the URLs are sent to your application - and detecting the encoding used is not that trivial to do.
Perhaps it's better to use transliterated parameters in your URLs (e.g. change Ä to Ae and so on). There is a really simple way to this (I don't know if this works with every language but I'm using it with German strings and it works quite well):
function createUrlFriendlyName($name) // $name must be an UTF-8 encoded string
{
$name=mb_convert_encoding(trim($name), 'HTML-ENTITIES', 'UTF-8');
$name=preg_replace(
array('/ß/', '/&(..)lig;/', '/&([aouAOU])uml;/', '/&(.)[^;]*;/', '/\W/'),
array('ss', '$1', '$1e', '$1', '-'),
$name);
$name=preg_replace('/-{2,}/', '-', $name);
return trim($name, '-');
}
The u modifier makes the regexp expect utf-8 input. This would suggest that ZF expects utf-8 encoded input, and not ISO-8859-1 (I'm not too familiar with ZF, so I'm just guessing here).
If that's the case, you'll have to utf-8 encode the ü before using it in a URL. It would then become: mysite.local/citytest/M%C3%BCnchen
Note that since the rest of your application probably speaks ISO-8859-1 (Which is default for PHP <= 5), you will have to explicitly decode the variable with utf8_decode, before you can use it.