Basically, I have a huge amount of files and many of them contain polish letters like 'ł, ż, ź, ó, ń' etc. in their filename.
What I want to reach is somehow change this polish letter to standard ascii character. (So for example ż => z, ń => n).
The files are located on the server with Linux Debian Squeezee.
What should I use and how to achieve the final effect?
You put a PHP tag to your question, so my answer will consider that.
There is a question similiar to yours.
Convert national chars into their latin equivalents in PHP
Basically
Use Normalizer PHP extension.
http://www.php.net/manual/en/class.normalizer.php
<?php
$string = 'ł ż ź ó ń';
echo Normalizer::normalize($string);
?>
Related
I need help on changing the codification of a string copied and pasted from clipboard...
The curious string is "español":
$problematicString = "español"; //copied and pasted from a filename
$okString = "español"; //typed
echo md5($problematicString)."<br>";
echo md5($okString)."<br>";
This is the output:
c9ae1d88242473e112ede8df2bdd6802
5d971adb0ba260af6a126a2ade4dd133
Why are the md5() outputs different for the same strings?
I've tried changing both strings using: mb_convert_encoding($string, "ISO-8859-1", "UTF-8") but the output is still different.
i need to fix the problematicString programmatically so that it shows the same hash as the other string
Why are the md5 different for the same strings ?
They are not the same string. In the first case the tilde is on the 'o':
$problematicString = "español"
In the second case, the tilde is on the 'n':
$okString = "español";
That's why the hashes don't match.
The reason being is that the first part contains a hidden unicode being:
̃
Pulled from my editor:
$problematicString = "español"; which is what it's actually showing.
It's actually a tilde ~.
Pulled from http://courses.washington.edu/hypertxt/unicode/unidec1.html
These symbols, which are most of the non-ascii symbols useful for standard phonetic transcription of English, are drawn from several regions of the Unicode chart: from Latin-1 Supplement, Latin Extended-A and B,IPA Extensions, Combining Diacritical Mark, and Greek (for the theta). All of these pages are supported by lucida sans unicode, a TrueType font that Microsoft has bundled with recent products. Sadly, Bitstream's mother-of-all-TTFs Cyberbit does not support the IPA Extensions. These values can be entered manually as character entities or assigned to hot keys, buttons, or whatever the browser allows. Word97 can access the font via the symbol table under Insert.
Another way to write this font is to use Wincalis uniedit, which will write the Unicode values directly into the file. Then "This is phonetically transcribed" is represented in strange alphabet soup which is converted by the browser into [ðɪs ɪz fɘnɛɾɘkli trænskraibd] (look at this in a plain text editor to see the soup). For any serious or extensive transcription work, an editor like Wincalis would prove handy--you can even customize the IPA keyboard supplied.
If you want the file to trigger Unicode UTF-8 decoding in the browser, you must preface this META tag:
with the following under "Diacritics":
̃ #771 nasalized
As #BeetleJuice said, they are not the same string. Here's another way to understand this: reduce the data to just these two strings:
"español";
"español";
Then run the od command against them. Observe that the hex characters are different:
0000000 6522 7073 6e61 83cc 6c6f 3b22 220a 7365
" e s p a n ̃ ** o l " ; \n " e s
0000020 6170 b1c3 6c6f 3b22 0a20
p a ñ ** o l " ; \n
0000032
In the first string the ñ is actually an n and a combining diacritic tilde (http://www.fileformat.info/info/unicode/char/0303/index.htm). In the second string it's an ñ(http://www.fileformat.info/info/unicode/char/f1/index.htm), one character. You can see that if you use backspace to delete characters and you'll see that in the first one it takes 2 presses, one to delete the tilde, the other one for 'n'.
This question already has answers here:
PHP: Replace umlauts with closest 7-bit ASCII equivalent in an UTF-8 string
(7 answers)
Closed 9 years ago.
What is the most efficient way to remove accents from a string e.g. ÈâuÑ becomes Eaun?
Is there a simple, built in way that I'm missing or a regular expression?
If you have iconv installed, try this (the example assumes your input string is in UTF-8):
echo iconv('UTF-8', 'ASCII//TRANSLIT', $string);
(iconv is a library to convert between all kinds of encodings; it's efficient and included with many PHP distributions by default. Most of all, it's definitely easier and more error-proof than trying to roll your own solution (did you know that there's a "Latin letter N with a curl"? Me neither.))
I found a solution, that worked in all my test-cases (copied from http://php.net/manual/en/transliterator.transliterate.php):
var_dump(transliterator_transliterate('Any-Latin; Latin-ASCII; [\u0080-\u7fff] remove',
"A æ Übérmensch på høyeste nivå! И я люблю PHP! есть. fi ¦"));
// string(50) "A ae Ubermensch pa hoyeste niva! I a lublu PHP! est. fi "
see: http://www.php.net/normalizer
EDIT: This solution is independent of the locale set using setlocale(). Another benefit over iconv() is, that even non-latin characters are not ignored.
EDIT2: I discovered, that there are some characters, that are not covered by the transliteration I posted originally. Any-Latin translates the cyrillic character ь to a character, that doesn't fit into a latin character-set: ʹ (http://en.wikipedia.org/wiki/Prime_%28symbol%29). I've added [\u0100-\u7fff] remove to remove all these non-latin characters. I also added a test to the text ;)
I suggest, that they mean the latin alphabet and not one of the latin character-sets by Latin here. But anyways - in my opinion, they should transliterate it to something ASCII then in Latin-ASCII ...
EDIT3: Sorry for another change here. I had to take the characters down to u0080 instead of u0100, to get only ASCII characters as output. The test above is updated.
Reposting this on request of #palantir ...
I find iconv completely unreliable, and I dislike preg_replace solutions and big arrays ... so my favorite way (and the only reliable method I've found) is ...
function toASCII( $str )
{
return strtr(utf8_decode($str),
utf8_decode(
'ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
}
You can use iconv to transliterate the characters to plain US-ASCII and then use a regular expression to remove non-alphabetic characters:
preg_replace('/[^a-z]/i', '', iconv("UTF-8", "US-ASCII//TRANSLIT", $text))
Another way would be using the Normalizer to normalize to the Normalization Form KD (NFKD) and then remove the mark characters:
preg_replace('/\p{Mn}/u', '', Normalizer::normalize($text, Normalizer::FORM_KD))
Note: I'm reposting this from another similar question in the hope that it's helpful to others.
I ended up writing a PHP library based on URLify.js from the Django project, since I found iconv() to be too incomplete. You can find it here:
https://github.com/jbroadway/urlify
Handles Latin characters as well as Greek, Turkish, Russian, Ukrainian, Czech, Polish, and Latvian.
I need to "flatten out" a number of Unicode strings for the purposes of indexing and searching. For example, I need to convert GötheФ€ into ASCII. The last two characters have no close representations in ASCII so it's Ok to discard them completely. So what I expect from
echo iconv("UTF-8", "ASCII//TRANSLIT//IGNORE", "GötheФ€");
is Gothe but instead it outputs Gothe?EUR.
In addition to letters, I'd also like all the variety of Unicode numerals and punctuation marks, such as periods, commas, dashes, slashes etc. to be replaced by their closest ASCII counterparts, which is something ASCII//TRANSLIT//IGNORE in iconv function does already but not without producing some garbage output for the Unicode characters for which it's not able to find any ASCII replacements. I'd like such characters to be totally ignored.
How do get the expected result? Is there a better way, perhaps using intl library?
You've picked a hard problem. It is better to tell the user entering Unicode characters to transliterate ASCII themselves. Doing it for them will only upset them when they disagree with your transliteration.
Anything you do will likely be jarring and offensive to people who place great meaning on Diacritics: http://en.wikipedia.org/wiki/Diacritic
No matter what transliteration strategy you use, you will not please everyone, since different people prescribe different meanings to different characters. A transliteration that delights one person will enrage another. You won't make everyone happy unless you let everyone use whatever character they want in Unicode.
But life is jarring and offensive, so off we go:
This PHP Code:
function toASCII( $str )
{
return strtr(utf8_decode($str),
utf8_decode(
'ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
}
What the above PHP function does is replace each Unicode character in the first parameter of utf8_decode and replaces it with the corresponding character in the second parameter of utf8_decode.
For example the Unicode À is transliterated to ASCII A, and the å is converted to a. You'll have to specify this for every single Unicode character that you believe transliterates to an ASCII character. For the others, remove them or run them through another transliteration algorithm.
There are 95,221 other characters that you will have to look at which might transliterate to ASCII. It becomes an existential game of "When is an A no longer an A?". What about the Klingon characters and the road-map signs that kind of look like an A? The fish character kind of looks like an a. Who is to say what is what?
This is a lot of work, but if you are cleaning database input, you have to create a white list of characters and block out the other barbarians, keeping them out at the moat, it's the only reliable way.
is there a way to transfer Latin letters to english letters with php?
Such as: āáǎà transfer to a,
ēéěè transfer to e,
īíǐì transfer to i,
... // there may be dozens which are main in Germany, French, Italian, Spain...
PS: how to transfer punctuation mark use php? I also want to transfer %20 to a space, transfer %27 to '. Thank u.
iconv can usually do this for you:
iconv("utf-8", "ascii//TRANSLIT//IGNORE", $string);
Adjust source encoding to preference. The //TRANSLIT//IGNORE part tells iconv to transliterate (replace with "similar" characters) whatever it can and ignore (leave out or replace with "?", can't remember) what it can't.
Have a look at How to change diacritic characters to non-diacritic ones
This question already has answers here:
PHP: Replace umlauts with closest 7-bit ASCII equivalent in an UTF-8 string
(7 answers)
Closed 9 years ago.
What is the most efficient way to remove accents from a string e.g. ÈâuÑ becomes Eaun?
Is there a simple, built in way that I'm missing or a regular expression?
If you have iconv installed, try this (the example assumes your input string is in UTF-8):
echo iconv('UTF-8', 'ASCII//TRANSLIT', $string);
(iconv is a library to convert between all kinds of encodings; it's efficient and included with many PHP distributions by default. Most of all, it's definitely easier and more error-proof than trying to roll your own solution (did you know that there's a "Latin letter N with a curl"? Me neither.))
I found a solution, that worked in all my test-cases (copied from http://php.net/manual/en/transliterator.transliterate.php):
var_dump(transliterator_transliterate('Any-Latin; Latin-ASCII; [\u0080-\u7fff] remove',
"A æ Übérmensch på høyeste nivå! И я люблю PHP! есть. fi ¦"));
// string(50) "A ae Ubermensch pa hoyeste niva! I a lublu PHP! est. fi "
see: http://www.php.net/normalizer
EDIT: This solution is independent of the locale set using setlocale(). Another benefit over iconv() is, that even non-latin characters are not ignored.
EDIT2: I discovered, that there are some characters, that are not covered by the transliteration I posted originally. Any-Latin translates the cyrillic character ь to a character, that doesn't fit into a latin character-set: ʹ (http://en.wikipedia.org/wiki/Prime_%28symbol%29). I've added [\u0100-\u7fff] remove to remove all these non-latin characters. I also added a test to the text ;)
I suggest, that they mean the latin alphabet and not one of the latin character-sets by Latin here. But anyways - in my opinion, they should transliterate it to something ASCII then in Latin-ASCII ...
EDIT3: Sorry for another change here. I had to take the characters down to u0080 instead of u0100, to get only ASCII characters as output. The test above is updated.
Reposting this on request of #palantir ...
I find iconv completely unreliable, and I dislike preg_replace solutions and big arrays ... so my favorite way (and the only reliable method I've found) is ...
function toASCII( $str )
{
return strtr(utf8_decode($str),
utf8_decode(
'ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
}
You can use iconv to transliterate the characters to plain US-ASCII and then use a regular expression to remove non-alphabetic characters:
preg_replace('/[^a-z]/i', '', iconv("UTF-8", "US-ASCII//TRANSLIT", $text))
Another way would be using the Normalizer to normalize to the Normalization Form KD (NFKD) and then remove the mark characters:
preg_replace('/\p{Mn}/u', '', Normalizer::normalize($text, Normalizer::FORM_KD))
Note: I'm reposting this from another similar question in the hope that it's helpful to others.
I ended up writing a PHP library based on URLify.js from the Django project, since I found iconv() to be too incomplete. You can find it here:
https://github.com/jbroadway/urlify
Handles Latin characters as well as Greek, Turkish, Russian, Ukrainian, Czech, Polish, and Latvian.