is there a way to transfer Latin letters to english letters with php?
Such as: āáǎà transfer to a,
ēéěè transfer to e,
īíǐì transfer to i,
... // there may be dozens which are main in Germany, French, Italian, Spain...
PS: how to transfer punctuation mark use php? I also want to transfer %20 to a space, transfer %27 to '. Thank u.
iconv can usually do this for you:
iconv("utf-8", "ascii//TRANSLIT//IGNORE", $string);
Adjust source encoding to preference. The //TRANSLIT//IGNORE part tells iconv to transliterate (replace with "similar" characters) whatever it can and ignore (leave out or replace with "?", can't remember) what it can't.
Have a look at How to change diacritic characters to non-diacritic ones
Related
Basically, I have a huge amount of files and many of them contain polish letters like 'ł, ż, ź, ó, ń' etc. in their filename.
What I want to reach is somehow change this polish letter to standard ascii character. (So for example ż => z, ń => n).
The files are located on the server with Linux Debian Squeezee.
What should I use and how to achieve the final effect?
You put a PHP tag to your question, so my answer will consider that.
There is a question similiar to yours.
Convert national chars into their latin equivalents in PHP
Basically
Use Normalizer PHP extension.
http://www.php.net/manual/en/class.normalizer.php
<?php
$string = 'ł ż ź ó ń';
echo Normalizer::normalize($string);
?>
I am not sure where to start with this, but here is what I want to do:
Users have a textfield where they need to input few words. Problem is that page will use people from different countries, and they will enter "weird" Latin characters like: ž, Ä, Ü, đ, Ť, Á etc.
Before saving to base I want to convert them to z, a, u, d, t, a... Is there a way to do this without making something like this (I think there is too much characters to cover):
$string = str_replace(array('Č','Ä','Á','đ'), array('C','A','A','d'), $string);
And, yes, I know that I can save utf-8 in database, but problem is that this string will later be sent by SMS, and because of sms protocol nature, these "special" chars use more space in message than regular English alphabet characters (I am limited to 120 chars, and if i put "Ä" in message, it will take more than 1 character place).
First of all, I would still store the original characters in utf-8 in the database. You can always "translate" them to ASCII characters upon retrieval. This is good because if, say, in the future SMS adds UTF-8 support (or you want to use user data for something else), you'll have the original characters intact.
That said, you can use iconv to do this:
iconv('utf-8', 'ascii//TRANSLIT', $input); //where $input contains "weird" characters
See this thread for more info, including some caveats of this approach: PHP: Replace umlauts with closest 7-bit ASCII equivalent in an UTF-8 string
Close but not perfect because it converts the accents and things into characters.
http://www.php.net/manual/en/function.iconv.php
echo iconv("ISO-8859-1", "ASCII//TRANSLIT", 'Martín');
//output: Mart'in
echo iconv("ISO-8859-1", "ASCII//TRANSLIT", "ÆÇÈÊÈÒÐÑÕ");
//output: AEC`E^E`E`OD~N~O
Using
echo iconv('utf-8', 'ascii//TRANSLIT', 'Martín');
//output: Mart
If the accented character is not UTF-8, it just cuts off the string from the special char onwards.
I need to "flatten out" a number of Unicode strings for the purposes of indexing and searching. For example, I need to convert GötheФ€ into ASCII. The last two characters have no close representations in ASCII so it's Ok to discard them completely. So what I expect from
echo iconv("UTF-8", "ASCII//TRANSLIT//IGNORE", "GötheФ€");
is Gothe but instead it outputs Gothe?EUR.
In addition to letters, I'd also like all the variety of Unicode numerals and punctuation marks, such as periods, commas, dashes, slashes etc. to be replaced by their closest ASCII counterparts, which is something ASCII//TRANSLIT//IGNORE in iconv function does already but not without producing some garbage output for the Unicode characters for which it's not able to find any ASCII replacements. I'd like such characters to be totally ignored.
How do get the expected result? Is there a better way, perhaps using intl library?
You've picked a hard problem. It is better to tell the user entering Unicode characters to transliterate ASCII themselves. Doing it for them will only upset them when they disagree with your transliteration.
Anything you do will likely be jarring and offensive to people who place great meaning on Diacritics: http://en.wikipedia.org/wiki/Diacritic
No matter what transliteration strategy you use, you will not please everyone, since different people prescribe different meanings to different characters. A transliteration that delights one person will enrage another. You won't make everyone happy unless you let everyone use whatever character they want in Unicode.
But life is jarring and offensive, so off we go:
This PHP Code:
function toASCII( $str )
{
return strtr(utf8_decode($str),
utf8_decode(
'ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
}
What the above PHP function does is replace each Unicode character in the first parameter of utf8_decode and replaces it with the corresponding character in the second parameter of utf8_decode.
For example the Unicode À is transliterated to ASCII A, and the å is converted to a. You'll have to specify this for every single Unicode character that you believe transliterates to an ASCII character. For the others, remove them or run them through another transliteration algorithm.
There are 95,221 other characters that you will have to look at which might transliterate to ASCII. It becomes an existential game of "When is an A no longer an A?". What about the Klingon characters and the road-map signs that kind of look like an A? The fish character kind of looks like an a. Who is to say what is what?
This is a lot of work, but if you are cleaning database input, you have to create a white list of characters and block out the other barbarians, keeping them out at the moat, it's the only reliable way.
I am trying to replace a certain character in a string with another. They are quite obscure latin characters. I want to replace character (hex) 259 with 4d9, so I tried this:
str_replace("\x02\x59","\x04\xd9",$string);
This didn't work. How do I go about this?
**EDIT: Additional information.
Thanks bobince, that has done the trick. Although, I want to replace the uppercase schwa also and it is not working for some reason. I calculated U+018F (Ə) as UTF-8 0xC68F and this is to be replaced with U+04D8 (0xD398):
$string = str_replace("\xC9\x99", "\xD3\x99", $_POST['string_with_schwa']); //lc 259->4d9
$string = str_replace( "\xC6\8F", "\xD3\x98" , $string); //uc 18f->4d8
I am copying the 'Ə' into a textbox and posting it. The first str_replace works fine on the lowercase, but does not detect the uppercase in the second str_replace, strange. It remains as U+018F. Guess I could run the string through strtolower but this should work though.
U+0259 Latin Small Letter Schwa is only encoded as the byte sequence 0x02,0x59 in the UTF-16BE encoding. It is very unlikely you will be working with byte strings in the UTF-16BE encoding as it's not an ASCII-compatible encoding and almost no-one uses it.
The encoding you want to be working with (the only ASCII-superset encoding to support both Latin Schwa and Cyrillic Schwa, as it supports all Unicode characters) is UTF-8. Ensure your input is in UTF-8 format (if it is coming from form data, serve the page containing the form as UTF-8). Then, in UTF-8, the character U+0259 is represented using the byte sequence 0xC9,0x99.
str_replace("\xC9\x99", "\xD3\x99", $string);
If you make sure to save your .php file as UTF-8-no-BOM in the text editor, you can skip the escaping and just directly say:
str_replace('ə', 'ә', $string);
A couple of possible suggestions. Firstly, remember that you need to assign the new value to $string, i.e.:
$string = str_replace("\x02\x59","\x04\xd9",$string);
Secondly, verify that your byte stream occurs in the $string. I mention this because your hex string begins with a low-byte, so you'll need to make sure your $string is not UTF8 encoded.
This question already has answers here:
PHP: Replace umlauts with closest 7-bit ASCII equivalent in an UTF-8 string
(7 answers)
Closed 9 years ago.
What is the most efficient way to remove accents from a string e.g. ÈâuÑ becomes Eaun?
Is there a simple, built in way that I'm missing or a regular expression?
If you have iconv installed, try this (the example assumes your input string is in UTF-8):
echo iconv('UTF-8', 'ASCII//TRANSLIT', $string);
(iconv is a library to convert between all kinds of encodings; it's efficient and included with many PHP distributions by default. Most of all, it's definitely easier and more error-proof than trying to roll your own solution (did you know that there's a "Latin letter N with a curl"? Me neither.))
I found a solution, that worked in all my test-cases (copied from http://php.net/manual/en/transliterator.transliterate.php):
var_dump(transliterator_transliterate('Any-Latin; Latin-ASCII; [\u0080-\u7fff] remove',
"A æ Übérmensch på høyeste nivå! И я люблю PHP! есть. fi ¦"));
// string(50) "A ae Ubermensch pa hoyeste niva! I a lublu PHP! est. fi "
see: http://www.php.net/normalizer
EDIT: This solution is independent of the locale set using setlocale(). Another benefit over iconv() is, that even non-latin characters are not ignored.
EDIT2: I discovered, that there are some characters, that are not covered by the transliteration I posted originally. Any-Latin translates the cyrillic character ь to a character, that doesn't fit into a latin character-set: ʹ (http://en.wikipedia.org/wiki/Prime_%28symbol%29). I've added [\u0100-\u7fff] remove to remove all these non-latin characters. I also added a test to the text ;)
I suggest, that they mean the latin alphabet and not one of the latin character-sets by Latin here. But anyways - in my opinion, they should transliterate it to something ASCII then in Latin-ASCII ...
EDIT3: Sorry for another change here. I had to take the characters down to u0080 instead of u0100, to get only ASCII characters as output. The test above is updated.
Reposting this on request of #palantir ...
I find iconv completely unreliable, and I dislike preg_replace solutions and big arrays ... so my favorite way (and the only reliable method I've found) is ...
function toASCII( $str )
{
return strtr(utf8_decode($str),
utf8_decode(
'ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
}
You can use iconv to transliterate the characters to plain US-ASCII and then use a regular expression to remove non-alphabetic characters:
preg_replace('/[^a-z]/i', '', iconv("UTF-8", "US-ASCII//TRANSLIT", $text))
Another way would be using the Normalizer to normalize to the Normalization Form KD (NFKD) and then remove the mark characters:
preg_replace('/\p{Mn}/u', '', Normalizer::normalize($text, Normalizer::FORM_KD))
Note: I'm reposting this from another similar question in the hope that it's helpful to others.
I ended up writing a PHP library based on URLify.js from the Django project, since I found iconv() to be too incomplete. You can find it here:
https://github.com/jbroadway/urlify
Handles Latin characters as well as Greek, Turkish, Russian, Ukrainian, Czech, Polish, and Latvian.