Normalize Turkish in PHP? - php

Is there a way to simply normalize turkish characters like Ç, Ğ, İ, Ö, Ş, Ü and ı ?
cause now I'm using str_replace but that doesn't seem the right way to go, cause it's possible to forget a character.... Is there a more standard way? I tried to use the normalize method within the PHP internationalization module, but the Turkish characters stay Turkish. I would like to replace them with normal characters for the URL. So Ç becomes C and Ş becomes S, and so on.

What do you mean by normalization? Just take the characters as they come in, but put your scripts, connection and html in correct encoding.
UTF-8 suggested, explanation: UTF-8 vs. Unicode
If you only want ASCII chars, you can test this by something like ord($char) < 255.
For conversion look into these functions:
http://php.net/iconv
http://php.net/utf8_encode
http://php.net/mb_convert_encoding
A call similiar to
$str = iconv('UTF-8', 'ASCII//TRANSLIT', $str);
would do the trick.
Another preg_replace way: Convert special characters to normal characters using PHP, like ã, é, ç to a, e, c

Related

Emojis not correctly encode into hexadecimal

$message = "Spanish Language
á, é, í, ó, ú, ñ, ü
😃 😄 😅 😆 😉 😊 😋 😎";
$hex = '#U' . strtoupper(bin2hex(mb_convert_encoding($message, 'UCS-2','auto')));
When I send $hex into the following API all things are fine except the emojis, instead if emojis ? symbol appears in the mobile
https://api.txtlocal.com/docs/encodingdecodingunicode
please correct me what I m doing wrong.
These emoji are not representable in UCS-2. In UTF-16, they are represented using surrogate pairs, which are not supported in UCS-2. For example, 😋 is encoded in UTF-16 as this:
0x3d 0xd8 0x0b 0xde
This is four bytes, even though it is supposedly only a single character. UCS-2 guarantees that all characters it contains will be take exactly two bytes, and so 😋 is not included.
I fixed this issue, just change following the line of code
return '#U' . strtoupper(bin2hex(mb_convert_encoding($message, 'UTF-16','UTF-8')));

Replace Polish characteres with standard ascii equivalent

Basically, I have a huge amount of files and many of them contain polish letters like 'ł, ż, ź, ó, ń' etc. in their filename.
What I want to reach is somehow change this polish letter to standard ascii character. (So for example ż => z, ń => n).
The files are located on the server with Linux Debian Squeezee.
What should I use and how to achieve the final effect?
You put a PHP tag to your question, so my answer will consider that.
There is a question similiar to yours.
Convert national chars into their latin equivalents in PHP
Basically
Use Normalizer PHP extension.
http://www.php.net/manual/en/class.normalizer.php
<?php
$string = 'ł ż ź ó ń';
echo Normalizer::normalize($string);
?>

Creating a slug with UTF-8 in it

I am trying to write some code to take UTF-8 text and create a slug that contains some UTF-8 characters. So this is not about transliterating UTF-8 into ASCII.
So basically I want to replace any UTF-8 character that is whitespace, a control character, punctuation, or a symbol with a dash. There exist Unicode categories that I should be able to use: \p{Z}, \p{C}, \p{P}, or \p{S}, respectively.
So I could do something as simple as this:
preg_replace("#(\p{P}|\p{C}|\p{S}|\p{Z})+#", "-", "This. test? has an ö in it");
but it results in this:
This-test-has-an-√-in-it
(and I'd want This-test-has-an-ö-in-it)
It butchers the umlaut o, possibly because in Unicode it is comprised of two bytes c3b6 of which the b6 seems to be recognized as a punctuation character (so the \p{P} matches here). The c3 remains in the text. This is strange because AFAIK a single byte b6 doesn't exist in UTF-8.
I also tried the same thing in Perl in order to make sure it is not a PHP problem, but the code
$s = 'This. test? has an ö in it';
$s =~ s/(\p{P}|\p{C}|\p{S}|\p{Z})+/-/g;
also produces:
This-test-has-an-√-in-it
(which probably makes sense as PHP's PCRE are Perl Compatible Regular Expressions)
While when I do this in Python
import regex as re
text=u"This. test? has an ö in it";
print re.sub(ur"(\p{P}|\p{C}|\p{S}|\p{Z})+", "-", text)
it produces my desired
This-test-has-an-ö-in-it
What to do?
The solution was to use the "Unicode modifier" u:
preg_replace("#(\p{P}|\p{C}|\p{S}|\p{Z})+#u", "-", "This. test? has an ö in it");
correctly produces
This-test-has-an-ö-in-it
So: using Unicode Categories without the Unicode modifier produces strange results without any warning.

Language specific characters to regular English chars

I am not sure where to start with this, but here is what I want to do:
Users have a textfield where they need to input few words. Problem is that page will use people from different countries, and they will enter "weird" Latin characters like: ž, Ä, Ü, đ, Ť, Á etc.
Before saving to base I want to convert them to z, a, u, d, t, a... Is there a way to do this without making something like this (I think there is too much characters to cover):
$string = str_replace(array('Č','Ä','Á','đ'), array('C','A','A','d'), $string);
And, yes, I know that I can save utf-8 in database, but problem is that this string will later be sent by SMS, and because of sms protocol nature, these "special" chars use more space in message than regular English alphabet characters (I am limited to 120 chars, and if i put "Ä" in message, it will take more than 1 character place).
First of all, I would still store the original characters in utf-8 in the database. You can always "translate" them to ASCII characters upon retrieval. This is good because if, say, in the future SMS adds UTF-8 support (or you want to use user data for something else), you'll have the original characters intact.
That said, you can use iconv to do this:
iconv('utf-8', 'ascii//TRANSLIT', $input); //where $input contains "weird" characters
See this thread for more info, including some caveats of this approach: PHP: Replace umlauts with closest 7-bit ASCII equivalent in an UTF-8 string
Close but not perfect because it converts the accents and things into characters.
http://www.php.net/manual/en/function.iconv.php
echo iconv("ISO-8859-1", "ASCII//TRANSLIT", 'Martín');
//output: Mart'in
echo iconv("ISO-8859-1", "ASCII//TRANSLIT", "ÆÇÈÊÈÒÐÑÕ");
//output: AEC`E^E`E`OD~N~O
Using
echo iconv('utf-8', 'ascii//TRANSLIT', 'Martín');
//output: Mart
If the accented character is not UTF-8, it just cuts off the string from the special char onwards.

Converting Unicode characters into the equivalent ASCII ones

I need to "flatten out" a number of Unicode strings for the purposes of indexing and searching. For example, I need to convert GötheФ€ into ASCII. The last two characters have no close representations in ASCII so it's Ok to discard them completely. So what I expect from
echo iconv("UTF-8", "ASCII//TRANSLIT//IGNORE", "GötheФ€");
is Gothe but instead it outputs Gothe?EUR.
In addition to letters, I'd also like all the variety of Unicode numerals and punctuation marks, such as periods, commas, dashes, slashes etc. to be replaced by their closest ASCII counterparts, which is something ASCII//TRANSLIT//IGNORE in iconv function does already but not without producing some garbage output for the Unicode characters for which it's not able to find any ASCII replacements. I'd like such characters to be totally ignored.
How do get the expected result? Is there a better way, perhaps using intl library?
You've picked a hard problem. It is better to tell the user entering Unicode characters to transliterate ASCII themselves. Doing it for them will only upset them when they disagree with your transliteration.
Anything you do will likely be jarring and offensive to people who place great meaning on Diacritics: http://en.wikipedia.org/wiki/Diacritic
No matter what transliteration strategy you use, you will not please everyone, since different people prescribe different meanings to different characters. A transliteration that delights one person will enrage another. You won't make everyone happy unless you let everyone use whatever character they want in Unicode.
But life is jarring and offensive, so off we go:
This PHP Code:
function toASCII( $str )
{
return strtr(utf8_decode($str),
utf8_decode(
'ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
}
What the above PHP function does is replace each Unicode character in the first parameter of utf8_decode and replaces it with the corresponding character in the second parameter of utf8_decode.
For example the Unicode À is transliterated to ASCII A, and the å is converted to a. You'll have to specify this for every single Unicode character that you believe transliterates to an ASCII character. For the others, remove them or run them through another transliteration algorithm.
There are 95,221 other characters that you will have to look at which might transliterate to ASCII. It becomes an existential game of "When is an A no longer an A?". What about the Klingon characters and the road-map signs that kind of look like an A? The fish character kind of looks like an a. Who is to say what is what?
This is a lot of work, but if you are cleaning database input, you have to create a white list of characters and block out the other barbarians, keeping them out at the moat, it's the only reliable way.

Categories