Replace diacritic characters with "equivalent" ASCII in PHP? - php

Related questions:
How to replace characters in a java String?
How to replace special characters with their equivalent (such as " á " for " a") in C#?
As in the questions above, I'm looking for a reliable, robust way to reduce any unicode character to near-equivalent ASCII using PHP. I really want to avoid rolling my own look up table.
For example (stolen from 1st referenced question): Gračišće becomes Gracisce

The iconv module can do this, more specifically, the iconv() function:
$str = iconv('Windows-1252', 'ASCII//TRANSLIT//IGNORE', "Gracišce");
echo $str;
//outputs "Gracisce"
The main hassle with iconv is that you just have to watch your encodings, but it's definitely the right tool for the job (I used 'Windows-1252' for the example due to limitations of the text editor I was working with ;) The feature of iconv that you definitely want to use is the //TRANSLIT flag, which tells iconv to transliterate any characters that don't have an ASCII match into the closest approximation.

I found another solution, based on #zombat's answer.
The issue with his answer was that I was getting:
Notice: iconv() [function.iconv]: Wrong charset, conversion from `UTF-8' to `ASCII//TRANSLIT//IGNORE' is not allowed in D:\www\phpcommand.php(11) : eval()'d code on line 3
And after removing //IGNORE from the function, I got:
Gr'a'e~a~o^O"ucisce
So, the š character was translated correctly, but the other characters weren't.
The solution that worked for me is a mix between preg_replace (to remove everything but [a-zA-Z0-9] - including spaces) and #zombat's solution:
preg_replace('/[^a-zA-Z0-9.]/','',iconv('UTF-8', 'ASCII//TRANSLIT', "GráéãõÔücišce"));
Output:
GraeaoOucisce

My solution is to create two strings - first with not wanted letters and second with letters that will replace firsts.
$from = 'čšć';
$to = 'csc';
$text = 'Gračišće';
$result = str_replace(str_split($from), str_split($to), $text);

Try this:
function normal_chars($string)
{
$string = htmlentities($string, ENT_QUOTES, 'UTF-8');
$string = preg_replace('~&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', $string);
$string = preg_replace(array('~[^0-9a-z]~i', '~-+~'), ' ', $string);
return trim($string);
}
Examples:
echo normal_chars('Álix----_Ãxel!?!?'); // Alix Axel
echo normal_chars('áéíóúÁÉÍÓÚ'); // aeiouAEIOU
echo normal_chars('üÿÄËÏÖÜŸåÅ'); // uyAEIOUYaA
Based on the selected answer in this thread: URL Friendly Username in PHP?

You should also try:
transliterator_transliterate('Any-Latin; Latin-ASCII; Lower()', "ÀÖØöøįĴőŔžǍǰǴǵǸțȞȟȤȳɃɆɏ");
//Will output
aooooijorzajggnthhzybey
I found this from here:
https://www.php.net/manual/en/transliterator.transliterate.php#111939

Related

Trouble decoding some special characters ’ “ ”

I'm trying to decode some special characters in php and can't seem to find a way to do it.
$str = 'Thi’s i"s a’n e”xa“mple';
This just returns some dots.
$str = preg_replace_callback("/(&#[0-9]+;)/", function($m) {
return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES");
}, $str);
Some other tests just return the same string.
$str = html_entity_decode($str, ENT_QUOTES, 'UTF-8');
$str = htmlspecialchars_decode($str, ENT_QUOTES);
Anyway, I've been trying all sorts of combinations but really no idea how to convert this to UTF-8 characters.
What I'm expecting to see is this:
Thi’s i"s a’n e”xa“mple
And actually if I take this directly and use htmlentities to encode it I see different characters to begin with.
Thi’s i"s a’n e”xa“mple
Unfortunately I don't have control of the source and I'm stuck dealing with those characters.
Are they non standard, do I need to replace them manually with my own lookup table?
EDIT
Looking at this table here: https://brajeshwar.github.io/entities/
I see the characters I'm looking after are not listed. When I test a few characters from this table they decode just fine. I guess the list in php is incomplete by default?
If you check the unicode standard for the characters you're referring to: http://www.unicode.org/charts/PDF/U0080.pdf
You would see that all the codepoints you have in your string do not have representable glyphs and are control characters.
Which means that it is expected that they are rendered as empty squares (or dots, depending on how your renderer treats those).
If it works for someone somewhere - it's a non-standard behaviour, which one must not rely on, since it is, well, non-standard.
Apparently the text you have has the initial encoding of cp1250, so you either should treat it accordingly, or re-encode entities manually:
$str = 'Thi’s i"s a’n e”xa“mple';
$str = preg_replace_callback("/&#([0-9]+);/u", function($m) {
return iconv('cp1250', 'utf-8', chr($m[1]));
}, $str);
echo $str;

mb_strtoupper displaying question mark

Hi I'm having a problem converting special characters to upper case.
With regular strtoupper I get something like DANIëL and when applying mb_strtoupper I get DANI?L.
Here's the code:
mb_strtoupper(rtrim($pieces[1], ","), 'UTF-8')
Mind you, I already have this running on the input:
iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $tr->TD[0])
Could this be the reason? Or is there something else?
Typical issue of trying to uppercasing a Latin1 when the converter expect UTF-8
Be sure to check your string source. This sample will works if your text editor works in Latin1 pagecode, and not in UTF-8
$str = "daniël"; //or your rtrim($pieces[1],",")
$str = mb_convert_encoding($str,'UTF-8','Latin1');
echo mb_strtoupper($str, 'UTF-8');
//will echo DANIËL

How to filter a Font Character in php

I have an arial character giving me a headache. U+02DD turns into a question mark after I turn its document into a phpquery object. What is an efficient method for removing the character in php by referring to it as 'U+02DD'?
You can use iconv() to convert character sets and strip invalid characters.
<?PHP
/* This will convert ISO-8859-1 input to UTF-8 output and
* strip invalid characters
*/
$output = iconv("ISO-8859-1", "UTF-8//IGNORE", $input);
/* This will attempt to convert invalid characters to something
* that looks approximately correct.
*/
$output = iconv("ISO-8859-1", "UTF-8//TRANSLIT", $input);
?>
See the iconv() documentation at http://php.net/manual/en/function.iconv.php
Use preg_replace and do it like this:
$str = "your text with that character";
echo preg_replace("#\x{02DD}#u", "", $str); //EDIT: inserted the u tag for unicode
To refer to large unicode ranges, you can use preg_replace and specify the unicode character with \x{abcd} pattern. The second parameter is an empty string that. This will make preg_replace to replace your character with nothing, effectively removing it.
[EDIT] Another way:
Did you try doing htmlentities on it. As it's html-entity is ˝, doing that OR replacing the character by ˝ may solve your issue too. Like this:
echo preg_replace("#\x{02DD}#u", "˝", $str);

How to convert HTML character NUMBERS to plain characters in PHP?

I have some HTML data (over which I have no control, can only read it) that contains a lot of Scandinavian characters (å, ä, ö, æ, ø, etc.). These "special" chars are stored as HTML character numbers (æ = æ). I need to convert these to the corresponding actual character in PHP (or JavaScript but I guess PHP is better here...). Seems like html_entity_decode() only handles the "other" kind of entities, where æ = &#aelig;. The only solution I've come up with so far is to make a conversion table and map each character number to a real character, but that's not really super smart...
So, any ideas? ;)
Cheers,
Christofer
&#NUMBER;
refers to the unicode value of that char.
so you could use some regex like:
/&#(\d+);/g
to grab the numbers, I don't know PHP but im sure you can google how to turn a number into its unicode equivalent char.
Then simply replace your regex match with the char.
Edit: Actually it looks like you can use this:
mb_convert_encoding('æ', 'UTF-8', 'HTML-ENTITIES');
I think html_entity_decode() should work just fine. What happens when you try:
echo html_entity_decode('æ', ENT_COMPAT, 'UTF-8');
On the PHP manual page on html_entity_decode(), it gives the following code for decoding numeric entities in versions of PHP prior to 4.3.0:
$string = preg_replace('~&#x([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $string);
$string = preg_replace('~&#([0-9]+);~e', 'chr("\\1")', $string);
As someone noted in the comments, you should probably replace chr() with unichr() to deal with non-ASCII characters.
However, it looks like html_entity_decode() really should deal with numeric as well as literal entities. Are specifying an appropriate charset (e.g.,UTF-8)?
If you haven't got the luxury of having multibyte string functions installed, you can use something like this:
<?php
$string = 'Here is a special char æ';
$list = preg_replace_callback('/(&#([0-9]+);)/', create_function(
'$matches', 'return decode(array($matches[2]));'
), $string);
echo '<p>', $string, '</p>';
echo '<p>', $list, '</p>';
function decode(array $list)
{
foreach ($list as $key=>$value) {
return utf8_encode(chr($value));
}
}
?>

PHP: Replace umlauts with closest 7-bit ASCII equivalent in an UTF-8 string

What I want to do is to remove all accents and umlauts from a string, turning "lärm" into "larm" or "andré" into "andre". What I tried to do was to utf8_decode the string and then use strtr on it, but since my source file is saved as UTF-8 file, I can't enter the ISO-8859-15 characters for all umlauts - the editor inserts the UTF-8 characters.
Obviously a solution for this would be to have an include that's an ISO-8859-15 file, but there must be a better way than to have another required include?
echo strtr(utf8_decode($input),
'ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ',
'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
UPDATE: Maybe I was a bit inaccurate with what I try to do: I do not actually want to remove the umlauts, but to replace them with their closest "one character ASCII" equivalent.
iconv("utf-8","ascii//TRANSLIT",$input);
Extended example
A little trick that doesn't require setting locales or having huge translation tables:
function Unaccent($string)
{
if (strpos($string = htmlentities($string, ENT_QUOTES, 'UTF-8'), '&') !== false)
{
$string = html_entity_decode(preg_replace('~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|tilde|uml);~i', '$1', $string), ENT_QUOTES, 'UTF-8');
}
return $string;
}
The only requirement for it to work properly is to save your files in UTF-8 (as you should already).
you can also try this
$string = "Fóø Bår";
$transliterator = Transliterator::createFromRules(':: Any-Latin; :: Latin-ASCII; :: NFD; :: [:Nonspacing Mark:] Remove; :: Lower(); :: NFC;', Transliterator::FORWARD);
echo $normalized = $transliterator->transliterate($string);
but you need to have http://php.net/manual/en/book.intl.php available
Okay, found an obvious solution myself, but it's not the best concerning performance...
echo strtr(utf8_decode($input),
utf8_decode('ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
If you are using WordPress, you can use the built-in function remove_accents( $string )
https://codex.wordpress.org/Function_Reference/remove_accents
However I noticed a bug : it doesn’t work on a string with a single character.
For Arabic and Persian users i recommend this way to remove diacritics:
$diacritics = array('َ','ِ','ً','ٌ','ٍ','ّ','ْ','ـ');
$search_txt = str_replace($diacritics, '', $diacritics);
For typing diacritics in Arabic keyboards u can use this Asci(those codes are Asci not Unicode) codes in windows editors
typing diacritics directly or holding Alt + (type the code of diacritic character)
This is the codes
ـَ(0243) ـِ(0246) ـُ(0245) ـً(0240) ـٍ(0242) ـٌ(0241) ـْ(0250) ـّ(0248) ـ
ـ(0220)
I found that this one gives the most consistent results in French and German.
with the meta tag set to utf-8, I have place it in a function to return a line from a array of words and it works perfect.
htmlentities ( $line, ENT_SUBSTITUTE , 'utf-8' )

Categories