Issue with Spanish string encoding

Issue with Spanish string encoding - php

I need help on changing the codification of a string copied and pasted from clipboard...
The curious string is "español":
$problematicString = "español"; //copied and pasted from a filename
$okString = "español"; //typed
echo md5($problematicString)."<br>";
echo md5($okString)."<br>";
This is the output:
c9ae1d88242473e112ede8df2bdd6802
5d971adb0ba260af6a126a2ade4dd133
Why are the md5() outputs different for the same strings?
I've tried changing both strings using: mb_convert_encoding($string, "ISO-8859-1", "UTF-8") but the output is still different.
i need to fix the problematicString programmatically so that it shows the same hash as the other string

Why are the md5 different for the same strings ?
They are not the same string. In the first case the tilde is on the 'o':
$problematicString = "español"
In the second case, the tilde is on the 'n':
$okString = "español";
That's why the hashes don't match.

The reason being is that the first part contains a hidden unicode being:
̃
Pulled from my editor:
$problematicString = "español"; which is what it's actually showing.
It's actually a tilde ~.
Pulled from http://courses.washington.edu/hypertxt/unicode/unidec1.html
These symbols, which are most of the non-ascii symbols useful for standard phonetic transcription of English, are drawn from several regions of the Unicode chart: from Latin-1 Supplement, Latin Extended-A and B,IPA Extensions, Combining Diacritical Mark, and Greek (for the theta). All of these pages are supported by lucida sans unicode, a TrueType font that Microsoft has bundled with recent products. Sadly, Bitstream's mother-of-all-TTFs Cyberbit does not support the IPA Extensions. These values can be entered manually as character entities or assigned to hot keys, buttons, or whatever the browser allows. Word97 can access the font via the symbol table under Insert.
Another way to write this font is to use Wincalis uniedit, which will write the Unicode values directly into the file. Then "This is phonetically transcribed" is represented in strange alphabet soup which is converted by the browser into [ðɪs ɪz fɘnɛɾɘkli trænskraibd] (look at this in a plain text editor to see the soup). For any serious or extensive transcription work, an editor like Wincalis would prove handy--you can even customize the IPA keyboard supplied.
If you want the file to trigger Unicode UTF-8 decoding in the browser, you must preface this META tag:
with the following under "Diacritics":
̃ #771 nasalized

As #BeetleJuice said, they are not the same string. Here's another way to understand this: reduce the data to just these two strings:
"español";
"español";
Then run the od command against them. Observe that the hex characters are different:
0000000 6522 7073 6e61 83cc 6c6f 3b22 220a 7365
" e s p a n ̃ ** o l " ; \n " e s
0000020 6170 b1c3 6c6f 3b22 0a20
p a ñ ** o l " ; \n
0000032

In the first string the ñ is actually an n and a combining diacritic tilde (http://www.fileformat.info/info/unicode/char/0303/index.htm). In the second string it's an ñ(http://www.fileformat.info/info/unicode/char/f1/index.htm), one character. You can see that if you use backspace to delete characters and you'll see that in the first one it takes 2 presses, one to delete the tilde, the other one for 'n'.

Related

Display \u1F603 (emoji icon) in web page

I store codes like "\u1F603" within messages in my database, and now I need to display the corresponding emoji on my web page.
How can I convert \u1F603 to \xF0\x9F\x98\x83 using PHP for displaying emoji icons in a web page?

You don't need to convert emoji character codes to UTF-8 sequences, you can simply use the original 21-bit Unicode value as numeric character reference in HTML like this: 😃 which renders as: 😃.
The Wikipedia article "Unicode and HTML" explains:
In order to work around the limitations of legacy encodings, HTML is designed such that it is possible to represent characters from the whole of Unicode inside an HTML document by using a numeric character reference: a sequence of characters that explicitly spell out the Unicode code point of the character being represented. A character reference takes the form &#N;, where N is either a decimal number for the Unicode code point, or a hexadecimal number, in which case it must be prefixed by x. The characters that compose the numeric character reference are universally representable in every encoding approved for use on the Internet.
For example, a Unicode code point like U+5408, which corresponds to a particular Chinese character, has to be converted to a decimal number, preceded by &# and followed by ;, like this: 合, which produces this: 合.
So if in your PHP code you have a string containing '\u1F603', then you can create the corresponding HTML string using preg_replace, as in following example:
$text = "This is fun \\u1F603!"; // this has just one backslash, it had to be escaped
echo "Database has: $text<br>";
$html = preg_replace("/\\\\u([0-9A-F]{2,5})/i", "&#x$1;", $text);
echo "Browser shows: $html<br>";
This outputs:
Database has: This is fun \u1F603!
Browser shows: This is fun 😃!
Note that if in your data you would use the literal \u notation also for lower range Unicode characters, i.e. with hex numbers of 2 to 4 digits, you must make sure the next user's character is not also a hex digit, as it would lead to a wrong interpretation of where the \u escape sequence stops. In that case I would suggest to always left-pad these hex numbers with zeroes in your data so they are always 5 digits long.
To ensure your browser uses the correct character encoding, do the following:
Specify the UTF-8 character encoding in the HTML head section:
<meta charset="utf-8">
Save your PHP file in UTF-8 encoding. Depending on your editor, you may need to use a "Save As" option, or find such a setting in the editor's "Preferences" or "Options" menu.

Hell everyone,
after many try i can found solution.
I user below code:
https://github.com/BriquzStudio/php-emoji
include 'Emoji.php';
$message = Emoji::Decode($message);
This one working fine for me!! :)Below is my reslut

Which middot character is this?

$string = 'Single · Female'
I copied it from facebook.
In html source its just that dot, how did they type it?
While echoing in php its A with circumflex (Â) concatenated with that same dot.
How can i explode this string with that dot?

It is U+00B7 MIDDLE DOT, a character used for many purposes, e.g. as a separator between links, alternatives, or other items.
If your code displays it as Â·, then the reason is that the UTF-8 encoded form of U+00B7, namely 0xC2 0xB7, is being misinterpreted as being ISO-8859-1 or Windows-1252 encoded. You should fix this basic problem (instead of trying to deal with some of its symptoms). See UTF-8 all the way through.
Regarding the question “how did they type it?”, we cannot really know, and we need not know. There are zillions of ways to type characters, and anyone can invent a few more. (On my keyboard, I use AltGr Shift X. If I needed to type “·” on a Windows computer with vanilla settings, I would use Alt 0183.)

I believe this is an interpunct. It can be used through the HTML entities · or · and in PHP with the unicode value U+00B7.
If you want to echo the unicode character without HTML entities, you can set the character encoding to UTF-8. Splitting is done through explode("·", $textToSplit) given that your PHP file is using UTF-8 as character encoding.

Converting Unicode characters into the equivalent ASCII ones

I need to "flatten out" a number of Unicode strings for the purposes of indexing and searching. For example, I need to convert GötheФ€ into ASCII. The last two characters have no close representations in ASCII so it's Ok to discard them completely. So what I expect from
echo iconv("UTF-8", "ASCII//TRANSLIT//IGNORE", "GötheФ€");
is Gothe but instead it outputs Gothe?EUR.
In addition to letters, I'd also like all the variety of Unicode numerals and punctuation marks, such as periods, commas, dashes, slashes etc. to be replaced by their closest ASCII counterparts, which is something ASCII//TRANSLIT//IGNORE in iconv function does already but not without producing some garbage output for the Unicode characters for which it's not able to find any ASCII replacements. I'd like such characters to be totally ignored.
How do get the expected result? Is there a better way, perhaps using intl library?

You've picked a hard problem. It is better to tell the user entering Unicode characters to transliterate ASCII themselves. Doing it for them will only upset them when they disagree with your transliteration.
Anything you do will likely be jarring and offensive to people who place great meaning on Diacritics: http://en.wikipedia.org/wiki/Diacritic
No matter what transliteration strategy you use, you will not please everyone, since different people prescribe different meanings to different characters. A transliteration that delights one person will enrage another. You won't make everyone happy unless you let everyone use whatever character they want in Unicode.
But life is jarring and offensive, so off we go:
This PHP Code:
function toASCII( $str )
{
return strtr(utf8_decode($str),
utf8_decode(
'ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
}
What the above PHP function does is replace each Unicode character in the first parameter of utf8_decode and replaces it with the corresponding character in the second parameter of utf8_decode.
For example the Unicode À is transliterated to ASCII A, and the å is converted to a. You'll have to specify this for every single Unicode character that you believe transliterates to an ASCII character. For the others, remove them or run them through another transliteration algorithm.
There are 95,221 other characters that you will have to look at which might transliterate to ASCII. It becomes an existential game of "When is an A no longer an A?". What about the Klingon characters and the road-map signs that kind of look like an A? The fish character kind of looks like an a. Who is to say what is what?
This is a lot of work, but if you are cleaning database input, you have to create a white list of characters and block out the other barbarians, keeping them out at the moat, it's the only reliable way.

How to replace umlaut characters or Unaccent in PHP?

I have a name "GÃ¶ran" and I want it to be converted to "Goran" which means I need to unaccent the particular word. But What I have tried doesn't seem to unaccent all the words.
This is the code I ve used to Unaccent :
private function Unaccent($string)
{
return preg_replace('~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml|caron);~i', '$1', htmlentities($string, ENT_COMPAT, 'UTF-8'));
}
The places where is not working(incorrect matching) : I mean it is not giving the expected result on the right hand side,
JÃƒÅ’rgen => Juergen
InÃƒÅ¡s => Ines
The place where it is working(correct matching):
GÃ¶ran => Goran
JÃ¸rgen Ole => Jorgen
JÃ©rÃ´me => Jerome
What could be the reason? How to fix? do you have any better approach to handle all cases?

This might be what you are looking for
How to convert special characters to normal characters?
but use "utf-8" instead.
$text = iconv('utf-8', 'ascii//TRANSLIT', $text);
http://us2.php.net/manual/en/function.iconv.php

Short answer
You have two problems:
Firstly. These names are not accented. They are badly formatted.
It seems that you had an UTF-8 file but were working with them using ISO-8559-1. For example if you tell your editor to use ISO-8859-1 and copy-paste the text into a text-area in a browser using UTF-8. Then you saved the badly formatted names in the database. I have seen many such problems arising from copy-paste.
If the names are correctly formatted, then you can solve your second problem. Unaccent them. There is already a question treating this: How to convert special characters to normal characters?
Long answer (focuses on the badly formatted accented letters only)
Why do you have got GÃ¶ran when you want Göran?
Let's begin with Unicode: The letter ö is in Unicode LATIN SMALL LETTER O WITH DIAERESIS. Its Unicode code point is F6 hexadecimal or, respectively, 246 decimal. See this link to the Unicode database.
In ISO-8859-1 code points from 0 to 255 are left as is. The small letter o with diaeresis is saved as only one byte: 246.
UTF-8 and ISO-8859-1 treat the code points 0 to 127 (aka ASCII) the same. They are left as is and saved as only one byte. They differ in the treatment of the code points 128 to 255. UTF-8 can encode the whole Unicode code point set, while ISO-8859-1 can only cope with the first 256 code points.
So, what does UTF-8 do with code points above 128? There is a staggered set of encoding possibilities for code points as they get bigger and bigger. For code points up to 2047 two bytes suffice. They are encoded like this: (see this bit schema)
x xxxx xxxx xxxx => 110xxxxx 10xxxxxx
Let's encode small letter o with diaresis in UTF-8. The bits are: 0 0000 1111 0110 and gets encoded to 11000011 10110110. This is nice.
However, these two bytes can be misunderstood as two valid (!) ISO-8559-1 bytes. What are 11000011 (C3 hex) and 10110110 (B6 hex)? Let's consult an ISO-8859-1 table. C3 is Capital A tilde, and B6 is Paragraph sign. Both signs are valid and no software can detect this misunderstanding by just looking at the bits.
It definitively needs people who know what names look like. GÃ¶ran is just not a name. There is an uppercase letter smack in the middle of the name and the paragraph sign is not a letter at all. Sadly, this misunderstanding does not stop here. Because all characters are valid, they can be copy-pasted and re-rendered. In this process the misunderstanding can be repeated again. Let's do this with Göran. We already misunderstood it once and got a badly formatted GÃ¶ran. The letter Capital A, tilde and the paragraph sign render to two bytes in UTF-8 each (!) and are interpreted as four bytes of gobbledygook, something like GÃƒÅ.ran.
Poor Jürgen! The umlaut ü got mistreated twice and we have JÃƒÅ’rgen.
We have a terrible mess with the umlauts here. It's even possible that the OP got this data as is from his customer. This happened to me once: I got mixed data: well formatted, badly formatted once, twice and thrice in the same file. It's extremely frustrating.

Strange UTF8 string comparison

I'm having this problem with UTF8 string comparison which I really have no idea about and it starts to give me headache. Please help me out.
Basically I have this string from a xml document encoded in UTF8: 'Mina Tidigare anställningar'
And when I compare that string with the exactly the same string which I typed myself: 'Mina Tidigare anställningar' (also in UTF8). And the result is FALSE!!!
I have no idea why. It is so strange. Can someone help me out?

This seems somewhat relevant. To simplify, there are several ways to get the same text in Unicode (and therefore UTF8): for example, this: ř can be written as one character ř or as two characters: r and the combining ˇ.
Your best bet would be the normalizer class - normalize both strings to the same normalization form and compare the results.
In one of the comments, you show these hex representations of the strings:
4d696e61205469646967617265 20 616e7374 c3a4 6c6c6e696e676172 // from XML
4d696e61205469646967617265 c2a0 616e7374 61cc88 6c6c6e696e676172 // typed
^^-----------------^^^^1 ^^^^^^2
Note the parts I marked, apparently there are two parts to this problem.
For the first, observe this question on the meaning of byte sequence "c2a0" - for some reason, your typing is translated to a non-breakable space where the XML file has a normal space. Note that there's a normal space in both cases after "Mina". Not sure what to do about that in PHP, except to replace all whitespace with a normal space.
As to the second, that is the case I outlined above: c3a4 is ä (U+00E4 "LATIN SMALL LETTER A WITH DIAERESIS" - one character, two bytes), whereas 61 is a (U+0061 "LATIN SMALL LETTER A" - one character, one byte) and cc88 would be the combining umlaut " (U+0308 "COMBINING DIAERESIS" - two characters, three bytes). Here, the normalization library should be useful.

Let's try blindly: maybe both UTF-8 strings have not the same underlying representation (you can get characters with accents as a sequence or as a unique character). You should give use some hex dump of both UTF8 strings and someone may be able to help.

mb_detect_encoding($s, "UTF-8") == "UTF-8" ? : $s = utf8_encode($s);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.