A simple comparison in utf8, wrong result?

A simple comparison in utf8, wrong result? - php

this code prints "no" , but it should print "ok" and utf8 encodes of two are different
$a="کیهان";
$b="كيهان";
echo utf8_encode($a)."==".utf8_encode($b)."<br>";
if(utf8_encode($a)==utf8_encode($b))
echo "ok";
else
echo "no";
and the result :
Ú©ÛÙØ§Ù==ÙÙÙØ§Ù
no
what's that © ?
edit : $a is copied and $b is typed

your unicode strings are different to begin with... shown here with spaces to hilight the point:
$a="ک ی ه ن";
$b="ك ي ه ن";
EDIT: for curiosity's sake...
Seems that they display identically in the tab at the top of the file, which must have font features which combine characters together, but displays differently in the body of code, where it is actually displayed back to front.

EDIT:
Billy's completely right (+1) about why the strings are not equal. This answer may explain why you see garbage text after the conversion.
I'm guessing that your original encoding is not ISO-8859-1.
See the first comment in the docs.
Please note that utf8_encode only converts a string encoded in
ISO-8859-1 to UTF-8. A more appropriate name for it would be
"iso88591_to_utf8". If your text is not encoded in ISO-8859-1, you do
not need this function. If your text is already in UTF-8, you do not
need this function. In fact, applying this function to text that is
not encoded in ISO-8859-1 will most likely simply garble that text.
You may want iconv instead.

Related

Emoji (Unicode) to UTF-8 ampersand hash (?) encoding

To maintain compatibility with a pre-existing PHP solution, I require
input: 😁 // emoji character,
output: ð
I believe this is 'ampersand hash' encoding (I'm not sure that's what it's called.. I'll be damned if I can find any resources which explain how I arrive at this format... or why what this encoding is suitable for...)
I can get the bytes by URL-encoding the Unicode...
<?php print urlencode("😁"); /* Output: %F0%9F%98%81 */ ?>
...and I can use a Regex to convert this to the format I need... but I don't like this solution. It's very hacky and very prone to accidentally encoding non-encoded strings...
<?php
$enc = urlencode("😁");
print $enc; // %F0%9F%98%81
$find = '/(%)([0-9a-fA-F][0-9a-fA-F])/i';
$replacement = '&#x$2;';
print preg_replace($find,$replacement,$enc);
?>
Result: ð&#x81
Is there a better approach?
What is this encoding known as, and how do I arrive at it (via PHP)?
Many thanks!
Edit: Turns out this approach is unsuitable after all. urlencode converts all the spaces into + characters. There must be a correct approach to arrive at this format?

ð is "html entities"; it represents the 4 hex bytes F09F9891, which is the UTF-8 encoding for that Emoji. I suspect it is HTML, not PHP that you are trying to appease?
http://unicode.scarfboy.com/?s=%F0%9F%98%81 -- go part way down the page to "string stuff" to see how to encode it for HTML, utf8, python, javascript, etc.
One way in PHP is:
echo bin2hex('😁'); // f09f9881
Then break it into groups of 2 hex digits.

PHP string array UTF-8 encoding fails

Everything is set to UTF-8 (file encoding, MySQL [however I don't use it], Apache, meta, mbstring etc...) but check this out:
$s="áéőúöüóűí";
echo $s; //works perfectly
echo $s[0] // doesn't work. Prints out a single '?'.
I have tried almost everything. Any ideas? Thanks in advance!

It is absolutely correct behavior.
if you want to get a first letter from a multi-byte string, not first byte from binary string, you have to use mb_substr():
mb_internal_encoding("UTF-8");
echo mb_substr($s,0,1);

You should use mb_* functions for multibyte strings. mb_substr() in your case.

And if you define $s[0]="á", does it work ? I believe that when encoded in UTF-8, those special chars are stored over two UTF-chars.
If you display in ANSI some UTF-8 text, it is rendered like this :
Ã¡Ã©oÃºÃ¶Ã¼Ã³uÃ
You see that á becomes Ã¡
So rendering the first char ($s[0]) would only display the "Ã", which is an incomplete character

you have to make some changes in database go to the the table structure
you can find a column "Collation"
which column you want to change click edit on right side menu
the default Collation is - 'latin1_general_ci' change it to 'utf8_general_ci'

PHP strlen and mb_strlen not working as expected

PHP functions strlen() and mb_strlen() both are returning the wrong number of characters when I run them on a string.
Here is a piece of the code I'm using...
$foo = mb_strlen($itemDetails['ITEMDESC'], 'UTF-8');
echo $foo;
It is telling me this sting - "4Â½" Straight Iris Scissors" is 45 characters long. It's 27.
It also tells me that this string - "Infant Heel Warmer, No Adhesive Attachment Pad, 100/cs" is 54, which is correct.
I assume its some issue with character encoding, everything should be UTF-8 I think. I've tried feeding mb_strlen() several different character encoding types and they all are returning this oddball count with the string that has those non-standard characters.
I've no idea why this is happening.

Double-check whether your text really is UTF-8 or not. That "Â" character makes it look like a classic character encoding problem to me. You should check the entire path from the origin of the text through the point in your code that you quoted above, because there are a lot of places where the encodings can get munged.
Did the text originate from an HTML form? Ensure your <form> element includes the accept-charset="UTF-8" attribute.
Did the text get stored in a database along the way? Make sure the database stores and returns the data in UTF-8. This means checking the server's global defaults, the defaults for the database or schema, and the table itself.

It is very likely that your input is encoded in UTF-16.
You may convert to UTF-8
$foo = mb_strlen(mb_convert_encoding($itemDetails['ITEMDESC'], "UTF-8", "UTF-16"));
or if you use mb_strlen() be sure to use proper encoding as a second parameter.
$foo = mb_strlen($itemDetails['ITEMDESC'], "UTF-16");
Without correct encoding mb_strlen will always return wrong results. It's easy to get into troubles when you're dealing with UTF-8/16/32 encoded strings. mb_detect_encoding() will not solve this problem.

Convert foreign characters with accents

I'm trying to compare some text to the text in a database. In the database any text with an accent is encoded like in HTML (i.e. é) when I compare the database text to my string it doesn't match because my string just shows é. When I use the PHP function htmlentities to encode the string first the é turns into Ã© weird? Using htmlspecialchars doesn't encode the é at all.
How would you suggest I compare é to é as well as all the other accented characters?

You need to send in the correct charset to htmlentities. It looks like you're using UTF-8, but the default is ISO-8859-1. Change it like this:
$encoded = htmlentities($text, ENT_COMPAT, 'UTF-8');
Another solution is to convert the text to ISO-8859-1 before encoding, but that may destroy information (ISO-8859-1 does not contain nearly as many characters as UTF-8). If you want to try that instead, do like this:
$encoded = htmlentities(utf8_decode($text));

I'm working on french site, and I also had same problem. This is the function that I use.
function convert_accent($string)
{
return htmlspecialchars_decode(htmlentities(utf8_decode($string)));
}
What it does it decodes your string to utf8, than converts everything HTML entities. even tags. But we want to convert tags back to normal, than htmlspecialchars_decode will convert them back. So in the end you will get a string with converted accents without touching tags.
You can use pass through this function your email content before sending it to recipent.
Another issue you might face is that, sometimes with this function the content from database converts to ? . In this case you should do this before running your query:
mysql_query("SET NAMES `utf8`");
But you might need to do it, it depends on encoding in your table. I hope it helps.

The comparing task is related to the charset and the collation you selected when you create the database or the tables. If you are saving strings with a lot of accents like spanish I sugget you to use charset uft8 and the collation could be the more accurate to the language(english, french or whatever) you're using.
The best thing of using the correct charset in the database is that you can save the string in natural way e.g: my name I can store it as is "Mario Juárez" and I have no need of doing some weird conversions.

Ran into similar issues recently. Followed Emil's answer and it worked fine locally but not on our dev/stage environments. I ended up using this and it worked all around:
$title = html_entity_decode(utf8_decode($item));
Thanks for leading me in the right direction!

Read ansi file and convert to UTF-8 string

Is there any way to do that with PHP?
The data to be inserted looks fine when I print it out.
But when I insert it in the database the field becomes empty.

$tmp = iconv('YOUR CURRENT CHARSET', 'UTF-8', $string);
or
$tmp = utf8_encode($string);
Strange thing is you end up with an empty string in your DB. I can understand you'll end up with some garbarge in your DB but nothing at all (empty string) is strange.
I just typed this in my console:
iconv -l | grep -i ansi
It showed me:
ANSI_X3.4-1968
ANSI_X3.4-1986
ANSI_X3.4
ANSI_X3.110-1983
ANSI_X3.110
MS-ANSI
These are possible values for YOUR CURRENT CHARSET
As pointed out before when your input string contains chars that are allowed in UTF, you dont need to convert anything.
Change UTF-8 in UTF-8//TRANSLIT when you dont want to omit chars but replace them with a look-a-like (when they are not in the UTF-8 set)

"ANSI" is not really a charset. It's a short way of saying "whatever charset is the default in the computer that creates the data". So you have a double task:
Find out what's the charset data is using.
Use an appropriate function to convert into UTF-8.
For #2, I'm normally happy with iconv() but utf8_encode() can also do the job if source data happens to use ISO-8859-1.
Update
It looks like you don't know what charset your data is using. In some cases, you can figure it out if you know the country and language of the user (e.g., Spain/Spanish) through the default encoding used by Microsoft Windows in such territory.

Be careful, using iconv() can return false if the conversion fails.
I am also having a somewhat similar problem, some characters from the Chinese alphabet are mistaken for \n if the file is encoded in UNICODE, but not if it is UFT-8.
To get back to your problem, make sure the encoding of your file is the same with the one of your database. Also using utf-8_encode() on an already utf-8 text can have unpleasant results. Try using mb_detect_encoding() to see the encoding of the file, but unfortunately this way doesn't always work. There is no easy fix for character encoding from what i can see :(

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

A simple comparison in utf8, wrong result? - php

Related

Emoji (Unicode) to UTF-8 ampersand hash (?) encoding

PHP string array UTF-8 encoding fails

PHP strlen and mb_strlen not working as expected

Convert foreign characters with accents

Read ansi file and convert to UTF-8 string

Categories

Resources