PHP and character encoding problem with  character - php

I'm having a problem where PHP (5.2) cannot find the character 'Â' in a string, though it is clearly there.
I realize the underlying problem has to do with character encoding, but unfortunately I have no control over the source content. I receive it as UTF-8, with those characters already in the string.
I would simply like to remove it from the string. strpos(), str_replace(), preg_replace(), trim(), etc. Cannot correctly identify it.
My string is this:
"Â Â Â A lot of couples throughout the World "
If I do this:
$string = str_replace('Â','',$string);
I get this:
"� � � A lot of couples throughout the World"
I even tried utf8_encode() and utf8_decode() before the str_replace, with no luck.
What's the solution? I've been throwing everything I can find at it...

$string = str_replace('Â','',$string);
How is this 'Â' encoded? If your script file is saved as iso-8859-1 the string 'Â' is encoded as the one byte sequence xC2 while the (/one) utf-8 representation is xC3 x82. php's str_replace() works on the byte level, i.e. it only "knows" single-byte characters.
see http://docs.php.net/intro.mbstring

I use this:
function replaceSpecial($str){
$chunked = str_split($str,1);
$str = "";
foreach($chunked as $chunk){
$num = ord($chunk);
// Remove non-ascii & non html characters
if ($num >= 32 && $num <= 123){
$str.=$chunk;
}
}
return $str;
}

From the PHP Manual Comment Page:
http://www.php.net/manual/en/function.preg-replace.php#96847
And from StackOverflow:
Remove accents without using iconv

Related

Problems with special chars encoding with an access mdb database using php

My boss is forcing me to use an access mdb database (yes, I'm serious) in a php server.
I can connect it and retrieve data from it, but as you could imagine, I have problems with encodings because I want to work using utf8.
The thing is that now I have two "solutions" to translate Windows-1252 to UTF-8
This is the first way:
mb_convert_encoding($string, "UTF-8", "Windows-1252").
It works, but the problem is that special chars are not properly converted, for example char º is converted to \u00ba and char Ó is converted to \u00d3.
My second way is doing this:
mb_convert_encoding(mb_convert_encoding($string, "UTF-8", "Windows-1252"), "HTML-ENTITIES", "UTF-8")
It works too, but it happens the same, special chars are not correctly converted. Char º is converted to º
Does anybody know how to properly change encoding including special chars?
Or does anybody know how to convert from º and \u00ba to something readable?
I did simple test to convert codepoint to letters
<?php
function codepoint_decode($str) {
return json_decode(sprintf('"%s"', $str));
}
$string_with_codepoint = "Ahed \u00d3\u00ba\u00d3";
// $string_with_codepoint = mb_convert_encoding($string, "UTF-8", "Windows-1252");
$output = codepoint_decode($string_with_codepoint);
echo $output; // Ahed ÓºÓ
Credit go for this answer
I finally found the solution.
I had the solution from the beginning but I was doing my tests wrong.
My bad.
The right way to do it for me is mb_convert_encoding($string, "UTF-8", "Windows-1252")
But i was checking the result like this:
$stringUTF8 = mb_convert_encoding($string, "UTF-8", "Windows-1252");
echo json_encode($stringUTF8);
that's why it was returning unicode chars like \u20ac, if I would have done:
$stringUTF8 = mb_convert_encoding($string, "UTF-8", "Windows-1252");
echo $stringUTF8;
I should have seen the solution from the beginning but I was wrong. It was json_encode() what was turning special chars into unicode chars.
Thanks everybody for your help!!

how to get unicode character from a unicode string in php

I want to get a single unicode chatacter from a unicode string.
for example:-
$str = "पर्वत निर्माणों में कोनसा संचलन कार्य करता है";
echo $str[0];
output is:- �
but i want to get char 'प' at 0 index of the string.
plz help me how to get char 'प' instead of � .
As #deceze writes, you need to use mb_substr in order to get a character, instead of just a byte. In addition, you need to set the internal encoding with mb_internal_encoding. Assuming that the encoding of your .php file is UTF-8, the following should work:
mb_internal_encoding('utf-8');
$str = "पर्वत निर्माणों में कोनसा संचलन कार्य करता है";
echo mb_substr($str, 0, 1);
PHP's default $str[x] notation operates on bytes, so you're just getting the first part of a multibyte character. To extract entire encoding aware byte sequences for whole characters, you need to use mb_substr.
Also see What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

PHP convert characters applicable for title tag

In my page I convert lower to uppercase string and output 'em in the title tag. First I had the issue that &NBSP; is not accepted, so I had to preserve entities.
So I converted them to unicode, then uppercase and then back to htmlentities:
echo htmlentities(strtoupper(html_entity_decode(ob_get_clean())));
Now I have the problem that I recognized related to a "right single quote". I'm getting this character as ’ in the title.
It seems that either of the two functions I'm using does not convert them correctly. Is there any better function that I can use or is there something especially for the title tag?
Edit: Here is a var_dump of the original data which I don't have influence to:
string(74) "Example example example » John Doe- Who’s That? "
Edit II: This is what my code above results in:
This would happen, if I would just use strtoupper:
Your problem is that strtoupper will destroy your UTF-8 entity-decoded input because it is not multibyte aware. In this instance, ’ decodes to the hex-encoded UTF-8 sequence e2 80 99. But in strtoupper's single-byte world, the character with code \xe2 is â, which is converted to  (\xc2) -- which makes your text an invalid UTF-8 sequence.
Simply use mb_strtoupper instead.
It's ugly, but it might work for you (although I would certainly suggest Jon's solution):
After your strtoupper(), you can replace all uppercased HTMLentities this way:
$entity_table = get_html_translation_table(HTML_ENTITIES);
$entity_table_uc = array_map('strtoupper', $entity_table);
$string = str_replace($entity_table_uc, $entity_table, $string);
This should remove the need for htmlentities() / html_entity_decode().

Charset problems with PHP

I have a problem with a PHP code that transforms accent characters in non accent characters. I have this code working a year ago but I'm trying to get this to work but without success. The translation is not done correctly.
Here is the code:
<?php
echo accentdestroyer('azeméis');
/**
*
* This function transform accent characters to non accent characters
* #param text $string
*/
function accentdestroyer($string) {
$string=strtr($string,
"()!$?: ,&+-/.ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ"
,
"-------------SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy");
return $string;
}
?>
I have tested to save the document in UTF-8 but gives me something like this: "azemy�is"
Some clues on what can I do to get this working correctly?
Best Regards,
A better solution may be to transliterate those characters automatically using iconv().
As for the reason your function doesn't work, it may have something to do with the fact that echo strlen('Š'); outputs 2. The documentation explicitly refers to single byte characters.
Also,
$a = 'Š';
var_dump(strtr('Š', 'Š', '!')); // string(2) "!�"
So the first byte has been matched but the second one (leftover) isn't a byte pointing to a valid Unicode character.
Update
Here is a workign example using iconv().
$str = 'ŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚ';
$str = iconv("utf-8", "us-ascii//TRANSLIT", $str);
var_dump($str); // string(37) "OEZsoezY?uAAAAAAAECEEEEIIII?NOOOOO?UU"
Some characters didn't quite translate, such as ¥ and Ø, but most did. You can append //IGNORE to the output character set to silently discard the ones which don't transliterate.
You could also drop all non word characters too using a Unicode regex with \pL.

How to convert HTML character NUMBERS to plain characters in PHP?

I have some HTML data (over which I have no control, can only read it) that contains a lot of Scandinavian characters (å, ä, ö, æ, ø, etc.). These "special" chars are stored as HTML character numbers (æ = æ). I need to convert these to the corresponding actual character in PHP (or JavaScript but I guess PHP is better here...). Seems like html_entity_decode() only handles the "other" kind of entities, where æ = &#aelig;. The only solution I've come up with so far is to make a conversion table and map each character number to a real character, but that's not really super smart...
So, any ideas? ;)
Cheers,
Christofer
&#NUMBER;
refers to the unicode value of that char.
so you could use some regex like:
/&#(\d+);/g
to grab the numbers, I don't know PHP but im sure you can google how to turn a number into its unicode equivalent char.
Then simply replace your regex match with the char.
Edit: Actually it looks like you can use this:
mb_convert_encoding('æ', 'UTF-8', 'HTML-ENTITIES');
I think html_entity_decode() should work just fine. What happens when you try:
echo html_entity_decode('æ', ENT_COMPAT, 'UTF-8');
On the PHP manual page on html_entity_decode(), it gives the following code for decoding numeric entities in versions of PHP prior to 4.3.0:
$string = preg_replace('~&#x([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $string);
$string = preg_replace('~&#([0-9]+);~e', 'chr("\\1")', $string);
As someone noted in the comments, you should probably replace chr() with unichr() to deal with non-ASCII characters.
However, it looks like html_entity_decode() really should deal with numeric as well as literal entities. Are specifying an appropriate charset (e.g.,UTF-8)?
If you haven't got the luxury of having multibyte string functions installed, you can use something like this:
<?php
$string = 'Here is a special char æ';
$list = preg_replace_callback('/(&#([0-9]+);)/', create_function(
'$matches', 'return decode(array($matches[2]));'
), $string);
echo '<p>', $string, '</p>';
echo '<p>', $list, '</p>';
function decode(array $list)
{
foreach ($list as $key=>$value) {
return utf8_encode(chr($value));
}
}
?>

Categories