Charset problems with PHP - php

I have a problem with a PHP code that transforms accent characters in non accent characters. I have this code working a year ago but I'm trying to get this to work but without success. The translation is not done correctly.
Here is the code:
<?php
echo accentdestroyer('azeméis');
/**
*
* This function transform accent characters to non accent characters
* #param text $string
*/
function accentdestroyer($string) {
$string=strtr($string,
"()!$?: ,&+-/.ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ"
,
"-------------SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy");
return $string;
}
?>
I have tested to save the document in UTF-8 but gives me something like this: "azemy�is"
Some clues on what can I do to get this working correctly?
Best Regards,

A better solution may be to transliterate those characters automatically using iconv().
As for the reason your function doesn't work, it may have something to do with the fact that echo strlen('Š'); outputs 2. The documentation explicitly refers to single byte characters.
Also,
$a = 'Š';
var_dump(strtr('Š', 'Š', '!')); // string(2) "!�"
So the first byte has been matched but the second one (leftover) isn't a byte pointing to a valid Unicode character.
Update
Here is a workign example using iconv().
$str = 'ŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚ';
$str = iconv("utf-8", "us-ascii//TRANSLIT", $str);
var_dump($str); // string(37) "OEZsoezY?uAAAAAAAECEEEEIIII?NOOOOO?UU"
Some characters didn't quite translate, such as ¥ and Ø, but most did. You can append //IGNORE to the output character set to silently discard the ones which don't transliterate.
You could also drop all non word characters too using a Unicode regex with \pL.

Related

Trouble decoding some special characters ’ “ ”

I'm trying to decode some special characters in php and can't seem to find a way to do it.
$str = 'Thi’s i"s a’n e”xa“mple';
This just returns some dots.
$str = preg_replace_callback("/(&#[0-9]+;)/", function($m) {
return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES");
}, $str);
Some other tests just return the same string.
$str = html_entity_decode($str, ENT_QUOTES, 'UTF-8');
$str = htmlspecialchars_decode($str, ENT_QUOTES);
Anyway, I've been trying all sorts of combinations but really no idea how to convert this to UTF-8 characters.
What I'm expecting to see is this:
Thi’s i"s a’n e”xa“mple
And actually if I take this directly and use htmlentities to encode it I see different characters to begin with.
Thi’s i"s a’n e”xa“mple
Unfortunately I don't have control of the source and I'm stuck dealing with those characters.
Are they non standard, do I need to replace them manually with my own lookup table?
EDIT
Looking at this table here: https://brajeshwar.github.io/entities/
I see the characters I'm looking after are not listed. When I test a few characters from this table they decode just fine. I guess the list in php is incomplete by default?
If you check the unicode standard for the characters you're referring to: http://www.unicode.org/charts/PDF/U0080.pdf
You would see that all the codepoints you have in your string do not have representable glyphs and are control characters.
Which means that it is expected that they are rendered as empty squares (or dots, depending on how your renderer treats those).
If it works for someone somewhere - it's a non-standard behaviour, which one must not rely on, since it is, well, non-standard.
Apparently the text you have has the initial encoding of cp1250, so you either should treat it accordingly, or re-encode entities manually:
$str = 'Thi’s i"s a’n e”xa“mple';
$str = preg_replace_callback("/&#([0-9]+);/u", function($m) {
return iconv('cp1250', 'utf-8', chr($m[1]));
}, $str);
echo $str;

Language specific characters to regular English chars

I am not sure where to start with this, but here is what I want to do:
Users have a textfield where they need to input few words. Problem is that page will use people from different countries, and they will enter "weird" Latin characters like: ž, Ä, Ü, đ, Ť, Á etc.
Before saving to base I want to convert them to z, a, u, d, t, a... Is there a way to do this without making something like this (I think there is too much characters to cover):
$string = str_replace(array('Č','Ä','Á','đ'), array('C','A','A','d'), $string);
And, yes, I know that I can save utf-8 in database, but problem is that this string will later be sent by SMS, and because of sms protocol nature, these "special" chars use more space in message than regular English alphabet characters (I am limited to 120 chars, and if i put "Ä" in message, it will take more than 1 character place).
First of all, I would still store the original characters in utf-8 in the database. You can always "translate" them to ASCII characters upon retrieval. This is good because if, say, in the future SMS adds UTF-8 support (or you want to use user data for something else), you'll have the original characters intact.
That said, you can use iconv to do this:
iconv('utf-8', 'ascii//TRANSLIT', $input); //where $input contains "weird" characters
See this thread for more info, including some caveats of this approach: PHP: Replace umlauts with closest 7-bit ASCII equivalent in an UTF-8 string
Close but not perfect because it converts the accents and things into characters.
http://www.php.net/manual/en/function.iconv.php
echo iconv("ISO-8859-1", "ASCII//TRANSLIT", 'Martín');
//output: Mart'in
echo iconv("ISO-8859-1", "ASCII//TRANSLIT", "ÆÇÈÊÈÒÐÑÕ");
//output: AEC`E^E`E`OD~N~O
Using
echo iconv('utf-8', 'ascii//TRANSLIT', 'Martín');
//output: Mart
If the accented character is not UTF-8, it just cuts off the string from the special char onwards.

Replace unicode character

I am trying to replace a certain character in a string with another. They are quite obscure latin characters. I want to replace character (hex) 259 with 4d9, so I tried this:
str_replace("\x02\x59","\x04\xd9",$string);
This didn't work. How do I go about this?
**EDIT: Additional information.
Thanks bobince, that has done the trick. Although, I want to replace the uppercase schwa also and it is not working for some reason. I calculated U+018F (Ə) as UTF-8 0xC68F and this is to be replaced with U+04D8 (0xD398):
$string = str_replace("\xC9\x99", "\xD3\x99", $_POST['string_with_schwa']); //lc 259->4d9
$string = str_replace( "\xC6\8F", "\xD3\x98" , $string); //uc 18f->4d8
I am copying the 'Ə' into a textbox and posting it. The first str_replace works fine on the lowercase, but does not detect the uppercase in the second str_replace, strange. It remains as U+018F. Guess I could run the string through strtolower but this should work though.
U+0259 Latin Small Letter Schwa is only encoded as the byte sequence 0x02,0x59 in the UTF-16BE encoding. It is very unlikely you will be working with byte strings in the UTF-16BE encoding as it's not an ASCII-compatible encoding and almost no-one uses it.
The encoding you want to be working with (the only ASCII-superset encoding to support both Latin Schwa and Cyrillic Schwa, as it supports all Unicode characters) is UTF-8. Ensure your input is in UTF-8 format (if it is coming from form data, serve the page containing the form as UTF-8). Then, in UTF-8, the character U+0259 is represented using the byte sequence 0xC9,0x99.
str_replace("\xC9\x99", "\xD3\x99", $string);
If you make sure to save your .php file as UTF-8-no-BOM in the text editor, you can skip the escaping and just directly say:
str_replace('ə', 'ә', $string);
A couple of possible suggestions. Firstly, remember that you need to assign the new value to $string, i.e.:
$string = str_replace("\x02\x59","\x04\xd9",$string);
Secondly, verify that your byte stream occurs in the $string. I mention this because your hex string begins with a low-byte, so you'll need to make sure your $string is not UTF8 encoded.

How to filter a Font Character in php

I have an arial character giving me a headache. U+02DD turns into a question mark after I turn its document into a phpquery object. What is an efficient method for removing the character in php by referring to it as 'U+02DD'?
You can use iconv() to convert character sets and strip invalid characters.
<?PHP
/* This will convert ISO-8859-1 input to UTF-8 output and
* strip invalid characters
*/
$output = iconv("ISO-8859-1", "UTF-8//IGNORE", $input);
/* This will attempt to convert invalid characters to something
* that looks approximately correct.
*/
$output = iconv("ISO-8859-1", "UTF-8//TRANSLIT", $input);
?>
See the iconv() documentation at http://php.net/manual/en/function.iconv.php
Use preg_replace and do it like this:
$str = "your text with that character";
echo preg_replace("#\x{02DD}#u", "", $str); //EDIT: inserted the u tag for unicode
To refer to large unicode ranges, you can use preg_replace and specify the unicode character with \x{abcd} pattern. The second parameter is an empty string that. This will make preg_replace to replace your character with nothing, effectively removing it.
[EDIT] Another way:
Did you try doing htmlentities on it. As it's html-entity is ˝, doing that OR replacing the character by ˝ may solve your issue too. Like this:
echo preg_replace("#\x{02DD}#u", "˝", $str);

PHP and character encoding problem with  character

I'm having a problem where PHP (5.2) cannot find the character 'Â' in a string, though it is clearly there.
I realize the underlying problem has to do with character encoding, but unfortunately I have no control over the source content. I receive it as UTF-8, with those characters already in the string.
I would simply like to remove it from the string. strpos(), str_replace(), preg_replace(), trim(), etc. Cannot correctly identify it.
My string is this:
"Â Â Â A lot of couples throughout the World "
If I do this:
$string = str_replace('Â','',$string);
I get this:
"� � � A lot of couples throughout the World"
I even tried utf8_encode() and utf8_decode() before the str_replace, with no luck.
What's the solution? I've been throwing everything I can find at it...
$string = str_replace('Â','',$string);
How is this 'Â' encoded? If your script file is saved as iso-8859-1 the string 'Â' is encoded as the one byte sequence xC2 while the (/one) utf-8 representation is xC3 x82. php's str_replace() works on the byte level, i.e. it only "knows" single-byte characters.
see http://docs.php.net/intro.mbstring
I use this:
function replaceSpecial($str){
$chunked = str_split($str,1);
$str = "";
foreach($chunked as $chunk){
$num = ord($chunk);
// Remove non-ascii & non html characters
if ($num >= 32 && $num <= 123){
$str.=$chunk;
}
}
return $str;
}
From the PHP Manual Comment Page:
http://www.php.net/manual/en/function.preg-replace.php#96847
And from StackOverflow:
Remove accents without using iconv

Categories