PHP: Dealing special characters with iconv - php

I still don't understand how iconv works.
For instance,
$string = "Löic & René";
$output = iconv("UTF-8", "ISO-8859-1//TRANSLIT", $string);
I get,
Notice: iconv() [function.iconv]:
Detected an illegal character in input
string in...
$string = "Löic"; or $string = "René";
I get,
Notice: iconv() [function.iconv]: Detected an incomplete multibyte character in input string in.
I get nothing with $string = "&";
There are two sets of different outputs I need store them in the two different columns inside the table of my database,
I need to convert Löic & René to Loic & Rene for clean url purposes.
I need to keep them as they are - Löic & René as Löic & René then only convert them with htmlentities($string, ENT_QUOTES); when displaying them on my html page.
I tried with some of the suggestions in php.net below, but still don't work,
I had a situation where I needed some characters transliterated, but the others ignored (for weird diacritics like ayn or hamza). Adding //TRANSLIT//IGNORE seemed to do the trick for me. It transliterates everything that is able to be transliterated, but then throws out stuff that can't be.
So:
$string = "ʿABBĀSĀBĀD";
echo iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $string);
// output: [nothing, and you get a notice]
echo iconv('UTF-8', 'ISO-8859-1//IGNORE', $string);
// output: ABBSBD
echo iconv('UTF-8', 'ISO-8859-1//TRANSLIT//IGNORE', $string);
// output: ABBASABAD
// Yay! That's what I wanted!
and another,
Andries Seutens 07-Nov-2009 07:38
When doing transliteration, you have to make sure that your LC_COLLATE is properly set, otherwise the default POSIX will be used.
To transform "rené" into "rene" we could use the following code snippet:
setlocale(LC_CTYPE, 'nl_BE.utf8');
$string = 'rené';
$string = iconv('UTF-8', 'ASCII//TRANSLIT', $string);
echo $string; // outputs rene
How can I actually work them out?
Thanks.
EDIT:
This is the source file I test the code,
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" class="no-js">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<?php
$string = "Löic & René";
$output = iconv("UTF-8", "ISO-8859-1//TRANSLIT", $string);
?>
</html>

$clean = iconv('UTF-8', 'ASCII//TRANSLIT', utf8_encode($s));

And did you save your source file in UTF-8 encoding? If not (and I guess you didn't since that will produce the "incomplete multibyte character" error), then try that first.

Related

Decoding String from UTF-8 to Windows1256

I have the code below, which tries to convert a string from UTF to CP1256. I want to decode the string to arabic, and the page encryption is fixed to UTF8
<?php
$string = "ãÍãÏ Úæäí ãÍãæÏ Úáí";
$string = iconv("UTF-8//TRANSLIT//IGNORE", "Windows-1252//TRANSLIT//IGNORE", $string);
echo $string;
?>
So your Arabic text, has been encoded in Windows-1256 and then incorrectly encoded to Windows-1252.
If your source file is UTF-8 encoded, the answer is:
<?php
$string = "ãÍãÏ Úæäí ãÍãæÏ Úáí";
$string = iconv("UTF-8//TRANSLIT//IGNORE", "Windows-1252//TRANSLIT//IGNORE", $string);
# $string is now back to its 1256 encoding. Encode to UTF-8 for web page
$string = iconv("Windows-1256//TRANSLIT//IGNORE", "UTF-8//TRANSLIT//IGNORE", $string);
echo $string;
?>
If your source file is "windows-1252" encoded, then you must use:
<?php
$string = "ãÍãÏ Úæäí ãÍãæÏ Úáí";
# Interperate windows-1252 string as if it were windows-1256. Encode to UTF-8 for web page
$string = iconv("Windows-1256//TRANSLIT//IGNORE", "UTF-8//TRANSLIT//IGNORE", $string);
echo $string;
?>
If you $string actually comes from a database or file, then you have to determine the encoding of the source before applying any conversion.
$strings = "ãÍãÏ Úæäí ãÍãæÏ Úáí";
setlocale(LC_CTYPE, 'nl_NL.UTF-8');
$convert = iconv('UTF-8', 'windows-1251//TRANSLIT//IGNORE', $strings);
echo var_dump($convert);

Show spanish characters in a form

I have a form and in a textarea I want to display some text that have some spanish characters but encoded as html. The problem is that instead of the spanish character it displays the html code. I'm using htmlentities to display it in the form. my code to display is:
<?php echo htmlentities($string, ENT_QUOTES, "UTF-8") ?>
Any idea or I just shouldnt use htmlentities in a form? Thanks!
EDIT
Lets say $string = 'á'
When I just do <?php echo $string ;?> I get á
If I do <?php echo htmlentities($string, ENT_QUOTES, "UTF-8") ?> I get á
I'm so confused!
You can try explicitly adding content type at the top of your file as below
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
if it's already encoded as html then you need to decode it now..you can use html_entity_decode($string);
Your string to be echoed in the form should be á as returned from database and not á
$string = 'á'; // your string as fetched from database
echo html_entity_decode($string);// this will display á in the textarea
and before saving to database you need to
htmlentities($_POST['txtAreaName'], ENT_QUOTES, "UTF-8"); // return `á`
If I understand you correctly, you need to use...
<meta charset="utf-8">
in your page header, and then...
<?php echo html_entity_decode($string, ENT_QUOTES); ?>
This will convert your HTML entities back to their proper characters
You might be looking for htmlspecialchars.
echo htmlspecialchars('<á>', ENT_COMPAT | ENT_HTML5, "UTF-8");
outputs <á>.

PHP Special Characters to Normal using strtr [duplicate]

I still don't understand how iconv works.
For instance,
$string = "Löic & René";
$output = iconv("UTF-8", "ISO-8859-1//TRANSLIT", $string);
I get,
Notice: iconv() [function.iconv]:
Detected an illegal character in input
string in...
$string = "Löic"; or $string = "René";
I get,
Notice: iconv() [function.iconv]: Detected an incomplete multibyte character in input string in.
I get nothing with $string = "&";
There are two sets of different outputs I need store them in the two different columns inside the table of my database,
I need to convert Löic & René to Loic & Rene for clean url purposes.
I need to keep them as they are - Löic & René as Löic & René then only convert them with htmlentities($string, ENT_QUOTES); when displaying them on my html page.
I tried with some of the suggestions in php.net below, but still don't work,
I had a situation where I needed some characters transliterated, but the others ignored (for weird diacritics like ayn or hamza). Adding //TRANSLIT//IGNORE seemed to do the trick for me. It transliterates everything that is able to be transliterated, but then throws out stuff that can't be.
So:
$string = "ʿABBĀSĀBĀD";
echo iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $string);
// output: [nothing, and you get a notice]
echo iconv('UTF-8', 'ISO-8859-1//IGNORE', $string);
// output: ABBSBD
echo iconv('UTF-8', 'ISO-8859-1//TRANSLIT//IGNORE', $string);
// output: ABBASABAD
// Yay! That's what I wanted!
and another,
Andries Seutens 07-Nov-2009 07:38
When doing transliteration, you have to make sure that your LC_COLLATE is properly set, otherwise the default POSIX will be used.
To transform "rené" into "rene" we could use the following code snippet:
setlocale(LC_CTYPE, 'nl_BE.utf8');
$string = 'rené';
$string = iconv('UTF-8', 'ASCII//TRANSLIT', $string);
echo $string; // outputs rene
How can I actually work them out?
Thanks.
EDIT:
This is the source file I test the code,
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" class="no-js">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<?php
$string = "Löic & René";
$output = iconv("UTF-8", "ISO-8859-1//TRANSLIT", $string);
?>
</html>
$clean = iconv('UTF-8', 'ASCII//TRANSLIT', utf8_encode($s));
And did you save your source file in UTF-8 encoding? If not (and I guess you didn't since that will produce the "incomplete multibyte character" error), then try that first.

write a php function which works for any lanuage

I'm writing a function to clear text which works with or without ut8 characters.
I keep getting text like this.
Coventry Salary - �25,000 - �35,000
but with this function it removes the � but leaves other.
I want to know if anyone wrote a function which cleans the text.
function convertHTMLSpecialChars ( $str='' )
{
$str = htmlspecialchars ( $str );
$str = mb_convert_encoding($str, 'UTF-8', mb_detect_encoding($str));
$str = htmlspecialchars($str, ENT_NOQUOTES, 'UTF-8');
return $str;
}
this function:
$str = mb_convert_encoding($str, 'UTF-8', mb_detect_encoding($str));
just tries to detect the character set from $str; if it finds that $str contains
utf8 characters it will return "utf8" so the func will be actually:
$str = mb_convert_encoding($str, 'UTF-8', 'UTF-8');
which doesnt help much..
in my opinion you should give the character set of your string by hand.
for example, if its turkish: iso-8859-5, if its greek: iso-8859-7 and so..
Make sure the server outputs your page as UTF-8.
You can force it by using:
header ('Content-type: text/html; charset=utf-8');

Problem with function removing accents and other characters in PHP

I found a simple function to remove some undesired characters from a string.
function strClean($input){
$input = strtolower($input);
$b = array("á","é","í","ó","ú", "ñ", " "); //etc...
$c = array("a","e","i","o","u","n", "-"); //etc...
$input = str_replace($b, $c, $input);
return $input;
}
When I use it on accents or other characters, like this word 'á é ñ í' it prints out those question marks or weird characters, like:
output http://img217.imageshack.us/img217/6794/59472278.jpg
Note: I'm using strclean.php (which contains this function) and index.php, both in UTF-8. index.php looks as follows:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title></title>
</head>
<body>
<?php
include('strclean.php');
echo 'óóóáà';
echo strClean('óóóáà');
?>
</body>
</html>
What am I doing wrong?
Use
iconv('UTF-8', 'ASCII//TRANSLIT', $input);
You may want to try iconv.
Does a replacement happen at all, i.e. do you get the same weird characters when you print $input beforehand? If so, the character sets of your PHP source code file and the input do not match and you might need to use iconv() on the input before replacing.
edit: I took both of your files, uploaded them to my webserver and printing and cleaning works fine (see http://www.tag-am-meer.com/test1/). This is on PHP 4.4.9 and Firefox 3.0.6. More potential problems that come to my mind:
Does it work for you on Firefox? I remember vaguely that IE6 (and probably later versions as well) expect the charset in the HTML head section to be written in lowercase ("utf-8")
Does your editor include byte order marks (BOM) in the code files? Mine does not, maybe PHP chokes on those.
Can you look at the HTTP headers to see if there's something unusual going on, like a bad MIME type? The Tamper Data add-on for Firefox can help with this.
I have tested your code, and error is in strtolower function...
Replace it with mb_strtolower, like bellow
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title></title>
</head>
<body>
<?php
function strClean($input) {
$input = mb_strtolower($input, 'UTF-8');
$b = array("á","é","í","ó","ú", "n", " ");
$c = array("a","e","i","o","u","n", "-");
return str_replace($b, $c, $input);
}
$string = 'á é í ó ú n abcdef ghij';
echo $string ."<br />". strClean($string);
?>
</body>
</html>
Why do you want to remove accents? Is it possible that you just want to ignore them? If so, this answer has a Perl solution that demonstrates how to do that. Note that the Perl is in a foreign language. :)
I found myself with this trouble before, and I tried to follow the leads of this post and others I found on the way and there was no simple solution, cause you have to know the charset that your system uses (in my case ISO-8859-1) and this is what I did:
function quit_accenture($str){
$pattern = array();
$pattern[0] = '/[Á|Â|À|Å|Ä]/';
$pattern[1] = '/[É|Ê|È]/';
$pattern[2] = '/[Í|Î|Ì|Ï]/';
$pattern[3] = '/[Ó|Ô|Ò|Ö]/';
$pattern[4] = '/[Ú|Û|Ù|Ü]/';
$pattern[5] = '/[á|â|à|å|ä]/';
$pattern[6] = '/[ð|é|ê|è|ë]/';
$pattern[7] = '/[í|î|ì|ï]/';
$pattern[8] = '/[ó|ô|ò|ø|õ|ö]/';
$pattern[9] = '/[ú|û|ù|ü]/';
$replacement = array();
$replacement[0] = 'A';
$replacement[1] = 'E';
$replacement[2] = 'I';
$replacement[3] = 'O';
$replacement[4] = 'U';
$replacement[5] = 'a';
$replacement[6] = 'e';
$replacement[7] = 'i';
$replacement[8] = 'o';
$replacement[9] = 'u';
return preg_replace($pattern, $replacement, $str);
}
$txt = $_POST['your_htmled_text'];
//Convert to your system's charset. I checked this on the php.ini
$txt = iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $txt);
//Apply your function
$txt = quit_accenture($txt);
//output
print_r($txt);
This worked for me, but I also think is the right way :)

Categories