I still don't understand how iconv works.
For instance,
$string = "Löic & René";
$output = iconv("UTF-8", "ISO-8859-1//TRANSLIT", $string);
I get,
Notice: iconv() [function.iconv]:
Detected an illegal character in input
string in...
$string = "Löic"; or $string = "René";
I get,
Notice: iconv() [function.iconv]: Detected an incomplete multibyte character in input string in.
I get nothing with $string = "&";
There are two sets of different outputs I need store them in the two different columns inside the table of my database,
I need to convert Löic & René to Loic & Rene for clean url purposes.
I need to keep them as they are - Löic & René as Löic & René then only convert them with htmlentities($string, ENT_QUOTES); when displaying them on my html page.
I tried with some of the suggestions in php.net below, but still don't work,
I had a situation where I needed some characters transliterated, but the others ignored (for weird diacritics like ayn or hamza). Adding //TRANSLIT//IGNORE seemed to do the trick for me. It transliterates everything that is able to be transliterated, but then throws out stuff that can't be.
So:
$string = "ʿABBĀSĀBĀD";
echo iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $string);
// output: [nothing, and you get a notice]
echo iconv('UTF-8', 'ISO-8859-1//IGNORE', $string);
// output: ABBSBD
echo iconv('UTF-8', 'ISO-8859-1//TRANSLIT//IGNORE', $string);
// output: ABBASABAD
// Yay! That's what I wanted!
and another,
Andries Seutens 07-Nov-2009 07:38
When doing transliteration, you have to make sure that your LC_COLLATE is properly set, otherwise the default POSIX will be used.
To transform "rené" into "rene" we could use the following code snippet:
setlocale(LC_CTYPE, 'nl_BE.utf8');
$string = 'rené';
$string = iconv('UTF-8', 'ASCII//TRANSLIT', $string);
echo $string; // outputs rene
How can I actually work them out?
Thanks.
EDIT:
This is the source file I test the code,
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" class="no-js">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<?php
$string = "Löic & René";
$output = iconv("UTF-8", "ISO-8859-1//TRANSLIT", $string);
?>
</html>
$clean = iconv('UTF-8', 'ASCII//TRANSLIT', utf8_encode($s));
And did you save your source file in UTF-8 encoding? If not (and I guess you didn't since that will produce the "incomplete multibyte character" error), then try that first.
Related
I have the code below, which tries to convert a string from UTF to CP1256. I want to decode the string to arabic, and the page encryption is fixed to UTF8
<?php
$string = "ãÍãÏ Úæäí ãÍãæÏ Úáí";
$string = iconv("UTF-8//TRANSLIT//IGNORE", "Windows-1252//TRANSLIT//IGNORE", $string);
echo $string;
?>
So your Arabic text, has been encoded in Windows-1256 and then incorrectly encoded to Windows-1252.
If your source file is UTF-8 encoded, the answer is:
<?php
$string = "ãÍãÏ Úæäí ãÍãæÏ Úáí";
$string = iconv("UTF-8//TRANSLIT//IGNORE", "Windows-1252//TRANSLIT//IGNORE", $string);
# $string is now back to its 1256 encoding. Encode to UTF-8 for web page
$string = iconv("Windows-1256//TRANSLIT//IGNORE", "UTF-8//TRANSLIT//IGNORE", $string);
echo $string;
?>
If your source file is "windows-1252" encoded, then you must use:
<?php
$string = "ãÍãÏ Úæäí ãÍãæÏ Úáí";
# Interperate windows-1252 string as if it were windows-1256. Encode to UTF-8 for web page
$string = iconv("Windows-1256//TRANSLIT//IGNORE", "UTF-8//TRANSLIT//IGNORE", $string);
echo $string;
?>
If you $string actually comes from a database or file, then you have to determine the encoding of the source before applying any conversion.
$strings = "ãÍãÏ Úæäí ãÍãæÏ Úáí";
setlocale(LC_CTYPE, 'nl_NL.UTF-8');
$convert = iconv('UTF-8', 'windows-1251//TRANSLIT//IGNORE', $strings);
echo var_dump($convert);
I have a form and in a textarea I want to display some text that have some spanish characters but encoded as html. The problem is that instead of the spanish character it displays the html code. I'm using htmlentities to display it in the form. my code to display is:
<?php echo htmlentities($string, ENT_QUOTES, "UTF-8") ?>
Any idea or I just shouldnt use htmlentities in a form? Thanks!
EDIT
Lets say $string = 'á'
When I just do <?php echo $string ;?> I get á
If I do <?php echo htmlentities($string, ENT_QUOTES, "UTF-8") ?> I get á
I'm so confused!
You can try explicitly adding content type at the top of your file as below
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
if it's already encoded as html then you need to decode it now..you can use html_entity_decode($string);
Your string to be echoed in the form should be á as returned from database and not á
$string = 'á'; // your string as fetched from database
echo html_entity_decode($string);// this will display á in the textarea
and before saving to database you need to
htmlentities($_POST['txtAreaName'], ENT_QUOTES, "UTF-8"); // return `á`
If I understand you correctly, you need to use...
<meta charset="utf-8">
in your page header, and then...
<?php echo html_entity_decode($string, ENT_QUOTES); ?>
This will convert your HTML entities back to their proper characters
You might be looking for htmlspecialchars.
echo htmlspecialchars('<á>', ENT_COMPAT | ENT_HTML5, "UTF-8");
outputs <á>.
Im trying to save some data into a xml file using the following PHP script:
<?php
$string = 'Go to google maps and some special characters ë è & ä etc.';
$string = htmlentities($string, ENT_QUOTES, 'UTF-8');
$doc = new DOMDocument('1.0', 'UTF-8');
$doc->preserveWhiteSpace = false;
$doc->formatOutput = true;
$root = $doc->createElement('top');
$root = $doc->appendChild($root);
$title = $doc->createElement('title');
$title = $root->appendChild($title);
$id = $doc->createAttribute('id');
$id->value = '1';
$text = $title->appendChild($id);
$text = $doc->createTextNode($string);
$text = $title->appendChild($text);
$doc->save('data.xml');
echo 'data saved!';
?>
I'm using htmlentities to translate all of the string into an html format, if I leave this out the special characters won't be translated to html format. this is the output:
<?xml version="1.0" encoding="UTF-8"?>
<top>
<title id="1"><a href="google.com/maps">Go to google maps</a> and some special characters ë è & ä etc.</title>
</top>
The ampersand of the html tags get a double html code: < and an ampersand becomes: &
Is this normal behavior? Or how can I prevent this from happening? Looks like a double encoding.
Try to remove the line:
$string = htmlentities($string, ENT_QUOTES, 'UTF-8');
Because the text passed to createTextNode() is escaped anyway.
Update:
If you want the utf-8 characters to be escaped. You could leave that line and try to add the $string directly in createElement().
For example:
$title = $doc->createElement('title', $string);
$title = $root->appendChild($title);
In PHP documentation it says that $string will not be escaped. I haven't tried it, but it should work.
It is the htmlentities that turns a & into &
When working with xml data you should not use htmlentities, as the DOMDocument will handle a & and not &.
As of php 5.3 the default encoding is UTF-8, so there is no need to convert to UTF-8.
This line:
$string = htmlentities($string, ENT_QUOTES, 'UTF-8');
… encodes a string as HTML.
This line:
$text = $doc->createTextNode($string);
… encodes your string of HTML as XML.
This gives you an XML representation of an HTML string. When the XML is parsed you get the HTML back.
how can I prevent this from happening?
If your goal is to store some text in an XML document. Remove the line that encodes it as HTML.
Looks like a double encoding.
Pretty much. It is encoded twice, it just uses different (albeit very similar) encoding methods for each of the two passes.
I still don't understand how iconv works.
For instance,
$string = "Löic & René";
$output = iconv("UTF-8", "ISO-8859-1//TRANSLIT", $string);
I get,
Notice: iconv() [function.iconv]:
Detected an illegal character in input
string in...
$string = "Löic"; or $string = "René";
I get,
Notice: iconv() [function.iconv]: Detected an incomplete multibyte character in input string in.
I get nothing with $string = "&";
There are two sets of different outputs I need store them in the two different columns inside the table of my database,
I need to convert Löic & René to Loic & Rene for clean url purposes.
I need to keep them as they are - Löic & René as Löic & René then only convert them with htmlentities($string, ENT_QUOTES); when displaying them on my html page.
I tried with some of the suggestions in php.net below, but still don't work,
I had a situation where I needed some characters transliterated, but the others ignored (for weird diacritics like ayn or hamza). Adding //TRANSLIT//IGNORE seemed to do the trick for me. It transliterates everything that is able to be transliterated, but then throws out stuff that can't be.
So:
$string = "ʿABBĀSĀBĀD";
echo iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $string);
// output: [nothing, and you get a notice]
echo iconv('UTF-8', 'ISO-8859-1//IGNORE', $string);
// output: ABBSBD
echo iconv('UTF-8', 'ISO-8859-1//TRANSLIT//IGNORE', $string);
// output: ABBASABAD
// Yay! That's what I wanted!
and another,
Andries Seutens 07-Nov-2009 07:38
When doing transliteration, you have to make sure that your LC_COLLATE is properly set, otherwise the default POSIX will be used.
To transform "rené" into "rene" we could use the following code snippet:
setlocale(LC_CTYPE, 'nl_BE.utf8');
$string = 'rené';
$string = iconv('UTF-8', 'ASCII//TRANSLIT', $string);
echo $string; // outputs rene
How can I actually work them out?
Thanks.
EDIT:
This is the source file I test the code,
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" class="no-js">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<?php
$string = "Löic & René";
$output = iconv("UTF-8", "ISO-8859-1//TRANSLIT", $string);
?>
</html>
$clean = iconv('UTF-8', 'ASCII//TRANSLIT', utf8_encode($s));
And did you save your source file in UTF-8 encoding? If not (and I guess you didn't since that will produce the "incomplete multibyte character" error), then try that first.
I found a simple function to remove some undesired characters from a string.
function strClean($input){
$input = strtolower($input);
$b = array("á","é","í","ó","ú", "ñ", " "); //etc...
$c = array("a","e","i","o","u","n", "-"); //etc...
$input = str_replace($b, $c, $input);
return $input;
}
When I use it on accents or other characters, like this word 'á é ñ í' it prints out those question marks or weird characters, like:
output http://img217.imageshack.us/img217/6794/59472278.jpg
Note: I'm using strclean.php (which contains this function) and index.php, both in UTF-8. index.php looks as follows:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title></title>
</head>
<body>
<?php
include('strclean.php');
echo 'óóóáà';
echo strClean('óóóáà');
?>
</body>
</html>
What am I doing wrong?
Use
iconv('UTF-8', 'ASCII//TRANSLIT', $input);
You may want to try iconv.
Does a replacement happen at all, i.e. do you get the same weird characters when you print $input beforehand? If so, the character sets of your PHP source code file and the input do not match and you might need to use iconv() on the input before replacing.
edit: I took both of your files, uploaded them to my webserver and printing and cleaning works fine (see http://www.tag-am-meer.com/test1/). This is on PHP 4.4.9 and Firefox 3.0.6. More potential problems that come to my mind:
Does it work for you on Firefox? I remember vaguely that IE6 (and probably later versions as well) expect the charset in the HTML head section to be written in lowercase ("utf-8")
Does your editor include byte order marks (BOM) in the code files? Mine does not, maybe PHP chokes on those.
Can you look at the HTTP headers to see if there's something unusual going on, like a bad MIME type? The Tamper Data add-on for Firefox can help with this.
I have tested your code, and error is in strtolower function...
Replace it with mb_strtolower, like bellow
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title></title>
</head>
<body>
<?php
function strClean($input) {
$input = mb_strtolower($input, 'UTF-8');
$b = array("á","é","í","ó","ú", "n", " ");
$c = array("a","e","i","o","u","n", "-");
return str_replace($b, $c, $input);
}
$string = 'á é í ó ú n abcdef ghij';
echo $string ."<br />". strClean($string);
?>
</body>
</html>
Why do you want to remove accents? Is it possible that you just want to ignore them? If so, this answer has a Perl solution that demonstrates how to do that. Note that the Perl is in a foreign language. :)
I found myself with this trouble before, and I tried to follow the leads of this post and others I found on the way and there was no simple solution, cause you have to know the charset that your system uses (in my case ISO-8859-1) and this is what I did:
function quit_accenture($str){
$pattern = array();
$pattern[0] = '/[Á|Â|À|Å|Ä]/';
$pattern[1] = '/[É|Ê|È]/';
$pattern[2] = '/[Í|Î|Ì|Ï]/';
$pattern[3] = '/[Ó|Ô|Ò|Ö]/';
$pattern[4] = '/[Ú|Û|Ù|Ü]/';
$pattern[5] = '/[á|â|à|å|ä]/';
$pattern[6] = '/[ð|é|ê|è|ë]/';
$pattern[7] = '/[í|î|ì|ï]/';
$pattern[8] = '/[ó|ô|ò|ø|õ|ö]/';
$pattern[9] = '/[ú|û|ù|ü]/';
$replacement = array();
$replacement[0] = 'A';
$replacement[1] = 'E';
$replacement[2] = 'I';
$replacement[3] = 'O';
$replacement[4] = 'U';
$replacement[5] = 'a';
$replacement[6] = 'e';
$replacement[7] = 'i';
$replacement[8] = 'o';
$replacement[9] = 'u';
return preg_replace($pattern, $replacement, $str);
}
$txt = $_POST['your_htmled_text'];
//Convert to your system's charset. I checked this on the php.ini
$txt = iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $txt);
//Apply your function
$txt = quit_accenture($txt);
//output
print_r($txt);
This worked for me, but I also think is the right way :)