Problem with function removing accents and other characters in PHP - php

I found a simple function to remove some undesired characters from a string.
function strClean($input){
$input = strtolower($input);
$b = array("á","é","í","ó","ú", "ñ", " "); //etc...
$c = array("a","e","i","o","u","n", "-"); //etc...
$input = str_replace($b, $c, $input);
return $input;
}
When I use it on accents or other characters, like this word 'á é ñ í' it prints out those question marks or weird characters, like:
output http://img217.imageshack.us/img217/6794/59472278.jpg
Note: I'm using strclean.php (which contains this function) and index.php, both in UTF-8. index.php looks as follows:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title></title>
</head>
<body>
<?php
include('strclean.php');
echo 'óóóáà';
echo strClean('óóóáà');
?>
</body>
</html>
What am I doing wrong?

Use
iconv('UTF-8', 'ASCII//TRANSLIT', $input);

You may want to try iconv.

Does a replacement happen at all, i.e. do you get the same weird characters when you print $input beforehand? If so, the character sets of your PHP source code file and the input do not match and you might need to use iconv() on the input before replacing.
edit: I took both of your files, uploaded them to my webserver and printing and cleaning works fine (see http://www.tag-am-meer.com/test1/). This is on PHP 4.4.9 and Firefox 3.0.6. More potential problems that come to my mind:
Does it work for you on Firefox? I remember vaguely that IE6 (and probably later versions as well) expect the charset in the HTML head section to be written in lowercase ("utf-8")
Does your editor include byte order marks (BOM) in the code files? Mine does not, maybe PHP chokes on those.
Can you look at the HTTP headers to see if there's something unusual going on, like a bad MIME type? The Tamper Data add-on for Firefox can help with this.

I have tested your code, and error is in strtolower function...
Replace it with mb_strtolower, like bellow
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title></title>
</head>
<body>
<?php
function strClean($input) {
$input = mb_strtolower($input, 'UTF-8');
$b = array("á","é","í","ó","ú", "n", " ");
$c = array("a","e","i","o","u","n", "-");
return str_replace($b, $c, $input);
}
$string = 'á é í ó ú n abcdef ghij';
echo $string ."<br />". strClean($string);
?>
</body>
</html>

Why do you want to remove accents? Is it possible that you just want to ignore them? If so, this answer has a Perl solution that demonstrates how to do that. Note that the Perl is in a foreign language. :)

I found myself with this trouble before, and I tried to follow the leads of this post and others I found on the way and there was no simple solution, cause you have to know the charset that your system uses (in my case ISO-8859-1) and this is what I did:
function quit_accenture($str){
$pattern = array();
$pattern[0] = '/[Á|Â|À|Å|Ä]/';
$pattern[1] = '/[É|Ê|È]/';
$pattern[2] = '/[Í|Î|Ì|Ï]/';
$pattern[3] = '/[Ó|Ô|Ò|Ö]/';
$pattern[4] = '/[Ú|Û|Ù|Ü]/';
$pattern[5] = '/[á|â|à|å|ä]/';
$pattern[6] = '/[ð|é|ê|è|ë]/';
$pattern[7] = '/[í|î|ì|ï]/';
$pattern[8] = '/[ó|ô|ò|ø|õ|ö]/';
$pattern[9] = '/[ú|û|ù|ü]/';
$replacement = array();
$replacement[0] = 'A';
$replacement[1] = 'E';
$replacement[2] = 'I';
$replacement[3] = 'O';
$replacement[4] = 'U';
$replacement[5] = 'a';
$replacement[6] = 'e';
$replacement[7] = 'i';
$replacement[8] = 'o';
$replacement[9] = 'u';
return preg_replace($pattern, $replacement, $str);
}
$txt = $_POST['your_htmled_text'];
//Convert to your system's charset. I checked this on the php.ini
$txt = iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $txt);
//Apply your function
$txt = quit_accenture($txt);
//output
print_r($txt);
This worked for me, but I also think is the right way :)

Related

Strpos unexpected result with special characters

I'm trying to find the position of the HTML element in a HTML document.
So i do this:
$filestring = file_get_contents($filename); //get the raw file
$filestring = htmlspecialchars($filestring);
$pos = strpos($filestring, "<head>"); //find the position of <head>
print_r($pos); //print the position
End print_r don't show nothing. I think it is due to the special characters, but do not understand how to do.
There is no such things as <head> in $filestring.
When you use htmlspecialchars, the < and > get replaced:
http://php.net/manual/en/function.htmlspecialchars.php
$pos = strpos($filestring, "<head>");
Or don't use htmlspecialchars when searching for the string
Why do you use htmlspecialchars?
Do you understand that using this function causes all entities like > or < to be replaces by their representations like > or <?
So, the solutions are
either not use htmlspecialchars
or search not for <head> but for <head>
i think you should remove this line from your code.
$filestring = htmlspecialchars($filestring);

HTML Special Characters (foreign languages)

Basically I have this string:
Český, Deutsch, English (US), Español (ES), Français (France), Italiano, 日本語, 한국어, Polski, 中文(繁體)
And I want to convert it into all possible HTML entities (there might be russian characters too!).
I've tried to make different "htmlspecialchars" and "htmlentities" function with different charsets but it returns empty strings...
$l = htmlentities("Český, Deutsch, English (US), Español (ES), Français (France), Italiano, 日本語, 한국어, Polski, 中文(繁體) €", ENT_COMPAT, "BIG5-HKSCS");
$l = htmlentities($l, ENT_COMPAT, "KOI8-R");
$l = htmlentities($l, ENT_COMPAT, "EUC-JP");
$l = htmlentities($l, ENT_COMPAT, "Shift_JIS");
$l = htmlentities($l, ENT_COMPAT, "Shift_JIS");
echo $l;
returns an empty string.
Any help?
Here's my "unutf8" function, which converts all UTF8 characters into HTML entities of the form 〹
function unutf8($str) {
return preg_replace_callback("([\xC0-\xDF][\x80-\xBF]|[\xE0-\xEF][\x80-\xBF]{2}|[\xF0-\xF7][\x80-\xBF]{3}|[\xF8-\xFB][\x80-\xBF]{4}|[\xFC-\xFD][\x80-\xBF]{5})",
function($m) {
$c = $m[0];
$out = bindec(ltrim(decbin(ord($c[0])),"1"));
$l = strlen($c);
for( $i=1; $i<$l; $i++) {
$out = ($out<<6) | bindec(ltrim(decbin(ord($c[$i])),"1"));
}
if( $out < 256) return chr($out);
return "&#".$out.";";
},$str);
}
It parses the string for valid UTF8 character sequences and converts the multi-byte sequence into the ordinal value of the character. It's very messy and I don't expect to win any awards for good coding with this, but it works.
Please note, however, that if you have unencoded characters then you WILL run into problems. For example, if for some reason you have é©© then the result will be 驩. Please make sure your string is valid UTF8 before passing it to the function.
Use header to modify the HTTP header to utf-8:
header('Content-Type: text/html; charset=utf-8');
Also, make sure your HTML document is also in utf-8:
<meta http-equiv="Content-type" content="text/html" charset="utf-8" />
Don't go for tough solutions and just follow this small and simple steps :
1) mysql_set_charset("utf8", $conn); set this with your config connection code.
or
2) mysql_query("SET NAMES 'UTF8'");
enter your query here........
mysql_set_charset("UTF8", queryResult);

is there any way add line number beside each DOM tags?

My code is like this which will access all the DOM elements. Now i want to add line number beside them. can anyone please help. thanks.
<?php
$dom = new domDocument;
// load the html into the object
$dom->loadHTMLFile('$url');
// keep white space
$dom->preserveWhiteSpace = true;
// nicely format output
$dom->formatOutput = true;
//get element by tag name
$htmlRootElement = $dom->getElementsByTagName('html');
$new = htmlspecialchars($dom->saveHTML(), ENT_QUOTES);
echo '<pre>' .$new. '</pre>';
?>
The code above would not out put give any line number. i want to do something like this.
<!DOCTYPE html>
1.<html>
2.<head>
3. <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
4. <title></title>
5.</head>
6.<body>
7.</html>
Any suggestion please. thanks.
Just split the string by new lines, and print out the line number followed by the line:
$lines = preg_split( '/\r\n|\r|\n/', $new );//split the string on new lines
echo "<pre>";
foreach($lines as $lineNumber=>$line){
echo "\r\n" . $lineNumber . ". " . $line;
}
echo "</pre>";
$lines = explode(PHP_EOL, $html);
There's my hint, I'm sure you can figure the rest out.
There are already PHP libs for this btw, like GeSHi
Oh, and your code has a small error.
Where you're loading the HTML, you do '$url', which literally means $url, not whatever your variable it. Use double-quotes or none at all.

PHP Special Characters to Normal using strtr [duplicate]

I still don't understand how iconv works.
For instance,
$string = "Löic & René";
$output = iconv("UTF-8", "ISO-8859-1//TRANSLIT", $string);
I get,
Notice: iconv() [function.iconv]:
Detected an illegal character in input
string in...
$string = "Löic"; or $string = "René";
I get,
Notice: iconv() [function.iconv]: Detected an incomplete multibyte character in input string in.
I get nothing with $string = "&";
There are two sets of different outputs I need store them in the two different columns inside the table of my database,
I need to convert Löic & René to Loic & Rene for clean url purposes.
I need to keep them as they are - Löic & René as Löic & René then only convert them with htmlentities($string, ENT_QUOTES); when displaying them on my html page.
I tried with some of the suggestions in php.net below, but still don't work,
I had a situation where I needed some characters transliterated, but the others ignored (for weird diacritics like ayn or hamza). Adding //TRANSLIT//IGNORE seemed to do the trick for me. It transliterates everything that is able to be transliterated, but then throws out stuff that can't be.
So:
$string = "ʿABBĀSĀBĀD";
echo iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $string);
// output: [nothing, and you get a notice]
echo iconv('UTF-8', 'ISO-8859-1//IGNORE', $string);
// output: ABBSBD
echo iconv('UTF-8', 'ISO-8859-1//TRANSLIT//IGNORE', $string);
// output: ABBASABAD
// Yay! That's what I wanted!
and another,
Andries Seutens 07-Nov-2009 07:38
When doing transliteration, you have to make sure that your LC_COLLATE is properly set, otherwise the default POSIX will be used.
To transform "rené" into "rene" we could use the following code snippet:
setlocale(LC_CTYPE, 'nl_BE.utf8');
$string = 'rené';
$string = iconv('UTF-8', 'ASCII//TRANSLIT', $string);
echo $string; // outputs rene
How can I actually work them out?
Thanks.
EDIT:
This is the source file I test the code,
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" class="no-js">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<?php
$string = "Löic & René";
$output = iconv("UTF-8", "ISO-8859-1//TRANSLIT", $string);
?>
</html>
$clean = iconv('UTF-8', 'ASCII//TRANSLIT', utf8_encode($s));
And did you save your source file in UTF-8 encoding? If not (and I guess you didn't since that will produce the "incomplete multibyte character" error), then try that first.

PHP: Dealing special characters with iconv

I still don't understand how iconv works.
For instance,
$string = "Löic & René";
$output = iconv("UTF-8", "ISO-8859-1//TRANSLIT", $string);
I get,
Notice: iconv() [function.iconv]:
Detected an illegal character in input
string in...
$string = "Löic"; or $string = "René";
I get,
Notice: iconv() [function.iconv]: Detected an incomplete multibyte character in input string in.
I get nothing with $string = "&";
There are two sets of different outputs I need store them in the two different columns inside the table of my database,
I need to convert Löic & René to Loic & Rene for clean url purposes.
I need to keep them as they are - Löic & René as Löic & René then only convert them with htmlentities($string, ENT_QUOTES); when displaying them on my html page.
I tried with some of the suggestions in php.net below, but still don't work,
I had a situation where I needed some characters transliterated, but the others ignored (for weird diacritics like ayn or hamza). Adding //TRANSLIT//IGNORE seemed to do the trick for me. It transliterates everything that is able to be transliterated, but then throws out stuff that can't be.
So:
$string = "ʿABBĀSĀBĀD";
echo iconv('UTF-8', 'ISO-8859-1//TRANSLIT', $string);
// output: [nothing, and you get a notice]
echo iconv('UTF-8', 'ISO-8859-1//IGNORE', $string);
// output: ABBSBD
echo iconv('UTF-8', 'ISO-8859-1//TRANSLIT//IGNORE', $string);
// output: ABBASABAD
// Yay! That's what I wanted!
and another,
Andries Seutens 07-Nov-2009 07:38
When doing transliteration, you have to make sure that your LC_COLLATE is properly set, otherwise the default POSIX will be used.
To transform "rené" into "rene" we could use the following code snippet:
setlocale(LC_CTYPE, 'nl_BE.utf8');
$string = 'rené';
$string = iconv('UTF-8', 'ASCII//TRANSLIT', $string);
echo $string; // outputs rene
How can I actually work them out?
Thanks.
EDIT:
This is the source file I test the code,
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" class="no-js">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<?php
$string = "Löic & René";
$output = iconv("UTF-8", "ISO-8859-1//TRANSLIT", $string);
?>
</html>
$clean = iconv('UTF-8', 'ASCII//TRANSLIT', utf8_encode($s));
And did you save your source file in UTF-8 encoding? If not (and I guess you didn't since that will produce the "incomplete multibyte character" error), then try that first.

Categories