i don't have any chance to get a valid utf-8 as output...
$fx = file_get_contents("Extended Ascii file.txt"); // example only has chr(129), but could be mixed Extended Ascii + UTF8
// not working:
//$fx = html_entity_decode($fx, ENT_QUOTES, "UTF-8");
//$fx = mb_convert_encoding($fx, 'UTF-8', 'ASCII');
//$fx = utf8_encode($fx);
//$fx = iconv('ASCII', 'UTF-8//IGNORE', $fx);
echo '"chr('.ord($fx[0]).')"=>"'.$fx[0].'"<br><br>'; // result: "chr(129)"=>"�"
$fx = strtr($fx, [chr(128)=>'Ç',chr(129)=>'ü',chr(130)=>'é',chr(131)=>'â',chr(132)=>'ä',chr(133)=>'à',chr(134)=>'å',chr(135)=>'ç',chr(136)=>'ê',chr(137)=>'ë',chr(138)=>'è',chr(139)=>'ï',chr(140)=>'î',chr(141)=>'ì',chr(142)=>'Ä',chr(143)=>'Å',chr(144)=>'É',chr(145)=>'æ',chr(146)=>'Æ',chr(147)=>'ô',chr(148)=>'ö',chr(149)=>'ò',chr(150)=>'û',chr(151)=>'ù',chr(152)=>'ÿ',chr(153)=>'Ö',chr(154)=>'Ü',chr(155)=>'ø',chr(156)=>'£',chr(157)=>'Ø',chr(158)=>'×',chr(159)=>'ƒ',chr(160)=>'á',chr(161)=>'í',chr(162)=>'ó',chr(163)=>'ú',chr(164)=>'ñ',chr(165)=>'Ñ',chr(166)=>'ª',chr(167)=>'º',chr(168)=>'¿',chr(169)=>'®',chr(170)=>'¬',chr(171)=>'½',chr(172)=>'¼',chr(173)=>'¡',chr(174)=>'«',chr(175)=>'»',chr(176)=>'░',chr(177)=>'▒',chr(178)=>'▓',chr(179)=>'│',chr(180)=>'┤',chr(181)=>'Á',chr(182)=>'Â',chr(183)=>'À',chr(184)=>'©',chr(185)=>'╣',chr(186)=>'║',chr(187)=>'╗',chr(188)=>'╝',chr(189)=>'¢',chr(190)=>'¥',chr(191)=>'┐',chr(192)=>'└',chr(193)=>'┴',chr(194)=>'┬',chr(195)=>'├',chr(196)=>'─',chr(197)=>'┼',chr(198)=>'ã',chr(199)=>'Ã',chr(200)=>'╚',chr(201)=>'╔',chr(202)=>'╩',chr(203)=>'╦',chr(204)=>'╠',chr(205)=>'═',chr(206)=>'╬',chr(207)=>'¤',chr(208)=>'ð',chr(209)=>'Ð',chr(210)=>'Ê',chr(211)=>'Ë',chr(212)=>'È',chr(213)=>'ı',chr(214)=>'Í',chr(215)=>'Î',chr(216)=>'Ï',chr(217)=>'┘',chr(218)=>'┌',chr(219)=>'█',chr(220)=>'▄',chr(221)=>'¦',chr(222)=>'Ì',chr(223)=>'▀',chr(224)=>'Ó',chr(225)=>'ß',chr(226)=>'Ô',chr(227)=>'Ò',chr(228)=>'õ',chr(229)=>'Õ',chr(230)=>'µ',chr(231)=>'þ',chr(232)=>'Þ',chr(233)=>'Ú',chr(234)=>'Û',chr(235)=>'Ù',chr(236)=>'ý',chr(237)=>'Ý',chr(238)=>'¯',chr(239)=>'´',chr(240)=>'≡',chr(241)=>'±',chr(242)=>'‗',chr(243)=>'¾',chr(244)=>'¶',chr(245)=>'§',chr(246)=>'÷',chr(247)=>'¸',chr(248)=>'°',chr(249)=>'¨',chr(250)=>'·',chr(251)=>'¹',chr(252)=>'³',chr(253)=>'²',chr(254)=>'■',chr(255)=>'nbsp']);
echo '"chr('.ord($fx[0]).')"=>"'.$fx[0].'"<br><br>'; // result: "chr(195)"=>"�"
How to convert or remove � ?
28.05.2020 Update: Solution found, thanks to Andrea Pollini!
Some notes:
iconv('UTF-8', 'UTF-8//IGNORE', $fx); // IGNORE is broken in PHP since - https://www.php.net/manual/en/function.iconv.php#108643 - use mb_convert_encoding
Here was my real problem (i figured it out later after many tests):
$P["T"] .= $text; // here was the problem, array is converting strings... (don't know why?)
changed to:
ini_set('mbstring.substitute_character', "none"); // mb_convert_encoding set remove unknown
$P["T"] .= mb_convert_encoding($text, 'UTF-8', 'UTF-8');
Now it's working. But if somebody knows why arrays are converting strings and how to disable that, would be great. :)
first configure in order to discard extended characters
<?php
ini_set('mbstring.substitute_character', "none");
?>
next you can use mb_convert_encoding
mb_convert_encoding($fx, "UTF-8", mb_detect_encoding($fx, "UTF-8, ISO-8859-1, ISO-8859-15", true));
you can add the encoding you need in mb_detect_encoding
I'm currently trying to figure out how to convert an ASCII encoded string to ISO-8859-1 encoding to be used for utf8_encode() to display special characters like "ñ" but I can't seem to make it work. In need of help.
I've already tried this iconv(mb_detect_encoding($text, mb_detect_order(), true), "ISO-8859-1", $text); and this mb_convert_encoding($text, "ISO-8859-1"); and also this mb_convert_encoding($text, "ASCII", "ISO-8859-1"); but it doesn't work, the string is still ASCII encoded.
I've created a temporary solution for this by creating a lookup table using the string provided by reading each character of the string. But I want to use the php built-in functions, is this possible?
Here is my code:
<?php
function convertString($text) {
$text = iconv(mb_detect_encoding($text, mb_detect_order(), true), "ISO-8859-1", $text);
echo mb_detect_encoding($text) .'<br/>'; // to check what encoding the string is in, displays ASCII
return utf8_encode($text);
}
echo convertString('\xc3\xb1');
?>
How to convert utf8 strings to iso 8859-1?
Why doesn't imap_mime_header_decode detect the utf8 coded string?
I need to remove all 4 byte unicode chars so the string fits in mysql utf8
Have tried this but it doesn't work
$text = mb_convert_encoding($text, 'UTF-8', 'UTF-8');
code
$input = '=?UTF-8?Q?=c3=b8en?=';
echo "$input\n";
$output = '';
foreach(imap_mime_header_decode($input) as $element){
if($element->charset == 'utf-8'){
echo "utf8 charset = $element->text\n";
$output .= $element->text;
}
else{
echo "default charset = $element->text\n";
$output .= $element->text;
}
}
// Here output should be iso 8859-1
echo "$output\n";
$string = preg_replace('/[^a-zæøåA-ZÆØÅ0-9 \-\.,:]/', '', $output);
// Back to utf8
$string = utf8_encode($string);
echo "$string\n";
output
=?UTF-8?Q?=c3=b8en?=
default charset = øen
øen
en
I came up with this solution.. First it converts to utf-8 (including 4 byte unicode chars), then converts to iso 8859-1 and then stripping unwanted chars and then finally encoding to utf-8
:D
private function strip_non_ascii($string){
$return = '';
if(preg_match('/^=\?(iso-8859-1|utf-8)\?q\?/i', $string)){
$return = str_replace('_',' ', mb_decode_mimeheader($string));
}
elseif(preg_match('/^(iso-8859-1\'\')(.*)$/i', $string, $matches)){
$return = utf8_encode(rawurldecode($matches[2]));
}
else{
$return = imap_utf8($string);
}
return utf8_encode(preg_replace('/[^a-zæøåA-ZÆØÅ0-9 \-\.,:]/', '', utf8_decode($return)));
}
Use htmlentities() to convert the special characters to HTML entities. You can optionally specify an encoding of the source string, which is encouraged to specify. In your case, this would be 'UTF-8'. The HTML entities are safe to store in a database and are safe to output in their escaped form, although you may choose to use html_entity_decode to convert as many characters as possible back to an encoding of your choice.
I'm trying to detect the character encoding of a string but I can't get the right result.
For example:
$str = "€ ‚ ƒ „ …" ;
$str = mb_convert_encoding($str, 'Windows-1252' ,'HTML-ENTITIES') ;
// Now $str should be a Windows-1252-encoded string.
// Let's detect its encoding:
echo mb_detect_encoding($str,'Windows-1252, ISO-8859-1, UTF-8') ;
That code outputs ISO-8859-1 but it should be Windows-1252.
What's wrong with this?
EDIT:
Updated example, in response to #raina77ow.
$str = "€‚ƒ„…" ; // no white-spaces
$str = mb_convert_encoding($str, 'Windows-1252' ,'HTML-ENTITIES') ;
$str = "Hello $str" ; // let's add some ascii characters
echo mb_detect_encoding($str,'Windows-1252, ISO-8859-1, UTF-8') ;
I get the wrong result again.
The problem with Windows-1252 in PHP is that it will almost never be detected, because as soon as your text contains any characters outside of 0x80 to 0x9f, it will not be detected as Windows-1252.
This means that if your string contains a normal ASCII letter like "A", or even a space character, PHP will say that this is not valid Windows-1252 and, in your case, fall back to the next possible encoding, which is ISO 8859-1. This is a PHP bug, see https://bugs.php.net/bug.php?id=64667.
Although strings encoded with ISO-8859-1 and CP-1252 have different byte code representation:
<?php
$str = "€ ‚ ƒ „ …" ;
foreach (array('Windows-1252', 'ISO-8859-1') as $encoding)
{
$new = mb_convert_encoding($str, $encoding, 'HTML-ENTITIES');
printf('%15s: %s detected: %10s explicitly: %10s',
$encoding,
implode('', array_map(function($x) { return dechex(ord($x)); }, str_split($new))),
mb_detect_encoding($new),
mb_detect_encoding($new, array('ISO-8859-1', 'Windows-1252'))
);
echo PHP_EOL;
}
Results:
Windows-1252: 802082208320842085 detected: explicitly: ISO-8859-1
ISO-8859-1: 3f203f203f203f203f detected: ASCII explicitly: ISO-8859-1
...from what we can see here it looks like there is problem with second paramater of mb_detect_encoding. Using mb_detect_order instead of parameter yields very similar results.
I'm writing a function to clear text which works with or without ut8 characters.
I keep getting text like this.
Coventry Salary - �25,000 - �35,000
but with this function it removes the � but leaves other.
I want to know if anyone wrote a function which cleans the text.
function convertHTMLSpecialChars ( $str='' )
{
$str = htmlspecialchars ( $str );
$str = mb_convert_encoding($str, 'UTF-8', mb_detect_encoding($str));
$str = htmlspecialchars($str, ENT_NOQUOTES, 'UTF-8');
return $str;
}
this function:
$str = mb_convert_encoding($str, 'UTF-8', mb_detect_encoding($str));
just tries to detect the character set from $str; if it finds that $str contains
utf8 characters it will return "utf8" so the func will be actually:
$str = mb_convert_encoding($str, 'UTF-8', 'UTF-8');
which doesnt help much..
in my opinion you should give the character set of your string by hand.
for example, if its turkish: iso-8859-5, if its greek: iso-8859-7 and so..
Make sure the server outputs your page as UTF-8.
You can force it by using:
header ('Content-type: text/html; charset=utf-8');