i don't have any chance to get a valid utf-8 as output...
$fx = file_get_contents("Extended Ascii file.txt"); // example only has chr(129), but could be mixed Extended Ascii + UTF8
// not working:
//$fx = html_entity_decode($fx, ENT_QUOTES, "UTF-8");
//$fx = mb_convert_encoding($fx, 'UTF-8', 'ASCII');
//$fx = utf8_encode($fx);
//$fx = iconv('ASCII', 'UTF-8//IGNORE', $fx);
echo '"chr('.ord($fx[0]).')"=>"'.$fx[0].'"<br><br>'; // result: "chr(129)"=>"�"
$fx = strtr($fx, [chr(128)=>'Ç',chr(129)=>'ü',chr(130)=>'é',chr(131)=>'â',chr(132)=>'ä',chr(133)=>'à',chr(134)=>'å',chr(135)=>'ç',chr(136)=>'ê',chr(137)=>'ë',chr(138)=>'è',chr(139)=>'ï',chr(140)=>'î',chr(141)=>'ì',chr(142)=>'Ä',chr(143)=>'Å',chr(144)=>'É',chr(145)=>'æ',chr(146)=>'Æ',chr(147)=>'ô',chr(148)=>'ö',chr(149)=>'ò',chr(150)=>'û',chr(151)=>'ù',chr(152)=>'ÿ',chr(153)=>'Ö',chr(154)=>'Ü',chr(155)=>'ø',chr(156)=>'£',chr(157)=>'Ø',chr(158)=>'×',chr(159)=>'ƒ',chr(160)=>'á',chr(161)=>'í',chr(162)=>'ó',chr(163)=>'ú',chr(164)=>'ñ',chr(165)=>'Ñ',chr(166)=>'ª',chr(167)=>'º',chr(168)=>'¿',chr(169)=>'®',chr(170)=>'¬',chr(171)=>'½',chr(172)=>'¼',chr(173)=>'¡',chr(174)=>'«',chr(175)=>'»',chr(176)=>'░',chr(177)=>'▒',chr(178)=>'▓',chr(179)=>'│',chr(180)=>'┤',chr(181)=>'Á',chr(182)=>'Â',chr(183)=>'À',chr(184)=>'©',chr(185)=>'╣',chr(186)=>'║',chr(187)=>'╗',chr(188)=>'╝',chr(189)=>'¢',chr(190)=>'¥',chr(191)=>'┐',chr(192)=>'└',chr(193)=>'┴',chr(194)=>'┬',chr(195)=>'├',chr(196)=>'─',chr(197)=>'┼',chr(198)=>'ã',chr(199)=>'Ã',chr(200)=>'╚',chr(201)=>'╔',chr(202)=>'╩',chr(203)=>'╦',chr(204)=>'╠',chr(205)=>'═',chr(206)=>'╬',chr(207)=>'¤',chr(208)=>'ð',chr(209)=>'Ð',chr(210)=>'Ê',chr(211)=>'Ë',chr(212)=>'È',chr(213)=>'ı',chr(214)=>'Í',chr(215)=>'Î',chr(216)=>'Ï',chr(217)=>'┘',chr(218)=>'┌',chr(219)=>'█',chr(220)=>'▄',chr(221)=>'¦',chr(222)=>'Ì',chr(223)=>'▀',chr(224)=>'Ó',chr(225)=>'ß',chr(226)=>'Ô',chr(227)=>'Ò',chr(228)=>'õ',chr(229)=>'Õ',chr(230)=>'µ',chr(231)=>'þ',chr(232)=>'Þ',chr(233)=>'Ú',chr(234)=>'Û',chr(235)=>'Ù',chr(236)=>'ý',chr(237)=>'Ý',chr(238)=>'¯',chr(239)=>'´',chr(240)=>'≡',chr(241)=>'±',chr(242)=>'‗',chr(243)=>'¾',chr(244)=>'¶',chr(245)=>'§',chr(246)=>'÷',chr(247)=>'¸',chr(248)=>'°',chr(249)=>'¨',chr(250)=>'·',chr(251)=>'¹',chr(252)=>'³',chr(253)=>'²',chr(254)=>'■',chr(255)=>'nbsp']);
echo '"chr('.ord($fx[0]).')"=>"'.$fx[0].'"<br><br>'; // result: "chr(195)"=>"�"
How to convert or remove � ?
28.05.2020 Update: Solution found, thanks to Andrea Pollini!
Some notes:
iconv('UTF-8', 'UTF-8//IGNORE', $fx); // IGNORE is broken in PHP since - https://www.php.net/manual/en/function.iconv.php#108643 - use mb_convert_encoding
Here was my real problem (i figured it out later after many tests):
$P["T"] .= $text; // here was the problem, array is converting strings... (don't know why?)
changed to:
ini_set('mbstring.substitute_character', "none"); // mb_convert_encoding set remove unknown
$P["T"] .= mb_convert_encoding($text, 'UTF-8', 'UTF-8');
Now it's working. But if somebody knows why arrays are converting strings and how to disable that, would be great. :)
first configure in order to discard extended characters
<?php
ini_set('mbstring.substitute_character', "none");
?>
next you can use mb_convert_encoding
mb_convert_encoding($fx, "UTF-8", mb_detect_encoding($fx, "UTF-8, ISO-8859-1, ISO-8859-15", true));
you can add the encoding you need in mb_detect_encoding
I have created a function to convert the following text to UTF-8, as it appeared to be in Windows-1252 format, due to being copied to a database table from a Word Document.
Testing weird character’s correction
This seems to fix the dodgy ’ character. However i'm not getting � in the following:
Devon�s most prominent dealerships
When passing the following through the same function:
Devon's most prominent dealerships
Below is the code which does the converting:
function Windows1252ToUTF8($text) {
return mb_convert_encoding($text, "Windows-1252", "UTF-8");
}
Edit:
The database can't be changed due to holding thousands of custom records. I tried the below but the mb_detect_encoding thinks character’s correction is UTF-8.
function Windows1252ToUTF8($text) {
if (mb_detect_encoding($text) == "UTF-8") {
return $text;
}
return mb_convert_encoding($text, "Windows-1252", "UTF-8");
}
Edit 2:
Just tried the example from the PHP Documentation:
$str = 'áéóú'; // ISO-8859-1
echo "<pre>";
var_dump(mb_detect_encoding($str, 'UTF-8')); // 'UTF-8'
var_dump(mb_detect_encoding($str, 'UTF-8', true)); // false
echo "</pre>";
die();
but this simply outputs:
string(5) "UTF-8"
string(5) "UTF-8"
So I can't even detect the encoding of the string :S
Edit 3:
This seems to do the trick:
function Windows1252ToUTF8($text) {
$badChars = [ "â", "á", "ú", "é", "ó" ];
$match = preg_match("/[".join("",$badChars)."]/", $text);
if ($match) {
return mb_convert_encoding($text, "Windows-1252", "UTF-8");
}
return $text;
}
Edit 4:
I have matched the hex values to their corresponding values. However when I get to the weird characters they don't appear to match.
Converting Testing weird character’s correction using bin2hex
gives me
54657374696e6720776569726420636861726163746572c3a2e282ace284a27320636f7272656374696f6e
This means the "’" is actually the bytes \xc3\xa2\xe2\x82\xac\xe2\x84\xa2. This is a typical sign of a UTF-8 string having been interpreted as Windows Latin-1/1252, and then re-encoded to UTF-8.
’ (UTF-8 \xe2\x80\x99)
→ bytes interpreted as Latin-1 equal the string ’
→ characters encoded to UTF-8 result in \xc3\xa2\xe2\x82\xac\xe2\x84\xa2
To restore the original, you need to reverse that chain of mis-encodings:
$s = "\xc3\xa2\xe2\x82\xac\xe2\x84\xa2";
echo mb_convert_encoding($s, 'Windows-1252', 'UTF-8');
This interprets the string as UTF-8, converts it to the Windows-1252 equivalent, which is then the valid UTF-8 representation of ’.
Preferably you figure out at what point the encoding screwed up like this and you stop that from happening in the future. If it happened by "copy and pasting from Word", then basically somebody pasted garbage into your database and you need to fix the workflow with Word somehow. Otherwise there may be an incorrect encoding-conversion step somewhere in your code which you need to fix.
The following seems to do the trick. Not the way I wanted it to work by checking for specific characters, but it does the trick.
function Windows1252ToUTF8($text) {
$badChars = [ "â", "á", "ú", "é", "ó" ];
$match = preg_match("/[".join("",$badChars)."]/", $text);
if ($match) {
return mb_convert_encoding($text, "Windows-1252", "UTF-8");
}
return $text;
}
Edit:
function Windows1252ToUTF8($text) {
// http://www.fileformat.info/info/charset/UTF-8/list.htm
$illegal_hex = [ "c3a2", "c3a1", "c3ba", "c3a9", "c3b3" ];
$match = preg_match("/".join("|",$illegal_hex)."/", bin2hex($text));
if ($match) {
return mb_convert_encoding($text, "Windows-1252", "UTF-8");
}
return $text;
}
My wordpress-blog runs with UTF-8. Now I want to display just one title as ASCII because I have to send it to a payment provider.
The following PHP Snippets are not working:
$utf8 = 'ÄÖÜ';
$iso88591_1 = utf8_decode($utf8);
$iso88591_2 = iconv('UTF-8', 'ISO-8859-1', $utf8);
$iso88591_2 = mb_convert_encoding($utf8, 'ISO-8859-1', 'UTF-8');
The result is: �.
How could I display single words as ASCII (like %E4 instead of ä for example) within my utf-8 encoded Blog?
You need to combine conversion to ISO-8859-1 and url encoding, like this:
$utf8 = 'ÄÖÜ';
echo urlencode(utf8_decode($utf8));
output:
%C4%D6%DC
I have string that looks like this "v\u00e4lkommen till mig" that I get after doing utf8_encode() on the string.
I would like that string to become
välkommen till mig
where the character
\u00e4 = ä = ä
How can I achive this in PHP?
Do not use utf8_(de|en)code. It just converts from UTF8 to ISO-8859-1 and back. ISO 8859-1 does not provide the same characters as ISO-8859-15 or Windows1252, which are the most used encodings (besides UTF-8). Better use mb_convert_encoding.
"v\u00e4lkommen till mig" > This string looks like a JSON encoded string which IS already utf8 encoded. The unicode code positiotion of "ä" is U+00E4 >> \u00e4.
Example
<?php
header('Content-Type: text/html; charset=utf-8');
$json = '"v\u00e4lkommen till mig"';
var_dump(json_decode($json)); //It will return a utf8 encoded string "välkommen till mig"
What is the source of this string?
There is no need to replace the ä with its HTML representation ä, if you print it in a utf8 encoded document and tell the browser the used encoding. If it is necessary, use htmlentities:
<?php
$json = '"v\u00e4lkommen till mig"';
$string = json_decode($json);
echo htmlentities($string, ENT_COMPAT, 'UTF-8');
Edit: Since you want to keep HTML characters, and I now think your source string isn't quite what you posted (I think it is actual unicode, rather than containing \unnnn as a string), I think your best option is this:
$html = str_replace( str_replace( str_replace( htmlentities( $whatever ), '<', '<' ), '>', '>' ), '&', '&' );
(note: no call to utf8-decode)
Original answer:
There is no direct conversion. First, decode it again:
$decoded = utf8_decode( $whatever );
then encode as HTML:
$html = htmlentities( $decoded );
and of course you can do it without a variable:
$html = htmlentities( utf8_decode( $whatever ) );
http://php.net/manual/en/function.utf8-decode.php
http://php.net/manual/en/function.htmlentities.php
To do this by regular expression (not recommended, likely slower, less reliable), you can use the fact that HTML supports &#xnnnn; constructs, where the nnnn is the same as your existing \unnnn values. So you can say:
$html = preg_replace( '/\\\\u([0-9a-f]{4})/i', '&#x$1;', $whatever )
The html_entity_decode worked for me.
$json = '"v\u00e4lkommen till mig"';
echo $decoded = html_entity_decode( json_decode($json) );
I'm writing a function to clear text which works with or without ut8 characters.
I keep getting text like this.
Coventry Salary - �25,000 - �35,000
but with this function it removes the � but leaves other.
I want to know if anyone wrote a function which cleans the text.
function convertHTMLSpecialChars ( $str='' )
{
$str = htmlspecialchars ( $str );
$str = mb_convert_encoding($str, 'UTF-8', mb_detect_encoding($str));
$str = htmlspecialchars($str, ENT_NOQUOTES, 'UTF-8');
return $str;
}
this function:
$str = mb_convert_encoding($str, 'UTF-8', mb_detect_encoding($str));
just tries to detect the character set from $str; if it finds that $str contains
utf8 characters it will return "utf8" so the func will be actually:
$str = mb_convert_encoding($str, 'UTF-8', 'UTF-8');
which doesnt help much..
in my opinion you should give the character set of your string by hand.
for example, if its turkish: iso-8859-5, if its greek: iso-8859-7 and so..
Make sure the server outputs your page as UTF-8.
You can force it by using:
header ('Content-type: text/html; charset=utf-8');