i don't have any chance to get a valid utf-8 as output...
$fx = file_get_contents("Extended Ascii file.txt"); // example only has chr(129), but could be mixed Extended Ascii + UTF8
// not working:
//$fx = html_entity_decode($fx, ENT_QUOTES, "UTF-8");
//$fx = mb_convert_encoding($fx, 'UTF-8', 'ASCII');
//$fx = utf8_encode($fx);
//$fx = iconv('ASCII', 'UTF-8//IGNORE', $fx);
echo '"chr('.ord($fx[0]).')"=>"'.$fx[0].'"<br><br>'; // result: "chr(129)"=>"�"
$fx = strtr($fx, [chr(128)=>'Ç',chr(129)=>'ü',chr(130)=>'é',chr(131)=>'â',chr(132)=>'ä',chr(133)=>'à',chr(134)=>'å',chr(135)=>'ç',chr(136)=>'ê',chr(137)=>'ë',chr(138)=>'è',chr(139)=>'ï',chr(140)=>'î',chr(141)=>'ì',chr(142)=>'Ä',chr(143)=>'Å',chr(144)=>'É',chr(145)=>'æ',chr(146)=>'Æ',chr(147)=>'ô',chr(148)=>'ö',chr(149)=>'ò',chr(150)=>'û',chr(151)=>'ù',chr(152)=>'ÿ',chr(153)=>'Ö',chr(154)=>'Ü',chr(155)=>'ø',chr(156)=>'£',chr(157)=>'Ø',chr(158)=>'×',chr(159)=>'ƒ',chr(160)=>'á',chr(161)=>'í',chr(162)=>'ó',chr(163)=>'ú',chr(164)=>'ñ',chr(165)=>'Ñ',chr(166)=>'ª',chr(167)=>'º',chr(168)=>'¿',chr(169)=>'®',chr(170)=>'¬',chr(171)=>'½',chr(172)=>'¼',chr(173)=>'¡',chr(174)=>'«',chr(175)=>'»',chr(176)=>'░',chr(177)=>'▒',chr(178)=>'▓',chr(179)=>'│',chr(180)=>'┤',chr(181)=>'Á',chr(182)=>'Â',chr(183)=>'À',chr(184)=>'©',chr(185)=>'╣',chr(186)=>'║',chr(187)=>'╗',chr(188)=>'╝',chr(189)=>'¢',chr(190)=>'¥',chr(191)=>'┐',chr(192)=>'└',chr(193)=>'┴',chr(194)=>'┬',chr(195)=>'├',chr(196)=>'─',chr(197)=>'┼',chr(198)=>'ã',chr(199)=>'Ã',chr(200)=>'╚',chr(201)=>'╔',chr(202)=>'╩',chr(203)=>'╦',chr(204)=>'╠',chr(205)=>'═',chr(206)=>'╬',chr(207)=>'¤',chr(208)=>'ð',chr(209)=>'Ð',chr(210)=>'Ê',chr(211)=>'Ë',chr(212)=>'È',chr(213)=>'ı',chr(214)=>'Í',chr(215)=>'Î',chr(216)=>'Ï',chr(217)=>'┘',chr(218)=>'┌',chr(219)=>'█',chr(220)=>'▄',chr(221)=>'¦',chr(222)=>'Ì',chr(223)=>'▀',chr(224)=>'Ó',chr(225)=>'ß',chr(226)=>'Ô',chr(227)=>'Ò',chr(228)=>'õ',chr(229)=>'Õ',chr(230)=>'µ',chr(231)=>'þ',chr(232)=>'Þ',chr(233)=>'Ú',chr(234)=>'Û',chr(235)=>'Ù',chr(236)=>'ý',chr(237)=>'Ý',chr(238)=>'¯',chr(239)=>'´',chr(240)=>'≡',chr(241)=>'±',chr(242)=>'‗',chr(243)=>'¾',chr(244)=>'¶',chr(245)=>'§',chr(246)=>'÷',chr(247)=>'¸',chr(248)=>'°',chr(249)=>'¨',chr(250)=>'·',chr(251)=>'¹',chr(252)=>'³',chr(253)=>'²',chr(254)=>'■',chr(255)=>'nbsp']);
echo '"chr('.ord($fx[0]).')"=>"'.$fx[0].'"<br><br>'; // result: "chr(195)"=>"�"
How to convert or remove � ?
28.05.2020 Update: Solution found, thanks to Andrea Pollini!
Some notes:
iconv('UTF-8', 'UTF-8//IGNORE', $fx); // IGNORE is broken in PHP since - https://www.php.net/manual/en/function.iconv.php#108643 - use mb_convert_encoding
Here was my real problem (i figured it out later after many tests):
$P["T"] .= $text; // here was the problem, array is converting strings... (don't know why?)
changed to:
ini_set('mbstring.substitute_character', "none"); // mb_convert_encoding set remove unknown
$P["T"] .= mb_convert_encoding($text, 'UTF-8', 'UTF-8');
Now it's working. But if somebody knows why arrays are converting strings and how to disable that, would be great. :)
first configure in order to discard extended characters
<?php
ini_set('mbstring.substitute_character', "none");
?>
next you can use mb_convert_encoding
mb_convert_encoding($fx, "UTF-8", mb_detect_encoding($fx, "UTF-8, ISO-8859-1, ISO-8859-15", true));
you can add the encoding you need in mb_detect_encoding
I'm currently trying to figure out how to convert an ASCII encoded string to ISO-8859-1 encoding to be used for utf8_encode() to display special characters like "ñ" but I can't seem to make it work. In need of help.
I've already tried this iconv(mb_detect_encoding($text, mb_detect_order(), true), "ISO-8859-1", $text); and this mb_convert_encoding($text, "ISO-8859-1"); and also this mb_convert_encoding($text, "ASCII", "ISO-8859-1"); but it doesn't work, the string is still ASCII encoded.
I've created a temporary solution for this by creating a lookup table using the string provided by reading each character of the string. But I want to use the php built-in functions, is this possible?
Here is my code:
<?php
function convertString($text) {
$text = iconv(mb_detect_encoding($text, mb_detect_order(), true), "ISO-8859-1", $text);
echo mb_detect_encoding($text) .'<br/>'; // to check what encoding the string is in, displays ASCII
return utf8_encode($text);
}
echo convertString('\xc3\xb1');
?>
Could someone explain why the output is ASCII in the last three tests below?
I get the same results on my own system, PHPTester.net, and PhpFiddle.org.
echo mb_internal_encoding(); // UTF-8
$str = 'foobar';
echo mb_check_encoding($str, 'UTF-8'); // true
echo mb_detect_encoding($str); // ASCII
$encoded = utf8_encode($str);
echo mb_detect_encoding($encoded); // ASCII
$converted = mb_convert_encoding($str, 'UTF-8');
echo mb_detect_encoding($converted); // ASCII
That would be because there are no characters in foobar that cannot be represented in ASCII.
mb_check_encoding($str, 'UTF-8') works because ASCII text is innately compatible with UTF-8 (deliberately so)
But in the absence of multi-byte characters, there's no discernible difference between the two. Proof of this: 'foobar' === utf8_encode('foobar') // true
I am new to encoding so please be patient.
I am working on a system where a user upload a csv, what i need to do is to display the content and then save it in the database. (utf-8 encoding)
I have been asked to fix a issue with some french alphabet characters that weren't displayed correctly. I have almost solved the problem, I am displaying characters such as
ÀàÂâÆÄäÇçÉéÈèÊêËëÎîÏïÔôœÖöÙùÛûÜüÿ
However the two mentioned in the title Ÿ Œ are not displayed correctly yet on the webpage.
Here is my php code so far:
// say in the csv we have "ÖüÜߟÀàÂ"
$content = file_get_contents(addslashes($file_name));
var_dump($content) // output: string(54) "���ߟ��� "
if(!mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true)){
$data = iconv('macintosh', 'UTF-8', $content);
}
// deal with known encoding types
else if(mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true) == 'ISO-8859-1'){
//$data = mb_convert_encoding($content, 'UTF-8', mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true)); // does not work
$data = iconv('ISO-8859-1', 'UTF-8', $content); //does not work
}else if(mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true) == 'UTF-8'){
$data = $content
}
//if i print $data "Ÿ Œ " are not printed out... they got lost somewhere
//do more stuff here
the file I am dealing with has an encoding type of ISO-8859-1(when i print out mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true) it displays ISO-8859-1).
Is there anyone that have an idea on how to deal with this special cases?
The characters Ÿ and Œ are not representable in ISO-8859-1. It seems that the incoming data is actually windows-1252 (Windows Latin 1) encoded, since windows-1252 has graphic characters, including Ÿ and Œ, in some code positions that are reserved for control characters in ISO-8859-1.
So you should probably add windows-1252 to the list of recognized encodings and treat recognized ISO-8859-1 as windows-1252, i.e use iconv('windows-1252', 'UTF-8', $content) even when ISO-8859-1 has bee recognized. Windows-1252 data mislabeled as ISO-8859-1 is very common.
I have string that looks like this "v\u00e4lkommen till mig" that I get after doing utf8_encode() on the string.
I would like that string to become
välkommen till mig
where the character
\u00e4 = ä = ä
How can I achive this in PHP?
Do not use utf8_(de|en)code. It just converts from UTF8 to ISO-8859-1 and back. ISO 8859-1 does not provide the same characters as ISO-8859-15 or Windows1252, which are the most used encodings (besides UTF-8). Better use mb_convert_encoding.
"v\u00e4lkommen till mig" > This string looks like a JSON encoded string which IS already utf8 encoded. The unicode code positiotion of "ä" is U+00E4 >> \u00e4.
Example
<?php
header('Content-Type: text/html; charset=utf-8');
$json = '"v\u00e4lkommen till mig"';
var_dump(json_decode($json)); //It will return a utf8 encoded string "välkommen till mig"
What is the source of this string?
There is no need to replace the ä with its HTML representation ä, if you print it in a utf8 encoded document and tell the browser the used encoding. If it is necessary, use htmlentities:
<?php
$json = '"v\u00e4lkommen till mig"';
$string = json_decode($json);
echo htmlentities($string, ENT_COMPAT, 'UTF-8');
Edit: Since you want to keep HTML characters, and I now think your source string isn't quite what you posted (I think it is actual unicode, rather than containing \unnnn as a string), I think your best option is this:
$html = str_replace( str_replace( str_replace( htmlentities( $whatever ), '<', '<' ), '>', '>' ), '&', '&' );
(note: no call to utf8-decode)
Original answer:
There is no direct conversion. First, decode it again:
$decoded = utf8_decode( $whatever );
then encode as HTML:
$html = htmlentities( $decoded );
and of course you can do it without a variable:
$html = htmlentities( utf8_decode( $whatever ) );
http://php.net/manual/en/function.utf8-decode.php
http://php.net/manual/en/function.htmlentities.php
To do this by regular expression (not recommended, likely slower, less reliable), you can use the fact that HTML supports &#xnnnn; constructs, where the nnnn is the same as your existing \unnnn values. So you can say:
$html = preg_replace( '/\\\\u([0-9a-f]{4})/i', '&#x$1;', $whatever )
The html_entity_decode worked for me.
$json = '"v\u00e4lkommen till mig"';
echo $decoded = html_entity_decode( json_decode($json) );