Detect file encoding in PHP - php

I have a script which combines a number of files into one, and it breaks when one of the files has UTF8 encoding. I figure that I should be using the utf8_decode() function when reading the files, but I don't know how to tell which need decoding.
My code is basically:
$output = '';
foreach ($files as $filename) {
$output .= file_get_contents($filename) . "\n";
}
file_put_contents('combined.txt', $output);
Currently, at the start of a UTF8 file, it adds these characters in the output: 

Try using the mb_detect_encoding function. This function will examine your string and attempt to "guess" what its encoding is. You can then convert it as desired. As brulak suggested, however, you're probably better off converting to UTF-8 rather than from, to preserve the data you're transmitting.

To make sure that the output is UTF-8, no matter what kind of input it was, I use this check:
if(!mb_check_encoding($output, 'UTF-8')
OR !($output === mb_convert_encoding(mb_convert_encoding($output, 'UTF-32', 'UTF-8' ), 'UTF-8', 'UTF-32'))) {
$output = mb_convert_encoding($content, 'UTF-8', 'pass');
}
// $output is now safely converted to UTF-8!

mb_detect_encoding function should be your last choice. That could return the wrong encoding. Linux command file -i /path/myfile.txt is working great. In PHP you could use:
function _detectFileEncoding($filepath) {
// VALIDATE $filepath !!!
$output = array();
exec('file -i ' . $filepath, $output);
if (isset($output[0])){
$ex = explode('charset=', $output[0]);
return isset($ex[1]) ? $ex[1] : null;
}
return null;
}

This is my solution which worked like a charm:
//check string strict for encoding out of list of supported encodings
$enc = mb_detect_encoding($str, mb_list_encodings(), true);
if ($enc===false){
//could not detect encoding
}
else if ($enc!=="UTF-8"){
$str = mb_convert_encoding($str, "UTF-8", $enc);
}
else {
//UTF-8 detected
}

For Linux servers, I use this command:
$file = 'your/file.ext'
exec( "from=`file -bi $file | awk -F'=' '{print $2 }'` && iconv -f \$from -t utf-8 $file -o $file" );

Scans all file, finds any kind of encoding from mb_list_encodings, good performance..
function detectFileEncoding($filePath){
$fopen=fopen($filePath,'r');
$row = fgets($fopen);
$encodings = mb_list_encodings();
$encoding = mb_detect_encoding( $row, "UTF-8, ASCII, Windows-1252, Windows-1254" );//these are my favorite encodings
if($encoding !== false) {
$key = array_search($encoding, $encodings) !== false;
if ($key !== false)
unset($encodings[$key]);
$encodings = array_values($encodings);
}
$encKey = 0;
while ($row = fgets($fopen)) {
if($encoding == false){
$encoding = $encodings[$encKey++];
}
if(!mb_check_encoding($row, $encoding)){
$encoding =false;
rewind($fopen);
}
}
return $encoding;
}

How are you going to handle the non-ASCII characters from the UTF-8 or 16 or 32 file?
I ask because I think you may have a design issue here.
I would convert your output file into UTF-8 (or 16 or 32) instead of the other way around.
Then you won't have this problem.
Have you also considered the security issues that may arise from converting an escaped UTF-8 code? See this comment:
Detecting multi-byte encoding
Figure out what encoding your source file is in, then convert it to UTF-8, and you should be good to go.

I recently encountered this issue and the mb_convert_encoding() function output was UTF-8.
After taking a look at the response headers, there wasn't anything mentioning the encoding type, so I found Set HTTP header to UTF-8 using PHP, which proposes the following:
<?php
header('Content-Type: text/html; charset=utf-8');
After adding that to the top of the PHP file, all of the funky characters went away and it rendered as it should. I am not sure if that's the issue the original poster was seeking for, but I found this in trying to solve the issue myself and figured I'd share.

<?php
$file = 'myfile.csv';
function detect_encoding($file){
return mb_detect_encoding(file_get_contents($file), mb_list_encodings());
}
if ( detect_encoding($file) == 'ISO-8859-1' ) {
echo "ISO-8859-1 detected";
}

You can try this, I use a below method for checking it is ISO-8859-2 .I am looking for polish characters
public static function findEncoding($text)
{
$plUTF8 = array("ą","ę","ć","ż","ź","ł","ó","ń");
//$lista = '437, 500, 500V1, 850, 851, 852, 855, 856, 857, 860, 861, 862, 863, 864, 865, 866, 866NAV, 869, 874, 904, 1026, 1046, 1047, 8859_1, 8859_2, 8859_3, 8859_4, 8859_5, 8859_6, 8859_7, 8859_8, 8859_9, 10646-1:1993, 10646-1:1993/UCS4, ANSI_X3.4-1968, ANSI_X3.4-1986, ANSI_X3.4, ANSI_X3.110-1983, ANSI_X3.110, ARABIC, ARABIC7, ARMSCII-8, ASCII, ASMO-708, ASMO_449, BALTIC, BIG-5, BIG-FIVE, BIG5-HKSCS, BIG5, BIG5HKSCS, BIGFIVE, BRF, BS_4730, CA, CN-BIG5, CN-GB, CN, CP-AR, CP-GR, CP-HU, CP037, CP038, CP273, CP274, CP275, CP278, CP280, CP281, CP282, CP284, CP285, CP290, CP297, CP367, CP420, CP423, CP424, CP437, CP500, CP737, CP770, CP771, CP772, CP773, CP774, CP775, CP803, CP813, CP819, CP850, CP851, CP852, CP855, CP856, CP857, CP860, CP861, CP862, CP863, CP864, CP865, CP866, CP866NAV, CP868, CP869, CP870, CP871, CP874, CP875, CP880, CP891, CP901, CP902, CP903, CP904, CP905, CP912, CP915, CP916, CP918, CP920, CP921, CP922, CP930, CP932, CP933, CP935, CP936, CP937, CP939, CP949, CP950, CP1004, CP1008, CP1025, CP1026, CP1046, CP1047, CP1070, CP1079, CP1081, CP1084, CP1089, CP1097, CP1112, CP1122, CP1123, CP1124, CP1125, CP1129, CP1130, CP1132, CP1133, CP1137, CP1140, CP1141, CP1142, CP1143, CP1144, CP1145, CP1146, CP1147, CP1148, CP1149, CP1153, CP1154, CP1155, CP1156, CP1157, CP1158, CP1160, CP1161, CP1162, CP1163, CP1164, CP1166, CP1167, CP1250, CP1251, CP1252, CP1253, CP1254, CP1255, CP1256, CP1257, CP1258, CP1282, CP1361, CP1364, CP1371, CP1388, CP1390, CP1399, CP4517, CP4899, CP4909, CP4971, CP5347, CP9030, CP9066, CP9448, CP10007, CP12712, CP16804, CPIBM861, CSA7-1, CSA7-2, CSASCII, CSA_T500-1983, CSA_T500, CSA_Z243.4-1985-1, CSA_Z243.4-1985-2, CSA_Z243.419851, CSA_Z243.419852, CSDECMCS, CSEBCDICATDE, CSEBCDICATDEA, CSEBCDICCAFR, CSEBCDICDKNO, CSEBCDICDKNOA, CSEBCDICES, CSEBCDICESA, CSEBCDICESS, CSEBCDICFISE, CSEBCDICFISEA, CSEBCDICFR, CSEBCDICIT, CSEBCDICPT, CSEBCDICUK, CSEBCDICUS, CSEUCKR, CSEUCPKDFMTJAPANESE, CSGB2312, CSHPROMAN8, CSIBM037, CSIBM038, CSIBM273, CSIBM274, CSIBM275, CSIBM277, CSIBM278, CSIBM280, CSIBM281, CSIBM284, CSIBM285, CSIBM290, CSIBM297, CSIBM420, CSIBM423, CSIBM424, CSIBM500, CSIBM803, CSIBM851, CSIBM855, CSIBM856, CSIBM857, CSIBM860, CSIBM863, CSIBM864, CSIBM865, CSIBM866, CSIBM868, CSIBM869, CSIBM870, CSIBM871, CSIBM880, CSIBM891, CSIBM901, CSIBM902, CSIBM903, CSIBM904, CSIBM905, CSIBM918, CSIBM921, CSIBM922, CSIBM930, CSIBM932, CSIBM933, CSIBM935, CSIBM937, CSIBM939, CSIBM943, CSIBM1008, CSIBM1025, CSIBM1026, CSIBM1097, CSIBM1112, CSIBM1122, CSIBM1123, CSIBM1124, CSIBM1129, CSIBM1130, CSIBM1132, CSIBM1133, CSIBM1137, CSIBM1140, CSIBM1141, CSIBM1142, CSIBM1143, CSIBM1144, CSIBM1145, CSIBM1146, CSIBM1147, CSIBM1148, CSIBM1149, CSIBM1153, CSIBM1154, CSIBM1155, CSIBM1156, CSIBM1157, CSIBM1158, CSIBM1160, CSIBM1161, CSIBM1163, CSIBM1164, CSIBM1166, CSIBM1167, CSIBM1364, CSIBM1371, CSIBM1388, CSIBM1390, CSIBM1399, CSIBM4517, CSIBM4899, CSIBM4909, CSIBM4971, CSIBM5347, CSIBM9030, CSIBM9066, CSIBM9448, CSIBM12712, CSIBM16804, CSIBM11621162, CSISO4UNITEDKINGDOM, CSISO10SWEDISH, CSISO11SWEDISHFORNAMES, CSISO14JISC6220RO, CSISO15ITALIAN, CSISO16PORTUGESE, CSISO17SPANISH, CSISO18GREEK7OLD, CSISO19LATINGREEK, CSISO21GERMAN, CSISO25FRENCH, CSISO27LATINGREEK1, CSISO49INIS, CSISO50INIS8, CSISO51INISCYRILLIC, CSISO58GB1988, CSISO60DANISHNORWEGIAN, CSISO60NORWEGIAN1, CSISO61NORWEGIAN2, CSISO69FRENCH, CSISO84PORTUGUESE2, CSISO85SPANISH2, CSISO86HUNGARIAN, CSISO88GREEK7, CSISO89ASMO449, CSISO90, CSISO92JISC62991984B, CSISO99NAPLPS, CSISO103T618BIT, CSISO111ECMACYRILLIC, CSISO121CANADIAN1, CSISO122CANADIAN2, CSISO139CSN369103, CSISO141JUSIB1002, CSISO143IECP271, CSISO150, CSISO150GREEKCCITT, CSISO151CUBA, CSISO153GOST1976874, CSISO646DANISH, CSISO2022CN, CSISO2022JP, CSISO2022JP2, CSISO2022KR, CSISO2033, CSISO5427CYRILLIC, CSISO5427CYRILLIC1981, CSISO5428GREEK, CSISO10367BOX, CSISOLATIN1, CSISOLATIN2, CSISOLATIN3, CSISOLATIN4, CSISOLATIN5, CSISOLATIN6, CSISOLATINARABIC, CSISOLATINCYRILLIC, CSISOLATINGREEK, CSISOLATINHEBREW, CSKOI8R, CSKSC5636, CSMACINTOSH, CSNATSDANO, CSNATSSEFI, CSN_369103, CSPC8CODEPAGE437, CSPC775BALTIC, CSPC850MULTILINGUAL, CSPC862LATINHEBREW, CSPCP852, CSSHIFTJIS, CSUCS4, CSUNICODE, CSWINDOWS31J, CUBA, CWI-2, CWI, CYRILLIC, DE, DEC-MCS, DEC, DECMCS, DIN_66003, DK, DS2089, DS_2089, E13B, EBCDIC-AT-DE-A, EBCDIC-AT-DE, EBCDIC-BE, EBCDIC-BR, EBCDIC-CA-FR, EBCDIC-CP-AR1, EBCDIC-CP-AR2, EBCDIC-CP-BE, EBCDIC-CP-CA, EBCDIC-CP-CH, EBCDIC-CP-DK, EBCDIC-CP-ES, EBCDIC-CP-FI, EBCDIC-CP-FR, EBCDIC-CP-GB, EBCDIC-CP-GR, EBCDIC-CP-HE, EBCDIC-CP-IS, EBCDIC-CP-IT, EBCDIC-CP-NL, EBCDIC-CP-NO, EBCDIC-CP-ROECE, EBCDIC-CP-SE, EBCDIC-CP-TR, EBCDIC-CP-US, EBCDIC-CP-WT, EBCDIC-CP-YU, EBCDIC-CYRILLIC, EBCDIC-DK-NO-A, EBCDIC-DK-NO, EBCDIC-ES-A, EBCDIC-ES-S, EBCDIC-ES, EBCDIC-FI-SE-A, EBCDIC-FI-SE, EBCDIC-FR, EBCDIC-GREEK, EBCDIC-INT, EBCDIC-INT1, EBCDIC-IS-FRISS, EBCDIC-IT, EBCDIC-JP-E, EBCDIC-JP-KANA, EBCDIC-PT, EBCDIC-UK, EBCDIC-US, EBCDICATDE, EBCDICATDEA, EBCDICCAFR, EBCDICDKNO, EBCDICDKNOA, EBCDICES, EBCDICESA, EBCDICESS, EBCDICFISE, EBCDICFISEA, EBCDICFR, EBCDICISFRISS, EBCDICIT, EBCDICPT, EBCDICUK, EBCDICUS, ECMA-114, ECMA-118, ECMA-128, ECMA-CYRILLIC, ECMACYRILLIC, ELOT_928, ES, ES2, EUC-CN, EUC-JISX0213, EUC-JP-MS, EUC-JP, EUC-KR, EUC-TW, EUCCN, EUCJP-MS, EUCJP-OPEN, EUCJP-WIN, EUCJP, EUCKR, EUCTW, FI, FR, GB, GB2312, GB13000, GB18030, GBK, GB_1988-80, GB_198880, GEORGIAN-ACADEMY, GEORGIAN-PS, GOST_19768-74, GOST_19768, GOST_1976874, GREEK-CCITT, GREEK, GREEK7-OLD, GREEK7, GREEK7OLD, GREEK8, GREEKCCITT, HEBREW, HP-GREEK8, HP-ROMAN8, HP-ROMAN9, HP-THAI8, HP-TURKISH8, HPGREEK8, HPROMAN8, HPROMAN9, HPTHAI8, HPTURKISH8, HU, IBM-803, IBM-856, IBM-901, IBM-902, IBM-921, IBM-922, IBM-930, IBM-932, IBM-933, IBM-935, IBM-937, IBM-939, IBM-943, IBM-1008, IBM-1025, IBM-1046, IBM-1047, IBM-1097, IBM-1112, IBM-1122, IBM-1123, IBM-1124, IBM-1129, IBM-1130, IBM-1132, IBM-1133, IBM-1137, IBM-1140, IBM-1141, IBM-1142, IBM-1143, IBM-1144, IBM-1145, IBM-1146, IBM-1147, IBM-1148, IBM-1149, IBM-1153, IBM-1154, IBM-1155, IBM-1156, IBM-1157, IBM-1158, IBM-1160, IBM-1161, IBM-1162, IBM-1163, IBM-1164, IBM-1166, IBM-1167, IBM-1364, IBM-1371, IBM-1388, IBM-1390, IBM-1399, IBM-4517, IBM-4899, IBM-4909, IBM-4971, IBM-5347, IBM-9030, IBM-9066, IBM-9448, IBM-12712, IBM-16804, IBM037, IBM038, IBM256, IBM273, IBM274, IBM275, IBM277, IBM278, IBM280, IBM281, IBM284, IBM285, IBM290, IBM297, IBM367, IBM420, IBM423, IBM424, IBM437, IBM500, IBM775, IBM803, IBM813, IBM819, IBM848, IBM850, IBM851, IBM852, IBM855, IBM856, IBM857, IBM860, IBM861, IBM862, IBM863, IBM864, IBM865, IBM866, IBM866NAV, IBM868, IBM869, IBM870, IBM871, IBM874, IBM875, IBM880, IBM891, IBM901, IBM902, IBM903, IBM904, IBM905, IBM912, IBM915, IBM916, IBM918, IBM920, IBM921, IBM922, IBM930, IBM932, IBM933, IBM935, IBM937, IBM939, IBM943, IBM1004, IBM1008, IBM1025, IBM1026, IBM1046, IBM1047, IBM1089, IBM1097, IBM1112, IBM1122, IBM1123, IBM1124, IBM1129, IBM1130, IBM1132, IBM1133, IBM1137, IBM1140, IBM1141, IBM1142, IBM1143, IBM1144, IBM1145, IBM1146, IBM1147, IBM1148, IBM1149, IBM1153, IBM1154, IBM1155, IBM1156, IBM1157, IBM1158, IBM1160, IBM1161, IBM1162, IBM1163, IBM1164, IBM1166, IBM1167, IBM1364, IBM1371, IBM1388, IBM1390, IBM1399, IBM4517, IBM4899, IBM4909, IBM4971, IBM5347, IBM9030, IBM9066, IBM9448, IBM12712, IBM16804, IEC_P27-1, IEC_P271, INIS-8, INIS-CYRILLIC, INIS, INIS8, INISCYRILLIC, ISIRI-3342, ISIRI3342, ISO-2022-CN-EXT, ISO-2022-CN, ISO-2022-JP-2, ISO-2022-JP-3, ISO-2022-JP, ISO-2022-KR, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-9E, ISO-8859-10, ISO-8859-11, ISO-8859-13, ISO-8859-14, ISO-8859-15, ISO-8859-16, ISO-10646, ISO-10646/UCS2, ISO-10646/UCS4, ISO-10646/UTF-8, ISO-10646/UTF8, ISO-CELTIC, ISO-IR-4, ISO-IR-6, ISO-IR-8-1, ISO-IR-9-1, ISO-IR-10, ISO-IR-11, ISO-IR-14, ISO-IR-15, ISO-IR-16, ISO-IR-17, ISO-IR-18, ISO-IR-19, ISO-IR-21, ISO-IR-25, ISO-IR-27, ISO-IR-37, ISO-IR-49, ISO-IR-50, ISO-IR-51, ISO-IR-54, ISO-IR-55, ISO-IR-57, ISO-IR-60, ISO-IR-61, ISO-IR-69, ISO-IR-84, ISO-IR-85, ISO-IR-86, ISO-IR-88, ISO-IR-89, ISO-IR-90, ISO-IR-92, ISO-IR-98, ISO-IR-99, ISO-IR-100, ISO-IR-101, ISO-IR-103, ISO-IR-109, ISO-IR-110, ISO-IR-111, ISO-IR-121, ISO-IR-122, ISO-IR-126, ISO-IR-127, ISO-IR-138, ISO-IR-139, ISO-IR-141, ISO-IR-143, ISO-IR-144, ISO-IR-148, ISO-IR-150, ISO-IR-151, ISO-IR-153, ISO-IR-155, ISO-IR-156, ISO-IR-157, ISO-IR-166, ISO-IR-179, ISO-IR-193, ISO-IR-197, ISO-IR-199, ISO-IR-203, ISO-IR-209, ISO-IR-226, ISO/TR_11548-1, ISO646-CA, ISO646-CA2, ISO646-CN, ISO646-CU, ISO646-DE, ISO646-DK, ISO646-ES, ISO646-ES2, ISO646-FI, ISO646-FR, ISO646-FR1, ISO646-GB, ISO646-HU, ISO646-IT, ISO646-JP-OCR-B, ISO646-JP, ISO646-KR, ISO646-NO, ISO646-NO2, ISO646-PT, ISO646-PT2, ISO646-SE, ISO646-SE2, ISO646-US, ISO646-YU, ISO2022CN, ISO2022CNEXT, ISO2022JP, ISO2022JP2, ISO2022KR, ISO6937, ISO8859-1, ISO8859-2, ISO8859-3, ISO8859-4, ISO8859-5, ISO8859-6, ISO8859-7, ISO8859-8, ISO8859-9, ISO8859-9E, ISO8859-10, ISO8859-11, ISO8859-13, ISO8859-14, ISO8859-15, ISO8859-16, ISO11548-1, ISO88591, ISO88592, ISO88593, ISO88594, ISO88595, ISO88596, ISO88597, ISO88598, ISO88599, ISO88599E, ISO885910, ISO885911, ISO885913, ISO885914, ISO885915, ISO885916, ISO_646.IRV:1991, ISO_2033-1983, ISO_2033, ISO_5427-EXT, ISO_5427, ISO_5427:1981, ISO_5427EXT, ISO_5428, ISO_5428:1980, ISO_6937-2, ISO_6937-2:1983, ISO_6937, ISO_6937:1992, ISO_8859-1, ISO_8859-1:1987, ISO_8859-2, ISO_8859-2:1987, ISO_8859-3, ISO_8859-3:1988, ISO_8859-4, ISO_8859-4:1988, ISO_8859-5, ISO_8859-5:1988, ISO_8859-6, ISO_8859-6:1987, ISO_8859-7, ISO_8859-7:1987, ISO_8859-7:2003, ISO_8859-8, ISO_8859-8:1988, ISO_8859-9, ISO_8859-9:1989, ISO_8859-9E, ISO_8859-10, ISO_8859-10:1992, ISO_8859-14, ISO_8859-14:1998, ISO_8859-15, ISO_8859-15:1998, ISO_8859-16, ISO_8859-16:2001, ISO_9036, ISO_10367-BOX, ISO_10367BOX, ISO_11548-1, ISO_69372, IT, JIS_C6220-1969-RO, JIS_C6229-1984-B, JIS_C62201969RO, JIS_C62291984B, JOHAB, JP-OCR-B, JP, JS, JUS_I.B1.002, KOI-7, KOI-8, KOI8-R, KOI8-RU, KOI8-T, KOI8-U, KOI8, KOI8R, KOI8U, KSC5636, L1, L2, L3, L4, L5, L6, L7, L8, L10, LATIN-9, LATIN-GREEK-1, LATIN-GREEK, LATIN1, LATIN2, LATIN3, LATIN4, LATIN5, LATIN6, LATIN7, LATIN8, LATIN9, LATIN10, LATINGREEK, LATINGREEK1, MAC-CENTRALEUROPE, MAC-CYRILLIC, MAC-IS, MAC-SAMI, MAC-UK, MAC, MACCYRILLIC, MACINTOSH, MACIS, MACUK, MACUKRAINIAN, MIK, MS-ANSI, MS-ARAB, MS-CYRL, MS-EE, MS-GREEK, MS-HEBR, MS-MAC-CYRILLIC, MS-TURK, MS932, MS936, MSCP949, MSCP1361, MSMACCYRILLIC, MSZ_7795.3, MS_KANJI, NAPLPS, NATS-DANO, NATS-SEFI, NATSDANO, NATSSEFI, NC_NC0010, NC_NC00-10, NC_NC00-10:81, NF_Z_62-010, NF_Z_62-010_(1973), NF_Z_62-010_1973, NF_Z_62010, NF_Z_62010_1973, NO, NO2, NS_4551-1, NS_4551-2, NS_45511, NS_45512, OS2LATIN1, OSF00010001, OSF00010002, OSF00010003, OSF00010004, OSF00010005, OSF00010006, OSF00010007, OSF00010008, OSF00010009, OSF0001000A, OSF00010020, OSF00010100, OSF00010101, OSF00010102, OSF00010104, OSF00010105, OSF00010106, OSF00030010, OSF0004000A, OSF0005000A, OSF05010001, OSF100201A4, OSF100201A8, OSF100201B5, OSF100201F4, OSF100203B5, OSF1002011C, OSF1002011D, OSF1002035D, OSF1002035E, OSF1002035F, OSF1002036B, OSF1002037B, OSF10010001, OSF10010004, OSF10010006, OSF10020025, OSF10020111, OSF10020115, OSF10020116, OSF10020118, OSF10020122, OSF10020129, OSF10020352, OSF10020354, OSF10020357, OSF10020359, OSF10020360, OSF10020364, OSF10020365, OSF10020366, OSF10020367, OSF10020370, OSF10020387, OSF10020388, OSF10020396, OSF10020402, OSF10020417, PT, PT2, PT154, R8, R9, RK1048, ROMAN8, ROMAN9, RUSCII, SE, SE2, SEN_850200_B, SEN_850200_C, SHIFT-JIS, SHIFT_JIS, SHIFT_JISX0213, SJIS-OPEN, SJIS-WIN, SJIS, SS636127, STRK1048-2002, ST_SEV_358-88, T.61-8BIT, T.61, T.618BIT, TCVN-5712, TCVN, TCVN5712-1, TCVN5712-1:1993, THAI8, TIS-620, TIS620-0, TIS620.2529-1, TIS620.2533-0, TIS620, TS-5881, TSCII, TURKISH8, UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4BE, UCS-4LE, UCS2, UCS4, UHC, UJIS, UK, UNICODE, UNICODEBIG, UNICODELITTLE, US-ASCII, US, UTF-7, UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE, UTF7, UTF8, UTF16, UTF16BE, UTF16LE, UTF32, UTF32BE, UTF32LE, VISCII, WCHAR_T, WIN-SAMI-2, WINBALTRIM, WINDOWS-31J, WINDOWS-874, WINDOWS-936, WINDOWS-1250, WINDOWS-1251, WINDOWS-1252, WINDOWS-1253, WINDOWS-1254, WINDOWS-1255, WINDOWS-1256, WINDOWS-1257, WINDOWS-1258, WINSAMI2, WS2, YU';
$lista = array('WINDOWS-1250',"CP852","CP850","ISO-8859-2","ISO-8859-1","UTF-8");
$wyniki = array();
foreach($lista as $ixL => $code)
{
$wyniki[] = array('code'=>$code, 'result'=>0, 'text' => iconv( $code, 'UTF-8//IGNORE', $text) );
}
foreach($plUTF8 as $ixxx => $char)
{
foreach ($wyniki as $wX => $wRes)
{
if(is_numeric(strpos($wRes['text'], $char) ))
{
$wyniki[$wX]['result']++;
}
}
}
$findInx = 0;
$max = 0;
foreach ($wyniki as $wX => $wRes)
{
if($wyniki[$wX]['result'] > $max)
{
$max = $wyniki[$wX]['result'];
$findInx = $wX;
}
}
$encodingIn =$wyniki[$findInx]['code'];
$encodingOut ='UTF-8';
// $ret = iconv( $encodingIn, $encodingOut, $text);
// return $ret;
return $encodingIn;
}

Related

Getting "Â" symbol from email message body from "&nbsp" but still encoding as UTF-8 [duplicate]

I'm reading out lots of texts from various RSS feeds and inserting them into my database.
Of course, there are several different character encodings used in the feeds, e.g. UTF-8 and ISO 8859-1.
Unfortunately, there are sometimes problems with the encodings of the texts. Example:
The "ß" in "Fußball" should look like this in my database: "Ÿ". If it is a "Ÿ", it is displayed correctly.
Sometimes, the "ß" in "Fußball" looks like this in my database: "ß". Then it is displayed wrongly, of course.
In other cases, the "ß" is saved as a "ß" - so without any change. Then it is also displayed wrongly.
What can I do to avoid the cases 2 and 3?
How can I make everything the same encoding, preferably UTF-8? When must I use utf8_encode(), when must I use utf8_decode() (it's clear what the effect is but when must I use the functions?) and when must I do nothing with the input?
How do I make everything the same encoding? Perhaps with the function mb_detect_encoding()? Can I write a function for this? So my problems are:
How do I find out what encoding the text uses?
How do I convert it to UTF-8 - whatever the old encoding is?
Would a function like this work?
function correct_encoding($text) {
$current_encoding = mb_detect_encoding($text, 'auto');
$text = iconv($current_encoding, 'UTF-8', $text);
return $text;
}
I've tested it, but it doesn't work. What's wrong with it?
If you apply utf8_encode() to an already UTF-8 string, it will return garbled UTF-8 output.
I made a function that addresses all this issues. It´s called Encoding::toUTF8().
You don't need to know what the encoding of your strings is. It can be Latin1 (ISO 8859-1), Windows-1252 or UTF-8, or the string can have a mix of them. Encoding::toUTF8() will convert everything to UTF-8.
I did it because a service was giving me a feed of data all messed up, mixing UTF-8 and Latin1 in the same string.
Usage:
require_once('Encoding.php');
use \ForceUTF8\Encoding; // It's namespaced now.
$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);
$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);
Download:
https://github.com/neitanod/forceutf8
I've included another function, Encoding::fixUFT8(), which will fix every UTF-8 string that looks garbled.
Usage:
require_once('Encoding.php');
use \ForceUTF8\Encoding; // It's namespaced now.
$utf8_string = Encoding::fixUTF8($garbled_utf8_string);
Examples:
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
will output:
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
I've transformed the function (forceUTF8) into a family of static functions on a class called Encoding. The new function is Encoding::toUTF8().
You first have to detect what encoding has been used. As you’re parsing RSS feeds (probably via HTTP), you should read the encoding from the charset parameter of the Content-Type HTTP header field. If it is not present, read the encoding from the encoding attribute of the XML processing instruction. If that’s missing too, use UTF-8 as defined in the specification.
Here is what I probably would do:
I’d use cURL to send and fetch the response. That allows you to set specific header fields and fetch the response header as well. After fetching the response, you have to parse the HTTP response and split it into header and body. The header should then contain the Content-Type header field that contains the MIME type and (hopefully) the charset parameter with the encoding/charset too. If not, we’ll analyse the XML PI for the presence of the encoding attribute and get the encoding from there. If that’s also missing, the XML specs define to use UTF-8 as encoding.
$url = 'http://www.lr-online.de/storage/rss/rss/sport.xml';
$accept = array(
'type' => array('application/rss+xml', 'application/xml', 'application/rdf+xml', 'text/xml'),
'charset' => array_diff(mb_list_encodings(), array('pass', 'auto', 'wchar', 'byte2be', 'byte2le', 'byte4be', 'byte4le', 'BASE64', 'UUENCODE', 'HTML-ENTITIES', 'Quoted-Printable', '7bit', '8bit'))
);
$header = array(
'Accept: '.implode(', ', $accept['type']),
'Accept-Charset: '.implode(', ', $accept['charset']),
);
$encoding = null;
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_HEADER, true);
curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
$response = curl_exec($curl);
if (!$response) {
// error fetching the response
} else {
$offset = strpos($response, "\r\n\r\n");
$header = substr($response, 0, $offset);
if (!$header || !preg_match('/^Content-Type:\s+([^;]+)(?:;\s*charset=(.*))?/im', $header, $match)) {
// error parsing the response
} else {
if (!in_array(strtolower($match[1]), array_map('strtolower', $accept['type']))) {
// type not accepted
}
$encoding = trim($match[2], '"\'');
}
if (!$encoding) {
$body = substr($response, $offset + 4);
if (preg_match('/^<\?xml\s+version=(?:"[^"]*"|\'[^\']*\')\s+encoding=("[^"]*"|\'[^\']*\')/s', $body, $match)) {
$encoding = trim($match[1], '"\'');
}
}
if (!$encoding) {
$encoding = 'utf-8';
} else {
if (!in_array($encoding, array_map('strtolower', $accept['charset']))) {
// encoding not accepted
}
if ($encoding != 'utf-8') {
$body = mb_convert_encoding($body, 'utf-8', $encoding);
}
}
$simpleXML = simplexml_load_string($body, null, LIBXML_NOERROR);
if (!$simpleXML) {
// parse error
} else {
echo $simpleXML->asXML();
}
}
Detecting the encoding is hard.
mb_detect_encoding works by guessing, based on a number of candidates that you pass it. In some encodings, certain byte-sequences are invalid, an therefore it can distinguish between various candidates. Unfortunately, there are a lot of encodings, where the same bytes are valid (but different). In these cases, there is no way to determine the encoding; You can implement your own logic to make guesses in these cases. For example, data coming from a Japanese site might be more likely to have a Japanese encoding.
As long as you only deal with Western European languages, the three major encodings to consider are utf-8, iso-8859-1 and cp-1252. Since these are defaults for many platforms, they are also the most likely to be reported wrongly about. Eg. if people use different encodings, they are likely to be frank about it, since else their software would break very often. Therefore, a good strategy is to trust the provider, unless the encoding is reported as one of those three. You should still doublecheck that it is indeed valid, using mb_check_encoding (note that valid is not the same as being - the same input may be valid for many encodings). If it is one of those, you can then use mb_detect_encoding to distinguish between them. Luckily that is fairly deterministic; You just need to use the proper detect-sequence, which is UTF-8,ISO-8859-1,WINDOWS-1252.
Once you've detected the encoding you need to convert it to your internal representation (UTF-8 is the only sane choice). The function utf8_encode transforms ISO-8859-1 to UTF-8, so it can only used for that particular input type. For other encodings, use mb_convert_encoding.
This cheatsheet lists some common caveats related to UTF-8 handling in PHP:
http://developer.loftdigital.com/blog/php-utf-8-cheatsheet
This function detecting multibyte characters in a string might also prove helpful (source):
function detectUTF8($string)
{
return preg_match('%(?:
[\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
|\xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
|\xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
|\xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
|[\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
|\xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)+%xs',
$string);
}
A little heads up. You said that the "ß" should be displayed as "Ÿ" in your database.
This is probably because you're using a database with Latin-1 character encoding or possibly your PHP-MySQL connection is set wrong, this is, P believes your MySQL is set to use UTF-8, so it sends data as UTF-8, but your MySQL believes PHP is sending data encoded as ISO 8859-1, so it may once again try to encode your sent data as UTF-8, causing this kind of trouble.
Take a look at mysql_set_charset. It may help you.
Your encoding looks like you encoded into UTF-8 twice; that is, from some other encoding, into UTF-8, and again into UTF-8. As if you had ISO 8859-1, converted from ISO 8859-1 to UTF-8, and treated the new string as ISO 8859-1 for another conversion into UTF-8.
Here's some pseudocode of what you did:
$inputstring = getFromUser();
$utf8string = iconv($current_encoding, 'utf-8', $inputstring);
$flawedstring = iconv($current_encoding, 'utf-8', $utf8string);
You should try:
detect encoding using mb_detect_encoding() or whatever you like to use
if it's UTF-8, convert into ISO 8859-1, and repeat step 1
finally, convert back into UTF-8
That is presuming that in the "middle" conversion you used ISO 8859-1. If you used Windows-1252, then convert into Windows-1252 (latin1). The original source encoding is not important; the one you used in flawed, second conversion is.
This is my guess at what happened; there's very little else you could have done to get four bytes in place of one extended ASCII byte.
The German language also uses ISO 8859-2 and Windows-1250 (Latin-2).
A really nice way to implement an isUTF8-function can be found on php.net:
function isUTF8($string) {
return (utf8_encode(utf8_decode($string)) == $string);
}
The interesting thing about mb_detect_encoding and mb_convert_encoding is that the order of the encodings you suggest does matter:
// $input is actually UTF-8
mb_detect_encoding($input, "UTF-8", "ISO-8859-9, UTF-8");
// ISO-8859-9 (WRONG!)
mb_detect_encoding($input, "UTF-8", "UTF-8, ISO-8859-9");
// UTF-8 (OK)
So you might want to use a specific order when specifying expected encodings. Still, keep in mind that this is not foolproof.
Working out the character encoding of RSS feeds seems to be complicated. Even normal web pages often omit, or lie about, their encoding.
So you could try to use the correct way to detect the encoding and then fall back to some form of auto-detection (guessing).
You need to test the character set on input since responses can come coded with different encodings.
I force all content been sent into UTF-8 by doing detection and translation using the following function:
function fixRequestCharset()
{
$ref = array(&$_GET, &$_POST, &$_REQUEST);
foreach ($ref as &$var)
{
foreach ($var as $key => $val)
{
$encoding = mb_detect_encoding($var[$key], mb_detect_order(), true);
if (!$encoding)
continue;
if (strcasecmp($encoding, 'UTF-8') != 0)
{
$encoding = iconv($encoding, 'UTF-8', $var[$key]);
if ($encoding === false)
continue;
$var[$key] = $encoding;
}
}
}
}
That routine will turn all PHP variables that come from the remote host into UTF-8.
Or ignore the value if the encoding could not be detected or converted.
You can customize it to your needs.
Just invoke it before using the variables.
mb_detect_encoding:
echo mb_detect_encoding($str, "auto");
Or
echo mb_detect_encoding($str, "UTF-8, ASCII, ISO-8859-1");
I really don't know what the results are, but I'd suggest you just take some of your feeds with different encodings and try if mb_detect_encoding works or not.
auto is short for "ASCII,JIS,UTF-8,EUC-JP,SJIS". It returns the detected charset, which you can use to convert the string to UTF-8 with iconv.
<?php
function convertToUTF8($str) {
$enc = mb_detect_encoding($str);
if ($enc && $enc != 'UTF-8') {
return iconv($enc, 'UTF-8', $str);
} else {
return $str;
}
}
?>
I haven't tested it, so no guarantee. And maybe there's a simpler way.
I know this is an older question, but I figure a useful answer never hurts. I was having issues with my encoding between a desktop application, SQLite, and GET/POST variables. Some would be in UTF-8, some would be in ASCII, and basically everything would get screwed up when foreign characters got involved.
Here is my solution. It scrubs your GET/POST/REQUEST (I omitted cookies, but you could add them if desired) on each page load before processing. It works well in a header. PHP will throw warnings if it can't detect the source encoding automatically, so these warnings are suppressed with #'s.
//Convert everything in our vars to UTF-8 for playing nice with the database...
//Use some auto detection here to help us not double-encode...
//Suppress possible warnings with #'s for when encoding cannot be detected
try
{
$process = array(&$_GET, &$_POST, &$_REQUEST);
while (list($key, $val) = each($process)) {
foreach ($val as $k => $v) {
unset($process[$key][$k]);
if (is_array($v)) {
$process[$key][#mb_convert_encoding($k,'UTF-8','auto')] = $v;
$process[] = &$process[$key][#mb_convert_encoding($k,'UTF-8','auto')];
} else {
$process[$key][#mb_convert_encoding($k,'UTF-8','auto')] = #mb_convert_encoding($v,'UTF-8','auto');
}
}
}
unset($process);
}
catch(Exception $ex){}
It's simple: when you get something that's not UTF-8, you must encode that into UTF-8.
So, when you're fetching a certain feed that's ISO 8859-1 parse it through utf8_encode.
However, if you're fetching an UTF-8 feed, you don't need to do anything.
harpax' answer worked for me. In my case, this is good enough:
if (isUTF8($str)) {
echo $str;
}
else
{
echo iconv("ISO-8859-1", "UTF-8//TRANSLIT", $str);
}
I was checking for solutions to encoding since ages, and this page is probably the conclusion of years of search! I tested some of the suggestions you mentioned and here are my notes:
This is my test string:
this is a "wròng wrìtten" string bùt I nèed to pù 'sòme' special
chàrs to see thèm, convertèd by fùnctìon!! & that's it!
I do an INSERT to save this string on a database in a field that is set as utf8_general_ci
The character set of my page is UTF-8.
If I do an INSERT just like that, in my database, I have some characters probably coming from Mars...
So I need to convert them into some "sane" UTF-8. I tried utf8_encode(), but still aliens chars were invading my database...
So I tried to use the function forceUTF8 posted on number 8, but in the database the string saved looks like this:
this is a "wròng wrìtten" string bùt I nèed to pù 'sòme' special
chà rs to see thèm, convertèd by fùnctìon!! & that's it!
So collecting some more information on this page and merging them with other information on other pages I solved my problem with this solution:
$finallyIDidIt = mb_convert_encoding(
$string,
mysql_client_encoding($resourceID),
mb_detect_encoding($string)
);
Now in my database I have my string with correct encoding.
NOTE:
Only note to take care of is in function mysql_client_encoding!
You need to be connected to the database, because this function wants a resource ID as a parameter.
But well, I just do that re-encoding before my INSERT so for me it is not a problem.
After sorting out your PHP scripts, don't forget to tell MySQL what charset you are passing and would like to receive.
Example: set the character to UTF-8
Passing UTF-8 data to a Latin 1 table in a Latin 1 I/O session gives those nasty birdfeets. I see this every other day in OsCommerce shops. Back and fourth it might seem right. But phpMyAdmin will show the truth. By telling MySQL what charset you are passing, it will handle the conversion of MySQL data for you.
How to recover existing scrambled MySQL data is another question. :)
Get the encoding from headers and convert it to UTF-8.
$post_url = 'http://website.domain';
/// Get headers ///////////////////////////////////////////////
function get_headers_curl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 15);
$r = curl_exec($ch);
return $r;
}
$the_header = get_headers_curl($post_url);
/// Check for redirect ////////////////////////////////////////
if (preg_match("/Location:/i", $the_header)) {
$arr = explode('Location:', $the_header);
$location = $arr[1];
$location = explode(chr(10), $location);
$location = $location[0];
$the_header = get_headers_curl(trim($location));
}
/// Get charset ///////////////////////////////////////////////
if (preg_match("/charset=/i", $the_header)) {
$arr = explode('charset=', $the_header);
$charset = $arr[1];
$charset = explode(chr(10), $charset);
$charset = $charset[0];
}
///////////////////////////////////////////////////////////////////
// echo $charset;
if($charset && $charset != 'UTF-8') {
$html = iconv($charset, "UTF-8", $html);
}
Ÿ is Mojibake for ß. In your database, you may have one of the following hex values (use SELECT HEX(col)...) to find out):
DF if the column is "latin1",
C39F if the column is utf8 -- OR -- it is latin1, but "double-encoded"
C383C5B8 if double-encoded into a utf8 column
You should not use any encoding/decoding functions in PHP; instead, you should set up the database and the connection to it correctly.
If MySQL is involved, see: Trouble with UTF-8 characters; what I see is not what I stored
if(!mb_check_encoding($str)){
$str = iconv("windows-1251", "UTF-8", $str);
}
It helped for me
Try without 'auto'
That is:
mb_detect_encoding($text)
instead of:
mb_detect_encoding($text, 'auto')
More information can be found here: mb_detect_encoding
Try to use this... every text that is not UTF-8 will be translated.
function is_utf8($str) {
return (bool) preg_match('//u', $str);
}
$myString = "Fußball";
if(!is_utf8($myString)){
$myString = utf8_encode($myString);
}
// or 1 line version ;)
$myString = !is_utf8($myString) ? utf8_encode($myString) : trim($myString);
I found a solution at http://deer.org.ua/2009/10/06/1/:
class Encoding
{
/**
* http://deer.org.ua/2009/10/06/1/
* #param $string
* #return null
*/
public static function detect_encoding($string)
{
static $list = ['utf-8', 'windows-1251'];
foreach ($list as $item) {
try {
$sample = iconv($item, $item, $string);
} catch (\Exception $e) {
continue;
}
if (md5($sample) == md5($string)) {
return $item;
}
}
return null;
}
}
$content = file_get_contents($file['tmp_name']);
$encoding = Encoding::detect_encoding($content);
if ($encoding != 'utf-8') {
$result = iconv($encoding, 'utf-8', $content);
} else {
$result = $content;
}
I think that # is a bad decision and made some changes to the solution from deer.org.ua.
When you try to handle multi languages, like Japanese and Korean, you might get in trouble.
mb_convert_encoding with the 'auto' parameter doesn't work well. Setting mb_detect_order('ASCII,UTF-8,JIS,EUC-JP,SJIS,EUC-KR,UHC') doesn't help since it will detect EUC-* wrongly.
I concluded that as long as input strings comes from HTML, it should use 'charset' in a meta element. I use Simple HTML DOM Parser because it supports invalid HTML.
The below snippet extracts the title element from a web page. If you would like to convert the entire page, then you may want to remove some lines.
<?php
require_once 'simple_html_dom.php';
echo convert_title_to_utf8(file_get_contents($argv[1])), PHP_EOL;
function convert_title_to_utf8($contents)
{
$dom = str_get_html($contents);
$title = $dom->find('title', 0);
if (empty($title)) {
return null;
}
$title = $title->plaintext;
$metas = $dom->find('meta');
$charset = 'auto';
foreach ($metas as $meta) {
if (!empty($meta->charset)) { // HTML5
$charset = $meta->charset;
} else if (preg_match('#charset=(.+)#', $meta->content, $match)) {
$charset = $match[1];
}
}
if (!in_array(strtolower($charset), array_map('strtolower', mb_list_encodings()))) {
$charset = 'auto';
}
return mb_convert_encoding($title, 'UTF-8', $charset);
}
This version is for the German language, but you can modify the $CHARSETS and the $TESTCHARS.
class CharsetDetector
{
private static $CHARSETS = array(
"ISO_8859-1",
"ISO_8859-15",
"CP850"
);
private static $TESTCHARS = array(
"€",
"ä",
"Ä",
"ö",
"Ö",
"ü",
"Ü",
"ß"
);
public static function convert($string)
{
return self::__iconv($string, self::getCharset($string));
}
public static function getCharset($string)
{
$normalized = self::__normalize($string);
if(!strlen($normalized))
return "UTF-8";
$best = "UTF-8";
$charcountbest = 0;
foreach (self::$CHARSETS as $charset)
{
$str = self::__iconv($normalized, $charset);
$charcount = 0;
$stop = mb_strlen($str, "UTF-8");
for($idx = 0; $idx < $stop; $idx++)
{
$char = mb_substr($str, $idx, 1, "UTF-8");
foreach (self::$TESTCHARS as $testchar)
{
if($char == $testchar)
{
$charcount++;
break;
}
}
}
if($charcount > $charcountbest)
{
$charcountbest = $charcount;
$best = $charset;
}
//echo $text . "<br />";
}
return $best;
}
private static function __normalize($str)
{
$len = strlen($str);
$ret = "";
for($i = 0; $i < $len; $i++)
{
$c = ord($str[$i]);
if ($c > 128) {
if (($c > 247))
$ret .= $str[$i];
elseif
($c > 239) $bytes = 4;
elseif
($c > 223) $bytes = 3;
elseif
($c > 191) $bytes = 2;
else
$ret .= $str[$i];
if (($i + $bytes) > $len)
$ret .= $str[$i];
$ret2 = $str[$i];
while ($bytes > 1)
{
$i++;
$b = ord($str[$i]);
if ($b < 128 || $b > 191)
{
$ret .= $ret2;
$ret2 = "";
$i += $bytes-1;
$bytes = 1;
break;
}
else
$ret2 .= $str[$i];
$bytes--;
}
}
}
return $ret;
}
private static function __iconv($string, $charset)
{
return iconv ($charset, "UTF-8", $string);
}
}
I had the same issue with phpQuery (ISO-8859-1 instead of UTF-8) and this hack helped me:
$html = '<?xml version="1.0" encoding="UTF-8" ?>' . $html;
mb_internal_encoding('UTF-8'), phpQuery::newDocumentHTML($html, 'utf-8'), mbstring.internal_encoding and other manipulations didn't take any effect.
For Chinese characters, it is common to be encoded in the GBK encoding. In addition, when tested, the most voted answer doesn't work. Here is a simple fix that makes it work as well:
function toUTF8($raw) {
try{
return mb_convert_encoding($raw, "UTF-8", "auto");
}catch(\Exception $e){
return mb_convert_encoding($raw, "UTF-8", "GBK");
}
}
Remark: This solution was written in 2017 and should fix problems for PHP in those days. I have not tested whether latest PHP already understands auto correctly.

How to Find the Encoding of a file using php

I am trying to find the encoding of a file using php but i cant seems to get a solution is there a solution
i used following code to detect from a list o f encoding given
public function detect($filePath)
{
$fopen=fopen($filePath,'r');
$row = fgets($fopen);
$encodings = mb_list_encodings();
$encoding = mb_detect_encoding( $row, "UTF-8, ASCII, Windows-1252, Windows-1254, Windows-1255" );//these are my favorite encodings
if($encoding !== false) {
$key = array_search($encoding, $encodings) !== false;
if ($key !== false)
unset($encodings[$key]);
$encodings = array_values($encodings);
}
$encKey = 0;
while ($row = fgets($fopen)) {
if($encoding == false){
$encoding = $encodings[$encKey++];
}
if(!mb_check_encoding($row, $encoding)){
$encoding =false;
rewind($fopen);
}
}
return $encoding;
}
For polish language I resolved problem:
public static function findEncoding($text)
{
$plUTF8 = array("ą","ę","ć","ż","ź","ł","ó","ń");
//$lista = '437, 500, 500V1, 850, 851, 852, 855, 856, 857, 860, 861, 862, 863, 864, 865, 866, 866NAV, 869, 874, 904, 1026, 1046, 1047, 8859_1, 8859_2, 8859_3, 8859_4, 8859_5, 8859_6, 8859_7, 8859_8, 8859_9, 10646-1:1993, 10646-1:1993/UCS4, ANSI_X3.4-1968, ANSI_X3.4-1986, ANSI_X3.4, ANSI_X3.110-1983, ANSI_X3.110, ARABIC, ARABIC7, ARMSCII-8, ASCII, ASMO-708, ASMO_449, BALTIC, BIG-5, BIG-FIVE, BIG5-HKSCS, BIG5, BIG5HKSCS, BIGFIVE, BRF, BS_4730, CA, CN-BIG5, CN-GB, CN, CP-AR, CP-GR, CP-HU, CP037, CP038, CP273, CP274, CP275, CP278, CP280, CP281, CP282, CP284, CP285, CP290, CP297, CP367, CP420, CP423, CP424, CP437, CP500, CP737, CP770, CP771, CP772, CP773, CP774, CP775, CP803, CP813, CP819, CP850, CP851, CP852, CP855, CP856, CP857, CP860, CP861, CP862, CP863, CP864, CP865, CP866, CP866NAV, CP868, CP869, CP870, CP871, CP874, CP875, CP880, CP891, CP901, CP902, CP903, CP904, CP905, CP912, CP915, CP916, CP918, CP920, CP921, CP922, CP930, CP932, CP933, CP935, CP936, CP937, CP939, CP949, CP950, CP1004, CP1008, CP1025, CP1026, CP1046, CP1047, CP1070, CP1079, CP1081, CP1084, CP1089, CP1097, CP1112, CP1122, CP1123, CP1124, CP1125, CP1129, CP1130, CP1132, CP1133, CP1137, CP1140, CP1141, CP1142, CP1143, CP1144, CP1145, CP1146, CP1147, CP1148, CP1149, CP1153, CP1154, CP1155, CP1156, CP1157, CP1158, CP1160, CP1161, CP1162, CP1163, CP1164, CP1166, CP1167, CP1250, CP1251, CP1252, CP1253, CP1254, CP1255, CP1256, CP1257, CP1258, CP1282, CP1361, CP1364, CP1371, CP1388, CP1390, CP1399, CP4517, CP4899, CP4909, CP4971, CP5347, CP9030, CP9066, CP9448, CP10007, CP12712, CP16804, CPIBM861, CSA7-1, CSA7-2, CSASCII, CSA_T500-1983, CSA_T500, CSA_Z243.4-1985-1, CSA_Z243.4-1985-2, CSA_Z243.419851, CSA_Z243.419852, CSDECMCS, CSEBCDICATDE, CSEBCDICATDEA, CSEBCDICCAFR, CSEBCDICDKNO, CSEBCDICDKNOA, CSEBCDICES, CSEBCDICESA, CSEBCDICESS, CSEBCDICFISE, CSEBCDICFISEA, CSEBCDICFR, CSEBCDICIT, CSEBCDICPT, CSEBCDICUK, CSEBCDICUS, CSEUCKR, CSEUCPKDFMTJAPANESE, CSGB2312, CSHPROMAN8, CSIBM037, CSIBM038, CSIBM273, CSIBM274, CSIBM275, CSIBM277, CSIBM278, CSIBM280, CSIBM281, CSIBM284, CSIBM285, CSIBM290, CSIBM297, CSIBM420, CSIBM423, CSIBM424, CSIBM500, CSIBM803, CSIBM851, CSIBM855, CSIBM856, CSIBM857, CSIBM860, CSIBM863, CSIBM864, CSIBM865, CSIBM866, CSIBM868, CSIBM869, CSIBM870, CSIBM871, CSIBM880, CSIBM891, CSIBM901, CSIBM902, CSIBM903, CSIBM904, CSIBM905, CSIBM918, CSIBM921, CSIBM922, CSIBM930, CSIBM932, CSIBM933, CSIBM935, CSIBM937, CSIBM939, CSIBM943, CSIBM1008, CSIBM1025, CSIBM1026, CSIBM1097, CSIBM1112, CSIBM1122, CSIBM1123, CSIBM1124, CSIBM1129, CSIBM1130, CSIBM1132, CSIBM1133, CSIBM1137, CSIBM1140, CSIBM1141, CSIBM1142, CSIBM1143, CSIBM1144, CSIBM1145, CSIBM1146, CSIBM1147, CSIBM1148, CSIBM1149, CSIBM1153, CSIBM1154, CSIBM1155, CSIBM1156, CSIBM1157, CSIBM1158, CSIBM1160, CSIBM1161, CSIBM1163, CSIBM1164, CSIBM1166, CSIBM1167, CSIBM1364, CSIBM1371, CSIBM1388, CSIBM1390, CSIBM1399, CSIBM4517, CSIBM4899, CSIBM4909, CSIBM4971, CSIBM5347, CSIBM9030, CSIBM9066, CSIBM9448, CSIBM12712, CSIBM16804, CSIBM11621162, CSISO4UNITEDKINGDOM, CSISO10SWEDISH, CSISO11SWEDISHFORNAMES, CSISO14JISC6220RO, CSISO15ITALIAN, CSISO16PORTUGESE, CSISO17SPANISH, CSISO18GREEK7OLD, CSISO19LATINGREEK, CSISO21GERMAN, CSISO25FRENCH, CSISO27LATINGREEK1, CSISO49INIS, CSISO50INIS8, CSISO51INISCYRILLIC, CSISO58GB1988, CSISO60DANISHNORWEGIAN, CSISO60NORWEGIAN1, CSISO61NORWEGIAN2, CSISO69FRENCH, CSISO84PORTUGUESE2, CSISO85SPANISH2, CSISO86HUNGARIAN, CSISO88GREEK7, CSISO89ASMO449, CSISO90, CSISO92JISC62991984B, CSISO99NAPLPS, CSISO103T618BIT, CSISO111ECMACYRILLIC, CSISO121CANADIAN1, CSISO122CANADIAN2, CSISO139CSN369103, CSISO141JUSIB1002, CSISO143IECP271, CSISO150, CSISO150GREEKCCITT, CSISO151CUBA, CSISO153GOST1976874, CSISO646DANISH, CSISO2022CN, CSISO2022JP, CSISO2022JP2, CSISO2022KR, CSISO2033, CSISO5427CYRILLIC, CSISO5427CYRILLIC1981, CSISO5428GREEK, CSISO10367BOX, CSISOLATIN1, CSISOLATIN2, CSISOLATIN3, CSISOLATIN4, CSISOLATIN5, CSISOLATIN6, CSISOLATINARABIC, CSISOLATINCYRILLIC, CSISOLATINGREEK, CSISOLATINHEBREW, CSKOI8R, CSKSC5636, CSMACINTOSH, CSNATSDANO, CSNATSSEFI, CSN_369103, CSPC8CODEPAGE437, CSPC775BALTIC, CSPC850MULTILINGUAL, CSPC862LATINHEBREW, CSPCP852, CSSHIFTJIS, CSUCS4, CSUNICODE, CSWINDOWS31J, CUBA, CWI-2, CWI, CYRILLIC, DE, DEC-MCS, DEC, DECMCS, DIN_66003, DK, DS2089, DS_2089, E13B, EBCDIC-AT-DE-A, EBCDIC-AT-DE, EBCDIC-BE, EBCDIC-BR, EBCDIC-CA-FR, EBCDIC-CP-AR1, EBCDIC-CP-AR2, EBCDIC-CP-BE, EBCDIC-CP-CA, EBCDIC-CP-CH, EBCDIC-CP-DK, EBCDIC-CP-ES, EBCDIC-CP-FI, EBCDIC-CP-FR, EBCDIC-CP-GB, EBCDIC-CP-GR, EBCDIC-CP-HE, EBCDIC-CP-IS, EBCDIC-CP-IT, EBCDIC-CP-NL, EBCDIC-CP-NO, EBCDIC-CP-ROECE, EBCDIC-CP-SE, EBCDIC-CP-TR, EBCDIC-CP-US, EBCDIC-CP-WT, EBCDIC-CP-YU, EBCDIC-CYRILLIC, EBCDIC-DK-NO-A, EBCDIC-DK-NO, EBCDIC-ES-A, EBCDIC-ES-S, EBCDIC-ES, EBCDIC-FI-SE-A, EBCDIC-FI-SE, EBCDIC-FR, EBCDIC-GREEK, EBCDIC-INT, EBCDIC-INT1, EBCDIC-IS-FRISS, EBCDIC-IT, EBCDIC-JP-E, EBCDIC-JP-KANA, EBCDIC-PT, EBCDIC-UK, EBCDIC-US, EBCDICATDE, EBCDICATDEA, EBCDICCAFR, EBCDICDKNO, EBCDICDKNOA, EBCDICES, EBCDICESA, EBCDICESS, EBCDICFISE, EBCDICFISEA, EBCDICFR, EBCDICISFRISS, EBCDICIT, EBCDICPT, EBCDICUK, EBCDICUS, ECMA-114, ECMA-118, ECMA-128, ECMA-CYRILLIC, ECMACYRILLIC, ELOT_928, ES, ES2, EUC-CN, EUC-JISX0213, EUC-JP-MS, EUC-JP, EUC-KR, EUC-TW, EUCCN, EUCJP-MS, EUCJP-OPEN, EUCJP-WIN, EUCJP, EUCKR, EUCTW, FI, FR, GB, GB2312, GB13000, GB18030, GBK, GB_1988-80, GB_198880, GEORGIAN-ACADEMY, GEORGIAN-PS, GOST_19768-74, GOST_19768, GOST_1976874, GREEK-CCITT, GREEK, GREEK7-OLD, GREEK7, GREEK7OLD, GREEK8, GREEKCCITT, HEBREW, HP-GREEK8, HP-ROMAN8, HP-ROMAN9, HP-THAI8, HP-TURKISH8, HPGREEK8, HPROMAN8, HPROMAN9, HPTHAI8, HPTURKISH8, HU, IBM-803, IBM-856, IBM-901, IBM-902, IBM-921, IBM-922, IBM-930, IBM-932, IBM-933, IBM-935, IBM-937, IBM-939, IBM-943, IBM-1008, IBM-1025, IBM-1046, IBM-1047, IBM-1097, IBM-1112, IBM-1122, IBM-1123, IBM-1124, IBM-1129, IBM-1130, IBM-1132, IBM-1133, IBM-1137, IBM-1140, IBM-1141, IBM-1142, IBM-1143, IBM-1144, IBM-1145, IBM-1146, IBM-1147, IBM-1148, IBM-1149, IBM-1153, IBM-1154, IBM-1155, IBM-1156, IBM-1157, IBM-1158, IBM-1160, IBM-1161, IBM-1162, IBM-1163, IBM-1164, IBM-1166, IBM-1167, IBM-1364, IBM-1371, IBM-1388, IBM-1390, IBM-1399, IBM-4517, IBM-4899, IBM-4909, IBM-4971, IBM-5347, IBM-9030, IBM-9066, IBM-9448, IBM-12712, IBM-16804, IBM037, IBM038, IBM256, IBM273, IBM274, IBM275, IBM277, IBM278, IBM280, IBM281, IBM284, IBM285, IBM290, IBM297, IBM367, IBM420, IBM423, IBM424, IBM437, IBM500, IBM775, IBM803, IBM813, IBM819, IBM848, IBM850, IBM851, IBM852, IBM855, IBM856, IBM857, IBM860, IBM861, IBM862, IBM863, IBM864, IBM865, IBM866, IBM866NAV, IBM868, IBM869, IBM870, IBM871, IBM874, IBM875, IBM880, IBM891, IBM901, IBM902, IBM903, IBM904, IBM905, IBM912, IBM915, IBM916, IBM918, IBM920, IBM921, IBM922, IBM930, IBM932, IBM933, IBM935, IBM937, IBM939, IBM943, IBM1004, IBM1008, IBM1025, IBM1026, IBM1046, IBM1047, IBM1089, IBM1097, IBM1112, IBM1122, IBM1123, IBM1124, IBM1129, IBM1130, IBM1132, IBM1133, IBM1137, IBM1140, IBM1141, IBM1142, IBM1143, IBM1144, IBM1145, IBM1146, IBM1147, IBM1148, IBM1149, IBM1153, IBM1154, IBM1155, IBM1156, IBM1157, IBM1158, IBM1160, IBM1161, IBM1162, IBM1163, IBM1164, IBM1166, IBM1167, IBM1364, IBM1371, IBM1388, IBM1390, IBM1399, IBM4517, IBM4899, IBM4909, IBM4971, IBM5347, IBM9030, IBM9066, IBM9448, IBM12712, IBM16804, IEC_P27-1, IEC_P271, INIS-8, INIS-CYRILLIC, INIS, INIS8, INISCYRILLIC, ISIRI-3342, ISIRI3342, ISO-2022-CN-EXT, ISO-2022-CN, ISO-2022-JP-2, ISO-2022-JP-3, ISO-2022-JP, ISO-2022-KR, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, ISO-8859-9E, ISO-8859-10, ISO-8859-11, ISO-8859-13, ISO-8859-14, ISO-8859-15, ISO-8859-16, ISO-10646, ISO-10646/UCS2, ISO-10646/UCS4, ISO-10646/UTF-8, ISO-10646/UTF8, ISO-CELTIC, ISO-IR-4, ISO-IR-6, ISO-IR-8-1, ISO-IR-9-1, ISO-IR-10, ISO-IR-11, ISO-IR-14, ISO-IR-15, ISO-IR-16, ISO-IR-17, ISO-IR-18, ISO-IR-19, ISO-IR-21, ISO-IR-25, ISO-IR-27, ISO-IR-37, ISO-IR-49, ISO-IR-50, ISO-IR-51, ISO-IR-54, ISO-IR-55, ISO-IR-57, ISO-IR-60, ISO-IR-61, ISO-IR-69, ISO-IR-84, ISO-IR-85, ISO-IR-86, ISO-IR-88, ISO-IR-89, ISO-IR-90, ISO-IR-92, ISO-IR-98, ISO-IR-99, ISO-IR-100, ISO-IR-101, ISO-IR-103, ISO-IR-109, ISO-IR-110, ISO-IR-111, ISO-IR-121, ISO-IR-122, ISO-IR-126, ISO-IR-127, ISO-IR-138, ISO-IR-139, ISO-IR-141, ISO-IR-143, ISO-IR-144, ISO-IR-148, ISO-IR-150, ISO-IR-151, ISO-IR-153, ISO-IR-155, ISO-IR-156, ISO-IR-157, ISO-IR-166, ISO-IR-179, ISO-IR-193, ISO-IR-197, ISO-IR-199, ISO-IR-203, ISO-IR-209, ISO-IR-226, ISO/TR_11548-1, ISO646-CA, ISO646-CA2, ISO646-CN, ISO646-CU, ISO646-DE, ISO646-DK, ISO646-ES, ISO646-ES2, ISO646-FI, ISO646-FR, ISO646-FR1, ISO646-GB, ISO646-HU, ISO646-IT, ISO646-JP-OCR-B, ISO646-JP, ISO646-KR, ISO646-NO, ISO646-NO2, ISO646-PT, ISO646-PT2, ISO646-SE, ISO646-SE2, ISO646-US, ISO646-YU, ISO2022CN, ISO2022CNEXT, ISO2022JP, ISO2022JP2, ISO2022KR, ISO6937, ISO8859-1, ISO8859-2, ISO8859-3, ISO8859-4, ISO8859-5, ISO8859-6, ISO8859-7, ISO8859-8, ISO8859-9, ISO8859-9E, ISO8859-10, ISO8859-11, ISO8859-13, ISO8859-14, ISO8859-15, ISO8859-16, ISO11548-1, ISO88591, ISO88592, ISO88593, ISO88594, ISO88595, ISO88596, ISO88597, ISO88598, ISO88599, ISO88599E, ISO885910, ISO885911, ISO885913, ISO885914, ISO885915, ISO885916, ISO_646.IRV:1991, ISO_2033-1983, ISO_2033, ISO_5427-EXT, ISO_5427, ISO_5427:1981, ISO_5427EXT, ISO_5428, ISO_5428:1980, ISO_6937-2, ISO_6937-2:1983, ISO_6937, ISO_6937:1992, ISO_8859-1, ISO_8859-1:1987, ISO_8859-2, ISO_8859-2:1987, ISO_8859-3, ISO_8859-3:1988, ISO_8859-4, ISO_8859-4:1988, ISO_8859-5, ISO_8859-5:1988, ISO_8859-6, ISO_8859-6:1987, ISO_8859-7, ISO_8859-7:1987, ISO_8859-7:2003, ISO_8859-8, ISO_8859-8:1988, ISO_8859-9, ISO_8859-9:1989, ISO_8859-9E, ISO_8859-10, ISO_8859-10:1992, ISO_8859-14, ISO_8859-14:1998, ISO_8859-15, ISO_8859-15:1998, ISO_8859-16, ISO_8859-16:2001, ISO_9036, ISO_10367-BOX, ISO_10367BOX, ISO_11548-1, ISO_69372, IT, JIS_C6220-1969-RO, JIS_C6229-1984-B, JIS_C62201969RO, JIS_C62291984B, JOHAB, JP-OCR-B, JP, JS, JUS_I.B1.002, KOI-7, KOI-8, KOI8-R, KOI8-RU, KOI8-T, KOI8-U, KOI8, KOI8R, KOI8U, KSC5636, L1, L2, L3, L4, L5, L6, L7, L8, L10, LATIN-9, LATIN-GREEK-1, LATIN-GREEK, LATIN1, LATIN2, LATIN3, LATIN4, LATIN5, LATIN6, LATIN7, LATIN8, LATIN9, LATIN10, LATINGREEK, LATINGREEK1, MAC-CENTRALEUROPE, MAC-CYRILLIC, MAC-IS, MAC-SAMI, MAC-UK, MAC, MACCYRILLIC, MACINTOSH, MACIS, MACUK, MACUKRAINIAN, MIK, MS-ANSI, MS-ARAB, MS-CYRL, MS-EE, MS-GREEK, MS-HEBR, MS-MAC-CYRILLIC, MS-TURK, MS932, MS936, MSCP949, MSCP1361, MSMACCYRILLIC, MSZ_7795.3, MS_KANJI, NAPLPS, NATS-DANO, NATS-SEFI, NATSDANO, NATSSEFI, NC_NC0010, NC_NC00-10, NC_NC00-10:81, NF_Z_62-010, NF_Z_62-010_(1973), NF_Z_62-010_1973, NF_Z_62010, NF_Z_62010_1973, NO, NO2, NS_4551-1, NS_4551-2, NS_45511, NS_45512, OS2LATIN1, OSF00010001, OSF00010002, OSF00010003, OSF00010004, OSF00010005, OSF00010006, OSF00010007, OSF00010008, OSF00010009, OSF0001000A, OSF00010020, OSF00010100, OSF00010101, OSF00010102, OSF00010104, OSF00010105, OSF00010106, OSF00030010, OSF0004000A, OSF0005000A, OSF05010001, OSF100201A4, OSF100201A8, OSF100201B5, OSF100201F4, OSF100203B5, OSF1002011C, OSF1002011D, OSF1002035D, OSF1002035E, OSF1002035F, OSF1002036B, OSF1002037B, OSF10010001, OSF10010004, OSF10010006, OSF10020025, OSF10020111, OSF10020115, OSF10020116, OSF10020118, OSF10020122, OSF10020129, OSF10020352, OSF10020354, OSF10020357, OSF10020359, OSF10020360, OSF10020364, OSF10020365, OSF10020366, OSF10020367, OSF10020370, OSF10020387, OSF10020388, OSF10020396, OSF10020402, OSF10020417, PT, PT2, PT154, R8, R9, RK1048, ROMAN8, ROMAN9, RUSCII, SE, SE2, SEN_850200_B, SEN_850200_C, SHIFT-JIS, SHIFT_JIS, SHIFT_JISX0213, SJIS-OPEN, SJIS-WIN, SJIS, SS636127, STRK1048-2002, ST_SEV_358-88, T.61-8BIT, T.61, T.618BIT, TCVN-5712, TCVN, TCVN5712-1, TCVN5712-1:1993, THAI8, TIS-620, TIS620-0, TIS620.2529-1, TIS620.2533-0, TIS620, TS-5881, TSCII, TURKISH8, UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4BE, UCS-4LE, UCS2, UCS4, UHC, UJIS, UK, UNICODE, UNICODEBIG, UNICODELITTLE, US-ASCII, US, UTF-7, UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE, UTF7, UTF8, UTF16, UTF16BE, UTF16LE, UTF32, UTF32BE, UTF32LE, VISCII, WCHAR_T, WIN-SAMI-2, WINBALTRIM, WINDOWS-31J, WINDOWS-874, WINDOWS-936, WINDOWS-1250, WINDOWS-1251, WINDOWS-1252, WINDOWS-1253, WINDOWS-1254, WINDOWS-1255, WINDOWS-1256, WINDOWS-1257, WINDOWS-1258, WINSAMI2, WS2, YU';
$lista = array('WINDOWS-1250',"CP852","CP850","ISO-8859-2","ISO-8859-1","UTF-8");
$wyniki = array();
foreach($lista as $ixL => $code)
{
$wyniki[] = array('code'=>$code, 'result'=>0, 'text' => iconv( $code, 'UTF-8//IGNORE', $text) );
}
foreach($plUTF8 as $ixxx => $char)
{
foreach ($wyniki as $wX => $wRes)
{
if(is_numeric(strpos($wRes['text'], $char) ))
{
$wyniki[$wX]['result']++;
}
}
}
$findInx = 0;
$max = 0;
foreach ($wyniki as $wX => $wRes)
{
if($wyniki[$wX]['result'] > $max)
{
$max = $wyniki[$wX]['result'];
$findInx = $wX;
}
}
$encodingIn =$wyniki[$findInx]['code'];
$encodingOut ='UTF-8';
// $ret = iconv( $encodingIn, $encodingOut, $text);
// return $ret;
return $encodingIn;
}
Windows-1254 Windows-1255 aren't supported in php Windows-1251 Windows-1252 are.
Check http://php.net/manual/en/mbstring.supported-encodings.php

How to decode UTF-8 only if the string has not been decoded? [duplicate]

I'm reading out lots of texts from various RSS feeds and inserting them into my database.
Of course, there are several different character encodings used in the feeds, e.g. UTF-8 and ISO 8859-1.
Unfortunately, there are sometimes problems with the encodings of the texts. Example:
The "ß" in "Fußball" should look like this in my database: "Ÿ". If it is a "Ÿ", it is displayed correctly.
Sometimes, the "ß" in "Fußball" looks like this in my database: "ß". Then it is displayed wrongly, of course.
In other cases, the "ß" is saved as a "ß" - so without any change. Then it is also displayed wrongly.
What can I do to avoid the cases 2 and 3?
How can I make everything the same encoding, preferably UTF-8? When must I use utf8_encode(), when must I use utf8_decode() (it's clear what the effect is but when must I use the functions?) and when must I do nothing with the input?
How do I make everything the same encoding? Perhaps with the function mb_detect_encoding()? Can I write a function for this? So my problems are:
How do I find out what encoding the text uses?
How do I convert it to UTF-8 - whatever the old encoding is?
Would a function like this work?
function correct_encoding($text) {
$current_encoding = mb_detect_encoding($text, 'auto');
$text = iconv($current_encoding, 'UTF-8', $text);
return $text;
}
I've tested it, but it doesn't work. What's wrong with it?
If you apply utf8_encode() to an already UTF-8 string, it will return garbled UTF-8 output.
I made a function that addresses all this issues. It´s called Encoding::toUTF8().
You don't need to know what the encoding of your strings is. It can be Latin1 (ISO 8859-1), Windows-1252 or UTF-8, or the string can have a mix of them. Encoding::toUTF8() will convert everything to UTF-8.
I did it because a service was giving me a feed of data all messed up, mixing UTF-8 and Latin1 in the same string.
Usage:
require_once('Encoding.php');
use \ForceUTF8\Encoding; // It's namespaced now.
$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);
$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);
Download:
https://github.com/neitanod/forceutf8
I've included another function, Encoding::fixUFT8(), which will fix every UTF-8 string that looks garbled.
Usage:
require_once('Encoding.php');
use \ForceUTF8\Encoding; // It's namespaced now.
$utf8_string = Encoding::fixUTF8($garbled_utf8_string);
Examples:
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
will output:
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
I've transformed the function (forceUTF8) into a family of static functions on a class called Encoding. The new function is Encoding::toUTF8().
You first have to detect what encoding has been used. As you’re parsing RSS feeds (probably via HTTP), you should read the encoding from the charset parameter of the Content-Type HTTP header field. If it is not present, read the encoding from the encoding attribute of the XML processing instruction. If that’s missing too, use UTF-8 as defined in the specification.
Here is what I probably would do:
I’d use cURL to send and fetch the response. That allows you to set specific header fields and fetch the response header as well. After fetching the response, you have to parse the HTTP response and split it into header and body. The header should then contain the Content-Type header field that contains the MIME type and (hopefully) the charset parameter with the encoding/charset too. If not, we’ll analyse the XML PI for the presence of the encoding attribute and get the encoding from there. If that’s also missing, the XML specs define to use UTF-8 as encoding.
$url = 'http://www.lr-online.de/storage/rss/rss/sport.xml';
$accept = array(
'type' => array('application/rss+xml', 'application/xml', 'application/rdf+xml', 'text/xml'),
'charset' => array_diff(mb_list_encodings(), array('pass', 'auto', 'wchar', 'byte2be', 'byte2le', 'byte4be', 'byte4le', 'BASE64', 'UUENCODE', 'HTML-ENTITIES', 'Quoted-Printable', '7bit', '8bit'))
);
$header = array(
'Accept: '.implode(', ', $accept['type']),
'Accept-Charset: '.implode(', ', $accept['charset']),
);
$encoding = null;
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_HEADER, true);
curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
$response = curl_exec($curl);
if (!$response) {
// error fetching the response
} else {
$offset = strpos($response, "\r\n\r\n");
$header = substr($response, 0, $offset);
if (!$header || !preg_match('/^Content-Type:\s+([^;]+)(?:;\s*charset=(.*))?/im', $header, $match)) {
// error parsing the response
} else {
if (!in_array(strtolower($match[1]), array_map('strtolower', $accept['type']))) {
// type not accepted
}
$encoding = trim($match[2], '"\'');
}
if (!$encoding) {
$body = substr($response, $offset + 4);
if (preg_match('/^<\?xml\s+version=(?:"[^"]*"|\'[^\']*\')\s+encoding=("[^"]*"|\'[^\']*\')/s', $body, $match)) {
$encoding = trim($match[1], '"\'');
}
}
if (!$encoding) {
$encoding = 'utf-8';
} else {
if (!in_array($encoding, array_map('strtolower', $accept['charset']))) {
// encoding not accepted
}
if ($encoding != 'utf-8') {
$body = mb_convert_encoding($body, 'utf-8', $encoding);
}
}
$simpleXML = simplexml_load_string($body, null, LIBXML_NOERROR);
if (!$simpleXML) {
// parse error
} else {
echo $simpleXML->asXML();
}
}
Detecting the encoding is hard.
mb_detect_encoding works by guessing, based on a number of candidates that you pass it. In some encodings, certain byte-sequences are invalid, an therefore it can distinguish between various candidates. Unfortunately, there are a lot of encodings, where the same bytes are valid (but different). In these cases, there is no way to determine the encoding; You can implement your own logic to make guesses in these cases. For example, data coming from a Japanese site might be more likely to have a Japanese encoding.
As long as you only deal with Western European languages, the three major encodings to consider are utf-8, iso-8859-1 and cp-1252. Since these are defaults for many platforms, they are also the most likely to be reported wrongly about. Eg. if people use different encodings, they are likely to be frank about it, since else their software would break very often. Therefore, a good strategy is to trust the provider, unless the encoding is reported as one of those three. You should still doublecheck that it is indeed valid, using mb_check_encoding (note that valid is not the same as being - the same input may be valid for many encodings). If it is one of those, you can then use mb_detect_encoding to distinguish between them. Luckily that is fairly deterministic; You just need to use the proper detect-sequence, which is UTF-8,ISO-8859-1,WINDOWS-1252.
Once you've detected the encoding you need to convert it to your internal representation (UTF-8 is the only sane choice). The function utf8_encode transforms ISO-8859-1 to UTF-8, so it can only used for that particular input type. For other encodings, use mb_convert_encoding.
This cheatsheet lists some common caveats related to UTF-8 handling in PHP:
http://developer.loftdigital.com/blog/php-utf-8-cheatsheet
This function detecting multibyte characters in a string might also prove helpful (source):
function detectUTF8($string)
{
return preg_match('%(?:
[\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
|\xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
|\xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
|\xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
|[\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
|\xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)+%xs',
$string);
}
A little heads up. You said that the "ß" should be displayed as "Ÿ" in your database.
This is probably because you're using a database with Latin-1 character encoding or possibly your PHP-MySQL connection is set wrong, this is, P believes your MySQL is set to use UTF-8, so it sends data as UTF-8, but your MySQL believes PHP is sending data encoded as ISO 8859-1, so it may once again try to encode your sent data as UTF-8, causing this kind of trouble.
Take a look at mysql_set_charset. It may help you.
Your encoding looks like you encoded into UTF-8 twice; that is, from some other encoding, into UTF-8, and again into UTF-8. As if you had ISO 8859-1, converted from ISO 8859-1 to UTF-8, and treated the new string as ISO 8859-1 for another conversion into UTF-8.
Here's some pseudocode of what you did:
$inputstring = getFromUser();
$utf8string = iconv($current_encoding, 'utf-8', $inputstring);
$flawedstring = iconv($current_encoding, 'utf-8', $utf8string);
You should try:
detect encoding using mb_detect_encoding() or whatever you like to use
if it's UTF-8, convert into ISO 8859-1, and repeat step 1
finally, convert back into UTF-8
That is presuming that in the "middle" conversion you used ISO 8859-1. If you used Windows-1252, then convert into Windows-1252 (latin1). The original source encoding is not important; the one you used in flawed, second conversion is.
This is my guess at what happened; there's very little else you could have done to get four bytes in place of one extended ASCII byte.
The German language also uses ISO 8859-2 and Windows-1250 (Latin-2).
A really nice way to implement an isUTF8-function can be found on php.net:
function isUTF8($string) {
return (utf8_encode(utf8_decode($string)) == $string);
}
The interesting thing about mb_detect_encoding and mb_convert_encoding is that the order of the encodings you suggest does matter:
// $input is actually UTF-8
mb_detect_encoding($input, "UTF-8", "ISO-8859-9, UTF-8");
// ISO-8859-9 (WRONG!)
mb_detect_encoding($input, "UTF-8", "UTF-8, ISO-8859-9");
// UTF-8 (OK)
So you might want to use a specific order when specifying expected encodings. Still, keep in mind that this is not foolproof.
Working out the character encoding of RSS feeds seems to be complicated. Even normal web pages often omit, or lie about, their encoding.
So you could try to use the correct way to detect the encoding and then fall back to some form of auto-detection (guessing).
You need to test the character set on input since responses can come coded with different encodings.
I force all content been sent into UTF-8 by doing detection and translation using the following function:
function fixRequestCharset()
{
$ref = array(&$_GET, &$_POST, &$_REQUEST);
foreach ($ref as &$var)
{
foreach ($var as $key => $val)
{
$encoding = mb_detect_encoding($var[$key], mb_detect_order(), true);
if (!$encoding)
continue;
if (strcasecmp($encoding, 'UTF-8') != 0)
{
$encoding = iconv($encoding, 'UTF-8', $var[$key]);
if ($encoding === false)
continue;
$var[$key] = $encoding;
}
}
}
}
That routine will turn all PHP variables that come from the remote host into UTF-8.
Or ignore the value if the encoding could not be detected or converted.
You can customize it to your needs.
Just invoke it before using the variables.
mb_detect_encoding:
echo mb_detect_encoding($str, "auto");
Or
echo mb_detect_encoding($str, "UTF-8, ASCII, ISO-8859-1");
I really don't know what the results are, but I'd suggest you just take some of your feeds with different encodings and try if mb_detect_encoding works or not.
auto is short for "ASCII,JIS,UTF-8,EUC-JP,SJIS". It returns the detected charset, which you can use to convert the string to UTF-8 with iconv.
<?php
function convertToUTF8($str) {
$enc = mb_detect_encoding($str);
if ($enc && $enc != 'UTF-8') {
return iconv($enc, 'UTF-8', $str);
} else {
return $str;
}
}
?>
I haven't tested it, so no guarantee. And maybe there's a simpler way.
I know this is an older question, but I figure a useful answer never hurts. I was having issues with my encoding between a desktop application, SQLite, and GET/POST variables. Some would be in UTF-8, some would be in ASCII, and basically everything would get screwed up when foreign characters got involved.
Here is my solution. It scrubs your GET/POST/REQUEST (I omitted cookies, but you could add them if desired) on each page load before processing. It works well in a header. PHP will throw warnings if it can't detect the source encoding automatically, so these warnings are suppressed with #'s.
//Convert everything in our vars to UTF-8 for playing nice with the database...
//Use some auto detection here to help us not double-encode...
//Suppress possible warnings with #'s for when encoding cannot be detected
try
{
$process = array(&$_GET, &$_POST, &$_REQUEST);
while (list($key, $val) = each($process)) {
foreach ($val as $k => $v) {
unset($process[$key][$k]);
if (is_array($v)) {
$process[$key][#mb_convert_encoding($k,'UTF-8','auto')] = $v;
$process[] = &$process[$key][#mb_convert_encoding($k,'UTF-8','auto')];
} else {
$process[$key][#mb_convert_encoding($k,'UTF-8','auto')] = #mb_convert_encoding($v,'UTF-8','auto');
}
}
}
unset($process);
}
catch(Exception $ex){}
It's simple: when you get something that's not UTF-8, you must encode that into UTF-8.
So, when you're fetching a certain feed that's ISO 8859-1 parse it through utf8_encode.
However, if you're fetching an UTF-8 feed, you don't need to do anything.
harpax' answer worked for me. In my case, this is good enough:
if (isUTF8($str)) {
echo $str;
}
else
{
echo iconv("ISO-8859-1", "UTF-8//TRANSLIT", $str);
}
I was checking for solutions to encoding since ages, and this page is probably the conclusion of years of search! I tested some of the suggestions you mentioned and here are my notes:
This is my test string:
this is a "wròng wrìtten" string bùt I nèed to pù 'sòme' special
chàrs to see thèm, convertèd by fùnctìon!! & that's it!
I do an INSERT to save this string on a database in a field that is set as utf8_general_ci
The character set of my page is UTF-8.
If I do an INSERT just like that, in my database, I have some characters probably coming from Mars...
So I need to convert them into some "sane" UTF-8. I tried utf8_encode(), but still aliens chars were invading my database...
So I tried to use the function forceUTF8 posted on number 8, but in the database the string saved looks like this:
this is a "wròng wrìtten" string bùt I nèed to pù 'sòme' special
chà rs to see thèm, convertèd by fùnctìon!! & that's it!
So collecting some more information on this page and merging them with other information on other pages I solved my problem with this solution:
$finallyIDidIt = mb_convert_encoding(
$string,
mysql_client_encoding($resourceID),
mb_detect_encoding($string)
);
Now in my database I have my string with correct encoding.
NOTE:
Only note to take care of is in function mysql_client_encoding!
You need to be connected to the database, because this function wants a resource ID as a parameter.
But well, I just do that re-encoding before my INSERT so for me it is not a problem.
After sorting out your PHP scripts, don't forget to tell MySQL what charset you are passing and would like to receive.
Example: set the character to UTF-8
Passing UTF-8 data to a Latin 1 table in a Latin 1 I/O session gives those nasty birdfeets. I see this every other day in OsCommerce shops. Back and fourth it might seem right. But phpMyAdmin will show the truth. By telling MySQL what charset you are passing, it will handle the conversion of MySQL data for you.
How to recover existing scrambled MySQL data is another question. :)
Get the encoding from headers and convert it to UTF-8.
$post_url = 'http://website.domain';
/// Get headers ///////////////////////////////////////////////
function get_headers_curl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 15);
$r = curl_exec($ch);
return $r;
}
$the_header = get_headers_curl($post_url);
/// Check for redirect ////////////////////////////////////////
if (preg_match("/Location:/i", $the_header)) {
$arr = explode('Location:', $the_header);
$location = $arr[1];
$location = explode(chr(10), $location);
$location = $location[0];
$the_header = get_headers_curl(trim($location));
}
/// Get charset ///////////////////////////////////////////////
if (preg_match("/charset=/i", $the_header)) {
$arr = explode('charset=', $the_header);
$charset = $arr[1];
$charset = explode(chr(10), $charset);
$charset = $charset[0];
}
///////////////////////////////////////////////////////////////////
// echo $charset;
if($charset && $charset != 'UTF-8') {
$html = iconv($charset, "UTF-8", $html);
}
Ÿ is Mojibake for ß. In your database, you may have one of the following hex values (use SELECT HEX(col)...) to find out):
DF if the column is "latin1",
C39F if the column is utf8 -- OR -- it is latin1, but "double-encoded"
C383C5B8 if double-encoded into a utf8 column
You should not use any encoding/decoding functions in PHP; instead, you should set up the database and the connection to it correctly.
If MySQL is involved, see: Trouble with UTF-8 characters; what I see is not what I stored
if(!mb_check_encoding($str)){
$str = iconv("windows-1251", "UTF-8", $str);
}
It helped for me
Try without 'auto'
That is:
mb_detect_encoding($text)
instead of:
mb_detect_encoding($text, 'auto')
More information can be found here: mb_detect_encoding
Try to use this... every text that is not UTF-8 will be translated.
function is_utf8($str) {
return (bool) preg_match('//u', $str);
}
$myString = "Fußball";
if(!is_utf8($myString)){
$myString = utf8_encode($myString);
}
// or 1 line version ;)
$myString = !is_utf8($myString) ? utf8_encode($myString) : trim($myString);
I found a solution at http://deer.org.ua/2009/10/06/1/:
class Encoding
{
/**
* http://deer.org.ua/2009/10/06/1/
* #param $string
* #return null
*/
public static function detect_encoding($string)
{
static $list = ['utf-8', 'windows-1251'];
foreach ($list as $item) {
try {
$sample = iconv($item, $item, $string);
} catch (\Exception $e) {
continue;
}
if (md5($sample) == md5($string)) {
return $item;
}
}
return null;
}
}
$content = file_get_contents($file['tmp_name']);
$encoding = Encoding::detect_encoding($content);
if ($encoding != 'utf-8') {
$result = iconv($encoding, 'utf-8', $content);
} else {
$result = $content;
}
I think that # is a bad decision and made some changes to the solution from deer.org.ua.
When you try to handle multi languages, like Japanese and Korean, you might get in trouble.
mb_convert_encoding with the 'auto' parameter doesn't work well. Setting mb_detect_order('ASCII,UTF-8,JIS,EUC-JP,SJIS,EUC-KR,UHC') doesn't help since it will detect EUC-* wrongly.
I concluded that as long as input strings comes from HTML, it should use 'charset' in a meta element. I use Simple HTML DOM Parser because it supports invalid HTML.
The below snippet extracts the title element from a web page. If you would like to convert the entire page, then you may want to remove some lines.
<?php
require_once 'simple_html_dom.php';
echo convert_title_to_utf8(file_get_contents($argv[1])), PHP_EOL;
function convert_title_to_utf8($contents)
{
$dom = str_get_html($contents);
$title = $dom->find('title', 0);
if (empty($title)) {
return null;
}
$title = $title->plaintext;
$metas = $dom->find('meta');
$charset = 'auto';
foreach ($metas as $meta) {
if (!empty($meta->charset)) { // HTML5
$charset = $meta->charset;
} else if (preg_match('#charset=(.+)#', $meta->content, $match)) {
$charset = $match[1];
}
}
if (!in_array(strtolower($charset), array_map('strtolower', mb_list_encodings()))) {
$charset = 'auto';
}
return mb_convert_encoding($title, 'UTF-8', $charset);
}
This version is for the German language, but you can modify the $CHARSETS and the $TESTCHARS.
class CharsetDetector
{
private static $CHARSETS = array(
"ISO_8859-1",
"ISO_8859-15",
"CP850"
);
private static $TESTCHARS = array(
"€",
"ä",
"Ä",
"ö",
"Ö",
"ü",
"Ü",
"ß"
);
public static function convert($string)
{
return self::__iconv($string, self::getCharset($string));
}
public static function getCharset($string)
{
$normalized = self::__normalize($string);
if(!strlen($normalized))
return "UTF-8";
$best = "UTF-8";
$charcountbest = 0;
foreach (self::$CHARSETS as $charset)
{
$str = self::__iconv($normalized, $charset);
$charcount = 0;
$stop = mb_strlen($str, "UTF-8");
for($idx = 0; $idx < $stop; $idx++)
{
$char = mb_substr($str, $idx, 1, "UTF-8");
foreach (self::$TESTCHARS as $testchar)
{
if($char == $testchar)
{
$charcount++;
break;
}
}
}
if($charcount > $charcountbest)
{
$charcountbest = $charcount;
$best = $charset;
}
//echo $text . "<br />";
}
return $best;
}
private static function __normalize($str)
{
$len = strlen($str);
$ret = "";
for($i = 0; $i < $len; $i++)
{
$c = ord($str[$i]);
if ($c > 128) {
if (($c > 247))
$ret .= $str[$i];
elseif
($c > 239) $bytes = 4;
elseif
($c > 223) $bytes = 3;
elseif
($c > 191) $bytes = 2;
else
$ret .= $str[$i];
if (($i + $bytes) > $len)
$ret .= $str[$i];
$ret2 = $str[$i];
while ($bytes > 1)
{
$i++;
$b = ord($str[$i]);
if ($b < 128 || $b > 191)
{
$ret .= $ret2;
$ret2 = "";
$i += $bytes-1;
$bytes = 1;
break;
}
else
$ret2 .= $str[$i];
$bytes--;
}
}
}
return $ret;
}
private static function __iconv($string, $charset)
{
return iconv ($charset, "UTF-8", $string);
}
}
I had the same issue with phpQuery (ISO-8859-1 instead of UTF-8) and this hack helped me:
$html = '<?xml version="1.0" encoding="UTF-8" ?>' . $html;
mb_internal_encoding('UTF-8'), phpQuery::newDocumentHTML($html, 'utf-8'), mbstring.internal_encoding and other manipulations didn't take any effect.
For Chinese characters, it is common to be encoded in the GBK encoding. In addition, when tested, the most voted answer doesn't work. Here is a simple fix that makes it work as well:
function toUTF8($raw) {
try{
return mb_convert_encoding($raw, "UTF-8", "auto");
}catch(\Exception $e){
return mb_convert_encoding($raw, "UTF-8", "GBK");
}
}
Remark: This solution was written in 2017 and should fix problems for PHP in those days. I have not tested whether latest PHP already understands auto correctly.

PHP arabic character encoding issue [duplicate]

I'm reading out lots of texts from various RSS feeds and inserting them into my database.
Of course, there are several different character encodings used in the feeds, e.g. UTF-8 and ISO 8859-1.
Unfortunately, there are sometimes problems with the encodings of the texts. Example:
The "ß" in "Fußball" should look like this in my database: "Ÿ". If it is a "Ÿ", it is displayed correctly.
Sometimes, the "ß" in "Fußball" looks like this in my database: "ß". Then it is displayed wrongly, of course.
In other cases, the "ß" is saved as a "ß" - so without any change. Then it is also displayed wrongly.
What can I do to avoid the cases 2 and 3?
How can I make everything the same encoding, preferably UTF-8? When must I use utf8_encode(), when must I use utf8_decode() (it's clear what the effect is but when must I use the functions?) and when must I do nothing with the input?
How do I make everything the same encoding? Perhaps with the function mb_detect_encoding()? Can I write a function for this? So my problems are:
How do I find out what encoding the text uses?
How do I convert it to UTF-8 - whatever the old encoding is?
Would a function like this work?
function correct_encoding($text) {
$current_encoding = mb_detect_encoding($text, 'auto');
$text = iconv($current_encoding, 'UTF-8', $text);
return $text;
}
I've tested it, but it doesn't work. What's wrong with it?
If you apply utf8_encode() to an already UTF-8 string, it will return garbled UTF-8 output.
I made a function that addresses all this issues. It´s called Encoding::toUTF8().
You don't need to know what the encoding of your strings is. It can be Latin1 (ISO 8859-1), Windows-1252 or UTF-8, or the string can have a mix of them. Encoding::toUTF8() will convert everything to UTF-8.
I did it because a service was giving me a feed of data all messed up, mixing UTF-8 and Latin1 in the same string.
Usage:
require_once('Encoding.php');
use \ForceUTF8\Encoding; // It's namespaced now.
$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);
$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);
Download:
https://github.com/neitanod/forceutf8
I've included another function, Encoding::fixUFT8(), which will fix every UTF-8 string that looks garbled.
Usage:
require_once('Encoding.php');
use \ForceUTF8\Encoding; // It's namespaced now.
$utf8_string = Encoding::fixUTF8($garbled_utf8_string);
Examples:
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
will output:
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
I've transformed the function (forceUTF8) into a family of static functions on a class called Encoding. The new function is Encoding::toUTF8().
You first have to detect what encoding has been used. As you’re parsing RSS feeds (probably via HTTP), you should read the encoding from the charset parameter of the Content-Type HTTP header field. If it is not present, read the encoding from the encoding attribute of the XML processing instruction. If that’s missing too, use UTF-8 as defined in the specification.
Here is what I probably would do:
I’d use cURL to send and fetch the response. That allows you to set specific header fields and fetch the response header as well. After fetching the response, you have to parse the HTTP response and split it into header and body. The header should then contain the Content-Type header field that contains the MIME type and (hopefully) the charset parameter with the encoding/charset too. If not, we’ll analyse the XML PI for the presence of the encoding attribute and get the encoding from there. If that’s also missing, the XML specs define to use UTF-8 as encoding.
$url = 'http://www.lr-online.de/storage/rss/rss/sport.xml';
$accept = array(
'type' => array('application/rss+xml', 'application/xml', 'application/rdf+xml', 'text/xml'),
'charset' => array_diff(mb_list_encodings(), array('pass', 'auto', 'wchar', 'byte2be', 'byte2le', 'byte4be', 'byte4le', 'BASE64', 'UUENCODE', 'HTML-ENTITIES', 'Quoted-Printable', '7bit', '8bit'))
);
$header = array(
'Accept: '.implode(', ', $accept['type']),
'Accept-Charset: '.implode(', ', $accept['charset']),
);
$encoding = null;
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_HEADER, true);
curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
$response = curl_exec($curl);
if (!$response) {
// error fetching the response
} else {
$offset = strpos($response, "\r\n\r\n");
$header = substr($response, 0, $offset);
if (!$header || !preg_match('/^Content-Type:\s+([^;]+)(?:;\s*charset=(.*))?/im', $header, $match)) {
// error parsing the response
} else {
if (!in_array(strtolower($match[1]), array_map('strtolower', $accept['type']))) {
// type not accepted
}
$encoding = trim($match[2], '"\'');
}
if (!$encoding) {
$body = substr($response, $offset + 4);
if (preg_match('/^<\?xml\s+version=(?:"[^"]*"|\'[^\']*\')\s+encoding=("[^"]*"|\'[^\']*\')/s', $body, $match)) {
$encoding = trim($match[1], '"\'');
}
}
if (!$encoding) {
$encoding = 'utf-8';
} else {
if (!in_array($encoding, array_map('strtolower', $accept['charset']))) {
// encoding not accepted
}
if ($encoding != 'utf-8') {
$body = mb_convert_encoding($body, 'utf-8', $encoding);
}
}
$simpleXML = simplexml_load_string($body, null, LIBXML_NOERROR);
if (!$simpleXML) {
// parse error
} else {
echo $simpleXML->asXML();
}
}
Detecting the encoding is hard.
mb_detect_encoding works by guessing, based on a number of candidates that you pass it. In some encodings, certain byte-sequences are invalid, an therefore it can distinguish between various candidates. Unfortunately, there are a lot of encodings, where the same bytes are valid (but different). In these cases, there is no way to determine the encoding; You can implement your own logic to make guesses in these cases. For example, data coming from a Japanese site might be more likely to have a Japanese encoding.
As long as you only deal with Western European languages, the three major encodings to consider are utf-8, iso-8859-1 and cp-1252. Since these are defaults for many platforms, they are also the most likely to be reported wrongly about. Eg. if people use different encodings, they are likely to be frank about it, since else their software would break very often. Therefore, a good strategy is to trust the provider, unless the encoding is reported as one of those three. You should still doublecheck that it is indeed valid, using mb_check_encoding (note that valid is not the same as being - the same input may be valid for many encodings). If it is one of those, you can then use mb_detect_encoding to distinguish between them. Luckily that is fairly deterministic; You just need to use the proper detect-sequence, which is UTF-8,ISO-8859-1,WINDOWS-1252.
Once you've detected the encoding you need to convert it to your internal representation (UTF-8 is the only sane choice). The function utf8_encode transforms ISO-8859-1 to UTF-8, so it can only used for that particular input type. For other encodings, use mb_convert_encoding.
This cheatsheet lists some common caveats related to UTF-8 handling in PHP:
http://developer.loftdigital.com/blog/php-utf-8-cheatsheet
This function detecting multibyte characters in a string might also prove helpful (source):
function detectUTF8($string)
{
return preg_match('%(?:
[\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
|\xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
|\xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
|\xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
|[\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
|\xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)+%xs',
$string);
}
A little heads up. You said that the "ß" should be displayed as "Ÿ" in your database.
This is probably because you're using a database with Latin-1 character encoding or possibly your PHP-MySQL connection is set wrong, this is, P believes your MySQL is set to use UTF-8, so it sends data as UTF-8, but your MySQL believes PHP is sending data encoded as ISO 8859-1, so it may once again try to encode your sent data as UTF-8, causing this kind of trouble.
Take a look at mysql_set_charset. It may help you.
Your encoding looks like you encoded into UTF-8 twice; that is, from some other encoding, into UTF-8, and again into UTF-8. As if you had ISO 8859-1, converted from ISO 8859-1 to UTF-8, and treated the new string as ISO 8859-1 for another conversion into UTF-8.
Here's some pseudocode of what you did:
$inputstring = getFromUser();
$utf8string = iconv($current_encoding, 'utf-8', $inputstring);
$flawedstring = iconv($current_encoding, 'utf-8', $utf8string);
You should try:
detect encoding using mb_detect_encoding() or whatever you like to use
if it's UTF-8, convert into ISO 8859-1, and repeat step 1
finally, convert back into UTF-8
That is presuming that in the "middle" conversion you used ISO 8859-1. If you used Windows-1252, then convert into Windows-1252 (latin1). The original source encoding is not important; the one you used in flawed, second conversion is.
This is my guess at what happened; there's very little else you could have done to get four bytes in place of one extended ASCII byte.
The German language also uses ISO 8859-2 and Windows-1250 (Latin-2).
A really nice way to implement an isUTF8-function can be found on php.net:
function isUTF8($string) {
return (utf8_encode(utf8_decode($string)) == $string);
}
The interesting thing about mb_detect_encoding and mb_convert_encoding is that the order of the encodings you suggest does matter:
// $input is actually UTF-8
mb_detect_encoding($input, "UTF-8", "ISO-8859-9, UTF-8");
// ISO-8859-9 (WRONG!)
mb_detect_encoding($input, "UTF-8", "UTF-8, ISO-8859-9");
// UTF-8 (OK)
So you might want to use a specific order when specifying expected encodings. Still, keep in mind that this is not foolproof.
Working out the character encoding of RSS feeds seems to be complicated. Even normal web pages often omit, or lie about, their encoding.
So you could try to use the correct way to detect the encoding and then fall back to some form of auto-detection (guessing).
You need to test the character set on input since responses can come coded with different encodings.
I force all content been sent into UTF-8 by doing detection and translation using the following function:
function fixRequestCharset()
{
$ref = array(&$_GET, &$_POST, &$_REQUEST);
foreach ($ref as &$var)
{
foreach ($var as $key => $val)
{
$encoding = mb_detect_encoding($var[$key], mb_detect_order(), true);
if (!$encoding)
continue;
if (strcasecmp($encoding, 'UTF-8') != 0)
{
$encoding = iconv($encoding, 'UTF-8', $var[$key]);
if ($encoding === false)
continue;
$var[$key] = $encoding;
}
}
}
}
That routine will turn all PHP variables that come from the remote host into UTF-8.
Or ignore the value if the encoding could not be detected or converted.
You can customize it to your needs.
Just invoke it before using the variables.
mb_detect_encoding:
echo mb_detect_encoding($str, "auto");
Or
echo mb_detect_encoding($str, "UTF-8, ASCII, ISO-8859-1");
I really don't know what the results are, but I'd suggest you just take some of your feeds with different encodings and try if mb_detect_encoding works or not.
auto is short for "ASCII,JIS,UTF-8,EUC-JP,SJIS". It returns the detected charset, which you can use to convert the string to UTF-8 with iconv.
<?php
function convertToUTF8($str) {
$enc = mb_detect_encoding($str);
if ($enc && $enc != 'UTF-8') {
return iconv($enc, 'UTF-8', $str);
} else {
return $str;
}
}
?>
I haven't tested it, so no guarantee. And maybe there's a simpler way.
I know this is an older question, but I figure a useful answer never hurts. I was having issues with my encoding between a desktop application, SQLite, and GET/POST variables. Some would be in UTF-8, some would be in ASCII, and basically everything would get screwed up when foreign characters got involved.
Here is my solution. It scrubs your GET/POST/REQUEST (I omitted cookies, but you could add them if desired) on each page load before processing. It works well in a header. PHP will throw warnings if it can't detect the source encoding automatically, so these warnings are suppressed with #'s.
//Convert everything in our vars to UTF-8 for playing nice with the database...
//Use some auto detection here to help us not double-encode...
//Suppress possible warnings with #'s for when encoding cannot be detected
try
{
$process = array(&$_GET, &$_POST, &$_REQUEST);
while (list($key, $val) = each($process)) {
foreach ($val as $k => $v) {
unset($process[$key][$k]);
if (is_array($v)) {
$process[$key][#mb_convert_encoding($k,'UTF-8','auto')] = $v;
$process[] = &$process[$key][#mb_convert_encoding($k,'UTF-8','auto')];
} else {
$process[$key][#mb_convert_encoding($k,'UTF-8','auto')] = #mb_convert_encoding($v,'UTF-8','auto');
}
}
}
unset($process);
}
catch(Exception $ex){}
It's simple: when you get something that's not UTF-8, you must encode that into UTF-8.
So, when you're fetching a certain feed that's ISO 8859-1 parse it through utf8_encode.
However, if you're fetching an UTF-8 feed, you don't need to do anything.
harpax' answer worked for me. In my case, this is good enough:
if (isUTF8($str)) {
echo $str;
}
else
{
echo iconv("ISO-8859-1", "UTF-8//TRANSLIT", $str);
}
I was checking for solutions to encoding since ages, and this page is probably the conclusion of years of search! I tested some of the suggestions you mentioned and here are my notes:
This is my test string:
this is a "wròng wrìtten" string bùt I nèed to pù 'sòme' special
chàrs to see thèm, convertèd by fùnctìon!! & that's it!
I do an INSERT to save this string on a database in a field that is set as utf8_general_ci
The character set of my page is UTF-8.
If I do an INSERT just like that, in my database, I have some characters probably coming from Mars...
So I need to convert them into some "sane" UTF-8. I tried utf8_encode(), but still aliens chars were invading my database...
So I tried to use the function forceUTF8 posted on number 8, but in the database the string saved looks like this:
this is a "wròng wrìtten" string bùt I nèed to pù 'sòme' special
chà rs to see thèm, convertèd by fùnctìon!! & that's it!
So collecting some more information on this page and merging them with other information on other pages I solved my problem with this solution:
$finallyIDidIt = mb_convert_encoding(
$string,
mysql_client_encoding($resourceID),
mb_detect_encoding($string)
);
Now in my database I have my string with correct encoding.
NOTE:
Only note to take care of is in function mysql_client_encoding!
You need to be connected to the database, because this function wants a resource ID as a parameter.
But well, I just do that re-encoding before my INSERT so for me it is not a problem.
After sorting out your PHP scripts, don't forget to tell MySQL what charset you are passing and would like to receive.
Example: set the character to UTF-8
Passing UTF-8 data to a Latin 1 table in a Latin 1 I/O session gives those nasty birdfeets. I see this every other day in OsCommerce shops. Back and fourth it might seem right. But phpMyAdmin will show the truth. By telling MySQL what charset you are passing, it will handle the conversion of MySQL data for you.
How to recover existing scrambled MySQL data is another question. :)
Get the encoding from headers and convert it to UTF-8.
$post_url = 'http://website.domain';
/// Get headers ///////////////////////////////////////////////
function get_headers_curl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 15);
$r = curl_exec($ch);
return $r;
}
$the_header = get_headers_curl($post_url);
/// Check for redirect ////////////////////////////////////////
if (preg_match("/Location:/i", $the_header)) {
$arr = explode('Location:', $the_header);
$location = $arr[1];
$location = explode(chr(10), $location);
$location = $location[0];
$the_header = get_headers_curl(trim($location));
}
/// Get charset ///////////////////////////////////////////////
if (preg_match("/charset=/i", $the_header)) {
$arr = explode('charset=', $the_header);
$charset = $arr[1];
$charset = explode(chr(10), $charset);
$charset = $charset[0];
}
///////////////////////////////////////////////////////////////////
// echo $charset;
if($charset && $charset != 'UTF-8') {
$html = iconv($charset, "UTF-8", $html);
}
Ÿ is Mojibake for ß. In your database, you may have one of the following hex values (use SELECT HEX(col)...) to find out):
DF if the column is "latin1",
C39F if the column is utf8 -- OR -- it is latin1, but "double-encoded"
C383C5B8 if double-encoded into a utf8 column
You should not use any encoding/decoding functions in PHP; instead, you should set up the database and the connection to it correctly.
If MySQL is involved, see: Trouble with UTF-8 characters; what I see is not what I stored
if(!mb_check_encoding($str)){
$str = iconv("windows-1251", "UTF-8", $str);
}
It helped for me
Try without 'auto'
That is:
mb_detect_encoding($text)
instead of:
mb_detect_encoding($text, 'auto')
More information can be found here: mb_detect_encoding
Try to use this... every text that is not UTF-8 will be translated.
function is_utf8($str) {
return (bool) preg_match('//u', $str);
}
$myString = "Fußball";
if(!is_utf8($myString)){
$myString = utf8_encode($myString);
}
// or 1 line version ;)
$myString = !is_utf8($myString) ? utf8_encode($myString) : trim($myString);
I found a solution at http://deer.org.ua/2009/10/06/1/:
class Encoding
{
/**
* http://deer.org.ua/2009/10/06/1/
* #param $string
* #return null
*/
public static function detect_encoding($string)
{
static $list = ['utf-8', 'windows-1251'];
foreach ($list as $item) {
try {
$sample = iconv($item, $item, $string);
} catch (\Exception $e) {
continue;
}
if (md5($sample) == md5($string)) {
return $item;
}
}
return null;
}
}
$content = file_get_contents($file['tmp_name']);
$encoding = Encoding::detect_encoding($content);
if ($encoding != 'utf-8') {
$result = iconv($encoding, 'utf-8', $content);
} else {
$result = $content;
}
I think that # is a bad decision and made some changes to the solution from deer.org.ua.
When you try to handle multi languages, like Japanese and Korean, you might get in trouble.
mb_convert_encoding with the 'auto' parameter doesn't work well. Setting mb_detect_order('ASCII,UTF-8,JIS,EUC-JP,SJIS,EUC-KR,UHC') doesn't help since it will detect EUC-* wrongly.
I concluded that as long as input strings comes from HTML, it should use 'charset' in a meta element. I use Simple HTML DOM Parser because it supports invalid HTML.
The below snippet extracts the title element from a web page. If you would like to convert the entire page, then you may want to remove some lines.
<?php
require_once 'simple_html_dom.php';
echo convert_title_to_utf8(file_get_contents($argv[1])), PHP_EOL;
function convert_title_to_utf8($contents)
{
$dom = str_get_html($contents);
$title = $dom->find('title', 0);
if (empty($title)) {
return null;
}
$title = $title->plaintext;
$metas = $dom->find('meta');
$charset = 'auto';
foreach ($metas as $meta) {
if (!empty($meta->charset)) { // HTML5
$charset = $meta->charset;
} else if (preg_match('#charset=(.+)#', $meta->content, $match)) {
$charset = $match[1];
}
}
if (!in_array(strtolower($charset), array_map('strtolower', mb_list_encodings()))) {
$charset = 'auto';
}
return mb_convert_encoding($title, 'UTF-8', $charset);
}
This version is for the German language, but you can modify the $CHARSETS and the $TESTCHARS.
class CharsetDetector
{
private static $CHARSETS = array(
"ISO_8859-1",
"ISO_8859-15",
"CP850"
);
private static $TESTCHARS = array(
"€",
"ä",
"Ä",
"ö",
"Ö",
"ü",
"Ü",
"ß"
);
public static function convert($string)
{
return self::__iconv($string, self::getCharset($string));
}
public static function getCharset($string)
{
$normalized = self::__normalize($string);
if(!strlen($normalized))
return "UTF-8";
$best = "UTF-8";
$charcountbest = 0;
foreach (self::$CHARSETS as $charset)
{
$str = self::__iconv($normalized, $charset);
$charcount = 0;
$stop = mb_strlen($str, "UTF-8");
for($idx = 0; $idx < $stop; $idx++)
{
$char = mb_substr($str, $idx, 1, "UTF-8");
foreach (self::$TESTCHARS as $testchar)
{
if($char == $testchar)
{
$charcount++;
break;
}
}
}
if($charcount > $charcountbest)
{
$charcountbest = $charcount;
$best = $charset;
}
//echo $text . "<br />";
}
return $best;
}
private static function __normalize($str)
{
$len = strlen($str);
$ret = "";
for($i = 0; $i < $len; $i++)
{
$c = ord($str[$i]);
if ($c > 128) {
if (($c > 247))
$ret .= $str[$i];
elseif
($c > 239) $bytes = 4;
elseif
($c > 223) $bytes = 3;
elseif
($c > 191) $bytes = 2;
else
$ret .= $str[$i];
if (($i + $bytes) > $len)
$ret .= $str[$i];
$ret2 = $str[$i];
while ($bytes > 1)
{
$i++;
$b = ord($str[$i]);
if ($b < 128 || $b > 191)
{
$ret .= $ret2;
$ret2 = "";
$i += $bytes-1;
$bytes = 1;
break;
}
else
$ret2 .= $str[$i];
$bytes--;
}
}
}
return $ret;
}
private static function __iconv($string, $charset)
{
return iconv ($charset, "UTF-8", $string);
}
}
I had the same issue with phpQuery (ISO-8859-1 instead of UTF-8) and this hack helped me:
$html = '<?xml version="1.0" encoding="UTF-8" ?>' . $html;
mb_internal_encoding('UTF-8'), phpQuery::newDocumentHTML($html, 'utf-8'), mbstring.internal_encoding and other manipulations didn't take any effect.
For Chinese characters, it is common to be encoded in the GBK encoding. In addition, when tested, the most voted answer doesn't work. Here is a simple fix that makes it work as well:
function toUTF8($raw) {
try{
return mb_convert_encoding($raw, "UTF-8", "auto");
}catch(\Exception $e){
return mb_convert_encoding($raw, "UTF-8", "GBK");
}
}
Remark: This solution was written in 2017 and should fix problems for PHP in those days. I have not tested whether latest PHP already understands auto correctly.

How to remove multiple UTF-8 BOM sequences

Using PHP5 (cgi) to output template files from the filesystem and having issues spitting out raw HTML.
private function fetch($name) {
$path = $this->j->config['template_path'] . $name . '.html';
if (!file_exists($path)) {
dbgerror('Could not find the template "' . $name . '" in ' . $path);
}
$f = fopen($path, 'r');
$t = fread($f, filesize($path));
fclose($f);
if (substr($t, 0, 3) == b'\xef\xbb\xbf') {
$t = substr($t, 3);
}
return $t;
}
Even though I've added the BOM fix I'm still having problems with Firefox accepting it. You can see a live copy here: http://ircb.in/jisti/ (and the template file I threw at http://ircb.in/jisti/home.html if you want to check it out)
Any idea how to fix this? o_o
you would use the following code to remove utf8 bom
//Remove UTF8 Bom
function remove_utf8_bom($text)
{
$bom = pack('H*','EFBBBF');
$text = preg_replace("/^$bom/", '', $text);
return $text;
}
try:
// -------- read the file-content ----
$str = file_get_contents($source_file);
// -------- remove the utf-8 BOM ----
$str = str_replace("\xEF\xBB\xBF",'',$str);
// -------- get the Object from JSON ----
$obj = json_decode($str);
:)
Another way to remove the BOM which is Unicode code point U+FEFF
$str = preg_replace('/\x{FEFF}/u', '', $file);
b'\xef\xbb\xbf' stands for the literal string "\xef\xbb\xbf". If you want to check for a BOM, you need to use double quotes, so the \x sequences are actually interpreted into bytes:
"\xef\xbb\xbf"
Your files also seem to contain a lot more garbage than just a single leading BOM:
$ curl http://ircb.in/jisti/ | xxd
0000000: efbb bfef bbbf efbb bfef bbbf efbb bfef ................
0000010: bbbf efbb bf3c 2144 4f43 5459 5045 2068 .....<!DOCTYPE h
0000020: 746d 6c3e 0a3c 6874 6d6c 3e0a 3c68 6561 tml>.<html>.<hea
...
if anybody using csv import then below code useful
$header = fgetcsv($handle);
foreach($header as $key=> $val) {
$bom = pack('H*','EFBBBF');
$val = preg_replace("/^$bom/", '', $val);
$header[$key] = $val;
}
This global funtion resolve for UTF-8 system base charset. Tanks!
function prepareCharset($str) {
// set default encode
mb_internal_encoding('UTF-8');
// pre filter
if (empty($str)) {
return $str;
}
// get charset
$charset = mb_detect_encoding($str, array('ISO-8859-1', 'UTF-8', 'ASCII'));
if (stristr($charset, 'utf') || stristr($charset, 'iso')) {
$str = iconv('ISO-8859-1', 'UTF-8//TRANSLIT', utf8_decode($str));
} else {
$str = mb_convert_encoding($str, 'UTF-8', 'UTF-8');
}
// remove BOM
$str = urldecode(str_replace("%C2%81", '', urlencode($str)));
// prepare string
return $str;
}
An extra method to do the same job:
function remove_utf8_bom_head($text) {
if(substr(bin2hex($text), 0, 6) === 'efbbbf') {
$text = substr($text, 3);
}
return $text;
}
The other methods I found cannot work in my case.
Hope it helps in some special case.
A solution without pack function:
$a = "1";
var_dump($a); // string(4) "1"
function deleteBom($text)
{
return preg_replace("/^\xEF\xBB\xBF/", '', $text);
}
var_dump(deleteBom($a)); // string(1) "1"
I'm not so fond of using preg_replace or preg_match for simple tasks. What about this alternative method of detecting and removing the BOM?
function remove_utf8_bom(string $text): string
{
$bomStart = mb_substr($text, 0, 1);
return ($bomStart == pack('H*','EFBBBF')) ?
mb_substr($text, 1) :
$text;
}
If you are reading some API using file_get_contents and got an inexplicable NULL from json_decode, check the value of json_last_error(): sometimes the value returned from file_get_contents will have an extraneous BOM that is almost invisible when you inspect the string, but will make json_last_error() to return JSON_ERROR_SYNTAX (4).
>>> $json = file_get_contents("http://api-guiaserv.seade.gov.br/v1/orgao/all");
=> "\t{"orgao":[{"Nome":"Tribunal de Justi\u00e7a","ID_Orgao":"59","Condicao":"1"}, ...]}"
>>> json_decode($json);
=> null
>>>
In this case, check the first 3 bytes - echoing them is not very useful because the BOM is invisible on most settings:
>>> substr($json, 0, 3)
=> " "
>>> substr($json, 0, 3) == pack('H*','EFBBBF');
=> true
>>>
If the line above returns TRUE for you, then a simple test may fix the problem:
>>> json_decode($json[0] == "{" ? $json : substr($json, 3))
=> {#204
+"orgao": [
{#203
+"Nome": "Tribunal de Justiça",
+"ID_Orgao": "59",
+"Condicao": "1",
},
],
...
}
When working with faulty software it happens that the BOM part gets multiplied with every saving.
So I am using this to get rid of it.
function remove_utf8_bom($text) {
$bom = pack('H*','EFBBBF');
while (preg_match("/^$bom/", $text)) {
$text = preg_replace("/^$bom/", '', $text);
}
return $text;
}
How about this:
function removeUTF8BomHeader($data) {
if (substr($data, 0, 3) == pack('CCC', 0xef, 0xbb, 0xbf)) {
$data = substr($data, 3);
}
return $data;
}
tested a lot and it works perfect without any issue

Categories