CakePHP 1.3 warning array_merge sanitize - php

I'm using CakePHP 1.3.7 and ran into a very specific issue.
The Sanitize core class method used in my application is the one of version 1.2. When I want to save particular data, it gives me a warning :
Warning: array_merge(): Argument #2 is not an array in
/usr/share/php/cake/libs/sanitize.php on line 113
but it does save, and with the right encoding/format.
Here's the method who causes this warning (version 1.2, which is NOT on line 113, but I'll come to that later)
function html($string, $remove = false) {
if ($remove) {
$string = strip_tags($string);
} else {
$patterns = array("/\&/", "/%/", "/</", "/>/", '/"/', "/'/", "/\(/", "/\)/", "/\+/", "/-/");
$replacements = array("&", "%", "<", ">", """, "'", "(", ")", "+", "-");
$string = preg_replace($patterns, $replacements, $string);
}
return $string;
}
And here's how this method is called
$value = Sanitize::html($value,true);
Now as you can see, array_merge() is not called in this method, but if I replace the html() method by the 1.3 version
function html($string, $options = array()) {
static $defaultCharset = false;
if ($defaultCharset === false) {
$defaultCharset = Configure::read('App.encoding');
if ($defaultCharset === null) {
$defaultCharset = 'UTF-8';
}
}
$default = array(
'remove' => false,
'charset' => $defaultCharset,
'quotes' => ENT_QUOTES
);
$options = array_merge($default, $options);
if ($options['remove']) {
$string = strip_tags($string);
}
return htmlentities($string, $options['quotes'], $options['charset']);
}
array_merge() falls exactly on line 113.
If I now call html() this way
$value = Sanitize::html($value,array('remove' => true));
I don't get the warning anymore. However, my data doesn't save with the right encoding/format anymore.
Here's an example of text I need to save (it is french and needs UTF-8 encoding)
L'envoi d'une communication & à la fenêtre
I can't overcome this doing
$value = Sanitize::html($value,array('remove' => true, 'quotes' => ENT_HTML401));
because I'm using PHP 5.3.6 thus I can't use the constant ENT_HTML401
If I use another constant like ENT_NOQUOTES, it ignores the quotes (obviously) but not the french accents and other special chars, which is intented to work this way but I want to save the text exactly like I quoted (or at least read it).
I'm guessing I wouldn't need to use htmlentities, but I think it is safer to and updating the core method is the only way I found to not get the warning. I also suppose I should not really modify these files other than for updating them?
So, briefly, I want to :
Get rid of the warning
Save/read data in the right format
I might have forgotten some infos, thanks

I ended up updating the html() method of the Sanitize class to match version 1.3 as follow
function html($string, $options = array()) {
static $defaultCharset = false;
if ($defaultCharset === false) {
$defaultCharset = Configure::read('App.encoding');
if ($defaultCharset === null) {
$defaultCharset = 'UTF-8';
}
}
$default = array(
'remove' => false,
'charset' => $defaultCharset,
'quotes' => ENT_QUOTES
);
$options = array_merge($default, $options);
if ($options['remove']) {
$string = strip_tags($string);
}
return htmlentities($string, $options['quotes'], $options['charset']);
}
I call it like this
$value = Sanitize::html($value, array('remove'=>true,'quotes'=>ENT_NOQUOTES));
And I simply decode the text fields this way whenever I read their value from database
$data['Model']['field'] = html_entity_decode($data['Model']['field'], ENT_NOQUOTES, "UTF-8");
EDIT : I had to undo what I described above because the way data was encoded in the 1.3 version of the function made it so we had to decode the data in the whole application when reading it.
Also, I am NOT using CakePHP 1.3.7 (got confused with cake console); I'm using 1.2.4 so updating the function was not appropriate afterall.
I kept the version 1.2 and this time I simply changed the second parameter to an array as follow and it seemed to do the trick as I'm not getting the warning anymore.
function html($string, $options = array()) {
if ($options['remove']) {
$string = strip_tags($string);
} else {
$patterns = array("/\&/", "/%/", "/</", "/>/", '/"/', "/'/", "/\(/", "/\)/", "/\+/", "/-/");
$replacements = array("&", "%", "<", ">", """, "'", "(", ")", "+", "-");
$string = preg_replace($patterns, $replacements, $string);
}
return $string;
}

Related

PHP - After getting a csv file parsed into an array, can not match two exact strings [duplicate]

Using PHP5 (cgi) to output template files from the filesystem and having issues spitting out raw HTML.
private function fetch($name) {
$path = $this->j->config['template_path'] . $name . '.html';
if (!file_exists($path)) {
dbgerror('Could not find the template "' . $name . '" in ' . $path);
}
$f = fopen($path, 'r');
$t = fread($f, filesize($path));
fclose($f);
if (substr($t, 0, 3) == b'\xef\xbb\xbf') {
$t = substr($t, 3);
}
return $t;
}
Even though I've added the BOM fix I'm still having problems with Firefox accepting it. You can see a live copy here: http://ircb.in/jisti/ (and the template file I threw at http://ircb.in/jisti/home.html if you want to check it out)
Any idea how to fix this? o_o
you would use the following code to remove utf8 bom
//Remove UTF8 Bom
function remove_utf8_bom($text)
{
$bom = pack('H*','EFBBBF');
$text = preg_replace("/^$bom/", '', $text);
return $text;
}
try:
// -------- read the file-content ----
$str = file_get_contents($source_file);
// -------- remove the utf-8 BOM ----
$str = str_replace("\xEF\xBB\xBF",'',$str);
// -------- get the Object from JSON ----
$obj = json_decode($str);
:)
Another way to remove the BOM which is Unicode code point U+FEFF
$str = preg_replace('/\x{FEFF}/u', '', $file);
b'\xef\xbb\xbf' stands for the literal string "\xef\xbb\xbf". If you want to check for a BOM, you need to use double quotes, so the \x sequences are actually interpreted into bytes:
"\xef\xbb\xbf"
Your files also seem to contain a lot more garbage than just a single leading BOM:
$ curl http://ircb.in/jisti/ | xxd
0000000: efbb bfef bbbf efbb bfef bbbf efbb bfef ................
0000010: bbbf efbb bf3c 2144 4f43 5459 5045 2068 .....<!DOCTYPE h
0000020: 746d 6c3e 0a3c 6874 6d6c 3e0a 3c68 6561 tml>.<html>.<hea
...
if anybody using csv import then below code useful
$header = fgetcsv($handle);
foreach($header as $key=> $val) {
$bom = pack('H*','EFBBBF');
$val = preg_replace("/^$bom/", '', $val);
$header[$key] = $val;
}
This global funtion resolve for UTF-8 system base charset. Tanks!
function prepareCharset($str) {
// set default encode
mb_internal_encoding('UTF-8');
// pre filter
if (empty($str)) {
return $str;
}
// get charset
$charset = mb_detect_encoding($str, array('ISO-8859-1', 'UTF-8', 'ASCII'));
if (stristr($charset, 'utf') || stristr($charset, 'iso')) {
$str = iconv('ISO-8859-1', 'UTF-8//TRANSLIT', utf8_decode($str));
} else {
$str = mb_convert_encoding($str, 'UTF-8', 'UTF-8');
}
// remove BOM
$str = urldecode(str_replace("%C2%81", '', urlencode($str)));
// prepare string
return $str;
}
An extra method to do the same job:
function remove_utf8_bom_head($text) {
if(substr(bin2hex($text), 0, 6) === 'efbbbf') {
$text = substr($text, 3);
}
return $text;
}
The other methods I found cannot work in my case.
Hope it helps in some special case.
A solution without pack function:
$a = "1";
var_dump($a); // string(4) "1"
function deleteBom($text)
{
return preg_replace("/^\xEF\xBB\xBF/", '', $text);
}
var_dump(deleteBom($a)); // string(1) "1"
I'm not so fond of using preg_replace or preg_match for simple tasks. What about this alternative method of detecting and removing the BOM?
function remove_utf8_bom(string $text): string
{
$bomStart = mb_substr($text, 0, 1);
return ($bomStart == pack('H*','EFBBBF')) ?
mb_substr($text, 1) :
$text;
}
If you are reading some API using file_get_contents and got an inexplicable NULL from json_decode, check the value of json_last_error(): sometimes the value returned from file_get_contents will have an extraneous BOM that is almost invisible when you inspect the string, but will make json_last_error() to return JSON_ERROR_SYNTAX (4).
>>> $json = file_get_contents("http://api-guiaserv.seade.gov.br/v1/orgao/all");
=> "\t{"orgao":[{"Nome":"Tribunal de Justi\u00e7a","ID_Orgao":"59","Condicao":"1"}, ...]}"
>>> json_decode($json);
=> null
>>>
In this case, check the first 3 bytes - echoing them is not very useful because the BOM is invisible on most settings:
>>> substr($json, 0, 3)
=> " "
>>> substr($json, 0, 3) == pack('H*','EFBBBF');
=> true
>>>
If the line above returns TRUE for you, then a simple test may fix the problem:
>>> json_decode($json[0] == "{" ? $json : substr($json, 3))
=> {#204
+"orgao": [
{#203
+"Nome": "Tribunal de Justiça",
+"ID_Orgao": "59",
+"Condicao": "1",
},
],
...
}
When working with faulty software it happens that the BOM part gets multiplied with every saving.
So I am using this to get rid of it.
function remove_utf8_bom($text) {
$bom = pack('H*','EFBBBF');
while (preg_match("/^$bom/", $text)) {
$text = preg_replace("/^$bom/", '', $text);
}
return $text;
}
How about this:
function removeUTF8BomHeader($data) {
if (substr($data, 0, 3) == pack('CCC', 0xef, 0xbb, 0xbf)) {
$data = substr($data, 3);
}
return $data;
}
tested a lot and it works perfect without any issue

convert encoded html entities to utf-8

How do I convert this string into UTF-8:
&#0000106&#0000097&#0000118&#0000097&#0000115&#0000099&#0000114&#0000105&#0000112&#0000116&#0000058&#0000097&#0000108&#0000101&#0000114&#0000116&#0000040&#0000039&#0000088&#0000083&#0000083&#0000039&#0000041
I want to also convert this one:
&#x6A&#x61&#x76&#x61&#x73&#x63&#x72&#x69&#x70&#x74&#x3A&#x61&#x6C&#x65&#x72&#x74&#x28&#x27&#x58&#x53&#x53&#x27&#x29
I want to prevent XSS attacks and I am using this article as a cheat sheet https://www.owasp.org/index.php/XSS_Filter_Evasion_Cheat_Sheet
My strategy is to convert the above string to UTF-8 and check if it contains javascript.
I made a simple functions to get the possible HTML, check:
$decimalHTML = '&#0000106&#0000097&#0000118&#0000097&#0000115&#0000099&#0000114&#0000105&#0000112&#0000116&#0000058&#0000097&#0000108&#0000101&#0000114&#0000116&#0000040&#0000039&#0000088&#0000083&#0000083&#0000039&#0000041';
$hexHTML = '&#x6A&#x61&#x76&#x61&#x73&#x63&#x72&#x69&#x70&#x74&#x3A&#x61&#x6C&#x65&#x72&#x74&#x28&#x27&#x58&#x53&#x53&#x27&#x29';
function getDecimalHTML($str) {
return str_replace(
'&#',
'',
preg_replace_callback(
'/\d+/',
function($v) {
return str_replace(';', '', implode(array_map('chr', $v)));
}, $str
)
);
}
function getHexDecimalHTML($str) {
return str_replace(
array('&#', 'x'),
'',
preg_replace_callback(
'/(?<=x)\w+/',
function($v) {
return str_replace(';', '', implode(array_map('hex2bin', $v)));
},
$str
)
);
}
echo getDecimalHTML($decimalHTML) . "\n";
echo getHexDecimalHTML($hexHTML);
Show me:
javascript:alert('XSS')
javascript:alert('XSS')
I used chr to get de char from ASCII and hex2bin to get the string from hexadecimal code....
I recommend not reinvent the wheel and use libraries that work for you and they cover all aspects of this problem, like AntiXSS

php - json_decode returning NULL [duplicate]

Using PHP5 (cgi) to output template files from the filesystem and having issues spitting out raw HTML.
private function fetch($name) {
$path = $this->j->config['template_path'] . $name . '.html';
if (!file_exists($path)) {
dbgerror('Could not find the template "' . $name . '" in ' . $path);
}
$f = fopen($path, 'r');
$t = fread($f, filesize($path));
fclose($f);
if (substr($t, 0, 3) == b'\xef\xbb\xbf') {
$t = substr($t, 3);
}
return $t;
}
Even though I've added the BOM fix I'm still having problems with Firefox accepting it. You can see a live copy here: http://ircb.in/jisti/ (and the template file I threw at http://ircb.in/jisti/home.html if you want to check it out)
Any idea how to fix this? o_o
you would use the following code to remove utf8 bom
//Remove UTF8 Bom
function remove_utf8_bom($text)
{
$bom = pack('H*','EFBBBF');
$text = preg_replace("/^$bom/", '', $text);
return $text;
}
try:
// -------- read the file-content ----
$str = file_get_contents($source_file);
// -------- remove the utf-8 BOM ----
$str = str_replace("\xEF\xBB\xBF",'',$str);
// -------- get the Object from JSON ----
$obj = json_decode($str);
:)
Another way to remove the BOM which is Unicode code point U+FEFF
$str = preg_replace('/\x{FEFF}/u', '', $file);
b'\xef\xbb\xbf' stands for the literal string "\xef\xbb\xbf". If you want to check for a BOM, you need to use double quotes, so the \x sequences are actually interpreted into bytes:
"\xef\xbb\xbf"
Your files also seem to contain a lot more garbage than just a single leading BOM:
$ curl http://ircb.in/jisti/ | xxd
0000000: efbb bfef bbbf efbb bfef bbbf efbb bfef ................
0000010: bbbf efbb bf3c 2144 4f43 5459 5045 2068 .....<!DOCTYPE h
0000020: 746d 6c3e 0a3c 6874 6d6c 3e0a 3c68 6561 tml>.<html>.<hea
...
if anybody using csv import then below code useful
$header = fgetcsv($handle);
foreach($header as $key=> $val) {
$bom = pack('H*','EFBBBF');
$val = preg_replace("/^$bom/", '', $val);
$header[$key] = $val;
}
This global funtion resolve for UTF-8 system base charset. Tanks!
function prepareCharset($str) {
// set default encode
mb_internal_encoding('UTF-8');
// pre filter
if (empty($str)) {
return $str;
}
// get charset
$charset = mb_detect_encoding($str, array('ISO-8859-1', 'UTF-8', 'ASCII'));
if (stristr($charset, 'utf') || stristr($charset, 'iso')) {
$str = iconv('ISO-8859-1', 'UTF-8//TRANSLIT', utf8_decode($str));
} else {
$str = mb_convert_encoding($str, 'UTF-8', 'UTF-8');
}
// remove BOM
$str = urldecode(str_replace("%C2%81", '', urlencode($str)));
// prepare string
return $str;
}
An extra method to do the same job:
function remove_utf8_bom_head($text) {
if(substr(bin2hex($text), 0, 6) === 'efbbbf') {
$text = substr($text, 3);
}
return $text;
}
The other methods I found cannot work in my case.
Hope it helps in some special case.
A solution without pack function:
$a = "1";
var_dump($a); // string(4) "1"
function deleteBom($text)
{
return preg_replace("/^\xEF\xBB\xBF/", '', $text);
}
var_dump(deleteBom($a)); // string(1) "1"
I'm not so fond of using preg_replace or preg_match for simple tasks. What about this alternative method of detecting and removing the BOM?
function remove_utf8_bom(string $text): string
{
$bomStart = mb_substr($text, 0, 1);
return ($bomStart == pack('H*','EFBBBF')) ?
mb_substr($text, 1) :
$text;
}
If you are reading some API using file_get_contents and got an inexplicable NULL from json_decode, check the value of json_last_error(): sometimes the value returned from file_get_contents will have an extraneous BOM that is almost invisible when you inspect the string, but will make json_last_error() to return JSON_ERROR_SYNTAX (4).
>>> $json = file_get_contents("http://api-guiaserv.seade.gov.br/v1/orgao/all");
=> "\t{"orgao":[{"Nome":"Tribunal de Justi\u00e7a","ID_Orgao":"59","Condicao":"1"}, ...]}"
>>> json_decode($json);
=> null
>>>
In this case, check the first 3 bytes - echoing them is not very useful because the BOM is invisible on most settings:
>>> substr($json, 0, 3)
=> " "
>>> substr($json, 0, 3) == pack('H*','EFBBBF');
=> true
>>>
If the line above returns TRUE for you, then a simple test may fix the problem:
>>> json_decode($json[0] == "{" ? $json : substr($json, 3))
=> {#204
+"orgao": [
{#203
+"Nome": "Tribunal de Justiça",
+"ID_Orgao": "59",
+"Condicao": "1",
},
],
...
}
When working with faulty software it happens that the BOM part gets multiplied with every saving.
So I am using this to get rid of it.
function remove_utf8_bom($text) {
$bom = pack('H*','EFBBBF');
while (preg_match("/^$bom/", $text)) {
$text = preg_replace("/^$bom/", '', $text);
}
return $text;
}
How about this:
function removeUTF8BomHeader($data) {
if (substr($data, 0, 3) == pack('CCC', 0xef, 0xbb, 0xbf)) {
$data = substr($data, 3);
}
return $data;
}
tested a lot and it works perfect without any issue

PHP: json_decode not working

This does not work:
$jsonDecode = json_decode($jsonData, TRUE);
However if I copy the string from $jsonData and put it inside the decode function manually it does work.
This works:
$jsonDecode = json_decode('{"id":"0","bid":"918","url":"http:\/\/www.google.com","md5":"6361fbfbee69f444c394f3d2fa062f79","time":"2014-06-02 14:20:21"}', TRUE);
I did output $jsonData copied it and put in like above in the decode function. Then it worked. However if I put $jsonData directly in the decode function it does not.
var_dump($jsonData) shows:
string(144) "{"id":"0","bid":"918","url":"http:\/\/www.google.com","md5":"6361fbfbee69f444c394f3d2fa062f79","time":"2014-06-02 14:20:21"}"
The $jsonData comes from a encrypted $_GET variable. To encrypt it I use this:
$key = "SOME KEY";
$iv_size = mcrypt_get_iv_size(MCRYPT_BLOWFISH, MCRYPT_MODE_ECB);
$iv = mcrypt_create_iv($iv_size, MCRYPT_RAND);
$enc = mcrypt_encrypt(MCRYPT_BLOWFISH, $key, $data, MCRYPT_MODE_ECB, $iv);
$iv = rawurlencode(base64_encode($iv));
$enc = rawurlencode(base64_encode($enc));
//To Decrypt
$iv = base64_decode(rawurldecode($_GET['i']));
$enc = base64_decode(rawurldecode($_GET['e']));
$data = mcrypt_decrypt(MCRYPT_BLOWFISH, $key, $enc, MCRYPT_MODE_ECB, $iv);
some time there is issue of html entities, for example \" it will represent like this \&quot, so you must need to parse the html entites to real text, that you can do using
html_entity_decode()
method of php.
$jsonData = stripslashes(html_entity_decode($jsonData));
$k=json_decode($jsonData,true);
print_r($k);
You have to use preg_replace for avoiding the null results from json_decode
here is the example code
$json_string = stripslashes(html_entity_decode($json_string));
$bookingdata = json_decode( preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $json_string), true );
Most likely you need to strip off the padding from your decrypted data. There are 124 visible characters in your string but var_dump reports 144. Which means 20 characters of padding needs to be removed (a series of "\0" bytes at the end of your string).
Probably that's 4 "\0" bytes at the end of a block + an empty 16-bytes block (to mark the end of the data).
How are you currently decrypting/encrypting your string?
Edit:
You need to add this to trim the zero bytes at the end of the string:
$jsonData = rtrim($jsonData, "\0");
Judging from the other comments, you could use,
$jsonDecode = json_decode(trim($jsonData), TRUE);
While moving on php 7.1 I encountered with json_decode error number 4 (json syntex error). None of the above solution on this page worked for me.
After doing some more searching i found solution at https://stackoverflow.com/a/15423899/1545384 and its working for me.
//Remove UTF8 Bom
function remove_utf8_bom($text)
{
$bom = pack('H*','EFBBBF');
$text = preg_replace("/^$bom/", '', $text);
return $text;
}
Be sure to set header to JSON
header('Content-type: application/json;');
str_replace("\t", " ", str_replace("\n", " ", $string))
because json_decode does not work with special characters. And no error will be displayed. Make sure you remove tab spaces and new lines.
Depending on the source you get your data, you might need also:
stripslashes(html_entity_decode($string))
Works for me:
<?php
$sql = <<<EOT
SELECT *
FROM `students`;
EOT;
$string = '{ "query" : "' . str_replace("\t", " ", str_replace("\n", " ", $sql)).'" }';
print_r(json_decode($string));
?>
output:
stdClass Object
(
[query] => SELECT * FROM `students`;
)
I had problem that json_decode did not work, solution was to change string encoding to utf-8. This is important in case you have non-latin characters.
Interestingly mcrypt_decrypt seem to add control characters other than \0 at the end of the resulting text because of its padding algorithm. Therefore instead of rtrim($jsonData, "\0")
it is recommended to use
preg_replace( "/\p{Cc}*$/u", "", $data)
on the result $data of mcrypt_decrypt. json_decode will work if all trailing control characters are removed. Pl refer to the comment by Peter Bailey at http://php.net/manual/en/function.mdecrypt-generic.php .
USE THIS CODE
<?php
$json = preg_replace('/[[:cntrl:]]/', '', $json_data);
$json_array = json_decode($json, true);
echo json_last_error();
echo json_last_error_msg();
print_r($json_array);
?>
Make sure your JSON is actually valid. For some reason I was convinced that this was valid JSON:
{ type: "block" }
While it is not. Point being, make sure to validate your string with a linter if you find json_decode not te be working.
Try the JSON validator.
The problem in my case was it used ' not ", so I had to replace it to make it working.
In notepad+ I changed encoding of json file on: "UTF-8 without BOM".
JSON started to work
TL;DR Be sure that your JSON not containing comments :)
I've taken a JSON structure from API reference and tested request using Postman. I've just copy-pasted the JSON and didn't pay attention that there was a comment inside it:
...
"payMethod": {
"type": "PBL" //or "CARD_TOKEN", "INSTALLMENTS"
},
...
Of course after deletion the comment json_decode() started working like a charm :)
Use following function:
If JSON_ERROR_UTF8 occurred :
$encoded = json_encode( utf_convert( $responseForJS ) );
Below function is used to encode Array data recursively
/* Use it for json_encode some corrupt UTF-8 chars
* useful for = malformed utf-8 characters possibly incorrectly encoded by json_encode
*/
function utf_convert( $mixed ) {
if (is_array($mixed)) {
foreach ($mixed as $key => $value) {
$mixed[$key] = utf8ize($value);
}
} elseif (is_string($mixed)) {
return mb_convert_encoding($mixed, "UTF-8", "UTF-8");
}
return $mixed;
}
Maybe it helps someone, check in your json string if you have any NULL values, json_decode will not work if a NULL is present as a value.
This super basic function may help you. I made the NULL in an array just in case I need to add more stuff in the future.
function jsonValueFix($json){
$json = str_replace( array('NULL'),'""',$json );
return $json;
}
I just used json_decode twice and it worked for me
$response = json_decode($apiResponse, true);
$response = json_decode($response, true);

How to remove multiple UTF-8 BOM sequences

Using PHP5 (cgi) to output template files from the filesystem and having issues spitting out raw HTML.
private function fetch($name) {
$path = $this->j->config['template_path'] . $name . '.html';
if (!file_exists($path)) {
dbgerror('Could not find the template "' . $name . '" in ' . $path);
}
$f = fopen($path, 'r');
$t = fread($f, filesize($path));
fclose($f);
if (substr($t, 0, 3) == b'\xef\xbb\xbf') {
$t = substr($t, 3);
}
return $t;
}
Even though I've added the BOM fix I'm still having problems with Firefox accepting it. You can see a live copy here: http://ircb.in/jisti/ (and the template file I threw at http://ircb.in/jisti/home.html if you want to check it out)
Any idea how to fix this? o_o
you would use the following code to remove utf8 bom
//Remove UTF8 Bom
function remove_utf8_bom($text)
{
$bom = pack('H*','EFBBBF');
$text = preg_replace("/^$bom/", '', $text);
return $text;
}
try:
// -------- read the file-content ----
$str = file_get_contents($source_file);
// -------- remove the utf-8 BOM ----
$str = str_replace("\xEF\xBB\xBF",'',$str);
// -------- get the Object from JSON ----
$obj = json_decode($str);
:)
Another way to remove the BOM which is Unicode code point U+FEFF
$str = preg_replace('/\x{FEFF}/u', '', $file);
b'\xef\xbb\xbf' stands for the literal string "\xef\xbb\xbf". If you want to check for a BOM, you need to use double quotes, so the \x sequences are actually interpreted into bytes:
"\xef\xbb\xbf"
Your files also seem to contain a lot more garbage than just a single leading BOM:
$ curl http://ircb.in/jisti/ | xxd
0000000: efbb bfef bbbf efbb bfef bbbf efbb bfef ................
0000010: bbbf efbb bf3c 2144 4f43 5459 5045 2068 .....<!DOCTYPE h
0000020: 746d 6c3e 0a3c 6874 6d6c 3e0a 3c68 6561 tml>.<html>.<hea
...
if anybody using csv import then below code useful
$header = fgetcsv($handle);
foreach($header as $key=> $val) {
$bom = pack('H*','EFBBBF');
$val = preg_replace("/^$bom/", '', $val);
$header[$key] = $val;
}
This global funtion resolve for UTF-8 system base charset. Tanks!
function prepareCharset($str) {
// set default encode
mb_internal_encoding('UTF-8');
// pre filter
if (empty($str)) {
return $str;
}
// get charset
$charset = mb_detect_encoding($str, array('ISO-8859-1', 'UTF-8', 'ASCII'));
if (stristr($charset, 'utf') || stristr($charset, 'iso')) {
$str = iconv('ISO-8859-1', 'UTF-8//TRANSLIT', utf8_decode($str));
} else {
$str = mb_convert_encoding($str, 'UTF-8', 'UTF-8');
}
// remove BOM
$str = urldecode(str_replace("%C2%81", '', urlencode($str)));
// prepare string
return $str;
}
An extra method to do the same job:
function remove_utf8_bom_head($text) {
if(substr(bin2hex($text), 0, 6) === 'efbbbf') {
$text = substr($text, 3);
}
return $text;
}
The other methods I found cannot work in my case.
Hope it helps in some special case.
A solution without pack function:
$a = "1";
var_dump($a); // string(4) "1"
function deleteBom($text)
{
return preg_replace("/^\xEF\xBB\xBF/", '', $text);
}
var_dump(deleteBom($a)); // string(1) "1"
I'm not so fond of using preg_replace or preg_match for simple tasks. What about this alternative method of detecting and removing the BOM?
function remove_utf8_bom(string $text): string
{
$bomStart = mb_substr($text, 0, 1);
return ($bomStart == pack('H*','EFBBBF')) ?
mb_substr($text, 1) :
$text;
}
If you are reading some API using file_get_contents and got an inexplicable NULL from json_decode, check the value of json_last_error(): sometimes the value returned from file_get_contents will have an extraneous BOM that is almost invisible when you inspect the string, but will make json_last_error() to return JSON_ERROR_SYNTAX (4).
>>> $json = file_get_contents("http://api-guiaserv.seade.gov.br/v1/orgao/all");
=> "\t{"orgao":[{"Nome":"Tribunal de Justi\u00e7a","ID_Orgao":"59","Condicao":"1"}, ...]}"
>>> json_decode($json);
=> null
>>>
In this case, check the first 3 bytes - echoing them is not very useful because the BOM is invisible on most settings:
>>> substr($json, 0, 3)
=> " "
>>> substr($json, 0, 3) == pack('H*','EFBBBF');
=> true
>>>
If the line above returns TRUE for you, then a simple test may fix the problem:
>>> json_decode($json[0] == "{" ? $json : substr($json, 3))
=> {#204
+"orgao": [
{#203
+"Nome": "Tribunal de Justiça",
+"ID_Orgao": "59",
+"Condicao": "1",
},
],
...
}
When working with faulty software it happens that the BOM part gets multiplied with every saving.
So I am using this to get rid of it.
function remove_utf8_bom($text) {
$bom = pack('H*','EFBBBF');
while (preg_match("/^$bom/", $text)) {
$text = preg_replace("/^$bom/", '', $text);
}
return $text;
}
How about this:
function removeUTF8BomHeader($data) {
if (substr($data, 0, 3) == pack('CCC', 0xef, 0xbb, 0xbf)) {
$data = substr($data, 3);
}
return $data;
}
tested a lot and it works perfect without any issue

Categories