fgetcsv encoding issue (PHP)

fgetcsv encoding issue (PHP) - php

I am being sent a csv file that is tab delimited. Here is a sample of what I see:
Invoice: Invoice Date Account: Name Bill To: First Name Bill To: Last Name Bill To: Work Email Rate Plan Charge: Name Subscription: Device Serial Number
2021-03-10 Test Company Wally Kolcz test#test.com Sample plan A0H1234567890A
I wrote a script to open, read and loop over the values but I get weird stuff after:
if (($handle = fopen($user_file, "r")) !== FALSE) {
while (($data = fgetcsv($handle, 1000, "\t")) !== FALSE) {
if($line >1 && isset($data[1])){
$user = [
'EmailAddress' => $data[4],
'Name' => $data[2].' '.$data[3],
];
}
$line++;
}
fclose($handle);
}
Here is what I get when I dump the first line.
array:7 [▼
0 => b"ÿþI\x00n\x00v\x00o\x00i\x00c\x00e\x00:\x00 \x00I\x00n\x00v\x00o\x00i\x00c\x00e\x00 \x00D\x00a\x00t\x00e\x00"
1 => "\x00A\x00c\x00c\x00o\x00u\x00n\x00t\x00:\x00 \x00N\x00a\x00m\x00e\x00"
2 => "\x00B\x00i\x00l\x00l\x00 \x00T\x00o\x00:\x00 \x00F\x00i\x00r\x00s\x00t\x00 \x00N\x00a\x00m\x00e\x00"
3 => "\x00B\x00i\x00l\x00l\x00 \x00T\x00o\x00:\x00 \x00L\x00a\x00s\x00t\x00 \x00N\x00a\x00m\x00e\x00"
4 => "\x00B\x00i\x00l\x00l\x00 \x00T\x00o\x00:\x00 \x00W\x00o\x00r\x00k\x00 \x00E\x00m\x00a\x00i\x00l\x00"
5 => "\x00R\x00a\x00t\x00e\x00 \x00P\x00l\x00a\x00n\x00 \x00C\x00h\x00a\x00r\x00g\x00e\x00:\x00 \x00N\x00a\x00m\x00e\x00"
6 => "\x00S\x00u\x00b\x00s\x00c\x00r\x00i\x00p\x00t\x00i\x00o\x00n\x00:\x00 \x00D\x00e\x00v\x00i\x00c\x00e\x00 \x00S\x00e\x00r\x00i\x00a\x00l\x00 \x00N\x00u\x00m\x00b\x00e\x00r\x00 ◀"
]
I tried adding:
header('Content-Type: text/html; charset=UTF-8');
$data = array_map("utf8_encode", $data);
setlocale(LC_ALL, 'en_US.UTF-8');
And when I dump mb_detect_encoding($data[2]), I get 'ASCII'...
Any way to fix this so I don't have to manually update the file each time I receive it? Thanks!

Looks like the file is in UTF-16 (every other byte is null).
You probably need to convert the whole file with something like mb_convert_encoding($data, "UTF-8", "UTF-16");
But you can't really use fgetcsv() in that case…

As #Andrea already mentioned, your data is encoded as UTF-16LE and you need to convert it to an encoding compatible with what you want to do. That said, it is possible to do in-flight with PHP stream filters.
abstract class TranslateCharset extends php_user_filter {
protected $in_charset, $out_charset;
private $buffer = '';
private $total_consumed = 0;
public function filter($in, $out, &$consumed, $closing) {
$output = '';
while ($bucket = stream_bucket_make_writeable($in)) {
$input = $this->buffer . $bucket->data;
for( $i=0, $p=0; ($c=mb_substr($input, $i, 1, $this->in_charset)) !== ""; ++$i, $p+=strlen($c) ) {
$output .= mb_convert_encoding($c, $this->out_charset, $this->in_charset);
}
$this->buffer = substr($input, $p);
$consumed += $p;
}
// this means that there's unconverted data at the end of the bridage.
if( $closing && strlen($this->buffer) > 0 ) {
$this->raise_error( sprintf(
"Likely encoding error at offset %d in input stream, subsequent data may be malformed or missing.",
$this->total_consumed += $consumed)
);
$consumed += strlen($this->buffer);
// give it the ol' college try
$output .= mb_convert_encoding($this->buffer, $this->out_charset, $this->in_charset);
}
$this->total_consumed += $consumed;
if ( ! isset($bucket) ) {
$bucket = stream_bucket_new($this->stream, $output);
} else {
$bucket->data = $output;
}
stream_bucket_append($out, $bucket);
return PSFS_PASS_ON;
}
protected function raise_error($message) {
user_error( sprintf(
"%s[%s]: %s",
__CLASS__, get_class($this), $message
), E_USER_WARNING);
}
}
class UTF16LEtoUTF8 extends TranslateCharset {
protected $in_charset = 'UTF-16LE';
protected $out_charset = 'UTF-8';
}
stream_filter_register('UTF16LEtoUTF8', 'UTF16LEtoUTF8');
// properly-encoded UTF-16BE example input "Invoice:,a"
$in = "\xFE\xFFI\x00n\x00v\x00o\x00i\x00c\x00e\x00:\x00,\x00a\x00";
// prep example pipe, in practice this would simple be your fopen() call.
$fh = fopen('php://memory', 'rwb+');
fwrite($fh, $in);
rewind($fh);
// skip BOM
fseek($fh, 2);
stream_filter_append($fh, 'UTF16LEtoUTF8', STREAM_FILTER_READ);
var_dump(fgetcsv($fh, 4096));
Output:
array(2) {
[0]=>
string(8) "Invoice:"
[1]=>
string(1) "a"
}
In practice there is no "magic bullet" to detect the encoding of an input file or string. In this case there is a Byte Order Mark [BOM] of 0xFF 0xFE that denotes that this in UTF-16LE but the BOM is frequently omitted, or may simply occur naturally at the beginning of any arbitrary string, or is simply not required for most encodings, or is simply not used by whoever encoded the data.
That last bit is the exact reason why everyone should avoid the utf8_encode() and utf8_decode() functions like the plague, because they simply assume that you only ever want to go between UTF-8 and ISO-8859-1 [western european], and make no effort to avoid corrupting your data when used incorrectly because they can't possibly know any better.
TLDR: You must explicitly know the encoding of your input data, or you're going to have a bad time.
Edit: Since I've gone and put a proper spitshine on this I've put it up as a Composer package, in case anyone else needs something like this.
https://packagist.org/packages/wrossmann/costrenc

I ended up with is as working code:
$f = file_get_contents($user_file);
$f = mb_convert_encoding($f, 'UTF8', 'UTF-16LE');
$f = preg_split("/\R/", $f);
$f = array_map('str_getcsv', $f);
$line = 0;
foreach($f as $record){
if($line !== 0 && isset($record[0])){
$pieces = preg_split('/[\t]/',$record[0]);
//My work here
}
}
Thank you everyone for your examples and suggestions!

Related

PHP - After getting a csv file parsed into an array, can not match two exact strings [duplicate]

Using PHP5 (cgi) to output template files from the filesystem and having issues spitting out raw HTML.
private function fetch($name) {
$path = $this->j->config['template_path'] . $name . '.html';
if (!file_exists($path)) {
dbgerror('Could not find the template "' . $name . '" in ' . $path);
}
$f = fopen($path, 'r');
$t = fread($f, filesize($path));
fclose($f);
if (substr($t, 0, 3) == b'\xef\xbb\xbf') {
$t = substr($t, 3);
}
return $t;
}
Even though I've added the BOM fix I'm still having problems with Firefox accepting it. You can see a live copy here: http://ircb.in/jisti/ (and the template file I threw at http://ircb.in/jisti/home.html if you want to check it out)
Any idea how to fix this? o_o

you would use the following code to remove utf8 bom
//Remove UTF8 Bom
function remove_utf8_bom($text)
{
$bom = pack('H*','EFBBBF');
$text = preg_replace("/^$bom/", '', $text);
return $text;
}

try:
// -------- read the file-content ----
$str = file_get_contents($source_file);
// -------- remove the utf-8 BOM ----
$str = str_replace("\xEF\xBB\xBF",'',$str);
// -------- get the Object from JSON ----
$obj = json_decode($str);
:)

Another way to remove the BOM which is Unicode code point U+FEFF
$str = preg_replace('/\x{FEFF}/u', '', $file);

b'\xef\xbb\xbf' stands for the literal string "\xef\xbb\xbf". If you want to check for a BOM, you need to use double quotes, so the \x sequences are actually interpreted into bytes:
"\xef\xbb\xbf"
Your files also seem to contain a lot more garbage than just a single leading BOM:
$ curl http://ircb.in/jisti/ | xxd
0000000: efbb bfef bbbf efbb bfef bbbf efbb bfef ................
0000010: bbbf efbb bf3c 2144 4f43 5459 5045 2068 .....<!DOCTYPE h
0000020: 746d 6c3e 0a3c 6874 6d6c 3e0a 3c68 6561 tml>.<html>.<hea
...

if anybody using csv import then below code useful
$header = fgetcsv($handle);
foreach($header as $key=> $val) {
$bom = pack('H*','EFBBBF');
$val = preg_replace("/^$bom/", '', $val);
$header[$key] = $val;
}

This global funtion resolve for UTF-8 system base charset. Tanks!
function prepareCharset($str) {
// set default encode
mb_internal_encoding('UTF-8');
// pre filter
if (empty($str)) {
return $str;
}
// get charset
$charset = mb_detect_encoding($str, array('ISO-8859-1', 'UTF-8', 'ASCII'));
if (stristr($charset, 'utf') || stristr($charset, 'iso')) {
$str = iconv('ISO-8859-1', 'UTF-8//TRANSLIT', utf8_decode($str));
} else {
$str = mb_convert_encoding($str, 'UTF-8', 'UTF-8');
}
// remove BOM
$str = urldecode(str_replace("%C2%81", '', urlencode($str)));
// prepare string
return $str;
}

An extra method to do the same job:
function remove_utf8_bom_head($text) {
if(substr(bin2hex($text), 0, 6) === 'efbbbf') {
$text = substr($text, 3);
}
return $text;
}
The other methods I found cannot work in my case.
Hope it helps in some special case.

A solution without pack function:
$a = "1";
var_dump($a); // string(4) "1"
function deleteBom($text)
{
return preg_replace("/^\xEF\xBB\xBF/", '', $text);
}
var_dump(deleteBom($a)); // string(1) "1"

I'm not so fond of using preg_replace or preg_match for simple tasks. What about this alternative method of detecting and removing the BOM?
function remove_utf8_bom(string $text): string
{
$bomStart = mb_substr($text, 0, 1);
return ($bomStart == pack('H*','EFBBBF')) ?
mb_substr($text, 1) :
$text;
}

If you are reading some API using file_get_contents and got an inexplicable NULL from json_decode, check the value of json_last_error(): sometimes the value returned from file_get_contents will have an extraneous BOM that is almost invisible when you inspect the string, but will make json_last_error() to return JSON_ERROR_SYNTAX (4).
>>> $json = file_get_contents("http://api-guiaserv.seade.gov.br/v1/orgao/all");
=> "\t{"orgao":[{"Nome":"Tribunal de Justi\u00e7a","ID_Orgao":"59","Condicao":"1"}, ...]}"
>>> json_decode($json);
=> null
>>>
In this case, check the first 3 bytes - echoing them is not very useful because the BOM is invisible on most settings:
>>> substr($json, 0, 3)
=> " "
>>> substr($json, 0, 3) == pack('H*','EFBBBF');
=> true
>>>
If the line above returns TRUE for you, then a simple test may fix the problem:
>>> json_decode($json[0] == "{" ? $json : substr($json, 3))
=> {#204
+"orgao": [
{#203
+"Nome": "Tribunal de Justiça",
+"ID_Orgao": "59",
+"Condicao": "1",
},
],
...
}

When working with faulty software it happens that the BOM part gets multiplied with every saving.
So I am using this to get rid of it.
function remove_utf8_bom($text) {
$bom = pack('H*','EFBBBF');
while (preg_match("/^$bom/", $text)) {
$text = preg_replace("/^$bom/", '', $text);
}
return $text;
}

How about this:
function removeUTF8BomHeader($data) {
if (substr($data, 0, 3) == pack('CCC', 0xef, 0xbb, 0xbf)) {
$data = substr($data, 3);
}
return $data;
}
tested a lot and it works perfect without any issue

php - json_decode returning NULL [duplicate]

Using PHP5 (cgi) to output template files from the filesystem and having issues spitting out raw HTML.
private function fetch($name) {
$path = $this->j->config['template_path'] . $name . '.html';
if (!file_exists($path)) {
dbgerror('Could not find the template "' . $name . '" in ' . $path);
}
$f = fopen($path, 'r');
$t = fread($f, filesize($path));
fclose($f);
if (substr($t, 0, 3) == b'\xef\xbb\xbf') {
$t = substr($t, 3);
}
return $t;
}
Even though I've added the BOM fix I'm still having problems with Firefox accepting it. You can see a live copy here: http://ircb.in/jisti/ (and the template file I threw at http://ircb.in/jisti/home.html if you want to check it out)
Any idea how to fix this? o_o

you would use the following code to remove utf8 bom
//Remove UTF8 Bom
function remove_utf8_bom($text)
{
$bom = pack('H*','EFBBBF');
$text = preg_replace("/^$bom/", '', $text);
return $text;
}

try:
// -------- read the file-content ----
$str = file_get_contents($source_file);
// -------- remove the utf-8 BOM ----
$str = str_replace("\xEF\xBB\xBF",'',$str);
// -------- get the Object from JSON ----
$obj = json_decode($str);
:)

Another way to remove the BOM which is Unicode code point U+FEFF
$str = preg_replace('/\x{FEFF}/u', '', $file);

b'\xef\xbb\xbf' stands for the literal string "\xef\xbb\xbf". If you want to check for a BOM, you need to use double quotes, so the \x sequences are actually interpreted into bytes:
"\xef\xbb\xbf"
Your files also seem to contain a lot more garbage than just a single leading BOM:
$ curl http://ircb.in/jisti/ | xxd
0000000: efbb bfef bbbf efbb bfef bbbf efbb bfef ................
0000010: bbbf efbb bf3c 2144 4f43 5459 5045 2068 .....<!DOCTYPE h
0000020: 746d 6c3e 0a3c 6874 6d6c 3e0a 3c68 6561 tml>.<html>.<hea
...

if anybody using csv import then below code useful
$header = fgetcsv($handle);
foreach($header as $key=> $val) {
$bom = pack('H*','EFBBBF');
$val = preg_replace("/^$bom/", '', $val);
$header[$key] = $val;
}

This global funtion resolve for UTF-8 system base charset. Tanks!
function prepareCharset($str) {
// set default encode
mb_internal_encoding('UTF-8');
// pre filter
if (empty($str)) {
return $str;
}
// get charset
$charset = mb_detect_encoding($str, array('ISO-8859-1', 'UTF-8', 'ASCII'));
if (stristr($charset, 'utf') || stristr($charset, 'iso')) {
$str = iconv('ISO-8859-1', 'UTF-8//TRANSLIT', utf8_decode($str));
} else {
$str = mb_convert_encoding($str, 'UTF-8', 'UTF-8');
}
// remove BOM
$str = urldecode(str_replace("%C2%81", '', urlencode($str)));
// prepare string
return $str;
}

An extra method to do the same job:
function remove_utf8_bom_head($text) {
if(substr(bin2hex($text), 0, 6) === 'efbbbf') {
$text = substr($text, 3);
}
return $text;
}
The other methods I found cannot work in my case.
Hope it helps in some special case.

A solution without pack function:
$a = "1";
var_dump($a); // string(4) "1"
function deleteBom($text)
{
return preg_replace("/^\xEF\xBB\xBF/", '', $text);
}
var_dump(deleteBom($a)); // string(1) "1"

I'm not so fond of using preg_replace or preg_match for simple tasks. What about this alternative method of detecting and removing the BOM?
function remove_utf8_bom(string $text): string
{
$bomStart = mb_substr($text, 0, 1);
return ($bomStart == pack('H*','EFBBBF')) ?
mb_substr($text, 1) :
$text;
}

If you are reading some API using file_get_contents and got an inexplicable NULL from json_decode, check the value of json_last_error(): sometimes the value returned from file_get_contents will have an extraneous BOM that is almost invisible when you inspect the string, but will make json_last_error() to return JSON_ERROR_SYNTAX (4).
>>> $json = file_get_contents("http://api-guiaserv.seade.gov.br/v1/orgao/all");
=> "\t{"orgao":[{"Nome":"Tribunal de Justi\u00e7a","ID_Orgao":"59","Condicao":"1"}, ...]}"
>>> json_decode($json);
=> null
>>>
In this case, check the first 3 bytes - echoing them is not very useful because the BOM is invisible on most settings:
>>> substr($json, 0, 3)
=> " "
>>> substr($json, 0, 3) == pack('H*','EFBBBF');
=> true
>>>
If the line above returns TRUE for you, then a simple test may fix the problem:
>>> json_decode($json[0] == "{" ? $json : substr($json, 3))
=> {#204
+"orgao": [
{#203
+"Nome": "Tribunal de Justiça",
+"ID_Orgao": "59",
+"Condicao": "1",
},
],
...
}

When working with faulty software it happens that the BOM part gets multiplied with every saving.
So I am using this to get rid of it.
function remove_utf8_bom($text) {
$bom = pack('H*','EFBBBF');
while (preg_match("/^$bom/", $text)) {
$text = preg_replace("/^$bom/", '', $text);
}
return $text;
}

How about this:
function removeUTF8BomHeader($data) {
if (substr($data, 0, 3) == pack('CCC', 0xef, 0xbb, 0xbf)) {
$data = substr($data, 3);
}
return $data;
}
tested a lot and it works perfect without any issue

making print_r use PHP_EOL

My PHP_EOL is "\r\n", however, when I do print_r on an array each new line has a "\n" - not a "\r\n" - placed after it.
Any idea if it's possible to change this behavior?

If you look the source code of print_r you'll find:
PHP_FUNCTION(print_r)
{
zval *var;
zend_bool do_return = 0;
if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "z|b", &var, &do_return) == FAILURE) {
RETURN_FALSE;
}
if (do_return) {
php_output_start_default(TSRMLS_C);
}
zend_print_zval_r(var, 0 TSRMLS_CC);
if (do_return) {
php_output_get_contents(return_value TSRMLS_CC);
php_output_discard(TSRMLS_C);
} else {
RETURN_TRUE;
}
}
ultimately you can ignore the stuff arround zend_print_zval_r(var, 0 TSRMLS_CC); for your question.
If you follow the stacktrace, you'll find:
ZEND_API void zend_print_zval_r(zval *expr, int indent TSRMLS_DC) /* {{{ */
{
zend_print_zval_r_ex(zend_write, expr, indent TSRMLS_CC);
}
which leads to
ZEND_API void zend_print_zval_r_ex(zend_write_func_t write_func, zval *expr, int indent TSRMLS_DC) /* {{{ */
{
switch (Z_TYPE_P(expr)) {
case IS_ARRAY:
ZEND_PUTS_EX("Array\n");
if (++Z_ARRVAL_P(expr)->nApplyCount>1) {
ZEND_PUTS_EX(" *RECURSION*");
Z_ARRVAL_P(expr)->nApplyCount--;
return;
}
print_hash(write_func, Z_ARRVAL_P(expr), indent, 0 TSRMLS_CC);
Z_ARRVAL_P(expr)->nApplyCount--;
break;
From this point on, you could continue to find the relevant line - but since there is already a hardcoded "Array\n" - i'll assume the rest of the print_r implementation uses the same hardcoded \n linebreak-thing.
So, to answer your question: You cannot change it to use \r\n.
Use one of the provided workarounds.
Sidenode: Since print_r is mainly used for debugging, this will do the job as well:
echo "<pre>";
print_r($object);
echo "</pre>";

Use second param in print_r (set true), read DOC:
http://www.php.net/manual/en/function.print-r.php
See:
mixed print_r ( mixed $expression [, bool $return = false ] );
Example:
$eol = chr(10); //Break line in like unix
$weol = chr(13) . $eol; //Break line with "carriage return" (required by some text editors)
$data = print_r(array(...), true);
$data = str_replace(eol, weol, $data);
echo $data;

Like pointed out elsewhere on this page, the newlines are hardcoded in the PHP source, so you have to replace them manually.
You could use your own version of print_r like this:
namespace My;
function print_r($expression, $return = false)
{
$out = \print_r($expression, true);
$out = \preg_replace("#(?<!\r)\n#", PHP_EOL, $out);
if ($return) {
return $out;
}
echo $out;
return true;
}
Whenever you want to use it, you just import it with
// aliasing a function (PHP 5.6+)
use My\print_r as print_r;
print_r("A string with \r\n is not replaced");
print_r("A string with \n is replaced");
This will then use PHP_EOL for newlines. Note that it will only substitute newlines, e.g. \n, but not any \r\n you might have in the $expression. This is to prevent any \r\n to become \r\r\n.
The benefit of doing it this way is that it will work as a drop-in replacement of the native function. So any code that already uses the native print_r can be replaced simply by adding the use statement.

This may not be the most elegant solution, but you could capture the print_r() output using buffer output, then use str_replace() to replace existences of \n with your PHP_EOL. In this example I've replaced it with x to show that it's working...
ob_start();
$test_array = range('A', 'Z');
print_r($test_array);
$dump = ob_get_contents();
ob_end_clean();
As pointed out by dognose, since PHP 4.3 you can return the result of print_r() into a string (more elegant):
$dump = print_r($test_array, true);
Then replace line endings:
$dump = str_replace("\n", "x" . PHP_EOL, $dump);
echo $dump;
Output:
Arrayx
(x
[0] => Ax
[1] => Bx
[2] => Cx
[3] => Dx
[4] => Ex
[5] => Fx
[6] => Gx
... etc
[25] => Zx
)x

Question Is it possible to change the behavior of PHP's print_r function was marked was duplicated of this one . I'd like to answer more how is possible change the behavior of print_r. My propose is do another function with another name that do the print_r customized . And we just need replace print_r functions with print_r_pretty ...
function print_r_pretty($in, $saveToString = false) {
$out = print_r($in, true);
$out = str_replace("\n", "\r\n", $out);
switch ($saveToString) {
case true: return $out;
default: echo $out;
}
}
But line :
$out = str_replace("\n", "\r\n", $out);
can be replaced by another line that do another changes to print_r like this :
$out = explode("\n", $out, 2)[1];

How to remove multiple UTF-8 BOM sequences

Using PHP5 (cgi) to output template files from the filesystem and having issues spitting out raw HTML.
private function fetch($name) {
$path = $this->j->config['template_path'] . $name . '.html';
if (!file_exists($path)) {
dbgerror('Could not find the template "' . $name . '" in ' . $path);
}
$f = fopen($path, 'r');
$t = fread($f, filesize($path));
fclose($f);
if (substr($t, 0, 3) == b'\xef\xbb\xbf') {
$t = substr($t, 3);
}
return $t;
}
Even though I've added the BOM fix I'm still having problems with Firefox accepting it. You can see a live copy here: http://ircb.in/jisti/ (and the template file I threw at http://ircb.in/jisti/home.html if you want to check it out)
Any idea how to fix this? o_o

you would use the following code to remove utf8 bom
//Remove UTF8 Bom
function remove_utf8_bom($text)
{
$bom = pack('H*','EFBBBF');
$text = preg_replace("/^$bom/", '', $text);
return $text;
}

try:
// -------- read the file-content ----
$str = file_get_contents($source_file);
// -------- remove the utf-8 BOM ----
$str = str_replace("\xEF\xBB\xBF",'',$str);
// -------- get the Object from JSON ----
$obj = json_decode($str);
:)

Another way to remove the BOM which is Unicode code point U+FEFF
$str = preg_replace('/\x{FEFF}/u', '', $file);

b'\xef\xbb\xbf' stands for the literal string "\xef\xbb\xbf". If you want to check for a BOM, you need to use double quotes, so the \x sequences are actually interpreted into bytes:
"\xef\xbb\xbf"
Your files also seem to contain a lot more garbage than just a single leading BOM:
$ curl http://ircb.in/jisti/ | xxd
0000000: efbb bfef bbbf efbb bfef bbbf efbb bfef ................
0000010: bbbf efbb bf3c 2144 4f43 5459 5045 2068 .....<!DOCTYPE h
0000020: 746d 6c3e 0a3c 6874 6d6c 3e0a 3c68 6561 tml>.<html>.<hea
...

if anybody using csv import then below code useful
$header = fgetcsv($handle);
foreach($header as $key=> $val) {
$bom = pack('H*','EFBBBF');
$val = preg_replace("/^$bom/", '', $val);
$header[$key] = $val;
}

This global funtion resolve for UTF-8 system base charset. Tanks!
function prepareCharset($str) {
// set default encode
mb_internal_encoding('UTF-8');
// pre filter
if (empty($str)) {
return $str;
}
// get charset
$charset = mb_detect_encoding($str, array('ISO-8859-1', 'UTF-8', 'ASCII'));
if (stristr($charset, 'utf') || stristr($charset, 'iso')) {
$str = iconv('ISO-8859-1', 'UTF-8//TRANSLIT', utf8_decode($str));
} else {
$str = mb_convert_encoding($str, 'UTF-8', 'UTF-8');
}
// remove BOM
$str = urldecode(str_replace("%C2%81", '', urlencode($str)));
// prepare string
return $str;
}

An extra method to do the same job:
function remove_utf8_bom_head($text) {
if(substr(bin2hex($text), 0, 6) === 'efbbbf') {
$text = substr($text, 3);
}
return $text;
}
The other methods I found cannot work in my case.
Hope it helps in some special case.

A solution without pack function:
$a = "1";
var_dump($a); // string(4) "1"
function deleteBom($text)
{
return preg_replace("/^\xEF\xBB\xBF/", '', $text);
}
var_dump(deleteBom($a)); // string(1) "1"

I'm not so fond of using preg_replace or preg_match for simple tasks. What about this alternative method of detecting and removing the BOM?
function remove_utf8_bom(string $text): string
{
$bomStart = mb_substr($text, 0, 1);
return ($bomStart == pack('H*','EFBBBF')) ?
mb_substr($text, 1) :
$text;
}

If you are reading some API using file_get_contents and got an inexplicable NULL from json_decode, check the value of json_last_error(): sometimes the value returned from file_get_contents will have an extraneous BOM that is almost invisible when you inspect the string, but will make json_last_error() to return JSON_ERROR_SYNTAX (4).
>>> $json = file_get_contents("http://api-guiaserv.seade.gov.br/v1/orgao/all");
=> "\t{"orgao":[{"Nome":"Tribunal de Justi\u00e7a","ID_Orgao":"59","Condicao":"1"}, ...]}"
>>> json_decode($json);
=> null
>>>
In this case, check the first 3 bytes - echoing them is not very useful because the BOM is invisible on most settings:
>>> substr($json, 0, 3)
=> " "
>>> substr($json, 0, 3) == pack('H*','EFBBBF');
=> true
>>>
If the line above returns TRUE for you, then a simple test may fix the problem:
>>> json_decode($json[0] == "{" ? $json : substr($json, 3))
=> {#204
+"orgao": [
{#203
+"Nome": "Tribunal de Justiça",
+"ID_Orgao": "59",
+"Condicao": "1",
},
],
...
}

When working with faulty software it happens that the BOM part gets multiplied with every saving.
So I am using this to get rid of it.
function remove_utf8_bom($text) {
$bom = pack('H*','EFBBBF');
while (preg_match("/^$bom/", $text)) {
$text = preg_replace("/^$bom/", '', $text);
}
return $text;
}

How about this:
function removeUTF8BomHeader($data) {
if (substr($data, 0, 3) == pack('CCC', 0xef, 0xbb, 0xbf)) {
$data = substr($data, 3);
}
return $data;
}
tested a lot and it works perfect without any issue

PHP multipart form data PUT request?

I'm writing a RESTful API. I'm having trouble with uploading images using the different verbs.
Consider:
I have an object which can be created/modified/deleted/viewed via a post/put/delete/get request to a URL. The request is multi part form when there is a file to upload, or application/xml when there's just text to process.
To handle the image uploads which are associated with the object I am doing something like:
if(isset($_FILES['userfile'])) {
$data = $this->image_model->upload_image();
if($data['error']){
$this->response(array('error' => $error['error']));
}
$xml_data = (array)simplexml_load_string( urldecode($_POST['xml']) );
$object = (array)$xml_data['object'];
} else {
$object = $this->body('object');
}
The major problem here is when trying to handle a put request, obviously $_POST doesn't contain the put data (as far as I can tell!).
For reference this is how I'm building the requests:
curl -F userfile=#./image.png -F xml="<xml><object>stuff to edit</object></xml>"
http://example.com/object -X PUT
Does anyone have any ideas how I can access the xml variable in my PUT request?

First of all, $_FILES is not populated when handling PUT requests. It is only populated by PHP when handling POST requests.
You need to parse it manually. That goes for "regular" fields as well:
// Fetch content and determine boundary
$raw_data = file_get_contents('php://input');
$boundary = substr($raw_data, 0, strpos($raw_data, "\r\n"));
// Fetch each part
$parts = array_slice(explode($boundary, $raw_data), 1);
$data = array();
foreach ($parts as $part) {
// If this is the last part, break
if ($part == "--\r\n") break;
// Separate content from headers
$part = ltrim($part, "\r\n");
list($raw_headers, $body) = explode("\r\n\r\n", $part, 2);
// Parse the headers list
$raw_headers = explode("\r\n", $raw_headers);
$headers = array();
foreach ($raw_headers as $header) {
list($name, $value) = explode(':', $header);
$headers[strtolower($name)] = ltrim($value, ' ');
}
// Parse the Content-Disposition to get the field name, etc.
if (isset($headers['content-disposition'])) {
$filename = null;
preg_match(
'/^(.+); *name="([^"]+)"(; *filename="([^"]+)")?/',
$headers['content-disposition'],
$matches
);
list(, $type, $name) = $matches;
isset($matches[4]) and $filename = $matches[4];
// handle your fields here
switch ($name) {
// this is a file upload
case 'userfile':
file_put_contents($filename, $body);
break;
// default for all other files is to populate $data
default:
$data[$name] = substr($body, 0, strlen($body) - 2);
break;
}
}
}
At each iteration, the $data array will be populated with your parameters, and the $headers array will be populated with the headers for each part (e.g.: Content-Type, etc.), and $filename will contain the original filename, if supplied in the request and is applicable to the field.
Take note the above will only work for multipart content types. Make sure to check the request Content-Type header before using the above to parse the body.

Please don't delete this again, it's helpful to a majority of people coming here! All previous answers were partial answers that don't cover the solution as a majority of people asking this question would want.
This takes what has been said above and additionally handles multiple file uploads and places them in $_FILES as someone would expect. To get this to work, you have to add 'Script PUT /put.php' to your Virtual Host for the project per Documentation. I also suspect I'll have to setup a cron to cleanup any '.tmp' files.
private function _parsePut( )
{
global $_PUT;
/* PUT data comes in on the stdin stream */
$putdata = fopen("php://input", "r");
/* Open a file for writing */
// $fp = fopen("myputfile.ext", "w");
$raw_data = '';
/* Read the data 1 KB at a time
and write to the file */
while ($chunk = fread($putdata, 1024))
$raw_data .= $chunk;
/* Close the streams */
fclose($putdata);
// Fetch content and determine boundary
$boundary = substr($raw_data, 0, strpos($raw_data, "\r\n"));
if(empty($boundary)){
parse_str($raw_data,$data);
$GLOBALS[ '_PUT' ] = $data;
return;
}
// Fetch each part
$parts = array_slice(explode($boundary, $raw_data), 1);
$data = array();
foreach ($parts as $part) {
// If this is the last part, break
if ($part == "--\r\n") break;
// Separate content from headers
$part = ltrim($part, "\r\n");
list($raw_headers, $body) = explode("\r\n\r\n", $part, 2);
// Parse the headers list
$raw_headers = explode("\r\n", $raw_headers);
$headers = array();
foreach ($raw_headers as $header) {
list($name, $value) = explode(':', $header);
$headers[strtolower($name)] = ltrim($value, ' ');
}
// Parse the Content-Disposition to get the field name, etc.
if (isset($headers['content-disposition'])) {
$filename = null;
$tmp_name = null;
preg_match(
'/^(.+); *name="([^"]+)"(; *filename="([^"]+)")?/',
$headers['content-disposition'],
$matches
);
list(, $type, $name) = $matches;
//Parse File
if( isset($matches[4]) )
{
//if labeled the same as previous, skip
if( isset( $_FILES[ $matches[ 2 ] ] ) )
{
continue;
}
//get filename
$filename = $matches[4];
//get tmp name
$filename_parts = pathinfo( $filename );
$tmp_name = tempnam( ini_get('upload_tmp_dir'), $filename_parts['filename']);
//populate $_FILES with information, size may be off in multibyte situation
$_FILES[ $matches[ 2 ] ] = array(
'error'=>0,
'name'=>$filename,
'tmp_name'=>$tmp_name,
'size'=>strlen( $body ),
'type'=>$value
);
//place in temporary directory
file_put_contents($tmp_name, $body);
}
//Parse Field
else
{
$data[$name] = substr($body, 0, strlen($body) - 2);
}
}
}
$GLOBALS[ '_PUT' ] = $data;
return;
}

For whom using Apiato (Laravel) framework:
create new Middleware like file below, then declair this file in your laravel kernel file within the protected $middlewareGroups variable (inside web or api, whatever you want) like this:
protected $middlewareGroups = [
'web' => [],
'api' => [HandlePutFormData::class],
];
<?php
namespace App\Ship\Middlewares\Http;
use Closure;
use Symfony\Component\HttpFoundation\ParameterBag;
/**
* #author Quang Pham
*/
class HandlePutFormData
{
/**
* Handle an incoming request.
*
* #param \Illuminate\Http\Request $request
* #param \Closure $next
*
* #return mixed
*/
public function handle($request, Closure $next)
{
if ($request->method() == 'POST' or $request->method() == 'GET') {
return $next($request);
}
if (preg_match('/multipart\/form-data/', $request->headers->get('Content-Type')) or
preg_match('/multipart\/form-data/', $request->headers->get('content-type'))) {
$parameters = $this->decode();
$request->merge($parameters['inputs']);
$request->files->add($parameters['files']);
}
return $next($request);
}
public function decode()
{
$files = [];
$data = [];
// Fetch content and determine boundary
$rawData = file_get_contents('php://input');
$boundary = substr($rawData, 0, strpos($rawData, "\r\n"));
// Fetch and process each part
$parts = $rawData ? array_slice(explode($boundary, $rawData), 1) : [];
foreach ($parts as $part) {
// If this is the last part, break
if ($part == "--\r\n") {
break;
}
// Separate content from headers
$part = ltrim($part, "\r\n");
list($rawHeaders, $content) = explode("\r\n\r\n", $part, 2);
$content = substr($content, 0, strlen($content) - 2);
// Parse the headers list
$rawHeaders = explode("\r\n", $rawHeaders);
$headers = array();
foreach ($rawHeaders as $header) {
list($name, $value) = explode(':', $header);
$headers[strtolower($name)] = ltrim($value, ' ');
}
// Parse the Content-Disposition to get the field name, etc.
if (isset($headers['content-disposition'])) {
$filename = null;
preg_match(
'/^form-data; *name="([^"]+)"(; *filename="([^"]+)")?/',
$headers['content-disposition'],
$matches
);
$fieldName = $matches[1];
$fileName = (isset($matches[3]) ? $matches[3] : null);
// If we have a file, save it. Otherwise, save the data.
if ($fileName !== null) {
$localFileName = tempnam(sys_get_temp_dir(), 'sfy');
file_put_contents($localFileName, $content);
$files = $this->transformData($files, $fieldName, [
'name' => $fileName,
'type' => $headers['content-type'],
'tmp_name' => $localFileName,
'error' => 0,
'size' => filesize($localFileName)
]);
// register a shutdown function to cleanup the temporary file
register_shutdown_function(function () use ($localFileName) {
unlink($localFileName);
});
} else {
$data = $this->transformData($data, $fieldName, $content);
}
}
}
$fields = new ParameterBag($data);
return ["inputs" => $fields->all(), "files" => $files];
}
private function transformData($data, $name, $value)
{
$isArray = strpos($name, '[]');
if ($isArray && (($isArray + 2) == strlen($name))) {
$name = str_replace('[]', '', $name);
$data[$name][]= $value;
} else {
$data[$name] = $value;
}
return $data;
}
}
Pls note: Those codes above not all mine, some from above comment, some modified by me.

Quoting netcoder reply : "Take note the above will only work for multipart content types"
To work with any content type I have added the following lines to Mr. netcoder's solution :
// Fetch content and determine boundary
$raw_data = file_get_contents('php://input');
$boundary = substr($raw_data, 0, strpos($raw_data, "\r\n"));
/*...... My edit --------- */
if(empty($boundary)){
parse_str($raw_data,$data);
return $data;
}
/* ........... My edit ends ......... */
// Fetch each part
$parts = array_slice(explode($boundary, $raw_data), 1);
$data = array();
............
...............

I've been trying to figure out how to work with this issue without having to break RESTful convention and boy howdie, what a rabbit hole, let me tell you.
I'm adding this anywhere I can find in the hope that it will help somebody out in the future.
I've just lost a day of development firstly figuring out that this was an issue, then figuring out where the issue lay.
As mentioned, this isn't a symfony (or laravel, or any other framework) issue, it's a limitation of PHP.
After trawling through a good few RFCs for php core, the core development team seem somewhat resistant to implementing anything to do with modernising the handling of HTTP requests. The issue was first reported in 2011, it doesn't look any closer to having a native solution.
That said, I managed to find this PECL extension called Always Populate Form Data. I'm not really very familiar with pecl, and couldn't seem to get it working using pear. but I'm using CentOS and Remi PHP which has a yum package.
I ran yum install php-pecl-apfd and it literally fixed the issue straight away (well I had to restart my docker containers but that was a given).
I believe there are other packages in various flavours of linux and I'm sure anybody with more knowledge of pear/pecl/general php extensions could get it running on windows or mac with no issue.

I know this article is old.
But unfortunately, PHP still does not pay attention to form-data other than the Post method.
Thanks to friends (#netcoder, #greendot, #pham-quang) who suggested solutions above.
Using those solutions I wrote a library for this purpose:
composer require alireaza/php-form-data
You can also use composer require alireaza/laravel-form-data in Laravel.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

fgetcsv encoding issue (PHP) - php

Looks like the file is in UTF-16 (every other byte is null). You probably need to convert the whole file with something like mb_convert_encoding($data, "UTF-8", "UTF-16"); But you can't really use fgetcsv() in that case…

Related

PHP - After getting a csv file parsed into an array, can not match two exact strings [duplicate]

php - json_decode returning NULL [duplicate]

making print_r use PHP_EOL

How to remove multiple UTF-8 BOM sequences

PHP multipart form data PUT request?

Categories

Resources