I'm working on a csv import script in php. It works fine, except for foreign characters in the beginning of a field.
The code looks like this
if (($handle = fopen($filename, "r")) !== FALSE)
{
while (($data = fgetcsv($handle, 1000, ",")) !== FALSE)
$teljing[] = $data;
fclose($handle);
}
Here is a data example showing my issue
føroyskir stavir, "Kr. 201,50"
óvirkin ting, "Kr. 100,00"
This will result in the following
array
(
[0] => array
(
[0] => 'føroyskir stavir',
[1] => 'Kr. 201,50'
)
[1] => array
(
[0] => 'virkin ting', <--- Should be 'óvirkin ting'
[1] => 'Kr. 100,00'
)
)
I have seen this behaivior documented in some comments in php.net, and I have tried ini_set('auto_detect_line_endings',TRUE); to detect line endings. No success.
Anyone familiar with this issue?
Edit:
Thanks you AJ, this issue is now solved.
setlocale(LC_ALL, 'en_US.UTF-8');
Was the solution.
From the PHP manual for fgetcsv():
"Note: Locale setting is taken into account by this function. If LANG is e.g. en_US.UTF-8, files in one-byte encoding are read wrong by this function."
Copied from the PHP.net/fgetcsv comments:
kent at marketruler dot com
04-Feb-2010 11:18 Note that fgetcsv,
at least in PHP 5.3 or previous, will
NOT work with UTF-16 encoded files.
Your options are to convert the entire
file to ISO-8859-1 (or latin1), or
convert line by line and convert each
line into ISO-8859-1 encoding, then
use str_getcsv (or compatible
backwards-compatible implementation).
If you need to read non-latin
alphabets, probably best to convert to
UTF-8.
See str_getcsv for a
backwards-compatible version of it
with PHP < 5.3, and see utf8_decode
for a function written by Rasmus
Andersson which provides utf16_decode.
The modification I added was that the
BOP appears at the top of the file,
then not on subsequent lines. So you
need to store the endian-ness, and
then re-send it upon each subsequent
line decoding. This modified version
returns the endianness, if it's not
available:
<?php
/**
* Decode UTF-16 encoded strings.
*
* Can handle both BOM'ed data and un-BOM'ed data.
* Assumes Big-Endian byte order if no BOM is available.
* From: http://php.net/manual/en/function.utf8-decode.php
*
* #param string $str UTF-16 encoded data to decode.
* #return string UTF-8 / ISO encoded data.
* #access public
* #version 0.1 / 2005-01-19
* #author Rasmus Andersson {#link http://rasmusandersson.se/}
* #package Groupies
*/
function utf16_decode($str, &$be=null) {
if (strlen($str) < 2) {
return $str;
}
$c0 = ord($str{0});
$c1 = ord($str{1});
$start = 0;
if ($c0 == 0xFE && $c1 == 0xFF) {
$be = true;
$start = 2;
} else if ($c0 == 0xFF && $c1 == 0xFE) {
$start = 2;
$be = false;
}
if ($be === null) {
$be = true;
}
$len = strlen($str);
$newstr = '';
for ($i = $start; $i < $len; $i += 2) {
if ($be) {
$val = ord($str{$i}) << 4;
$val += ord($str{$i+1});
} else {
$val = ord($str{$i+1}) << 4;
$val += ord($str{$i});
}
$newstr .= ($val == 0x228) ? "\n" : chr($val);
}
return $newstr;
}
?>
Trying the "setlocale" trick did not work for me, e.g.
<?php
setlocale(LC_CTYPE, "en.UTF16");
$line = fgetcsv($file, ...)
?>
But that's perhaps because my platform
didn't support it. However, fgetcsv
only supports single characters for
the delimiter, etc. and complains if
you pass in a UTF-16 version of said
character, so I gave up on that rather
quickly.
Hope this is helpful to someone out
there.
Related
I use PHP.
The function below loads part of a big multibyte enter separated CSV file and return a pointer (the end position) and the content in an array. With the pointer I can later do another run. It works:
function part($path, $offset, $rows) {
$buffer = array();
$buffer['content'] = '';
$buffer['pointer'] = array();
$handle = fopen($path, "r");
fseek($handle, $offset);
if( $handle ) {
for( $i = 0; $i < $rows; $i++ ) {
$buffer['content'] .= fgets($handle);
$buffer['pointer'] = mb_strlen($buffer['content']);
}
}
fclose($handle);
return($buffer);
}
// Buffer first part
$buffer = part($path_to_file, 0, 100);
// Buffer second part
$buffer = part($path_to_file, $buffer['pointer'], 100);
print_r($buffer);
If I change the $buffer['pointer'] line to:
$buffer['pointer'] = mb_strlen($buffer['content'], "UTF-8");
...it does not work anymore... I understand that it uses the different encoding when I use UTF-8 instead of the default, but why doesn't it work with UTF-8?
Shouldn't UTF-8 be compatible with foreign characters?
Because the function above works when I use it without "UTF-8" I guess I could just use it without UTF-8.
I'm still worried that in some cases it can give the wrong pointer?
Is there a safer way to get the correct pointer?
Encoding test
When I do this I get UTF-8:
echo mb_detect_encoding($buffer['content']);
This has little to do with UTF-8. Filesystem functions (like fseek(), fread(), etc.) operate on individual bytes. They don't care about the encoding at all. (You could be writing / reading binary data).
If you want to store a pointer to fseek() to at a later time, use ftell() to find out the current position:
$buffer['pointer'] = ftell($handle);
I have a strange problem...
I would like to search in a logfile.
$lines = file($file);
$sampleName = "T3173sGas";
foreach ($lines as &$line) {
if (strpos($line, $sampleName) !== false) {
echo "yes";
}
}
This code is not working, $sampleName is to 100% in the log file. The search works just for single characters; for example "T" or "3" but not for "T3".
Do you have an idea why it's not working? Is the encoding of the logfile wrong?
Thanks a lot for your help!
If you can only find single characters I would assume that your logfile is in some multi-byte character set like UTF-16. As you already assume similar, next step for you is to consult the documentation / specification of the logfile you're trying to operate with regarding the character encoding.
You then can use character-encoding specific string functions, the package is called http://php.net/mbstring.
$encoding = ... ; // encoding of logfile
if (mb_strpos($line, $sampleName, 0, $encoding) !== false) {
echo "yes";
}
This may work, it searches for the entire string
<?php
$filename = 'test.php';
$file = file_get_contents($filename);
$sampleName = "T3173sGas";
if(strlen(strstr($file,$sampleName))>0)
{
echo "yes";
}
?>
I've searched for an answer for quite a while, and haven't found anything that works correctly.
I have log files, some reaching 100MB in size, around 140,000 lines of text.
With PHP, I am trying to get the last 500 lines of the file.
How would I get the 500 lines? With most functions, the file is read into memory, and that isn't a plausible case for this matter. I would preferably stay away from executing system commands.
If you are on a 'nix machine, you should be able to use shell escaping and the tool 'tail'.
It's been a while, but something like this:
$lastLines = `tail -n 500`;
notice the use of tick marks, which executes the string in BASH or similar and returns the results.
I wrote this function which seems to work quite nicely to me. It returns an array of lines just like file. If you want it to return a string like file_get_contents, then just change the return statement to return implode('', array_reverse($lines));:
function file_get_tail($filename, $num_lines = 10){
$file = fopen($filename, "r");
fseek($file, -1, SEEK_END);
for ($line = 0, $lines = array(); $line < $num_lines && false !== ($char = fgetc($file));) {
if($char === "\n"){
if(isset($lines[$line])){
$lines[$line][] = $char;
$lines[$line] = implode('', array_reverse($lines[$line]));
$line++;
}
}else
$lines[$line][] = $char;
fseek($file, -2, SEEK_CUR);
}
fclose($file);
if($line < $num_lines)
$lines[$line] = implode('', array_reverse($lines[$line]));
return array_reverse($lines);
}
Example:
file_get_tail('filename.txt', 500);
If you want to do it in PHP:
<?php
/**
Read last N lines from file.
#param $filename string path to file. must support seeking
#param $n int number of lines to get.
#return array up to $n lines of text
*/
function tail($filename, $n)
{
$buffer_size = 1024;
$fp = fopen($filename, 'r');
if (!$fp) return array();
fseek($fp, 0, SEEK_END);
$pos = ftell($fp);
$input = '';
$line_count = 0;
while ($line_count < $n + 1)
{
// read the previous block of input
$read_size = $pos >= $buffer_size ? $buffer_size : $pos;
fseek($fp, $pos - $read_size, SEEK_SET);
// prepend the current block, and count the new lines
$input = fread($fp, $read_size).$input;
$line_count = substr_count(ltrim($input), "\n");
// if $pos is == 0 we are at start of file
$pos -= $read_size;
if (!$pos) break;
}
fclose($fp);
// return the last 50 lines found
return array_slice(explode("\n", rtrim($input)), -$n);
}
var_dump(tail('/var/log/syslog', 50));
This is largely untested, but should be enough for you to get a fully working solution.
The buffer size is 1024, but can be changed to be bigger or larger. (You could even dynamically set it based on $n * estimate of line length.) This should be better than seeking character by character, although it does mean we need to do substr_count() to look for new lines.
I want to covert all the characters in a file to ASCII code in php? I know of ord function but whether there is any function that will do for the entire file?
iconv may do the work
http://php.net/manual/de/function.iconv.php
it convertes chars of a specified charset in a string to another one. look at the //TRANSLIT and //IGNORE specials for chars that cannot be converted 1:1.
to get the file in a string you can use file_get_contents and save it after iconv etc. is applied with file_put_contents.
$inputFile = fopen("input.txt", "rb");
$outputFile = fopen("output.txt", "w+");
while (!feof($inputFile)) {
$inputBlock = fread($inputFile, 8192);
$outputBlock = '';
$inputLength = strlen($inputBlock);
for ($i = 0; $i < $inputLength; ++$i) {
$outputBlock .= str_pad(dechex(ord($inputBlock{$i})),2,'0',STR_PAD_LEFT);
}
fwrite($outputFile,$outputBlock);
}
fclose($inputFile);
fclose($outputFile);
I try to read a CSV and echo the content. But the content displays the characters wrong.
Mäx Müstermänn -> Mäx Müstermänn
Encoding of the CSV file is UTF-8 without BOM (checked with Notepad++).
This is the content of the CSV file:
"Mäx";"Müstermänn"
My PHP script
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
<?php
$handle = fopen ("specialchars.csv","r");
echo '<table border="1"><tr><td>First name</td><td>Last name</td></tr><tr>';
while ($data = fgetcsv ($handle, 1000, ";")) {
$num = count ($data);
for ($c=0; $c < $num; $c++) {
// output data
echo "<td>$data[$c]</td>";
}
echo "</tr><tr>";
}
?>
</body>
</html>
I tried to use setlocale(LC_ALL, 'de_DE.utf8'); as suggested here without success. The content is still wrong displayed.
What I'm missing?
Edit:
An echo mb_detect_encoding($data[$c],'UTF-8'); gives me UTF-8 UTF-8.
echo file_get_contents("specialchars.csv"); gives me "Mäx";"Müstermänn".
And
print_r(str_getcsv(reset(explode("\n", file_get_contents("specialchars.csv"))), ';'))
gives me
Array ( [0] => Mäx [1] => Müstermänn )
What does it mean?
Try this:
<?php
$handle = fopen ("specialchars.csv","r");
echo '<table border="1"><tr><td>First name</td><td>Last name</td></tr><tr>';
while ($data = fgetcsv ($handle, 1000, ";")) {
$data = array_map("utf8_encode", $data); //added
$num = count ($data);
for ($c=0; $c < $num; $c++) {
// output data
echo "<td>$data[$c]</td>";
}
echo "</tr><tr>";
}
?>
Encountered similar problem: parsing CSV file with special characters like é, è, ö etc ...
The following worked fine for me:
To represent the characters correctly on the html page, the header was needed :
header('Content-Type: text/html; charset=UTF-8');
In order to parse every character correctly, I used:
utf8_encode(fgets($file));
Dont forget to use in all following string operations the 'Multibyte String Functions', like:
mb_strtolower($value, 'UTF-8');
In my case the source file has windows-1250 encoding and iconv prints tons of notices about illegal characters in input string...
So this solution helped me a lot:
/**
* getting CSV array with UTF-8 encoding
*
* #param resource &$handle
* #param integer $length
* #param string $separator
*
* #return array|false
*/
private function fgetcsvUTF8(&$handle, $length, $separator = ';')
{
if (($buffer = fgets($handle, $length)) !== false)
{
$buffer = $this->autoUTF($buffer);
return str_getcsv($buffer, $separator);
}
return false;
}
/**
* automatic convertion windows-1250 and iso-8859-2 info utf-8 string
*
* #param string $s
*
* #return string
*/
private function autoUTF($s)
{
// detect UTF-8
if (preg_match('#[\x80-\x{1FF}\x{2000}-\x{3FFF}]#u', $s))
return $s;
// detect WINDOWS-1250
if (preg_match('#[\x7F-\x9F\xBC]#', $s))
return iconv('WINDOWS-1250', 'UTF-8', $s);
// assume ISO-8859-2
return iconv('ISO-8859-2', 'UTF-8', $s);
}
Response to #manvel's answer - use str_getcsv instead of explode - because of cases like this:
some;nice;value;"and;here;comes;combinated;value";and;some;others
explode will explode string into parts:
some
nice
value
"and
here
comes
combinated
value"
and
some
others
but str_getcsv will explode string into parts:
some
nice
value
and;here;comes;combinated;value
and
some
others
Try putting this into the top of your file (before any other output):
<?php
header('Content-Type: text/html; charset=UTF-8');
?>
The problem is that the function returns UTF-8 (it can check using mb_detect_encoding), but do not convert, and these characters takes as UTF-8. Тherefore, it's necessary to do the reverse-convert to initial encoding (Windows-1251 or CP1251) using iconv. But since by the fgetcsv returns an array, I suggest to write a custom function:
[Sorry for my english]
function customfgetcsv(&$handle, $length, $separator = ';'){
if (($buffer = fgets($handle, $length)) !== false) {
return explode($separator, iconv("CP1251", "UTF-8", $buffer));
}
return false;
}
Now I got it working (after removing the header command). I think the problem was that the encoding of the php file was in ISO-8859-1. I set it to UTF-8 without BOM. I thought I already have done that, but perhaps I made an additional undo.
Furthermore, I used SET NAMES 'utf8' for the database. Now it is also correct in the database.