I use PHP.
The function below loads part of a big multibyte enter separated CSV file and return a pointer (the end position) and the content in an array. With the pointer I can later do another run. It works:
function part($path, $offset, $rows) {
$buffer = array();
$buffer['content'] = '';
$buffer['pointer'] = array();
$handle = fopen($path, "r");
fseek($handle, $offset);
if( $handle ) {
for( $i = 0; $i < $rows; $i++ ) {
$buffer['content'] .= fgets($handle);
$buffer['pointer'] = mb_strlen($buffer['content']);
}
}
fclose($handle);
return($buffer);
}
// Buffer first part
$buffer = part($path_to_file, 0, 100);
// Buffer second part
$buffer = part($path_to_file, $buffer['pointer'], 100);
print_r($buffer);
If I change the $buffer['pointer'] line to:
$buffer['pointer'] = mb_strlen($buffer['content'], "UTF-8");
...it does not work anymore... I understand that it uses the different encoding when I use UTF-8 instead of the default, but why doesn't it work with UTF-8?
Shouldn't UTF-8 be compatible with foreign characters?
Because the function above works when I use it without "UTF-8" I guess I could just use it without UTF-8.
I'm still worried that in some cases it can give the wrong pointer?
Is there a safer way to get the correct pointer?
Encoding test
When I do this I get UTF-8:
echo mb_detect_encoding($buffer['content']);
This has little to do with UTF-8. Filesystem functions (like fseek(), fread(), etc.) operate on individual bytes. They don't care about the encoding at all. (You could be writing / reading binary data).
If you want to store a pointer to fseek() to at a later time, use ftell() to find out the current position:
$buffer['pointer'] = ftell($handle);
Related
I have a large CSV file. Because of memory concerns (with MySQL), I would like to only read a part of it at a time, if possible.
That it's CSV might not be important. The important thing is that it needs to cut with a new line.
Example content:
Some CSV content
that will break
on a line break
This could be my path:
$path = 'path/to/my.csv';
A solution for it could in my mind look like this:
$csv_content1 = read_csv_file($path, 0, 100);
$csv_content2 = read_csv_file($path, 101, 200);
It reads the raw content on line 0-100.
It reads the raw content on line 101-200.
Information
No parsing is needed (just split into content).
The file exists on my own server.
Don't read the whole file into the memory.
I want to be able to do the second read on another time, not on the same run. I accept save temp values like pointers if needed.
I've been trying to read other topics but did not find an exact match to this problem.
Maybe some of these could somehow work?
SplFileObject
fgetcsv
Maybe I can't use $csv_content2 before I've used $csv_content1, because I need to save some kind of a pointer? In that case it's fine. I will read them in order anyway.
After much thinking and reading I finally think I found the solution to my problem. Correct me if this is a bad solution because of memory usage or from other perspectives.
First run
$buffer = part($path_to_file, 0, 100);
Next run
$buffer = part($path_to_file, $buffer['pointer'], 100);
Function
function part($path, $offset, $rows) {
$buffer = array();
$buffer['content'] = '';
$buffer['pointer'] = array();
$handle = fopen($path, "r");
fseek($handle, $offset);
if( $handle ) {
for( $i = 0; $i < $rows; $i++ ) {
$buffer['content'] .= fgets($handle);
$buffer['pointer'] = mb_strlen($buffer['content']);
}
}
fclose($handle);
return($buffer);
}
In my more object oriented environment it looks more like this:
function part() {
$handle = fopen($this->path, "r");
fseek($handle, $this->pointer);
if( $handle ) {
for( $i = 0; $i < 2; $i++ ) {
if( $this->pointer != $this->filesize ) {
$this->content .= fgets($handle);
}
}
$this->pointer += mb_strlen($this->content);
}
fclose($handle);
}
i have a system that saves files to the server harddrive in base 64 encoded after stripping them from email files.
i would like to change the file to thier original format again, how can i do that in php?
this is how i tried to save the files but that does not seem to create a valid files:
$start = $part['starting-pos-body'];
$end = $part['ending-pos-body'];
$len = $end-$start;
$written = 0;
$write = 2028;
$body = '';
while ($start <= $end) {
fseek($this->stream, $start, SEEK_SET);
$my = fread($this->stream, $write);
fwrite($temp_fp, base64_decode($my));
$start += $write;
}
fclose($temp_fp);
#traylz makes the point clear for why it may fail when it shouldn't. However, base64_decode() may fail for large images even. I have worked with 6 to 7 MB files fine, I haven't gone over this size, so for me it should be as simple as:
$dir = dirname(__FILE__);
// get the base64 encoded file and decode it
$o_en = file_get_contents($dir . '/base64.txt');
$d = base64_decode($o_en);
// put decoded string into tmp file
file_put_contents($dir . '/base64_d', $d);
// get mime type (note: mime_content_type() is deprecated in favour
// for fileinfo functions)
$mime = str_replace('image/', '.', mime_content_type($dir . '/base64_d'));
rename($dir . '/base64_d', $dir . '/base64_d' . $mime);
If the following fails try adding chunk_split() function to decode operation:
$d = base64_decode(chunk_split($o_en));
So what am I sayaing...forget the loop unless there is a need for it...keep the orignal file extension if you don't trust php's mime detection. use chunk_split() on base64_decode() operation if working on large files.
NOTE: all theory so untested
EDIT: for large files that mostly likely will freeze file_get_contents(), read in what you need, output to file so little RAM is used:
$chunkSize = 1024;
$src = fopen('base64.txt', 'rb');
$dst = fopen('binary.mime', 'wb');
while (!feof($src)) {
fwrite($dst, base64_decode(fread($src, $chunkSize)));
}
Your problem is that you read 2028 chunks, checking start<=end AFTER you read chunk,so you read beyond end pointer, and you should check < insead of <= (to avoid reading 0 bytes)
Also you don't need to fseek on every iteration, because fread reads from current position. You can take fssek out of loop( before while).Why 2028 btw?
Try this out:
fseek($this->stream, $start, SEEK_SET);
while ($start < $end) {
$write = min($end-$start,2048);
$my = fread($this->stream, $write);
fwrite($temp_fp, base64_decode($my));
$start += $write;
}
fclose($temp_fp);
thank you all guys for the help
finally i ended up with:
shell_exec('/usr/bin/base64 -d '.$temp_file.' > '.$temp_file.'_s');
I want to covert all the characters in a file to ASCII code in php? I know of ord function but whether there is any function that will do for the entire file?
iconv may do the work
http://php.net/manual/de/function.iconv.php
it convertes chars of a specified charset in a string to another one. look at the //TRANSLIT and //IGNORE specials for chars that cannot be converted 1:1.
to get the file in a string you can use file_get_contents and save it after iconv etc. is applied with file_put_contents.
$inputFile = fopen("input.txt", "rb");
$outputFile = fopen("output.txt", "w+");
while (!feof($inputFile)) {
$inputBlock = fread($inputFile, 8192);
$outputBlock = '';
$inputLength = strlen($inputBlock);
for ($i = 0; $i < $inputLength; ++$i) {
$outputBlock .= str_pad(dechex(ord($inputBlock{$i})),2,'0',STR_PAD_LEFT);
}
fwrite($outputFile,$outputBlock);
}
fclose($inputFile);
fclose($outputFile);
I seem to be in a catch-22 with a small app I'm developing in PHP on Google App Engine using Quercus;
I have a remote csv-file which I can download & store in a string
To parse that string I'd ideally use str_getcsv, but Quercus doesn't have that function yet
Quercus does seem to know fgetcsv, but that function expects a file handle which I don't have (and I can't make a new one as GAE doesn't allow files to be created)
Anyone got an idea of how to solve this without having to dismiss the built-in PHP csv-parser functions and write my own parser instead?
I think the simplest solution really is to write your own parser . it's a piece of cake anyway and will get you to learn more regex- it makes no sense that there is no csv string to array parser in PHP so it's totally justified to write your own. Just make sure it's not too slow ;)
You might be able to create a new stream wrapper using stream_wrapper_register.
Here's an example from the manual which reads global variables: http://www.php.net/manual/en/stream.streamwrapper.example-1.php
You could then use it like a normal file handle:
$csvStr = '...';
$fp = fopen('var://csvStr', 'r+');
while ($row = fgetcsv($fp)) {
// ...
}
fclose($fp);
this shows a simple manual parser i wrote with example input with qualifed, non-qualified, escape feature. it can be used for the header and data rows and included an assoc array function to make your data into a kvp style array.
//example data
$fields = strparser('"first","second","third","fourth","fifth","sixth","seventh"');
print_r(makeAssocArray($fields, strparser('"asdf","bla\"1","bl,ah2","bl,ah\"3",123,34.234,"k;jsdfj ;alsjf;"')));
//do something like this
$fields = strparser(<csvfirstline>);
foreach ($lines as $line)
$data = makeAssocArray($fields, strparser($line));
function strparser($string, $div = ",", $qual = "\"", $esc = "\\") {
$buff = "";
$data = array();
$isQual = false; //the result will be a qualifier
$inQual = false; //currently parseing inside qualifier
//itereate through string each byte
for ($i = 0; $i < strlen($string); $i++) {
switch ($string[$i]) {
case $esc:
//add next byte to buffer and skip it
$buff .= $string[$i+1];
$i++;
break;
case $qual:
//see if this is escaped qualifier
if (!$inQual) {
$isQual = true;
$inQual = true;
break;
} else {
$inQual = false; //done parseing qualifier
break;
}
case $div:
if (!$inQual) {
$data[] = $buff; //add value to data
$buff = ""; //reset buffer
break;
}
default:
$buff .= $string[$i];
}
}
//get last item as it doesnt have a divider
$data[] = $buff;
return $data;
}
function makeAssocArray($fields, $data) {
foreach ($fields as $key => $field)
$array[$field] = $data[$key];
return $array;
}
if it can be dirty and quick. I would just use the
http://php.net/manual/en/function.exec.php
to pass it in and use sed and awk (http://shop.oreilly.com/product/9781565922259.do) to parse it. I know you wanted to use the php parser. I've tried before and failed simply because its not vocal about its errors.
Hope this helps.
Good luck.
You might be able to use fopen with php://temp or php://memory (php.net) to get it to work. What you would do is open either php://temp or php://memory, write to it, then rewind it (php.net), and then pass it to fgetcsv. I didn't test this, but it might work.
I have no experience when dealing with large files so I am not sure what to do about this. I have attempted to read several large files using file_get_contents ; the task is to clean and munge them using preg_replace().
My code runs fine on small files ; however, the large files (40 MB) trigger an Memory exhausted error:
PHP Fatal error: Allowed memory size of 16777216 bytes exhausted (tried to allocate 41390283 bytes)
I was thinking of using fread() instead but I am not sure that'll work either. Is there a workaround for this problem?
Thanks for your input.
This is my code:
<?php
error_reporting(E_ALL);
##get find() results and remove DOS carriage returns.
##The error is thrown on the next line for large files!
$myData = file_get_contents("tmp11");
$newData = str_replace("^M", "", $myData);
##cleanup Model-Manufacturer field.
$pattern = '/(Model-Manufacturer:)(\n)(\w+)/i';
$replacement = '$1$3';
$newData = preg_replace($pattern, $replacement, $newData);
##cleanup Test_Version field and create comma delimited layout.
$pattern = '/(Test_Version=)(\d).(\d).(\d)(\n+)/';
$replacement = '$1$2.$3.$4 ';
$newData = preg_replace($pattern, $replacement, $newData);
##cleanup occasional empty Model-Manufacturer field.
$pattern = '/(Test_Version=)(\d).(\d).(\d) (Test_Version=)/';
$replacement = '$1$2.$3.$4 Model-Manufacturer:N/A--$5';
$newData = preg_replace($pattern, $replacement, $newData);
##fix occasional Model-Manufacturer being incorrectly wrapped.
$newData = str_replace("--","\n",$newData);
##fix 'Binary file' message when find() utility cannot id file.
$pattern = '/(Binary file).*/';
$replacement = '';
$newData = preg_replace($pattern, $replacement, $newData);
$newData = removeEmptyLines($newData);
##replace colon with equal sign
$newData = str_replace("Model-Manufacturer:","Model-Manufacturer=",$newData);
##file stuff
$fh2 = fopen("tmp2","w");
fwrite($fh2, $newData);
fclose($fh2);
### Functions.
##Data cleanup
function removeEmptyLines($string)
{
return preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", "\n", $string);
}
?>
Firstly you should understand that when using file_get_contents you're fetching the entire string of data into a variable, that variable is stored in the hosts memory.
If that string is greater than the size dedicated to the PHP process then PHP will halt and display the error message above.
The way around this to open the file as a pointer, and then take a chunk at a time. This way if you had a 500MB file you can read the first 1MB of data, do what you will with it, delete that 1MB from the system's memory and replace it with the next MB. This allows you to manage how much data you're putting in the memory.
An example if this can be seen below, I will create a function that acts like node.js
function file_get_contents_chunked($file,$chunk_size,$callback)
{
try
{
$handle = fopen($file, "r");
$i = 0;
while (!feof($handle))
{
call_user_func_array($callback,array(fread($handle,$chunk_size),&$handle,$i));
$i++;
}
fclose($handle);
}
catch(Exception $e)
{
trigger_error("file_get_contents_chunked::" . $e->getMessage(),E_USER_NOTICE);
return false;
}
return true;
}
and then use like so:
$success = file_get_contents_chunked("my/large/file",4096,function($chunk,&$handle,$iteration){
/*
* Do what you will with the {$chunk} here
* {$handle} is passed in case you want to seek
** to different parts of the file
* {$iteration} is the section of the file that has been read so
* ($i * 4096) is your current offset within the file.
*/
});
if(!$success)
{
//It Failed
}
One of the problems you will find is that you're trying to perform regex several times on an extremely large chunk of data. Not only that but your regex is built for matching the entire file.
With the above method your regex could become useless as you may only be matching a half set of data. What you should do is revert to the native string functions such as
strpos
substr
trim
explode
for matching the strings, I have added support in the callback so that the handle and current iteration are passed. This will allow you to work with the file directly within your callback, allowing you to use functions like fseek, ftruncate and fwrite for instance.
The way you're building your string manipulation is not efficient whatsoever, and using the proposed method above is by far a much better way.
A pretty ugly solution to adjust your memory limit depending on file size:
$filename = "yourfile.txt";
ini_set ('memory_limit', filesize ($filename) + 4000000);
$contents = file_get_contents ($filename);
The right solutuion would be to think if you can process the file in smaller chunks, or use command line tools from PHP.
If your file is line-based you can also use fgets to process it line-by-line.
For processing just n numbers of rows at a time, we can use generators in PHP.
n(use 1000)
This is how it works
Read n lines, process them, come back at n+1, then read n lines, process them come back and read next n lines and so on.
Here's the code for doing so.
<?php
class readLargeCSV{
public function __construct($filename, $delimiter = "\t"){
$this->file = fopen($filename, 'r');
$this->delimiter = $delimiter;
$this->iterator = 0;
$this->header = null;
}
public function csvToArray()
{
$data = array();
while (($row = fgetcsv($this->file, 1000, $this->delimiter)) !== false)
{
$is_mul_1000 = false;
if(!$this->header){
$this->header = $row;
}
else{
$this->iterator++;
$data[] = array_combine($this->header, $row);
if($this->iterator != 0 && $this->iterator % 1000 == 0){
$is_mul_1000 = true;
$chunk = $data;
$data = array();
yield $chunk;
}
}
}
fclose($this->file);
if(!$is_mul_1000){
yield $data;
}
return;
}
}
And for reading it, you can use this.
$file = database_path('path/to/csvfile/XYZ.csv');
$csv_reader = new readLargeCSV($file, ",");
foreach($csv_reader->csvToArray() as $data){
// you can do whatever you want with the $data.
}
Here $data contains the 1000 entries from the csv or n%1000 which will be for the last batch.
A detailed explanation for this can be found here https://medium.com/#aashish.gaba097/database-seeding-with-large-files-in-laravel-be5b2aceaa0b
My advice would be to use fread. It may be a little slower, but you won't have to use all your memory...
For instance :
//This use filesize($oldFile) memory
file_put_content($newFile, file_get_content($oldFile));
//And this 8192 bytes
$pNew=fopen($newFile, 'w');
$pOld=fopen($oldFile, 'r');
while(!feof($pOld)){
fwrite($pNew, fread($pOld, 8192));
}