Parse tabulation-based data file php - php

I have several files to parse (with PHP) in order to insert their respective content in different database tables.
First point : the client gave me 6 files, 5 are CSV with values separated by coma ; The last one do not come from the same database and its content is tabulation-based.
I built a FileParser that uses SplFileObject to execute a method on each line of the file-content (basically, create an Entity with each dataset and persist it to the database, with Symfony2 and Doctrine2).
But I cannot manage to parse the tabulation-based text file with SplFileObject, it does not split the content in lines as I expect it to do...
// In my controller context
$parser = new MyAmazingFileParser();
$parser->parse($filename, $delimitor, function ($data) use ($em) {
$e = new Entity();
$e->setSomething($data[0);
// [...]
$em->persist($e);
});
// In my parser
public function parse($filename, $delimitor = ',', $run = null) {
if (is_callable($run)) {
$handle = new SplFileObject($filename);
$infos = new SplFileInfo($filename);
if ($infos->getExtension() === 'csv') {
// Everything is going well here
$handle->setCsvControl(',');
$handle->setFlags(SplFileObject::DROP_NEW_LINE + SplFileObject::READ_AHEAD + SplFileObject::SKIP_EMPTY + SplFileObject::READ_CSV);
foreach (new LimitIterator($handle, 1) as $data) {
$result = $run($data);
}
} else {
// Why does the Iterator-way does not work ?
$handle->setCsvControl("\t");
// I have tried with all the possible flags combinations, without success...
foreach (new LimitIterator($handle, 1) as $data) {
// It always only gets the first line...
$result = $run($data);
}
// And the old-memory-killing-dirty-way works ?
$fd = fopen($filename, 'r');
$contents = fread($fd, filesize($filename));
foreach (explode("\t", $contents) as $line) {
// Get all the line as I want... But it's dirty and memory-expensive !
$result = $run($line);
}
}
}
}
It is probably related with the horrible formatting of my client's file, but after a long discussion with them, they really cannot get another format for me, for some acceptable reasons (constraints in their side), unfortunately.
The file is currently long of 49459 lines, so I really think the memory is important at this step ; So I have to make the SplFileObject way working, but do not know how.
An extract of the file can be found here :
Data-extract-hosted

Related

Best way to read a large file in php [duplicate]

This question already has answers here:
Reading very large files in PHP
(8 answers)
Closed 1 year ago.
I have a file with around 100 records for now.
The file has users in json format per line.
Eg
{"user_id" : 1,"user_name": "Alex"}
{"user_id" : 2,"user_name": "Bob"}
{"user_id" : 3,"user_name": "Mark"}
Note : This is a just very simple example, I have more complex json values per line in the file.
I am reading the file line by line and store that in an array which obviously will be big if there are a lot of items in the file.
public function read(string $file) : array
{
//Open the file in "reading only" mode.
$fileHandle = fopen($file, "r");
//If we failed to get a file handle, throw an Exception.
if ($fileHandle === false) {
throw new Exception('Could not get file handle for: ' . $file);
}
$lines = [];
//While we haven't reach the end of the file.
while (!feof($fileHandle)) {
//Read the current line in.
$lines[] = json_decode(fgets($fileHandle));
}
//Finally, close the file handle.
fclose($fileHandle);
return $lines;
}
Next, Ill process this array and only take the parameters I need (some parameters might be further processed) and then Ill export this array to csv.
public function processInput($users){
$data = [];
foreach ($users as $key => $user)
{
$data[$key]['user_id'] = $user->user_id;
$data[$key]['user_name'] = strtoupper($user->user_name);
}
// Call export to csv $data.
}
What should be the best way to read the file (incase we have a big file)?
I know file_get_contents is not optimized way and instead fgets is a better approach.
Is there a much better way considering big file read and then put it to csv.
You need to modify your reader to make it more "lazy" in some sense. For example consider this:
public function read(string $file, callable $rowProcessor) : void
{
//Open the file in "reading only" mode.
$fileHandle = fopen($file, "r");
//If we failed to get a file handle, throw an Exception.
if ($fileHandle === false) {
throw new Exception('Could not get file handle for: ' . $file);
}
//While we haven't reach the end of the file.
while (!feof($fileHandle)) {
//Read the current line in.
$line = json_decode(fgets($fileHandle));
$rowProcessor($line);
}
//Finally, close the file handle.
fclose($fileHandle);
return $lines;
}
Then your will need different code that works with this:
function processAndWriteJson($filename) { //Names are hard
$writer = fopen('output.csv', 'w');
read($filename, function ($row) use ($writer) {
// Do processing of the single row here
fputcsv($writer, $processedRow);
});
}
If you want to get the same result as before with your read method you can do:
$lines = [];
read($filename, function ($row) use ($writer) {
$lines[] = $row;
});
It does provide some more flexibility. Unfortunately it does mean you can only process one line at a time and scanning up and down the file is harder

How to assign SplFileObject::fpassthru output to variable

I'm currently writing some data to an SplFileObject like this:
$fileObj = new SplFileObject('php://text/plain,', "w+");
foreach($data as $row) {
$fileObj->fputcsv($row);
}
Now, I want to dump the whole output (string) to a variable.
I know that SplFileObject::fgets gets the output line by line (which requires a loop) but I want to get it in one go, ideally something like this:
$fileObj->rewind();
$output = $fileObj->fpassthru();
However, this does not work as it simply prints to standard output.
There's a solution for what I'm trying to achieve using stream_get_contents():
pass fpassthru contents to variable
However, that method requires you to have direct access to the file handle.
SplFileObject hides the file handle in a private property and therefore not accessible.
Is there anything else I can try?
After writing, do a rewind() then you can read everything. The example is for understanding:
$fileObj = new SplFileObject('php://memory', "w+");
$row = [1,2,'test']; //Test Data
$fileObj->fputcsv($row);
$fileObj->rewind();
//now Read
$rowCopy = $fileObj->fgetcsv();
var_dump($row == $rowCopy);//bool(true)
$fileObj->rewind();
$strLine = $fileObj->fgets(); //read as string
$expected = "1,2,test\n";
var_dump($strLine === $expected); //bool(true)
//several lines
$fileObj->rewind();
$fileObj->fputcsv(['test2',3,4]);
$fileObj->fputcsv(['test3',5,6]);
$fileObj->rewind();
for($content = ""; $row = $fileObj->fgets(); $content .= $row);
var_dump($content === "test2,3,4\ntest3,5,6\n"); //bool(true)
If you absolutely have to fetch your content with only one command then you can do this too
// :
$length = $fileObj->ftell();
$fileObj->rewind();
$content = $fileObj->fread($length);
getSize() doesn't work here.
In the absence of an inbuilt function I've decided to do php output buffering as #CBroe had suggested.
...
$fileObj->rewind();
ob_start();
$fileObj->fpassthru();
$buffer = ob_get_clean();
See #jsplit's answer for a better method using SplFileObjects inbuilt functions

PHP: Differences in parsing static file and php array file

I wanted to find best way to make translation framework (gettext have some imperfections).
So I make two tests - One, parsing file contains static text by code below
function parseLine($line) {
if($line[0] == '#' || !strlen($line))
return array();
$eq = strpos($line, '=');
$key = trim(substr($line, 0, $eq));
$value = trim(substr($line, $eq+1));
$value = trim($value, '"');
return array($key => $value);
}
$table = array();
$fp = fopen('lang.lng', 'r');
while(!feof($fp)) {
$table += parseLine(fgets($fp, 4096));
}
fclose($fp);
and secondly including array
$table = include('lang.php');
Of course each lang.lng and lang.php has same data (1000 records) but reperesented in diffrent way.
I was surpised when I saw results...
First method: ~0.01 s
Second: ~0.001 s
Before test I was sure that including array will take more memory and time then parsing file.
Could somebody explain me where is mistake?
I would have thought that was a no-brainer. Which is faster, reading in a file that contains an array, or reading in a file, processing it line-by-line, and for each line look for various tokens and components and piece them all together in a complex matter to result in the same array as the include method?

parse remote csv-file with PHP on GAE

I seem to be in a catch-22 with a small app I'm developing in PHP on Google App Engine using Quercus;
I have a remote csv-file which I can download & store in a string
To parse that string I'd ideally use str_getcsv, but Quercus doesn't have that function yet
Quercus does seem to know fgetcsv, but that function expects a file handle which I don't have (and I can't make a new one as GAE doesn't allow files to be created)
Anyone got an idea of how to solve this without having to dismiss the built-in PHP csv-parser functions and write my own parser instead?
I think the simplest solution really is to write your own parser . it's a piece of cake anyway and will get you to learn more regex- it makes no sense that there is no csv string to array parser in PHP so it's totally justified to write your own. Just make sure it's not too slow ;)
You might be able to create a new stream wrapper using stream_wrapper_register.
Here's an example from the manual which reads global variables: http://www.php.net/manual/en/stream.streamwrapper.example-1.php
You could then use it like a normal file handle:
$csvStr = '...';
$fp = fopen('var://csvStr', 'r+');
while ($row = fgetcsv($fp)) {
// ...
}
fclose($fp);
this shows a simple manual parser i wrote with example input with qualifed, non-qualified, escape feature. it can be used for the header and data rows and included an assoc array function to make your data into a kvp style array.
//example data
$fields = strparser('"first","second","third","fourth","fifth","sixth","seventh"');
print_r(makeAssocArray($fields, strparser('"asdf","bla\"1","bl,ah2","bl,ah\"3",123,34.234,"k;jsdfj ;alsjf;"')));
//do something like this
$fields = strparser(<csvfirstline>);
foreach ($lines as $line)
$data = makeAssocArray($fields, strparser($line));
function strparser($string, $div = ",", $qual = "\"", $esc = "\\") {
$buff = "";
$data = array();
$isQual = false; //the result will be a qualifier
$inQual = false; //currently parseing inside qualifier
//itereate through string each byte
for ($i = 0; $i < strlen($string); $i++) {
switch ($string[$i]) {
case $esc:
//add next byte to buffer and skip it
$buff .= $string[$i+1];
$i++;
break;
case $qual:
//see if this is escaped qualifier
if (!$inQual) {
$isQual = true;
$inQual = true;
break;
} else {
$inQual = false; //done parseing qualifier
break;
}
case $div:
if (!$inQual) {
$data[] = $buff; //add value to data
$buff = ""; //reset buffer
break;
}
default:
$buff .= $string[$i];
}
}
//get last item as it doesnt have a divider
$data[] = $buff;
return $data;
}
function makeAssocArray($fields, $data) {
foreach ($fields as $key => $field)
$array[$field] = $data[$key];
return $array;
}
if it can be dirty and quick. I would just use the
http://php.net/manual/en/function.exec.php
to pass it in and use sed and awk (http://shop.oreilly.com/product/9781565922259.do) to parse it. I know you wanted to use the php parser. I've tried before and failed simply because its not vocal about its errors.
Hope this helps.
Good luck.
You might be able to use fopen with php://temp or php://memory (php.net) to get it to work. What you would do is open either php://temp or php://memory, write to it, then rewind it (php.net), and then pass it to fgetcsv. I didn't test this, but it might work.

file_get_contents => PHP Fatal error: Allowed memory exhausted

I have no experience when dealing with large files so I am not sure what to do about this. I have attempted to read several large files using file_get_contents ; the task is to clean and munge them using preg_replace().
My code runs fine on small files ; however, the large files (40 MB) trigger an Memory exhausted error:
PHP Fatal error: Allowed memory size of 16777216 bytes exhausted (tried to allocate 41390283 bytes)
I was thinking of using fread() instead but I am not sure that'll work either. Is there a workaround for this problem?
Thanks for your input.
This is my code:
<?php
error_reporting(E_ALL);
##get find() results and remove DOS carriage returns.
##The error is thrown on the next line for large files!
$myData = file_get_contents("tmp11");
$newData = str_replace("^M", "", $myData);
##cleanup Model-Manufacturer field.
$pattern = '/(Model-Manufacturer:)(\n)(\w+)/i';
$replacement = '$1$3';
$newData = preg_replace($pattern, $replacement, $newData);
##cleanup Test_Version field and create comma delimited layout.
$pattern = '/(Test_Version=)(\d).(\d).(\d)(\n+)/';
$replacement = '$1$2.$3.$4 ';
$newData = preg_replace($pattern, $replacement, $newData);
##cleanup occasional empty Model-Manufacturer field.
$pattern = '/(Test_Version=)(\d).(\d).(\d) (Test_Version=)/';
$replacement = '$1$2.$3.$4 Model-Manufacturer:N/A--$5';
$newData = preg_replace($pattern, $replacement, $newData);
##fix occasional Model-Manufacturer being incorrectly wrapped.
$newData = str_replace("--","\n",$newData);
##fix 'Binary file' message when find() utility cannot id file.
$pattern = '/(Binary file).*/';
$replacement = '';
$newData = preg_replace($pattern, $replacement, $newData);
$newData = removeEmptyLines($newData);
##replace colon with equal sign
$newData = str_replace("Model-Manufacturer:","Model-Manufacturer=",$newData);
##file stuff
$fh2 = fopen("tmp2","w");
fwrite($fh2, $newData);
fclose($fh2);
### Functions.
##Data cleanup
function removeEmptyLines($string)
{
return preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", "\n", $string);
}
?>
Firstly you should understand that when using file_get_contents you're fetching the entire string of data into a variable, that variable is stored in the hosts memory.
If that string is greater than the size dedicated to the PHP process then PHP will halt and display the error message above.
The way around this to open the file as a pointer, and then take a chunk at a time. This way if you had a 500MB file you can read the first 1MB of data, do what you will with it, delete that 1MB from the system's memory and replace it with the next MB. This allows you to manage how much data you're putting in the memory.
An example if this can be seen below, I will create a function that acts like node.js
function file_get_contents_chunked($file,$chunk_size,$callback)
{
try
{
$handle = fopen($file, "r");
$i = 0;
while (!feof($handle))
{
call_user_func_array($callback,array(fread($handle,$chunk_size),&$handle,$i));
$i++;
}
fclose($handle);
}
catch(Exception $e)
{
trigger_error("file_get_contents_chunked::" . $e->getMessage(),E_USER_NOTICE);
return false;
}
return true;
}
and then use like so:
$success = file_get_contents_chunked("my/large/file",4096,function($chunk,&$handle,$iteration){
/*
* Do what you will with the {$chunk} here
* {$handle} is passed in case you want to seek
** to different parts of the file
* {$iteration} is the section of the file that has been read so
* ($i * 4096) is your current offset within the file.
*/
});
if(!$success)
{
//It Failed
}
One of the problems you will find is that you're trying to perform regex several times on an extremely large chunk of data. Not only that but your regex is built for matching the entire file.
With the above method your regex could become useless as you may only be matching a half set of data. What you should do is revert to the native string functions such as
strpos
substr
trim
explode
for matching the strings, I have added support in the callback so that the handle and current iteration are passed. This will allow you to work with the file directly within your callback, allowing you to use functions like fseek, ftruncate and fwrite for instance.
The way you're building your string manipulation is not efficient whatsoever, and using the proposed method above is by far a much better way.
A pretty ugly solution to adjust your memory limit depending on file size:
$filename = "yourfile.txt";
ini_set ('memory_limit', filesize ($filename) + 4000000);
$contents = file_get_contents ($filename);
The right solutuion would be to think if you can process the file in smaller chunks, or use command line tools from PHP.
If your file is line-based you can also use fgets to process it line-by-line.
For processing just n numbers of rows at a time, we can use generators in PHP.
n(use 1000)
This is how it works
Read n lines, process them, come back at n+1, then read n lines, process them come back and read next n lines and so on.
Here's the code for doing so.
<?php
class readLargeCSV{
public function __construct($filename, $delimiter = "\t"){
$this->file = fopen($filename, 'r');
$this->delimiter = $delimiter;
$this->iterator = 0;
$this->header = null;
}
public function csvToArray()
{
$data = array();
while (($row = fgetcsv($this->file, 1000, $this->delimiter)) !== false)
{
$is_mul_1000 = false;
if(!$this->header){
$this->header = $row;
}
else{
$this->iterator++;
$data[] = array_combine($this->header, $row);
if($this->iterator != 0 && $this->iterator % 1000 == 0){
$is_mul_1000 = true;
$chunk = $data;
$data = array();
yield $chunk;
}
}
}
fclose($this->file);
if(!$is_mul_1000){
yield $data;
}
return;
}
}
And for reading it, you can use this.
$file = database_path('path/to/csvfile/XYZ.csv');
$csv_reader = new readLargeCSV($file, ",");
foreach($csv_reader->csvToArray() as $data){
// you can do whatever you want with the $data.
}
Here $data contains the 1000 entries from the csv or n%1000 which will be for the last batch.
A detailed explanation for this can be found here https://medium.com/#aashish.gaba097/database-seeding-with-large-files-in-laravel-be5b2aceaa0b
My advice would be to use fread. It may be a little slower, but you won't have to use all your memory...
For instance :
//This use filesize($oldFile) memory
file_put_content($newFile, file_get_content($oldFile));
//And this 8192 bytes
$pNew=fopen($newFile, 'w');
$pOld=fopen($oldFile, 'r');
while(!feof($pOld)){
fwrite($pNew, fread($pOld, 8192));
}

Categories