How to parse Large CSV file without timing out? - php

I'm trying to parse a 50 megabyte .csv file. The file itself is fine, but I'm trying to get past the massive timeout issues involved. Every is set upload wise, I can easily upload and re-open the file but after the browser timeout, I receive a 500 Internal error.
My guess is I can save the file onto the server, open it and keep a session value of what line I dealt with. After a certain line I reset the connect via refresh and open the file at the line I left off with. Is this a do-able idea? The previous developer made a very inefficient MySQL class and it controls the entire site, so I don't want to write my own class if I don't have to, and I don't want to mess with his class.
TL;DR version: Is it efficient to save the last line I'm currently on of a CSV file that has 38K lines of products then, and after X number of rows, reset the connection and start from where I left off? Or is there another way to parse a Large CSV file without timeouts?
NOTE: It's the PHP script execution time. Currently at 38K lines, it takes about 46 minutes and 5 seconds to run via command line. It works correctly 100% of the time when I remove it from the browser, suggesting that it is a browser timeout. Chrome's timeout is not editable as far as Google has told me, and Firefox's timeout works rarely.

You could do something like this:
<?php
namespace database;
class importcsv
{
private $crud;
public function __construct($dbh, $table)
{
$this->crud = new \database\crud($dbh, $table);
return $this;
}
public function import($columnNames, $csv, $seperator)
{
$lines = explode("\n", $csv);
foreach($lines as $line)
{
\set_time_limit(30);
$line = explode($seperator, $line);
$data = new \stdClass();
foreach($line as $i => $item)
{
if(isset($columnNames[$i])&&!empty($columnNames[$i]))
$data->$columnNames[$i] = $item;
}
#$x++;
$this->crud->create($data);
}
return $x;
}
public function importFile($columnNames, $csvPath, $seperator)
{
if(file_exists($csvPath))
{
$content = file_get_contents($csvPath);
return $this->import($columnNames, $content, $seperator);
}
else
{
// Error
}
}
}
TL;DR: \set_time_limit(30); everytime you loop throu a line might fix your timeout issues.

I suggest to run php from command line and set it as a cron job. This way you don't have to modify your code. There will be no timeout issue and you can easily parse large CSV files.
also check this link

Your post is a little unclear due to the typos and grammar, could you please edit?
If you are saying that the Upload itself is okay, but the delay is in processing of the file, then the easiest thing to do is to parse the file in parallel using multiple threads. You can use the java built-in Executor class, or Quartz or Jetlang to do this.
Find the size of the file or number of lines.
Select a Thread load (Say 1000 lines per thread)
Start an Executor
Read the file in a loop.
For ach 1000 lines, create a Runnable and load it to the Executor
Start the Executor
Wait till all threads are finished
Each runnable does this:
Fetch a connection
Insert the 1000 lines
Log the results
Close the connection

Related

Codeigniter Allowed memory size exhausted while processing large files

I'm posting this in case someone else is looking for the same solution, seeing as I just wasted two days on this bullshit.
I have a cron job that updates the database using a very large file once a day, using the following code:
if (($handle = fopen(dirname(__FILE__) . '/uncompressed', "r")) !== FALSE)
{
while (($data = fgets($handle)) !== FALSE)
{
$thisline = json_decode($data, true);
$this->regen($thisline);
}
fclose($handle);
}
This is in a Codeigniter controller that's only used for cron jobs. The $this->regen function runs through a bunch of different checks and stores the right information from the line in the database. The file itself is over 300MB of JSONs separated by newlines.
The problem: it would only process about 20,000 lines before the whole thing ran out of memory.
I spent hours troubleshooting this and got nothing obvious. I'm using fgets, I have $query->free_result() in the right places. It didn't help. So then I started checking a loop of about 100 lines, and watched the output of memory_get_usage(). I finally narrowed it down to the Codeigniter Active Record class - every call to the class caused the memory usage to increase by a tiny amount.
Then I found this thread on Ellislabs and I got the answer. CI Active Record saves queries so that if you want to, you can build a query in multiple functions. (I am not even going to go into how dumb it is to have that switched on by default.)
Go to /config/database.php and add
$db['default']['save_queries'] = FALSE;
to the end of the file. Then make sure you build and execute queries using Active Record in a single function. If you need to switch it off just for one case, use
$this->db->save_queries = FALSE;
in the constructor or wherever you need to put it.

PHP server file download cutoff unexpectedly

I have a web interface that I built into the admin section of a WordPress site. It scrapes a few tables in my database and just displays a big list of data row by row. There are about 30,000 rows of this data, displayed with a basic echo in a for loop. Displaying all 30,000 rows on a page works fine.
Additionally, I include an option to download a CSV file of the complete rows of data. I use fopen and then fputcsv to build the CSV file for download from the result of the data query. This feature used to work, but now that the dataset is at 30,000, the CSV will no longer generate correctly. What happens is the first 200~1000 rows will be written to the CSV file leaving out the majority of the data. I estimate that the CSV that is not properly generated in my case would be about 10 Megs. Then the file will download the first 200~1000 rows as though everything was working correctly.
Here is the code:
// This gets a huge list of data from a SP I built. This data is well formed
$data = $this->run_stats_stored_procedure($job_to_report);
// This is where the data is converted into a csv file. This part is broken
// the file may already exist at that location burn it down if it does
if(file_exists(ABSPATH . "some/path/to/my/file/csv_export.csv")) {
unlink(ABSPATH . "some/path/to/my/file/csv_export.csv");
}
$csv_file_handler = fopen(ABSPATH . "some/path/to/my/file/candidate_export.csv", 'w');
if(!empty($csv_file_handler)) {
$title_array = array(
"ID",
"other_feild"
);
fputcsv($csv_file_handler, $title_array, ",");
if(!empty($data)) {
foreach($data as $data_piece) {
$array_as_csv_line = array();
foreach($data_piece as $object_property) {
$array_as_csv_line[] = (string)$object_property;
}
fputcsv($csv_file_handler, $array_as_csv_line, ",");
unset($array_as_csv_line);
}
} else {
fputcsv($csv_file_handler, array("empty"), ",");
}
// pros clean everything up when they are done
fclose($csv_file_handler);
}
I'm not sure what I need to change to get the entire CSV file to download. I believe this could be a configuration issue, but I'm not should. I am led to believe this because this function used to work with even 20,000 csv rows, it is now at 30,000 and breaking. Please let me know if additional info would help. Has anyone bumped into issues with huge CSV files before? Thank you to anyone who can help.
Is the "download" taking more than say a minute, two minutes, or three minutes? If so, the webserver could be closing the connection. For example, if you're using the Apache FCGI module, it has this directive:
FcgidBusyTimeout
which defaults to 300 seconds.
This is the maximum time limit for request handling. If a FastCGI request does not complete within FcgidBusyTimeout seconds, it will be subject to termination.
Hope this helps you solve your problem.
The answer that I am currently implementing is to allow the script to use more time. To do this, I am simply running the following code before the script runs:
set_time_limit ( 3600 );
I am doing further research because this is not a sustainable solution. Any further advice would be greatly appreciated.

PHP, check if the file is being written to/updated by PHP script?

I have a script that re-writes a file every few hours. This file is inserted into end users html, via php include.
How can I check if my script, at this exact moment, is working (e.g. re-writing) the file when it is being called to user for display? Is it even an issue, in terms of what will happen if they access the file at the same time, what are the odds and will the user just have to wait untill the script is finished its work?
Thanks in advance!
More on the subject...
Is this a way forward using file_put_contents and LOCK_EX?
when script saves its data every now and then
file_put_contents($content,"text", LOCK_EX);
and when user opens the page
if (file_exists("text")) {
function include_file() {
$file = fopen("text", "r");
if (flock($file, LOCK_EX)) {
include_file();
}
else {
echo file_get_contents("text");
}
}
} else {
echo 'no such file';
}
Could anyone advice me on the syntax, is this a proper way to call include_file() after condition and how can I limit a number of such calls?
I guess this solution is also good, except same call to include_file(), would it even work?
function include_file() {
$time = time();
$file = filectime("text");
if ($file + 1 < $time) {
echo "good to read";
} else {
echo "have to wait";
include_file();
}
}
To check if the file is currently being written, you can use filectime() function to get the actual time the file is being written.
You can get current timestamp on top of your script in a variable and whenever you need to access the file, you can compare the current timestamp with the filectime() of that file, if file creation time is latest then the scenario occured when you have to wait for that file to be written and you can log that in database or another file.
To prevent this scenario from happening, you can change the script which is writing the file so that, it first creates temporary file and once it's done you just replace (move or rename) the temporary file with original file, this action would require very less time compared to file writing and make the scenario occurrence very rare possibility.
Even if read and replace operation occurs simultaneously, the time the read script has to wait will be very less.
Depending on the size of the file, this might be an issue of concurrency. But you might solve that quite easy: before starting to write the file, you might create a kind of "lock file", i.e. if your file is named "incfile.php" you might create an "incfile.php.lock". Once you're doen with writing, you will remove this file.
On the include side, you can check for the existance of the "incfile.php.lock" and wait until it's disappeared, need some looping and sleeping in the unlikely case of a concurrent access.
Basically, you should consider another solution by just writing the data which is rendered in to that file to a database (locks etc are available) and render that in a module which then gets included in your page. Solutions like yours are hardly to maintain on the long run ...
This question is old, but I add this answer because the other answers have no code.
function write_to_file(string $fp, string $string) : bool {
$timestamp_before_fwrite = date("U");
$stream = fopen($fp, "w");
fwrite($stream, $string);
while(is_resource($stream)) {
fclose($stream);
}
$file_last_changed = filemtime($fp);
if ($file_last_changed < $timestamp_before_fwrite) {
//File not changed code
return false;
}
return true;
}
This is the function I use to write to file, it first gets the current timestamp before making changes to the file, and then I compare the timestamp to the last time the file was changed.

PHP queue file implementation

For a project I was working on I need a queue which will be too large to hold in normal memory. I had been implementing it as a simple file where it would read the whole file take the first few (~100) lines, process them, then write back the updated queue with new instructions added and the old ones removed. However, since the queue became too large to hold in memory like this I need something different. Preferably someone can tell me a way to peel off just the first few lines of a file without having to look at the rest of the data. I had thought about using a database (MySQL probably with sorted insert timestamps) but I would heavily prefer to do it without for load and bandwidth reasons (several servers would have to all be sending and receiving a lot of data from the DB). The language I'm working in is PHP but really this question is more about unix files I suppose. Any help would be appreciated.
Sucking out the first line of a file is pretty trivial (fopen() followed by an fgets()). Re-writing the file to remove completed jobs would be very painful, especially if you've got multiple concurrent servers working off the same queue file.
One alternative would be to use a seperate file for each job. If you have some concurrency-safe method of generating an incrementing ID for these files, then it'd be a simple matter of picking out the file with the lowest id for the oldest job, and generating a new id for each new job. You'd have to figure out some file locking, though, to keep two+ servers grabbing the same file at the same time, however.
I had same problems while I was working on enqueue/fs transport. I failed to modify a small portion at the begging of the file without copying it to the memory and saving back. Instead, but that's possible to do that with the end of the file. You can read a portion and then truncate it. That's not really a queue but a stack. So if you rely on message ordering this would not be a solution. In my case, I lock the file when the file has been read from the file, the lock is released.
This is how you could write messages to a queue file:
<?php
$rawMessage = 'this your message to put to the queue as a string';
$queueFile = fopen('/path/to/queue/file', '+a');
// here it may add some spaces so the message length is multiples of modular.
// that make it easier to read messages from a file.
// lock file
$rawMessage = str_repeat(' ', 64 - (strlen($rawMessage) % 64)).$rawMessage;
fwrite($queueFile, $rawMessage);
// release lock
This is how you could read messages from a queue file:
<?php
$queueFile = fopen('/path/to/queue/file', '+c');
// lock file
$frame = readFrame($file, 1);
ftruncate($file, fstat($file)['size'] - strlen($frame));
rewind($file);
$rawMessage = substr(trim($frame), 1);
// release lock
function readFrame($file, $frameNumber)
{
$frameSize = 64;
$offset = $frameNumber * $frameSize;
fseek($file, -$offset, SEEK_END);
$frame = fread($file, $frameSize);
if ('' == $frame) {
return '';
}
if (false !== strpos($frame, '|{')) {
return $frame;
}
return readFrame($file, $frameNumber + 1).$frame;
}
For the locking I'd suggest using Symfony LockHandler or simply take enqueue/fs.

Creating files on a time (hourly) basis

I experimenting with twitter streaming API,
I use Phirehose to connect to twitter and fetch the data but having problems storing it in files for further processing.
Basically what I want to do is to create a file named
date("YmdH")."."txt"
for every hour of connection.
Here is how my code looks like right now (not handling the hourly change of files)
public function enqueueStatus($status)
$data = json_decode($status,true);
if(isset($data['text'])/*more conditions here*/) {
$fp = fopen("/tmp/$time.txt");
fwirte ($status,$fp);
fclose($fp);
}
Help is as always much appreciated :)
You want the 'append' mode in fopen - this will either append to a file or create it.
if(isset($data['text'])/*more conditions here*/) {
$fp = fopen("/tmp/" . date("YmdH") . ".txt", "a");
fwrite ($status,$fp);
fclose($fp);
}
From the Phirehose googlecode wiki:
As of Phirehose version 0.2.2 there is
an example of a simple "ghetto queue"
included in the tarball (see file:
ghetto-queue-collect.php and
ghetto-queue-consume.php) that shows
how statuses could be easily collected
on to the filesystem for processing
and then picked up by a separate
process (consume).
This is a complete working sample of doing what you want to do. The rotation time interval is configurable too. Additionally there's another script to consume and process the written files too.
Now if only I could find a way to stop the whole sript, my log keeps filling up (the script continues execution) even if I close the browser tab :P

Categories