Codeigniter Allowed memory size exhausted while processing large files - php

I'm posting this in case someone else is looking for the same solution, seeing as I just wasted two days on this bullshit.
I have a cron job that updates the database using a very large file once a day, using the following code:
if (($handle = fopen(dirname(__FILE__) . '/uncompressed', "r")) !== FALSE)
{
while (($data = fgets($handle)) !== FALSE)
{
$thisline = json_decode($data, true);
$this->regen($thisline);
}
fclose($handle);
}
This is in a Codeigniter controller that's only used for cron jobs. The $this->regen function runs through a bunch of different checks and stores the right information from the line in the database. The file itself is over 300MB of JSONs separated by newlines.
The problem: it would only process about 20,000 lines before the whole thing ran out of memory.

I spent hours troubleshooting this and got nothing obvious. I'm using fgets, I have $query->free_result() in the right places. It didn't help. So then I started checking a loop of about 100 lines, and watched the output of memory_get_usage(). I finally narrowed it down to the Codeigniter Active Record class - every call to the class caused the memory usage to increase by a tiny amount.
Then I found this thread on Ellislabs and I got the answer. CI Active Record saves queries so that if you want to, you can build a query in multiple functions. (I am not even going to go into how dumb it is to have that switched on by default.)
Go to /config/database.php and add
$db['default']['save_queries'] = FALSE;
to the end of the file. Then make sure you build and execute queries using Active Record in a single function. If you need to switch it off just for one case, use
$this->db->save_queries = FALSE;
in the constructor or wherever you need to put it.

Related

PHP server file download cutoff unexpectedly

I have a web interface that I built into the admin section of a WordPress site. It scrapes a few tables in my database and just displays a big list of data row by row. There are about 30,000 rows of this data, displayed with a basic echo in a for loop. Displaying all 30,000 rows on a page works fine.
Additionally, I include an option to download a CSV file of the complete rows of data. I use fopen and then fputcsv to build the CSV file for download from the result of the data query. This feature used to work, but now that the dataset is at 30,000, the CSV will no longer generate correctly. What happens is the first 200~1000 rows will be written to the CSV file leaving out the majority of the data. I estimate that the CSV that is not properly generated in my case would be about 10 Megs. Then the file will download the first 200~1000 rows as though everything was working correctly.
Here is the code:
// This gets a huge list of data from a SP I built. This data is well formed
$data = $this->run_stats_stored_procedure($job_to_report);
// This is where the data is converted into a csv file. This part is broken
// the file may already exist at that location burn it down if it does
if(file_exists(ABSPATH . "some/path/to/my/file/csv_export.csv")) {
unlink(ABSPATH . "some/path/to/my/file/csv_export.csv");
}
$csv_file_handler = fopen(ABSPATH . "some/path/to/my/file/candidate_export.csv", 'w');
if(!empty($csv_file_handler)) {
$title_array = array(
"ID",
"other_feild"
);
fputcsv($csv_file_handler, $title_array, ",");
if(!empty($data)) {
foreach($data as $data_piece) {
$array_as_csv_line = array();
foreach($data_piece as $object_property) {
$array_as_csv_line[] = (string)$object_property;
}
fputcsv($csv_file_handler, $array_as_csv_line, ",");
unset($array_as_csv_line);
}
} else {
fputcsv($csv_file_handler, array("empty"), ",");
}
// pros clean everything up when they are done
fclose($csv_file_handler);
}
I'm not sure what I need to change to get the entire CSV file to download. I believe this could be a configuration issue, but I'm not should. I am led to believe this because this function used to work with even 20,000 csv rows, it is now at 30,000 and breaking. Please let me know if additional info would help. Has anyone bumped into issues with huge CSV files before? Thank you to anyone who can help.
Is the "download" taking more than say a minute, two minutes, or three minutes? If so, the webserver could be closing the connection. For example, if you're using the Apache FCGI module, it has this directive:
FcgidBusyTimeout
which defaults to 300 seconds.
This is the maximum time limit for request handling. If a FastCGI request does not complete within FcgidBusyTimeout seconds, it will be subject to termination.
Hope this helps you solve your problem.
The answer that I am currently implementing is to allow the script to use more time. To do this, I am simply running the following code before the script runs:
set_time_limit ( 3600 );
I am doing further research because this is not a sustainable solution. Any further advice would be greatly appreciated.

Load big file into database

I have a big file that has about 11 Mb. It is a CSV file and I need to load the content of that file into a Postgres database.
I use a PHP script to do this job but always stop in some moment.
I put big size for PHP memory and other stuff and I could load more data but not all data.
How can I solve that? Is any cache memory that I need to clean? Some secret to manage big files in PHP?
Thanks in advance.
UPDATE: Add some code
$handler = fopen($fileName, "r");
$dbHandler = pg_connect($databaseConfig);
while (($line = $handler->fgetcsv(";")) !== false) {
// Algorithms to transform data
// Adding sql sentences in a variable
// I am using a "batch" idea that execute all sql formed after 5000 read lines
// When I reach 5000 read lines, execute my sql
$results = pg_query($dbHandler, $sql);
}
In case you have direct access to the server(and you don't work with some subversion software), postgre has a far better option that is far less demanding in terms of resources. Keep in mind that php is a slow and resource consuming language
COPY my_table_name FROM '/home/myfile.csv' DELIMITERS ',' CSV

How to parse Large CSV file without timing out?

I'm trying to parse a 50 megabyte .csv file. The file itself is fine, but I'm trying to get past the massive timeout issues involved. Every is set upload wise, I can easily upload and re-open the file but after the browser timeout, I receive a 500 Internal error.
My guess is I can save the file onto the server, open it and keep a session value of what line I dealt with. After a certain line I reset the connect via refresh and open the file at the line I left off with. Is this a do-able idea? The previous developer made a very inefficient MySQL class and it controls the entire site, so I don't want to write my own class if I don't have to, and I don't want to mess with his class.
TL;DR version: Is it efficient to save the last line I'm currently on of a CSV file that has 38K lines of products then, and after X number of rows, reset the connection and start from where I left off? Or is there another way to parse a Large CSV file without timeouts?
NOTE: It's the PHP script execution time. Currently at 38K lines, it takes about 46 minutes and 5 seconds to run via command line. It works correctly 100% of the time when I remove it from the browser, suggesting that it is a browser timeout. Chrome's timeout is not editable as far as Google has told me, and Firefox's timeout works rarely.
You could do something like this:
<?php
namespace database;
class importcsv
{
private $crud;
public function __construct($dbh, $table)
{
$this->crud = new \database\crud($dbh, $table);
return $this;
}
public function import($columnNames, $csv, $seperator)
{
$lines = explode("\n", $csv);
foreach($lines as $line)
{
\set_time_limit(30);
$line = explode($seperator, $line);
$data = new \stdClass();
foreach($line as $i => $item)
{
if(isset($columnNames[$i])&&!empty($columnNames[$i]))
$data->$columnNames[$i] = $item;
}
#$x++;
$this->crud->create($data);
}
return $x;
}
public function importFile($columnNames, $csvPath, $seperator)
{
if(file_exists($csvPath))
{
$content = file_get_contents($csvPath);
return $this->import($columnNames, $content, $seperator);
}
else
{
// Error
}
}
}
TL;DR: \set_time_limit(30); everytime you loop throu a line might fix your timeout issues.
I suggest to run php from command line and set it as a cron job. This way you don't have to modify your code. There will be no timeout issue and you can easily parse large CSV files.
also check this link
Your post is a little unclear due to the typos and grammar, could you please edit?
If you are saying that the Upload itself is okay, but the delay is in processing of the file, then the easiest thing to do is to parse the file in parallel using multiple threads. You can use the java built-in Executor class, or Quartz or Jetlang to do this.
Find the size of the file or number of lines.
Select a Thread load (Say 1000 lines per thread)
Start an Executor
Read the file in a loop.
For ach 1000 lines, create a Runnable and load it to the Executor
Start the Executor
Wait till all threads are finished
Each runnable does this:
Fetch a connection
Insert the 1000 lines
Log the results
Close the connection

Download a large XML file from an external source in the background, with the ability to resume download if incomplete

Some background information
The files I would like to download is kept at the external server for a week, and a new XML file(10-50mb large) is created there every hour with a different name. I would like the large file to be downloaded to my server chunk by chunk in the background each time my website is loaded, perhaps 0.5mb each time, and then resume the download the next time someone else loads the website. This would require my site to have atleast 100 pageloads each hour to stay updated, so perhaps abit more of the file each time if possible. I have researched simpleXML, XMLreader, SAX parsing, but whatever I do, it seems it takes too long to parse the file directly, therefore I would like a different approach, namely downloading it like described above.
If I download a 30mb large XML file, I can parse it locally with XMLreader in 3 seconds(250k iterations) only, but when I try to do the same from the external server limiting it to 50k iterations, it uses 15secs to read that small part, so it would not be possible to parse it directly from that server it seems.
Possible solutions
I think it's best to use cURL. But then again, perhaps fopen(), fsockopen(), copy() or file_get_contents() are the way to go. I'm looking for advice on what functions to use to make this happen, or different solutions on how I can parse a 50mb external XML file into a mySQL database.
I suspect a Cron job every hour would be the best solution, but I am not sure how well that would be supported by webhosting companies, and I have no clue how to do something like that. But if thats the best solution, and the majority thinks so, I will have to do my research in that area too.
If a java applet/javascript running in the background would be a better solution, please point me in the right direction when it comes to functions/methods/libraries there aswell.
Summary
What's the best solution to downloading parts of a file in the
background, and resume the download each time my website is loaded
until its completed?
If the above solution would be moronic to even try, what
language/software would you use to achieve the same thing(download a large file every hour)?
Thanks in advance for all answers, and sorry for the long story/question.
Edit: I ended up using this solution to get the files with cron job scheduling a php script. It checks my folder for what files I already have, generates a list of the possible downloads for the last four days, then downloads the next XMLfile in line.
<?php
$date = new DateTime();
$current_time = $date->getTimestamp();
$four_days_ago = $current_time-345600;
echo 'Downloading: '."\n";
for ($i=$four_days_ago; $i<=$current_time; ) {
$date->setTimestamp($i);
if($date->format('H') !== '00') {
$temp_filename = $date->format('Y_m_d_H') ."_full.xml";
if(!glob($temp_filename)) {
$temp_url = 'http://www.external-site-example.com/'.$date->format('Y/m/d/H') .".xml";
echo $temp_filename.' --- '.$temp_url.'<br>'."\n";
break; // with a break here, this loop will only return the next file you should download
}
}
$i += 3600;
}
set_time_limit(300);
$Start = getTime();
$objInputStream = fopen($temp_url, "rb");
$objTempStream = fopen($temp_filename, "w+b");
stream_copy_to_stream($objInputStream, $objTempStream, (1024*200000));
$End = getTime();
echo '<br>It took '.number_format(($End - $Start),2).' secs to download "'.$temp_filename.'".';
function getTime() {
$a = explode (' ',microtime());
return(double) $a[0] + $a[1];
}
?>
edit2: I just wanted to inform you that there is a way to do what I asked, only it would'nt work in my case. With the amount of data I need the website would have to have 400+ visitors an hour for it to work properly. But with smaller amounts of data there are some options; http://www.google.no/search?q=poormanscron
You need to have a scheduled, offline task (e.g., cronjob). The solution you are pursuing is just plain wrong.
The simplest thing that could possibly work is a php script you run every hour (scheduled via cron, most likely) that downloads the file and processes it.
You could try fopen:
<?php
$handle = fopen("http://www.example.com/test.xml", "rb");
$contents = stream_get_contents($handle);
fclose($handle);
?>

PHP queue file implementation

For a project I was working on I need a queue which will be too large to hold in normal memory. I had been implementing it as a simple file where it would read the whole file take the first few (~100) lines, process them, then write back the updated queue with new instructions added and the old ones removed. However, since the queue became too large to hold in memory like this I need something different. Preferably someone can tell me a way to peel off just the first few lines of a file without having to look at the rest of the data. I had thought about using a database (MySQL probably with sorted insert timestamps) but I would heavily prefer to do it without for load and bandwidth reasons (several servers would have to all be sending and receiving a lot of data from the DB). The language I'm working in is PHP but really this question is more about unix files I suppose. Any help would be appreciated.
Sucking out the first line of a file is pretty trivial (fopen() followed by an fgets()). Re-writing the file to remove completed jobs would be very painful, especially if you've got multiple concurrent servers working off the same queue file.
One alternative would be to use a seperate file for each job. If you have some concurrency-safe method of generating an incrementing ID for these files, then it'd be a simple matter of picking out the file with the lowest id for the oldest job, and generating a new id for each new job. You'd have to figure out some file locking, though, to keep two+ servers grabbing the same file at the same time, however.
I had same problems while I was working on enqueue/fs transport. I failed to modify a small portion at the begging of the file without copying it to the memory and saving back. Instead, but that's possible to do that with the end of the file. You can read a portion and then truncate it. That's not really a queue but a stack. So if you rely on message ordering this would not be a solution. In my case, I lock the file when the file has been read from the file, the lock is released.
This is how you could write messages to a queue file:
<?php
$rawMessage = 'this your message to put to the queue as a string';
$queueFile = fopen('/path/to/queue/file', '+a');
// here it may add some spaces so the message length is multiples of modular.
// that make it easier to read messages from a file.
// lock file
$rawMessage = str_repeat(' ', 64 - (strlen($rawMessage) % 64)).$rawMessage;
fwrite($queueFile, $rawMessage);
// release lock
This is how you could read messages from a queue file:
<?php
$queueFile = fopen('/path/to/queue/file', '+c');
// lock file
$frame = readFrame($file, 1);
ftruncate($file, fstat($file)['size'] - strlen($frame));
rewind($file);
$rawMessage = substr(trim($frame), 1);
// release lock
function readFrame($file, $frameNumber)
{
$frameSize = 64;
$offset = $frameNumber * $frameSize;
fseek($file, -$offset, SEEK_END);
$frame = fread($file, $frameSize);
if ('' == $frame) {
return '';
}
if (false !== strpos($frame, '|{')) {
return $frame;
}
return readFrame($file, $frameNumber + 1).$frame;
}
For the locking I'd suggest using Symfony LockHandler or simply take enqueue/fs.

Categories