I have a web interface that I built into the admin section of a WordPress site. It scrapes a few tables in my database and just displays a big list of data row by row. There are about 30,000 rows of this data, displayed with a basic echo in a for loop. Displaying all 30,000 rows on a page works fine.
Additionally, I include an option to download a CSV file of the complete rows of data. I use fopen and then fputcsv to build the CSV file for download from the result of the data query. This feature used to work, but now that the dataset is at 30,000, the CSV will no longer generate correctly. What happens is the first 200~1000 rows will be written to the CSV file leaving out the majority of the data. I estimate that the CSV that is not properly generated in my case would be about 10 Megs. Then the file will download the first 200~1000 rows as though everything was working correctly.
Here is the code:
// This gets a huge list of data from a SP I built. This data is well formed
$data = $this->run_stats_stored_procedure($job_to_report);
// This is where the data is converted into a csv file. This part is broken
// the file may already exist at that location burn it down if it does
if(file_exists(ABSPATH . "some/path/to/my/file/csv_export.csv")) {
unlink(ABSPATH . "some/path/to/my/file/csv_export.csv");
}
$csv_file_handler = fopen(ABSPATH . "some/path/to/my/file/candidate_export.csv", 'w');
if(!empty($csv_file_handler)) {
$title_array = array(
"ID",
"other_feild"
);
fputcsv($csv_file_handler, $title_array, ",");
if(!empty($data)) {
foreach($data as $data_piece) {
$array_as_csv_line = array();
foreach($data_piece as $object_property) {
$array_as_csv_line[] = (string)$object_property;
}
fputcsv($csv_file_handler, $array_as_csv_line, ",");
unset($array_as_csv_line);
}
} else {
fputcsv($csv_file_handler, array("empty"), ",");
}
// pros clean everything up when they are done
fclose($csv_file_handler);
}
I'm not sure what I need to change to get the entire CSV file to download. I believe this could be a configuration issue, but I'm not should. I am led to believe this because this function used to work with even 20,000 csv rows, it is now at 30,000 and breaking. Please let me know if additional info would help. Has anyone bumped into issues with huge CSV files before? Thank you to anyone who can help.
Is the "download" taking more than say a minute, two minutes, or three minutes? If so, the webserver could be closing the connection. For example, if you're using the Apache FCGI module, it has this directive:
FcgidBusyTimeout
which defaults to 300 seconds.
This is the maximum time limit for request handling. If a FastCGI request does not complete within FcgidBusyTimeout seconds, it will be subject to termination.
Hope this helps you solve your problem.
The answer that I am currently implementing is to allow the script to use more time. To do this, I am simply running the following code before the script runs:
set_time_limit ( 3600 );
I am doing further research because this is not a sustainable solution. Any further advice would be greatly appreciated.
Related
I'm posting this in case someone else is looking for the same solution, seeing as I just wasted two days on this bullshit.
I have a cron job that updates the database using a very large file once a day, using the following code:
if (($handle = fopen(dirname(__FILE__) . '/uncompressed', "r")) !== FALSE)
{
while (($data = fgets($handle)) !== FALSE)
{
$thisline = json_decode($data, true);
$this->regen($thisline);
}
fclose($handle);
}
This is in a Codeigniter controller that's only used for cron jobs. The $this->regen function runs through a bunch of different checks and stores the right information from the line in the database. The file itself is over 300MB of JSONs separated by newlines.
The problem: it would only process about 20,000 lines before the whole thing ran out of memory.
I spent hours troubleshooting this and got nothing obvious. I'm using fgets, I have $query->free_result() in the right places. It didn't help. So then I started checking a loop of about 100 lines, and watched the output of memory_get_usage(). I finally narrowed it down to the Codeigniter Active Record class - every call to the class caused the memory usage to increase by a tiny amount.
Then I found this thread on Ellislabs and I got the answer. CI Active Record saves queries so that if you want to, you can build a query in multiple functions. (I am not even going to go into how dumb it is to have that switched on by default.)
Go to /config/database.php and add
$db['default']['save_queries'] = FALSE;
to the end of the file. Then make sure you build and execute queries using Active Record in a single function. If you need to switch it off just for one case, use
$this->db->save_queries = FALSE;
in the constructor or wherever you need to put it.
I have a big file that has about 11 Mb. It is a CSV file and I need to load the content of that file into a Postgres database.
I use a PHP script to do this job but always stop in some moment.
I put big size for PHP memory and other stuff and I could load more data but not all data.
How can I solve that? Is any cache memory that I need to clean? Some secret to manage big files in PHP?
Thanks in advance.
UPDATE: Add some code
$handler = fopen($fileName, "r");
$dbHandler = pg_connect($databaseConfig);
while (($line = $handler->fgetcsv(";")) !== false) {
// Algorithms to transform data
// Adding sql sentences in a variable
// I am using a "batch" idea that execute all sql formed after 5000 read lines
// When I reach 5000 read lines, execute my sql
$results = pg_query($dbHandler, $sql);
}
In case you have direct access to the server(and you don't work with some subversion software), postgre has a far better option that is far less demanding in terms of resources. Keep in mind that php is a slow and resource consuming language
COPY my_table_name FROM '/home/myfile.csv' DELIMITERS ',' CSV
I'm trying to parse a 50 megabyte .csv file. The file itself is fine, but I'm trying to get past the massive timeout issues involved. Every is set upload wise, I can easily upload and re-open the file but after the browser timeout, I receive a 500 Internal error.
My guess is I can save the file onto the server, open it and keep a session value of what line I dealt with. After a certain line I reset the connect via refresh and open the file at the line I left off with. Is this a do-able idea? The previous developer made a very inefficient MySQL class and it controls the entire site, so I don't want to write my own class if I don't have to, and I don't want to mess with his class.
TL;DR version: Is it efficient to save the last line I'm currently on of a CSV file that has 38K lines of products then, and after X number of rows, reset the connection and start from where I left off? Or is there another way to parse a Large CSV file without timeouts?
NOTE: It's the PHP script execution time. Currently at 38K lines, it takes about 46 minutes and 5 seconds to run via command line. It works correctly 100% of the time when I remove it from the browser, suggesting that it is a browser timeout. Chrome's timeout is not editable as far as Google has told me, and Firefox's timeout works rarely.
You could do something like this:
<?php
namespace database;
class importcsv
{
private $crud;
public function __construct($dbh, $table)
{
$this->crud = new \database\crud($dbh, $table);
return $this;
}
public function import($columnNames, $csv, $seperator)
{
$lines = explode("\n", $csv);
foreach($lines as $line)
{
\set_time_limit(30);
$line = explode($seperator, $line);
$data = new \stdClass();
foreach($line as $i => $item)
{
if(isset($columnNames[$i])&&!empty($columnNames[$i]))
$data->$columnNames[$i] = $item;
}
#$x++;
$this->crud->create($data);
}
return $x;
}
public function importFile($columnNames, $csvPath, $seperator)
{
if(file_exists($csvPath))
{
$content = file_get_contents($csvPath);
return $this->import($columnNames, $content, $seperator);
}
else
{
// Error
}
}
}
TL;DR: \set_time_limit(30); everytime you loop throu a line might fix your timeout issues.
I suggest to run php from command line and set it as a cron job. This way you don't have to modify your code. There will be no timeout issue and you can easily parse large CSV files.
also check this link
Your post is a little unclear due to the typos and grammar, could you please edit?
If you are saying that the Upload itself is okay, but the delay is in processing of the file, then the easiest thing to do is to parse the file in parallel using multiple threads. You can use the java built-in Executor class, or Quartz or Jetlang to do this.
Find the size of the file or number of lines.
Select a Thread load (Say 1000 lines per thread)
Start an Executor
Read the file in a loop.
For ach 1000 lines, create a Runnable and load it to the Executor
Start the Executor
Wait till all threads are finished
Each runnable does this:
Fetch a connection
Insert the 1000 lines
Log the results
Close the connection
Some background information
The files I would like to download is kept at the external server for a week, and a new XML file(10-50mb large) is created there every hour with a different name. I would like the large file to be downloaded to my server chunk by chunk in the background each time my website is loaded, perhaps 0.5mb each time, and then resume the download the next time someone else loads the website. This would require my site to have atleast 100 pageloads each hour to stay updated, so perhaps abit more of the file each time if possible. I have researched simpleXML, XMLreader, SAX parsing, but whatever I do, it seems it takes too long to parse the file directly, therefore I would like a different approach, namely downloading it like described above.
If I download a 30mb large XML file, I can parse it locally with XMLreader in 3 seconds(250k iterations) only, but when I try to do the same from the external server limiting it to 50k iterations, it uses 15secs to read that small part, so it would not be possible to parse it directly from that server it seems.
Possible solutions
I think it's best to use cURL. But then again, perhaps fopen(), fsockopen(), copy() or file_get_contents() are the way to go. I'm looking for advice on what functions to use to make this happen, or different solutions on how I can parse a 50mb external XML file into a mySQL database.
I suspect a Cron job every hour would be the best solution, but I am not sure how well that would be supported by webhosting companies, and I have no clue how to do something like that. But if thats the best solution, and the majority thinks so, I will have to do my research in that area too.
If a java applet/javascript running in the background would be a better solution, please point me in the right direction when it comes to functions/methods/libraries there aswell.
Summary
What's the best solution to downloading parts of a file in the
background, and resume the download each time my website is loaded
until its completed?
If the above solution would be moronic to even try, what
language/software would you use to achieve the same thing(download a large file every hour)?
Thanks in advance for all answers, and sorry for the long story/question.
Edit: I ended up using this solution to get the files with cron job scheduling a php script. It checks my folder for what files I already have, generates a list of the possible downloads for the last four days, then downloads the next XMLfile in line.
<?php
$date = new DateTime();
$current_time = $date->getTimestamp();
$four_days_ago = $current_time-345600;
echo 'Downloading: '."\n";
for ($i=$four_days_ago; $i<=$current_time; ) {
$date->setTimestamp($i);
if($date->format('H') !== '00') {
$temp_filename = $date->format('Y_m_d_H') ."_full.xml";
if(!glob($temp_filename)) {
$temp_url = 'http://www.external-site-example.com/'.$date->format('Y/m/d/H') .".xml";
echo $temp_filename.' --- '.$temp_url.'<br>'."\n";
break; // with a break here, this loop will only return the next file you should download
}
}
$i += 3600;
}
set_time_limit(300);
$Start = getTime();
$objInputStream = fopen($temp_url, "rb");
$objTempStream = fopen($temp_filename, "w+b");
stream_copy_to_stream($objInputStream, $objTempStream, (1024*200000));
$End = getTime();
echo '<br>It took '.number_format(($End - $Start),2).' secs to download "'.$temp_filename.'".';
function getTime() {
$a = explode (' ',microtime());
return(double) $a[0] + $a[1];
}
?>
edit2: I just wanted to inform you that there is a way to do what I asked, only it would'nt work in my case. With the amount of data I need the website would have to have 400+ visitors an hour for it to work properly. But with smaller amounts of data there are some options; http://www.google.no/search?q=poormanscron
You need to have a scheduled, offline task (e.g., cronjob). The solution you are pursuing is just plain wrong.
The simplest thing that could possibly work is a php script you run every hour (scheduled via cron, most likely) that downloads the file and processes it.
You could try fopen:
<?php
$handle = fopen("http://www.example.com/test.xml", "rb");
$contents = stream_get_contents($handle);
fclose($handle);
?>
I experimenting with twitter streaming API,
I use Phirehose to connect to twitter and fetch the data but having problems storing it in files for further processing.
Basically what I want to do is to create a file named
date("YmdH")."."txt"
for every hour of connection.
Here is how my code looks like right now (not handling the hourly change of files)
public function enqueueStatus($status)
$data = json_decode($status,true);
if(isset($data['text'])/*more conditions here*/) {
$fp = fopen("/tmp/$time.txt");
fwirte ($status,$fp);
fclose($fp);
}
Help is as always much appreciated :)
You want the 'append' mode in fopen - this will either append to a file or create it.
if(isset($data['text'])/*more conditions here*/) {
$fp = fopen("/tmp/" . date("YmdH") . ".txt", "a");
fwrite ($status,$fp);
fclose($fp);
}
From the Phirehose googlecode wiki:
As of Phirehose version 0.2.2 there is
an example of a simple "ghetto queue"
included in the tarball (see file:
ghetto-queue-collect.php and
ghetto-queue-consume.php) that shows
how statuses could be easily collected
on to the filesystem for processing
and then picked up by a separate
process (consume).
This is a complete working sample of doing what you want to do. The rotation time interval is configurable too. Additionally there's another script to consume and process the written files too.
Now if only I could find a way to stop the whole sript, my log keeps filling up (the script continues execution) even if I close the browser tab :P