Trouble reading huge CSV file with php fgetcsv - understanding memory consumption - php

Good morning,
I´m actually going through some hard lessons while trying to handle huge csv files up to 4GB.
Goal is to search some items in a csv file (Amazon datafeed) by a given browsenode and also by some given item id´s (ASIN). To get a mix of existing items (in my database) plus some additional new itmes since from time to time items disapear on the marketplace. I also filter the title of the items because there are many items using the same.
I have been reading here lots af tips and finally decided to use php´s fgetcsv() and thought this function will not exhaust memory, since it reads the file line by line.
But no matter what I try I´m always running out of memory.
I can not understand why my code uses so much memory.
I set the memory limit to 4096MB, time limit is 0. Server has 64 GB Ram and two SSD hardisks.
May someone please check out my piece of code and explain how it is possible that im running out of memory and more important how memory is used?
private function performSearchByASINs()
{
$found = 0;
$needed = 0;
$minimum = 84;
if(is_array($this->searchASINs) && !empty($this->searchASINs))
{
$needed = count($this->searchASINs);
}
if($this->searchFeed == NULL || $this->searchFeed == '')
{
return false;
}
$csv = fopen($this->searchFeed, 'r');
if($csv)
{
$l = 0;
$title_array = array();
while(($line = fgetcsv($csv, 0, ',', '"')) !== false)
{
$header = array();
if(trim($line[6]) != '')
{
if($l == 0)
{
$header = $line;
}
else
{
$asin = $line[0];
$title = $this->prepTitleDesc($line[6]);
if(is_array($this->searchASINs)
&& !empty($this->searchASINs)
&& in_array($asin, $this->searchASINs)) //search for existing items to get them updated
{
$add = true;
if(in_array($title, $title_array))
{
$add = false;
}
if($add === true)
{
$this->itemsByASIN[$asin] = new stdClass();
foreach($header as $k => $key)
{
if(isset($line[$k]))
{
$this->itemsByASIN[$asin]->$key = trim(strip_tags($line[$k], '<br><br/><ul><li>'));
}
}
$title_array[] = $title;
$found++;
}
}
if(($line[20] == $this->bnid || $line[21] == $this->bnid)
&& count($this->itemsByKey) < $minimum
&& !isset($this->itemsByASIN[$asin])) // searching for new items
{
$add = true;
if(in_array($title, $title_array))
{
$add = false;
}
if($add === true)
{
$this->itemsByKey[$asin] = new stdClass();
foreach($header as $k => $key)
{
if(isset($line[$k]))
{
$this->itemsByKey[$asin]->$key = trim(strip_tags($line[$k], '<br><br/><ul><li>'));
}
}
$title_array[] = $title;
$found++;
}
}
}
$l++;
if($l > 200000 || $found == $minimum)
{
break;
}
}
}
fclose($csv);
}
}

I know my answer is a bit late but I had a similar problem with fgets() and things based on fgets() like SplFileObject->current() function. In my case it was on a windows system when trying to read a +800MB file. I think fgets() doesn't free the memory of the previous line in a loop. So every line that was read stayed in memory and let to a fatal out of memory error. I fixed it using fread($lineLength) instead but it is a bit trickier since you must supply the length.

It is very hard to manage large data using array without encountering timeout issue. Instead why not parse this datafeed to a database table and do the heavy lifting from there.

Have you tried this? SplFileObject::fgetcsv
<?php
$file = new SplFileObject("data.csv");
while (!$file->eof()) {
//your code here
}
?>
You are running out of memory because you use variables, and you are never doing an unset(); and use too many nested foreach. You could shrink that code in more functions
A solution should be, use a real Database instead.

Related

Checking for partial duplications in a CSV in PHP

I'm having an issue with a memory leak in this code. What I'm attempting to do is to temporarily upload a rather large CSV file (at least 12k records), and check each record for a partial duplication against other records in the CSV file. The reason why I say "partial duplication" is because basically if most of the record matches (at least 30 fields), it is going to be a duplicate record. The code I've written should, in theory, work as intended, but of course, it's a rather large loop and is exhausting memory. This is happening on the line that contains "array_intersect".
This is not for something I'm getting paid to do, but it is with the purpose of helping make life at work easier. I'm a data entry employee, and we are having to look at duplicate entries manually right now, which is asinine, so I'm trying to help out by making a small program for this.
Thank you so much in advance!
if (isset($_POST["submit"])) {
if (isset($_FILES["sheetupload"])) {
$fh = fopen(basename($_FILES["sheetupload"]["name"]), "r+");
$lines = array();
$records = array();
$counter = 0;
while(($row = fgetcsv($fh, 8192)) !== FALSE ) {
$lines[] = $row;
}
foreach ($lines as $line) {
if(!in_array($line, $records)){
if (count($records) > 0) {
//check array against records for dupes
foreach ($records as $record) {
if (count(array_intersect($line, $record)) > 30) {
$dupes[] = $line;
$counter++;
}
else {
$records[] = $line;
}
}
}
else {
$records[] = $line;
}
}
else {
$counter++;
}
}
if ($counter < 1) {
echo $counter." duplicate records found. New file not created.";
}
else {
echo $counter." duplicate records found. New file created as NEWSHEET.csv.";
$fp = fopen('NEWSHEET.csv', 'w');
foreach ($records as $line) {
fputcsv($fp, $line);
}
}
}
}
A couple of possibilities, assuming the script is reaching the memory limit or timing out. If you can access the php.ini file, try increasing the memory_limit and the max_execution_time.
If you can't access the server settings, try adding these to the top of your script:
ini_set('memory_limit','256M'); // change this number as necessary
set_time_limit(0); // so script does not time out
If altering these settings in the script is not possible, you might try using unset() in a few spots to free up memory:
// after the first while loop
unset($fh, $row);
and
//at end of each foreach loop
unset($line);

PHP - Why is reading this csv file using so much memory, how can I improve my code?

Situation is that I need to import a fairly large csv file (approx 1/2 million records - 80mb) to a mysql database. I know I could do this from the command line but I need I UI so the client can do it.
Here is what I have so far:
ini_set('max_execution_time', 0);
ini_set('memory_limit', '1024M');
$field_maps = array();
foreach (Input::get() as $field => $value){
if ('fieldmap_' == substr($field, 0, 9) && $value != 'unassigned'){
$field_maps[str_replace('fieldmap_', null, $field)] = $value;
}
}
$file = app_path().'/../uploads/'.$client.'_'.$job_number.'/'.Input::get('file');
$result_array = array();
$rows = 0;
$bulk_insert_count = 1000;
if (($handle = fopen($file, "r")) !== FALSE)
{
$header = fgetcsv($handle);
$data_map = array();
foreach ($header as $k => $th){
if (array_key_exists($th, $field_maps)){
$data_map[$field_maps[$th]] = $k;
}
}
$tmp_rows_count = 0;
while (($data = fgetcsv($handle, 1000)) !== FALSE) {
$row_array = array();
foreach ($data_map as $column => $data_index){
$row_array[$column] = $data[$data_index];
}
$result_array[] = $row_array;
$rows++;
$tmp_rows_count++;
if ($tmp_rows_count == $bulk_insert_count){
Inputs::insert($result_array);
$result_array = array();
if (empty($result_array)){
echo '*************** array cleared *************';
}
$tmp_rows_count = 0;
}
}
fclose($handle);
}
print('done');
I am currently working on a local vagrant box, when I try to run the above locally it process almost all the rows of the csv file and then dies shortly before the end (no error) but it gets up to the boxes memory limit of 1.5Gb.
I suspect some of what I have done in the above code is unnecessary, e.g. but I thought by building up and inserting a limited number of rows I would reduce memory use but it hasn't done enough.
I suspect this would probably work on the live server with more memory available but I cannot believe that it has to take 1.5Gb of memory to process an 80mb file, there must be a better approach. Any help much appreciated
Had this problem once, this solved it for me:
DB::connection()->disableQueryLog();
Info in the docs about it: http://laravel.com/docs/database#query-logging

Comparing two csv files based on multiple columns and save in separate file

I have two files with same format where one has new updates and the other has older updates. There is no particular unique id column.
How can I extract the new updated lines only (with unix, PHP, AWK)?
You want to "byte" compare all lines against the other lines, so i would do:
$lines1 = file('file1.txt');
$lines2 = file('file2.txt');
$lookup = array();
foreach($lines1 as $line) {
$key = crc32($line);
if (!isset($lookup[$key])) $lookup[$key] = array();
$lookup[$key][] = $line;
}
foreach($lines2 as $line) {
$key = crc32($line);
$found = false;
if (isset($lookup[$key])) {
foreach($lookup[$key] as $lookupLine) {
if (strcmp($lookupLine, $line) == 0) {
$found = true;
break;
}
}
}
// check if not found
if (!$found) {
// output to file or do something
}
}
Note that if the files are very large this will consume quite some memory and you need to use some other mechanism, but the idea stays the same

Memory leakage in php with three for loops

My script is a spider that checks if a page is a "links page" or is a "information page".
if the page is a "links page" then it continue in a recursive manner (or a tree if you will)
until it finds the "information page".
I tried to make the script recursive and it was easy but i kept getting the error:
Fatal error: Allowed memory size of 33554432 bytes exhausted (tried to
allocate 39 bytes) in /srv/www/loko/simple_html_dom.php on line 1316
I was told i would have to use the for loop method because no matter if i use the unset() function the script won't free memory and i only have three levels i need to loop through so it makes sense. But after i changed the script the error occurs again, but maybe i can free
memory now?
Something needs to die here, please help me destruct someone!
set_time_limit(0);
ini_set('memory_limit', '256M');
require("simple_html_dom.php");
$thelink = "http://www.somelink.com";
$html1 = file_get_html($thelink);
$ret1 = $html1->find('#idTabResults2');
// first inception level, we know page has only links
if (!$ret1){
$es1 = $html1->find('table.litab a');
//unset($html1);
$countlinks1 = 0;
foreach ($es1 as $aa1) {
$links1[$countlinks1] = $aa1->href;
$countlinks1++;
}
//unset($es1);
//for every link in array do the same
for ($i = 0; $i < $countlinks1; $i++) {
$html2 = file_get_html($links1[$i]);
$ret2 = $html2->find('#idTabResults2');
// if got information then send to DB
if ($ret2){
pullInfo($html2);
//unset($html2);
} else {
// continue inception
$es2 = $html2->find('table.litab a');
$html2 = null;
$countlinks2 = 0;
foreach ($es2 as $aa2) {
$links2[$countlinks2] = $aa2->href;
$countlinks2++;
}
//unset($es2);
for ($j = 0; $j < $countlinks2; $j++) {
$html3 = file_get_html($links2[$j]);
$ret3 = $html3->find('#idTabResults2');
// if got information then send to DB
if ($ret3){
pullInfo($html3);
} else {
// inception level three
$es3 = $html3->find('table.litab a');
$html3 = null;
$countlinks3 = 0;
foreach ($es3 as $aa3) {
$links3[$countlinks3] = $aa3->href;
$countlinks3++;
}
for ($k = 0; $k < $countlinks3; $k++) {
echo memory_get_usage() ;
echo "\n";
$html4 = file_get_html($links3[$k]);
$ret4 = $html4->find('#idTabResults2');
// if got information then send to DB
if ($ret4){
pullInfo($html4);
}
unset($html4);
}
unset($html3);
}
}
}
}
}
function pullInfo($html)
{
$tds = $html->find('td');
$count =0;
foreach ($tds as $td) {
$count++;
if ($count==1){
$name = html_entity_decode($td->innertext);
}
if ($count==2){
$address = addslashes(html_entity_decode($td->innertext));
}
if ($count==3){
$number = addslashes(preg_replace('/(\d+) - (\d+)/i', '$2$1', $td->innertext));
}
}
unset($tds, $td);
$name = mysql_real_escape_string($name);
$address = mysql_real_escape_string($address);
$number = mysql_real_escape_string($number);
$inAlready=mysql_query("SELECT * FROM people WHERE phone=$number");
while($e=mysql_fetch_assoc($inAlready))
$output[]=$e;
if (json_encode($output) != "null"){
//print(json_encode($output));
} else {
mysql_query("INSERT INTO people (name, area, phone)
VALUES ('$name', '$address', '$number')");
}
}
And here is a picture of the growth in memory size:
I modified the code a little bit to free as much memory as I see could be freed.
I've added a comment above each modification. The added comments start with "#" so you could find them easier.
This is not related to this question, but worth mentioning that your database insertion code is vulnerable to SQL injection.
<?php
require("simple_html_dom.php");
$thelink = "http://www.somelink.co.uk";
# do not keep raw contents of the file on memory
#$data1 = file_get_contents($thelink);
#$html1 = str_get_html($data1);
$html1 = str_get_html(file_get_contents($thelink));
$ret1 = $html1->find('#idResults2');
// first inception level, we know page has only links
if (!$ret1){
$es1 = $html1->find('table.litab a');
# free $html1, not used anymore
unset($html1);
$countlinks1 = 0;
foreach ($es1 as $aa1) {
$links1[$countlinks1] = $aa1->href;
$countlinks1++;
// echo (addslashes($aa->href));
}
# free memroy used by the $es1 value, not used anymore
unset($es1);
//for every link in array do the same
for ($i = 0; $i <= $countlinks1; $i++) {
# do not keep raw contents of the file on memory
#$data2 = file_get_contents($links1[$i]);
#$html2 = str_get_html($data2);
$html2 = str_get_html(file_get_contents($links1[$i]));
$ret2 = $html2->find('#idResults2');
// if got information then send to DB
if ($ret2){
pullInfo($html2);
} else {
// continue inception
$es2 = $html2->find('table.litab a');
# free memory used by $html2, not used anymore.
# we would unset it at the end of the loop.
$html2 = null;
$countlinks2 = 0;
foreach ($es2 as $aa2) {
$links2[$countlinks2] = $aa2->href;
$countlinks2++;
}
# free memory used by $es2
unest($es2);
for ($j = 0; $j <= $countlinks2; $j++) {
# do not keep raw contents of the file on memory
#$data3 = file_get_contents($links2[$j]);
#$html3 = str_get_html($data3);
$html3 = str_get_html(file_get_contents($links2[$j]));
$ret3 = $html3->find('#idResults2');
// if got information then send to DB
if ($ret3){
pullInfo($html3);
}
# free memory used by $html3 or on last iteration the memeory would net get free
unset($html3);
}
}
# free memory used by $html2 or on last iteration the memeory would net get free
unset($html2);
}
}
function pullInfo($html)
{
$tds = $html->find('td');
$count =0;
foreach ($tds as $td) {
$count++;
if ($count==1){
$name = addslashes($td->innertext);
}
if ($count==2){
$address = addslashes($td->innertext);
}
if ($count==3){
$number = addslashes(preg_replace('/(\d+) - (\d+)/i', '$2$1', $td->innertext));
}
}
# check for available data:
if ($count) {
# free $tds and $td
unset($tds, $td);
mysql_query("INSERT INTO people (name, area, phone)
VALUES ('$name', '$address', '$number')");
}
}
Update:
You could trace your memory usage to see how much memory is being used in each section of your code. this could be done by using the memory_get_usage() calls, and saving the result to some file. like placing this below code in the end of each of your loops, or before creating objects, calling heavy methods:
file_put_contents('memory.log', 'memory used in line ' . __LINE__ . ' is: ' . memory_get_usage() . PHP_EOL, FILE_APPEND);
So you could trace the memory usage of each part of your code.
In the end remember all this tracing and optimization might not be enough, since your application might really need more memory than 32 MB. I'v developed a system that analyzes several data sources and detects spammers, and then blocks their SMTP connections and since sometimes the number of connected users are over 30000, after a lot of code optimization, I had to increase the PHP memory limit to 768 MB on the server, Which is not a common thing to do.
If your operation requires memory and your server has more memory available, you can call ini_set('memory_limit', '128M'); or something similar (depending your memory requirement) to increase the amount of memory available to the script.
This does not mean you should not optimise and refactor your code :-) this is just one part.
The solution was to use the clear method such as:
$html4->clear(); a simple_html_dom method to clear memory When you are finished with the DOM element.
If you want to learn more, enter this website.
Firstly, let's turn this into a truly recursive function, should make it easier to modify the whole chain of events that way:
function findInfo($thelink)
{
$data = file_get_contents($thelink); //Might want to make sure that it's a valid link, i.e. that file get contents actually returned stuff, before trying to run further with it.
$html = str_get_html($data);
unset($data); //Finished using it, no reason to keep it around.
$ret = $html->find('#idResults2');
if($ret)
{
pullInfo($html);
return true; //Should stop once it finds it right?
}
else
{
$es = $html->find('table.litab a'); //Might want a little error checking here to make sure it actually found links.
unset($html); //Finished using it, no reason to keep it around
$countlinks = 0;
foreach($es as $aa)
{
$links[$countlinks] = $aa->href;
$countlinks++;
}
unset($es); //Finished using it, no reason to keep it around.
for($i = 0; $i <= $countlinks; $i++)
{
$result = findInfo($links[$i]);
if($result === true)
{
return true; //To break out of above recursive functions if lower functions return true
}
else
{
unset($links[$i]); //Finished using it, no reason to keep it around.
continue;
}
}
}
return false; //Will return false if all else failed, should hit a return true before this point if it successfully finds an info page.
}
See if that helps at all with the cleanups. Probably still run out of memory, but you shouldn't be holding onto the full html of each webpage scanned and what not with this.
Oh, and if you only want it to go only so deep, change the function declaration to something like:
function findInfo($thelink, $depth = 1, $maxdepth = 3)
Then when calling the function within the function, call it like so:
findInfo($html, $depth + 1, $maxdepth); //you include maxdepth so you can override it in the initial function call, like findInfo($thelink,,4)
and then do a check on depth vs. maxdepth at the start of the function and have it return false if depth is > than maxdepth.
If memory usage is your primary concern, you may want to consider using a SAX-based parser. Coding using a SAX parser can be a bit more complicated, but it's not necessary to keep the entire document in memory.

PHP: scandir() is too slow

I have to make a function that lists all subfolders into a folder. I have a no-file filter, but the function uses scandir() for listing. That makes the application very slow. Is there an alternative of scandir(), even a not native php function?
Thanks in advance!
You can use readdir which may be faster, something like this:
function readDirectory($Directory,$Recursive = true)
{
if(is_dir($Directory) === false)
{
return false;
}
try
{
$Resource = opendir($Directory);
$Found = array();
while(false !== ($Item = readdir($Resource)))
{
if($Item == "." || $Item == "..")
{
continue;
}
if($Recursive === true && is_dir($Item))
{
$Found[] = readDirectory($Directory . $Item);
}else
{
$Found[] = $Directory . $Item;
}
}
}catch(Exception $e)
{
return false;
}
return $Found;
}
May require some tweeking but this is essentially what scandir does, and it should be faster, if not please write an update as i would like to see if i can make a faster solution.
Another issue is if your reading a very large directory your filling an array up within the internal memory and that may be where your memory is going.
You could try and create a function that reads in offsets so that you can return 50 files at a time!
reading chunks of files at a time would be just as simple to use, would be like so:
$offset = 0;
while(false !== ($Batch = ReadFilesByOffset("/tmp",$offset)))
{
//Use $batch here which contains 50 or less files!
//Increment the offset:
$offset += 50;
}
Don't write your own. PHP has a Recursive Directory Iterator built specifically for this:
http://php.net/manual/en/class.recursivedirectoryiterator.php
As a rule of thumb (aka not 100% of the time), since it's implemented in straight C, anything you build in PHP is going to be slower.

Categories