Checking for partial duplications in a CSV in PHP - php

I'm having an issue with a memory leak in this code. What I'm attempting to do is to temporarily upload a rather large CSV file (at least 12k records), and check each record for a partial duplication against other records in the CSV file. The reason why I say "partial duplication" is because basically if most of the record matches (at least 30 fields), it is going to be a duplicate record. The code I've written should, in theory, work as intended, but of course, it's a rather large loop and is exhausting memory. This is happening on the line that contains "array_intersect".
This is not for something I'm getting paid to do, but it is with the purpose of helping make life at work easier. I'm a data entry employee, and we are having to look at duplicate entries manually right now, which is asinine, so I'm trying to help out by making a small program for this.
Thank you so much in advance!
if (isset($_POST["submit"])) {
if (isset($_FILES["sheetupload"])) {
$fh = fopen(basename($_FILES["sheetupload"]["name"]), "r+");
$lines = array();
$records = array();
$counter = 0;
while(($row = fgetcsv($fh, 8192)) !== FALSE ) {
$lines[] = $row;
}
foreach ($lines as $line) {
if(!in_array($line, $records)){
if (count($records) > 0) {
//check array against records for dupes
foreach ($records as $record) {
if (count(array_intersect($line, $record)) > 30) {
$dupes[] = $line;
$counter++;
}
else {
$records[] = $line;
}
}
}
else {
$records[] = $line;
}
}
else {
$counter++;
}
}
if ($counter < 1) {
echo $counter." duplicate records found. New file not created.";
}
else {
echo $counter." duplicate records found. New file created as NEWSHEET.csv.";
$fp = fopen('NEWSHEET.csv', 'w');
foreach ($records as $line) {
fputcsv($fp, $line);
}
}
}
}

A couple of possibilities, assuming the script is reaching the memory limit or timing out. If you can access the php.ini file, try increasing the memory_limit and the max_execution_time.
If you can't access the server settings, try adding these to the top of your script:
ini_set('memory_limit','256M'); // change this number as necessary
set_time_limit(0); // so script does not time out
If altering these settings in the script is not possible, you might try using unset() in a few spots to free up memory:
// after the first while loop
unset($fh, $row);
and
//at end of each foreach loop
unset($line);

Related

Problem getting last chunk in a while loop

So it's been awhile since i did any PHP and to be honest, this question feels kinda dumb. But my head is just stuck thinking about how to get last chunk in a file.
My while loop reads a file, line by line and after 10 lines it should execute a code. Problem occures when there's 51 lines. How do i reach the last line?
The file is over 300 mb so I cannot load it into memory (array).
while ($row = fgets($handle))
{
$chunk[] = array_combine($feed_product_arraykeys, explode("\t", $row));
if(count($chunk) == 10)
{
echo count($chunk) . '<br>';
// Initiate code
unset($chunk);
}
}
Best Regards
Here's an alternate way. Just read the file into an array and chunk it into chunks of 10 The remaining will be in the last chunk:
foreach(array_chunk(file('/path/to/file'), 10) as $row) {
$chunk[] = array_combine($feed_product_arraykeys, explode("\t", $row));
echo count($chunk) . '<br>';
}
So i actually fixed it by counting number of rows in the file. I thought it would be slow but its actually fast, even on a 300 mb file with 130k rows.
// Count number of lines in feed
$feed_row_count = count_lines_in_file("tmp/56.csv");
$row_counter = 0;
$feed_handle = fopen("tmp/56.csv", "r");
while ($row = fgets($feed_handle))
{
$row_counter++;
$chunk[] = array_combine($feed_product_arraykeys, explode("\t", $row));
if(count($chunk) == 25 || $feed_row_count == $row_counter)
{
echo count($chunk) . '<br>';
// Initiate SQL
unset($chunk);
}
}

Trouble reading huge CSV file with php fgetcsv - understanding memory consumption

Good morning,
I´m actually going through some hard lessons while trying to handle huge csv files up to 4GB.
Goal is to search some items in a csv file (Amazon datafeed) by a given browsenode and also by some given item id´s (ASIN). To get a mix of existing items (in my database) plus some additional new itmes since from time to time items disapear on the marketplace. I also filter the title of the items because there are many items using the same.
I have been reading here lots af tips and finally decided to use php´s fgetcsv() and thought this function will not exhaust memory, since it reads the file line by line.
But no matter what I try I´m always running out of memory.
I can not understand why my code uses so much memory.
I set the memory limit to 4096MB, time limit is 0. Server has 64 GB Ram and two SSD hardisks.
May someone please check out my piece of code and explain how it is possible that im running out of memory and more important how memory is used?
private function performSearchByASINs()
{
$found = 0;
$needed = 0;
$minimum = 84;
if(is_array($this->searchASINs) && !empty($this->searchASINs))
{
$needed = count($this->searchASINs);
}
if($this->searchFeed == NULL || $this->searchFeed == '')
{
return false;
}
$csv = fopen($this->searchFeed, 'r');
if($csv)
{
$l = 0;
$title_array = array();
while(($line = fgetcsv($csv, 0, ',', '"')) !== false)
{
$header = array();
if(trim($line[6]) != '')
{
if($l == 0)
{
$header = $line;
}
else
{
$asin = $line[0];
$title = $this->prepTitleDesc($line[6]);
if(is_array($this->searchASINs)
&& !empty($this->searchASINs)
&& in_array($asin, $this->searchASINs)) //search for existing items to get them updated
{
$add = true;
if(in_array($title, $title_array))
{
$add = false;
}
if($add === true)
{
$this->itemsByASIN[$asin] = new stdClass();
foreach($header as $k => $key)
{
if(isset($line[$k]))
{
$this->itemsByASIN[$asin]->$key = trim(strip_tags($line[$k], '<br><br/><ul><li>'));
}
}
$title_array[] = $title;
$found++;
}
}
if(($line[20] == $this->bnid || $line[21] == $this->bnid)
&& count($this->itemsByKey) < $minimum
&& !isset($this->itemsByASIN[$asin])) // searching for new items
{
$add = true;
if(in_array($title, $title_array))
{
$add = false;
}
if($add === true)
{
$this->itemsByKey[$asin] = new stdClass();
foreach($header as $k => $key)
{
if(isset($line[$k]))
{
$this->itemsByKey[$asin]->$key = trim(strip_tags($line[$k], '<br><br/><ul><li>'));
}
}
$title_array[] = $title;
$found++;
}
}
}
$l++;
if($l > 200000 || $found == $minimum)
{
break;
}
}
}
fclose($csv);
}
}
I know my answer is a bit late but I had a similar problem with fgets() and things based on fgets() like SplFileObject->current() function. In my case it was on a windows system when trying to read a +800MB file. I think fgets() doesn't free the memory of the previous line in a loop. So every line that was read stayed in memory and let to a fatal out of memory error. I fixed it using fread($lineLength) instead but it is a bit trickier since you must supply the length.
It is very hard to manage large data using array without encountering timeout issue. Instead why not parse this datafeed to a database table and do the heavy lifting from there.
Have you tried this? SplFileObject::fgetcsv
<?php
$file = new SplFileObject("data.csv");
while (!$file->eof()) {
//your code here
}
?>
You are running out of memory because you use variables, and you are never doing an unset(); and use too many nested foreach. You could shrink that code in more functions
A solution should be, use a real Database instead.

PHP - Why is reading this csv file using so much memory, how can I improve my code?

Situation is that I need to import a fairly large csv file (approx 1/2 million records - 80mb) to a mysql database. I know I could do this from the command line but I need I UI so the client can do it.
Here is what I have so far:
ini_set('max_execution_time', 0);
ini_set('memory_limit', '1024M');
$field_maps = array();
foreach (Input::get() as $field => $value){
if ('fieldmap_' == substr($field, 0, 9) && $value != 'unassigned'){
$field_maps[str_replace('fieldmap_', null, $field)] = $value;
}
}
$file = app_path().'/../uploads/'.$client.'_'.$job_number.'/'.Input::get('file');
$result_array = array();
$rows = 0;
$bulk_insert_count = 1000;
if (($handle = fopen($file, "r")) !== FALSE)
{
$header = fgetcsv($handle);
$data_map = array();
foreach ($header as $k => $th){
if (array_key_exists($th, $field_maps)){
$data_map[$field_maps[$th]] = $k;
}
}
$tmp_rows_count = 0;
while (($data = fgetcsv($handle, 1000)) !== FALSE) {
$row_array = array();
foreach ($data_map as $column => $data_index){
$row_array[$column] = $data[$data_index];
}
$result_array[] = $row_array;
$rows++;
$tmp_rows_count++;
if ($tmp_rows_count == $bulk_insert_count){
Inputs::insert($result_array);
$result_array = array();
if (empty($result_array)){
echo '*************** array cleared *************';
}
$tmp_rows_count = 0;
}
}
fclose($handle);
}
print('done');
I am currently working on a local vagrant box, when I try to run the above locally it process almost all the rows of the csv file and then dies shortly before the end (no error) but it gets up to the boxes memory limit of 1.5Gb.
I suspect some of what I have done in the above code is unnecessary, e.g. but I thought by building up and inserting a limited number of rows I would reduce memory use but it hasn't done enough.
I suspect this would probably work on the live server with more memory available but I cannot believe that it has to take 1.5Gb of memory to process an 80mb file, there must be a better approach. Any help much appreciated
Had this problem once, this solved it for me:
DB::connection()->disableQueryLog();
Info in the docs about it: http://laravel.com/docs/database#query-logging

Comparing two csv files based on multiple columns and save in separate file

I have two files with same format where one has new updates and the other has older updates. There is no particular unique id column.
How can I extract the new updated lines only (with unix, PHP, AWK)?
You want to "byte" compare all lines against the other lines, so i would do:
$lines1 = file('file1.txt');
$lines2 = file('file2.txt');
$lookup = array();
foreach($lines1 as $line) {
$key = crc32($line);
if (!isset($lookup[$key])) $lookup[$key] = array();
$lookup[$key][] = $line;
}
foreach($lines2 as $line) {
$key = crc32($line);
$found = false;
if (isset($lookup[$key])) {
foreach($lookup[$key] as $lookupLine) {
if (strcmp($lookupLine, $line) == 0) {
$found = true;
break;
}
}
}
// check if not found
if (!$found) {
// output to file or do something
}
}
Note that if the files are very large this will consume quite some memory and you need to use some other mechanism, but the idea stays the same

PHP counter with flock

I have a problem with a counter. I need to count two variables, separated with a |, but sometimes the counter doesn't increase a variable's value.
numeri.txt (the counter):
6122|742610
This is the PHP script:
$filename="numeri.txt";
while(!$fp=fopen($filename,'c+'))
{
usleep(100000);
}
while(!flock($fp,LOCK_EX))
{
usleep(100000);
}
$contents=fread($fp,filesize($filename));
ftruncate($fp,0);
rewind($fp);
$contents=explode("|",$contents);
$clicks=$contents[0];
$impressions=$contents[1]+1;
fwrite($fp,$clicks."|".$impressions);
flock($fp,LOCK_UN);
fclose($fp);
I have another counter that is a lot slower but counts both values (clicks and impressions) exactly. Sometimes the counter numeri.txt counts more impressions than the other counter. Why? How can I fix this?
We're using the following at our high-traffic site to count impressions:
<?php
$countfile = "counter.txt"; // SET THIS
$yearmonthday = date("Y.m.d");
$yearmonth = date("Y.m");;
// Read the current counts
$countFileHandler = fopen($countfile, "r+");
if (!$countFileHandler) {
die("Can't open count file");
}
if (flock($countFileHandler, LOCK_EX)) {
while (($line = fgets($countFileHandler)) !== false) {
list($date, $count) = explode(":", trim($line));
$counts[$date] = $count;
}
$counts[$yearmonthday]++;
$counts[$yearmonth]++;
fseek($countFileHandler, 0);
// Write the counts back to the file
krsort($counts);
foreach ($counts as $date => $count) {
fwrite($countFileHandler, "$date:$count\n");
fflush($countFileHandler);
}
flock($countFileHandler, LOCK_UN);
} else {
echo "Couldn't acquire file lock!";
}
fclose($countFileHandler);
}
?>
The results are both daily and monthly totals:
2015.10.02:40513
2015.10.01:48396
2015.10:88909
Try performing a flush before unlocking. You're unlocking before the data might even be written, allowing another execution to clobber.
http://php.net/manual/en/function.fflush.php

Categories