I have a text file (which essentially is a csv without the extension) that has 150,000 lines in it. I need to remove duplicates by key then insert them into the database. I'm attempting fgetcvs to read it line by line, but I don't want to do 150,000 queries. So this is what I came up with so far: (keep in mind i'm using laravel)
$count = 0;
$insert = [];
if (($handle = fopen("myHUGEfile.txt", "r")) !== FALSE) {
while (($data = fgetcsv($handle, 1000, ",")) !== FALSE) {
$count++;
//See if this is the top row, which in this case are column headers
if ($count == 1) continue;
//Get the parts needed for the new part
$quantity = $data[0];
$part_number = $data[1];
$manufacturer = $data[2];
$new_part = [
'manufacturer' => $manufacturer,
'part_number' => $part_number,
'stock' => $quantity,
'price' => '[]',
'approved' => 0,
];
$insert[] = $new_part;
}
fclose($handle);
} else {
throw new Exception('Could not open file for reading.');
}
//Remove duplicates
$newRows = [];
$parsedCount = 0;
foreach ($insert as $row) {
$x = 0;
foreach ($newRows as $n) {
if (strtoupper($row['part_number']) === strtoupper($n['part_number'])) {
$x++;
}
}
if ($x == 0) {
$parsedCount++;
$newRows[] = $row;
}
}
$parsed_rows = array_chunk($newRows, 1000, true);
$x = 0;
foreach ($parsed_rows as $chunk) {
//Insert
if (count($chunk) > 0)
if (DB::table('search_parts')->insert($chunk))
$x++;
}
echo $x . " chunks inserted.<br/>" . $count . " parts started with<br/>" . $parsedCount . " rows after duplicates removed.";
But it's very clunky, I have only tested it with a little over 1000 rows and it works using localhost. But i'm afraid if I push it up to production it won't be able to handle all 150,000 rows. The file is about 4mb.
Can someone show me a better more efficient way to do this?
Right now, you're keeping the first duplicate record. If you're ok keeping the last dupe, you can just change
$insert[] = $new_part;
to
$insert[strtoupper($part_number)] = $new_part
That way, your $insert array will only have one value for each $part_number. Your inserts will be a little slower, but you can drop all of the code which checks for duplicates which looks very, very slow.
4Mb is not remotely a "huge" file. I'd just read the whole thing into an assoc array keyed by part number, which will inherently de-dupe, giving you the last row whenever a duplicate is encountered. Something like this maybe:
$parts = [];
foreach (explode("\n", file_get_contents('file')) as $line) {
$part = str_getcsv($line);
$parts[$part[1]] = [
'manufacturer' => $part[2],
'part_number' => $part[1],
'stock' => $part[0],
'price' => '[]',
'approved' => 0,
];
}
// $parts now contains unique part list
foreach ($parts as $part) {
$db->insert($part);
}
If you don't want duplicates on a certain or multiple keys, you can make it easy on yourself and just add a UNIQUE INDEX on the key you don't want duplicates for on the table.
This way, all you have to worry about is processing the file. When it reaches a duplicate key, it will not be able to insert it and will continue.
It would also make it easier in the future because you wouldn't have to modify your code if you need to do checks on additional columns. Just modify the index.
Related
I'm struggling with GFAPI's function submit_form() when it's used in loops. For unknown reason it often merges data from other loops into one, the outcome of this situation is that only the very first entry is added in a proper way, the rest seems to be empty.
I can't use other function although I've tried - and it worked (add_entry() for example). I need to use QR generation and attach them to the notification, and these codes are generated when the form is submitted.
CSV file has at least 3 columns: email, full_name and phone_number. Here's my code:
function generateData($filename, $date, $id){
$rows = array_map('str_getcsv', file($filename));
$header = array_shift($rows);
$csv = array();
foreach($rows as $row) {
$csv[] = array_combine($header, $row);
}
$count_array = count($csv);
for ($i=0; $i < $count_array; $i++) {
foreach ($csv[$i] as $key => $value) {
// delete rows that we don't need
if ($key != 'email' && $key != 'full_name' && $key != 'phone_number') {
unset($csv[$i][$key]);
}
}
}
insertGFAPI($csv);
}
function insertGFAPI($entries){
$count_entries = count($entries);
for ($i=0; $i < $count_entries; $i++) {
$data[$i]['input_1'] = $entries[$i]['email'];
$data[$i]['input_2'] = $entries[$i]['full_name'];
$data[$i]['input_3'] = $entries[$i]['phone_number'];
$result = GFAPI::submit_form( get_option('form-id'), $data[$i]);
}
The outcome that I'd like to get is pretty simple - I want to know why and how is it possible that submit_form() merges data from other loops and how I can prevent it.
Do you know what can I do with that?
Solved. It was necessary to empty $_POST array.
I have a script that parses a csv to array with a million rows in it.
I want to batch this with a cronjob. For example every 100.000 rows i want to pause the script and then continue it again to prevent memory leaks etc.
My script for now is looking like this :
It's not relevant what is does but how can i loop through this in batches in an cronjob?
Can i just make an cronjob what calls this script every 5 minutes and remembers where the foreach loop is paused?
$csv = file_get_contents(CSV);
$array = array_map("str_getcsv", explode("\n", $csv));
$headers = $array[0];
$number_of_records = count($array);
for ($i = 1; $i < $number_of_records; $i++) {
$params['body'][] = [
'index' => [
'_index' => INDEX,
'_type' => TYPE,
'_id' => $i
]
];
// Set the right keys
foreach ($array[$i] as $key => $value) {
$array[$i][$headers[$key]] = $value;
unset($array[$i][$key]);
}
// Loop fields
$params['body'][] = [
'Inrijdtijd' => $array[$i]['Inrijdtijd'],
'Uitrijdtijd' => $array[$i]['Uitrijdtijd'],
'Parkeerduur' => $array[$i]['Parkeerduur'],
'Betaald' => $array[$i]['Betaald'],
'bedrag' => $array[$i]['bedrag']
];
// Every 1000 documents stop and send the bulk request
if ($i % 100000 == 0) {
$responses = $client->bulk($params);
// erase the old bulk request
$params = ['body' => []];
// unset the bulk response when you are done to save memory
unset($responses);
}
// Send the last batch if it exists
if (!empty($params['body'])) {
$responses = $client->bulk($params);
}
}
In the given code the script will always process from the beginning, since no pointer of some sort is kept.
My suggestion would be to split the CSV file into pieces and let another script parse the pieces one by one (i.e. every 5 minutes). (and delete the file afterwards).
$fp = fopen(CSV, 'r');
$head = fgets($fp);
$output = [$head];
while (!feof($fp)) {
$output[] = fgets($fp);
if (count($output) == 10000) {
file_put_contents('batches/batch-' . $count . '.csv', implode("\n", $output));
$count++;
$output = [$head];
}
}
if (count($output) > 1) {
file_put_contents('batches/batch-' . $count . '.csv', implode("\n", $output));
}
Now the original script can process a file every time:
$files = array_diff(scandir('batches/'), ['.', '..']);
if (count($files) > 0) {
$file = 'batches/' . $files[0];
// PROCESS FILE
unlink($file);
}
I am Trying to fill my Excel sheet with the data i filtered through the methods i have made. For now i am getting a sheet but i only have only one row filled not the other it's not getting the data i provide it though my object
I am trying my sheet something similar to this sheet .
i am trying to write code in this part of code :
public function export($Sets,$disp_filter)
{
$objPHPExcel = new PHPExcel();
$objPHPExcel->getProperties()->setTitle("Offic excel Test Document");
$styleArray = array(
'font' => array(
'bold' => true,
'color' => array('rgb' => 'FF0000'),
'size' => 10,
'name' => 'Verdana'
));
$objPHPExcel->getActiveSheet()->getStyle('A1')->applyFromArray($styleArray);
$excel_out = array($this->outputSampleName($Sets));
// var_dump($excel_out);
// exit;
$objPHPExcel->getActiveSheet()->SetCellValue('A1', 'Sample Size and Margin of Error');
$rowCount = 2;
foreach ($excel_out as $key=> $line)
{
$colCount = 'A';
$i=0;
// $line = array($Set['name']);
// $CT = $Set['crossTabs']['base'];
// $Moe = array($CT['sample']['moe']);
foreach($line as $col_value)
{
// var_dump($col_value);
// exit;
$objPHPExcel->getActiveSheet()->setCellValue($colCount.$rowCount, $col_value[$i])
->getStyle($colCount.$rowCount)->applyFromArray($styleArray);
$colCount++;
}
$rowCount++;
$i++;
}
return $objPHPExcel;
}
protected function outputSampleName($Sets)
{
foreach ($Sets as $Set)
{
$CT = $Set['crossTabs']['base'];
$line = array(
$Set['name'],
$CT['sample']['moe'] . '%'
);
$excel_out []= $line;
}
return $excel_out;
}
when i see by var_dump($excel_out)
i have this data structure :
**Please suggest me something how can i get those percentage values in my next row in optimized way.
for now i can only loop through the sample[name] which are (enthusiasts, hunter, new shooters etc. )from that array. **
thanks in advance
Maybe because your array elements are arrays themselves, and you are trying to place these subarrays into cells.
Try setting each element of $line in separate cells:
foreach ($excel_out as $line)
{
$colCount = 'A';
$objPHPExcel->getActiveSheet()
->setCellValue('A'.$rowCount, $line[0])
->setCellValue('B'.$rowCount, $line[1])
->setCellValue('C'.$rowCount, $line[2])
->setCellValue('D'.$rowCount, $line[3])
->setCellValue('E'.$rowCount, $line[4]);
$colCount++;
$rowCount++;
}
Note that the first sub-array in $excel_out has only one element. You may want to store.
You could also use an inner loop to traverse through each $line.
EDIT:
After looking at the code in your answer.
Using inner loop:
oreach ($excel_out as $key=> $line)
{
$colCount = 'A';
$i = 0;
foreach($line as $col_value)
{
// var_dump($col_value);
// exit;
$objPHPExcel->getActiveSheet()->setCellValue($colCount.$rowCount, $col_value[$i]);
//$objPHPExcel->getActiveSheet()->setCellValue('B'.$rowCount, $col_value[1]);
//$objPHPExcel->getActiveSheet()->setCellValue('C'.$rowCount, $col_value[2]);
//$objPHPExcel->getActiveSheet()->setCellValue('D'.$rowCount, $col_value[3]);
//$objPHPExcel->getActiveSheet()->setCellValue('E'.$rowCount, $col_value[4]);
//$objPHPExcel->getActiveSheet()->setCellValue('F'.$rowCount, $col_value[5]);
$colCount++;
$i++;
//$rowCount++;
}
$rowCount++;
// $colCount++;
}
$objPHPExcel->getActiveSheet()->setCellValue($colCount.$rowCount, $line);
Seems like you're writing an array $line into a cell. Should you do a loop from 0 to count($line) to put each element into a cell?
I have this big file containing SWIFT numbers and bank names. I'm using the following php function for reading and comparing data:
function csv_query($blz) {
$cdata = -1;
$fp = fopen(DIR_WS_INCLUDES . 'data/swift.csv', 'r');
while ($data = fgetcsv($fp, 1024, ",")) {
if ($data[0] == $blz){
$cdata = array ('blz' => $data[0],
'bankname' => $data[7]);
// 'prz' => $data[2]
}
}
return $cdata;
}
The csv files looks like that:
"20730054",1,"UniCredit Bank - HypoVereinsbank (ex VereinWest)","21423","Winsen (Luhe)","UniCredit Bk ex VereinWest",,"HYVEDEMM324","68","013765","M",1,"20030000"
"20750000",1,"Sparkasse Harburg-Buxtehude","21045","Hamburg","Spk Harburg-Buxtehude","52002","NOLADE21HAM","00","011993","U",0,"00000000"
"20750000",2,"Sparkasse Harburg-Buxtehude","21605","Buxtehude","Spk Harburg-Buxtehude","52002",,"00","011242","U",0,"00000000"
As you can see from the code, I need the first and the eight string. If the first string has no duplicates everything is ok, but if it has, most likely the eighth field of the duplicate will be empty and I get no result back. So I want to ask how to display that eighth field of the first result if the line has a duplicate.
I guess this will solve your problem :
function csv_query($blz) {
$cdata = -1;
$fp = fopen(DIR_WS_INCLUDES . 'data/swift.csv', 'r');
$counter = 0; // add this line
while ($data = fgetcsv($fp, 1024, ",")) {
if ($data[0] == $blz && !$counter) { //change this line
$cdata = array(
'blz' => $data[0],
'bankname' => $data[7]
);
$counter++; //add this line
}
}
return $cdata;
}
I am writing a script to create Magento attributes programatically, pulling the data from a CSV. Not sure I have the actual loop correct that pulls the data from the CSV - was hoping for some expert guidance on the logic?
<?php
$fh = fopen("attributes.csv", "r");
$i = 0;
while (($l = fgetcsv($fh, 1024, ",")) !== FALSE) {
$i++;
if($i == 1) continue; //ignoring the headers, so skip row 0
$data['label'] = trim($l[2]);
$data['input'] = trim($l[3]);
$data['type'] = trim($l[2]);
//Create the attribute
$data=array(
'type'=>$data['type'],
'input'=>'text',
'label'=>$data['label'],
'global'=>Mage_Catalog_Model_Resource_Eav_Attribute::SCOPE_GLOBAL,
'is_required'=>'0',
'is_comparable'=>'0',
'is_searchable'=>'0',
'is_unique'=>'1',
'is_configurable'=>'1',
'use_defined'=>'1'
);
$model->addAttribute('catalog_product','test_attribute',$data);
}
?>
I basically just want it to grab the attribute data from the CSV, and for each row in the CSV run the code to create it (using the label and name as specified in the CSV - im guessing I am missing something obvious in the loop? (just really learning what I'm doing!)
You reset the $data array in each loop, after inserting the values from CSV, so the CSV-content gets lost. Try this
$fh = fopen("attributes.csv", "r");
$i = 0;
$attributes=array(); //!!
while (($l = fgetcsv($fh, 1024, ",")) !== FALSE) {
$i++;
if($i == 1) continue; //ignoring the headers, so skip row 0
$data=array();
$data['label'] = trim($l[2]);
$data['input'] = trim($l[3]);
$data['type'] = trim($l[2]);
$data['global']=Mage_Catalog_Model_Resource_Eav_Attribute::SCOPE_GLOBAL;
$data['is_required']='0';
$data['is_comparable']='0';
$data['is_searchable']='0';
$data['is_unique']='1';
$data['is_configurable']='1';
$data['use_defined']='1';
//insert $data to the attributes array
$attributes[]=$data;
//or
$model->addAttribute('catalog_product','test_attribute',$data);
}