I have a script that parses a csv to array with a million rows in it.
I want to batch this with a cronjob. For example every 100.000 rows i want to pause the script and then continue it again to prevent memory leaks etc.
My script for now is looking like this :
It's not relevant what is does but how can i loop through this in batches in an cronjob?
Can i just make an cronjob what calls this script every 5 minutes and remembers where the foreach loop is paused?
$csv = file_get_contents(CSV);
$array = array_map("str_getcsv", explode("\n", $csv));
$headers = $array[0];
$number_of_records = count($array);
for ($i = 1; $i < $number_of_records; $i++) {
$params['body'][] = [
'index' => [
'_index' => INDEX,
'_type' => TYPE,
'_id' => $i
]
];
// Set the right keys
foreach ($array[$i] as $key => $value) {
$array[$i][$headers[$key]] = $value;
unset($array[$i][$key]);
}
// Loop fields
$params['body'][] = [
'Inrijdtijd' => $array[$i]['Inrijdtijd'],
'Uitrijdtijd' => $array[$i]['Uitrijdtijd'],
'Parkeerduur' => $array[$i]['Parkeerduur'],
'Betaald' => $array[$i]['Betaald'],
'bedrag' => $array[$i]['bedrag']
];
// Every 1000 documents stop and send the bulk request
if ($i % 100000 == 0) {
$responses = $client->bulk($params);
// erase the old bulk request
$params = ['body' => []];
// unset the bulk response when you are done to save memory
unset($responses);
}
// Send the last batch if it exists
if (!empty($params['body'])) {
$responses = $client->bulk($params);
}
}
In the given code the script will always process from the beginning, since no pointer of some sort is kept.
My suggestion would be to split the CSV file into pieces and let another script parse the pieces one by one (i.e. every 5 minutes). (and delete the file afterwards).
$fp = fopen(CSV, 'r');
$head = fgets($fp);
$output = [$head];
while (!feof($fp)) {
$output[] = fgets($fp);
if (count($output) == 10000) {
file_put_contents('batches/batch-' . $count . '.csv', implode("\n", $output));
$count++;
$output = [$head];
}
}
if (count($output) > 1) {
file_put_contents('batches/batch-' . $count . '.csv', implode("\n", $output));
}
Now the original script can process a file every time:
$files = array_diff(scandir('batches/'), ['.', '..']);
if (count($files) > 0) {
$file = 'batches/' . $files[0];
// PROCESS FILE
unlink($file);
}
Related
I am inserting data to the database using insert_batch() function.
I want to split the process.
I mean if I want to create 10,000 serial numbers. but 1,000 rows at a time, it should run the create process 10 times in a loop.
How can I do that?
$serial_numbers = $this->serial_numbers_model->generate_serial_numbers($product_id, $partner_id, $serial_number_quantity, $serial_number_start);
$issued_date = date("Y-m-d H:i:s");
$inserted_rows = 0;
foreach ($serial_numbers as $sn) {
$check_number = $this->serial_numbers_model->generate_check_number();
$first_serial_number = reset($serial_numbers);
$last_serial_number = end($serial_numbers);
$inserted_rows++;
$insert_data[] = array(
'partner_id' => $partner_id,
'product_id' => $product_id,
'check_number' => $check_number,
'serial_number' => $sn,
'issued_time' => $issued_date,
'serial_number_status' => CREATE_STATUS
);
}
$this->serial_numbers_model->insert_batch($insert_data);
}
Probably your serial_numbers_model->insert_batch() is just a wrapper around Codeigniter's native insert_batch()? The code below uses the native one for clarity, replace it with yours as required.
// Track how many in your batch, and prepare empty batch array
$count = 0;
$insert_data = [];
foreach ($serial_numbers as $sn) {
// ... your code, prepare your data, etc ...
$count++;
$insert_data[] = array(
// ... your data ...
);
// Do you have a batch of 1000 ready?
if ($count === 1000) {
// Yes - insert it
$this->db->insert_batch('table', $insert_data);
// $this->serial_numbers_model->insert_batch($insert_data);
// Reset the count, and empty the batch, ready to start again
$count = 0;
$insert_data = [];
}
}
// Watch out! If there were 1001 serial numbers, the first 1000 were handled,
// but the last one hasn't been inserted!
if (sizeof($insert_data)) {
$this->db->insert_batch('table', $insert_data);
}
I use CodeIgniter, and when an insert_batch does not fully work (number of items inserted different from the number of items given), I have to do the inserts again, using insert ignore to maximize the number that goes through the process without having errors for existing ones.
When I use this method, the kind of data I'm inserting does not need strict compliance between the number of items given, and the number put in the database. Maximize is the way.
What would be the correct way of a) using insert_batch as much as possible b) when it fails, using a workaround, while minimizing the number of unnecessary requests?
Thanks
The Correct way of inserting data using insert_batch is :
CI_Controller :
public function add_monthly_record()
{
$date = $this->input->post('date');
$due_date = $this->input->post('due_date');
$billing_date = $this->input->post('billing_date');
$total_area = $this->input->post('total_area');
$comp_id = $this->input->post('comp_id');
$unit_id = $this->input->post('unit_id');
$percent = $this->input->post('percent');
$unit_consumed = $this->input->post('unit_consumed');
$per_unit = $this->input->post('per_unit');
$actual_amount = $this->input->post('actual_amount');
$subsidies_from_itb = $this->input->post('subsidies_from_itb');
$subsidies = $this->input->post('subsidies');
$data = array();
foreach ($unit_id as $id => $name) {
$data[] = array(
'date' => $date,
'comp_id' => $comp_id,
'due_date' => $due_date,
'billing_date' => $billing_date,
'total_area' => $total_area,
'unit_id' => $unit_id[$id],
'percent' =>$percent[$id],
'unit_consumed' => $unit_consumed[$id],
'per_unit' => $per_unit[$id],
'actual_amount' => $actual_amount[$id],
'subsidies_from_itb' => $subsidies_from_itb[$id],
'subsidies' => $subsidies[$id],
);
};
$result = $this->Companies_records->add_monthly_record($data);
//return from model
$total_affected_rows = $result[1];
$first_insert_id = $result[0];
//using last id
if ($total_affected_rows) {
$count = $total_affected_rows - 1;
for ($x = 0; $x <= $count; $x++) {
$id = $first_insert_id + $x;
$invoice = 'EBR' . date('m') . '/' . date('y') . '/' . str_pad($id, 6, '0', STR_PAD_LEFT);
$field = array(
'invoice_no' => $invoice,
);
$this->Companies_records->add_monthly_record_update($field,$id);
}
}
echo json_encode($result);
}
CI_Model :
public function add_monthly_record($data)
{
$this->db->insert_batch('monthly_record', $data);
$first_insert_id = $this->db->insert_id();
$total_affected_rows = $this->db->affected_rows();
return [$first_insert_id, $total_affected_rows];
}
AS #q81 mentioned, you would split the batches (as you see fit or depending on system resources) like this:
$insert_batch = array();
$maximum_items = 100;
$i = 1;
while ($condition == true) {
// code to add data into $insert_batch
// ...
// insert the batch every n items
if ($i == $maximum_items) {
$this->db->insert_batch('table', $insert_batch); // insert the batch
$insert_batch = array(); // empty batch array
$i = 0;
}
$i++;
}
// the last $insert_batch
if ($insert_batch) {
$this->db->insert_batch('table', $insert_batch);
}
Edit:
while insert batch already splits the batches, the reason why you have "number of items inserted different from the number of items given" might be because the allowed memory size is reached. this happened to me too many times.
I have a text file (which essentially is a csv without the extension) that has 150,000 lines in it. I need to remove duplicates by key then insert them into the database. I'm attempting fgetcvs to read it line by line, but I don't want to do 150,000 queries. So this is what I came up with so far: (keep in mind i'm using laravel)
$count = 0;
$insert = [];
if (($handle = fopen("myHUGEfile.txt", "r")) !== FALSE) {
while (($data = fgetcsv($handle, 1000, ",")) !== FALSE) {
$count++;
//See if this is the top row, which in this case are column headers
if ($count == 1) continue;
//Get the parts needed for the new part
$quantity = $data[0];
$part_number = $data[1];
$manufacturer = $data[2];
$new_part = [
'manufacturer' => $manufacturer,
'part_number' => $part_number,
'stock' => $quantity,
'price' => '[]',
'approved' => 0,
];
$insert[] = $new_part;
}
fclose($handle);
} else {
throw new Exception('Could not open file for reading.');
}
//Remove duplicates
$newRows = [];
$parsedCount = 0;
foreach ($insert as $row) {
$x = 0;
foreach ($newRows as $n) {
if (strtoupper($row['part_number']) === strtoupper($n['part_number'])) {
$x++;
}
}
if ($x == 0) {
$parsedCount++;
$newRows[] = $row;
}
}
$parsed_rows = array_chunk($newRows, 1000, true);
$x = 0;
foreach ($parsed_rows as $chunk) {
//Insert
if (count($chunk) > 0)
if (DB::table('search_parts')->insert($chunk))
$x++;
}
echo $x . " chunks inserted.<br/>" . $count . " parts started with<br/>" . $parsedCount . " rows after duplicates removed.";
But it's very clunky, I have only tested it with a little over 1000 rows and it works using localhost. But i'm afraid if I push it up to production it won't be able to handle all 150,000 rows. The file is about 4mb.
Can someone show me a better more efficient way to do this?
Right now, you're keeping the first duplicate record. If you're ok keeping the last dupe, you can just change
$insert[] = $new_part;
to
$insert[strtoupper($part_number)] = $new_part
That way, your $insert array will only have one value for each $part_number. Your inserts will be a little slower, but you can drop all of the code which checks for duplicates which looks very, very slow.
4Mb is not remotely a "huge" file. I'd just read the whole thing into an assoc array keyed by part number, which will inherently de-dupe, giving you the last row whenever a duplicate is encountered. Something like this maybe:
$parts = [];
foreach (explode("\n", file_get_contents('file')) as $line) {
$part = str_getcsv($line);
$parts[$part[1]] = [
'manufacturer' => $part[2],
'part_number' => $part[1],
'stock' => $part[0],
'price' => '[]',
'approved' => 0,
];
}
// $parts now contains unique part list
foreach ($parts as $part) {
$db->insert($part);
}
If you don't want duplicates on a certain or multiple keys, you can make it easy on yourself and just add a UNIQUE INDEX on the key you don't want duplicates for on the table.
This way, all you have to worry about is processing the file. When it reaches a duplicate key, it will not be able to insert it and will continue.
It would also make it easier in the future because you wouldn't have to modify your code if you need to do checks on additional columns. Just modify the index.
I have this big file containing SWIFT numbers and bank names. I'm using the following php function for reading and comparing data:
function csv_query($blz) {
$cdata = -1;
$fp = fopen(DIR_WS_INCLUDES . 'data/swift.csv', 'r');
while ($data = fgetcsv($fp, 1024, ",")) {
if ($data[0] == $blz){
$cdata = array ('blz' => $data[0],
'bankname' => $data[7]);
// 'prz' => $data[2]
}
}
return $cdata;
}
The csv files looks like that:
"20730054",1,"UniCredit Bank - HypoVereinsbank (ex VereinWest)","21423","Winsen (Luhe)","UniCredit Bk ex VereinWest",,"HYVEDEMM324","68","013765","M",1,"20030000"
"20750000",1,"Sparkasse Harburg-Buxtehude","21045","Hamburg","Spk Harburg-Buxtehude","52002","NOLADE21HAM","00","011993","U",0,"00000000"
"20750000",2,"Sparkasse Harburg-Buxtehude","21605","Buxtehude","Spk Harburg-Buxtehude","52002",,"00","011242","U",0,"00000000"
As you can see from the code, I need the first and the eight string. If the first string has no duplicates everything is ok, but if it has, most likely the eighth field of the duplicate will be empty and I get no result back. So I want to ask how to display that eighth field of the first result if the line has a duplicate.
I guess this will solve your problem :
function csv_query($blz) {
$cdata = -1;
$fp = fopen(DIR_WS_INCLUDES . 'data/swift.csv', 'r');
$counter = 0; // add this line
while ($data = fgetcsv($fp, 1024, ",")) {
if ($data[0] == $blz && !$counter) { //change this line
$cdata = array(
'blz' => $data[0],
'bankname' => $data[7]
);
$counter++; //add this line
}
}
return $cdata;
}
I have CSV file which contains a list of files and directories:
Depth;Directory;
0;bin
1;basename
1;bash
1;cat
1;cgclassify
1;cgcreate
0;etc
1;aliases
1;audit
2;auditd.conf
2;audit.rules
0;home
....
Each line depends on the above one (for the depth param)
I would like to create an array like this one in order to store it into my MongoDB collection with Materialized Paths
$directories = array(
array('_id' => null,
'name' => "auditd.conf",
'path' => "etc,audit,auditd.conf"),
array(....)
);
I don't know how to process...
Any ideas?
Edit 1:
I'm not really working with directories - it's an example, so I cannot use FileSystems functions or FileIterators.
Edit 2:
From this CSV file, I'm able to create a JSON nested array:
function nestedarray($row){
list($id, $depth, $cmd) = $row;
$arr = &$tree_map;
while($depth--) {
end($arr );
$arr = &$arr [key($arr )];
}
$arr [$cmd] = null;
}
But i'm not sure it's the best way to proceed...
This should do the trick, I think (it worked in my test, at least, with your data). Note that this code doesn't do much error checking and expects the input data to be in proper order (i.e. starting with level 0 and no holes).
<?php
$input = explode("\n",file_get_contents($argv[1]));
array_shift($input);
$data = array();
foreach($input as $dir)
{
if(count($parts = str_getcsv($dir, ';')) < 2)
{
continue;
}
if($parts[0] == 0)
{
$last = array('_id' => null,
'name' => $parts[1],
'path' => $parts[1]);
$levels = array($last);
$data[] = $last;
}
else
{
$last = array('id' => null,
'name' => $parts[1],
'path' => $levels[$parts[0] - 1]['path'] . ',' . $parts[1]);
$levels[$parts[0]] = $last;
$data[] = $last;
}
}
print_r($data);
?>
The "best" way to go would be to not store your data in CSV format, as it's the Wrong Tool For The Job.
That said, here you go:
<?php
$lines = file('/path/to/your/csv_file.csv');
$directories = array();
$path = array();
$lastDepth = NULL;
foreach ($lines as $line) {
list($depth, $dir) = str_getcsv($line, ';');
// Skip headers and such
if (!ctype_digit($depth)) {
continue;
}
if ($depth == $lastDepth) {
// If this depth is the same as the last, pop the last directory
// we added off the stack
array_pop($path);
} else if ($depth == 0) {
// At depth 0, reset the path
$path = array();
}
// Push the current directory onto the path stack
$path[] = $dir;
$directories[] = array(
'_id' => NULL,
'name' => $dir,
'path' => implode(',', $path)
);
$lastDepth = $depth;
}
var_dump($directories);
Edit:
For what it's worth, once you have the desired nested structure in PHP, it would probably be a good idea to use json_encode(), serialize(), or some other format to store it to disk again, and get rid of the CSV file. Then you can just use json_decode() or unserialize() to get it back in PHP array format whenever you need it again.