Compare MySQL with CSV and find differences - php

I have a CSV file with 1 column named EAN and a MySQL Table with a column named EAN too.
This is what I want to do comparing both columns:
CSV ||| MySQL ||| STATUS
123 123 OK
321 321 OK
444 MISSING IN MySQL
111 MISSING IN CSV
Any ideas how to realize with PHP?

One way to do it:
(Assuming you already know how to open a file and execute a query.)
First read rows from your CSV and assume the data is missing in SQL.
while (($row = fgetcsv($file)) !== FALSE) {
$num = $row[0]; // or whatever CSV column the value you want is in
$result[$num] = ['csv' => $num, 'sql' => '', 'status' => 'MISSING IN SQL'];
}
Then fetch rows from your query and fill the array you created from the CSV accordingly.
while ($row = $stmt->fetch(PDO::FETCH_ASSOC)) {
$num = $row['EAN']; // or whatever your column is named
if (isset($result[$num])) {
// This has a value from the CSV, so update the array
$result[$num]['sql'] = $num;
$result[$num]['status'] = 'OK';
} else {
// This doesn't have a value from the CSV, so insert a new row
$result[$num] = ['csv' => '', 'sql' => $num, 'status' => 'MISSING IN CSV'];
}
}
You could change the order of this and process the query results first. Either order will work, just as long as you do the update/insert logic with the second data source.
You can ksort($result); if you want the merged values to be in order, then output $result however you need to.

Related

Laravel insertOrIgnore inserting duplicate rows and empty rows during a CSV file upload

I have a companies table. Each company has a company_id with a unique constraint.
$table->string('company_id', 9)->unique();
I'm uploading a CSV file with 1M rows containing a single column of these unique company_id.
I'm aware the file has MANY duplicate rows so not even half of the rows should be inserted because I'm using Laravel's insertOrIgnore method. Here's my CSV upload method :
public function csvUpload(Request $request)
{
$validated = $request->validate([
'csv' => 'required|file',
]);
$path = $request->csv->store('temp');
$input = base_path() . Storage::url('app/' . $path);
$handle = fopen($input, 'r');
// Push the data into an array
$company_ids = [];
$rows = 0;
$now = now();
$inserted = 0;
while (($data = fgetcsv($handle, 9)) !== FALSE) {
$company_ids[] = [
'company_id' => $data[0],
'created_at' => $now,
'updated_at' => $now,
];
$rows++;
if ($rows >= 1000) {
// Insert company_ids into companies table
$inserted += Company::insertOrIgnore($company_ids);
unset($company_ids); // unset to avoid memory issues
$company_ids = [];
$rows = 0;
}
}
// Insert remaining company_ids (those when no more rows but total rows less then 1000)
$inserted += Company::insertOrIgnore($company_ids);
// close the stored file
fclose($handle);
// delete the file
Storage::delete($path);
return redirect()->route('companies.index')
->with('message', $inserted . ' new company_ids uploaded');
}
However, this is saving all rows, including the duplicate ones, and also inserting rows with an empty company_id every two rows.
I'm using Laravel 9 with MySQL.
I'm thinking it could come from how the CSV is parsed or the fact that i chunk the data before inserting it but cannot find the issue here.
EDIT 1:
In parallel, I have queue workers attaching companies to another model. This other model has a company_id attribute so I'm using firstOrCreate in that job. Like: find the company with this company_id or create it. More empty rows are being created as I speak so those jobs must be the ones creating the empty rows but doesn't explain why the CSV upload inserted duplicate rows however.
EDIT 2:
there seems to be an issue with these jobs described above. I deleted the rows with an empty company_id but, as more rows are being inserted to the table because of the queue workers, I see that an id is always skipped. One row has 2955265, the next has id 2955267.

How to find the keys of duplicate entries in multidimensional array

I'm reporting on appointment activity and have included a function to export the raw data behind the KPIs. This raw data is stored as a CSV and I need to check for potentially duplicate consultations that have been entered.
Each row of data is assigned a unique visit ID based on the patients ID and the appointment ID. The raw data contains 30 columns of data, the duplicate check only needs to be performed on 7 of these. I have imported the CSV and created an array as below for first record and then append rest on.
$mds = array(
$unique_visit_id => array(
$appt_date,
$dob,
$site,
$CCG,
$GP,
$appt_type,
$treatment_scheme
)
);
What I need is to scan the $mds array and return an array containing just the $unique_visit_id for any duplicate arrays.
e.g. keys 1111, 2222 and 5555 all references arrays that contain the same value for all seven values, then I would need 2222 and 5555 returned.
I've tried search but not coming up with anything that is working.
Thanks
This is what I've gone with, still validating (data set is very big) but seems to be functioning as expected so far
$handle = fopen("../reports/mds_full_export.csv", "r");
$visits = array();
while($data = fgetcsv($handle,0,',','"') !== FALSE){
$key = $data['unique_visit_id'];
$value = $data['$appt_date'].$data['$dob'].$data['$site'].$data['$CCG'].$data['$GP'].$data['$appt_type'].$data['$treatment_scheme'];
$visits[$key] = $value;
}
$visits = asort($visits);
$previous = "";
$dupes = array();
foreach($visits as $id => $visit){
if(strcmp($previous, $visit) == 0){
$dupes[] = $id;
}
$previous = $visit;
}
return $dupes;

How to process CSV with 100k+ lines in PHP?

I have a CSV file with more than 100.000 lines, each line has 3 values separated by semicolon. Total filesize is approx. 5MB.
CSV file is in this format:
stock_id;product_id;amount
==========================
1;1234;0
1;1235;1
1;1236;0
...
2;1234;3
2;1235;2
2;1236;13
...
3;1234;0
3;1235;2
3;1236;0
...
We have 10 stocks which are indexed 1-10 in CSV. In database we have them saved as 22-31.
CSV is sorted by stock_id, product_id but I think it doesn't matter.
What I have
<?php
session_start();
require_once ('db.php');
echo '<meta charset="iso-8859-2">';
// convert table: `CSV stock id => DB stock id`
$stocks = array(
1 => 22,
2 => 23,
3 => 24,
4 => 25,
5 => 26,
6 => 27,
7 => 28,
8 => 29,
9 => 30,
10 => 31
);
$sql = $mysqli->query("SELECT product_id FROM table WHERE fielddef_id = 1");
while ($row = $sql->fetch_assoc()) {
$products[$row['product_id']] = 1;
}
$csv = file('export.csv');
// go thru CSV file and prepare SQL UPDATE query
foreach ($csv as $row) {
$data = explode(';', $row);
// $data[0] - stock_id
// $data[1] - product_id
// $data[2] - amount
if (isset($products[$data[1]])) {
// in CSV are products which aren't in database
// there is echo which should show me queries
echo " UPDATE t
SET value = " . (int)$data[2] . "
WHERE fielddef_id = " . (int)$stocks[$data[0]] . " AND
product_id = '" . $data[1] . "' -- product_id isn't just numeric
LIMIT 1<br>";
}
}
Problem is that writing down 100k lines by echo is soooo slow, takes long minutes. I'm not sure what MySQL will do, if it will be faster, or take ± the same time. I have no testing machine here, so I'm worry about testing in on prod server.
My idea was to load CSV file into more variables (better array) like below, but I don't know why.
$csv[0] = lines 0 - 10.000;
$csv[1] = lines 10.001 - 20.000;
$csv[2] = lines 20.001 - 30.000;
$csv[3] = lines 30.001 - 40.000;
etc.
I found eg. Efficiently counting the number of lines of a text file. (200mb+), but I'm not sure how it can help me.
When I replace foreach for print_r, I get dump in < 1 sec. The task is to make the foreach loop with database update faster.
Any ideas how to updates so many records in database?
Thanks.
Something like this (please note this is 100% untested and off top of my head may need some tweaking to actually work :) )
//define array may (probably better ways of doing this
$stocks = array(
1 => 22,
2 => 23,
3 => 24,
4 => 25,
5 => 26,
6 => 27,
7 => 28,
8 => 29,
9 => 30,
10 => 31
);
$handle = fopen("file.csv", "r")); //open file
while (($data = fgetcsv($handle, 1000, ";")) !== FALSE) {
//loop through csv
$updatesql = "UPDATE t SET `value` = ".$data[2]." WHERE fielddef_id = ".$stocks[$data[0]]." AND product_id = ".$data[1];
echo "$updatesql<br>";//for debug only comment out on live
}
There is no need to do your initial select since you're only ever setting your product data to 1 anyway in your code and it looks from your description that your product id's are always correct its just your fielddef column which has the map.
Also just for live don't forget to put your actual mysqli execute command in on your $updatesql;
To give you a comparison to actual usage code (I can benchmark against!)
This is some code I use for an importer of an uploaded file (its not perfect but it does its job)
if (isset($_POST['action']) && $_POST['action']=="beginimport") {
echo "<h4>Starting Import</h4><br />";
// Ignore user abort and expand time limit
//ignore_user_abort(true);
set_time_limit(60);
if (($handle = fopen($_FILES['clientimport']['tmp_name'], "r")) !== FALSE) {
$row = 0;
//defaults
$sitetype = 3;
$sitestatus = 1;
$startdate = "2013-01-01 00:00:00";
$enddate = "2013-12-31 23:59:59";
$createdby = 1;
//loop and insert
while (($data = fgetcsv($handle, 10000, ",")) !== FALSE) { // loop through each line of CSV. Returns array of that line each time so we can hard reference it if we want.
if ($row>0) {
if (strlen($data[1])>0) {
$clientshortcode = mysqli_real_escape_string($db->mysqli,trim(stripslashes($data[0])));
$sitename = mysqli_real_escape_string($db->mysqli,trim(stripslashes($data[0]))." ".trim(stripslashes($data[1])));
$address = mysqli_real_escape_string($db->mysqli,trim(stripslashes($data[1])).",".trim(stripslashes($data[2])).",".trim(stripslashes($data[3])));
$postcode = mysqli_real_escape_string($db->mysqli,trim(stripslashes($data[4])));
//look up client ID
$client = $db->queryUniqueObject("SELECT ID FROM tblclients WHERE ShortCode='$clientshortcode'",ENABLE_DEBUG);
if ($client->ID>0 && is_numeric($client->ID)) {
//got client ID so now check if site already exists we can trust the site name here since we only care about double matching against already imported sites.
$sitecount = $db->countOf("tblsites","SiteName='$sitename'");
if ($sitecount>0) {
//site exists
echo "<strong style=\"color:orange;\">SITE $sitename ALREADY EXISTS SKIPPING</strong><br />";
} else {
//site doesn't exist so do import
$db->execute("INSERT INTO tblsites (SiteName,SiteAddress,SitePostcode,SiteType,SiteStatus,CreatedBy,StartDate,EndDate,CompanyID) VALUES
('$sitename','$address','$postcode',$sitetype,$sitestatus,$createdby,'$startdate','$enddate',".$client->ID.")",ENABLE_DEBUG);
echo "IMPORTED - ".$data[0]." - ".$data[1]."<br />";
}
} else {
echo "<strong style=\"color:red;\">CLIENT $clientshortcode NOT FOUND PLEASE ENTER AND RE-IMPORT</strong><br />";
}
fcflush();
set_time_limit(60); // reset timer on loop
}
} else {
$row++;
}
}
echo "<br />COMPLETED<br />";
}
fclose($handle);
unlink($_FILES['clientimport']['tmp_name']);
echo "All Imports finished do not reload this page";
}
That imported 150k rows in about 10 seconds
Due to answers and comments for the question, I have the solution. The base for that is from #Dave, I've only updated it to pass better to question.
<?php
require_once 'include.php';
// stock convert table (key is ID in CSV, value ID in database)
$stocks = array(
1 => 22,
2 => 23,
3 => 24,
4 => 25,
5 => 26,
6 => 27,
7 => 28,
8 => 29,
9 => 30,
10 => 31
);
// product IDs in CSV (value) and Database (product_id) are different. We need to take both IDs from database and create an array of e-shop products
$sql = mysql_query("SELECT product_id, value FROM cms_module_products_fieldvals WHERE fielddef_id = 1") or die(mysql_error());
while ($row = mysql_fetch_assoc($sql)) {
$products[$row['value']] = $row['product_id'];
}
$handle = fopen('import.csv', 'r');
$i = 1;
while (($data = fgetcsv($handle, 1000, ';')) !== FALSE) {
$p_id = (int)$products[$data[1]];
if ($p_id > 0) {
// if product exists in database, continue. Without this condition it works but we do many invalid queries to database (... WHERE product_id = 0 updates nothing, but take a time)
if ($i % 300 === 0) {
// optional, we'll see what it do with the real traffic
sleep(1);
}
$updatesql = "UPDATE table SET value = " . (int)$data[2] . " WHERE fielddef_id = " . $stocks[$data[0]] . " AND product_id = " . (int)$p_id . " LIMIT 1";
echo "$updatesql<br>";//for debug only comment out on live
$i++;
}
}
// cca 1.5sec to import 100.000k+ records
fclose($handle);
Like I said in the comment, use SPLFileObject to iterate over the CSV file. Use Prepared statements to reduce performance overhead of calling the UPDATE in each loop. Also, merge your two queries together, there isn't any reason to pull all of the product rows first and check them against the CSV. You can use a JOIN to ensure that only those stocks in the second table that are related to the product in the first and that is the current CSV row will get updated:
/* First the CSV is pulled in */
$export_csv = new SplFileObject('export.csv');
$export_csv->setFlags(SplFileObject::READ_CSV | SplFileObject::DROP_NEW_LINE | SplFileObject::READ_AHEAD);
$export_csv->setCsvControl(';');
/* Next you prepare your statement object */
$stmt = $mysqli->prepare("
UPDATE stocks, products
SET value = ?
WHERE
stocks.fielddef_id = ? AND
product_id = ? AND
products.fielddef_id = 1
LIMIT 1
");
$stmt->bind_param('iis', $amount, $fielddef_id, $product_id);
/* Now you can loop through the CSV and set the fields to match the integers bound to the prepared statement and execute the update on each loop. */
foreach ($export_csv as $csv_row) {
list($stock_id, $product_id, $amount) = $csv_row;
$fielddef_id = $stock_id + 21;
if(!empty($stock_id)) {
$stmt->execute();
}
}
$stmt->close();
Make the query bigger, i.e. use the loop to compile a larger query. You may need to split it up into chunks (e.g. process 100 at a time), but certainly don't do one query at a time (applies for any kind, insert, update, even select if possible). This should greatly increase the performance.
It's generally recommended that you don't query in a loop.
Updating every record every time will be too expensive (mostly due to seeks, but also from writing).
You should TRUNCATE the table first and then insert all the records again (assuming you won't have external foreign keys linking to this table).
To make it even faster, you should lock the table before the insert and unlock it afterwards. This will prevent the indexing from happening at every insert.

Iterating through PHP array with duplicate records and deleting one record where a value = 0

I have a MySQL query using Laravel that I convert to a PHP Array.
The rows have values similar to this:
name | override | percentage
Eclipse | 1 | 50%
Eclipse | 0 | 75%
MySQL query
select * from table
Both rows (it's many more than just 2 in reality) have the same name, but one has override set to 0 and one has it set to 1.
How can I get rid of all records in my query result (PHP array) that are duplicates (determined by the name) AND have override set to 0? I want only the records that have been overridden with a new record which I have done, but I need a way to remove the records with override = 0, given that the records are the same but have a different percentage value.
How can this be done?
Thanks.
Try following query,
SELECT * from testtable GROUP BY `name` HAVING count(`name`) = 1 OR `override` = 1;
check this sqlfiddle
If I understand your needs correctly, you need to filter out records that have duplicate name and override = 0.
If you sort your result set by name (SELECT * FROM TABLE ORDER BY name), you can use this function.
function removeDuplicatesFromArray($rows) {
$result = array();
$old_name = '';
foreach($rows as $row) {
if($row['name'] != $old_name) {
$result[] = $row;
$old_name = $row['name'];
}
elseif($row['override'] == 1) {
array_pop($result);
$result[] = $row;
}
}
return $result;
}
NOTE: Doing this in SQL will be WAYYYYYYYYY faster and use far less memory. I would only try this PHP approach if you cannot modify the SQL for some reason.
Maybe try out... hit the db twice, first time only get non-overrides, then get the overrides in second pass -- coerce your arrays to be indexed by name and array_merge them. (Uses a fair chunk of memory given the number of arrays and copies - but it's easy to understand and keeps it simple.
$initial = get_non_overridden();
$override = get_overridden();
$init_indexed = index_by_name($initial);
$over_indexed = index_by_name($override);
$desired_result = array_merge($init_indexed, $over_indexed);
Assuming your database gives you a standard rowset (array of rows, where each row is a hash of fields->values). We want something that looks like this instead:
[
'Eclipse' => [
'name' => 'Eclipse',
'override' => '0',
'percentage' => '75%'
],
'Something' => [
'name' => 'Something',
'override' => '0',
'percentage' => '20%'
],
]
So index_by_name would be:
function index_by_name($rowset) {
$result = array();
foreach ($rowset as $row) {
$result[ $row['name'] ] = $row;
}
return $result;
}
There are ways to tweak your efficiency either in memory or run time, but that's the gist of what I was thinking.
array_merge then overwrites the initial ones with the overridden ones.
NOTE: this all assumes that there is only one row where Eclipse override is 1. If you have twenty Eclipse|0 and one Eclipse|1, this will work, if you have two Eclipse|1 you'd only see one of them... and there's no way to say which one.

phpexcel printing mysql data to excel( printing only 1 row of data on database to excel)

I want to use the 10autofilter.php from phpexcel, to our program.
But I want a code that will print on excel the datas on our database since its only prints row 1 and doesnt print all data on our mysql please help me you can see the code it only output 1 row.
I think there's a problem here but this works fine on php displaying on browser but just not in excel the output is 1 row.
I have used $i++ on $row as you can see i dont know what to do.
$res = mysql_query("select * from services");
$row = mysql_num_rows($res);
for($i=0; $i<$row; $i++)
{
$serviceid = mysql_result($res,$i,"serviceid");
$servicename = mysql_result($res,$i,"servicename");
$contactemail = mysql_result($res,$i,"contactemail");
$charge = mysql_result($res,$i,"charge");
$contactlastname = mysql_result($res,$i,"contactlastname");
$contactmiddlename = mysql_result($res,$i,"contactmiddlename");
$yearassistancereceived = mysql_result($res,$i,"yearassistancereceived");
$yearestablished = mysql_result($res,$i,"yearestablished");
$dataArray = array(
array(
$serviceid,
$servicename,
$contactemail,
$charge." ".$contactmiddlename." ".$contactlastname,
$yearassistancereceived,
$yearestablished
)
);
$objPHPExcel->getActiveSheet()->fromArray($dataArray, NULL, 'A2');
}
I also have the error:
Fatal error. Uncaught exception 'PHPExcel_WriterException' with message 'Invalid parameters passed'. why? at the bottom of the error is written C:\xampp\htdocs\DOSTPROJECT\classes\PHPExcel\Writer\Excel2007\ContentTypes.php on line 263.
This executes and excel file but I have said only 1 row data, I mean whats wrong?
If you write every row of data to spreadsheet row #2, then the rows will all overwrite each other. You want to write the first row of data to spreadsheet row #2, the second to spreadsheet row #3, etc
You used $i++ to get each row from the database, but you're not using it when writing to PHPExcel, you're just writing every row array at cell A2
for($i=0; $i<$row; $i++)
{
$serviceid = mysql_result($res,$i,"serviceid");
$servicename = mysql_result($res,$i,"servicename");
$contactemail = mysql_result($res,$i,"contactemail");
$charge = mysql_result($res,$i,"charge");
$contactlastname = mysql_result($res,$i,"contactlastname");
$contactmiddlename = mysql_result($res,$i,"contactmiddlename");
$yearassistancereceived = mysql_result($res,$i,"yearassistancereceived");
$yearestablished = mysql_result($res,$i,"yearestablished");
$dataArray = array(
array(
$serviceid,
$servicename,
$contactemail,
$charge." ".$contactmiddlename." ".$contactlastname,
$yearassistancereceived,
$yearestablished
)
);
$objPHPExcel->getActiveSheet()->fromArray($dataArray, NULL, 'A'.($i+2));
}

Categories