I have a .csv file that is about 5mb (~45,000 rows). What I need to do is run through each row of the file and check if the ID in each line is already in a table in my database. If it is, I can delete that row from the file.
I did a good amount of research on the most memory efficient way to do this, so I've been using a method of writing lines that don't need to get deleted to a temporary file and then renaming that file as the original. Code below:
$file= fopen($filename, 'r');
$temp = fopen($tempFilename, 'w');
while(($row = fgetcsv($file)) != FALSE){
// id is the 7th value in the row
$id = $row[6];
// check table to see if id exists
$sql = "SELECT id FROM table WHERE id = $id";
$result = mysqli_query($conn, $sql);
// if id is in the database, skip to next row
if(mysqli_num_rows($result) > 0){
continue;
}
// else write line to temp file
fputcsv($temp, $row);
}
fclose($file);
fclose($temp);
// overwrite original file
rename($tempFilename, $filename);
Problem is, I'm running into a timeout while executing this bit of code. Anything I can do to make the code more efficient?
You fire a database query per line, aka 45.000 queries... that takes too much time.
Better you do a query before the loop and read the existing id into a lookup array, then only check this array in the loop.
Pseudo code:
$st = query('SELECT id FROM table');
while ($row = $st->fetch()) {
$lookup[ $row['id'] ] = $row['id'];
}
// now read CSV
while($row = fgetcsv($h)) {
$id = $row[6];
if (isset($lookup[ $id ])) {
// exist...
continue;
}
// write the non-existing id to different file...
}
edit:
Assume memory isn't sufficient to hold 1 million integer from the database. How can it still be done efficiently?
Collect ids from CSV into an array. Write a single query to find all those ids in the database and collect (it can be maximal so many as in the CSV). Now array_diff() the ids from file with the ids from database - those ids remaining exist in CSV but not in database.
Pseudo code:
$ids_csv = [];
while($row = fgetcsv($h)) {
$id = row[6];
$ids_csv[] = intval($id);
}
$sql = sprintf('SELECT id FROM table WHERE id IN(%s)', implode(',', $ids_csv));
$ids_db = [];
$st = query($sql);
while ($row = $st->fetch()) {
$ids_db[] = $row['id'];
}
$missing_in_db = array_diff($ids_csv, $ids_db);
I would use LOAD DATA INFILE: https://dev.mysql.com/doc/refman/8.0/en/load-data.html
Your database user needs to have FILE priveleges on the database to use.
to read the csv file into a separate table.
Then you can run one query to delete id's already exist (delete from join ...)
And export the rows that were left intact.
Other option is use your loop to insert your csv file into a seperate table, and then proceed with step 2.
Update: I use LOAD DATA INFILE with csv files up to 2 million rows (at the moment) and do some bulk data manipulation with big queries, it's blazingly fast and I would recommend this route for files containing > 100k lines.
Related
I am in need of a way to export my MYSQL Database to CSV via PHP, but I need to select the column names as well. So Far I have the following which does everything I need except get the column names.
echo "Export Starting \n";
$SQL = ("SELECT *
FROM INF_TimeEntries
WHERE Exported IS NULL");
$result = mysqli_query($db_conn, $SQL) or die("Selection Error " . mysqli_error($db_conn));
echo "Export Data Selected \n";
$fp = fopen('../updateDatabase/timesheetExport/TimeEntries.csv', 'w');
echo "Starting Write to CSV \n";
while($row = mysqli_fetch_assoc($result)){
fputcsv($fp, $row);
$RowID = $row['ID'];
$exportTime = date("Y-m-d H:i:s");
$sql = ("UPDATE INF_TimeEntries
SET Exported = '$exportTime'
WHERE ID = '$RowID'");
if ($mysqli_app->query($sql) === TRUE) {
}
else {
echo date("Y-m-d H:i:s")."\n";
echo "An Error Occured please contact the administrator ". $mysqli_app->error."\n";
}
}
echo "Export Completed \n";
fclose($fp);
mysqli_close($mysqli_app);
mysqli_close($db_conn);
I am not sure how I would go about Achieving this. I do not simply need to get column names but Column names and the data contained in each of these columns. I have not found any information on this in the other question suggested.
Since you're using mysqli_fetch_assoc, the name of the columns are the keys of the $row array in each iteration. You can put that in the file in the first iteration:
echo "Starting Write to CSV \n";
$first = true;
while($row = mysqli_fetch_assoc($result)){
if ($first) {
fputcsv($fp, array_keys($row));
$first = false;
}
fputcsv($fp, $row);
// ..
}
Once you have your $result set from your mysqli_query() method, you can use mysqli_fetch_fields() to return an array of descriptions of the columns in the result set.
Each element of that array is a an object with several properties. One property is name -- which you can use as a header for your csv file. You also get properties like max_length, length, and table. The linked documentation shows an example of using this metadata.
This metadata is especially useful if you have a query more complex than SELECT * FROM table: if you assign aliases to the columns in your query, they show up in the name properties of the metadata array elements.
This works even if the result set has no rows in it.
Sounds pretty simple, as long as everything else is already working for you. You can create an array with the column names, and fputcsv($fp, $array_of_column_names) right before your while loop.
I have a large database that contains results of an experiment for 1500 individuals. Each individual has 96 data points. I wrote the following script to summarize and then format the data so it can be used by the analysis software. At first all was good until I had more than 500 individuals. Now I am running out of memory.
I was wondering if anyone has a suggestion on now to overcome the memory limit problem without sacrificing speed.
This is how the table look in the database
fishId assayId allele1 allele2
14_1_1 1 A T
14_1_1 2 A A
$mysql = new PDO('mysql:host=localhost; dbname=aquatech_DB', $db_user, $db_pass);
$query = $mysql->prepare("SELECT genotyped.fishid, genotyped.assayid, genotyped.allele1, genotyped.allele2, fishId.sex, " .
"fishId.role FROM `fishId` INNER JOIN genotyped ON genotyped.fishid=fishId.catId WHERE fishId.projectid=:project");
$query->bindParam(':project', $project, PDO::PARAM_INT);
$query->execute();
So this is the call to the database. It is joining information from two tables to build the file I need.
if(!$query){
$error = $query->errorInfo();
print_r($error);
} else {
$data = array();
$rows = array();
if($results = $query->fetchAll()){
foreach($results as $row)
{
$rows[] = $row[0];
$role[$row[0]] = $row[5];
$data[$row[0]][$row[1]]['alelleY'] = $row[2];
$data[$row[0]][$row[1]]['alelleX'] = $row[3];
}
$rows = array_unique($rows);
foreach($rows as $ids)
{
$col2 = $role[$ids];
$alelleX = $alelleY = $content = "";
foreach($snp as $loci)
{
$alelleY = convertAllele($data[$ids][$loci]['alelleY']);
$alelleX = convertAllele($data[$ids][$loci]['alelleX']);
$content .= "$alelleY\t$alelleX\t";
}
$body .= "$ids\t$col2\t" . substr($content, 0, -1) . "\n";
This parses the data. In the file I need I have to have one row per individual rather than 96 rows per individual, that is why the data has to be formatted. In the end of the script I just write $body to a file.
I need the output file to be
FishId Assay 1 Assay 2
14_1_1 A T A A
$location = "results/" . "$filename" . "_result.txt";
$fh = fopen("$location", 'w') or die ("Could not create destination file");
if(fwrite($fh, $body))
Instead of reading the whole result from your database query into a variable with fetchAll(), fetch it row by row:
while($row = $query->fetch()) { ... }
fetchAll() fetches the entire result in one go, which has its uses but is greedy with memory. Why not just use fetch() which handles one row at a time?
You seem to indexing the rows by the first column, creating another large array, and then removing duplicate items. Why not use SELECT DISTINCT in the query to remove duplicates before they get to PHP?
I'm not sure what the impact would be on speed - fetch() may be slower than fetchAll() - but you don't have to remove duplicates from the array which saves some processing.
I'm also not sure what your second foreach is doing but you should be able to do it all in a single pass. I.e. a foreach loop within a fetch loop.
Other observations on your code above:
the $role array seems to do the same indexing job as $rows - using $row[0] as the key effectively removes the duplicates in a single pass. Removing the duplicates by SELECT DISTINCT is probably better but, if not, do you need the $rows array and the array_unique function at all?
if the same value of $row[0] can have different values of $row[5] then your indexing method will be discarding data - but you know what's in your data so I guess you've already thought of that (the same could be true of the $data array)
I am in need of a way to export my MYSQL Database to CSV via PHP, but I need to select the column names as well. So Far I have the following which does everything I need except get the column names.
echo "Export Starting \n";
$SQL = ("SELECT *
FROM INF_TimeEntries
WHERE Exported IS NULL");
$result = mysqli_query($db_conn, $SQL) or die("Selection Error " . mysqli_error($db_conn));
echo "Export Data Selected \n";
$fp = fopen('../updateDatabase/timesheetExport/TimeEntries.csv', 'w');
echo "Starting Write to CSV \n";
while($row = mysqli_fetch_assoc($result)){
fputcsv($fp, $row);
$RowID = $row['ID'];
$exportTime = date("Y-m-d H:i:s");
$sql = ("UPDATE INF_TimeEntries
SET Exported = '$exportTime'
WHERE ID = '$RowID'");
if ($mysqli_app->query($sql) === TRUE) {
}
else {
echo date("Y-m-d H:i:s")."\n";
echo "An Error Occured please contact the administrator ". $mysqli_app->error."\n";
}
}
echo "Export Completed \n";
fclose($fp);
mysqli_close($mysqli_app);
mysqli_close($db_conn);
I am not sure how I would go about Achieving this. I do not simply need to get column names but Column names and the data contained in each of these columns. I have not found any information on this in the other question suggested.
Since you're using mysqli_fetch_assoc, the name of the columns are the keys of the $row array in each iteration. You can put that in the file in the first iteration:
echo "Starting Write to CSV \n";
$first = true;
while($row = mysqli_fetch_assoc($result)){
if ($first) {
fputcsv($fp, array_keys($row));
$first = false;
}
fputcsv($fp, $row);
// ..
}
Once you have your $result set from your mysqli_query() method, you can use mysqli_fetch_fields() to return an array of descriptions of the columns in the result set.
Each element of that array is a an object with several properties. One property is name -- which you can use as a header for your csv file. You also get properties like max_length, length, and table. The linked documentation shows an example of using this metadata.
This metadata is especially useful if you have a query more complex than SELECT * FROM table: if you assign aliases to the columns in your query, they show up in the name properties of the metadata array elements.
This works even if the result set has no rows in it.
Sounds pretty simple, as long as everything else is already working for you. You can create an array with the column names, and fputcsv($fp, $array_of_column_names) right before your while loop.
I have a large database that contains results of an experiment for 1500 individuals. Each individual has 96 data points. I wrote the following script to summarize and then format the data so it can be used by the analysis software. At first all was good until I had more than 500 individuals. Now I am running out of memory.
I was wondering if anyone has a suggestion on now to overcome the memory limit problem without sacrificing speed.
This is how the table look in the database
fishId assayId allele1 allele2
14_1_1 1 A T
14_1_1 2 A A
$mysql = new PDO('mysql:host=localhost; dbname=aquatech_DB', $db_user, $db_pass);
$query = $mysql->prepare("SELECT genotyped.fishid, genotyped.assayid, genotyped.allele1, genotyped.allele2, fishId.sex, " .
"fishId.role FROM `fishId` INNER JOIN genotyped ON genotyped.fishid=fishId.catId WHERE fishId.projectid=:project");
$query->bindParam(':project', $project, PDO::PARAM_INT);
$query->execute();
So this is the call to the database. It is joining information from two tables to build the file I need.
if(!$query){
$error = $query->errorInfo();
print_r($error);
} else {
$data = array();
$rows = array();
if($results = $query->fetchAll()){
foreach($results as $row)
{
$rows[] = $row[0];
$role[$row[0]] = $row[5];
$data[$row[0]][$row[1]]['alelleY'] = $row[2];
$data[$row[0]][$row[1]]['alelleX'] = $row[3];
}
$rows = array_unique($rows);
foreach($rows as $ids)
{
$col2 = $role[$ids];
$alelleX = $alelleY = $content = "";
foreach($snp as $loci)
{
$alelleY = convertAllele($data[$ids][$loci]['alelleY']);
$alelleX = convertAllele($data[$ids][$loci]['alelleX']);
$content .= "$alelleY\t$alelleX\t";
}
$body .= "$ids\t$col2\t" . substr($content, 0, -1) . "\n";
This parses the data. In the file I need I have to have one row per individual rather than 96 rows per individual, that is why the data has to be formatted. In the end of the script I just write $body to a file.
I need the output file to be
FishId Assay 1 Assay 2
14_1_1 A T A A
$location = "results/" . "$filename" . "_result.txt";
$fh = fopen("$location", 'w') or die ("Could not create destination file");
if(fwrite($fh, $body))
Instead of reading the whole result from your database query into a variable with fetchAll(), fetch it row by row:
while($row = $query->fetch()) { ... }
fetchAll() fetches the entire result in one go, which has its uses but is greedy with memory. Why not just use fetch() which handles one row at a time?
You seem to indexing the rows by the first column, creating another large array, and then removing duplicate items. Why not use SELECT DISTINCT in the query to remove duplicates before they get to PHP?
I'm not sure what the impact would be on speed - fetch() may be slower than fetchAll() - but you don't have to remove duplicates from the array which saves some processing.
I'm also not sure what your second foreach is doing but you should be able to do it all in a single pass. I.e. a foreach loop within a fetch loop.
Other observations on your code above:
the $role array seems to do the same indexing job as $rows - using $row[0] as the key effectively removes the duplicates in a single pass. Removing the duplicates by SELECT DISTINCT is probably better but, if not, do you need the $rows array and the array_unique function at all?
if the same value of $row[0] can have different values of $row[5] then your indexing method will be discarding data - but you know what's in your data so I guess you've already thought of that (the same could be true of the $data array)
Is there a way to cache results of a mysql query manually to a txt file?
Ex:
$a=1;
$b=9;
$c=0;
$cache_filename = 'cached_results/'.md5("$a,$b,$c").'.txt';
if(!file_exists($cache_filename)){
$result = mysql_query("SELECT * FROM abc,def WHERE a=$a AND b=$b AND c=$c");
while($row = mysql_fetch_array($result)){
echo $row['name'];
}
// Write results on $row to the txt file for re-use
}else{
// Load results just like $row = mysql_fetch_array($result); from the txt file
}
The original query contains more WHEREs and joins that uses multiple tables.
So, Is this possible? If so, please explain.
Thank you,
pnm123
If you're sure that your data has a long time-to-live, you can certainly cache data by saving it temporarily to a text file.
if (!file_exists($cachefile)) {
// Save to cache
$query=mysql_query('SELECT * FROM ...');
while ($row=mysql_fetch_array($query))
$result[]=$row;
file_put_contents($cachefile,serialize($result),LOCK_EX);
}
else
// Retrieve from cache
$result=unserialize(file_get_contents($cachefile));
foreach ($result as $row)
echo $row['name'];
Although using APC, MemCache, or XCache would be a better alternative if you consider performance.