I have a large database that contains results of an experiment for 1500 individuals. Each individual has 96 data points. I wrote the following script to summarize and then format the data so it can be used by the analysis software. At first all was good until I had more than 500 individuals. Now I am running out of memory.
I was wondering if anyone has a suggestion on now to overcome the memory limit problem without sacrificing speed.
This is how the table look in the database
fishId assayId allele1 allele2
14_1_1 1 A T
14_1_1 2 A A
$mysql = new PDO('mysql:host=localhost; dbname=aquatech_DB', $db_user, $db_pass);
$query = $mysql->prepare("SELECT genotyped.fishid, genotyped.assayid, genotyped.allele1, genotyped.allele2, fishId.sex, " .
"fishId.role FROM `fishId` INNER JOIN genotyped ON genotyped.fishid=fishId.catId WHERE fishId.projectid=:project");
$query->bindParam(':project', $project, PDO::PARAM_INT);
$query->execute();
So this is the call to the database. It is joining information from two tables to build the file I need.
if(!$query){
$error = $query->errorInfo();
print_r($error);
} else {
$data = array();
$rows = array();
if($results = $query->fetchAll()){
foreach($results as $row)
{
$rows[] = $row[0];
$role[$row[0]] = $row[5];
$data[$row[0]][$row[1]]['alelleY'] = $row[2];
$data[$row[0]][$row[1]]['alelleX'] = $row[3];
}
$rows = array_unique($rows);
foreach($rows as $ids)
{
$col2 = $role[$ids];
$alelleX = $alelleY = $content = "";
foreach($snp as $loci)
{
$alelleY = convertAllele($data[$ids][$loci]['alelleY']);
$alelleX = convertAllele($data[$ids][$loci]['alelleX']);
$content .= "$alelleY\t$alelleX\t";
}
$body .= "$ids\t$col2\t" . substr($content, 0, -1) . "\n";
This parses the data. In the file I need I have to have one row per individual rather than 96 rows per individual, that is why the data has to be formatted. In the end of the script I just write $body to a file.
I need the output file to be
FishId Assay 1 Assay 2
14_1_1 A T A A
$location = "results/" . "$filename" . "_result.txt";
$fh = fopen("$location", 'w') or die ("Could not create destination file");
if(fwrite($fh, $body))
Instead of reading the whole result from your database query into a variable with fetchAll(), fetch it row by row:
while($row = $query->fetch()) { ... }
fetchAll() fetches the entire result in one go, which has its uses but is greedy with memory. Why not just use fetch() which handles one row at a time?
You seem to indexing the rows by the first column, creating another large array, and then removing duplicate items. Why not use SELECT DISTINCT in the query to remove duplicates before they get to PHP?
I'm not sure what the impact would be on speed - fetch() may be slower than fetchAll() - but you don't have to remove duplicates from the array which saves some processing.
I'm also not sure what your second foreach is doing but you should be able to do it all in a single pass. I.e. a foreach loop within a fetch loop.
Other observations on your code above:
the $role array seems to do the same indexing job as $rows - using $row[0] as the key effectively removes the duplicates in a single pass. Removing the duplicates by SELECT DISTINCT is probably better but, if not, do you need the $rows array and the array_unique function at all?
if the same value of $row[0] can have different values of $row[5] then your indexing method will be discarding data - but you know what's in your data so I guess you've already thought of that (the same could be true of the $data array)
Related
I have a .csv file that is about 5mb (~45,000 rows). What I need to do is run through each row of the file and check if the ID in each line is already in a table in my database. If it is, I can delete that row from the file.
I did a good amount of research on the most memory efficient way to do this, so I've been using a method of writing lines that don't need to get deleted to a temporary file and then renaming that file as the original. Code below:
$file= fopen($filename, 'r');
$temp = fopen($tempFilename, 'w');
while(($row = fgetcsv($file)) != FALSE){
// id is the 7th value in the row
$id = $row[6];
// check table to see if id exists
$sql = "SELECT id FROM table WHERE id = $id";
$result = mysqli_query($conn, $sql);
// if id is in the database, skip to next row
if(mysqli_num_rows($result) > 0){
continue;
}
// else write line to temp file
fputcsv($temp, $row);
}
fclose($file);
fclose($temp);
// overwrite original file
rename($tempFilename, $filename);
Problem is, I'm running into a timeout while executing this bit of code. Anything I can do to make the code more efficient?
You fire a database query per line, aka 45.000 queries... that takes too much time.
Better you do a query before the loop and read the existing id into a lookup array, then only check this array in the loop.
Pseudo code:
$st = query('SELECT id FROM table');
while ($row = $st->fetch()) {
$lookup[ $row['id'] ] = $row['id'];
}
// now read CSV
while($row = fgetcsv($h)) {
$id = $row[6];
if (isset($lookup[ $id ])) {
// exist...
continue;
}
// write the non-existing id to different file...
}
edit:
Assume memory isn't sufficient to hold 1 million integer from the database. How can it still be done efficiently?
Collect ids from CSV into an array. Write a single query to find all those ids in the database and collect (it can be maximal so many as in the CSV). Now array_diff() the ids from file with the ids from database - those ids remaining exist in CSV but not in database.
Pseudo code:
$ids_csv = [];
while($row = fgetcsv($h)) {
$id = row[6];
$ids_csv[] = intval($id);
}
$sql = sprintf('SELECT id FROM table WHERE id IN(%s)', implode(',', $ids_csv));
$ids_db = [];
$st = query($sql);
while ($row = $st->fetch()) {
$ids_db[] = $row['id'];
}
$missing_in_db = array_diff($ids_csv, $ids_db);
I would use LOAD DATA INFILE: https://dev.mysql.com/doc/refman/8.0/en/load-data.html
Your database user needs to have FILE priveleges on the database to use.
to read the csv file into a separate table.
Then you can run one query to delete id's already exist (delete from join ...)
And export the rows that were left intact.
Other option is use your loop to insert your csv file into a seperate table, and then proceed with step 2.
Update: I use LOAD DATA INFILE with csv files up to 2 million rows (at the moment) and do some bulk data manipulation with big queries, it's blazingly fast and I would recommend this route for files containing > 100k lines.
I have an query which select all ids from a table. Once I have all id's, they are stored in an array which I foreach over.
Then there is an second array which pull data from url (around 5k rows) and should update DB based on the id's.
The problem - second foreach is looping once for each ID, which is not what I want. What I want is to loop once for all id's.
Here is the code I have so far
$query = " SELECT id, code FROM `countries` WHERE type = 1";
$result = $DB->query($query);
$url = "https://api.gov/v2/data?api_key=xxxxx";
$api_responce = file_get_contents($url);
$api_responce = json_decode($api_responce);
$data_array = $api_responce->data;
$rows = Array();
while($row = $DB->fetch_object($result)) $rows[] = $row;
foreach ($rows as $row) {
foreach ($data_array as $key => $dataArr) {
$query = "UPDATE table SET data_field = $dataArr->value WHERE country_id = $row->id LIMIT 1";
}
}
The query returns 200 id's and because of than the second foreach (foreach ($data_array as $key => $dataArr) { ... }) execute everything 200 times.
It must execute once for all 200 id's not 200 * 5000 times.
Since the question is aboot using a loop, we will talk about the loop, instead of trying to find another way. Actually, I see no reason to find another way.
->Loops and recursions are great, powerful tools. As usually, with great tools, you need to also find ways of controlling them.
See cars for example, they have breaks.
The solution is not to be slow and sit to horses era, but to have good brakes.
->In the same spirit, all you need to master the power called recursions and loops is to stop them properly. You can use if cases and "break" command in PHP.
For example, here we have a case of arrays containing arrays, each first child of the array having the last of the other (1,2,3), (3,4,5) and we want to controll the loop in a way of showing data in a proper way (1,2,3,4,5).
We will use an if case and a counter :
<?php
$array = array( array(-1,0,1), array(1,2,3,4,5), array(5,6,7,8,9,10), array(10,11,12,13,14,15) );
static $key_counter;
foreach( $array as $key ){
$key_counter = 0;
foreach( $key as $key2 ){
if ( $key_counter != 0 ) {
echo $key2 . ', ';
}
$key_counter = $key_counter + 1;
}
}
Since I dont have access to your DB is actually hard for me to run and debbug the code, so the best I can say is that you need to use an if case which checks if the ID of the object is the ID we want to proccess, then proceed to proccessing.
P.S. Static variables are usefull for loops and specially for recurrsions, since they dont get deleted from the memory once the functions execution ends.
The static keyword is also used to declare variables in a function
which keep their value after the function has ended.
I have a large database that contains results of an experiment for 1500 individuals. Each individual has 96 data points. I wrote the following script to summarize and then format the data so it can be used by the analysis software. At first all was good until I had more than 500 individuals. Now I am running out of memory.
I was wondering if anyone has a suggestion on now to overcome the memory limit problem without sacrificing speed.
This is how the table look in the database
fishId assayId allele1 allele2
14_1_1 1 A T
14_1_1 2 A A
$mysql = new PDO('mysql:host=localhost; dbname=aquatech_DB', $db_user, $db_pass);
$query = $mysql->prepare("SELECT genotyped.fishid, genotyped.assayid, genotyped.allele1, genotyped.allele2, fishId.sex, " .
"fishId.role FROM `fishId` INNER JOIN genotyped ON genotyped.fishid=fishId.catId WHERE fishId.projectid=:project");
$query->bindParam(':project', $project, PDO::PARAM_INT);
$query->execute();
So this is the call to the database. It is joining information from two tables to build the file I need.
if(!$query){
$error = $query->errorInfo();
print_r($error);
} else {
$data = array();
$rows = array();
if($results = $query->fetchAll()){
foreach($results as $row)
{
$rows[] = $row[0];
$role[$row[0]] = $row[5];
$data[$row[0]][$row[1]]['alelleY'] = $row[2];
$data[$row[0]][$row[1]]['alelleX'] = $row[3];
}
$rows = array_unique($rows);
foreach($rows as $ids)
{
$col2 = $role[$ids];
$alelleX = $alelleY = $content = "";
foreach($snp as $loci)
{
$alelleY = convertAllele($data[$ids][$loci]['alelleY']);
$alelleX = convertAllele($data[$ids][$loci]['alelleX']);
$content .= "$alelleY\t$alelleX\t";
}
$body .= "$ids\t$col2\t" . substr($content, 0, -1) . "\n";
This parses the data. In the file I need I have to have one row per individual rather than 96 rows per individual, that is why the data has to be formatted. In the end of the script I just write $body to a file.
I need the output file to be
FishId Assay 1 Assay 2
14_1_1 A T A A
$location = "results/" . "$filename" . "_result.txt";
$fh = fopen("$location", 'w') or die ("Could not create destination file");
if(fwrite($fh, $body))
Instead of reading the whole result from your database query into a variable with fetchAll(), fetch it row by row:
while($row = $query->fetch()) { ... }
fetchAll() fetches the entire result in one go, which has its uses but is greedy with memory. Why not just use fetch() which handles one row at a time?
You seem to indexing the rows by the first column, creating another large array, and then removing duplicate items. Why not use SELECT DISTINCT in the query to remove duplicates before they get to PHP?
I'm not sure what the impact would be on speed - fetch() may be slower than fetchAll() - but you don't have to remove duplicates from the array which saves some processing.
I'm also not sure what your second foreach is doing but you should be able to do it all in a single pass. I.e. a foreach loop within a fetch loop.
Other observations on your code above:
the $role array seems to do the same indexing job as $rows - using $row[0] as the key effectively removes the duplicates in a single pass. Removing the duplicates by SELECT DISTINCT is probably better but, if not, do you need the $rows array and the array_unique function at all?
if the same value of $row[0] can have different values of $row[5] then your indexing method will be discarding data - but you know what's in your data so I guess you've already thought of that (the same could be true of the $data array)
I have slow query problem, may be i am wrong, here is what i want,
i have to display more than 40 drop down lists at a single page with same fields , fetched by db, but i feel that the query takes much time to execute and also use more resources..
here is an example...
$sql_query = "SELECT * FROM tbl_name";
$rows = mysql_query($sql_query);
now i use while loop to print all records in that query in drop down list,
but i have to reprint same record in next drop down list up to 40 lists, so i use
mysql_data_seek() to move to first record and then reprint the next list and so on till 40 lists.
but this was seems slow to me so i use the second method like this same query for all 40 lists
$sql_query2 = "SELECT * FROM tbl_name";
$rows2 = mysql_query($sql_query2);
do you think that i wrong about the speed of query, or do you suggest me the another way that is faster than these methods....
Try putting the rows into an array like so:
<?php
$rows = array();
$fetch_rows = mysql_query("SELECT * FROM table");
while ($row = mysql_fetch_assoc($fetch_rows)) {
$rows[] = $row;
}
Then just use the $rows array in a foreach ($rows as $row) loop.
There is considerable processing overhead associated with fetching rows from a MySQL result resource. Typically it would be quite a bit faster to store the results as an array in PHP rather than to query and fetch the same rowset again from the RDBMS.
$rowset = array();
$result = mysql_query(...);
if ($result) {
while ($row = mysql_fetch_assoc($result)) {
// Append each fetched row onto $rowset
$rowset[] = $row;
}
}
If your query returns lots of rows (thousands or tens of thousands or millions) and you need to use all of them, you may reach memory limitations by storing all rows into an array in PHP. In that case it may be more memory-conservative to fetch rows individually from MySQL, but it will still probably be more CPU intensive.
Instead of printing the records, going back, and printing them again, put the records in one big string variable, then echo it for each dropdown.
$str = "";
while($row = mysql_fetch_assoc($rows)) {
// instead of echo...
$str .= [...];
}
// now for each dropdown
echo $str;
// will print all the rows.
I was wondering how I can handle a mysql request in php precisely as an object.
Ex:
//supposing...
$beginning=$_GET['start'];//value equal to 3
$ending=$_GET['end'];//value equal to 17
$conn=new mysqli("localhost","user","password","databasename");
$query=$conn->query("select name, favoriteFood, weight, from tablename");
1- Supposing that tablename has 23 rows, how to printing only 14 rows, beginning for example by 3rd row and ending in 17th row, as following?
Ex:
//supposing... It, I guess, should result in error but is a sketch of my ideia
for($i=$beginning,$colsCol=$query->fetch_array(MYSQLI_ASSOC); $i<$ending; $i++)
printf("%s %s %s<\br>",$colsCol['name'][$i],$colsCol['favoriteFood'][$i],$colsCol['weight'][$i]);
2 - And later, how to order the resulted rows with $query variable?
P.S.: I know that to get results ordered, I could user order by columname, but in this case I would like to order the resulted rows after query been done.
If you want to sort later, after the query's done, then you'd need to store the results in a PHP data structure and do the sorting there. Or re-run the query with new sorting options.
As for fetching only certain rows, it'd be far more efficient to retrieve only the rows you want. Otherwise (for large result sets) you're forcing a lot of data to be pulled off disk, sent over the wire, etc... only to get thrown away. Rather wasteful.
However, if you insist on doing things this way:
$row = 0; $data = array();
while($row = $query->fetch_array(MYSQLI_ASSOC)) {
$row++;
if (($row < 3) || ($row > 17)) {
continue;
}
$data[] = $row;
}
for 2D array you can use it.
function asort2d($records, $field, $reverse=false) {
// Sort an array of arrays with text keys, like a 2d array from a table query:
$hash = array();
foreach($records as $key => $record) {
$hash[$record[$field].$key] = $record;
}
($reverse)? krsort($hash) : ksort($hash);
$records = array();
foreach($hash as $record) {
$records []= $record;
}
return $records;
} // end function asort2d
Use SQL for SQL-tasks:
"how to printing only 14 rows, begging for example by 3rd row and ending in 17th row"
$stmt = $database->prepare('SELECT `name`, `favoriteFood`, `weight` FROM `tablename` LIMIT :from, :count');
$stmt->bindValue(':from', (int)$_GET['start'] - 1);
$stmt->bindValue(':count', (int)$_GET['end'] - (int)$_GET['start']);