So I wrote a script to extract data from raw genome files, heres what the raw genome file looks like:
# rsid chromosome position genotype
rs4477212 1 82154 AA
rs3094315 1 752566 AG
rs3131972 1 752721 AG
rs12124819 1 776546 AA
rs11240777 1 798959 AG
rs6681049 1 800007 CC
rs4970383 1 838555 AC
rs4475691 1 846808 CT
rs7537756 1 854250 AG
rs13302982 1 861808 GG
rs1110052 1 873558 TT
rs2272756 1 882033 GG
rs3748597 1 888659 CT
rs13303106 1 891945 AA
rs28415373 1 893981 CC
rs13303010 1 894573 GG
rs6696281 1 903104 CT
rs28391282 1 904165 GG
rs2340592 1 910935 GG
The raw text file has hundreds of thousands of these rows, but I only need specific ones, I need about 10,000 of them. I have a list of rsids. I just need the genotype from each line. So I loop through the rsid list and use preg_match to find the line I need:
$rawData = file_get_contents('genome_file.txt');
$rsids = $this->get_snps();
while ($row = $rsids->fetch_assoc()) {
$searchPattern = "~rs{$row['rsid']}\t(.*?)\t(.*?)\t(.*?)\n~i";
if (preg_match($searchPattern,$rawData,$matchedGene)) {
$genotype = $matchedGene[3]);
// Do something with genotype
}
}
NOTE: I stripped out a lot of code to just show the regexp extraction I'm doing. I'm also inserting each row into a database as I go along. Heres the code with the database work included:
$rawData = file_get_contents('genome_file.txt');
$rsids = $this->get_snps();
$query = "INSERT INTO wp_genomics_results (file_id,snp_id,genotype,reputation,zygosity) VALUES (?,?,?,?,?)";
$stmt = $ngdb->prepare($query);
$stmt->bind_param("iissi", $file_id,$snp_id,$genotype,$reputation,$zygosity);
$ngdb->query("START TRANSACTION");
while ($row = $rsids->fetch_assoc()) {
$searchPattern = "~rs{$row['rsid']}\t(.*?)\t(.*?)\t(.*?)\n~i";
if (preg_match($searchPattern,$rawData,$matchedGene)) {
$genotype = $matchedGene[3]);
$stmt->execute();
$insert++;
}
}
$stmt->close();
$ngdb->query("COMMIT");
$snps->free();
$ngdb->close();
}
So unfortunately my script runs very slowly. Running 50 iterations takes 17 seconds. So you can imagine how long running 18,000 iterations is gonna take. I'm looking into ways to optimise this.
Is there a faster way to extract the data I need from this huge text file? What if I explode it into an array of lines, and use preg_grep(), would that be any faster?
Something I tried is combining all 18,000 rsids into a single expression (i.e. (rs123|rs124|rs125) like this:
$rsids = get_rsids();
$rsid_group = implode('|',$rsids);
$pattern = "~({$rsid_group })\t(.*?)\t(.*?)\t(.*?)\n~i";
preg_match($pattern,$rawData,$matches);
But unfortunately it gave me some error message about exceeding the PCRE expression limit. The needle was way too big. Another thing I tried is adding the S modifier to the expression. I read that this analyses the pattern in order to increase performance. It didn't speed things up at all. Maybe maybe pattern isn't compatible with it?
So then the second thing I need to try and optimise is the database inserts. I added a transaction hoping that would speed things up but it didn't speed it up at all. So I'm thinking maybe I should group the inserts together, so that I insert multiple rows at once, rather than inserting them individually.
Then another idea is something I read about, using LOAD DATA INFILE to load rows from a text file. In that case, I just need to generate a text file first. Would it work out faster to generate a text file in this case I wonder.
EDIT: It seems like whats taking up most time is the regular expressions. Running that part of the program by itself, it takes a really long time. 10 rows takes 4 seconds.
This is slow because you're searching a vast array of data over and over again.
It looks like you have a text file, not a dbms table, containing lines like these:
rs4477212 1 82154 AA
rs3094315 1 752566 AG
rs3131972 1 752721 AG
rs12124819 1 776546 AA
It looks like you have some other data structure containing a list of values like rs4477212. I think that's already in a table in the dbms.
I think you want exact matches for the rsxxxx values, not prefix or partial matches.
I think you want to process many different files of raw data, and extract the same batch of rsxxxx values from each of them.
So, here's what you do, in pseudocode. Don't load the whole raw data file into memory, rather process it line by line.
Read your rows of rsid values from the dbms, just once, and store them in an associative array.
for each file of raw data....
for each line of data in the file...
split the line of data to obtain the rsid. In php, $array = explode(" ", $line, 2); will yield your rsid in $array[0], and do it fast.
Look in your array of rsid values for this value. In php, if ( array_key_exists( $array[0], $rsid_array )) { ... will do this.
If the key does exist, you have a match.
extract the last column from the raw text line ('GC or whatever)
write it to your dbms.
Notice how this avoids regular expressions, and how it processes your raw data line by line. You only have to touch each line of raw data once. That's good, because your raw data is also your largest quantity of data. It exploits php's associative array feature to do the matching. All that will be much faster than your method.
To speed the process of inserting tens of thousands of rows into a table, read this. Optimizing InnoDB Insert Queries
+1 to #Ollie Jones' answer. He posted while I was working on my answer. So here's some code to get you started.
$rsids = $this->get_snps();
while ($row = $rsids->fetch_assoc()) {
$key = 'rs' . $row['rsid'];
$rsidHash[$key] = true;
}
$rawDataFd = fopen('genome_file.txt', 'r');
while ($rawData = fgetcsv($rawDataFd, 80, "\t")) {
if (array_key_exists($rawData[0], $rsidHash)) {
$genotype = $rawData[3];
// do something with genotype
}
}
I wanted to give the LOAD DATA INFILE approach to see how well that works, so I came up with what I thought is a nice elegant approach, heres the code:
$file = 'C:/wamp/www/nutri/wp-content/plugins/genomics/genome/test';
$data_query = "
LOAD DATA LOCAL INFILE '$file'
INTO TABLE wp_genomics_results
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
IGNORE 18 ROWS
(#rsid,#chromosome,#locus,#genotype)
SET file_id = '$file_id',
snp_id = (SELECT id FROM wp_pods_snp WHERE rsid = SUBSTR(#rsid,2)),
genotype = #genotype
";
$ngdb->query($data_query);
I put a foreign key restraint on the snp_id (thats the ID for my table of RSIDs) column so that it only enters genotypes for rsids that I need. Unfortunately this foreign key restraint caused some kind of error which locked the tables. Ah well. It might not have been a good approach anyhow since there are on average 200,000 rows in each of these genome files. I'll go with Ollie Jones approach since that seems to be the most effective and viable approach I've come across.
Related
I need to come up with a way to make a large task faster to beat the timeout.
I have very limited access to the server due to the restrictions of the hosting company.
I have a system set up where a cron visits a PHP file that grabs a csv that contains data on some products. The csv does not contain all of the fields that the product would have. Just a handful of essential ones.
I've read a fair number of articles on timeouts and handling csv's and currently (in an attempt to shave time) I have made a table (let's call it csv_data) to hold the csv data. I have a script that truncates the csv_data table then inserts data from the csv so each night the latest recordset from the csv is in that table (the csv file gets updated nightly). So far, no timeout problems..the task only takes about 4-5 seconds.
The timeouts occur when I have to sift through the data to make updates to the products table. The steps that it is running right now is like this
1. Get the sku from csv_data table (that holds thousands of records)
2. Select * from Products where products.sku = csv.sku (products table also holds thousands of records to loop through)
3. Get numrows.
If numrows<0{no record in products, so skip}.
If numrows>1{duplicate entries, don't change anything, but later on report the sku}
If numrows==1{Update selected fields in the products table with csv data}
4. Go to the next record in csv_data all over again
(I figured outlining the process is shorter and easier than dropping in the code.)
I looked into MySQl views and stored procedures but I am not skilled enough in it to know if it will handle the 'if' statement portion.
Is there anything I can do to make this faster to avoid the timeouts?
edit:
I should mention that set_time_limit(0); isn't doing it. And if it helps, the server uses IIS7 and fastcgi
Thanks for your help.
Update after using suggestions from Jakob and Shawn:
I'm doing something wrong. The speed is definitely faster and the csv sku is incrementing,
but when I tried to implement Shawn's solution; the query is giving me a PHP Warning: mysql_result() expects parameter 1 to be resource, boolean error.
Can you help me spot what I am doing wrong?
Here is the section of code:
$csvdata="SELECT * FROM csv_update";
$csvdata_result=mysql_query($csvdata);
mysql_query($csvdata);
$csvdata_num = mysql_num_rows($csvdata_result);
$i=0;
while($i<$csvdata_num){
$csv_code=#mysql_result($csvdata_result,$i,"skucode");
$datacheck=NULL;
$datacheck=substr($csv_code,0,1);
if($datacheck>='0' && $datacheck<='9'){
$csv_price=#mysql_result($csvdata_result,$i,"price");
$csv_retail=#mysql_result($csvdata_result,$i,"retail");
$csv_stock=#mysql_result($csvdata_result,$i,"stock");
$csv_weight=#mysql_result($csvdata_result,$i,"weight");
$csv_manufacturer=#mysql_result($csvdata_result,$i,"manufacturer");
$csv_misc1=#mysql_result($csvdata_result,$i,"misc1");
$csv_misc2=#mysql_result($csvdata_result,$i,"misc2");
$csv_selectlist=#mysql_result($csvdata_result,$i,"selectlist");
$csv_level5=#mysql_result($csvdata_result,$i,"level5");
$csv_frontpage=#mysql_result($csvdata_result,$i,"frontpage");
$csv_level3=#mysql_result($csvdata_result,$i,"level3");
$csv_minquantity=#mysql_result($csvdata_result,$i,"minquantity");
$csv_quantity1=#mysql_result($csvdata_result,$i,"quantity1");
$csv_discount1=#mysql_result($csvdata_result,$i,"discount1");
$csv_quantity2=#mysql_result($csvdata_result,$i,"quantity2");
$csv_discount2=#mysql_result($csvdata_result,$i,"discount2");
$csv_quantity3=#mysql_result($csvdata_result,$i,"quantity3");
$csv_discount3=#mysql_result($csvdata_result,$i,"discount3");
$count_check="SELECT COUNT(*) AS totalCount FROM products WHERE skucode = '$csv_code'";
$count_result=mysql_query($count_check);
mysql_query($count_check);
$totalCount=#mysql_result($count_result,0,'totalCount');
$loopCount = ceil($totalCount / 25);
for($j = 0; $j < $loopCount; $j++){
$prod_check="SELECT skucode FROM products WHERE skucode = '$csv_code' LIMIT ($loopCount*25), 25;";
$prodresult=mysql_query($prod_check);
mysql_query($prod_check);
$prodnum =#mysql_num_rows($prodresult);
$prod_id=#mysql_result($prodresult,0,"catalogid");
if($prodnum<1){
echo "NOT FOUND:$csv_code<br>";
$count_sku_not_found=$count_sku_not_found+1;
$list_sku_not_found=$list_sku_not_found." $csv_code";}
if($prodnum>1){
echo "DUPLICATE:$csv_ccode<br>";
$count_duplicate_skus=$count_duplicate_skus+1;
$list_duplicate_skus=$list_duplicate_skus." $csv_code";}
if ($prodnum==1){
///This prevents an overwrite from happening if the csv file doesn't produce properly
if ($csv_price!="" OR $csv_price!=NULL)
{$sql_price='price="'.$csv_price.'"';}
if ($csv_retail!="" OR $csv_retail!=NULL)
{$sql_retail=',retail="'.$csv_retail.'"';}
if ($csv_stock!="" OR $csv_stock!=NULL)
{$sql_stock=',stock="'.$csv_stock.'"';}
if ($csv_weight!="" OR $csv_weight!=NULL)
{$sql_weight=',weight="'.$csv_weight.'"';}
if ($csv_manufacturer!="" OR $csv_manufacturer!=NULL)
{$sql_manufacturer=',manufacturer="'.$csv_manufacturer.'"';}
if ($csv_misc1!="" OR $csv_misc1!=NULL)
{$sql_misc1=',misc1="'.$csv_misc1.'"';}
if ($csv_misc2!="" OR $csv_misc2!=NULL)
{$sql_pother2=',pother2="'.$csv_misc2.'"';}
if ($csv_selectlist!="" OR $csv_selectlist!=NULL)
{$sql_selectlist=',selectlist="'.$csv_selectlist.'"';}
if ($csv_level5!="" OR $csv_level5!=NULL)
{$sql_level5=',level5="'.$csv_level5.'"';}
if ($csv_frontpage!="" OR $csv_frontpage!=NULL)
{$sql_frontpage=',frontpage="'.$csv_frontpage.'"';}
$import="UPDATE products SET $sql_price $sql_retail $sql_stock $sql_weight $sql_manufacturer $sql_misc1 $sql_misc2 $sql_selectlist $sql_level5 $sql_frontpage $sql_in_stock WHERE skucode='$csv_code'";
mysql_query($import) or die(mysql_error("error updating in products table"));
echo "Update ".$csv_code." successful ($i)<br>";
$count_success_update_skus=$count_success_update_skus+1;
$list_success_update_skus=$list_success_update_skus." $csv_code";
//empty out variables
$sql_price='';
$sql_retail='';
$sql_stock='';
$sql_weight='';
$sql_manufacturer='';
$sql_misc1='';
$sql_misc2='';
$sql_selectlist='';
$sql_level5='';
$sql_frontpage='';
$sql_in_stock='';
$prodnum=0;
}
}
$i++;
}
Is it timing out before the first row is returned or is it between rows during the read? One good practice bit would be to handle your query in chunks; do a count first to see how many records you are dealing with for the SKU, the loop through smaller chunks (the size of these chunks would depend on how many things you have to do with each row). Your updated workflow would look more like this:
Get next SKU from CSV
Get a total count: SELECT COUNT(*) AS totalCount FROM products WHERE products.sku = csv.sku
Determine chunk size (using 25 for this demo)
loopCount = ceil(totalCount / 25)
Loop through all results using a loop like this: for($i = 0; $i < loopCount; $i++)
Inside your loop you should be running a query like this: SELECT * FROM products WHERE products.sku = csv.sku LIMIT (loopCount*25), 25
You will want to use a constant order for your SELECT chunks; your unique ID would probably be best.
I think you can solve this problem with cron. http://en.wikipedia.org/wiki/Cron . It has never had timeout.
I'm working with an MLS real estate listing provider (RETS). Every 48 hours we will be pulling data from their server in a cron job to an SQL database. I'm charged with the task of writing a php script that will be run after the data from the remote server is dumped into our "raw" tables. In these raw tables, all columns are VARCHAR(255), and we want to move the data into optimized tables. Before I send my script to the guy in charge of setting up the cron job, I wondered if there is a more efficient way to do it so I don't look foolish.
Here's what I'm doing:
There are 8 total tables, 4 raw and 4 optimized - all in the same database. The raw table column names are non descriptive, like c1,c2,c2,c4 etc. This is intentional because the data that goes in each column may change. The raw table column names are mapped to the correct optimized table columns with php, something like this:
$tables['optimized_table_name1']['raw_table'] = 'raw_table_name1';
$tables['optimized_table_name1']['data_map'] = array(
'c1' => array( // <--- "c1" is the raw table column name
'column_name' => 'id',
// I use other values for table creation,
// but they don't matter to the question.
// Just explaining why the array looks like this
//'type' => 'VARCHAR',
//'max_length' => 45,
//'primary_key' => FALSE,
// etc.
),
'c9' => array('column_name' => 'address'),
'c25' => array('column_name' => 'baths'),
'c2' => array('column_name' => 'bedrooms') //etc.
);
I'm doing the same thing for each of the 4 tables: SELECT * FROM the raw table, read the config array and create a huge SQL insert statement, TRUNCATE the optimized table, then run the INSERT query.
foreach ($tables as $table_name => $config):
$raw_table = $config['raw_table'];
$data_map = $config['data_map'];
$fields = array();
$values = array();
$count = 0;
// Get the raw data and create an array mapped to the optimized table columns.
$query = mysql_query("SELECT * FROM dbname.{$raw_table}");
while ($row = mysql_fetch_assoc($query))
{
// Reading column names from my config file on first pass
// Setting up the array, will only run once per table
if (empty($fields))
{
foreach ($row as $key => $val)
{// Produces an array with the column names
$fields[] = $data_map[$key]['column_name'];
}
}
foreach ($row as $key => $val)
{// Assigns data to an array to be imploded later
$values[$count][] = $val;
}
$count++;
}
// Create the INSERT statement string
$insert = array();
$sql = "\nINSERT INTO `{$table_name}` (`".implode('`,`', $fields)."`) VALUES\n";
foreach ($values as $key => $vals)
{
foreach ($vals as &$val)
{
// Escape the data
$val = mysql_real_escape_string($val);
}
// Using implode for simplicity, could avoid the nested foreach if I wanted to
$insert[] = "('".implode("','", $vals)."')";
}
$sql .= implode(",\n", $insert).";\n";
// TRUNCATE optimized table and run INSERT query here
endforeach;
Which produces something like this (only larger - about 15,000 records max per table, and one insert statement per table):
INSERT INTO `optimized_table_name1` (`id`,`beds`,`baths`,`town`) VALUES
('50300584','2','1','Fairfield'),
('87560584','3','2','New Haven'),
('76545584','2','1','Bristol');
Now I'll admit, I have been under the wing of an ORM for a long time and am not up on my vanilla mysql/php. This is a pretty simple task and I want to keep the code simple.
My questions:
Is the TRUNCATE/INSERT method a good way to do this?
Is there anything about my code that you can see being a problem? I know you see nested foreach loops and just shudder, but I want to keep the code as small clean as possible and avoid lots of messy string concatenation (to produce the insert query). Like I said, I also haven't used native php functions for SQL in a long time.
I feel like it really doesn't matter if the code is not optimized if it is run at 3AM every 2 days. Does it matter? Is this code going to preform OK?
Is there a better overall strategy to accomplish this task?
Do I need to be using transactions?
How can I be aware of errors that may occur in cron scripts?
Apologize if I don't use correct cron jargon, it's new to me.
Keep it simple. ORM would be swell for this task.
Answers:
Yes.
Your code is readable. At least I did not have any problems to read it.
We had a script that ran early in the morning. It was not optimized and consumed a lot of memory. After FOUR years it started to consume over 512 Mb. I've spent 2 hours to optimize it to, so now it consumes 7 Mb (pretty good optimization, huh? :) ). I personally think it is "ok" that your script is not optimized now. If this script will start failing, you'll figure what the problem is. Maybe it will exhaust memory, maybe your SQL queries will cause deadlocks... maybe you will later optimize it to READ from slave servers... I don't know, but it works fine now, that's okay.
I'd do something similar to your code. But I'd probably generate file first and load data into the server by running shell command mysql -u username --password=password < import_file.sql. So I'd have my file stored somewhere on a disk so I cal always take a look at it. And maybe even edit for one-time correction load. But you still can do it by writing your sql statement into file.
No. It is just one query. If you use InnoDB engine it is already a transaction.
First, use error_reporting(E_ALL & ~E_NOTICE). Second, use mysql_error PHP function to ensure your query performed correctly. Third, in your cronjob output errors stream into some file like so: 0 7 * * 0 /path/to/php -c /path/to/php.ini /path/to/script.php 2> /tmp/errors_file And thus you can create SECOND script runnin after first one to notify about errors in script.php by email or.... whatever way of notifying you prefer. I'd prefer to register_shutdown_functions that would check for error_file and if it is not empty, notify you and delete it afterwards.
Just my opinion, but I hope my answer helps though.
Hey, trying to figure out a way to use a file I have to generate an SQL insert to a database.
The file has many entries of the form:
100090 100090 bill smith 1998
That is,an id number, another id(not always the same), a full name and a year. These are all separated by a space.
Basically what i want to to is be able to get variables from these lines as I iterate through the file so that i can,for instance give the values on each line the names: id,id2,name,year. I then want to pass these to a database.So for each line id be able to do (in pseudo code)
INSERT INTO BLAH VALUES(id, id2,name , year)
This is in php, I noticed I haven't outlined that above, however i have also tried using grep in order to find the regex but cant find a way to paste the code: eg:"VALUES()" around the information from the file.
Any help would be appreciated. I'm kind of stuck on this one
Try something like this:
$fh = fopen('filename', 'r');
$values = array();
while(!feof($fh)) {
//Read a line from the file
$line = trim(fgets($fh));
//Match the line against the specified format
$fields = array();
if(preg_match(':^(\d+) (\d+) (.+) (\d{4})$:', $line, $fields)) {
//If it do match, create a VALUES() block
//Don't forget to escape the string part
$values[] = sprintf('VALUES(%d, %d, "%s", %d)',
$fields[1], $fields[2], mysqli_real_escape_string($fields[3]), $fields[4]);
}
}
fclose($fh);
$all_values = implode(',', $values);
//Check out what's inside $all_values:
echo $all_values;
If the file is really big you'll have to do your SQL INSERTs inside the loop instead of saving them to the end, but for small files I think it's better to save all VALUEs to the end so we can do only one SQL query.
If you can rely on the file's structure (and don't need to do additional sanitation/checking), consider using LOAD DATA INFILE.
GUI tools like HeidiSQL come with great dialogs to build fully functional mySQL statements easily.
Alternatively, PHP has fgetcsv() to parse CSV files.
If all of your lines look like the one you posted, you can read the contents of the file into a string (see http://www.ehow.com/how_5074629_read-file-contents-string-php.html)
Then use PHP split function to give you each piece of the query. (Looks like preg_split() as of PHP 5.3).
The array will look like this:
myData[0] = 10090
myData[1] = 10090
myData[2] = Bill Smith
myData[3] = 1998
.....And so on for each record
Then you can use a nifty loop to build your query.
for($i = 0, $i < (myData.length / 4); $i+4)
{
$query = 'INSERT INTO MyTABLE VALUES ($myData[$i],$myData[$i+1],$myData[$i+2],myData[$i+3])'
//Then execute the query
}
This will be better and faster than introducing a 3rd party tool.
I have an 800mb text file with 18,990,870 lines in it (each line is a record) that I need to pick out certain records, and if there is a match write them into a database.
It is taking an age to work through them, so I wondered if there was a way to do it any quicker?
My PHP is reading a line at a time as follows:
$fp2 = fopen('download/pricing20100714/application_price','r');
if (!$fp2) {echo 'ERROR: Unable to open file.'; exit;}
while (!feof($fp2)) {
$line = stream_get_line($fp2,128,$eoldelimiter); //use 2048 if very long lines
if ($line[0] === '#') continue; //Skip lines that start with #
$field = explode ($delimiter, $line);
list($export_date, $application_id, $retail_price, $currency_code, $storefront_id ) = explode($delimiter, $line);
if ($currency_code == 'USD' and $storefront_id == '143441'){
// does application_id exist?
$application_id = mysql_real_escape_string($application_id);
$query = "SELECT * FROM jos_mt_links WHERE link_id='$application_id';";
$res = mysql_query($query);
if (mysql_num_rows($res) > 0 ) {
echo $application_id . "application id has price of " . $retail_price . "with currency of " . $currency_code. "\n";
} // end if exists in SQL
} else
{
// no, application_id doesn't exist
} // end check for currency and storefront
} // end while statement
fclose($fp2);
At a guess, the performance issue is because it issues a query for each application_id with USD and your storefront.
If space and IO aren't an issue, you might just blindly write all 19M records into a new staging DB table, add indices and then do the matching with a filter?
Don't try to invent the wheel, it's been done. Use a database to search through the file's content. You can looad that file into a staging table in your database and query your data using indexes for fast access if they add value. Most if not all databases have import/loading tools to get a file into the database relatively fast.
19M rows on DB will slow it down if DB was not designed properly. You can still use text files, if it is partitioned properly. Recreating multiple smaller files, based on certain parameters, storing in proper sorted way might work.
Anyway PHP is not the best language for file IO and processing, it is much slower than Java for this task, while plain old C would be one of the fastest for the job. PHP should be restricted to generated dynamic Web output, while core processing should be in Java/C. Ideally it should be Java/C service which generates output, and PHP using that feed to generate HTML output.
You are parsing the input line twice by doing two explodes in a row. I would start by removing the first line:
$field = explode ($delimiter, $line);
list($export_date, ...., $storefront_id ) = explode($delimiter, $line);
Also, if you are only using the query to test for a match based on your condition, don't use SELECT * use something like this:
"SELECT 1 FROM jos_mt_links WHERE link_id='$application_id';"
You could also, as Brandon Horsley suggested, buffer a set of application_id values in an array and modify your select statement to use the IN clause thereby reducing the number of queries you are performing.
Have you tried profiling the code to see where it's spending most of its time? That should always be your first step when trying to diagnose performance problems.
Preprocess with sed and/or awk ?
Databases are built and designed to cope with large amounts of data, PHP isn't. You need to re-evaluate how you are storing the data.
I would dump all the records into a database, then delete the records you don't need. Once you have done that, you can copy those records wherever you want.
As others have mentioned, the expense is likely in your database query. It might be faster to load a batch of records from the file (instead of one at a time) and perform one query to check multiple records.
For example, load 1000 records that match the USD currency and storefront at a time into an array and execute a query like:
'select link_id from jos_mt_links where link_id in (' . implode(',', application_id_array) . ')'
This will return a list of those records that are in the database. Alternatively, you could change the sql to be not in to get a list of those records that are not in the database.
I got thousands of data inside the array that was parsed from xml.. My concern is the processing time of my script, Does it affect the processing time of my script since I have a hundred thousand records to be inserted in the database? I there a way that I process the insertion of the data to the database in batch?
Syntax is:
INSERT INTO tablename (fld1, fld2) VALUES (val1, val2), (val3, val4)... ;
So you can write smth. like this (dummy example):
foreach ($data AS $key=>$value)
{
$data[$key] = "($value[0], $value[1])";
}
$query = "INSERT INTO tablename (fld1, fld2) VALUES ".implode(',', $data);
This works quite fast event on huge datasets, and don't worry about performance if your dataset fits in memory.
This is for SQL files - but you can follow it's model ( if not just use it ) -
It splits the file up into parts that you can specify, say 3000 lines and then inserts them on a timed interval < 1 second to 1 minute or more.
This way a large file is broken into smaller inserts etc.
This will help bypass editing the php server configuration and worrying about memory limits etc. Such as script execution time and the like.
New Users can't insert links so Google Search "sql big dump" or if this works goto:
www [dot] ozerov [dot] de [ slash ] bigdump [ dot ] php
So you could even theoretically modify the above script to accept your array as the data source instead of the SQl file. It would take some modification obviously.
Hope it helps.
-R
Its unlikely to affect the processing time, but you'll need to ensure the DB's transaction logs are big enough to build a rollback segment for 100k rows.
Or with the ADOdb wrapper (http://adodb.sourceforge.net/):
// assuming you have your data in a form like this:
$params = array(
array("key1","val1"),
array("key2","val2"),
array("key3","val3"),
// etc...
);
// you can do this:
$sql = "INSERT INTO `tablename` (`key`,`val`) VALUES ( ?, ? )";
$db->Execute( $sql, $params );
Have you thought about array_chunk? It worked for me in another project
http://www.php.net/manual/en/function.array-chunk.php