How to speed up processing a huge text file?

How to speed up processing a huge text file? - php

I have an 800mb text file with 18,990,870 lines in it (each line is a record) that I need to pick out certain records, and if there is a match write them into a database.
It is taking an age to work through them, so I wondered if there was a way to do it any quicker?
My PHP is reading a line at a time as follows:
$fp2 = fopen('download/pricing20100714/application_price','r');
if (!$fp2) {echo 'ERROR: Unable to open file.'; exit;}
while (!feof($fp2)) {
$line = stream_get_line($fp2,128,$eoldelimiter); //use 2048 if very long lines
if ($line[0] === '#') continue; //Skip lines that start with #
$field = explode ($delimiter, $line);
list($export_date, $application_id, $retail_price, $currency_code, $storefront_id ) = explode($delimiter, $line);
if ($currency_code == 'USD' and $storefront_id == '143441'){
// does application_id exist?
$application_id = mysql_real_escape_string($application_id);
$query = "SELECT * FROM jos_mt_links WHERE link_id='$application_id';";
$res = mysql_query($query);
if (mysql_num_rows($res) > 0 ) {
echo $application_id . "application id has price of " . $retail_price . "with currency of " . $currency_code. "\n";
} // end if exists in SQL
} else
{
// no, application_id doesn't exist
} // end check for currency and storefront
} // end while statement
fclose($fp2);

At a guess, the performance issue is because it issues a query for each application_id with USD and your storefront.
If space and IO aren't an issue, you might just blindly write all 19M records into a new staging DB table, add indices and then do the matching with a filter?

Don't try to invent the wheel, it's been done. Use a database to search through the file's content. You can looad that file into a staging table in your database and query your data using indexes for fast access if they add value. Most if not all databases have import/loading tools to get a file into the database relatively fast.

19M rows on DB will slow it down if DB was not designed properly. You can still use text files, if it is partitioned properly. Recreating multiple smaller files, based on certain parameters, storing in proper sorted way might work.
Anyway PHP is not the best language for file IO and processing, it is much slower than Java for this task, while plain old C would be one of the fastest for the job. PHP should be restricted to generated dynamic Web output, while core processing should be in Java/C. Ideally it should be Java/C service which generates output, and PHP using that feed to generate HTML output.

You are parsing the input line twice by doing two explodes in a row. I would start by removing the first line:
$field = explode ($delimiter, $line);
list($export_date, ...., $storefront_id ) = explode($delimiter, $line);
Also, if you are only using the query to test for a match based on your condition, don't use SELECT * use something like this:
"SELECT 1 FROM jos_mt_links WHERE link_id='$application_id';"
You could also, as Brandon Horsley suggested, buffer a set of application_id values in an array and modify your select statement to use the IN clause thereby reducing the number of queries you are performing.

Have you tried profiling the code to see where it's spending most of its time? That should always be your first step when trying to diagnose performance problems.

Preprocess with sed and/or awk ?

Databases are built and designed to cope with large amounts of data, PHP isn't. You need to re-evaluate how you are storing the data.
I would dump all the records into a database, then delete the records you don't need. Once you have done that, you can copy those records wherever you want.

As others have mentioned, the expense is likely in your database query. It might be faster to load a batch of records from the file (instead of one at a time) and perform one query to check multiple records.
For example, load 1000 records that match the USD currency and storefront at a time into an array and execute a query like:
'select link_id from jos_mt_links where link_id in (' . implode(',', application_id_array) . ')'
This will return a list of those records that are in the database. Alternatively, you could change the sql to be not in to get a list of those records that are not in the database.

Related

PHP - Optimising preg_match of thousands of patterns

So I wrote a script to extract data from raw genome files, heres what the raw genome file looks like:
# rsid chromosome position genotype
rs4477212 1 82154 AA
rs3094315 1 752566 AG
rs3131972 1 752721 AG
rs12124819 1 776546 AA
rs11240777 1 798959 AG
rs6681049 1 800007 CC
rs4970383 1 838555 AC
rs4475691 1 846808 CT
rs7537756 1 854250 AG
rs13302982 1 861808 GG
rs1110052 1 873558 TT
rs2272756 1 882033 GG
rs3748597 1 888659 CT
rs13303106 1 891945 AA
rs28415373 1 893981 CC
rs13303010 1 894573 GG
rs6696281 1 903104 CT
rs28391282 1 904165 GG
rs2340592 1 910935 GG
The raw text file has hundreds of thousands of these rows, but I only need specific ones, I need about 10,000 of them. I have a list of rsids. I just need the genotype from each line. So I loop through the rsid list and use preg_match to find the line I need:
$rawData = file_get_contents('genome_file.txt');
$rsids = $this->get_snps();
while ($row = $rsids->fetch_assoc()) {
$searchPattern = "~rs{$row['rsid']}\t(.*?)\t(.*?)\t(.*?)\n~i";
if (preg_match($searchPattern,$rawData,$matchedGene)) {
$genotype = $matchedGene[3]);
// Do something with genotype
}
}
NOTE: I stripped out a lot of code to just show the regexp extraction I'm doing. I'm also inserting each row into a database as I go along. Heres the code with the database work included:
$rawData = file_get_contents('genome_file.txt');
$rsids = $this->get_snps();
$query = "INSERT INTO wp_genomics_results (file_id,snp_id,genotype,reputation,zygosity) VALUES (?,?,?,?,?)";
$stmt = $ngdb->prepare($query);
$stmt->bind_param("iissi", $file_id,$snp_id,$genotype,$reputation,$zygosity);
$ngdb->query("START TRANSACTION");
while ($row = $rsids->fetch_assoc()) {
$searchPattern = "~rs{$row['rsid']}\t(.*?)\t(.*?)\t(.*?)\n~i";
if (preg_match($searchPattern,$rawData,$matchedGene)) {
$genotype = $matchedGene[3]);
$stmt->execute();
$insert++;
}
}
$stmt->close();
$ngdb->query("COMMIT");
$snps->free();
$ngdb->close();
}
So unfortunately my script runs very slowly. Running 50 iterations takes 17 seconds. So you can imagine how long running 18,000 iterations is gonna take. I'm looking into ways to optimise this.
Is there a faster way to extract the data I need from this huge text file? What if I explode it into an array of lines, and use preg_grep(), would that be any faster?
Something I tried is combining all 18,000 rsids into a single expression (i.e. (rs123|rs124|rs125) like this:
$rsids = get_rsids();
$rsid_group = implode('|',$rsids);
$pattern = "~({$rsid_group })\t(.*?)\t(.*?)\t(.*?)\n~i";
preg_match($pattern,$rawData,$matches);
But unfortunately it gave me some error message about exceeding the PCRE expression limit. The needle was way too big. Another thing I tried is adding the S modifier to the expression. I read that this analyses the pattern in order to increase performance. It didn't speed things up at all. Maybe maybe pattern isn't compatible with it?
So then the second thing I need to try and optimise is the database inserts. I added a transaction hoping that would speed things up but it didn't speed it up at all. So I'm thinking maybe I should group the inserts together, so that I insert multiple rows at once, rather than inserting them individually.
Then another idea is something I read about, using LOAD DATA INFILE to load rows from a text file. In that case, I just need to generate a text file first. Would it work out faster to generate a text file in this case I wonder.
EDIT: It seems like whats taking up most time is the regular expressions. Running that part of the program by itself, it takes a really long time. 10 rows takes 4 seconds.

This is slow because you're searching a vast array of data over and over again.
It looks like you have a text file, not a dbms table, containing lines like these:
rs4477212 1 82154 AA
rs3094315 1 752566 AG
rs3131972 1 752721 AG
rs12124819 1 776546 AA
It looks like you have some other data structure containing a list of values like rs4477212. I think that's already in a table in the dbms.
I think you want exact matches for the rsxxxx values, not prefix or partial matches.
I think you want to process many different files of raw data, and extract the same batch of rsxxxx values from each of them.
So, here's what you do, in pseudocode. Don't load the whole raw data file into memory, rather process it line by line.
Read your rows of rsid values from the dbms, just once, and store them in an associative array.
for each file of raw data....
for each line of data in the file...
split the line of data to obtain the rsid. In php, $array = explode(" ", $line, 2); will yield your rsid in $array[0], and do it fast.
Look in your array of rsid values for this value. In php, if ( array_key_exists( $array[0], $rsid_array )) { ... will do this.
If the key does exist, you have a match.
extract the last column from the raw text line ('GC or whatever)
write it to your dbms.
Notice how this avoids regular expressions, and how it processes your raw data line by line. You only have to touch each line of raw data once. That's good, because your raw data is also your largest quantity of data. It exploits php's associative array feature to do the matching. All that will be much faster than your method.
To speed the process of inserting tens of thousands of rows into a table, read this. Optimizing InnoDB Insert Queries

+1 to #Ollie Jones' answer. He posted while I was working on my answer. So here's some code to get you started.
$rsids = $this->get_snps();
while ($row = $rsids->fetch_assoc()) {
$key = 'rs' . $row['rsid'];
$rsidHash[$key] = true;
}
$rawDataFd = fopen('genome_file.txt', 'r');
while ($rawData = fgetcsv($rawDataFd, 80, "\t")) {
if (array_key_exists($rawData[0], $rsidHash)) {
$genotype = $rawData[3];
// do something with genotype
}
}

I wanted to give the LOAD DATA INFILE approach to see how well that works, so I came up with what I thought is a nice elegant approach, heres the code:
$file = 'C:/wamp/www/nutri/wp-content/plugins/genomics/genome/test';
$data_query = "
LOAD DATA LOCAL INFILE '$file'
INTO TABLE wp_genomics_results
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
IGNORE 18 ROWS
(#rsid,#chromosome,#locus,#genotype)
SET file_id = '$file_id',
snp_id = (SELECT id FROM wp_pods_snp WHERE rsid = SUBSTR(#rsid,2)),
genotype = #genotype
";
$ngdb->query($data_query);
I put a foreign key restraint on the snp_id (thats the ID for my table of RSIDs) column so that it only enters genotypes for rsids that I need. Unfortunately this foreign key restraint caused some kind of error which locked the tables. Ah well. It might not have been a good approach anyhow since there are on average 200,000 rows in each of these genome files. I'll go with Ollie Jones approach since that seems to be the most effective and viable approach I've come across.

Importing CSV with odd rows into MySQL

I'm faced with a problematic CSV file that I have to import to MySQL.
Either through the use of PHP and then insert commands, or straight through MySQL's load data infile.
I have attached a partial screenshot of how the data within the file looks:
The values I need to insert are below "ACC1000" so I have to start at line 5 and make my way through the file of about 5500 lines.
It's not possible to skip to each next line because for some Accounts there are multiple payments as shown below.
I have been trying to get to the next row by scanning the rows for the occurrence of "ACC"
if (strpos($data[$c], 'ACC') !== FALSE){
echo "Yep ";
} else {
echo "Nope ";
}
I know it's crude, but I really don't know where to start.

If you have a (foreign key) constraint defined in your target table such that records with a blank value in the type column will be rejected, you could use MySQL's LOAD DATA INFILE to read the first column into a user variable (which is carried forward into subsequent records) and apply its IGNORE keyword to skip those "records" that fail the FK constraint:
LOAD DATA INFILE '/path/to/file.csv'
IGNORE
INTO TABLE my_table
CHARACTER SET utf8
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 4 LINES
(#a, type, date, terms, due_date, class, aging, balance)
SET account_no = #account_no := IF(#a='', #account_no, #a)

There are several approaches you could take.
1) You could go with #Jorge Campos suggestion and read the file line by line, using PHP code to skip the lines you don't need and insert the ones you want into MySQL. A potential disadvantage to this approach if you have a very large file is that you will either have to run a bunch of little queries or build up a larger one and it could take some time to run.
2) You could process the file and remove any rows/columns that you don't need, leaving the file in a format that can be inserted directly into mysql via command line or whatever.
Based on which approach you decide to take, either myself or the community can provide code samples if you need them.

This snippet should get you going in the right direction:
$file = '/path/to/something.csv';
if( ! fopen($file, 'r') ) { die('bad file'); }
if( ! $headers = fgetcsv($fh) ) { die('bad data'); }
while($line = fgetcsv($fh)) {
echo var_export($line, true) . "\n";
if( preg_match('/^ACC/', $line[0] ) { echo "record begin\n"; }
}
fclose($fh);
http://php.net/manual/en/function.fgetcsv.php

Downloading Large Data Sets -> Text to MySQL or just to MySQL?

I'm downloading large sets of data via an XML Query through PHP with the following scenario:
- Query for records 1-1000, download all parts (1000 parts has roughly 4.5 megs of text), then store those in memory while i query the next 1001 - 2000, store in mem (up to potentially 400k)
I'm wondering if it would be better to write these entries to a text field, rather than storing them in memory and once the complete download is done trying to insert them all up into the DB or to try and write them to the DB as they come in.
Any suggestions would be greatly appreciated.
Cheers

You can run a query like this:
INSERT INTO table (id, text)
VALUES (null, 'foo'), (null, 'bar'), ..., (null, 'value no 1000');
Doing this you'll do the thing in one shoot, and the parser will be called once. The best you can do, is running something like this with the MySQL's Benchmark function, running 1000 times a query that inserts 1000 records, or 1000000 of inserts of one record.
(Sorry about the prev. answer, I've misunderstood the question).

I think write them to database as soon as you receive them. This will save memory and u don't have to execute a 400 times slower query at the end. You will need mechanism to deal with any problems that may occur in this process like a disconnection after 399K results.

In my experience it would be better to download everything in a temporary area and then, when you are sure that everything went well, to move the data (or the files) in place.
As you are using a database you may want to dump everything into a table, something like this code:
$error=false;
while ( ($row = getNextRow($db)) && !error ) {
$sql = "insert into temptable(key, value) values ($row[0], $row[1])";
if (mysql_query ($sql) ) {
echo '#';
} else {
$error=true;
}
}
if (!error) {
$sql = "insert into myTable (select * from temptable)";
if (mysql_query($sql) {
echo 'Finished';
} else {
echo 'Error';
}
}
Alternatively, if you know the table well, you can add a "new" flag field for newly inserted lines and update everything when you are finished.

Php query MYSQL very slow. what possible to cause it?

I have a php page query mysql database, it will return about 20000 rows. However the browser will take above 20 minutes to present. I have added index on my database and it do used it, the query time in command line is about 1 second for 20000 rows. but in web application, it takes long. is anyone know which causing this problem? and better way to improve it?Below is my php code to retrieve the data:
select * from table where Date between '2010-01-01' and '2010-12-31'
$result1 = mysql_query($query1) or die('Query failed: ' . mysql_error());
while ($line = mysql_fetch_assoc($result1)) {
echo "\t\t<tr>\n";
$Data['Date'] = $line['Date'];
$Data['Time'] = $line['Time'];
$Data['Serial_No'] = $line['Serial_No'];
$Data['Department'] = $line['Department'];
$Data['Team'] = $line['Team'];
foreach ($Data as $col_value) {
echo "\t\t\t<td>$col_value</td>\n";
};
echo "\t\t</tr>\n";
}

Try adding an index to your date column.
Also, it's a good idea to learn about the EXPLAIN command.
As mentioned in the comments above, 1 second is still pretty long for your results.
You might consider putting all your output into a single variable and then echoing the variable once the loop is complete.
Also, browsers wait for tables to be completely formed before showing them, so that will slow your results (at least slow the process of building the results in the browser). A list may work better - or better yet a paged view if possible (as recommended in other answers).

It's not PHP that's causing it to be slow, but the browser itself rendering a huge page. Why do you have to display all that data anyway? You should paginate the results instead.
Try constructing a static HTML page with 20,000 table elements. You'll see how slow it is.
You can also improve that code:
while ($line = mysql_fetch_assoc($result1)) {
echo "\t\t<tr>\n";
foreach ($line as $col_value) {
echo "\t\t\t<td>$col_value</td>\n";
flush(); // optional, but gives your program a sense of responsiveness
}
echo "\t\t</tr>\n";
}
In addition, you should increase your acceptance rate.

You could time any steps of the script, by echoing the time before and after connecting to the database, running the query and outputting the code.
This will tell you how long the different steps will take. You may find out that it is indeed the traffic causing the delay and not the query.
On the other hand, when you got a table with millions of records, retreiving 20000 of them can take a long time, even when it is indexed. 20 minutes is extreme, though...

Parsing a file of values in order to change into an SQL insert

Hey, trying to figure out a way to use a file I have to generate an SQL insert to a database.
The file has many entries of the form:
100090 100090 bill smith 1998
That is,an id number, another id(not always the same), a full name and a year. These are all separated by a space.
Basically what i want to to is be able to get variables from these lines as I iterate through the file so that i can,for instance give the values on each line the names: id,id2,name,year. I then want to pass these to a database.So for each line id be able to do (in pseudo code)
INSERT INTO BLAH VALUES(id, id2,name , year)
This is in php, I noticed I haven't outlined that above, however i have also tried using grep in order to find the regex but cant find a way to paste the code: eg:"VALUES()" around the information from the file.
Any help would be appreciated. I'm kind of stuck on this one

Try something like this:
$fh = fopen('filename', 'r');
$values = array();
while(!feof($fh)) {
//Read a line from the file
$line = trim(fgets($fh));
//Match the line against the specified format
$fields = array();
if(preg_match(':^(\d+) (\d+) (.+) (\d{4})$:', $line, $fields)) {
//If it do match, create a VALUES() block
//Don't forget to escape the string part
$values[] = sprintf('VALUES(%d, %d, "%s", %d)',
$fields[1], $fields[2], mysqli_real_escape_string($fields[3]), $fields[4]);
}
}
fclose($fh);
$all_values = implode(',', $values);
//Check out what's inside $all_values:
echo $all_values;
If the file is really big you'll have to do your SQL INSERTs inside the loop instead of saving them to the end, but for small files I think it's better to save all VALUEs to the end so we can do only one SQL query.

If you can rely on the file's structure (and don't need to do additional sanitation/checking), consider using LOAD DATA INFILE.
GUI tools like HeidiSQL come with great dialogs to build fully functional mySQL statements easily.
Alternatively, PHP has fgetcsv() to parse CSV files.

If all of your lines look like the one you posted, you can read the contents of the file into a string (see http://www.ehow.com/how_5074629_read-file-contents-string-php.html)
Then use PHP split function to give you each piece of the query. (Looks like preg_split() as of PHP 5.3).
The array will look like this:
myData[0] = 10090
myData[1] = 10090
myData[2] = Bill Smith
myData[3] = 1998
.....And so on for each record
Then you can use a nifty loop to build your query.
for($i = 0, $i < (myData.length / 4); $i+4)
{
$query = 'INSERT INTO MyTABLE VALUES ($myData[$i],$myData[$i+1],$myData[$i+2],myData[$i+3])'
//Then execute the query
}
This will be better and faster than introducing a 3rd party tool.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.