Best practices for iterating over MASSIVE CSV files in PHP - php

Ok, I'll try and keep this short, sweet and to-the-point.
We do massive GeoIP updates to our system by uploading a MASSIVE CSV file to our PHP-based CMS. This thing usually has more than 100k records of IP address information. Now, doing a simple import of this data isn't an issue at all, but we have to run checks against our current regional IP address mappings.
This means that we must validate the data, compare and split overlapping IP address, etc.. And these checks must be made for each and every record.
Not only that, but I've just created a field mapping solution that would allow other vendors to implement their GeoIP updates in different formats. This is done by applying rules to IPs records within the CSV update.
For instance a rule might look like:
if 'countryName' == 'Australia' then send to the 'Australian IP Pool'
There might be multiple rules that have to be run and each IP record must apply them all. For instance, 100k records to check against 10 rules would be 1 million iterations; not fun.
We're finding 2 rules for 100k records takes up to 10 minutes to process. I'm fully aware of the bottleneck here which is the shear amount of iterations that must occur for a successful import; just not fully aware of any other options we may have to speed things up a bit.
Someone recommended splitting the file into chunks, server-side. I don't think this is a viable solution as it adds yet another layer of complexity to an already complex system. The file would have to be opened, parsed and split. Then the script would have to iterate over the chunks as well.
So, question is, considering what I just wrote, what would the BEST method be to speed this process up a bit? Upgrading the server's hardware JUST for this tool isn't an option unfortunately, but they're pretty high-end boxes to begin with.
Not as short as I thought, but yeah. Halps? :(

Perform a BULK IMPORT into a database (SQL Server's what I use). The BULK IMPORT takes seconds literally, and 100,000 records is peanuts for a database to crunch on business rules. I regularly perform similar data crunches on a table with over 4 million rows and it doesn't take the 10 minutes you listed.
EDIT: I should point out, yeah, I don't recommend PHP for this. You're dealing with raw DATA, use a DATABASE.. :P

The simple key to this is keeping as much work out of the inner loop as possible.
Simply put, anything you do in the inner loop is done "100K times", so doing nothing is best (but certainly not practical), so doing as little possible is the next best bet.
If you have the memory, for example, and it's practical for the application, defer any "output" until after the main processing. Cache any input data if practical as well. This works best for summary data or occasional data.
Ideally, save for the reading of the CSV file, do as little I/O as possible during the main processing.
Does PHP offer any access to the Unix mmap facility, that is typically the fastest way to read files, particularly large files.
Another consideration is to batch your inserts. For example, it's straightforward to build up your INSERT statements as simple strings, and ship them to the server in blocks of 10, 50, or 100 rows. Most databases have some hard limit on the size of the SQL statement (like 64K, or something), so you'll need to keep that in mind. This will dramatically reduce your round trips to the DB.
If you're creating primary keys through simple increments, do that en masses (blocks of 1000, 10000, whatever). This is another thing you can remove from your inner loop.
And, for sure, you should be processing all of the rules at once for each row, and not run the records through for each rule.

100k records isn't a large number. 10 minutes isn't a bad job processing time for a single thread. The amount of raw work to be done in a straight line is probably about 10 minutes, regardless if you're using PHP or C. If you want it to be faster, you're going to need a more complex solution than a while loop.
Here's how I would tackle it:
Use a map/reduce solution to run the process in parallel. Hadoop is probably overkill. Pig Latin may do the job. You really just want the map part of the map/reduce problem. IE: you're forking of a chunk of the file to be processed by a sub process. Your reducer is probably cat. A simple version could be having PHP fork processes for each 10K record chunk, wait for the children, then re-assemble their output.
Use a queue/grid processing model. Queue up chunks of the file, then have a cluster of machines checking in, grabbing jobs and sending the data somewhere. This is very similar to the map/reduce model, just using different technologies, plus you could scale by adding more machines to the grid.
If you can write your logic as SQL, do it in a database. I would avoid this because most web programmers can't work with SQL on this level. Also, SQL is sort of limited for doing things like RBL checks or ARIN lookups.

One thing you can try is running the CSV import under command line PHP. It generally provides faster results.

If you are using PHP to do this job, switch the parsing to Python since it is WAY faster than PHP on this matters, this exchange should speed up the process by 75% or even more.
If you are using MySQL you can also use the LOAD DATA INFILE operator, I'm not sure if you need check the data before you insert it into the database though.

Have worked on this problem intensively for a while now. And, yes the better solution is to only read in a portion of the file at any one time, parse it, do validation, do filtering, then export it and then read the next portion of the file. I would agree that this is probably not a solution for php, although you can probably do it in php. As long as you have a seek function, so that you can start reading from a particular location in the file. You are right it does add a higher level of complexity but the worth that little extra effort.
It your data is pure i.e. delimited correctly, string qualified, free of broken lines etc then by all means bulk upload into a sql database. Otherwise you want to know where, when and why errors occur and to be able to handle them.

i'm working with something alike. The csv file i'm working contain portuguese data (dd/mm/yyyy) that i have to convert into mysql yyyy-mm-dd. Portuguese monetary: R$ 1.000,15 that had to be converted into mysql decimal 1000,15. Trim the possible spaces and finally, addslashes.
There are 25 variables to be treated before the insert.
If i check every $notafiscal value (select into table to see if exist and update), the php handle around 60k rows. But if i don't check it, php handle more than 1 million rows.The server work with memory of 4GB - scripting localhosting (memory of 2GB), it handles the half rows in both cases.
mysqli_query($db,"SET AUTOCOMMIT=0");
mysqli_query($db, "BEGIN");
mysqli_query($db, "SET FOREIGN_KEY_CHECKS = 0");
fgets($handle); //ignore the header line of csv file
while (($data = fgetcsv($handle, 100000, ';')) !== FALSE):
//if $notafiscal lower than 1, ignore the record
$notafiscal = $data[0];
if ($notafiscal < 1):
continue;
else:
$serie = trim($data[1]);
$data_emissao = converteDataBR($data[2]);
$cond_pagamento = trim(addslashes($data[3]));
//...
$valor_total = trim(moeda($data[24]));
//check if the $notafiscal already exist, if so, update, else, insert into table
$query = "SELECT * FROM venda WHERE notafiscal = ". $notafiscal ;
$rs = mysqli_query($db, $query);
if (mysqli_num_rows($rs) > 0):
//UPDATE TABLE
else:
//INSERT INTO TABLE
endif;
endwhile;
mysqli_query($db,"COMMIT");
mysqli_query($db,"SET AUTOCOMMIT=1");
mysqli_query($db,"SET FOREIGN_KEY_CHECKS = 1");
mysqli_close($db);

Related

Which is faster / more efficient - lots of little MySQL queries or one big PHP array?

I have a PHP/MySQL based web application that has internationalization support by way of a MySQL table called language_strings with the string_id, lang_id and lang_text fields.
I call the following function when I need to display a string in the selected language:
public function get_lang_string($string_id, $lang_id)
{
$db = new Database();
$sql = sprintf('SELECT lang_string FROM language_strings WHERE lang_id IN (1, %s) AND string_id=%s ORDER BY lang_id DESC LIMIT 1', $db->escape($lang_id, 'int'), $db->escape($string_id, 'int'));
$row = $db->query_first($sql);
return $row['lang_string'];
}
This works perfectly but I am concerned that there could be a lot of database queries going on. e.g. the main menu has 5 link texts, all of which call this function.
Would it be faster to load the entire language_strings table results for the selected lang_id into a PHP array and then call that from the function? Potentially that would be a huge array with much of it redundant but clearly it would be one database query per page load instead of lots.
Can anyone suggest another more efficient way of doing this?
There isn't an answer that isn't case sensitive. You can really look at it on a case by case statement. Having said that, the majority of the time, it will be quicker to get all the data in one query, pop it into an array or object and refer to it from there.
The caveat is whether you can pull all your data that you need in one query as quickly as running the five individual ones. That is where the performance of the query itself comes into play.
Sometimes a query that contains a subquery or two will actually be less time efficient than running a few queries individually.
My suggestion is to test it out. Get a query together that gets all the data you need, see how long it takes to execute. Time each of the other five queries and see how long they take combined. If it is almost identical, stick the output into an array and that will be more efficient due to not having to make frequent connections to the database itself.
If however, your combined query takes longer to return data (it might cause a full table scan instead of using indexes for example) then stick to individual ones.
Lastly, if you are going to use the same data over and over - an array or object will win hands down every single time as accessing it will be much faster than getting it from a database.
OK - I did some benchmarking and was surprised to find that putting things into an array rather than using individual queries was, on average, 10-15% SLOWER.
I think the reason for this was because, even if I filtered out the "uncommon" elements, inevitably there was always going to be unused elements as a matter of course.
With the individual queries I am only ever getting out what I need and as the queries are so simple I think I am best sticking with that method.
This works for me, of course in other situations where the individual queries are more complex, I think the method of storing common data in an array would turn out to be more efficient.
Agree with what everybody says here.. it's all about the numbers.
Some additional tips:
Try to create a single memory array which holds the minimum you require. This means removing most of the obvious redundancies.
There are standard approaches for these issues in performance critical environments, like using memcached with mysql. It's a bit overkill, but this basically lets you allocate some external memory and cache your queries there. Since you choose how much memory you want to allocate, you can plan it according to how much memory your system has.
Just play with the numbers. Try using separate queries (which is the simplest approach) and stress your PHP script (like calling it hundreds of times from the command-line). Measure how much time this takes and see how big the performance loss actually is.. Speaking from my personal experience, I usually cache everything in memory and then one day when the data gets too big, I run out of memory. Then I split everything to separate queries to save memory, and see that the performance impact wasn't that bad in the first place :)
I'm with Fluffeh on this: look into other options at your disposal (joins, subqueries, make sure your indexes reflect the relativity of the data -but don't over index and test). Most likely you'll end up with an array at some point, so here's a little performance tip, contrary to what you might expect, stuff like
$all = $stmt->fetchAll(PDO::FETCH_ASSOC);
is less memory efficient compared too:
$all = array();//or $all = []; in php 5.4
while($row = $stmt->fetch(PDO::FETCH_ASSOC);
{
$all[] = $row['lang_string '];
}
What's more: you can check for redundant data while fetching the data.
My answer is to do something in between. Retrieve all strings for a lang_id that are shorter than a certain length (say, 100 characters). Shorter text strings are more likely to be used in multiple places than longer ones. Cache the entries in a static associative array in get_lang_string(). If an item isn't found, then retrieve it through a query.
I am currently at the point in my site/application where I have had to put the brakes on and think very carefully about speed. I think these speed tests mentioned should consider the volume of traffic on your server as an important variable that will effect the results. If you are putting data into javascript data structures and processing it on the client machine, the processing time should be more regular. If you are requesting lots of data through mysql via php (for example) this is putting demand on one machine/server rather than spreading it. As your traffic grows you are having to share server resources with many users and I am thinking that this is where getting JavaScript to do more is going to lighten the load on the server. You can also store data in the local machine via localstorage.setItem(); / localstorage.getItem(); (most browsers have about 5mb of space per domain). If you have data in database that does not change that often then you can store it to client and then just check at 'start-up' if its still in date/valid.
This is my first comment posted after having and using the account for 1 year so I might need to fine tune my rambling - just voicing what im thinking through at present.

inserting huge set of data [PHP, MySQL]

I have a big data set into MySQL (users, companies, contacts)? about 1 million records.
And now I need to make import new users, companies, contacts from import file (csv) with about 100000 records. I records from file has all info for all three essences (user, company, contacts).
Moreover on production i can't use LOAD DATA (just do not have so many rights :( ).
So there are three steps which should be applied to that data set.
- compare with existing DB data
- update it (if we will find something on previous step)
- and insert new, records
I'm using php on server for doing that. I can see two approaches:
reading ALL data from file at once and then work with this BIG array and apply those steps.
or reading line by line from the file and pass each line through steps
which approach is more efficient ? by CPU, memory or time usage
Can I use transactions ? or it will slow down whole production system ?
Thanks.
CPU time/time there won't be much in it, although reading the whole file will be slightly faster. However, for such a large data set, the additional memory required to read all records into memory will vastly outstrip the time advantage - I would definitely process one line at a time.
Did you know that phpMyAdmin has that nifty feature of "resumable import" for big SQL files ?
Just check "Allow interrupt of import" in the Partial Import section. And voila, PhpMyAdmin will stop and loop until all requests are executed.
It may be more efficient to just "use the tool" rather than "reinvent the wheel"
I think, 2nd approach is more acceptable:
Create change list (it would be a separate table)
Make updates line by line (and mark each line as updated using "updflag" field, for example)
Perform this process in background using transactions.

How to copy tables from one website to another with php?

I have 2 websites, lets say - example.com and example1.com
example.com has a database fruits which has a table apple with 7000 records.
I exported apple and tried to import it to example1.com but I'm always getting "MYSQL Server has gone away" error. I suspect this is due to some server side restriction.
So, how can I copy the tables without having to contact the system admins? Is there a way to do this using PHP? I went through example of copying tables, but that was inside the same database.
Both example.com and example1.com are on the same server.
One possible approach:
On the "source" server create a PHP script (the "exporter") that outputs the contents of the table in an easy to parse format (XML comes to mind as easy to generate and to consume, but alternatives like CSV could do).
On the "destination" server create a "importer" PHP script that requests the exporter one via HTTP, parses the result, and uses that data to populate the table.
That's quite generic, but should get you started. Here are some considerations:
http://ie2.php.net/manual/en/function.http-request.php is your friend
If the table contains sensitive data, there are many techniques to enhance security (http_request won't give you https:// support directly, but you can encrypt the data you export and decrypt on importing: look for "SSL" on the PHP manual for further details).
You should consider adding some redundancy (or even full-fleshed encryption) to prevent the data from being tampered with while it sails the web between the servers.
You may use GET parameters to add flexibility (for example, passing the table name as a parameter would allow you to have a single script for all tables you may ever need to transfer).
With large tables, PHP timeouts may play against you. The ideal solution for this would be efficient code + custom timeout for the export and import scripts, but I'm not even sure if that's possible at all. A quite reliable workaround is to do the job in chunks (GET params come handy here to tell the exporter what chunk do you need, and a special entry on the output can be enough to tell the importer how much is left to import). Redirects help a lot with this approach (each redirect is a new request, to the servers, so timeouts get reset).
Maybe I'm missing something, but I hope there is enough there to let you get your hands dirty on the job and come back with any specific issue I might have failed to foresee.
Hope this helps.
EDIT:
Oops, I missed the detail that both DBs are on the same server. In that case, you can merge the import and export task into a single script. This means that:
You don't need a "transfer" format (such as XML or CSV): in-memory representation within the PHP is enough, since now both tasks are done within the same script.
The data doesn't ever leave your server, so the need for encryption is not so heavy. Just make sure no one else can run your script (via authentication or similar techniques) and you should be fine.
Timeouts are not so restrictive, since you don't waste a lot of time waiting for the response to arrive from the source to the destination server, but they still apply. Chunk processing, redirection, and GET parameters (passed within the Location for the redirect) are still a good solution, but you can squeeze much closer to the timeout since execution time metrics are far more reliable than cross-server data-transfer metrics.
Here is a very rough sketch of what you may have to code:
$link_src = mysql_connect(source DB connection details);
$link_dst = mysql_connect(destination DB connection details);
/* You may want to truncate the table on destination before going on, to prevent data repetition */
$q = "INSERT INTO `table_name_here` (column_list_here) VALUES ";
$res = mysql_query("SELECT * FROM `table_name_here`", $link_src);
while ($row = mysql_fetch_assoc($res)) {
$q = $q . sprintf("(%s, %s, %s), ", $row['field1_name'], $row['field2_name'], $row['field3_name']);
}
mysql_free_result($res);
/* removing the trailing ',' from $q is left as an exercise (ok, I'm lazy, but that's supposed to be just a sketck) */
mysql_query($q, $link_dst);
You'll have to add the chunking logics in there (those are too case- & setup- specific), and probably output some confirmation message (maybe a DESCRIBE and a COUNT of both source and destination tables and a comparison between them?), but that's quite the core of the job.
As an alternative you may run a separate insert per row (invoking the query within the loop), but I'm confident a single query would be faster (however, if you have too small RAM limits for PHP, this alternative allows you to get rid of the memory-hungry $q).
Yet another edit:
From the documentation link posted by Roberto:
You can also get these errors if you send a query to the server that is incorrect or too large. If mysqld receives a packet that is too large or out of order, it assumes that something has gone wrong with the client and closes the connection. If you need big queries (for example, if you are working with big BLOB columns), you can increase the query limit by setting the server's max_allowed_packet variable, which has a default value of 1MB. You may also need to increase the maximum packet size on the client end. More information on setting the packet size is given in Section B.5.2.10, “Packet too large”.
An INSERT or REPLACE statement that inserts a great many rows can also cause these sorts of errors. Either one of these statements sends a single request to the server irrespective of the number of rows to be inserted; thus, you can often avoid the error by reducing the number of rows sent per INSERT or REPLACE.
If that's what's causing your issue (and by your question it seems very likely it is), then the approach of breaking the INSERT into one query per row will most probably solve it. In that case, the code sketch above becomes:
$link_src = mysql_connect(source DB connection details);
$link_dst = mysql_connect(destination DB connection details);
/* You may want to truncate the table on destination before going on, to prevent data repetition */
$q = "INSERT INTO `table_name_here` (column_list_here) VALUES ";
$res = mysql_query("SELECT * FROM `table_name_here`", $link_src);
while ($row = mysql_fetch_assoc($res)) {
$q = $q . sprintf("INSERT INTO `table_name_here` (`field1_name`, `field2_name`, `field3_name`) VALUE (%s, %s, %s)", $row['field1_name'], $row['field2_name'], $row['field3_name']);
}
mysql_free_result($res);
The issue may also be triggered by the huge volume of the initial SELECT; in such case you should combine this with chunking (either with multiple SELECT+while() blocks, taking profit of SQL's LIMIT clause, or via redirections), thus breaking the SELECT down into multiple, smaller queries. (Note that redirection-based chunking is only needed if you have timeout issues, or your execution time gets close enough to the timeout to threaten with issues if the table grows. It may be a good idea to implement it anyway, so even if your table ever grows to an obscene size the script will still work unchanged.)
After struggling with this for a while, came across BigDump. It worked like a charm! Was able to copy LARGE databases without a glitch.
Here are reported the most common causes of the "MySQL Server has gone away" error in MySQL 5.0:
http://dev.mysql.com/doc/refman/5.0/en/gone-away.html
You might want to have a look to it and to use it as a checklist to see if you're doing something wrong.

how to speed up Mysql and PHP?

I am developing a script in my localhst using PHP and mysql and I am dealing with large data (about 2 millions of records for scintific research)
some queries I need to call once in a life (to analyse the data and prepare some data); however it takes very long time for example: now my script is analysing some data for more than 4 hours
I knew I might have some problems in the optimization of my database I am not an expert
for example I just figured out that "indexing" can be useful to speed up the queries
however even with indexing some columns my script is still very slow
any idea how to speed up my script (in PHP and mysql)
I am using XAMPP as a server package
Thanks a lot for help
best regards
update 1:
part of my slow script which takes more than 4 hours to process
$sql = "select * from urls";//10,000 record of cached HTML documents
$result = $DB->query($sql);
while($row = $DB->fetch_array($result)){
$url_id = $row["id"];
$content = $row["content"];
$dom = new DOMDocument();
#$dom->loadHTML($content);
$xpath = new DOMXPath($dom);
$row = $xpath->evaluate("/html/body//a");
for($i = 0; $i < $row->length; $i++) {
// lots of the code here to deal with the HTML documents and some update and insert and select queries which query another table which has 1 million record
}
update 2:
I do not have "JOIN" in my quires or even "IN"
they are very simple queries
and don't know! and I don't know how to know which causes the slowness?
is it the PHP or the MYSQL?
First of all, to be able to optimize efficiently, you need to know what it taking time :
is PHP doing too much calculations ?
do you have too many SQL queries ?
do you have SQL queries that take too much time ?
if yes, which ones ?
where is your script spending time ?
With those informations, you can then try to figure out :
if you can diminish the number of SQL queries
for instance, if you are doing the exact same query over and over again, you are obviously wasting time
another idea is to "regroup" queries, if that is possible ; for instance, use only one query to get 10 lines, instead of 10 queries which all get one line back.
if you can optimize queries that take too long
either using indexes -- those which are usefull generally depend on the joins and conditions you are using
or re-writing the queries, if they are "bad"
About optimization of select statements, you can take a look at : 7.2. Optimizing SELECT and Other Statements
if PHP is doing too much calculations, can you have it make less calculations ?
Maybe not recalculating similar stuff again and again ?
Or using more efficient queries ?
if PHP is taking time, and the SQL server is not over-loaded, using parallelism (launching several calculations at the same time) might also help speed up the whole thing.
Still : this is quite a specific question, and the answers will be probably be pretty specific too -- which means more informations might be necessary if you want more than general answer...
Edit after your edits
As you only have simple queries, things might be a bit easier... Maybe.
First of all : you need to identify the kind of queries you are doing.
I'm guessing, of all all your queries, you can identify some "types" of queries.
for instance : "select * from a where x = 12" and "select * from a where x = 14" are of the same type : same select, same table, same where clause -- only the value changes
once you know which queries are used the most, you'll need to check if they are optimized : using EXPLAIN will help
(if needed, I'm sure some people will be able to help you understand its output, if you provider it alongside the schema of you DB (tables + indexes))
If needed : create the right indexes -- that's kind of the hard/specific part ^^
it is also for those queries that reducing the number of queries might prove useful...
when you're done with queries often used, it's time to go with queries that take too long ; using microtime from PHP will help you find out which ones those are
another solution is to use the 5.2.4. The Slow Query Log
when you have identified those queries, same as before : optimize.
Before that, to find out if PHP is working too much, or if it's MySQL, a simple way is to use the "top" command on Linux, or the "process manager" (I'm not on windows, and don't use it in english -- the real name might be something else).
If PHP is eating 100% of CPU, you have your culprit. If MySQL is eating all CPU, you have your culprit too.
When you know which one of those is working too much, it's a first step : you know what to optimize first.
I see from your portion of code that your are :
going through 10,000 elements one by one -- it should be easy to split those in 2 or more slices
using DOM and XPath, which might eat some CPU on the PHP-side
If you have a multi-core CPU, an idea (that I would try if I see that PHP is eating lots of CPU) would to to parallelize.
For instance, you could have two instances of the PHP script running at the same time :
one that will deal with the first half of the URLs
the SQL query for this one will be like "select * from urls where id < 5000"
and the other one that will deal with the second half of the URLs
its query will be like "select * from urls where id >= 5000"
You will get a bit more concurrency on the network (probably not a problem) and on the database (a database knows how to deal with concurrency, and 2 scripts using it will generally not be too much), but you'll be able to process almost twice the same amount of documents in the same time.
If you have 4 CPU, splitting the urls-list in 4 (or even more ; find out by trial and error) parts would do too.
Since your query is on one table and has no grouping or ordering, it is unlikely that the query is slow. I expect the issue is the size and number of the content fields. It appears that you are storing the entire HTML of a webpage in your database and then pulling it out every time you want to change a couple of values on the page. This is a situation to be avoided if at all possible.
Most scientific webapps (like BLAST for example) have the option to export the data as a delimited text file like a csv. If this is the case for you, you might consider restructuring your url table so that you have one column per data field in the csv. Then your update queries will be significantly faster as you will be able to do them entirely in SQL instead of pulling the entire url table into PHP, accessing and pulling one or more other records for each url record and then updating your table.
Assumably you have stored your data as webpages so you can dump the content easily to a browser. If you change your database schema as I've suggested, you'll need to write a webpage template that you can plug the data into when you wish to output it.
Knowing queries and table structures it would be easier.
If you cant give them out check if you have IN operator. MySQL tends to slow too much in there. Also try to run
EXPLAIN yourquery;
and see how it is executed. Sometimes sorting takes too much time. Try to avoid sorting on non-index columns.
inner joins are quicker than left or right joins
Has always sped up my queries going through after and thinking specifically about the joins.
have a look in your mysql config for settings you can turn off etc
If you are not using indexes it can be the main problem. There are many more optimization hints and tricks. Better will be to show i.e. your slowest query. It's not possible to help without any input data. Indexes and correct joins can speed this up really much.
If the queries will return same data you can store them in file or in memory and do them just once.
2 millions of records is not much.
Before you can optimise, you need to find out where the bottleneck is. Can you run the script on a smaller dataset, for testing purposes?
In that case, you should set such a test up, and then profile the code. You can either use a dedicated profiler such as Xdebug, or if you find it too daunting to configure (Not that complicated really, but you sound like you're a bit in the deep end already), you may feel more comfortable with a manual approach. This means starting a timer before parts of your code and stopping it after, then printing the result out. You can then narrow down which part is slowest.
Once you got that, we can give more specific answers, or perhaps it will be apparent to you what to do.

CSV vs MySQL performance

Lets assume the same environments for PHP5 working with MySQL5 and CSV files. MySQL is on the same host as hosted scripts.
Will MySQL always be faster than retriving/searching/changing/adding/deleting records to CSV?
Or is there some amount of data below which PHP+CSV performance is better than using database server?
CSV won't let you create indexes for fast searching.
If you always need all data from a single table (like for application settings), CSV is faster, otherwise not.
I don't even consider SQL queries, transactions, data manipulation or concurrent access here, as CSV is certainly not for these things.
No, MySQL will probably be slower for inserting (appending to a CSV is very fast) and table-scan (non-index based) searches.
Updating or deleting from a CSV is nontrivial - I leave that as an exercise for the reader.
If you use a CSV, you need to be really careful to handle multiple threads / processes correctly, otherwise you'll get bad data or corrupt your file.
However, there are other advantages too. Care to work out how you do ALTER TABLE on a CSV?
Using a CSV is a very bad idea if you ever need UPDATEs, DELETEs, ALTER TABLE or to access the file from more than one process at once.
As a person coming from the data industry, I've dealt with exactly this situation.
Generally speaking, MySQL will be faster.
However, you don't state the type of application that you are developing. Are you developing a data warehouse application that is mainly used for searching and retrieval of records? How many fields are typically present in your records? How many records are typically present in your data files? Do these files have any relational properties to each other, i.e. do you have a file of customers and a file of customer orders? How much time do you have to develop a system?
The answer will depend on the answer to the questions listed previously. However, you can generally use the following as a guidelines:
If you are building a data warehouse application with records exceeding one million, you may want to consider ditching both and moving to a Column Oriented Database.
CSV will probably be faster for smaller data sets. However, rolling your own insert routines in CSV could be painful and you lose the advantages of database indexing.
My general recommendation would be to just use MySql, as I said previously, in most cases it will be faster.
From a pure performance standpoint, it completely depends on the operation you're doing, as #MarkR says. Appending to a flat file is very fast. As is reading in the entire file (for a non-indexed search or other purposes).
The only way to know for sure what will work better for your use cases on your platform is to do actual profiling. I can guarantee you that doing a full table scan on a million row database will be slower than grep on a million line CSV file. But that's probably not a realistic example of your usage. The "breakpoints" will vary wildly depending on your particular mix of retrieve, indexed search, non-indexed search, update, append.
To me, this isn't a performance issue. Your data sounds record-oriented, and MySQL is vastly superior (in general terms) for dealing with that kind of data. If your use cases are even a little bit complicated by the time your data gets large, dealing with a 100k line CSV file is going to be horrific compared to a 100k record db table, even if the performance is marginally better (which is by no means guaranteed).
Depends on the use. For example for configuration or language files CSV might do better.
Anyway, if you're using PHP5, you have 3rd option -- SQLite, which comes embedded in PHP. It gives you ease of use like regular files, but robustness of RDBMS.
Databases are for storing and retrieving data. If you need anything more than plain line/entry addition or bulk listing, why not go for the database way? Otherwise you'd basically have to code the functionality (incl. deletion, sorting etc) yourself.
CSV is an incredibly brittle format and requires your app to do all the formatting and calcuations. If you need to update a spesific record in a csv you will have to first read the entire csv file, find the entry in memory would need to change, then write the whole file out again. This gets very slow very quickly. CSV is only useful for write once, readd once type apps.
If you want to import swiftly like a thief in the night, use SQL format.
If you are working in production server, CSV is slow but it is the safest.
Just make sure the CSV file doesn't have a Primary Key which will override your existing data.

Categories