I have 2 websites, lets say - example.com and example1.com
example.com has a database fruits which has a table apple with 7000 records.
I exported apple and tried to import it to example1.com but I'm always getting "MYSQL Server has gone away" error. I suspect this is due to some server side restriction.
So, how can I copy the tables without having to contact the system admins? Is there a way to do this using PHP? I went through example of copying tables, but that was inside the same database.
Both example.com and example1.com are on the same server.
One possible approach:
On the "source" server create a PHP script (the "exporter") that outputs the contents of the table in an easy to parse format (XML comes to mind as easy to generate and to consume, but alternatives like CSV could do).
On the "destination" server create a "importer" PHP script that requests the exporter one via HTTP, parses the result, and uses that data to populate the table.
That's quite generic, but should get you started. Here are some considerations:
http://ie2.php.net/manual/en/function.http-request.php is your friend
If the table contains sensitive data, there are many techniques to enhance security (http_request won't give you https:// support directly, but you can encrypt the data you export and decrypt on importing: look for "SSL" on the PHP manual for further details).
You should consider adding some redundancy (or even full-fleshed encryption) to prevent the data from being tampered with while it sails the web between the servers.
You may use GET parameters to add flexibility (for example, passing the table name as a parameter would allow you to have a single script for all tables you may ever need to transfer).
With large tables, PHP timeouts may play against you. The ideal solution for this would be efficient code + custom timeout for the export and import scripts, but I'm not even sure if that's possible at all. A quite reliable workaround is to do the job in chunks (GET params come handy here to tell the exporter what chunk do you need, and a special entry on the output can be enough to tell the importer how much is left to import). Redirects help a lot with this approach (each redirect is a new request, to the servers, so timeouts get reset).
Maybe I'm missing something, but I hope there is enough there to let you get your hands dirty on the job and come back with any specific issue I might have failed to foresee.
Hope this helps.
EDIT:
Oops, I missed the detail that both DBs are on the same server. In that case, you can merge the import and export task into a single script. This means that:
You don't need a "transfer" format (such as XML or CSV): in-memory representation within the PHP is enough, since now both tasks are done within the same script.
The data doesn't ever leave your server, so the need for encryption is not so heavy. Just make sure no one else can run your script (via authentication or similar techniques) and you should be fine.
Timeouts are not so restrictive, since you don't waste a lot of time waiting for the response to arrive from the source to the destination server, but they still apply. Chunk processing, redirection, and GET parameters (passed within the Location for the redirect) are still a good solution, but you can squeeze much closer to the timeout since execution time metrics are far more reliable than cross-server data-transfer metrics.
Here is a very rough sketch of what you may have to code:
$link_src = mysql_connect(source DB connection details);
$link_dst = mysql_connect(destination DB connection details);
/* You may want to truncate the table on destination before going on, to prevent data repetition */
$q = "INSERT INTO `table_name_here` (column_list_here) VALUES ";
$res = mysql_query("SELECT * FROM `table_name_here`", $link_src);
while ($row = mysql_fetch_assoc($res)) {
$q = $q . sprintf("(%s, %s, %s), ", $row['field1_name'], $row['field2_name'], $row['field3_name']);
}
mysql_free_result($res);
/* removing the trailing ',' from $q is left as an exercise (ok, I'm lazy, but that's supposed to be just a sketck) */
mysql_query($q, $link_dst);
You'll have to add the chunking logics in there (those are too case- & setup- specific), and probably output some confirmation message (maybe a DESCRIBE and a COUNT of both source and destination tables and a comparison between them?), but that's quite the core of the job.
As an alternative you may run a separate insert per row (invoking the query within the loop), but I'm confident a single query would be faster (however, if you have too small RAM limits for PHP, this alternative allows you to get rid of the memory-hungry $q).
Yet another edit:
From the documentation link posted by Roberto:
You can also get these errors if you send a query to the server that is incorrect or too large. If mysqld receives a packet that is too large or out of order, it assumes that something has gone wrong with the client and closes the connection. If you need big queries (for example, if you are working with big BLOB columns), you can increase the query limit by setting the server's max_allowed_packet variable, which has a default value of 1MB. You may also need to increase the maximum packet size on the client end. More information on setting the packet size is given in Section B.5.2.10, “Packet too large”.
An INSERT or REPLACE statement that inserts a great many rows can also cause these sorts of errors. Either one of these statements sends a single request to the server irrespective of the number of rows to be inserted; thus, you can often avoid the error by reducing the number of rows sent per INSERT or REPLACE.
If that's what's causing your issue (and by your question it seems very likely it is), then the approach of breaking the INSERT into one query per row will most probably solve it. In that case, the code sketch above becomes:
$link_src = mysql_connect(source DB connection details);
$link_dst = mysql_connect(destination DB connection details);
/* You may want to truncate the table on destination before going on, to prevent data repetition */
$q = "INSERT INTO `table_name_here` (column_list_here) VALUES ";
$res = mysql_query("SELECT * FROM `table_name_here`", $link_src);
while ($row = mysql_fetch_assoc($res)) {
$q = $q . sprintf("INSERT INTO `table_name_here` (`field1_name`, `field2_name`, `field3_name`) VALUE (%s, %s, %s)", $row['field1_name'], $row['field2_name'], $row['field3_name']);
}
mysql_free_result($res);
The issue may also be triggered by the huge volume of the initial SELECT; in such case you should combine this with chunking (either with multiple SELECT+while() blocks, taking profit of SQL's LIMIT clause, or via redirections), thus breaking the SELECT down into multiple, smaller queries. (Note that redirection-based chunking is only needed if you have timeout issues, or your execution time gets close enough to the timeout to threaten with issues if the table grows. It may be a good idea to implement it anyway, so even if your table ever grows to an obscene size the script will still work unchanged.)
After struggling with this for a while, came across BigDump. It worked like a charm! Was able to copy LARGE databases without a glitch.
Here are reported the most common causes of the "MySQL Server has gone away" error in MySQL 5.0:
http://dev.mysql.com/doc/refman/5.0/en/gone-away.html
You might want to have a look to it and to use it as a checklist to see if you're doing something wrong.
Related
I have a JS script that does one simple thing - an ajax request to my server. On this server I establish a PDO connection, execute one prepared statement:
SELECT * FROM table WHERE param1 = :param1 AND param2 = :param2;
Where table is the table with 5-50 rows, 5-15 columns, with data changing once each day on average.
Then I echo the json result back to the script and do something with it, let's say I console log it.
The problem is that the script is run ~10,000 times a second. Which gives me that much connections to the database, and I'm getting can't connect to the database errors all the time in server logs. Which means sometimes it works, when DB processes are free, sometimes not.
How can I handle this?
Probable solutions:
Memcached - it would also be slow, it's not created to do that. The performance would be similar, or worse, to the database.
File on server instead of the database - great solution, but the structure would be the problem.
Anything better?
For such a tiny amount of data that is changed so rarely, I'd make it just a regular PHP file.
Once you have your data in the form of array, dump it in the php file using var_export(). Then just include this file and use a simple loop to search data.
Another option is to use Memcached, which was created exactly this sort of job and on a fast machine with high speed networking, memcached can easily handle 200,000+ requests per second, which is high above your modest 10k rps.
You can even eliminate PHP from the tract, making Nginx directly ask Memcached for the stored valaues, using ngx_http_memcached_module
If you want to stick with current Mysql-based solution, you can increase max_connections number in mysql configuration, however, making it above 200 would may require some OS tweaking as well. But what you should not is to make a persistent connection, that will make things far worse.
You need to leverage a cache. There is no reason at all to go fetch the data from the database every time the AJAX request is made for data that is this slow-changing.
A couple of approaches that you could take (possibly even in combination with each other).
Cache between application and DB. This might be memcache or similar and would allow you to perform hash-based lookups (likely based on some hash of parameters passed) to data stored in memory (perhaps JSON representation or whatever data format you ultimately return to the client).
Cache between client and application. This might take the form of web-server-provided cache, a CDN-based cache, or similar that would prevent the request from ever even reaching your application given an appropriately stored, non-expired item in the cache.
Anything better? No
Since you output the same results many times, the sensible solution is to cache results.
My educated guess is your wrong assumption -- that memcached is not built for this -- is based off you planning on storing each record separately
I implemented a simple caching mechanism for you to use :
<?php
$memcached_port = YOUR_MEMCACHED_PORT;
$m = new Memcached();
$m->addServer('localhost', $memcached_port);
$key1 = $_GET['key1'];
$key2 = $_GET['key2'];
$m_key = $key1.$key2; // generate key from unique values , for large keys use MD5 to hash the unique value
$data = false;
if(!($data = $m->get($m_key))) {
// fetch $data from your database
$expire = 3600; // 1 hour, you may use a unix timestamp value if you wish to expire in a specified time of day
$m->set($m_key,$data,$expire); // push to memcache
}
echo json_encode($data);
What you do is :
Decide on what signifies a result ( what set of input parameters )
Use that for the memcache key ( for example if it's a country and language the key would be $country.$language )
check if the result exists:
if it does pull the data you stored as an array and output it.
if it doesn't exist or is outdated :
a. pull the data needed
b. put the data in an array
c. push the data to memcached
d. output the data
There are more efficient ways to cache data, but this is the simplest one, and sounds like your kind of code.
10,000 requests/second still don't justify the effort needed to create server-level caching ( nginx/whatever )
In an ideally tuned world a chicken with a calculator would be able to run facebook .. but who cares ? (:
I have a web app which is pretty CPU intensive ( it's basically a collection of dictionaries, but they are not just simple dictionaries, they do a lot of stuff, anyway this is not important ). So in a CPU intensive web app you have the scaling problem, too many simultaneous users and you get pretty slow responses.
The flow of my app is this:
js -> ajax call -> php -> vb6 dll -> vb6 code queries the dictionaries and does CPU intensive stuff -> reply to php -> reply to js -> html div gets updated with the new content. Obviously in a windows env with IIS 7.5. PHP acts just as a way of accessing the .dll and does nothing else.
The content replied/displayed is html formatted text.
The app has many php files which call different functions in the .dll.
So in order to avoid the calling of the vb6 dll for each request, which is the CPU intensive part, I'm thinking of doing this:
example ajax request:
php file: displayconjugationofword.php
parameter: word=lol&tense=2&voice=active
So when a user makes the above request to displayconjugationofword.php, i call the vb6 dll, then just before giving back the reply to the client, I can add in a MYSQL table the request data like this:
filename, request, content
displayconjugationofword.php, word=blahblah&tense=2&voice=active, blahblahblah
so next time that a user makes the EXACT same ajax request, the displayconjugationofword.php code, instead of calling the vb6 dll, checks first the mysql table to see if the request exists there and if it does, it fetches it from there.
So this mysql table will gradually grow in size, reaching up to 3-4 million rows and as it grows the chance of something requested being in the db, grows up too, which theoretically should be faster than doing the cpu intensive calls ( each anywhere from 50 to 750ms long ).
Do you think this is a good method of achieving what I want? or when the mysql table reaches 3-4 million entries, it will be slow too ?
thank you in advance for your input.
edit
i know about iis output caching but i think it's not useful in my case because:
1) AFAIK it only caches the .php file when it becomes "hot" ( many queries ).
2) i do have some .php files which call the vb6 but the reply is random each time.
I love these situations/puzzles! Here are the questions that I'd ask first, to determine what options are viable:
Do you have any idea/sense of how many of these queries are going to be repeated in a given hour, day, week?
Because... the 'more common caching technique' (i.e the technique I've seen and/or read about the most) is to use something like APC or, for scalability, something like Memcache. What I've seen, though, is that these are usually used for < 12 hour-long caches. That's just what I've seen. Benefit: auto-cleanup of unused items.
Can you give an estimate of how long a single 'task' might take?
Because... this will let you know if/when the cache becomes unproductive - that is, when the caching mechanism is slower than the task.
Here's what I'd propose as a solution - do it all from PHP (no surprise). In your work-flow, this would be both PHP points: js -> ajax call -> php -> vb6 dll -> vb6 code queries the dictionaries and does CPU intensive stuff -> reply to php -> reply to js -> html div...
Something like this:
Create a table with columns: __id, key, output, count, modified
1.1. The column '__id' is the auto-increment column (ex. INT(11) AUTO_INCREMENT) and thus is also the PRIMARY INDEX
1.2 The column 'modified' is created like this in MySQL: modified TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
1.3 'key' = CHAR(32) which is the string length of MD5 hashes. 'key' also has a UNIQUE INDEX (very important!! for 3.3 below)
1.4 'output' = TEXT since the VB6 code will be more than a little
1.5 'count' = INT(8) or so
Hash the query-string ("word=blahblah&tense=2&voice=active"). I'm thinking something like:$key = md5(var_export($_GET, TRUE)); Basically, hash whatever will give a unique output. Deducing from the example given, perhaps it might be best to lowercase the 'word' if case doesn't matter.
Run a conditional on the results of a SELECT for the key. In pseudo-code:
3.1. $result = SELECT output, count FROM my_cache_table_name WHERE key = "$key"
3.2. if (empty($result)) {
$output = result of running VB6 task
$count = 1
else
$count = $result['count'] + 1
3.3. run query 'INSERT INTO my_cache_table_name (key, output, count) VALUES ($key, $output, $count) ON DUPLICATE KEY UPDATE count = $count'
3.4. return $output as "reply to js"
Long-term, you will not only have a cache but you will also know what queries are being run the least and can prune them if needed. Personally, I don't think such a query will ever be all that time-consuming. And there are certainly things you might do to optimize the cache/querying (that's beyond me).
So what I'm not stating directly is this: the above will work (and is pretty much what you suggested). By adding a 'count' column, you will be able to see what queries are done a lot and/or a little and can come back and prune if/as needed.
If you want to see how long queries are taking, you might create another table that holds 'key', 'duration', and 'modified' (like above). Before 3.1 and 3.3, get the microtime(). If this is a cache-hit, subtract the microtime()s and store in this new table where 'key' = $key and 'duration' = 2nd microtime() - 1st microtime(). Then you can come back later, sort by 'modified DESC', and see how long queries are taking. If you have a TON of data and still the latest 'duration' is not bad, you can pull this whole duration-recording mechanism. Or, if bored, only store the duration when $key ends in a letter (just to cut down on it's load on the server)
I'm not an expert but this is an interesting logic problem. Hopefully what I've set out below will help or at least stimulate comments that may or may not make it useful.
to an extent, the answer is going to depend on how many queries you are likely to have, how many at once and whether the mysql indexing will be faster than your definitive solution.
A few thoughts then:
It would be possible to pass caching requests on to another server easily which would allow essentially infinite scaling.
Humans being as they are, most word requests are likely to involve only a few thousand words so you will, probably, find that most of the work being done is repeat work fairly soon. It makes sense then to create an indexable database.
Hashing has in the past been suggested as a good way to speed indexing of data. Whether this is useful or not will to an extent depend on the length of your answer.
If you are very clever, you could determine the top 10000 or so likely questions and responses and store them in a separate table for even faster responses. (Guru to comment?)
Does your dll already do caching of requests? If so then any further work will probably slow your service down.
This solution is amenable to simple testing using JS or php to generate multiple requests to test response speeds using or not using caching. Whichever you decide, I reckon you should test it with a large amount of sample data.
In order to get maximum performance for your example you need to follow basic cache optimization principle.
I'm not sure if you application logic allows it but if it is it will give you a huge benefit: you need to distinguish requests which can be cached (static) from those which return dynamic (random) responses. Use some file naming rule or provide some custom http header or request parameter - i.e. any part of the request which can be used to judge whether to cache it or not.
Speeding up static requests. The idea is to process incoming requests and send back reply as early as possible (ideally even before a web server comes into play). I suggest you to use output caching since it will do what you intend to do in php&mysql internally in much more performant way. Some options are:
Use IIS output caching feature (quick search shows it can cache queries basing on requested file name and query string).
Place a caching layer in front of a web server. Varnish (https://www.varnish-cache.org/) is a flexible and powerful opensource tool, and you can configure caching strategy optimally depending on the size of your data (use memory vs. disk, how much mem can be used etc).
Speeding up dynamic requests. If they are completely random internally (no dll calls which can be cached) then there's not much to do. If there are some dll calls which can be cached, do it like you described: fetch data from cache, if it's there, you're good, if not, fetch it from dll and save to cache.
But use something more suitable for the task of caching - key/value storage like Redis or memcached are good. They are blazingly fast. Redis may be a better option since the data can be persisted to disk (while memcached drops entire cache on restart so it needs to be refilled).
I have a MySQL table with about 9.5K rows, these won't change much but I may slowly add to them.
I have a process where if someone scans a barcode I have to check if that barcode matches a value in this table. What would be the fastest way to accomplish this? I must mention there is no pattern to these values
Here Are Some Thoughts
Ajax call to PHP file to query MySQL table ( my thoughts would this would be slowest )
Load this MySQL table into an array on log in. Then when scanning Ajax call to PHP file to check the array
Load this table into an array on log in. When viewing the scanning page somehow load that array into a JavaScript array and check with JavaScript. (this seems to me to be the fastest because it eliminates Ajax call and MySQL Query. Would it be efficient to split into smaller arrays so I don't lag the server & browser?)
Honestly, I'd never load the entire table for anything. All I'd do is make an AJAX request back to a PHP gateway that then queries the database, and returns the result (or nothing). It can be very fast (as it only depends on the latency) and you can cache that result heavily (via memcached, or something like it).
There's really no reason to ever load the entire array for "validation"...
Much faster to used a well indexed MySQL table, then to look through an array for something.
But in the end it all depends on what you really want to do with the data.
As you mentions your table contain around 9.5K of data. There is no logic to load data on login or scanning page.
Better to index your table and do a ajax call whenever required.
Best of Luck!!
While 9.5 K rows are not that much, the related amount of data would need some time to transfer.
Therefore - and in general - I'd propose to run validation of values on the server side. AJAX is the right technology to do this quite easily.
Loading all 9.5 K rows only to find one specific row, is definitely a waste of resources. Run a SELECT-query for the single value.
Exposing PHP-functionality at the client-side / AJAX
Have a look at the xajax project, which allows to expose whole PHP classes or single methods as AJAX method at the client side. Moreover, xajax helps during the exchange of parameters between client and server.
Indexing to be searched attributes
Please ensure, that the column, which holds the barcode value, is indexed. In case the verification process tends to be slow, look out for MySQL table scans.
Avoiding table scans
To avoid table scans and keep your queries run fast, do use fixed sized fields. E.g. VARCHAR() besides other types makes queries slower, since rows no longer have a fixed size. No fixed-sized tables effectively prevent the database to easily predict the location of the next row of the result set. Therefore, you e.g. CHAR(20) instead of VARCHAR().
Finally: Security!
Don't forget, that any data transferred to the client side may expose sensitive data. While your 9.5 K rows may not get rendered by client's browser, the rows do exist in the generated HTML-page. Using Show source any user would be able to figure out all valid numbers.
Exposing valid barcode values may or may not be a security problem in your project context.
PS: While not related to your question, I'd propose to use PHPexcel for reading or writing spreadsheet data. Beside other solutions, e.g. a PEAR-based framework, PHPExcel depends on nothing.
I have a PHP/MySQL based web application that has internationalization support by way of a MySQL table called language_strings with the string_id, lang_id and lang_text fields.
I call the following function when I need to display a string in the selected language:
public function get_lang_string($string_id, $lang_id)
{
$db = new Database();
$sql = sprintf('SELECT lang_string FROM language_strings WHERE lang_id IN (1, %s) AND string_id=%s ORDER BY lang_id DESC LIMIT 1', $db->escape($lang_id, 'int'), $db->escape($string_id, 'int'));
$row = $db->query_first($sql);
return $row['lang_string'];
}
This works perfectly but I am concerned that there could be a lot of database queries going on. e.g. the main menu has 5 link texts, all of which call this function.
Would it be faster to load the entire language_strings table results for the selected lang_id into a PHP array and then call that from the function? Potentially that would be a huge array with much of it redundant but clearly it would be one database query per page load instead of lots.
Can anyone suggest another more efficient way of doing this?
There isn't an answer that isn't case sensitive. You can really look at it on a case by case statement. Having said that, the majority of the time, it will be quicker to get all the data in one query, pop it into an array or object and refer to it from there.
The caveat is whether you can pull all your data that you need in one query as quickly as running the five individual ones. That is where the performance of the query itself comes into play.
Sometimes a query that contains a subquery or two will actually be less time efficient than running a few queries individually.
My suggestion is to test it out. Get a query together that gets all the data you need, see how long it takes to execute. Time each of the other five queries and see how long they take combined. If it is almost identical, stick the output into an array and that will be more efficient due to not having to make frequent connections to the database itself.
If however, your combined query takes longer to return data (it might cause a full table scan instead of using indexes for example) then stick to individual ones.
Lastly, if you are going to use the same data over and over - an array or object will win hands down every single time as accessing it will be much faster than getting it from a database.
OK - I did some benchmarking and was surprised to find that putting things into an array rather than using individual queries was, on average, 10-15% SLOWER.
I think the reason for this was because, even if I filtered out the "uncommon" elements, inevitably there was always going to be unused elements as a matter of course.
With the individual queries I am only ever getting out what I need and as the queries are so simple I think I am best sticking with that method.
This works for me, of course in other situations where the individual queries are more complex, I think the method of storing common data in an array would turn out to be more efficient.
Agree with what everybody says here.. it's all about the numbers.
Some additional tips:
Try to create a single memory array which holds the minimum you require. This means removing most of the obvious redundancies.
There are standard approaches for these issues in performance critical environments, like using memcached with mysql. It's a bit overkill, but this basically lets you allocate some external memory and cache your queries there. Since you choose how much memory you want to allocate, you can plan it according to how much memory your system has.
Just play with the numbers. Try using separate queries (which is the simplest approach) and stress your PHP script (like calling it hundreds of times from the command-line). Measure how much time this takes and see how big the performance loss actually is.. Speaking from my personal experience, I usually cache everything in memory and then one day when the data gets too big, I run out of memory. Then I split everything to separate queries to save memory, and see that the performance impact wasn't that bad in the first place :)
I'm with Fluffeh on this: look into other options at your disposal (joins, subqueries, make sure your indexes reflect the relativity of the data -but don't over index and test). Most likely you'll end up with an array at some point, so here's a little performance tip, contrary to what you might expect, stuff like
$all = $stmt->fetchAll(PDO::FETCH_ASSOC);
is less memory efficient compared too:
$all = array();//or $all = []; in php 5.4
while($row = $stmt->fetch(PDO::FETCH_ASSOC);
{
$all[] = $row['lang_string '];
}
What's more: you can check for redundant data while fetching the data.
My answer is to do something in between. Retrieve all strings for a lang_id that are shorter than a certain length (say, 100 characters). Shorter text strings are more likely to be used in multiple places than longer ones. Cache the entries in a static associative array in get_lang_string(). If an item isn't found, then retrieve it through a query.
I am currently at the point in my site/application where I have had to put the brakes on and think very carefully about speed. I think these speed tests mentioned should consider the volume of traffic on your server as an important variable that will effect the results. If you are putting data into javascript data structures and processing it on the client machine, the processing time should be more regular. If you are requesting lots of data through mysql via php (for example) this is putting demand on one machine/server rather than spreading it. As your traffic grows you are having to share server resources with many users and I am thinking that this is where getting JavaScript to do more is going to lighten the load on the server. You can also store data in the local machine via localstorage.setItem(); / localstorage.getItem(); (most browsers have about 5mb of space per domain). If you have data in database that does not change that often then you can store it to client and then just check at 'start-up' if its still in date/valid.
This is my first comment posted after having and using the account for 1 year so I might need to fine tune my rambling - just voicing what im thinking through at present.
Ok, I'll try and keep this short, sweet and to-the-point.
We do massive GeoIP updates to our system by uploading a MASSIVE CSV file to our PHP-based CMS. This thing usually has more than 100k records of IP address information. Now, doing a simple import of this data isn't an issue at all, but we have to run checks against our current regional IP address mappings.
This means that we must validate the data, compare and split overlapping IP address, etc.. And these checks must be made for each and every record.
Not only that, but I've just created a field mapping solution that would allow other vendors to implement their GeoIP updates in different formats. This is done by applying rules to IPs records within the CSV update.
For instance a rule might look like:
if 'countryName' == 'Australia' then send to the 'Australian IP Pool'
There might be multiple rules that have to be run and each IP record must apply them all. For instance, 100k records to check against 10 rules would be 1 million iterations; not fun.
We're finding 2 rules for 100k records takes up to 10 minutes to process. I'm fully aware of the bottleneck here which is the shear amount of iterations that must occur for a successful import; just not fully aware of any other options we may have to speed things up a bit.
Someone recommended splitting the file into chunks, server-side. I don't think this is a viable solution as it adds yet another layer of complexity to an already complex system. The file would have to be opened, parsed and split. Then the script would have to iterate over the chunks as well.
So, question is, considering what I just wrote, what would the BEST method be to speed this process up a bit? Upgrading the server's hardware JUST for this tool isn't an option unfortunately, but they're pretty high-end boxes to begin with.
Not as short as I thought, but yeah. Halps? :(
Perform a BULK IMPORT into a database (SQL Server's what I use). The BULK IMPORT takes seconds literally, and 100,000 records is peanuts for a database to crunch on business rules. I regularly perform similar data crunches on a table with over 4 million rows and it doesn't take the 10 minutes you listed.
EDIT: I should point out, yeah, I don't recommend PHP for this. You're dealing with raw DATA, use a DATABASE.. :P
The simple key to this is keeping as much work out of the inner loop as possible.
Simply put, anything you do in the inner loop is done "100K times", so doing nothing is best (but certainly not practical), so doing as little possible is the next best bet.
If you have the memory, for example, and it's practical for the application, defer any "output" until after the main processing. Cache any input data if practical as well. This works best for summary data or occasional data.
Ideally, save for the reading of the CSV file, do as little I/O as possible during the main processing.
Does PHP offer any access to the Unix mmap facility, that is typically the fastest way to read files, particularly large files.
Another consideration is to batch your inserts. For example, it's straightforward to build up your INSERT statements as simple strings, and ship them to the server in blocks of 10, 50, or 100 rows. Most databases have some hard limit on the size of the SQL statement (like 64K, or something), so you'll need to keep that in mind. This will dramatically reduce your round trips to the DB.
If you're creating primary keys through simple increments, do that en masses (blocks of 1000, 10000, whatever). This is another thing you can remove from your inner loop.
And, for sure, you should be processing all of the rules at once for each row, and not run the records through for each rule.
100k records isn't a large number. 10 minutes isn't a bad job processing time for a single thread. The amount of raw work to be done in a straight line is probably about 10 minutes, regardless if you're using PHP or C. If you want it to be faster, you're going to need a more complex solution than a while loop.
Here's how I would tackle it:
Use a map/reduce solution to run the process in parallel. Hadoop is probably overkill. Pig Latin may do the job. You really just want the map part of the map/reduce problem. IE: you're forking of a chunk of the file to be processed by a sub process. Your reducer is probably cat. A simple version could be having PHP fork processes for each 10K record chunk, wait for the children, then re-assemble their output.
Use a queue/grid processing model. Queue up chunks of the file, then have a cluster of machines checking in, grabbing jobs and sending the data somewhere. This is very similar to the map/reduce model, just using different technologies, plus you could scale by adding more machines to the grid.
If you can write your logic as SQL, do it in a database. I would avoid this because most web programmers can't work with SQL on this level. Also, SQL is sort of limited for doing things like RBL checks or ARIN lookups.
One thing you can try is running the CSV import under command line PHP. It generally provides faster results.
If you are using PHP to do this job, switch the parsing to Python since it is WAY faster than PHP on this matters, this exchange should speed up the process by 75% or even more.
If you are using MySQL you can also use the LOAD DATA INFILE operator, I'm not sure if you need check the data before you insert it into the database though.
Have worked on this problem intensively for a while now. And, yes the better solution is to only read in a portion of the file at any one time, parse it, do validation, do filtering, then export it and then read the next portion of the file. I would agree that this is probably not a solution for php, although you can probably do it in php. As long as you have a seek function, so that you can start reading from a particular location in the file. You are right it does add a higher level of complexity but the worth that little extra effort.
It your data is pure i.e. delimited correctly, string qualified, free of broken lines etc then by all means bulk upload into a sql database. Otherwise you want to know where, when and why errors occur and to be able to handle them.
i'm working with something alike. The csv file i'm working contain portuguese data (dd/mm/yyyy) that i have to convert into mysql yyyy-mm-dd. Portuguese monetary: R$ 1.000,15 that had to be converted into mysql decimal 1000,15. Trim the possible spaces and finally, addslashes.
There are 25 variables to be treated before the insert.
If i check every $notafiscal value (select into table to see if exist and update), the php handle around 60k rows. But if i don't check it, php handle more than 1 million rows.The server work with memory of 4GB - scripting localhosting (memory of 2GB), it handles the half rows in both cases.
mysqli_query($db,"SET AUTOCOMMIT=0");
mysqli_query($db, "BEGIN");
mysqli_query($db, "SET FOREIGN_KEY_CHECKS = 0");
fgets($handle); //ignore the header line of csv file
while (($data = fgetcsv($handle, 100000, ';')) !== FALSE):
//if $notafiscal lower than 1, ignore the record
$notafiscal = $data[0];
if ($notafiscal < 1):
continue;
else:
$serie = trim($data[1]);
$data_emissao = converteDataBR($data[2]);
$cond_pagamento = trim(addslashes($data[3]));
//...
$valor_total = trim(moeda($data[24]));
//check if the $notafiscal already exist, if so, update, else, insert into table
$query = "SELECT * FROM venda WHERE notafiscal = ". $notafiscal ;
$rs = mysqli_query($db, $query);
if (mysqli_num_rows($rs) > 0):
//UPDATE TABLE
else:
//INSERT INTO TABLE
endif;
endwhile;
mysqli_query($db,"COMMIT");
mysqli_query($db,"SET AUTOCOMMIT=1");
mysqli_query($db,"SET FOREIGN_KEY_CHECKS = 1");
mysqli_close($db);