I've a Cronjob script, written in PHP with following requirements:
Step 1 (DB server 1): Get some data from multiple tables (We have lot of data here)
Step 2 (Application server): Perform some calculation
Step 3 (DB Server 2): After calculation, insert that data in another database(MySQL)/table(InnoDB) for reporting purpose. This table contains 97 columns, actually different rates, which can not be normalized further. This is different physical DB server and have only one DB.
Script worked fine during development but on production, Step 1 returned approx 50 million records. Result, as obvious, script run for around 4 days and then failed. (Rough estimation, with current rate, it would have taken approx 171 days to finish)
Just for note, We were using prepared statements and Step 1 is getting data in bunch of 1000 records at a time.
What we did till now
Optimization Step 1: Multiple values in insert & drop all indexes
Some tests showed insert (Step 3 above) is taking maximum time (More then 95% time). To optimize, after some googling, we dropped all indexes from table, and instead of one insert query/row, we are not having one insert query/100 rows. This gave us a bit faster insert but still, as per rough estimate, it will take 90 days to run cron once, and we need to run it once every month as new data will be available every month.
Optimization step 2, instead of writing to DB, write to csv file and then import in mysql using linux command.
This step seems not working. Writing 30000 rows in CSV file took 16 minutes and we still need to import that CSV file in MySQL. We have single file handler for all write operations.
Current state
It seems I'm now clueless on what else can be done. Some key requirements:
Script need to insert approx 50,000,000 records (will increase with time)
There are 97 columns for each records, we can skip some but 85 columns at the minimum.
Based on input, we can break script into three different cron to run on three different server but insert had to be done on one DB server (master) so not sure if it will help.
However:
We are open to change database/storage engine (including NoSQL)
On production, we could have multiple database servers but insert had to be done on master only. All read operations can be directed to slave, which are minimal and occasional (Just to generate reports)
Question
I don't need any descriptive answer but can someone in short suggest what could be possible solution. I just need some optimization hint and I'll do remaining R&D.
We are open for everything, change database/storage engine, Server optimization/ multiple servers (Both DB and application), change programming language or whatever is best configuration for above requirements.
Final expectation, cron must finish in maximum 24 hours.
Edit in optimization step 2
To further understand why generating csv is taking time, I've created a replica of my code, with only necessary code. That code is present on git https://github.com/kapilsharma/xz
Output file of experiment is https://github.com/kapilsharma/xz/blob/master/csv/output500000_batch5000.txt
If you check above file, I'm inserting 500000 records and getting 5000 records form database at a time, making loop running 100 times. Time taken in first loop was 0.25982284545898 seconds but in 100th loop was 3.9140808582306. I assume its because of system resource and/or file size of csv file. In that case, it becomes more of programming question then DB optimization. Still, can someone suggest why it is taking more time in next loops?
If needed, whole code is committed except csv files and sql file generated to create dummy DB as these files are very big. However they can be easily generated with code.
Using OFFSET and LIMIT to walk through a table is O(N*N), that is much slower than you want or expected.
Instead, walk through the table "remembering where you left off". It is best to use the PRIMARY KEY for such. Since the id looks like an AUTO_INCREMENT without gaps, the code is simple. My blog discusses that (and more complex chunking techniques).
It won't be a full 100 (500K/5K) times as fast, but it will be noticeably faster.
This is a very broad question. I'd start by working out what the bottleneck is with the "insert" statement. Run the code, and use whatever your operating system gives you to see what the machine is doing.
If the bottleneck is CPU, you need to find the slowest part and speed it up. Unlikely, given your sample code, but possible.
If the bottleneck is I/O or memory, you're almost certainly going to need either better hardware, or a fundamental re-design.
The obvious way to re-design this is to find a way to handle only deltas in the 50M records. For instance, if you can write to an audit table whenever a record changes, your cron job can look at that audit table and pick out any data that was modified since the last batch run.
I had a mailer cron job on CakePHP, which failed merely on 600 rows fetch and send email to the registered users. It couldn't even perform the job in batch operations. We finally opted for mandrill and since then it all went well.
I'd suggest (considering it a bad idea to touch the legacy system in production) :
Schedule a mirco solution in golang or node.js considering
performance benchmarks, as database interaction is involved -
you'll be fine with any of these. Have this micro solution perform
the cron job. (Fetch + Calculate)
Reporting from NoSQL will be
challenging, so you should try out using available services like
Google Big Query. Have the cron job store data to google big
query and you should get a huge performance improvement even in
generating reports.
or
With each row inserted into your original db server 1, set up a messaging mechanism which performs the operations of cron job everytime an insert is made (sort of trigger) and store it into your reporting server. Possible services you can use are : Google PubSub or Pusher. I think per insert time consumption will be pretty less. (You can also use a async service setup which does the task of storing into the reporting database).
Hope this helps.
Related
I have a big data set into MySQL (users, companies, contacts)? about 1 million records.
And now I need to make import new users, companies, contacts from import file (csv) with about 100000 records. I records from file has all info for all three essences (user, company, contacts).
Moreover on production i can't use LOAD DATA (just do not have so many rights :( ).
So there are three steps which should be applied to that data set.
- compare with existing DB data
- update it (if we will find something on previous step)
- and insert new, records
I'm using php on server for doing that. I can see two approaches:
reading ALL data from file at once and then work with this BIG array and apply those steps.
or reading line by line from the file and pass each line through steps
which approach is more efficient ? by CPU, memory or time usage
Can I use transactions ? or it will slow down whole production system ?
Thanks.
CPU time/time there won't be much in it, although reading the whole file will be slightly faster. However, for such a large data set, the additional memory required to read all records into memory will vastly outstrip the time advantage - I would definitely process one line at a time.
Did you know that phpMyAdmin has that nifty feature of "resumable import" for big SQL files ?
Just check "Allow interrupt of import" in the Partial Import section. And voila, PhpMyAdmin will stop and loop until all requests are executed.
It may be more efficient to just "use the tool" rather than "reinvent the wheel"
I think, 2nd approach is more acceptable:
Create change list (it would be a separate table)
Make updates line by line (and mark each line as updated using "updflag" field, for example)
Perform this process in background using transactions.
I have a PHP script that in every run, inserts a new row to a Mysql db (with a relative small amount of data..)
I have more than 20 requests per second, and this is causing my CPU to scream for help..
I'm using the sql INSERT DELAYED method with a MyISAM engine (although I just notice that INSERT DELAYED is not working with MyISAM).
My main concern is my CPU load and I started to look for ways to store this data with more CPU friendly solutions.
My first idea was to write this data to an hourly log files and once an hour to retrieve the data from the logs and insert it to the DB at once.
Maybe a better idea is to use NoSQL DB instead of log files and then once an hour to insert the data from the NoSQL to the Mysql..
I didn't test yet any of these ideas, so I don't really know if this will manage to decrease my CPU load or not. I wanted to ask if someone can help me find the right solution that will have the lowest affect over my CPU.
I recently had a very similar problem and my solution was to simply batch the requests. This sped things up about 50 times because of the reduced overhead of mysql connections and also the greatly decreased amount of reindexing. Storing them to a file then doing one larger (100-300 individual inserts) statement at once probably is a good idea. To speed things up even more turn off indexing for the duration of the insert with
ALTER TABLE tablename DISABLE KEYS
insert statement
ALTER TABLE tablename ENABLE KEYS
doing the batch insert will reduce the number of instances of the php script running, it will reduce the number of currently open mysql handles (large improvement) and it will decrease the amount of indexing.
Ok guys, I manage to lower the CPU load dramatically with APC-cache
I'm doing it like so:
storing the data in memory with APC-cache, with TTL of 70 seconds:
apc_store('prfx_SOME_UNIQUE_STRING', $data, 70);
once a minute I'm looping over all the records in the cache:
$apc_list=apc_cache_info('user');
foreach($apc_list['cache_list'] as $apc){
if((substr($apc['info'],0,5)=='prfx_') && ($val=apc_fetch($apc['info']))){
$values[]=$val;
apc_delete($apc['info']);
}
}
inserting the $values to the DB
and the CPU continues to smile..
enjoy
I would insert a sleep(1); function at the top of your PHP script, before every insert at the top of your loop where 1 = 1 second. This only allows the loop to cycle once per second.
This way it will regulate a bit just how much load the CPU is getting, this would be ideal assuming your only writing a small number of records in each run.
You can read more about the sleep function here : http://php.net/manual/en/function.sleep.php
It's hard to tell without profiling both methods, if you write to a log file first you could end up just making it worse as your turning your operation count from N to N*2. You gain a slight edge by writing it all to a file and doing a batch insert but bear in mind that as the log file fills up it's load/write time increases.
To reduce database load, look at using mem cache for database reads if your not already.
All in all though your probably best of just trying both and seeing what's faster.
Since you are trying INSERT DELAYED, I assume you don't need up to the second data. If you want to stick with MySQL, you can try using replication and the BLACKHOLE table type. By declaring a table as type BLACKHOLE on one server, then replicating it to a MyISAM or other table type on another server, you can smooth out CPU and io spikes. BLACKHOLE is really just a replication log file, so "inserts" into it are very fast and light on the system.
I do not know what is your table size or your server capabilities but I guess you need to make a lot of inserts per single table. In such a situation I would recommend checking for the construction of vertical partitions that will reduce the physical size of each partition and significantly reduce the insertion time to the table.
Replication
I have an app that Is polling data from a large number of data feeds. It processes thousands of records per day and this number is ever increasing. The data is stored in Mysql.
I then have a website that utilises this data.
I'm trying to build my environment with future in mind.
I thought of mysql replication so that the website can use it's own database on a different server and get bogged down by the thousands of write commands that are happening on the main database.
I am having difficulty getting this setup, despite mysql reporting it's all working fine.
I then started think - is there not a better way ?
From what I understand mysql sends the write command to the slave database as the master.
Does this not mean that what I am trying to avoid is just happening anyway?
Does this mean that the slave database will suffer thousands of writes
I am a one man band, doing this venture with my own money so I need to do this a cheapest way. I am getting a bit lost !
I have a dedicated server,
A vps
Using Php5, mysql 5 in a lamp stack.
I cannot begin to tell you how much I would appreciate some guidance!
If the slaves are a 1:1 clone of the master, than all writes to the master MUST be propagated down to the slaves. Otherwise replication would be useless.
Thousands of records per day is actually very small. Assuming the same processing time for each, and doing 5000 records, you'd have 86400/5000 = 17.28 seconds per record. That's very minimal write overhead.
If you were doing millions of records a day, THEN you'd have a write bottleneck.
I would split this in three layers.
Data Feed layer. Data read from the feeds is preprocessed and posted into a queue. This layer has a temporary queue that serves also as a temporary storage, a buffer to allow all data feed to post its data. I'd use a Message Queue System. It's fast and reliable.
Data Store layer. This layer reads from the queue, maybe processes someway the data read, and stores the data in the database.
Data Analysis layer. This is your "slave" database. It's a data warehouse. It periodically does ETL (extract, transform and load) data from the Data Store layer to this secondary database.
This layeread approach allows you isolate concerns (speed, reliability, security) and implementation details; and allows for future scalability.
Replication is literally what the word suggest - replicating queries on another machine.
MySQL creates a log that's filled with queries that were used to create the dataset on the original machine (master) and sends it to the slave(s) that read the log and re-execute those queries.
Basically, what you want is to increase your write ratio. That's achievable trough using different engines, for example TokuDB is one of them (however it isn't free, but you are allowed to store 50gb of user data for free and use it).
What you want (for the moment) is fast HDD subsystem more than a monolithic write-scalable storage system. InnoDB is capable of achieving a lot of queries per second on properly configured machine with sufficient hardware. I am not sure about pricing, but SSD and 4-8 gigs of ram shouldn't be that expensive. As Marc. B said - until you reach millions of records per day, you don't have to worry about scaling reads and writes trough replication.
You say you have an app "polling" your data from datafeeds. Does that mean you are doing full text searches? I'm making an assumption here in that you are batch processing date feeds and then querying that. If that is the case I'd offload all your fulltext queries to something like Solr. It actually isn't too time consuming to setup, depending on the size of your DB you can get away with running it on a fairly small VPS or on your dedicated, and best yet the difference is search speed is incredible. I've had full text mysql queries that would take 20 minutes to run be done in solr in under a second.
Just make sure you use a try statement in the event your solr instance goes down.
Well, this is the thing. Let's say that my future PHP CMS need to drive 500k visitors daily and I need to record them all in MySQL database (referrer, ip address, time etc.). This way I need to insert 300-500 rows per minute and update 50 more. The main problem is that script would call database every time I want to insert new row, which is every time someone hits a page.
My question, is there any way to locally cache incoming hits first (and what is the best solution for that apc, csv...?) and periodically send them to database every 10 minutes for example? Is this good solution and what is the best practice for this situation?
500k daily it's just 5-7 queries per second. If each request will be served for 0.2 sec, then you will have almost 0 simultaneous queries, so there is nothing to worry about.
Even if you will have 5 times more users - all should work fine.
You can just use INSERT DELAYED and tune your mysql.
About tuning: http://www.day32.com/MySQL/ - there is very useful script (will change nothing, just show you the tips how to optimize settings).
You can use memcache or APC to write log there first, but with using INSERT DELAYED MySQL will do almost same work, and will do it better :)
Do not use files for this. DB will serve locks much better, than PHP. It's not so trivial to write effective mutexes, so let DB (or memcache, APC) do this work.
A frequently used solution:
You could implement an counter in memcached which you increment on an visit, and push an update to the database for every 100 (or 1000) hits.
We do this by storing locally on each server to CSV, then having a minutely cron job to push the entries into the database. This is to avoid needing a highly available MySQL database more than anything - the database should be able to cope with that volume of inserts without a problem.
Save them to a directory-based database (or flat file, depends) somewhere and at a certain time, use a PHP code to insert/update them into your MySQL database. Your php code can be executed periodically using Cron, so check if your server has Cron so that you can set the schedule for that, say every 10 minutes.
Have a look at this page: http://damonparker.org/blog/2006/05/10/php-cron-script-to-run-automated-jobs/. Some codes have been written in the cloud and are ready for you to use :)
One way would be to use Apache access.log. You can get a quite fine logging by using cronolog utility with apache . Cronolog will handle the storage of a very big number of rows in files, and can rotate it based on volume day, year, etc. Using this utility will prevent your Apache from suffering of log writes.
Then as said by others, use a cron-based job to analyse these log and push whatever summarized or raw data you want in MySQL.
You may think of using a dedicated database (or even database server) for write-intensive jobs, with specific settings. For example you may not need InnoDB storage and keep a simple MyIsam. And you could even think of another database storage (as said by #Riccardo Galli)
If you absolutely HAVE to log directly to MySQL, consider using two databases. One optimized for quick inserts, which means no keys other than possibly an auto_increment primary key. And another with keys on everything you'd be querying for, optimized for fast searches. A timed job would copy hits from the insert-only to the read-only database on a regular basis, and you end up with the best of both worlds. The only drawback is that your available statistics will only be as fresh as the previous "copy" run.
I have also previously seen a system which records the data into a flat file on the local disc on each web server (be careful to do only atomic appends if using multiple proceses), and periodically asynchronously write them into the database using a daemon process or cron job.
This appears to be the prevailing optimium solution; your web app remains available if the audit database is down and users don't suffer poor performance if the database is slow for any reason.
The only thing I can say, is be sure that you have monitoring on these locally-generated files - a build-up definitely indicates a problem and your Ops engineers might not otherwise notice.
For an high number of write operations and this kind of data you might find more suitable mongodb or couchdb
Because INSERT DELAYED is only supported by MyISAM, it is not an option for many users.
We use MySQL Proxy to defer the execution of queries matching a certain signature.
This will require a custom Lua script; example scripts are here, and some tutorials are here.
The script will implement a Queue data structure for storage of query strings, and pattern matching to determine what queries to defer. Once the queue reaches a certain size, or a certain amount of time has elapsed, or whatever event X occurs, the query queue is emptied as each query is sent to the server.
you can use a Queue strategy using beanstalk or IronQ
I'm soon to be working on a project that poses a problem for me.
It's going to require, at regular intervals throughout the day, processing tens of thousands of records, potentially over a million. Processing is going to involve several (potentially complicated) formulas and the generation of several random factors, writing some new data to a separate table, and updating the original records with some results. This needs to occur for all records, ideally, every three hours. Each new user to the site will be adding between 50 and 500 records that need to be processed in such a fashion, so the number will not be steady.
The code hasn't been written, yet, as I'm still in the design process, mostly because of this issue. I know I'm going to need to use cron jobs, but I'm concerned that processing records of this size may cause the site to freeze up, perform slowly, or just piss off my hosting company every three hours.
I'd like to know if anyone has any experience or tips on similar subjects? I've never worked at this magnitude before, and for all I know, this will be trivial to the server and not pose much of an issue. As long as ALL records are processed before the next three hour period occurs, I don't care if they aren't processed simultaneously (though, ideally, all records belonging to a specific user should be processed in the same batch), so I've been wondering if I should process in batches every 5 minutes, 15 minutes, hour, whatever works, and how best to approach this (and make it scalable in a way that is fair to all users)?
Below I am going to describe how I would approach this problem(but will cost you money and may not be desired solution):
You should use VPS(a quick listing of some cheap VPS). But I guess you should do some more research finding the best VPS for your needs, if you want to achieve your task without pissing of your hosting company(I am sure you will).
You should not use cronjobs but use a message queue like for example beanstalkd to queue up your messages(tasks) and do the processing offline instead. When using a message queue you could also throttle your processing if needed.
Not really necessary, but I would tackle it in this way.
If performance was really a key issue I would have two VPS(at least) instances. one VPS instance to handle the http request from the users visiting your site and one VPS instance to do the offline processing you desire. This way your users/visitor will not notice any heavy offline processing which you are doing.
I also would probably not use PHP to do the offline processing because of the blocking nature. I would use something like node.js to do this kind of processing because nothing is blocking in node.js which is going to be a lot faster.
I also would probably not store the data in a relational database but use the lightning fast redis as a datastore. node_redis is a blazingly fast client for node.js
The problem with many updates on MySQL tables that are used on a website, is that updating data kills your query cache. Meaning that this will slow down you site significantly, even after you update is complete.
A solution we have used before, is to have two MySQL databases (on different servers too, in our case). Only one of them is actively used by the web server. The other is just a fallback and is used for these kind of updates. The two servers replicate their data to one another.
The solution:
Replication is stopped.
The website is told to use Database1.
These large updates you mention done ran on Database2.
Many commonly used queries are executed once on Database2 to warm up the query cache.
The server is told to use Database2.
Replication is started again. Database2 is now used mainly for reading (by both the website and the replication), so there isn't much delay on the websites.
it could be cone using many servers , where each server could do X records/hour , the more records you will be using in the future the more servers you will need , otherwise you might end up with million records being processed while the last 2-3 or even 4th processing is still not finished ...
You might want to consider what kind of database to use. Maybe a relational database isn't the best for this?
Only way to find out is to actually do some benchmarks simulating what you're going to do though.
In this situation I would consider using Gearman (which also has a PHP extension but can be used with many languages)
Do it all server side using a stored procedure that selects subsets of data then processes the data internally.
Here's an example that uses a cursor to select ranges of data:
drop procedure if exists batch_update;
delimiter #
create procedure batch_update
(
in p_from_id int unsigned, -- range of data to select for each batch
in p_to_id int unsigned
)
begin
declare v_id int unsigned;
declare v_val double(10,4);
declare v_done tinyint default 0;
declare v_cur cursor for select id, val from foo where id between = p_from_id and p_to_id;
declare continue handler for not found set v_done = 1;
start transaction;
open v_cur;
repeat
fetch v_cur into v_id, v_val;
-- do work...
if v_val < 0 then
update foo set...
else
insert into foo...
end if;
until v_done end repeat;
close v_cur;
commit;
end #
delimiter ;
call batch_update(1,10000);
call batch_update(10001, 20000);
call batch_update(20001, 30000);
If you can avoid using cursors at all - great, but the main point of my suggestion is about moving the logic from your application tier back into the data tier. I suggest you create a prototype stored procedure in your database and then perform some benchmarks. If the procedure executes in a few seconds then I dont see you having many issues especially if you're using innodb tables with transactions.
Here's another example which may prove of interest although it works on a much larger dataset 50+ million rows:
Optimal MySQL settings for queries that deliver large amounts of data?
Hope this helps :)