Right guys,
I have a MySQL database, using InnoDB on tables, every so often I have to perform a big cron job that does a large batch of queries and inserts. When I run this cron job, for the 5minutes or so that it is running, no other page is able to load. As soon as it is done, the queries are executed and the page loads.
The table that is actually having all this data added to it, isn't even being queried by the main site. It simply is that when MySQL is under a lot of work, the rest of the site is untouchable. This surely must not be right, what could be causing this to happen? CPU usage for MySQLD rockets to huge figures like 120% (!!!!!) and all MySQL queries are locked.
What could cause/fix this?
No, that's obviously wrong. This is probably related to bad configuration. Take a look at the size of the innodb buffer pool and see if it can be increased. This sounds like a typical case of ram shortage. Healthy setups are almost never cpu bound, and certainly not when doing bulk inserts.
With InnoDB, other things should still be able to access the database. Are you prepared to show the schema (or relevant part of it) and the relevant parts of the application?
Maybe it's contention in hardware.
How big are the transactions which your "cron" job is using? Using tiny transactions will create a massive amount of IO needlessly.
Do your database servers have battery backed raid controllers (assuming your servers use hard drives not SSD)? If not, commits will be quite slow.
How much ram is in your database server? If possible, ensure that it is a bit bigger than your database and set innodb_buffer_pool to > data size - this will mean that read workloads come out of ram anyway, which should make them fast.
Can you reproduce the problem in a test system on production-grade hardware?
I think you might need to re-think how you are building up your queries. InnoDB has page-level locks, but with massive updates you can still lock down quite a bit of your queries.
Post your actual queries, and try again.. I don't think there is a generic solution for a generic question like this, so look into optimizing what you're doing today.
You could have the script delay for 1/10 second or so between each query. It will take longer but allow activity in the background.
sleep( 0.1 );
You will probably only need to do this for the writes, reads are very cheap.
Related
I see a lot of statements like: "Cassandra very fast on writes", "Cassandra has reads really slower than writes, but much faster than Mysql"
On my windows7 system:
I installed Mysql of default configuration.
I installed PHP5 of default configuration.
I installed Casssandra of default configuration.
Making simple write test on mysql: "INSERT INTO wp_test (id,title) VALUES ('id01','test')" gives me result: 0.0002(s)
For 1000 inserts: 0.1106(s)
Making simple same write test on Cassandra: $column_faily->insert('id01',array('title'=>'test')) gives me result of: 0.005(s)
For 1000 inserts: 1.047(s)
For reads tests i also got that Cassandra is much slower than mysql.
So the question, does this sounds correct that i have 5ms for one write operation on Cassadra? Or something is wrong and should be at least 0.5ms.
When people say "Cassandra is faster than MySQL", they mean when you are dealing with terabytes of data and many simultaneous users. Cassandra (and many distributed NoSQL databases) is optimized for hundreds of simultaneous readers and writers on many nodes, as opposed to MySQL (and other relational DBs) which are optimized to be really fast on a single node, but tend to fall to pieces when you try to scale them across multiple nodes. There is a generalization of this trade-off by the way- the absolute fastest disk I/O is plain old UNIX flat files, and many latency-sensitive financial applications use them for that reason.
If you are building the next Facebook, you want something like Cassandra because a single MySQL box is never going to stand up to the punishment of thousands of simultaneous reads and writes, whereas with Cassandra you can scale out to hundreds of data nodes and handle that load easily. See scaling up vs. scaling out.
Another use case is when you need to apply a lot of batch processing power to terabytes or petabytes of data. Cassandra or HBase are great because they are integrated with MapReduce, allowing you to run your processing on the data nodes. With MySQL, you'd need to extract the data and spray it out across a grid of processing nodes, which would consume a lot of network bandwidth and entail a lot of unneeded complication.
Cassandra benefits greatly from parallelisation and batching. Try doing 1 million inserts on each of 100 threads (each with their own connection & in batches of 100) and see which ones is faster.
Finally, Cassandra insert performance should be relatively stable (maintaining high throughput for a very long time). With MySQL, you will find that it tails off rather dramatically once the btrees used for the indexes grow too large memory.
It's likely that the maturity of the MySQL drivers, especially the improved MySQL drivers in PHP 5.3, is having some impact on the tests. It's also entirely possible that the simplicity of the data in your query is impacting the results - maybe on 100 value inserts, Cassandra becomes faster.
Try the same test from the command line and see what the timestamps are, then try with varying numbers of values. You can't do a single test and base your decision on that.
Many user space factors can impact write performance. Such as:
Dozens of settings in each of the database server's configuration.
The table structure and settings.
The connection settings.
The query settings.
Are you swallowing warnings or exceptions? The MySQL sample would on face value be expected to produce a duplicate key error. It could be failing while doing nothing at all. What Cassandra might do in the same case isn't something I'm familiar with.
My limited experience of Cassandra tell me one thing about inserts, while performance of everything else degrades as data grows, inserts appear to maintain the same speed. How fast it is compared to MySQL however isn't something I've tested.
It might not be so much that inserts are fast but rather tries to be never slow. If you want a more meaningful test you need to incorporate concurrency and more variations on scenario such as large data sets, various batch sizes, etc. More complex tests might test latency for availability of data post insert and read speed over time.
It would not surprise me if Cassandra's first port of call for inserting data is to put it on a queue or to simply append. This is configurable if you look at consistency level. MySQL similarly allows you to balance performance and reliability/availability though each will have variations on what they allow and don't allow.
Outside of that unless you get into the internals it may be hard to tell why one performs better than the other.
I did some benchmarks of a use case I had for Cassandra a while ago. For the benchmark it would insert tens of thousands of rows first. I had to make the script sleep for a few seconds because otherwise queries run after the fact would not see the data and the results would be inconsistent between implementations I was testing.
If you really want fast inserts, append to a file on ramdisk.
Replication
I have an app that Is polling data from a large number of data feeds. It processes thousands of records per day and this number is ever increasing. The data is stored in Mysql.
I then have a website that utilises this data.
I'm trying to build my environment with future in mind.
I thought of mysql replication so that the website can use it's own database on a different server and get bogged down by the thousands of write commands that are happening on the main database.
I am having difficulty getting this setup, despite mysql reporting it's all working fine.
I then started think - is there not a better way ?
From what I understand mysql sends the write command to the slave database as the master.
Does this not mean that what I am trying to avoid is just happening anyway?
Does this mean that the slave database will suffer thousands of writes
I am a one man band, doing this venture with my own money so I need to do this a cheapest way. I am getting a bit lost !
I have a dedicated server,
A vps
Using Php5, mysql 5 in a lamp stack.
I cannot begin to tell you how much I would appreciate some guidance!
If the slaves are a 1:1 clone of the master, than all writes to the master MUST be propagated down to the slaves. Otherwise replication would be useless.
Thousands of records per day is actually very small. Assuming the same processing time for each, and doing 5000 records, you'd have 86400/5000 = 17.28 seconds per record. That's very minimal write overhead.
If you were doing millions of records a day, THEN you'd have a write bottleneck.
I would split this in three layers.
Data Feed layer. Data read from the feeds is preprocessed and posted into a queue. This layer has a temporary queue that serves also as a temporary storage, a buffer to allow all data feed to post its data. I'd use a Message Queue System. It's fast and reliable.
Data Store layer. This layer reads from the queue, maybe processes someway the data read, and stores the data in the database.
Data Analysis layer. This is your "slave" database. It's a data warehouse. It periodically does ETL (extract, transform and load) data from the Data Store layer to this secondary database.
This layeread approach allows you isolate concerns (speed, reliability, security) and implementation details; and allows for future scalability.
Replication is literally what the word suggest - replicating queries on another machine.
MySQL creates a log that's filled with queries that were used to create the dataset on the original machine (master) and sends it to the slave(s) that read the log and re-execute those queries.
Basically, what you want is to increase your write ratio. That's achievable trough using different engines, for example TokuDB is one of them (however it isn't free, but you are allowed to store 50gb of user data for free and use it).
What you want (for the moment) is fast HDD subsystem more than a monolithic write-scalable storage system. InnoDB is capable of achieving a lot of queries per second on properly configured machine with sufficient hardware. I am not sure about pricing, but SSD and 4-8 gigs of ram shouldn't be that expensive. As Marc. B said - until you reach millions of records per day, you don't have to worry about scaling reads and writes trough replication.
You say you have an app "polling" your data from datafeeds. Does that mean you are doing full text searches? I'm making an assumption here in that you are batch processing date feeds and then querying that. If that is the case I'd offload all your fulltext queries to something like Solr. It actually isn't too time consuming to setup, depending on the size of your DB you can get away with running it on a fairly small VPS or on your dedicated, and best yet the difference is search speed is incredible. I've had full text mysql queries that would take 20 minutes to run be done in solr in under a second.
Just make sure you use a try statement in the event your solr instance goes down.
I've a php website which displays recipes www.trymasak.my, to be exact. The recipes being displayed at the index page is updated about once a day. To get the latest recipes, I just use a mysql query which is something like "select recipe_name, page_views, image from table order by last_updated". So if I got 10000 visitors a day, obviously the query would be made 10000 times a day. A friend told me a better way (in terms of reducing server load) is when I update the recipes, I just put in the latest recipe details (names,images etc) into a text file, and make my page instead of querying a same query for 10,000 times, just get the data from the text file. Is his suggestion really better? If yes, which is the best php command should I use to open, read and close the text file?
thanks
The typical solution is to cache in memory. Either the query result or the whole page.
Benchmark
To know the truth about something you should really benchmark it. "Simple is Hard" from Rasmus Ledorf(Author of PHP) are really interesting video/slides(my opinion ;)) which explain how to benchmark your website. It will teach you to tackle the low hanging fruit of your website instead of wasting your time doing premature optimizations.
Donald Knuth made the following two
statements on optimization: "We should
forget about small efficiencies, say
about 97% of the time: premature
optimization is the root of all evil"
"In established engineering
disciplines a 12 % improvement, easily
obtained, is never considered marginal
and I believe the same viewpoint
should prevail in software
engineering"5
In a nutshell you will run benchmarks using tools like Siege, ab, httperf, etc. I would really like to advice you to watch this video if you aren't familiar with this topic, because I found it a really interesting watch/read.
Speed
If speed as your concern you should have at least consider:
Using a bytecode cache => APC. Precompiling your PHP will really speed up your website for at least these two big reasons:
Most PHP accelerators work by caching
the compiled bytecode of PHP scripts
to avoid the overhead of parsing and
compiling source code on each request
(some or all of which may never even
be executed). To further improve
performance, the cached code is stored
in shared memory and directly executed
from there, minimizing the amount of
slow disk reads and memory copying at
runtime.
PHP accelerators can substantially
increase the speed of PHP
applications. Improvements of web page
generation throughput by factors of 2
to 7 have been observed. 50
times faster for compute intensive
analysis programs.
Us an in-memory database to store your queries => Redis or Memcached. There is a very very big mismatch between memory and the disc(IO).
Thus, we observe that the main memory
is about 10 times slower and I/O units
1000 times slower than the processor.
The analogy part is also interesting read(can't copy from google books :)).
Databases are more flexible, secure and scalable in the long run. 10000 queries per day isn't really that much for modern RDBMS either. Go (or stay) database.
Optimize on the caching side of things, the HTTP specification has an own section on that:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html
Reading files:
file_get_contents(), if you want
to store contents in a variable
before outputting
readfile(), if
you just want to output file's
contents
If you try it, store the recipes in a folder structure like this:
/recipes/X/Y/Z/id.txt
where X, Y and Z is a random integer, from 1 to 25
example:
/recipes/3/12/22/12345.txt
This is because the filesystem is just another database. And it has a lot more hidden meta data updates to deal with.
I think MySQL will be faster, and certainly more manageable, since you'd have to backup the MySQL db anyway.
Opening ONE file is faster that doing a mysql connect + query.
BUT, if your website already needs a mysql connect to retreive some other informations, you probably want to stick with your query because the longuest part is the connection and your query is very light.
On the other hand, opening 10 files is longer than query 10 records from a database, because you only open one mysql connection.
In any case, you have to consider how long is your query and if caching it in a text file will have more pros than cons.
I am developing a large web app and want it to alter itself dependent on a factor that relates to the stress the database is currently under.
I am not sure what would me most accurate/effective/easiest. I am considering maybe the number of current connections or server response time or CPU useage?
What would be best suited and possible?
Thanks
The MySQL Query Profiler does what you are looking for.
http://dev.mysql.com/tech-resources/articles/using-new-query-profiler.html
If you would rather pay money to get a graphical profiler then try this out:
http://www.jetprofiler.com/
The amount of "stress" the database is under is not very real metric. The important thing is to identify how scalable the application is, and the extenet to which the database is contributing to unacceptable performance. This sounds like a bit of a get-out but there's not much point in spending time and effort on something without a clear objective of what you intend to achieve.
Important things are to start recording microsecond level response times in your webserver logs and enable slow query logging in mysql. Then test your DBMS to see what's slow, what's slow AND getting hit often, and what slows down as demand increases.
Certainly if you have performance problems then by all means start looking at CPU, memory usage and I/O but these are primarily symptoms of a performance problem - not true indicators. You might have 10% CPU usage and your system could be running like a dog, or 95% usage and running like a greyhound ;).
System load (i.e. the average length of the run queue is a better indicator than CPU - but still measuring a symtpom. In general database related slowness is usually primarily about I/O issues, and usually resolved by SQL tuning.
C.
Interesting question. What you REALLY want is a way for PHP to ask the mySQL server two questions:
server, are you using almost all your cpu capacity?
server, are you using almost all your disk IO capacity?
Based on the answers, I suppose you want to simplify the work your PHP web app does ... perhaps by eliminating some kind of search capability, or caching some data more aggressively.
If you have a shell to your (linux or bsd) mysql server, your two questions can be answered by eyeballing the output from these two commands.
sar -u 1 10 # the %idle column tells you about unused cpu cycles
sar -d 1 10 # the %util column tells you which disks are busy and how busy.
But, there's no sweet little query which fetches this data from mySQL to your app.
Edit: one possibility is to write a little PERL hack or other simple program that runs on your server, connects to the local data base, and once every so often (once a minute, maybe) determines %idle and %util, and updates a little one-row table in your data base. You could, without too much trouble, also add stuff like how full your disks are, to this table, if you care. Then your PHP app can query this table. This is an ideal use of the MEMORY access method. At any rate, keep it simple: you don't want your monitoring to weigh down your server.
A second-best trick, that you CAN do from your client.
Issue the command SHOW PROCESSLIST FULL, count the number of rows (mySQL processes) for which the Command is "Query", and if you have a lot of them consider it to be a high workload.
You might also add up the Time values for the processes which have status Query, and use a high value of that time as a threshold.
EDIT: if you're running on a mySQL 5 server, and your server account has access to the mySql-furnished information_schema, you can use a query directly to get the process data I mentioned:
SELECT (COUNT(*)-1) P.QUERYCOUNT, SUM(P.TIME) QUERYTIME
FROM information_schema.PROCESSLIST P
WHERE P.COMMAND = 'Query'
COUNT(*) - 1: because the above query itself counts as a query.
You will need to fiddle with the threshold values to make this work right in production.
It's a good idea to have your PHP web app shed load when the data base server can't keep up. Still, a better idea is to identify your long-running queries and optimize them.
I've been coding php for a while now and have a pretty firm grip on it, MySQL, well, lets just say I can make it work.
I'd like to make a stats script to track the stats of other websites similar to the obvious statcounter, google analytics, mint, etc.
I, of course, would like to code this properly and I don't see MySQL liking 20,000,000 to 80,000,000 inserts ( 925 inserts per second "roughly**" ) daily.
I've been doing some research and it looks like I should store each visit, "entry", into a csv or some other form of flat file and then import the data I need from it.
Am I on the right track here? I just need a push in the right direction, the direction being a way to inhale 1,000 psuedo "MySQL" inserts per second and the proper way of doing it.
Example Insert: IP, time(), http_referer, etc.
I need to collect this data for the day, and then at the end of the day, or in certain intervals, update ONE row in the database with, for example, how many extra unique hits we got. I know how to do that of course, just trying to give a visualization since I'm horrible at explaining things.
If anyone can help me, I'm a great coder, I would be more than willing to return the favor.
We tackled this at the place I've been working the last year so over summer. We didn't require much granularity in the information, so what worked very well for us was coalescing data by different time periods. For example, we'd have a single day's worth of real time stats, after that it'd be pushed into some daily sums, and then off into a monthly table.
This obviously has some huge drawbacks, namely a loss of granularity. We considered a lot of different approaches at the time. For example, as you said, CSV or some similar format could potentially serve as a way to handle a month of data at a time. The big problem is inserts however.
Start by setting out some sample schema in terms of EXACTLY what information you need to keep, and in doing so, you'll guide yourself (through revisions) to what will work for you.
Another note for the vast number of inserts: we had potentially talked through the idea of dumping realtime statistics into a little daemon which would serve to store up to an hours worth of data, then non-realtime, inject that into the database before the next hour was up. Just a thought.
For the kind of activity you're looking at, you need to look at the problem from a new point of view: decoupling. That is, you need to figure out how to decouple the data-recording steps so that delays and problems don't propogate back up the line.
You have the right idea in logging hits to a database table, insofar as that guarantees in-order, non-contended access. This is something the database provides. Unfortunately, it comes at a price, one of which is that the database completes the INSERT before getting back to you. Thus the recording of the hit is coupled with the invocation of the hit. Any delay in recording the hit will slow the invocation.
MySQL offers a way to decouple that; it's called INSERT DELAYED. In effect, you tell the database "insert this row, but I can't stick around while you do it" and the database says "okay, I got your row, I'll insert it when I have a minute". It is conceivable that this reduces locking issues because it lets one thread in MySQL do the insert, not whichever you connect to. Unfortuantely, it only works with MyISAM tables.
Another solution, which is a more general solution to the problem, is to have a logging daemon that accepts your logging information and just en-queues it to wherever it has to go. The trick to making this fast is the en-queueing step. This the sort of solution syslogd would provide.
In my opinion it's a good thing to stick to MySQL for registering the visits, because it provides tools to analyze your data. To decrease the load I would have the following suggestions.
Make a fast collecting table, with no indixes except primary key, myisam, one row per hit
Make a normalized data structure for the hits and move the records once a day to that database.
This gives you a smaller performance hit for logging and a well indexed normalized structure for querying/analyzing.
Presuming that your MySQL server is on a different physical machine to your web server, then yes it probably would be a bit more efficient to log the hit to a file on the local filesystem and then push those to the database periodically.
That would add some complexity though. Have you tested or considered testing it with regular queries? Ie, increment a counter using an UPDATE query (because you don't need each entry in a separate row). You may find that this doesn't slow things down as much as you had thought, though obviously if you are pushing 80,000,000 page views a day you probably don't have much wiggle room at all.
You should be able to get that kind of volume quite easily, provided that you do some stuff sensibly. Here are some ideas.
You will need to partition your audit table on a regular (hourly, daily?) basis, if nothing else only so you can drop old partitions to manage space sensibly. DELETEing 10M rows is not cool.
Your web servers (as you will be running quite a large farm, right?) will probably want to do the inserts in large batches, asynchronously. You'll have a daemon process which reads flat-file logs on a per-web-server machine and batches them up. This is important for InnoDB performance and to avoid auditing slowing down the web servers. Moreover, if your database is unavailable, your web servers need to continue servicing web requests and still have them audited (eventually)
As you're collecting large volumes of data, some summarisation is going to be required in order to report on it at a sensible speed - how you do this is very much a matter of taste. Make sensible summaries.
InnoDB engine tuning - you will need to tune the InnoDB engine quite significantly - in particular, have a look at the variables controlling its use of disc flushing. Writing out the log on each commit is not going to be cool (maybe unless it's on a SSD - if you need performance AND durability, consider a SSD for the logs) :) Ensure your buffer pool is big enough. Personally I'd use the InnoDB plugin and the file per table option, but you could also use MyISAM if you fully understand its characteristics and limitations.
I'm not going to further explain any of the above as if you have the developer skills on your team to build an application of that scale anyway, you'll either know what it means or be capable of finding it out.
Provided you don't have too many indexes, 1000 rows/sec is not unrealistic with your data sizes on modern hardware; we insert that many sometimes (and probably have a lot more indexes).
Remember to performance test it all on production-spec hardware (I don't really need to tell you this, right?).
I think that using MySQL is an overkill for the task of collecting the logs and summarizing them. I'd stick to plain log files in your case. It does not provide the full power of relational database management but it's quite enough to generate summaries. A simple lock-append-unlock file operation on a modern OS is seamless and instant. On the contrary, using MySQL for the same simple operation loads the CPU and may lead to swapping and other hell of scalability.
Mind the storage as well. With plain text file you'll be able to store years of logs of a highly loaded website taking into account current HDD price/capacity ratio and compressability of plain text logs