I've been working with MySQL for a while, but I've never really used the supported mathematical functions, such as FLOOR(), SQRT(), CRC32(), etc.
Is it faster / better to use these functions in queries rather than just doing the same on the result set with PHP?
EDIT: I don't think this question is a duplicate of this, as my question is about mathematical functions, listed on the page I linked, not CONCAT() or NOW() or any other function as in that question. Please consider this before flagging.
It is more efficient to do this in PHP.
Faster depends on the machines involved, if you're talking about faster for one user. If you're talking about faster for a million users hitting a website, then it's more efficient to do these calculations in PHP.
The load of a webserver running PHP is very easily distributed over a large number of machines. These machines can run in parallel, handling requests from visitors and fetching necessary information from the database. The database, however, is not easy to run in parallel. Issues such as replication or sharding are complex and can require specialty software and properly organized data to function well. These are expensive solutions compared to adding another PHP installation to a server array.
Because of this, the value of a CPU cycle on the database machine is far more valuable than one on the webserver. So you should perform these math functions on the webserver where CPU cycles are cheaper and significantly more easy to parallelize.
There's no general answer to that. You certainly shouldn't go out of your way to do math in SQL instead of PHP; it really doesn't make that much of a difference, if there is any. However, if you're doing an SQL query anyway, and you have the choice of doing the operation in PHP before you send it to MySQL or in the query itself... it still won't make much of a difference. Oftentimes there will be a logical difference, in terms of when and how often exactly the operation is performed and where it needs to be performed and where the code to do this is best kept for good maintainability and reuse. That should be your first consideration, not performance.
Overall, you have to do some really complex math for any of it to make any difference whatsoever. Some simple math operations are trivial in virtually any language and environment. If in doubt, benchmark your specific case.
Like deceze said there's probably not going to be much difference in speed between using math functions in SQL and PHP. If you're really worried, you should always benchmark both of your use cases.
However, one example that comes to my mind, when is probably better to use SQL math functions than doing PHP math functions: when you don't need to perform any additional operations on the results from the DB. If you do your operations in MySQL, you avoid having to loop through results in PHP.
There is an additional consideration to think of and that's scaling. You usually have one MySQL server (if you have more than one, then you probably already know all of this). But you can have many web servers connecting to the same MySQL server (e.g. when having load balancer).
In that case, it's going to be better to move the computation to the PHP to take the load of MySQL. It's "easier" to add more web servers than to increase the performance of MySQL. In theory you can add infinite amount of web servers, but the amount of memory and the processor speed / num. of cores in a MySQL server is finite. You can scale MySQL in some other ways, like using MySQL cluster or doing master-slave replication and reading from slaves, but that will always be more complicated / harder to do.
MySQL is faster in scope of SQL query. PHP is faster in PHP code. If you make SQL query to find out SQRT() it should be definitely slower (unless PHP is broken) because MySQL parser and networking overhead.
Related
I'm currently saving my language files in a MySQL database.
Is it generally better (I'm thinking about performance) to fetch all page specific language strings at once (a lot fewer queries, but they are bigger and contains some unnecessary strings) or fetch at request (that gives a lot more requests, but instead, they are much smaller and won't fetch unnecessary strings).
EDIT: I'm using APC, and there's about 200-250 page specific strings, but it becomes maybe 100-150 if I fetch one request. I'm hosting MySQL on the same machine.
It depends entirely on your situation and your available resources. Fetching everything at once will probably be better if you're making single-threaded requests to a remote server, for example, but more small requests might be faster and less memory-intensive running on a local MySQL server (Tuncay said it results in poor performance, tough). It would probably be even faster if the page were rigged up to make the requests asynchronously so that you're not waiting for the last one before making another.
However, the only way to really know is to run some benchmarks in your environment.
My experience is that the mysql server can easily handle a big request. Several small ones instead result in very poor performance. In comparable situations I find one query is most always better in terms of performance. Get the whole data from database and let php sort out the rest.
However, just fetching the data from db in one query is even better. Are you sure you cant use an appropiate "where" clause ?
I see a lot of statements like: "Cassandra very fast on writes", "Cassandra has reads really slower than writes, but much faster than Mysql"
On my windows7 system:
I installed Mysql of default configuration.
I installed PHP5 of default configuration.
I installed Casssandra of default configuration.
Making simple write test on mysql: "INSERT INTO wp_test (id,title) VALUES ('id01','test')" gives me result: 0.0002(s)
For 1000 inserts: 0.1106(s)
Making simple same write test on Cassandra: $column_faily->insert('id01',array('title'=>'test')) gives me result of: 0.005(s)
For 1000 inserts: 1.047(s)
For reads tests i also got that Cassandra is much slower than mysql.
So the question, does this sounds correct that i have 5ms for one write operation on Cassadra? Or something is wrong and should be at least 0.5ms.
When people say "Cassandra is faster than MySQL", they mean when you are dealing with terabytes of data and many simultaneous users. Cassandra (and many distributed NoSQL databases) is optimized for hundreds of simultaneous readers and writers on many nodes, as opposed to MySQL (and other relational DBs) which are optimized to be really fast on a single node, but tend to fall to pieces when you try to scale them across multiple nodes. There is a generalization of this trade-off by the way- the absolute fastest disk I/O is plain old UNIX flat files, and many latency-sensitive financial applications use them for that reason.
If you are building the next Facebook, you want something like Cassandra because a single MySQL box is never going to stand up to the punishment of thousands of simultaneous reads and writes, whereas with Cassandra you can scale out to hundreds of data nodes and handle that load easily. See scaling up vs. scaling out.
Another use case is when you need to apply a lot of batch processing power to terabytes or petabytes of data. Cassandra or HBase are great because they are integrated with MapReduce, allowing you to run your processing on the data nodes. With MySQL, you'd need to extract the data and spray it out across a grid of processing nodes, which would consume a lot of network bandwidth and entail a lot of unneeded complication.
Cassandra benefits greatly from parallelisation and batching. Try doing 1 million inserts on each of 100 threads (each with their own connection & in batches of 100) and see which ones is faster.
Finally, Cassandra insert performance should be relatively stable (maintaining high throughput for a very long time). With MySQL, you will find that it tails off rather dramatically once the btrees used for the indexes grow too large memory.
It's likely that the maturity of the MySQL drivers, especially the improved MySQL drivers in PHP 5.3, is having some impact on the tests. It's also entirely possible that the simplicity of the data in your query is impacting the results - maybe on 100 value inserts, Cassandra becomes faster.
Try the same test from the command line and see what the timestamps are, then try with varying numbers of values. You can't do a single test and base your decision on that.
Many user space factors can impact write performance. Such as:
Dozens of settings in each of the database server's configuration.
The table structure and settings.
The connection settings.
The query settings.
Are you swallowing warnings or exceptions? The MySQL sample would on face value be expected to produce a duplicate key error. It could be failing while doing nothing at all. What Cassandra might do in the same case isn't something I'm familiar with.
My limited experience of Cassandra tell me one thing about inserts, while performance of everything else degrades as data grows, inserts appear to maintain the same speed. How fast it is compared to MySQL however isn't something I've tested.
It might not be so much that inserts are fast but rather tries to be never slow. If you want a more meaningful test you need to incorporate concurrency and more variations on scenario such as large data sets, various batch sizes, etc. More complex tests might test latency for availability of data post insert and read speed over time.
It would not surprise me if Cassandra's first port of call for inserting data is to put it on a queue or to simply append. This is configurable if you look at consistency level. MySQL similarly allows you to balance performance and reliability/availability though each will have variations on what they allow and don't allow.
Outside of that unless you get into the internals it may be hard to tell why one performs better than the other.
I did some benchmarks of a use case I had for Cassandra a while ago. For the benchmark it would insert tens of thousands of rows first. I had to make the script sleep for a few seconds because otherwise queries run after the fact would not see the data and the results would be inconsistent between implementations I was testing.
If you really want fast inserts, append to a file on ramdisk.
I need to update a large db quickly. It may be easier to code in a scripting language but I suspect a C program would do the update faster. Anybody know if there have been comparative speed tests?
It wouldn't.
The rate of the update speed depends on:
database configuration (engine used, db config)
hardware of the server, especially the HDD subsystem
network bandwith between source and target machine
amount of data transfered
I suspect that you think that a scripting language will be a hog in this last part - amount of data transfered.
Any scripting language will be fast enough to deliver the data. If you have a large amount of data that you need to parse / transform quickly - then yes, C would definitely be language of choice. However if it's sending simple string data to the db, there's no point in doing that, although it's not like it's difficult to create a simple C program for UPDATE operation. It's not like it's that complicated to do it in C, it's almost on par with using PHP's mysql_ functions from "complexity" point of view.
Are you concerned about speed because you're already dealing with a situation where speed is a problem, or are you just planning ahead?
I can say comfortably that DB interactions are generally constrained by IO, network bandwidth, memory, database traffic, SQL complexity, database configuration, indexing issues, and the quantity of data being selected far more than by the choice of a scripting language versus C.
When you run into bottlenecks, they'll almost always be solved by a better algorithm, smarter use of indexes, faster IO devices, more caching... those sorts of things (beginning with algorithms).
The fourth component of LAMP is a scripting language after all. When fine tuning, memcache becomes an option, as well as persistent interpreters (such as mod_perl in a web environment, for example).
The majority cost in database transactions lie on the database side. The cost of interpreting / compiling your SQL statement and evaluating the query execution is much more substantial than any difference to be found in the language of what sent it.
It is in rare situations that the application's CPU usage for database-intensive work is a greater factor than the CPU use of the database server, or the disk speed of that server.
Unless your applications are long-running and don't wait on the database, I wouldn't worry about benchmarking them. If they do need benchmarking, you should do it yourself. Data use cases vary wildly and you need your own numbers.
Since C's a lower-level language, it won't have the parseing/type-conversion overhead that the scripting languages will. A MySQL int can map directly onto a C int, whereas a PHP int has various metadata attached to it that needs to be populated/updated.
On the other hand, if you need to do any text manipulation as part of this large update, any speed gains from C would probably be lost in hairpulling/debugging because of its poor string manipulation support versus what you could do with trivial ease in a scripting language like Perl or PHP.
I've heard speculation that the C API is faster, but I haven't seen any benchmarks. For performing large database operations quickly, regardless of programming language, use Stored Procedures: http://dev.mysql.com/tech-resources/articles/mysql-storedprocedures.html.
The speed comes from the fact that there is a reduced strain on the network.
From that link:
Stored procedures are fast! Well, we
can't prove that for MySQL yet, and
everyone's experience will vary. What
we can say is that the MySQL server
takes some advantage of caching, just
as prepared statements do. There is no
compilation, so an SQL stored
procedure won't work as quickly as a
procedure written with an external
language such as C. The main speed
gain comes from reduction of network
traffic. If you have a repetitive task
that requires checking, looping,
multiple statements, and no user
interaction, do it with a single call
to a procedure that's stored on the
server. Then there won't be messages
going back and forth between server
and client, for every step of the
task.
The C API will be marginally faster, for the simple reason that any other language (regardless of whether it's a "scripting language" or a fully-compiled language) will probably, at some level, be mapping from that language to the C API. Using the C API directly will obviously be a few dozen CPU cycles faster than performing a mapping operation and then using the C API.
But this is just spitting in the ocean. Even accessing main memory is an order of magnitude or two slower than CPU cycles on a modern machine and I/O operations (disk or network access) are several orders of magnitude slower still. There's no point in optimizing to make it a microsecond faster to send the query if it will still take half a second (or even multiple seconds, for queries which are complex or examine/return large amounts of data) to actually run the query.
Choose the language that you will be most productive in and don't worry about micro-optimizing language choice. Even if the language itself becomes a performance issue (which is extremely unlikely), your additional productivity will save more money than the cost of an additional server.
I have found that for large batches of data (Gigabytes or more), it is commonly faster overall to dump the data from mysql into a file or multiple files on an application machine. Then process it there (with your favourite tool, here: Perl) and the use LOAD DATA LOCAL INFILE to slurp it back into a fresh table while doing as little as possible in SQL. While doing that, you should
remove indexes from the table before LOAD (may not be necessary for MyISAM, but meh).
always, ALWAYS load the data in PK order!
add indexes after being done with loading.
Another advantage is that it may be much easier to parallelize the processing on a cheap application machine with a bunch of fast-but-volatile disks rather than do concurrent writing to your expensive and non-scalable database master.
Either way. Large datasets usually mean that the DB is the bottleneck.
I've been doing a lot of calculating stuff nowadays. Usually I prefer to do these calculations in PHP rather than MySQL though I know PHP is not good at this. I thought MySQL may be worse. But I found some performance problem: some pages were loaded so slowly that 30 seconds' time limit is not enough for them! So I wonder where is the better place to do the calculations, and any principles for that? Suggestions would be appreciated.
Anything that can be done using a RDBMS (GROUPING, SUMMING, AVG) where the data can be filtered on the server side, should be done in the RDBMS.
If the calculation would be better suited in PHP then fine, go with that, but otherwise don't try to do in PHP what a RDBMS was made for. YOU WILL LOSE.
I would recommend doing any row level calculations using the RDBMS.
Not only are you going to benefit from better performance but it also makes your applications more portable if you need to switch to another scripting language, let's say PHP to Python, because you've already sorted, filtered and processed the data using your RBDMS.
It also helps separate your application logic, it has helped me keep my controllers cleaner and neater when working in an MVC environment.
i would say do calculations in languages that were created for that, like c++. But if you choose between mysql and php, php is better.
Just keep track of where your bottlenecks are. If your table gets locked up because you're trying to run some calculations, everyone else is in queue waiting to read/write the data in the selected tables and the queue will continue to grow.
MySQL is typically faster at processing your commands, but PHP should be able to handle simple problems without too much of a fuss. Of course, that does not mean you should be pinging your database multiple times for the same calculation over and over.
You might be better off caching your results if you can and have a cron job updating it once a day/hour (please don't do it every minute, your hosting provider will probably hate you).
Do as much filtering and merging as possible to bring the minimum amount of data into php. Once you have that minimum data set, then it depends on what you are doing, server load, and perhaps other factors.
If you can do something equally well in either, and the sql is not overly complex to write (and maintain) then do that. For simple math, sql is usually a good bet. For string manipulations where the strings will end up about the same length or grow, php is probably a good bet.
The most important thing is to request as little data as possible. The manipulation of the data, at least what sql can do, is secondary to retrieving and transferring the data.
Native MySQL functions are very quick. So do what makes sense in your queries.
If you have multiples servers (ie, a web server and a DB server), note DB servers are much more expensive then web servers, so if you have a lot of traffic or a very busy DB server do not do the 'extras' that can be handled just as easily/efficiently on a web server machine to help prevent slowdowns.
cmptrgeekken is right we would need some more information. BUT if you are needing to do calculations that pertain to database queries or doing operations on them, comparisons certian fields from the database, make the database do it. Doing special queries in SQL is cheape r(as far as time is concerned and it is optimized for that) But both PHP and MySQL are both server side it won't really matter where you are doing the calculations. But like I said before if they are operations on with database information, make a more complicated SQL query and use that.
Use PHP, don't lag up your MySQL doing endless calculations. If your talking about things like sorting its OK to use MySQL for stuff like that, SUM, AVG but don't overdo it.
A few minutes ago, I asked whether it was better to perform many queries at once at log in and save the data in sessions, or to query as needed. I was surprised by the answer, (to query as needed). Are there other good rules of thumb to follow when building PHP/MySQL multi-user apps that speed up performance?
I'm looking for specific ways to create the most efficient application possible.
hashing
know your hashes (arrays/tables/ordered maps/whatever you call them). a hash lookup is very fast, and sometimes, if you have O(n^2) loops, you may reduce them to O(n) by organizing them into an array (keyed by primary key) first and then processing them.
an example:
foreach ($results as $result)
if (in_array($result->id, $other_results)
$found++;
is slow - in_array loops through the whole $other_result, resulting in O(n^2).
foreach ($other_results as $other_result)
$hash[$other_result->id] = true;
foreach ($results as $result)
if (isset($hash[$result->id]))
$found++;
the second one is a lot faster (depending on the result sets - the bigger, the faster), because isset() is (almost) constant time. actually, this is not a very good example - you could do this even faster using built in php functions, but you get the idea.
optimizing (My)SQL
mysql.conf: i don't have any idea how much performance you can gain by optimizing your mysql configuration instead of leaving the default. but i've read you can ignore every postgresql benchmark that used the default configuration. afaik with configuration matters less with mysql, but why ignore it? rule of thumb: try to fit the whole database into memory :)
explain [query]: an obvious one, a lot of people get wrong. learn about indices. there are rules you can follow, you can benchmark it and you can make a huge difference. if you really want it all, learn about the different types of indices (btrees, hashes, ...) and when to use them.
caching
caching is hard, but if done right it makes the difference (not a difference). in my opinion: if you can live without caching, don't do it. it often adds a lot of complexity and points of failures. google did a bit of proxy caching once (to make the intertubes faster), and some people saw private information of others.
in php, there are 4 different kinds of caching people regulary use:
query caching: almost always translates to memcached (sometimes to APC shared memory). store the result set of a certain query to a fast key/value (=hashing) storage engine. queries (now lookups) become very cheap.
output caching: store your generated html for later use (instead of regenerating it every time). this can result in the biggest speed-ups, but somewhat works against PHPs dynamic nature.
browser caching: what about etags and http responses? if done right you may avoid most of the work right at the beginning! most php programmers ignore this option because they have no idea what HTTP is.
opcode caching: APC, zend optimizer and so on. makes php code load faster. can help with big applications. got nothing to do with (slow) external datasources though, and the potential is somewhat limited.
sometimes it's not possible to live without caches, e.g. if it comes to thumbnails. image resizing is very expensive, but fortunatley easy to control (most of the time).
profiler
xdebug shows you the bottlenecks of your application. if your app is too slow, it's helpful to know why.
queries in loops
there are (php-)experts who do not know what a join is (and for every one you educate, two new ones without that knowledge will surface - and they will write frameworks, see schnalles law). sometimes, those queries-in-loops are not that obvious, e.g. if they come with libraries. count the queries - if they grow with the results shown, there is something wrong.
inexperienced developers do have a primal, insatiable urge to write frameworks and content management systems
schnalle's law
Optimize your MySQL queries first, then the PHP that handles it, and then lastly cache the results of large queries and searches. MySQL is, by far, the most frequent bottleneck in an application. A poorly designed query can take two to three times longer than a well designed query that only selects needed information.
Therefore, if your queries are optimized before you cache them, you have saved a good deal of processing time.
However, on some shared hosts caching is file-system only thanks to a lack of Memcached. In this instance it may be better to run smaller queries than it is to cache them, as the seek time of the hard drive (and waiting for access due to other sites) can easily take longer than the query when your site is under load.
Cache.
Cache.
Speedy indexed queries.