I've got a database (MySQL) table with three fields : id, score, and percent.
Long story short, I need to do a calculation on each record that looks like this:
(Score * 10) / (1 - percent) = Value
And then I need to use that value both in my code and as the ORDER BY field. Writing the SQL isn't my issue - I'm just worried about the efficiency of this statement. Is doing that calculation in my SQL statement the most efficient use of resources, or would I be better off grabbing the data and then doing math via PHP?
If SQL is the best way to do it, are there any tips I can keep in mind for keeping my SQL pulls as speedy as possible?
Update 1: Just to clear some things up, because it seems like many of the answers are assuming differently : Both the Score and the Percent will be changing constantly. Actually, just about every time a user interacts with the app, those fields will change (those fields are actually linked to a user, btw).
As far as # of records, right now it's very small, but I would like to be scaling for a target set of about 2 million records (users). At any given time I will only need 20ish records, but I need them to be the top 20 records sorted by this calculated value.
It sounds like this calculated value is of inherent meaning in your business domain; if this is the case, I would calculate it once (e.g. at the time the record is created), and use it just like any normal field. This is by far the most efficient way to achieve what you want - the extra calculation on insert or update has minimal performance impact, and from then on you don't have to worry about who does the calculation where.
Drawback is that you do have to update your "insert" and "update" logic to perform this calculation. I don't usually like triggers - they can be the source of impenetrable bugs - but this is a case where I'd consider them (http://dev.mysql.com/doc/refman/5.0/en/triggers.html).
If for some reason you can't do that, I'd suggest doing it on the database server. This should be pretty snappy, unless you are dealing with very large numbers of records; in that case the "order by" will be a real performance problem. It will be a far bigger performance problem if you execute the same logic on the PHP side, of course - but your database tends to be the bottleneck from a performance point of view, so the impact is larger.
If you're dealing with large numbers of records, you may just have to bite the bullet and go with my first suggestion.
If it weren't for the need to sort by the calculation, you could also do this on the PHP side; however, sorting an array in PHP is not something I'd want to do for large result sets, and it seems wasteful not to do sorting in the database (which is good at that kinda thing).
So, after all that, my actual advice boils down to:
do the simplest thing that could work
test whether it's fast enough within the constraints of your
project
if not, iteratively refactor to a faster solution, re-test
once you reach "good enough", move on.
Based on edit 1:
You've answered your own question, I think - returning (eventually) 2 million rows to PHP, only to find the top 20 records (after calculating their "value" one by one) will be incredibly slow. So calculating in PHP is really not an option.
So, you're going to be calculating it on the server. My recommendation would be to create a view (http://dev.mysql.com/doc/refman/5.0/en/create-view.html) which has the SQL to perform the calculation; benchmark the performance of the view with 200, 200K and 2M records, and see if it's quick enough.
If it isn't quick enough at 2M users/records, you can always create a regular table, with an index on your "value" column, and relatively little needs to change in your client code; you could populate the new table through triggers, and the client code might never know what happened.
doing the math in the database will be more efficient because sending the data back and forth from the database to the client will be slower than that simple expression no matter how fast the client is and how slow the database is.
Test it out and let us know the performance results. I think it is going to depend on the volume of data in your result set. For the SQL bit, just make sure your where clause has a covered index.
Where you do the math shouldn't be too important. It's the same fundamental operation either way. Now, if MySQL is running on a different server than your PHP code, then you may care which CPU does the calculation. You may wish that the SQL server does more of the "hard work", or you may wish to leave the SQL server doing "only SQL", and move the math logic to PHP.
Another consideration might be bandwidth usage (if MySQL isn't running on the same machine as PHP)--you may wish to have MySQL return whichever form is shorter, to use less network bandwidth.
If they're both on the same physical hardware, though, it probably makes no noticeable difference, from a sheer CPU usage standpoint.
One tip I would offer is to do the ORDER BY on the raw value (percent) rather than on the calculated value--this way MySQL can use an index on the percent column--it can't use indexes on calculated values.
If you have a growing number of records, your script (and its memory) will reach its limits faster than mysql would. Are you planning to fetch all records anyway?
Mysql would be quicker in general.
I don't get how you would use the value calculated in php in an ORDER BY afterwards. If you are planning to sort in php, it would become even slower but it all depends on the number of records you're dealing with.
Related
I do have a SQL database with about 20 columns containing percentage values as decimals, like0.096303533707682 for example.
On my website I need to get these values, multiply with 100 and round them up so that 0.096303533707682 will be shown as 10% when the page is opened by the user.
Now my question is : is it faster/cheaper to calculate the 10% in advance and save the value to the database, so there is nothing to calculate after the query or doesn't it make much sense or difference ?
Thanks for your help!
For the individual operation the way to know is: Test it and be aware that performance on both sides can vary between versions and configurations.
On the larger system-level approach mind the following:
If you transfer data from the database to PHP to then do calculation you probably have extra cost due to networking, thus using SQL and calculating there has benefits.
Logic can be put into the database, using virtual columns, views or stored procedures/functions, thus multiple applications can share the logic
However for performance under scale it is simpler to add a new PHP host in front of a database than adding an extra database host.
For this specific question you also have to mind:
If you have to do the calculations every time maybe you can do this already while storing he data, thus taking more disk space but saving calculation time
Depending on the amount of data those costs could be quite neglectable and you should rather put it where it makes logically sense. (did you measure and see any problem at all or are you doing premature optimization?) Is the calculation more like "data retrival" or "business logic"? - This is a subjective choice.
My question really revolves around the repetitive use of a large amount of data.
I have about 50mb of data that I need to cross reference repetitively during a single php page execution. This task is most easily solved by using sql queries with table joins. The problem is the sheer volume of data that I need to process in an very short amount of time and the number of queries required to do it.
What I am currently doing is dumping the relevant part of each table (usually in excess of 30% or 10k rows) into an array and looping. The table joins are always on a single field, so I built a really basic 'index' of sorts to identify which rows are relevant.
The system works. It's been in my production environment for over a year, but now I'm trying to squeeze even more performance out of it. On one particular page I'm profiling, the second highest total time is attributed to the increment line that loops though these arrays. It's hit count is 1.3 million, for a total execution time of 30 seconds. This represents the work that would have been preformed by about 8200 sql queries it to achieve the same result.
What I'm looking for is anyone else that has run a situation like this. I really can't belive that I'm anywhere near the first person to have large amounts of data that needs to be processed in PHP.
Thanks!
Thank you very much to everyone that offered some advice here. It looks like there's isn't really a sliver bullet here like I was hoping. I think what I'm going to end up doing is using a mix of mysql memory tables and some version of a paged memcache.
This solution depends closely on what are you doing with the data, but I found that working unique-value columns inside array keys accelerate things a lot when you are trying to look up for a row given certain value on a column.
This is because php uses a hash table to store the keys for fast lookups. It's hundreds of times faster than iterating over the array, or using array_search.
But without seeing a code example is hard to say.
Added from comment:
The next step is use some memory database. You can use memory tables in mysql, or SQLite. Also depends on how much of your running environment you control, because those methods would need more memory than a shared hosting provider would usually allow. It would probably also simplify your code because of grouping, sorting, aggregate functions, etc.
Well, I'm looking at a similar situation in which I have a large amount of data to process, and a choice to try to do as much via MySQL queries, or off-loading it to PHP.
So far, my experience has been this:
PHP is a lot slower than using MySQL queries.
MySQL query speed is only acceptable if I cram the logic into a single call, as the latency between calls is severe.
I'm particularly shocked by how slow PHP is for looping over an even modest amount of data. I keep thinking/hoping I'm doing something wrong...
I have a PHP/MySQL based web application that has internationalization support by way of a MySQL table called language_strings with the string_id, lang_id and lang_text fields.
I call the following function when I need to display a string in the selected language:
public function get_lang_string($string_id, $lang_id)
{
$db = new Database();
$sql = sprintf('SELECT lang_string FROM language_strings WHERE lang_id IN (1, %s) AND string_id=%s ORDER BY lang_id DESC LIMIT 1', $db->escape($lang_id, 'int'), $db->escape($string_id, 'int'));
$row = $db->query_first($sql);
return $row['lang_string'];
}
This works perfectly but I am concerned that there could be a lot of database queries going on. e.g. the main menu has 5 link texts, all of which call this function.
Would it be faster to load the entire language_strings table results for the selected lang_id into a PHP array and then call that from the function? Potentially that would be a huge array with much of it redundant but clearly it would be one database query per page load instead of lots.
Can anyone suggest another more efficient way of doing this?
There isn't an answer that isn't case sensitive. You can really look at it on a case by case statement. Having said that, the majority of the time, it will be quicker to get all the data in one query, pop it into an array or object and refer to it from there.
The caveat is whether you can pull all your data that you need in one query as quickly as running the five individual ones. That is where the performance of the query itself comes into play.
Sometimes a query that contains a subquery or two will actually be less time efficient than running a few queries individually.
My suggestion is to test it out. Get a query together that gets all the data you need, see how long it takes to execute. Time each of the other five queries and see how long they take combined. If it is almost identical, stick the output into an array and that will be more efficient due to not having to make frequent connections to the database itself.
If however, your combined query takes longer to return data (it might cause a full table scan instead of using indexes for example) then stick to individual ones.
Lastly, if you are going to use the same data over and over - an array or object will win hands down every single time as accessing it will be much faster than getting it from a database.
OK - I did some benchmarking and was surprised to find that putting things into an array rather than using individual queries was, on average, 10-15% SLOWER.
I think the reason for this was because, even if I filtered out the "uncommon" elements, inevitably there was always going to be unused elements as a matter of course.
With the individual queries I am only ever getting out what I need and as the queries are so simple I think I am best sticking with that method.
This works for me, of course in other situations where the individual queries are more complex, I think the method of storing common data in an array would turn out to be more efficient.
Agree with what everybody says here.. it's all about the numbers.
Some additional tips:
Try to create a single memory array which holds the minimum you require. This means removing most of the obvious redundancies.
There are standard approaches for these issues in performance critical environments, like using memcached with mysql. It's a bit overkill, but this basically lets you allocate some external memory and cache your queries there. Since you choose how much memory you want to allocate, you can plan it according to how much memory your system has.
Just play with the numbers. Try using separate queries (which is the simplest approach) and stress your PHP script (like calling it hundreds of times from the command-line). Measure how much time this takes and see how big the performance loss actually is.. Speaking from my personal experience, I usually cache everything in memory and then one day when the data gets too big, I run out of memory. Then I split everything to separate queries to save memory, and see that the performance impact wasn't that bad in the first place :)
I'm with Fluffeh on this: look into other options at your disposal (joins, subqueries, make sure your indexes reflect the relativity of the data -but don't over index and test). Most likely you'll end up with an array at some point, so here's a little performance tip, contrary to what you might expect, stuff like
$all = $stmt->fetchAll(PDO::FETCH_ASSOC);
is less memory efficient compared too:
$all = array();//or $all = []; in php 5.4
while($row = $stmt->fetch(PDO::FETCH_ASSOC);
{
$all[] = $row['lang_string '];
}
What's more: you can check for redundant data while fetching the data.
My answer is to do something in between. Retrieve all strings for a lang_id that are shorter than a certain length (say, 100 characters). Shorter text strings are more likely to be used in multiple places than longer ones. Cache the entries in a static associative array in get_lang_string(). If an item isn't found, then retrieve it through a query.
I am currently at the point in my site/application where I have had to put the brakes on and think very carefully about speed. I think these speed tests mentioned should consider the volume of traffic on your server as an important variable that will effect the results. If you are putting data into javascript data structures and processing it on the client machine, the processing time should be more regular. If you are requesting lots of data through mysql via php (for example) this is putting demand on one machine/server rather than spreading it. As your traffic grows you are having to share server resources with many users and I am thinking that this is where getting JavaScript to do more is going to lighten the load on the server. You can also store data in the local machine via localstorage.setItem(); / localstorage.getItem(); (most browsers have about 5mb of space per domain). If you have data in database that does not change that often then you can store it to client and then just check at 'start-up' if its still in date/valid.
This is my first comment posted after having and using the account for 1 year so I might need to fine tune my rambling - just voicing what im thinking through at present.
this question may seem too basic to some, but please bear with be, it's been a while since I dealt with decent database programming.
I have an algorithm that I need to program in PHP/MySQL to work on a website. It performs some computations iteratively on an array of objects (it ranks the objects based on their properties). In each iteration the algorithm runs through all collection a couple of times, accessing various data from different places of the whole collection. The algorithm needs several hundred iterations to complete. The array comes from a database.
The straightforward solution that I see is to take the results of a database query and create an object for each row of the query, put the objects to an array and pass the array to my algorithm.
However, I'm concerned with efficacy of such solution when I have to work with an array of several thousand of items because what I do is essentially mirror the results of a query to memory.
On the other hand, making database query a couple of times on each iteration of the algorithm also seems wrong.
So, my question is - what is the correct architectural solution for a problem like this? Is it OK to mirror the query results to memory? If not, which is the best way to work with query results in such an algorithm?
Thanks!
UPDATE: The closest problem that I can think of is ranking of search results by a search engine - I need to do something similar to that. Each result is represented as a row of a database and all results of the set are regarded when the rank is computed.
Don't forget, premature optimization is the root of all evil. Give it a shot copying everything to memory. If that uses too much mem, then optimize for memory.
Memory seems like the best way to go - iff you can scale up to meet it. Otherwise you'll have to revise your algorithm to maybe use a divide and conquer type of approach - do something like a merge sort.
It really depends on the situation at hand. It's probably rarely required to do such a thing, but it's very difficult to tell based off of the information you've given.
Try to isolate the data as much as possible. For instance, if you need to perform some independent action on the data that doesn't have data dependencies amongst iterations of the loop, you can write a query to update the affected rows rather than loading them all into memory, only to write them back.
In short, it is probably avoidable but it's hard to tell until you give us more information :)
If you are doing a query to the database, when the results come back, they are already "mirrored to memory". When you get your results using mysql_fetch_assoc (or equiv) you have your copy. Just use that as the cache.
Is the computation of one object dependent on another, or are they all independent? If they are independent, you could load just a small number of rows from the database, converting them to objects as you describe. Then run your hundreds of iterations on these, and then output the result for that block. You then proceed to the next block of items.
This keeps memory usage down, since you are only dealing with a small number of items rather than the whole data set, and avoids running multiple queries on the database.
The SQL keywords LIMIT and OFFSET can help you step through the data block by block.
Writing ranking queries with MySQL is possible as well, you just need to play with user-defined variables a bit. If you will provide some input data and the result you are going to achieve, the replies will be more detailed
can you use a cron job to do your ranking, say once per day, hour, or whatever you need, and then save the items ranking to a field in its row?
that way when you call your rows up you could just order them by the ranking field.
I was wondering if it's faster to process data in MySQL or a server language like PHP or Python. I'm sure native functions like ORDER will be faster in MySQL due to indexing, caching, etc, but actually calculating the rank (including ties returning multiple entries as having the same rank):
Sample SQL
SELECT TORCH_ID,
distance AS thisscore,
(SELECT COUNT(distinct(distance))+1 FROM torch_info WHERE distance > thisscore) AS rank
FROM torch_info ORDER BY rank
Server
...as opposed to just doing a SELECT TORCH_ID FROM torch_info ORDER BY score DESC and then figure out rank in PHP on the web server.
Edit: Since posting this, my answer has changed completely, partly due to the experience I've gained since then and partly because relational database systems have gotten significantly better since 2009. Today, 9 times out of 10, I would recommend doing as much of your data crunching in-database as possible. There are three reasons for this:
Databases are highly optimized for crunching data—that's their entire job! With few exceptions, replicating what the database is doing at the application level is going to be slower unless you invest a lot of engineering effort into implementing the same optimizations that the DB provides to you for free—especially with a relatively slow language like PHP, Python, or Ruby.
As the size of your table grows, pulling it into the application layer and operating on it there becomes prohibitively expensive simply due to the sheer amount of data transferred. Many applications will never reach this scale, but if you do, it's best to reduce the transfer overhead and keep the data operations as close to the DB as possible.
In my experience, you're far more likely to introduce consistency bugs in your application than in your RDBMS, since the DB can enforce consistency on your data at a low level but the application cannot. If you don't have that safety net built-in, so you have to be more careful to not make mistakes.
Original answer: MySQL will probably be faster with most non-complex calculations. However, 90% of the time database server is the bottleneck, so do you really want to add to that by bogging down your database with these calculations? I myself would rather put them on the web/application server to even out the load, but that's your decision.
In general, the answer to the "Should I process data in the database, or on the web server question" is, "It depends".
It's easy to add another web server. It's harder to add another database server. If you can take load off the database, that can be good.
If the output of your data processing is much smaller than the required input, you may be able to avoid a lot of data transfer overhead by doing the processing in the database. As a simple example, it'd be foolish to SELECT *, retrieve every row in the table, and iterate through them on the web server to pick the one where x = 3, when you can just SELECT * WHERE x = 3
As you pointed out, the database is optimized for operation on its data, using indexes, etc.
The speed of the count is going to depend on which DB storage engine you are using and the size of the table. Though I suspect that nearly every count and rank done in mySQL would be faster than pulling that same data into PHP memory and doing the same operation.
Ranking is based on count, order. So if you can do those functions faster, then rank will obviously be faster.
A large part of your question is dependent on the primary keys and indexes you have set up.
Assuming that torchID is indexed properly...
You will find that mySQL is faster than server side code.
Another consideration you might want to make is how often this SQL will be called. You may find it easier to create a rank column and update that as each track record comes in. This will result in a lot of minor hits to your database, versus a number of "heavier" hits to your database.
So let's say you have 10,000 records, 1000 users who hit this query once a day, and 100 users who put in a new track record each day. I'd rather have the DB doing 100 updates in which 10% of them hit every record (9,999) then have the ranking query get hit 1,000 times a day.
My two cents.
If your test is running individual queries instead of posting transactions then I would recommend using a JDBC driver over the ODBC dsn because youll get 2-3 times faster performance. (im assuming your using an odbc dsn here in your tests)