I have a heavy script that we run plenty. Below is the algorithm used:
Load 4500 rows from database and store them as an array. (A)
Load 600000 rows from database and store them as an array. (B)
For each element in (A) looks for a match in (B).
go to next element in (A)
So the maximum amount of iteration of this script are 4500 * 60 000 which is 270,000,000 so you understand that this can be a bit sweaty for PHP.
Can I make this process more efficient somehow?
Reading the rows from a database is not really an issue it is the array iterations that bring heavy costs.
It does work pretty fast but one factor (60000) will increase greatly in the years to come.
So any ideas?
Here are a few different answers. My guess is that the first one is the right one, the
easy one and sufficient, but it's very hard to be sure.
Possible Answer 1: Use SQL
As the comments indicate it sounds awfully lot like a join. In addition your post seem
to indicate that you only take an action when a match is found and that not every element
in A have a match. This mean your SQL statement should only return the matching rows, not
all of them. It doesn't matter that you can't do everything i SQL, if you can let it
organise your data for you.
Possible Answer 2: Sort the arrays
Maybe you can sort the arrays (again, preferably let your database do this). Possibly you
can sort B so that search for a match is quicker. Or put the search value in the key of the
array so that searching is very quick. Or if you are lucky you might be able to sort both
arrays in a way that makes all A's and B's in the same order. i.e. for any A you pick you
know that the right B either do not exist or exist later in the B array.
Possible Answer 3: Explain more about the problem
You have only given us your current algorithm, not what you are actually trying to do. Most
likely iterating over everything is not the best idea, but no one can say unless they know
more about your data and what you want to do in the end.
it depends on your data, of course....
some general aspects:
this really sounds like a use case for the database queries, not the php script. Looking for matches in datasets is what databases are good at, no tricks will make php scripts play even in the same league
if you really have to use the php scripting functions try to
not to hit your allowed memory limits. your php server will just exit with an error, but if your sql site result set becomes too big your sql server may begin to write temp data to the hd, which will slow down the whole execution time -> if possible, fetch and process the data in chunks (offset, limit)
if you match whole words, build your matching array in such a way, that the search criterion is a key, not a value, so that you can use isset($potentialMatches[$searchTerm]), which is way faster than in_array($searchTerm, $potentialMatches) for larger arrays. Mockup:
while ($row = $resultSet->fetch_assoc()) {
$potentialMatches[$row['search_column']] = $row;
}
but it can't be stressed enough: the usual course to handle this would be:
do the matching DB-side
process the matches in your script
if necessary: do a new query for non-matches
if 3., process those results in your script
Related
I have a PHP/MySQL based web application that has internationalization support by way of a MySQL table called language_strings with the string_id, lang_id and lang_text fields.
I call the following function when I need to display a string in the selected language:
public function get_lang_string($string_id, $lang_id)
{
$db = new Database();
$sql = sprintf('SELECT lang_string FROM language_strings WHERE lang_id IN (1, %s) AND string_id=%s ORDER BY lang_id DESC LIMIT 1', $db->escape($lang_id, 'int'), $db->escape($string_id, 'int'));
$row = $db->query_first($sql);
return $row['lang_string'];
}
This works perfectly but I am concerned that there could be a lot of database queries going on. e.g. the main menu has 5 link texts, all of which call this function.
Would it be faster to load the entire language_strings table results for the selected lang_id into a PHP array and then call that from the function? Potentially that would be a huge array with much of it redundant but clearly it would be one database query per page load instead of lots.
Can anyone suggest another more efficient way of doing this?
There isn't an answer that isn't case sensitive. You can really look at it on a case by case statement. Having said that, the majority of the time, it will be quicker to get all the data in one query, pop it into an array or object and refer to it from there.
The caveat is whether you can pull all your data that you need in one query as quickly as running the five individual ones. That is where the performance of the query itself comes into play.
Sometimes a query that contains a subquery or two will actually be less time efficient than running a few queries individually.
My suggestion is to test it out. Get a query together that gets all the data you need, see how long it takes to execute. Time each of the other five queries and see how long they take combined. If it is almost identical, stick the output into an array and that will be more efficient due to not having to make frequent connections to the database itself.
If however, your combined query takes longer to return data (it might cause a full table scan instead of using indexes for example) then stick to individual ones.
Lastly, if you are going to use the same data over and over - an array or object will win hands down every single time as accessing it will be much faster than getting it from a database.
OK - I did some benchmarking and was surprised to find that putting things into an array rather than using individual queries was, on average, 10-15% SLOWER.
I think the reason for this was because, even if I filtered out the "uncommon" elements, inevitably there was always going to be unused elements as a matter of course.
With the individual queries I am only ever getting out what I need and as the queries are so simple I think I am best sticking with that method.
This works for me, of course in other situations where the individual queries are more complex, I think the method of storing common data in an array would turn out to be more efficient.
Agree with what everybody says here.. it's all about the numbers.
Some additional tips:
Try to create a single memory array which holds the minimum you require. This means removing most of the obvious redundancies.
There are standard approaches for these issues in performance critical environments, like using memcached with mysql. It's a bit overkill, but this basically lets you allocate some external memory and cache your queries there. Since you choose how much memory you want to allocate, you can plan it according to how much memory your system has.
Just play with the numbers. Try using separate queries (which is the simplest approach) and stress your PHP script (like calling it hundreds of times from the command-line). Measure how much time this takes and see how big the performance loss actually is.. Speaking from my personal experience, I usually cache everything in memory and then one day when the data gets too big, I run out of memory. Then I split everything to separate queries to save memory, and see that the performance impact wasn't that bad in the first place :)
I'm with Fluffeh on this: look into other options at your disposal (joins, subqueries, make sure your indexes reflect the relativity of the data -but don't over index and test). Most likely you'll end up with an array at some point, so here's a little performance tip, contrary to what you might expect, stuff like
$all = $stmt->fetchAll(PDO::FETCH_ASSOC);
is less memory efficient compared too:
$all = array();//or $all = []; in php 5.4
while($row = $stmt->fetch(PDO::FETCH_ASSOC);
{
$all[] = $row['lang_string '];
}
What's more: you can check for redundant data while fetching the data.
My answer is to do something in between. Retrieve all strings for a lang_id that are shorter than a certain length (say, 100 characters). Shorter text strings are more likely to be used in multiple places than longer ones. Cache the entries in a static associative array in get_lang_string(). If an item isn't found, then retrieve it through a query.
I am currently at the point in my site/application where I have had to put the brakes on and think very carefully about speed. I think these speed tests mentioned should consider the volume of traffic on your server as an important variable that will effect the results. If you are putting data into javascript data structures and processing it on the client machine, the processing time should be more regular. If you are requesting lots of data through mysql via php (for example) this is putting demand on one machine/server rather than spreading it. As your traffic grows you are having to share server resources with many users and I am thinking that this is where getting JavaScript to do more is going to lighten the load on the server. You can also store data in the local machine via localstorage.setItem(); / localstorage.getItem(); (most browsers have about 5mb of space per domain). If you have data in database that does not change that often then you can store it to client and then just check at 'start-up' if its still in date/valid.
This is my first comment posted after having and using the account for 1 year so I might need to fine tune my rambling - just voicing what im thinking through at present.
I've been asked to choose which is the best option out of three in terms of resource optimization.Suppose I have a big Excel file of thousands of records, and I need to extract these data and insert them into a database.
The 3 options are:
Load everything into a multidimensional array and insert everything with just one complex query;
Load everything into a multidimensional array, then loop over each excel row and do a simple insert query.
Inside a loop, read each Excel row, put it into an array, and then do a simple insert query on the DB.
This is for an interview test (I labelled it homework, not sure if it's right); I pondered for a while:
Case 1: I could risk an *out_of_memory* error (depending on the machine, of course), but it's the solution that performs less request to the database. Two drawbacks are the huge amount of memory to be allocated both to the array and the database. I know that I can transform excel into CSV, but it's not an option here. I'd go for a big array and a bulk insert, but I fear it would be hard for the database.
Case 2: I could risk an *out_of_memory* error when loading it into the array, but not for the second task. Nonetheless, performing thousands of queries could be a performance hit on the database, and this query is likely to be a candidate for optimization.
Case 3: Still have a loop over thousands records (which also takes a lot of memory...), and still have thousands queries to run (which hits the database).
So, I actually chose answer one, and it took me some thinking before doing it.
And it was WRONG. And I don't know actually which of the three was the right one.
Can someone help me on this? Is that answer so bad? I thought that thousands of insert queries would be "bad", but seems like I'm totally wrong..
EDIT
Clarification: my question is not about which is the best optimization absolutely, but which one among the three I presented; so I'm not looking into other alternatives, just an explanation on why I was wrong and which is, argumentatively, the best answer instead.
On the one hand, this seems like a bit of a trick question. The sane answer is, use a bulk import utility like MySQL's mysqlimport or SQL Server's BULK INSERT ... FROM [data_file]. On the other hand, those utilities are essentially doing one of the above three options (albeit in a presumably highly-optimized fashion).
Thing is, you have to consider the entirety of the question when answering these. The "best option in terms of resource utilization" is case 3, given that your memory usage will be rather low and that most database platforms are designed to handle a metric crapton of requests per second anyway.
"Wrong" seems like the wrong answer.
There are a number of tradeoffs, and the "right" answer depends on factors you haven't listed such as: 1) Is this a production database? 2) Is the site online when you insert this data? 3) Is it ok if row 1 is inserted and visible to the public, when row 10,985 isn't? 4) Are others writing to the table while you are?
Assuming the answer to all of these questions is yes, I'd probably go with the row at a time read and insert. The first two are going to lock up your table so that no one else is going to be able to access it. With option 3 you can even meter your rate of inserts.
I think the PHP way presupposes Case 3, because you minimize amount of memory used. It's slow, but it reduces how munch memory each operation takes. Loading the whole thing in one big multidimensional array and doing a complex insert takes a lot more resources, and the speedup is not that much better. The question assumes, this is a long running task, so maybe that's what threw you off.
Whoever wrote this doesn't seem to have considered that insert operations are expensive for data loading and are not meant to be used when you have a lot of data to load.
I've got a database (MySQL) table with three fields : id, score, and percent.
Long story short, I need to do a calculation on each record that looks like this:
(Score * 10) / (1 - percent) = Value
And then I need to use that value both in my code and as the ORDER BY field. Writing the SQL isn't my issue - I'm just worried about the efficiency of this statement. Is doing that calculation in my SQL statement the most efficient use of resources, or would I be better off grabbing the data and then doing math via PHP?
If SQL is the best way to do it, are there any tips I can keep in mind for keeping my SQL pulls as speedy as possible?
Update 1: Just to clear some things up, because it seems like many of the answers are assuming differently : Both the Score and the Percent will be changing constantly. Actually, just about every time a user interacts with the app, those fields will change (those fields are actually linked to a user, btw).
As far as # of records, right now it's very small, but I would like to be scaling for a target set of about 2 million records (users). At any given time I will only need 20ish records, but I need them to be the top 20 records sorted by this calculated value.
It sounds like this calculated value is of inherent meaning in your business domain; if this is the case, I would calculate it once (e.g. at the time the record is created), and use it just like any normal field. This is by far the most efficient way to achieve what you want - the extra calculation on insert or update has minimal performance impact, and from then on you don't have to worry about who does the calculation where.
Drawback is that you do have to update your "insert" and "update" logic to perform this calculation. I don't usually like triggers - they can be the source of impenetrable bugs - but this is a case where I'd consider them (http://dev.mysql.com/doc/refman/5.0/en/triggers.html).
If for some reason you can't do that, I'd suggest doing it on the database server. This should be pretty snappy, unless you are dealing with very large numbers of records; in that case the "order by" will be a real performance problem. It will be a far bigger performance problem if you execute the same logic on the PHP side, of course - but your database tends to be the bottleneck from a performance point of view, so the impact is larger.
If you're dealing with large numbers of records, you may just have to bite the bullet and go with my first suggestion.
If it weren't for the need to sort by the calculation, you could also do this on the PHP side; however, sorting an array in PHP is not something I'd want to do for large result sets, and it seems wasteful not to do sorting in the database (which is good at that kinda thing).
So, after all that, my actual advice boils down to:
do the simplest thing that could work
test whether it's fast enough within the constraints of your
project
if not, iteratively refactor to a faster solution, re-test
once you reach "good enough", move on.
Based on edit 1:
You've answered your own question, I think - returning (eventually) 2 million rows to PHP, only to find the top 20 records (after calculating their "value" one by one) will be incredibly slow. So calculating in PHP is really not an option.
So, you're going to be calculating it on the server. My recommendation would be to create a view (http://dev.mysql.com/doc/refman/5.0/en/create-view.html) which has the SQL to perform the calculation; benchmark the performance of the view with 200, 200K and 2M records, and see if it's quick enough.
If it isn't quick enough at 2M users/records, you can always create a regular table, with an index on your "value" column, and relatively little needs to change in your client code; you could populate the new table through triggers, and the client code might never know what happened.
doing the math in the database will be more efficient because sending the data back and forth from the database to the client will be slower than that simple expression no matter how fast the client is and how slow the database is.
Test it out and let us know the performance results. I think it is going to depend on the volume of data in your result set. For the SQL bit, just make sure your where clause has a covered index.
Where you do the math shouldn't be too important. It's the same fundamental operation either way. Now, if MySQL is running on a different server than your PHP code, then you may care which CPU does the calculation. You may wish that the SQL server does more of the "hard work", or you may wish to leave the SQL server doing "only SQL", and move the math logic to PHP.
Another consideration might be bandwidth usage (if MySQL isn't running on the same machine as PHP)--you may wish to have MySQL return whichever form is shorter, to use less network bandwidth.
If they're both on the same physical hardware, though, it probably makes no noticeable difference, from a sheer CPU usage standpoint.
One tip I would offer is to do the ORDER BY on the raw value (percent) rather than on the calculated value--this way MySQL can use an index on the percent column--it can't use indexes on calculated values.
If you have a growing number of records, your script (and its memory) will reach its limits faster than mysql would. Are you planning to fetch all records anyway?
Mysql would be quicker in general.
I don't get how you would use the value calculated in php in an ORDER BY afterwards. If you are planning to sort in php, it would become even slower but it all depends on the number of records you're dealing with.
this question may seem too basic to some, but please bear with be, it's been a while since I dealt with decent database programming.
I have an algorithm that I need to program in PHP/MySQL to work on a website. It performs some computations iteratively on an array of objects (it ranks the objects based on their properties). In each iteration the algorithm runs through all collection a couple of times, accessing various data from different places of the whole collection. The algorithm needs several hundred iterations to complete. The array comes from a database.
The straightforward solution that I see is to take the results of a database query and create an object for each row of the query, put the objects to an array and pass the array to my algorithm.
However, I'm concerned with efficacy of such solution when I have to work with an array of several thousand of items because what I do is essentially mirror the results of a query to memory.
On the other hand, making database query a couple of times on each iteration of the algorithm also seems wrong.
So, my question is - what is the correct architectural solution for a problem like this? Is it OK to mirror the query results to memory? If not, which is the best way to work with query results in such an algorithm?
Thanks!
UPDATE: The closest problem that I can think of is ranking of search results by a search engine - I need to do something similar to that. Each result is represented as a row of a database and all results of the set are regarded when the rank is computed.
Don't forget, premature optimization is the root of all evil. Give it a shot copying everything to memory. If that uses too much mem, then optimize for memory.
Memory seems like the best way to go - iff you can scale up to meet it. Otherwise you'll have to revise your algorithm to maybe use a divide and conquer type of approach - do something like a merge sort.
It really depends on the situation at hand. It's probably rarely required to do such a thing, but it's very difficult to tell based off of the information you've given.
Try to isolate the data as much as possible. For instance, if you need to perform some independent action on the data that doesn't have data dependencies amongst iterations of the loop, you can write a query to update the affected rows rather than loading them all into memory, only to write them back.
In short, it is probably avoidable but it's hard to tell until you give us more information :)
If you are doing a query to the database, when the results come back, they are already "mirrored to memory". When you get your results using mysql_fetch_assoc (or equiv) you have your copy. Just use that as the cache.
Is the computation of one object dependent on another, or are they all independent? If they are independent, you could load just a small number of rows from the database, converting them to objects as you describe. Then run your hundreds of iterations on these, and then output the result for that block. You then proceed to the next block of items.
This keeps memory usage down, since you are only dealing with a small number of items rather than the whole data set, and avoids running multiple queries on the database.
The SQL keywords LIMIT and OFFSET can help you step through the data block by block.
Writing ranking queries with MySQL is possible as well, you just need to play with user-defined variables a bit. If you will provide some input data and the result you are going to achieve, the replies will be more detailed
can you use a cron job to do your ranking, say once per day, hour, or whatever you need, and then save the items ranking to a field in its row?
that way when you call your rows up you could just order them by the ranking field.
If you have an array of record ID's in your application code, what is the best way to read the records from the database?
$idNumsIWant = {2,4,5,7,9,23,56};
Obviously looping over each ID is bad because you do n queries:
foreach ($idNumsIWant as $memID) {
$DBinfo = mysql_fetch_assoc(mysql_query("SELECT * FROM members WHERE mem_id = '$memID'"));
echo "{$DBinfo['fname']}\n";
}
So, perhaps it is better to use a single query?
$sqlResult = mysql_query("SELECT * FROM members WHERE mem_id IN (".join(",",$idNumsIWant).")");
while ($DBinfo = mysql_fetch_assoc($sqlResult))
echo "{$DBinfo['fname']}\n";
But does this method scale when the array has 30,000 elements?
How do you tackle this problem efficiently?
The best approach depends eventually on the number of IDs you have in your array (you obviously don't want to send a 50MB SQL query to your server, even though technically it might be capable of dealing with it without too many trouble), but mostly on how you're going to deal with the resulting rows.
If the number of IDs is very low (let's say a few thousands tops), a single query with a WHERE clause using the IN syntax will be perfect. Your SQL query will be short enough for it to be transfered reliably, efficiently and quickly to the DB server. This method is perfect for a single thread looping through the resulting records.
If the number of IDs is really big, I would suggest you split the IDs array in several groups, and run more than 1 query, each one with a group of IDs. It may be a little heavier for the DB server, but on the application side you can spawn several threads and deal with the multiple recordsets as soon as they arrive, in a parrallel way.
Both methods will work.
Cliffnotes : For that kind of situations, focus on data usage, as long as data extraction isn't too big of a bottleneck. And profile your app !
My thoughts:
The first method is too costly in terms of processing and disk reads.
The second method is more efficient and you don't have to worry much about query size limit (but check it anyway).
When I have to deal with that kind of situation, I see at least three or four possible solutions :
one request per id ; as you said, this is not really good : lots of requests ; I generally don't do that
use the solution you proposed : one request for many ids
but you can't do that with a very long list of ids : some database engines have a limit on the number of data you can pass in an IN()
a very big list in IN() might not be good performance-wise
So I generally do something like one request for X ids, and repeat this. For instance, to fecth data corresponding to 1000 ids, I could do 20 requests, each getting data for 50 ids (that's just an example : benchmarking your DB/table could be intresting, for your particular case, as it might depends on several factors)
in some cases, you could also re-think your requests : maybe you could avoid passing such a list of ids, by using some kind of join ? (this really depends on what you need, your tables' schema, ...)
Also, to facilitate modifications of the fetching logic, I would write a function that gets the list of ids, and return the list of data corresponding to those.
This way, you just call this function the same way, and you always get the same data, not having to worry about how that data is fetched ; this will allow you to change the fetching method if needed (if you find another better way some day), without breaking anything : HOW the function works will change, but as it's interface (input/output) will remain the same, it will not change a thing for the rest of your code :-)
If it were me and I had that large a list of values for the in clause, I would use a stored proc with a variable containing the values I wanted and use a function in it to send them into a temp table and then join to it. Depending on the size of the values you want to send, you might need to split it up into mutiple input vairables to process. Is there any way the values could be permanently stored (if they are often querying on this) in the database? And how is the user going to pick out 30,000 values, surely he or she is n;t going to tyope them all in? So there is probably a better way to query the table based ona a join and a where clause.
Using StringTokenizer by separating your string into tokens it would be easier for u to handle this, of retrieving data for multiple values