I work at a school and have been looking at a way to speed up and improve the way some of our database functions work.
We have a PHP formatting class that seems to be slowing things down now that the database is getting bigger and bigger and some of the queries longer.
The class does things like take a foreign key and find the value for that key in a lookup table.
For example a student class will use the formatting class for:
courseID = 114 and studentID are looked up to return Biology and John Doe each time using a mysql query.
My issue is there are classes that generate an array of objects for example and array of 500 student objects and each of these student classes accesses this formatter class and thus run several queries.
I am thinking this is slowing things down
Worst case
500 student objects x 10 lookups in the formatter class that means 5000 queries executed.
I am wondering the best way to deal with this.
Do I preload all the lookup data into arrays into that one formatter class?
Make that formatted class a one instance (singleton) so that the worst case scenario, one master class that generates a whole array of classes uses that one and only class.
Is it better to store all that lookup data already parsed in one array (cache issue?)
Some classes now have so many queries they no longer work.
edit 8/23/2013 below
To add further information.
I am not really concerned with a single lookup, those have no issues speed wise. Such as a teacher looking up one student's info. Having a formatter class run 10 queries is no issue.
The class which generate a huge listing of other objects, such as a teacher requesting to see all students, where there is 500 objects is the issue.
I have several of these type of classes, creating a Join for all of them is the probably the fastest but as someone pointed out a lot of work.
edit 1/30/2014
Wanted to thank Lorenz Meyer for the great start to my speed issues, been working on some of the suggestions!!!!
I do have a further related question.
For simpler lookups, say storing the values of 50 pairs of values, for example the teacherIds and the corresponding teacher names.
Option 1:
I have in some cases added a field to some tables and had a script pre-populate those fields with the value, such as the teachers name for the teacherIds in that row. At run time the field already has a value, I did this in some of the huge script and it dramatically cut the amount of queries.
With a Cron to run the script, it is a ok solution but I could see it becoming an issue adding fields just for rendered data to so many tables
Option 2:
I have been thinking of using the $_Session to store that pair data. Once a user logs in, an array of teacherIds and Full Names populate an array in the $_Session data. Any class that previously used a lookup to find a teachers name, could use the $_Session array and use that instead, with a fallback to query the lookup table just in case.
I don't have many concurrent users, 30 at most so it seems this would not be hugely taxing, and would limit it to some of the smaller lookup tables.
What are peoples thoughts on these two options, especially Option 2.
I see three solutions, I present them from the easiest to the heaviest, but most effective.
Caching, but limited to one request
This solution would be to include a static variable in this function and use this as a temporary store for the students and classes. This would result in fewer queries, because you query each student only once and each class only once.
Something like this
function format($studentid, $classid){
static $students = array();
static $classes = array();
if !isset($students[$studentid]) $students[$studentid] db_lookup($studentid);
if !isset($classes[$classid]) $classes[$classid] db_lookup($classid);
$student_name = $students[$studentid];
$class_name = $classes[$studentid];
(...)
instead of
function format($studentid, $classid){
$student_name = db_lookup($studentid);
$class_name = db_lookup($classid);
(...)
This solution is very easy to implement, but it caches the result only for one request, for instance if you display a table wich contains many times the same course.
Caching between requests
For caching between requests, you need to use a cache solution like for instance the PEAR package Cache_Lite. It allows to cache the result of a function call with a fixed value (e.g. db_lookup($studentid=123) ) and store the result in the cache. Cache_Lite implements a memory cache, a file cache and a database cache. I used it with memcache and it worked well.
This solution requires more work, and it will either use diskspace or memory.
Code refactoring
The most efficient solution, but the one that requires the most effort is to refactor your code. It does not make sense to query 500 times the database for one row each time. You should rewrite the code, so that the query gets all data, and then format the data for each row of your record set.
Related
I've been trying to incorporate the repository pattern into my PHP app, loosely following this example.
I'm trying to decide between:
limiting which fields I retrieve from a database
retrieving all of them and storing them in a domain object.
The accepted answer in the topic I linked shows a way to implement option 1. The second answer shows why it may be better to retrieve full objects instead.
In my case, retrieving full objects often involves getting from 50 to 250 fields. Retrieve an array of those objects, and that's gonna be a lot of fields which could affect performance of the app. That's why I've been leaning towards being selective about which fields I retrieve, since I usually need up to 10 to perform common operations.
If I take the selective approach, then I wouldn't be able to use domain objects to store the data, and I may face the issues that "the second answer" mentioned.
You end up with essentially the same data across many queries. For example, with a User, you'll end up writing essentially the same select * for many calls. One call will get 8 of 10 fields, one will get 5 of 10, one will get 7 of 10. Why not replace all with one call that gets 10 out of 10? The reason this is bad is that it is murder to re-factor/test/mock.
It becomes very hard to reason at a high level about your code over time. Instead of statements like "Why is the User so slow?" you end up tracking down one-off queries and so bug fixes tend to be small and localized.
It's really hard to replace the underlying technology. If you store everything in MySQL now and want to move to MongoDB, it's a lot harder to replace 100 ad-hoc calls than it is a handful of entities.
I could still use domain objects for Create Update and Delete operations, but Read operations would have to be stored in plain objects or associative arrays, as per the CQRS pattern. I do have some more complex queries currently that involve joins, which would be difficult to replicate with domain objects, without doing extra DB calls and processing that the database would normally handle in 1 query.
Given all of this, does it make sense for me to use domain objects for just read operations, or at all? Am I missing some other advantages that the domain objects can provide to help with the concerns I've stated?
Our app currently works like this:
class myClass{
private $names = array();
function getNames($ids = array()){
$lookup = array();
foreach($ids as $id)
if (!isset($this->names[$id]))
$lookup[] = $id;
if(!empty($lookup)){
$result;//query database for names where id in $lookup
// now contains associative array of id => name pairs
$this->names = array_merge($this->names, $result);
}
$result = array();
foreach($ids as $id)
$result[$id] = $this->names[$id];
return $result;
}
}
Which works fine, except it can still (and often does) result in several queries (400+ in this instance).
So, I am thinking of simply querying the database and populating the $this->names array with every name from the database.
But I am concerned about how many entries in the database I should start worrying about memory when doing this? (database column is varchar(100))
How much memory do you have? And how many concurrent users does your service generally support during peak access times? These are pertinent pieces of information. Without them any answer is useless. Generally, this is a question easily solved by load testing. Then, find the bottlenecks and optimize. Until then, just make it work (within reason).
But ...
If you really want an idea of what you're looking at ...
If we assume you aren't storing multibyte characters, you have 400 names * 100 chars (assume every name maxes your char limit) ... you're looking at ~40Kb of memory. Seems way too insignificant to worry about, doesn't it?
Obviously you'll get other overhead from PHP to hold the datastructure itself. Could you store things more efficiently using a data structure like SplFixedArray instead of a plain array? Probably -- but then you're losing the highly optimized array_* functions that you'd otherwise have to manipulate the list.
Will the user be using every one of the entries you're planning to buffer in memory? If you have to have them for your application it doesn't really matter how big they are, does it? It's not a good idea to keep lots of information you don't need in memory "just because." One thing you definitely don't want to do is query the database for 4000 records on every page load. At the very least you'd need to put those types of transactions into a memory store like memcached or use APC.
This question -- like most questions in computer science -- is simply a constrained maximization problem. It can't be solved correctly unless you know the variables at your disposal.
Once you get over a thousand items or so keyed look ups start to get really slow (there is a delay when you access a specific key). You can fix that with ksort(). (I saw a script go from 15min run time down to under 2 mins just by adding a ksort)
Other then that you are really only limited by memory.
A better way would be to build an array of missing data in your script and then fetch them all in one query using an IN list.
You really shouldn't waste memory storing data the user will never see if you can help it.
I've implemented an Access Control List using 2 static arrays (for the roles and the resources), but I added a new table in my database for the permissions.
The idea of using a static array for the roles is that we won't create new roles all the time, so the data won't change all the time. I thought the same for the resources, also because I think the resources are something that only the developers should treat, because they're more related to the code than to a data. Do you have any knowledge of why to use a static array instead of a database table? When/why?
The problem with hardcoding values into your code is that compared with a database change, code changes are much more expensive:
Usually need to create a new package to deploy. That package would need to be regression tested, to verify that no bugs have been introduced. Hint: even if you only change one line of code, regression tests are necessary to verify that nothing went wrong in the build process (e.g. a library isn't correctly packaged causing a module to fail).
Updating code can mean downtime, which also increases risk because what if the update fails, there always is a risk of this
In an enterprise environment it is usually a lot quicker to get DB updates approved than code change.
All that costs time/effort/money. Note, in my opinion holding reference data or static data in a database does not mean a hit on performance, because the data can always be cached.
Your static array is an example of 'hard-coding' your data into your program, which is fine if you never ever want to change it.
In my experience, for your use case, this is not ever going to be true, and hard-coding your data into your source will result in you being constantly asked to update those things you assume will never change.
Protip: to a project manager and/or client, nothing is immutable.
I think this just boils down to how you think the database will be used in the future. If you leave the data in arrays, and then later want to create another application that interacts with this database, you will start to have to maintain the roles/resources data in both code bases. But, if you put the roles/resources into the database, the database will be the one authority on them.
I would recommend putting them in the database. You could read the tables into arrays at startup, and you'll have the same performance benefits and the flexibility to have other applications able to get this information.
Also, when/if you get to writing a user management system, it is easier to display the roles/resources of a user by joining the tables than it is to get back the roles/resources IDs and have to look up the pretty names in your arrays.
Using static arrays you get performance, considering that you do not need to access the database all the time, but safety is more important than performance, so I suggest you do the control of permissions in the database.
Study on RBAC.
Things considered static should be coded static. That is if you really consider them static.
But I suggest using class constants instead of static array values.
I am currently building a codeigniter application that handles a specific type of mammal. When a user is adding a new record (mammal), they are given lists of 'breed types', 'genders', etc. Those are stored in separate database tables.
Currently, to get these, I have separate functions such as:
$this->Mammal->get_list_of_breeds()
$this->Mammal->get_list_of_genders()
Each of these calls a query, there may be up to 7 or 8 more different lookups for me to query. Does anyone know if this will significantly slow down my application or cause too many queries on the database. For the most part, the max number of records in any individual table is under 300 records.
Is there a better way I can be doing this by consolidating the queries into a single function and using php to split the lookup fields?
Any ideas or thoughts are greatly appreciated.
One idea is to take some of the smaller sets of options and put them in arrays, especially if they cannot be changed by the user. Gender, for example, could probably just be in an array. As far as I know, there are only two options. If there are any other similar option sets you could make those arrays too.
But, even 300 records is not a huge amount of data. I take it you aren't building the next Facebook, so just making several clean queries to get the options you need probably won't be a big deal.
Personally, I wouldn't put it all in one table. Big generic tables just seem kind of hokey, and you would still be getting the same amount of data. You could have separate tables and accomplish the same thing by UNIONing the queries.
As you commented yourself, yes indeed you should put everything into one table...
So you'd have a table called mammals
And then you'd have the fields: gender, breeds etc...
Now this is a lot easier when programming in php since now you can do one query and then display everything, like this:
$query="SELECT * FROM `mammals`";
$query_exec=mysql_query($query);
while($result=mysql_fetch_array($query_exec))
{
print "gender: ".$result['gender']." breed: ".$result['breed'];
}
Little explanation:
The query gets everything from the table called mammals
Then the while just continues as long as there are still results in the array
The fetch array puts the data in the variable and every field can be read by $result[]
I know this is not a very clear explanation, but my mind also isn't the cleares at this late hour :/
If you have an array of record ID's in your application code, what is the best way to read the records from the database?
$idNumsIWant = {2,4,5,7,9,23,56};
Obviously looping over each ID is bad because you do n queries:
foreach ($idNumsIWant as $memID) {
$DBinfo = mysql_fetch_assoc(mysql_query("SELECT * FROM members WHERE mem_id = '$memID'"));
echo "{$DBinfo['fname']}\n";
}
So, perhaps it is better to use a single query?
$sqlResult = mysql_query("SELECT * FROM members WHERE mem_id IN (".join(",",$idNumsIWant).")");
while ($DBinfo = mysql_fetch_assoc($sqlResult))
echo "{$DBinfo['fname']}\n";
But does this method scale when the array has 30,000 elements?
How do you tackle this problem efficiently?
The best approach depends eventually on the number of IDs you have in your array (you obviously don't want to send a 50MB SQL query to your server, even though technically it might be capable of dealing with it without too many trouble), but mostly on how you're going to deal with the resulting rows.
If the number of IDs is very low (let's say a few thousands tops), a single query with a WHERE clause using the IN syntax will be perfect. Your SQL query will be short enough for it to be transfered reliably, efficiently and quickly to the DB server. This method is perfect for a single thread looping through the resulting records.
If the number of IDs is really big, I would suggest you split the IDs array in several groups, and run more than 1 query, each one with a group of IDs. It may be a little heavier for the DB server, but on the application side you can spawn several threads and deal with the multiple recordsets as soon as they arrive, in a parrallel way.
Both methods will work.
Cliffnotes : For that kind of situations, focus on data usage, as long as data extraction isn't too big of a bottleneck. And profile your app !
My thoughts:
The first method is too costly in terms of processing and disk reads.
The second method is more efficient and you don't have to worry much about query size limit (but check it anyway).
When I have to deal with that kind of situation, I see at least three or four possible solutions :
one request per id ; as you said, this is not really good : lots of requests ; I generally don't do that
use the solution you proposed : one request for many ids
but you can't do that with a very long list of ids : some database engines have a limit on the number of data you can pass in an IN()
a very big list in IN() might not be good performance-wise
So I generally do something like one request for X ids, and repeat this. For instance, to fecth data corresponding to 1000 ids, I could do 20 requests, each getting data for 50 ids (that's just an example : benchmarking your DB/table could be intresting, for your particular case, as it might depends on several factors)
in some cases, you could also re-think your requests : maybe you could avoid passing such a list of ids, by using some kind of join ? (this really depends on what you need, your tables' schema, ...)
Also, to facilitate modifications of the fetching logic, I would write a function that gets the list of ids, and return the list of data corresponding to those.
This way, you just call this function the same way, and you always get the same data, not having to worry about how that data is fetched ; this will allow you to change the fetching method if needed (if you find another better way some day), without breaking anything : HOW the function works will change, but as it's interface (input/output) will remain the same, it will not change a thing for the rest of your code :-)
If it were me and I had that large a list of values for the in clause, I would use a stored proc with a variable containing the values I wanted and use a function in it to send them into a temp table and then join to it. Depending on the size of the values you want to send, you might need to split it up into mutiple input vairables to process. Is there any way the values could be permanently stored (if they are often querying on this) in the database? And how is the user going to pick out 30,000 values, surely he or she is n;t going to tyope them all in? So there is probably a better way to query the table based ona a join and a where clause.
Using StringTokenizer by separating your string into tokens it would be easier for u to handle this, of retrieving data for multiple values