When to use Doctrine or Symfony Cache? - php

I have been reading extensively about Doctrine different options for Caching as well as symfony caching mechanisms:
Symfony Official: https://symfony.com/doc/4.0/components/cache.html
Doctrine Official: https://www.doctrine-project.org/projects/doctrine-orm/en/2.6/reference/caching.html
KnP university (very useful as always): https://knpuniversity.com/screencast/symfony-fundamentals/caching
Other good resources: https://blog.kaliop.com/blog/2014/10/06/doctrine-symfony2-2/
Nevertheless, despite explaining HOW to use the cache system, I can’t figure out WHEN to use the cache. Under which circumstances is the cache very useful and when not to use it.
For example, in my project, I have a large amount of data to pull from my database that I’d like to cache (pulling entities with tons of left joins). Those left joins are, for some, updated every hour, every day or every minutes on regular basis though a bot called with a cron job (symfony command).
I don’t know how make sure all my data are updated properly when I display it to the user with the cache mechanism enabled? If the DB gets updated do I need to remove the data from the cache manually calling for example $cacheDriver->delete('my_data’); at time of update and checking if the data exist then save it anew when retrieving the data? Would this be the proper way to do it?
Also, should I use Doctrine Cache or Symfony 4 cache? Which one to choose from?
I have an example of one of the query I’d like to cache on another SO thread here : https://stackoverflow.com/a/51800728/1083453
Bottom line would be, how to make that query as efficient as possible?
I’m leaning toward the following:
- Remove cache when updating any data included in my query.
- Cache it when calling the query for the first time
- Retrieve it whenever calling the same query between 2 updated
Am I on the right path?
Any help or advice is much appreciated.
Thanks

There's no official rules. Sometimes you may think you are doing optimization when at best you're losing time and at worst your losing performances because the process of checking if you have cache and if it is valid is longer than the query itself.
Assuming that you have done a fine enough job architecturing your application (meaning there are no inneficient operations, that your queries are well done and don't load useless datas etc), it's really a case by case study.
Try testing the pages of your application with a software simulating hundreds of clients accessing it. You'll soon enough identify which pages can't handle the charge and the debugger will tell you which queries are slowing them down. Those are the ones that definitly will benefit from caching.

I was wondering the same thing and stumbled upon your question.
Full disclosure: I've never used doctrine result cache :)
One thing I expect is that with Doctrine result cache, you don't need to bother with serializing / deserializing your cached data. This could be very convenient, when you're trying to cache and later retrieve a complex entity.

Related

Using Memcache, Should I use PDO or an ORM?

I am working on a project with a custom HTML5 front end and a backend I've designed from experience. The backend is composed of a message queue and a cache - currently I've chosen Beanstalk and Memcache because I'm famliar with them but I am open to suggestions.
My question though comes from how my coder is interfacing with the MySQL DB we are using to store the data. The idea is to pre-cache most or all of the DB so the site runs really fast. It's not a huge DB so RAM for Memcache shouldn't be an issue. However, my coder is using CodeIgniter with GreenBean. I've never heard of GreenBean before and when I google it I get almost nothing that isn't related to greenbeans the food. What little I could find suggested it was an ORM which fits from what my coder has told me.
The problem is this. With raw PDO my pre-caching scheme is simple - I would grab each row from each table and store it in the cache with a key. Then every time I needed that data I would look at the cache first for it and then the DB. If something is changed on the backend then I only need to update that row in the DB and the associated key in the cache.
With an ORM, if I store the entire ORM object serialized into the cache then it holds a bunch of related data. Data that could be incorrect if something were changed. For example, you have a DB of employees that is linked to the office they work in and the dept they work in. The ORM grabs the office and the dept and we store all of that in the cache. But if the office address changes the ORM object for every employee in that office is now stale/incorrect.
In that example, just letting the cache expire probably isn't an issue most of the time. But in my application, that data should really get updated immediately. So in a simple PDO scheme you flush the cache keys related to the data that changed and every future page call gets the updated data. But with an ORM you have lots and lots of cached object instances that might be incorrect and no good way of finding them. So it seems to me you are now left with some form of indexing of your cached objects and when you change something simple you could be flushing and refilling a big chunk of the cache. The site gets really slow then.
Typically I would just cache a DB result after the first time I needed it but in this case I think that could end up being really slow for a lot of users that make the first requests that particular set of data. Additionally, there are some search features that could require a lot of data from the DB. Thus my desire to pre-cache.
So in this case I'm thinking an ORM would hurt the site's performance. I'm thinking I'm not the first person to have this issue though. Is there an ORM out there that would handle this scenario well? Is there a better backend architecture I'm missing?
Thanks

Is this a viable solution to a list in Memcached?

Basically we have sales people that request leads to call. Right now it tried a "fresh lead" query to get those.
If there aren't any fresh leads it moves on to a "relatively new" query. We call these "sources" and essentially a closer will go through sources until they find a viable lead.
These queries all query the same table, just different groups of data. However, there is a lot of complex sorting on each query and between that and inserts/updates to the table (table being InnoDB) we're experience lots of waits (no deadlocks i'm pretty sure since they don't show in InnoDB status) so my guess is we have slow selects, coupled with lots of inserts/updates.
NOW, the ultimate question IS:
Should we query the DB for each source and grab about 100ish (obviously variable depending on the system) and cache them in memcached. Then, as closers request leads, send them from cache but update the cache to reflect an "is_acccepted" flag. This way we only call each source as we run out of cached leads so just once as we run out, instead of once per closer requesting a lead?
Then we can use simulated locking with memcached - http://code.google.com/p/memcached/wiki/FAQ#Emulating_locking_with_the_add_command
Does this seem like a viable solution? Any recommendations? We need to minimize the chances of lock waits desperately and quickly.
Sounds viable, but have you looked at your indexes and are you using proper isolation levels on your selects?
Previous SO question may help with the answer your seeking: Any way to select without causing locking in MySQL?
If you perform your select/update in a SP with full transaction's this could also speed things up quite a bit due to optimization. Of course, there are times when SP's in MySQL are much slower :(
I'd have put this as a comment, but haven't reached that level yet :)
And I did read the part about inno-db, but experience has shown me improvements even with inno when using isolation levels.
You should definitely look at making sure your DB queries are fully optimized before you employ another datastore.
If you do decide to cache this data then consider using Redis, which makes lists first class citizens.

PHP APC To cache or not to cache?

I don't really have any experience with caching at all, so this may seem like a stupid question, but how do you know when to cache your data? I wasn't even able to find one site that talked about this, but it may just be my searching skills or maybe too many variables to consider?
I will most likely be using APC. Does anyone have any examples of what would be the least amount of data you would need in order to cache it? For example, let's say you have an array with 100 items and you use a foreach loop on it and perform some simple array manipulation, should you cache the result? How about if it had a 1000 items, 10000 items, etc.?
Should you be caching the results of your database query? What kind of queries should you be caching? I assume a simple select and maybe a couple joins statement to a mysql db doesn't need caching, or does it? Assuming the mysql query cache is turned on, does that mean you don't need to cache in the application layer, or should you still do it?
If you instantiate an object, should you cache it? How to determine whether it should be cached or not? So a general guide on what to cache would be nice, examples would also be really helpful, thanks.
When you're looking at caching data that has been read from the database in APC/memcache/WinCache/redis/etc, you should be aware that it will not be updated when the database is updated unless you explicitly code to keep the database and cache in synch. Therefore, caching is most effective when the data from the database doesn't change often, but also requires a more complex and/or expensive query to retrieve that data from the database (otherwise, you may as well read it from the database when you need it)... so expensive join queries that return the same data records whenever they're run are prime candidates.
And always test to see if queries are faster read from the database than from cache. Correct database indexing can vastly improve database access times, especially as most databases maintain their own internal cache as well, so don't use APC or equivalent to cache data unless the database overheads justify it.
You also need to be aware of space usage in the cache. Most caches are a fixed size and you don't want to overfill them... so don't use them to store large volumes of data. Use the apc.php script available with APC to monitor cache usage (though make sure that it's not publicly accessible to anybody and everybody that accesses your site.... bad security).
When holding objects in cache, the object will be serialized() when it's stored, and unserialized() when it's retrieved, so there is an overhead. Objects with resource attributes will lose that resource; so don't store your database access objects.
It's sensible only to use cache to store information that is accessed by many/all users, rather than user-specific data. For user session information, stick with normal PHP sessions.
The simple answer is that you cache data when things get slow. Obviously for any medium to large sized application, you need to do much more planning than just a wait and see approach. But for the vast majority of websites out there, the question to ask yourself is "Are you happy with the load time". Of course if you are obsessive about load time, like myself, you are going to want to try to make it even faster regardless.
Next, you have to identify what specifically is the cause of the slowness. You assumed that your application code was the source but its worth examining if there are other external factors such as large page file size, excessive requests, no gzip, etc. Use a site like http://tools.pingdom.com/ or an extension like yslow as a start for that. (quick tip make sure keepalives and gzip are working).
Assuming the problem is the duration of execution of your application code, you are going to want to profile your code with something like xdebug (http://www.xdebug.org/) and view the output with kcachegrind or wincachegrind. That will let you know what parts of your code are taking long to run. From there you will make decisions on what to cache and how to cache it (or make improvements in the logic of your code).
There are so many possibilities for what the problem could be and the associated solutions, that it is not worth me guessing. So, once you identify the problem you may want to post a new question related to solving that specific problem. I will say that if not used properly, the mysql query cache can be counter productive. Also, I generally avoid the APC user cache in favor of memcached.

Smart (?) Database Cache

I've seen several database cache engines, all of them are pretty dumb (i.e.: keep this query cached for X minutes) and require that you manually delete the whole cache repository after a INSERT / UPDATE / DELETE query has been executed.
About 2 or 3 years ago I developed an alternative DB cache system for a project I was working on, the idea was basically to use regular expressions to find the table(s) involved in a particular SQL query:
$query_patterns = array
(
'INSERT' => '/INTO\s+(\w+)\s+/i',
'SELECT' => '/FROM\s+((?:[\w]|,\s*)+)(?:\s+(?:[LEFT|RIGHT|OUTER|INNER|NATURAL|CROSS]\s*)*JOIN\s+((?:[\w]|,\s*)+)\s*)*/i',
'UPDATE' => '/UPDATE\s+(\w+)\s+SET/i',
'DELETE' => '/FROM\s+((?:[\w]|,\s*)+)/i',
'REPLACE' => '/INTO\s+(\w+)\s+/i',
'TRUNCATE' => '/TRUNCATE\s+(\w+)/i',
'LOAD' => '/INTO\s+TABLE\s+(\w+)/i',
);
I know that these regexs probably have some flaws (my regex skills were pretty green back then) and obviously don't match nested queries, but since I never use them that isn't a problem for me.
Anyway, after finding the involved tables I would alphabetically sort them and create a new folder in the cache repository with the following naming convention:
+table_a+table_b+table_c+table_...+
In case of a SELECT query, I would fetch the results from the database, serialize() them and store them in the appropriate cache folder, so for instance the results of the following query:
SELECT `table_a`.`title`, `table_b`.`description` FROM `table_a`, `table_b` WHERE `table_a`.`id` <= 10 ORDER BY `table_a`.`id` ASC;
Would be stored in:
/cache/+table_a+table_b+/079138e64d88039ab9cb2eab3b6bdb7b.md5
The MD5 being the query itself. Upon a consequent SELECT query the results would be trivial to fetch.
In case of any other type of write query (INSERT, REPLACE, UPDATE, DELETE and so on) I would glob() all the folders that had +matched_table(s)+ in their name all delete all the file contents. This way it wouldn't be necessary to delete the whole cache, just the cache used by the affected and related tables.
The system worked pretty well and the difference of performance was visible - although the project had many more read queries than write queries. Since then I started using transactions, FK CASCADE UPDATES / DELETES and never had the time to perfect the system to make it work with these features.
I've used MySQL Query Cache in the past but I must say the performance doesn't even compare.
I'm wondering: am I the only one who sees beauty in this system? Is there any bottlenecks I may not be aware of? Why do popular frameworks like CodeIgniter and Kohana (I'm not aware of Zend Framework) have such rudimentary DB cache systems?
More importantly, do you see this as a feature worth pursuing? If yes, is there anything I could do / use to make it even faster (my main concerns are disk I/O and (de)serialization of query results)?
I appreciate all input, thanks.
I can see the beauty in this solution, however, I belive it only works for a very specific set of applications. Scenarios where it is not applicable include:
Databases which utilize cascading deletes/updates or any kind of triggers. E.g., your DELETE to table A may cause a DELETE from table B. The regex will never catch this.
Accessing the database from points which do not go through you cache invalidation scheme, e.g. crontab scripts etc. If you ever decide to implement replication across machines (introduce read-only slaves), it may also disturb the cache (because it does not go through cache invalidation etc.)
Even if these scenarios are not realistic for your case it does still answer the question of why frameworks do not implement this kind of cache.
Regarding if this is worth pursuing, it all depends on your application. Maybe you care to supply more information?
The solution, as you describe it, is at risk for concurrency issues. When you're receiving hundreds of queries per second, you're bound to hit a case where an UPDATE statement runs, but before you can clear your cache, a SELECT reads from it, and gets stale data. Additionally, you may run in to issues when several UPDATEs hit the same set of rows in a short time period.
In a broader sense, best practice with caching is to cache the largest objects possible. E.g., rather than having a bunch of "user"-related rows cached all over the place, it's better to just cache the "user" object itself.
Even better, if you can cache whole pages (e.g., you show the same homepage to everyone; a profile page appears identical to almost everyone, etc.), that's even better. One cache fetch for a whole, pre-rendered page will dramatically outperform dozens of cache fetches for row/query level caches followed by re-rending the page.
Long story short: profile. If you take the time to do some measurement, you'll likely find that caching large objects, or even pages, rather than small queries used to build those things, is a huge performance win.
While I do see the beauty in this - especially for environments where resources are limited and can not easily be extended, like on shared hosting - I personally would fear complications in the future: What if somebody, newly hired and unaware of the caching mechanism, starts using nested queries? What if some external service starts updating the table, with the cache not noticing?
For a specialized, defined project that urgently needs a speedup that cannot be helped by adding processor power or RAM, this looks like a great solution. As a general component, I find it too shaky, and would fear subtle problems in the long run that stem from people forgetting that there is a cache to be aware of.
I suspect that the regexes may not provide for every case - certainly they don't seem to deal with the scenario of mixing base table names and the tables themselves. e.g. consider
update stats.measures set amount=50 where id=1;
and
use stats;
update measures set amount=50 where id=1;
Then there's PL/SQL.
Then there's the fact that it depends on every client opting in to an advisory control mechanism i.e. it pre-supposes that all the database access is from machines implementing the caching control mechanism on a shared filesystem.
(as a small point - wouldn't it be simpler to just check the modification times on the data files to determine if the cached version of a query on a defined set of tables is still current, rather then trying to identify if the cache control mechanism has spotted an update - it would certainly be a lot more robust)
Stepping back a bit, implementing this from scratch using a robust architecture would mean that all queries would have to be intercepted by the control mechanism. The control mechanism would probably need a more sophisticated query parser. It certainly requires a common storgae substrate for all the instances of the control mechanism. It probably needs an understanding of the data dictionary - all things which are already implemented by the database itself.
You state that "I've used MySQL Query Cache in the past but I must say the performance doesn't even compare."
I find this rather odd. Certainly when dealing with large result sets from queries, my experience is that loading the data into the heap from a database is a lot faster than unserializing large arrays - although large result sets are rather atypical of web based applications.
When I've tried to speed up database access (after fixing everything else of course) then I've gone down the route of replicating and partitioning data across multiple DBMS instances.
C.
This is related to the problem of session splitting when working with multiple databases in a master-slave configuration. Basically, a similar set of regular expressions are used to determine which tables (or even which rows) are being read from or written to. The system keeps track of which tables were written to and when, and when a read to one of those tables comes up, it's routed to the master. If a query is reading from a table whose data needn't be up-to-the-second accurate, then it's routed to the slave. Generally, information only really needs to be current when it's something a user changed themselves (i.e., editing a user's profile).
They talk about this a good bit in the O'Reilly book High Performance MySQL. I used it quite a bit when developing a system for handling session splits back in the day.
The improvement you describe is to avoid invalidating caches that are guaranteed to not have been affected by an update because they draw data from a different table.
That is of course nice, but I am not sure if it is fine-grained enough to make a real difference. You would still be invaliding lots of caches that did not really need to be (because the update was on the table, but on different rows).
Also, even this "simple" scheme relies on being able to detect the relevant tables by looking at the SQL query string. This can be difficult to do in the general case, because of views, table aliases, and multiple catalogs.
It is very difficult to automatically (and efficiently) detect whether a cache needs to be invalidated. Because of that, you can either use a very simple scheme (such as invalidating on every update, or per table, as in your system, which does not work too well when there are many updates), or a very hand-crafted cache for the specific application with deep hooks into the query logic (probably difficult to write and hard to maintain), or accept that the cache can contain stale data and just refresh it periodically.

Caching table results for better performance... how?

First of all, the website I run is hosted and I don't have access to be able to install anything interesting like memcached.
I have several web pages displaying HTML tables. The data for these HTML tables are generated using expensive and complex MySQL queries. I've optimized the queries as far as I can, and put indexes in place to improve performance. The problem is if I have high traffic to my site the MySQL server gets hammered, and struggles.
Interestingly - the data within the MySQL tables doesn't change very often. In fact it changes only after a certain 'event' that takes place every few weeks.
So what I have done now is this:
Save the HTML table once generated to a file
When the URL is accessed check the saved file if it exists
If the file is older than 1hr, run the query and save a new file, if not output the file
This ensures that for the vast majority of requests the page loads very fast, and the data can at most be 1hr old. For my purpose this isn't too bad.
What I would really like is to guarantee that if any data changes in the database, the cache file is deleted. This could be done by finding all scripts that do any change queries on the table and adding code to remove the cache file, but it's flimsy as all future changes need to also take care of this mechanism.
Is there an elegant way to do this?
I don't have anything but vanilla PHP and MySQL (recent versions) - I'd like to play with memcached, but I can't.
Ok - serious answer.
If you have any sort of database abstraction layer (hopefully you will), you could maintain a field in the database for the last time anything was updated, and manage that from a single point in your abstraction layer.
e.g. (pseudocode): On any update set last_updated.value = Time.now()
Then compare this to the time of the cached file at runtime to see if you need to re-query.
If you don't have an abstraction layer, create a wrapper function to any SQL update call that does this, and always use the wrapper function for any future functionality.
There are only two hard things in
Computer Science: cache invalidation
and naming things.
—Phil Karlton
Sorry, doesn't help much, but it is sooooo true.
You have most of the ends covered, but a last_modified field and cron job might help.
There's no way of deleting files from MySQL, Postgres would give you that facility, but MySQL can't.
You can cache your output to a string using PHP's output buffering functions. Google it and you'll find a nice collection of websites explaining how this is done.
I'm wondering however, how do you know that the data expires after an hour? Or are you assuming the data wont change that dramatically in 60 minutes to warrant constant page generation?

Categories