I am working on a project with a custom HTML5 front end and a backend I've designed from experience. The backend is composed of a message queue and a cache - currently I've chosen Beanstalk and Memcache because I'm famliar with them but I am open to suggestions.
My question though comes from how my coder is interfacing with the MySQL DB we are using to store the data. The idea is to pre-cache most or all of the DB so the site runs really fast. It's not a huge DB so RAM for Memcache shouldn't be an issue. However, my coder is using CodeIgniter with GreenBean. I've never heard of GreenBean before and when I google it I get almost nothing that isn't related to greenbeans the food. What little I could find suggested it was an ORM which fits from what my coder has told me.
The problem is this. With raw PDO my pre-caching scheme is simple - I would grab each row from each table and store it in the cache with a key. Then every time I needed that data I would look at the cache first for it and then the DB. If something is changed on the backend then I only need to update that row in the DB and the associated key in the cache.
With an ORM, if I store the entire ORM object serialized into the cache then it holds a bunch of related data. Data that could be incorrect if something were changed. For example, you have a DB of employees that is linked to the office they work in and the dept they work in. The ORM grabs the office and the dept and we store all of that in the cache. But if the office address changes the ORM object for every employee in that office is now stale/incorrect.
In that example, just letting the cache expire probably isn't an issue most of the time. But in my application, that data should really get updated immediately. So in a simple PDO scheme you flush the cache keys related to the data that changed and every future page call gets the updated data. But with an ORM you have lots and lots of cached object instances that might be incorrect and no good way of finding them. So it seems to me you are now left with some form of indexing of your cached objects and when you change something simple you could be flushing and refilling a big chunk of the cache. The site gets really slow then.
Typically I would just cache a DB result after the first time I needed it but in this case I think that could end up being really slow for a lot of users that make the first requests that particular set of data. Additionally, there are some search features that could require a lot of data from the DB. Thus my desire to pre-cache.
So in this case I'm thinking an ORM would hurt the site's performance. I'm thinking I'm not the first person to have this issue though. Is there an ORM out there that would handle this scenario well? Is there a better backend architecture I'm missing?
Thanks
Related
I have been reading extensively about Doctrine different options for Caching as well as symfony caching mechanisms:
Symfony Official: https://symfony.com/doc/4.0/components/cache.html
Doctrine Official: https://www.doctrine-project.org/projects/doctrine-orm/en/2.6/reference/caching.html
KnP university (very useful as always): https://knpuniversity.com/screencast/symfony-fundamentals/caching
Other good resources: https://blog.kaliop.com/blog/2014/10/06/doctrine-symfony2-2/
Nevertheless, despite explaining HOW to use the cache system, I can’t figure out WHEN to use the cache. Under which circumstances is the cache very useful and when not to use it.
For example, in my project, I have a large amount of data to pull from my database that I’d like to cache (pulling entities with tons of left joins). Those left joins are, for some, updated every hour, every day or every minutes on regular basis though a bot called with a cron job (symfony command).
I don’t know how make sure all my data are updated properly when I display it to the user with the cache mechanism enabled? If the DB gets updated do I need to remove the data from the cache manually calling for example $cacheDriver->delete('my_data’); at time of update and checking if the data exist then save it anew when retrieving the data? Would this be the proper way to do it?
Also, should I use Doctrine Cache or Symfony 4 cache? Which one to choose from?
I have an example of one of the query I’d like to cache on another SO thread here : https://stackoverflow.com/a/51800728/1083453
Bottom line would be, how to make that query as efficient as possible?
I’m leaning toward the following:
- Remove cache when updating any data included in my query.
- Cache it when calling the query for the first time
- Retrieve it whenever calling the same query between 2 updated
Am I on the right path?
Any help or advice is much appreciated.
Thanks
There's no official rules. Sometimes you may think you are doing optimization when at best you're losing time and at worst your losing performances because the process of checking if you have cache and if it is valid is longer than the query itself.
Assuming that you have done a fine enough job architecturing your application (meaning there are no inneficient operations, that your queries are well done and don't load useless datas etc), it's really a case by case study.
Try testing the pages of your application with a software simulating hundreds of clients accessing it. You'll soon enough identify which pages can't handle the charge and the debugger will tell you which queries are slowing them down. Those are the ones that definitly will benefit from caching.
I was wondering the same thing and stumbled upon your question.
Full disclosure: I've never used doctrine result cache :)
One thing I expect is that with Doctrine result cache, you don't need to bother with serializing / deserializing your cached data. This could be very convenient, when you're trying to cache and later retrieve a complex entity.
I've got a heavy-read website associated to a MySQL database. I also have some little "auxiliary" information (fits in an array of 30-40 elements as of now), hierarchically organized and yet gets periodically and slowly updated 4-5 times per year. It's not a configuration file though since this information is about the subject of the website and not about its functioning, but still kind of a configuration file. Until now, I just used a static PHP file containing an array of info, but now I need a way to update it via a backend CMS from my admin panel.
I thought of a simple CMS that allows the admin to create/edit/delete entries, periodical rare job, and then creates a static JSON file to be used by the page building scripts instead of pulling this information from the db.
The question is: given the heavy-read nature of the website, is it better to read a rarely updated JSON file on the server when building pages or just retrieve raw info from the database for every request?
I just used a static PHP
This sounds like contradiction to me. Either static, or PHP.
given the heavy-read nature of the website, is it better to read a rarely updated JSON file on the server when building pages or just retrieve raw info from the database for every request?
Cache was invented for a reason :) Same with your case - it all depends on how often data changes vs how often is read. If data changes once a day and remains static for 100k downloads during the day, then not caching it or not serving from flat file would would simply be stupid. If data changes once a day and you have 20 reads per day average, then perhaps returning the data from code on each request would be less stupid, but from other hand, all these 19 requests could be served from cache anyway, so... If you can, serve from flat file.
Caching is your best option, Redis or Memcached are common excellent choices. For flat-file or database, it's hard to know because the SQL schema you're using, (as in, how many columns, what are the datatype definitions, how many foreign keys and indexes, etc.) you are using.
SQL is about relational data, if you have non-relational data, you don't really have a reason to use SQL. Most people are now switching to NoSQL databases to handle this since modifying SQL databases after the fact is a huge pain.
I'm working on a site that has a store locator built in.
Since I have similar sites developed in the past, I have experienced some troubles when I had search peaks hitting the database (mySQL) hard.
All these past location search engines were querying the database to get the results.
Now I have taken a different approach, but since I'm not 100% sure, I thought that asking this great community could make me feel more secure about this direction or stick to what I did before.
So for this new search, instead of hitting the database for requests, I'm serving the search with a JSON file that regenerates (querying the database) only when something is updated, created or deleted on the locations list.
My doubt is, can a high load of requests over the json file have the same effect than a high load of query requests over the database?
Serving the search results from a JSON to lower the impact on db (and server resources) is a good approach or it's not a good idea?
Maybe someone out there had to take the same decision and can share the experience with me, or maybe you just know how things really are and recommend me a certain approach.
Flat files are the poor man's db and can be even more problematic than a heavily pounded database. For example reading and writing the file still requires a lock, and will not scale, as the same file may not be accessible to all app servers.
My suggestion would be any one of the following:
Benchmark your current hardware, identify bottlenecks, scale out or up accordingly.
Implement a caching layer, this will save on costly queries for readonly data.
Consider more high performant storage solutions such as Aerospike or Redis
Implement a real full text search engine such as ElasticSearch or SOLR.
Response to comment #1:
You could accomplish the same thing without having to read/write a flat file (which must be accessible by all app servers), by caching the data. Here's just a quick N dirty rundown of how I would do it:
Zip + 10 miles:
Query database, pull store data, json_encode, cache using a key construct like 92562_10, then store in cache. Now when other users enter 92562 + 10 they will pull data from cache vs the database (or flat file).
City, State + 50 miles:
Same as above, except key construct may look like murrieta_ca_50.
But with the caching layer you get better performance, and the cache server will be available to all your app servers, which would be much easier than having to install/configure NFS to share the file on a network.
I am considering enabling Memcache support for my large-scale REST service. However I have some questions regarding best approaches for these key-value stores.
The setup:
A database wrapper which has functions for select, update and etc.
A REST framework which contains all the API functions (getUser, createUser and etc.)
In my head, the ideal approach would be to integrate the Memcache in the database wrapper so, for example, every SQL query would get md5-hashed and saved in the cache (this is btw what most online resources suggests). However, there is obviously a problem with this approach: if a search query has been cached, and one of the users from the search result has been updated after the cached result, this wont reflect in the next request (because it is now in the cache).
As I see it I have several ways of handeling this:
Implement the Memcache in the REST framework for each function (getUser, createUser etc) and thereby explicit handle the updating of the cache etc. if users gets updated. This could end up in redundant code.
Let the cached values expire very quickly and live with the fact that some requests shows old cached values.
Do a more advanced implementation of the Memcache in the database wrapper so that I can identify which parts(e.g. users) to update in e.g. a search request.
Could you guide me to which of the following, or a complete another approach, to take?
Thanks in advance.
Enabling cache for a web application is not something to take lightly.
Maybe you have done that already bit... I recommend you first come up with a goal based on business needs or forcast (ex: must accept 1000 requests per seconds) then properly stress-test your system to have numbers before you start changing anything and then identify your bottleneck.
http://en.wikipedia.org/wiki/Performance_tuning
I usually use profiling tools such as HXProf (by facebook).
https://github.com/facebook/xhprof
Caching all your data to mirror your database might not be the best approach.
Find out how big you can allocate for your cache. If your architecture only allow you to allocate 100MB for your memcache, then it will affect your decision about what you cache and how long you cache it.
The best cache is to cache forever. But we all know that data changes. You can start by caching data that is requested often and requires the most resources to fetch.
Always try to make sure you are not working on improving something that will get you low improvement.
Without understanding your architecture in depth, it would be hazardous for anyone to recommend a caching strategy that best fit your needs.
Maybe you should cache the resutling output of your web services instead? Using a reverse proxy for example (What #Darrel is talking about) or using output buffering...
http://en.wikipedia.org/wiki/Reverse_proxy
http://php.net/manual/en/book.outcontrol.php
Optimize your database queries before you think about caching. Make sure your use a PHP Op cache (like APC) and all those things that are standard practice.
http://phplens.com/lens/php-book/optimizing-debugging-php.php
http://blog.digitalstruct.com/2008/01/31/performance-tuning-overview/
If you want to cache data and prevent stale/old data from being served, the trick is to identify your data (primary key maybe?) and when the data is updated or deleted, you delete or update the cache for that identifyer.
<?php
// After inserting into DB, you can also put it in the cache
$memcache->set($userId, $userData);
// After updating or deleting the user, you update or delete the data
$memcache->delete($userId);
A lot of site will show stale data. When I am on stackoverflow and my reputation is increased and then I got in the stackoverflow chat, the reputation shown is my old reputation. When I got a reputation of 20 (reputation required to chat) I still could not chat for another 5 minutes because the chat system had my old reputation data and did not yet know my reputation had increased enough to allow me to chat. Some data can be stale while other type of data should never be stale. Consider that when caching data.
Conclusion
Your approaches can all be valid depending on the factors that I talk about above. In fact, you can use a combination of those for all the different type of data you want to cache and how long it is acceptable to show old data for them. Maybe the categories or list of countries (since they do not change often) can be cached for a long time while the reputation (or whatever data changes all the time for all users) should be cached for a short period only.
I'm currently developing the foundation of a an application, and looking for ways to optimize performance. My setup is based on the CakePHP framework, but I believe my question is relevant to any technology stack, as it relates to data caching.
Let's take a typical post-author relation, which is represented by 2 tables in my db. When I query the database for a specific blog post, at the same time the built-in ORM functionality in CakePHP also fetches the author of the post, comments on the post, etc. All of this is returned as one big-ass nested array, which I store in cache using a unique identifier for the concerned blog post.
When updating the blog post, it is child play to destroy the cache for the post, and have it regenerated with the next request.
But what happens when not the main entity (in this case the blog post) gets updated, but rather some of the related data? For example, a comment could be deleted, or the author could update his avatar. Are there any approaches (patterns) which I could consider for tracking updates to related data, and applying updates to my cache accordingly?
I'm curious to hear whether you've also run into similar challenges, and how you have managed to potentially overcome the hurdle. Feel free to provide an abstract perspective, if you're using another stack on your end. Your views are anyhow much appreciated, many thanks!
It is rather simple, cache entries can be
added
destroyed
You should take care of destroying cache entries when related data change (so in application layer in addition to updating the data you should destroy certain types of cached entries when you update certain tables; you keep track of dependencies by hard-coding it).
If you'd like to be smart about it you could have your cache object state their dependencies and cache the last update times for your DB tables as well.
Then you could
fetch cached data, examine dependencies,
get update times for relevant DB tables and
in case the record is stale (update time of a table that your big ass cache entry depends on is later then the time of the cache entry) drop it and get fresh data from the database.
You could even integrate the above into your persistence layer.
EDIT:
Of course the above is for when you want to have consistent cache. Sometimes, and for some data, you can relax the consistency requirements and there are scenarios where simple TTL will be good enough (for a trivial example, if you have ttl of 1 sec, you should mostly be out of trouble with users and can help data processing; and with higher times you might still be ok - for example let's say you are caching the list of country ISO codes; your application might be perfectly ok if you say let's cache this for 86400 sec).
Furthermore, you could also track the times of information presented to user, for example
let's say user has seen data A from cache and that we know that this data was created/modified at time t1
user makes changes to the data A (and makes it data B) and commits the change
the application layer can then examine if the data A is still as in DB (if the cached data upon which the user made decisions and/or changes was indeed fresh)
if it was not fresh then there is a conflict and user should confirm the changes
This has a cost of extra read of data A from DB, but it occurs only on writes.
Also, the conflict can occur not only because of the cache, but also because of multiple users trying to change the data (i.e. it is related to locking strategies).
One Approach for memcached is to use tags ( http://code.google.com/p/memcached-tag/ ). For Example, you have your Post "big-ass nested array" lets say, it inclused the autors information, the post itself and is shown on the frontpage and in some box in the sidebar. So it gets the tags: frontpage, {auhothor-id}, sidebar, {post-id} - now if someone changes the Author Information you flush every cache entry with the tag {author-id}. But thats only one Solution, and only for Cache Backends that support Tags, for example not APC (afaik). Hope That gave you an example.