Assume, that I have a big (MySQL-)table (>10k rows) with id -> string. I can put them all in an array and cache this array. But the question ist: How to cache it efficiently?
a) Cache it as one big item. So I will execute
$redis->set("array", $array);
Quite short and easy. But for every entry I need, I have to fetch the whole thing. Absolutely inefficient.
b) Cache every entry itself:
foreach( $array as $id => $str )
$redis->set( "array:$id", $str );
Using this way, I will have >10k entries in Redis. That doesn't feel good. If I have 10 of these tables, i will have 100k entries....
So what's your proposal? How to cache a big array?
Caching the big array only makes helpful if you're planning to retrieve it always as a whole. However cache invalidation will be a very "heavy" operation as anytime when you change something you have to invalidate the whole array and reread it from the DB.
10k in redis is not much at all. You can have millions of entries without problem.
I would go with the b) version. Cache every entry individually. Easier to maintain, simpler application code and smaller memory footprint from application side which gets more and more important when you want to scale your application.
The first question is: why do you need to cache that array.
If you allways need the whole array, then:
$redis->set("array", $array);
If you only need some specific indexes (2nd solution), then why are you trying to cache the whole thing instead of querying the database each time for the id you need.
It is allways more efficient to get only needed data.
Remember that a cache usefullness is estimmated using the ration between reads (items effectively read from the cache) and miss (items read from the datasource then added to the cache).
If you are caching the whole table (10k miss), but querying only few elements by id (2de solution), then your ratio is near zero.
If you need the whole table each time, then cache it using the first solution (1miss) and so your ratio is more likely to be > 1.
Also, remember that redis is a separate server. For each request to redis, a request is made to this server (on localhost or not).
So basicly it's the same rule for redis than for mysql: One big request will perform faster than many little requests.
Related
In PHP, does array_slice() serve good enough to process large data array that cannot be paginated since its not stored in database but calculated on other db tables.
Anyways, so I have an array of around 50k which might increase later. First time on page load it fetches all 50k records then slices it for ajax based pagination.
Will this cause server load in future since all records are being fetched on page load?
At first its a bad idea to create array containing 50K moreover it can encrease. It may "eat" all your memory in high traffic.
Also where you store sliced parts of array for using on ajax requests?
I think (if you can not set limit in query) you can create additional table in which you can store your data (with cron for example) and show users data from it using limit for pagination, or you can create caching layer (or use existed caching systems: file cache, php memcache, ...), and write some algorithm for updating cache (it depends on your programm logic).
Currently i m using shared hosting domain for my site .But we have currently near about 11,00,000 rows in one of the tables.So its taking a lot of time to load the webpage.So we want to implement the database caching techniques like APC or memcache for our site.But in shared domain we dont have those facilities available,we have only eaccelerator.But eaccelerator does not cache db calls,If i m not wrong.So considering all these points we want to move to VPS and in this case.which database caching technique we need to use APC or memcache to decrease the page load time...Please guide on VPS and better caching technique of two
we have similar website and we use APC
APC will cache the opcode as well the html that is generated. This helps to avoid unrequired hits to the page
you should also enable caching on mysql to cache results of your query
I had a task where i needed to fetch rows from a database table that had more than 100.000 record. it was a scrollable page. So what i did was to fetch the first 50 records and cache the next 50 in the first call. and on scroll down events i wrote an ajax request to check if the data is available in cache; if not i fetched it from the database and also cached the next 50. It worked pretty well and solved the inconvenient load time.
if you have a similar scenario you might benefit from this approach.
ps: I used memcache.
From your comment I take it you're doing a LIKE %..% query and want to paginate the result. First of all, investigate whether FULLTEXT indices are an option for you, as they should perform better. If that's not an option, you can add a simple cache like so:
Treat each unique search term as an id, i.e. if in your URL you have ..?search=foobar, then "foobar" is the id of the result set. Keep that in all your links, e.g. ..?search=foobar&page=2.
If the result set does not yet exist (see below), create it:
Query the database with your slow query.
Get all the results into an array. Don't overdo it, you don't want to be storing hundreds of megabytes.
Create a unique filename per query, e.g. sha1($query), or maybe sha1(strtolower($query)).
serialize the data and store it in the file.
Get the data from the file, unserialize it, display the portion of the array corresponding to the requested page.
Occasionally, delete old cached results. You can do that with something like if (rand(0, 100) == 1) .., which will run the cleanup job every 100 queries on average. Strike a balance between server load and data freshness. Cache invalidation is a topic whole books can be written about, BTW.
That's a simple poor man's cache implementation. It's not great, but if you have absolutely nothing else to work with, it's better than running slow queries over and over.
APC is Alternative PHP Cache and works only with PHP. Whereas Memcahced will work independently with any language.
I have to do a lookup on around 160K records where in data is in the form of
id and we need to get the rows for which a given value is in the range of range1 and range2, so so far its a between query that we use.
I started using memcache yesterday, which finds out if a perticular row against the given value is in the memcache and if not than it puts it in memcache by taking it from the db.
I am not sure what's the order of lookup in memcache itself, is it o(1) or o(n), I know dbsearch can at best get me o(log n) , and i am thinking to keep another layer in-between of some other in-memory object[i can't think off right now, but i certainly don't want to use the sessions to keep the table in-memory], and rather get the data from this in-memory object, and if not found in it, then go to database.
PS - my db table hardly goes through any changes.
so the order i am thinking of is
Lookup in memcache
if not found (lookup in in-memory - do a binary search on the array), and add to memcache
if not found lookup in db, add to in-memory and add to memcache.
Am I thinking in right direction
Memcached is very fast and lookups are o(1), not o(n) so you are best off just using memcached and some backend database. Consider the scenario you proposed above. Having a secondary cache that is slower than memcached in the middle will only increase latency of requests since you will now potentially have to ask 3 places for some data instead of just two. Also since memcached is going to be faster than the in memory database solution your better just giving all of your extra memory to memcached. Another thing to consider too is the management overhead of adding an extra tier for your application.
Basically, one part of some metrics that I would like to track is the amount of impressions that certain objects receive on our marketing platform.
If you imagine that we display lots of objects, we would like to track each time an object is served up.
Every object is returned to the client through a single gateway/interface. So if you imagine that a request comes in for a page with some search criteria, and then the search request is proxied to our Solr index.
We then get 10 results back.
Each of these 10 results should be regarded as an impression.
I'm struggling to find an incredibly fast and accurate implementation.
Any suggestions on how you might do this? You can throw in any number of technologies. We currently use, Gearman, PHP, Ruby, Solr, Redis, Mysql, APC and Memcache.
Ultimately all impressions should eventually be persisted to mysql, which I could do every hour. But I'm not sure how to store the impressions in memory fast without effecting the load time of the actual search request.
Ideas (I just added option 4 and 5)
Once the results are returned to the client, the client then requests a base64 encoded URI on our platform which contains the ID's of all of the objects that they have been served. This object is then passed to gearman, which then saves the count to redis. Once an hour, redis is flushed and the count is increments for each object in mysql.
After the results have been returned from Solr, loop over, and save directly to Redis. (Haven't benchmarked this for speed). Repeat the flushing to mysql every hour.
Once the items are returned from Solr, send all the ID's in a single job to gearman, which will then submit to Redis..
new idea Since the most number of items returned will be around 20, I could set a X-Application-Objects header with a base64 header of the ID's returned. These ID's (in the header) could then be stripped out by nginx, and using a custom LUA nginx module, I could write the ID's directly to Redis from nginx. This might be overkill though. The benefit of this though is that I can tell nginx to return the response object immediately while it's writing to redis.
new idea Use fastcgi_finish_request() in order to flush the request back to nginx, but then insert the results into Redis.
Any other suggestions?
Edit to Answer question:
The reliability of this data is not essential. So long as it is a best guess. I wouldn't want to see a swing of say 30% dropped impressions. But I would allow a tolerance of 10% -/+ acurracy.
I see your two best options as:
Using the increment command I redis to incremenent counters as you pull the dis. Use the Id as a key and increment it in Redis. Redis can easily handle hundreds of thousands of increments per second, so that should be fast enough to do without any noticeable client impact. You could even pipeline each request if the PHP language binding supports it. I think it does.
Use redis as a plain cache. In this option you would simply use a Redis list and do an rpush of a string containing the IDs separated by eg. a comma. You might use the hour of the day as the key. Then you can have a separate process pull it out by grabbing the previous hour and massaging it however you want to into MySQL. I'd you put an expires on keys you can have them cleaned out after a period of time, or just delete the keys with the post-processing process.
You can also use a read slave to do the exporting to MySQL from if you have very high redis traffic or just want to offload it and get as a bonus a backup of it. If you do that you can set the master redis instance to not flush to disk, increasing write performance.
For some additional options regarding a more extended use of redis' features for this sort of tracking see this answer You could also avoid the MySQL portion and pull the data from redis, keeping the overall system simpler.
I would do something like #2, and hand the data off to the fastest queue you can to update Redis counters. I'm not that familiar with Gearman, but I bet it's slow for this. If your Redis client supports asynchronous writes, I'd use that, or put this in a queue on a separate thread. You don't want to slow down your response waiting to update the counters.
I am writing a PHP function that will need to loop over an array of pointers and for each item, pull in that data (be it from a MySQL database or flat file). Would anyone have any ideas of optimizing this as there could potentially be thousands and thousands of iterations?
My first idea was to have a static array of cached data that I work on and any modifications will just change that cached array then at the end I can flush it to disk. However in a loop of over 1000 items, this would be useless if I only keep around 30 in the array. Each item isn't too big but 1000+ of them in memory is way too much, hence the need for disk storage.
The data is just gzipped serialized objects. Currently I am using a database to store the data but I am thinking maybe flat files would be quicker (I don't care about concurrency issues and I don't need to parse it, just unzip and unserialize). I already have a custom iterator that will pull in 5 items at a time (to cut down on DB connections) and store them in this cache. But again, using a cache of 30 when I need to iterate over thousands is fairly useless.
Basically I just need a way to iterate over these many items quickly.
Well, you haven't given a whole lot to go on. You don't describe your data, and you don't describe what your data is doing or when you need one object as opposed to another, and how those objects get released temporarily, and under what circumstances you need it back, and...
So anything anybody says here is going to be a complete shot in the dark.
...so along those lines, here's a shot in the dark.
If you are only comfortable holding x items in memory at any one time, set aside space for x items. Then, every time you access the object, make a note of the time (this might not mean clock time so much as it may mean the order in which you access them). Keep each item in a list (it may not be implemented in a list, but rather as a heap-like structure) so that the most recently used items appear sooner in the list. When you need to put a new one into memory, you replace the one that was used the longest time ago and then you move that item to the front of the list. You may need to keep another index of the items so that you know where exactly they are in the list when you need them. What you do then is look up where the item is located, link its parent and child pointers as appropriate, then move it to the front of the list. There are probably other ways to optimize lookup time, too.
This is called the LRU algroithm. It's a page replacement scheme for virtual memory. What it does is it delays your bottleneck (the disk I/O) until it's probably impossible to avoid. It is worth noting that this algorithm does not guarantee optimal replacement, but it performs pretty well nonetheless.
Beyond that, I would recommend parallelizing your code to a large degree (if possible) so that when one item needs to hit the hard disk to load or to dump, you can keep that processor busy doing real work.
< edit >
Based off of your comment, you are working on a neural network. In the case of your initial fedding of the data (before the correction stage), or when you are actively using it to classify, I don't see how the algorithm is a bad idea, unless there is just no possible way to fit the most commonly used nodes in memory.
In the correction stage (perhaps back-prop?), it should be apparent what nodes you MUST keep in memory... because you've already visited them!
If your network is large, you aren't going to get away with no disk I/O. The trick is to find a way to minimize it.
< /edit >
Clearly, keeping it in memory is faster than anything else. How big is each item? Even if they are 1K each, ten thousand of them is only 10 M.
you can always break out on a loop after you get the data you need. so that it will not continue on looping. if it is a flat file you are storing.. you server HDD will suffer containing thousands or millions of files with different file size. But if you are talking about the whole actual file stored in a DB. then it is much better to store it in a folder and just save the path of that file in the DB. And try putting the pulled items in an XML. so that it is much easier to access and it can contain many attributes for the details of the item pulled e.g (Name, date uploaded, etc).
You could use memcached to store objects the first time they are read, then use the cached version in the subsequent calls. Memcached use the RAM to store objects so as long you have enough memory, you will have a great accceleration. There is a php api to memcached