On our site, we have a variety of data collections. We pull a lot of records by ID, and we also have some records that are paginated content.
We are caching our queries using memcached - we essentially serialize the query array and then md5() it. This provides us a key to use for this specific query. However, this query can return multiple records - and if one of those records is changed, we want to invalidate the cached query that resulted in that record being returned.
What would be the best way to accomplish this? I toyed with the idea of having two instances of memcached, with one acting as like an index server, if you will.
Thanks!
One of the ways you can address this (if I read you correctly), is to store collection of IDs with the MD5 key and then those rows separately. So for example if you end up querying for collection and it returned 10 results, you would save the IDs only by the MD5 of the query key. For each of the ID, you will then query the database and get the full details, which are individually stored in memcached. When individual items are updates, it's now easier to invalidate the memcache entry. And these records updates wont effect the collection now as you only have the IDs stored. Hope this helps.
Related
I have a program that creates logs and these logs are used to calculate balances, trends, etc for each individual client. Currently, I store everything in separate MYSQL tables. I link all the logs to a specific client by joining the two tables. When I access a client, it pulls all the logs from the log_table and generates a report. The report varies depending on what filters are in place, mostly date and category specific.
My concern is the performance of my program as we accumulate more logs and clients. My intuition tells me to store the log information in the user_table in the form of a serialized array so only one query is used for the entire session. I can then take that log array and filter it using PHP where as before, it was filtered in a MYSQL query (using multiple methods, such as BETWEEN for dates and other comparisons).
My question is, do you think performance would be improved if I used serialized arrays to store the logs as opposed to using a MYSQL table to store each individual log? We are estimating about 500-1000 logs per client, with around 50000 clients (and growing).
It sounds like you don't understand what makes databases powerful. It's not about "storing data", it's about "storing data in a way that can be indexed, optimized, and filtered". You don't store serialized arrays, because the database can't do anything with that. All it sees is a single string without any structure that it can meaningfully work with. Using it that way voids the entire reason to even use a database.
Instead, figure out the schema for your array data, and then insert your data properly, with one field per dedicated table column so that you can actually use the database as a database, allowing it to optimize its storage, retrieval, and database algebra (selecting, joining and filtering).
Is serialized arrays in a db faster than native PHP? No, of course not. You've forced the database to act as a flat file with the extra dbms overhead.
Is using the database properly faster than native PHP? Usually, yes, by a lot.
Plus, and this part is important, it means that your database can live "anywhere", including on a faster machine next to your webserver, so that your database can return results in 0.1s, rather than PHP jacking 100% cpu to filter your data and preventing users of your website from getting page results because you blocked all the threads. In fact, for that very reason it makes absolutely no sense to keep this task in PHP, even if you're bad at implementing your schema and queries, forget to cache results and do subsequent searches inside of those cached results, forget to index the tables on columns for extremely fast retrieval, etc, etc.
PHP is not for doing all the heavy lifting. It should ask other things for the data it needs, and act as the glue between "a request comes in", "response base data is obtained" and "response is sent back to the client". It should start up, make the calls, generate the result, and die as fast as it can again.
It really depends on how you need to use the data. You might want to look into storing with mongo if you don't need to search that data. If you do, leave it in individual rows and create your indexes in a way that makes them look up fast.
If you have 10 billion rows, and need to look up 100 of them to do a calculation, it should still be fast if you have your indexes done right.
Now if you have 10 billion rows and you want to do a sum on 10,000 of them, it would probably be more efficient to save that total somewhere. Whenever a new row is added, removed or updated that would affect that total, you can change that total as well. Consider a bank, where all items in the ledger are stored in a table, but the balance is stored on the user account and is not calculated based on all the transactions every time the user wants to check his balance.
In my webapp (PHP) I have a script showing the contents of a single database record: view.php?id=n, where n is the ID of the record.
Also there are Next and Previous links on the page (to browse through the records without getting back to the record list) using the increment/decrement of the current ID to call view.php:
?id=$current_id-1 (Prev)
?id=$current_id+1 (Next)
Now I’m going to implement a search function. The Next and Previous links are now encouraged to guide the user through his recent search result making the mechanism above useless.
I would have to store a list of at least the IDs of the search result in the current session. Having 5000 records this means the server has to hold about 20K of data (assuming 32-bit integers) for each user doing a search query not using filters (like sorting all records by the time last changed).
Is this a proper way and acceptable in terms of performance and memory usage or are there alternative ways?
If you store only id for prev next, it is meaningless because you are getting record from db by id and probably id is primary key and indexed. However, if you want to store key => value pair like ID=>value(ID), you can cache them not store in session. So, when user clicks prev or next, you can get result from cache by using ID without querying db. In this case, you need to refresh cache on each db update
There is a data structure that is very convenient for such cases - SplDoublyLinkedList. If you haven't come across linked lists, here's a good explanation with a few illustrations about them.
I'm not sure though how big their memory footprint will be, so you will want to test your mileage first. But still, they're likely to be faster than anything handmade.
Cheers.
I am storing some history information on my website for future retrieval of the users. So when they visit certain pages it will record the page that they visited, the time, and then store it under their user id for future additions/retrievals.
So my original plan was to store all of the data in an array, and then serialize/unserialize it on each retrieval and then store it back in a TEXT field in the database. The problem is: I don't know how efficient or inefficient this will get with large arrays of data if the user builds up a history of (e.g.) 10k pages.
EDIT: So I want to know what is the most efficient way to do this? I was also considering just inserting a new row in the database for each and every history, but then this would make a large database for selecting things from.
The question is what is faster/efficient, massive amount of rows in database or massive serialized array? Any other better solutions are obviously welcome. I will eventually be switching to Python, but for now this has to be done in PHP.
There is no benefit to storing the data as serialized arrays. Retrieving a big blob of data, de-serializing, modifying it and re-serializing to update is slow - and worse, will get slower the larger the piece of data (exactly what you're worried about).
Databases are specifically designed to handle large numbers of rows, so use them. You have no extra cost per insert as the data grows, unlike your proposed method, and you're still storing the same amount of data, so let the database do what it does best, and keep your code simple.
Storing the data as an array also makes any sort of querying and aggregation near impossible. If the purpose of the system is to (for example) see how many visits a particular page got, you would have to de-serialize every record, find all the matching pages, etc. If you have the data as a series of rows with user and page, it's a trivial SQL count query.
If, one day, you find that you have so many rows (10,000 is not a lot of rows) that you're starting to see performance issues, find ways to optimize it, perhaps through aggregation and de-normalization.
you can check for session variable and store all data of one session and can dump it together into database.
You can do Indexing at db level to save time.
Last and the most effective thing you can do is to do operation/manipulation on data and store it in separate table.And always select data from manuplated table.You can achieve this using cron job or some schedular.
I am starting to learn the benefits of memcache, and would like to implement it on my project. I have understood most of it, such as how data can be retrieved by a key and so on.
Now I get it that I can put a post with all of its details in memcache and call the key POST:123, that is OK, and I can do it for each post.
But how to deal with the case when I query the table posts to get the list of all posts with their titles. Can this be done with memcache, or should this always be queried from the table?
Memcache is a key-value cache, so as you've described, it is typically used when you know exactly what data you want to retrieve (ie., it is not used for querying and returning an unknown list of results based on some filters).
Generally, the goal is not to replace all database queries with memcache calls, especially if the optimization isn't necessary.
Assuming the data won't fit in a single value, and you did have a need for quick-access to all the data in your table, you could consider dumping it into keys based on some chunk-value, like a timestamp range.
However, it's probably best to either keep the database query or load it into memory directly (if we're talking about a single server that writes all the updates).
You could have a key called "posts" with a value of "1,2,3,4,10,12" and then update, retrieve it every time new post is created, updated, deleted.
In any case, memcached is not a database replacement, it's a key/value storage (fast and scalable at it). So you have to decide what data to store in DB and what offload to memcached.
Alternatively you could use "cachemoney" plugin which will do read/write through caching of your AcriveRecord in memory (even in memcached)
Look at the cache_money gem from Twitter. This gem adds caching at the ActiveRecord level.
The find calls by id will go through the cache. You can add support for indexing by other fields in your table (i.e. title in your case)
class Post < ActiveRecord::Base
index :title
end
Now the find calls that filter by title will go through cache.
Post.all(:conditions => {:title => "xxxx"})
Basically what you should do it check the cache first to see if it has the information you need. If there is nothing in the cache (or if the cached data is stale), then you should query the table and place the returned data in the cache, as well as return it to the front-end.
Does that help?
You cannot enumerate memcached contents - it's not designed to do that the way you try. You need to store the complete result set of SQL queries instead of single records.
So, yes: Do a SELECT title FROM posts and store the result in memcached (eg as POSTS:ALL_UNORDERED). This is how it is designed. And yes, if querying single records like SELECT * FROM posts WHERE id = XX then store that. Just keep in mind to create unique memcached keys for each query you perform.
As far as I know you can't query for multiple keys and values. But if you need to access the same list of posts often, why don't you store this in with a key like "posts" in your memcache.
I have a large table and I'd like to store the results in Memcache. I have Memcache set up on my server but there doesn't seem to be any useful documentation (that I can find) on how to efficiently transfer large amounts of data. The only way that I currently can think of is to write a mysql query that grabs the key and value of the table and then saves that in Memcache. Its not a particularly scalable solution (especially when my query generates a few hundred thousand rows). Any advice on how to do this?
EDIT: there is some confusion about what I"m attempting to do. Lets say that I have a table with two fields (key and value). I am pulling in information on the fly and have to match it to the key and return the value. I'd like to avoid having to execute ~1000 queries per page load. Memcache seems like a perfect alternative because its set up to use key value. Lets say this table has 100K rows. THe only way that I know to get that data from the db table to memcache is to run a query that loops through every row in the table and creates an individual memcache row.
Questions: Is this a good way to use memcache? If yes, is there a better way to transfer my table?
you can actually pull all the rows in an array and store the array in memcache
memcache_set($memcache_obj, 'var_key', $your_array);
but you have to remember few things
PHP will serialize/unserialize the array from memcache so if you have many rows it might be slower then actually querying the DB
you cannot do any filtering (NO SQL), if you want to filter some items you have to implement this filter yourself and it would probably perform worst then the DB engine.
memcache won't store more then 1 megabyte ...
I don't know what you try to achieve but the general use of memcache is:
store the result of SQL/time consuming processing but the number of resulting row should be small
store some pre created (X)HTML blobs to avoid DB access.
user session storage
Russ,
It sounds almost as if using a MySQL table with the storage engine set to MEMORY might be your way to go.
A RAM based table gives you the flexibility of using SQL, and also prevents disk thrashing due to a large amount of reads/writes (like memcached).
However, a RAM based table is very volatile. If anything is stored in the table and not flushed to a disk based table, and you lose power... well, you just lost your data. That being said, ensure you flush to a real disk-based table every once in a while.
Also, another plus from using memory tables is you can store all the typical MySQL data types, so there is no 1MB size limit.