I am working on a search application that uses a form with 16 filter options that are either 1 (selected) or 0 (not selected). The result as JSON is retrieved via AJAX using a GET request.
The query string then looks like this:
filter_1=0&filter_2=1 ...omitted... &filter_16=1&page=20
Each searchresult has at least 2 pages which can be browsed by the user.
My question is: how can I cache the searchresults based on the input parameter? My first idea was to md5 the requestparameters and then write a cache file using the hash as filename.
Every time a new request comes in, I search for the cache file and if it is there, then use the data from that file instead of querying the database and converting the rows to a json result.
But this seems not like a good idea because of the many search options. There would be quite a lot cache files (16 * 16 ???), and because the application is only used by a few users, I doubt that all possible combinations will ever get cached. And each result contains X pages, so each of that page would be a cache file of its own (16 * 16 * X).
What would be a good caching strategy for an application like this? Is it acutually possible to implement a cache?
Because all of your search parameters are flags that can be either 0 or 1, you might consider bitmasking.
Each of your filters would represent a value that is a power of 2:
$filter_1 = 1;
$filter_2 = 2;
$filter_3 = 4;
...
$filter_8 = 256;
...
$filter_16 = 65536;
By using PHP's bitwise operators, you can easily store all 16 filter values in a single integer. For instance, the value "257" can only be reached using a combination of filter_1 and filter_8. If the user selected filter_1 and filter_8, you could determine the bitmask by doing:
$bitmask = $filter_1 | $filter_8 //gives 257
With a unique bitmask representing the state of all your filters, you can simply use that as your cache key as well, with no expensive md5 operations needed. So in this case, you would save a file named "257" into your cache.
This technique gives you an easy tool to invalidate your cache with as well, as you can check new and updated records to determine which filters they match, and delete any file that has that "bit" set in the name, ie. if ( ((int)$filename) & $filter == $filter) unlink($filename);. If your tables have frequent writes, this could cause some performance issues for scanning your cache, but it's a decent technique for a read-heavy application.
This is an approach I love to use when dealing with bits or flags. You should consider carefully if you really need caching like this however. If you only have a few users of the system, are you really going to be having performance problems based on a few search queries? As well, MySQL has built-in query caching which performs very well on a high-read application. If your result page generation routines are expensive, then caching the output fragments can definitely be beneficial, but if you're only talking about microseconds of performance here for a handful of users, it might not be worth it.
Why do you need the cache?
If the app is only used by a few users then caching may not actually be required.
Given the requirements you describe (small number of users), it seems to me that caching all combinations seems reasonable. Unless, of course, caching makes sense at all. How much time does a typical query take? Since you say that the application will be used only by several people, is it even worth caching? My very rough estimate is that if the query does not take several seconds in this case, don’t worry about caching. If it is less than a second, and you really don’t want to make the application super responsive, no caching should be needed.
Otherwise, I would say (again given the small number of users) that caching all combinations is OK. Even if very large number of them was used, there is still at most 65536 of them, and many modern operating systems can easily handle thousands of files in a directory (in case you plan to cache into files). But in any case, it would be reasonable to limit the number of items in the cache and purge the old regularly. Also, I would not use an MD5, I would just concatenate the zeros and ones from your filters for the cache key (e.g. 0101100010010100).
First verify you actually need a cache (like Toby suggested).
After that, think about how fresh the information needs to be - you'll be needing to flush out old values. You may want to use a preexisting solution for this, such as memcached.
$key = calc_key();
$result = $memcache->get($key);
if (!$result) {
$result = get_data_from_db();
/* cache result for 3600 seconds == 1 hour */
$memcache->set($key, $result, 0, 3600);
}
/* use $result */
Related
I have a JS script that does one simple thing - an ajax request to my server. On this server I establish a PDO connection, execute one prepared statement:
SELECT * FROM table WHERE param1 = :param1 AND param2 = :param2;
Where table is the table with 5-50 rows, 5-15 columns, with data changing once each day on average.
Then I echo the json result back to the script and do something with it, let's say I console log it.
The problem is that the script is run ~10,000 times a second. Which gives me that much connections to the database, and I'm getting can't connect to the database errors all the time in server logs. Which means sometimes it works, when DB processes are free, sometimes not.
How can I handle this?
Probable solutions:
Memcached - it would also be slow, it's not created to do that. The performance would be similar, or worse, to the database.
File on server instead of the database - great solution, but the structure would be the problem.
Anything better?
For such a tiny amount of data that is changed so rarely, I'd make it just a regular PHP file.
Once you have your data in the form of array, dump it in the php file using var_export(). Then just include this file and use a simple loop to search data.
Another option is to use Memcached, which was created exactly this sort of job and on a fast machine with high speed networking, memcached can easily handle 200,000+ requests per second, which is high above your modest 10k rps.
You can even eliminate PHP from the tract, making Nginx directly ask Memcached for the stored valaues, using ngx_http_memcached_module
If you want to stick with current Mysql-based solution, you can increase max_connections number in mysql configuration, however, making it above 200 would may require some OS tweaking as well. But what you should not is to make a persistent connection, that will make things far worse.
You need to leverage a cache. There is no reason at all to go fetch the data from the database every time the AJAX request is made for data that is this slow-changing.
A couple of approaches that you could take (possibly even in combination with each other).
Cache between application and DB. This might be memcache or similar and would allow you to perform hash-based lookups (likely based on some hash of parameters passed) to data stored in memory (perhaps JSON representation or whatever data format you ultimately return to the client).
Cache between client and application. This might take the form of web-server-provided cache, a CDN-based cache, or similar that would prevent the request from ever even reaching your application given an appropriately stored, non-expired item in the cache.
Anything better? No
Since you output the same results many times, the sensible solution is to cache results.
My educated guess is your wrong assumption -- that memcached is not built for this -- is based off you planning on storing each record separately
I implemented a simple caching mechanism for you to use :
<?php
$memcached_port = YOUR_MEMCACHED_PORT;
$m = new Memcached();
$m->addServer('localhost', $memcached_port);
$key1 = $_GET['key1'];
$key2 = $_GET['key2'];
$m_key = $key1.$key2; // generate key from unique values , for large keys use MD5 to hash the unique value
$data = false;
if(!($data = $m->get($m_key))) {
// fetch $data from your database
$expire = 3600; // 1 hour, you may use a unix timestamp value if you wish to expire in a specified time of day
$m->set($m_key,$data,$expire); // push to memcache
}
echo json_encode($data);
What you do is :
Decide on what signifies a result ( what set of input parameters )
Use that for the memcache key ( for example if it's a country and language the key would be $country.$language )
check if the result exists:
if it does pull the data you stored as an array and output it.
if it doesn't exist or is outdated :
a. pull the data needed
b. put the data in an array
c. push the data to memcached
d. output the data
There are more efficient ways to cache data, but this is the simplest one, and sounds like your kind of code.
10,000 requests/second still don't justify the effort needed to create server-level caching ( nginx/whatever )
In an ideally tuned world a chicken with a calculator would be able to run facebook .. but who cares ? (:
I have a web app which is pretty CPU intensive ( it's basically a collection of dictionaries, but they are not just simple dictionaries, they do a lot of stuff, anyway this is not important ). So in a CPU intensive web app you have the scaling problem, too many simultaneous users and you get pretty slow responses.
The flow of my app is this:
js -> ajax call -> php -> vb6 dll -> vb6 code queries the dictionaries and does CPU intensive stuff -> reply to php -> reply to js -> html div gets updated with the new content. Obviously in a windows env with IIS 7.5. PHP acts just as a way of accessing the .dll and does nothing else.
The content replied/displayed is html formatted text.
The app has many php files which call different functions in the .dll.
So in order to avoid the calling of the vb6 dll for each request, which is the CPU intensive part, I'm thinking of doing this:
example ajax request:
php file: displayconjugationofword.php
parameter: word=lol&tense=2&voice=active
So when a user makes the above request to displayconjugationofword.php, i call the vb6 dll, then just before giving back the reply to the client, I can add in a MYSQL table the request data like this:
filename, request, content
displayconjugationofword.php, word=blahblah&tense=2&voice=active, blahblahblah
so next time that a user makes the EXACT same ajax request, the displayconjugationofword.php code, instead of calling the vb6 dll, checks first the mysql table to see if the request exists there and if it does, it fetches it from there.
So this mysql table will gradually grow in size, reaching up to 3-4 million rows and as it grows the chance of something requested being in the db, grows up too, which theoretically should be faster than doing the cpu intensive calls ( each anywhere from 50 to 750ms long ).
Do you think this is a good method of achieving what I want? or when the mysql table reaches 3-4 million entries, it will be slow too ?
thank you in advance for your input.
edit
i know about iis output caching but i think it's not useful in my case because:
1) AFAIK it only caches the .php file when it becomes "hot" ( many queries ).
2) i do have some .php files which call the vb6 but the reply is random each time.
I love these situations/puzzles! Here are the questions that I'd ask first, to determine what options are viable:
Do you have any idea/sense of how many of these queries are going to be repeated in a given hour, day, week?
Because... the 'more common caching technique' (i.e the technique I've seen and/or read about the most) is to use something like APC or, for scalability, something like Memcache. What I've seen, though, is that these are usually used for < 12 hour-long caches. That's just what I've seen. Benefit: auto-cleanup of unused items.
Can you give an estimate of how long a single 'task' might take?
Because... this will let you know if/when the cache becomes unproductive - that is, when the caching mechanism is slower than the task.
Here's what I'd propose as a solution - do it all from PHP (no surprise). In your work-flow, this would be both PHP points: js -> ajax call -> php -> vb6 dll -> vb6 code queries the dictionaries and does CPU intensive stuff -> reply to php -> reply to js -> html div...
Something like this:
Create a table with columns: __id, key, output, count, modified
1.1. The column '__id' is the auto-increment column (ex. INT(11) AUTO_INCREMENT) and thus is also the PRIMARY INDEX
1.2 The column 'modified' is created like this in MySQL: modified TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
1.3 'key' = CHAR(32) which is the string length of MD5 hashes. 'key' also has a UNIQUE INDEX (very important!! for 3.3 below)
1.4 'output' = TEXT since the VB6 code will be more than a little
1.5 'count' = INT(8) or so
Hash the query-string ("word=blahblah&tense=2&voice=active"). I'm thinking something like:$key = md5(var_export($_GET, TRUE)); Basically, hash whatever will give a unique output. Deducing from the example given, perhaps it might be best to lowercase the 'word' if case doesn't matter.
Run a conditional on the results of a SELECT for the key. In pseudo-code:
3.1. $result = SELECT output, count FROM my_cache_table_name WHERE key = "$key"
3.2. if (empty($result)) {
$output = result of running VB6 task
$count = 1
else
$count = $result['count'] + 1
3.3. run query 'INSERT INTO my_cache_table_name (key, output, count) VALUES ($key, $output, $count) ON DUPLICATE KEY UPDATE count = $count'
3.4. return $output as "reply to js"
Long-term, you will not only have a cache but you will also know what queries are being run the least and can prune them if needed. Personally, I don't think such a query will ever be all that time-consuming. And there are certainly things you might do to optimize the cache/querying (that's beyond me).
So what I'm not stating directly is this: the above will work (and is pretty much what you suggested). By adding a 'count' column, you will be able to see what queries are done a lot and/or a little and can come back and prune if/as needed.
If you want to see how long queries are taking, you might create another table that holds 'key', 'duration', and 'modified' (like above). Before 3.1 and 3.3, get the microtime(). If this is a cache-hit, subtract the microtime()s and store in this new table where 'key' = $key and 'duration' = 2nd microtime() - 1st microtime(). Then you can come back later, sort by 'modified DESC', and see how long queries are taking. If you have a TON of data and still the latest 'duration' is not bad, you can pull this whole duration-recording mechanism. Or, if bored, only store the duration when $key ends in a letter (just to cut down on it's load on the server)
I'm not an expert but this is an interesting logic problem. Hopefully what I've set out below will help or at least stimulate comments that may or may not make it useful.
to an extent, the answer is going to depend on how many queries you are likely to have, how many at once and whether the mysql indexing will be faster than your definitive solution.
A few thoughts then:
It would be possible to pass caching requests on to another server easily which would allow essentially infinite scaling.
Humans being as they are, most word requests are likely to involve only a few thousand words so you will, probably, find that most of the work being done is repeat work fairly soon. It makes sense then to create an indexable database.
Hashing has in the past been suggested as a good way to speed indexing of data. Whether this is useful or not will to an extent depend on the length of your answer.
If you are very clever, you could determine the top 10000 or so likely questions and responses and store them in a separate table for even faster responses. (Guru to comment?)
Does your dll already do caching of requests? If so then any further work will probably slow your service down.
This solution is amenable to simple testing using JS or php to generate multiple requests to test response speeds using or not using caching. Whichever you decide, I reckon you should test it with a large amount of sample data.
In order to get maximum performance for your example you need to follow basic cache optimization principle.
I'm not sure if you application logic allows it but if it is it will give you a huge benefit: you need to distinguish requests which can be cached (static) from those which return dynamic (random) responses. Use some file naming rule or provide some custom http header or request parameter - i.e. any part of the request which can be used to judge whether to cache it or not.
Speeding up static requests. The idea is to process incoming requests and send back reply as early as possible (ideally even before a web server comes into play). I suggest you to use output caching since it will do what you intend to do in php&mysql internally in much more performant way. Some options are:
Use IIS output caching feature (quick search shows it can cache queries basing on requested file name and query string).
Place a caching layer in front of a web server. Varnish (https://www.varnish-cache.org/) is a flexible and powerful opensource tool, and you can configure caching strategy optimally depending on the size of your data (use memory vs. disk, how much mem can be used etc).
Speeding up dynamic requests. If they are completely random internally (no dll calls which can be cached) then there's not much to do. If there are some dll calls which can be cached, do it like you described: fetch data from cache, if it's there, you're good, if not, fetch it from dll and save to cache.
But use something more suitable for the task of caching - key/value storage like Redis or memcached are good. They are blazingly fast. Redis may be a better option since the data can be persisted to disk (while memcached drops entire cache on restart so it needs to be refilled).
I have a PHP/MySQL based web application that has internationalization support by way of a MySQL table called language_strings with the string_id, lang_id and lang_text fields.
I call the following function when I need to display a string in the selected language:
public function get_lang_string($string_id, $lang_id)
{
$db = new Database();
$sql = sprintf('SELECT lang_string FROM language_strings WHERE lang_id IN (1, %s) AND string_id=%s ORDER BY lang_id DESC LIMIT 1', $db->escape($lang_id, 'int'), $db->escape($string_id, 'int'));
$row = $db->query_first($sql);
return $row['lang_string'];
}
This works perfectly but I am concerned that there could be a lot of database queries going on. e.g. the main menu has 5 link texts, all of which call this function.
Would it be faster to load the entire language_strings table results for the selected lang_id into a PHP array and then call that from the function? Potentially that would be a huge array with much of it redundant but clearly it would be one database query per page load instead of lots.
Can anyone suggest another more efficient way of doing this?
There isn't an answer that isn't case sensitive. You can really look at it on a case by case statement. Having said that, the majority of the time, it will be quicker to get all the data in one query, pop it into an array or object and refer to it from there.
The caveat is whether you can pull all your data that you need in one query as quickly as running the five individual ones. That is where the performance of the query itself comes into play.
Sometimes a query that contains a subquery or two will actually be less time efficient than running a few queries individually.
My suggestion is to test it out. Get a query together that gets all the data you need, see how long it takes to execute. Time each of the other five queries and see how long they take combined. If it is almost identical, stick the output into an array and that will be more efficient due to not having to make frequent connections to the database itself.
If however, your combined query takes longer to return data (it might cause a full table scan instead of using indexes for example) then stick to individual ones.
Lastly, if you are going to use the same data over and over - an array or object will win hands down every single time as accessing it will be much faster than getting it from a database.
OK - I did some benchmarking and was surprised to find that putting things into an array rather than using individual queries was, on average, 10-15% SLOWER.
I think the reason for this was because, even if I filtered out the "uncommon" elements, inevitably there was always going to be unused elements as a matter of course.
With the individual queries I am only ever getting out what I need and as the queries are so simple I think I am best sticking with that method.
This works for me, of course in other situations where the individual queries are more complex, I think the method of storing common data in an array would turn out to be more efficient.
Agree with what everybody says here.. it's all about the numbers.
Some additional tips:
Try to create a single memory array which holds the minimum you require. This means removing most of the obvious redundancies.
There are standard approaches for these issues in performance critical environments, like using memcached with mysql. It's a bit overkill, but this basically lets you allocate some external memory and cache your queries there. Since you choose how much memory you want to allocate, you can plan it according to how much memory your system has.
Just play with the numbers. Try using separate queries (which is the simplest approach) and stress your PHP script (like calling it hundreds of times from the command-line). Measure how much time this takes and see how big the performance loss actually is.. Speaking from my personal experience, I usually cache everything in memory and then one day when the data gets too big, I run out of memory. Then I split everything to separate queries to save memory, and see that the performance impact wasn't that bad in the first place :)
I'm with Fluffeh on this: look into other options at your disposal (joins, subqueries, make sure your indexes reflect the relativity of the data -but don't over index and test). Most likely you'll end up with an array at some point, so here's a little performance tip, contrary to what you might expect, stuff like
$all = $stmt->fetchAll(PDO::FETCH_ASSOC);
is less memory efficient compared too:
$all = array();//or $all = []; in php 5.4
while($row = $stmt->fetch(PDO::FETCH_ASSOC);
{
$all[] = $row['lang_string '];
}
What's more: you can check for redundant data while fetching the data.
My answer is to do something in between. Retrieve all strings for a lang_id that are shorter than a certain length (say, 100 characters). Shorter text strings are more likely to be used in multiple places than longer ones. Cache the entries in a static associative array in get_lang_string(). If an item isn't found, then retrieve it through a query.
I am currently at the point in my site/application where I have had to put the brakes on and think very carefully about speed. I think these speed tests mentioned should consider the volume of traffic on your server as an important variable that will effect the results. If you are putting data into javascript data structures and processing it on the client machine, the processing time should be more regular. If you are requesting lots of data through mysql via php (for example) this is putting demand on one machine/server rather than spreading it. As your traffic grows you are having to share server resources with many users and I am thinking that this is where getting JavaScript to do more is going to lighten the load on the server. You can also store data in the local machine via localstorage.setItem(); / localstorage.getItem(); (most browsers have about 5mb of space per domain). If you have data in database that does not change that often then you can store it to client and then just check at 'start-up' if its still in date/valid.
This is my first comment posted after having and using the account for 1 year so I might need to fine tune my rambling - just voicing what im thinking through at present.
I am working on an application using memcache pool (5 servers) and some processing nodes. I have two different possible approaches and I was wondering if you guys have any comments on comparison based on performance (speed primarily) between the two
I extract a big chunk of data from memcache once per request, itereate over it and discard the bits I dont need for the particular request
I extract small small bits from memcached and only extract the ones I need. i.e. I extract value of a and based on value of a, extract value of either b or c. Use this combination to find the next key I want to extract.
The difference between the two is that the number of memcached lookups (which is a pool of servers) reduces in 1. but the size of response increases. Any benchmarking reports around it someone has seen before?
Unfortunately I cant use a better key based on request directly as I dont have enough memcache to support all possible combinations of values, so I got to construct some of it at run time
Thanks
You would have to benchmark for your own setup. The parts that would matter wold be the time spent on:
requesting large amount of data from memcache + retrieving it + extracting data from the resonse
sending several requests to memcache + retrieving the data
Basically first thing you have to measure is how large the overhead for interaction with your cache pool is. And there is that small matter of how this whole thing will react when load increases. What might be fast now, can turn out to be a terrible decision later, when the users start pouring in.
This kinda depends on your definition of "large chunk". Are we talking megabytes here or an array with 100 keys? You also have to consider, that php still needs to process that information.
There are two things you can do at this point:
take a hard looks at how you are storing the information. Maybe you can cut it down to two small requests. One to retrieve the specific data for the conditions, and other to get the conditional information.
setup your own benchmark-thing for your server. Some random article on the web will not be relevant to your system architecture.
I know this is not the answer you wanted to hear, but that's my two cents .. here ya go.
APC lets you store data inside keys, but you cannot group these keys.
So if i want to have a group called "articles", and inside this group I would have keys that take the form of the article ID I can't do this easily.
articles -> 5 -> cached data
-> 10 -> cached data
-> 17 -> cached data
...
I could prefix the key with the "group" name like:
article_5 -> cached data
article_10 -> cached data
article_17 -> cached data
...
But this it makes it impossible to delete the entire group if I want to :(
A working solution would be to store multidimensional arrays (this is what I'm doing now), but I don't think it's good because when I want to access / or delete cached data, I need to get the entire group first. So if the group has one zillion articles in it you can image what kind of array I will be iterating and searching
Do you have better ideas on how could I achieve the group thing?
edit: found another solution, not sure if it's much better because I don't know how reliable is yet. I'm adding a special key called __paths which is basically a multidimensional array containing the full prefixed key paths for all the other entries in the cache. And when I request or delete the cache I use this array as a reference to quickly find out the key (or group of keys) I need to remove, so I don't have to store arrays and iterate trough all keys...
Based upon your observations, I looked at the underlying C implementation of APC's caching model (apc_cache.c) to see what I could find.
The source corroborates your observations that no grouping structure exists in the backing data store, such that any loosely-grouped collection of objects will need to be done based on some namespace constraint or a modification to the cache layer itself. I'd hoped to find some backdoor relying on key chaining by way of a linked list, but unfortunately it seems collisions are reconciled by way of a direct reallocation of the colliding slot instead of chaining.
Further confounding this problem, APC appears to use an explicit cache model for user entries, preventing them from aging off. So, the solution Emil Vikström provided that relies on the LRU model of memcached will, unfortunately, not work.
Without modifying the source code of APC itself, here's what I would do:
Define a namespace constraint that your entries conform to. As you've originally defined above, this would be something like article_ prepended to each of your entries.
Define a separate list of elements in this set. Effectively, this would be the 5, 10, and 17 scheme you'd described above, but in this case, you could use some numeric type to make this more efficient than storing a whole lot of string values.
Define an interface to updating this set of pointers and reconciling them with the backing memory cache, including (at minimum) the methods insert, delete, and clear. When clear is called, walk each of your pointers, reconstruct the key you used in the backing data store, and flush each from your cache.
What I'm advocating for here is a well-defined object that performs the operations you seek efficiently. This scales linearly with the number of entries in your sub-cache, but because you're using a numeric type for each element, you'd need over 100 million entries or so before you started to experience real memory pain at a constraint of, for example, a few hundred megabytes.
Tamas Imrei beat me to suggesting an alternate strategy I was already in the process of documenting, but this has some major flaws I'd like to discuss.
As defined in the backing C code, APCIterator is a linear time operation over the full data set when performing searches (using its constructor, public __construct ( string $cache [, mixed $search = null ...]] )).
This is flatly undesirable in the case where the backing elements you're searching for represent a small percentage of your total data, because it would walk every single element in your cache to find the ones you desire. Citing apc_cache.c:
/* {{{ apc_cache_user_find */
apc_cache_entry_t* apc_cache_user_find(apc_cache_t* cache, char *strkey, \
int keylen, time_t t TSRMLS_DC)
{
slot_t** slot;
...
slot = &cache->slots[h % cache->num_slots];
while (*slot) {
...
slot = &(*slot)->next;
}
}
Therefore, I would most strongly recommend using an efficient, pointer-based virtual grouping solution to your problem as I've sketched out above. Although, in the case where you're severely memory-restricted, the iterator approach may be most correct to conserve as much memory as possible at the expense of computation.
Best of luck with your application.
I have had this problem once with memcached and I solved it by using a version number in my keys, like this:
version -> 5
article_5_5 -> cached data
article_10_5 -> cached data
article_17_5 -> cached data
Just change the version number and the group will be effectively "gone"!
memcached uses a least-recently-used policy to remove old data so the old-versioned group will be removed from the cache when the space is needed. I don't know if APC have the same feature.
According to MrGomez this is NOT working for APC. Please read his post, and keep my post in mind only for other cache systems which use a least-recently-used policy (not APC).
You may use the APCIterator class which seems to exists especially for tasks like this:
The APCIterator class makes it easier to iterate over large APC caches. This is helpful as it allows iterating over large caches in steps ...
Unfortunately, APC can't do this. I wished myself often enough that it could. So I looked for alternatives.
Zend_Cache has an interesting way of doing it, but it simply uses caches to cache the tagging information. It's a component that can in turn use backends (like apc).
If you want to go a step further, you could install Redis. This one has all that natively included and some other really interesting features. This would probably the cleanest solution to go with. If you were able to use APC, you should be also able to use Redis.