Memcache get optimization with php

Memcache get optimization with php - php

Okay so I have some weird-er questions about Memcache. The whole basic idea of my caching technique is to save data to be requested by my PHP script in Memcached server. The main issue me and my team faced is that sometimes saving large amounts of data can sometimes pass the 1MB limit for the item data size in Memcached.
To further explain the approach imagine the following:
We have lots of data to configure a certain object and that data contains a lot of text and numbers..etc. And we need to save almost 200 items of those objects so the first approach we went with is to cache the entire 200ish objects to one big item in Memcached. That item may surpass the limit of 1Mb so we figured we can go with a new approach.
The new approach we went with is that we break down the data configuring the object into smaller building blocks (and since we don't use all the data in the same page) we would then use the smaller building blocks to get exactly the amount of data that we would use in that particular page.
The question is as follows:
Does the GET speed change when you get bigger data? Or would the limitation on the amount of requests handled by Memcached server in parallel get in the way of the second approach because we would then use multi GET to get the multiple building blocks configuring the object?
I know this is a weird question but it's vital to the new approach that we're going with since it would determine the size of the building blocks that we will use and whether or not we will add data to it if we need to.
Edit 1:
Bear in mind that we can use the MULTIGET function with the second approach so we don't have to connect to Memecached and wait for a response for each bit of data that we're getting. So parallel requests will be used to get the multiple keys.

Without getting into the 'what the heck are you storing in memcache and why not use another solution (like a DB with a memory table storage engine)....
I'd say the cost of the multiple requests is indeed a concern--especially with memcached running on remote nodes/hosts. A single request for a large object is most likely overall faster--you still need the same amount of data transferred, but will not have the additional separate request overhead vs. the 200 pieces.
BTW... If you're using APC and you don't have many of these huge items, you can use it instead of memcache to do local user level memory caching--the max size is easily tweakable via the php config settings. You won't get the benefit of distibuted access/sharing across hosts, but it's fast and simple.

Related

How to store PHP Trie for all later uses?

I'm designing an application in PHP which involves Trie data structure.
For time efficient prefix search, I'm using Trie.
I'm constructing the Trie using records from the database.
Now, the database has millions of records. So it is not feasible to everytime create the Trie and then search in it, for every new user request.
Instead can I create the Trie only once and somehow store this information, such that it does not have to be re-created for every new user request, and then searching can be immediately done. Is there somehow I can cache the created Trie (not just for one user session, but for all user requests) using PHP?
Any help would be much appreciated.

You have a couple of standard options.
Cache the database result in memory, using a simple cache like memcached
Cache using Redis, perhaps taking advantage of some of its extra features. This might involve a process where you load the data into a structure in REDIS and have your trie search code work against Redis directly rather than the database result set.
In either case, you are going to cache the result for some period of time that is acceptable, and since the database result will be in memory in some form, there is no load placed on the RDBMS.
In your related question, you indicated that he raw serialized form of the variable would be about 200mb in size. That is well within the max object size (512mb) for Redis, but could be problematic for memcached. I personally use Redis for most app server caching these days.

Handling big arrays in PHP

The application i am working on needs to obtain dataset of around 10mb maximum two times a hour. We use that dataset to display paginated results on the site also simple search by one of the object properties should also be possible.
Currently we are thinking about 2 different ways to implement this
1.) Store the json dataset in the database or a file in the file system, read that and loop over to display results whenever we need.
2.) Store the json dataset in relational MySQL table and query the results and loop over whenever we need to display them.
Replacing/Refreshing the results has to be done multiple times per hour as i said.
Both ways have cons. I am trying to choose a good way which is less evil overall. Reading 10 MB in memory is not a lot and on the other hand rewriting a table few times a hour could produce conflicts in my opinion.
My concern regarding 1.) is how safe the app will be if we read 10mb in the memory all the time? What will happen if multiple users do this at some point of time, is this something to worry about or PHP is able to handle this in background?
What do you think it will be best for this use case?
Thanks!

When php runs on a web server (as it usually does) the server starts new php processes on demand when they're needed to handle concurrent requests. A powerful web server may allow fifty or so php processes. If each of them is handling this large data set, you'll need to have enough RAM for fifty copies. And, you'll need to load that data somehow for each new request. Reading 10mb from a file is not an overwhelming burden unless you have some sort of parsing to do. But it is a burden.
As it starts to handle each request, php offers a clean context to the programming environment. php is not good at maintaining in-RAM context from one request to the next. You may be able to figure out how to do it, but it's a dodgy solution. If you're running on a server that's shared with other web applications -- especially applications you don't trust -- you should not attempt to do this; the other applications will have access to your in-RAM data.
You can control the concurrent processes with Apache or nginx configuration settings, and restrict it to five or ten copies of php. But if you have a lot of incoming requests, those requests get serialized and they will slow down.
Will this application need to scale up? Will you eventually need a pool of web servers to handle all your requests? If so, the in-RAM solution looks worse.
Does your json data look like a big array of objects? Do most of the objects in that array have the same elements as each other? If so, that's conformable to a SQL table? You can make a table in which the columns correspond to the elements of your object. Then you can use SQL to avoid touching every row -- every element of each array -- every time you display or update data.
(The same sort of logic applies to Mongo, Redis, and other ways of storing your data.)

Optimization: Where to process data? Database, Server or Client?

I've been thinking a lot about optimization lately. I'm developing an application that makes me think where I should process data considering balancing server load, memory, client, loading, speed, size, etc..
I want to understand better how experienced programmers optimize their code when thinking about processing. Take the following 3 options:
Do some processing on the database level, when I'm getting the data.
Process the data on PHP
Pass the raw data to the client, and process with javascript.
Which would you guys prefer on which occasions and why? Sorry for the broad question, I'd also be thankful if someone could recommend me good reading sources on this.

Database is heart of any application, so you should keep load on database as light as possible. Here are some suggestions
Get only required fields from database.
Two simple queries are better than a single complex query.
Get data from database, process with PHP and then store this processed data into temporary storage(say cache e.g. Memcache, Couchbase, Redis). This data should be set with an expiry time, expiry time totally depends upon type of data. Caching will reduce your database load to a great extent.
Data is stored in normalized form. But if you know in advance that data is going to be requested and producing this data requires joins from many tables, then processed data, in advance, can be stored in separate table and can be served from this table.
Send as few as possible data on client side. Less HTML size will save bandwidth and browser will be able to render page quickly.
Load data on demand(using ajax, lazy loading etc), e.g a image is not visible on a page until user clicks on a tab, this image should be loaded upon user click.

Two thoughts: Computers should work, people should think. (IBM ad from the 1960s.)
"Premature optimization is the root of all evil (or at least most of it) in programming." --Donald Knuth
Unless you are, or are planning to become, Google or Amazon or Facebook, you should focus on functionality. "Make it work before you make it fast." If you are planning to grow to that size, do what they did: throw hardware at the problem. It is cheaper and more likely to be effective.
Edited to add: Since you control the processing power on the server, but probably not on the client, it is generally better to put intensive tasks on the server, especially if the clients are likely to be mobile devices. However, consider network latency, bandwidth requirements, and response time. If you can improve response time by processing on the client, then consider doing so. So, optimize the user experience, not the CPU cycles; you can buy more CPU cycles when you need them.
Finally, remember that the client cannot be trusted. For that reason, some things must be on the server.

So as a rule of thumb, process as much of the data in the database as possible. The cost of creating a new connection to query is very high, so you want to limit it as much as possible. Even if you have to write some very ugly SQL, performing a JOIN will almost always be quicker than performing 2 SELECT statements.
PHP should really only be used to format and cache data. If you are performing a ton of data operations after every request, you are probably storing your data in a format that's not very practical. You want to cache anything that is not changed often in an almost ready to server state using something like Redis or APCu.
Finally, client should never be performing data operations on more than a few objects. You never know the clients resource availability so always keep the client data lean. Perform pagination and sorting on any data sets larger than a few dozen in the back-end. An AJAX request using AngularJS is usually just as quick as performing a sort on 100+ items on an iPad 2.
If you would like further details on any aspect of this answer please ask and I will do my best to provide examples or additional detail.

High performance impression tracking

Basically, one part of some metrics that I would like to track is the amount of impressions that certain objects receive on our marketing platform.
If you imagine that we display lots of objects, we would like to track each time an object is served up.
Every object is returned to the client through a single gateway/interface. So if you imagine that a request comes in for a page with some search criteria, and then the search request is proxied to our Solr index.
We then get 10 results back.
Each of these 10 results should be regarded as an impression.
I'm struggling to find an incredibly fast and accurate implementation.
Any suggestions on how you might do this? You can throw in any number of technologies. We currently use, Gearman, PHP, Ruby, Solr, Redis, Mysql, APC and Memcache.
Ultimately all impressions should eventually be persisted to mysql, which I could do every hour. But I'm not sure how to store the impressions in memory fast without effecting the load time of the actual search request.
Ideas (I just added option 4 and 5)
Once the results are returned to the client, the client then requests a base64 encoded URI on our platform which contains the ID's of all of the objects that they have been served. This object is then passed to gearman, which then saves the count to redis. Once an hour, redis is flushed and the count is increments for each object in mysql.
After the results have been returned from Solr, loop over, and save directly to Redis. (Haven't benchmarked this for speed). Repeat the flushing to mysql every hour.
Once the items are returned from Solr, send all the ID's in a single job to gearman, which will then submit to Redis..
new idea Since the most number of items returned will be around 20, I could set a X-Application-Objects header with a base64 header of the ID's returned. These ID's (in the header) could then be stripped out by nginx, and using a custom LUA nginx module, I could write the ID's directly to Redis from nginx. This might be overkill though. The benefit of this though is that I can tell nginx to return the response object immediately while it's writing to redis.
new idea Use fastcgi_finish_request() in order to flush the request back to nginx, but then insert the results into Redis.
Any other suggestions?
Edit to Answer question:
The reliability of this data is not essential. So long as it is a best guess. I wouldn't want to see a swing of say 30% dropped impressions. But I would allow a tolerance of 10% -/+ acurracy.

I see your two best options as:
Using the increment command I redis to incremenent counters as you pull the dis. Use the Id as a key and increment it in Redis. Redis can easily handle hundreds of thousands of increments per second, so that should be fast enough to do without any noticeable client impact. You could even pipeline each request if the PHP language binding supports it. I think it does.
Use redis as a plain cache. In this option you would simply use a Redis list and do an rpush of a string containing the IDs separated by eg. a comma. You might use the hour of the day as the key. Then you can have a separate process pull it out by grabbing the previous hour and massaging it however you want to into MySQL. I'd you put an expires on keys you can have them cleaned out after a period of time, or just delete the keys with the post-processing process.
You can also use a read slave to do the exporting to MySQL from if you have very high redis traffic or just want to offload it and get as a bonus a backup of it. If you do that you can set the master redis instance to not flush to disk, increasing write performance.
For some additional options regarding a more extended use of redis' features for this sort of tracking see this answer You could also avoid the MySQL portion and pull the data from redis, keeping the overall system simpler.

I would do something like #2, and hand the data off to the fastest queue you can to update Redis counters. I'm not that familiar with Gearman, but I bet it's slow for this. If your Redis client supports asynchronous writes, I'd use that, or put this in a queue on a separate thread. You don't want to slow down your response waiting to update the counters.

Multiple memcache lookup Vs one lookup (with big output)

I am working on an application using memcache pool (5 servers) and some processing nodes. I have two different possible approaches and I was wondering if you guys have any comments on comparison based on performance (speed primarily) between the two
I extract a big chunk of data from memcache once per request, itereate over it and discard the bits I dont need for the particular request
I extract small small bits from memcached and only extract the ones I need. i.e. I extract value of a and based on value of a, extract value of either b or c. Use this combination to find the next key I want to extract.
The difference between the two is that the number of memcached lookups (which is a pool of servers) reduces in 1. but the size of response increases. Any benchmarking reports around it someone has seen before?
Unfortunately I cant use a better key based on request directly as I dont have enough memcache to support all possible combinations of values, so I got to construct some of it at run time
Thanks

You would have to benchmark for your own setup. The parts that would matter wold be the time spent on:
requesting large amount of data from memcache + retrieving it + extracting data from the resonse
sending several requests to memcache + retrieving the data
Basically first thing you have to measure is how large the overhead for interaction with your cache pool is. And there is that small matter of how this whole thing will react when load increases. What might be fast now, can turn out to be a terrible decision later, when the users start pouring in.
This kinda depends on your definition of "large chunk". Are we talking megabytes here or an array with 100 keys? You also have to consider, that php still needs to process that information.
There are two things you can do at this point:
take a hard looks at how you are storing the information. Maybe you can cut it down to two small requests. One to retrieve the specific data for the conditions, and other to get the conditional information.
setup your own benchmark-thing for your server. Some random article on the web will not be relevant to your system architecture.
I know this is not the answer you wanted to hear, but that's my two cents .. here ya go.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.