High performance impression tracking

High performance impression tracking - php

Basically, one part of some metrics that I would like to track is the amount of impressions that certain objects receive on our marketing platform.
If you imagine that we display lots of objects, we would like to track each time an object is served up.
Every object is returned to the client through a single gateway/interface. So if you imagine that a request comes in for a page with some search criteria, and then the search request is proxied to our Solr index.
We then get 10 results back.
Each of these 10 results should be regarded as an impression.
I'm struggling to find an incredibly fast and accurate implementation.
Any suggestions on how you might do this? You can throw in any number of technologies. We currently use, Gearman, PHP, Ruby, Solr, Redis, Mysql, APC and Memcache.
Ultimately all impressions should eventually be persisted to mysql, which I could do every hour. But I'm not sure how to store the impressions in memory fast without effecting the load time of the actual search request.
Ideas (I just added option 4 and 5)
Once the results are returned to the client, the client then requests a base64 encoded URI on our platform which contains the ID's of all of the objects that they have been served. This object is then passed to gearman, which then saves the count to redis. Once an hour, redis is flushed and the count is increments for each object in mysql.
After the results have been returned from Solr, loop over, and save directly to Redis. (Haven't benchmarked this for speed). Repeat the flushing to mysql every hour.
Once the items are returned from Solr, send all the ID's in a single job to gearman, which will then submit to Redis..
new idea Since the most number of items returned will be around 20, I could set a X-Application-Objects header with a base64 header of the ID's returned. These ID's (in the header) could then be stripped out by nginx, and using a custom LUA nginx module, I could write the ID's directly to Redis from nginx. This might be overkill though. The benefit of this though is that I can tell nginx to return the response object immediately while it's writing to redis.
new idea Use fastcgi_finish_request() in order to flush the request back to nginx, but then insert the results into Redis.
Any other suggestions?
Edit to Answer question:
The reliability of this data is not essential. So long as it is a best guess. I wouldn't want to see a swing of say 30% dropped impressions. But I would allow a tolerance of 10% -/+ acurracy.

I see your two best options as:
Using the increment command I redis to incremenent counters as you pull the dis. Use the Id as a key and increment it in Redis. Redis can easily handle hundreds of thousands of increments per second, so that should be fast enough to do without any noticeable client impact. You could even pipeline each request if the PHP language binding supports it. I think it does.
Use redis as a plain cache. In this option you would simply use a Redis list and do an rpush of a string containing the IDs separated by eg. a comma. You might use the hour of the day as the key. Then you can have a separate process pull it out by grabbing the previous hour and massaging it however you want to into MySQL. I'd you put an expires on keys you can have them cleaned out after a period of time, or just delete the keys with the post-processing process.
You can also use a read slave to do the exporting to MySQL from if you have very high redis traffic or just want to offload it and get as a bonus a backup of it. If you do that you can set the master redis instance to not flush to disk, increasing write performance.
For some additional options regarding a more extended use of redis' features for this sort of tracking see this answer You could also avoid the MySQL portion and pull the data from redis, keeping the overall system simpler.

I would do something like #2, and hand the data off to the fastest queue you can to update Redis counters. I'm not that familiar with Gearman, but I bet it's slow for this. If your Redis client supports asynchronous writes, I'd use that, or put this in a queue on a separate thread. You don't want to slow down your response waiting to update the counters.

Related

Redis - Best data structure to store and then fetch large data

I've recently implemented Redis into one of my Laravel projects. It's currently more of an technical exercise as opposed to production as I want to see what it's capable of.
What I've done is created a list of payment transactions. What I'm pushing to the list is the payload which I receive from a webhook every time a transaction is processed. The payload is essentially an object containing all the information to do with that particular transaction.
I've created a VueJS frontend that then displays all the data in a table and has pagination so it's show 10 rows at a time.
Initially this was working super quick but now that the list contains 30,000 rows which is about 11MB worth of data, the request is taking about 11seconds.
I think the issue here is that I'm using a list and am fetching all the rows from the list using LRANGE.
The reason I used a list was because it has the LPUSH command so that latest transactions go to the start of the list.
I decided to do a test where I got all the data from the list and outputted the value to a blank page and this took about the same time so it's not an issue with Vue, Axios, etc.
Firslty, is this read speed normal? I've always heard that Redis is blazing fast.
Secondly, is there a better way to increase read performance when using Redis?
Thirdly, am I using the wrong data type?
In time I need to be able to store 1m rows of data.

As I realized you get all 30,000 rows in any transaction update and then paginate it in frontend. In my opinion, the true strategy is getting lighter data packs in each request.
For example, use Laravel pagination in response to your request.

In my opinion:
Firstly: As you know, Redis is blazing fast and Redis is really fast. Because Redis data always in memory, you say read 11MB data about use 11s, you can check your bandwidth
Secondly: I'm sorry I don't know how to increase in this env.
Thirdly: I think your choice ok.
So, you can check your bandwidth first(redis server).

Handling big arrays in PHP

The application i am working on needs to obtain dataset of around 10mb maximum two times a hour. We use that dataset to display paginated results on the site also simple search by one of the object properties should also be possible.
Currently we are thinking about 2 different ways to implement this
1.) Store the json dataset in the database or a file in the file system, read that and loop over to display results whenever we need.
2.) Store the json dataset in relational MySQL table and query the results and loop over whenever we need to display them.
Replacing/Refreshing the results has to be done multiple times per hour as i said.
Both ways have cons. I am trying to choose a good way which is less evil overall. Reading 10 MB in memory is not a lot and on the other hand rewriting a table few times a hour could produce conflicts in my opinion.
My concern regarding 1.) is how safe the app will be if we read 10mb in the memory all the time? What will happen if multiple users do this at some point of time, is this something to worry about or PHP is able to handle this in background?
What do you think it will be best for this use case?
Thanks!

When php runs on a web server (as it usually does) the server starts new php processes on demand when they're needed to handle concurrent requests. A powerful web server may allow fifty or so php processes. If each of them is handling this large data set, you'll need to have enough RAM for fifty copies. And, you'll need to load that data somehow for each new request. Reading 10mb from a file is not an overwhelming burden unless you have some sort of parsing to do. But it is a burden.
As it starts to handle each request, php offers a clean context to the programming environment. php is not good at maintaining in-RAM context from one request to the next. You may be able to figure out how to do it, but it's a dodgy solution. If you're running on a server that's shared with other web applications -- especially applications you don't trust -- you should not attempt to do this; the other applications will have access to your in-RAM data.
You can control the concurrent processes with Apache or nginx configuration settings, and restrict it to five or ten copies of php. But if you have a lot of incoming requests, those requests get serialized and they will slow down.
Will this application need to scale up? Will you eventually need a pool of web servers to handle all your requests? If so, the in-RAM solution looks worse.
Does your json data look like a big array of objects? Do most of the objects in that array have the same elements as each other? If so, that's conformable to a SQL table? You can make a table in which the columns correspond to the elements of your object. Then you can use SQL to avoid touching every row -- every element of each array -- every time you display or update data.
(The same sort of logic applies to Mongo, Redis, and other ways of storing your data.)

Can redis improve my MySQL performance and how to store data in redis similar to a DB table?

I have the following web site:
The user inputs some data and based on it the server generates a lot of results, that need to be displayed back to the user. I am calculating the data with php, storing it in a MySQL DB and display it in Datatables with server side processing. The data needs to be saved for a limited time - on every whole hour the whole table with it is DROPPED and re-created.
The maximum observed load is: 7000 sessions/users per day, with max of 400 users at a single time. Every hour we have over 50 milion records inserted in the main table. We are using a Dedicated server with Intel i7 and 24GB ram, HDD disk.
The problem is that when more people (>100 at a time) use the site, the MySQL cannot handle the load and MySQL + hard disk become the bottleneck. The user has to wait minutes even for a few thousand results. The disk is HDD and for now there is not an option to put SSD.
The QUESTION(S):
Can replacing MySQL with Redis improve the performance and how much?
How to store the produced data in redis, so i can retrieve it for 1 user and sort it by any of the values and filter it?
I have the following data in php
$user_data = array (
array("id"=>1, "session"="3124", "set"=>"set1", "int1"=>1, "int2"=>11, "int3"=>111, "int4"=>1111),
array("id"=>2, "session"="1287", "set"=>"set2", "int1"=>2, "int2"=>22, "int3"=>222, "int4"=>2222)...
)
$user_data can be an array with length from 1 to 1-2milion (I am calculating it and inserting in the DB in chunks of 10000)
I need to store in redis data for at least 400 such users and be able to retrieve data for particular user in chunks of 10/20 for the pagination. I also need to be able to sort by any of the fields set (string), int1, int2... (i have around 22 int fields) and also filter by any of the integer fields ( similar to sql WHERE clause 9000 < int4 < 100000 ).
Also can redis make something similar to SQLs WHERE set LIKE '%value%'?

Probably Redis is a good fit for you problem, if you can hold all your data in memory. But you must re-think your data structure. Redis is very different than a relational database, and there is no direct migration.
As for you questions.
Probably it can help with performance. How much, it will depends on your use-case and data structure. Your constraint will not be hard-disk anymore, but maybe something else.
Redis have no concept similar to ORDER BY, or WHERE as SQL. You will be responsible to maintain your indices and filters.
I would create a HSET for every "record" and then, use several ZSET to create indexes of that records. (if you really need to order on any field, then you'll need one ZSET per field)
As for filters, the ZSET used for indexes, will probably be useful to filter ranges of int values.
Unfortunately for LIKE query, I really don't have a answer. When I need advanced search capabilities, I usually use ElasticSearch (in combination with redis and/or mysql)

1. Can replacing MySQL with Redis improve the performance and how much?
Yes, Redis can improve your basic read/write performance due to the fact that it stores the information directly in memory. This post describes a performance increase by a factor of 3, but the post is dated in 2009 so the numbers may have changed since.
However, this performance gain is only relevant as long as you have enough memory. Once you exceed the allotted amount of memory, your server will start swapping to disk, drastically reducing Redis performance.
Another thing to keep in mind is that information stored in Redis is not guaranteed to be persistent by default--the data set is only stored every 60 seconds or if at least 1000 keys change. Other changes will be lost on a server restart or power loss.
2. How to store the produced data in redis, so i can retrieve it for 1 user and sort it by any of the values and filter it?
Redis data store and has a different approach from traditional relational databases. It does not offer complex sorting, but basic sorting can be done through sorted sets and the SORT command. That will have to be done by the PHP server.
Redis does have any searching support--it will have to be implemented by your PHP server.
3. Conclusion
In my opinion, the best way to handle what you are asking is to use a Redis server for caching and the MySQL server for storing information that you need to be persistent (if you don't have any information that has to be persistent, you can just have the Redis server).
You said that
The data needs to be saved for a limited time - on every whole hour
the whole table with it is DROPPED and re-created.
which is perfect for Redis. Redis supports a TTL through the EXPIRE command on keys, which automatically deletes a key after a set amount of time. This way you don't need to drop and re-create any tables--Redis does it for you.

Memcache get optimization with php

Okay so I have some weird-er questions about Memcache. The whole basic idea of my caching technique is to save data to be requested by my PHP script in Memcached server. The main issue me and my team faced is that sometimes saving large amounts of data can sometimes pass the 1MB limit for the item data size in Memcached.
To further explain the approach imagine the following:
We have lots of data to configure a certain object and that data contains a lot of text and numbers..etc. And we need to save almost 200 items of those objects so the first approach we went with is to cache the entire 200ish objects to one big item in Memcached. That item may surpass the limit of 1Mb so we figured we can go with a new approach.
The new approach we went with is that we break down the data configuring the object into smaller building blocks (and since we don't use all the data in the same page) we would then use the smaller building blocks to get exactly the amount of data that we would use in that particular page.
The question is as follows:
Does the GET speed change when you get bigger data? Or would the limitation on the amount of requests handled by Memcached server in parallel get in the way of the second approach because we would then use multi GET to get the multiple building blocks configuring the object?
I know this is a weird question but it's vital to the new approach that we're going with since it would determine the size of the building blocks that we will use and whether or not we will add data to it if we need to.
Edit 1:
Bear in mind that we can use the MULTIGET function with the second approach so we don't have to connect to Memecached and wait for a response for each bit of data that we're getting. So parallel requests will be used to get the multiple keys.

Without getting into the 'what the heck are you storing in memcache and why not use another solution (like a DB with a memory table storage engine)....
I'd say the cost of the multiple requests is indeed a concern--especially with memcached running on remote nodes/hosts. A single request for a large object is most likely overall faster--you still need the same amount of data transferred, but will not have the additional separate request overhead vs. the 200 pieces.
BTW... If you're using APC and you don't have many of these huge items, you can use it instead of memcache to do local user level memory caching--the max size is easily tweakable via the php config settings. You won't get the benefit of distibuted access/sharing across hosts, but it's fast and simple.

Best practice to record large amount of hits into MySQL database

Well, this is the thing. Let's say that my future PHP CMS need to drive 500k visitors daily and I need to record them all in MySQL database (referrer, ip address, time etc.). This way I need to insert 300-500 rows per minute and update 50 more. The main problem is that script would call database every time I want to insert new row, which is every time someone hits a page.
My question, is there any way to locally cache incoming hits first (and what is the best solution for that apc, csv...?) and periodically send them to database every 10 minutes for example? Is this good solution and what is the best practice for this situation?

500k daily it's just 5-7 queries per second. If each request will be served for 0.2 sec, then you will have almost 0 simultaneous queries, so there is nothing to worry about.
Even if you will have 5 times more users - all should work fine.
You can just use INSERT DELAYED and tune your mysql.
About tuning: http://www.day32.com/MySQL/ - there is very useful script (will change nothing, just show you the tips how to optimize settings).
You can use memcache or APC to write log there first, but with using INSERT DELAYED MySQL will do almost same work, and will do it better :)
Do not use files for this. DB will serve locks much better, than PHP. It's not so trivial to write effective mutexes, so let DB (or memcache, APC) do this work.

A frequently used solution:
You could implement an counter in memcached which you increment on an visit, and push an update to the database for every 100 (or 1000) hits.

We do this by storing locally on each server to CSV, then having a minutely cron job to push the entries into the database. This is to avoid needing a highly available MySQL database more than anything - the database should be able to cope with that volume of inserts without a problem.

Save them to a directory-based database (or flat file, depends) somewhere and at a certain time, use a PHP code to insert/update them into your MySQL database. Your php code can be executed periodically using Cron, so check if your server has Cron so that you can set the schedule for that, say every 10 minutes.
Have a look at this page: http://damonparker.org/blog/2006/05/10/php-cron-script-to-run-automated-jobs/. Some codes have been written in the cloud and are ready for you to use :)

One way would be to use Apache access.log. You can get a quite fine logging by using cronolog utility with apache . Cronolog will handle the storage of a very big number of rows in files, and can rotate it based on volume day, year, etc. Using this utility will prevent your Apache from suffering of log writes.
Then as said by others, use a cron-based job to analyse these log and push whatever summarized or raw data you want in MySQL.
You may think of using a dedicated database (or even database server) for write-intensive jobs, with specific settings. For example you may not need InnoDB storage and keep a simple MyIsam. And you could even think of another database storage (as said by #Riccardo Galli)

If you absolutely HAVE to log directly to MySQL, consider using two databases. One optimized for quick inserts, which means no keys other than possibly an auto_increment primary key. And another with keys on everything you'd be querying for, optimized for fast searches. A timed job would copy hits from the insert-only to the read-only database on a regular basis, and you end up with the best of both worlds. The only drawback is that your available statistics will only be as fresh as the previous "copy" run.

I have also previously seen a system which records the data into a flat file on the local disc on each web server (be careful to do only atomic appends if using multiple proceses), and periodically asynchronously write them into the database using a daemon process or cron job.
This appears to be the prevailing optimium solution; your web app remains available if the audit database is down and users don't suffer poor performance if the database is slow for any reason.
The only thing I can say, is be sure that you have monitoring on these locally-generated files - a build-up definitely indicates a problem and your Ops engineers might not otherwise notice.

For an high number of write operations and this kind of data you might find more suitable mongodb or couchdb

Because INSERT DELAYED is only supported by MyISAM, it is not an option for many users.
We use MySQL Proxy to defer the execution of queries matching a certain signature.
This will require a custom Lua script; example scripts are here, and some tutorials are here.
The script will implement a Queue data structure for storage of query strings, and pattern matching to determine what queries to defer. Once the queue reaches a certain size, or a certain amount of time has elapsed, or whatever event X occurs, the query queue is emptied as each query is sent to the server.

you can use a Queue strategy using beanstalk or IronQ

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.