Scaling With Session Data in Database

Scaling With Session Data in Database - php

My Silex app has always had the session data stored on the server, but I want to move to the mysql database so that I'm not so tied to a single webserver. I'm wondering about performance, though. I plan to use the PdoSessionHandler. My question is this: currently I have about 177K stored sessions. Will the garbage collection be slow? Will I be taking a performance hit by moving to the database from the filesystem?

Are you going to have an index on the session expiry? If there is no index, then yes, it will be slow. OTOH, how fast do you think searching 177,000 files on disk is? Probably a lot slower than using a database to do the thing it is expressly designed to do.
Will you take a performance hit? Probably. Will it be significant? Depends what else the system is doing with the database, the configuration of the DB, and the server it runs on.
In short - yes, there will be an inevitable cost to use the database as a session store, but it could be worth it for the abilities it gives you.
I'd suggest using Redis, backed to disk though.

Honestly, using a MySQL database as the defacto session storage in the name of scaling is about one of the worst mistakes you can make in distributed session storage.
Let me explain why...
Your MySQL database is likely already your biggest bottleneck in that PHP probably connects to it for just about everything else persistent anyway. However, there are probably a handful of request URIs where PHP might be relying on cache and not hitting your db. In the case that you're using sessions on those pages (well, there goes your connection overhead again).
The cost of deleting rows from a large table (in your case for GC) in MySQL can be extremely expensive at scale. In MyISAM the entire table is locked (worst outcome the entire site blocks during a large GC cycle). With InnoDB the DBMS has to write all of your undo information to a large commit log taking up added I/O and sometimes causing sluggishness depending on fragmentation issues. This could especially prove problematic if you have re-indexing issues too.
There are already better alternatives and they require you to write less code!
My recommendation is to just use something like memcached instead. Where the connection overhead can be significantly lower, there are no db schemas to write, and the drivers for the session handler already exist in PHP by default. Throw something like igbinary on top of memcached and you have blazing fast serialization coupled with cheaper in-memory session handling that can easily be scaled up and distributed with minimal effort and side effects. For example, AWS offers you Elasticache for memcached/redis load-balancing and replication solution in their PaS. There's also Twem Prox if you're not on AWS.

You should probably pivot to storing session data in Redis. It serves blazing fast queries via memory, but it can also recover and repopulate the memory after a crash from a static log.

Related

What is better way to store temporary data [Memcache or MySQL]

I'm using PHP. I want to store temporary data on MySQL. I have also 64GB Memcached server. So, I want use memcache server to store temporary data. But I have a doubt about performace of MySQL and Memcached server.
What is best and speed and reliable way to store get temporary data between MySQL and Memcached ?

The answer is...it depends. Memcached is very fast and, even better, you can add multiple machines or shards to your memcached pool more or less invisibly to your application. MySQL has a variety of storage engines (and potentially replication) which can persist your data if you need to store it.
So it depends on how "temporary" your temp data needs to be. A good way of thinking about it is this: if your server(s) reboot, do you care if the data is there or not?
Data in a memcached shard will be gone if the memcached instance is restarted, but it is often much faster to access than data in a MySQL table. Data in MySQL will be more persistent but will be slower to access.
MySQL does have a memory storage engine. Memory storage is fast compared to most other MySQL lookups, and data in it will disappear if the server or service restarts.
If you are choosing between that and Memcached, usually Memcached is the better choice (although it always will depend on your own data - YMMV). The MySQL memory engine has strict size limits and you will need to prune your data or risk hitting them. With Memcached, you may hit size limits but the Memcached service will simply start evicting values instead of throwing errors. Evictions aren't great but at least they won't cause errors you need to handle and you can usually prevent them by watching your metrics and setting appropriate TTLs on your content.

When not to use memcache

Currently we are having a site which do a lot of api calls from our parent site for user details and other data. We are planning to cache all the details on our side. I am planning to use memcache for this. as this is a live site and so we are expecting heavier traffic in coming days(not that like FB but again my server is also not like them ;) ) so I need your opinion what issues we can face if we are going for memcache and cross opinions of yours why shouldn't we go for it. Any other alternative will also help.

https://github.com/steveyen/community-site/blob/master/db_doc/main/WhyNotMemcached.wiki
Memcached is terrific! But not for every situation...
You have objects larger than 1MB.
Memcached is not for large media and streaming huge blobs.
Consider other solutions like: http://www.danga.com/mogilefs
You have keys larger than 250 chars.
If so, perhaps you're doing something wrong?
And, see this mailing list conversation on key size for suggestions.
Your hosting provider won't let you run memcached.
If you're on a low-end virtual private server (a slice of a machine), virtualization tech like vmware or xen might not be a great place to run memcached. Memcached really wants to take over and control a hunk of memory -- if that memory gets swapped out by the OS or hypervisor, performance goes away. Using virtualization, though, just to ease deployment across dedicated boxes is fine.
You're running in an insecure environment.
Remember, anyone can just telnet to any memcached server. If you're on a shared system, watch out!
You want persistence. Or, a database.
If you really just wish that memcached had a SQL interface, then you probably need to rethink your understanding of caching and memcached.

You should implement a generic caching layer for the API calls first. Within the domain of the caching layer you can then change the strategy which backend you want to use. If you then see that memcache is not fitting you can actually switch (and/or testwise monitor how it works compared with other backends).
Even better, you can first code this build upon the filesystem quite easily (which has multiple backends, too) without the hurdle to rely on another daemon, so already get started with caching - probably file system is already enough for your caching needs?

Memcache is fast, but it also can use a lot of memory if you want to get the most out of it. Whenever you hit the disk for I/O, you're increasing the latency of your application. Pull items that are frequently accessed and put them on memcache. For my large scale deployments, we cache sessions there because DB is slow as well as filesystem session storage.
A recommendation to add to your stack is APC. It caches PHP files and lessens the overall memory usage per page.

Alternative: Redis
Memcached is, obviously, limited by your available memory and will start to jettison data when memory thresholds are reached. You may want to look redis which is as fast (faster in some benchmarks) as memcached but allows the use of both volatile and non-volatile keys, more complex data structures, and the option of using virtual memory to put Least Recently Used (LRU) key values to disk.

Caching data vs. writing in memory table

Which would be the best way to achieve a fast hash / session storage with one of these three ways?
Way 1:
Create a memory table in MySQL that stores a hash and a timestamp when the entry was created. A MySQL event automatically deletes all entries older than 20 minutes. This should be pretty fast because all data is stored in memory, but the overhead of connecting to the database server might destroy this benefit.
Way 2:
I create an empty file with the hash as its filename and create a cronjob that automatically deletes all files older than 20 minutes. This can become slow because of all the read operations on the HDD.
Way 3:
Since this is going to be PHP related and we use the Zend Framework I could use the Zend_Cache and store the hash with a time-to-live of 20 minutes.
I don't want to use Memcached or APC for this, because I think that is a big overhead for just some small hashes.
Do you have any experience with similar scenarios? I would appreciate your experience and solutions for this.

If performance is an issue i would defenately go with memcached. All big internet sites rely on memcached for either caching otherwise server intensive tasks or even session storage and locking mechanisms.
Memcached is the way to go if you ask me

Don't reinvent the wheel - use memcache. Either that, or measure your MySQL performance vs. Memcache. Remember that db access is usually always a bottleneck in high-perf environments anyway.

Do you consider also scaling issues with all these approaches? How many hashes do you talk about? How many megabytes of data you talk about?
Hold it in memory of your computer if you can (your variant 3)
Put it on the disk, but hold it in memory if you can (your variant 2, if you use file-based sessions, you can use $session as someone pointed out)
Use database
Do you have a lot of data? Use memcached.

As others have said, use memcached. If PHP supported database connection pools, like Java for instance, I would recommend MySQL, but PHP being what it is, memcached is the only real answer.

What is faster, flat files or a MySQL RAM database?

I need a simple way for multiple running PHP scripts to share data.
Should I create a MySQL DB with a RAM storage engine, and share data via that (can multiple scripts connect to the same DB simultaneously?)
Or would flat files with one piece of data per line be better?

Flat files? Nooooooo...
Use a good DB engine (MySQL, SQLite, etc). Then, for maximum performance, use memcached to cache content.
In this way, you have the ease and reliability of sharing data between processes using proven server software that handles concurrency, etc... But you get the speed of having your data cached.
Keep in mind a couple things:
MySQL has a query cache. If you are issuing the same queries repeteadly, you can gain a lot of performance without adding a caching layer.
MySQL is really fast anyway. Have you load-tested to demonstrate it is not fast enough?

Please don't use flat files, for the sanity of the maintainers.
If you're just looking to have shared data, as fast as possible, and you can hold it all in RAM, then memcached is the perfect solution.
If you'd like persistence of data, then use a DBMS, like MySQL.

Generally, a DB is better, however, if you are sharing a small, mostly static amount of data, there might be performance benefits (and simplicity) of doing it with flat files.
Anything other than trivial data sharing and I would pick a DB however.

1- Where the flat file can be usefull:
Flat file can be faster than a database, but in very specific applications.
They are faster if the data is read from start to finish without any search or write.
If the data dont fit in memory and need to be read fully to get the job done, It 'can' be faster than a database. Also if there is lot more write than read, flat file also shine, most default databases setups will need to make the read queries wait for the write to finish in order maintain indexes and foreign keys. Making the write queries usually slower than simple reads.
TD/LR vesion:
Use flat files for jobs based system(Aka, simple logs parsing), not for web searches queries.
2- Flat files pit falls:
If your going with a flat file, you will need to synchronize your scripts when the file change using custom lock mechanism. Which can lead to slowdown, corruption up to dead lock if you have a bug.
3- Ram based Database ?
Most databases have in memory cache for query results, search indexes, making them very hard to beat with a flat file. Because they cache in memory, making it run entirely from memory is most of the time ineffective and dangerous. Better to properly tune the database configuration.
If your looking to optimize performance using ram, I would first look at running your php scrips, html pages, and small images from a ram drive. Where the cache mechanism is more likely to be crude and hit the hard drive systematically for non changing static data.
Better result can be reach with a load balancer, clustering with a back plane connections up to ram based SAN array. But that's a whole other topic.
5- can multiple scripts connect to the same DB simultaneously?
Yes, its called connection pooling. In php (client side) its the function to open a connection its mysql-pconnect(http://php.net/manual/en/function.mysql-pconnect.php).
You can configure the maximum open connection in php.ini I think. Similar setting on mysql server side define the maximum of concurrent client connections in /etc/mysql/my.cnf.
You must do this in order to take advantage of parrallel processessing of the cpu and avoid php script to wait the query of each other finish. It greatly increase performance under heavy load.
There is also one connection pool/thread pool in Apache configuration for regular web clients. See httpd.conf.
Sorry for the wall of text, was bored.
Louis.

If you're running them on multiple servers, a filesystem-based approach will not cut it (unless you've got a consistent shared filesystem, which is unlikely and may not be scalable).
Therefore you'll need a server-based database anyway to allow the sharing of data between web servers. If you're serious about either performance or availability, your application will support multiple web servers.

I would say that the MySql DB would be better choice unless you have some mechanism in place to deal with locks on the flat files (and some way to control access). In this case the DB layer (regardless of specific DBMS) is acting as an indirection layer, letting you not worry about it.
Since the OP doesn't specify a web server (and PHP actually can run from a commandline) then I'm not certain that the caching technologies are what they're after here. The OP could be looking to do some sort of flying data transform that isn't website driven. Who knows.

If your system has a PHP cache (that caches compiled PHP code in memory, like APC), try putting your data into a PHP file, as PHP code. If you have to write data, there are some security issues.

I need a simple way for multiple
running PHP scripts to share data.
APC, and memcached are both good options depending on context. shared memory may also be an option.
Should I create a MySQL DB with a RAM
storage engine, and share data via
that (can multiple scripts connect to
the same DB simultaneously?)
That's also a decent option, but will probably not be as fast as APC or memcached.
Or would flat files with one piece of
data per line be better?
If this is read-only data, that's a possibility -- but may be slower than any of the options above. Especially if the data is large. Rather than writing custom parsing code, however, consider simply building a PHP array, and include() the file.
If this is a datastore that may be accessed by several writers simultaneously, by all means do NOT use a flat file! Writing to a flat file from multiple processes is likely to lead to file corruption. You can lock the file, but you risk lock contention issues, and long lock wait times.
Handling concurrent writes is the reason applications like mysql and memcached exist.

File access speed vs database access speed

The site I am developing in php makes many MySQL database requests per page viewed. Albeit many are small requests with properly designed index's. I do not know if it will be worth while to develop a cache script for these pages.
Are file I/O generally faster than database requests? Does this depend on the server? Is there a way to test how many of each your server can handle?
One of the pages checks the database for a filename, then checks the server to see if it exists, then decides what to display. This I would assume would benefit from a cached page view?
Also if there is any other information on this topic that you could forward me to that would be greatly appreciated.

If you're doing read-heavy access (looking up filenames, etc) you might benefit from memcached. You could store the "hottest" (most recently created, recently used, depending on your app) data in memory, then only query the DB (and possibly files) when the cache misses. Memory access is far, far faster than database or files.
If you need write-heavy access, a database is the way to go. If you're using MySQL, use InnoDB tables, or another engine that supports row-level locking. That will avoid people blocking while someone else writes (or worse, writing anyway).
But ultimately, it depends on the data.

It depends on how the data is structured, how much there is and how often it changes.
If you've got relatively small amounts, of relatively static data with relatively simple relationships - then flat files are the right tool for the job.
Relational databases come into their own when the connections between the data are more complex. For basic 'look up tables' they can be a bit overkill.
But, if the data is constantly changing, then it can be easier to just use a database rather than handle the configuration management by hand - and for large amounts of data, with flat files you've got the additional problem of how do you find the one bit that you need, efficiently.

This really depends on many factors. If you have a fast database with much data cached in the RAM or a fast RAID system, chances are bad, that you will gain much from simple file system caching on the web server. Also think about scalibility. Under high workload a simple caching mechanism might easily become a bottle neck while a database is well designed to handle high work loads.
If there are not so much requests and you (or the operating system) is able to keep the cache in RAM, you might be able to gain some performance. But now the question arises, if it is realy neccessary to perform caching under low work load.

From plain performance perspective, it is wiser to tune the database server and not complicate the data access logic with intermediate file caches. A good database server would do the caching on its own if the results are cacheable. (I'm not sure what is teh case with mysql).
If you have performance problems, you should profile the pages to see the real bottlenecks. Even when you are -like me- a fan of the optimized codes, putting a stronger/more hardware into the equation is cheaper on the long run.
If you still need to use caches, consider using an existing solution, like memcached.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.