I'm going to introduce the Round Robin Load Balancing in our architecture and I don't really want to use the Sticky Session since we don't utilize the cookie in our apps. I'm trying to decide whether I should store the session ID for my php apps inside the couchbase database or I should store it in the mounted AWS S3. Currently our session ID is stored in a pretty standard way which is local server.
I'm thinking of moving the session id storage in the couchbase database, however that requires us to change the code to accomodate that capability. Storing the session ID on mounted AWS S3 is preferable since I don't really have to change anything in my code other than the pointer in php.ini. But is it a good idea though and does it has any performance implication to it. Anyone has any experience with this feature request and perhaps can share your thought/result?
Many thanks everyone...
Deciding on how to implement sessions based on what is easy to implement doesn't seem like a very sound basis for choosing an architecture.
S3 -- particularly if you are already using it in a way in which it is not exactly designed to be used (by mounting it, to emulate a filesystem, which it is not -- it's an object store) -- does not seem like an appropriate platform for storing session data, for two reasons: the first is the likelihood of potential performance issues, and the second is the consistency model of S3. A third potential consideration is the per-request pricing for S3, and a fourth consideration is that you'd need to disable any local file cache from s3fs (assuming that's what you're using) or you run the risk of reading stale data... but this would likely introduce additional potential performance issues.
When you create a new object in S3, and then try to download it, it will not necessarily be immediately available in the US Standard Region, which only provides eventual consistency, which means it is sometimes but not always possible to immediately download something you just finished uploading. The other regions provide read-after-write consistency on new objects, but this could potentially come in the form of a tradeoff that increases the time it takes to initially create the new object, or the initial time to fetch it again.
In contrast to new objects, all regions, not just US Standard, provide only eventual consistency when you overwrite an existing object with different content. This means if you change the contents of a file on S3 and immediately read the file again, you may not immediately see the newest version of the file... and that if you delete a file and subsequently try to read from it again, you might actually be able to, for a short period of time. Testing this to disprove whether this is a problem would serve no purpose, since this is their stated consistency model, and behavior you observe today could deteriorate in the future yet still be consistent with their model.
http://aws.amazon.com/s3/faqs/
By contrast, SimpleDB, DynamoDB, and RDS all provide services that are more appropriate for storing session data, with the applicability of each of these services depending on your specific requirements.
Or, you could store the sessions in Couchbase, if that provides suitable performance. I can't comment on that possibility, since I am unfamiliar with that platform... but S3, in spite of being an excellent service, it does not seem well-suited for this application.
One thing, though...
we don't utilize the cookie in our apps
I'm skeptical, since that's typically the way sessions work. How does your server identify the appropriate session for the user connecting to your site, then, if not by a cookie?
Related
I've got a PHP app that stores arbitrary config info in a file. I would like to read that file once, when the app first starts, save it as some kind of application state variable, and leverage it across potentially thousands of user sessions. My Google foo is usually pretty good but in this case the only thing I'm able to come up with is the $_SESSION variable. Using it means reading the config file once per user session, which could mean reading it thousands of times a minute in high-volume installations, which seems inefficient.
When I worked with .NET web apps there was an idea of an application session that could be used to persist app configuration information across multiple user sessions. Does PHP have a similar concept?
Does php provide an API for cross-session data management? No
Does php provide a mechanism for reading and updating data? Yes there's lots of them
While this sounds like a session handler which is shared across multiple users, it's implementation is very different. By default (and by necessity) php's sessions are blocking. If the access to this shared dataset was blocking then you would severely limit concurrency.
Given that the access to the data must be non-blocking, how do you mediate concurrent updates to the shared data? A lot depends on the frequency of the updates. But there's also questions about capacity and whether you need to support multiple nodes.
Any one-size-all solution for the functionality is going to be severely hampered in capacity and/or performance. There are lots of products PHP will integrate with to provide a suitable storage substrate, however (leaving aside the logic of the interface for your super-session) it is not in the nature of open source software to package up third party products and hide them behind APIs.
I have a situation where my Linux server will be running a website which gets some of its data from a 3rd-party server through a SOAP interface. The data isn't exactly real-time, but it does change every 5 minutes or so. I was told not to have our website hammer their website for data, which I can completely understand.
So I wondered if this was a good candiate to use a cache scheme of some type. Where when a user comes to our web page to display the data, if it's less than 5 minutes old (for example), it would get that data from our server instead of polling the 3rd-party for it. This way, if 100 users at once come to our website, our server won't be access the 3rd-party website 100 times to share the same exact data within a given time-frame.
Is this a practical thing to do in PHP? Or should this be written in a faster language when it comes to caching? Are their cache packages for this sort of situation which can be used along with a PHP Joomla application? Thanks!
I think memcached is a good choice.
You can set timeout when you store content to memcached server, if key-value missed, retrieve data from 3rd-part server and store again.
There is memcached extension for PHP, check doc here.
There's lots of ways to solve the problem -we can't say which is the right one without knowing a lot more about the constraints you are working in or how the service is used. If you are using Joomla then you're obviously not bothered about performance - it would be really hard to write anything which has a measurable impact on your html generation times. This does not need to "be written in a faster language", but....
can you install additional software?
have you got access to cron?
at what rate is the service consumed?
how many webservers do you have consuming the service - do they have a shared filesystem? Are they on the same sub-net?
Is the SOAP response cacheable?
how do you deal with non-availability of the service?
For a very scalable solution I would suggest running a simple forward proxy (e.g. squid) but do make sure that it's not accessible from the internet. Sven (see comment elsewhere) is right about POST sometimes not being cacheable - but you can cache the response from a surrogate script on your own site accessed via GET returning appropriate caching instructions - and this could return the data as a serialized php array / object which is much less expensive to process. Indeed whichever method you choose I would recommend caching the parsed response - not the XML. This also allows you to override poor caching information from the service.
If the rate is less than around 1 per minute then the cron solution is overkill. But if its more than 20 per minute then it makes a lot of sense. If you don't have access to cron / can't install your own software then you might consider simply caching the response and refreshing the cache on demand. Don't bother with memcache unless you are already using it. APC is faster on a single server - but memcache is distributed. If you have multiple servers then use whatever cluster storage you are currently sharing your data in (distributed filesystem / database cluster / shared filesystem....).
Don't try to use locking / mutexes around the cache refresh unless you really have to (i.e. only if accessing the service more than once every 5 minutes is a mortal sin) - this gets real complicated real quick - it's too easy to introduce bugs.
Do make sure you buffer and validate any responses before writing them to the cache.
Yes, just use HTTP. Most of the heavy lifting has already been built into your web server.
Since SOAP is just a simple HTTP POST request with an XML body, you could set up your website or HTTP API in front of the SOAP endpoint to act like a translator to regular HTTP, attaching the appropriate HTTP caching headers on the transformed response body and then configure an NGinx reverse proxy in front of it.
Notably: if the transformation is simple you could just use XSLT to transform the response body from the SOAP API and remove the web service layer entirely.
Your problem is a very small one, which does not require a complicated solution.
You could write a small cron job that is executed every five minutes, sends the request to the SOAP server, and stores the result in a local file. If any script needs the data, it reads the local file. This will result in 288 requests to the SOAP server per day, and have excellent performance for any script call that needs the results because they are already on your server.
If you do not have cron jobs available and cannot fake them, any other cache will do. You really don't need fancy stuff like Memcached, unless it already is available. Storing the result to a cache file will work as well. Note that if you have to really fetch the SOAP result from the origin, this will take some more time and might affect the perceived performance of your site.
There are plenty of frameworks which also offer cache support, and if you use one you should investigate if there is support included. I'm not sure if Joomla has something appropriate for you. Otherwise, you can implement something yourself. It isn't that hard.
Cache functionality comes in various flavours:
memory-based, where a separate process on the server holds data in RAM (or overflows to disk) and you query it like you would a database; very efficient and powerful, and will have options to manage storage use and clear up after themselves, but requires setting up additional software on the server; e.g. memcached, redis
file-based, where you just write the data to disk; less efficient, but can be implemented in "user-land" code, i.e. pure PHP; beware of filling up your disk with variant caches that have expired but not been cleaned up; many frameworks have an implementation of this built in
database-backed, where you push data into an RDBMS (e.g. MySQL, PostgreSQL) or fully-featured NoSQL store (e.g. MongoDB); might make sense if you have a large amount of data, and can trade a bit of performance; as with files, you need to make sure that stale data is cleaned up
In each case, the basic idea is that you create a "key" that can tell one request from another (e.g. the name of the SOAP call and its input parameters, serialized), and pick a "lifetime" (how long you want to carry on using the same copy of the data). The caching engine or library then checks for a cache with that key, and if it is still within its "lifetime" returns the previously cached data. If there is a "cache miss" (there is no cache for that key, or it has expired), you perform the costly operation (in your case, the SOAP call) and save to the cache, using the same key.
You can do more complex things, like pre-caching things in the background so that there is never a cache miss, or having some code paths which accept stale data in order to return quickly, but these can generally be implemented on top of whatever you're using as the main caching solution.
Edit Another important decision is at what level of granularity to cache the data, in relation to processing it. At one extreme, you could cache each individual SOAP call: simple to set up, but means re-processing the same data repeatedly, and can cause problems if two responses are related, but cached independently and may get out of sync. At the other extreme, you can cache whole rendered pages: pages load very fast once cached, but creating variations based on the same data without repeating work becomes tricky. In between are various points in your code where you have processed and combined data into meaningful chunks: if your application is well-written, these are the input and output of major functions, or possibly even complete model objects; this is more work to implement, as you have to choose the right keys (avoiding two contexts overwriting each other's caches while ignoring variables that have no impact on the data in question) and values (avoiding repeats of costly work without having to store huge blobs of data which will be slow to unserialize and use up the capacity of your cache store). As with anything else, no approach suits all needs, and a complex application will probably involve caching at multiple levels for different purposes.
I'm on board with the whole cookieless domains / CDN thing, and I understand how just sending cookies for requests to www.yourdomain.com, while setting up a separate domain like cdn.yourdomain.com to keep unnecessary cookies from being sent can help performance.
What I'm curious about is if using PHP's native sessions have a negative effect on performance, and if so, how? I know the session key is kept track of in a cookie, which is small, and so that seems fine.
I'm prompted to ask this question because in the past I've written my web apps and stored a lof of the user's active data, preferences, and authentication information in the $_SESSION variable. However, I notice that some popular web applications out there, like Wordpress, don't use $_SESSION at all. But sessions are easy to use and seem fairly secure, especially if you combine it with tracking user-agent / ip changes to prevent session hijacking. So why don't Wordpress and other web apps use php's sessions? Should I also stop using sessions?
Also, let me also clarify that I do realize the server must load the session data to process a page request, but that's not what I'm asking about here. My question is about if / how it impacts the network performance, especially in regard to the headers being sent / received. For example does using sessions prevent pages or images on the site from being served from the browser's cache? Is the PHPSESID cookie the only additional header that is being sent? These sorts of things.
The standard store for $_SESSION is the file-system with one file per session. This comes with a price:
When two requests access the same session, one request will win over the other and the other request needs to wait until the first request has finished. A race condition controlled by file-locking.
Using cookies to store the session data (Wordpress, Codeigniter), the race-condition is the same but the locking is not that immanent, but a browser might do locking within the cookie management.
Using cookies has the downside that you can not store that much data and that the data get's passed with each request and response. This is likely to trigger security issues as well. Steal the cookie and you've got the data. If it's encrypted, an attacker can try to decrypt it to gain the data stored therein.
The historical reason for Wordpress was that the platform never used the PHP Sessions. The root project started around 2000, it got a lot of traction in 2002 and 2004. As session handling was only available with PHP 4 and PHP 3 was much more popular that time.
Later on, when $_SESSION was available, the main design of the application was already done, and it worked. Next to that, in 2004/2005 wordpress decided to start a commercial multi-blog hosting service. This created a need in scaling the application(s) across servers and cookies+database looked more easy for the session/user handling than using the $_SESSION implementation. Infact, this is pretty easy and just works, so there never was need to change it.
For Codeigniter I can not say that much. I know that it stores all session information inside a cookie by default. So session is just another name for cookie. Optionally it can be encrypted but this needs configuration. IIRC it was said that this has been done because "most users do not need sessions". For those who need, there is a database backend (requires additional configuration) so users can change from cookie to database store transparently within their application. There is a new implementation available as well that allows you to change to any store you like, e.g. to native PHP sessions as well. This is done with so called drivers.
However this does not mean that you can't achieve the same based on $_SESSION nowadays. You can replace the store with whatever you like (even cookies :) ) and the PHP implementation of it should be encapsulated anyway in a good program design.
That done you can implement a store you can better control locking on (e.g. a database) and that works across servers in a load balanced infrastructure that does not support sticky sessions.
Wordpress is a good example for an own implementation of sessions handling totally agnostic to whatever PHP offers. That means the wheel has been re-invented. With a view from today, I would not call their design explicitly innovative, so it full-fills a very specific need in a very specific environment that you can only understand if you know about the projects roots.
Codeigniter is maybe a little step ahead (in an interface sense) as it offers some sort of (unstable) interface to sessions and it's possible to replace it with any implementation you like. That's much better for new developers but it's also sort of re-inventing the wheel because PHP does this already out of the box.
The best thing you can do in an application design is to make the implementation independent from system needs, so to make the storage mechanism of your session data independent from the rest of the program flow. PHP offers this with a pretty direct interface, the $_SESSION array and the session configuration.
As $_SESSION is a superglobal array you might want to prevent your application to access it directly as this would introduce global state. So in a good design you would have an interface to it, to be able to fully abstract away from the superglobal.
Done that, plus abstraction of the store plus configuration (e.g. all in one session dependency container), you should be able to scale and maintain your application well over as many servers as you like for whatever reason. Your implementation then can just use cookies if you think that's it for you. However you will be able to switch to database based session in case you need it - without the need to rewrite large parts of your application.
I'm not 100% confident this is the case but one reason to avoid the built-in $_SESSION mechanism in PHP is if you want to deploy your web application in a high-availability web farm scenario.
Because the default session behavior in PHP is to store session objects in process, in memory, it makes it hard (if not impossible) to have multiple servers processing requests from the same user. You would only have this if you wanted to deploy your web application in a web farm environment where you have a number of PHP web servers processing requests for your app to balance the load.
So, while in-process session state is generally much faster than a database-based solution, the latter is favorable when you need to process a huge number of requests and to service the capacity a web-farm environment is used.
As I said in the beginning, I'm not 100% sure if PHP supports configuring the session state provider to be a database, or session state server, instead of the in-process default.
I've seen many sites give up the use of the default handling of sessions in PHP for their own method and I still have no clue why.
They are definitely running PHP and it just seems pointless to me that people would design their own method. Is there some sort of limitation that I do not know of or is it purely so they have control of everything?
(I tried asking them and yeah they either didn't have a way of contacting them or they "saw something somewhere against using PHP sessions" without knowing what it actually was)
Default sessions are stored on the hard drive, usually in the /tmp directory.
When your site gets larger, 1 computer isn't sufficient to run it.
Therefore, people resort to load balancing (among other solutions).
Load balancer effectively switches between a cluster of computers. Therefore, if by any chance you got served by computer #1 on your first request and then by computer #2 at your second request - the second computer cannot read the session since it's not in its /tmp folder.
This is a simplified scenario of course since there's much more to application scaling but this is one of the reasons why people resort to overriding the default session mechanism.
The other thing of interest is storing sessions in the db thus making them searchable and what not. You can also create an interface for effectively forcefully logging people out, which is something that the default mechanism cannot provide.
I would have thought a principal reason for rolling your own session-handling functionality is for the purposes of testing. If you're running unit tests, you won't necessarily have a browser environment going. You won't be able to set cookies, and so PHP won't set $_SESSION variables for you.
If, however, you wrote your own session handling class(es), then you could create a mock class for running unit tests. The object would behave like a "real" session, but you won't have to faff about with browsers, cookies and human beings.
Well with the standard setup you are tied to using the file system, saving session data unencrypted etc.
Writing your own session handling using session_set_save_handler you can adjust the sesssion management to your needs ... applying encryption, saving session in a database, synchronizing the sessions with separate software systems ...
1) Session are still widely used. They works and do the work, so there is not point to change it unless a special case.
2) However, Session is weak, it relies in a single PHP (that can be stolen). However, it is possible to protect a session using different method such cookie + ip + expiration.
So yes and no. Session are still widely used but require a fine tune.
I am building a web-application and have a couple of quick questions. From what I learnt, one should not worry about scalability when initially building the app and should only start worrying when the traffic increases. However, this being my first web-application, I am not quite sure if I should take an approach where I design things in an ad-hoc manner and later "fix" them. I have been reading stories about how people start off with an app that gets millions of users in a week or two. Not that I will face the same situation but I can't help but wonder, how do these people do it?
Currently, I bought a shared hosting account on Lunarpages and that got me started in building and testing the application. However, I am interested in learning how to build the same application in a scalable-manner using the cloud, for instance, Amazon's EC2. From my understanding, I can see a couple of components:
There is a load balancer that first receives requests and then decides where to route each request
This request is then handled by a server replica that then processes the request and updates (if required) the database and sends back the response to the client
If a similar request comes in, then a caching mechanism like memcached kicks into picture and returns objects from the cache
A blackbox that handles database replication
Specifically, I am trying to do the following:
Setting up a load balancer (my homework revealed that HAProxy is one such load balancer)
Setting up replication so that databases can be synchronized
Using memcached
Configuring Apache to work with multiple web servers
Partitioning application to use Amazon EC2 and Amazon S3 (my application is something that will need great deal of storage)
Finally, how can I avoid burning myself when using Amazon services? Because this is just a learning phase, I can probably do with 2-3 servers with a simple load balancer and replication but until I want to avoid paying loads of money accidentally.
I am able to find resources on individual topics but am unable to find something that starts off from the big picture. Can someone please help me get started?
Personally, I think you should be considering how your app will scale initially - as otherwise you'll run into problems down the line.
I'm not saying you need to build it initially as a multi-server system, but if you think you'll need to do it later, be mindful of the concerns now.
In my experience, this includes things like:
Sessions. Unless you use 'sticky' load balancing, you will have to have some way of sharing session state between servers. This probably means storing session data on either shared storage, or in a DB.
File uploads and replication. If you allow users to upload files, or you have a CMS that allows you to upload images/documents, it needs to cater for the fact that these files will also need to find their way onto other nodes in your cluster. However, if you've gone down the shared storage route mentioned above, this should cover it.
DB scalability. If you're using traditional DB servers, you might want to think about how you'll implement scalability at that level. This may mean coding your app so you use one connection string for reads, and another for writes. Then, you are free to implement replication with one master node handling the inserts/updates cascading the changes to read only nodes that handle the bulk of the work.
Middleware. You might even want to go down the route of implementing some kind of message oriented middleware solution to completely hand off business logic functions - this will give you a great level of flexibility in how you wish to scale this business logic layer in the future. Although initially this will be a lot of complication and work for not a great deal of payoff.
Have you considered playing around with VMs first? You can run 2-3 VMs on your local machine and set them up like you would actual servers, they just won't be able to handle real traffic levels. If all you're looking for is the learning experience, it might be an ideal way to go about it.