I created a small and very simple REST-based webservice with PHP.
This service gets data from a different server and returns the result. It's more like a proxy rather than a full service.
Client --(REST call)--> PHP Webservice --(Relay call)--> Remote server
<-- Return data ---
In order to keep costs as low as possible I want to implement a caching table on the PHP webservice system by maintaining data for a period of time in server memory and only re-request the data after a timeout (let's say after 30 mins).
In pseudo-code I basically want to do this:
$id = $_GET["id"];
$result = null;
if (isInCache($id) && !cacheExpired($id, 30)){
$result = getFromCache($id);
}
else{
$result = getDataFromRemoteServer($id);
saveToCache($result);
}
printData($result);
The code above should get data from a remote server which is identified by an id. If it is in the cache and 30 mins have not passed yet the data should be read from the cache and returned as a result of the webservice call. If not, the remote server should be queried.
While thinking on how to do this I realized 2 important aspects:
I don't want to use filesystem I/O operation because of performance concerns.
Instead, I want to keep the cache in memory. So, no MySQL or local
file operations.
I can't use sessions because the cached data must be shared across different users, browsers and internet connections worldwide.
So, if I could somehow share objects in memory between multiple GET requests, I would be able to implement this caching system pretty easily I think.
But how could I do that?
Edit: I forgot to mention that I cannot install any modules on that PHP server. It's a pure "webhosting-only" service.
I would not implement the cache on the (PHP) application level. REST is HTTP, therefore you should use a caching HTTP proxy between the internet and the web server. Both servers, the web server and the proxy could live on the same machine as long as the application grows (if you worry about costs).
I see two fundamental problems when it comes to application or server level caching:
using memcached would lead to a situation where it is required that a user session is bound to the physical server where the memcache exists. This makes horizontal scaling a lot more complicated (and expensive)
software should being developed in layers. caching should not being part of the application layer (and/or business logic). It is a different layer using specialized components. And as there are well known solutions for this (HTTP caching proxy) they should being used in favour of self crafted solutions.
Well, if you do have to use PHP, and you cannot modify the server, and you do want in-memory caching for performance reasons (without first measuring that any other solution has good enough performance), then the solution for you must be to change the webhosting.
Otherwise, you won't be able to do it. PHP does not really have any memory-sharing facilities available. The usual approach is to use Memcached or Redis or something else that runs separately.
And for a starter and proof-of-concept, I'd really go with a file-based cache. Accessing a file instead of requesting a remote resource is WAY faster. In fact, you'd probably not notice the difference between file cache and memory cache.
Related
I'm building an MVC web application (let's call it xyz.com). I want it to also support mobile apps via an API on the same server (let's call this api.xyz.com).
I'm confused about how to structure the web app vs. the API:
Should the web app use the API to query the database? Or should it perform them independently (like using a Model)? I mean, should the flow be (User > Controller > API > Database) or (User > Controller > Model > Database)?
If we use the APIs to query the database, how would you query the API? Would you use something like cURL?
If we use cURL, wouldn't that slow down the process? (As we are making a second request from the controller)? What's the ideal way to do this?
I've tried reading up on API-centric web apps, but there's not too much information about this on the net.
Any guidance would be appreciated.
since I have not enough points to comment I'll just leave my thoughts here.
1) I would make the web app use the API for 2 main reasons:
I don't like having multiples components accessing my database, you'll most likely have a lot of duplicate code on the data extraction and if you have a lot of filtering you might make mistakes and they'll behave differently.
I believe that (if I'm wrong please correct me) it'll be easier to increase performances on the app rather than the database. Since the API will handle resources it'll be easier to cache things than a full HTML web page.
2) To query the API I would most likely use composer and find a nice HTTP library, not sure which one though since I didn't use php in a while (symfony's and laravel's are quite nice if I remember well).
3) Yeah it might slow down the process a little but I believe it won't be enough to be noticed by the user. As I said above, with a correct cache handling you'll do just fine.
Hope having my thoughts on this can help, if I was wrong somewhere please feel free to correct me in the comments below (Don't know if I'll be able to respond tho... :s).
Have a nice day !
You really want to know the difference between an n-tier architecture and a single-tier architecture. An n-tier architecture is comprised of several tiers, which are connected via an internal protocol and API. E.g.:
backend frontend
data --- HTTP server --- HTTP client --- application --- HTTP server --- browser
application
You could have many more tiers in there. Those tiers can all run on the same physical machine, but they still talk to each other over HTTP; more likely you'd want to run each tier on a different machine though.
Yes, this will obviously incur some overhead in talking to your database backend over HTTP instead of doing it within the same PHP process. However, this is offset by:
Caching is built-in in this architecture. If you make proper use of HTTP caching by using a fully capable HTTP server and client and are using the HTTP cache mechanism well, you can reduce the absolute amount of queries made enormously. Practically you'd set up a reverse proxy on the web server on your frontend server, so your queries go PHP → curl → web server reverse proxy → backend server. If the reverse proxy is caching properly, that's where the chain often stops, which can be much faster than executing the actual query on the database. If the backend server is using HTTP caching effectively, it too can probably often respond with a simple 304 Not Modified, which is also very fast.
Load is distributed among more machines, which speeds up each individual machine. If your frontend servers can largely work with cached data without needing to bother the database, the database is much faster for the queries that it does need to handle. You can also scale up your frontend servers to many instances, each of which is faster because it has less load.
These are the advantages of such an architecture. The further advantage is that you can also directly expose your internal HTTP API to the outside world (again, reverse proxies make a lot of sense here). If you're not using an internal HTTP API, you have to write:
A controller and view which gets data from the model and renders to HTML.
A controller and view which gets data from the model and renders to JSON (or whatever).
To some degree controllers can be reused and simply the view switched out, if everything else is the same, but often you'll find that you need to duplicate each logical "thing" to be served as HTML in one case and JSON in another, and you need to keep those in sync.
I have a situation where my Linux server will be running a website which gets some of its data from a 3rd-party server through a SOAP interface. The data isn't exactly real-time, but it does change every 5 minutes or so. I was told not to have our website hammer their website for data, which I can completely understand.
So I wondered if this was a good candiate to use a cache scheme of some type. Where when a user comes to our web page to display the data, if it's less than 5 minutes old (for example), it would get that data from our server instead of polling the 3rd-party for it. This way, if 100 users at once come to our website, our server won't be access the 3rd-party website 100 times to share the same exact data within a given time-frame.
Is this a practical thing to do in PHP? Or should this be written in a faster language when it comes to caching? Are their cache packages for this sort of situation which can be used along with a PHP Joomla application? Thanks!
I think memcached is a good choice.
You can set timeout when you store content to memcached server, if key-value missed, retrieve data from 3rd-part server and store again.
There is memcached extension for PHP, check doc here.
There's lots of ways to solve the problem -we can't say which is the right one without knowing a lot more about the constraints you are working in or how the service is used. If you are using Joomla then you're obviously not bothered about performance - it would be really hard to write anything which has a measurable impact on your html generation times. This does not need to "be written in a faster language", but....
can you install additional software?
have you got access to cron?
at what rate is the service consumed?
how many webservers do you have consuming the service - do they have a shared filesystem? Are they on the same sub-net?
Is the SOAP response cacheable?
how do you deal with non-availability of the service?
For a very scalable solution I would suggest running a simple forward proxy (e.g. squid) but do make sure that it's not accessible from the internet. Sven (see comment elsewhere) is right about POST sometimes not being cacheable - but you can cache the response from a surrogate script on your own site accessed via GET returning appropriate caching instructions - and this could return the data as a serialized php array / object which is much less expensive to process. Indeed whichever method you choose I would recommend caching the parsed response - not the XML. This also allows you to override poor caching information from the service.
If the rate is less than around 1 per minute then the cron solution is overkill. But if its more than 20 per minute then it makes a lot of sense. If you don't have access to cron / can't install your own software then you might consider simply caching the response and refreshing the cache on demand. Don't bother with memcache unless you are already using it. APC is faster on a single server - but memcache is distributed. If you have multiple servers then use whatever cluster storage you are currently sharing your data in (distributed filesystem / database cluster / shared filesystem....).
Don't try to use locking / mutexes around the cache refresh unless you really have to (i.e. only if accessing the service more than once every 5 minutes is a mortal sin) - this gets real complicated real quick - it's too easy to introduce bugs.
Do make sure you buffer and validate any responses before writing them to the cache.
Yes, just use HTTP. Most of the heavy lifting has already been built into your web server.
Since SOAP is just a simple HTTP POST request with an XML body, you could set up your website or HTTP API in front of the SOAP endpoint to act like a translator to regular HTTP, attaching the appropriate HTTP caching headers on the transformed response body and then configure an NGinx reverse proxy in front of it.
Notably: if the transformation is simple you could just use XSLT to transform the response body from the SOAP API and remove the web service layer entirely.
Your problem is a very small one, which does not require a complicated solution.
You could write a small cron job that is executed every five minutes, sends the request to the SOAP server, and stores the result in a local file. If any script needs the data, it reads the local file. This will result in 288 requests to the SOAP server per day, and have excellent performance for any script call that needs the results because they are already on your server.
If you do not have cron jobs available and cannot fake them, any other cache will do. You really don't need fancy stuff like Memcached, unless it already is available. Storing the result to a cache file will work as well. Note that if you have to really fetch the SOAP result from the origin, this will take some more time and might affect the perceived performance of your site.
There are plenty of frameworks which also offer cache support, and if you use one you should investigate if there is support included. I'm not sure if Joomla has something appropriate for you. Otherwise, you can implement something yourself. It isn't that hard.
Cache functionality comes in various flavours:
memory-based, where a separate process on the server holds data in RAM (or overflows to disk) and you query it like you would a database; very efficient and powerful, and will have options to manage storage use and clear up after themselves, but requires setting up additional software on the server; e.g. memcached, redis
file-based, where you just write the data to disk; less efficient, but can be implemented in "user-land" code, i.e. pure PHP; beware of filling up your disk with variant caches that have expired but not been cleaned up; many frameworks have an implementation of this built in
database-backed, where you push data into an RDBMS (e.g. MySQL, PostgreSQL) or fully-featured NoSQL store (e.g. MongoDB); might make sense if you have a large amount of data, and can trade a bit of performance; as with files, you need to make sure that stale data is cleaned up
In each case, the basic idea is that you create a "key" that can tell one request from another (e.g. the name of the SOAP call and its input parameters, serialized), and pick a "lifetime" (how long you want to carry on using the same copy of the data). The caching engine or library then checks for a cache with that key, and if it is still within its "lifetime" returns the previously cached data. If there is a "cache miss" (there is no cache for that key, or it has expired), you perform the costly operation (in your case, the SOAP call) and save to the cache, using the same key.
You can do more complex things, like pre-caching things in the background so that there is never a cache miss, or having some code paths which accept stale data in order to return quickly, but these can generally be implemented on top of whatever you're using as the main caching solution.
Edit Another important decision is at what level of granularity to cache the data, in relation to processing it. At one extreme, you could cache each individual SOAP call: simple to set up, but means re-processing the same data repeatedly, and can cause problems if two responses are related, but cached independently and may get out of sync. At the other extreme, you can cache whole rendered pages: pages load very fast once cached, but creating variations based on the same data without repeating work becomes tricky. In between are various points in your code where you have processed and combined data into meaningful chunks: if your application is well-written, these are the input and output of major functions, or possibly even complete model objects; this is more work to implement, as you have to choose the right keys (avoiding two contexts overwriting each other's caches while ignoring variables that have no impact on the data in question) and values (avoiding repeats of costly work without having to store huge blobs of data which will be slow to unserialize and use up the capacity of your cache store). As with anything else, no approach suits all needs, and a complex application will probably involve caching at multiple levels for different purposes.
I am using sessions in PHP to track if a user is logged in. I do not use it to store any other data about the user; essentially it is like checking a hash table to see if the user has authenticated.
Would there be some advantage to using redis instead of native PHP sessions?
I'm curious about performance, scalability, and security (not really concerned with code complexity).
Using something like Redis for storing sessions is a great way to get more performance out of load balanced servers.
For example on Amazon Web Services, the load balancers have what's called 'sticky sessions'. What this means is that when a user first connects to your web app, e.g. when logging in to it, the load balancer will choose one of your app servers and this user will continue to be served from this server until they exit your application. This is because the sessions used by PHP, for example, will be stored on the app server that they first start using.
Now, if you use Redis on a separate server, then configure your PHP on each of your app servers to store it's sessions in Redis, you can turn this 'sticky sessions' off. This would mean that any of your servers can access the sessions and, therefore, the user be served from a different server with every request to your app. This ultimately makes for more efficient use of your load balancing set-up.
You want the session save handler to be fast. This is due to the fact that a PHP session will block all other concurrent requests from the same user until the first request is finished.
There are a variety of handlers you could use for PHP sessions across multiple servers: File w/ NFS, MySQL Database, Memcache, and Redis.
The database method (using InnoDB) was the slowest in my experience followed by File w/ NFS. Locking and write contention are the main factors. Memcache and Redis provide similar performance and are by far the better alternatives since all operations are in RAM. Redis is my choice because you can enable disk persistence, and Memcache is only memory based.
I explain Redis Sessions in PHP with Kohana if you want more detail. Here is our dashboard for managing Redis keys:
I don't really think you need to worry much about sessions unless you get MASSIVE ammounts of traffic, PHP handle sessions nicely, and if you store only that little data, it should be fine even with a lot of requests, and about performance it should be close, as redis is not native to PHP.
With 10k users, if each user uses like 1kb data of sessions, it would consume 10,000kb or 10~mb, which is not much; PHP is smart enough to use a good enough data structure to hold and quickly write and read those values. The problem is if the session data is too big, or for some reason the server consumes too many resources reading the session data, but that's normally if it's the data is too big.
Here's a little background, currently i have
3 web servers
one db server which also host memcache for php sessions for the 3 web servers.
I have the php configs on the 3 servers to point to the memcache server for sessions. It was working fine until alot of connections were being produced for reads etc, which then caused connection timeouts.
So I'm currently looking at clustering the memcache on each web server for sessions, my only concern is how to go about making sure that memcache on all the servers have the same information for sessions.
Someone guided me to http://github.com/trs21219/Memcached-Library because i am using codeigniter but how do i converge my php sessions onto this since memcache seems as a key-value store? Thanks in advance.
Has anyone checked out http://repcached.sourceforge.net/ and does it work?
I'm not sure you have the same expectations of memcache that its designers had.
First, however, memcache distribution works differently than you expect: there is no mechanism to replicate stored information. Each memcache instance is a simple key-value store, as you've noticed. The distribution is done by the client code which has a list of all configured memcache instances and does a hash of the key to direct it to one of the instances. It is possible for the client to store it everywhere and retrieve it locally, or for it to hash it multiple times for redundancy, but these are not straightforward exercises.
But the other issue is that memcache is designed for reasonably short-lived data that memcache is allowed to throw away at any time. This makes it really good for caching frequently accessed data that can be a little stale (say up to a few minutes old) but might be expensive to retrieve (such as almost a minute to generate from a query).
PHP sessions don't really qualify for this, in my experience. A database can easily support many thousands of PHP sessions with barely visible traffic, but you need a lot of memcache storage to support the same number: 50k per session and 5000 sessions means close to 256Mb, and then there is all the other data you want to put in there. Not enough storage and you get lots of unexplained logouts (as memcache discards session data when under memory pressure) and thus lots of annoyed users who have to keep logging in again.
We've found GREAT advantage applying MongoDB instead of MySQL for most things, including session handling. It's far faster, far smaller, far easier. We keep MySQL around for transactional needs, but everything else goes into Mongo now. We've relegated memcache to simply caching pages and other data that isn't critical if it's there or not, something like smarty does.
There is no need to use some 3rd party libraries to organize memcached "cluster".
http://ru.php.net/manual/en/memcached.addserver.php
Just use this function to add several servers into the pool and after that data will be stored and distributed over those servers. The server for storing/retrieving the data for the specific key will be selected according to consistent key distribution option.
So in this case you don't need to worry about "how to go about making sure that memcache on all the servers have the same information for sessions"
I am using PHP with the Zend Framework and Database connects alone seem to take longer than the 0,02 seconds Google takes to do a query. The wierd thing today I watched a video that said Google connects to 1000 servers for a single query. With latency I would expect one server for every query to be more efficent than having multiple servers in different datacenters handeling stuff.
How do I get PHP, MySQL and the Zend Framework to work together and reach equal great speeds?
Is caching the only way? How do you optimize your code to take less time to "render".
There are many techniques that Google uses to achieve the amount of throughput it delivers. MapReduce, Google File System, BigTable are a few of those.
There are a few very good Free & Open Source alternatives to these, namely Apache Hadoop, Apache HBase and Hypertable. Yahoo! is using and promoting the Hadoop projects quite a lot and thus they are quite actively maintained.
I am using PHP with the Zend Framework
and Database connects alone seem to
take longer than the 0,02 seconds
Google takes to do a query.
Database connect operations are heavyweight no matter who you are: use a connection pool so that you don't have to initialise resources for every request.
Performance is about architecture not language.
Awhile ago Google decided to put everything into RAM.
http://googlesystem.blogspot.com/2009/02/machines-search-results-google-query.html
If you never have to query the hard drive, your results will improve significantly. Caching helps because you don't query the hard drive as much, but you still do when there is a cache miss (Unless you mean caching with PHP, which means you only compile the PHP program when the source has been modified).
It really depends on what you are trying to do, but here are some examples:
Analyze your queries with explain. In your dev environment you can output your queries and execution time to the bottom of the page - reduce the number of queries and/or optimize those that are slow.
Use a caching layer. Looks like Zend can be memcache enabled. This can potentially greatly speed up your application by sending requests to the ultra-fast caching layer instead of the db.
Look at your front-end loading time. Use Yahoo's YSlow add-on to Firebug. Limit http requests, set far-future headers to cache js, css and images. Etc.
You can get lightning speeds on your web app, probably not as fast as google, if you optimize each layer of your application. Your db connect times are probably not the slowest part of your app.
Memcached is a recommended solution for optimizing storage/retrieval in memory on Linux.
PHP scripts by default are interpreted every time they are called by the http server, so every call initiates script parsing and probably compilation by the Zend Engine.
You can get rid of this bottleneck by using script caching, like APC. It keeps the once compiled PHP script in memory/on disk and uses it for all subsequent requests. Gains are often significant, especially in PHP apps created with sophisticated frameworks like ZF.
Every request by default opens up a connection to the database, so you should use some kind of database connection pooling or persistent connections (which don't always work, depending on http server/php configuration). I have never tried, but maybe there's a way to use memcache to keep database connection handles.
You could also use memcache for keeping session data, if they're used on every request. Their persistence is not that important and memcache helps make it very fast.
The actual "problem" is that PHP works a bit different than other frameworks, because it works in a SSI (server-side includes) way - every request is handled by http server and if it requires running a PHP script, its interpreter is initialized and scripts loaded, parsed, compiled and run. This can be compared to getting into the car, starting the engine and going for 10 meters.
The other way is, let's say, an application-server way, in which the web application itself is handling the requests in its own loop, always sharing database connections and not initializing the runtime over and over. This solution gives much lower latency. This on the other hand can be compared to already being in a running car and using it to drive the same 10 meters. ;)
The above caching/precompiling and pooling solutions are the best in reducing the init overhead. PHP/MySQL is still a RDBMS-based solution though, and there's a good reason why BigTable is, well, just a big, sharded, massively distributed hashtable (a bit of oversimplification, I know) - read up on High Scalability.
If it's for a search engine, the bottleneck is the database, depending of its size.
In order to speed-up search on full text on a large set, you can use Sphinx. It can be configured either on 1 or multiple servers. However, you will have to adapt existing querying code, as Sphinx runs as a search daemon (libs are available for most languages)
Google have a massive, highly distributed system that incorporates a lot of proprietary technology (including their own hardware, and operating, file and database systems).
The question is like asking: "How can I make my car be a truck?" and essentially meaningless.
According to the link supplied by #Coltin, google response times are in the region of .2 seconds, not .02 seconds. As long as your application has an efficient design, I believe you should be able to achieve that on a lot of platforms. Although I do not know PHP it would surpise me if .2 seconds is a problem.
APC code caching;
Zend_Cache with APC or Memcache backend;
CDN for the static files;