I'm working on a system that needs to get a country code based on IP address, and needs to be accessible to multiple applications of all shapes and sizes on multiple servers.
At the moment, this is obtained via a cURL request to a preexisting geo.php library, which I think resolves the country code from a .dat file downloaded from MaxMind. Apparently though this method has been running into problems under heavy loads, perhaps due to a memory leak? No one's really sure.
The powers that be have suggested to me that we should dispense with the cURLing and derive the countrycode from a locally based geocoding library with the data also stored in a local file. Or else possibly a master file hosted on, e.g., Amazon S3. I'm feeling a bit wary of having a massive file of IP-to-country lookups stored unnecessarily in a hundred different places, of course.
One thing I've done is put the data in a mysql database and obtained the required results by connecting to that; I don't know for sure, but it seems to me that our sites generally run swiftly and efficiently while connecting to centralised mysql data, so wouldn't this be a good way of solving this particular problem?
My question then: what are the relative overheads of obtaining data in different ways? cURLing it in, making a request to a remote database, getting it from a local file, getting it from a file hosted somewhere else? It's difficult to work out which of these are more efficient or inefficient, and whether the relative gains in efficiency are likely to be big enough to matter...
I had a website using cURL to get the country code text from maxmind as well for about 1.5 years with no problems as far as I could tell. One thing that I did do though was set a timeout of ~1-2 seconds for the cURL request and default back to a set country code if it didn't hit it. We went through about 1 million queries to maxmind I believe, so it must have been used.... If it didn't reach it in that time, I didn't want to slow the page anymore. That's the main disadvantage of using an external library - relying on their response time.
As for having it locally, the main thing to be concerned about is: will it be up to date a year from now? Obviously you can't get any more different IP addresses out of the current IPv4 pool, but potentially ISPs could buy/sell/trade IPs with different countries (I don't know how it works, but I've seen plenty of IPs from different countries and they never seem to have any pattern to them lol). If that doesn't happen, disregard that part :p. The other thing about having it locally is you could use mysql query cache to store the result so you don't have to worry about resources on subsequent page loads, or alternatively just do what I did and store it in a Cookie and check that first before cURLing (or doing a lookup).
You state this question wrong way.
There are only two different methods:
a network lookup
a local resource request
And only one answer:
NEVER do any network lookups while serving client request.
So, as long as you're accessing local resource (okay - in the limits of the same datacenter) - you're all right.
If you're requesting some distant resource - no matter it's curl or database or whatever - you're in trouble.
That rule seems obvious for me.
Related
I created a small and very simple REST-based webservice with PHP.
This service gets data from a different server and returns the result. It's more like a proxy rather than a full service.
Client --(REST call)--> PHP Webservice --(Relay call)--> Remote server
<-- Return data ---
In order to keep costs as low as possible I want to implement a caching table on the PHP webservice system by maintaining data for a period of time in server memory and only re-request the data after a timeout (let's say after 30 mins).
In pseudo-code I basically want to do this:
$id = $_GET["id"];
$result = null;
if (isInCache($id) && !cacheExpired($id, 30)){
$result = getFromCache($id);
}
else{
$result = getDataFromRemoteServer($id);
saveToCache($result);
}
printData($result);
The code above should get data from a remote server which is identified by an id. If it is in the cache and 30 mins have not passed yet the data should be read from the cache and returned as a result of the webservice call. If not, the remote server should be queried.
While thinking on how to do this I realized 2 important aspects:
I don't want to use filesystem I/O operation because of performance concerns.
Instead, I want to keep the cache in memory. So, no MySQL or local
file operations.
I can't use sessions because the cached data must be shared across different users, browsers and internet connections worldwide.
So, if I could somehow share objects in memory between multiple GET requests, I would be able to implement this caching system pretty easily I think.
But how could I do that?
Edit: I forgot to mention that I cannot install any modules on that PHP server. It's a pure "webhosting-only" service.
I would not implement the cache on the (PHP) application level. REST is HTTP, therefore you should use a caching HTTP proxy between the internet and the web server. Both servers, the web server and the proxy could live on the same machine as long as the application grows (if you worry about costs).
I see two fundamental problems when it comes to application or server level caching:
using memcached would lead to a situation where it is required that a user session is bound to the physical server where the memcache exists. This makes horizontal scaling a lot more complicated (and expensive)
software should being developed in layers. caching should not being part of the application layer (and/or business logic). It is a different layer using specialized components. And as there are well known solutions for this (HTTP caching proxy) they should being used in favour of self crafted solutions.
Well, if you do have to use PHP, and you cannot modify the server, and you do want in-memory caching for performance reasons (without first measuring that any other solution has good enough performance), then the solution for you must be to change the webhosting.
Otherwise, you won't be able to do it. PHP does not really have any memory-sharing facilities available. The usual approach is to use Memcached or Redis or something else that runs separately.
And for a starter and proof-of-concept, I'd really go with a file-based cache. Accessing a file instead of requesting a remote resource is WAY faster. In fact, you'd probably not notice the difference between file cache and memory cache.
Is it possible to keep a SQL connection/session "open" between PHP program iterations, so the program doesn't have to keep re-logging in?
I've written a PHP program that continually (and legally/respectfully) polls the web for statistical weather data, and then dumps it into a local MYSQL database for analysis. Rather than having to view the data through the local database browser, I've wanted to have it available as an online webpage hosted by an external web host.
Not sure of the best way to approach this, I exported the local MYSQL database up onto my web host's server, figuring that because the PHP program needs to be continually looping (and longer than the default runtime, with HTML also continually refreshing its page), it would be best if I kept the "engine" on my local computer where I can have the page continually looping in a browser, and then have it connect to the database up on my web server and dump the data there.
It worked for a few hours. But then, as I feared might happen, I lost access to my cPanel login/host. I've since confirmed through my own testing that my IP has been blocked (the hosting company is currently closed), no doubt due to the PHP program reconnecting to the online SQL database once every 10 minutes. I didn't think this behavior and amount of time between connections would be enough to warrant an IP blacklisting, but alas, it was.
Now, aside from the possibility of getting my IP whitelisted with the hosting company, is there a way to keep a MYSQL session/connection alive so that a program doesn't have to keep re-logging in between iterations?
I suppose this might only be possible if I could keep the PHP program running indefinitely, perhaps after manually adjusting the max run-time limits (I don't know if there would be other external limitations, too, perhaps browser limits). I'm not sure if this is feasible, or would work.
Is there some type of low-level system-wide "cookie" for a MYSQL connection? With the PHP program finishing and closing (and then waiting for the HTML to refresh the page), I suppose the only way to not have to re-log in again would be with some type of cookie, or IP address access (which would need server-side functionality/implementation).
I'll admit that my approach here probably isn't the most efficient/effective way to accomplish this. Thus, I'm also open to alternative approaches and suggestions that would accomplish the same end result -- a continual web-scrape loop that dumps into a database, and then have the database continually dumped to a webpage.
(I'm seeking a way to accomplish this other than asking my webhost for an IP whitelist, or merely determining their firewall's access ban rate. I'll do either of these if there's truly no feasible or better way.)
Perhaps you can try Persistent Database Connection.
This link explains about persistent connectivity: http://in2.php.net/manual/en/function.mysql-pconnect.php
Here it gets a little complicated. I'm in the last few months to finish a larger Webbased Project, and since I'm trying to keep the budget low (and learn some stuff myself) I'm not touching an Issue that I never touched before: load balancing with NGINX, and scalability for the future.
The setup is the following:
1 Web server
1 Database server
1 File server (also used to store backups)
Using PHP 5.4< over fastCGI
Now, all those servers should be 'scalable' - in the sense that I can add a new File Server, if the Free Disk Space is getting low, or a new Web Server if I need to handle more requests than expected.
Another thing is: I would like to do everything over one domain, so that the access to differend backend servers isnt really noticed in the frontend (some backend servers are basically called via subdomain - for example: the fileserver, over 'http://file.myserver.com/...' where a load balancing only between the file servers happens)
Do I need an additional, separate Server for load balancing? Or can I just use one of the web servers? If yes:
How much power (CPU / RAM) do I require for such a load-balancing server? Does it have to be the same like the webserver, or is it enough to have a 'lighter' server for that?
Does the 'load balancing' server have to be scalable too? Will I need more than one if there are too many requests?
How exactly does the whole load balancing work anyway? What I mean:
I've seen many entries stating, that there are some problems like session handling / synchronisation on load balanced systems. I could find 2 Solutions that maybe would fit my needs: Either the user is always directed to the same machine, or the data is stored inside a databse. But with the second, I basically would have to rebuild parts of the $_SESSION functionality PHP already has, right? (How do I know what user gets wich session, are cookies really enough?)
What problems do I have to expect, except the unsynchronized sessions?
Write scalable code - that's a sentence I read a lot. But in terms of PHP, for example, what does it really mean? Usually, the whole calculations for one user happens on one server only (the one where NGINX redirected the user at) - so how can PHP itself be scalable, since it's not actually redirected by NGINX?
Are different 'load balancing' pools possible? What I mean is, that all fileservers are in a 'pool' and all web servers are in a 'pool' and basically, if you request an image on a fileserver that has too much to do, it redirects to a less busy fileserver
SSL - I'll only need one certificate for the balance loading server, right? Since the data always goes back over the load balancing server - or how exactly does that work?
I know it's a huge question - basically, I'm really just searching for some advices / and a bit of a helping hand, I'm a bit lost in the whole thing. I can read snippets that partially answer the above questions, but really 'doing' it is completly another thing. So I already know that there wont be a clear, definitive answer, but maybe some experiences.
The end target is to be easily scalable in the future, and already plan for it ahead (and even buy stuff like the load balancer server) in time.
You can use one of web servers for load balacing. But it'll be more reliable to set the balacing on a separate machine. If your web servers responds not very quickly and you're getting many requests then load balancer will set the requests in the queue. For the big queue you need a sufficient amount of RAM.
You don't generally need to scale a load balancer.
Alternatively, you can create two or more A (address) records for your domain, each pointing to different web server's address. It'll give you a 'DNS load-balancing' without a balancing server. Consider this option.
I am designing a file download network.
The ultimate goal is to have an API that lets you directly upload a file to a storage server (no gateway or something). The file is then stored and referenced in a database.
When the file is requsted a server that currently holds the file is selected from the database and a http redirect is done (or an API gives the currently valid direct URL).
Background jobs take care of desired replication of the file for durability/scaling purposes.
Background jobs also move files around to ensure even workload on the servers regarding disk and bandwidth usage.
There is no Raid or something at any point. Every drive ist just hung into the server as JBOD. All the replication is at application level. If one server breaks down it is just marked as broken in the database and the background jobs take care of replication from healthy sources until the desired redundancy is reached again.
The system also needs accurate stats for monitoring / balancing and maby later billing.
So I thought about the following setup.
The environment is a classic Ubuntu, Apache2, PHP, MySql LAMP stack.
An url that hits the currently storage server is generated by the API (thats no problem far. Just a classic PHP website and MySQL Database)
Now it gets interesting...
The Storage server runs Apache2 and a PHP script catches the request. URL parameters (secure token hash) are validated. IP, Timestamp and filename are validated so the request is authorized. (No database connection required, just a PHP script that knows a secret token).
The PHP script sets the file hader to use apache2 mod_xsendfile
Apache delivers the file passed by mod_xsendfile and is configured to have the access log piped to another PHP script
Apache runs mod_logio and an access log is in Combined I/O log format but additionally estended with the %D variable (The time taken to serve the request, in microseconds.) to calculate the transfer speed spot bottlenecks int he network and stuff.
The piped access log then goes to a PHP script that parses the url (first folder is a "bucked" just as google storage or amazon s3 that is assigned one client. So the client is known) counts input/output traffic and increases database fields. For performance reasons i thought about having daily fields, and updating them like traffic = traffic+X and if no row has been updated create it.
I have to mention that the server will be low budget servers with massive strage.
The can have a close look at the intended setup in this thread on serverfault.
The key data is that the systems will have Gigabit throughput (maxed out 24/7) and the fiel requests will be rather large (so no images or loads of small files that produce high load by lots of log lines and requests). Maby on average 500MB or something!
The currently planned setup runs on a cheap consumer mainboard (asus), 2 GB DDR3 RAM and a AMD Athlon II X2 220, 2x 2.80GHz tray cpu.
Of course download managers and range requests will be an issue, but I think the average size of an access will be around at least 50 megs or so.
So my questions are:
Do I have any sever bottleneck in this flow? Can you spot any problems?
Am I right in assuming that mysql_affected_rows() can be directly read from the last request and does not do another request to the mysql server?
Do you think the system with the specs given above can handle this? If not, how could I improve? I think the first bottleneck would be the CPU wouldnt it?
What do you think about it? Do you have any suggestions for improvement? Maby something completely different? I thought about using Lighttpd and the mod_secdownload module. Unfortunately it cant check IP adress and I am not so flexible. It would have the advantage that the download validation would not need a php process to fire. But as it only runs short and doesnt read and output the data itself i think this is ok. Do you? I once did download using lighttpd on old throwaway pcs and the performance was awesome. I also thought about using nginx, but I have no experience with that. But
What do you think ab out the piped logging to a script that directly updates the database? Should I rather write requests to a job queue and update them in the database in a 2nd process that can handle delays? Or not do it at all but parse the log files at night? My thought that i would like to have it as real time as possible and dont have accumulated data somehwere else than in the central database. I also don't want to keep track on jobs running on all the servers. This could be a mess to maintain. There should be a simple unit test that generates a secured link, downlads it and checks whether everything worked and the logging has taken place.
Any further suggestions? I am happy for any input you may have!
I am also planning to open soure all of this. I just think there needs to be an open source alternative to the expensive storage services as amazon s3 that is oriented on file downloads.
I really searched a lot but didnt find anything like this out there that. Of course I would re use an existing solution. Preferrably open source. Do you know of anything like that?
MogileFS, http://code.google.com/p/mogilefs/ -- this is almost exactly thing, that you want.
I have a simple question and wish to hear others' experiences regarding which is the best way to replicate images across multiple hosts.
I have determined that storing images in the database and then using database replication over multiple hosts would result in maximum availability.
The worry I have with the filesystem is the difficulty synchronising the images (e.g I don't want 5 servers all hitting the same server for images!).
Now, the only concerns I have with storing images in the database is the extra queries hitting the database and the extra handling i'd have to put in place in apache if I wanted 'virtual' image links to point to database entries. (e.g AddHandler)
As far as my understanding goes:
If you have a script serving up the
images: Each image would require a
database call.
If you display the images inline as
binary data: Which could be done in
a single database call.
To provide external / linkable
images you would have to add a
addHandler for the extension you
wish to 'fake' and point it to your
scripting language (e.g php, asp).
I might have missed something, but I'm curious if anyone has any better ideas?
Edit:
Tom has suggested using mod_rewrite to save using an AddHandler, I have accepted as a proposed solution to the AddHandler issue; however I don't yet feel like I have a complete solution yet so please, please, keep answering ;)
A few have suggested using lighttpd over Apache. How different are the ISAPI modules for lighttpd?
If you store images in the database, you take an extra database hit plus you lose the innate caching/file serving optimizations in your web server. Apache will serve a static image much faster than PHP can manage it.
In our large app environments, we use up to 4 clusters:
App server cluster
Web service/data service cluster
Static resource (image, documents, multi-media) cluster
Database cluster
You'd be surprised how much traffic a static resource server can handle. Since it's not really computing (no app logic), a response can be optimized like crazy. If you go with a separate static resource cluster, you also leave yourself open to change just that portion of your architecture. For instance, in some benchmarks lighttpd is even faster at serving static resources than apache. If you have a separate cluster, you can change your http server there without changing anything else in your app environment.
I'd start with a 2-machine static resource cluster and see how that performs. That's another benefit of separating functions - you can scale out only where you need it. As far as synchronizing files, take a look at existing file synchronization tools versus rolling your own. You may find something that does what you need without having to write a line of code.
Serving the images from wherever you decide to store them is a trivial problem; I won't discuss how to solve it.
Deciding where to store them is the real decision you need to make. You need to think about what your goals are:
Redundancy of hardware
Lots of cheap storage
Read-scaling
Write-scaling
The last two are not the same and will definitely cause problems.
If you are confident that the size of this image library will not exceed the disc you're happy to put on your web servers (say, 200G at the time of writing, as being the largest high speed server-grade discs that can be obtained; I assume you want to use 1U web servers so you won't be able to store more than that in raid1, depending on your vendor), then you can get very good read-scaling by placing a copy of all the images on every web server.
Of course you might want to keep a master copy somewhere too, and have a daemon or process which syncs them from time to time, and have monitoring to check that they remain in sync and this daemon works, but these are details. Keeping a copy on every web server will make read-scaling pretty much perfect.
But keeping a copy everywhere will ruin write-scalability, as every single web server will have to write every changed / new file. Therefore your total write throughput will be limited to the slowest single web server in the cluster.
"Sharding" your image data between many servers will give good read/write scalability, but is a nontrivial exercise. It may also allow you to use cheap(ish) storage.
Having a single central server (or active/passive pair or something) with expensive IO hardware will give better write-throughput than using "cheap" IO hardware everywhere, but you'll then be limited by read-scalability.
Having your images in a database doesn't necessarily mean a database call for each one; you could cache these separately on each host (e.g. in temporary files) when they are retrieved. The source images would still be in the database and easy to synchronise across servers.
You also don't really need to add Apache handlers to serve an image through a PHP script whilst maintaining nice urls- you can make urls like http://server/image.php/param1/param2/param3.JPG and read the parameters through $_SERVER['PATH_INFO'] . You could also remove the 'image.php' portion of the URL (if you needed to) using mod_rewrite.
What you are looking for already exists and is called MogileFS
Target setup involves mogilefsd, replicated mysql databases and lighttd/perlbal for serving files; It will bring you failover, fine grained file replication (for exemple, you can decide to duplicate end-user images on several physical devices, and to keep only one physical instance of thumbnails). Load balancing can also be achieved quite easily.