I have a web service running written in PHP-MYSQL. The script involves fetching data from other websites like wikipedia,google etc. The average execution time for a script is 5 secs(Currently running on 1 server). Now I have been asked to scale the system to handle 60requests/second. Which of the approach should I follow.
-Split functionality between servers (I create 1 server to fetch data from wikipedia, another to fetch from google etc and a main server.)
-Split load between servers (I create one main server which round robin the request entirely to its child servers with each child processing one complete request. What about MYSQL database sharing between child servers here?)
I'm not sure what you would really gain by splitting the functionality between servers (option #1). You can use Apache's mod_proxy_balancer to accomplish your second option. It has a few different algorithms to determine which server would be most likely to be able to handle the request.
http://httpd.apache.org/docs/2.1/mod/mod_proxy_balancer.html
Apache/PHP should be able to handle multiple requests concurrently by itself. You just need to make sure you have enough memory and configure Apache correctly.
Your script is not a server it's acting as a client when it makes requests to other sites. The rest of the time its merely a component of your server.
Yes, running multiple clients (instances of your script - you don't need more hardware) concurrently will be much faster than running the sequentially, however if you need to fetch the data synchronously with the incoming request to your script, then coordinating the results of the seperate instances will be difficult - instead you might take a look at the curl_multi* functions which allow you to batch up several requests and run them concurrently from a single PHP thread.
Alternately, if you know in advance what the incoming request to your webservice will be, then you should think about implementing scheduling and caching of the fetches so they are already available when the request arrives.
Related
I have a webpage that when users go to it, multiple (10-20) Ajax requests are instantly made to a single PHP script, which depending on the parameters in the request, returns a different report with highly aggregated data.
The problem is that a lot of the reports require heavy SQL calls to get the necessary data, and in some cases, a report can take several seconds to load.
As a result, because one client is sending multiple requests to the same PHP script, you end up seeing the reports slowly load on the page one at a time. In other words, the generating of the reports is not done in parallel, and thus causes the page to take a while to fully load.
Is there any way to get around this in PHP and make it possible for all the requests from a single client to a single PHP script to be processed in parallel so that the page and all its reports can be loaded faster?
Thank you.
As far as I know, it is possible to do multi-threading in PHP.
Have a look at pthreads extension.
What you could do is make the report generation part/function of the script to be executed in parallel. This will make sure that each function is executed in a thread of its own and will retrieve your results much sooner. Also, set the maximum number of concurrent threads <= 10 so that it doesn't become a resource hog.
Here is a basic tutorial to get you started with pthreads.
And a few more examples which could be of help (Notably the SQLWorker example in your case)
Server setup
This is more of a server configuration issue and depends on how PHP is installed on your system: If you use php-fpm you have to increase the pm.max_children option. If you use PHP via (F)CGI you have to configure the webserver itself to use more children.
Database
You also have to make sure that your database server allows that many concurrent processes to run. It won’t do any good if you have enough PHP processes running but half of them have to wait for the database to notice them.
In MySQL, for example, the setting for that is max_connections.
Browser limitations
Another problem you’re facing is that browsers won’t do 10-20 parallel requests to the same hosts. It depends on the browser, but to my knowledge modern browsers will only open 2-6 connections to the same host (domain) simultaneously. So any more requests will just get queued, regardless of server configuration.
Alternatives
If you use MySQL, you could try to merge all your calls into one request and use parallel SQL queries using mysqli::poll().
If that’s not possible you could try calling child processes or forking within your PHP script.
Of course PHP can execute multiple requests in parallel, if it uses a Web Server like Apache or Nginx. PHP dev server is single threaded, but this should ony be used for dev anyway. If you are using php's file sessions however, access to the session is serialized. I.e. only one script can have the session file open at any time. Solution: Fetch information from the session at script start, then close the session.
I'm writing a php code for a web server where it's required to do some heavy duty processes when requested before returning the results to the users.
My question is: does the apache server creates a separate thread/process for each client or should I use multi-threading to separate them?
The processes include calling the execution of other applications through cmd and downloading files to the server.
Well every request to the web server is a separate process which will try to use a free core from the CPU, and if there isn't a free one currently, it will go on a que and wait.
You can't have multithreading in php with apache within a single web request. You simply can't. Usually at each request apache forks a new O.S. process.
This is configurable, but typically chosen when working with php, since many methods of php standard library are not thread safe.
When I had to handle heavy computation I always choose to make the user request asynchronous, and let a third-process daemon to do the actual computation in background. In this case, after the user request, I let the client to poll the daemon (through others web-requests) to know when the computation is done.
I'm currently working on an event-logging system that will form part of a real-time analytics system. Individual events are sent via rpc from the main application to another server where a separate php script running under apache handles the event data.
Currently the receiving server PHP script hands off the event data to an AMQP exchange/queue from where a Java application pops events from the queue, batches them up and performs a batch db insert.
This will provide great scalability however I'm thinking the cost is complexity.
I'm now looking to simplify things a little so my questions are:
Would it be possible to remove the AMQP queue and perform the batching and inserting of events directly to the db from within the PHP script(s) on the receiving server?
And if so, would some kind of intermediary database be required to batch up the events or could the batching be done from within PHP ?
Thanks in advance
Edit:
Thanks for taking the time to respond, to be more specific. Is it possible for a PHP script running under Apache to be configured to handle multiple http requests?
So, as Apache spawns child processes each of these processes would be configured to accept say 1000 http requests, deal with them and then shut down?
I see three potential answers to your question:
Yes
No
Probably
If you share metrics of alternative implementations (because everything you ask about is techncially possible so please do it first and then get hard results) we can give better suggestions. But as long as you don't provide some meat, put it on the grill and show us the results, there is not much more to tell.
can you tell me how server handles different http request at a time. If 10 users logged in a site and send request for a page at the same time what will happen?
Usually, each of the users sends a HTTP request for the page. The server receives the requests and delegates them to different workers (processes or threads).
Depending on the URL given, the server reads a file and sends it back to the user. If the file is a dynamic file such as a PHP file, the file is executed before it's send back to the user.
Once the requested file has been sent back, the server usually closes the connection after a few seconds.
For more, see: HowStuffWorks Web Servers
HTTP uses TCP which is a connection-based protocol. That is, clients establish a TCP connection while they're communicating with the server.
Multiple clients are allowed to connect to the same destination port on the same destination machine at the same time. The server just opens up multiple simultaneous connections.
Apache (and most other HTTP servers) have a multi-processing module (MPM). This is responsible for allocating Apache threads/processes to handle connections. These processes or threads can then run in parallel on their own connection, without blocking each other. Apache's MPM also tends to keep open "spare" threads or processes even when no connections are open, which helps speed up subsequent requests.
The program ab (short for ApacheBench) which comes with Apache lets you test what happens when you open up multiple connections to your HTTP server at once.
Apache's configuration files will normally set a limit for the number of simultaneous connections it will accept. This will be set to a reasonable number, such that during normal operation this limit should never be reached.
Note too that the HTTP protocol (from version 1.1) allows for a connection to be kept open, so that the client can make multiple HTTP requests before closing the connection, potentially reducing the number of simultaneous connections they need to make.
More on Apache's MPMs:
Apache itself can use a number of different multi-processing modules (MPMs). Apache 1.x normally used a module called "prefork", which creates a number of Apache processes in advance, so that incoming connections can often be sent to an existing process. This is as I described above.
Apache 2.x normally uses an MPM called "worker", which uses multithreading (running multiple execution threads within a single process) to achieve the same thing. The advantage of multithreading over separate processes is that threading is a lot more light-weight compared to opening separate processes, and may even use a bit less memory. It's very fast.
The disadvantage of multithreading is you can't run things like mod_php. When you're multithreading, all your add-in libraries need to be "thread-safe" - that is, they need to be aware of running in a multithreaded environment. It's harder to write a multi-threaded application. Because threads within a process share some memory/resources between them, this can easily create race condition bugs where threads read or write to memory when another thread is in the process of writing to it. Getting around this requires techniques such as locking. Many of PHP's built-in libraries are not thread-safe, so those wishing to use mod_php cannot use Apache's "worker" MPM.
Apache 2 has two different modes of operation. One is running as a threaded server the other is using a mode called "prefork" (multiple processes).
The requests will be processed simultaneously, to the best ability of the HTTP daemon.
Typically, the HTTP daemon will spawn either several processes or several threads and each one will handle one client request. The server may keep spare threads/processes so that when a client makes a request, it doesn't have to wait for the thread/process to be created. Each thread/process may be mapped to a different processor or core so that they can be processed more quickly. In most circumstances, however, what holds the requests is network I/O, not lack of raw computing, so there is frequently no slowdown from having a number of processors/cores significantly lower than the number of requests handled at one time.
The server (apache) is multi-threaded, meaning it can run multiple programs at once. A few years ago, a single CPU could switch back and forth quickly between multiple threads, giving on the appearance that two things were happening at once. These days, computers have multiple processors, so the computer can actually run two threads of code simultaneously. That being said, threads aren't really mapped to processors in any simple way.
With that ability, a PHP program can be thought of as a single thread of execution. If two requests reach the server at the same time, two threads can be used to process the request simultaneously. They will probably both get about the same amount of CPU, so if they are doing the same thing, they will complete at approximately the same time.
One of the most common issues with multi-threading is "race conditions"-- where you two requests are doing the same thing ("racing" to do the same thing), if it is a single resource, one of them is going to win. If they both insert a record into the database, they can't both get the same id-- one of them will win. So you need to be careful when writing code to realize other requests are going on at the same time and may modify your database, write files or change globals.
That being said, the programming model allows you to mostly ignore this complexity.
Greetings All!
I am having some troubles on how to execute thousands upon thousands of requests to a web service (eBay), I have a limit of 5 million calls per day, so there are no problems on that end.
However, I'm trying to figure out how to process 1,000 - 10,000 requests every minute to every 5 minutes.
Basically the flow is:
1) Get list of items from database (1,000 to 10,000 items)
2) Make a API POST request for each item
3) Accept return data, process data, update database
Obviously a single PHP instance running this in a loop would be impossible.
I am aware that PHP is not a multithreaded language.
I tried the CURL solution, basically:
1) Get list of items from database
2) Initialize multi curl session
3) For each item add a curl session for the request
4) execute the multi curl session
So you can imagine 1,000-10,000 GET requests occurring...
This was ok, around 100-200 requests where occurring in about a minute or two, however, only 100-200 of the 1,000 items actually processed, I am thinking that i'm hitting some sort of Apache or MySQL limit?
But this does add latency, its almost like performing a DoS attack on myself.
I'm wondering how you would handle this problem? What if you had to make 10,000 web service requests and 10,000 MySQL updates from the return data from the web service... And this needs to be done in at least 5 minutes.
I am using PHP and MySQL with the Zend Framework.
Thanks!
I've had to do something similar, but with Facebook, updating 300,000+ profiles every hour. As suggested by grossvogel, you need to use many processes to speed things up because the script is spending most of it's time waiting for a response.
You can do this with forking, if your PHP install has support for forking, or you can just execute another PHP script via the command line.
exec('nohup /path/to/script.php >> /tmp/logfile 2>&1 & echo $!'), $processId);
You can pass parameters (getopt) to the php script on the command line to tell it which "batch" to process. You can have the master script do a sleep/check cycle to see if the scripts are still running by checking for the process id's. I've tested up to 100 scripts running at once in this manner, at which point the CPU load can get quite high.
Combine multiple processes with multi-curl, and you should easily be able to do what you need.
My two suggestions are (a) do some benchmarking to find out where your real bottlenecks are and (b) use batching and cacheing wherever possible.
Mysqli allows multiple-statement queries, so you could definitely batch those database updates.
The http requests to the web service are more likely the culprit, though. Check the API you're using to see if you can get more info from a single call, maybe? To break up the work, maybe you want a single master script to shell out to a bunch of individual processes, each of which makes an api call and stores the results in a file or memcached. The master can periodically read the results and update the db. (Careful to rotate the data store for safe reading and writing by multiple processes.)
To understand your requirements better, you must implement your solution only in PHP? Or you can interface a PHP part with another part written in another language?
If you could not go for another language, try to perform this update maybe as php script that runs in the background and not through the apache.
You can follow Brent Baisley advice for a simple use case.
If you want to build a robuts solution, then you need to :
set up a representation of the actions in a table in database that will be your process queue;
set up a script that pop this queue and process your action;
set up a cron daemon that run this script every x.
This way you can have 1000 PHP scripts running, using your OS parallelism capabilities and not hanging when ebay is taking to to respond.
The real advantage of this system is that you can fully control the firepower you throw at your task by adjusting :
the number of request one PHP script does;
the order / number / type / priority of the action in the queue;
the number or scripts the cron daemon runs.
Thanks everyone for the awesome and quick answers!
The advice from Brent Baisley and e-satis works nicely, rather than executing the sub-processes using CURL like i did before, the forking takes a massive load off, it also nicely gets around the issues with max out my apache connection limit.
Thanks again!
It is true that PHP is not multithreaded, but it can certainly be setup with multiple processes.
I have created a system that resemebles the one that you are describing. It's running in a loop and is basically a background process. It uses up to 8 processes for batch processing and a single control process.
It is somewhat simplified because i do not have to have any communication between the processes. Everything resides in a database so each process is spawned with the full context taken from the database.
Here is a basic description of the system.
1. Start control process
2. Check database for new jobs
3. Spawn child process with the job data as a parameter
4. Keep a table of the child processes to be able to control the number of simultaneous processes.
Unfortunately it does not appear to be a widespread idea to use PHP for this type of application, and i really had to write wrappers for the low level functions.
The manual has a whole section on these functions, and it appears that there are methods for allowing IPC as well.
PCNTL has the functions to control forking/child processes, and Semaphore covers IPC.
The interesting part of this is that i'm able to fork off actual PHP code, not execute other programs.