I am using curl_multi_exec to process over 100K requests. I do 100 requests at a time because curl_multi_exec can only handle 10- requests at a time to eventually get to 100K requests. We've added multiple servers to this system to spread the load [we are using load balancing]. What is the best way to have curl handle 100K requests and make use of these additional servers? What's the downside (other than time) of handling that many requests on one server? How can I use the additional servers to help handle those requests?
To elaborate- basically, we are using curl to send out over 100K requests to third party servers. The problem with only using 1 server is that there is a memory limit in the number of requests 1 server can handle. So we decided to add additional servers, but we are not sure how to design this system to use curl to handle that many requests..
Thanks!
Don't obsess about CURL. It's simply a tool. You're focusing on the wrong level of design.
What you need to consider is how to spread this workload amongst multiple servers. A simple design would have one central database listing all your urls, and a method for clients to "check out" a url (or set of urls) to chug away on.
Once you've got that working, the curl portion will be the easiest part of it all.
Related
I would like to find the maximum parallel http requests I can make to a server without having errors both from my own server and also from the host server using http guzzle package, in an empirical dynamic way. For example I start from an initial number say 200 and if it doesn't give an error the code should automatically increase up the parallel requests until the maximum number that errors occurs.
Working with php and just developing on web, I have forgotten concepts like memoization that may help find the solution of this problem.
So let's say I have a million requests stored on a database and I would like to send them asap what should I do?
while(unresolved_requests()){
try{
make_the_requests(int numbers_of_parallel_requests,int first_unresolved_request_id);
increase_the_parallel_requests();
}catch(err){
decrease_the_parallel_requests()
}
}
If you want to do load testing, use a load testing tool.
Take a look at JMeter. There are many others (Wrk, Autocannon, K6, Bombardier...).
It is not trivial to develop such testing tool and get consistent results.
I've got a rather large PHP web app which gets its products from numerous others suppliers through their API's, usually responding with a large XML to parse. Currently there are 20 suppliers but this is due to rise even further.
Our current set up uses multi curl to make the requests and this takes about 30-40 seconds to complete and is too long. The script runs in the background whilst the front end polls the database looking for results and then displays them as they come in.
To improve this process we were thinking of using a job server to run in the background, each supplier request being a separate job. We've seen beanstalkd and Gearman being mentioned.
So are we looking in the right direction, as in, is a job server the right way to go? We're looking at doing some promotion soon so we may get 200+ users searching 30 suppliers at the same time so the right choice needs to scale well if we have to load balance.
Any advice is great fully received.
You can use Beanstalkd, as you can customize the priority of jobs and the TTR time-to-resolve, default is 60 seconds, but for your scenario you must increase it. There is a nice admin console panel for Beanstalkd.
You should also leverage the multi Curl calls, so you should use parallel requests. In order to make use of Keep-alive you also need to maintain a pool of CURL handles and keep them warm. See high performance curl tips. You also need to tune Linux network stack.
If you run this in cloud, make sure you use multiple micro machines rather than one heavy machine as the throughput is better when you have multiple resources available.
If I have a loop with a lot of curl executions happening, will that slow down the server that is running that process? I realize that when this process runs, and I open a new tab to access some other page on the website, it doesn't load until this curl process that's happening finishes, is there a way for this process to run without interfering with the performance of the site?
For example this is what I'm doing:
foreach ($chs as $ch) {
$content = curl_exec($ch);
... do random stuff...
}
I know I can do multi curl, but for the purposes of what I'm doing, I need to do it like this.
Edit:
Okay, maybe this might change things a bit but I actually want this process to run using WordPress cron. If this is running as a WordPress "cron", would it hinder the page performance of the WordPress site? So in essence, if the process is running, and people try to access the site, will they be lagged up?
The curl requests are not asynchronous so using curl like that, any code after that loop will have to wait to execute until after the curl requests have each finished in turn.
curl_multi_init is PHP's fix for this issue. You mentioned you need to do it the way you are, but is there a way you can refactor to use that?
http://php.net/manual/en/function.curl-multi-init.php
As an alternate, this library is really good for this purpose too: https://github.com/petewarden/ParallelCurl
Not likely unless you use a strictly 1-thread server for development. Different requests are eg in Apache handled by workers (which depending on your exact setup can be either threads or separate processes) and all these workers run independently.
The effect you're seeing is caused by your browser and not by the server. It is suggested in rfc 2616 that a client only opens a limited number of parallel connections to a server:
Clients that use persistent connections SHOULD limit the number of
simultaneous connections that they maintain to a given server. A
single-user client SHOULD NOT maintain more than 2 connections with
any server or proxy.
btw, the standard usage of capitalized keywords like here SHOULD and SHOULD NOT is explained in rfc 2119
and that's what eg Firefox and probably other browsers also use as their defaults. By opening more tabs you quickly exhaust these parallel open channels, and that's what causes the wait.
EDIT: but after reading #earl3s 'reply I realize that there's more to it: earl3s addresses the performance within each page request (and thus the server's "performance" as experienced by the individual user), which can in fact be sped up by parallelizing curl requests. But at the cost of creating more than one simultaneous link to the system(s) you're querying... And that's where rfc2616's recommendation comes back into play: unless the backend systems delivering the content are under your control you should think twice before paralleling your curl requests, as each page hit on your system will hit the backend system with n simultaneous hits...
EDIT2: to answer OP's clarification: no (for the same reason I explained in the first paragraph - the "cron" job will be running in another worker than those serving your users), and if you don't overdo it, ie, don't go wild on parallel threads, you can even mildly parallelize the outgoing requests. But the latter more to be a good neighbour than because of fear to met down your own server.
I just tested it and it looks like the multi curl process running on WP's "cron" made no noticeable negative impact on the site's performance. I was able to load multiple other pages with no terrible lag on the site while the site was running the multi curl process. So looks like it's okay. And I also made sure that there is locking so that this process doesn't get scheduled multiple times. And besides, this process will only run once a day in U.S. low-peak hours. Thanks.
I have a web service running written in PHP-MYSQL. The script involves fetching data from other websites like wikipedia,google etc. The average execution time for a script is 5 secs(Currently running on 1 server). Now I have been asked to scale the system to handle 60requests/second. Which of the approach should I follow.
-Split functionality between servers (I create 1 server to fetch data from wikipedia, another to fetch from google etc and a main server.)
-Split load between servers (I create one main server which round robin the request entirely to its child servers with each child processing one complete request. What about MYSQL database sharing between child servers here?)
I'm not sure what you would really gain by splitting the functionality between servers (option #1). You can use Apache's mod_proxy_balancer to accomplish your second option. It has a few different algorithms to determine which server would be most likely to be able to handle the request.
http://httpd.apache.org/docs/2.1/mod/mod_proxy_balancer.html
Apache/PHP should be able to handle multiple requests concurrently by itself. You just need to make sure you have enough memory and configure Apache correctly.
Your script is not a server it's acting as a client when it makes requests to other sites. The rest of the time its merely a component of your server.
Yes, running multiple clients (instances of your script - you don't need more hardware) concurrently will be much faster than running the sequentially, however if you need to fetch the data synchronously with the incoming request to your script, then coordinating the results of the seperate instances will be difficult - instead you might take a look at the curl_multi* functions which allow you to batch up several requests and run them concurrently from a single PHP thread.
Alternately, if you know in advance what the incoming request to your webservice will be, then you should think about implementing scheduling and caching of the fetches so they are already available when the request arrives.
I am trying to build a system of monitoring site / server uptime in PHP, the system will be required to check thousands of domains / ips a minute. I have looked into cURL as this seems to be the best method.
Edit:
The system will be required to probe a server, check its response time is reasonable, and return its response code. It will then add a row to a mysql db containing response time and status code. The notification part of the system is fairly straight forward from there. The system will be on dedicated servers. Hope this adds some clarity.
Why not go for the KISS approach and use php's get_headers() function?
If you want to retrieve the status code, here's a snippet from the comments to the php manual page:
function get_http_response_code($theURL) {
$headers = get_headers($theURL);
return substr($headers[0], 9, 3);
}
This function (being a core php feature) should be faster and more efficient than curl.
If I understand correctly, this system of yours will constantly be connecting to thousands of domains/ips, and if it works, it assumes that the server is up and running?
I suppose you could use cURL, but it would take a long time especially if you're talking thousands of requests - you'd need multiple servers and lots of bandwidth for this to work properly.
You can also take a look at multi cURL for multi-threaded requests (ie. simultaneously sending out 10+ cURL requests, instead of one at a time).
http://php.net/manual/en/function.curl-multi-exec.php
There are very good tools for things like that. No need to write it on your own.
Have a look at Nagios for example. Many admins use it for monitoring.
Your bottleneck will be in waiting for a given host to respond. Given a 30 second timeout and N hosts to check and all but the last host not responding, you'll need to wait 30(N-1) seconds to check the last host. You may never get to checking the last host.
You certainly need to send multiple HTTP requests - either multi cURL as already suggested, or the HttpRequestPool class for an OO approach.
You will also need to consider how to break the set of N hosts to check down into a maximum number of subsets to avoid the problem of failing to reach the a host due to having to first deal with a queue of non-responding hosts.
Checking N hosts from 1 server presents the greatest chance of not reaching one or more hosts due to a queue of non-responding hosts. This is the cheapest, most easy and least reliable option.
Checking 1 host each from N servers presents the least chance of not reaching one not reaching one or more hosts due to a queue of non-responding hosts. This is the most expensive, (possibly) most difficult and most reliable option.
Consider a cost/difficulty/reliability balance that works best for you.