I would like to find the maximum parallel http requests I can make to a server without having errors both from my own server and also from the host server using http guzzle package, in an empirical dynamic way. For example I start from an initial number say 200 and if it doesn't give an error the code should automatically increase up the parallel requests until the maximum number that errors occurs.
Working with php and just developing on web, I have forgotten concepts like memoization that may help find the solution of this problem.
So let's say I have a million requests stored on a database and I would like to send them asap what should I do?
while(unresolved_requests()){
try{
make_the_requests(int numbers_of_parallel_requests,int first_unresolved_request_id);
increase_the_parallel_requests();
}catch(err){
decrease_the_parallel_requests()
}
}
If you want to do load testing, use a load testing tool.
Take a look at JMeter. There are many others (Wrk, Autocannon, K6, Bombardier...).
It is not trivial to develop such testing tool and get consistent results.
Related
I have a PHP site in which I make an ajax call , in that ajax call I make call to an API that returns XML and I parse it, The problem it sometimes the xML is so huge that it takes many time, The load balancer in EC2 have timeout value of 20 minutes, so If my call is greater than this I get 504 Error, How can I solve this issue? I know its a server issue but how I can solve this?I dont think php.ini is helpful here
HTTP is a stateless protocol. It works best when responses to requests are made within a few seconds of the request. When you don't respond quickly, timeouts start coming into play. This might be a timeout you can control (fcgi process timeout) or one you can't control (third party proxy, client browser).
So what do you do when you have work that will take longer than a few seconds? Use a message queue of course.
The cheap way to do this is store the job in a db table and have cron read from the table and process the work. This can work on a small scale, but it has some issues when you try to get larger.
The proper way to do this is use a real message queue system. Amazon has SQS, but could just as well use Gearman, zeroMQ, rabbitMQ, and others to handle this.
If I have a loop with a lot of curl executions happening, will that slow down the server that is running that process? I realize that when this process runs, and I open a new tab to access some other page on the website, it doesn't load until this curl process that's happening finishes, is there a way for this process to run without interfering with the performance of the site?
For example this is what I'm doing:
foreach ($chs as $ch) {
$content = curl_exec($ch);
... do random stuff...
}
I know I can do multi curl, but for the purposes of what I'm doing, I need to do it like this.
Edit:
Okay, maybe this might change things a bit but I actually want this process to run using WordPress cron. If this is running as a WordPress "cron", would it hinder the page performance of the WordPress site? So in essence, if the process is running, and people try to access the site, will they be lagged up?
The curl requests are not asynchronous so using curl like that, any code after that loop will have to wait to execute until after the curl requests have each finished in turn.
curl_multi_init is PHP's fix for this issue. You mentioned you need to do it the way you are, but is there a way you can refactor to use that?
http://php.net/manual/en/function.curl-multi-init.php
As an alternate, this library is really good for this purpose too: https://github.com/petewarden/ParallelCurl
Not likely unless you use a strictly 1-thread server for development. Different requests are eg in Apache handled by workers (which depending on your exact setup can be either threads or separate processes) and all these workers run independently.
The effect you're seeing is caused by your browser and not by the server. It is suggested in rfc 2616 that a client only opens a limited number of parallel connections to a server:
Clients that use persistent connections SHOULD limit the number of
simultaneous connections that they maintain to a given server. A
single-user client SHOULD NOT maintain more than 2 connections with
any server or proxy.
btw, the standard usage of capitalized keywords like here SHOULD and SHOULD NOT is explained in rfc 2119
and that's what eg Firefox and probably other browsers also use as their defaults. By opening more tabs you quickly exhaust these parallel open channels, and that's what causes the wait.
EDIT: but after reading #earl3s 'reply I realize that there's more to it: earl3s addresses the performance within each page request (and thus the server's "performance" as experienced by the individual user), which can in fact be sped up by parallelizing curl requests. But at the cost of creating more than one simultaneous link to the system(s) you're querying... And that's where rfc2616's recommendation comes back into play: unless the backend systems delivering the content are under your control you should think twice before paralleling your curl requests, as each page hit on your system will hit the backend system with n simultaneous hits...
EDIT2: to answer OP's clarification: no (for the same reason I explained in the first paragraph - the "cron" job will be running in another worker than those serving your users), and if you don't overdo it, ie, don't go wild on parallel threads, you can even mildly parallelize the outgoing requests. But the latter more to be a good neighbour than because of fear to met down your own server.
I just tested it and it looks like the multi curl process running on WP's "cron" made no noticeable negative impact on the site's performance. I was able to load multiple other pages with no terrible lag on the site while the site was running the multi curl process. So looks like it's okay. And I also made sure that there is locking so that this process doesn't get scheduled multiple times. And besides, this process will only run once a day in U.S. low-peak hours. Thanks.
I'm using the rolling-curl [https://github.com/LionsAd/rolling-curl] library to asynchronously retrieve content from a large amount of web resources as part of a scheduled task. The library allows you to set the maximum number of concurrent CURL connections, and I started out at 20 but later moved up to 50 to increase speed.
It seems that every time I run it, arbitrary urls out of the several thousand being processed just fail and return a blank string. It seems the more concurrent connections I have, the more failed requests I get. The same url that failed one time may work the next time I attempt to run the function. What could be causing this, and how can I avoid it?
Everything Luc Franken wrote is accurate and his answer lead me to the solution to my version of the questioner's problem, which is:
Remote servers respond according to their own, highly variable, schedules. To give them enough time to respond, it's important to set two cURL parameters to provide a liberal amount of time. They are:
CURLOPT_CONNECTTIMEOUT => 30
CURLOPT_TIMEOUT => 30
You can try longer and shorter amounts of time until you find something that minimizes errors. But if you're getting intermittent non-responses with curl/multi-curl/rollingcurl, you can likely solve most of the issue this way.
In general you assume that this should not happen.
In the case of accessing external servers that is just not the case. Your code should be totally aware of servers which might not respond, don't respond in time or respond wrong. It is allowed in the HTTP process that things can go wrong. If you reach the server you should get notified by an HTTP error code (although that not always happens) but also network issues can create no or useless responses.
Don't trust external input. That's the root of the issue.
In your concrete case you increase the amount of requests consistently. That will create more requests, open sockets and other uses. To find the solution to your exact issue you need advanced access to the server so you can see the logfiles and monitor open connections and other concerns. Preferably you test this on a test server without any other software creating connections so you can isolate the issue.
But how well tested you make it, you have just uncertainties. For example you might get blocked by external servers because you make too many requests. You might be get stuck in some security filters like DDOS filters etc. Monitoring and customization of the amount of requests (automated or by hand) will generate the most stable solution for you. You could also just accept these lost requests and just handle a stable queue which makes sure you get the contents in at a certain moment in time.
I am trying to build a system of monitoring site / server uptime in PHP, the system will be required to check thousands of domains / ips a minute. I have looked into cURL as this seems to be the best method.
Edit:
The system will be required to probe a server, check its response time is reasonable, and return its response code. It will then add a row to a mysql db containing response time and status code. The notification part of the system is fairly straight forward from there. The system will be on dedicated servers. Hope this adds some clarity.
Why not go for the KISS approach and use php's get_headers() function?
If you want to retrieve the status code, here's a snippet from the comments to the php manual page:
function get_http_response_code($theURL) {
$headers = get_headers($theURL);
return substr($headers[0], 9, 3);
}
This function (being a core php feature) should be faster and more efficient than curl.
If I understand correctly, this system of yours will constantly be connecting to thousands of domains/ips, and if it works, it assumes that the server is up and running?
I suppose you could use cURL, but it would take a long time especially if you're talking thousands of requests - you'd need multiple servers and lots of bandwidth for this to work properly.
You can also take a look at multi cURL for multi-threaded requests (ie. simultaneously sending out 10+ cURL requests, instead of one at a time).
http://php.net/manual/en/function.curl-multi-exec.php
There are very good tools for things like that. No need to write it on your own.
Have a look at Nagios for example. Many admins use it for monitoring.
Your bottleneck will be in waiting for a given host to respond. Given a 30 second timeout and N hosts to check and all but the last host not responding, you'll need to wait 30(N-1) seconds to check the last host. You may never get to checking the last host.
You certainly need to send multiple HTTP requests - either multi cURL as already suggested, or the HttpRequestPool class for an OO approach.
You will also need to consider how to break the set of N hosts to check down into a maximum number of subsets to avoid the problem of failing to reach the a host due to having to first deal with a queue of non-responding hosts.
Checking N hosts from 1 server presents the greatest chance of not reaching one or more hosts due to a queue of non-responding hosts. This is the cheapest, most easy and least reliable option.
Checking 1 host each from N servers presents the least chance of not reaching one not reaching one or more hosts due to a queue of non-responding hosts. This is the most expensive, (possibly) most difficult and most reliable option.
Consider a cost/difficulty/reliability balance that works best for you.
I'm keeping my self busy working on app that gets a feed from twitter search API, then need to extract all the URLs from each status in the feed, and finally since lots of the URLs are shortened I'm checking the response header of each URL to get the real URL it leads to.
for a feed of 100 entries this process can be more then a minute long!! (still working local on my pc)
i'm initiating Curl resource one time per feed and keep it open until I'm finished all the URL expansions though this helped a bit i'm still warry that i'l be in trouble when going live
any ideas how to speed things up?
The issue is, as Asaph points out, that you're doing this in a single-threaded process, so all of the network latency is being serialized.
Does this all have to happen inside an http request, or can you queue URLs somewhere, and have some background process chew through them?
If you can do the latter, that's the way to go.
If you must do the former, you can do the same sort of thing.
Either way, you want to look at way to chew through the requests in parallel. You could write a command-line PHP script that forks to accomplish this, though you might be better off looking into writing such a beast in language that supports threading, such as ruby or python.
You may be able to get significantly increased performance by making your application multithreaded. Multi-threading is not supported directly by PHP per se, but you may be able to launch several PHP processes, each working on a concurrent processing job.