PHP best way to send post requests to hundreds of sites? - php

I've tried using Rolling Curl, Epi Curl, and other PHP multi curl solutions that are out there, and it takes an average of 180 seconds to send post requests to JUST 40 sites and receive data (I'm talking about receiving just small little success/fail strings) from them, that is dog slow!!!
It only does well with 1 post request which is like 3-6 seconds and I don't even know if that's even good because I see others talking about getting 1 second responses which is crazy.
I've also tried using proc_open to run linux shell commands (curl, wget) but that is also slow as well, and not server friendly.
What I'm pretty much trying to do is a Wordpress plugin that is able to manage multiple Wordpress sites and do mass upgrades, remote publishings, blogroll management, etc. I know that there is a site out there called managewp.com, but I don't want to use their services because I want to keep the sites I manage private and develop my own. What I notice about them is that their request/response is ridiculously fast and I am just puzzled at how they're able to do that, especially with hundreds of sites.
So can someone please shed light how I can make these post requests faster?
Edit
I've been doing some thinking and I asked myself, "What is so important about fetching the response? It's not like the requests that get sent don't get processed properly, they all do 99% of the time!"
And so I was thinking maybe I can just send all the requests without getting the responses. And if I really want to do some tracking of those processes and how they went, I can have those child sites send a post request back with the status of how the process went and have the master site add them into a database table and have an ajax request query like every 10 seconds or so for status updates or something like that.. how does that sound?

cUrl takes about 0.6 - 0.8 seconds per request
So for about 500 website it could take from 300 to 400 seconds.
You could whip this through a loop.
$ch = curl_init(); // Init cURL
curl_setopt($ch, CURLOPT_URL, "http://www.example.com/post.php"); // Post location
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // 1 = Return data, 0 = No return
curl_setopt($ch, CURLOPT_POST, true); // This is POST
// Our data
$postdata = array(
'name1' => 'value1',
'name2' => 'value2',
'name3' => 'value3',
'name4' => 'value4'
);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postdata); // Add the data to the request
$o = curl_exec($ch); // Execute the request
curl_close($ch); // Finish the request. Close it.
This also depends on your connection speed. From a datacenter it should be fine, if your testing from a home location it might give not-as-good results.

I'm currently working on a project download hundreds of URLs at a time with PHP and curl_multi. Do batches of up to 250 URLs and play with CURLOPT_TIMEOUT and CURLOPT_CONNECTTIMEOUT to refine your code's speed.
I have a cURL class (2500+ lines) handling all cURL magic including multi and straight to file downloads. 250 URLs / 15-25 seconds using decent timeouts. (But I'm not sharing it for free...)
PS: Downloading that many URLs would require using temporary files as cURL download targets and not memory. Just a thought...

Related

Is excessive cURL intensive on the server?

I am setting up a site where my users can create lists of names that gets stored in the database. They can then "check" these lists, and each name in the list is run through a cURL function, checking an external site to see if that name is available or taken (for domain names, Twitter names, Facebook names, gaming names, etc). There will be a drop down for them to select which type of name they want to find, and it checks that site.
Here's a code sample for a Runescape name checker:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://services.runescape.com/m=adventurers-log/display_player_profile.ws?searchName=" . $name);
curl_setopt($ch, CURLOPT_HTTPGET, TRUE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$output = curl_exec($ch);
if (stristr($output,"non-member account")) {
echo 'Not available';
}
elseif (stristr($output,"private profile")) {
echo 'Not available';
}
elseif (stristr($output,"top skills")) {
echo 'Not available';
}
else {
echo 'Available';
}
curl_close($ch);
Will this cause too much stress on the server? I'm thinking also of capping lists, so maybe only 1,000 names per list for free members or something, and they can upgrade to run bigger lists (maybe even smaller than 1,000 for free users). Another thing I could do is store the results locally (which I'll do anyway), and load it from there if the name was searched recently. But then it's not completely accurate.
The answer can only be "it depends." It depends on how many users you have, how often those users hit the page in question, how beefy your hardware is, how much bandwidth your host allows, how much data is being transferred, and a million other things.
In general, you should locally (as in, on your server) cache as much data as you can from API responses. That prevents unnecessary duplicate API requests for data that you already had at some point previously. As for what data makes sense to cache, that is completely application/API specific, and something you will have to decide. In general, good candidates for caching are things that don't change very often and are either easy to determine when they are changed, or not important enough that somewhat stale data will be a big deal.
CURL requests are fundamentally slow, and PHP is, for the most part, a synchronous language, so unless you want to wait for each request to return (which, when I tested your command took ~1.2 seconds per request) your best bet it to either have PHP fork the curl requests using your OS's curl command via exec or to use non-blocking sockets. This article has a good explanation of how to do it:
https://segment.io/blog/how-to-make-async-requests-in-php/
However, you're still going to run into issues where the receiving host may not be able to handle the volume of requests you're sending (or it will blacklist you). You might have an easier time breaking the requests into batches (say ten names at a time) and then run those requests simultaneously against each host (Runescape, FB, etc)... this will let you run a few hundred simultaneous requests without hitting any one host too hard... It's still going to be a slowish process, and you might get your IP banned, but it's a reasonable approach.
Also, you might think about having the whole process broken down over a long-ish period of time... so a user uploads the list, and your server says "thanks, you'll receive an email when we're done"... then use a cron job to schedule the subsequent cURL requests over the course of an hour or so... which should help with all the above issues.

CURL and DDOS Problems

I need get some data from remote http server.Im using Curl Classes for multirequests.
My problem is Remote Server's Firewall. Im sending 1000 between 10000 GET and POST requests. And Server bans me from DDOS.
İ used this measures.
packages still contain header information
curl_setopt($this->ch, CURLOPT_HTTPHEADER, $header);
packages still contain random referer information
curl_setopt($this->ch, CURLOPT_REFERER, $refs[rand(0,count($refs))]);
packages still contain random user agents
curl_setopt($this->ch, CURLOPT_USERAGENT, $agents[rand(0,count($agents))]);
I send packages by using the function of sleep at random intervals.
sleep(rand(0,10));
But bans access to the server each time for 1 hour.
Sorry for my bad english :)
Thanks for all.
Sending a large number of requests in a short space of time to the server is likely to have the same impact as a DOS attack whether that is what you intended or not. A quick fix would be to change the sleep line from sleep(rand(0,10)); which means there is a 1 in 11 chance of sending the next request instantly to sleep(3); which means there will always be 3 seconds (approximately) between requests. 3 seconds should be enough of a gap to keep most servers happy. Once you've verified this works you can reduce the value to 2 or 1 to see if you can speed things up.
A far better solution would be to create an API on the server that allows you to get the data you need in 1, or at least only a few, requests. Obviously this is only possible if you're able to make changes to the server (or can persuade those who can to make the changes on your behalf).

best value for curl timeout and connection timeout

Greetings everyone
I am working on a small crawling engine and am using curl to request pages from various websites. Question is what do suggest should I set my connection_timeout and timeout values to? Stuff I would normally be crawling would be pages with lots of images and text.
cURL knows two different timeouts.
For CURLOPT_CONNECTTIMEOUT it doesn't matter how much text the site contains or how many other resources like images it references because this is a connection timeout and even the server cannot know about the size of the requested page until the connection is established.
For CURLOPT_TIMEOUT it does matter. Even large pages require only a few packets on the wire, but the server may need more time to assemble the output. Also the number of redirects and other things (e.g. proxies) can significantly increase response time.
Generally speaking the "best value" for timeouts depends on your requirements and conditions of the networks and servers. Those conditions are subject of change. Therefore there is no "one best value".
I recommend to use rather short timeouts and retry failed downloads later.
Btw cURL does not automatically download resources referenced in the response. You have to do this manually with further calls to curl_exec (with fresh timeouts).
If you set it too high then your script will be slow as a one url that is down will take all the time you set in CURLOPT_TIMEOUT to finish processing. If you are not using proxies then you can just set the following values
CURLOPT_TIMEOUT = 3
CURLOPT_CONNECTTIMEOUT = 1
Then you can go through failed urls at a later time to double check on them.
The best response is the rik's one.
I have a Proxy Checker and in my benchmarks I saw that most of working Proxies takes less than 10 seconds to connect.
So I use 10 seconds for ConnectionTimeOut and TimeOut but that's in my case, you have to decide how many time you want to use so start with big values, use curl_getinfo to see time benchmarks and decrease the value.
Note: A proxy that takes more than 5 or 10 seconds to connect is useless for me, that's why I use that values.
Yes. If your target is a proxy to query another site, such a cascading connection will require fairly long period like these values to execute the curl calls.
Especially when you encountered intermittent curl problems, please check these values first.
I use
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT,30);
curl_setopt($ch, CURLOPT_TIMEOUT,60);

How to gracefully handle a downed API

With twitter being down today I was thinking about how to best handle calls to an API when it is down. If I am using CURL to call their api how do I cause the script to fail quickly and handle the errors so as not to slow down the application?
Perhaps use a sort of cache of whether or not twitter is up or down. Log invalid responses from the api in a database or server-sided file. Once you get two/three/some other amount of invalid responses in a row, disable all requests to the api for x amount of time.
After x amount of time, attempt a request, if it's still down, disable for x minutes again.
If your server can run CRON jobs consider making a script that checks the api for a valid response every few minutes. If it finds out it's down, disable requests until it's back up. At least in this case the server would be doing the testing and users won't have to be the guinea pigs.
Use curl_setopt
curl_setopt($yourCurlHandle, CURLOPT_CONNECTTIMEOUT, '1'); // 1 second
If you use curl >= 7.16.2 and PHP >= 5.2.3 there is CURLOPT_CONNECTTIMEOUT_MS
Use curl_getinfo to get the cURL response code or content length and check against those.
$HttpCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);

Improving cURL performance (PHP Library)

Here is a brief overview of what I am doing, it is quite simple really:
Go out and fetch records from a database table.
Walk through all those records and for each column that contains a URL go out (using cURL) and make sure the URL is still valid.
For each record a column is updated with a current time stamp indicating when it was last checked and some other db processing takes place.
Anyhow all this works well and good and does exactly what it is supposed to. The problem is that I think performance could be greatly improved in terms of how I am validating the URL's with cURL.
Here is a brief (over simplified) excerpt from my code which demonstrates how cURL is being used:
$ch = curl_init();
while($dbo = pg_fetch_object($dbres))
{
// for each iteration set url to db record url
curl_setopt($ch, CURLOPT_URL, $dbo->url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_exec($ch); // perform a cURL session
$ihttp_code = intval(curl_getinfo($ch, CURLINFO_HTTP_CODE));
// do checks on $ihttp_code and update db
}
// do other stuff here
curl_close($ch);
As you can see I am just reusing the same cURL handle the entire time but even if I strip out all over the processing (database or otherwise) the script still takes incredibly long to run. Would changing any of the cURL options help improve performance? Tuning timeout values / etc? Any input would be appreciated.
Thank you,
Nicholas
Set CURLOPT_NOBODY to 1 (see curl documentation) tell curl not to ask for the body of the response. This will contact the web server and issue a HEAD request. The response code will tell you if the URL is valid or not, and won't transfer the bulk of the data back.
If that's still too slow, then you'll likely see a vast improvement by running N threads (or processes) each doing 1/Nth of the work. The bottleneck may not be in your code, but in the response times of the remote servers. If they're slow to respond, then your loop will be slow to run.

Categories