Here is a brief overview of what I am doing, it is quite simple really:
Go out and fetch records from a database table.
Walk through all those records and for each column that contains a URL go out (using cURL) and make sure the URL is still valid.
For each record a column is updated with a current time stamp indicating when it was last checked and some other db processing takes place.
Anyhow all this works well and good and does exactly what it is supposed to. The problem is that I think performance could be greatly improved in terms of how I am validating the URL's with cURL.
Here is a brief (over simplified) excerpt from my code which demonstrates how cURL is being used:
$ch = curl_init();
while($dbo = pg_fetch_object($dbres))
{
// for each iteration set url to db record url
curl_setopt($ch, CURLOPT_URL, $dbo->url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_exec($ch); // perform a cURL session
$ihttp_code = intval(curl_getinfo($ch, CURLINFO_HTTP_CODE));
// do checks on $ihttp_code and update db
}
// do other stuff here
curl_close($ch);
As you can see I am just reusing the same cURL handle the entire time but even if I strip out all over the processing (database or otherwise) the script still takes incredibly long to run. Would changing any of the cURL options help improve performance? Tuning timeout values / etc? Any input would be appreciated.
Thank you,
Nicholas
Set CURLOPT_NOBODY to 1 (see curl documentation) tell curl not to ask for the body of the response. This will contact the web server and issue a HEAD request. The response code will tell you if the URL is valid or not, and won't transfer the bulk of the data back.
If that's still too slow, then you'll likely see a vast improvement by running N threads (or processes) each doing 1/Nth of the work. The bottleneck may not be in your code, but in the response times of the remote servers. If they're slow to respond, then your loop will be slow to run.
Related
I am setting up a site where my users can create lists of names that gets stored in the database. They can then "check" these lists, and each name in the list is run through a cURL function, checking an external site to see if that name is available or taken (for domain names, Twitter names, Facebook names, gaming names, etc). There will be a drop down for them to select which type of name they want to find, and it checks that site.
Here's a code sample for a Runescape name checker:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://services.runescape.com/m=adventurers-log/display_player_profile.ws?searchName=" . $name);
curl_setopt($ch, CURLOPT_HTTPGET, TRUE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$output = curl_exec($ch);
if (stristr($output,"non-member account")) {
echo 'Not available';
}
elseif (stristr($output,"private profile")) {
echo 'Not available';
}
elseif (stristr($output,"top skills")) {
echo 'Not available';
}
else {
echo 'Available';
}
curl_close($ch);
Will this cause too much stress on the server? I'm thinking also of capping lists, so maybe only 1,000 names per list for free members or something, and they can upgrade to run bigger lists (maybe even smaller than 1,000 for free users). Another thing I could do is store the results locally (which I'll do anyway), and load it from there if the name was searched recently. But then it's not completely accurate.
The answer can only be "it depends." It depends on how many users you have, how often those users hit the page in question, how beefy your hardware is, how much bandwidth your host allows, how much data is being transferred, and a million other things.
In general, you should locally (as in, on your server) cache as much data as you can from API responses. That prevents unnecessary duplicate API requests for data that you already had at some point previously. As for what data makes sense to cache, that is completely application/API specific, and something you will have to decide. In general, good candidates for caching are things that don't change very often and are either easy to determine when they are changed, or not important enough that somewhat stale data will be a big deal.
CURL requests are fundamentally slow, and PHP is, for the most part, a synchronous language, so unless you want to wait for each request to return (which, when I tested your command took ~1.2 seconds per request) your best bet it to either have PHP fork the curl requests using your OS's curl command via exec or to use non-blocking sockets. This article has a good explanation of how to do it:
https://segment.io/blog/how-to-make-async-requests-in-php/
However, you're still going to run into issues where the receiving host may not be able to handle the volume of requests you're sending (or it will blacklist you). You might have an easier time breaking the requests into batches (say ten names at a time) and then run those requests simultaneously against each host (Runescape, FB, etc)... this will let you run a few hundred simultaneous requests without hitting any one host too hard... It's still going to be a slowish process, and you might get your IP banned, but it's a reasonable approach.
Also, you might think about having the whole process broken down over a long-ish period of time... so a user uploads the list, and your server says "thanks, you'll receive an email when we're done"... then use a cron job to schedule the subsequent cURL requests over the course of an hour or so... which should help with all the above issues.
I need get some data from remote http server.Im using Curl Classes for multirequests.
My problem is Remote Server's Firewall. Im sending 1000 between 10000 GET and POST requests. And Server bans me from DDOS.
İ used this measures.
packages still contain header information
curl_setopt($this->ch, CURLOPT_HTTPHEADER, $header);
packages still contain random referer information
curl_setopt($this->ch, CURLOPT_REFERER, $refs[rand(0,count($refs))]);
packages still contain random user agents
curl_setopt($this->ch, CURLOPT_USERAGENT, $agents[rand(0,count($agents))]);
I send packages by using the function of sleep at random intervals.
sleep(rand(0,10));
But bans access to the server each time for 1 hour.
Sorry for my bad english :)
Thanks for all.
Sending a large number of requests in a short space of time to the server is likely to have the same impact as a DOS attack whether that is what you intended or not. A quick fix would be to change the sleep line from sleep(rand(0,10)); which means there is a 1 in 11 chance of sending the next request instantly to sleep(3); which means there will always be 3 seconds (approximately) between requests. 3 seconds should be enough of a gap to keep most servers happy. Once you've verified this works you can reduce the value to 2 or 1 to see if you can speed things up.
A far better solution would be to create an API on the server that allows you to get the data you need in 1, or at least only a few, requests. Obviously this is only possible if you're able to make changes to the server (or can persuade those who can to make the changes on your behalf).
I've tried using Rolling Curl, Epi Curl, and other PHP multi curl solutions that are out there, and it takes an average of 180 seconds to send post requests to JUST 40 sites and receive data (I'm talking about receiving just small little success/fail strings) from them, that is dog slow!!!
It only does well with 1 post request which is like 3-6 seconds and I don't even know if that's even good because I see others talking about getting 1 second responses which is crazy.
I've also tried using proc_open to run linux shell commands (curl, wget) but that is also slow as well, and not server friendly.
What I'm pretty much trying to do is a Wordpress plugin that is able to manage multiple Wordpress sites and do mass upgrades, remote publishings, blogroll management, etc. I know that there is a site out there called managewp.com, but I don't want to use their services because I want to keep the sites I manage private and develop my own. What I notice about them is that their request/response is ridiculously fast and I am just puzzled at how they're able to do that, especially with hundreds of sites.
So can someone please shed light how I can make these post requests faster?
Edit
I've been doing some thinking and I asked myself, "What is so important about fetching the response? It's not like the requests that get sent don't get processed properly, they all do 99% of the time!"
And so I was thinking maybe I can just send all the requests without getting the responses. And if I really want to do some tracking of those processes and how they went, I can have those child sites send a post request back with the status of how the process went and have the master site add them into a database table and have an ajax request query like every 10 seconds or so for status updates or something like that.. how does that sound?
cUrl takes about 0.6 - 0.8 seconds per request
So for about 500 website it could take from 300 to 400 seconds.
You could whip this through a loop.
$ch = curl_init(); // Init cURL
curl_setopt($ch, CURLOPT_URL, "http://www.example.com/post.php"); // Post location
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // 1 = Return data, 0 = No return
curl_setopt($ch, CURLOPT_POST, true); // This is POST
// Our data
$postdata = array(
'name1' => 'value1',
'name2' => 'value2',
'name3' => 'value3',
'name4' => 'value4'
);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postdata); // Add the data to the request
$o = curl_exec($ch); // Execute the request
curl_close($ch); // Finish the request. Close it.
This also depends on your connection speed. From a datacenter it should be fine, if your testing from a home location it might give not-as-good results.
I'm currently working on a project download hundreds of URLs at a time with PHP and curl_multi. Do batches of up to 250 URLs and play with CURLOPT_TIMEOUT and CURLOPT_CONNECTTIMEOUT to refine your code's speed.
I have a cURL class (2500+ lines) handling all cURL magic including multi and straight to file downloads. 250 URLs / 15-25 seconds using decent timeouts. (But I'm not sharing it for free...)
PS: Downloading that many URLs would require using temporary files as cURL download targets and not memory. Just a thought...
I was ask to use a simple facebook api to return the number of likes or shares at work which return json string.
Now since i am going to do this for a very large amount of links, which one is better:
Using file_get_contents or cURL.
Both of them seem to return the same results and cURL seems to be more complicated to use, but what is the difference among them. why do most people recommend using cURL over file_get_contents?
Before i run the api which might take a whole day to process, i will like to have feedback.
A few years ago I benchmarked the two and CURL was faster. With CURL you create one CURL instance which can be used for every request, and it maps directly to the very fast libcurl library. Using file_get_contents you have the overhead of protocol wrappers and the initialization code getting executed for every single request.
I will dig out my benchmark script and run on PHP 5.3 but I suspect that CURL will still be faster.
cURL supports https requests more widely than file_get_contents and it's not too terribly complicated. Although the one-line file_get_contents solution sure is clean looking, it's behind-the-scene overhead is larger than cURL.
$curl_handle=curl_init();
curl_setopt($curl_handle,CURLOPT_URL,$feedURL);
curl_setopt($curl_handle,CURLOPT_CONNECTTIMEOUT,2);
curl_setopt($curl_handle,CURLOPT_RETURNTRANSFER,true);
curl_setopt($curl_handle, CURLOPT_SSL_VERIFYPEER, false);
$buffer = curl_exec($curl_handle);
curl_close($curl_handle);
This is what I use to make facebook api calls as many of them require an access_token and facebook will only accept access_token information in a secure connection. I've also noticed a large difference in execution time (cURL is much faster).
I'm currently using Curl for PHP a lot. It takes a lot of time to get results of about 100 pages each time. For every request i'm using code like this
$ch = curl_init();
// get source
curl_close($ch);
What are my options to speed things up?
How should I use the multi_init() etc?
Reuse the same cURL handler ($ch) without running curl_close. This will speed it up just a little bit.
Use curl_multi_init to run the processes in parallel. This can have a tremendous effect.
take curl_multi - it is far better. Save the handshakes - they are not needed every time!
when i use code given in "http://php.net/curl_multi_init", response of 2 requests are conflicting.
But the code written in below link, returns each response separately (in array format)
https://stackoverflow.com/a/21362749/3177302
or take pcntl_fork, fork some new threads to execute curl_exec. But it's not as good as curl_multi.