I am setting up a site where my users can create lists of names that gets stored in the database. They can then "check" these lists, and each name in the list is run through a cURL function, checking an external site to see if that name is available or taken (for domain names, Twitter names, Facebook names, gaming names, etc). There will be a drop down for them to select which type of name they want to find, and it checks that site.
Here's a code sample for a Runescape name checker:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://services.runescape.com/m=adventurers-log/display_player_profile.ws?searchName=" . $name);
curl_setopt($ch, CURLOPT_HTTPGET, TRUE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$output = curl_exec($ch);
if (stristr($output,"non-member account")) {
echo 'Not available';
}
elseif (stristr($output,"private profile")) {
echo 'Not available';
}
elseif (stristr($output,"top skills")) {
echo 'Not available';
}
else {
echo 'Available';
}
curl_close($ch);
Will this cause too much stress on the server? I'm thinking also of capping lists, so maybe only 1,000 names per list for free members or something, and they can upgrade to run bigger lists (maybe even smaller than 1,000 for free users). Another thing I could do is store the results locally (which I'll do anyway), and load it from there if the name was searched recently. But then it's not completely accurate.
The answer can only be "it depends." It depends on how many users you have, how often those users hit the page in question, how beefy your hardware is, how much bandwidth your host allows, how much data is being transferred, and a million other things.
In general, you should locally (as in, on your server) cache as much data as you can from API responses. That prevents unnecessary duplicate API requests for data that you already had at some point previously. As for what data makes sense to cache, that is completely application/API specific, and something you will have to decide. In general, good candidates for caching are things that don't change very often and are either easy to determine when they are changed, or not important enough that somewhat stale data will be a big deal.
CURL requests are fundamentally slow, and PHP is, for the most part, a synchronous language, so unless you want to wait for each request to return (which, when I tested your command took ~1.2 seconds per request) your best bet it to either have PHP fork the curl requests using your OS's curl command via exec or to use non-blocking sockets. This article has a good explanation of how to do it:
https://segment.io/blog/how-to-make-async-requests-in-php/
However, you're still going to run into issues where the receiving host may not be able to handle the volume of requests you're sending (or it will blacklist you). You might have an easier time breaking the requests into batches (say ten names at a time) and then run those requests simultaneously against each host (Runescape, FB, etc)... this will let you run a few hundred simultaneous requests without hitting any one host too hard... It's still going to be a slowish process, and you might get your IP banned, but it's a reasonable approach.
Also, you might think about having the whole process broken down over a long-ish period of time... so a user uploads the list, and your server says "thanks, you'll receive an email when we're done"... then use a cron job to schedule the subsequent cURL requests over the course of an hour or so... which should help with all the above issues.
Related
I need get some data from remote http server.Im using Curl Classes for multirequests.
My problem is Remote Server's Firewall. Im sending 1000 between 10000 GET and POST requests. And Server bans me from DDOS.
İ used this measures.
packages still contain header information
curl_setopt($this->ch, CURLOPT_HTTPHEADER, $header);
packages still contain random referer information
curl_setopt($this->ch, CURLOPT_REFERER, $refs[rand(0,count($refs))]);
packages still contain random user agents
curl_setopt($this->ch, CURLOPT_USERAGENT, $agents[rand(0,count($agents))]);
I send packages by using the function of sleep at random intervals.
sleep(rand(0,10));
But bans access to the server each time for 1 hour.
Sorry for my bad english :)
Thanks for all.
Sending a large number of requests in a short space of time to the server is likely to have the same impact as a DOS attack whether that is what you intended or not. A quick fix would be to change the sleep line from sleep(rand(0,10)); which means there is a 1 in 11 chance of sending the next request instantly to sleep(3); which means there will always be 3 seconds (approximately) between requests. 3 seconds should be enough of a gap to keep most servers happy. Once you've verified this works you can reduce the value to 2 or 1 to see if you can speed things up.
A far better solution would be to create an API on the server that allows you to get the data you need in 1, or at least only a few, requests. Obviously this is only possible if you're able to make changes to the server (or can persuade those who can to make the changes on your behalf).
I am writing a kind of test system in php that would test my database records. I have separated php files for every test case. One (master) file is given the test number and the input parameters for that test in the form of URL string. That file determines the test number and calls the appropriate test case based on test number. Now I have a bunch of URL strings to be passed, I want those to be passsed to that (master) file and every test case starts working independently after receiving its parameters.
PHP is a single threaded entity, no multithreading currently exists for it. However, there are a few things you can do to achieve similar (but not identical) results for use cases I have come across when people normally ask me about multithreading. Again, there is no multithreading in PHP, but some of the below may help you further in creating something with characteristics that may match your requirement.
libevent: you could use this to create an event loop for PHP which would make blocking less of an issue. See http://www.php.net/manual/en/ref.libevent.php
curl_multi: Another useful library that can fire off get/post to other services.
Process Control: Not used this myself, but may be of value if process control is one aspect of your issue. http://uk.php.net/pcntl
Gearman: Now this I've used and it's pretty good. It allows you to create workers and spin off processes into a queue. You may also want to look at rabbit-php or ZeroMQ.
PHP is not multithreaded, it's singlethreaded. You cannot start new threads within PHP. Your best bet would be a file_get_contents (or cURL) to another PHP script to "mimic" threads. True multithreading isn't available in PHP.
You could also have a look at John's post at http://phplens.com/phpeverywhere/?q=node/view/254.
What you can do is use cURL to send the requests back to the server. The request will be handled and the results will be returned.
An example would be:
$c = curl_init("http://servername/".$script_name.$params);
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($c);
curl_close($c);
Although this is not considered multithreading, it can be used to achieve your goal.
Greetings everyone
I am working on a small crawling engine and am using curl to request pages from various websites. Question is what do suggest should I set my connection_timeout and timeout values to? Stuff I would normally be crawling would be pages with lots of images and text.
cURL knows two different timeouts.
For CURLOPT_CONNECTTIMEOUT it doesn't matter how much text the site contains or how many other resources like images it references because this is a connection timeout and even the server cannot know about the size of the requested page until the connection is established.
For CURLOPT_TIMEOUT it does matter. Even large pages require only a few packets on the wire, but the server may need more time to assemble the output. Also the number of redirects and other things (e.g. proxies) can significantly increase response time.
Generally speaking the "best value" for timeouts depends on your requirements and conditions of the networks and servers. Those conditions are subject of change. Therefore there is no "one best value".
I recommend to use rather short timeouts and retry failed downloads later.
Btw cURL does not automatically download resources referenced in the response. You have to do this manually with further calls to curl_exec (with fresh timeouts).
If you set it too high then your script will be slow as a one url that is down will take all the time you set in CURLOPT_TIMEOUT to finish processing. If you are not using proxies then you can just set the following values
CURLOPT_TIMEOUT = 3
CURLOPT_CONNECTTIMEOUT = 1
Then you can go through failed urls at a later time to double check on them.
The best response is the rik's one.
I have a Proxy Checker and in my benchmarks I saw that most of working Proxies takes less than 10 seconds to connect.
So I use 10 seconds for ConnectionTimeOut and TimeOut but that's in my case, you have to decide how many time you want to use so start with big values, use curl_getinfo to see time benchmarks and decrease the value.
Note: A proxy that takes more than 5 or 10 seconds to connect is useless for me, that's why I use that values.
Yes. If your target is a proxy to query another site, such a cascading connection will require fairly long period like these values to execute the curl calls.
Especially when you encountered intermittent curl problems, please check these values first.
I use
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT,30);
curl_setopt($ch, CURLOPT_TIMEOUT,60);
I just had a look at the docs on sleep().
Where would you use this function?
Is it there to give the CPU a break in an expensive function?
Any common pitfalls?
One place where it finds use is to create a delay.
Lets say you've built a crawler that uses curl/file_get_contents to get remote pages. Now you don't want to bombard the remote server with too many requests in short time. So you introduce a delay between consecutive requests.
sleep takes the argument in seconds, its friend usleep takes arguments in microseconds and is more suitable in some cases.
Another example: You're running some sort of batch process that makes heavy use of a resource. Maybe you're walking the database of 9,000,000 book titles and updating about 10% of them. That process has to run in the middle of the day, but there are so many updates to be done that running your batch program drags the database server down to a crawl for other users.
So you modify the batch process to submit, say, 1000 updates, then sleep for 5 seconds to give the database server a chance to finish processing any requests from other users that have backed up.
Here's a snippet of how I use sleep in one of my projects:
foreach($addresses as $address)
{
$url = "http://maps.google.com/maps/geo?q={$address}&output=json...etc...";
$result = file_get_contents($url);
$geo = json_decode($result, TRUE);
// Do stuff with $geo
sleep(1);
}
In this case sleep helps me prevent being blocked by Google maps, because I am sending too many requests to the server.
Old question I know, but another reason for using u/sleep can be when you are writing security/cryptography code, such as an authentication script. A couple of examples:
You may wish to reduce the effectiveness of a potential brute force attack by making your login script purposefully slow, especially after a few failed attempts.
Also you might wish to add an artificial delay during encryption to mitigate against timing attacks. I know that the chances are slim that you're going to be writing such in-depth encryption code in a language like PHP, but still valid I reckon.
EDIT
Using u/sleep against timing attacks is not a good solution. You can still get the important data in a timing attack, you just need more samples to filter out the noise that u/sleep adds.
You can find more information about this topic in: Could a random sleep prevent timing attacks?
Another way to use it: if you want to execute a cronjob more often there every minute. I use the following code for this:
sleep(30);
include 'cronjob.php';
I call this file, and cronjob.php every minute.
This is a bit of an odd case...file transfer throttling.
In a file transfer service we ran a long time ago, the files were served from 10Mbps uplink servers. To prevent the network from bogging down, the download script tracked how many users were downloading at once, and then calculated how many bytes it could send per second per user. It would send part of this amount, then sleep a moment (1/4 second, I think) then send more...etc.
In this way, the servers ran continuously at about 9.5Mbps, without having uplink saturation issues...and always dynamically adjusting speeds of the downloads.
I wouldn't do it this way, or in PHP, now...but it worked great at the time.
You can use sleep to pause the script execution... for example to delay an AJAX call by server side or implement an observer. You can also use it to simulate delays.
I use that also to delay sendmail() & co. .
Somebody uses use sleep() to prevent DoS and login brutefoces, I do not agree 'cause in this you need to add some checks to prevent the user from running multiple times.
Check also usleep.
I had to use it recently when I was utilising Google's Geolocation API. Every address in a loop needed to call Google's server so it needed a bit of time to receive a response. I used usleep(500000) to give everything involved enough time.
I wouldn't typically use it for serving web pages, but it's useful for command line scripts.
$ready = false;
do {
$ready = some_monitor_function();
sleep(2);
} while (!$ready);
Super old posts, but I thought I would comment as well.
I recently had to check for a VERY long running process that created some files. So I made a function that iterates over a cURL function. If the file I'm looking for doesn't exist, I sleep the php file, and check again in a bit:
function remoteFileExists() {
$curl = curl_init('domain.com/file.ext');
//don't fetch the actual page, you only want to check the connection is ok
curl_setopt($curl, CURLOPT_NOBODY, true);
//do request
$result = curl_exec($curl);
//if request did not fail
if ($result !== false) {
//if request was ok, check response code
$statusCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);
if ($statusCode == 404) {
sleep(7);
remoteFileExists();
}
else{
echo 'exists';
}
}
curl_close($curl);
}
echo remoteFileExists();
One of its application is, if I am sending mails by a script to 100+ customers then this operation will take maximum 1-2 seconds thus most of the website like hotmail and yahoo consider it as spam, so to avoid this we need to use some delay in execution after every mail.
Among the others: you are testing a web application that makes ayncronous requests (AJAX calls, lazy image loading,...)
You are testing it locally so responses are immediate since there is only one user (you) and no network latency.
Using sleep lets you see/test how the web app behaves when load and network cause delay on requests.
A quick pseudo code example of where you may not want to get millions of alert emails for a single event but you want your script to keep running.
if CheckSystemCPU() > 95
SendMeAnEmail()
sleep(1800)
fi
Here is a brief overview of what I am doing, it is quite simple really:
Go out and fetch records from a database table.
Walk through all those records and for each column that contains a URL go out (using cURL) and make sure the URL is still valid.
For each record a column is updated with a current time stamp indicating when it was last checked and some other db processing takes place.
Anyhow all this works well and good and does exactly what it is supposed to. The problem is that I think performance could be greatly improved in terms of how I am validating the URL's with cURL.
Here is a brief (over simplified) excerpt from my code which demonstrates how cURL is being used:
$ch = curl_init();
while($dbo = pg_fetch_object($dbres))
{
// for each iteration set url to db record url
curl_setopt($ch, CURLOPT_URL, $dbo->url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_exec($ch); // perform a cURL session
$ihttp_code = intval(curl_getinfo($ch, CURLINFO_HTTP_CODE));
// do checks on $ihttp_code and update db
}
// do other stuff here
curl_close($ch);
As you can see I am just reusing the same cURL handle the entire time but even if I strip out all over the processing (database or otherwise) the script still takes incredibly long to run. Would changing any of the cURL options help improve performance? Tuning timeout values / etc? Any input would be appreciated.
Thank you,
Nicholas
Set CURLOPT_NOBODY to 1 (see curl documentation) tell curl not to ask for the body of the response. This will contact the web server and issue a HEAD request. The response code will tell you if the URL is valid or not, and won't transfer the bulk of the data back.
If that's still too slow, then you'll likely see a vast improvement by running N threads (or processes) each doing 1/Nth of the work. The bottleneck may not be in your code, but in the response times of the remote servers. If they're slow to respond, then your loop will be slow to run.