I am scraping data from an URL using cURL
for ($i = 0; $i < 1000000; $i++) {
$curl_handle = curl_init();
curl_setopt($curl_handle, CURLOPT_URL, 'http://example.com?page='.$i);
curl_exec($curl_handle);
curl_close($curl_handle);
// some code to save the HTML page on HDD
}
I wanted to know if there is some way that I could speed up the process? Maybe multithreading? How could I do it?
cURL Multi does not make parallel requests, it makes asynchronous requests.
The documentation was wrong until 5 minutes ago, it will take some time for the corrected documentation to be deployed and translated.
Asynchronous I/O (using something like the cURL Multi API) is the simplest thing to do, however, it can only make requests asynchronously; the processing of data once downloaded, for example writing to disk would still cause lots of blocking I/O, similarly further processing of the data (parsing json for example) would occur synchronously, in a single thread of execution.
Multi-threading is the other option, this requires that you have a thread safe build of PHP and the pthreads extension installed.
Multi-threading has the advantage that all processing can be done for each download and subsequent actions in parallel, fully utilizing all the CPU cores available.
What is best depends largely on how much processing of downloaded data your code must perform, and even then can be considered a matter of opinion.
You're looking for the curl_multi_* set of functions: "Allows the processing of multiple cURL handles in parallel".
Take a look at the complete example on the curl_multi_init() page.
Check out these articles for more information about how curl_multi_exec() works:
http://technosophos.com/2012/10/26/php-and-curlmultiexec.html
http://www.somacon.com/p537.php
Related
Initial Condition: I have code written in php file. initially i was executing code, it was taking 30 seconds to execute. In this file the code was called 5 times.
What will happen next:Let if i need to execute this code 50 times then it will take 300 seconds in one execution in browser.next for 500 times 3000 secs. So it is serial execution of code.
What I Need: i need to execute this code in parallel. like several instance. So i would like to minimize the execution time so user has not wait for such long time.
What I Did: i used PHP CURL to execute this code parallel. I called this file several times to minimize the execution time.
So I want to know that is this method is correct. How much CURL i can execute and how much resources it require. It need a better method that how could i execute this code in parallel with tutorial.
any help will be grateful.
Probably the simplest option without changing your code (too much), though, would be to call PHP through the command line and not CURL. This cuts the overhead of APACHE (both in memory and speed), networking etc. Plus Curl is not a portable option as some servers can't see themselves (in network terms).
$process1 = popen('php myfile.php [parameters]');
$process2 = popen('php myfile.php [parameters]');
// get response from children : you can loop until all completed
$response1 = stream_get_contents($process1);
$response2 = stream_get_contents($process2);
You'll need to remove any reference to apache added variables in $_SERVER, and replace $_GET with argv/argc references. Both otherwise it should just work.
But the best solution will probably be pThreads (http://php.net/manual/en/book.pthreads.php) that allow you to do what you want. Will require some editing of code (and installing, possibly) but does what you're asking.
php curl is low enough overhead to not have to worry about it. If you can make loopback calls to a server farm through a load balancer, that's a good use case for curl. I've also used pcntl_fork() for same-host parallelism, but it's harder to set up. I've written classes built on both; see my php lib at https://github.com/andrasq/quicklib for ideas (or just borrow code, it's open source)
Consider using Gearman. Documentation :
http://php.net/manual/en/book.gearman.php
I am writing a kind of test system in php that would test my database records. I have separated php files for every test case. One (master) file is given the test number and the input parameters for that test in the form of URL string. That file determines the test number and calls the appropriate test case based on test number. Now I have a bunch of URL strings to be passed, I want those to be passsed to that (master) file and every test case starts working independently after receiving its parameters.
PHP is a single threaded entity, no multithreading currently exists for it. However, there are a few things you can do to achieve similar (but not identical) results for use cases I have come across when people normally ask me about multithreading. Again, there is no multithreading in PHP, but some of the below may help you further in creating something with characteristics that may match your requirement.
libevent: you could use this to create an event loop for PHP which would make blocking less of an issue. See http://www.php.net/manual/en/ref.libevent.php
curl_multi: Another useful library that can fire off get/post to other services.
Process Control: Not used this myself, but may be of value if process control is one aspect of your issue. http://uk.php.net/pcntl
Gearman: Now this I've used and it's pretty good. It allows you to create workers and spin off processes into a queue. You may also want to look at rabbit-php or ZeroMQ.
PHP is not multithreaded, it's singlethreaded. You cannot start new threads within PHP. Your best bet would be a file_get_contents (or cURL) to another PHP script to "mimic" threads. True multithreading isn't available in PHP.
You could also have a look at John's post at http://phplens.com/phpeverywhere/?q=node/view/254.
What you can do is use cURL to send the requests back to the server. The request will be handled and the results will be returned.
An example would be:
$c = curl_init("http://servername/".$script_name.$params);
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($c);
curl_close($c);
Although this is not considered multithreading, it can be used to achieve your goal.
I was ask to use a simple facebook api to return the number of likes or shares at work which return json string.
Now since i am going to do this for a very large amount of links, which one is better:
Using file_get_contents or cURL.
Both of them seem to return the same results and cURL seems to be more complicated to use, but what is the difference among them. why do most people recommend using cURL over file_get_contents?
Before i run the api which might take a whole day to process, i will like to have feedback.
A few years ago I benchmarked the two and CURL was faster. With CURL you create one CURL instance which can be used for every request, and it maps directly to the very fast libcurl library. Using file_get_contents you have the overhead of protocol wrappers and the initialization code getting executed for every single request.
I will dig out my benchmark script and run on PHP 5.3 but I suspect that CURL will still be faster.
cURL supports https requests more widely than file_get_contents and it's not too terribly complicated. Although the one-line file_get_contents solution sure is clean looking, it's behind-the-scene overhead is larger than cURL.
$curl_handle=curl_init();
curl_setopt($curl_handle,CURLOPT_URL,$feedURL);
curl_setopt($curl_handle,CURLOPT_CONNECTTIMEOUT,2);
curl_setopt($curl_handle,CURLOPT_RETURNTRANSFER,true);
curl_setopt($curl_handle, CURLOPT_SSL_VERIFYPEER, false);
$buffer = curl_exec($curl_handle);
curl_close($curl_handle);
This is what I use to make facebook api calls as many of them require an access_token and facebook will only accept access_token information in a secure connection. I've also noticed a large difference in execution time (cURL is much faster).
I just had a look at the docs on sleep().
Where would you use this function?
Is it there to give the CPU a break in an expensive function?
Any common pitfalls?
One place where it finds use is to create a delay.
Lets say you've built a crawler that uses curl/file_get_contents to get remote pages. Now you don't want to bombard the remote server with too many requests in short time. So you introduce a delay between consecutive requests.
sleep takes the argument in seconds, its friend usleep takes arguments in microseconds and is more suitable in some cases.
Another example: You're running some sort of batch process that makes heavy use of a resource. Maybe you're walking the database of 9,000,000 book titles and updating about 10% of them. That process has to run in the middle of the day, but there are so many updates to be done that running your batch program drags the database server down to a crawl for other users.
So you modify the batch process to submit, say, 1000 updates, then sleep for 5 seconds to give the database server a chance to finish processing any requests from other users that have backed up.
Here's a snippet of how I use sleep in one of my projects:
foreach($addresses as $address)
{
$url = "http://maps.google.com/maps/geo?q={$address}&output=json...etc...";
$result = file_get_contents($url);
$geo = json_decode($result, TRUE);
// Do stuff with $geo
sleep(1);
}
In this case sleep helps me prevent being blocked by Google maps, because I am sending too many requests to the server.
Old question I know, but another reason for using u/sleep can be when you are writing security/cryptography code, such as an authentication script. A couple of examples:
You may wish to reduce the effectiveness of a potential brute force attack by making your login script purposefully slow, especially after a few failed attempts.
Also you might wish to add an artificial delay during encryption to mitigate against timing attacks. I know that the chances are slim that you're going to be writing such in-depth encryption code in a language like PHP, but still valid I reckon.
EDIT
Using u/sleep against timing attacks is not a good solution. You can still get the important data in a timing attack, you just need more samples to filter out the noise that u/sleep adds.
You can find more information about this topic in: Could a random sleep prevent timing attacks?
Another way to use it: if you want to execute a cronjob more often there every minute. I use the following code for this:
sleep(30);
include 'cronjob.php';
I call this file, and cronjob.php every minute.
This is a bit of an odd case...file transfer throttling.
In a file transfer service we ran a long time ago, the files were served from 10Mbps uplink servers. To prevent the network from bogging down, the download script tracked how many users were downloading at once, and then calculated how many bytes it could send per second per user. It would send part of this amount, then sleep a moment (1/4 second, I think) then send more...etc.
In this way, the servers ran continuously at about 9.5Mbps, without having uplink saturation issues...and always dynamically adjusting speeds of the downloads.
I wouldn't do it this way, or in PHP, now...but it worked great at the time.
You can use sleep to pause the script execution... for example to delay an AJAX call by server side or implement an observer. You can also use it to simulate delays.
I use that also to delay sendmail() & co. .
Somebody uses use sleep() to prevent DoS and login brutefoces, I do not agree 'cause in this you need to add some checks to prevent the user from running multiple times.
Check also usleep.
I had to use it recently when I was utilising Google's Geolocation API. Every address in a loop needed to call Google's server so it needed a bit of time to receive a response. I used usleep(500000) to give everything involved enough time.
I wouldn't typically use it for serving web pages, but it's useful for command line scripts.
$ready = false;
do {
$ready = some_monitor_function();
sleep(2);
} while (!$ready);
Super old posts, but I thought I would comment as well.
I recently had to check for a VERY long running process that created some files. So I made a function that iterates over a cURL function. If the file I'm looking for doesn't exist, I sleep the php file, and check again in a bit:
function remoteFileExists() {
$curl = curl_init('domain.com/file.ext');
//don't fetch the actual page, you only want to check the connection is ok
curl_setopt($curl, CURLOPT_NOBODY, true);
//do request
$result = curl_exec($curl);
//if request did not fail
if ($result !== false) {
//if request was ok, check response code
$statusCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);
if ($statusCode == 404) {
sleep(7);
remoteFileExists();
}
else{
echo 'exists';
}
}
curl_close($curl);
}
echo remoteFileExists();
One of its application is, if I am sending mails by a script to 100+ customers then this operation will take maximum 1-2 seconds thus most of the website like hotmail and yahoo consider it as spam, so to avoid this we need to use some delay in execution after every mail.
Among the others: you are testing a web application that makes ayncronous requests (AJAX calls, lazy image loading,...)
You are testing it locally so responses are immediate since there is only one user (you) and no network latency.
Using sleep lets you see/test how the web app behaves when load and network cause delay on requests.
A quick pseudo code example of where you may not want to get millions of alert emails for a single event but you want your script to keep running.
if CheckSystemCPU() > 95
SendMeAnEmail()
sleep(1800)
fi
I'm currently using Curl for PHP a lot. It takes a lot of time to get results of about 100 pages each time. For every request i'm using code like this
$ch = curl_init();
// get source
curl_close($ch);
What are my options to speed things up?
How should I use the multi_init() etc?
Reuse the same cURL handler ($ch) without running curl_close. This will speed it up just a little bit.
Use curl_multi_init to run the processes in parallel. This can have a tremendous effect.
take curl_multi - it is far better. Save the handshakes - they are not needed every time!
when i use code given in "http://php.net/curl_multi_init", response of 2 requests are conflicting.
But the code written in below link, returns each response separately (in array format)
https://stackoverflow.com/a/21362749/3177302
or take pcntl_fork, fork some new threads to execute curl_exec. But it's not as good as curl_multi.