$query = 'SELECT * FROM `chat` LIMIT 0, 24334436743;';
$result = mysql_query($query);
while($row = mysql_fetch_array( $result )) {
$URL = $row['url'];
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,"$URL");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
//curl_setopt($ch, CURLOPT_POSTFIELDS, "user=unrevoked clarity&randominfo=hi");
curl_exec ($ch);
curl_close ($ch);
}
//curl_close ($ch);
}
Alright the above snippet is me pulling a whole bunch of URL's from a database and I am trying to send data to each of them. But it seems to gum the page up (even with only one or two URL's). Is there a built in system to handle this or something?
You can initialize multiple requests using the curl_multi_*() functions, then have them sent all at once. There is probably a limit to how many requests can be pooled. And the overall processing time will take as long as the slowest connection/server.
So your approach (many many URLs at once) is still problematic. Maybe you can rewrite it to do the processing in your browser, start multiple AJAX requests with some visual feedback.
Requesting a URL from the network is an expensive operation, and even downloading a few will noticeably increase the latency of your page. Can you cache the contents of the pages in a database? Do you have to download the URL; can you make the client do it with an iframe?
Related
I use simple HTML DOM parser together with curl (I do not have a big experience with curl) and I try to figure out why is hanging on different URL requests long. I have been trying to log with verbose but I did not get back any useful information. It seems like is a Caching problem because after long response all my other requests are acting the same till I clear Browser Cache
str_get_html(get_data($target));
function get_data($url)
{
$ch = curl_init();
$timeout = 30;
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
curl_setopt($ch, CURLOPT_USERAGENT, 'some useragent');
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
You are using CURLOPT_NOBODY curl option in your request. Are you sure what it does? It sends the HEAD request to the target url instead of the GET. There are lot of web servers available in the Internet which do accept the HEAD request and keep the request in stuck until the timeout occures. And this is what you are experiencing right now.
I am using cURL toaccess instagrams API on a webpage I am building. THe functionality works great, however, page load is sacrificed. For instance, consider this DOM structure:
Header
Article
Instagram Photos (retrieved via cURL)
Footer
When loading the page, the footer will not load until the instagram hotos have been fully loaded with cURL. Below is the cURL function that is being called:
function fetchData($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, 20);
$result = curl_exec($ch);
curl_close($ch);
return $result;
}
$result = fetchData("https://api.instagram.com/v1/media/search?lat={$lat}&lng={$lng}&distance={$distance}&access_token={$accessToken}");
$result = json_decode($result);
So after this function is run, then the rest of the DOM is displayed. If I move the function call below the footer, it does not work.
Is there anything I can do to load the entire webpage and have the cURL request setn on top of the loading site (not cause a lag or holdup)?
UPDATE: Is the best solution to load it after the footer, and then append it to another area with js?
You can cache the result json into a file that is saved locally. You can make a cronjob that is called ever minute and update the locally cache file. This makes your page loading much more faster. The downside is that your cache is updating even when you don't have visitors and you have a delay of a minute in data from instagram.
was searching stackoverflow for a solution, but couldn't find anything even close to what I am trying to achieve. Perhaps I am just blissfully unaware of some magic PHP sauce everyone is doing tackling this problem... ;)
Basically I have an array with give or take a few hundred urls, pointing to different XML files on a remote server. I'm doing some magic file-checking to see if the content of the XML files have changed and if it did, I'll download newer XMLs to my server.
PHP code:
$urls = array(
'http://stackoverflow.com/a-really-nice-file.xml',
'http://stackoverflow.com/another-cool-file2.xml'
);
foreach($urls as $url){
set_time_limit(0);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, false);
$contents = curl_exec($ch);
curl_close($ch);
file_put_contents($filename, $contents);
}
Now, $filename is set somewhere else and gives each xml it's own ID based on my logic.
So far this script is running OK and does what it should, but it does it terribly slow. I know my server can handle a lot more and I suspect my foreach is slowing down the process.
Is there any way I can speed up the foreach? Currently I am thinking to up the file_put_contents in each foreach loop to 10 or 20, basically cutting my execution time 10- or 20-fold, but can't think of how to approach this the best and most performance kind of way. Any help or pointers on how to proceed?
Your bottleneck (most likely) is your curl requests, you can only write to a file after each request is done, there is no way (in a single script) to speed up that process.
I don't know how it all works but you can execute curl requests in parallel: http://php.net/manual/en/function.curl-multi-exec.php.
Maybe you can fetch the data (if memory is available to store it) and then as they complete fill in the data.
Just run more script. Each script will download some urls.
You can get more information about this pattern here: http://en.wikipedia.org/wiki/Thread_pool_pattern
The more script your run the more parallelism you get
I use on paralel requests guzzle pool ;) ( you can send x paralel request)
http://docs.guzzlephp.org/en/stable/quickstart.html
I have to fetch multiple web pages, let's say 100 to 500. Right now I am using curl to do so.
function get_html_page($url) {
//create curl resource
$ch = curl_init();
//set url
curl_setopt($ch, CURLOPT_URL, $url);
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, FALSE);
//$output contains the output string
$html = curl_exec($ch);
//close curl resource to free up system resources
curl_close($ch);
return $html;
}
My major concern is the total time taken by my script to fetch all these web pages. I know that the time taken is directly proportional to my internet speed and hence the majority time is taken by $html = curl_exec($ch); function call.
I was thinking that instead of creating and destroying curl instance again and again for each and every web page, if I create it only once and then just reuse it for each and every page and finally in the end destroy it. Something like:
<?php
function get_html_page($ch, $url) {
//$output contains the output string
$html = curl_exec($ch);
return $html;
}
//create curl resource
$ch = curl_init();
//set url
curl_setopt($ch, CURLOPT_URL, $url);
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, FALSE);
.
.
.
<fetch web pages using get_html_page()>
.
.
.
//close curl resource to free up system resources
curl_close($ch);
?>
Will it make any significant difference in the total time taken? If there is any other better approach then please let me know about it also?
How about trying to benchmark it? It may be more efficient to do it the second way, but I don't think it will add up to much. I'm sure your system can create and destroy curl instances in microseconds. It has to initiate the same HTTP connections each time either way, too.
If you were running many of these at the same time and were worried about system resources, not time, it might be worth exploring. As you noted, most of the time spent doing this will be waiting for network transfers, so I don't think you'll notice a change in overall time with either method.
For web scraping I would use : YQL + JSON + xPath. You'll implement it using cURL
I think you'll save a lot of resources.
I have a database setup and all it does is give me at most 10 URL's. I need to post data to those 10 URL's when the page is loaded. This means the script to send message hits "send.php?message=Foo". and it post's 'foo' to the pages in the database. There is no way around this. But I need to be able to do it. Right now I am trying to use regular curl requests in a while loop but that only posts to the first URL. How do I use the CURL multi functions to do this:
$query = 'SELECT * FROM `chat` LIMIT 0, 20;';
$result = mysql_query($query);
$init = 0;
while($row = mysql_fetch_array( $result )) {
$URL = $row['url'];
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,"$URL");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_exec ($ch);
curl_close ($ch);
}
Are you sure it only posts to the first URL? Try adding some debugging statements. Check if you're hitting a timeout in PHP.
What about timeouts? Look at the example here http://au2.php.net/manual/en/function.curl-multi-init.php. Start a multi handle before while, create curl handles in while and add those handles to multi handle method, then after while execute the curl call and close the handle, do check for errors/exceptions thrown by curl, example, a url may time out and so on.