cURL fast way to stream images - php

I know that cURL's performance is quite fast however since the files I'm streaming is on a different server, it took 4-5 minutes to complete.
Is there any way to do 100 requests in 10-15 seconds? I have tried multi-process of cURL and the effect is still the same even file_get_contents().
Thank you
<?php
require_once('FileMaker.php');
use \setasign\Fpdi;
use \setasign\Fpdi\PdfParser\StreamReader;
require_once('FPDi/vendor/setasign/fpdf/fpdf.php');
require_once('FPDi/vendor/autoload.php');
require_once('FPDi/src/PdfParser/StreamReader.php');
$pdf = new Fpdi\Fpdi();
$fm = new FileMaker('database sample', 'location sample', 'username sample',
'password sample');
$connected = $fm->getLayout('Accessioning | Desktop | Form');
if (FileMaker::isError($connected)) :
// echo '<script> alert("NOT CONNECTED"); </script>';
endif;
//echo '<script> alert("CONNECTED"); </script>';
$record_gathered = $fm->getRecordByID("Accessioning | Desktop | Form", "18226");
$related_records = $record_gathered->getRelatedSet('Library | Accession Catalog Items');
$result_gathered_information = array();
$result = array();
if (!FileMaker::iserror ($related_records)) {
foreach($related_records as $related_record){
#$counter = $counter + 1;
if ($counter <= 3){
$fetchURL = 'http://sampleURL/SuperContainer/RawData/Library/Catalog/Items/'. $related_record->getField('Library | Accession Catalog Items::recno') . '/?a=1';
$result_gathered_information[] = $fetchURL;
}
}
}
foreach ($result_gathered_information as $detectedinfo){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $detectedinfo);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$result[] = curl_exec($ch);
curl_close($ch);
}
foreach ($result as $file) {
$pageCount = $pdf->setSourceFile(StreamReader::createByString($file));
for ($pageNo = 1; $pageNo <= $pageCount; $pageNo++) {
// import a page
$templateId = $pdf->importPage($pageNo);
// get the size of the imported page
$size = $pdf->getTemplateSize($templateId);
// add a page with the same orientation and size
$pdf->AddPage($size['orientation'], $size);
// use the imported page
$pdf->useTemplate($templateId);
$pdf->SetFont('Helvetica');
$pdf->SetXY(5, 5);
}
}
$pdf->Output();

Related

PHP CURL script getting 502/503 server error after first couple of requests

I have been working on a clients WP site which lists deals from Groupon. I am using the Groupon's official XML feed, importing via WP All Import. This works without much hassle. Now the issue is Groupon doesn't update that feed frequently but some of their deals get sold out or off the market often. So to get this resolved what I am trying is using a CURL script to crawl the links and check if the deal is available or not then turn the unavailable deals to draft posts (Once a day only).
The custom script is working almost perfectly, only after the first 14/24 requests the server starts responding with 502/503 HTTP status codes. To overcome the issue I have used the below precautions -
Using the proper header (captured from the requests made by the browser)
Parsing cookies from response header and sending back.
Using proper referrer and user agent.
Using proxies.
Trying to send request after a set interval. PHP - sleep(5);
Unfortunately, none of this got me the solution I wanted. I am attaching my code and I would like to request your expert insights on the issue, please.
Thanks in advance for your time.
Shahriar
PHP SCRIPT - https://pastebin.com/FF2cNm5q
<?php
// Error supressing and extend maximum execution time
error_reporting(0);
ini_set('max_execution_time', 50000);
// Sitemap URL List
$all_activity_urls = array();
$sitemap_url = array(
'https://www.groupon.de/sitemaps/deals-local0.xml.gz'
);
$cookies = Array();
// looping through sitemap url for scraping activity urls
for ($u = 0; $u < count($sitemap_url); $u++)
{
$ch1 = curl_init();
curl_setopt($ch1, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch1, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:38.0) Gecko/20100101 Firefox/38.0');
curl_setopt($ch1, CURLOPT_REFERER, "https://www.groupon.de/");
curl_setopt($ch1, CURLOPT_TIMEOUT, 40);
// curl_setopt($ch1, CURLOPT_COOKIEFILE, "cookie.txt");
curl_setopt($ch1, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch1, CURLOPT_URL, $sitemap_url[$u]);
curl_setopt($ch1, CURLOPT_SSL_VERIFYPEER, FALSE);
// Parsing Cookie from the response header
curl_setopt($ch1, CURLOPT_HEADERFUNCTION, "curlResponseHeaderCallback");
$activity_url_source = curl_exec($ch1);
$status_code = curl_getinfo($ch1, CURLINFO_HTTP_CODE);
curl_close($ch1);
if ($status_code === 200)
{
// Parsing XML sitemap for activity urls
$activity_url_list = json_decode(json_encode(simplexml_load_string($activity_url_source)));
for ($a = 0; $a < count($activity_url_list->url); $a++)
{
array_push($all_activity_urls, $activity_url_list->url[$a]->loc);
}
}
}
if (count($all_activity_urls) > 0)
{
// URL Loop count
$loop_from = 0;
$loop_to = (count($all_activity_urls) > 0) ? 100 : 0;
// $loop_to = count($all_activity_urls);
$final_data = array();
echo 'script start - ' . date('h:i:s') . "<br>";
for ($u = $loop_from; $u < $loop_to; $u++)
{
//Pull source from webpage
$headers = array(
'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'accept-language: en-US,en;q=0.9,bn-BD;q=0.8,bn;q=0.7,it;q=0.6',
'cache-control: max-age=0',
'cookie: ' . implode('; ', $cookies),
'upgrade-insecure-requests: 1',
'user-agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
);
$site = $all_activity_urls[$u];
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_REFERER, "https://www.groupon.de/");
curl_setopt($ch, CURLOPT_TIMEOUT, 40);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL, $site);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
// Parsing Cookie from the response header
curl_setopt($ch, CURLOPT_HEADERFUNCTION, "curlResponseHeaderCallback");
$data = curl_exec($ch);
$status_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($status_code === 200)
{
// Ready data for parsing
$document = new DOMDocument();
$document->loadHTML('<meta http-equiv="content-type" content="text/html; charset=utf-8">' . $data);
$xpath = new DOMXpath($document);
$title = '';
$availability = '';
$price = '';
$base_price = '';
$link = '';
$image = '';
$link = $all_activity_urls[$u];
// Scraping Availability
$raw_availability = $xpath->query('//div[#data-bhw="DealHighlights"]/div[0]/div/div');
$availability = $raw_availability->item(0)->nodeValue;
// Scraping Title
$raw_title = $xpath->query('//h1[#id="deal-title"]');
$title = $raw_title->item(0)->nodeValue;
// Scraping Price
$raw_price = $xpath->query('//div[#class="price-discount-wrapper"]');
$price = trim(str_replace(array("$", "€", "US", " "), array("", "", "", ""), $raw_price->item(0)->nodeValue));
// Scraping Old Price
$raw_base_price = $xpath->query('//div[contains(#class, "value-source-wrapper")]');
$base_price = trim(str_replace(array("$", "€", "US", " "), array("", "", "", ""), $raw_base_price->item(0)->nodeValue));
// Creating Final Data Array
array_push($final_data, array(
'link' => $link,
'availability' => $availability,
'name' => $title,
'price' => $price,
'baseprice' => $base_price,
'img' => $image,
));
}
else
{
$link = $all_activity_urls[$u];
if ($status_code === 429)
{
$status_msg = ' - Too Many Requests';
}
else
{
$status_msg = '';
}
array_push($final_data, array(
'link' => $link,
'status' => $status_code . $status_msg,
));
}
echo 'before break - ' . date('h:i:s') . "<br>";
sleep(5);
echo 'after break - ' . date('h:i:s') . "<br>";
flush();
}
echo 'script end - ' . date('h:i:s') . "<br>";
// Converting data to XML
$activities = new SimpleXMLElement("<?xml version=\"1.0\"?><activities></activities>");
array_to_xml($final_data, $activities);
$xml_file = $activities->asXML('activities.xml');
if ($xml_file)
{
echo 'XML file have been generated successfully.';
}
else
{
echo 'XML file generation error.';
}
}
else
{
$activities = new SimpleXMLElement("<?xml version=\"1.0\"?><activities></activities>");
$activities->addChild("error", htmlspecialchars("No URL scraped from sitemap. Stoping script."));
$xml_file = $activities->asXML('activities.xml');
if ($xml_file)
{
echo 'XML file have been generated successfully.';
}
else
{
echo 'XML file generation error.';
}
}
// Recursive Function for creating XML Nodes
function array_to_xml($array, &$activities)
{
foreach ($array as $key => $value)
{
if (is_array($value))
{
if (!is_numeric($key))
{
$subnode = $activities->addChild("$key");
array_to_xml($value, $subnode);
}
else
{
$subnode = $activities->addChild("activity");
array_to_xml($value, $subnode);
}
}
else
{
$activities->addChild("$key", htmlspecialchars("$value"));
}
}
}
// Cookie Parsing Function
function curlResponseHeaderCallback($ch, $headerLine)
{
global $cookies;
if (preg_match('/^Set-Cookie:\s*([^;]*)/mi', $headerLine, $cookie) == 1)
{
$cookies[] = $cookie[1];
}
return strlen($headerLine); // Needed by curl
}
There is a mess of cookies in your snippet. The callback function just appends cookies to the array regardingless of whether they already exist or not. Here is a new version which at least seems to work in this case since there are no semicolon-seperated multiple cookie definitions. Usually the cookie string should be even parsed. If you have installed the http extension you can use http_parse_cookie.
// Cookie Parsing Function
function curlResponseHeaderCallback($ch, $headerLine)
{
global $cookies;
if (preg_match('/^Set-Cookie:\s*([^;]+)/mi', $headerLine, $match) == 1)
{
if(false !== ($p = strpos($match[1], '=')))
{
$replaced = false;
$cname = substr($match[1], 0, $p+1);
foreach ($cookies as &$cookie)
if(0 === strpos($cookie, $cname))
{
$cookie = $match[1];
$replaced = true;
break;
}
if(!$replaced)
$cookies[] = $match[1];
}
var_dump($cookies);
}
return strlen($headerLine); // Needed by curl
}

Submit POST data in loop without cURL and get response data - CodeIgniter

I have a code that sends SMS to customers that is set up as cron job. My problem is, when I use cURL, it's sending 2-part messages and our wallet is being charged twice. We have an average of 500 messages to send per day. My goal is to make the message 1-part only.
So I'm trying to find the best way to send messages in loop without using cURL.
I have thought of saving the post data from the loop in an array and send it to my view. Then when the view is called, I will automatically send the form withouth hitting the submit button and using javascript to auto submit. After the view is called, I will use file_get_contents() to get the response from the URL. But I cant make it work:
The view is not working inside the loop. It just works outside the loop.
If I pass the loop inside the view and not in the controller, how can I get the response data each loop?
My current code in CURL:
public function send_sms_globe(){
error_reporting(-1);
ini_set('display_errors', 1);
set_time_limit(0);
//get all data from the database pull with status = queue
$globe_data = $this->New_Sms_Api_model->get_queued_data('globe_api');
$passphrase = '[our_pass_phrase]';
$app_id = '[our_app_id]';
$app_secret = '[our_app_secret]';
$url = 'https://devapi.globelabs.com.ph/smsmessaging/v1/outbound/<our_shortcode>/requests/';
$ch = curl_init($url);
$limit = 0;
foreach($globe_data AS $records_data){
if($limit == 49){
break;
}
switch($limit) {
case 49:
$limit = 0;
break;
default:
if($records_data['Remarks'] == 'LOADED'){
if($records_data['sent_to'] == 'sender'){
$address = $records_data['sender_phone_number'];
}else if($records_data['sent_to'] == 'consignee'){
$address = $records_data['consignee_phone_number'];
}
} else {
$address = $records_data['sender_phone_number'];
}
//$address = '+63917*******';//address : *subscriber number $records_data['phone_number'];
$message = (isset($records_data['Message']) && $records_data['Message'] != '') ? $records_data['Message']:''; //message : *sms content
$str = iconv('UTF-8', 'ASCII//IGNORE//TRANSLIT', $message);
$post_data = [
'app_id' => $app_id,
'app_secret' => $app_secret,
'passphrase' => $passphrase,
'message' => rawurlencode($str),
'address' => $address
];
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_data);
// execute!
$response = curl_exec($ch);
$http_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
$queue_id = $records_data['ID'];
$invoice_number = $records_data['InvoicecNumber'];
$status_remark = $records_data['Remarks'];
$record_id = $records_data['record_id'];
if($http_code == 201 || $http_code == 200){
$no_of_tries = $records_data['no_of_tries'];
if($no_of_tries == 0){
$no_of_tries = 1;
} else {
$no_of_tries = $records_data['no_of_tries'];
}
$status = 'sent';
$retry = 0;
} else {
$no_of_tries = $records_data['no_of_tries'] + 1;
if($no_of_tries == 3){
$status = 'failed';
$retry = 0;
} else {
$status = 'retry';
$retry = 1;
}
}
$update_queued_data = $this->New_Sms_Api_model->update_queued_data($queue_id, $invoice_number, $status, $retry, $no_of_tries);
if($update_queued_data){
if($status == 'sent'){
if($status_remark == 'LOADED'){
$sent_to = $records_data['sent_to'];
} else {
$sent_to = NULL;
}
$this->New_Sms_Api_model->save_to_cq_sms($invoice_number, $status_remark, $record_id,$sent_to);
echo $records_data['record_id'].' ---- '.$status;
}
$limit++;
}
}
}
// close the connection, release resources used
curl_close($ch);
}
We have a message with 157 characters (160 max). I already talked to the API support we're using. First they suggested to format my message as URL encoding, and so I did. So from 3-part message it became 2-part. And then they said it will send as 2-part even if it's formatted that way because we use cURL. They suggested that we use PostMan but it's not for free so it's not an option.
Any ideas that can replace my current code? Thanks!
Sorry for the confusion. I was able to fix the issue without changing all of my code. I just removed rawurlencode in my string message and it's sending a 1-part message now.
Maybe using $str = iconv('UTF-8', 'ASCII//IGNORE//TRANSLIT', $message); already did the trick and using rawurlencode after just added additional characters ex: %20.
Thanks everyone!

How to download google search big image using php script?

I want to download the google search big image. Below is my code, which is working fine for download small image from google search.
<?php
include_once('simple_html_dom.php');
set_time_limit(0);
$fp = fopen('csv/search.csv','r') or die("can't open file");
$csv_data = array();
while($csv_line = fgetcsv($fp)) {
for ($i = 0, $j = count($csv_line); $i < $j; $i++) {
$imgname = $csv_line[$i];
$search_query = $csv_line[$i];
$search_query = urlencode(trim($search_query));
$html=file_get_html('http://images.google.com/images?as_q='. $search_query .'&hl=en&imgtbs=z&btnG=Search+Images&as_epq=&as_oq=&as_eq=&imgtype=&imgsz=m&imgw=&imgh=&imgar=&as_filetype=&imgc=&as_sitesearch=&as_rights=&safe=images&as_st=y');
$image_container = $html->find('div#rcnt', 0);
$images = $html->find('img');
$image_count = 1; //Enter the amount of images to be shown
$i = 0;
foreach($images as $image){
$srcimg = $image->src;
if($i == $image_count) break;
$i++;
$randname = $imgname.".jpg";
$randname = "Images/".$randname;
file_put_contents("$randname", file_get_contents($srcimg));
}
}
}
?>
Any idea?
This worked for me. simple_html_dom.php wouldn't do the trick since the 'big image' is inside a snippet of JSON near each thumbnail in the DOM.
<?php
$search_query = "Some Keyword"; //change this
$search_query = urlencode( $search_query );
$googleRealURL = "https://www.google.com/search?hl=en&biw=1360&bih=652&tbs=isz%3Alt%2Cislt%3Asvga%2Citp%3Aphoto&tbm=isch&sa=1&q=".$search_query."&oq=".$search_query."&gs_l=psy-ab.12...0.0.0.10572.0.0.0.0.0.0.0.0..0.0....0...1..64.psy-ab..0.0.0.wFdNGGlUIRk";
// Call Google with CURL + User-Agent
$ch = curl_init($googleRealURL);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Linux i686; rv:20.0) Gecko/20121230 Firefox/20.0');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
$google = curl_exec($ch);
$array_imghtml = explode("\"ou\":\"", $google); //the big url is inside JSON snippet "ou":"big url"
foreach($array_imghtml as $key => $value){
if ($key > 0) {
$array_imghtml_2 = explode("\",\"", $value);
$array_imgurl[] = $array_imghtml_2[0];
}
}
var_dump($array_imgurl); //array contains the urls for the big images
die();
?>
I think rather than crawling the page you can you the google custom search api
for more details here the url:
https://developers.google.com/custom-search/json-api/v1/overview

PHP - Safe way to download large files?

Information
There are many ways to download files in PHP, file_get_contents + file_put_contents, fopen, readfile and cURL.
Question?
When having a large file, let's say 500 MB from another server / domain, what is the "correct" way to downloaded it safe? If connection failes it should find the position and continue OR download the file again if it contains errors.
It's going to be used on a web site, not in php.exe shell.
What I figured out so far
I've read about AJAX solutions with progress bars but what I'm really looking for is a PHP solution.
I don't need to buffer the file to a string like file_get_contents does. That probably uses memory as well.
I've also read about memory problems. A solution that don't use that much memory might be prefered.
Concept
This is sort of what I want if the result is false.
function download_url( $url, $filename ) {
// Code
$success['success'] = false;
$success['message'] = 'File not found';
return $success;
}
The easiest way to copy large files can be demonstrated here Save large files from php stdin but the does not shows how to copy files with http range
$url = "http://REMOTE_FILE";
$local = __DIR__ . "/test.dat";
try {
$download = new Downloader($url);
$download->start($local); // Start Download Process
} catch (Exception $e) {
printf("Copied %d bytes\n", $pos = $download->getPos());
}
When an Exception occur you can resume the file download for the last point
$download->setPos($pos);
Class used
class Downloader {
private $url;
private $length = 8192;
private $pos = 0;
private $timeout = 60;
public function __construct($url) {
$this->url = $url;
}
public function setLength($length) {
$this->length = $length;
}
public function setTimeout($timeout) {
$this->timeout = $timeout;
}
public function setPos($pos) {
$this->pos = $pos;
}
public function getPos() {
return $this->pos;
}
public function start($local) {
$part = $this->getPart("0-1");
// Check partial Support
if ($part && strlen($part) === 2) {
// Split data with curl
$this->runPartial($local);
} else {
// Use stream copy
$this->runNormal($local);
}
}
private function runNormal($local) {
$in = fopen($this->url, "r");
$out = fopen($local, 'w');
$pos = ftell($in);
while(($pos = ftell($in)) <= $this->pos) {
$n = ($pos + $this->length) > $this->length ? $this->length : $this->pos;
fread($in, $n);
}
$this->pos = stream_copy_to_stream($in, $out);
return $this->pos;
}
private function runPartial($local) {
$i = $this->pos;
$fp = fopen($local, 'w');
fseek($fp, $this->pos);
while(true) {
$data = $this->getPart(sprintf("%d-%d", $i, ($i + $this->length)));
$i += strlen($data);
fwrite($fp, $data);
$this->pos = $i;
if ($data === - 1)
throw new Exception("File Corupted");
if (! $data)
break;
}
fclose($fp);
}
private function getPart($range) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $this->url);
curl_setopt($ch, CURLOPT_RANGE, $range);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, $this->timeout);
$result = curl_exec($ch);
$code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
// Request not Satisfiable
if ($code == 416)
return false;
// Check 206 Partial Content
if ($code != 206)
return - 1;
return $result;
}
}
You'd want to download the remote file in chunks. This answer has a great example:
How download big file using PHP (low memory usage)

multi curl unable to handle more than 200 request at a time

Could you please tell me , is there any limitation to send a request using multi_curl.
When I tried to send a request more than 200 , it was getting timeout.
see the below code ..............
.........................................
foreach($newUrlArry as $url){
$gatherUrl[] = $url['url'];
}
/*...................Array slice----------------------*/
$totalUrlRequest = count($gatherUrl);
if($totalUrlRequest > 10){
$offset = 10;
$index = 0;
$matchedAnchors = array();
$dom = new DOMDocument;
$NoOfTilesRequest = ceil($totalUrlRequest/$offset);
for($sl = 0; $sl<$NoOfTilesRequest;$sl++){
$output = array_slice($gatherUrl, $index, $offset);
$index = $offset+$index;
$responseAction = $this->multiRequestAction($output);
$k=0;
foreach($responseAction as $responseHtml){
#$dom->loadHTML($responseHtml);
$documentLinks = $dom->getElementsByTagName("a");
$chieldFlag = false;
for($i=0;$i<$documentLinks->length;$i++) {
$documentLink = $documentLinks->item($i);
if ($documentLink->hasAttribute('href') AND substr($documentLink->getAttribute('href'), 0, strlen($match)) == $match) {
$description = $documentLink->childNodes;
foreach($description as $words) {
$name = trim($words->nodeName);
if($name == 'em' || $name == 'b' || $name=="span" || $name=="p") {
if(!empty($words->nodeValue)) {
$matchedAnchors[$sl][$k]['anchor'] = trim($words->nodeValue);
$matchedAnchors[$sl][$k]['img'] = 0;
if($documentLink->hasAttribute('rel'))
$matchedAnchors[$sl][$k]['rel'] = 'Y';
else
$matchedAnchors[$sl][$k]['rel'] = 'N';
$chieldFlag = true;
break;
}
}
elseif($name == 'img' ) {
$alt= $words->getAttribute('alt');
if(!empty($alt)) {
$matchedAnchors[$sl][$k]['anchor'] = trim($words->getAttribute('alt'));
$matchedAnchors[$sl][$k]['img'] = 1;
if($documentLink->hasAttribute('rel'))
$matchedAnchors[$sl][$k]['rel'] = 'Y';
else
$matchedAnchors[$sl][$k]['rel'] = 'N';
$chieldFlag = true;
break;
}
}
}
if(!$chieldFlag){
$matchedAnchors[$sl][$k]['anchor'] = $documentLink->nodeValue;
$matchedAnchors[$sl][$k]['img'] = 0;
if($documentLink->hasAttribute('rel'))
$matchedAnchors[$sl][$k]['rel'] = 'Y';
else
$matchedAnchors[$sl][$k]['rel'] = 'N';
}
}
}$k++;
}
}
}
Both #Phliplip & #lunixbochs have mentioned common cURL pitfalls (max execution time & denied by the target server.)
When sending that many cURL requests to the same server I try to "be nice" and place voluntarily sleep periods so I don't bombard the host. For a low-end site, 1000+ requests could be like a mini DDOS!
Here's code that's worked for me. I it used to scrape a client's product data from their old site, since the data was locked in a proprietary database system with NO export function.
<?php
header('Content-type: text/html; charset=utf-8', true);
set_time_limit(0);
$urls = array(
'http://www.example.com/cgi-bin/product?id=500',
'http://www.example.com/cgi-bin/product?id=501',
'http://www.example.com/cgi-bin/product?id=502',
'http://www.example.com/cgi-bin/product?id=503',
'http://www.example.com/cgi-bin/product?id=504',
);
$i = 0;
foreach($urls as $url){
echo $url."\n";
$curl = curl_init($url);
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
curl_setopt($curl, CURLOPT_USERAGENT, $userAgent);
curl_setopt($curl, CURLOPT_AUTOREFERER, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1 );
curl_setopt($curl, CURLOPT_TIMEOUT, 25 );
$html = curl_exec($curl);
$html = #mb_convert_encoding($html, 'HTML-ENTITIES', 'utf-8');
curl_close($curl);
// now do something with info returned by curl
$i++;
if($i%10==0){
sleep(20);
} else {
sleep(2);
}
}
?>
The main features are:
no max execution time
voluntary sleep-ing
new curl init & exec for each request.
In my experience, going to sleep() will stop servers from denying you.
However if by "different different server" you mean that you are sending a small number of requests a large number of servers, for example:
$urls = array(
'http://www.example-one.com/',
'http://www.example-two.com/',
'http://www.example-three.com/',
'http://www.example-four.com/',
'http://www.example-five.com/',
'http://www.example-six.com/'
);
And you are using set_time_limit(0); then something then an error may be causing your code to die; try
ini_set('display_errors',1);
error_reporting(E_ALL);
And tell us the error message you are getting.
PHP doesn't place a restriction on the number of connections using curl_multi_init, but memory usage and time limits will be an issue.
Check your memory_limit setting in your php.ini and try to increase it to see if that helps you.

Categories