I make a loop until 500000 and it stop at 65536, dies this php limit or system limit? Anyway to avoid it?
Btw, it's just an ordinary for loop.
for($i=0; $i<500000; $i++){}
real code, I care about the load, but my concern is not that one.
class ECStun{
var $sp = 1;
var $d = 5;
var $url = 'http://example.com/';
var $html = '';
var $invalid = '404';
var $data = array();
var $db = false; // db connection
var $dbhost = 'localhost';
var $dbuser = 'root';
var $dbpass = '';
var $dbname = 'scrape';
var $tbname = 'ecstun'; // this table will be created automatically by the script.
var $proxies = array(
);
var $startat = 0;
function init(){
$this->initDB();
$this->startat = microtime(true);
$x = rand(0, count($this->proxies)-1);
print('start scraping using proxy : '.$this->proxies[$x]."\n");
for($i = $this->sp; $i <= $this->d; $i++){
$url = $this->url.'ES'.$i;
if(!$this->isSaved($url)){ // skip is already saved to DB
$link = curl_init();
if(count($this->proxies) > 0){
curl_setopt($link, CURLOPT_PROXY, $this->proxies[$x]);
}
curl_setopt($link, CURLOPT_URL, $url);
curl_setopt($link, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.52 Safari/537.17)');
curl_setopt($link, CURLOPT_RETURNTRANSFER, true);
curl_setopt($link, CURLOPT_FOLLOWLOCATION, true);
$this->html = curl_exec($link);
$code = curl_getinfo($link, CURLINFO_HTTP_CODE);
curl_close($link);
if($code == 200 || $code == 301){
$this->parse($url);
}else{
/*
** this block of script will make sure if IP got banned then delete the banned ip
** and then check if there is another remaining IPs we can use
** if no more IPs in $proxies then exit the script with information of latest ES
*/
//unset($this->proxies[$x]);
array_splice($this->proxies, $x, 1);
if(count($this->proxies) > 0){
$this->sp = $i; // if banned then set starting point to the latest ES to try again using different ip
$this->init();
}else{
exit('LAST ES: ES'.$i);
}
}
}
}
//$this->initDB();
}
}
$ecs = new ECStun;
if(isset($_GET['verify_proxy'])){ // browser only
$ecs->verifyProxy();
}else{
$ecs->init();
}
65535 is a magical number. From wikipedia:
65535 occurs frequently in the field of computing because it is the
highest number which can be represented by an unsigned 16-bit binary
number.
Try this:
var_dump(PHP_INT_MAX);
It should be the 65535 to confirm your error. Try upgrading to 32 bit or more.
Related
I'm trying to improve my programming skills constantly, I learned everything online so far. But I can't find a way to avoid duplicate code. Here's my code:
public function Curl($page, $check_top = 0, $pages = 1, $pagesources = array()){
//$page is the URL
//$check_top 0 = false 1 = true. When true it needs to check both false & true
//$pages is the amount of pages it needs to check.
$agent = "Mozilla/5.0 (Windows NT x.y; Win64; x64; rv:10.0) Gecko/20100101 Firefox/10.0";
try{
for($i = 0; $i < $pages; $i++){
$count = $i * 25; //Page 1 starts at 0, page 2 at 25 etc..
$ch = curl_init($page . "/?count=" . $count);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, 60);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$pagesource = curl_exec($ch);
$pagesources[] = $pagesource;
}
if($check_top == 1){
for($i = 0; $i < $pages; $i++){
$count = $i * 25;
$ch = curl_init($page . "/top/?sort=top&t=all&count=" . $count);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, 60);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$pagesource = curl_exec($ch);
$pagesources[] = $pagesource;
}
}
} catch (Exception $e){
echo $e->getMessage();
}
return $pagesources;
}
What I'm trying to do:
I want to get the HTML Page Sources from a specific page range (for example 1 to 5 pages). There are top pages and standard pages I want to get the sources from both with the page range. So my code works fine, but obviously; there must be a better way.
Here 's a short example, how you can avoid duplicate code with writing functions and using them together.
class A
{
public function methodA($paramA, $paramB, $paramC)
{
if ($paramA == 'A') {
$result = $this->methodB($paramB);
} else {
$result = $this->methodB($paramC);
}
return $result;
}
public function methodB($paramA)
{
// do something with the given param and return the result
}
}
$classA = new Class();
$result = $classA->methodA('foo', 'bar', 'baz');
The code given above shows a simple class with two methods. As you declared your function Curl in your example as public, I guess you 're using a class. The class in the example above is very basic. It calls the method methodB with different params in the nethodA method of the class.
What this means to you? You have to find out, which parameters your helper function needs. If you found out, which parameters it needs, just write another class method, which executes the curl functions with the given parameters. Simple as pie.
If you 're new into using classes and methods with php I suggest reading the documentation, where the basic functionality of classes, methods and members are described: http://php.net/manual/en/classobj.examples.php.
Recently tasked to monitor external webpage response/loading time via CACTI. I found some PHP scripts that were working (pageload-agent.php and class.pageload.php) using cURL. All was working fine until they requested it to be transferred from LINUX to Windows 2012R2 server. I'm having a very hard time modifying the scripts to work for windows. Already installed PHP and cURL and both working as tested. Here are the scripts taken from askaboutphp.
class.pageload.php
<?php
class PageLoad {
var $siteURL = "";
var $pageInfo = "";
/*
* sets the URLs to check for loadtime into an array $siteURLs
*/
function setURL($url) {
if (!empty($url)) {
$this->siteURL = $url;
return true;
}
return false;
}
/*
* extract the header information of the url
*/
function doPageLoad() {
$u = $this->siteURL;
if(function_exists('curl_init') && !empty($u)) {
$ch = curl_init($u);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_ENCODING, "gzip");
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_NOBODY, false);
curl_setopt($ch, CURLOPT_FRESH_CONNECT, false);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)");
$pageBody = curl_exec($ch);
$this->pageInfo = curl_getinfo($ch);
curl_close ($ch);
return true;
}
return false;
}
/*
* compile the page load statistics only
*/
function getPageLoadStats() {
$info = $this->pageInfo;
//stats from info
$s['dest_url'] = $info['url'];
$s['content_type'] = $info['content_type'];
$s['http_code'] = $info['http_code'];
$s['total_time'] = $info['total_time'];
$s['size_download'] = $info['size_download'];
$s['speed_download'] = $info['speed_download'];
$s['redirect_count'] = $info['redirect_count'];
$s['namelookup_time'] = $info['namelookup_time'];
$s['connect_time'] = $info['connect_time'];
$s['pretransfer_time'] = $info['pretransfer_time'];
$s['starttransfer_time'] = $info['starttransfer_time'];
return $s;
}
}
?>
pageload-agent.php
#! /usr/bin/php -q
<?php
//include the class
include_once 'class.pageload.php';
// read in an argument - must make sure there's an argument to use
if ($argc==2) {
//read in the arg.
$url_argv = $argv[1];
if (!eregi('^http://', $url_argv)) {
$url_argv = "http://$url_argv";
}
// check that the arg is not empty
if ($url_argv!="") {
//initiate the results array
$results = array();
//initiate the class
$lt = new PageLoad();
//set the page to check the loadtime
$lt->setURL($url_argv);
//load the page
if ($lt->doPageLoad()) {
//load the page stats into the results array
$results = $lt->getPageLoadStats();
} else {
//do nothing
print "";
}
//print out the results
if (is_array($results)) {
//expecting only one record as we only passed in 1 page.
$output = $results;
print "dns:".$output['namelookup_time'];
print " con:".$output['connect_time'];
print " pre:".$output['pretransfer_time'];
print " str:".$output['starttransfer_time'];
print " ttl:".$output['total_time'];
print " sze:".$output['size_download'];
print " spd:".$output['speed_download'];
} else {
//do nothing
print "";
}
}
} else {
//do nothing
print "";
}
?>
Thank you. any type of assistance is greatly appreciated.
I am trying to load an url using cURL. But here I am facing problem with browser cookies. The webpage is asking me to enable browser cookies. My code is like this.
public function execute() {
// Set two default options, and merge any extra ones in
if(!isset($this->options[CURLOPT_TIMEOUT])) $this->options[CURLOPT_TIMEOUT] = 45;
if(!isset($this->options[CURLOPT_COOKIESESSION])) $this->options[CURLOPT_COOKIESESSION] = TRUE;
if(!isset($this->options[CURLOPT_COOKIEJAR])) $this->options[CURLOPT_COOKIEJAR] = 'cookie.txt';
if(!isset($this->options[CURLOPT_RETURNTRANSFER])) $this->options[CURLOPT_RETURNTRANSFER] = TRUE;
if(!isset($this->options[CURLOPT_FOLLOWLOCATION])) $this->options[CURLOPT_FOLLOWLOCATION] = TRUE;
if(!isset($this->options[CURLOPT_USERAGENT])) $this->options[CURLOPT_USERAGENT] = "Mozilla/5.0 (Windows NT 6.1; rv:20.0) Gecko/20100101 Firefox/20.0";
if(!isset($this->options[CURLOPT_AUTOREFERER])) $this->options[CURLOPT_AUTOREFERER] = TRUE;
if(!isset($this->options[CURLOPT_CONNECTTIMEOUT])) $this->options[CURLOPT_CONNECTTIMEOUT] = 15;
if(!isset($this->options[CURLOPT_MAXREDIRS])) $this->options[CURLOPT_MAXREDIRS] = 4;
if(!isset($this->options[CURLOPT_HEADER])) $this->options[CURLOPT_HEADER] = FALSE;
if(!isset($this->options[CURLOPT_SSL_VERIFYPEER])) $this->options[CURLOPT_SSL_VERIFYPEER] = FALSE;
if(!isset($this->options[CURLOPT_FAILONERROR])) $this->options[CURLOPT_FAILONERROR] = FALSE;
if(!isset($this->options[CURLOPT_ENCODING])) $this->options[CURLOPT_ENCODING] = '';
$this->options();
$return = curl_exec($this->session);
// Request failed
if($return === FALSE){
$this->error_code = curl_errno($this->session);
$this->error_string = curl_error($this->session);
curl_close($this->session);
$this->session = NULL;
return $return;
// Request successful
} else {
$this->info = curl_getinfo($this->session);
curl_close($this->session);
$this->session = NULL;
return $return;
}
}
But still facing the same problem, please help me in that.
I know, this is purely restricted by the website
"Before you can move on, please activate your browser cookies. "
I think you forget about CURLOPT_COOKIE, try to set it on TRUE
I'm trying to pull the count of subscribers for a particular youtube channel. I referred some links on Stackoverflow as well as external sites, came across links like this. Almost all the links suggested me to use youtube gdata api and pull the count from subscriberCount but the following code
$data = file_get_contents("http://gdata.youtube.com/feeds/api/users/Tollywood/playlists");
$xml = simplexml_load_string($data);
print_r($xml);
returns no such subscriberCount. Is there any other way of getting subscribers count or am I doing something wrong?
The YouTube API v2.0 is deprecated. Here's how to do it with 3.0. OAuth is not needed.
1) Log in to a Google account and go to https://console.developers.google.com/. You may have to start a new project.
2) Navigate to APIs & auth and go to Public API Access -> Create a New Key
3) Choose the option you need (I used 'browser applications') This will give you an API key.
4) Navigate to your channel in YouTube and look at the URL. The channel ID is here: https://www.youtube.com/channel/YOUR_CHANNEL_ID
5) Use the API key and channel ID to get your result with this query: https://www.googleapis.com/youtube/v3/channels?part=statistics&id=YOUR_CHANNEL_ID&key=YOUR_API_KEY
Great success!
Documentation is actually pretty good, but there's a lot of it. Here's a couple of key links:
Channel information documentation: https://developers.google.com/youtube/v3/sample_requests
"Try it" page: https://developers.google.com/youtube/v3/docs/subscriptions/list#try-it
Try this ;)
<?php
$data = file_get_contents('http://gdata.youtube.com/feeds/api/users/Tollywood');
$xml = new SimpleXMLElement($data);
$stats_data = (array)$xml->children('yt', true)->statistics->attributes();
$stats_data = $stats_data['#attributes'];
/********* OR **********/
$data = file_get_contents('http://gdata.youtube.com/feeds/api/users/Tollywood?alt=json');
$data = json_decode($data, true);
$stats_data = $data['entry']['yt$statistics'];
/**********************************************************/
echo 'lastWebAccess = '.$stats_data['lastWebAccess'].'<br />';
echo 'subscriberCount = '.$stats_data['subscriberCount'].'<br />';
echo 'videoWatchCount = '.$stats_data['videoWatchCount'].'<br />';
echo 'viewCount = '.$stats_data['viewCount'].'<br />';
echo 'totalUploadViews = '.$stats_data['totalUploadViews'].'<br />';
?>
I could do it with regex for my page , not sure does it work for you or not . check following codes:
<?php
$channel = 'http://youtube.com/user/YOURUSERNAME/';
$t = file_get_contents($channel);
$pattern = '/yt-uix-tooltip" title="(.*)" tabindex/';
preg_match($pattern, $t, $matches, PREG_OFFSET_CAPTURE);
echo $matches[1][0];
<?php
//this code was written by Abdu ElRhoul
//If you have any questions please contact me at info#oklahomies.com
//My website is http://Oklahomies.com
set_time_limit(0);
function retrieveContent($url){
$file = fopen($url,"rb");
if (!$file)
return "";
while (feof ($file)===false) {
$line = fgets ($file, 1024);
$salida .= $line;
}
fclose($file);
return $salida;
}
{
$content = retrieveContent("https://www.youtube.com/user/rhoula/about"); //replace rhoula with the channel name
$start = strpos($content,'<span class="about-stat"><b>');
$end = strpos($content,'</b>',$start+1);
$output = substr($content,$start,$end-$start);
echo "Number of Subscribers = $output";
}
?>
<?php
echo get_subscriber("UCOshmVNmGce3iwozz55hpww");
function get_subscriber($channel,$use = "user") {
(int) $subs = 0;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://www.youtube.com/".$use."/".$channel."/about?disable_polymer=1");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1 );
curl_setopt($ch, CURLOPT_POST, 0 );
curl_setopt($ch, CURLOPT_REFERER, 'https://www.youtube.com/');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:59.0) Gecko/20100101 Firefox/59.0');
$result = curl_exec($ch);
$R = curl_getinfo($ch);
if($R["http_code"] == 200) {
$pattern = '/yt-uix-tooltip" title="(.*)" tabindex/';
preg_match($pattern, $result, $matches, PREG_OFFSET_CAPTURE);
$subs = intval(str_replace(',','',$matches[1][0]));
}
if($subs == 0 && $use == "user") return get_subscriber($channel,"channel");
return $subs;
}
Could you please tell me , is there any limitation to send a request using multi_curl.
When I tried to send a request more than 200 , it was getting timeout.
see the below code ..............
.........................................
foreach($newUrlArry as $url){
$gatherUrl[] = $url['url'];
}
/*...................Array slice----------------------*/
$totalUrlRequest = count($gatherUrl);
if($totalUrlRequest > 10){
$offset = 10;
$index = 0;
$matchedAnchors = array();
$dom = new DOMDocument;
$NoOfTilesRequest = ceil($totalUrlRequest/$offset);
for($sl = 0; $sl<$NoOfTilesRequest;$sl++){
$output = array_slice($gatherUrl, $index, $offset);
$index = $offset+$index;
$responseAction = $this->multiRequestAction($output);
$k=0;
foreach($responseAction as $responseHtml){
#$dom->loadHTML($responseHtml);
$documentLinks = $dom->getElementsByTagName("a");
$chieldFlag = false;
for($i=0;$i<$documentLinks->length;$i++) {
$documentLink = $documentLinks->item($i);
if ($documentLink->hasAttribute('href') AND substr($documentLink->getAttribute('href'), 0, strlen($match)) == $match) {
$description = $documentLink->childNodes;
foreach($description as $words) {
$name = trim($words->nodeName);
if($name == 'em' || $name == 'b' || $name=="span" || $name=="p") {
if(!empty($words->nodeValue)) {
$matchedAnchors[$sl][$k]['anchor'] = trim($words->nodeValue);
$matchedAnchors[$sl][$k]['img'] = 0;
if($documentLink->hasAttribute('rel'))
$matchedAnchors[$sl][$k]['rel'] = 'Y';
else
$matchedAnchors[$sl][$k]['rel'] = 'N';
$chieldFlag = true;
break;
}
}
elseif($name == 'img' ) {
$alt= $words->getAttribute('alt');
if(!empty($alt)) {
$matchedAnchors[$sl][$k]['anchor'] = trim($words->getAttribute('alt'));
$matchedAnchors[$sl][$k]['img'] = 1;
if($documentLink->hasAttribute('rel'))
$matchedAnchors[$sl][$k]['rel'] = 'Y';
else
$matchedAnchors[$sl][$k]['rel'] = 'N';
$chieldFlag = true;
break;
}
}
}
if(!$chieldFlag){
$matchedAnchors[$sl][$k]['anchor'] = $documentLink->nodeValue;
$matchedAnchors[$sl][$k]['img'] = 0;
if($documentLink->hasAttribute('rel'))
$matchedAnchors[$sl][$k]['rel'] = 'Y';
else
$matchedAnchors[$sl][$k]['rel'] = 'N';
}
}
}$k++;
}
}
}
Both #Phliplip & #lunixbochs have mentioned common cURL pitfalls (max execution time & denied by the target server.)
When sending that many cURL requests to the same server I try to "be nice" and place voluntarily sleep periods so I don't bombard the host. For a low-end site, 1000+ requests could be like a mini DDOS!
Here's code that's worked for me. I it used to scrape a client's product data from their old site, since the data was locked in a proprietary database system with NO export function.
<?php
header('Content-type: text/html; charset=utf-8', true);
set_time_limit(0);
$urls = array(
'http://www.example.com/cgi-bin/product?id=500',
'http://www.example.com/cgi-bin/product?id=501',
'http://www.example.com/cgi-bin/product?id=502',
'http://www.example.com/cgi-bin/product?id=503',
'http://www.example.com/cgi-bin/product?id=504',
);
$i = 0;
foreach($urls as $url){
echo $url."\n";
$curl = curl_init($url);
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
curl_setopt($curl, CURLOPT_USERAGENT, $userAgent);
curl_setopt($curl, CURLOPT_AUTOREFERER, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1 );
curl_setopt($curl, CURLOPT_TIMEOUT, 25 );
$html = curl_exec($curl);
$html = #mb_convert_encoding($html, 'HTML-ENTITIES', 'utf-8');
curl_close($curl);
// now do something with info returned by curl
$i++;
if($i%10==0){
sleep(20);
} else {
sleep(2);
}
}
?>
The main features are:
no max execution time
voluntary sleep-ing
new curl init & exec for each request.
In my experience, going to sleep() will stop servers from denying you.
However if by "different different server" you mean that you are sending a small number of requests a large number of servers, for example:
$urls = array(
'http://www.example-one.com/',
'http://www.example-two.com/',
'http://www.example-three.com/',
'http://www.example-four.com/',
'http://www.example-five.com/',
'http://www.example-six.com/'
);
And you are using set_time_limit(0); then something then an error may be causing your code to die; try
ini_set('display_errors',1);
error_reporting(E_ALL);
And tell us the error message you are getting.
PHP doesn't place a restriction on the number of connections using curl_multi_init, but memory usage and time limits will be an issue.
Check your memory_limit setting in your php.ini and try to increase it to see if that helps you.