php my crawler crash after some time segmentation fault error

php my crawler crash after some time segmentation fault error - php

i am a newbie in PHP and with my knownledge i build a script in PHP but after some time it crash.
I tested it on 5-6 different Linux OS, debian, ubuntu, redhat, fedora,etc. Only on fedora don't crash but after 3-4 h of working he stops and don't give me any error. The process still remain open, he don't crash, just stop of working, but this only on fedora.
Here's my script code:
<?
ini_set('max_execution_time', 0);
include_once('simple_html_dom.php');
$file = fopen("t.txt", "r");
while(!feof($file)) {
$line = fgets($file);
$line = trim($line);
$line = crawler($line);
}
fclose($file);
function crawler($line) {
$site = $line;
// Check target.
$agent = "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; pt-pt) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27";
$ch=curl_init();
curl_setopt ($ch, CURLOPT_URL,$line);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch,CURLOPT_VERBOSE,false);
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
curl_exec($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if($httpcode>=200 && $httpcode<=300) {
$check2 = $html = #file_get_html($site);
if($check2 === false) {
return $line;
} else {
foreach($html->find('a') as $element) {
$checkurl = parse_url($element->href);
$checkline = parse_url($line);
if(isset($checkurl['scheme'], $checkurl['host'])) {
if($checkurl['host'] !== $checkline['host']) {
$split = str_split($checkurl['host']);
$replacethis = ".";
$replacewith = "dot";
for($i=0;$i<count($split);$i++) {
if($split[$i] == $replacethis) {
$split[$i] = $replacewith;
}
}
chdir('C:\xampp\htdocs\_test\db');
foreach($split as $element2) {
if(!chdir($element2)) { mkdir($element2); chdir($element2); };
}
$save = fopen('results.txt', 'a'); $txt = "$line,$element->innertext\n"; fwrite($save,$txt); fclose($save);
}
}
}
}
}
}
?>
So my script crawl all backlinks from the targets i specified in t.txt, but only outgoing backlinks... then he scale on directories and save the information.
Here are the errors I got:
Allowed memory size of 16777216 bytes exhausted (tried to allocate 24 bytes)
Segmentation fault (core dumped)
It seems somewhere is a bug.. something is wrong... any ideea? Thanks.

Such error can be thrown when you haven't free memory. I believe it happens inside your simple_html_dom. You need to use
void clear () Clean up memory.
while using it in loop according to its documentation
Also you perform two http request for each line. But it's enough only one curl request. Just save responce
$html = curl_exec($ch);
and than use str_get_html($html) instead of file_get_html($site);
also it's bad practice to use error suppression operator #. If it can throw an exception you better handle it by try ... catch construction.
Also you don't need to do such things
$site = $line;
just use $line
and finally instead of your long line $save = fopen('results.txt', 'a');............... you can use simple file_put_contents()
And i suggest you to output to console what you actually doing now. Like
echo "getting HTML from URL ".$line
echo "parsing text..."
so you can control process somehow

Related

scraping a webpage returns encrypted characters

I have tried quite a few methods of downloading the page below$url = 'https://kat.cr/usearch/life%20of%20pi/'; using PHP. However, I always receive a page with encrypted characters.
I've tried searching for possible solutions prior to posting, and have tried out a few, however, I haven't been able to get any to work yet.
Please see the methods I have tried below and suggest a solution. I am looking for a PHP solution for the same.
Approach 1 - using file_get_contents - returns encrypted characters
<?php
//$contents = file_get_contents($url, $use_include_path, $context, $offset);
include('simple_html_dom.php');
$url = 'https://kat.cr/usearch/life%20of%20pi/';
$html = str_get_html(utf8_encode(file_get_contents($url)));
echo $html;
?>
Approach 2 - using file_get_html - returns encrypted characters
include('simple_html_dom.php');
$url = 'https://kat.cr/usearch/life%20of%20pi/';
$encoded = htmlentities(utf8_encode(file_get_html($url)));
echo $encoded;
?>
Approach 3 - using gzread - returns blank page
<?php
include('simple_html_dom.php');
$url = 'https://kat.cr/usearch/life%20of%20pi/';
$fp = gzopen($url,'r');
$contents = '';
while($html = gzread($fp , 256000))
{
$contents .= $html;
}
gzclose($fp);
?>
Approach 4 - using gzinflate - returns empty page
<?php
include('simple_html_dom.php');
//function gzdecode($data)
//{
// return gzinflate(substr($data,10,-8));
//}
//$contents = file_get_contents($url, $use_include_path, $context, $offset);
$url = 'https://kat.cr/usearch/life%20of%20pi/';
$html = str_get_html(utf8_encode(file_get_contents($url)));
echo gzinflate(substr($html,10,-8));
?>
Approach 5 - using fopen and fgets - returns encrypted characters
<?php
$url='https://kat.cr/usearch/life%20of%20pi/';
$handle = fopen($url, "r");
if ($handle)
{
while (($line = fgets($handle)) !== false)
{
echo $line;
}
}
else
{
// error opening the file.
echo "could not open the wikipedia URL!";
}
fclose($handle);
?>
Approach 6 - adding ob_start at the beginning of script - page does not load
<?php
ob_start("ob_gzhandler");
$url = 'https://kat.cr/usearch/life%20of%20pi/';
$handle = fopen($url, "r");
if ($handle)
{
while (($line = fgets($handle)) !== false)
{
echo $line;
}
}
else
{
// error opening the file.
echo "could not open the wikipedia URL!";
}
fclose($handle);
?>
Approach 7 - using curl - returns empty page
<?php
$url = 'https://kat.cr/usearch/life%20of%20pi/';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url); // Define target site
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Return page in string
curl_setopt($cr, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.2 (KHTML, like Gecko) Chrome/5.0.342.3 Safari/533.2');
curl_setopt($ch, CURLOPT_ENCODING , "gzip");
curl_setopt($ch, CURLOPT_TIMEOUT,5);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE); // Follow redirects
$return = curl_exec($ch);
$info = curl_getinfo($ch);
curl_close($ch);
$html = str_get_html("$return");
echo $html;
?>
Approach 8 - using R - returns encrypted characters
> thepage = readLines('https://kat.cr/usearch/life%20of%20pi/')
There were 29 warnings (use warnings() to see them)
> thepage[1:5]
[1] "\037‹\b"
[2] "+SC®\037\035ÕpšÐ\032«F°{¼…àßá$\030±ª\022ù˜ú×Gµ."
[3] "\023\022&ÒÅdDjÈÉÎŽj\t¹Iê¬©\003ä\fp\024“ä(M<©U«ß×Ðy2\tÈÂæœ8ž\036â!9ª]ûd<¢QR*>öÝdpä’kß!\022?ÙG~è'>\016¤ØÁ\0019Re¥†\0264æ’Ø‰üQâÓ°Ô^—\016\tÂ¡‹\\:\016\003Š]4¤aLiˆ†8ìS\022Ão€'ðÿ\020a;¦Aš`‚<\032!/\"DF=\034'EåX^ÔˆÚ4‰KDCê‡.¹©¡ˆ\004Gµ4&8r\006EÍÄO\002r|šóóZðóú\026?\0274Š ½\030!\týâ;W8Ž‹k‡õ¬™¬ÉÀ\017¯2b1ÓA< \004„š€&J"
[4] "#ƒˆxGµz\035\032Jpâ;²C‡u\034\004’Ñôp«e^*Wz-Óz!ê\022\001èÌI\023ä;LÖ\v›õ‡¸Oâº‡¯Y!\031þ\024-mÍ·‡G#°›„¦Î#º¿ÉùÒò(ìó¶³f\177¤?}\017½<Cæ_eÎ\0276\t\035®ûÄœ\025À}rÌ\005òÃŸ$t}ï/IºM»µ*íÖšh\006\t#kåd³¡€âÈ¹E÷CÌG·!\017ý°èø‡x†ä\a|³&jÇ‡õìè>\016ú\t™aá¾ž[\017—z¹«K¸çeØ¿=/"
[5] "\035æ\034vÎ÷Gûx?Ú'ûÝý`ßßwö¯v‹bÿFç\177F\177\035±?ÿýß\177þupþ'ƒ\035ösT´°ûï¢<+(Òx°Ó‰\"<‘G\021M(ãEŽ\003pa2¸¬`\aGýtÈFíî.úÏîAQÙ?\032ÉNDpBÎ\002Â"
Approach 9 - using BeautifulSoup (python) - returns encrypted characters
import urllib
htmltext = urllib.urlopen("https://kat.cr/usearch/life%20of%20pi/").read()
print htmltext
Approach 10 - using wget on the linux terminal - gets a page with encrypted characters
wget -O page https://kat.cr/usearch/Monsoon%20Mangoes%20malayalam/
Approach 11 -
tried manually by pasting the url to the below service - works
https://www.hurl.it/
Approach 12 -
tried manually by pasting the url to the below service - works
https://www.import.io/

PHP Get metadata of remote .mp3 file (from URL)

I am trying to get song name / artist name / song length / bitrate etc from a remote .mp3 file such as http://shiro-desu.com/scr/11.mp3 .
I have tried getID3 script but from what i understand it doesn't work for remote files as i got this error: "Remote files are not supported - please copy the file locally first"
Also, this code:
<?php
$tag = id3_get_tag( "http://shiro-desu.com/scr/11.mp3" );
print_r($tag);
?>
did not work either.
"Fatal error: Call to undefined function id3_get_tag() in /home4/shiro/public_html/scr/index.php on line 2"

As you haven't mentioned your error I am considering a common error case undefined function
The error you get (undefined function) means the ID3 extension is not enabled in your PHP configuration:
If you dont have Id3 extension file .Just check here for installation info.

Firstly, I didn’t create this, I’ve just making it easy to understand with a full example.
You can read more of it here, but only because of archive.org.
https://web.archive.org/web/20160106095540/http://designaeon.com/2012/07/read-mp3-tags-without-downloading-it/
To begin, download this library from here: http://getid3.sourceforge.net/
When you open the zip folder, you’ll see ‘getid3’. Save that folder in to your working folder.
Next, create a folder called “temp” in that working folder that the following script is going to be running from.
Basically, what it does is download the first 64k of the file, and then read the metadata from the file.
I enjoy a simple example. I hope this helps.
<?php
require_once("getid3/getid3.php");
$url_media = "http://example.com/myfile.mp3"
$a=getfileinfo($url_media);
echo"<pre>";
echo $a['tags']['id3v2']['album'][0] . "\n";
echo $a['tags']['id3v2']['artist'][0] . "\n";
echo $a['tags']['id3v2']['title'][0] . "\n";
echo $a['tags']['id3v2']['year'][0] . "\n";
echo $a['tags']['id3v2']['year'][0] . "\n";
echo "\n-----------------\n";
//print_r($a['tags']['id3v2']['album']);
echo "-----------------\n";
//print_r($a);
echo"</pre>";
function getfileinfo($remoteFile)
{
$url=$remoteFile;
$uuid=uniqid("designaeon_", true);
$file="temp/".$uuid.".mp3";
$size=0;
$ch = curl_init($remoteFile);
//==============================Get Size==========================//
$contentLength = 'unknown';
$ch1 = curl_init($remoteFile);
curl_setopt($ch1, CURLOPT_NOBODY, true);
curl_setopt($ch1, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch1, CURLOPT_HEADER, true);
curl_setopt($ch1, CURLOPT_FOLLOWLOCATION, true); //not necessary unless the file redirects (like the PHP example we're using here)
$data = curl_exec($ch1);
curl_close($ch1);
if (preg_match('/Content-Length: (\d+)/', $data, $matches)) {
$contentLength = (int)$matches[1];
$size=$contentLength;
}
//==============================Get Size==========================//
if (!$fp = fopen($file, "wb")) {
echo 'Error opening temp file for binary writing';
return false;
} else if (!$urlp = fopen($url, "r")) {
echo 'Error opening URL for reading';
return false;
}
try {
$to_get = 65536; // 64 KB
$chunk_size = 4096; // Haven't bothered to tune this, maybe other values would work better??
$got = 0; $data = null;
// Grab the first 64 KB of the file
while(!feof($urlp) && $got < $to_get) { $data = $data . fgets($urlp, $chunk_size); $got += $chunk_size; } fwrite($fp, $data); // Grab the last 64 KB of the file, if we know how big it is.
if ($size > 0) {
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RESUME_FROM, $size - $to_get);
curl_exec($ch);
}
// Now $fp should be the first and last 64KB of the file!!
#fclose($fp);
#fclose($urlp);
}
catch (Exception $e) {
#fclose($fp);
#fclose($urlp);
echo 'Error transfering file using fopen and cURL !!';
return false;
}
$getID3 = new getID3;
$filename=$file;
$ThisFileInfo = $getID3->analyze($filename);
getid3_lib::CopyTagsToComments($ThisFileInfo);
unlink($file);
return $ThisFileInfo;
}
?>

php curl multi error handler

i want to capture curl errors and warnings in my error handler so that they do not get echoed to the user. to prove that all errors have been caught i prepend the $err_start string to the error. currently here is a working (but simplified) snippet of my code (run it in a browser, not cli):
<?php
set_error_handler('handle_errors');
test_curl();
function handle_errors($error_num, $error_str, $error_file, $error_line)
{
$err_start = 'caught error'; //to prove that the error has been properly caught
die("$err_start $error_num, $error_str, $error_file, $error_line<br>");
}
function test_curl()
{
$curl_multi_handle = curl_multi_init();
$curl_handle1 = curl_init('iamdooooooooooown.com');
curl_setopt($curl_handle1, CURLOPT_RETURNTRANSFER, true);
curl_multi_add_handle($curl_multi_handle, $curl_handle1);
$still_running = 1;
while($still_running > 0) $multi_errors = curl_multi_exec($curl_multi_handle, $still_running);
if($multi_errors != CURLM_OK) trigger_error("curl error [$multi_errors]: ".curl_error($curl_multi_handle), E_USER_ERROR);
if(strlen(curl_error($curl_handle1))) trigger_error("curl error: [".curl_error($curl_handle1)."]", E_USER_ERROR);
$curl_info = curl_getinfo($curl_handle1); //info for individual requests
$content = curl_multi_getcontent($curl_handle1);
curl_multi_remove_handle($curl_multi_handle, $curl_handle1);
curl_close($curl_handle1);
curl_multi_close($curl_multi_handle);
}
?>
note that my full code has multiple requests in parallel, however the issue still manifests with a single request as shown here. note also that the error handler shown in this code snippet is very basic - my actual error handler will not die on warnings or notices, so no need to school me on this.
now if i try and curl a host which is currently down then i successfully capture the curl error and my script dies with:
caught error 256, curl error: [Couldn't resolve host 'iamdooooooooooown.com'], /var/www/proj/test_curl.php, 18
however the following warning is not caught by my error handler function, and is being echoed to the page:
Warning: (null)(): 3 is not a valid cURL handle resource in Unknown on line 0
i would like to capture this warning in my error handler so that i can log it for later inspection.
one thing i have noticed is that the warning only manifests when the curl code is inside a function - it does not happen when the code is at the highest scope level. is it possible that one of the curl globals (eg CURLM_OK) is not accessible within the scope of the test_curl() function?
i am using PHP Version 5.3.2-1ubuntu4.19
edits
updated the code snippet to fully demonstrate the error
the uncaptured warning only manifests when inside a function or class method

I don't think i agree with the with the way you are capturing the error ... you can try
$nodes = array(
"http://google.com",
"http://iamdooooooooooown.com",
"https://gokillyourself.com"
);
echo "<pre>";
print_r(multiplePost($nodes));
Output
Array
(
[google.com] => #HTTP-OK 48.52 kb returned
[iamdooooooooooown.com] => #HTTP-ERROR 0 for : http://iamdooooooooooown.com
[gokillyourself.com] => #HTTP-ERROR 0 for : https://gokillyourself.com
)
Function Used
function multiplePost($nodes) {
$mh = curl_multi_init();
$curl_array = array();
foreach ( $nodes as $i => $url ) {
$url = trim($url);
$curl_array[$i] = curl_init($url);
curl_setopt($curl_array[$i], CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl_array[$i], CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 (.NET CLR 3.5.30729)');
curl_setopt($curl_array[$i], CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($curl_array[$i], CURLOPT_TIMEOUT, 15);
curl_setopt($curl_array[$i], CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl_array[$i], CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($curl_array[$i], CURLOPT_SSL_VERIFYPEER, 0);
curl_multi_add_handle($mh, $curl_array[$i]);
}
$running = NULL;
do {
usleep(10000);
curl_multi_exec($mh, $running);
} while ( $running > 0 );
$res = array();
foreach ( $nodes as $i => $url ) {
$domain = parse_url($url, PHP_URL_HOST);
$curlErrorCode = curl_errno($curl_array[$i]);
if ($curlErrorCode === 0) {
$info = curl_getinfo($curl_array[$i]);
$info['url'] = trim($info['url']);
if ($info['http_code'] == 200) {
$content = curl_multi_getcontent($curl_array[$i]);
$res[$domain] = sprintf("#HTTP-OK %0.2f kb returned", strlen($content) / 1024);
} else {
$res[$domain] = "#HTTP-ERROR {$info['http_code'] } for : {$info['url']}";
}
} else {
$res[$domain] = sprintf("#CURL-ERROR %d: %s ", $curlErrorCode, curl_error($curl_array[$i]));
}
curl_multi_remove_handle($mh, $curl_array[$i]);
curl_close($curl_array[$i]);
flush();
ob_flush();
}
curl_multi_close($mh);
return $res;
}

it is possible that this is a bug with php-curl. when the following line is removed, then everything behaves ok:
if(strlen(curl_error($curl_handle1))) trigger_error("curl error: [".curl_error($curl_handle1)."]", E_USER_ERROR);
as far as i can tell, curling a host that is down is corrupting $curl_handle1 in some way that the curl_error() function is not prepared for. to get around this problem (until a bug fix is made) just test if the http_code returned by curl_getinfo() is 0. if it is 0 then do not use the curl_error function:
if($multi_errors != CURLM_OK) trigger_error("curl error [$multi_errors]: ".curl_error($curl_multi_handle), E_USER_ERROR);
$curl_info = curl_getinfo($curl_handle1); //info for individual requests
$is_up = ($curl_info['http_code'] == 0) ? 0 : 1;
if($is_up && strlen(curl_error($curl_handle1))) trigger_error("curl error: [".curl_error($curl_handle1)."]", E_USER_ERROR);
its not a very elegant solution, but it may have to do for now.

Caching JSON output in PHP

Got a slight bit of an issue. Been playing with the facebook and twitter API's and getting the JSON output of status search queries no problem, however I've read up further and realised that I could end up being "rate limited" as quoted from the documentation.
I was wondering is it easy to cache the JSON output each hour so that I can at least try and prevent this from happening? If so how is it done? As I tried a youtube video but that didn't really give much information only how to write the contents of a directory listing to a cache.php file, but it didn't really point out whether this can be done with JSON output and certainly didn't say how to use the time interval of 60 minutes or how to get the information then back out of the cache file.
Any help or code would be very much appreciated as there seems to be very little in tutorials on this sorta thing.

Here a simple function that adds caching to getting some URL contents:
function getJson($url) {
// cache files are created like cache/abcdef123456...
$cacheFile = 'cache' . DIRECTORY_SEPARATOR . md5($url);
if (file_exists($cacheFile)) {
$fh = fopen($cacheFile, 'r');
$size = filesize($cacheFile);
$cacheTime = trim(fgets($fh));
// if data was cached recently, return cached data
if ($cacheTime > strtotime('-60 minutes')) {
return fread($fh, $size);
}
// else delete cache file
fclose($fh);
unlink($cacheFile);
}
$json = /* get from Twitter as usual */;
$fh = fopen($cacheFile, 'w');
fwrite($fh, time() . "\n");
fwrite($fh, $json);
fclose($fh);
return $json;
}
It uses the URL to identify cache files, a repeated request to the identical URL will be read from the cache the next time. It writes the timestamp into the first line of the cache file, and cached data older than an hour is discarded. It's just a simple example and you'll probably want to customize it.

It's a good idea to use caching to avoid the rate limit.
Here's some example code that shows how I did it for Google+ data,
in some php code I wrote recently.
private function getCache($key) {
$cache_life = intval($this->instance['cache_life']); // minutes
if ($cache_life <= 0) return null;
// fully-qualified filename
$fqfname = $this->getCacheFileName($key);
if (file_exists($fqfname)) {
if (filemtime($fqfname) > (time() - 60 * $cache_life)) {
// The cache file is fresh.
$fresh = file_get_contents($fqfname);
$results = json_decode($fresh,true);
return $results;
}
else {
unlink($fqfname);
}
}
return null;
}
private function putCache($key, $results) {
$json = json_encode($results);
$fqfname = $this->getCacheFileName($key);
file_put_contents($fqfname, $json, LOCK_EX);
}
and to use it:
// $cacheKey is a value that is unique to the
// concatenation of all params. A string concatenation
// might work.
$results = $this->getCache($cacheKey);
if (!$results) {
// cache miss; must call out
$results = $this->getDataFromService(....);
$this->putCache($cacheKey, $results);
}

I know this post is old, but it show in google so for everyone looking, I made this simple one that curl a JSON url and cache it in a file that is in a specific folder, when json is requested again if 5min passed it will curl it if the 5min didnt pass yet, it will show it from file, it uses timestamp to track time and yea, enjoy
function ccurl($url,$id){
$path = "./private/cache/$id/";
$files = scandir($path);
$files = array_values(array_diff(scandir($path), array('.', '..')));
if(count($files) > 1){
foreach($files as $file){
unlink($path.$file);
$files = scandir($path);
$files = array_values(array_diff(scandir($path), array('.', '..')));
}
}
if(empty($files)){
$c = curl_init();
curl_setopt($c, CURLOPT_URL, $url);
curl_setopt($c, CURLOPT_TIMEOUT, 15);
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
curl_setopt($c, CURLOPT_USERAGENT,
'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:17.0) Gecko/20100101 Firefox/17.0');
$response = curl_exec($c);
curl_close ($c);
$fp = file_put_contents($path.time().'.json', $response);
return $response;
}else {
if(time() - str_replace('.json', '', $files[0]) > 300){
unlink($path.$files[0]);
$c = curl_init();
curl_setopt($c, CURLOPT_URL, $url);
curl_setopt($c, CURLOPT_TIMEOUT, 15);
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
curl_setopt($c, CURLOPT_USERAGENT,
'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:17.0) Gecko/20100101 Firefox/17.0');
$response = curl_exec($c);
curl_close ($c);
$fp = file_put_contents($path.time().'.json', $response);
return $response;
}else {
return file_get_contents($path. $files[0]);
}
}
}
for usage create a directory for all cached files, for me its /private/cache then create another directory inside for the request cache like x for example, and when calling the function it should be like htis
ccurl('json_url','x')
where x is the id, if u have question pls ask me ^_^ also enjoy (i might update it later so it doesn't use a directory for id

How to reduce virtual memory by optimising my PHP code?

My current code (see below) uses 147MB of virtual memory!
My provider has allocated 100MB by default and the process is killed once run, causing an internal error.
The code is utilising curl multi and must be able to loop with more than 150 iterations whilst still minimizing the virtual memory. The code below is only set at 150 iterations and still causes the internal server error. At 90 iterations the issue does not occur.
How can I adjust my code to lower the resource use / virtual memory?
Thanks!
<?php
function udate($format, $utimestamp = null) {
if ($utimestamp === null)
$utimestamp = microtime(true);
$timestamp = floor($utimestamp);
$milliseconds = round(($utimestamp - $timestamp) * 1000);
return date(preg_replace('`(?<!\\\\)u`', $milliseconds, $format), $timestamp);
}
$url = 'https://www.testdomain.com/';
$curl_arr = array();
$master = curl_multi_init();
for($i=0; $i<150; $i++)
{
$curl_arr[$i] = curl_init();
curl_setopt($curl_arr[$i], CURLOPT_URL, $url);
curl_setopt($curl_arr[$i], CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_arr[$i], CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($curl_arr[$i], CURLOPT_SSL_VERIFYPEER, FALSE);
curl_multi_add_handle($master, $curl_arr[$i]);
}
do {
curl_multi_exec($master,$running);
} while($running > 0);
for($i=0; $i<150; $i++)
{
$results = curl_multi_getcontent ($curl_arr[$i]);
$results = explode("<br>", $results);
echo $results[0];
echo "<br>";
echo $results[1];
echo "<br>";
echo udate('H:i:s:u');
echo "<br><br>";
usleep(100000);
}
?>

As per your last comment..
Download RollingCurl.php.
Hopefully this will sufficiently spam the living daylights out of your API.
<?php
$url = '________';
$fetch_count = 150;
$window_size = 5;
require("RollingCurl.php");
function request_callback($response, $info, $request) {
list($result0, $result1) = explode("<br>", $response);
echo "{$result0}<br>{$result1}<br>";
//print_r($info);
//print_r($request);
echo "<hr>";
}
$urls = array_fill(0, $fetch_count, $url);
$rc = new RollingCurl("request_callback");
$rc->window_size = $window_size;
foreach ($urls as $url) {
$request = new RollingCurlRequest($url);
$rc->add($request);
}
$rc->execute();
?>
Looking through your questions, I saw this comment:
If the intention is domain snatching,
then using one of the established
services is a better option. Your
script implementation is hardly as
important as the actual connection and
latency.
I agree with that comment.
Also, you seem to have posted the "same question" approximately seven hundred times:
https://stackoverflow.com/users/558865/icer
https://stackoverflow.com/users/516277/icer
How can I adjust the server to run my PHP script quicker?
How can I re-code my php script to run as quickly as possible?
How to run cURL once, checking domain availability in a loop? Help fixing code please
Help fixing php/api/curl code please
How to reduce virtual memory by optimising my PHP code?
Overlapping HTTPS requests?
Multiple https requests.. how to?
Doesn't the fact that you have to keep asking the same question over and over tell you that you're doing it wrong?
This comment of yours:
#mario: Cheers. I'm competing against
2 other companies for specific
ccTLD's. They are new to the game and
they are snapping up those domains in
slow time (up to 10 seconds after
purge time). I'm just a little slower
at the moment.
I'm fairly sure that PHP on a shared hosting account is the wrong tool to use if you are seriously trying to beat two companies at snapping up expired domain names.

The result of each of the 150 queries is being stored in PHP memory and by your evidence this is insufficient. The only conclusion is that you cannot keep 150 queries in memory. You must have a method of streaming to files instead of memory buffers, or simply reduce the number of queries and processing the list of URLs in batches.
To use streams you must set CURLOPT_RETURNTRANSFER to 0 and implement a callback for CURLOPT_WRITEFUNCTION, there is an example in the PHP manual:
http://www.php.net/manual/en/function.curl-setopt.php#98491
function on_curl_write($ch, $data)
{
global $fh;
$bytes = fwrite ($fh, $data, strlen($data));
return $bytes;
}
curl_setopt ($curl_arr[$i], CURLOPT_WRITEFUNCTION, 'on_curl_write');
Getting the correct file handle in the callback is left as problem for the reader to solve.

<?php
echo str_repeat(' ', 1024); //to make flush work
$url = 'http://__________/';
$fetch_count = 15;
$delay = 100000; //0.1 second
//$delay = 1000000; //1 second
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
for ($i=0; $i<$fetch_count; $i++) {
$start = microtime(true);
$result = curl_exec($ch);
list($result0, $result1) = explode("<br>", $result);
echo "{$result0}<br>{$result1}<br>";
flush();
$end = microtime(true);
$sleeping = $delay - ($end - $start);
echo 'sleeping: ' . ($sleeping / 1000000) . ' seconds<hr />';
usleep($sleeping);
}
curl_close($ch);
?>

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

php my crawler crash after some time segmentation fault error - php

Related

scraping a webpage returns encrypted characters

PHP Get metadata of remote .mp3 file (from URL)

php curl multi error handler

Caching JSON output in PHP

How to reduce virtual memory by optimising my PHP code?

Categories

Resources