Exit out of a cURL fetch - php

I'm trying to find a way to only quickly access a file and then disconnect immediately.
So I've decided to use cURL since it's the fastest option for me. But I can't figure out how I should "disconnect" cURL.
With the code below, Apache's access logs says that the file I tried accessing was indeed accessed, but I'm feeling a little iffy about this, because when I just run the while loop without breaking out of it, it just keeps looping. Shouldn't the loop stop when cURL has finished fetching the file? Or am I just being silly; is the loop just restarting constantly?
<?php
$Resource = curl_init();
curl_setopt($Resource, CURLOPT_URL, '...');
curl_setopt($Resource, CURLOPT_HEADER, 0);
curl_setopt($Resource, CURLOPT_USERAGENT, '...');
while(curl_exec($Resource)){
break;
}
curl_close($Resource);
?>
I tried setting the CURLOPT_CONNECTTIMEOUT_MS / CURLOPT_CONNECTTIMEOUT options to very small values, but it didn't help in this case.
Is there a more "proper" way of doing this?

This statement is superflous:
while(curl_exec($Resource)){
break;
}
Instead just keep the return value for future reference:
$result = curl_exec($Resource);
The while loop does not help anything. So now to your question: You can tell curl that it should only take some bytes from the body and then quit. That can be achieved by reducing the CURLOPT_BUFFERSIZE to a small value and by using a callback function to tell curl it should stop:
$withCallback = array(
CURLOPT_BUFFERSIZE => 20, # ~ value of bytes you'd like to get
CURLOPT_WRITEFUNCTION => function($handle, $data) {
echo "WRITE: (", strlen($data), ") $data\n";
return 0;
},
);
$handle = curl_init("http://stackoverflow.com/");
curl_setopt_array($handle, $withCallback);
curl_exec($handle);
curl_close($handle);
Output:
WRITE: (10) <!DOCTYPE
Another alternative is to make a HEAD request by using CURLOPT_NOBODY which will never fetch the body. But it's not a GET request.
The connect timeout settings are about how long it will take until the connect times out. The connect is the phase until the server accepts input from curl and curl starts to know about that the server does. It's not related to the phase when curl fetches data from the server, that's
CURLOPT_TIMEOUT The maximum number of seconds to allow cURL functions to execute.
You find a long list of available options in the PHP Manual: curl_setoptĀ­Docs.

Perhaps that might be helpful?
$GLOBALS["dataread"] = 0;
define("MAX_DATA", 3000); // how many bytes should be read?
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.php.net/");
curl_setopt($ch, CURLOPT_WRITEFUNCTION, "handlewrite");
curl_exec($ch);
curl_close($ch);
function handlewrite($ch, $data)
{
$GLOBALS["dataread"] += strlen($data);
echo "READ " . strlen($data) . " bytes\n";
if ($GLOBALS["dataread"] > MAX_DATA) {
return 0;
}
return strlen($data);
}

Related

What would be the best way to collect the titles (in bulk) of a subreddit

I am looking to collect the titles of all of the posts on a subreddit, and I wanted to know what would be the best way of going about this?
I've looked around and found some stuff talking about Python and bots. I've also had a brief look at the API and am unsure in which direction to go.
As I do not want to commit to find out 90% of the way through it won't work, I ask if someone could point me in the right direction of language and extras like any software needed for example pip for Python.
My own experience is in web languages such as PHP so I initially thought of a web app would do the trick but am unsure if this would be the best way and how to go about it.
So as my question stands
What would be the best way to collect the titles (in bulk) of a
subreddit?
Or if that is too subjective
How do I retrieve and store all the post titles of a subreddit?
Preferably needs to :
do more than 1 page of (25) results
save to a .txt file
Thanks in advance.
PHP; in 25 lines:
$subreddit = 'pokemon';
$max_pages = 10;
// Set variables with default data
$page = 0;
$after = '';
$titles = '';
do {
$url = 'http://www.reddit.com/r/' . $subreddit . '/new.json?limit=25&after=' . $after;
// Set URL you want to fetch
$ch = curl_init($url);
// Set curl option of of header to false (don't need them)
curl_setopt($ch, CURLOPT_HEADER, 0);
// Set curl option of nobody to false as we need the body
curl_setopt($ch, CURLOPT_NOBODY, 0);
// Set curl timeout of 5 seconds
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
// Set curl to return output as string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// Execute curl
$output = curl_exec($ch);
// Get HTTP code of request
$status = curl_getinfo($ch, CURLINFO_HTTP_CODE);
// Close curl
curl_close($ch);
// If http code is 200 (success)
if ($status == 200) {
// Decode JSON into PHP object
$json = json_decode($output);
// Set after for next curl iteration (reddit's pagination)
$after = $json->data->after;
// Loop though each post and output title
foreach ($json->data->children as $k => $v) {
$titles .= $v->data->title . "\n";
}
}
// Increment page number
$page++;
// Loop though whilst current page number is less than maximum pages
} while ($page < $max_pages);
// Save titles to text file
file_put_contents(dirname(__FILE__) . '/' . $subreddit . '.txt', $titles);

HTTP Response Code 0 - Site is working

I am making a website that will check if a website is working and live. I pass in the URL of the site I would like to check and the following code will check if the site is live and return the HTTP response code as well as true or false.
function urlExists($url=NULL)
{
if($url == NULL) return false;
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$data = curl_exec($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if ($httpcode == 0) {
return array (false, $httpcode);
}
else if($httpcode < 400){
return array (true, $httpcode);
} else {
return array (false, $httpcode);
}
}
With one of the sites I am testing though I am getting the HTTP response code of 0 even though I know that the site is live and working.
The site is very slow as its a large site on a not very powerful server so response times can vary between 7 - 25 seconds.
Any help would be greatly appreciated.
Thanks,
Sam
Based on these two links:-
https://curl.haxx.se/libcurl/c/CURLOPT_TIMEOUT.html
And
https://curl.haxx.se/libcurl/c/CURLOPT_CONNECTTIMEOUT.html
First one is:- set maximum time the request is allowed to take
Second one is:- timeout for the connect phase
As you said that the Site URL you are hitting is taking 7-25 second for responding. meanwhile your CURL request is terminated and closed because of these two time settings.
Increase these two time settings in your code and it will work for you.
thanks.
I will offer 2 alternatives for you to compare - along with your curl() function, you will have 3 options to see which one is better/faster for you.
Option A (all php versions), requires fopen() to be activated:
if (!$fp = fopen($url, 'r'))
{
trigger_error("Unable to open URL ($url)", E_USER_ERROR);
}
$headers = stream_get_meta_data($fp);
fclose($fp);
$http_header_info = $headers['wrapper_data'][0];
$httpCode = (int)substr($http_header_info, 9, 3);
Option B (php5+):
$headers = get_headers($url, 1);
$http_header_info = $headers[0];
$httpCode = substr($http_header_info, 9, 3);
Also, if anyone has benchmarks on these 3 approaches, i am curious to see which is more appropriate (only for retrieving http response headers of course)
Code 0 returns often when used invalid URL syntax or host not found error.
You can also call curl_error($ch) function (http://php.net/manual/en/function.curl-error.php) to determine error details.

How to call posts from PHP

I have a website, that uses WP Super Cache plugin. I need to recycle cache once a day and then I need to call 5 posts (URL adresses) so WP Super Cache put these posts into cache again (caching is quite time consuming so I'd like to have it precached before users come so they dont have to wait).
On my hosting I can use a CRON but only for 1 call/hour. And I need to call 5 different URL's at once.
Is it possible to do that? Maybe create one HTML page with these 5 posts in iframe? Will something like that work?
Edit: Shell is not available, so I have to use PHP scripting.
The easiest way to do it in PHP is to use file_get_contents() (fopen() also works), if the HTTP stream wrapper is enabled on your server:
<?php
$postUrls = array(
'http://my.site.here/post1',
'http://my.site.here/post2',
'http://my.site.here/post3',
'http://my.site.here/post4',
'http://my.site.here/post5',
);
foreach ($postUrls as $url) {
// Get the post as an user will do it
$text = file_get_contents();
// Here you can check if the request was successful
// For example, use strpos() or regex to find a piece of text you expect
// to find in the post
// Replace 'copyright bla, bla, bla' with a piece of text you display
// in the footer of your site
if (strpos($text, 'copyright bla, bla, bla') === FALSE) {
echo('Retrieval of '.$url." failed.\n");
}
}
If file_get_contents() fails to open the URLs on your server (some ISP restrict this behaviour) you can try to use curl:
function curl_get_contents($url)
{
$ch = curl_init($url);
curl_setopt_array($ch, array(
CURLOPT_CONNECTTIMEOUT => 30, // timeout in seconds
CURLOPT_RETURNTRANSFER => TRUE, // tell curl to return the page content instead of just TRUE/FALSE
));
$text = curl_exec($ch);
curl_close($ch);
return $text;
}
Then use the function curl_get_contents() listed above instead of file_get_contents().
An example using PHP without building a cURL request.
Using PHP's shell exec, you can have an extremely light function like so :
$siteList = array("http://url1", "http://url2", "http://url3", "http://url4", "http://url5");
foreach ($siteList as &$site) {
$request = shell_exec('wget '.$site);
}
Now of course this is not the most concise answer and not always a good solution also, if you actually want anything from the response you will have to work with it a different way to cURLbut its a low impact option.
Thanks to Arkascha tip I created a PHP page that I call from CRON. This page contains simple function using cURL:
function cache_it($Url){
if (!function_exists('curl_init')){
die('No cURL, sorry!');
}
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $Url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 50); //higher timeout needed for cache to load
curl_exec($ch); //dont need it as output, otherwise $output = curl_exec($ch);
curl_close($ch);
}
cache_it('http://www.mywebsite.com/url1');
cache_it('http://www.mywebsite.com/url2');
cache_it('http://www.mywebsite.com/url3');
cache_it('http://www.mywebsite.com/url4');

file get content or fsockopen - timeout issue

I have a php file called testResponse.php which is only :
<?php
sleep(5);
echo"go";
?>
Now, I'm calling this file from a another page using file_get_contents like this :
$start= microtime(true);
$opts = array('http' =>
array(
'method' => 'GET',
'timeout' => 1
)
);
$context = stream_context_create($opts);
$loc = #file_get_contents("http://www.mywebsite.com/testResponse.php", false, $context);
$end= microtime(true);
echo $end - $start, "\n";
The output is more than 5 sec, which means that my timeout has been ignored...
I followed the advice of this post : stackoverflow.com/questions/3689371
But it seems that hostname cannot be a path (like www.mywebsite.com/testResponse.php) but directly the hostname like www.mywebsite.com.
So I'm stuck to achieve this goal :
Get content of page www.test.com/x.php with constraint :
if test.com doesn't exist or the page x.php doesn't exist returns nothing quickly
if the page exist but takes more than 1 sec to load, abort
else get the content of the file
Edit : By the way, it seems to work when I call this page (testResponse.php) from my local server. Well, it multiply the timeout by 2. For instance, If I have 1 for timeout, I will have echoed something like "2.0054645". But only from local...
The solution is to use PHP's cURL functions. The other question you linked to explains things properly, about the read timeouts vs. the connection timeouts, and so on, but neither of those are truly what you're looking for here. Even the connection timeout won't work, because the connection to testResponse.php is always successful; after that it's waiting, so what you need is an execution timeout. This is where cURL comes in handy.
So, testResponse.php doesn't need to be altered. In your main file, though, try the following code (this is tested and it works on my server):
$start = microtime(true);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.mywebsite.com/testResponse.php");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 1);
$output = curl_exec($ch);
$errno = curl_errno($ch);
if ($errno > 0) {
if ($errno === 28) {
echo "Connection timed out.";
}
else {
echo "Error #" . $errno . ": " . curl_error($ch);
}
}
else {
echo $output;
}
$end = microtime(true);
echo "<br><br>" . ($end - $start);
curl_close($ch);
This sets the execution time of the cURL session, via the CURLOPT_TIMEOUT option you see on line 5. So, when the connection is timed out, $errno will equal 28, the code for cURL's operation timeout error. The rest of the error codes are listed in the cURL documentation, so you can expand the script above to act accordingly.
Finally, because of the CURLOPT_RETURNTRANSFER option that's set, curl_exec($ch) will be set to the content of the retrieved page if the session succeeds. Otherwise, it will equal false.
Hope this helps!
Edit: Removed the statement setting CURLOPT_HEADER. I also, for some reason, was under the impression that curl_exec($ch) set the value of $ch to the returned contents, forgetting that the contents are returned by curl_exec().

Safe image download from PHP

I want to allow my users to upload a file by providing a URL to the image.
Pretty much like imgur, you enter http://something.com/image.png and the script downloads the file, then keeps it on the server and publishes it.
I tried using file_get_contents() and getimagesize(). But I'm thinking there would be problems:
how can I protect the script from 100 users supplying 100 URLs to large images?
how can I determine if the download process will take or already takes too long?
This is actually interesting.
It appears that you can actually track and control the progress of a cURL transfer. See documentation on CURLOPT_NOPROGRESS, CURLOPT_PROGRESSFUNCTION and CURLOPT_WRITEFUNCTION
I found this example and changed it to:
<?php
file_put_contents('progress.txt', '');
$target_file_name = 'targetfile.zip';
$target_file = fopen($target_file_name, 'w');
$ch = curl_init('http://localhost/so/testfile2.zip');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_NOPROGRESS, FALSE);
curl_setopt($ch, CURLOPT_PROGRESSFUNCTION, 'progress_callback');
curl_setopt($ch, CURLOPT_WRITEFUNCTION, 'write_callback');
curl_exec($ch);
if ($target_file) {
fclose($target_file);
}
$_download_size = 0;
function progress_callback($download_size, $downloaded_size, $upload_size, $uploaded_size) {
global $_download_size;
$_download_size = $download_size;
static $previous_progress = 0;
if ($download_size == 0) {
$progress = 0;
}
else {
$progress = round($downloaded_size * 100 / $download_size);
}
if ($progress > $previous_progress) {
$previous_progress = $progress;
$fp = fopen('progress.txt', 'a');
fputs($fp, $progress .'% ('. $downloaded_size .'/'. $download_size .")\n");
fclose($fp);
}
}
function write_callback($ch, $data) {
global $target_file_name;
global $target_file;
global $_download_size;
if ($_download_size > 1000000) {
return '';
}
return fwrite($target_file, $data);
}
write_callback checks whether the size of the data is greater than a specified limit. If it is, it returns an empty string that aborts the transfer. I tested this on 2 files with 80K and 33M, respectively, with a 1M limit. In your case, progress_callback is pointless beyond the second line, but I kept everything in there for debugging purposes.
One other way to get the size of the data is to do a HEAD request but I don't think that servers are required to send a Content-length header.
To answer question one, you simply need to add the appropriate limits in your code. Define how many requests you want to accept in a given amount of time, track your requests in a database, and go from there. Also put a cap on file size.
For question two, you can set appropriate timeouts if you use cURL.

Categories