CURLOPT_RETURNTRANSFER with curl_multi

CURLOPT_RETURNTRANSFER with curl_multi - php

I'm using the curl_multi functions with PHP. I already know that you can return the request contents from curl_exec when the CURLOPT_RETURNTRANSFER flag is on. However, how can we grab the request contents of multiple requests as strings when using curl_multi_exec?
Does it return an array when this flag is set? Nope, curl_multi_exec can only return true or false, without the option to return the contents like the normal one.

Turns out, the curl_multi_getcontent function, while somewhat inelegant, works for getting the contents as strings from each individual curl handle.

Related

Reading curl request progress headers with Guzzle

When calling curl with php, I'm able to hook callback on CURLOPT_PROGRESSFUNCTION and read headers during the request progress using curl_multi_getcontent($handle)
$handle = curl_init()
curl_setopt(CURLOPT_NOPROGRESS, false)
curl_setopt(CURLOPT_RETURNTRANSFER, true)
curl_setopt(CURLOPT_PROGRESSFUNCTION, function($handle) {
$response = curl_multi_getcontent($handle);
// some logic here
})
curl_exec($handle)
How can I do that with Guzzle?
The problem is that I cannot use curl_multi_getcontent($handle) without setting CURLOPT_RETURNTRANSFER to true.
But when I set CURLOPT_RETURNTRANSFER to guzzle' curl config, I can read headers in progress function $response = curl_multi_getcontent($handle); However the response stream contains empty content.
$request->getResponse()->getBody()->getContents(); // always outputs ""
Edit:
I have made this change https://github.com/guzzle/guzzle/pull/2173 so I can access handle in progress callback with progress settings:
'progress' => function($handle) {
$response = curl_multi_getcontent($handle);
// some logic here
})
That works as long as CURLOPT_RETURNTRANSFER is true. However as I mentioned earlier, the response contents is "" then.

There is a progress request option.
// Send a GET request to /get?foo=bar
$result = $client->request(
'GET',
'/',
[
'progress' => function(
$downloadTotal,
$downloadedBytes,
$uploadTotal,
$uploadedBytes
) {
//do something
},
]
);
http://docs.guzzlephp.org/en/stable/request-options.html#progress

I found the solution or rather an explanation why it is happening.
Guzzle by default sets CURLOPT_FILE option when no custom sink is defined (or CURLOPT_WRITEFUNCTION when sink is defined but that doesn't really matter actually).
However, setting CURLOPT_RETURNTRANSFER to true negates both of those options -> they're not applied anymore.
Two things happen then after setting CURLOPT_RETURNTRANSFER:
Response can be read in PROGRESSFUNCTION with $response = curl_multi_getcontent($handle). For that this Guzzle modification is necessary https://github.com/guzzle/guzzle/pull/2173
Response is returned when Guzzle calls curl_exec($handle) as return value of this call. But Guzzle doesn't assign it to any variable because it expects that the response is not returned there but through WRITEFUNCTION that was however neutralized by setting CURLOPT_RETURNTRANSFER.
So my solution is not the cleanest one but I don't find any other way to go around it with Guzzle. Guzzle is simply not built to be able to handle that.
I forked Guzzle. Then created custom Stream class that behaves like default sink -> that writes to php://temp. And when my custom Stream class is set as sink I write the result of curl_exec into stream:
$result = curl_exec($easy->handle);
$easy->sink->write($result);

Is it ok to terminate a HTTP request in the callback function set by CURLOPT_HEADERFUNCTION?

Currently I'm writing a PHP script that is supposed to check if a URL is current (returns a HTTP 200 code or redirects to such an URL).
Since several of the URLs that are to be tested return a file, I'd like to avoid using a normal GET request, in order not having to actually download a file.
I would normally use the HTTP HEAD method, however tests show, that many servers don't recognize it and return a different HTTP code than the corresponding GET request.
My idea was know to make a GET request and use CURLOPT_HEADERFUNCTION to define a callback function which checks the HTTP code in the first line of the header and then immediately terminate the request by having it return 0 (instead of the length of the header) if it's not a redirect code.
My question is: Is it ok, to terminate a HTTP request like that? Or will it have any negative effects on the server? Will this actually avoid the unnecessary download?
Example code (untested):
$url = "http://www.example.com/";
$ch = curl_init($url);
curl_setopt_array($ch, array(
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_HEADER => true,
CURLINFO_HEADER_OUT => true,
CURLOPT_HTTPGET => true,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HEADERFUNCTION => 'requestHeaderCallback',
));
$curlResult = curl_exec($ch);
curl_close($ch);
function requestHeaderCallback($ch, $header) {
$matches = array();
if (preg_match("/^HTTP/\d.\d (\d{3}) /")) {
if ($matches[1] < 300 || $matches[1] >= 400) {
return 0;
}
}
return strlen($header);
}

Yes it is fine and yes it will stop the transfer right there.
It will also cause the connection to get disconnected, which only is a concern if you intend to do many requests to the same host as then keeping the connection alive could be a performance benefit.

Does curl_exec stop program execution?

I am new to using PHP to make web requests. Previously I have only ever used node.js.
In node, our program continues to run after we send the web request. When the response comes back, then node automatically runs the callback function associated with the request.
However on PHP, I see that I can make a web request by calling curl_exec on my curl object. But how do I get the callback? What if I need to keep running code between the time when the request has been sent, and when it comes back? Is there a way to basically do a callback through some other method?
Thanks!

In PHP most functions are blocking, the program execution halts until the operation finished. This is the case with curl_exec.
You get the returned response (or a boolean indicating success) as the return value of the function. See the manual on php.net:
Returns TRUE on success or FALSE on failure. However, if the CURLOPT_RETURNTRANSFER option is set, it will return the result on success, FALSE on failure.

How to reset CURLOPT_CUSTOMREQUEST

I’m using a REST API which, among other things, uses the DELETE method like this:
DELETE /resources/whatever/items/123
To access this using PHP I’m using cURL like this:
self::$curl = curl_init();
curl_setopt_array(self::$curl, array(
CURLOPT_AUTOREFERER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_RETURNTRANSFER => true,
));
As you can see, my cURL instance is static and will be reused for subsequent calls. This works fine when switching between “builtin” request methods. For example, in my get() method, I do something like this:
curl_setopt_array(self::$curl, array(
CURLOPT_HTTPGET => true,
CURLOPT_URL => self::BASE . 'whatever',
));
and then run curl_exec(). By explicitly setting the request method via CURLOPT_HTTPGET, a possible previous CURLOPT_POST will be cleared.
However, setting CURLOPT_CUSTOMREQUEST (for example to DELETE) will override any other builtin request method. That’s fine as long as I want to DELETE things, but calling for example curl_setopt(self::$curl, CURLOPT_HTTPGET, true) will not reset the custom method; DELETE will still be used.
I have tried setting CURLOPT_CUSTOMREQUEST to null, false or the empty string, but this will only result in a HTTP request like
/resources/whatever/items/123
i.e. with the empty string as method, followed by a space, followed by the path.
I know that I could set CURLOPT_CUSTOMREQUEST to GET instead and do GET requests without any problems, but I wonder whether there is a possiblity to reset CURLOPT_CUSTOMREQUEST.

This is actually a bug in PHP, since the original documentation states the following:
Restore to the internal default by setting this to NULL.
Unfortunately, as you can see from the source code, the option value gets cast to a string before it's passed to the underlying library.
Solution
I've written a pull request that addresses the issue and allows for NULL to be passed for the CURLOPT_CUSTOMREQUEST option value.
The above patch will take some time to get merged into the project, so until then you would have to explicitly set the method yourself once you start using this option.
Update
The fix has been applied to 5.5.11 and 5.6.0 (beta1).

Set CURLOPT_CUSTOMREQUEST to NULL and CURLOPT_HTTPGET to TRUE to reset back to an ordinary GET.

(Shrug ...) And what I wound up doing in my API is to simply, "set it every time." My POST routine simply sets a custom-header of "POST", and so on. Works great.
I found that, once you do set a custom-header, it "sticks." Future calls which then attempt to do ordinary GET, POST, PUT start to fail. I experimented with the suggestions listed earlier in this post (with PHP-7), and didn't very-quickly meet with success ... so: "to heck with it, this isn't elegant, but it works."

How do I check for valid (not dead) links programmatically using PHP?

Given a list of urls, I would like to check that each url:
Returns a 200 OK status code
Returns a response within X amount of time
The end goal is a system that is capable of flagging urls as potentially broken so that an administrator can review them.
The script will be written in PHP and will most likely run on a daily basis via cron.
The script will be processing approximately 1000 urls at a go.
Question has two parts:
Are there any bigtime gotchas with an operation like this, what issues have you run into?
What is the best method for checking the status of a url in PHP considering both accuracy and performance?

Use the PHP cURL extension. Unlike fopen() it can also make HTTP HEAD requests which are sufficient to check the availability of a URL and save you a ton of bandwith as you don't have to download the entire body of the page to check.
As a starting point you could use some function like this:
function is_available($url, $timeout = 30) {
$ch = curl_init(); // get cURL handle
// set cURL options
$opts = array(CURLOPT_RETURNTRANSFER => true, // do not output to browser
CURLOPT_URL => $url, // set URL
CURLOPT_NOBODY => true, // do a HEAD request only
CURLOPT_TIMEOUT => $timeout); // set timeout
curl_setopt_array($ch, $opts);
curl_exec($ch); // do it!
$retval = curl_getinfo($ch, CURLINFO_HTTP_CODE) == 200; // check if HTTP OK
curl_close($ch); // close handle
return $retval;
}
However, there's a ton of possible optimizations: You might want to re-use the cURL instance and, if checking more than one URL per host, even re-use the connection.
Oh, and this code does check strictly for HTTP response code 200. It does not follow redirects (302) -- but there also is a cURL-option for that.

Look into cURL. There's a library for PHP.
There's also an executable version of cURL so you could even write the script in bash.

I actually wrote something in PHP that does this over a database of 5k+ URLs. I used the PEAR class HTTP_Request, which has a method called getResponseCode(). I just iterate over the URLs, passing them to getResponseCode and evaluate the response.
However, it doesn't work for FTP addresses, URLs that don't begin with http or https (unconfirmed, but I believe it's the case), and sites with invalid security certificates (a 0 is not found). Also, a 0 is returned for server-not-found (there's no status code for that).
And it's probably easier than cURL as you include a few files and use a single function to get an integer code back.

fopen() supports http URI.
If you need more flexibility (such as timeout), look into the cURL extension.

Seems like it might be a job for curl.
If you're not stuck on PHP Perl's LWP might be an answer too.

You should also be aware of URLs returning 301 or 302 HTTP responses which redirect to another page. Generally this doesn't mean the link is invalid. For example, http://amazon.com returns 301 and redirects to http://www.amazon.com/.

Just returning a 200 response is not enough; many valid links will continue to return "200" after they change into porn / gambling portals when the former owner fails to renew.
Domain squatters typically ensure that every URL in their domains returns 200.

One potential problem you will undoubtably run into is when the box this script is running on looses access to the Internet... you'll get 1000 false positives.
It would probably be better for your script to keep some type of history and only report a failure after 5 days of failure.
Also, the script should be self-checking in some way (like checking a known good web site [google?]) before continuing with the standard checks.

You only need a bash script to do this. Please check my answer on a similar post here. It is a one-liner that reuses HTTP connections to dramatically improve speed, retries n times for temporary errors and follows redirects.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

CURLOPT_RETURNTRANSFER with curl_multi - php

Turns out, the curl_multi_getcontent function, while somewhat inelegant, works for getting the contents as strings from each individual curl handle.

Related

Reading curl request progress headers with Guzzle

Is it ok to terminate a HTTP request in the callback function set by CURLOPT_HEADERFUNCTION?

Does curl_exec stop program execution?

How to reset CURLOPT_CUSTOMREQUEST

How do I check for valid (not dead) links programmatically using PHP?

Categories

Resources