Clear Splash Browser Cache - php

I'm trying to visualize a website speed analysis. As a headless browser I use Splash 3.2 ... unfortunately I have problems getting a correct har file.
The first request looks good, after the second request I get only requests that were not cached.
I tried to empty the cache with a post request to the _gc endpoint, unfortunately without success.
My curl requests:
$url = 'http://localhost:8050/render.har?url=' . esc_url( $url ) .'&response_body=1&wait=5&timeout=10';
$cache_url = 'http://localhost:8050/_gc';
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $cache_url);
curl_setopt($curl, CURLOPT_POST, 1);
curl_setopt($curl, CURLOPT_POSTFIELDS,"cached_args_removed=1");
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
$server_output = curl_exec($curl);
curl_reset($curl);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_REFERER, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$data = curl_exec($curl);
curl_close($curl);
the result of _gc request:
{"cached_args_removed": 0, "pyobjects_collected": 1269, "status": "ok"}
Afterwards I tried to start Splash with --disable-browser-caches to get a correct output, but Splash does not cache anything and therefore makes many requests to the same files, if they occur several times.
Is there another way to flush the browser cache before rendering, or should I prefer to use another headless browser (recommendation)?

#Tobias
You mentioned you're using splash version 3.2
I'm the author of PR 821 which introduced --disable-browser-caches , and according to the changelog , this feature landed in splash version 3.3.
So please upgrade to splash version 3.3 and you should be able to use that feature.

Related

set cookie throw curl request

I want to set cookie throw curl request. I used this code but the requested URL return You must enable Javascript and accept cookies. what Im doing wrong here?
cookie.txt file is 0644 permission
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $base);
curl_setopt ($curl, CURLOPT_COOKIEFILE, 'cookie.txt');
curl_setopt($curl, CURLOPT_REFERER, $base);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$str = curl_exec($curl);
curl_close($curl);
The problem is not your code, it's that the page you're trying to access works only by executing javascript. CURL doesn't support that, it just downloads the HTML code of the page, but doesn't execute any javascript.
If you're in the need of retrieving information from a website that needs to execute javascript, you need to rely on solutions that provide headless browsers, like Selenium

Authenticated curl not working in php

I am trying to create a simple PHP script that calls two different REST APIs on two different domains. Both services are HTTPS and require authentication. When I do a curl from the terminal, I get the response in JSON for both domains and everything works beautifully:
curl --user “myuser:mypassword” https://www.example.com/rest/api/2/projects
Notice that it's a GET, not a POST.
The strange thing is that when I try the exact same curl commands from my PHP script neither of them works.
This what happens:
The first domain returns an empty JSON array with no errors. Just this: []
The second domain returns this error in JSON:
{"errors":[
{
"context":null,
"message":"You are not permitted to access this resource",
"exceptionName":"com.atlassian.stash.exception.AuthorisationException"
}
]}
Here's what's NOT happening:
No SSL certificate errors or warnings
No authentication errors.
Even if put in a bad username or password, both services will act exactly the same way.
To me what's suspicious is that both domains don't authenticate my calls which makes me think there's either a problem with my code or in the php curl library.
Here's my code:
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $link3);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
$encodedAuth = base64_encode($username.":".$password);
curl_setopt($curl, CURLOPT_HTTPHEADER, array("Authentication : Basic ".$encodedAuth));
curl_setopt($curl, CURLOPT_USERPWD, $username.":".$password);
curl_setopt($curl, CURLOPT_HTTPAUTH, CURLAUTH_ANY);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION,true);
curl_setopt($curl, CURLINFO_HEADER_OUT, true);
$status_code = curl_getinfo($curl, CURLINFO_HTTP_CODE); //get status code
I know some of it is redundant, but I wanted to try everything and nothing works. Any ideas?
My environment:
OS X Yosemite (10.10.2)
PHP 5.6.6 (I manually upgraded to the latest version as an attempt to make this work)
The current code mixes various approaches and does it in a conflicting way:
the authentication header is named Authorization: and not Authentication:
the CURLOPT_HTTPAUTH, CURLAUTH_ANY tries to negotiate with the server and it shouldn't (neither does the working cURL command line)
Just use:
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $link3);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_USERPWD, $username.":".$password);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION,true);
# for debugging/non-prod
#curl_setopt($curl, CURLOPT_VERBOSE, true);
#curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
$result = curl_exec($curl);
curl_close($curl);
echo $result;

How can I properly follow all redirects on sites I am trying to scrape with cURL in PHP?

I am using cURL to try to scrape an ASP site that is not on my server, with the following option to automatically follow redirects it comes across:
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
but it is not properly following all redirects that the website sends back: it is putting several of the redirect URLs as relative to my server and my PHP script's path, instead of the website's server and the path that the website's pages should be relative to. Is there any way to set the base path or server path in cURL, so my script can properly follow the relative redirects it comes across when scraping through the other website?
For example: If I authenticate on their site and then try to access "https://www.theirserver.com/theirapp/mainForm/securepage.aspx" with my script at "https://www.myserver.com/php/myscript.php", then, under some circumstances, their website tries to redirect back to their login page, but this causes a big problem, because the redirect sends my cURL client to "https://www.myserver.com/php/mainForm/login.aspx", that is, '/mainForm/login.aspx' relative to my script on my server, instead of the correct "https://www.theirserver.com/theirapp/mainForm/login.aspx" relative to the site I am scraping on their server.
I would expect cURL's FOLLOWLOCATION option to properly follow relative redirects based on the "Location:" header of the web pages I am accessing, but it seems that it doesn't and can't. Since this seems to not work, preferably I want a way to tell cURL a base path for the server or for all relative redirects it sees, so I can just use FOLLOWLOCATION. If not, then I need to figure out some code that will do the same thing FOLLOWLOCATION does, but that can let me specify a base path to handle these relative URLs when it comes across them.
I see several similar questions about following relative paths with cURL, but none of the answers have any good suggestions for dealing with this problem, where I don't own the website's server and I don't know every single redirect that might come up. In fact, none of the answers I've seen for similar questions seem to even understand that a person might be trying to scrape an external website and would want any relative redirects they come across while scraping the site to just be relative to that site.
EDIT: Here is the code in question:
$urlLogin = "https://www.theirsite.com/theirApp/MainForm/login.aspx"
$urlSecuredPage = "https://www.theirsite.com/theirApp/ContentPages/content.aspx"
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $urlLogin);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_AUTOREFERER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; yie8)");
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 120);
curl_setopt($ch, CURLOPT_TIMEOUT, 120);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
// GET login page
$data=curl_exec($ch);
// Read ASP viewstate and eventvalidation fields
$viewstate = parseExtract($data,$regexViewstate, 1);
$eventval = parseExtract($data, $regexEventVal, 1);
//set POST data
$postData = '__EVENTTARGET='.$eventtarget
.'&__EVENTARGUMENT='.$eventargument
.'&__VIEWSTATE='.$viewstate
.'&__EVENTVALIDATION='.$eventval
.'&'.$nameUsername.'='.$valUsername
.'&'.$namePassword.'='.$valPassword
.'&'.$nameLoginBtn.'='.$valLoginBtn;
// POST authentication
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postData);
curl_setopt($ch, CURLOPT_URL, $urlLogin);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieFile);
$data = curl_exec($ch);
/******************************************************************
GET secure page (This is where a redirect fails... when getting
the secure page, it redirects to /mainForm/login.aspx relative to my
script, instead of /mainForm/login.aspx on their site.
*****************************************************************/
curl_setopt($ch, CURLOPT_POST, FALSE);
curl_setopt($ch, CURLOPT_URL, $urlSecuredPage);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieFile);
$data = curl_exec($ch);
echo $data; // Page Not Found
You may be running into redirects that are JavaScript redirects.
To find out what is there:
This will give you additional info.
curl_setopt($ch, CURLOPT_FILETIME, true);
You should set fail on error:
curl_setopt($ch, CURLOPT_FAILONERROR,true);
You may also need to see all the Request and Response headers:
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLINFO_HEADER_OUT, true);
The big thing you are missing is curl_getinfo($ch);
It has info on all the redirects and the headers.
You may want to turn off: CURLOPT_FOLLOWLOCATION
And do each request individually. You can get the redirect location from curl_getinfo("redirect_url")
Or you can set CURLOPT_MAXREDIRS to the number of successful redirects, then do a separate curl request for the problem redirect location
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
When you get the response, if no curl error, get the resposne header
$data = curl_exec($ch);
if (curl_errno($ch)){
$data .= 'Retreive Base Page Error: ' . curl_error($ch);
echo $data;
}
else {
$skip = intval(curl_getinfo($ch, CURLINFO_HEADER_SIZE));
$responseHeader = substr($data,0,$skip);
$data= substr($data,$skip);
$info = curl_getinfo($ch);
$info = var_export($info,true);
}
echo $responseHeader . $info . $data;
A better way to web scraping a webpage is to use 2 PHP Packages = Guzzle + DomCrawler.
I made a lot of tests with this combination and i came to the conclusion that this is the best choice.
Here, you will find an example for your implementation.
Let me know if you have any problem! ;)

Remote login using curl on cakephp returns blackhole

I have a website which i need to implement a login form to other website , which is based on cakephp.
I don't want to change the current security settings on the Cakephp website.
The login is based on the auth component.
Therefore I've pulled the form from the cakephp site using Curl, to keep the token fields.
When i login, i've got a blackhole message 'The request has been black-holed'.
How can i fix that without lowering the security level?
What are the steps to debug this situation.
Thanks
$url = WEB_APP . '/users/login';
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_REFERER, '/');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
$result = curl_exec($ch);
$markup = strip_tags($result, '<form><input>');
echo $markup;
I've checked and the problem is that the cookie is missing , for sure.
I tried to use sendcookie on the cookie file i got from curl, and it's saved encoded.
Using setrawcookie returns false.
How can I set the cookie correctly?

Making GET requests to the Meetup API in PHP

I am using PHP's build in cURL library to make GET requests to the Meetup API. This is an example of a query I'm running to view every meetup group 25 miles from central park:
https://api.meetup.com/groups.json/?lat=40.75&lon=-73.98999786376953&order=members&page=200&offset=0&key=MY_API_KEY
This query works correctly when passed to the browser, it returns the excepted 200 largest groups.
When I run this in a PHP script I'm using cURL set with these options
curl_setopt($cURL, CURLOPT_URL, $groups_url);
curl_setopt($cURL, CURLOPT_HEADER, 1);
curl_setopt($cURL, CURLOPT_RETURNTRANSFER, 1);
$json_string = curl_exec($cURL);
I am hoping to be able to get the cURL to execute and return a json string that I can parse, but for some reason I do not understand, the result of curl_exec is always NULL, I am not sure why an input that works in the browser will not work in a script, this could just be me being dumb. Thank you for your help in advance.
this is becuase its https [SSL].
so the quick fix is to add this line
curl_setopt($cURL, CURLOPT_SSL_VERIFYPEER, 0);
here example of it all working
$cURL = curl_init();
curl_setopt($cURL, CURLOPT_URL, "https://api.meetup.com/groups.json/?lat=40.75&lon=-73.98999786376953&order=members&page=200&offset=0&key=MY_API_KEY");
curl_setopt($cURL, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($cURL, CURLOPT_SSL_VERIFYPEER, 0);
$json_string = curl_exec($cURL);
echo $json_string;

Categories