Accessing a web site via web scrape - php

When attempting to web scrape Rubies, I am unable to get past the login. I have absolutely no idea why I am not able to, but here are the cURL options that I am using. If anyone sees a problem, I would greatly appreciate it!
curl_setopt_array($curl, array(
CURLOPT_URL => "https://www.rubies.com/customer/account/loginPost/",
CURLOPT_RETURNTRANSFER => true,
CURLOPT_ENCODING => "",
CURLOPT_MAXREDIRS => 10,
CURLOPT_TIMEOUT => 30,
CURLOPT_HEADER => true,
CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
CURLOPT_POST => 1,
CURLOPT_POSTFIELDS => array('form_key' => "****", "login[username]" => "****", "login[password]" => "****", "persistent_remember_me" => 'on', "send" => ''),
CURLOPT_FOLLOWLOCATION => 1,
CURLOPT_COOKIEFILE => 'cookie.txt',
CURLOPT_COOKIEJAR => 'cookie.txt',
CURLOPT_HTTPHEADER => array(
'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Host: www.rubies.com',
'Content-Type: application/x-www-form-urlencoded',
'Origin: https://www.rubies.com',
'Referer: https://www.rubies.com/customer/account/',
'Connection: keep-alive',
'Cache-Control: no-cache',
'Upgrade-Insecure-Requests: 1'
),
CURLOPT_SSL_VERIFYPEER => false,
CURLOPT_SSL_VERIFYHOST => false,
CURLINFO_HEADER_OUT => true
));
I currently have the form key hard encoded, but I am not sure if I would have to change the form key depending on the login. The response from the post is empty, but I get redirected 2 times. Once to the account page, then back to the login. If anyone can tell me what is going on, then I would appreciate it. I think they are using some kind of basic auth system.

Use fiddler2 or another packet sniffer to look at the cURL traffic both requests and responses. Compare that to the traffic using a browser.
You probably either missed or mistyped a field, or missed follow-up steps like setting cookies and posting additional data.
Code for a login often requires fetching the login page, scraping a one-time token (changes with each page request), then posting as the first step. This might trigger script code to set cookies and/or automatically submit other data.

you do several mistakes.
you say to the server that your POST body is application/x-www-form-urlencoded encoded, but you give CURLOPT_POSTFIELDS an array, so what you actually send to the server, is multipart/form-data encoded. to have curl send the post data as application/x-www-form-urlencoded, urlencode the data for CURLOPT_POSTFIELDS - with arrays specifically, http_build_query will do this for you. furthermore, with POSTs when doing multipart/form-data or application/x-www-form-urlencoded, don't set the content-type header at all, curl will do it for you, automatically, depending on which encoding was used. on that note, you shouldn't set the User-Agent header manually, either, but use CURLOPT_USERAGENT. and you should not set the Host header either, curl generates that automatically, and you're more likely than curl to make a mistake.
also, here you send a fake Referer header, some websites can detect when the referer is fake, it's safer just to set CURLOPT_AUTOREFERER, and make a real request, thus obtaining a real referer. also, to actually login to https://www.rubies.com/customer/account/loginPost/ , you need both a cookie session, and a form_key code, the form_key is probably tied to your cookie session, and probably a form of CSRF token, but you provide no code to acquire either. and on top of that, you may need a real referer.
using hhb_curl from https://github.com/divinity76/hhb_.inc.php/blob/master/hhb_.inc.php ,
here's an example code i think would be able to log in, with a real username/password, doing none of the mistakes i listed above:
<?php
declare(strict_types = 1);
require_once ('hhb_.inc.php');
$hc = new hhb_curl ();
$hc->_setComfortableOptions ();
$hc->exec ( 'https://www.rubies.com/customer/account/login/' ); // << getting a referer, form_key (csrf token?), and a session.
$domd = #DOMDocument::loadHTML ( $hc->getResponseBody () );
$csrf = NULL;
// extract the form_key
foreach ( $domd->getElementsByTagName ( "form" ) as $form ) {
if ($form->getAttribute ( "class" ) !== 'form form-login') {
continue;
}
foreach ( $form->getElementsByTagName ( "input" ) as $input ) {
if ($input->getAttribute ( "name" ) !== 'form_key') {
continue;
}
$csrf = $input->getAttribute ( "value" );
break;
}
break;
}
if ($csrf === NULL) {
throw new \RuntimeException ( 'failed to extract the form_key token!' );
}
$hc->setopt_array ( array (
CURLOPT_POST => true,
CURLOPT_POSTFIELDS => http_build_query ( array (
'form_key' => $csrf,
'login' => array (
'username' => '???',
'password' => '???'
),
'persistent_remember_me' => 'on',
'send' => '' // ??
) )
) );
$hc->exec ( 'https://www.rubies.com/customer/account/login/' );
hhb_var_dump ( $hc->getStdErr (), $hc->getResponseBody () );
EDIT: fixed an url, the original code definitely wouldn't work, but it should now.

Related

Office365 OAuth - Token via CURL

I'm trying to connect to the Azure platform to grab a response, mainly to get the token for use when accessing an office365 mailbox
The following is what i'm using, but I always get a NULL response
What other CURLOPT_POSTFIELDS need to be included, or what else needs to be changed.
$curl = curl_init();
curl_setopt_array($curl, array(
CURLOPT_URL => "https://login.microsoftonline.com/".$tennantid."/oauth2/v2.0/authorize",
CURLOPT_RETURNTRANSFER => true,
CURLOPT_ENCODING => '',
CURLOPT_MAXREDIRS => 10,
CURLOPT_TIMEOUT => 0,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
CURLOPT_CUSTOMREQUEST => 'POST',
CURLOPT_POSTFIELDS => 'client_id='.$appid.'&response_type=token&scope=https://graph.microsoft.com/User.Read&redirect_uri='.$uri,
CURLOPT_HTTPHEADER => array(
'Content-Type: application/x-www-form-urlencoded'
),
));
$response = curl_exec($curl);
$response_decode = json_decode($response);
var_dump($response_decode);
curl_close($curl);
I currently get the token back ok when I use the following method
$params = array ('client_id' =>$appid,
'redirect_uri' =>$uri,
'response_type' =>'token',
'response_mode' =>'form_post',
'scope' =>'https://graph.microsoft.com/User.Read',
'state' =>$_SESSION['state']);
header ('Location: '.$login_url.'?'.http_build_query ($params));
Which works fine.
But I need to do CURL method as I need this running background cron job task
What do I seem to be missing?
Thanks in advance
A link to the API documentation may have helped. Without that I cannot help much. Just one obvious thing.
You are using Content-Type: application/x-www-form-urlencoded
You are not encoding the data. From MDN:
application/x-www-form-urlencoded: the keys and values are encoded in
key-value tuples separated by '&', with a '=' between the key and the
value. Non-alphanumeric characters in both keys and values are percent
encoded: this is the reason why this type is not suitable to use with
binary data (use multipart/form-data instead)
Source: MDN Post data
This would mean your request data should look like this:
client_id%3D1234%26response_type%3Dtoken%26scope%3Dhttps%3A%2F%2Fgraph.microsoft.com%2FUser.Read%26redirect_uri%3Dhttp%3Aexample.com
Try this:
$post = urlencode('client_id='.$appid.'&response_type=token&scope=https://graph.microsoft.com/User.Read&redirect_uri=http:example.com');
CURLOPT_POSTFIELDS => $ post,

Curl POST is executed as a GET

I'm trying to develop a kind of Browser with PHP.
So far my class can process a GET or a POST request with this Content Type: application/x-www-form-urlencoded.
Now I need to move to a JSON one. I've set the Content-Type header to application/json.
The fact is, with this type I got the following issue: Setting up a POST request will result in a GET request. This is really weird.
Here is my code:
private function request($url, $reset_cookies, $post_data = null, $custom_headers = null)
{
// Create options
$options = array(
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_HEADER => 0,
CURLINFO_HEADER_OUT => 1,
CURLOPT_FAILONERROR => 1,
CURLOPT_USERAGENT => $this->user_agent,
CURLOPT_CONNECTTIMEOUT => 30,
CURLOPT_TIMEOUT => 30,
CURLOPT_FOLLOWLOCATION => 1,
CURLOPT_MAXREDIRS => 10,
CURLOPT_AUTOREFERER => 1,
CURLOPT_COOKIESESSION => $reset_cookies ? 1 : 0,
CURLOPT_COOKIEJAR => $this->cookies_file,
CURLOPT_COOKIEFILE => $this->cookies_file,
CURLOPT_HTTPHEADER => array('Accept-language: en'),
// SSL
/*
CURLOPT_SSL_CIPHER_LIST => 'TLSv1',
CURLOPT_SSL_VERIFYPEER => 1,
CURLOPT_CAINFO => dirname(__FILE__) . '/Entrust.netCertificationAuthority(2048).crt',
*/
);
// Add headers
if (isset($custom_headers)) $options[CURLOPT_HTTPHEADER] = array_merge($options[CURLOPT_HTTPHEADER], $custom_headers);
// Add POST data
if (isset($post_data))
{
$options[CURLOPT_POST] = 1;
$options[CURLOPT_POSTFIELDS] = is_string($post_data) ? $post_data : http_build_query($post_data);
}
// Attach options
curl_setopt_array($this->curl, $options);
// Execute the request and read the response
$content = curl_exec($this->curl);
print_r($options);
print_r(curl_getinfo($this->curl, CURLINFO_HEADER_OUT));
// Clean local variables
unset($url);
unset($reset_cookies);
unset($post_data);
unset($custom_headers);
unset($options);
// Handle any error
if (curl_errno($this->curl))
{
unset($content);
throw new Exception(curl_error($this->curl));
}
return $content;
}
To illustrate my issue, here is an example:
CUrl options as an Array:
Array
(
[10002] => http://mywebsite.com/post/
[19913] => 1
[42] => 0
[2] => 1
[45] => 1
[10018] => Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2) Gecko/20100101 Firefox/10.0.2
[78] => 30
[13] => 30
[52] => 1
[68] => 10
[58] => 1
[96] => 0
[10082] => C:\wamp\www\v2\_libs/../_cookies/14d0fd2b-9f15-4ac5-8fae-4246cc6cef49.cookie
[10031] => C:\wamp\www\v2\_libs/../_cookies/14d0fd2b-9f15-4ac5-8fae-4246cc6cef49.cookie
[10023] => Array
(
[0] => Accept-language: en
[1] => RequestVerificationToken: 4PMxvJsQzFJ5oFt3JdUPe6Bp_geIj4obDJCYIRoU09PrrfcBSUgJT9iB3mXnGFc2KSlYrPcRHF7iHdQhGNu0GKLUzd5FywfaADbGS8wjhXraF36W0
[2] => Content-Type: application/json
)
[47] => 1
[10015] => {"usernameOrFeedId":"manitoba","feed_message_body":"Dummy message goes here"}
)
So the request header seems good to me, but I may be wrong.
And here is the real header sent by CUrl:
GET /post/ HTTP/1.1
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2) Gecko/20100101 Firefox/10.0.2
Host: mywebsite.com
Accept: */*
Referer: http://mywebsite.com/post/
Cookie: ADRUM_BT=R%3a53%7cclientRequestGUID%3a9787a51b-b24d-4400-9d6a-efbd618c74c0%7cbtId%3a18790%7cbtERT%3a44; CSRFToken=o_eoIVji7pWclOsrLaJpZEbOFSBJBm851rHbH0Xqwdzw2tC5j07EAc23mlj-opWowgpj0RkHyiktl1cS6onBqI43afM1; WebSessionId=3aem0m2xpwmvesgphna5gaop; prod=rd101o00000000000000000000ffff0a5a2a74o80; AuthenticateCookie=AAAAAtsQgeb8+UXrJ+wa7CGVJKnqizEAo2bMuFvqvwYMAl1NRaa6z68LBRx9hiHzPBC8tYqiayHID6pHChGXB7VywemwTpGivcRQ3nRlUVuaYQKyxQt21p1mx7OMlLCsRA==; web_lang.prod=fr
Accept-language: en
RequestVerificationToken: 4PMxvJsQzFJ5oFt3JdUPe6Bp_geIj4obDJCYIRoU09PrrfcBSUgJT9iB3mXnGFc2KSlYrPcRHF7iHdQhGNu0GKLUzd5FywfaADbGS8wjhXraF34W0
Content-Type: application/json
As you can see, it's a GET request and the post data look to have disapeared.
Am I doing it wrong ?
You're following redirects, that means you get a 3xx response code and curl makes a second request to the new URL.
curl will act according to the specific 3xx code and for some of the redirects it will change request method from POST to GET - enabling VERBOSE will show you if it does so or not. The response codes that makes curl change method are 301, 302 and 303. It does so because that's how browsers act on those response codes.
libcurl offers an option called CURLOPT_POSTREDIR that you can use to tell curl to not change method for specific HTTP responses. Using that, you can thus have curl send a POST even after redirecting with one of these response codes.
CURLOPT_FOLLOWLOCATION
seems to be the cause shown by
Referer: http://mywebsite.com/post/
seems the server is doing a PRG ?
http://en.wikipedia.org/wiki/Post/Redirect/Get
Disable follow location by setting it false and remove the curlopt_maxredirs from your code.
CURLOPT_FOLLOWLOCATION => false,
// CURLOPT_MAXREDIRS => 10,

POST using cURL and x-www-form-urlencoded in PHP returning Access Denied

I have been able to use the Advanced Rest Client Extension for chrome to send POST queries to an specific HTTPS server and I get Status Code: 200 - OK with the same body fields as the ones I used in this code, but when I run the following code I get this response: 403 - Access Denied.
<?php
$postData = array(
'type' => 'credentials',
'id' => 'exampleid',
'secret_key' => 'gsdDe32dKa'
);
// Setup cURL
$ch = curl_init('https://www.mywebsite.com/oauth/token');
curl_setopt_array($ch, array(
CURLOPT_POST => TRUE,
CURLOPT_RETURNTRANSFER => TRUE,
CURLOPT_HTTPHEADER => array('Content-Type: application/x-www-form-urlencoded'
),
CURLOPT_POSTFIELDS => json_encode($postData)
));
// Send the request
$response = curl_exec($ch);
var_dump($response);
// Check for errors
if($response === FALSE){
die(curl_error($ch));
}
// Decode the response
$responseData = json_decode($response, TRUE);
// Print the date from the response
echo $responseData['published'];
?>
I've noticed as well that when I use Advanced Rest Client Extension for chrome and if I set the Content-Type to application/json I have to enter a login and a password that I don't know what are those because even if I enter the id and secret key that I have in the code it returns 401 Unauthorized. So I'm guessing this code that I wrote is not forcing it to the content-type: application/x-www-form-urlencoded, but I'm not sure. Thank you for any help on this issue!
Can you try like that and see if it helps:
curl_setopt_array($ch, array(
CURLOPT_POST => TRUE,
CURLOPT_RETURNTRANSFER => TRUE,
CURLOPT_COOKIEFILE => 'cookie.txt',
CURLOPT_COOKIEJAR => 'cookie.txt',
CURLOPT_USERPWD => 'username:password', //Your credentials goes here
CURLOPT_HTTPHEADER => array('Content-Type: application/x-www-form-urlencoded'),
CURLOPT_POSTFIELDS => http_build_query($postData),
));
I guess the site expect simple authentication on top of the secret_key that you already provided.
Also it is possible to send a Cookie, so just in case it is good idea to store it and use it again in the next Curl calls.

PHP script to automate login and form submit

I have an external site which requires me to
a. login
b. post form (with 2-3 dyanamic parameters)
I need a PHP script to automate this behavior. i.e. the script should first login with a username/password and then navigate to the URL and submit the form (using dyanamic parameters)
How can I do the same using PHP?
I recommend using this class:
http://semlabs.co.uk/journal/object-oriented-curl-class-with-multi-threading
It will be something like this:
$c = new CURLRequest();
$c->retry = 2;
$c->get( $url, $this->curlOpts );
$url = 'https://secure.login.co.uk/';
$opts = array(
CURLOPT_USERAGENT => 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)',
CURLOPT_COOKIEFILE => 'anc.tmp',
CURLOPT_COOKIEJAR => 'anc.tmp',
CURLOPT_FOLLOWLOCATION => 1,
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_SSL_VERIFYHOST => 0,
CURLOPT_SSL_VERIFYPEER => 0,
CURLOPT_TIMEOUT => 120
);
$opts[CURLOPT_POSTFIELDS] = 'username=user&password=pass&submit=1';
$request = $c->get( $url, $opts );
N.B. Some sites require you to download the login page first to set a cookie.
Also, you need to url_encode special chars in the post fields.

PHP cURL error: "Empty reply from server"

I have a class function to interface with the RESTful API for Last.FM - its purpose is to grab the most recent tracks for my user. Here it is:
private static $base_url = 'http://ws.audioscrobbler.com/2.0/';
public static function getTopTracks($options = array())
{
$options = array_merge(array(
'user' => 'bachya',
'period' => NULL,
'api_key' => 'xxxxx...', // obfuscated, obviously
), $options);
$options['method'] = 'user.getTopTracks';
// Initialize cURL request and set parameters
$ch = curl_init();
curl_setopt_array($ch, array(
CURLOPT_URL => self::$base_url,
CURLOPT_POST => TRUE,
CURLOPT_POSTFIELDS => $options,
CURLOPT_RETURNTRANSFER => TRUE,
CURLOPT_TIMEOUT => 30,
CURLOPT_USERAGENT => 'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)'
));
$results = curl_exec($ch);
return $results;
}
This returns "Empty reply from server". I know that some have suggested that this error comes from some fault in network infrastructure; I do not believe this to be true in my case. If I run a cURL request through the command line, I get my data; the Last.FM service is up and accessible.
Before I go to those folks and see if anything has changed, I wanted to check with you fine folks and see if there's some issue in my code that would be causing this.
Thanks!
ANSWER: #Jan Kuboschek helped me stumble onto what is (maybe) going on here. By giving CURLOPT_POSTFIELDS an associative array, a particular content-type is specified that may not work with certain RESTful services. A smarter solution is to manually create a URL-encoded version of that data and pass that as the CURLOPT_POSTFIELDS.
For more info, check out: http://www.brandonchecketts.com/archives/array-versus-string-in-curlopt_postfields
A common issue are spaces in the URL - beginning, in the middle, or trailing. Did you check that out?
Edit - per comments below, spacing is not the issue.
I ran your code and had the same problem - no output whatsoever. I tried the URL and with a GET request, the server talks to me. I would do the following:
Use the following as $base_url: $base_url = 'http://ws.audioscrobbler.com/2.0/?user=bachya&period=&api_key=xxx&method=user.getTopTracks';
Remove the post fields from your request.
Edit
I moved your code out of the class since I didn't have the rest and modified it. The following code runs perfect for me. If these changes don't work for you, I suggest that your error is in a different function.
<?php
function getTopTracks()
{
$base_url = 'http://ws.audioscrobbler.com/2.0/?user=bachya&period=&api_key=8066d2ebfbf1e1a8d1c32c84cf65c91c&method=user.getTopTracks';
$options = array_merge(array(
'user' => 'bachya',
'period' => NULL,
'api_key' => 'xxxxx...', // obfuscated, obviously
));
$options['method'] = 'user.getTopTracks';
// Initialize cURL request and set parameters
$ch = curl_init($base_url);
curl_setopt_array($ch, array(
CURLOPT_URL => $base_url,
CURLOPT_RETURNTRANSFER => TRUE,
CURLOPT_TIMEOUT => 30,
CURLOPT_USERAGENT => 'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)'
));
$results = curl_exec($ch);
return $results;
}
echo getTopTracks();
?>
The server received your request, but sent an empty response. Check the result of curl_getinfo($ch, CURLINFO_HTTP_CODE) to find out if the server responded with an HTTP error code.
Update: Ok so the server responds with the 100 Continue HTTP status code. In that case, this should solve your problem:
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Expect:'));
I found this here: PHP and cURL: Disabling 100-continue header. Hope it works!
I came acorss the same issue. My Http_code returned 200 but my response was empty. There could be many reasons for this as i experienced.
--Your hedaers might be incorrect
CURLOPT_HTTPHEADER => array('Content-Type:application/json', 'Expect:')
--You might need to send data as post fields in culr and not attached to the URl like url?p1=a1&p2=a2
$data = array (p1=>a1, p2=>a2)
CURLOPT_POSTFIELDS => $data
So your options array would be similar to the below
array(
CURLOPT_URL => $url,
CURLOPT_FAILONERROR => TRUE, // FALSE if in debug mode
CURLOPT_RETURNTRANSFER => TRUE,
CURLOPT_TIMEOUT => 4,
CURLOPT_HTTPHEADER => array('Content-Type:application/json', 'Expect:'),
CURLOPT_POST => TRUE,
CURLOPT_POSTFIELDS => $data,
);
According to Last.FM API documentation you should use GET method instead of POST to pass parameters. When I've changed POST to GET I've received the answer about incorrect key.
And here's the code for get Album Info from Laft.FM even if return error:
The Function:
function getAlbum($xml,$artist,$album)
{
$base_url = $xml;
$options = array_merge(array(
'user' => 'YOUR_USERNAME',
'artist'=>$artist,
'album'=>$album,
'period' => NULL,
'api_key' => 'xYxOxUxRxxAxPxIxxKxExYxx',
));
$options['method'] = 'album.getinfo';
// Initialize cURL request and set parameters
$ch = curl_init($base_url);
curl_setopt_array($ch, array(
CURLOPT_URL => 'http://ws.audioscrobbler.com/2.0/',
CURLOPT_POST => TRUE,
CURLOPT_POSTFIELDS => $options,
CURLOPT_RETURNTRANSFER => TRUE,
CURLOPT_TIMEOUT => 30,
CURLOPT_HTTPHEADER => array( 'Expect:' ) ,
CURLOPT_USERAGENT => 'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)'
));
$results = curl_exec($ch);
unset ($options);
return $results;
}
Usage:
// Get the XML
$xml_error = getAlbum($xml,$artist,$album);
// Show XML error
if (preg_match("/error/i", $xml_error)) {
echo " <strong>ERRO:</strong> ".trim(strip_tags($xml_error));
}

Categories