I'm trying to find a way to log into the Amazon SellerCentral account via PHP, I fund this script
https://github.com/mindevolution/amazonSellerCentralLogin
which in theory should work but I'm being redirected to the login page everytime I run it.
Also, I tried PhantomJS + CasperJS but without any luck, the first problem I had with that approach is that I need to disable 2-factor authentication and the second problem was that I'm getting captchas which I can't solve via code.
Here is the CasperJS code I tried:
var urlBeforeLoggedIn = "https://sellercentral.amazon.com/gp/homepage.html";
var urlAfterLoggedIn = "https://sellercentral.amazon.com/";
var casper = require('casper').create({
pageSettings: {
loadImages: false,
loadPlugins: false,
userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36'
}
});
casper.start(urlBeforeLoggedIn);
casper.waitForSelector('form[name="signIn"]', function() {
casper.fillSelectors('form[name="signIn"]', {
'input[name="email"]': 'some_username',
'input[name="password"]': 'some_password'
}, true);
});
casper.waitForUrl(urlAfterLoggedIn, function() {
this.viewport(3000, 1080);
this.capture('./testscreenshot.png', {top: 0,left: 0,width: 3000, height:
1080});
});
casper.run();
not an answer, but too long to post as a comment.
Do not parse html with regex., use a proper HTML parser instead, like DOMDocument & DOMXPath. i don't have an account to test with, but this should get you past the first login page, with a correct email & password,
<?php
declare(strict_types=1);
header("content-type: text/plain;charset=utf-8");
$email="em#ail.com";
$password="passw0rd";
$ch=curl_init();
curl_setopt_array($ch,array(
CURLOPT_AUTOREFERER => true,
CURLOPT_BINARYTRANSFER => true,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_SSL_VERIFYPEER => false,
CURLOPT_CONNECTTIMEOUT => 4,
CURLOPT_TIMEOUT => 8,
CURLOPT_COOKIEFILE => "", // <<makes curl save/load cookies across requests..
CURLOPT_ENCODING => "", // << makes curl post all supported encodings, gzip/deflate/etc, makes transfers faster
CURLOPT_USERAGENT => 'whatever; curl/' . (curl_version() ['version']) . ' (' . (curl_version() ['host']) . '); php/' . PHP_VERSION,
CURLOPT_RETURNTRANSFER=>1,
CURLOPT_URL=>'https://sellercentral.amazon.com/gp/homepage.html',
));
$html=curl_exec($ch);
//var_dump($html) & die();
$domd=#DOMDocument::loadHTML($html);
$xp=new DOMXPath($domd);
$form=$xp->query("//form[#name='signIn']")->item(0);
$inputs=[];
foreach($form->getElementsByTagName("input") as $input){
$name=$input->getAttribute("name");
if(empty($name) && $name!=="0"){
continue;
}
$inputs[$name]=$input->getAttribute("value");
}
assert(isset($inputs['email'],$inputs['password'],
$inputs['appActionToken'],$inputs['workflowState'],
$inputs['rememberMe']),"missing form inputs!");
$inputs["email"]=$email;
$inputs["password"]=$password;
$inputs["rememberMe"]="false";
$login_url=$form->getAttribute("action");
var_dump($inputs,$login_url);
curl_setopt_array($ch,array(
CURLOPT_URL=>$login_url,
CURLOPT_POST=>1,
CURLOPT_POSTFIELDS=>http_build_query($inputs)
));
$html=curl_exec($ch);
$domd=#DOMDocument::loadHTML($html);
$xp=new DOMXPath($domd);
$loginErrors=[];
// warning-message-box is also used for login *errors*, amazon web devs are just being stupid with the names.
foreach($xp->query("//*[contains(#id,'error-message-box')]|//*[contains(#id,'warning-message-box')]") as $loginError){
$loginErrors[]=preg_replace("/\s+/"," ",trim($loginError->textContent));
}
if(!empty($loginErrors)){
echo "login errors: ";
var_dump($loginErrors);
die();
}
//var_dump($html);
echo "login successful!";
the important takeaway here is
$domd=#DOMDocument::loadHTML($domd);
$xp=new DOMXPath($domd);
$form=$xp->query("//form[#name='signIn']")->item(0);
$inputs=[];
foreach($form->getElementsByTagName("input") as $input){
$name=$input->getAttribute("name");
if(empty($name) && $name!=="0"){
continue;
}
$inputs[$name]=$input->getAttribute("value");
}
that's how most website login pages can be parsed for login info.
I'm getting captchas which I can't solve via code
deathbycaptcha api to the rescue: http://www.deathbycaptcha.com/user/api
Related
I need to login to http://auto.vsk.ru/login.aspx making a post request to it from my site.
I wrote a js ajax function that sends post request to php script on my server, that sends cross-domain request via cUrl.
post.php
<?php
function request($url,$post, $cook)
{
$ch = curl_init();
$curlConfig = array(
CURLOPT_URL => $url,
CURLOPT_POST => 1,
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_COOKIEFILE => $cook,
CURLOPT_COOKIEJAR => $cook,
CURLOPT_USERAGENT => '"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 10.0; Trident/7.0; Touch; .NET4.0C; .NET4.0E; Tablet PC 2.0)"',
CURLOPT_FOLLOWLOCATION => 1,
CURLOPT_REFERER => $url,
CURLOPT_POSTFIELDS => $post,
CURLOPT_HEADER => 1,
);
curl_setopt_array($ch,$curlConfig);
$result = curl_exec($ch);
curl_close($ch);
return $result;
}
$result = request($_POST['url'], $_POST['data'], $_POST['cook']);
if ($result === FALSE)
echo('error');
else
echo($result);
?>
Js code:
function postcross(path,data,cook,run)
{
requestsp('post.php','url='+path+'&data='+data+'&cook='+cook, run);
}
function requestp(path, data, run)
{
var http = new XMLHttpRequest();
http.open('POST', path, true);
http.setRequestHeader('Content-type', 'application/x-www-form-urlencoded');
http.onreadystatechange = function()
{
if(http.readyState == 4 && http.status == 200)
{
run(http);
}
}
http.send(data);
}
postcross('http://auto.vsk.ru/login.aspx',encodeURIComponent('loginandpassord'),'vskcookies.txt',function(e){
document.getElementById('container').innerText=e.responseText;
});
The html page I getting from response says two things:
My browser is not Internet Explorer, I should switch to it.(actually it works from Google Chrome, at least can login).
My browser doesn’t support cookies.
About the cookies it is very similar to this (veeeery long) question. File vskcookies.txt is created in my server and it is actually updates after post request call, and stores cookies.
About the IE, firstly I thought that the site checks browser from js, but it is wrong, because js doesn’t run at all - I only read html page as a plain text, and it already has that notification about IE.
So wondered what if I make cUrl request wrong? I wrote new php script that shows request headers, here is a source:
head.php
<?php
foreach (getallheaders() as $name => $value)
{
echo "$name: $value\n";
}
?>
The result of postcross('http://mysite/head.php',encodeURIComponent('loginandpassord'),'vskcookies.txt',function(e){ document.getElementById('container').innerText=e.responseText; }):
Host: my site
User-Agent: "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 10.0; Trident/7.0; Touch; .NET4.0C; .NET4.0E; Tablet PC 2.0)"
Accept: */*
Content-Type: application/x-www-form-urlencoded
Referer: mysite/head
X-1gb-Client-Ip: my ip
X-Forwarded-For: ip, ip, ip
X-Forwarded-Port: 443
X-Forwarded-Proto: https
X-Port: 443
Accept-Encoding: gzip
X-Forwarded-URI: /head
X-Forwarded-Request: POST /head HTTP/1.1
X-Forwarded-Host: my site
X-Forwarded-Server: my site
Content-Length: 823
Connection: close
For some reason there is no Cookie: parameters, but user agent is IE as I mentioned.
Also I tried to replace head.php source with
print_r($_COOKIE);
And got empty array:
Am I doing something wrong, or it is site bot-protection?
Update 1
It is showing cookies only if to pass them through CURLOPT_COOKIE.
So I think I will leave CURLOPT_COOKIEFILE => $cook; as it is, and for CURLOPT_COOKIE something like file_get_contents($cook), although there is useless information. protection?
Important Update 2
Okay, probably I just stupid. Response html page indeed consists messages about IE and offed cookies, but they are in div that is display:none and are displayed on by js.
So, seems my tries fail because of another reasons.
I am trying to fetch prices from play and amazon for a personal project, but i have 2 problems.
Firstly i have got play to work, but it fetches the wrong price, and secondly amazon doesnt fetch any results.
Here is the code i have been trying to get working.
$playdotcom = file_get_contents('http://www.play.com/Search.html?searchstring=".$getdata[getlist_item]."&searchsource=0&searchtype=r2alldvd');
$amazoncouk = file_get_contents('http://www.amazon.co.uk/gp/search?search-alias=dvd&keywords=".$getdata[getlist_item]."');
preg_match('#<span class="price">(.*)</span>#', $playdotcom, $pmatch);
$newpricep = $pmatch[1];
preg_match('#used</a> from <strong>(.*)</strong>#', $playdotcom, $pmatch);
$usedpricep = $pmatch[1];
preg_match('#<span class="bld lrg red"> (.*)</span>#', $amazoncouk, $amatch);
$newpricea = $amatch[1];
preg_match('#<span class="price bld">(.*)</span> used#', $amazoncouk, $amatch);
$usedpricea = $amatch[1];
then echo the results:
echo "Play :: New: $newpricep - Used: $usedpricep";
echo "Amazon :: New: $newpricea - Used: $usedpricea";
Just so you know whats going on
$getdata[getlist_item] = "American Pie 5: The Naked Mile";
which is working fine.
Any idea why these aren't working correctly?
EDIT: I have just realised that $getdata[getlist_item] in the file_get_contents is not using the variable, just printing the variable as is... why is it doing that???
The quotes you are using aren't consistent! Both your opening and closing quotes need to be the same.
Try this:
$playdotcom = file_get_contents("http://www.play.com/Search.html?searchstring=".$getdata['getlist_item']."&searchsource=0&searchtype=r2alldvd");
$amazoncouk = file_get_contents("http://www.amazon.co.uk/gp/search?search-alias=dvd&keywords=".$getdata['getlist_item']);
As it were ".$getdata[getlist_item]." was considered part of the string as you never closed the single quote string you initiated.
Use curl function with correct headers. Below code will read the any web pages and then use a proper parser DOMDocument or simpleHTMLDomParser tool for read price from html content
$playdotcom = getPage("http://www.play.com/Search.html?searchstring=".$getdata['getlist_item']."&searchsource=0&searchtype=r2alldvd");
$amazoncouk = getPage("http://www.amazon.co.uk/gp/search?search-alias=dvd&keywords=".$getdata['getlist_item']);
function getPage($url){
$user_agent='Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0';
$options = array(
CURLOPT_CUSTOMREQUEST =>"GET",
CURLOPT_POST =>false,
CURLOPT_USERAGENT => $user_agent,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HEADER => false,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_ENCODING => 'gzip',
CURLOPT_AUTOREFERER => true,
CURLOPT_CONNECTTIMEOUT => 30000,
CURLOPT_TIMEOUT => 30000,
CURLOPT_MAXREDIRS => 10,
);
$ch = curl_init( $url );
curl_setopt_array( $ch, $options );
$content = curl_exec( $ch );
curl_close( $ch );
return $content;
}
I am trying to scrape product data by product section from a Zen-cart store using Simple HTML DOM. I can scrape data from the first page fine but when I try to load the 'next' page of products the site returns the index.php landing page.
If I use the function directly with *http://URLxxxxxxxxxx.com/index.php?main_page=index&cPath=36&sort=20a&page=2* it scrapes the product information from page 2 fine.
The same thing occurs if I use cURL.
getPrices('http://URLxxxxxxxxxx.com/index.php?main_page=index&cPath=36');
function getPrices($sectionURL) {
$opts = array('http' => array('method' => "GET", 'header' => "Accept-language: en\r\n" . "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6\r\n" . "Cookie: zenid=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\r\n"));
$context = stream_context_create($opts);
$html = file_get_contents($sectionURL, false, $context);
$dom = new simple_html_dom();
$dom -> load($html);
//Do cool stuff here with information from page.. product name, image, price and more info URL
if ($nextPage = $dom -> find('a[title= Next Page ]', 0)) {
$nextPageURL = $nextPage -> href;
echo $nextPageURL;
$dom -> clear();
unset($dom);
getPrices($nextPageURL);
} else {
echo "\nNo more pages to scrape!!";
$dom -> clear();
unset($dom);
}
}
Any ideas on how to fix this problem?
I see lots of potential culprits. You're not keeping track of cookies, or setting referer and there's a good chance simple_html_dom is letting you down.
My recommendation is to proxy your requests through fiddler or charles and make sure they look the way they do coming from a browser.
Turned out next page URLs being passed to the function in loop were passing & instead of & and file_get_contents didn't like it.
$sectionURL = str_replace( "&", "&", urldecode(trim($sectionURL)) );
I am taking part in a beauty competition, and I require that I be nominated.
The nomination form requires my details, and my nominators details.
My nominators may have a problem switching between my email containing my details and the nomination form, and may discourage them from filling the form in the first place.
The solution I came up with is to create an HTML page (which I have 100% control on), and it contains my pre-filled details already, so that the nominators do not get confused filling up my details, all I have to do is ask them for their own details.
Now I want my HTML form to parse the details onto an another website (the competition organiser's website) and have the form automatically filled in, and all the nominators have to do is click submit on the competition's website. I have absolute no control on the competition's website so that I cannot add or change any programming code.
How can I parse the data from my own HTML page (100% under my control) onto a third party PHP page?
Any examples of coding are appreciated.
Thank you xx
The same origin policy makes this impossible unless the competition organiser were to grant you permission using CORS (in which case you could load their site in a frame and modify it using JavaScript to manipulate its DOM … in supporting browsers).
The form they are using is submitting the form data to a mailing script which is secured by checking the referer (at least). You could use something like cURL in PHP to spoof the
referer like this (not tested):
function get_web_page( $url,$curl_data )
{
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)", // who am i
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
CURLOPT_POST => 1, // i am sending post data
CURLOPT_POSTFIELDS => $curl_data, // this are my post vars
CURLOPT_SSL_VERIFYHOST => 0, // don't verify ssl
CURLOPT_SSL_VERIFYPEER => false, //
CURLOPT_REFERER => "http://http://fashionawards.com.mt/nominationform.php",
CURLOPT_VERBOSE => 1 //
);
$ch = curl_init($url);
curl_setopt_array($ch,$options);
$content = curl_exec($ch);
$err = curl_errno($ch);
$errmsg = curl_error($ch) ;
$header = curl_getinfo($ch);
curl_close($ch);
$header['errno'] = $err;
$header['errmsg'] = $errmsg;
$header['content'] = $content;
return $header;
}
$curl_data = "nameandsurname_nominator=XXXX&id_nominator=XXX.....etc....";
$url = "http://www.logix.com.mt/cgi-bin/FormMail.pl";
$response = get_web_page($url,$curl_data);
print '<pre>';
print_r($response);
print '</pre>';
In the line where it says $curl_data = "nameandsurname_nominator=XXXX&id_nominator=XXX.....etc...."; you can set the post variables according to their names in the original form.
Thus you could make your own form to submit to their mailing script & have some of the field populated with what you need...
BEWARE: You may easily get disqualified or run into legal troubles for using such techniques! The recipient may very easily notice that the form has been compromised!
I have a database of a few thousand URL's that I am checking for links on pages (end up looking for specific links) and so I am throwing the below function through a loop and every once and awhile one of the URL's is bad and then the entire program just stalls and stops running and starts building up memory used. I thought adding the CURLOPT_TIMEOUT would fix this but it didn't. Any ideas?
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13'", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_TIMEOUT => 2, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
CURLOPT_POST => 0, // i am sending post data
CURLOPT_POSTFIELDS => $curl_data, // this are my post vars
CURLOPT_SSL_VERIFYHOST => 0, // don't verify ssl
CURLOPT_SSL_VERIFYPEER => false, //
CURLOPT_VERBOSE => 1 //
);
$ch = curl_init($url);
curl_setopt_array($ch,$options);
$content = curl_exec($ch);
$err = curl_errno($ch);
$errmsg = curl_error($ch) ;
$header = curl_getinfo($ch);
curl_close($ch);
// $header['errno'] = $err;
// $header['errmsg'] = $errmsg;
$header['content'] = $content;
#Extract the raw URl from the current one
$scheme = parse_url($url, PHP_URL_SCHEME); //Ex: http
$host = parse_url($url, PHP_URL_HOST); //Ex: www.google.com
$raw_url = $scheme . '://' . $host; //Ex: http://www.google.com
#Replace the relative link by an absolute one
$relative = array();
$absolute = array();
#String to search
$relative[0] = '/src="\//';
$relative[1] = '/href="\//';
#String to remplace by
$absolute[0] = 'src="' . $raw_url . '/';
$absolute[1] = 'href="' . $raw_url . '/';
$source = preg_replace($relative, $absolute, $content); //Ex: src="/image/google.png" to src="http://www.google.com/image/google.png"
return $source;
curl_exec will return false if it cannot find the URL.
The HTTP status code will be zero.
Check the results of curl_exec and check the HTTP status
code too.
$content = curl_exec($ch);
$httpStatus = curl_getinfo($ch, CURLINFO_HTTP_CODE);
if ( $content === false) {
if ($httpStatus == 0) {
$content = "link was not found";
}
}
....
The way you have it currently, the line of code
header['content'] = $content;
will get the value of false. This is not what you
want.
I am using curl_exec and my code does not stall if
it cannot find the url. The code keeps running.
You may end up with nothing in your browser though
and a message in the Firebug Console like "500 Internal Server Error".
Maybe that's what you mean by stall.
So basically you don't know and just guess that the curl request is stalling.
For this answer I can only guess as well then. You might need to set one of the following curl option as well: CURLOPT_CONNECTTIMEOUT
If the connect already stalls, the other timeout setting might not be taken into account. I'm not entirely sure, but please see Why would CURL time out in 1000ms when I have set up timeout upto 3000ms?.