Making cURL cookies work in successives curl_exec connections - php

I'm crawling a web with cURL and DOM PHP. The web has a products sections where you can go page by page viewing all the products and also you have subsections for more concise searching, in each page 9 products are listed.
I need to store the information of the subsection to witch the product belongs. I start with all the subsections URL's and the program above shows how I try to get the next 9 products page of a subsection.
The problem it's that the web makes redirects with some information that I suppose it's on a cookie because there is not post traces in the network.
For example: In the ALL PRODUCTS section the URL of the second page is like:
www.example.com/product/?n=2
The first page of any subsection has a unique URL like:
www.example.com/product/subsection
The problem is that the link to the next subsection page (next 9 products) is
www.example.com/product/?n=2
The URL it's THE SAME as the all product section but it shows the subsection products.
The problem it's that I get the ALL PRODUCTS page instead of the SUBSECTION page.
I have tried with cookies but I don't get distinct results. Any suggestion?
<?php
private ckfile;
public function main()
{
$this->ckfile = tempnam ("C:/Web/", "CURLCOOKIE");
$copy = $this->get_page();
$next_visit = $this->link_next($copy);
while($next_visit != false){//it's not last page
$copy = $this->get_page($next_visit,$get_name($next_visit));
$next_visit = $this->link_next($copy);
}
}
public function get_page($URL = "http://www.example.com" , $nombre = "example" )
{
$ch = curl_init();
$options = array(
CURLOPT_HTTPHEADER => array("Accept-Language: es-es,en"),
CURLOPT_USERAGENT => "Googlebot/2.1 (+http://www.google.com/bot.html)",
CURLOPT_AUTOREFERER => true, // set referer on redirect ,
CURLOPT_ENCODING => "", //allow all encodings
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_HEADER => false,
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
CURLOPT_COOKIEFILE => $this->ckfile,
CURLOPT_COOKIEJAR => $this->ckfile,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_URL => $URL
);
curl_setopt_array($ch, $options);
$g = 'C:/Web/'.$nombre.'.html';
if(!is_file($g)){
$fp=fopen ($g, "w");
curl_setopt ($ch,CURLOPT_FILE, $fp);
$trash = curl_exec ($ch); // don't browse them
fclose($fp);
}
curl_close ($ch);
return $g;
}
public function link_next($value)
{
# function that searches the DOM for a link and returns a well formed URL
# or returns false if doesn't find one( last page)
}
?>

To make multiple calls you want to use curl multi:
$ch = curl_multi_init();
Not
$ch = curl_init();
See this post for an example Multiple PHP cUrl posts to same page

Related

extract reCaptcha from web page to be completed externally via cURL and then return results to view page

I am creating a web scraper for personal use that scrape car dealership sites based on my personal input but several of the sites that I attempting to collect data from a blocked by a redirected captcha page. The current site I am scraping with curl returns this HTML
<html>
<head>
<title>You have been blocked</title>
<style>#cmsg{animation: A 1.5s;}#keyframes A{0%{opacity:0;}99%{opacity:0;}100%{opacity:1;}}</style>
</head>
<body style="margin:0">
<p id="cmsg">Please enable JS and disable any ad blocker</p>
<script>
var dd={'cid':'AHrlqAAAAAMA1gZrYHNP4MIAAYhtzg==','hsh':'C0705ACD75EBF650A07FF8291D3528','t':'fe','host':'geo.captcha-delivery.com'}
</script>
<script src="https://ct.captcha-delivery.com/c.js"></script>
</body>
</html>
I am using this to scrape the page:
<?php
function web_scrape($url)
{
$ch = curl_init();
$imei = "013977000272744";
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch, CURLOPT_COOKIE, '_ym_uid=1460051101134309035; _ym_isad=1; cxx=80115415b122e7c81172a0c0ca1bde40; _ym_visorc_20293771=w');
curl_setopt($ch, CURLOPT_POSTFIELDS, array(
'imei' => $imei,
));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$server_output = curl_exec($ch);
return $server_output;
curl_close($ch);
}
echo web_scrape($url);
?>
And to reiterate what I want to do; I want to collect the Recaptcha from this page so when I want to view the page details on an external site I can fill in the Recaptcha on my external site and then scrape the page initially imputed.
Any response would be great!
Datadome is currently utilizing Recaptcha v2 and GeeTest captchas, so this is what your script should do:
Navigate to redirection https://geo.captcha-delivery.com/captcha/?initialCid=….
Detect what type of captcha is used.
Obtain token for this captcha using any captcha solving service like Anti Captcha.
Submit the token, check if you were redirected to the target page.
Sometimes target page contains an iframe with address https://geo.captcha-delivery.com/captcha/?initialCid=.. , so you need to repeat from step 2 in this iframe.
I’m not sure if steps above could be made with PHP, but you can do it with browser automation engines like Puppeteer, a library for NodeJS. It launches a Chromium instance and emulates a real user presence. NodeJS is a must you want to build pro scrapers, worth investing some time in Youtube lessons.
Here’s a script which does all steps above: https://github.com/MoterHaker/bypass-captcha-examples/blob/main/geo.captcha-delivery.com.js
You’ll need a proxy to bypass GeeTest protection.
based on the high demand for code, HERE is my upgraded scraper that bypassed this specific issue. However my attempt to obtain the captcha did not work and I still have not solved how to obtain it.
include "simple_html_dom.php";
/**
* Get a web file (HTML, XHTML, XML, image, etc.) from a URL. Return an
* array containing the HTTP server response header fields and content.
*/
// This function is where the Magic comes from. It bypasses ever peice of security carsales.com.au can throw at me
function get_web_page( $url ) {
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "spider", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
CURLOPT_SSL_VERIFYPEER => false // Disabled SSL Cert checks
);
$ch = curl_init( $url ); //initiate the Curl program that we will use to scrape data off the webpage
curl_setopt_array( $ch, $options ); //set the data sent to the webpage to be readable by the webpage (JSON)
$content = curl_exec( $ch ); //creates function to read pages content. This variable will be used to hold the sites html
$err = curl_errno( $ch ); //errno function that saves all the locations our scraper is sent to. This is just for me so that in the case of a error,
//I can see what parts of the page has it seen and more importantly hasnt seen
$errmsg = curl_error( $ch ); //check error message function. for example if I am denied permission this string will be equal to: 404 access denied
$header = curl_getinfo( $ch ); //the information of the page stored in a array
curl_close( $ch ); //Closes the Curler to save site memory
$header['errno'] = $err; //sending the header data to the previously made errno, which contains a array path of all the places my scraper has been
$header['errmsg'] = $errmsg; //sending the header data to the previously made error message checker function.
$header['content'] = $content; //sending the header data to the previously made content checker that will be the variable holder of the webpages HTML.
return $header; //Return all the pages data and my identifying functions in a array. To be used in the presentation of the search results.
};
//using the function we just made, we use the url genorated by the form to get a developer view of the scraping.
$response_dev = get_web_page($url);
// print_r($response_dev);
$response = end($response_dev); //takes only the end of the developer response because the rest is for my eyes only in the case that the site runs into a issue

How to follow all redirects with CURL including META-refresh

I'm using a API to return a set a URLs, all URLs have redirects but how many redirects and where the URLs go are unknown.
So what I'm trying to do is to trace the path and find the last URL.
I basically want do the same as: http://wheregoes.com/retracer.php, but I only need to know the last URL
I've found a way to do it with CURL but the trace stops when it is a Meta-Refresh.
I've seen this thread: PHP: Can CURL follow meta redirects but it doesn't help me a lot.
This is my current code:
function trace_url($url){
$ch = curl_init($url);
curl_setopt_array($ch, array(
CURLOPT_FOLLOWLOCATION => TRUE,
CURLOPT_RETURNTRANSFER => TRUE,
CURLOPT_SSL_VERIFYHOST => FALSE,
CURLOPT_SSL_VERIFYPEER => FALSE,
));
curl_exec($ch);
$url = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
curl_close($ch);
return $url;
}
$lasturl = trace_url('http://myurl.org');
echo $lasturl;
well, there are a big difference between Header Redirects , which is basically under 3xx class and META refresh , simply one way relies on the server, and the other related to the client .
and as long as curl or as known cURL or libcurl which is executed in the server , it can handle the first type, 'Header redirects' or http redirects.
so , you can then extract the url using bunch of ways.
you will need to handle it manually .
1) scrap the web page contents.
2) extract the link from the meta tag.
3) grab this new link if you want.
from your example:
function trace_url($url){
$ch = curl_init($url);
curl_setopt_array($ch, array(
CURLOPT_FOLLOWLOCATION => TRUE,
CURLOPT_RETURNTRANSFER => TRUE,
CURLOPT_SSL_VERIFYHOST => FALSE,
CURLOPT_SSL_VERIFYPEER => FALSE,
));
curl_exec($ch);
$url = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
curl_close($ch);
return $url;
}
$response = trace_url('http://myurl.org');
// quick pattern for explanation purposes only, you may improve it as you like
preg_match('#\<meta.*?content="[0-9]*\;url=([^"]+)"\s*\/\>#', $response, $links);
$newLink = $links[1];
or as mentioned in your question about the solution provided which is use simplexml_load_file library .
$xml = simplexml_load_file($response);
$link = $xml->xpath("//meta[#http-equiv='refresh']");

Login with curl and move to another page

I'm trying to access one page in a website with CURL, however it needs to be logged in i tried the code to login and it was successful
<?php
$user_agent = "Mozilla/5.0 (X11; Linux i686; rv:24.0) Gecko/20140319 Firefox/24.0 Iceweasel/24.4.0";
$curl_crack = curl_init();
CURL_SETOPT($curl_crack,CURLOPT_URL,"https://www.vininspect.com/en/account/login");
CURL_SETOPT($curl_crack,CURLOPT_USERAGENT,$user_agent);
CURL_SETOPT($curl_crack,CURLOPT_PROXY,"183.78.169.60:37899");
CURL_SETOPT($curl_crack,CURLOPT_PROXYTYPE,CURLPROXY_SOCKS5);
CURL_SETOPT($curl_crack,CURLOPT_POST,True);
CURL_SETOPT($curl_crack,CURLOPT_POSTFIELDS,"LoginForm[email]=naceriwalid%40hotmail.com&LoginForm[password]=passwordhere&toploginform[rememberme]=0&yt1=&toploginform[rememberme]=0");
CURL_SETOPT($curl_crack,CURLOPT_RETURNTRANSFER,True);
CURL_SETOPT($curl_crack,CURLOPT_FOLLOWLOCATION,True);
CURL_SETOPT($curl_crack,CURLOPT_COOKIEFILE,"cookie.txt"); //Put the full path of the cookie file if you want it to write on it
CURL_SETOPT($curl_crack,CURLOPT_COOKIEJAR,"cookie.txt"); //Put the full path of the cookie file if you want it to write on it
CURL_SETOPT($curl_crack,CURLOPT_CONNECTTIMEOUT,30);
CURL_SETOPT($curl_crack,CURLOPT_TIMEOUT,30);
$exec = curl_exec($curl_crack);
if(preg_match("/^you are logged|logout|successfully logged$/i",$exec))
{
echo "yoooha";
}
?>
Now the only problem I'm facing let's say that i don't want to be redirected to the logged in page, i want to be redirected to this page http://example.com/buy, how i can do that in the same code?
If you want to go to /buy after you log in, just use the same curl handle and issue another request for that page. cURL will retain the cookies for the duration of the handle (and on subsequent requests since you are saving them to a file and reading them back with the cookie jar.
For example:
$user_agent = "Mozilla/5.0 (X11; Linux i686; rv:24.0) Gecko/20140319 Firefox/24.0 Iceweasel/24.4.0";
$curl_crack = curl_init();
CURL_SETOPT($curl_crack,CURLOPT_URL,"https://www.vininspect.com/en/account/login");
CURL_SETOPT($curl_crack,CURLOPT_USERAGENT,$user_agent);
CURL_SETOPT($curl_crack,CURLOPT_PROXY,"183.78.169.60:37899");
CURL_SETOPT($curl_crack,CURLOPT_PROXYTYPE,CURLPROXY_SOCKS5);
CURL_SETOPT($curl_crack,CURLOPT_POST,True);
CURL_SETOPT($curl_crack,CURLOPT_POSTFIELDS,"LoginForm[email]=naceriwalid%40hotmail.com&LoginForm[password]=passwordhere&toploginform[rememberme]=0&yt1=&toploginform[rememberme]=0");
CURL_SETOPT($curl_crack,CURLOPT_RETURNTRANSFER,True);
CURL_SETOPT($curl_crack,CURLOPT_FOLLOWLOCATION,True);
CURL_SETOPT($curl_crack,CURLOPT_COOKIEFILE,"cookie.txt"); //Put the full path of the cookie file if you want it to write on it
CURL_SETOPT($curl_crack,CURLOPT_COOKIEJAR,"cookie.txt"); //Put the full path of the cookie file if you want it to write on it
CURL_SETOPT($curl_crack,CURLOPT_CONNECTTIMEOUT,30);
CURL_SETOPT($curl_crack,CURLOPT_TIMEOUT,30);
$exec = curl_exec($curl_crack);
if(preg_match("/^you are logged|logout|successfully logged$/i",$exec))
{
$post = array('search' => 'keyword', 'abc' => 'xyz');
curl_setopt($curl_crack, CURLOPT_POST, 1); // change back to GET
curl_setopt($curl_crack, CURLOPT_POSTFIELDS, http_build_query($post)); // set post data
curl_setopt($curl_crack, CURLOPT_URL, 'http://example.com/buy'); // set url for next request
$exec = curl_exec($curl_crack); // make request to buy on the same handle with the current login session
}
Here are some other examples of using PHP & cURL to make multiple requests:
How to login in with Curl and SSL and cookies (links to multiple other examples)
Grabbing data from a website with cURL after logging in?
Pinterest login with PHP and cURL not working
Login to Google with PHP and Curl, Cookie turned off?
PHP Curl - Cookies problem
You just need to change the URL after login is compete and then run curl_exec Like this :
<?php
//login code goes here
if(preg_match("/^you are logged|logout|successfully logged$/i",$exec))
{
echo "Logged in! now lets go to other page while we are logged in, shall we?";
//The new URL that you want to go to while logged in goes in bottom line :
CURL_SETOPT($curl_crack, CURLOPT_URL, "https://new_url_to_go.com/something");
$exec = curl_exec($curl_crack);
// now $exec contains the the content of new page with login
}
curl_close($curl_crack);//dont forgert to close curl session at last
?>
First define these function to get an associative array containing the url header and content (see http://nadeausoftware.com/articles/2007/06/php_tip_how_get_web_page_using_curl):
/**
* Get a web file (HTML, XHTML, XML, image, etc.) from a URL. Return an
* array containing the HTTP server response header fields and content.
*/
function get_web_page( $url, $params, $is_post = true )
{
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "Mozilla/4.0 (compatible;)", // i'm mozilla
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
);
if($is_post) { //use POST
$options[CURLOPT_POST] = 1;
$options[CURLOPT_POSTFIELDS] = http_build_query($params);
} else { //use GET
$url = $url.'?'.http_build_query($params);
}
$ch = curl_init( $url );
curl_setopt_array( $ch, $options );
$content = curl_exec( $ch );
$err = curl_errno( $ch );
$errmsg = curl_error( $ch );
$header = curl_getinfo( $ch );
curl_close( $ch );
$header['errno'] = $err;
$header['errmsg'] = $errmsg;
$header['content'] = $content;
return $header;
}
try this to load the 'http://www.example.com/buy' after login is successful.
// after curl login setup
$exec = curl_exec($curl_crack);
if(preg_match("/^you are logged|logout|successfully logged$/i",$exec))
{
// close login CURL resource, and free up system resources
curl_close($curl_crack);
$params = array('product_id'=>'xxxx', qty=>10);
$url = 'http://www.example.com/buy';
//use above function to get the url content via POST params
$result = get_web_page($url, $params, true);
if($result['http_code'] == 200) {
//echo the content
echo $result['content'];
die();
}
}

CURL returns response code 200 and 0 for a same page at different time

I am facing one unusual behavior of curl. For a given page, I some times get HTTP response code as 200 and sometimes I get 0 as HTTP response code. I am not able to understand whether this page is valid or not. If you try the given code, please try it for at least 5-10 times so that you can see the difference.
function print_info()
{
$url = 'bart.no';
$arr = array(
'bart.no',
'bolandirekt.nu',
'ekompassen.com',
'ekompassen.nu',
);
foreach ($arr as $url)
{
echo "<br/>URL: " . $url;
$temp = str_replace(array("www.", "http://", "https://"), "", strtolower($url));
// From this array it will be decided which is to prepend
$pre_array = array("", "www.", "https://", "http://", "https://www.", "http://www.");
$status_code = array();
// For each Value Status will be stored
foreach ($pre_array as $pre)
{
$options = array(
CURLOPT_RETURNTRANSFER => TRUE, // return web page
CURLOPT_HEADER => TRUE, // don't return headers
CURLOPT_FOLLOWLOCATION => FALSE, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "spider", // who am i
CURLOPT_AUTOREFERER => FALSE, // set referer on redirect
CURLOPT_SSL_VERIFYHOST => FALSE, //ssl verify host
CURLOPT_SSL_VERIFYPEER => FALSE, //ssl verify peer
CURLOPT_NOBODY => FALSE,
CURLOPT_CONNECTTIMEOUT => 20, // timeout on connect
CURLOPT_TIMEOUT => 20, // timeout on response
);
// Initializing Curl
$ch = curl_init($pre . $temp);
// Set Curl Options
curl_setopt_array($ch, $options);
// Execute Curl
$content = curl_exec($ch);
$code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
echo "<pre/>";
if ($code == 200)
{
print_r(curl_getinfo($ch));
break;
}
curl_close($ch);
}
}
}
So my final doubt is : Why I am getting response code 200 for the pages which are not existing Or not opening in browser ? Also, why sometimes I get response code 0 and sometimes response code 200 for the same page even if I keep time interval between requests ?
The CURL request did not complete, thus there's no response code.
The reason for this may be an invalid host name (can't resolve), malformed URL, timeout, etc.
You should be able to get the CURL error code as in CodeCaster's comment and curl_error / curl_errno docs.
Once the CURL request completed properly, then a response code (from the server) should be available and meaningful.

Parse data from my own HTML page onto a third party PHP page

I am taking part in a beauty competition, and I require that I be nominated.
The nomination form requires my details, and my nominators details.
My nominators may have a problem switching between my email containing my details and the nomination form, and may discourage them from filling the form in the first place.
The solution I came up with is to create an HTML page (which I have 100% control on), and it contains my pre-filled details already, so that the nominators do not get confused filling up my details, all I have to do is ask them for their own details.
Now I want my HTML form to parse the details onto an another website (the competition organiser's website) and have the form automatically filled in, and all the nominators have to do is click submit on the competition's website. I have absolute no control on the competition's website so that I cannot add or change any programming code.
How can I parse the data from my own HTML page (100% under my control) onto a third party PHP page?
Any examples of coding are appreciated.
Thank you xx
The same origin policy makes this impossible unless the competition organiser were to grant you permission using CORS (in which case you could load their site in a frame and modify it using JavaScript to manipulate its DOM … in supporting browsers).
The form they are using is submitting the form data to a mailing script which is secured by checking the referer (at least). You could use something like cURL in PHP to spoof the
referer like this (not tested):
function get_web_page( $url,$curl_data )
{
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)", // who am i
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
CURLOPT_POST => 1, // i am sending post data
CURLOPT_POSTFIELDS => $curl_data, // this are my post vars
CURLOPT_SSL_VERIFYHOST => 0, // don't verify ssl
CURLOPT_SSL_VERIFYPEER => false, //
CURLOPT_REFERER => "http://http://fashionawards.com.mt/nominationform.php",
CURLOPT_VERBOSE => 1 //
);
$ch = curl_init($url);
curl_setopt_array($ch,$options);
$content = curl_exec($ch);
$err = curl_errno($ch);
$errmsg = curl_error($ch) ;
$header = curl_getinfo($ch);
curl_close($ch);
$header['errno'] = $err;
$header['errmsg'] = $errmsg;
$header['content'] = $content;
return $header;
}
$curl_data = "nameandsurname_nominator=XXXX&id_nominator=XXX.....etc....";
$url = "http://www.logix.com.mt/cgi-bin/FormMail.pl";
$response = get_web_page($url,$curl_data);
print '<pre>';
print_r($response);
print '</pre>';
In the line where it says $curl_data = "nameandsurname_nominator=XXXX&id_nominator=XXX.....etc...."; you can set the post variables according to their names in the original form.
Thus you could make your own form to submit to their mailing script & have some of the field populated with what you need...
BEWARE: You may easily get disqualified or run into legal troubles for using such techniques! The recipient may very easily notice that the form has been compromised!

Categories