cURL Scrape then Parse/Find Specific Content

cURL Scrape then Parse/Find Specific Content - php

I'm using php and cURL to scrape a web page, but this web page is poorly designed (as in no classes or ids on tags), so I need to search for specific text, then go to the tag holding it (ie <p>) then move to the next child (or next <p>) and get the text.
There are various things I need to get from the page, some also being the text within an <a onclick="get this stuff here">. So basically I feel that I need to use cURL to scrape the source code to a php variable, then I can use php to kind of parse through and find the stuff I need.
Does this sound like the best method to do this? Does anyone have any pointers or can demonstrate how I can put source code from cURL into a variable?
Thanks!
EDIT (Working/Current Code) -----------
<?php
class Scrape
{
public $cookies = 'cookies.txt';
private $user = null;
private $pass = null;
/*Data generated from cURL*/
public $content = null;
public $response = null;
/* Links */
private $url = array(
'login' => 'https://website.com/login.jsp',
'submit' => 'https://website.com/LoginServlet',
'page1' => 'https://website.com/page1',
'page2' => 'https://website.com/page2',
'page3' => 'https://website.com/page3'
);
/* Fields */
public $data = array();
public function __construct ($user, $pass)
{
$this->user = $user;
$this->pass = $pass;
}
public function login()
{
$this->cURL($this->url['login']);
if($form = $this->getFormFields($this->content, 'login'))
{
$form['login'] = $this->user;
$form['password'] =$this->pass;
// echo "<pre>".print_r($form,true);exit;
$this->cURL($this->url['submit'], $form);
//echo $this->content;//exit;
}
//echo $this->content;//exit;
}
// NEW TESTING
public function loadPage($page)
{
$this->cURL($this->url[$page]);
echo $this->content;//exit;
}
/* Scan for form */
private function getFormFields($data, $id)
{
if (preg_match('/(<form.*?name=.?'.$id.'.*?<\/form>)/is', $data, $matches)) {
$inputs = $this->getInputs($matches[1]);
return $inputs;
} else {
return false;
}
}
/* Get Inputs in form */
private function getInputs($form)
{
$inputs = array();
$elements = preg_match_all('/(<input[^>]+>)/is', $form, $matches);
if ($elements > 0) {
for($i = 0; $i < $elements; $i++) {
$el = preg_replace('/\s{2,}/', ' ', $matches[1][$i]);
if (preg_match('/name=(?:["\'])?([^"\'\s]*)/i', $el, $name)) {
$name = $name[1];
$value = '';
if (preg_match('/value=(?:["\'])?([^"\']*)/i', $el, $value)) {
$value = $value[1];
}
$inputs[$name] = $value;
}
}
}
return $inputs;
}
/* Perform curl function to specific URL provided */
public function cURL($url, $post = false)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13");
// "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_COOKIEJAR, $this->cookies);
curl_setopt($ch, CURLOPT_COOKIEFILE, $this->cookies);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 120);
curl_setopt($ch, CURLOPT_TIMEOUT, 120);
if($post) //if post is needed
{
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($post));
}
curl_setopt($ch, CURLOPT_URL, $url);
$this->content = curl_exec($ch);
$this->response = curl_getinfo( $ch );
$this->url['last_url'] = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
curl_close($ch);
}
}
$sc = new Scrape('user','pass');
$sc->login();
$sc->loadPage('page1');
echo "<h1>TESTTESTEST</h1>";
$sc->loadPage('page2');
echo "<h1>TESTTESTEST</h1>";
$sc->loadPage('page3');
echo "<h1>TESTTESTEST</h1>";
(note: credit to #Ramz scrape a website with secured login)

You can divide your problem in several parts.
Retrieving the data from the data source.
For that, you can possibly use CURL or file_get_contents() depending on your requirements. Code examples are everywhere. http://php.net/manual/en/function.file-get-contents.php and http://php.net/manual/en/curl.examples-basic.php
Parsing the retrieved data.
For that, i would start by looking into "PHP Simple HTML DOM Parser" You can use it to extract data from an HTML string. http://simplehtmldom.sourceforge.net/
Building and generating the output.
This is simply a question of what you want to do with the data that you have extracted. For example, you can print it, reformat it, or store it to a database/file.

I suggest you use a rready made scaper. I use Goutte (https://github.com/FriendsOfPHP/Goutte) which allows me to load website content and traverse it in the same way you do with jQuery. i.e. if I want the content of the <div id="content"> I use $client->filter('#content')->text()
It even allows me to find and 'click' on links and submit forms to retreive and process the content.
It makes life soooooooo mucn easier than using cURL or file_get_contentsa() and working your way through the html manually

Related

PHP - Simple Html Dom load multiple pages speed

I finally got my script to work but it takes a long time to do the search (via ajax). Basically by entering a keyword, it searches the page and captures all the titles, urls, and thumbnails of the videos. But the problem arose to me to capture the tags that were inside each video, so I had to forcibly access each video to capture the tags, the only way I could think of was to add a loop inside the loop that captures the found videos that is to say:
For each video found -> Capture title, thumbnail, URL -> With captured URL -> Go to that URL and capture your tags.
The code I used is basically the following, I need to know if there is any other method to speed up searches, either by optimizing the code or using another way:
My parse function:
<?php
function dlPage($href) {
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_HEADER, "Accept-language: en-US");
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $href);
curl_setopt($curl, CURLOPT_REFERER, $href);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.125 Safari/533.4");
$str = curl_exec($curl);
curl_close($curl);
// Create a DOM object
$dom = new simple_html_dom();
// Load HTML from a string
$dom->load($str);
return $dom;
}
?>
My script:
$buscartag = str_replace(' ', '+', $_POST['buscartag']);
$urlparse = "https://example.com/?k=".$buscartag;
$paginas = rand(0, 50);
$html = dlPage($urlparse."&p=".$paginas);
$counter = 0;
foreach($html->find('div.video-box') as $videos) {
if ($videos) {
$titulo = $videos->find('div.video-box>p[!class])>a[!class]',0)->attr['title'];
$pathvideo = str_replace('_', '', $videos->attr['id']);
$link = "https://www.example.com/".$pathvideo."/";
$thumb = $videos->find('div.thumb')->innertext
//HERE MY SECOND BUCLE FOR TAGS!!!
$gettags2 = array();
$html_tags = file_get_html($link);
foreach ($html_tags->find('a.nu') as $gettags){
$gettags2[] = $gettags->innertext;
if (!empty($titulo) && !empty($link) && !empty($idvideo) && !empty($urlimagen)){
$counter++;
//here will echo all variables
}}

Manipulate dom with php to scrape data

I am currently trying to manipulate dom throuhg php to extract views from an fb video page. The below code was working until a bit ago. However now it doesnt find the node that contains the views count. This information is inside a div with id fbPhotoPageMediaInfo. What would be the best way to manipulate the dom through php to get views of an fb video page?
private function _callCurl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Linux; Android 5.0.1; SAMSUNG-SGH-I337 Build/LRX22C; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/42.0.2311.138 Mobile Safari/537.36');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 20);
curl_setopt($ch, CURLOPT_URL, $url);
$response = curl_exec($ch);
$http = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
return array(
$http,
$response,
);
}
function test()
{
$url = "https://www.facebook.com/TaylorSwift/videos/10153665021155369/";
$request = callCurl($url);
if ($request[0] == 200) {
$dom = new DOMDocument();
#$dom->loadHTML($request[1]);
$elm = $dom->getElementById('fbPhotoPageMediaInfo');
if (isset($elm->nodeValue)) {
$views = preg_replace('/[^0-9]/', '', $elm->nodeValue);
} else {
$views = null;
}
} else {
echo "Error!";
}
return isset($views) ? $views : null;
}

Here is what I've determined...
If you var_dump() on $request you can see that it's giving you a 302 code (redirect) rather than a 200 (ok).
Changing CURLOPT_FOLLOWLOCATION to true or commenting it out entirely makes the error go away, but now we're getting a different page from the one expected.
I ran the following to see where I was being redirected to:
$htm = file_get_contents("https://www.facebook.com/TaylorSwift/videos/10153665021155369/");
var_dump($htm);
This gave me a page saying I was using an outdated browser, and needed to update it. So apparently Facebook doesn't like the User Agent.
I updated it as follows:
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/44.0.2');
That appears to solve the problem.

Personally I prefer to use Simplehtmldom.
FB like other high traffic sites do update their source to help prevent scraping. You may in the future have to adjust your node search.
<?php
$ua = "Mozilla/5.0 (Windows NT 5.0) AppleWebKit/5321 (KHTML, like Gecko) Chrome/13.0.872.0 Safari/5321"; // must be a valid User Agent
ini_set('user_agent', $ua);
require_once('simplehtmldom/simple_html_dom.php'); // http://simplehtmldom.sourceforge.net/
Function Scrape_FB_Views($url) {
IF (!filter_var($url, FILTER_VALIDATE_URL) === false) {
// Create DOM from URL
$html = file_get_html($url);
IF ($html) {
IF (($html->find('span[class=fcg]', 3))) { // 4th instance of span with fcg class
$text = trim($html->find('span[class=fcg]', 3)->plaintext); // get content of span as plain text
$result = preg_replace('/[^0-9]/', '', $text); // replace all non-numeric characters
}ELSE{
$result = "Node is no longer valid."
}
}ELSE{
$result = "Could not get HTML.";
}
}ELSE{
$result = "URL is invalid.";
}
return $result;
}
$url = "https://www.facebook.com/TaylorSwift/videos/10153665021155369/";
echo("<p>".Scrape_FB_Views($url)."</p>");
?>

cURL login into ebay.co.uk [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 3 years ago.
Improve this question
I've been trying for some time now to use cURL to login to eBay.co.uk. The cookies are being set and the post data is being sent correctly however I am not sure that the cookie file is being read again or even set correctly for that matter since I'm getting a page from eBay that says I need to enable cookies in my browser.
Here is the cURL class I'm using:
class Curl {
private $ch;
private $cookie_path;
private $agent;
public function __construct($userId) {
$this->cookie_path = dirname(realpath(basename($_SERVER['PHP_SELF']))).'/cookies/' . $userId . '.txt';
$this->agent = "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4) Gecko/20030624 Netscape/7.1 (ax)";
}
private function init() {
$this->ch = curl_init();
}
private function close() {
curl_close ($this->ch);
}
private function setOptions($submit_url) {
curl_setopt($this->ch, CURLOPT_URL, $submit_url);
curl_setopt($this->ch, CURLOPT_USERAGENT, $this->agent);
curl_setopt($this->ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($this->ch, CURLOPT_FOLLOWLOCATION, 1);
//curl_setopt($this->ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($this->ch, CURLOPT_COOKIEFILE, $this->cookie_path);
curl_setopt($this->ch, CURLOPT_COOKIEJAR, $this->cookie_path);
}
public function curl_cookie_set($submit_url) {
$this->init();
$this->setOptions($submit_url);
$result = curl_exec ($this->ch);
$this->close();
return $result;
}
public function curl_post_request($referer, $submit_url, $data) {
$this->init();
$this->setOptions($submit_url);
$post = http_build_query($data);
curl_setopt($this->ch, CURLOPT_POST, 1);
curl_setopt($this->ch, CURLOPT_POSTFIELDS, $post);
curl_setopt($this->ch, CURLOPT_REFERER, $referer);
$result = curl_exec ($this->ch);
$this->close();
return $result;
}
public function curl_clean() {
// cleans and closes the curl connection
if (file_exists($this->cookie_path)) {
unlink($this->cookie_path);
}
if ($this->ch != '') {
curl_close ($this->ch);
}
}
}
Here is the test script, the login details are for a throwaway account, so feel free to test with them:
$curl = new Curl(md5(1)); //(md5($_SESSION['userId']));
$referer = 'http://ebay.co.uk';
$submit_url = "http://signin.ebay.co.uk/aw-cgi/eBayISAPI.dll";
$data['userid'] = "VitoGambino-us";
$data['pass'] = "P0wqw12vi";
$data['MfcISAPICommand'] = 'SignInWelcome';
$data['siteid'] = '0';
$data['co_partnerId'] = '2';
$data['UsingSSL'] = '0';
$data['ru'] = '';
$data['pp'] = '';
$data['pa1'] = '';
$data['pa2'] = '';
$data['pa3'] = '';
$data['i1'] = '-1';
$data['pageType'] = '-1';
$curl->curl_cookie_set($referer);
$result = $curl->curl_post_request($referer, $submit_url, $data);
echo $result;
Here is what the cookie files contents are:
# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This file was generated by libcurl! Edit at your own risk.
www.ebay.co.uk FALSE / FALSE 0 JSESSIONID BDE9B23B829CA7DF2CC4D5880F5173A6
.ebay.co.uk TRUE / FALSE 0 ebay %5Esbf%3D%23%5Ecv%3D15555%5E
.ebay.co.uk TRUE / FALSE 1431871451 dp1 bu1p/QEBfX0BAX19AQA**53776c5b^
#HttpOnly_.ebay.co.uk TRUE / FALSE 0 s CgAD4ACBRl4pbYjJjZDk1YTAxM2UwYTU2YjYzYzRhYmU0ZmY2ZjcyODYBSgAXUZeKWzUxOTYzOGI5LjMuMS43LjY2LjguMC4xuMWzLg**
.ebay.co.uk TRUE / FALSE 1400335451 nonsession CgADLAAFRlj/jMgDKACBa/DpbYjJjZDk1YTAxM2UwYTU2YjYzYzRhYmU0ZmY2ZjcyODcBTAAXU3dsWzUxOTYzOGI5LjMuMS42LjY1LjEuMC4xhVUTMQ**
.ebay.co.uk TRUE / FALSE 1526479451 lucky9 4551358

I was able to figure it out.
eBay uses a pretty tricky method for logging in. It's a combination of cookies, hidden fields and a javascript redirect after successful login.
Here's how I solved it.
Newly modified class:
class Curl {
private $ch;
private $cookie_path;
private $agent;
// userId will be used later to keep multiple users logged
// into ebay site at one time.
public function __construct($userId) {
$this->cookie_path = dirname(realpath(basename($_SERVER['PHP_SELF']))).'/cookies/' . $userId . '.txt';
$this->agent = "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4) Gecko/20030624 Netscape/7.1 (ax)";
}
private function init() {
$this->ch = curl_init();
}
private function close() {
curl_close ($this->ch);
}
// Set cURL options
private function setOptions($submit_url) {
$headers[] = "Accept: */*";
$headers[] = "Connection: Keep-Alive";
curl_setopt($this->ch, CURLOPT_URL, $submit_url);
curl_setopt($this->ch, CURLOPT_USERAGENT, $this->agent);
curl_setopt($this->ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($this->ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($this->ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($this->ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($this->ch, CURLOPT_COOKIEFILE, $this->cookie_path);
curl_setopt($this->ch, CURLOPT_COOKIEJAR, $this->cookie_path);
}
// Grab initial cookie data
public function curl_cookie_set($submit_url) {
$this->init();
$this->setOptions($submit_url);
curl_exec ($this->ch);
echo curl_error($this->ch);
}
// Grab hidden fields
public function get_form_fields($submit_url) {
curl_setopt($this->ch, CURLOPT_URL, $submit_url);
$result = curl_exec ($this->ch);
echo curl_error($this->ch);
return $this->getFormFields($result);
}
// Send login data
public function curl_post_request($referer, $submit_url, $data) {
$post = http_build_query($data);
curl_setopt($this->ch, CURLOPT_URL, $submit_url);
curl_setopt($this->ch, CURLOPT_POST, 1);
curl_setopt($this->ch, CURLOPT_POSTFIELDS, $post);
curl_setopt($this->ch, CURLOPT_REFERER, $referer);
$result = curl_exec ($this->ch);
echo curl_error($this->ch);
$this->close();
return $result;
}
// Show the logged in "My eBay" or any other page
public function show_page( $submit_url) {
curl_setopt($this->ch, CURLOPT_URL, $submit_url);
$result = curl_exec ($this->ch);
echo curl_error($this->ch);
return $result;
}
// Used to parse out form
private function getFormFields($data) {
if (preg_match('/(<form name="SignInForm".*?<\/form>)/is', $data, $matches)) {
$inputs = $this->getInputs($matches[1]);
return $inputs;
} else {
die('Form not found.');
}
}
// Used to parse out hidden field names and values
private function getInputs($form) {
$inputs = array();
$elements = preg_match_all('/(<input[^>]+>)/is', $form, $matches);
if ($elements > 0) {
for($i = 0; $i < $elements; $i++) {
$el = preg_replace('/\s{2,}/', ' ', $matches[1][$i]);
if (preg_match('/name=(?:["\'])?([^"\'\s]*)/i', $el, $name)) {
$name = $name[1];
$value = '';
if (preg_match('/value=(?:["\'])?([^"\'\s]*)/i', $el, $value)) {
$value = $value[1];
}
$inputs[$name] = $value;
}
}
}
return $inputs;
}
// Destroy cookie and close curl.
public function curl_clean() {
// cleans and closes the curl connection
if (file_exists($this->cookie_path)) {
unlink($this->cookie_path);
}
if ($this->ch != '') {
curl_close ($this->ch);
}
}
}
The actual code in use:
$curl = new Curl(md5(1)); //(md5($_SESSION['userId']));
$referer = 'http://ebay.com';
$formPage = 'http://signin.ebay.com/aw-cgi/eBayISAPI.dll?SignIn';
// Grab cookies from main page, ebay.com
$curl->curl_cookie_set($referer);
// Grab the hidden form fields and then set UsingSSL = 0
// Login with credentials and hidden fields
$data = $curl->get_form_fields($formPage);
$data['userid'] = "";
$data['pass'] = "";
$data['UsingSSL'] = '0';
// Post data to login page. Don't echo this result, there's a
// javascript redirect. Just do this to save login cookies
$formLogin = "https://signin.ebay.com/ws/eBayISAPI.dll?co_partnerId=2&siteid=3&UsingSSL=0";
$curl->curl_post_request($referer, $formLogin, $data);
// Use login cookies to load the "My eBay" page, viola, you're logged in.
$result = $curl->show_page('http://my.ebay.com/ws/eBayISAPI.dll?MyeBay');
// take out Javascript so it won't redirect to actualy ebay site
echo str_replace('<script', '<', $result);
I used some of the code posted here, thanks to drew010!

Faking Post Request with PHP Curl - Rejection

I am trying to build a script that posts information into the RoyalMail tracking system and extracts the output.
What I currently have is getting an error from their server - see the link, somehow it is detecting that I am not using their website as per normal and throwing me an error.
Things I think I have taken into account:
Using an exact copy of their form by parsing it beforehand (the post parameters)
Saving the cookies between each request
Accepting redirect headers
Providing a refer header that is actually valid (the previously visited page)
Does anyone know anything else I need to check or can figure out what I am doing wrong?
A full copy of the source is at EDIT: please see my answer below

Websites usually use 2 ways to detect if you are a human or a bot: HTTP REFERER and USER AGENT. I suggest you use Curl it specified user agent and referer (replace 'http://something/' with real URL of a page you would normally visit before navigating to the url you want to download with PHP):
<?php
$url = 'http://track2.royalmail.com/portal/rm/track';
$html = file_get_contents2($url, '');
$post['_dyncharset'] = 'ISO-8859-1';
$post['trackConsigniaPage'] = 'track';
$post['/rmg/track/RMTrackFormHandler.value.searchCompleteUrl'] = '/portal/rm/trackresults?catId=22700601&pageId=trt_rmresultspage';
$post['_D:/rmg/track/RMTrackFormHandler.value.searchCompleteUrl'] = '';
$post['/rmg/track/RMTrackFormHandler.value.invalidInputUrl'] = '/portal/rm/trackresults?catId=22700601&pageId=trt_rmresultspage&keyname=track_blank';
$post['_D:/rmg/track/RMTrackFormHandler.value.invalidInputUrl'] = '';
$post['/rmg/track/RMTrackFormHandler.value.searchBusyUrl'] = '/portal/rm/trackresults?catId=22700601&pageId=trt_busypage&keyname=3E_track';
$post['_D:/rmg/track/RMTrackFormHandler.value.searchBusyUrl'] = '';
$post['/rmg/track/RMTrackFormHandler.value.searchWaitUrl'] = '/portal/rm/trackresults?catId=22700601&timeout=true&pageId=trt_timeoutpage&keyname=3E_track';
$post['_D:/rmg/track/RMTrackFormHandler.value.searchWaitUrl'] = '';
$post['/rmg/track/RMTrackFormHandler.value.keyname'] = '3E_track';
$post['_D:/rmg/track/RMTrackFormHandler.value.keyname'] = '';
$post['/rmg/track/RMTrackFormHandler.value.previousTrackingNumber'] = '';
$post['_D:/rmg/track/RMTrackFormHandler.value.previousTrackingNumber'] = '';
$post['/rmg/track/RMTrackFormHandler.value.trackingNumber'] = 'ZW791944749GB';
$post['_D:/rmg/track/RMTrackFormHandler.value.trackingNumber'] = '';
$post['/rmg/track/RMTrackFormHandler.track.x'] = '50';
$post['/rmg/track/RMTrackFormHandler.track.y'] = '14';
$post['_D:/rmg/track/RMTrackFormHandler.track'] = '';
$post['/rmg/track/RMTrackFormHandler.value.day'] = '19';
$post['_D:/rmg/track/RMTrackFormHandler.value.day'] = '';
$post['/rmg/track/RMTrackFormHandler.value.month'] = '5';
$post['_D:/rmg/track/RMTrackFormHandler.value.month'] = '';
$post['/rmg/track/RMTrackFormHandler.value.year'] = '2012';
$post['_D:/rmg/track/RMTrackFormHandler.value.year'] = '';
$post['_DARGS'] = '/portal/rmgroup/apps/templates/html/rm/rmTrackResultPage.jsp';
$url2 = 'http://track2.royalmail.com/portal/rm?_DARGS=/portal/rmgroup/apps/templates/html/rm/rmTrackAndTraceForm.jsp';
$html2 = file_get_contents2($url2, $url, $post);
echo $html2;
function file_get_contents2($address, $referer, $post = false)
{
$useragent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1";
$c = curl_init();
curl_setopt($c, CURLOPT_URL, $address);
curl_setopt($c, CURLOPT_USERAGENT, $useragent);
curl_setopt($c, CURLOPT_HEADER, 0);
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
if ($post)
{
$postF = http_build_query($post);
curl_setopt($c, CURLOPT_POST, true);
curl_setopt($c, CURLOPT_POSTFIELDS, $postF);
}
curl_setopt($c, CURLOPT_COOKIEJAR, 'cookie.txt');
//curl_setopt($c, CURLOPT_FRESH_CONNECT, 1);
curl_setopt($c, CURLOPT_REFERER, $referer);
curl_setopt($c, CURLOPT_FOLLOWLOCATION, 1);
if (!$data = curl_exec($c))
{
return false;
}
return $data;
}
The above updated code returned me:
Item ZW791944749GB was posted at 1 High Street RG17 9TJ on 19/05/12 and is being progressed through our network for delivery.
So it seems it works.

I have now fixed it, the problem was with PHP curl and following redirects, it seems that it doesn't always post the request data and sends a GET request when following.
To deal with this I disabled curl follow location with curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false); and then built a follow location system myself that works recursively. Essentially it extracts the location header from the response, checks for a 301 or a 302 and then runs the method again as required.
This means the information will definitely be POSTED again.
I also improved the user agent string, simply copying my current one on the basis it won't be blocked for a long while as in 2012 it is in active use!
Here is a final copy of the curl class (in case the link dies - been down voted for that in the past) which is working:
/**
* Make a curl request respecting redirects
* Also supports posts
*/
class pegCurlRequest {
private $url, $postFields = array(), $referer = NULL, $timeout = 3;
private $debug = false, $postString = "";
private $curlInfo = array();
private $content = "";
private $response_meta_info = array();
static $cookie;
function __construct($url, $postFields = array(), $referer = NULL, $timeout = 3) {
$this->setUrl($url);
$this->setPost($postFields);
$this->setReferer($referer);
$this->setTimeout($timeout);
if(empty(self::$cookie)) self::$cookie = tempnam("/tmp", "pegCurlRequest"); //one time cookie
}
function setUrl($url) {
$this->url = $url;
}
function setTimeout($timeout) {
$this->timeout = $timeout;
}
function setPost($postFields) {
if(is_array($postFields)) {
$this->postFields = $postFields;
}
$this->updatePostString();
}
function updatePostString() {
//Cope with posting
$this->postString = "";
if(!empty($this->postFields)) {
foreach($this->postFields as $key=>$value) { $this->postString .= $key.'='.$value.'&'; }
$this->postString= rtrim($this->postString,'&'); //Trim off the waste
}
}
function setReferer($referer) {
//Set a referee either specified or based on the url
$this->referer = $referer;
}
function debugInfo() {
//Debug
if($this->debug) {
echo "<table><tr><td colspan='2'><b><u>Pre Curl Request</b><u></td></tr>";
echo "<tr><td><b>URL: </b></td><td>{$this->url}</td></tr>";
if(!empty(self::$cookie)) echo "<tr><td><b>Cookie String: </b></td><td>".self::$cookie."</td></tr>";
if(!empty($this->referer)) echo "<tr><td><b>Referer: </b></td><td>".$this->referer."</td></tr>";
if(!empty($this->postString)) echo "<tr><td><b>Post String: </b></td><td>".$this->postString."</td></tr>";
if(!empty($this->postFields)) {
echo "<tr><td><b>Post Values:</b></td><td><table>";
foreach($this->postFields as $key=>$value)
echo "<tr><td>$key</td><td>$value</td></tr>";
echo "</table>";
}
echo "</td></tr></table><br />\n";
}
}
function debugFurtherInfo() {
//Debug
if($this->debug) {
echo "<table><tr><td colspan='2'><b><u>Post Curl Request</b><u></td></tr>";
echo "<tr><td><b>URL: </b></td><td>{$this->url}</td></tr>";
if(!empty($this->referer)) echo "<tr><td><b>Referer: </b></td><td>".$this->referer."</td></tr>";
if(!empty($this->curlInfo)) {
echo "<tr><td><b>Curl Info:</b></td><td><table>";
foreach($this->curlInfo as $key=>$value)
echo "<tr><td>$key</td><td>$value</td></tr>";
echo "</table>";
}
echo "</td></tr></table><br />\n";
}
}
/**
* Make the actual request
*/
function makeRequest($url=NULL) {
//Shorthand request
if(!is_null($url))
$this->setUrl($url);
//Output debug info
$this->debugInfo();
//Using a shared cookie
$cookie = self::$cookie;
//Setting up the starting information
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 Safari/536.11" );
curl_setopt($ch, CURLOPT_URL, $this->url);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($ch, CURLOPT_ENCODING, "gzip");
//register a callback function which will process the headers
//this assumes your code is into a class method, and uses $this->readHeader as the callback //function
curl_setopt($ch, CURLOPT_HEADERFUNCTION, array(&$this,'readHeader'));
//Some servers (like Lighttpd) will not process the curl request without this header and will return error code 417 instead.
curl_setopt($ch, CURLOPT_HTTPHEADER, array("Expect:"));
//Referer
if(empty($this->referer)) {
curl_setopt($ch, CURLOPT_REFERER, dirname($this->url));
} else {
curl_setopt($ch, CURLOPT_REFERER, $this->referer);
}
//Posts
if(!empty($this->postFields)) {
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $this->postString);
}
//Redirects, transfers and timeouts
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, false);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $this->timeout);
curl_setopt($ch, CURLOPT_TIMEOUT, $this->timeout);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
//Debug
if($this->debug) {
curl_setopt($ch, CURLOPT_VERBOSE, true); // logging stuffs
curl_setopt($ch, CURLINFO_HEADER_OUT, true); // enable tracking
}
//Get the content and the header info
$content = curl_exec($ch);
$response = curl_getinfo($ch);
//get the default response headers
$headers = curl_getinfo($ch);
//add the headers from the custom headers callback function
$this->response_meta_info = array_merge($headers, $this->response_meta_info);
curl_close($ch); //be nice
//Curl info
$this->curlInfo = $response;
//Output debug info
$this->debugFurtherInfo();
//Are we being redirected?
if ($response['http_code'] == 301 || $response['http_code'] == 302) {
$location = $this->getHeaderLocation();
if(!empty($location)) { //the location exists
$this->setReferer($this->getTrueUrl()); //update referer
return $this->makeRequest($location); //recurse to location
}
}
//Is there a javascript redirect on the page?
elseif (preg_match("/window\.location\.replace\('(.*)'\)/i", $content, $value) ||
preg_match("/window\.location\=\"(.*)\"/i", $content, $value)) {
$this->setReferer($this->getTrueUrl()); //update referer
return $this->makeRequest($value[1]); //recursion
} else {
$this->content = $content; //set the content - final page
}
}
/**
* Get the url after any redirection
*/
function getTrueUrl() {
return $this->curlInfo['url'];
}
function __toString() {
return $this->content;
}
/**
* CURL callback function for reading and processing headers
* Override this for your needs
*
* #param object $ch
* #param string $header
* #return integer
*/
private function readHeader($ch, $header) {
//This is run for every header, use ifs to grab and add
$location = $this->extractCustomHeader('Location: ', '\n', $header);
if ($location) {
$this->response_meta_info['location'] = trim($location);
}
return strlen($header);
}
private function extractCustomHeader($start,$end,$header) {
$pattern = '/'. $start .'(.*?)'. $end .'/';
if (preg_match($pattern, $header, $result)) {
return $result[1];
} else {
return false;
}
}
function getHeaders() {
return $this->response_meta_info;
}
function getHeaderLocation() {
return $this->response_meta_info['location'];
}
}

Well first of all, you are talking about the Royal Mail. So I'm not sure if this simple little trick would trip them up...
But what you could try is spoofing your user agent with a quick ini_set() -
ini_set('user_agent', 'Mozilla/5.0 (X11; CrOS i686 1660.57.0) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.46 Safari/535.19'
That's an Ubuntu chrome user agent string.
The cURL user agent string would look quite different. For example:
curl/7.15.5 (i686-redhat-linux-gnu) libcurl/7.15.5 OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5
It's a long shot - but they might be rejecting requests that are not originating from recognized browsers.

PHP: cURL login with hash bypass

I am trying to login to a form which has a hidden hash field. The problem is when I curl the page to get the hash, and when I include it as the post value in my next curl call (to the same page), the hash is not valid anymore since the succeeding curl cal is like the page refreshed already and it regenerated a new hash.
So how do I get the hash without simulating a refreshed page?
here is my sample code:
<?php
$la = new LoginAuth('http://site.tld/auth.php', 'username', 'password');
$result = $la->auth(0);
echo $result;
class LoginAuth
{
public $url;
public $usr;
public $pwd;
public $status;
private $last_url;
public function __construct($url, $usr, $pwd)
{
$this->url = $url;
$this->usr= $usr;
$this->pwd= $pwd;
}
public function get_hash()
{
$output = $this->curl($this->url, $this->last_url);
$hash = $this->match('!<input.*?name="hash".*?value="(.*?)"!ms', $output, 1);
return $hash;
}
public function auth($server)
{
$hash = $this->get_hash();
$auth_data = 'username=' . $this->usr . '&password=' . $this->pwd . '&server=' . $server . '&hash=' . $hash;
$output = $this->curl($this->url, $this->last_url, $auth_data);
$this->status = $output;
return $output;
}
private function curl($url, $referer = null, $post_param = null)
{
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (iPhone; U; CPU iPhone OS 2_2_1 like Mac OS X; en-us) AppleWebKit/525.18.1 (KHTML, like Gecko) Version/3.1.1 Mobile/5H11 Safari/525.20");
if($referer)
curl_setopt($ch, CURLOPT_REFERER, $referer);
if(!is_null($post_param))
{
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_param);
}
$html = curl_exec($ch);
$this->last_url = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
curl_close($ch);
return $html;
}
private function match($regex, $str, $out_ary = 0)
{
return preg_match($regex, $str, $match) == 1 ? $match[$out_ary] : false;
}
}
/* End of file auth.php */
/* Location: ./auth.php */

The server is probably sending you a Set-Cookie header for a session id. It will store the hash someplace locally, and then compare the one you submit to that one IF you supply the session cookie back to it.
You'll need to read the session cookie out of the get_hash() response, and then submit it back in your auth() call.
I'd fire up firebug and check out the headers being sent back and forth when you do it by hand, there may be some other important ones as well.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

cURL Scrape then Parse/Find Specific Content - php

Related

PHP - Simple Html Dom load multiple pages speed

Manipulate dom with php to scrape data

cURL login into ebay.co.uk [closed]

Faking Post Request with PHP Curl - Rejection

PHP: cURL login with hash bypass

Categories

Resources