I am currently trying to manipulate dom throuhg php to extract views from an fb video page. The below code was working until a bit ago. However now it doesnt find the node that contains the views count. This information is inside a div with id fbPhotoPageMediaInfo. What would be the best way to manipulate the dom through php to get views of an fb video page?
private function _callCurl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Linux; Android 5.0.1; SAMSUNG-SGH-I337 Build/LRX22C; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/42.0.2311.138 Mobile Safari/537.36');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 20);
curl_setopt($ch, CURLOPT_URL, $url);
$response = curl_exec($ch);
$http = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
return array(
$http,
$response,
);
}
function test()
{
$url = "https://www.facebook.com/TaylorSwift/videos/10153665021155369/";
$request = callCurl($url);
if ($request[0] == 200) {
$dom = new DOMDocument();
#$dom->loadHTML($request[1]);
$elm = $dom->getElementById('fbPhotoPageMediaInfo');
if (isset($elm->nodeValue)) {
$views = preg_replace('/[^0-9]/', '', $elm->nodeValue);
} else {
$views = null;
}
} else {
echo "Error!";
}
return isset($views) ? $views : null;
}
Here is what I've determined...
If you var_dump() on $request you can see that it's giving you a 302 code (redirect) rather than a 200 (ok).
Changing CURLOPT_FOLLOWLOCATION to true or commenting it out entirely makes the error go away, but now we're getting a different page from the one expected.
I ran the following to see where I was being redirected to:
$htm = file_get_contents("https://www.facebook.com/TaylorSwift/videos/10153665021155369/");
var_dump($htm);
This gave me a page saying I was using an outdated browser, and needed to update it. So apparently Facebook doesn't like the User Agent.
I updated it as follows:
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/44.0.2');
That appears to solve the problem.
Personally I prefer to use Simplehtmldom.
FB like other high traffic sites do update their source to help prevent scraping. You may in the future have to adjust your node search.
<?php
$ua = "Mozilla/5.0 (Windows NT 5.0) AppleWebKit/5321 (KHTML, like Gecko) Chrome/13.0.872.0 Safari/5321"; // must be a valid User Agent
ini_set('user_agent', $ua);
require_once('simplehtmldom/simple_html_dom.php'); // http://simplehtmldom.sourceforge.net/
Function Scrape_FB_Views($url) {
IF (!filter_var($url, FILTER_VALIDATE_URL) === false) {
// Create DOM from URL
$html = file_get_html($url);
IF ($html) {
IF (($html->find('span[class=fcg]', 3))) { // 4th instance of span with fcg class
$text = trim($html->find('span[class=fcg]', 3)->plaintext); // get content of span as plain text
$result = preg_replace('/[^0-9]/', '', $text); // replace all non-numeric characters
}ELSE{
$result = "Node is no longer valid."
}
}ELSE{
$result = "Could not get HTML.";
}
}ELSE{
$result = "URL is invalid.";
}
return $result;
}
$url = "https://www.facebook.com/TaylorSwift/videos/10153665021155369/";
echo("<p>".Scrape_FB_Views($url)."</p>");
?>
Related
I implemented this function in order to parse HTML pages using two different "methods".
As you can see both are using the very handy class called simple_html_dom.
The difference is the first method is also using curl to load the HTML while the second is not using curl
Both methods are working fine on a lot of pages but I'm struggling with this specific call:
searchThroughDOM('https://fr.shopping.rakuten.com/offer/buy/3458931181/new-york-1997-4k-ultra-hd-blu-ray-blu-ray-bonus-edition-boitier-steelbook.html', 'simple_html_dom');
In both cases, I end up with a 403 access denied response.
Did I do something wrong?
Or is there another method in order to avoid this type of denial?
function searchThroughDOM ($url, $method)
{
echo '$url = '.$url.'<br>'.'$method = '.$method.'<br><br>';
$time_start = microtime(true);
switch ($method) {
case 'curl':
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_REFERER, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36');
$str = curl_exec($curl);
curl_close($curl);
// Create a DOM object
$html = new simple_html_dom();
// Load HTML from a string
$html->load($str);
break;
case 'simple_html_dom':
$html = new simple_html_dom();
$html->load_file($url);
break;
}
$collection = $html->find('h1');
foreach($collection as $x => $x_value) {
echo 'x = '.$x.' => value = '.$x_value.'<br>';
}
$html->save('result.htm');
$html->clear();
$time_end = microtime(true);
echo 'Elapsed Time (DOM) = '.($time_end - $time_start).'<br><br>';
}
From my point of view , there is nothing wrong with "simple_html_dom"
you may remove the simple html dom "part" of the code , leave only for the CURL
which I assume is the source of the problem.
There are lots of reasons cause the curl Not working on page
first of all I can see you add
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
you should also try to add CURLOPT_SSL_VERIFYHOST , false
Secondly , check your curl version, see if it is too old
third option, if none of above working , you may want to enable cookie , it may possible the cookie disabled cause the website detect it is machine, not real person send the request .
lastly , if all above attempt failed , try other library or even file_get_content ,
Curl is not your only option, of cause it is the most powerful one.
I'm trying to get reviews in Google Business. The goal is to get access via curl and then get value from pane.rating.moreReviews label jsaction.
How I can fix code below to get curl?
function curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36');
$html = curl_exec($ch);
curl_close($ch);
return $html;
}
$html = curl("https://www.google.com/maps?cid=12909283986953620003");
$DOM = new DOMDocument();
$DOM->loadHTML($html);
$finder = new DomXPath($DOM);
$classname = 'pane.rating.moreReviews';
$nodes = $finder->query("//*[contains(#jsaction, '$classname')]");
foreach ($nodes as $node) {
$check_reviews = $node->nodeValue;
$ses_key = preg_replace('/[^0-9]+/', '', $check_reviews);
}
// result should be: 166
echo $ses_key;
If I try do var_dump($html);, I'm getting:
string(348437) " "
And this number is changing on each page refresh.
Get Google-Reviews with PHP cURL & without API Key
How to find the CID - If you have the business open in Google Maps:
Do a search in Google Maps for the business name
Make sure it’s the only result that shows up.
Replace http:// with view-source: in the URL
Click CTRL+F and search the source code for “ludocid”
CID will be the numbers after “ludocid\u003d” and till the last number
or use this tool: https://ryanbradley.com/tools/google-cid-finder/
Example
ludocid\\u003d16726544242868601925\
HINT: Use the class ".quote" in you CSS to style the output
The PHP
<?php
/*
💬 Get Google-Reviews with PHP cURL & without API Key
=====================================================
How to find the CID - If you have the business open in Google Maps:
- Do a search in Google Maps for the business name
- Make sure it’s the only result that shows up.
- Replace http:// with view-source: in the URL
- Click CTRL+F and search the source code for “ludocid”
- CID will be the numbers after “ludocid\\u003d” and till the last number
or use this tool: https://pleper.com/index.php?do=tools&sdo=cid_converter
Example
-------
```TXT
ludocid\\u003d16726544242868601925\
```
> HINT: Use the class ".quote" in you CSS to style the output
###### Copyright 2019 Igor Gaffling
*/
$cid = '16726544242868601925'; // The CID you want to see the reviews for
$show_only_if_with_text = false; // true OR false
$show_only_if_greater_x = 0; // 0-4
$show_rule_after_review = false; // true OR false
/* ------------------------------------------------------------------------- */
$ch = curl_init('https://www.google.com/maps?cid='.$cid);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla / 5.0 (Windows; U; Windows NT 5.1; en - US; rv:1.8.1.6) Gecko / 20070725 Firefox / 2.0.0.6");
curl_setopt($ch, CURLOPT_TIMEOUT, 60);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookies.txt');
$result = curl_exec($ch);
curl_close($ch);
$pattern = '/window\.APP_INITIALIZATION_STATE(.*);window\.APP_FLAGS=/ms';
if ( preg_match($pattern, $result, $match) ) {
$match[1] = trim($match[1], ' =;'); // fix json
$reviews = json_decode($match[1]);
$reviews = ltrim($reviews[3][6], ")]}'"); // fix json
$reviews = json_decode($reviews);
//$customer = $reviews[0][1][0][14][18];
//$reviews = $reviews[0][1][0][14][52][0];
$customer = $reviews[6][18]; // NEW IN 2020
$reviews = $reviews[6][52][0]; // NEW IN 2020
}
if (isset($reviews)) {
echo '<div class="quote"><strong>'.$customer.'</strong><br>';
foreach ($reviews as $review) {
if ($show_only_if_with_text == true and empty($review[3])) continue;
if ($review[4] <= $show_only_if_greater_x) continue;
for ($i=1; $i <= $review[4]; ++$i) echo '⭐'; // RATING
if ($show_blank_star_till_5 == true)
for ($i=1; $i <= 5-$review[4]; ++$i) echo '☆'; // RATING
echo '<p>'.$review[3].'<br>'; // TEXT
echo '<small>'.$review[0][1].'</small></p>'; // AUTHOR
if ($show_rule_after_review == true) echo '<hr size="1">';
}
echo '</div>';
}
Source: https://github.com/gaffling/PHP-Grab-Google-Reviews
Please try below code
$html = curl("https://maps.googleapis.com/maps/api/place/details/json?cid=12909283986953620003&key=<google_apis_key>", "Mozilla 5.0");
$datareview = json_decode($html);// get all data in array
Ex. : http://meetingwords.com/QiIN1vaIuY
It will work for you.
Create Google Key From google console developer
https://developers.google.com/maps/documentation/embed/get-api-key
I get page contents with this php code:
but I don't know why server access this page with old browser version
$url = $target_domain . 'http://www.facebook.com';
//Download page
$site = file_get_contents($url);
$dom = DOMDocument::loadHTML($site);
if($dom instanceof DOMDocument) {
// find <head> tag
$head_tag_list = $dom->getElementsByTagName('head');
// there should only be one <head> tag
if($head_tag_list->length !== 1) {
throw new Exception('Wow! The HTML is malformed without single head tag.');
}
$head_tag = $head_tag_list->item(0);
// find first child of head tag to later use in insertion
$head_has_children = $head_tag->hasChildNodes();
if($head_has_children) {
$head_tag_first_child = $head_tag->firstChild;
}
// create new <base> tag
$base_element = $dom->createElement('base');
$base_element->setAttribute('href', $target_domain);
// insert new base tag as first child to head tag
if($head_has_children) {
$base_node = $head_tag->insertBefore($base_element, $head_tag_first_child);
} else {
$base_node = $head_tag->appendChild($base_element);
}
echo $dom->saveHTML();
} else {
// something went wrong in loading HTML to DOM Document
// provide error messaging
}
?>
take a look on photo :
";
please help me how to access with new version of browsers.
Instead of using file_get_contents which is little poor, you can use the Curl which we can add useful parameters, in your case you should indicate a new UserAgent in the request, try this this simple exemple :
function curl_download($Url) {
// is cURL installed yet?
if (!function_exists('curl_init')) {
die('Sorry cURL is not installed!');
}
// OK cool - then let's create a new cURL resource handle
$ch = curl_init();
// Now set some options (most are optional)
// Set URL to download
curl_setopt($ch, CURLOPT_URL, $Url);
// User agent
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36");
// Include header in result? (0 = yes, 1 = no)
curl_setopt($ch, CURLOPT_HEADER, 0);
// Should cURL return or print out the data? (true = return, false = print)
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
// Timeout in seconds
curl_setopt($ch, CURLOPT_TIMEOUT, 15);
// Download the given URL, and return output
$output = curl_exec($ch);
// Close the cURL resource, and free system resources
curl_close($ch);
return $output;
}
I'm using php and cURL to scrape a web page, but this web page is poorly designed (as in no classes or ids on tags), so I need to search for specific text, then go to the tag holding it (ie <p>) then move to the next child (or next <p>) and get the text.
There are various things I need to get from the page, some also being the text within an <a onclick="get this stuff here">. So basically I feel that I need to use cURL to scrape the source code to a php variable, then I can use php to kind of parse through and find the stuff I need.
Does this sound like the best method to do this? Does anyone have any pointers or can demonstrate how I can put source code from cURL into a variable?
Thanks!
EDIT (Working/Current Code) -----------
<?php
class Scrape
{
public $cookies = 'cookies.txt';
private $user = null;
private $pass = null;
/*Data generated from cURL*/
public $content = null;
public $response = null;
/* Links */
private $url = array(
'login' => 'https://website.com/login.jsp',
'submit' => 'https://website.com/LoginServlet',
'page1' => 'https://website.com/page1',
'page2' => 'https://website.com/page2',
'page3' => 'https://website.com/page3'
);
/* Fields */
public $data = array();
public function __construct ($user, $pass)
{
$this->user = $user;
$this->pass = $pass;
}
public function login()
{
$this->cURL($this->url['login']);
if($form = $this->getFormFields($this->content, 'login'))
{
$form['login'] = $this->user;
$form['password'] =$this->pass;
// echo "<pre>".print_r($form,true);exit;
$this->cURL($this->url['submit'], $form);
//echo $this->content;//exit;
}
//echo $this->content;//exit;
}
// NEW TESTING
public function loadPage($page)
{
$this->cURL($this->url[$page]);
echo $this->content;//exit;
}
/* Scan for form */
private function getFormFields($data, $id)
{
if (preg_match('/(<form.*?name=.?'.$id.'.*?<\/form>)/is', $data, $matches)) {
$inputs = $this->getInputs($matches[1]);
return $inputs;
} else {
return false;
}
}
/* Get Inputs in form */
private function getInputs($form)
{
$inputs = array();
$elements = preg_match_all('/(<input[^>]+>)/is', $form, $matches);
if ($elements > 0) {
for($i = 0; $i < $elements; $i++) {
$el = preg_replace('/\s{2,}/', ' ', $matches[1][$i]);
if (preg_match('/name=(?:["\'])?([^"\'\s]*)/i', $el, $name)) {
$name = $name[1];
$value = '';
if (preg_match('/value=(?:["\'])?([^"\']*)/i', $el, $value)) {
$value = $value[1];
}
$inputs[$name] = $value;
}
}
}
return $inputs;
}
/* Perform curl function to specific URL provided */
public function cURL($url, $post = false)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13");
// "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_COOKIEJAR, $this->cookies);
curl_setopt($ch, CURLOPT_COOKIEFILE, $this->cookies);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 120);
curl_setopt($ch, CURLOPT_TIMEOUT, 120);
if($post) //if post is needed
{
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($post));
}
curl_setopt($ch, CURLOPT_URL, $url);
$this->content = curl_exec($ch);
$this->response = curl_getinfo( $ch );
$this->url['last_url'] = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
curl_close($ch);
}
}
$sc = new Scrape('user','pass');
$sc->login();
$sc->loadPage('page1');
echo "<h1>TESTTESTEST</h1>";
$sc->loadPage('page2');
echo "<h1>TESTTESTEST</h1>";
$sc->loadPage('page3');
echo "<h1>TESTTESTEST</h1>";
(note: credit to #Ramz scrape a website with secured login)
You can divide your problem in several parts.
Retrieving the data from the data source.
For that, you can possibly use CURL or file_get_contents() depending on your requirements. Code examples are everywhere. http://php.net/manual/en/function.file-get-contents.php and http://php.net/manual/en/curl.examples-basic.php
Parsing the retrieved data.
For that, i would start by looking into "PHP Simple HTML DOM Parser" You can use it to extract data from an HTML string. http://simplehtmldom.sourceforge.net/
Building and generating the output.
This is simply a question of what you want to do with the data that you have extracted. For example, you can print it, reformat it, or store it to a database/file.
I suggest you use a rready made scaper. I use Goutte (https://github.com/FriendsOfPHP/Goutte) which allows me to load website content and traverse it in the same way you do with jQuery. i.e. if I want the content of the <div id="content"> I use $client->filter('#content')->text()
It even allows me to find and 'click' on links and submit forms to retreive and process the content.
It makes life soooooooo mucn easier than using cURL or file_get_contentsa() and working your way through the html manually
I am trying to build a script that posts information into the RoyalMail tracking system and extracts the output.
What I currently have is getting an error from their server - see the link, somehow it is detecting that I am not using their website as per normal and throwing me an error.
Things I think I have taken into account:
Using an exact copy of their form by parsing it beforehand (the post parameters)
Saving the cookies between each request
Accepting redirect headers
Providing a refer header that is actually valid (the previously visited page)
Does anyone know anything else I need to check or can figure out what I am doing wrong?
A full copy of the source is at EDIT: please see my answer below
Websites usually use 2 ways to detect if you are a human or a bot: HTTP REFERER and USER AGENT. I suggest you use Curl it specified user agent and referer (replace 'http://something/' with real URL of a page you would normally visit before navigating to the url you want to download with PHP):
<?php
$url = 'http://track2.royalmail.com/portal/rm/track';
$html = file_get_contents2($url, '');
$post['_dyncharset'] = 'ISO-8859-1';
$post['trackConsigniaPage'] = 'track';
$post['/rmg/track/RMTrackFormHandler.value.searchCompleteUrl'] = '/portal/rm/trackresults?catId=22700601&pageId=trt_rmresultspage';
$post['_D:/rmg/track/RMTrackFormHandler.value.searchCompleteUrl'] = '';
$post['/rmg/track/RMTrackFormHandler.value.invalidInputUrl'] = '/portal/rm/trackresults?catId=22700601&pageId=trt_rmresultspage&keyname=track_blank';
$post['_D:/rmg/track/RMTrackFormHandler.value.invalidInputUrl'] = '';
$post['/rmg/track/RMTrackFormHandler.value.searchBusyUrl'] = '/portal/rm/trackresults?catId=22700601&pageId=trt_busypage&keyname=3E_track';
$post['_D:/rmg/track/RMTrackFormHandler.value.searchBusyUrl'] = '';
$post['/rmg/track/RMTrackFormHandler.value.searchWaitUrl'] = '/portal/rm/trackresults?catId=22700601&timeout=true&pageId=trt_timeoutpage&keyname=3E_track';
$post['_D:/rmg/track/RMTrackFormHandler.value.searchWaitUrl'] = '';
$post['/rmg/track/RMTrackFormHandler.value.keyname'] = '3E_track';
$post['_D:/rmg/track/RMTrackFormHandler.value.keyname'] = '';
$post['/rmg/track/RMTrackFormHandler.value.previousTrackingNumber'] = '';
$post['_D:/rmg/track/RMTrackFormHandler.value.previousTrackingNumber'] = '';
$post['/rmg/track/RMTrackFormHandler.value.trackingNumber'] = 'ZW791944749GB';
$post['_D:/rmg/track/RMTrackFormHandler.value.trackingNumber'] = '';
$post['/rmg/track/RMTrackFormHandler.track.x'] = '50';
$post['/rmg/track/RMTrackFormHandler.track.y'] = '14';
$post['_D:/rmg/track/RMTrackFormHandler.track'] = '';
$post['/rmg/track/RMTrackFormHandler.value.day'] = '19';
$post['_D:/rmg/track/RMTrackFormHandler.value.day'] = '';
$post['/rmg/track/RMTrackFormHandler.value.month'] = '5';
$post['_D:/rmg/track/RMTrackFormHandler.value.month'] = '';
$post['/rmg/track/RMTrackFormHandler.value.year'] = '2012';
$post['_D:/rmg/track/RMTrackFormHandler.value.year'] = '';
$post['_DARGS'] = '/portal/rmgroup/apps/templates/html/rm/rmTrackResultPage.jsp';
$url2 = 'http://track2.royalmail.com/portal/rm?_DARGS=/portal/rmgroup/apps/templates/html/rm/rmTrackAndTraceForm.jsp';
$html2 = file_get_contents2($url2, $url, $post);
echo $html2;
function file_get_contents2($address, $referer, $post = false)
{
$useragent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1";
$c = curl_init();
curl_setopt($c, CURLOPT_URL, $address);
curl_setopt($c, CURLOPT_USERAGENT, $useragent);
curl_setopt($c, CURLOPT_HEADER, 0);
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
if ($post)
{
$postF = http_build_query($post);
curl_setopt($c, CURLOPT_POST, true);
curl_setopt($c, CURLOPT_POSTFIELDS, $postF);
}
curl_setopt($c, CURLOPT_COOKIEJAR, 'cookie.txt');
//curl_setopt($c, CURLOPT_FRESH_CONNECT, 1);
curl_setopt($c, CURLOPT_REFERER, $referer);
curl_setopt($c, CURLOPT_FOLLOWLOCATION, 1);
if (!$data = curl_exec($c))
{
return false;
}
return $data;
}
The above updated code returned me:
Item ZW791944749GB was posted at 1 High Street RG17 9TJ on 19/05/12 and is being progressed through our network for delivery.
So it seems it works.
I have now fixed it, the problem was with PHP curl and following redirects, it seems that it doesn't always post the request data and sends a GET request when following.
To deal with this I disabled curl follow location with curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false); and then built a follow location system myself that works recursively. Essentially it extracts the location header from the response, checks for a 301 or a 302 and then runs the method again as required.
This means the information will definitely be POSTED again.
I also improved the user agent string, simply copying my current one on the basis it won't be blocked for a long while as in 2012 it is in active use!
Here is a final copy of the curl class (in case the link dies - been down voted for that in the past) which is working:
/**
* Make a curl request respecting redirects
* Also supports posts
*/
class pegCurlRequest {
private $url, $postFields = array(), $referer = NULL, $timeout = 3;
private $debug = false, $postString = "";
private $curlInfo = array();
private $content = "";
private $response_meta_info = array();
static $cookie;
function __construct($url, $postFields = array(), $referer = NULL, $timeout = 3) {
$this->setUrl($url);
$this->setPost($postFields);
$this->setReferer($referer);
$this->setTimeout($timeout);
if(empty(self::$cookie)) self::$cookie = tempnam("/tmp", "pegCurlRequest"); //one time cookie
}
function setUrl($url) {
$this->url = $url;
}
function setTimeout($timeout) {
$this->timeout = $timeout;
}
function setPost($postFields) {
if(is_array($postFields)) {
$this->postFields = $postFields;
}
$this->updatePostString();
}
function updatePostString() {
//Cope with posting
$this->postString = "";
if(!empty($this->postFields)) {
foreach($this->postFields as $key=>$value) { $this->postString .= $key.'='.$value.'&'; }
$this->postString= rtrim($this->postString,'&'); //Trim off the waste
}
}
function setReferer($referer) {
//Set a referee either specified or based on the url
$this->referer = $referer;
}
function debugInfo() {
//Debug
if($this->debug) {
echo "<table><tr><td colspan='2'><b><u>Pre Curl Request</b><u></td></tr>";
echo "<tr><td><b>URL: </b></td><td>{$this->url}</td></tr>";
if(!empty(self::$cookie)) echo "<tr><td><b>Cookie String: </b></td><td>".self::$cookie."</td></tr>";
if(!empty($this->referer)) echo "<tr><td><b>Referer: </b></td><td>".$this->referer."</td></tr>";
if(!empty($this->postString)) echo "<tr><td><b>Post String: </b></td><td>".$this->postString."</td></tr>";
if(!empty($this->postFields)) {
echo "<tr><td><b>Post Values:</b></td><td><table>";
foreach($this->postFields as $key=>$value)
echo "<tr><td>$key</td><td>$value</td></tr>";
echo "</table>";
}
echo "</td></tr></table><br />\n";
}
}
function debugFurtherInfo() {
//Debug
if($this->debug) {
echo "<table><tr><td colspan='2'><b><u>Post Curl Request</b><u></td></tr>";
echo "<tr><td><b>URL: </b></td><td>{$this->url}</td></tr>";
if(!empty($this->referer)) echo "<tr><td><b>Referer: </b></td><td>".$this->referer."</td></tr>";
if(!empty($this->curlInfo)) {
echo "<tr><td><b>Curl Info:</b></td><td><table>";
foreach($this->curlInfo as $key=>$value)
echo "<tr><td>$key</td><td>$value</td></tr>";
echo "</table>";
}
echo "</td></tr></table><br />\n";
}
}
/**
* Make the actual request
*/
function makeRequest($url=NULL) {
//Shorthand request
if(!is_null($url))
$this->setUrl($url);
//Output debug info
$this->debugInfo();
//Using a shared cookie
$cookie = self::$cookie;
//Setting up the starting information
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 Safari/536.11" );
curl_setopt($ch, CURLOPT_URL, $this->url);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($ch, CURLOPT_ENCODING, "gzip");
//register a callback function which will process the headers
//this assumes your code is into a class method, and uses $this->readHeader as the callback //function
curl_setopt($ch, CURLOPT_HEADERFUNCTION, array(&$this,'readHeader'));
//Some servers (like Lighttpd) will not process the curl request without this header and will return error code 417 instead.
curl_setopt($ch, CURLOPT_HTTPHEADER, array("Expect:"));
//Referer
if(empty($this->referer)) {
curl_setopt($ch, CURLOPT_REFERER, dirname($this->url));
} else {
curl_setopt($ch, CURLOPT_REFERER, $this->referer);
}
//Posts
if(!empty($this->postFields)) {
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $this->postString);
}
//Redirects, transfers and timeouts
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, false);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $this->timeout);
curl_setopt($ch, CURLOPT_TIMEOUT, $this->timeout);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
//Debug
if($this->debug) {
curl_setopt($ch, CURLOPT_VERBOSE, true); // logging stuffs
curl_setopt($ch, CURLINFO_HEADER_OUT, true); // enable tracking
}
//Get the content and the header info
$content = curl_exec($ch);
$response = curl_getinfo($ch);
//get the default response headers
$headers = curl_getinfo($ch);
//add the headers from the custom headers callback function
$this->response_meta_info = array_merge($headers, $this->response_meta_info);
curl_close($ch); //be nice
//Curl info
$this->curlInfo = $response;
//Output debug info
$this->debugFurtherInfo();
//Are we being redirected?
if ($response['http_code'] == 301 || $response['http_code'] == 302) {
$location = $this->getHeaderLocation();
if(!empty($location)) { //the location exists
$this->setReferer($this->getTrueUrl()); //update referer
return $this->makeRequest($location); //recurse to location
}
}
//Is there a javascript redirect on the page?
elseif (preg_match("/window\.location\.replace\('(.*)'\)/i", $content, $value) ||
preg_match("/window\.location\=\"(.*)\"/i", $content, $value)) {
$this->setReferer($this->getTrueUrl()); //update referer
return $this->makeRequest($value[1]); //recursion
} else {
$this->content = $content; //set the content - final page
}
}
/**
* Get the url after any redirection
*/
function getTrueUrl() {
return $this->curlInfo['url'];
}
function __toString() {
return $this->content;
}
/**
* CURL callback function for reading and processing headers
* Override this for your needs
*
* #param object $ch
* #param string $header
* #return integer
*/
private function readHeader($ch, $header) {
//This is run for every header, use ifs to grab and add
$location = $this->extractCustomHeader('Location: ', '\n', $header);
if ($location) {
$this->response_meta_info['location'] = trim($location);
}
return strlen($header);
}
private function extractCustomHeader($start,$end,$header) {
$pattern = '/'. $start .'(.*?)'. $end .'/';
if (preg_match($pattern, $header, $result)) {
return $result[1];
} else {
return false;
}
}
function getHeaders() {
return $this->response_meta_info;
}
function getHeaderLocation() {
return $this->response_meta_info['location'];
}
}
Well first of all, you are talking about the Royal Mail. So I'm not sure if this simple little trick would trip them up...
But what you could try is spoofing your user agent with a quick ini_set() -
ini_set('user_agent', 'Mozilla/5.0 (X11; CrOS i686 1660.57.0) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.46 Safari/535.19'
That's an Ubuntu chrome user agent string.
The cURL user agent string would look quite different. For example:
curl/7.15.5 (i686-redhat-linux-gnu) libcurl/7.15.5 OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5
It's a long shot - but they might be rejecting requests that are not originating from recognized browsers.