Grabbing HTML From a Page That Has Blocked CURL

Grabbing HTML From a Page That Has Blocked CURL - php

I have been asked to grab a certain line from a page but it appears that site has blocked CURL requests?
The site in question is http://www.habbo.com/home/Intricat
I tried changing the UserAgent to see if they were blocking that but it didn't seem to do the trick.
The code I am using is as follows:
<?php
$curl_handle=curl_init();
//This is the URL you would like the content grabbed from
curl_setopt($curl_handle, CURLOPT_USERAGENT, "Mozilla/5.0");
curl_setopt($curl_handle,CURLOPT_URL,'http://www.habbo.com/home/Intricat');
//This is the amount of time in seconds until it times out, this is useful if the server you are requesting data from is down. This way you can offer a "sorry page"
curl_setopt($curl_handle,CURLOPT_CONNECTTIMEOUT,2);
curl_setopt($curl_handle,CURLOPT_RETURNTRANSFER,1);
$buffer = curl_exec($curl_handle);
//This Keeps everything running smoothly
curl_close($curl_handle);
// Change the message bellow as you wish, please keep in mind you must have your message within the " " Quotes.
if (empty($buffer))
{
print "Sorry, It seems our weather resources are currently unavailable, please check back later.";
}
else
{
print $buffer;
}
?>
Any ideas on another way I can grab a line of code from that page if they've blocked CURL requests?
EDIT: On running curl -i through my server, it appears that the site is setting a cookie first?

You are not very specific about the kind of block you're talking. The website in question http://www.habbo.com/home/Intricat does first of all check if the browser has javascript enabled:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta http-equiv="Content-Script-Type" content="text/javascript">
<script type="text/javascript">function setCookie(c_name, value, expiredays) {
var exdate = new Date();
exdate.setDate(exdate.getDate() + expiredays);
document.cookie = c_name + "=" + escape(value) + ((expiredays == null) ? "" : ";expires=" + exdate.toGMTString()) + ";path=/";
}
function getHostUri() {
var loc = document.location;
return loc.toString();
}
setCookie('YPF8827340282Jdskjhfiw_928937459182JAX666', '179.222.19.192', 10);
setCookie('DOAReferrer', document.referrer, 10);
location.href = getHostUri();</script>
</head>
<body>
<noscript>This site requires JavaScript and Cookies to be enabled. Please change your browser settings or upgrade your
browser.
</noscript>
</body>
</html>
As curl has no javascript support you either need to use a HTTP client that has -or- you need to mimic that script and create the cookie and new request URI your own.

go in with your browser and copy the exact headers that are being send,
the site won't be able to tell that your are trying to curl because the request will look exactly the same.
if cookies are used - attach them as headers.

This is a cut and paste from my Curl class I did quite a few years back, hope you can pick some gems out of it for yourself.
function get_url($url)
{
curl_setopt ($this->ch, CURLOPT_URL, $url);
curl_setopt ($this->ch, CURLOPT_USERAGENT, $this->user_agent);
curl_setopt ($this->ch, CURLOPT_COOKIEFILE, $this->cookie_name);
curl_setopt ($this->ch, CURLOPT_COOKIEJAR, $this->cookie_name);
if(!is_null($this->referer))
{
curl_setopt ($this->ch, CURLOPT_REFERER, $this->referer);
}
curl_setopt ($this->ch, CURLOPT_SSL_VERIFYHOST, 2);
curl_setopt ($this->ch, CURLOPT_HEADER, 0);
if($this->follow)
{
curl_setopt ($this->ch, CURLOPT_FOLLOWLOCATION, 1);
}
else
{
curl_setopt ($this->ch, CURLOPT_FOLLOWLOCATION, 0);
}
curl_setopt ($this->ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($this->ch, CURLOPT_HTTPHEADER, array("Accept: text/html,text/vnd.wap.wml,*.*"));
curl_setopt ($this->ch, CURLOPT_SSL_VERIFYPEER, FALSE); // this line makes it work under https
$try=0;
$result="";
while( ($try<=$this->retry_attempts) && (empty($result)) ) // force a retry upto 5 times
{
$try++;
$result = curl_exec($this->ch);
$this->response=curl_getinfo($this->ch);
// $response['http_code'] 4xx is an error
}
// set refering URL to current url for next page.
if($this->referer_to_last) $this->set_referer($url);
return $result;
}

I know this is a very old post, but since I had to answer myself the same question today, here I share it for people coming, it may be of use to them. I'm also fully aware the OP asked for curl specifically, but --just like me-- there could be people interested in a solution, no matter if curl or not.
The page I wanted to get with curl blocked it. If the block is not because javascript, but because of the agent (that was my case, and setting the agent in curl didn't help), then wget could be a solution:
wget -o output.txt --no-check-certificate --user-agent="Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1" "http://example.com/page"

Related

cUrl getting contents of url containing "ü" U+00FC %c3%bc

I am trying to get information about groceries, title, image, price etc.
All other URLs work fine and the cUrl response is exactly as expected.
The problem I am having is when URLs contain accented latin/non-standard url/non-english characters like ü or è.
I've tried everything I can think of, but there is probably a simply solution I am missing:
stringtest.php?url=http://www.sainsburys.co.uk/shop/gb/groceries/desserts/g%C3%BC-lemon-pots-3x45g
stringtest.php?url=http://www.sainsburys.co.uk/shop/gb/groceries/desserts/gü-lemon-pots-3x45g
stringtest.php?url=http%3A%2F%2Fwww.sainsburys.co.uk%2Fshop%2Fgb%2Fgroceries%2Fdesserts%2Fg%C3%BC-lemon-pots-3x45g
This my code for testing cUrl:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
</head>
<body>
<?php
$url = $_GET['url'];
echo curlUrl($url);
function curlUrl($url){
$ch = curl_init();
$timeout = 5;
$cookie_file = "/tmp/cookie/cookie1.txt";
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_file);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$html = curl_exec($ch);
curl_close($ch);
return $html;
}
?>
<form action="stringtest.php" method="get" id="process">
<input type="text" name="url" placeholder="Url" autofocus>
<input type="submit">
</form>
</body>
</html>
The result I get from cUrl is Sainsburys' 404 page claiming the page isn't found.
Copying http://www.sainsburys.co.uk/shop/gb/groceries/desserts/gü-lemon-pots-3x45g from the url bar results in the URL encoded version of ü (%C3%BC) being copied, as expected. When entering the URL in the browser, ü and %C3%BC can both be used to reach the actual product page so why does Sainsburys return a 404 when cUrl'd?
I've tried various things such as urldecode(), using the exact headers the browser uses, but to no avail.

Seems like an issue with the Sainsbury website itself.
The server returns a 404 when you don't send a valid cookie.
Did you try reloading?
I tried
stringtest.php?url=http://www.sainsburys.co.uk/shop/gb/groceries/desserts/gü-chocolate-ganache-pots-3x45g
and it worked with a valid cookie.

If you try:
wget http://www.sainsburys.co.uk/shop/gb/groceries/desserts/g%C3%BC-lemon-pots-3x45g
The response is:
http://www.sainsburys.co.uk/shop/gb/groceries/bakery
Resolving www.sainsburys.co.uk (www.sainsburys.co.uk)... 109.94.142.1
Connecting to www.sainsburys.co.uk (www.sainsburys.co.uk)|109.94.142.1|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://www.sainsburys.co.uk/webapp/wcs/stores/servlet/gb/groceries/bakery?langId=44&storeId=10151&krypto=xbYM3SJja%2F1mDOxJIVlKl9vZN6zjdlTL4MSiHOKiUMQoum9OkLwoTv6wj27CjUXwqM4%2BsteXag0O%0AQOWiHuS8onFdmoVLWlJyZ7hXaMhcMW9MIMMAsnPdWTPEzSEnOP5a&ddkey=http:AjaxAutoCompleteDisplayView [following]
--2014-10-07 11:56:11-- http://www.sainsburys.co.uk/webapp/wcs/stores/servlet/gb/groceries/bakery?langId=44&storeId=10151&krypto=xbYM3SJja%2F1mDOxJIVlKl9vZN6zjdlTL4MSiHOKiUMQoum9OkLwoTv6wj27CjUXwqM4%2BsteXag0O%0AQOWiHuS8onFdmoVLWlJyZ7hXaMhcMW9MIMMAsnPdWTPEzSEnOP5a&ddkey=http:AjaxAutoCompleteDisplayView
Reusing existing connection to www.sainsburys.co.uk:80.
HTTP request sent, awaiting response... 200 OK
To follow the redirect in curl, use the -L flag:
curl -L http://www.sainsburys.co.uk/shop/gb/groceries/desserts/g%C3%BC-lemon-pots-3x45g

PHP + CURL http response headers

I am currently attempting to configure a CURL & PHP function found online that when called checks if the HTTP response headers is in the 200-300 range to determine if the web page is up. This is successful once ran against an individual website with the code below (not the function itself but the if statements etc) The function returns true or false depending on the range of the HTTP Response header:
$page = "www.google.com";
$page = gzdecode($page);
if (Visit($page))
{
echo $page;
echo " Is OK <br>";
}
else
{
echo $page;
echo " Is DOWN <br>";
}
However when running against an array of URL's stored within the script through the use of a for each loop it reports every webpage within the list as down despite that the code is the same bar the added for loop of course.
Does anyone know what the issue may be surrounding this?
Edit - adding Visit function
My bad sorry, not fully thinking.
The visit function is the following:
function Visit($url){
$agent = "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)";$ch=curl_init();
curl_setopt ($ch, CURLOPT_URL,$url );
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch,CURLOPT_VERBOSE,false);
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
curl_setopt($ch,CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch,CURLOPT_SSLVERSION,3);
curl_setopt($ch,CURLOPT_SSL_VERIFYHOST, FALSE);
$page=curl_exec($ch);
//echo curl_error($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if($httpcode>=200 && $httpcode<310) return true;
else return false;
}
The foreach loop as mentioned looks like this:
foreach($Urls as $URL)
{
$page = $URL;
$page = gzdecode($page);
if (Visit($page))
The if loop for the visit part is the same as before.

$page = $URL;
$page = gzdecode($page);
Why are you trying to uncompress the non-compressed URL? Assuming you really meant to uncompress the content returned from the URL, why would the remote server server compress it when you you've told it that the client does not support compression? Why are you fetching the entire page to see the headers?
The code you've shown us here has never worked

Query data using curl

Why is it not working code, why - I do not understand. Code gets a response from curl and looking (must look) in this response word yes, if it is found - that displays the text - if not, then the other. The code:
<?PHP
// CURL
$ch = curl_init('http://dev.local/phpwhois-4.2.2/example.php?query=domain.ru&output=object');
curl_setopt ($ch, CURLOPT_USERAGENT, "Mozilla/5.0");
curl_setopt ($ch, CURLOPT_HEADER, false);
$curl = curl_exec($ch);
echo $curl;
curl_close($ch);
if(preg_match('~\s*yes\s*~u', $curl))
echo 'Ok';
else
echo 'Else text';
?>
Error strange, more precisely, its not quite there, but - if curl sends text yes, that does not work, then writes that else, and if it does not give a text - too else. If all the text that simply gives curl himself put in the variable it works.
That's what gives the script to curl `e (this answer in writing what else):
regrinfo->Array disclaimer->Array 0->By submitting a query to RIPN's
Whois Service 1->you agree to abide by the following terms of use:
2->#3.2 (in Russian)
3-#3.2 (in English).
domain->Array name->hashcode.ru nserver->Array
ns1.nameself.com->81.176.95.18 ns2.nameself.com->88.212.207.45
status->REGISTERED, DELEGATED, VERIFIED created->2010-11-05
expires->2014-11-05 source->TCI registered->yes regyinfo->Array
referrer-> registrar->RUCENTER-REG-RIPN
servers->Array 0->Array server->ru.whois-servers.net
args->hashcode.ru port->43 type->domain rawdata->Array 0->% By
submitting a query to RIPN's Whois Service 1->% you agree to abide by
the following terms of use: 2->%
(in Russian) 3->%
(in English). 4->
5->domain: 6->nserver: . 7->nserver:
. 8->state: REGISTERED, DELEGATED, VERIFIED
9->person: Private Person 10->registrar: REGTIME-REG-RIPN
11->admin-contact: 12->created: 2010.11.05
13->paid-till: 2014.11.05 14->free-date: 2014.12.06 15->source: TCI
16-> 17->Last updated on 2014.07.27 12:31:31 MSK 18->

You have forgotten to set flag return transfer
<?PHP
// CURL
$ch = curl_init('http://dev.local/phpwhois-4.2.2/example.php?query=domain.ru&output=object');
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0");
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$curl = curl_exec($ch);
echo $curl;
curl_close($ch);
if(preg_match('~\s*yes\s*~u', $curl))
echo 'Ok';
else
echo 'Else text';
?>
Take care also about timeouts in the future. Good luck.

Download an Excel file with PHP and Curl

I have a repetitive task that I do daily. Log in to a web portal, click a link that pops open a new window, and then click a button to download an Excel spreadsheet. It's a time consuming task that I would like to automate.
I've been doing some research with PHP and cUrl, and while it seems like it should be possible, I haven't found any good examples. Has anyone ever done something like this, or do you know of any tools that are better suited for it?

Are you familiar with the basics of HTTP requests? Like, do you know the difference between a POST and a GET request? If what you're doing amounts to nothing more than GET requests, then it's actually super simple and you don't need to use cURL at all. But if "clicking a button" means submitting a POST form, then you will need cURL.
One way to check this is by using a tool such as Live HTTP Headers and watching what requests happen when you click on your links/buttons. It's up to you to figure out which variables need to get passed along with each request and which URLs you need to use.
But assuming that there is at least one POST request, here's a basic script that will post data and get back whatever HTML is returned.
<?php
if ( $ch = curl_init() ) {
$data = 'field1=' . urlencode('somevalue');
$data .= '&field2[]=' . urlencode('someothervalue');
$url = 'http://www.website.com/path/to/post.asp';
$userAgent = 'Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)';
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $data);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
$html = curl_exec($ch);
curl_close($ch);
} else {
$html = false;
}
// write code here to look through $html for
// the link to download your excel file
?>

try this >>>
$ch = curl_init();
$csrf_token = $this->getCSRFToken($ch);// this function to get csrf token from website if you need it
$ch = $this->signIn($ch, $csrf_token);//signin function you must do it and return channel
curl_setopt($ch, CURLOPT_HTTPGET, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 300);// if file large
curl_setopt($ch, CURLOPT_URL, "https://your-URL/anything");
$return=curl_exec($ch);
// the important part
$destination ="files.xlsx";
if (file_exists( $destination)) {
unlink( $destination);
}
$file=fopen($destination,"w+");
fputs($file,$return);
if(fclose($file))
{
echo "downloaded";
}
curl_close($ch);

Check if Twitter API is down (or the whole site is down in general)

I am using the Twitter API to display the statuses of a user. However, in some cases (like today), Twitter goes down and takes all the APIs with it. Because of this, my application fails and continuously displays the loading screen.
I was wondering if there is a quick way (using PHP or JS) to query Twitter and see if it (and the API) is up. I'm thinking it could be an easy response of some sort.
Thanks in advance,
Phil

Request http://api.twitter.com/1/help/test.xml or test.json. Check to make sure you get a 200 http response code.
If you requested XML the response should be:
<ok>true</ok>
The JSON response should be:
"ok"

JSONP!
You can have some function like this, declared in the head or before including the next script tag below:
var isTwitterWorking = false;
function testTwitter(status) {
if (status === "ok") {
isTwitterWorking = true;
}
}
And then
<script src="http://api.twitter.com/1/help/test.json?callback=testTwitter"></script>
Demo (might take a while, Twitter's API seems to be slow here)

function visit($url) {
$agent = "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)";$ch=curl_init();
curl_setopt ($ch, CURLOPT_URL,$url );
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch,CURLOPT_VERBOSE,false);
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
$page=curl_exec($ch);
//echo curl_error($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
if($httpcode>=200 && $httpcode<300)
return true;
else
return false;
}
// Examples
if(visit("http://www.twitter.com"))
echo "Website OK"."n"; // site is online
else
echo "Website DOWN"; // site is offline / show no response
I hope this helps you.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Grabbing HTML From a Page That Has Blocked CURL - php

go in with your browser and copy the exact headers that are being send, the site won't be able to tell that your are trying to curl because the request will look exactly the same. if cookies are used - attach them as headers.

Related

cUrl getting contents of url containing "ü" U+00FC %c3%bc

PHP + CURL http response headers

Query data using curl

Download an Excel file with PHP and Curl

Check if Twitter API is down (or the whole site is down in general)

Categories

Resources