I have to put this page: http://www.tvindiretta.com/m/ in a iframe. This page is cURL powered. He is it's content. When I try to put this url: http://www.tvindiretta.com/m/index.php in an iframe (with tag) the browser redirects to the iframe url. How can I keep this page inside the iframe. I have to change the user user agent. the I'm a complete noob in cURL but help me please. He is the /m/index.php page source code:
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.tvindiretta.com/");
curl_setopt($ch, CURLOPT_MAXREDIRS, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0);
curl_setopt($ch, CURLOPT_HTTPHEADER, array('User-Agent: Mozilla/5.0 (iPhone; U; CPU iPhone OS 2_2_1 like Mac OS X; en-us) AppleWebKit/525.18.1 (KHTML, like Gecko) Version/3.1.1 Mobile/5H11 Safari/525.20'));
curl_exec($ch);
$result = curl_exec ($ch);
curl_close ($ch);
print $result;
curl_close($ch);
?> $
I don't think there is an user-agent redirection on this web page since
<?php
if (isset($_GET['get'])){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.tvindiretta.com/m");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_exec($ch);
$result = curl_exec ($ch);
curl_close ($ch);
print $result;
}
else{
?>
<!DOCTYPE HTML>
<html>
<head>
<meta charset="utf-8">
</head>
<body>
<iframe src="test.php?get" style="position:absolute; top:100px; left:100px; width:400px; height:400px;"/>
</body>
</html>
<?php } ?>
Seems to screw the page, but provide me the mobile content anyway.
So I guess the real problem here is the javascript code inside that page:
In html5 you have a new iframe attribute "sandbox" which allows you to restrict the iframe's content behaviour .
Unfortunately this seems to be supported only by Chrome and Safari.
One idea here could be to try to scrape the content of the web page (with DomDocument in PHP for instance), keep only the content in which you are interested, and try to reproduce their style. It may be easier to say than to do, but I can't see a cleaner way to do so.
Since it seems you are interested in getting a TV program, you could check for a dedicated xml scaper XMLtv.
Related
I have solved the problem of downloading a source code of a Google's search result page here. Here is the code:
<!DOCTYPE html>
<html>
<body>
<!-- this program saves source code of a website to an external file -->
<!-- the string there for the fake user agent can be found here: http://useragentstring.com/index.php -->
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.google.com/search?q=blue+car');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0');
$html = curl_exec($ch);
if(empty($html)) {
echo "<pre>cURL request failed:\n".curl_error($ch)."</pre>";
} else {
$myfile = fopen("file.txt", "w") or die("Unable to open file!");
fwrite($myfile, $html);
fclose($myfile);
}
?>
</body>
</html>
Now I wish to have 100 results instead of only 10. If I change Google search settings it has no influence on the code written above. The number of search results variable is stored somewhere and it is not a part of the query string while searching on Google...
Please use the &num parameter to specify the number of records returned (&num=xx)
So for your case, please change
curl_setopt($ch, CURLOPT_URL, 'https://www.google.com/search?q=blue+car');
to
curl_setopt($ch, CURLOPT_URL, 'https://www.google.com/search?q=blue+car&num=100');
i need your help, can anyone explain me why my code doesnt find the a-tag privacy on the site zoho.com?
my code finds the link "privacy" on other sites well but not on the site zoho.com
I use symfony Crawler: https://symfony.com/doc/current/components/dom_crawler.html
// Imprint Check //
function findPrivacy($domain) {
$ua = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13';
$curl = curl_init($domain);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($curl, CURLOPT_USERAGENT, $ua);
$data = curl_exec($curl);
$crawler = new Crawler($data);
$nodeValues = $crawler->filter('a')->each(function ($node) {
if(str_contains($node->attr('href'), 'privacy-police') || str_contains($node->attr('href'), 'privacy')) {
return true;
} else {
return false;
}
});
return $nodeValues;
}
if you watch the source code from zoho.com, then you will see the footer is empty. But on the site, the footer isnt empty if you scroll down.
How can I find now this link Privacy?
Your script cannot find what is not there. If you load the zoho.com page in a browser and look at the source code, you will notice that the word privacy is not even present. It's possible that the footer containing the link to the privacy policy is loaded asynchronously, which PHP cannot handle.
EDIT: by asynchronously loaded I mean using something like AJAX, which is client-side only. Since PHP is server-side only, it cannot perform the operations required to load the footer containing the link to the privacy policy.
I am making a home automantion project with Arduino and I am using Teleduino to remotely control an LED as a test. I want to take the contents of this link and display them into a php page.
<!DOCTYPE html>
<html>
<body>
<?php
include 'simple_html_dom.php';
echo file_get_html('http://us01.proxy.teleduino.org/api/1.0/2560.php?k=202A57E66167ADBDC55A931D3144BE37&r=definePinMode&pin=7&mode=1');
?>
</body>
The problem is that the function does not return anything.
Is something wrong with my code?
Is there any other function I can use to send a request to a page and get that page in return?
I think you had to use function file_get_contents but your server is protcting data from scraping so curl would be a better solution:
<?php
// echo file_get_contents('http://us01.proxy.teleduino.org/api/1.0/2560php?k=202A57E66167ADBDC55A931D3144BE37&r=definePinMode&pin=7&mode=1');
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, "http://us01.proxy.teleduino.org/api/1.0/2560.php?k=202A57E66167ADBDC55A931D3144BE37&r=definePinMode&pin=7&mode=1");
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
// $output contains the output string
$output = curl_exec($ch);
echo $output;
// close curl resource to free up system resources
curl_close($ch);
?>
I am trying to get information about groceries, title, image, price etc.
All other URLs work fine and the cUrl response is exactly as expected.
The problem I am having is when URLs contain accented latin/non-standard url/non-english characters like ü or è.
I've tried everything I can think of, but there is probably a simply solution I am missing:
stringtest.php?url=http://www.sainsburys.co.uk/shop/gb/groceries/desserts/g%C3%BC-lemon-pots-3x45g
stringtest.php?url=http://www.sainsburys.co.uk/shop/gb/groceries/desserts/gü-lemon-pots-3x45g
stringtest.php?url=http%3A%2F%2Fwww.sainsburys.co.uk%2Fshop%2Fgb%2Fgroceries%2Fdesserts%2Fg%C3%BC-lemon-pots-3x45g
This my code for testing cUrl:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
</head>
<body>
<?php
$url = $_GET['url'];
echo curlUrl($url);
function curlUrl($url){
$ch = curl_init();
$timeout = 5;
$cookie_file = "/tmp/cookie/cookie1.txt";
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_file);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$html = curl_exec($ch);
curl_close($ch);
return $html;
}
?>
<form action="stringtest.php" method="get" id="process">
<input type="text" name="url" placeholder="Url" autofocus>
<input type="submit">
</form>
</body>
</html>
The result I get from cUrl is Sainsburys' 404 page claiming the page isn't found.
Copying http://www.sainsburys.co.uk/shop/gb/groceries/desserts/gü-lemon-pots-3x45g from the url bar results in the URL encoded version of ü (%C3%BC) being copied, as expected. When entering the URL in the browser, ü and %C3%BC can both be used to reach the actual product page so why does Sainsburys return a 404 when cUrl'd?
I've tried various things such as urldecode(), using the exact headers the browser uses, but to no avail.
Seems like an issue with the Sainsbury website itself.
The server returns a 404 when you don't send a valid cookie.
Did you try reloading?
I tried
stringtest.php?url=http://www.sainsburys.co.uk/shop/gb/groceries/desserts/gü-chocolate-ganache-pots-3x45g
and it worked with a valid cookie.
If you try:
wget http://www.sainsburys.co.uk/shop/gb/groceries/desserts/g%C3%BC-lemon-pots-3x45g
The response is:
http://www.sainsburys.co.uk/shop/gb/groceries/bakery
Resolving www.sainsburys.co.uk (www.sainsburys.co.uk)... 109.94.142.1
Connecting to www.sainsburys.co.uk (www.sainsburys.co.uk)|109.94.142.1|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: http://www.sainsburys.co.uk/webapp/wcs/stores/servlet/gb/groceries/bakery?langId=44&storeId=10151&krypto=xbYM3SJja%2F1mDOxJIVlKl9vZN6zjdlTL4MSiHOKiUMQoum9OkLwoTv6wj27CjUXwqM4%2BsteXag0O%0AQOWiHuS8onFdmoVLWlJyZ7hXaMhcMW9MIMMAsnPdWTPEzSEnOP5a&ddkey=http:AjaxAutoCompleteDisplayView [following]
--2014-10-07 11:56:11-- http://www.sainsburys.co.uk/webapp/wcs/stores/servlet/gb/groceries/bakery?langId=44&storeId=10151&krypto=xbYM3SJja%2F1mDOxJIVlKl9vZN6zjdlTL4MSiHOKiUMQoum9OkLwoTv6wj27CjUXwqM4%2BsteXag0O%0AQOWiHuS8onFdmoVLWlJyZ7hXaMhcMW9MIMMAsnPdWTPEzSEnOP5a&ddkey=http:AjaxAutoCompleteDisplayView
Reusing existing connection to www.sainsburys.co.uk:80.
HTTP request sent, awaiting response... 200 OK
To follow the redirect in curl, use the -L flag:
curl -L http://www.sainsburys.co.uk/shop/gb/groceries/desserts/g%C3%BC-lemon-pots-3x45g
I have been asked to grab a certain line from a page but it appears that site has blocked CURL requests?
The site in question is http://www.habbo.com/home/Intricat
I tried changing the UserAgent to see if they were blocking that but it didn't seem to do the trick.
The code I am using is as follows:
<?php
$curl_handle=curl_init();
//This is the URL you would like the content grabbed from
curl_setopt($curl_handle, CURLOPT_USERAGENT, "Mozilla/5.0");
curl_setopt($curl_handle,CURLOPT_URL,'http://www.habbo.com/home/Intricat');
//This is the amount of time in seconds until it times out, this is useful if the server you are requesting data from is down. This way you can offer a "sorry page"
curl_setopt($curl_handle,CURLOPT_CONNECTTIMEOUT,2);
curl_setopt($curl_handle,CURLOPT_RETURNTRANSFER,1);
$buffer = curl_exec($curl_handle);
//This Keeps everything running smoothly
curl_close($curl_handle);
// Change the message bellow as you wish, please keep in mind you must have your message within the " " Quotes.
if (empty($buffer))
{
print "Sorry, It seems our weather resources are currently unavailable, please check back later.";
}
else
{
print $buffer;
}
?>
Any ideas on another way I can grab a line of code from that page if they've blocked CURL requests?
EDIT: On running curl -i through my server, it appears that the site is setting a cookie first?
You are not very specific about the kind of block you're talking. The website in question http://www.habbo.com/home/Intricat does first of all check if the browser has javascript enabled:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta http-equiv="Content-Script-Type" content="text/javascript">
<script type="text/javascript">function setCookie(c_name, value, expiredays) {
var exdate = new Date();
exdate.setDate(exdate.getDate() + expiredays);
document.cookie = c_name + "=" + escape(value) + ((expiredays == null) ? "" : ";expires=" + exdate.toGMTString()) + ";path=/";
}
function getHostUri() {
var loc = document.location;
return loc.toString();
}
setCookie('YPF8827340282Jdskjhfiw_928937459182JAX666', '179.222.19.192', 10);
setCookie('DOAReferrer', document.referrer, 10);
location.href = getHostUri();</script>
</head>
<body>
<noscript>This site requires JavaScript and Cookies to be enabled. Please change your browser settings or upgrade your
browser.
</noscript>
</body>
</html>
As curl has no javascript support you either need to use a HTTP client that has -or- you need to mimic that script and create the cookie and new request URI your own.
go in with your browser and copy the exact headers that are being send,
the site won't be able to tell that your are trying to curl because the request will look exactly the same.
if cookies are used - attach them as headers.
This is a cut and paste from my Curl class I did quite a few years back, hope you can pick some gems out of it for yourself.
function get_url($url)
{
curl_setopt ($this->ch, CURLOPT_URL, $url);
curl_setopt ($this->ch, CURLOPT_USERAGENT, $this->user_agent);
curl_setopt ($this->ch, CURLOPT_COOKIEFILE, $this->cookie_name);
curl_setopt ($this->ch, CURLOPT_COOKIEJAR, $this->cookie_name);
if(!is_null($this->referer))
{
curl_setopt ($this->ch, CURLOPT_REFERER, $this->referer);
}
curl_setopt ($this->ch, CURLOPT_SSL_VERIFYHOST, 2);
curl_setopt ($this->ch, CURLOPT_HEADER, 0);
if($this->follow)
{
curl_setopt ($this->ch, CURLOPT_FOLLOWLOCATION, 1);
}
else
{
curl_setopt ($this->ch, CURLOPT_FOLLOWLOCATION, 0);
}
curl_setopt ($this->ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($this->ch, CURLOPT_HTTPHEADER, array("Accept: text/html,text/vnd.wap.wml,*.*"));
curl_setopt ($this->ch, CURLOPT_SSL_VERIFYPEER, FALSE); // this line makes it work under https
$try=0;
$result="";
while( ($try<=$this->retry_attempts) && (empty($result)) ) // force a retry upto 5 times
{
$try++;
$result = curl_exec($this->ch);
$this->response=curl_getinfo($this->ch);
// $response['http_code'] 4xx is an error
}
// set refering URL to current url for next page.
if($this->referer_to_last) $this->set_referer($url);
return $result;
}
I know this is a very old post, but since I had to answer myself the same question today, here I share it for people coming, it may be of use to them. I'm also fully aware the OP asked for curl specifically, but --just like me-- there could be people interested in a solution, no matter if curl or not.
The page I wanted to get with curl blocked it. If the block is not because javascript, but because of the agent (that was my case, and setting the agent in curl didn't help), then wget could be a solution:
wget -o output.txt --no-check-certificate --user-agent="Mozilla/5.0 (Windows NT 5.2; rv:2.0.1) Gecko/20100101 Firefox/4.0.1" "http://example.com/page"