get page content using CURL - php

I would like to scrape the content of this
http://whostreams.net/embed/gryr4u074z82x using curl.
I've been trying setting different user agents, and setting other options
but I just can't seem to get the content of that page, as I often get redirected or I get a "page moved" error.
I believe it has something to do with the fact that the query string gets encoded somewhere but I'm really not sure how to get around that.
$url = 'http://whostreams.net/embed/gryr4u074z82x';
$curl_handle=curl_init();
curl_setopt($curl_handle, CURLOPT_REFERER, 'http://www.fel3arda.com/2018/09/denmark-vs-wales.html');
curl_setopt($curl_handle, CURLOPT_HTTPHEADER, array('Host: whostreams.net'));
curl_setopt($curl_handle, CURLOPT_URL,$url);
curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 2);
curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_handle, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36');
$query = curl_exec($curl_handle);
curl_close($curl_handle);
echo ($query) ;
What do I need to do to get my php code to show the exact content of the page

curl_exec() need to be before curl_close();
Because curl_close() Terminates the CURL session and releases resources. The descriptor curl_handle is also destroyed.

the code you posted works for me, just added a <?php to it.
<?php
$url = 'http://whostreams.net/embed/gryr4u074z82x';
$curl_handle=curl_init();
curl_setopt($curl_handle, CURLOPT_REFERER, 'http://www.fel3arda.com/2018/09/denmark-vs-wales.html');
curl_setopt($curl_handle, CURLOPT_HTTPHEADER, array('Host: whostreams.net'));
curl_setopt($curl_handle, CURLOPT_URL,$url);
curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 2);
curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_handle, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36');
$query = curl_exec($curl_handle);
curl_close($curl_handle);
echo ($query) ;
i do indeed get the
CLICK HERE TO UNMUTE
STREAM IS OFFLINE
Retrying in seconds
page + the heavily obfuscated javascript used to start streaming the video from wss://ws.peer5.com
you say I just can't seem to get the content of that page - well, what content are you getting? and what did you expect to get instead? because here is roughly what my Google Chrome webbrowser and curl is getting:
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />
<meta name="viewport" content="width=device-width; initial-scale=1.0">
<script>if(window==window.top) document.location="/"</script>
<link rel="stylesheet" href="/css/embed.min.css?v=0.1" />
<!-- Tssp-->
<!-- PopAds.net Popunder Code for whostreams.net | 2018-09-09,2437207,0,0 -->
<script type="text/javascript" data-cfasync="false">
/*<![CDATA[/* */
/* Generated 2018-09-09 16:26:49 for "PopAds%20CGAPIL%20A", len 1367 */
(function(){ var p=window;p["\x5f\x70\x6fp"]=[["\u0073i\x74e\u0049\x64",2437207],["\u006d\x69\x6e\u0042i\x64",0],["\x70\x6f\u0070un\x64er\x73Pe\x72\x49\x50",0],["\x64\x65\u006c\u0061y\u0042e\x74\u0077een",0],["\u0064\x65\u0066\u0061u\u006ct",false],["\x64\u0065fau\x6c\x74P\x65\u0072\x44a\u0079",0],["\u0074o\u0070\x6dos\x74\x4cay\x65\x72",!1]];var l=["/\x2fc\u0031\x2ep\x6f\u0070a\u0064s\u002en\x65\u0074\u002f\x70o\x70\u002e\u006a\u0073","/\u002f\x63\u0032.p\x6fpa\x64\u0073.n\x65t/\x70\u006fp\u002ej\x73","//w\x77\x77.\x6b\u0061\u006f\x6ariv\u006d\u0068\x79s\x2ec\u006f\x6d\u002f\u0062p\x2ejs","/\x2fww\x77.\x74djo\x61\x6f\x73\u0069\u0062\x65\x73\u002e\x63om\x2f\x78\u002ejs",""],w=0,x,a=function(){if(""==l[w])return;x=p["\u0064\x6f\u0063\u0075\u006de\u006e\u0074"]["\x63\u0072e\x61\x74\u0065\u0045le\u006d\x65n\x74"]("\x73cr\u0069\x70\x74");x["\x74\x79\x70\u0065"]="te\x78\x74\u002f\u006a\x61v\x61\u0073\u0063\x72\x69p\u0074";x["\x61\x73\u0079\u006ec"]=!0;var s=p["\x64\x6fcu\u006de\x6et"]["g\u0065\u0074Ele\x6d\x65n\x74\x73\x42\u0079\x54\x61\x67\u004ea\x6d\x65"]("\x73\u0063r\u0069\u0070\u0074")[0];x["\x73\x72c"]=l[w];if(w<2){x["\u0063ro\u0073\x73Or\u0069g\x69\u006e"]="\x61\x6eo\u006e\x79mo\x75s";};x["\u006f\u006ee\x72\x72\u006f\u0072"]=function(){w++;a()};s["p\x61\x72\u0065n\u0074\u004e\u006f\x64\u0065"]["\u0069nse\x72\x74\x42\x65\x66ore"](x,s)};a()})();
/*]]>/* */
</script>
</head>
<body>
<div class="jwplayer jw-reset jw-skin-glow" id="player"></div>
<div id="btn-unmute" onclick="WSUnmute()">CLICK HERE TO UNMUTE</div>
<div class="tb stream-offline" >
<div class="tb-col">
<img src="/imgs/logo.png" />
<h2>STREAM IS OFFLINE</h2>
<p>Retrying in <span class="counter"></span> seconds</p>
</div>
</div>
<script src="/js/jquery.min.js"></script>
<script>var WSreloadCounter,WSnTries=0,videoStarted = false, startMuted = startMuted();function errorPlaying(){$(".stream-offline .counter").text(10);$(".stream-offline").css("display","table");WSreloadCounter=setInterval(function(){var a=$(".stream-offline .counter").text();if(a>1){a--;$(".stream-offline .counter").text(a)}else{ clearInterval(WSreloadCounter);WSnTries++;if(WSnTries<10){WSreloadStream();}else{ window.location.reload() } }},1000)}function startMuted(){var d=/constructor/i.test(window.HTMLElement)||(function(a){return a.toString()==="[object SafariRemoteNotification]"})(!window.safari||(typeof safari!=="undefined"&&safari.pushNotification));if(d){return true}var c=!!window.chrome&&!!window.chrome.webstore;if(c&&getChromeVersion()>=66){return true}return false}function getChromeVersion(){var a=navigator.userAgent.match(/Chrom(e|ium)\/([0-9]+)\./);return a?parseInt(a[2],10):false};</script>
<script src="//api.peer5.com/peer5.js?id=5yaksk6z3h8drz14s022"></script><script src="//api.peer5.com/peer5.clappr.plugin.js"></script>
<script src="/players/clappr/clappr.min.js?v=0.22"></script>
<script>eval(function(p,a,c,k,e,d){e=function(c){return(c<a?'':e(parseInt(c/a)))+((c=c%a)>35?String.fromCharCode(c+29):c.toString(36))};if(!''.replace(/^/,String)){while(c--){d[e(c)]=k[c]||e(c)}k=[function(e){return d[e]}];e=function(){return'\\w+'};c=1};while(c--){if(k[c]){p=p.replace(new RegExp('\\b'+e(c)+'\\b','g'),k[c])}}return p}('6 3;$(4).J(2(){3=C D.E({K:"L://Y.l.k:10/V/H.17?s=-Z&e=W",X:"#3",11:"r%",12:"r%",14:q,13:q,U:"M",I:"",N:"1",O:"",T:{S:2(e){R()},P:2(e){5(2(){$(".9-B").G()},Q);16(!p){p=8;5(2(){6 h=4.o("t")[0],s=4.u("x");s.w("n-v","y.b");s.f="z/d";s.m=8;s.g="//1o.b/1n/18/1l/1q.j";h.i(s)},F);5(2(){6 h=4.o("t")[0],s=4.u("x");s.w("n-v","y.b");s.f="z/d";s.m=8;s.g="//l.k/1i/1h.j";h.i(s)},1g);5(2(){$.1m("",{"1f":"H","a":"A"})},F)}},1e:2(e){$(".9-B").1d()},19:2(e){$("#1a-c").G()},}})});2 1b(){$(".9-1k").1j("1p","1s");6 7=3.1r(3);7=C D.E(7.1t);3.1c();3=7;3.A();3.c()}2 15(){3.c()}',62,92,'||function|player|document|setTimeout|var|newplayer|true|stream||com|unmute|javascript||type|src||appendChild|js|net|whostreams|async|data|getElementsByTagName|videoStarted|false|100||head|createElement|domain|setAttribute|script|aeckcjy|text|play|logo|new|Clappr|Player|15000|fadeOut|gryr4u074z82x|watermark|ready|source|http|bestfit|position|watermarkLink|onPlay|1000|errorPlaying|onError|events|stretching|hls|1536534286|parent|cdn|Xj60CxQUPZV0M5RAeKbFA|8080|width|height|mute|autoPlay|WSUnmute|if|m3u8|d1|onVolumeUpdate|btn|WSreloadStream|destroy|fadeIn|onPause|ref|120000|adcash|pops|css|offline|fa|post|d4|wdaxvjr9dc|display|d4d1faecf77b3799e550953764a305da|configure|none|options'.split('|'),0,{}))
</script><!--Amung / Analytics -->
<div style="display:none;"><img name="viewers" src="//whos.amung.us/cwidget/whostreams/000000ffffff.png"></div>
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-112185528-1"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'UA-112185528-1');
</script>
</body>
</html>

Related

How to download source code of Google search result page having 100 results instead of only 10

I have solved the problem of downloading a source code of a Google's search result page here. Here is the code:
<!DOCTYPE html>
<html>
<body>
<!-- this program saves source code of a website to an external file -->
<!-- the string there for the fake user agent can be found here: http://useragentstring.com/index.php -->
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.google.com/search?q=blue+car');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0');
$html = curl_exec($ch);
if(empty($html)) {
echo "<pre>cURL request failed:\n".curl_error($ch)."</pre>";
} else {
$myfile = fopen("file.txt", "w") or die("Unable to open file!");
fwrite($myfile, $html);
fclose($myfile);
}
?>
</body>
</html>
Now I wish to have 100 results instead of only 10. If I change Google search settings it has no influence on the code written above. The number of search results variable is stored somewhere and it is not a part of the query string while searching on Google...
Please use the &num parameter to specify the number of records returned (&num=xx)
So for your case, please change
curl_setopt($ch, CURLOPT_URL, 'https://www.google.com/search?q=blue+car');
to
curl_setopt($ch, CURLOPT_URL, 'https://www.google.com/search?q=blue+car&num=100');

Privacy Crawler

i need your help, can anyone explain me why my code doesnt find the a-tag privacy on the site zoho.com?
my code finds the link "privacy" on other sites well but not on the site zoho.com
I use symfony Crawler: https://symfony.com/doc/current/components/dom_crawler.html
// Imprint Check //
function findPrivacy($domain) {
$ua = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13';
$curl = curl_init($domain);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($curl, CURLOPT_USERAGENT, $ua);
$data = curl_exec($curl);
$crawler = new Crawler($data);
$nodeValues = $crawler->filter('a')->each(function ($node) {
if(str_contains($node->attr('href'), 'privacy-police') || str_contains($node->attr('href'), 'privacy')) {
return true;
} else {
return false;
}
});
return $nodeValues;
}
if you watch the source code from zoho.com, then you will see the footer is empty. But on the site, the footer isnt empty if you scroll down.
How can I find now this link Privacy?
Your script cannot find what is not there. If you load the zoho.com page in a browser and look at the source code, you will notice that the word privacy is not even present. It's possible that the footer containing the link to the privacy policy is loaded asynchronously, which PHP cannot handle.
EDIT: by asynchronously loaded I mean using something like AJAX, which is client-side only. Since PHP is server-side only, it cannot perform the operations required to load the footer containing the link to the privacy policy.

file_get_html(); not working with Teleduino links

I am making a home automantion project with Arduino and I am using Teleduino to remotely control an LED as a test. I want to take the contents of this link and display them into a php page.
<!DOCTYPE html>
<html>
<body>
<?php
include 'simple_html_dom.php';
echo file_get_html('http://us01.proxy.teleduino.org/api/1.0/2560.php?k=202A57E66167ADBDC55A931D3144BE37&r=definePinMode&pin=7&mode=1');
?>
</body>
The problem is that the function does not return anything.
Is something wrong with my code?
Is there any other function I can use to send a request to a page and get that page in return?
I think you had to use function file_get_contents but your server is protcting data from scraping so curl would be a better solution:
<?php
// echo file_get_contents('http://us01.proxy.teleduino.org/api/1.0/2560php?k=202A57E66167ADBDC55A931D3144BE37&r=definePinMode&pin=7&mode=1');
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, "http://us01.proxy.teleduino.org/api/1.0/2560.php?k=202A57E66167ADBDC55A931D3144BE37&r=definePinMode&pin=7&mode=1");
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
// $output contains the output string
$output = curl_exec($ch);
echo $output;
// close curl resource to free up system resources
curl_close($ch);
?>

get xml of gestis database

i try to get the (not xml aparently) content of this website:
http://gestis.itrust.de/nxt/gateway.dll/gestis_de/010520.xml?f=templates$fn=default-doc.htm$3.0
via curl or file_get_contents in php.
you can open the website in any browser but whenever i try to open it with php to get the content automated it will return a 500 error.
here is the code used:
<?php
/* gets the data from a URL */
function get_data($url) {
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($ch, CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$returned_content = get_data('http://gestis.itrust.de/nxt/gateway.dll/gestis_de/010520.xml?f=templates$fn=default-doc.htm$3.0');
echo $returned_content;
?>
does anybody have an idea how to get the xml via php from this website?
The website you want to open needs the vid=gestisdeu:sdbdeu value in form of a cookie to work:
Cookie: nxt/gateway.dll/vid=gestisdeu%3Asdbdeu;
Please consult the curl documentation how you can set cookies or take a look into the existing material that is already on this webiste, for example Is it possible to set the cookie content with CURL? and the like.
Take care that depending on website and their configuration changes this might become different. So technically your question can't be really answered, because that website doesn't have any documentation of it's HTTP request requirements. So you need to find out on your own and provide those if you ask such a question.
PHP Example:
$url = 'http://gestis.itrust.de/nxt/gateway.dll/gestis_de/010520.xml?f=templates$fn=default-doc.htm$3.0';
$options['http'] = ['header' => 'Cookie: nxt/gateway.dll/vid=gestisdeu%3Asdbdeu;'];
stream_context_set_default($options);
$content = file_get_contents($url);
var_dump($content);
Output:
string(104975) "<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>DGUV-IFA GESTIS</title>
<meta http-equiv="content-type" content="text/html;charset=utf-8">
</head>
<body>
<html>
<head>
<META http-equiv="Content-Type" content="text/html">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<link rel="stylesheet" href="/nxt/gateway.dll/gestis_de/010520.xml?f=stylesheets$fn=gestis-doc.css$up=1$3.0" type="text/css">
<"...

How to put CURLOPT_HTTPHEADER page in iframe?

I have to put this page: http://www.tvindiretta.com/m/ in a iframe. This page is cURL powered. He is it's content. When I try to put this url: http://www.tvindiretta.com/m/index.php in an iframe (with tag) the browser redirects to the iframe url. How can I keep this page inside the iframe. I have to change the user user agent. the I'm a complete noob in cURL but help me please. He is the /m/index.php page source code:
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.tvindiretta.com/");
curl_setopt($ch, CURLOPT_MAXREDIRS, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0);
curl_setopt($ch, CURLOPT_HTTPHEADER, array('User-Agent: Mozilla/5.0 (iPhone; U; CPU iPhone OS 2_2_1 like Mac OS X; en-us) AppleWebKit/525.18.1 (KHTML, like Gecko) Version/3.1.1 Mobile/5H11 Safari/525.20'));
curl_exec($ch);
$result = curl_exec ($ch);
curl_close ($ch);
print $result;
curl_close($ch);
?> $
I don't think there is an user-agent redirection on this web page since
<?php
if (isset($_GET['get'])){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.tvindiretta.com/m");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_exec($ch);
$result = curl_exec ($ch);
curl_close ($ch);
print $result;
}
else{
?>
<!DOCTYPE HTML>
<html>
<head>
<meta charset="utf-8">
</head>
<body>
<iframe src="test.php?get" style="position:absolute; top:100px; left:100px; width:400px; height:400px;"/>
</body>
</html>
<?php } ?>
Seems to screw the page, but provide me the mobile content anyway.
So I guess the real problem here is the javascript code inside that page:
In html5 you have a new iframe attribute "sandbox" which allows you to restrict the iframe's content behaviour .
Unfortunately this seems to be supported only by Chrome and Safari.
One idea here could be to try to scrape the content of the web page (with DomDocument in PHP for instance), keep only the content in which you are interested, and try to reproduce their style. It may be easier to say than to do, but I can't see a cleaner way to do so.
Since it seems you are interested in getting a TV program, you could check for a dedicated xml scaper XMLtv.

Categories