getting page from twitter or facebook anonymously by curl - php

I'm trying to make some kind of page parser (more specific - highlighting some words on pages) and i've got some problems with it. I'm getting whole page data from url using curl and most pages are cooperating nicely, while others don't.
My goal is to get all page html just like browser is getting it and I'm trying to use it anonymously - like browser is. I mean - if some page needs log in to show data for browser that doesn't interest me. The problem is that I can't get on Twitter or Facebook pages that I can reach anonymously from regular browser, even when I set all headers just like they are send normally form Firefox or Chrome.
Is there any way to simply emulate browser to get page from these side or I have to use OAuth (and can someone explain why browsers don't need to use it)?
EDIT
I got the solution! If somebody will have problems with that you should:
-> try to switch protocol from https to http
-> get rid of the /#!/ element if there is one in url
-> for my curl element "Accept-Encoding: gzip, deflate" was also causing problems.. dunno why, but now everything is OK
code of mine:
if (substr($this->url,0,5) == 'https')
$this->url = str_replace('https://', 'http://', $this->url);
$this->url = str_replace('/#!/', '/', $this->url);
//check, if a valid url is provided
if(!filter_var($this->url, FILTER_VALIDATE_URL))
return false;
$curl = curl_init();
$header = array();
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
// -> gives an error: $header[] = "Accept-Encoding: gzip, deflate";
$header[] = "Accept-Language: pl,en-us;q=0.7,en;q=0.3";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Pragma: "; // browsers keep this blank.
curl_setopt($curl, CURLOPT_HTTPHEADER,$header);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_URL, $this->url);
curl_setopt($curl, CURLOPT_COOKIEJAR, "cookie.txt");
curl_setopt($curl, CURLOPT_COOKIEFILE, "cookie.txt");
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT,10);
curl_setopt($curl, CURLOPT_COOKIESESSION,true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER,1);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7 (.NET CLR 3.5.30729)');
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
$response = curl_exec($curl);
curl_close($curl);
if ($response) return $response;
return false;
All was in class, but you can extract code very easy. For me it's getting both (twitter and facebook) nicely.

Yes, this is possible to emulate a browser: but you need to carefully watch all the http headers (including cookies) that are sent by the browser, and also handle redirects as well. Some of this can be "automated" by cUrl functions, the rest you'll need to manually handle.
Note: I'm not talking about HTML headers in code; these are HTTP headers sent and received by browsers.
The easiest way to spot these is to user fiddler to monitor the traffic. Choose a URL and look on the right for "inspect element" and you'll see headers that get send, and headers that are received.
Facebook makes this more complicated with a mirad of iFrames, so I suggest you start on a simpler website!

I got the solution! If somebody will have problems with that you should:
-> try to switch protocol from https to http
-> get rid of the /#!/ element if there is one in url
-> for my curl element "Accept-Encoding: gzip, deflate" was also causing problems.. dunno why, but now everything is OK
code of mine:
if (substr($this->url,0,5) == 'https')
$this->url = str_replace('https://', 'http://', $this->url);
$this->url = str_replace('/#!/', '/', $this->url);
//check, if a valid url is provided
if(!filter_var($this->url, FILTER_VALIDATE_URL))
return false;
$curl = curl_init();
$header = array();
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
// -> gives an error: $header[] = "Accept-Encoding: gzip, deflate";
$header[] = "Accept-Language: pl,en-us;q=0.7,en;q=0.3";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Pragma: "; // browsers keep this blank.
curl_setopt($curl, CURLOPT_HTTPHEADER,$header);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_URL, $this->url);
curl_setopt($curl, CURLOPT_COOKIEJAR, "cookie.txt");
curl_setopt($curl, CURLOPT_COOKIEFILE, "cookie.txt");
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT,10);
curl_setopt($curl, CURLOPT_COOKIESESSION,true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER,1);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7 (.NET CLR 3.5.30729)');
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
$response = curl_exec($curl);
curl_close($curl);
if ($response) return $response;
return false;
All was in class, but you can extract code very easy. For me it's getting both (twitter and facebook) nicely.

Related

PHP cURL function returns 403 or 503 on websites that are working & public

I have a curl function in PHP which I use to request the title & description of websites. I've tested it with over 1000s of different websites (using my existing bookmarks list) and it works great, but there are some websites that don't work because cURL returns a status code of either 403 or 503. For example, CodePen pens sites such as: https://codepen.io/vrugtehagel/pen/eYJjYNm returns 503 or sometimes 403 error.
This is the cURL function with the options that I have set up:
$ch = curl_init();
$header[] = "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-GB,en-US;q=0.9,en;q=0.8";
$header[] = "Pragma: no-cache";
// Set agent as MacBook Chrome
$user_agent_chrome = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36';
// Set cURL Options
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, TRUE);
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent_chrome);
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_ENCODING, '');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_MAXREDIRS, 4);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 4);
curl_setopt($ch, CURLOPT_TIMEOUT, 8);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE); // Staging
// Execute & fetch page string data
$url_data = curl_exec($ch);
// Flag if cURL was terminated
$curl_err = curl_error($ch) ? true : false;
// Get the status (404 | 200 | ... )
$status = curl_getinfo($ch, CURLINFO_HTTP_CODE);
// Get final URL (if redirection happened)
$effective_url = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
// Terminate connection
curl_close($ch);
Since these websites are publicly accessible & they work in social media sites, I'm wondering what could be setup wrong. I've searched a lot and tried other different methods. Could it be a cookie, or a particular referrer? Does anyone have a clue or maybe you have your own well-tested cURL method/options that you can share with us?
Is there an ultimate "should work everywhere" cURL example somewhere on the Web we don't know about?
I've been trying to get around this problem for almost four months now, always giving up because I can't figure out why only a few minor websites are not working. I'de appreciate if anyone can help.
Thanks in advance!

php curl with browser cookies

i have a webservice like this
http://www.sample.com/api/v2/exchanges/Web/stocks/Stock/lastN=200
that return json
the main problem is you should login to http://sample.com to see this json otherwise you see 403 Forbidden error
and this website use google authenticate for login, can i use browser cookie with curl for get this json?
this is the code i found but it didnt work for me
function get_content($url,$ref)
{
$browser = $_SERVER['HTTP_USER_AGENT'];
$ch = curl_init();
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: "; // browsers keep this blank.
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, $browser);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_REFERER, $ref);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, false);
$html = curl_exec($ch);
curl_close ($ch);
return $html;
}
this website use google authenticate for login
then login on gmail first, gmail will give you a special cookie which should allow you to fetch the json as authenticated.
, can i use browser cookie with curl for get this json?
yup. you can get the cookie by logging in on your browser and check document.cookie in a js terminal, but sounds like a very inconvenient solution
code i found but it didnt work for me
that code does not attempt to authenticate at all. for an example of google-authentication via gmail, check
https://gist.github.com/divinity76/544d7cadd3e88e057ea3504cb8b3bf7e

Using curl to load an HTML file that submit an external form with JS

I need a little help here:
I have 2 files
index.php
form0.html
form0.html automatically fill a form and send it.
When I go straight to the fill it works fine, but when I try to access it through my php script it won't work unless I print the results.
PHP CODE:
<?php
set_time_limit(30);
$delay_time = 2; // time to wait before looping again.
$loop_times = 1; // number of times to loop.
$url = array("http://localhost/htmlfile0.html");
for($x=0;$x<$loop_times;$x++)
{
echo count($url);
for($i=0;$i<count($url);$i++)
{
$url1=$url[$i];
$curl = curl_init(); // Initialize the cURL handler
$header = array();
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 30000";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: "; // browsers keep this blank.
$var_host = parse_url($url1,PHP_URL_HOST);
$cookieJar = 'cookies/'.$var_host.'.txt'; // change it according to your requirement. make dynamic for multi URL
curl_setopt($curl, CURLOPT_HTTPHEADER, $header); // Browser like header
curl_setopt($curl, CURLOPT_COOKIEJAR, $cookieJar); // file for cookies if site requires
curl_setopt($curl, CURLOPT_COOKIEFILE, $cookieJar); // file for cookies if site requires
curl_setopt($curl, CURLOPT_HEADER, 0);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9) Gecko/2008052906 Firefox/3.0');
curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate');
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1); // Follow any redirects, just in case
curl_setopt($curl, CURLOPT_AUTOREFERER, true);
curl_setopt($curl,CURLOPT_RETURNTRANSFER, true); // set curl to return the page
curl_setopt($curl, CURLOPT_POSTFIELDS, '');
curl_setopt($curl, CURLOPT_POST, true); //post the form
curl_setopt($curl, CURLOPT_URL, $url1); // Set the URL
$ch=curl_exec($curl); // Display page
curl_close($ch); // Close cURL handler
echo date('h:i:s') . "\n";
if($ch) echo "Success: ".$url1;
else echo "Fail: ".$url1;
echo '<hr>';
sleep(5);
echo date('h:i:s') . "\n";
}
if($x < $loop_times) sleep($delay_time);
}
?>
How can I get pass this?
Thanks.
Correct me if I'm wrong but you're trying to execute JS using CURL request which you cant' at least not directly.
When you go straight to file, it works fine - your browser's JS interpreter executes JS proprerly and form is submitted, but by using cURL you are doing something different - you are sending http request headers to certain URL(http://localhost/htmlfile0.html) and you may and may not fetch the response content.
If you do fetch the content (and parse JS in your browser's js interpreter) javascript may be set to refuse correct action based on wheather it is or it is not correct url.
Example:
do some action if it is right URL - if you reached code on http://example.com/script.html
do not do anything if it's not above URL - and that's the case when you perform action using your cURL php script to reach above URL with and to output document's code in http://localhost.

Why can I not scrape the title off this site?

I'm using simple-html-dom to scrape the title off of a specified site.
<?php
include('simple_html_dom.php');
$html = file_get_html('http://www.pottermore.com/');
foreach($html->find('title') as $element)
echo $element->innertext . '<br>';
?>
Any other site I've tried works, apple.com for example.
But if I input pottermore.com, it doesn't output anything. Pottermore has flash elements on it, but the home screen I'm trying to scrape the title off of has no flash, just html.
This works for me :)
$url = 'http://www.pottermore.com/';
$html = get_html($url);
file_put_contents('page.htm',$html);//just to test what you have downloaded
echo 'The title from: '.$url.' is: '.get_snip($html, '<title>','</title>');
function get_html($url)
{
$ch = curl_init();
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: "; //browsers keep this blank.
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows;U;Windows NT 5.0;en-US;rv:1.4) Gecko/20030624 Netscape/7.1 (ax)');
curl_setopt($ch, CURLOPT_ENCODING, 'gzip,deflate');
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 20);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_COOKIEFILE, COOKIE);
curl_setopt($ch, CURLOPT_COOKIEJAR, COOKIE);
$result = curl_exec ($ch);
curl_close ($ch);
return($result);
}
function get_snip($string,$start,$end,$trim_start='1',$trim_end='1')
{
$startpos = strpos($string,$start);
$endpos = strpos($string,$end,$startpos);
if($trim_start!='')
{
$startpos += strlen($start);
}
if($trim_end=='')
{
$endpos += strlen($end);
}
return(substr($string,$startpos,($endpos-$startpos)));
}
Just to confirm what others are saying, if you don't send a user agent string this site sends 403 Forbidden.
Adding this worked for me:
User-Agent: Mozilla/5.0 (Windows;U;Windows NT 5.0;en-US;rv:1.4) Gecko/20030624 Netscape/7.1 (ax)
The function file_get_html uses file_get_contents under the covers. This function can pull data from a URL, but to do so, it sends a User Agent string.
By default, this string is empty. Some webservers use this fact to detect that a non-browser is accessing its data and opt to forbid this.
You can set user_agent in php.ini to control the User Agent string that gets sent. Or, you could try:
ini_set('user_agent','UA-String');
with 'UA-String' set to whatever you like.

file_get_contents and jQuery pageless

I am using the php file_get_contents function to retrieve the HTML from pinterest's source tracking page, that shows all of the pins originating from a particular domain. Ex: http://pinterest.com/source/google.com/
However, pinterest appears to be using the jQuery pageless feature, which prevents all of the content from loading.
Is there a way to force the file_get_contents function to trigger the pageless function, so that the entire result set is returned?
Tried file_get_contents, but for some reason that did'nt give me much of anything, but cURL seems to work just fine for me.
You will need to have cURL installed on your server, and the libCURL extension for PHP, but you could try something like this and see what you get:
<?php
$cl = curl_init();
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3";
$header[] = "Accept-Language: nb-NO,nb;q=0.8,no;q=0.6,nn;q=0.4,en-US;q=0.2,en;q=0.2";
$header[] = "Pragma: ";
curl_setopt($cl, CURLOPT_FAILONERROR,true);
curl_setopt($cl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7');
curl_setopt($cl, CURLOPT_HTTPHEADER, $header);
curl_setopt($cl, CURLOPT_REFERER, 'http://www.google.com');
curl_setopt($cl, CURLOPT_ENCODING, 'gzip,deflate');
curl_setopt($cl, CURLOPT_AUTOREFERER, false);
curl_setopt($cl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($cl, CURLOPT_CONNECTTIMEOUT, 2);
$url = 'http://pinterest.com/source/google.com/';
curl_setopt($cl, CURLOPT_URL, $url);
$output = curl_exec($cl);
curl_close($cl);
?>
<!DOCTYPE html>
<head>
<title>get pinterest</title>
</head>
<body>
<xmp>
<?php echo $output; ?>
</xmp>
</body>
</html>
file_get_contents(..) simply gives you what you see as the Page source in your browser. It can't give stuffs that get loaded through javascript. The best way to do it in your case would be to look for AJAX calls (in the page source) that are being made. Or rather, you could just open up browser's utility to monitor page activity. (On chrome you will get it using ctrl+shift+J)
Once you get the URL to which the request is being made, you can directly use them in your file_get_contents(..) to fetch the relevant data.

Categories