Why can I not scrape the title off this site?

Why can I not scrape the title off this site? - php

I'm using simple-html-dom to scrape the title off of a specified site.
<?php
include('simple_html_dom.php');
$html = file_get_html('http://www.pottermore.com/');
foreach($html->find('title') as $element)
echo $element->innertext . '<br>';
?>
Any other site I've tried works, apple.com for example.
But if I input pottermore.com, it doesn't output anything. Pottermore has flash elements on it, but the home screen I'm trying to scrape the title off of has no flash, just html.

This works for me :)
$url = 'http://www.pottermore.com/';
$html = get_html($url);
file_put_contents('page.htm',$html);//just to test what you have downloaded
echo 'The title from: '.$url.' is: '.get_snip($html, '<title>','</title>');
function get_html($url)
{
$ch = curl_init();
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: "; //browsers keep this blank.
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows;U;Windows NT 5.0;en-US;rv:1.4) Gecko/20030624 Netscape/7.1 (ax)');
curl_setopt($ch, CURLOPT_ENCODING, 'gzip,deflate');
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 20);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_COOKIEFILE, COOKIE);
curl_setopt($ch, CURLOPT_COOKIEJAR, COOKIE);
$result = curl_exec ($ch);
curl_close ($ch);
return($result);
}
function get_snip($string,$start,$end,$trim_start='1',$trim_end='1')
{
$startpos = strpos($string,$start);
$endpos = strpos($string,$end,$startpos);
if($trim_start!='')
{
$startpos += strlen($start);
}
if($trim_end=='')
{
$endpos += strlen($end);
}
return(substr($string,$startpos,($endpos-$startpos)));
}

Just to confirm what others are saying, if you don't send a user agent string this site sends 403 Forbidden.
Adding this worked for me:
User-Agent: Mozilla/5.0 (Windows;U;Windows NT 5.0;en-US;rv:1.4) Gecko/20030624 Netscape/7.1 (ax)

The function file_get_html uses file_get_contents under the covers. This function can pull data from a URL, but to do so, it sends a User Agent string.
By default, this string is empty. Some webservers use this fact to detect that a non-browser is accessing its data and opt to forbid this.
You can set user_agent in php.ini to control the User Agent string that gets sent. Or, you could try:
ini_set('user_agent','UA-String');
with 'UA-String' set to whatever you like.

Related

php curl with browser cookies

i have a webservice like this
http://www.sample.com/api/v2/exchanges/Web/stocks/Stock/lastN=200
that return json
the main problem is you should login to http://sample.com to see this json otherwise you see 403 Forbidden error
and this website use google authenticate for login, can i use browser cookie with curl for get this json?
this is the code i found but it didnt work for me
function get_content($url,$ref)
{
$browser = $_SERVER['HTTP_USER_AGENT'];
$ch = curl_init();
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: "; // browsers keep this blank.
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, $browser);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_REFERER, $ref);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, false);
$html = curl_exec($ch);
curl_close ($ch);
return $html;
}

this website use google authenticate for login
then login on gmail first, gmail will give you a special cookie which should allow you to fetch the json as authenticated.
, can i use browser cookie with curl for get this json?
yup. you can get the cookie by logging in on your browser and check document.cookie in a js terminal, but sounds like a very inconvenient solution
code i found but it didnt work for me
that code does not attempt to authenticate at all. for an example of google-authentication via gmail, check
https://gist.github.com/divinity76/544d7cadd3e88e057ea3504cb8b3bf7e

Issue with hyperlinks when pulling Facebook post into web page

I'm using the following script to pull the latest post from my Facebook page.
It does this as expected, however, if the Facebook post contains a hyperlink, the link becames garbled & no longer works. Try it out if you can using my code - making sure curl is installed.
<?php
$url = "http://www.facebook.com/feeds/page.php?id=466171083413035&format=json";
// disguises the curl using fake headers and a fake user agent.
function disguise_curl($url)
{
$curl = curl_init();
// Setup headers - the same headers from Firefox version 2.0.0.6
// below was split up because the line was too long.
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: "; // browsers keep this blank.
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla');
curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
curl_setopt($curl, CURLOPT_REFERER, '');
curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate');
curl_setopt($curl, CURLOPT_AUTOREFERER, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_TIMEOUT, 10);
$html = curl_exec($curl); // execute the curl command
curl_close($curl); // close the connection
return $html; // and finally, return $html
}
// uses the function and displays the text off the website
$text = disguise_curl($url);
$json_feed_object = json_decode($text);
$i = 0;
foreach ( $json_feed_object->entries as $entry )
{
echo "<h2>{$entry->title}</h2>";
$published = date("g:i A F j, Y", strtotime($entry->published));
echo "<small>{$published}</small>";
$content = preg_replace("/<img[^>]+\>/i", "", $entry->content);
echo "<p style='word-wrap:break-word;'>{$content}</p>";
echo "<hr />";
$i++;
if ($i == 1) { break;}
}
?>
EDIT
My hyperlink appears as:
http://www.empireonline.com/news/story.asp?NID=36903<br/><br/>
Has anyone ever come across this issue before? Is there a solution?
Many thanks for any pointers.

My bad.
All I needed to was do a string replace on any URLs and append facebook.com.
Here is my code in case it helps any others:
$content = str_replace(' href="/l.php', ' href="http://www.facebook.com/l.php',$content);

getting page from twitter or facebook anonymously by curl

I'm trying to make some kind of page parser (more specific - highlighting some words on pages) and i've got some problems with it. I'm getting whole page data from url using curl and most pages are cooperating nicely, while others don't.
My goal is to get all page html just like browser is getting it and I'm trying to use it anonymously - like browser is. I mean - if some page needs log in to show data for browser that doesn't interest me. The problem is that I can't get on Twitter or Facebook pages that I can reach anonymously from regular browser, even when I set all headers just like they are send normally form Firefox or Chrome.
Is there any way to simply emulate browser to get page from these side or I have to use OAuth (and can someone explain why browsers don't need to use it)?
EDIT
I got the solution! If somebody will have problems with that you should:
-> try to switch protocol from https to http
-> get rid of the /#!/ element if there is one in url
-> for my curl element "Accept-Encoding: gzip, deflate" was also causing problems.. dunno why, but now everything is OK
code of mine:
if (substr($this->url,0,5) == 'https')
$this->url = str_replace('https://', 'http://', $this->url);
$this->url = str_replace('/#!/', '/', $this->url);
//check, if a valid url is provided
if(!filter_var($this->url, FILTER_VALIDATE_URL))
return false;
$curl = curl_init();
$header = array();
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
// -> gives an error: $header[] = "Accept-Encoding: gzip, deflate";
$header[] = "Accept-Language: pl,en-us;q=0.7,en;q=0.3";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Pragma: "; // browsers keep this blank.
curl_setopt($curl, CURLOPT_HTTPHEADER,$header);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_URL, $this->url);
curl_setopt($curl, CURLOPT_COOKIEJAR, "cookie.txt");
curl_setopt($curl, CURLOPT_COOKIEFILE, "cookie.txt");
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT,10);
curl_setopt($curl, CURLOPT_COOKIESESSION,true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER,1);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7 (.NET CLR 3.5.30729)');
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
$response = curl_exec($curl);
curl_close($curl);
if ($response) return $response;
return false;
All was in class, but you can extract code very easy. For me it's getting both (twitter and facebook) nicely.

Yes, this is possible to emulate a browser: but you need to carefully watch all the http headers (including cookies) that are sent by the browser, and also handle redirects as well. Some of this can be "automated" by cUrl functions, the rest you'll need to manually handle.
Note: I'm not talking about HTML headers in code; these are HTTP headers sent and received by browsers.
The easiest way to spot these is to user fiddler to monitor the traffic. Choose a URL and look on the right for "inspect element" and you'll see headers that get send, and headers that are received.
Facebook makes this more complicated with a mirad of iFrames, so I suggest you start on a simpler website!

I got the solution! If somebody will have problems with that you should:
-> try to switch protocol from https to http
-> get rid of the /#!/ element if there is one in url
-> for my curl element "Accept-Encoding: gzip, deflate" was also causing problems.. dunno why, but now everything is OK
code of mine:
if (substr($this->url,0,5) == 'https')
$this->url = str_replace('https://', 'http://', $this->url);
$this->url = str_replace('/#!/', '/', $this->url);
//check, if a valid url is provided
if(!filter_var($this->url, FILTER_VALIDATE_URL))
return false;
$curl = curl_init();
$header = array();
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
// -> gives an error: $header[] = "Accept-Encoding: gzip, deflate";
$header[] = "Accept-Language: pl,en-us;q=0.7,en;q=0.3";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Pragma: "; // browsers keep this blank.
curl_setopt($curl, CURLOPT_HTTPHEADER,$header);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_URL, $this->url);
curl_setopt($curl, CURLOPT_COOKIEJAR, "cookie.txt");
curl_setopt($curl, CURLOPT_COOKIEFILE, "cookie.txt");
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT,10);
curl_setopt($curl, CURLOPT_COOKIESESSION,true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER,1);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7 (.NET CLR 3.5.30729)');
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
$response = curl_exec($curl);
curl_close($curl);
if ($response) return $response;
return false;
All was in class, but you can extract code very easy. For me it's getting both (twitter and facebook) nicely.

file_get_contents script works with some websites but not others

I'm looking to build a PHP script that parses HTML for particular tags. I've been using this code block, adapted from this tutorial:
<?php
$data = file_get_contents('http://www.google.com');
$regex = '/<title>(.+?)</';
preg_match($regex,$data,$match);
var_dump($match);
echo $match[1];
?>
The script works with some websites (like google, above), but when I try it with other websites (like, say, freshdirect), I get this error:
"Warning: file_get_contents(http://www.freshdirect.com) [function.file-get-contents]: failed to open stream: HTTP request failed!"
I've seen a bunch of great suggestions on StackOverflow, for example to enable extension=php_openssl.dll in php.ini. But (1) my version of php.ini didn't have extension=php_openssl.dll in it, and (2) when I added it to the extensions section and restarted the WAMP server, per this thread, still no success.
Would someone mind pointing me in the right direction? Thank you very much!

It just requires a user-agent ("any" really, any string suffices):
file_get_contents("http://www.freshdirect.com",false,stream_context_create(
array("http" => array("user_agent" => "any"))
));
See more options.
Of course, you can set user_agent in your ini:
ini_set("user_agent","any");
echo file_get_contents("http://www.freshdirect.com");
... but I prefer to be explicit for the next programmer working on it.

$html = file_get_html('http://google.com/');
$title = $html->find('title')->innertext;
Or if you prefer with preg_match and you should be really using cURL instead of fgc...
function curl($url){
$headers[] = "User-Agent:Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13";
$headers[] = "Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
$headers[] = "Accept-Language:en-us,en;q=0.5";
$headers[] = "Accept-Encoding:gzip,deflate";
$headers[] = "Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$headers[] = "Keep-Alive:115";
$headers[] = "Connection:keep-alive";
$headers[] = "Cache-Control:max-age=0";
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_HTTPHEADER, $headers);
curl_setopt($curl, CURLOPT_ENCODING, "gzip");
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($curl);
curl_close($curl);
return $data;
}
$data = curl('http://www.google.com');
$regex = '#<title>(.*?)</title>#mis';
preg_match($regex,$data,$match);
var_dump($match);
echo $match[1];

Another option: Some hosts disable CURLOPT_FOLLOWLOCATION so recursive is what you want, also will log into a text file any errors. Also a simple example of how to use DOMDocument() to extract the content, obviously its not extensive but something you could build appon.
<?php
function file_get_site($url){
(function_exists('curl_init')) ? '' : die('cURL Must be installed. Ask your host to enable it or uncomment extension=php_curl.dll in php.ini');
$curl = curl_init();
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: ";
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0 Firefox/5.0');
curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
curl_setopt($curl, CURLOPT_HEADER, true);
curl_setopt($curl, CURLOPT_REFERER, $url);
curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate');
curl_setopt($curl, CURLOPT_AUTOREFERER, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_TIMEOUT, 60);
$html = curl_exec($curl);
$status = curl_getinfo($curl);
curl_close($curl);
if($status['http_code']!=200){
if($status['http_code'] == 301 || $status['http_code'] == 302) {
list($header) = explode("\r\n\r\n", $html, 2);
$matches = array();
preg_match("/(Location:|URI:)[^(\n)]*/", $header, $matches);
$url = trim(str_replace($matches[1],"",$matches[0]));
$url_parsed = parse_url($url);
return (isset($url_parsed))? file_get_site($url):'';
}
$oline='';
foreach($status as $key=>$eline){$oline.='['.$key.']'.$eline.' ';}
$line =$oline." \r\n ".$url."\r\n-----------------\r\n";
$handle = #fopen('./curl.error.log', 'a');
fwrite($handle, $line);
return FALSE;
}
return $html;
}
function get_content_tags($source,$tag,$id=null,$value=null){
$xml = new DOMDocument();
#$xml->loadHTML($source);
foreach($xml->getElementsByTagName($tag) as $tags) {
if($id!=null){
if($tags->getAttribute($id)==$value){
return $tags->getAttribute('content');
}
}
return $tags->nodeValue;
}
}
$source = file_get_site('http://www.freshdirect.com/about/index.jsp');
echo get_content_tags($source,'title'); //FreshDirect
echo get_content_tags($source,'meta','name','description'); //Online grocer providing high quality fresh......
?>

CURL Authentication being lost?

I am authenticating a login via CURL just fine. I have a variable I am using to display the returned HTML, and it is returning my user control panel as if I am logged in.
After authenticating, I want to communicate variables with a form on another page within the site; but for some reason the HTML from that page is returning a non-authenticated version of the header (as if the original authentication never took place.)
I have a cookies.txt file with 777 permissions, and have tried just getting the contents of the same page shown when I authenticate and it is as if I am losing any associated session/cookie data somewhere along the way.
Here is my curl.class file -
<?
class Curl {
public $cookieJar = "";
// Make sure the cookies.txt file is read/write permissions
public function __construct($cookieJarFile = 'cookies.txt') {
$this->cookieJar = $cookieJarFile;
}
function setup() {
$header = array();
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: "; // browsers keep this blank.
curl_setopt($this->curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7');
curl_setopt($this->curl, CURLOPT_HTTPHEADER, $header);
curl_setopt($this->curl, CURLOPT_COOKIEJAR, $this->cookieJar);
curl_setopt($this->curl, CURLOPT_COOKIEFILE, $this->cookieJar);
curl_setopt($this->curl, CURLOPT_AUTOREFERER, true);
curl_setopt($this->curl, CURLOPT_COOKIESESSION, true);
curl_setopt($this->curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($this->curl, CURLOPT_RETURNTRANSFER, true);
}
function get($url) {
$this->curl = curl_init($url);
$this->setup();
return $this->request();
}
function getAll($reg, $str) {
preg_match_all($reg, $str, $matches);
return $matches[1];
}
function postForm($url, $fields, $referer = '') {
$this->curl = curl_init($url);
$this->setup();
curl_setopt($this->curl, CURLOPT_URL, $url);
curl_setopt($this->curl, CURLOPT_POST, 1);
curl_setopt($this->curl, CURLOPT_REFERER, $referer);
curl_setopt($this->curl, CURLOPT_POSTFIELDS, $fields);
return $this->request();
}
function getInfo($info) {
$info = ($info == 'lasturl') ? curl_getinfo($this->curl, CURLINFO_EFFECTIVE_URL) : curl_getinfo($this->curl, $info);
return $info;
}
function request() {
return curl_exec($this->curl);
}
}
?>
And here is my curl.php file -
<?
include('curl.class.php'); // This path would change to where you store the file
$curl = new Curl();
$url = "http://www.site.com/public/member/signin";
$fields = "MAX_FILE_SIZE=50000000&dado_form_3=1&member[email]=email&member[password]=pass&x=16&y=5&member[persistent]=true";
// Calling URL
$referer = "http://www.site.com/public/member/signin";
$html = $curl->postForm($url, $fields, $referer);
echo($html);
?>
<hr style="clear:both;"/>
<?
$html = $curl->postForm('http://www.site.com/index.php','nid=443&sid=733005&tab=post&eval=yes&ad=&MAX_FILE_SIZE=10000000&ip=63.225.235.30','http://www.site.com/public/member/signin');
echo $html; // This will show you the HTML of the current page you and logged into
?>
Any ideas?

As always when doing HTTP scripting, you should use LiveHTTPHeaders or similar to record a manual session first and then you should mimic that as closely as possible when you write your curl stuff.
Also (unfortunately) the command line tool curl offers slightly better debug and tracing options than what the PHP binding does, which makes that a better tool to work out exactly what you need to do and once that works you convert it to a PHP program.
See http://curl.haxx.se/docs/httpscripting.html for further details.

Err, please tell us what authentication scheme the server is using. Not all schemes use cookies.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Why can I not scrape the title off this site? - php

Just to confirm what others are saying, if you don't send a user agent string this site sends 403 Forbidden. Adding this worked for me: User-Agent: Mozilla/5.0 (Windows;U;Windows NT 5.0;en-US;rv:1.4) Gecko/20030624 Netscape/7.1 (ax)

Related

php curl with browser cookies

Issue with hyperlinks when pulling Facebook post into web page

getting page from twitter or facebook anonymously by curl

file_get_contents script works with some websites but not others

CURL Authentication being lost?

Categories

Resources