Not able to parse HTML content using PHP because of special characters - php

I'm trying to scrape a website by using CURL. So far I have written the following:
Curl class:
<?php
class Curl
{
public $cookieJar = "";
public function __construct($cookieJarFile = 'cookies.txt') {
$this->cookieJar = $cookieJarFile;
}
function setup()
{
$header = array();
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: "; // browsers keep this blank.
curl_setopt($this->curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7');
curl_setopt($this->curl, CURLOPT_HTTPHEADER, $header);
curl_setopt($this->curl,CURLOPT_COOKIEJAR, $cookieJar);
curl_setopt($this->curl,CURLOPT_COOKIEFILE, $cookieJar);
curl_setopt($this->curl,CURLOPT_AUTOREFERER, true);
curl_setopt($this->curl,CURLOPT_FOLLOWLOCATION, true);
curl_setopt($this->curl,CURLOPT_RETURNTRANSFER, true);
}
function get($url)
{
$this->curl = curl_init($url);
$this->setup();
return $this->request();
}
function getAll($reg,$str)
{
preg_match_all($reg,$str,$matches);
return $matches[1];
}
function postForm($url, $fields, $referer='')
{
$this->curl = curl_init($url);
$this->setup();
curl_setopt($this->curl, CURLOPT_URL, $url);
curl_setopt($this->curl, CURLOPT_POST, 1);
curl_setopt($this->curl, CURLOPT_REFERER, $referer);
curl_setopt($this->curl, CURLOPT_POSTFIELDS, $fields);
return $this->request();
}
function getInfo($info)
{
$info = ($info == 'lasturl') ? curl_getinfo($this->curl, CURLINFO_EFFECTIVE_URL) : curl_getinfo($this->curl, $info);
return $info;
}
function request()
{
return curl_exec($this->curl);
}
}
?>
And then I'm calling this curl class in my php file:
include_once("curl.php");
$curl = new Curl();
$html = $curl->get("www.somewebsite.com");
$html = htmlentities($html);
//echo $html;
$pattern = htmlentities("<span class=\"review-text\">");
function get_string_between($string, $start, $end)
{
$string = " ".$string;
$ini = strpos($string,$start);
if ($ini == 0)
return "";
$ini += strlen($start);
$len = strpos($string,$end,$ini) - $ini;
return substr($string,$ini,$len);
}
echo get_string_between($html, '<span class=\"review-text\">', '<\/span>');
Now that I'm trying to get the string between the two strings, I'm getting a blank page. However when I see the html content, I am clearly able to spot the string.
The HTML content is very big and I'm trying search and get the content between the huge file.
I even tried replacing the "<" symbol with "<" sign but it does not seem to find the string.

'There is a better way to get the value of a html tag, by using the dom.
$dom = new DomDocument();
#$dom -> loadHTML($html);
$dom -> preserveWhiteSpace = false;
$spans = getElementsByTagName('span');
foreach($spans as $span){
if($span -> getAttribute('class') == 'review-text'){ print $span-> nodeValue }
}
Or there is another way:
$dompath = new DOMXPath($dom);
$review_div = $dompath -> query('//*[#class="review-text"]')->item(0)
$string = $review_div -> nodeValue;
Hope this helps you.

Related

Even CURL function can't scrape some urls

I'm using CURL to scrape the html from url's. It works great in 80% of the urls I use. But some url's don't seem "scrapeable". For example, when I try to scrape http://www.thefancy.com , it doesn't work. the website keeps loading and at the end it doesn't return a result. the problem is testable at: http://www.itemmized.com/test/test/ this is my code:
if($_POST['submit']) {
function curl_exec_follow($ch, &$maxredirect = null) {
$mr = $maxredirect === null ? 5 : intval($maxredirect);
if (ini_get('open_basedir') == '' && ini_get('safe_mode' == 'Off')) {
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, $mr > 0);
curl_setopt($ch, CURLOPT_MAXREDIRS, $mr);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
} else {
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
if ($mr > 0)
{
$original_url = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
$newurl = $original_url;
$rch = curl_copy_handle($ch);
curl_setopt($rch, CURLOPT_HEADER, true);
curl_setopt($rch, CURLOPT_NOBODY, true);
curl_setopt($rch, CURLOPT_FORBID_REUSE, false);
do
{
curl_setopt($rch, CURLOPT_URL, $newurl);
$header = curl_exec($rch);
if (curl_errno($rch)) {
$code = 0;
} else {
$code = curl_getinfo($rch, CURLINFO_HTTP_CODE);
if ($code == 301 || $code == 302) {
preg_match('/Location:(.*?)\n/', $header, $matches);
$newurl = trim(array_pop($matches));
// if no scheme is present then the new url is a
// relative path and thus needs some extra care
if(!preg_match("/^https?:/i", $newurl)){
$newurl = $original_url . $newurl;
}
} else {
$code = 0;
}
}
} while ($code && --$mr);
curl_close($rch);
if (!$mr)
{
if ($maxredirect === null)
trigger_error('Too many redirects.', E_USER_WARNING);
else
$maxredirect = 0;
return false;
}
curl_setopt($ch, CURLOPT_URL, $newurl);
}
}
return curl_exec($ch);
}
$ch = curl_init($_POST['form_url']);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$data = curl_exec_follow($ch);
curl_close($ch);
echo $data;
Try this.. hope this helps...
<?php
class Curl
{
public $cookieJar = "";
public function __construct($cookieJarFile = 'cookies.txt') {
$this->cookieJar = $cookieJarFile;
}
function setup()
{
$header = array();
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: "; // browsers keep this blank.
curl_setopt($this->curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7');
curl_setopt($this->curl, CURLOPT_HTTPHEADER, $header);
curl_setopt($this->curl,CURLOPT_COOKIEJAR, $cookieJar);
curl_setopt($this->curl,CURLOPT_COOKIEFILE, $cookieJar);
curl_setopt($this->curl,CURLOPT_AUTOREFERER, true);
curl_setopt($this->curl,CURLOPT_FOLLOWLOCATION, true);
curl_setopt($this->curl,CURLOPT_RETURNTRANSFER, true);
}
function get($url)
{
$this->curl = curl_init($url);
$this->setup();
return $this->request();
}
function getAll($reg,$str)
{
preg_match_all($reg,$str,$matches);
return $matches[1];
}
function postForm($url, $fields, $referer='')
{
$this->curl = curl_init($url);
$this->setup();
curl_setopt($this->curl, CURLOPT_URL, $url);
curl_setopt($this->curl, CURLOPT_POST, 1);
curl_setopt($this->curl, CURLOPT_REFERER, $referer);
curl_setopt($this->curl, CURLOPT_POSTFIELDS, $fields);
return $this->request();
}
function getInfo($info)
{
$info = ($info == 'lasturl') ? curl_getinfo($this->curl, CURLINFO_EFFECTIVE_URL) : curl_getinfo($this->curl, $info);
return $info;
}
function request()
{
return curl_exec($this->curl);
}
}
{
$curl = new Curl();
$html = $curl->get("http://www.thefancy.com");
echo "$html";
}
?>
Probably you're unable to scrape http://www.thefancy.com because every time you reach the bottom of the page new content is loading so actually you are trying to get an enormous amount of information with the cUrl probably that's where the problem is. You just get a timeout try setting the timeout in php.ini with a larger number and give it a try again. Probably its gona take a while to load but I think this way it's going to work.

Grab imdb poster image from search term using php

i would like to get the poster image url using php from imdb from a search term. For example i have the search term 21 Jump Street and i would like to get back the image ur or only the imdb movie url. With the below code i need only to retrieve the url of the movie from a search term
here is the code i have
<?php
include("simple_html_dom.php");
//url to imdb page
$url = 'hereistheurliwanttogetfromsearch';
//get the page content
$imdb_content = file_get_contents($url);
$html = str_get_html($imdb_content);
$name = $html->find('title',0)->plaintext;
$director = $html->find('a[itemprop="director"]',0)->innertext;
$plot = $html->find('p[itemprop="description"]',0)->innertext;
$release_date = $html->find('time[itemprop="datePublished"]',0)->innertext;
$mpaa = $html->find('span[itemprop="contentRating"]',0)->innertext;
$run_time = $html->find('time[itemprop="duration"]',0)->innertext;
$img = $html->find('img[itemprop="image"]',0)->src;
$content = "";
//build content
$content.= '<h2>Film</h2><p>'.$name.'</p>';
$content.= '<h2>Director</h2><p>'.$director.'</p>';
$content.= '<h2>Plot</h2><p>'.$plot.'</p>';
$content.= '<h2>Release Date</h2><p>'.$release_date.'</p>';
$content.= '<h2>MPAA</h2><p>'.$mpaa.'</p>';
$content.= '<h2>Run Time</h2><p>'.$run_time.'</p>';
$content.= '<h2>Full Details</h2><p>'.$url.'</p>';
$content.= '<img src="'.$img.'" />';
echo $content;
?>
Using the API that Kasper Mackenhauer Jacobsenless suggested here's a fuller answer:
$url = 'http://www.imdbapi.com/?i=&t=21+jump+street';
$json_response = file_get_contents($url);
$object_response = json_decode($json_response);
if(!is_null($object_response) && isset($object_response->Poster)) {
$poster_url = $object_response->Poster;
echo $poster_url."\n";
}
Parsing with regex is bad but there is very little in this that could break. Its advised to use curl as its faster and you can mask your usergent.
The main problem with getting the image from a search is you first need to know the IMDB ID then you can load the page and rip the image url. Hope it helps
<?php
//Is form posted
if($_SERVER['REQUEST_METHOD']=='POST'){
$find = $_POST['find'];
//Get Imdb code from search
$source = file_get_curl('http://www.imdb.com/find?q='.urlencode(strtolower($find)).'&s=tt');
if(preg_match('#/title/(.*?)/mediaindex#',$source,$match)){
//Get main page for imdb id
$source = file_get_curl('http://www.imdb.com/title/'.$match[1]);
//Grab the first .jpg image, which is always the main poster
if(preg_match('#<img src=\"(.*).jpg\"#',$source,$match)){
$imdb=$match[1];
//do somthing with image
echo '<img src="'.$imdb.'" />';
}
}
}
//The curl function
function file_get_curl($url){
(function_exists('curl_init')) ? '' : die('cURL Must be installed');
$curl = curl_init();
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: ";
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0 Firefox/5.0');
curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
curl_setopt($curl, CURLOPT_HEADER, true);
curl_setopt($curl, CURLOPT_REFERER, $url);
curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate');
curl_setopt($curl, CURLOPT_AUTOREFERER, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_TIMEOUT, 5);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
$html = curl_exec($curl);
$status = curl_getinfo($curl);
curl_close($curl);
if($status['http_code'] != 200){
if($status['http_code'] == 301 || $status['http_code'] == 302) {
list($header) = explode("\r\n\r\n", $html, 2);
$matches = array();
preg_match("/(Location:|URI:)[^(\n)]*/", $header, $matches);
$url = trim(str_replace($matches[1],"",$matches[0]));
$url_parsed = parse_url($url);
return (isset($url_parsed))? file_get_curl($url):'';
}
return FALSE;
}else{
return $html;
}
}
?>
<form method="POST" action="">
<p><input type="text" name="find" size="20"><input type="submit" value="Submit"></p>
</form>

How to return an array from a function

I have a function which is suposed to return an array. But it is not working. When i try a print_r nothing is returned. The strange thing is that if I put a print_r in the function just before the return, it is returning the array properly. Hope someone can help. Thank you in advance for your replies. Cheers. Marc.
$url = "http://www.somesite.com";
$path ="somexpath";
$print = print_url_data($url, $path);
print_r($print);
function print_url_data($url, $path)
{
$content = get_url_data($url, $path);
foreach ($content as $value)
{
$output .= $value->nodeValue . "<br />";
}
return $output;
}
function get_url_data($url, $path)
{
$xml_content = get_url($url);
$dom = new DOMDocument();
#$dom->loadHTML($xml_content);
$xpath = new DomXPath($dom);
$content_title = $xpath->query($path);
$tableau = array();
foreach ($content_title as $node)
array_push($tableau, utf8_decode(urldecode($node->nodeValue)));
return $tableau; //What is being returned to the function call
}
function get_url($url)
{
$curl = curl_init();
// Setup headers - I used the same headers from Firefox version 2.0.0.6
// below was split up because php.net said the line was too long. :/
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: "; // browsers keep this blank.
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_USERAGENT, 'Googlebot/2.1 (+http://www.google.com/bot.html)');
curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
curl_setopt($curl, CURLOPT_REFERER, '[url=http://www.google.com]http://www.google.com[/url]');
curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate');
curl_setopt($curl, CURLOPT_AUTOREFERER, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_TIMEOUT, 10);
$html = curl_exec($curl); // execute the curl command
curl_close($curl); // close the connection
return $html; // and finally, return $html
}
Make sure errors and warnings are enabled in php.ini file. It may help.
Sorry my mistake. I misplaced the array construction. Here below the code that works for those who are interested. Thanks to all that took time trying to help me out. Cheers.
<?php
$url = "http://www.somesite.com";
$path = "somexpath";
print_r(print_url_data($url, $path));
///////////////////////////////////
function print_url_data($url, $path)
{
$content = get_url_data($url, $path);
$tableau = array();
foreach ($content as $value)
{
array_push($tableau, $value->nodeValue);
}
return $tableau;
}
function get_url_data($url, $path)
{
$xml_content = get_url($url);
$dom = new DOMDocument();
#$dom->loadHTML($xml_content);
$xpath = new DomXPath($dom);
$content_title = $xpath->query($path);
return $content_title;
}
function get_url($url)
{
$curl = curl_init();
// Setup headers - I used the same headers from Firefox version 2.0.0.6
// below was split up because php.net said the line was too long. :/
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: "; // browsers keep this blank.
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_USERAGENT, 'Googlebot/2.1 (+http://www.google.com/bot.html)');
curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
curl_setopt($curl, CURLOPT_REFERER, '[url=http://www.google.com]http://www.google.com[/url]');
curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate');
curl_setopt($curl, CURLOPT_AUTOREFERER, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_TIMEOUT, 10);
$html = curl_exec($curl); // execute the curl command
curl_close($curl); // close the connection
return $html; // and finally, return $html
}
Edit:
With light to the new code that you have posted change :
$output .= $value->nodeValue . "<br />";
to
$output .= $value . "<br />";
This working for me if you would like me to post a link to a test script on my server.
(You were referencing $content as an object, but this was declared as an array :] )
If by $node->nodeValue you are parsing xml; I had a similar problem with this a while back. I found that explicitly casting the nodevalue to a string when adding it to the array fixed my problem.
I believe this maybe because a reference to the xml object is added to the array instead of the string; once your function ends, the xml object is destroyed and the data is no longer accessible. Hope this helps :)
Example:
array_push($tableau, (String) utf8_decode(urldecode($node->nodeValue)));

file_get_contents script works with some websites but not others

I'm looking to build a PHP script that parses HTML for particular tags. I've been using this code block, adapted from this tutorial:
<?php
$data = file_get_contents('http://www.google.com');
$regex = '/<title>(.+?)</';
preg_match($regex,$data,$match);
var_dump($match);
echo $match[1];
?>
The script works with some websites (like google, above), but when I try it with other websites (like, say, freshdirect), I get this error:
"Warning: file_get_contents(http://www.freshdirect.com) [function.file-get-contents]: failed to open stream: HTTP request failed!"
I've seen a bunch of great suggestions on StackOverflow, for example to enable extension=php_openssl.dll in php.ini. But (1) my version of php.ini didn't have extension=php_openssl.dll in it, and (2) when I added it to the extensions section and restarted the WAMP server, per this thread, still no success.
Would someone mind pointing me in the right direction? Thank you very much!
It just requires a user-agent ("any" really, any string suffices):
file_get_contents("http://www.freshdirect.com",false,stream_context_create(
array("http" => array("user_agent" => "any"))
));
See more options.
Of course, you can set user_agent in your ini:
ini_set("user_agent","any");
echo file_get_contents("http://www.freshdirect.com");
... but I prefer to be explicit for the next programmer working on it.
$html = file_get_html('http://google.com/');
$title = $html->find('title')->innertext;
Or if you prefer with preg_match and you should be really using cURL instead of fgc...
function curl($url){
$headers[] = "User-Agent:Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13";
$headers[] = "Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
$headers[] = "Accept-Language:en-us,en;q=0.5";
$headers[] = "Accept-Encoding:gzip,deflate";
$headers[] = "Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$headers[] = "Keep-Alive:115";
$headers[] = "Connection:keep-alive";
$headers[] = "Cache-Control:max-age=0";
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_HTTPHEADER, $headers);
curl_setopt($curl, CURLOPT_ENCODING, "gzip");
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($curl);
curl_close($curl);
return $data;
}
$data = curl('http://www.google.com');
$regex = '#<title>(.*?)</title>#mis';
preg_match($regex,$data,$match);
var_dump($match);
echo $match[1];
Another option: Some hosts disable CURLOPT_FOLLOWLOCATION so recursive is what you want, also will log into a text file any errors. Also a simple example of how to use DOMDocument() to extract the content, obviously its not extensive but something you could build appon.
<?php
function file_get_site($url){
(function_exists('curl_init')) ? '' : die('cURL Must be installed. Ask your host to enable it or uncomment extension=php_curl.dll in php.ini');
$curl = curl_init();
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: ";
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0 Firefox/5.0');
curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
curl_setopt($curl, CURLOPT_HEADER, true);
curl_setopt($curl, CURLOPT_REFERER, $url);
curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate');
curl_setopt($curl, CURLOPT_AUTOREFERER, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_TIMEOUT, 60);
$html = curl_exec($curl);
$status = curl_getinfo($curl);
curl_close($curl);
if($status['http_code']!=200){
if($status['http_code'] == 301 || $status['http_code'] == 302) {
list($header) = explode("\r\n\r\n", $html, 2);
$matches = array();
preg_match("/(Location:|URI:)[^(\n)]*/", $header, $matches);
$url = trim(str_replace($matches[1],"",$matches[0]));
$url_parsed = parse_url($url);
return (isset($url_parsed))? file_get_site($url):'';
}
$oline='';
foreach($status as $key=>$eline){$oline.='['.$key.']'.$eline.' ';}
$line =$oline." \r\n ".$url."\r\n-----------------\r\n";
$handle = #fopen('./curl.error.log', 'a');
fwrite($handle, $line);
return FALSE;
}
return $html;
}
function get_content_tags($source,$tag,$id=null,$value=null){
$xml = new DOMDocument();
#$xml->loadHTML($source);
foreach($xml->getElementsByTagName($tag) as $tags) {
if($id!=null){
if($tags->getAttribute($id)==$value){
return $tags->getAttribute('content');
}
}
return $tags->nodeValue;
}
}
$source = file_get_site('http://www.freshdirect.com/about/index.jsp');
echo get_content_tags($source,'title'); //FreshDirect
echo get_content_tags($source,'meta','name','description'); //Online grocer providing high quality fresh......
?>

CURL Authentication being lost?

I am authenticating a login via CURL just fine. I have a variable I am using to display the returned HTML, and it is returning my user control panel as if I am logged in.
After authenticating, I want to communicate variables with a form on another page within the site; but for some reason the HTML from that page is returning a non-authenticated version of the header (as if the original authentication never took place.)
I have a cookies.txt file with 777 permissions, and have tried just getting the contents of the same page shown when I authenticate and it is as if I am losing any associated session/cookie data somewhere along the way.
Here is my curl.class file -
<?
class Curl {
public $cookieJar = "";
// Make sure the cookies.txt file is read/write permissions
public function __construct($cookieJarFile = 'cookies.txt') {
$this->cookieJar = $cookieJarFile;
}
function setup() {
$header = array();
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: "; // browsers keep this blank.
curl_setopt($this->curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7');
curl_setopt($this->curl, CURLOPT_HTTPHEADER, $header);
curl_setopt($this->curl, CURLOPT_COOKIEJAR, $this->cookieJar);
curl_setopt($this->curl, CURLOPT_COOKIEFILE, $this->cookieJar);
curl_setopt($this->curl, CURLOPT_AUTOREFERER, true);
curl_setopt($this->curl, CURLOPT_COOKIESESSION, true);
curl_setopt($this->curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($this->curl, CURLOPT_RETURNTRANSFER, true);
}
function get($url) {
$this->curl = curl_init($url);
$this->setup();
return $this->request();
}
function getAll($reg, $str) {
preg_match_all($reg, $str, $matches);
return $matches[1];
}
function postForm($url, $fields, $referer = '') {
$this->curl = curl_init($url);
$this->setup();
curl_setopt($this->curl, CURLOPT_URL, $url);
curl_setopt($this->curl, CURLOPT_POST, 1);
curl_setopt($this->curl, CURLOPT_REFERER, $referer);
curl_setopt($this->curl, CURLOPT_POSTFIELDS, $fields);
return $this->request();
}
function getInfo($info) {
$info = ($info == 'lasturl') ? curl_getinfo($this->curl, CURLINFO_EFFECTIVE_URL) : curl_getinfo($this->curl, $info);
return $info;
}
function request() {
return curl_exec($this->curl);
}
}
?>
And here is my curl.php file -
<?
include('curl.class.php'); // This path would change to where you store the file
$curl = new Curl();
$url = "http://www.site.com/public/member/signin";
$fields = "MAX_FILE_SIZE=50000000&dado_form_3=1&member[email]=email&member[password]=pass&x=16&y=5&member[persistent]=true";
// Calling URL
$referer = "http://www.site.com/public/member/signin";
$html = $curl->postForm($url, $fields, $referer);
echo($html);
?>
<hr style="clear:both;"/>
<?
$html = $curl->postForm('http://www.site.com/index.php','nid=443&sid=733005&tab=post&eval=yes&ad=&MAX_FILE_SIZE=10000000&ip=63.225.235.30','http://www.site.com/public/member/signin');
echo $html; // This will show you the HTML of the current page you and logged into
?>
Any ideas?
As always when doing HTTP scripting, you should use LiveHTTPHeaders or similar to record a manual session first and then you should mimic that as closely as possible when you write your curl stuff.
Also (unfortunately) the command line tool curl offers slightly better debug and tracing options than what the PHP binding does, which makes that a better tool to work out exactly what you need to do and once that works you convert it to a PHP program.
See http://curl.haxx.se/docs/httpscripting.html for further details.
Err, please tell us what authentication scheme the server is using. Not all schemes use cookies.

Categories