How do I download a file from an .ashx page with php? - php

I am trying to download a file from this url in php: http://www.roblox.com/Asset/BodyColors.ashx?userId=36377783
The page returns a file your webbrowser automatically downloads.
I tried using cURL:
<?php
$uid = 36377783;
$xUrl = "http://www.roblox.com/Asset/BodyColors.ashx?userId=".$uid;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $xUrl);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
$xml = curl_exec($ch);
curl_close($ch);
echo $xml;
?>
But it redirects me to an error page.
How do I download the file the .ashx url returns?
(Setting CURLOPT_USERAGENT doesn't work.)

There is a redirection - i use file_get_contents() (but why not curl) and $http_response_header:
$uid = 36377783;
$xUrl = "http://www.roblox.com/Asset/BodyColors.ashx?userId=".$uid;
$opts = array(
'http'=>array(
'method'=>"GET",
'follow_location' => true,
'header'=>
"Host: www.roblox.com\r\n" .
"User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0\r\n" .
"Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n" .
"Accept-Encoding: gzip, deflate\r\n" .
"DNT: 1\r\n"
)
);
$context = stream_context_create($opts);
$xml = file_get_contents($xUrl, false, $context);
#print_r($http_response_header);
$url_redirect = str_replace('Location: ',"",$http_response_header[5]);
#print $url_redirect;
$xml = file_get_contents($url_redirect);
#print_r($xml);
$roblox_responses = new SimpleXMLElement($xml);
print_r($roblox_responses);

Related

How to work with file_get_contents for m3u8 file?

I wrote a PHP code that sends 3 GET Requests, one after the other
The purpose of the code is to get the content from the m3u8 file,
But in the last GET Request I get an error.
PHP:
<?php
//1. Create a proper token for the m3u8 to work
$opts = array(
'http'=>array(
'method'=>"GET",
'header'=>"Referer: http://www.hotstar.com",
"User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:51.0) Gecko/20100101 Firefox/51.0"
));
$context = stream_context_create($opts);
$url = "http://www.hotstar.com/get_cdn_token.php";
$data = file_get_contents($url, false, $context);
$values = json_decode($data, true);
$url = $values['token'];
//2. Send another GET request along with the token, to pull the master m3u8
$opts = array(
'http'=>array(
'method'=>"GET",
'header'=>"Referer: http://www.hotstar.com",
"User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:51.0) Gecko/20100101 Firefox/51.0"
));
$context = stream_context_create($opts);
$url = "https://secure-getcdn.hotstar.com/AVS/besc?hotstarauth=$url&action=GetCDN&appVersion=5.0.40&asJson=Y&channel=TABLET&id=1000055355&type=VOD";
$data = file_get_contents($url, false, $context);
$values = json_decode($data, true);
$link = $values['resultObj']['src'];
//3. Get the m3u8 content
$opts = array(
'http'=>array(
'method'=>"GET",
'header'=>"User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:51.0) Gecko/20100101 Firefox/51.0"
));
$context = stream_context_create($opts);
$url = "$link";
$data = file_get_contents($url, false, $context);
echo $data;
Through Inspect Element I accept the result:
#EXTM3U
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=241000,RESOLUTION=320x180,CODECS="avc1.66.30, mp4a.40.2",CLOSED-CAPTIONS=NONE
https://staragvod1-vh.akamaihd.net/i/videos/plus/sns/1365/1000055355_,16,180,400,800,1300,2000,3000,4500,_STAR.mp4.csmil/index_1_av.m3u8?null=0&id=AgC0lfI2aGb2DFFZW1pBPartIAq++S+ee++3UM8jU49rfzGeMpTl2IaWB4PCyZ0c2yGZOtSqAhal4g%3d%3d
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=461000,RESOLUTION=416x234,CODECS="avc1.66.30, mp4a.40.2",CLOSED-CAPTIONS=NONE
https://staragvod1-vh.akamaihd.net/i/videos/plus/sns/1365/1000055355_,16,180,400,800,1300,2000,3000,4500,_STAR.mp4.csmil/index_2_av.m3u8?null=0&id=AgC0lfI2aGb2DFFZW1pBPartIAq++S+ee++3UM8jU49rfzGeMpTl2IaWB4PCyZ0c2yGZOtSqAhal4g%3d%3d
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=861000,RESOLUTION=640x360,CODECS="avc1.66.30, mp4a.40.2",CLOSED-CAPTIONS=NONE
https://staragvod1-vh.akamaihd.net/i/videos/plus/sns/1365/1000055355_,16,180,400,800,1300,2000,3000,4500,_STAR.mp4.csmil/index_3_av.m3u8?null=0&id=AgC0lfI2aGb2DFFZW1pBPartIAq++S+ee++3UM8jU49rfzGeMpTl2IaWB4PCyZ0c2yGZOtSqAhal4g%3d%3d
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=1360000,RESOLUTION=720x404,CODECS="avc1.66.30, mp4a.40.2",CLOSED-CAPTIONS=NONE
https://staragvod1-vh.akamaihd.net/i/videos/plus/sns/1365/1000055355_,16,180,400,800,1300,2000,3000,4500,_STAR.mp4.csmil/index_4_av.m3u8?null=0&id=AgC0lfI2aGb2DFFZW1pBPartIAq++S+ee++3UM8jU49rfzGeMpTl2IaWB4PCyZ0c2yGZOtSqAhal4g%3d%3d
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=2060000,RESOLUTION=1280x720,CODECS="avc1.77.30, mp4a.40.2",CLOSED-CAPTIONS=NONE
https://staragvod1-vh.akamaihd.net/i/videos/plus/sns/1365/1000055355_,16,180,400,800,1300,2000,3000,4500,_STAR.mp4.csmil/index_5_av.m3u8?null=0&id=AgC0lfI2aGb2DFFZW1pBPartIAq++S+ee++3UM8jU49rfzGeMpTl2IaWB4PCyZ0c2yGZOtSqAhal4g%3d%3d
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=3060000,RESOLUTION=1600x900,CODECS="avc1.77.30, mp4a.40.2",CLOSED-CAPTIONS=NONE
https://staragvod1-vh.akamaihd.net/i/videos/plus/sns/1365/1000055355_,16,180,400,800,1300,2000,3000,4500,_STAR.mp4.csmil/index_6_av.m3u8?null=0&id=AgC0lfI2aGb2DFFZW1pBPartIAq++S+ee++3UM8jU49rfzGeMpTl2IaWB4PCyZ0c2yGZOtSqAhal4g%3d%3d
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=4562000,RESOLUTION=1920x1080,CODECS="avc1.77.30, mp4a.40.2",CLOSED-CAPTIONS=NONE
https://staragvod1-vh.akamaihd.net/i/videos/plus/sns/1365/1000055355_,16,180,400,800,1300,2000,3000,4500,_STAR.mp4.csmil/index_7_av.m3u8?null=0&id=AgC0lfI2aGb2DFFZW1pBPartIAq++S+ee++3UM8jU49rfzGeMpTl2IaWB4PCyZ0c2yGZOtSqAhal4g%3d%3d
Through PHP I get the error:
Warning: file_get_contents(https://staragvod1-vh.akamaihd.net/i/videos/plus/sns/1365/1000055355_,16,180,400,800,1300,2000,3000,4500,_STAR.mp4.csmil/master.m3u8?hdnea=st=1515937603~exp=1515938203~acl=/*~hmac=c5f9294a198233a9751edbca51631c9cb12db63a08a69499c20d1208bd07aca8): failed to open stream: HTTP request failed! HTTP/1.0 403 Forbidden in **** on line 37
How do I arrange this please?
Your file needs a login to be accessed, for that you need a cookie mechanism on your file. This can be done with php-curl, or with curl alone in the shell.
Here is a php-curl snippet that can help you on this case.
There is more to do with curl. Anyway, the system needs an additional php package for this to works.
sudo apt install php-curl
$handle = curl_init();
$url = "https//lalala.com/file/files/oups.m3u";
$domain = preg_replace("(^https?://)", "", $url );
$header = array('Accept-Language: fr,fr-fr;q=0.8,en-us;q=0.5,en;q=0.3');
curl_setopt($handle, CURLOPT_URL, $url);
curl_setopt($handle, CURLINFO_HEADER_OUT, 1);
curl_setopt($handle, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:55.0) Gecko/20100101 Firefox/55.0.1');
curl_setopt($handle, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($handle, CURLOPT_NOSIGNAL, true);
curl_setopt($handle, CURLOPT_RETURNTRANSFER, false);
curl_setopt($handle, CURLOPT_HTTPHEADER, $header);
curl_setopt($handle, CURLOPT_HEADER, false);
header('Content-Type: text/html');
header("Access-Control-Allow-Origin: *");
$result = curl_exec($handle);
var_dump($result);
See https://curl.haxx.se/libcurl/c/CURLOPT_USERAGENT.html for more details about php-curl.
See https://curl.haxx.se/libcurl/c/CURLOPT_COOKIE.html for how to setup cookies.

Instagram API retrieve the code using PHP

I try to use the Instagram API but it's really not easy.
According to the API documentation, a code must be retrieved in order to get an access token and then make requests to Instagram API.
But after few try, I don't succeed.
I already well-configured the settings in https://www.instagram.com/developer
I call the url api.instagram.com/oauth/authorize/?client_id=[CLIENT_ID]&redirect_uri=[REDIRECT_URI]&response_type=code with curl, but I don't have the redirect uri with the code in response.
Can you help me please ;)!
I would recommend you use one of the existing PHP instagram client libraries like this https://github.com/cosenary/Instagram-PHP-API
I did this not too long ago, here's a good reference:
https://auth0.com/docs/connections/social/instagram
Let me know if it helps!
I've made this code, I hope it doesnt have error, but i've just made it for usecase like you wantedHere is the code, I'll explain it below how this code works.
$authorization_url = "https://api.instagram.com/oauth/authorize/?client_id=".$instagram_client_id."&redirect_uri=".$your_website_redirect_uri."&response_type=code";
$username='ig_username';
$password='ig_password';
$_defaultHeaders = array(
'User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: en-US,en;q=0.5',
'Accept-Encoding: ',
'Connection: keep-alive',
'Upgrade-Insecure-Requests: 1',
'Cache-Control: max-age=0'
);
$ch = curl_init();
$cookie='/application/'.strtoupper(VERSI)."instagram_cookie/instagram.txt";
curl_setopt( $ch, CURLOPT_POST, 0 );
curl_setopt( $ch, CURLOPT_HTTPGET, 1 );
if($this->token!==null){
array_push($this->_defaultHeaders,"Authorization: ".$this->token);
}
curl_setopt( $ch, CURLOPT_HTTPHEADER, $this->_defaultHeaders);
curl_setopt( $ch, CURLOPT_HEADER, true);
curl_setopt( $ch, CURLOPT_SSL_VERIFYPEER, false );
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
curl_setopt( $ch, CURLOPT_COOKIEFILE,getcwd().$cookie );
curl_setopt( $ch, CURLOPT_COOKIEJAR, getcwd().$cookie );
curl_setopt($this->curlHandle,CURLOPT_URL,$url);
curl_setopt($this->curlHandle,CURLOPT_FOLLOWLOCATION,true);
$result = curl_exec($this->curlHandle);
$redirect_uri = curl_getinfo($this->curlHandle, CURLINFO_EFFECTIVE_URL);
$form = explode('login-form',$result)[1];
$form = explode("action=\"",$form)[1];
// vd('asd',$form);
$action = substr($form,0,strpos($form,"\""));
// vd('action',$action);
$csrfmiddlewaretoken = explode("csrfmiddlewaretoken\" value=\"",$form);
$csrfmiddlewaretoken = substr($csrfmiddlewaretoken[1],0,strpos($csrfmiddlewaretoken[1],"\""));
//finish getting parameter
$post_param['csrfmiddlewaretoken']=$csrfmiddlewaretoken;
$post_param['username']=$username;
$post_param['password']=$password;
//format instagram cookie from vaha's answer https://stackoverflow.com/questions/26003063/instagram-login-programatically
preg_match_all('/^Set-Cookie:\s*([^;]*)/mi', $result, $matches);
$cookieFileContent = '';
foreach($matches[1] as $item)
{
$cookieFileContent .= "$item; ";
}
$cookieFileContent = rtrim($cookieFileContent, '; ');
$cookieFileContent = str_replace('sessionid=; ', '', $cookieFileContent);
$cookie=getcwd().'/application/'.strtoupper(VERSI)."instagram_cookie/instagram.txt";
$oldContent = file_get_contents($cookie);
$oldContArr = explode("\n", $oldContent);
if(count($oldContArr))
{
foreach($oldContArr as $k => $line)
{
if(strstr($line, '# '))
{
unset($oldContArr[$k]);
}
}
$newContent = implode("\n", $oldContArr);
$newContent = trim($newContent, "\n");
file_put_contents(
$cookie,
$newContent
);
}
// end format
$useragent = "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0";
$arrSetHeaders = array(
'origin: https://www.instagram.com',
'authority: www.instagram.com',
'upgrade-insecure-requests: 1',
'Host: www.instagram.com',
"User-Agent: $useragent",
'content-type: application/x-www-form-urlencoded',
'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language: en-US,en;q=0.5',
'Accept-Encoding: deflate, br',
"Referer: $redirect_uri",
"Cookie: $cookieFileContent",
'Connection: keep-alive',
'cache-control: max-age=0',
);
$ch = curl_init();
curl_setopt($ch, CURLOPT_COOKIEJAR, dirname(__FILE__)."/".$cookie);
curl_setopt($ch, CURLOPT_COOKIEFILE, dirname(__FILE__)."/".$cookie);
curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_HTTPHEADER, $arrSetHeaders);
curl_setopt($ch, CURLOPT_URL, $this->base_url.$action);
curl_setopt($ch, CURLOPT_REFERER, $redirect_uri);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($post_param));
sleep(5);
$page = curl_exec($ch);
preg_match_all('/^Set-Cookie:\s*([^;]*)/mi', $page, $matches);
$cookies = array();
foreach($matches[1] as $item) {
parse_str($item, $cookie1);
$cookies = array_merge($cookies, $cookie1);
}
var_dump($page);
Step 1:
We need to get to the login page first.
We can access it using curl get, with CURLOPT_FOLLOWLOCATION set to true so that we will be redirected to the login page, we access our application instagram authorization url
$authorization_url = "https://api.instagram.com/oauth/authorize/?client_id=".$instagram_client_id."&redirect_uri=".$your_website_redirect_uri."&response_type=code";
$username='ig_username';
This is step one from this Instagram documentation here
Now the result of the first get curl we have the response page and its page uri that we store at $redirect_uri, this must be needed and placed on referer header when we do http post for login.
After get the result of login_page, we will need to format the cookie, I know this and use some code from vaha answer here vaha's answer
Step 2:
After we get the login_page we will extract the action url , extract csrfmiddlewaretoken hidden input value.
After we get it, we will do a post parameter to login.
We must set the redirect uri, and dont forget the cookiejar, and other header setting like above code.After success sending the parameter post for login, Instagram will call your redirect uri, for example https://www.yourwebsite.com/save_instagram_code at there you must use or save your instagram code to get the access token using curl again ( i only explain how to get the code :D)
I make this in a short time, I'll update the code which I have tested and work if i have time, Feel free to suggest an edit of workable code or a better explanation.

PHP Curl with a file attachment

I am trying to simulate a PHP cURL POST that requires a file upload.
Here is the HTML form from the website I am trying to POST TO: http://pastebin.com/X6Y0mmfP
The file I need to upload is "domains.txt" which can be found on the same directory as the script.
Using Live HTTP headers (firefox addon) I've retrieved this information:
POST to: http://www.majesticseo.com/reports/bulk-backlinks-upload
HTTP Headers:
Host: www.majesticseo.com
User-Agent: Mozilla/5.0 (Windows NT 6.2; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: https://www.majesticseo.com/reports/bulk-backlink-checker
Cookie: _pk_id.2.d6bc=a607157d494109d4.1382175578.4.1388174858.1384073229.; RURI=reports%2Fbulk- backlink-checker; _pk_ses.2.d6bc=*; STOK=Ox09WRWBeFCU3l3TAim86efmBa
Connection: keep-alive
Content-Type: multipart/form-data; boundary=---------------------------210646678590
Content-Length: 1106
POST Content:
-----------------------------210646678590\r\n
Content-Disposition: form-data; name="fileType"\r\n
\r\n
SingleColumn\r\n
-----------------------------210646678590\r\n
Content-Disposition: form-data; name="indexType"\r\n
\r\n
F\r\n
-----------------------------210646678590\r\n
Content-Disposition: form-data; name="ajaxLoadUrl"\r\n
\r\n
/reports/downloads/confirm-file-upload/backlinksAjax\r\n
-----------------------------210646678590\r\n
Content-Disposition: form-data; name="file"; filename="domains.txt"\r\n
Content-Type: text/plain\r\n
\r\n
facebook.com\n
twitter.com\n
google.com\n
youtube.com\n
wordpress.org\n
adobe.com\n
blogspot.com\n
wikipedia.org\n
wordpress.com\n
linkedin.com\n
yahoo.com\n
amazon.com\n
flickr.com\n
w3.org\n
pinterest.com\n
apple.com\n
tumblr.com\n
myspace.com\n
microsoft.com\n
vimeo.com\n
digg.com\n
qq.com\n
stumbleupon.com\n
baidu.com\n
addthis.com\n
miibeian.gov.cn\n
statcounter.com\n
bit.ly\n
feedburner.com\n
nytimes.com\n
reddit.com\n
delicious.com\n
msn.com\n
macromedia.com\n
bbc.co.uk\n
weebly.com\n
blogger.com\n
icio.us\n
goo.gl\n
gov.uk\n
cnn.com\n
yandex.ru\n
webs.com\n
google.de\n
mail.ru\n
livejournal.com\n
sourceforge.net\n
go.com\n
imdb.com\n
jimdo.com\n
\r\n
-----------------------------210646678590--\r\n
In the manual browser upload, I am using domains.txt - which is also the file on the server (in the same directory as the script).
My script first logs in to then it attempts to make this request.
This is what I have tried to do so far, however it is not being accepted:
$ch = curl_init();
$post = array('fileType' => 'SingleColumn',
'indexType' => 'F',
'ajaxLoadUrl' => '/reports/downloads/confirm-file-upload/backlinksAjax',
'file'=>'#'.realpath('./domains.txt') . ';filename=domains.txt'
);
$post = http_build_query($post);
curl_setopt($ch, CURLOPT_URL, "https://www.majesticseo.com/reports/bulk-backlinks-upload");
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5 );
curl_setopt($ch, CURLOPT_COOKIEJAR, 'majestic.txt');
curl_setopt($ch, CURLOPT_COOKIEFILE, 'majestic.txt');
curl_setopt($ch, CURLOPT_REFERER, 'https://www.majesticseo.com/reports/bulk-backlink-checker');
curl_setopt ($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post);
$result = curl_exec($ch);
curl_close($ch);
Curl doesn't work very well with relative paths, please provide the full path.
ex:
realpath('/home/user/public_html/domains.txt')
This is the function and how I generated the info for the request
function send_curl_request_with_attachment($method, $headers, $url, $post_fields) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
if($headers != "" && count($headers) > 0){
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
} curl_setopt($ch, CURLOPT_CUSTOMREQUEST, $method);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_fields);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_VERBOSE,true);
$result = curl_exec($ch);
curl_close($ch);
return $result;}
$token_slams = "Authorization: Bearer " . $access_token;
$authHeader = array(
$token_slams,
'Accept: application/form-data');
$schedule_path ='../../documents/' . $docs_record["document"];
$cFile = curl_file_create($schedule_path);
$post = array(
'old_record' => $old_record,
'employer_number' => $employer_number,
'payment_date' => $payment_date,
'fund_year' => $fund_year,
'fund_month' => $fund_month,
'employer_schedule'=> $cFile
);
send_curl_request_with_attachment("POST", $authHeader, $my_url, $post);

https headers with file_get_contents

This does not get gzipped content, but plain content. How to make file_get_contents send headers with https ?
$url = 'https://www.google.co.in/';
///Try to fetch compressed content using the file_get_contents function
$opts = array(
'http'=>array(
'method'=>"GET",
'header'=>"Accept-language: en-US,en;q=0.8\r\n" .
"Accept-Encoding: gzip,deflate,sdch\r\n" .
"Accept-Charset:UTF-8,*;q=0.5\r\n"
)
);
$context = stream_context_create($opts);
$zipped_content = file_get_contents($url ,false,$context);
echo $zipped_content;
print_r($http_response_header);
If the url is http://www.yahoo.co.in then the gzipped content is served (and to confirm, it appears like rubbish).
But when using "https://" it seems that file_get_contents does not send the headers specified.
Header are no OK... Add User-agent and it will be fine.
"User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:19.0) Gecko/20100101 Firefox/19.0 FirePHP/0.4\r\n".
Why? Google decides.
Try this
$url = "https://www.google.co.in/";
$ch = curl_init();
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Accept-Encoding: gzip'));
$contents = curl_exec($ch);
curl_close($ch);

file_get_contents script works with some websites but not others

I'm looking to build a PHP script that parses HTML for particular tags. I've been using this code block, adapted from this tutorial:
<?php
$data = file_get_contents('http://www.google.com');
$regex = '/<title>(.+?)</';
preg_match($regex,$data,$match);
var_dump($match);
echo $match[1];
?>
The script works with some websites (like google, above), but when I try it with other websites (like, say, freshdirect), I get this error:
"Warning: file_get_contents(http://www.freshdirect.com) [function.file-get-contents]: failed to open stream: HTTP request failed!"
I've seen a bunch of great suggestions on StackOverflow, for example to enable extension=php_openssl.dll in php.ini. But (1) my version of php.ini didn't have extension=php_openssl.dll in it, and (2) when I added it to the extensions section and restarted the WAMP server, per this thread, still no success.
Would someone mind pointing me in the right direction? Thank you very much!
It just requires a user-agent ("any" really, any string suffices):
file_get_contents("http://www.freshdirect.com",false,stream_context_create(
array("http" => array("user_agent" => "any"))
));
See more options.
Of course, you can set user_agent in your ini:
ini_set("user_agent","any");
echo file_get_contents("http://www.freshdirect.com");
... but I prefer to be explicit for the next programmer working on it.
$html = file_get_html('http://google.com/');
$title = $html->find('title')->innertext;
Or if you prefer with preg_match and you should be really using cURL instead of fgc...
function curl($url){
$headers[] = "User-Agent:Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13";
$headers[] = "Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
$headers[] = "Accept-Language:en-us,en;q=0.5";
$headers[] = "Accept-Encoding:gzip,deflate";
$headers[] = "Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$headers[] = "Keep-Alive:115";
$headers[] = "Connection:keep-alive";
$headers[] = "Cache-Control:max-age=0";
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_HTTPHEADER, $headers);
curl_setopt($curl, CURLOPT_ENCODING, "gzip");
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($curl);
curl_close($curl);
return $data;
}
$data = curl('http://www.google.com');
$regex = '#<title>(.*?)</title>#mis';
preg_match($regex,$data,$match);
var_dump($match);
echo $match[1];
Another option: Some hosts disable CURLOPT_FOLLOWLOCATION so recursive is what you want, also will log into a text file any errors. Also a simple example of how to use DOMDocument() to extract the content, obviously its not extensive but something you could build appon.
<?php
function file_get_site($url){
(function_exists('curl_init')) ? '' : die('cURL Must be installed. Ask your host to enable it or uncomment extension=php_curl.dll in php.ini');
$curl = curl_init();
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: ";
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0 Firefox/5.0');
curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
curl_setopt($curl, CURLOPT_HEADER, true);
curl_setopt($curl, CURLOPT_REFERER, $url);
curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate');
curl_setopt($curl, CURLOPT_AUTOREFERER, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_TIMEOUT, 60);
$html = curl_exec($curl);
$status = curl_getinfo($curl);
curl_close($curl);
if($status['http_code']!=200){
if($status['http_code'] == 301 || $status['http_code'] == 302) {
list($header) = explode("\r\n\r\n", $html, 2);
$matches = array();
preg_match("/(Location:|URI:)[^(\n)]*/", $header, $matches);
$url = trim(str_replace($matches[1],"",$matches[0]));
$url_parsed = parse_url($url);
return (isset($url_parsed))? file_get_site($url):'';
}
$oline='';
foreach($status as $key=>$eline){$oline.='['.$key.']'.$eline.' ';}
$line =$oline." \r\n ".$url."\r\n-----------------\r\n";
$handle = #fopen('./curl.error.log', 'a');
fwrite($handle, $line);
return FALSE;
}
return $html;
}
function get_content_tags($source,$tag,$id=null,$value=null){
$xml = new DOMDocument();
#$xml->loadHTML($source);
foreach($xml->getElementsByTagName($tag) as $tags) {
if($id!=null){
if($tags->getAttribute($id)==$value){
return $tags->getAttribute('content');
}
}
return $tags->nodeValue;
}
}
$source = file_get_site('http://www.freshdirect.com/about/index.jsp');
echo get_content_tags($source,'title'); //FreshDirect
echo get_content_tags($source,'meta','name','description'); //Online grocer providing high quality fresh......
?>

Categories