Im trying to downlaod a file with php that is located on a remote server, and having no luck. Ive tried using fopen, get_file_contents, but nothing has worked.
I am passing in a download URL, which isnt the exact file location, it is the "download url" of the file, which then forces the browser to download.
So Im thinking that is why the file fopen and file_get_contents is failing, can someone tell me what I have to do to download a file from a url with headers set to force file download.
Any help greatly appreciated!
While not technically a duplicate, this has been asked on SO before: How to get redirecting url link with php from bit.ly
Your problem is that file_get_contents does not follow redirects. See the linked answer for a solution.
altough you didnt form your question clearly i think i know what you mean.
try this function taken from php.net comments
didnt test it but it looks good and seems to follow html header redirects as well as meta and javascript redirects to the file.
<?php
/*==================================
Get url content and response headers (given a url, follows all redirections on it and returned content and response headers of final url)
#return array[0] content
array[1] array of response headers
==================================*/
function get_url( $url, $javascript_loop = 0, $timeout = 5 )
{
$url = str_replace( "&", "&", urldecode(trim($url)) );
$cookie = tempnam ("/tmp", "CURLCOOKIE");
$ch = curl_init();
curl_setopt( $ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1" );
curl_setopt( $ch, CURLOPT_URL, $url );
curl_setopt( $ch, CURLOPT_COOKIEJAR, $cookie );
curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
curl_setopt( $ch, CURLOPT_ENCODING, "" );
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
curl_setopt( $ch, CURLOPT_AUTOREFERER, true );
curl_setopt( $ch, CURLOPT_SSL_VERIFYPEER, false ); # required for https urls
curl_setopt( $ch, CURLOPT_CONNECTTIMEOUT, $timeout );
curl_setopt( $ch, CURLOPT_TIMEOUT, $timeout );
curl_setopt( $ch, CURLOPT_MAXREDIRS, 10 );
$content = curl_exec( $ch );
$response = curl_getinfo( $ch );
curl_close ( $ch );
if ($response['http_code'] == 301 || $response['http_code'] == 302)
{
ini_set("user_agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1");
if ( $headers = get_headers($response['url']) )
{
foreach( $headers as $value )
{
if ( substr( strtolower($value), 0, 9 ) == "location:" )
return get_url( trim( substr( $value, 9, strlen($value) ) ) );
}
}
}
if ( ( preg_match("/>[[:space:]]+window\.location\.replace\('(.*)'\)/i", $content, $value) || preg_match("/>[[:space:]]+window\.location\=\"(.*)\"/i", $content, $value) ) &&
$javascript_loop < 5
)
{
return get_url( $value[1], $javascript_loop+1 );
}
else
{
return array( $content, $response );
}
}
?>
Related
A while back, I wrote a little utility function that takes inPath and outPath, opens both and copies from one to the other using fread() and fwrite(). allow_url_fopen is enabled.
Well, I've got a url that I'm trying to get the contents of, and fopen() doesn't get any data, but if I use curl to do the same, it works.
The url in question is: http://www.deltagroup.com/Feeds/images.php?lid=116582497&id=1
fopen version:
$in = #fopen( $inPath, "rb" );
$out = #fopen( $outPath, "wb" );
if( !$in || !$out )
{
echo 0;
}
while( $chunk = fread( $in, 8192 ) )
{
fwrite( $out, $chunk, 8192 );
}
fclose( $in );
fclose( $out );
if( file_exists($outPath) )
{
echo 1;
}
else
{
echo 0;
}
curl version:
$opt = "curl -o " . $outPath . " " . $inPath;
$res = `$opt`;
if( file_exists($outPath) )
{
echo 1;
}
else
{
echo 0;
}
Any idea why this would happen?
Even using php's curl, I was unable to download the file- until I added a curlopt_useragent string. Nothing in the response indicated that it was required (no errors, nothing other than an HTTP 200).
Final code:
$out = #fopen( $outPath, "wb" );
if( !$out )
{
echo 0;
}
$ch = curl_init();
curl_setopt( $ch, CURLOPT_URL, $inPath );
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt( $ch, CURLOPT_FILE, $out );
curl_setopt( $ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13');
curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
curl_setopt( $ch, CURLOPT_CONNECTTIMEOUT, 15 );
curl_setopt( $ch, CURLOPT_TIMEOUT, 18000 );
$data = curl_exec( $ch );
curl_close( $ch );
fclose( $out );
if( file_exists($outPath) )
{
echo 1;
}
else
{
echo 0;
}
I want to get the paragraphs under this tag:
I tried to:
<?php
$doc = new DOMDocument();
$doc->loadHTMLFile("https://sabq.org/xMQjz2");
$elements = $doc->getElementsByTagName('p');
if (!is_null($elements)) {
foreach ($elements as $element) {
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->textContent. "\n";
}
}
}
?>
And I got the paragraphs I wanted along with unwanted ones, and they were duplicated.
EDIT:
I changed the URL, hope it works
The link that you have provided throws an error when accessing it so what I did, I found a function that could get the contents of the webpage using curl instead of the DOMDocument class which you were using.
I used preg_match and regex to extract the specific element that you were looking for.
Here's the code:
<?php
//opened url
$content = get_fcontent("https://sabq.org/%D8%B4%D8%A7%D9%87%D8%AF-%D8%A3%D9%84%D9%81-%D8%B5%D9%81%D8%AD%D8%A9-%D8%AA%D8%B1%D9%88%D9%8A-%D9%82%D8%B5%D8%B5-%D8%A7%D9%84%D8%AD%D8%B1%D9%85%D9%8A%D9%86-%D9%85%D9%86%D8%B0-%D8%A7%D9%86%D8%B7%D9%84%D8%A7%D9%82-%D8%A7%D9%84%D8%B9%D9%87%D8%AF-%D8%A7%D9%84%D8%B3%D8%B9%D9%88%D8%AF%D9%8A");
//extract specific html tag and its innerHTML
preg_match('/<p .*? ng\-bind\-html\=\"getContent\(material\.content\)\" .*?>.*?<\/p>/m', $content[0], $matches);
//display the wanted element
echo $matches[0];
//getting contents using curl because threw error: failed to open stream
function get_fcontent( $url, $javascript_loop = 0, $timeout = 5 ) {
$url = str_replace( "&", "&", urldecode(trim($url)) );
$cookie = tempnam ("/tmp", "CURLCOOKIE");
$ch = curl_init();
curl_setopt( $ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1" );
curl_setopt( $ch, CURLOPT_URL, $url );
curl_setopt( $ch, CURLOPT_COOKIEJAR, $cookie );
curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
curl_setopt( $ch, CURLOPT_ENCODING, "" );
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
curl_setopt( $ch, CURLOPT_AUTOREFERER, true );
curl_setopt( $ch, CURLOPT_SSL_VERIFYPEER, false ); # required for https urls
curl_setopt( $ch, CURLOPT_CONNECTTIMEOUT, $timeout );
curl_setopt( $ch, CURLOPT_TIMEOUT, $timeout );
curl_setopt( $ch, CURLOPT_MAXREDIRS, 10 );
$content = curl_exec( $ch );
$response = curl_getinfo( $ch );
curl_close ( $ch );
if ($response['http_code'] == 301 || $response['http_code'] == 302) {
ini_set("user_agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1");
if ( $headers = get_headers($response['url']) ) {
foreach( $headers as $value ) {
if ( substr( strtolower($value), 0, 9 ) == "location:" )
return get_url( trim( substr( $value, 9, strlen($value) ) ) );
}
}
}
if ( ( preg_match("/>[[:space:]]+window\.location\.replace\('(.*)'\)/i", $content, $value) || preg_match("/>[[:space:]]+window\.location\=\"(.*)\"/i", $content, $value) ) && $javascript_loop < 5) {
return get_url( $value[1], $javascript_loop+1 );
} else {
return array( $content, $response );
}
}
?>
For testing, I created a local file called test.html:
<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<p>This should not be showing.</p>
<p ng-bind-html="getContent(material.content)" id="dev-content" class="details-text">This is a test.</p>
</body>
</html>
I used the local url http://localhost/example/test.html instead of the link you provided for testing purposes.
And from the local file I created for testing, I got the following result:
<p ng-bind-html="getContent(material.content)" id="dev-content" class="details-text">This is a test.</p>
Here's the result that I got from the original url:
<p ng-bind-html="getContent(material.content)" id="dev-content" class="details-text"></p>
I hope this helps!
I am trying to get some information from a UK retailer's website, and on many of the scrapes I have done it's quite simple. However, there are a number where I just cannot get around issues where most of the time they are caused by cookies. This was a a great SO question, but it's not helped.
I have the following PHP function...
function file_get_contents_curl_many_redir2( $url, $timeout = 15 ) {
$cookie = tempnam ("/tmp", "CURLCOOKIE");
$verbose = fopen('php://temp', 'rw+');
$ch = curl_init();
curl_setopt( $ch, CURLOPT_URL, $url );
curl_setopt( $ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0" );
curl_setopt( $ch, CURLOPT_COOKIEJAR, $cookie );
curl_setopt( $ch, CURLOPT_COOKIEFILE, $cookie ); //
curl_setopt( $ch, CURLOPT_COOKIESESSION, true );
curl_setopt( $ch, CURLOPT_SSL_VERIFYPEER, false); //
curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
curl_setopt( $ch, CURLOPT_ENCODING, "" );
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
curl_setopt( $ch, CURLOPT_AUTOREFERER, true );
curl_setopt( $ch, CURLOPT_CONNECTTIMEOUT, $timeout );
curl_setopt( $ch, CURLOPT_TIMEOUT, $timeout );
curl_setopt( $ch, CURLOPT_MAXREDIRS, 100 );
curl_setopt( $ch, CURLOPT_VERBOSE, true);
curl_setopt( $ch, CURLOPT_STDERR, $verbose);
$data = curl_exec($ch);
rewind($verbose);
$verboseLog = stream_get_contents($verbose);
echo "Verbose information:\n<pre>", htmlspecialchars($verboseLog), "</pre>\n";
$curlVersion = curl_version();
extract(curl_getinfo($ch));
$metrics = <<<EOD
URL....: $url
Code...: $http_code ($redirect_count redirect(s) in $redirect_time secs)
Content: $content_type Size: $download_content_length (Own: $size_download) Filetime: $filetime
Time...: $total_time Start # $starttransfer_time (DNS: $namelookup_time Connect: $connect_time Request: $pretransfer_time)
Speed..: Down: $speed_download (avg.) Up: $speed_upload (avg.)
Curl...: v{$curlVersion['version']}
EOD;
var_dump($metrics);
if(curl_errno($ch)){
echo 'Curl error: ' . curl_error($ch) . "on url:" .$url;
var_dump(curl_getinfo($ch, CURLINFO_HTTP_CODE));
}
curl_close($ch);
return $data;
}
On the majority of websites this works (I in fact have simpler functions as most don't redirect lots of times.)
But with www.homebase.co.uk when I use this url http://www.homebase.co.uk/SearchDisplay?pageSize=43&searchSource=Q&resultCatEntryType=2&pageView=&catalogId=10011&showResultsPage=true&beginIndex=0&langId=110&categoryId=&storeId=10201&sType=SimpleSearch&searchTerm=$sku where $sku is the 6 number SKU that Homebase uses (which is also the same as the last 6 numbers of any product page) I get no information.
When I run the URL through Redurect Detective it seems to be an unsupported browser issue, I presume because it's not a browser accessing the webite, and it knows this.
The Main Question:
How do I fix it so I can then get the correct source code to this page? (I am currently just getting blank text, not the product page(s) I want)
Related questions
There's a handful of other websites that don't "let me in" either, is this just cookie related, or is the coding of their website "smart enough" to know whether it's a browser or not? Can I fool it into thinking (this may be answered via my primary question).
Related extra information that may help answer the above.
* When I use CURLOPT_MAXREDIRS to 100, it seems to only let me go up to 40. Could this be the issue?
* using $sku = 323526; you should arrive at the final URL after some redirects to http://www.homebase.co.uk/en/homebaseuk/sovereign-petrol-self-propelled-rotary-mower---1493cc---40cm-323526. I do not need to know where the CURL ends up, I just want to be able to pinch the title, image and other info from the product page from knowing the SKU!
I started off using file_get_contents() and it returned string(9259) in which every character is a space (aka. its a lot of empty). After some research I tried using curl() and after a few struggles with getting the CURLOPT_SSL_VERIFYPEER and CURLOPT_USERAGENT to work it brought me right back to where I was, string(9259) of all blank.
I am attempting to retrieve the tracking information on multiple packages automatically and the code for a single iteration is as follows:
function curl($url)
{
$ch = curl_init();
curl_setopt( $ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.52 Safari/537.17' );
curl_setopt( $ch, CURLOPT_SSL_VERIFYPEER, false );
curl_setopt( $ch, CURLOPT_URL, $url );
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, 1 );
$data = curl_exec( $ch );
var_dump( curl_getinfo( $ch ) );
echo curl_errno( $ch ) . '<br/>';
echo curl_error( $ch ) . '<br/>';
curl_close( $ch );
return $data;
}
$url for one instance is https://www.fedex.com/fedextrack/?tracknumbers=055575670028673&cntry_code=us
My question is essentially why am I receiving the string(9259) of blank characters? I expected to receive an actual string representation of the website.
Maybe Fedex doesn't like it when you scrape pages, has detected that you are doing so, and is returning dummy data?
They do have APIs for this: http://www.fedex.com/us/developer/web-services/index.html
How do I download a copy of the html from a website that has language detection (eg google, youtube) and redirection? I have tried file_get_contents but it is to limited.
I am trying to use curl in php to get the html from www.google.com but it detects that I am from the UK and sends me a 302 redirect to www.google.co.uk.
I have tried many different things with no joy, is this possible? websites like www.markosweb.com do it..
my code:
$ch = curl_init( "http://www.google.com/" );
// $userAgent = "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0)";
// $userAgent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)';
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
$header = array(
"Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5",
"Accept-Language: en-US,us;q=0.7,en-us;q=0.5,en;q=0.3",
"Accept-Charset: windows-1251,utf-8;q=0.7,*;q=0.7",
"Keep-Alive: 300");
curl_setopt($ch,CURLOPT_RETURNTRANSFER,TRUE); //TRUE to return the transfer as a string of the return value of curl_exec() instead of outputting it out directly.
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,5); //The number of seconds to wait while trying to connect.
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent); //The contents of the "User-Agent: " header to be used in a HTTP request.
curl_setopt($ch, CURLOPT_FAILONERROR, TRUE); //To fail silently if the HTTP code returned is greater than or equal to 400.
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); //To follow any "Location: " header that the server sends as part of the HTTP header.
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE); //To automatically set the Referer: field in requests where it follows a Location: redirect.
curl_setopt($ch, CURLOPT_TIMEOUT, 10); //The maximum number of seconds to allow cURL functions to execute.
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($curl, CURLOPT_REFERER, $url);
curl_setopt($ch, CURLOPT_HTTPHEADER, 0);
$content = curl_exec( $ch );
$err = curl_errno( $ch );
$errmsg = curl_error( $ch );
$header = curl_getinfo( $ch );
curl_close( $ch );
$header['errno'] = $err;
$header['errmsg'] = $errmsg;
$header['content'] = $content;
return $header;
I have tried changing the useragent to lots of things, tried with and without header details. I managed to get something if I used header info : "Accept-Language: ru-ru,ru;q=0.7,en-us;q=0.5,en;q=0.3" but it was in russian or something.
Thanks for your help.
Carl
Try this proxy script:
// Change these configuration options if needed, see above descriptions for info.
$enable_jsonp = false;
$enable_native = false;
$valid_url_regex = '/.*/';
// ############################################################################
$url = $_GET['url'];
if ( !$url ) {
// Passed url not specified.
$contents = 'ERROR: url not specified';
$status = array( 'http_code' => 'ERROR' );
} else if ( !preg_match( $valid_url_regex, $url ) ) {
// Passed url doesn't match $valid_url_regex.
$contents = 'ERROR: invalid url';
$status = array( 'http_code' => 'ERROR' );
} else {
$ch = curl_init( $url );
if ( strtolower($_SERVER['REQUEST_METHOD']) == 'post' ) {
curl_setopt( $ch, CURLOPT_POST, true );
curl_setopt( $ch, CURLOPT_POSTFIELDS, $_POST );
}
if ( $_GET['send_cookies'] ) {
$cookie = array();
foreach ( $_COOKIE as $key => $value ) {
$cookie[] = $key . '=' . $value;
}
if ( $_GET['send_session'] ) {
$cookie[] = SID;
}
$cookie = implode( '; ', $cookie );
curl_setopt( $ch, CURLOPT_COOKIE, $cookie );
}
curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
curl_setopt( $ch, CURLOPT_HEADER, true );
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
curl_setopt( $ch, CURLOPT_USERAGENT, $_GET['user_agent'] ? $_GET['user_agent'] : $_SERVER['HTTP_USER_AGENT'] );
list( $header, $contents ) = preg_split( '/([\r\n][\r\n])\\1/', curl_exec( $ch ), 2 );
$status = curl_getinfo( $ch );
curl_close( $ch );
}
// Split header text into an array.
$header_text = preg_split( '/[\r\n]+/', $header );
if ( $_GET['mode'] == 'native' ) {
if ( !$enable_native ) {
$contents = 'ERROR: invalid mode';
$status = array( 'http_code' => 'ERROR' );
}
// Propagate headers to response.
foreach ( $header_text as $header ) {
if ( preg_match( '/^(?:Content-Type|Content-Language|Set-Cookie):/i', $header ) ) {
header( $header );
}
}
print $contents;
} else {
// $data will be serialized into JSON data.
$data = array();
// Propagate all HTTP headers into the JSON data object.
if ( $_GET['full_headers'] ) {
$data['headers'] = array();
foreach ( $header_text as $header ) {
preg_match( '/^(.+?):\s+(.*)$/', $header, $matches );
if ( $matches ) {
$data['headers'][ $matches[1] ] = $matches[2];
}
}
}
// Propagate all cURL request / response info to the JSON data object.
if ( $_GET['full_status'] ) {
$data['status'] = $status;
} else {
$data['status'] = array();
$data['status']['http_code'] = $status['http_code'];
}
// Set the JSON data object contents, decoding it from JSON if possible.
$decoded_json = json_decode( $contents );
$data['contents'] = $decoded_json ? $decoded_json : $contents;
// Generate appropriate content-type header.
$is_xhr = strtolower($_SERVER['HTTP_X_REQUESTED_WITH']) == 'xmlhttprequest';
header( 'Content-type: application/' . ( $is_xhr ? 'json' : 'x-javascript' ) );
// Get JSONP callback.
$jsonp_callback = $enable_jsonp && isset($_GET['callback']) ? $_GET['callback'] : null;
// Generate JSON/JSONP string
$json = json_encode( $data );
print $jsonp_callback ? "$jsonp_callback($json)" : $json;
}
Make sure to perform a request like this:
http://example.com/script?url=http://whateverurl.com/
Oh, and this PHP script will display the result in JSON.
From there, you can parse it using jQuery.
Like I use this jQuery code:
<script type="text/javascript">
$(document).ready(function(){
var url='+++++URL WHICH THE PHP PROXY SCRIPT IS IN++++++';
$(window).load(function(){
$.getJSON(url,function(json){
$("#resu").append(""+json.contents+"");
});
});
});
</script>
Edit: This script is not a true proxy in the sense that it does fake an IP address. Sorry for the confusion.