I started off using file_get_contents() and it returned string(9259) in which every character is a space (aka. its a lot of empty). After some research I tried using curl() and after a few struggles with getting the CURLOPT_SSL_VERIFYPEER and CURLOPT_USERAGENT to work it brought me right back to where I was, string(9259) of all blank.
I am attempting to retrieve the tracking information on multiple packages automatically and the code for a single iteration is as follows:
function curl($url)
{
$ch = curl_init();
curl_setopt( $ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.52 Safari/537.17' );
curl_setopt( $ch, CURLOPT_SSL_VERIFYPEER, false );
curl_setopt( $ch, CURLOPT_URL, $url );
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, 1 );
$data = curl_exec( $ch );
var_dump( curl_getinfo( $ch ) );
echo curl_errno( $ch ) . '<br/>';
echo curl_error( $ch ) . '<br/>';
curl_close( $ch );
return $data;
}
$url for one instance is https://www.fedex.com/fedextrack/?tracknumbers=055575670028673&cntry_code=us
My question is essentially why am I receiving the string(9259) of blank characters? I expected to receive an actual string representation of the website.
Maybe Fedex doesn't like it when you scrape pages, has detected that you are doing so, and is returning dummy data?
They do have APIs for this: http://www.fedex.com/us/developer/web-services/index.html
Related
I'm an amateur programmer trying to fetch my facebook page likes for my webpage since I do not want facebooks standard plugin. After a lot of research and testing I found out the right call to make which is:
https://graph.facebook.com/{PAGE_ID}?access_token={APP_ID}|{APP_SECRET}&fields=fan_count
Using this in my web browser it works great and I get the fan count in json.
My problem is when I try to make this call using either cURL or file_get_contents inside a php script I dont seem to get anything back. After doing a var_dump I get bool(false) back. And after doing print_r I get nothing.
Is facebook somehow blocking my calls or what could be wrong? I have tried multiple ways I found online and none works. This is one of the examples:
<?php
$json_url = "https://graph.facebook.com/XXXXXXXXXXX?access_token=XXXXXXXXXXXXX&fields=fan_count";
$json = file_get_contents($json_url);
$json = json_decode($json, true);
print_r($json);
echo "Number of likes : ". $json[0]->fan_count;
?>
This is another:
<?php
$ch = curl_init("https://graph.facebook.com/v2.6/XXXXXXXXXXXXX?access_token=XXXXXXXXXXXX&fields=fan_count");
curl_setopt( $ch, CURLOPT_POST, false );
curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.12) Gecko/20050915 Firefox/1.0.7");
curl_setopt( $ch, CURLOPT_HEADER, false );
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
$data = curl_exec( $ch );
echo $data;
?>
The problem seem to be right after the graph call. Can't figure out why, all help will be appreciated :) Thanks!
I have a script that send POST data to several pages. however, I encountered some difficulties sending request to some servers. The reason is redirection. Here's the model:
I'am sending post request to server
Server responses: 301 Moved Permanently
Then curl_setopt ( $ch, CURLOPT_FOLLOWLOCATION, TRUE) kicks in and follows the redirection (but via GET request).
To solve this I'am using curl_setopt ( $ch, CURLOPT_CUSTOMREQUEST, "POST") and yes, now its redirecting without POST body content that I've send in first request. How can I force curl to send post body when redirected? Thanks!
Here's the example:
<?php
function curlPost($url, $postData = "")
{
$ch = curl_init () or exit ( "curl error: Can't init curl" );
$url = trim ( $url );
curl_setopt ( $ch, CURLOPT_URL, $url );
//curl_setopt ( $ch, CURLOPT_POST, 1 );
curl_setopt ( $ch, CURLOPT_CUSTOMREQUEST, "POST");
curl_setopt ( $ch, CURLOPT_POSTFIELDS, $postData );
curl_setopt ( $ch, CURLOPT_RETURNTRANSFER, true );
curl_setopt ( $ch, CURLOPT_CONNECTTIMEOUT, 30 );
curl_setopt ( $ch, CURLOPT_TIMEOUT, 30 );
curl_setopt ( $ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/537.36");
curl_setopt ( $ch, CURLOPT_FOLLOWLOCATION, TRUE);
$response = curl_exec ( $ch );
if (! $response) {
echo "Curl errno: " . curl_errno ( $ch ) . " (" . $url . " postdata = $postData )\n";
echo "Curl error: " . curl_error ( $ch ) . " (" . $url . " postdata = $postData )\n";
$info = curl_getinfo($ch);
echo "HTTP code: ".$info["http_code"]."\n";
// exit();
}
curl_close ( $ch );
// echo $response;
return $response;
}
?>
curl is following what RFC 7231 suggests, which also is what browsers typically do for 301 responses:
Note: For historical reasons, a user agent MAY change the request
method from POST to GET for the subsequent request. If this
behavior is undesired, the 307 (Temporary Redirect) status code
can be used instead.
If you think that's undesirable, you can change it with the CURLOPT_POSTREDIR option, which only seems very sparsely documented in PHP but the libcurl docs explains it. By setting the correct bitmask there, you then make curl not change method when it follows the redirect.
If you control the server end for this, an easier fix would be to make sure a 307 response code is returned instead of a 301.
I am trying to get some information from a UK retailer's website, and on many of the scrapes I have done it's quite simple. However, there are a number where I just cannot get around issues where most of the time they are caused by cookies. This was a a great SO question, but it's not helped.
I have the following PHP function...
function file_get_contents_curl_many_redir2( $url, $timeout = 15 ) {
$cookie = tempnam ("/tmp", "CURLCOOKIE");
$verbose = fopen('php://temp', 'rw+');
$ch = curl_init();
curl_setopt( $ch, CURLOPT_URL, $url );
curl_setopt( $ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0" );
curl_setopt( $ch, CURLOPT_COOKIEJAR, $cookie );
curl_setopt( $ch, CURLOPT_COOKIEFILE, $cookie ); //
curl_setopt( $ch, CURLOPT_COOKIESESSION, true );
curl_setopt( $ch, CURLOPT_SSL_VERIFYPEER, false); //
curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
curl_setopt( $ch, CURLOPT_ENCODING, "" );
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
curl_setopt( $ch, CURLOPT_AUTOREFERER, true );
curl_setopt( $ch, CURLOPT_CONNECTTIMEOUT, $timeout );
curl_setopt( $ch, CURLOPT_TIMEOUT, $timeout );
curl_setopt( $ch, CURLOPT_MAXREDIRS, 100 );
curl_setopt( $ch, CURLOPT_VERBOSE, true);
curl_setopt( $ch, CURLOPT_STDERR, $verbose);
$data = curl_exec($ch);
rewind($verbose);
$verboseLog = stream_get_contents($verbose);
echo "Verbose information:\n<pre>", htmlspecialchars($verboseLog), "</pre>\n";
$curlVersion = curl_version();
extract(curl_getinfo($ch));
$metrics = <<<EOD
URL....: $url
Code...: $http_code ($redirect_count redirect(s) in $redirect_time secs)
Content: $content_type Size: $download_content_length (Own: $size_download) Filetime: $filetime
Time...: $total_time Start # $starttransfer_time (DNS: $namelookup_time Connect: $connect_time Request: $pretransfer_time)
Speed..: Down: $speed_download (avg.) Up: $speed_upload (avg.)
Curl...: v{$curlVersion['version']}
EOD;
var_dump($metrics);
if(curl_errno($ch)){
echo 'Curl error: ' . curl_error($ch) . "on url:" .$url;
var_dump(curl_getinfo($ch, CURLINFO_HTTP_CODE));
}
curl_close($ch);
return $data;
}
On the majority of websites this works (I in fact have simpler functions as most don't redirect lots of times.)
But with www.homebase.co.uk when I use this url http://www.homebase.co.uk/SearchDisplay?pageSize=43&searchSource=Q&resultCatEntryType=2&pageView=&catalogId=10011&showResultsPage=true&beginIndex=0&langId=110&categoryId=&storeId=10201&sType=SimpleSearch&searchTerm=$sku where $sku is the 6 number SKU that Homebase uses (which is also the same as the last 6 numbers of any product page) I get no information.
When I run the URL through Redurect Detective it seems to be an unsupported browser issue, I presume because it's not a browser accessing the webite, and it knows this.
The Main Question:
How do I fix it so I can then get the correct source code to this page? (I am currently just getting blank text, not the product page(s) I want)
Related questions
There's a handful of other websites that don't "let me in" either, is this just cookie related, or is the coding of their website "smart enough" to know whether it's a browser or not? Can I fool it into thinking (this may be answered via my primary question).
Related extra information that may help answer the above.
* When I use CURLOPT_MAXREDIRS to 100, it seems to only let me go up to 40. Could this be the issue?
* using $sku = 323526; you should arrive at the final URL after some redirects to http://www.homebase.co.uk/en/homebaseuk/sovereign-petrol-self-propelled-rotary-mower---1493cc---40cm-323526. I do not need to know where the CURL ends up, I just want to be able to pinch the title, image and other info from the product page from knowing the SKU!
I want to get json data from here: JSON url;
Using Chrome I can see all JSON data, but using curl (below code) it seems to redirect and get lost (if CURLOPT_FOLLOWLOCATION is false, it does nothing):
$json_url = 'http://cartolafc.globo.com/mercado/filtrar.json?page=1&order_by=media&status_id=7&posicao_id=1';
$ch = curl_init($json_url);
curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
//curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, false );
//even killing the redirect process it does not return JSON data
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.12) Gecko/20050915 Firefox/1.0.7");
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
$data = curl_exec( $ch );
$dataDecoded = json_decode($data);
print_r($dataDecoded);
I saw something about simulating a browser with curl but I tought agent would do the trick. Maybe something about server using cookies... I really don't know. I saw other answers here today but they didn't solve my problem. Am I missing something?
Thank you.
Wrong variable use:
$data = curl_exec( $ch );
^^^^^--- data here
$dataDecoded = json_decode($json);
^^^^--- not $data here
So you're trying to decode a variable that doesn't exist.
Im trying to downlaod a file with php that is located on a remote server, and having no luck. Ive tried using fopen, get_file_contents, but nothing has worked.
I am passing in a download URL, which isnt the exact file location, it is the "download url" of the file, which then forces the browser to download.
So Im thinking that is why the file fopen and file_get_contents is failing, can someone tell me what I have to do to download a file from a url with headers set to force file download.
Any help greatly appreciated!
While not technically a duplicate, this has been asked on SO before: How to get redirecting url link with php from bit.ly
Your problem is that file_get_contents does not follow redirects. See the linked answer for a solution.
altough you didnt form your question clearly i think i know what you mean.
try this function taken from php.net comments
didnt test it but it looks good and seems to follow html header redirects as well as meta and javascript redirects to the file.
<?php
/*==================================
Get url content and response headers (given a url, follows all redirections on it and returned content and response headers of final url)
#return array[0] content
array[1] array of response headers
==================================*/
function get_url( $url, $javascript_loop = 0, $timeout = 5 )
{
$url = str_replace( "&", "&", urldecode(trim($url)) );
$cookie = tempnam ("/tmp", "CURLCOOKIE");
$ch = curl_init();
curl_setopt( $ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1" );
curl_setopt( $ch, CURLOPT_URL, $url );
curl_setopt( $ch, CURLOPT_COOKIEJAR, $cookie );
curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
curl_setopt( $ch, CURLOPT_ENCODING, "" );
curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
curl_setopt( $ch, CURLOPT_AUTOREFERER, true );
curl_setopt( $ch, CURLOPT_SSL_VERIFYPEER, false ); # required for https urls
curl_setopt( $ch, CURLOPT_CONNECTTIMEOUT, $timeout );
curl_setopt( $ch, CURLOPT_TIMEOUT, $timeout );
curl_setopt( $ch, CURLOPT_MAXREDIRS, 10 );
$content = curl_exec( $ch );
$response = curl_getinfo( $ch );
curl_close ( $ch );
if ($response['http_code'] == 301 || $response['http_code'] == 302)
{
ini_set("user_agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1");
if ( $headers = get_headers($response['url']) )
{
foreach( $headers as $value )
{
if ( substr( strtolower($value), 0, 9 ) == "location:" )
return get_url( trim( substr( $value, 9, strlen($value) ) ) );
}
}
}
if ( ( preg_match("/>[[:space:]]+window\.location\.replace\('(.*)'\)/i", $content, $value) || preg_match("/>[[:space:]]+window\.location\=\"(.*)\"/i", $content, $value) ) &&
$javascript_loop < 5
)
{
return get_url( $value[1], $javascript_loop+1 );
}
else
{
return array( $content, $response );
}
}
?>