file_get_contents request to external site - php

I am trying to do a file_get_contents of this Demo URL
However the server has trouble with getting data from external sites. This is the error I get if I echo the file_get_contents:
Found The document has moved
here. Apache/2.4 Server at spotifycharts.com Port 80
I have turned the register_global on in the php.ini file, but this doesn't help.
What would be the most logical thing to check to make sure my website is able to get data from external sites?

Just use the https url instead of the http url:
https://spotifycharts.com/api/?type=regional&country=nl&recurrence=daily&date=latest&limit=200

You may need to request with cURL, I don't think file_get_contents() can follow 302 redirects.
Something like this should work...
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, FALSE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
$a = curl_exec($ch);
if (preg_match('#Location: (.*)#', $a, $r))
$l = trim($r[1]);
How to get the real URL after file_get_contents if redirection happens?
Source

Related

How to find last (final) URL after series of redirects via shortened URL from PHP

I built a custom plugin for WordPress that people can post without having to register / login, but just double confirming the password. It has been working well, spam free, but someone started posting spammy links.
I wrote a plugin to detect the pattern based on IP address then block the IP and delete all posts for those who got blocked. However, I think this spammer is using a tool that spoofs or switches IP address and started posting from a different IP address. One thing in common I found is that the links go to the same URL after series of redirects.
I've tried the following functions to trace the destination, but no luck.
myfunction( $url ){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_exec($ch);
$lastUrl = curl_getinfo($ch);
curl_close($ch);
return $lastUrl;
}
I've also tried getting the header information from the link, but no luck.
So, I tried many online tool that grabs the final URL from the link, and none of them worked.
The URL shortener service this spammer uses is http://urnic.com/
I don't think it is doing a JavaScript redirect as it worked with JS turned off from my Chrome.
you can use curl's CURLOPT_FOLLOWLOCATION + CURLINFO_EFFECTIVE_URL to find the final address, provided that the redirects you speak of are HTTP-redirects (eg HTTP 3xx 300 Multiple Choices or 301 Moved Permanently or 302 Found or 307 Temporary Redirect or some such),
function get_final_url(string $redirect_url):string{
$ch=curl_init($redirect_url);
curl_setopt_array($ch,array(
CURLOPT_FOLLOWLOCATION=>1,
CURLOPT_ENCODING=>'',
CURLOPT_USERAGENT=>'many_websites_block_UAless_requests',
CURLOPT_RETURNTRANSFER=>1, // ideally we should use CURLOPT_NOBODY but some websites respond differently to HEAD requests, so using GET requests is the safest option =/ (also if you're worried about ram usage, you should set CURLOPT_OUTFILE to /dev/null or enable CURLOPT_WRITEFUNCTION)
));
curl_exec($ch);
$ret=curl_getinfo($ch,CURLINFO_EFFECTIVE_URL);
curl_close($ch);
return $ret;
}
(ps! untested, might be a typo or something, but that should work in theory.)
as i mentioned in a code-comment, the function can be optimized to use less ram if you're worried about huge responses (CURLOPT_RETURNTRANSFER put the entire response in-ram, can be fixed by using an empty CURLOPT_WRITEFUNCTION)
anyhow, that should return the final url.
You can make with preg_match and catch location url. its working for me perfectly.
$curlhandle = curl_init();
curl_setopt($curlhandle, CURLOPT_URL, $url);
curl_setopt($curlhandle, CURLOPT_HEADER, 1);
curl_setopt($curlhandle, CURLOPT_USERAGENT, 'googlebot');
curl_setopt($curlhandle, CURLOPT_FOLLOWLOCATION, 0);
curl_setopt($curlhandle, CURLOPT_RETURNTRANSFER, 1);
$final = curl_exec($curlhandle);
if (preg_match('~Location: (.*)~i', $final, $lasturl)) {
$loc = trim($lasturl[1]);
echo $loc;
} else {
echo "Dont have redirect url...";
}
This will behavior like googlebot and will show you redirected url.
only add curl_setopt($curlhandle, CURLOPT_USERAGENT, 'googlebot'); this code.

How to run an external link in PHP

Currently I have page say page1.php. Now in a certain event, I just want to run another link say http://example.com without actually refreshing this page. The link is a kind of script which updates my database. I tried using shell_exec('php '.$url); where $url='http://example.com' however it showed me an error that could not open file so I suppose shell_exec works only for internal files present on the server. Is there a way to do this directly or I have to go with AJAX? Thanks in advance
Try using curl to send the request to the server with php.
$url = 'http://example.com';
$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_NOBODY, TRUE);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_exec($ch);
curl_close($ch);
Alternatively you could try file_get_contents
file_get_contents('http://example.com');
I would do this front-end and I would use JSONP: much clean and safer IMHO.

How avoid Moved Permanently The document has moved here

I'm in a site and I would call an API that is in another site. So I build a curl
$url = ........
$curl_data = array('name'=>$name);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $curl_data);
$output = curl_exec($ch);
$info = curl_getinfo($ch, CURLINFO_HTTP_CODE);
so when I execute the curl I print the value "$output" and I obtain Moved Permanently The document has moved here. This is wrong because I would call this api I would obtain value and come back to the page when the process started. Anyone can help me?
After a day I resolve add this line before to call the function:
curl_setopt($ch,CURLOPT_FOLLOWLOCATION,true);
Check the URL if it is indeed throwing error 301 (move permanently). Use fiddler since it can capture HTTP error codes.
see this PHP cURL says Moved Permanently when POSTing to a virtual host
curl_setopt($curl, CURLOPT_CUSTOMREQUEST, "POST");
curl_setopt($curl, CURLOPT_POSTREDIR, 3);
This work for me by adding this line
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
I have faced the same issue, You can fix it by changing your url prefix from http to https (if it was http otherwise change it to http)
The problem is in .htaccess
Something like
RewriteRule ............ [R=301,L]

How can I properly follow all redirects on sites I am trying to scrape with cURL in PHP?

I am using cURL to try to scrape an ASP site that is not on my server, with the following option to automatically follow redirects it comes across:
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
but it is not properly following all redirects that the website sends back: it is putting several of the redirect URLs as relative to my server and my PHP script's path, instead of the website's server and the path that the website's pages should be relative to. Is there any way to set the base path or server path in cURL, so my script can properly follow the relative redirects it comes across when scraping through the other website?
For example: If I authenticate on their site and then try to access "https://www.theirserver.com/theirapp/mainForm/securepage.aspx" with my script at "https://www.myserver.com/php/myscript.php", then, under some circumstances, their website tries to redirect back to their login page, but this causes a big problem, because the redirect sends my cURL client to "https://www.myserver.com/php/mainForm/login.aspx", that is, '/mainForm/login.aspx' relative to my script on my server, instead of the correct "https://www.theirserver.com/theirapp/mainForm/login.aspx" relative to the site I am scraping on their server.
I would expect cURL's FOLLOWLOCATION option to properly follow relative redirects based on the "Location:" header of the web pages I am accessing, but it seems that it doesn't and can't. Since this seems to not work, preferably I want a way to tell cURL a base path for the server or for all relative redirects it sees, so I can just use FOLLOWLOCATION. If not, then I need to figure out some code that will do the same thing FOLLOWLOCATION does, but that can let me specify a base path to handle these relative URLs when it comes across them.
I see several similar questions about following relative paths with cURL, but none of the answers have any good suggestions for dealing with this problem, where I don't own the website's server and I don't know every single redirect that might come up. In fact, none of the answers I've seen for similar questions seem to even understand that a person might be trying to scrape an external website and would want any relative redirects they come across while scraping the site to just be relative to that site.
EDIT: Here is the code in question:
$urlLogin = "https://www.theirsite.com/theirApp/MainForm/login.aspx"
$urlSecuredPage = "https://www.theirsite.com/theirApp/ContentPages/content.aspx"
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $urlLogin);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_AUTOREFERER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; yie8)");
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 120);
curl_setopt($ch, CURLOPT_TIMEOUT, 120);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
// GET login page
$data=curl_exec($ch);
// Read ASP viewstate and eventvalidation fields
$viewstate = parseExtract($data,$regexViewstate, 1);
$eventval = parseExtract($data, $regexEventVal, 1);
//set POST data
$postData = '__EVENTTARGET='.$eventtarget
.'&__EVENTARGUMENT='.$eventargument
.'&__VIEWSTATE='.$viewstate
.'&__EVENTVALIDATION='.$eventval
.'&'.$nameUsername.'='.$valUsername
.'&'.$namePassword.'='.$valPassword
.'&'.$nameLoginBtn.'='.$valLoginBtn;
// POST authentication
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postData);
curl_setopt($ch, CURLOPT_URL, $urlLogin);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieFile);
$data = curl_exec($ch);
/******************************************************************
GET secure page (This is where a redirect fails... when getting
the secure page, it redirects to /mainForm/login.aspx relative to my
script, instead of /mainForm/login.aspx on their site.
*****************************************************************/
curl_setopt($ch, CURLOPT_POST, FALSE);
curl_setopt($ch, CURLOPT_URL, $urlSecuredPage);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieFile);
$data = curl_exec($ch);
echo $data; // Page Not Found
You may be running into redirects that are JavaScript redirects.
To find out what is there:
This will give you additional info.
curl_setopt($ch, CURLOPT_FILETIME, true);
You should set fail on error:
curl_setopt($ch, CURLOPT_FAILONERROR,true);
You may also need to see all the Request and Response headers:
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLINFO_HEADER_OUT, true);
The big thing you are missing is curl_getinfo($ch);
It has info on all the redirects and the headers.
You may want to turn off: CURLOPT_FOLLOWLOCATION
And do each request individually. You can get the redirect location from curl_getinfo("redirect_url")
Or you can set CURLOPT_MAXREDIRS to the number of successful redirects, then do a separate curl request for the problem redirect location
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
When you get the response, if no curl error, get the resposne header
$data = curl_exec($ch);
if (curl_errno($ch)){
$data .= 'Retreive Base Page Error: ' . curl_error($ch);
echo $data;
}
else {
$skip = intval(curl_getinfo($ch, CURLINFO_HEADER_SIZE));
$responseHeader = substr($data,0,$skip);
$data= substr($data,$skip);
$info = curl_getinfo($ch);
$info = var_export($info,true);
}
echo $responseHeader . $info . $data;
A better way to web scraping a webpage is to use 2 PHP Packages = Guzzle + DomCrawler.
I made a lot of tests with this combination and i came to the conclusion that this is the best choice.
Here, you will find an example for your implementation.
Let me know if you have any problem! ;)

curl Request to Unverified Website

I am trying to use PHP's curl() function and for some reason my code does not return any data.
I am making a request to a URL that is unverified:
Here is my code:
<?php
$ch = curl_init("**SENSITIVE URL**");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($ch);
print_r($result);
curl_close($ch);
?>
If I put in www.google.com it does return the google webpage to my site. I apoligize, but I can't give out the URL for my site but I assure you that directly going to the URL does return data.
You need to tell cURL to ignore the (bad) SSL cert. Try adding the following options:
// Do not verify the cert
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
// Ignore the "does not match target host name" error
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);

Categories