file_get_contents() for short urls

file_get_contents() for short urls - php

file_get_contents() doesn't read data for short urls
Example:
http://wp.me/pbZy8-1WM,
http://bit.ly/d00E2C
Please help me in handle this. OR Is there any CURL function to handle above links?

This in general works fine. If you find it doesn't do the right thing you can explicitly use a stream context:
$url = "http://bit.ly/d00E2C";
$context = stream_context_create(array('http' => array('max_redirects' => 5)));
$val = file_get_contents($url, false, $context);
should do it. No need to touch CURL for that.

On my machine, I cannot replicate your problem; I receive the page as intended. However, should the issue be with the redirect, this may solve your problem.
<?php
$opts = array(
'http' => array(
'follow_location' => 1,
'max_redirects' => 20
)
);
$context = stream_context_create($opts);
echo file_get_contents('http://wp.me/pbZy8-1WM', false, $context);
I imagine there may be a directive that toggles redirect following, but I have not yet found it. I will edit my answer should I.

What you can do is using curl with CURLOPT_FOLLOWLOCATION set to True:
$ch = curl_init("http://bit.ly/d00E2C");
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$result = curl_exec($ch);
curl_close($ch);
echo $result;

Related

Unable to get Healthline search results with PHP

I am trying to run a script that will search Healthline with a query string and determine if there are any search results, but I can't get the contents with the query string posting to the page. To search for something on their site, you go to https://www.healthline.com/search?q1=search+string.
Here is what I tried:
$healthline_url = 'https://www.healthline.com/search';
$search_string = 'ashwaganda';
$postdata = http_build_query(
array(
'q1' => $search_string
)
);
$opts = array('http' =>
array(
'method' => 'POST',
'header' => 'Content-type: application/x-www-form-urlencoded',
'content' => $postdata
)
);
$stream = stream_context_create($opts);
$theHtmlToParse = file_get_contents($healthline_url, false, $stream);
print_r($theHtmlToParse);
I also tried to just add the query string to the url and skip the stream, amongst other variations, but I'm running out of ideas. This also didn't work:
$healthline_url = 'https://www.healthline.com/search';
$search_string = 'ashwaganda';
$opts = array(
'http'=>array(
'method'=>"GET",
'header'=>"Content-Type: text/xml; charset=utf-8"
)
);
$stream = stream_context_create($opts);
$theHtmlToParse = file_get_contents($healthline_url.'&q1='.$search_string, false, $stream);
print_r($theHtmlToParse);
And suggestions?
EDIT: I changed the url in case someone wants to look at the search page. Also fixed the query string. Still doesn't work.
In response to Ken Lee, I did try the following cURL script that also just returns the page without search results:
$healthline_url = 'https://www.healthline.com/search?q1=ashwaganda';
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $healthline_url);
$data = curl_exec($ch);
curl_close($ch);
print_r($data);

Healthline does not load the search result directly. It has its search index stored in Algolia and made extra javascript calls to retrieve the result. Therefore you cannot see the search result by file_get_content.
To see the search result, you need to run a browser simulator that simulates a javascript-capable browser to properly run the site page.
For PHP developers, you may try using php-webdriver to control browers through webdriver (e.g. Selenium, Chrome + chromedriver, Firefox + geckodriver).
Update: Didn't know that the target site is Healthline. Updated the answer once I found out.

How to follow all redirects with CURL including META-refresh

I'm using a API to return a set a URLs, all URLs have redirects but how many redirects and where the URLs go are unknown.
So what I'm trying to do is to trace the path and find the last URL.
I basically want do the same as: http://wheregoes.com/retracer.php, but I only need to know the last URL
I've found a way to do it with CURL but the trace stops when it is a Meta-Refresh.
I've seen this thread: PHP: Can CURL follow meta redirects but it doesn't help me a lot.
This is my current code:
function trace_url($url){
$ch = curl_init($url);
curl_setopt_array($ch, array(
CURLOPT_FOLLOWLOCATION => TRUE,
CURLOPT_RETURNTRANSFER => TRUE,
CURLOPT_SSL_VERIFYHOST => FALSE,
CURLOPT_SSL_VERIFYPEER => FALSE,
));
curl_exec($ch);
$url = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
curl_close($ch);
return $url;
}
$lasturl = trace_url('http://myurl.org');
echo $lasturl;

well, there are a big difference between Header Redirects , which is basically under 3xx class and META refresh , simply one way relies on the server, and the other related to the client .
and as long as curl or as known cURL or libcurl which is executed in the server , it can handle the first type, 'Header redirects' or http redirects.
so , you can then extract the url using bunch of ways.
you will need to handle it manually .
1) scrap the web page contents.
2) extract the link from the meta tag.
3) grab this new link if you want.
from your example:
function trace_url($url){
$ch = curl_init($url);
curl_setopt_array($ch, array(
CURLOPT_FOLLOWLOCATION => TRUE,
CURLOPT_RETURNTRANSFER => TRUE,
CURLOPT_SSL_VERIFYHOST => FALSE,
CURLOPT_SSL_VERIFYPEER => FALSE,
));
curl_exec($ch);
$url = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
curl_close($ch);
return $url;
}
$response = trace_url('http://myurl.org');
// quick pattern for explanation purposes only, you may improve it as you like
preg_match('#\<meta.*?content="[0-9]*\;url=([^"]+)"\s*\/\>#', $response, $links);
$newLink = $links[1];
or as mentioned in your question about the solution provided which is use simplexml_load_file library .
$xml = simplexml_load_file($response);
$link = $xml->xpath("//meta[#http-equiv='refresh']");

php script to read and save web page contents not working on some sites

I have a very simple script that works perfectly on most sites but not the main site I want it to work with - the code below accesses a sample site perfectly. However when I use it on a site I want to access http://www.livescore.com I get an error
This works.
<?php
$url = "http://www.cambodia.me.uk";
$page = file_get_contents($url);
$outfile = "contents.html";
file_put_contents($outfile, $page);
?>
This does not work.....
<?php
$url = "http://www.livescore.com";
$page = file_get_contents($url);
$outfile = "contents.html";
file_put_contents($outfile, $page);
?>
and gives the following error
Warning: file_get_contents(http://www.livescore.com)
[function.file-get-contents]: failed to open stream: HTTP request
failed! HTTP/1.0 404 Not Found in C:\Program Files
(x86)\EasyPHP-5.3.8.1\www\Livescore\attempt-1-read-page.php on line 3
Thanks for any assistance

In common case you can just say to file_get_contents to follow redirects:
$context = stream_context_create(
array(
'http' => array(
'follow_location' => true
)
)
);
$html = file_get_contents('http://www.example.com/', false, $context);
This site tries to analyze User-agent http header, and fails if it's not found. Try to add some user-agent header:
<?php
$context = stream_context_create(
array(
'http' => array(
'header' => "User-agent: chrome",
'ignore_errors' => true,
'follow_location' => true
)
)
);
$html = file_get_contents('http://www.livescore.com/', false, $context);
echo substr($html, 0, 200)."\n";

Most likely www.livescore.com is doing a hidden redirect which file_get_contents is too basic to catch.
Do you have lynx installed on your server?
$page= shell_exec("lynx -source 'http://www.livescore.com'");
lynx is a full browser and can 'bypass' certain redirects.

How can I interrupt a PHP function that's taking too long?

I would like to stop a simplexml_load_file if it takes too long to load and/or isn't reachable (occasionally the site with the xml goes down) seeing as I don't want my site to completely lag if theirs aren't up.
I tried to experiment a bit myself, but haven't managed to make anything work.
Thank you so much in advance for any help!

You can't have an arbitrary function quit after a specified time. What you can do instead is to try to load the contents of the URL first - and if it succeeds, continue processing the rest of the script.
There are several ways to achieve this. The easiest is to use file_get_contents() with a stream context set:
$context = stream_context_create(array('http' => array('timeout' => 5)));
$xmlStr = file_get_contents($url, FALSE, $context);
$xmlObj = simplexml_load_string($xmlStr);
Or you could use a stream context with simplexml_load_file() via the libxml_set_streams_context() function:
$context = stream_context_create(array('http' => array('timeout' => 5)));
libxml_set_streams_context($context);
$xmlObj = simplexml_load_file($url);
You could wrap it as a nice little function:
function simplexml_load_file_from_url($url, $timeout = 5)
{
$context = stream_context_create(
array('http' => array('timeout' => (int) $timeout))
);
$data = file_get_contents($url, FALSE, $context);
if(!$data) {
trigger_error("Couldn't get data from: '$url'", E_USER_NOTICE);
return FALSE;
}
return simplexml_load_string($data);
}
Alternatively, you can consider using the cURL (available by default). The benefit of using cURL is that you get really fine grained control over the request and how to handle the response.

You should be using a stream context with a timeout option coupled with file_get_contents
$context = stream_context_create(array('http' => array('timeout' => 5))); //<---- Setting timeout to 5 seconds...
and now map that to your file_get_contents
$xml_load = file_get_contents('http://yoururl', FALSE, $context);
$xml = simplexml_load_string($xml_load);

Make cURL output STDERR to file (or string)

We're trying to debug some cURL errors on the server, and I would like to see the STDERR log. Currently, all we can see for our error is "error code: 7" and that we can't connect to target server. We have contacted the host and made special rule to open the port we need and we're even ignoring the certificate for the time being.
Still, we can't connect. I need to debug this, but I can't see any pertinent information on my end.
The lines mentioning "VERBOSE" and "STDERR" are the most important, I think. Nothing is written to $curl_log. What am I doing wrong? Following the manuals logic, this should be correct...
PHP in use:
<?php
$curl = curl_init();
$curl_log = fopen("curl.txt", 'w');
$url = "http://www.google.com";
curl_setopt_array($curl, array(
CURLOPT_URL => $url, // Our destination URL
CURLOPT_VERBOSE => 1, // Logs verbose output to STDERR
CURLOPT_STDERR => $curl_log, // Output STDERR log to file
CURLOPT_SSL_VERIFYPEER => 0, // Do not verify certificate
CURLOPT_FAILONERROR => 0, // true to fail silently for http requests > 400
CURLOPT_RETURNTRANSFER => 1 // Return data received from server
));
$output = fread($curl_log, 2048);
echo $output; // This returns nothing!
fclose($curl_log);
$response = curl_exec($curl);
//...restofscript...
?>
From PHP manual: http://php.net/manual/en/function.curl-setopt.php
CURLOPT_VERBOSE TRUE to output verbose information. Writes output to STDERR
CURLOPT_STDERR An alternative location to output errors to instead of STDERR.
It is not a permission issue either, I have set file and script permissions to 777 on server side and my local client is windows and has never cared about permission settings (it's only for dev anyway).

You are making couple mistakes in your example:
1) you have to call curl_exec() prior to reading from the "verbose log", because curl_setopt() doesn't perform any action, so nothing can be logged prior to the curl_exec().
2) you are opening $curl_log = fopen("curl.txt", 'w'); only for write, so nothing could be read, even after you write to the file and rewind the internal file pointer.
So the correct shortened code should look like:
<?php
$curl = curl_init();
$curl_log = fopen("curl.txt", 'rw'); // open file for READ and write
$url = "http://www.google.com";
curl_setopt_array($curl, array(
CURLOPT_URL => $url,
CURLOPT_VERBOSE => 1,
CURLOPT_STDERR => $curl_log,
CURLOPT_RETURNTRANSFER => 1
));
$response = curl_exec($curl);
rewind($curl_log);
$output= fread($curl_log, 2048);
echo "<pre>". print_r($output, 1). "</pre>";
fclose($curl_log);
// ...
?>
NOTE: verbose log could be longer than 2048 bytes, so you could "fclose" the $curl_log after curl_exec() and then read the whole file with for example file_get_contents().
In that case, the point 2) should not be considered as mistake :-)

A bit late to the party, but this page still pops up high in Google, so let's go.
It seems that CURLOPT_VERBOSE doesn't log anything if CURLINFO_HEADER_OUT is also set to TRUE.
This is a know bug in PHP (#65348), and due to reasons they decided not to fix it.

Putting al above answers together, I use this function to make a Curl Post Request with loggin to a file option:
function CURLPostRequest($url, array $post = NULL, array $options = array(), $log_file = NULL){
$defaults = array(
CURLOPT_POST => 1,
CURLOPT_HEADER => 0,
CURLOPT_URL => $url,
CURLOPT_FRESH_CONNECT => 1,
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_FORBID_REUSE => 1,
CURLOPT_TIMEOUT => 4,
CURLOPT_POSTFIELDS => http_build_query($post)
);
if (is_resource($log_file)){
$defaults[CURLOPT_VERBOSE]=1;
$defaults[CURLOPT_STDERR]=$log_file;
$defaults[CURLINFO_HEADER_OUT]=1;
}
$ch = curl_init();
curl_setopt_array($ch, ($options + $defaults));
if( ! $result = curl_exec($ch)){
throw new Exception(curl_error($ch));
}
if (is_resource($log_file)){
$info = curl_getinfo($ch);
if (isset($info['request_header'])){
fwrite($log_file, PHP_EOL.PHP_EOL.'* POST Content'.PHP_EOL.PHP_EOL);
fwrite($log_file, print_r($info['request_header'],true));
fwrite($log_file, http_build_query($post));
}
fwrite($log_file, PHP_EOL.PHP_EOL.'* Response Content'.PHP_EOL.PHP_EOL);
fwrite($log_file, $result.PHP_EOL.PHP_EOL);
}
curl_close($ch);
return $result;
}
Hope this help to someone.

I needed to close the file before being able to read it, this worked for me:
$filename = 'curl.txt';
$curl_log = fopen($filename, 'w'); // open file for write (rw, a, etc didn't help)
curl_setopt($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch, CURLOPT_STDERR, $curl_log);
$result = curl_exec($ch);
fclose($curl_log);
$curl_log = fopen($filename, 'r'); // open file for read
$output= fread($curl_log, filesize($filename));
echo $output;
(PHP 5.6.0, Apache/2.2.15)

From php manual for function curl_setopt:
CURLOPT_FILE The file that the transfer should be written to. The default is STDOUT (the browser window).

You should put
$output = fread($curl_log, 2048);
echo $output; // This returns nothing!
fclose($curl_log);
after $response = curl_exec($curl); otherwise, file is closed during curl is executing.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

file_get_contents() for short urls - php

file_get_contents() doesn't read data for short urls Example: http://wp.me/pbZy8-1WM, http://bit.ly/d00E2C Please help me in handle this. OR Is there any CURL function to handle above links?

What you can do is using curl with CURLOPT_FOLLOWLOCATION set to True: $ch = curl_init("http://bit.ly/d00E2C"); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); $result = curl_exec($ch); curl_close($ch); echo $result;

Related

Unable to get Healthline search results with PHP

How to follow all redirects with CURL including META-refresh

php script to read and save web page contents not working on some sites

How can I interrupt a PHP function that's taking too long?

Make cURL output STDERR to file (or string)

Categories

Resources