How to use fsockopen to load a url from an xml sitemap - php

I am attempting to load each url in a sitemap.xml file in an effort to pre-cache them and speed up the users experience.
I have the following code which grabs the urls from the sitemap
$ch = curl_init();
/**
* Set the URL of the page or file to download.
*/
curl_setopt($ch, CURLOPT_URL, 'http://onlineservices.letterpart.com/sitemap.xml;jsessionid=1j1agloz5ke7l?id=1j1agloz5ke7l');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$data = curl_exec ($ch);
curl_close ($ch);
$xml = new SimpleXMLElement($data);
foreach ($xml->url as $url_list) {
$url = $url_list->loc;
echo $url ."<br>";
}
and I am now trying to use fsockopen to load each url in turn.
where $url is in this format: http://onlineservices.letterpart.com:80/content/en/FAMILY-201103311115/Family_FLJONLINE_FLJ_2009_07_4
foreach ($xml->url as $url_list) {
$url = $url_list->loc;
$fp = fsockopen ($url,80);
if ($fp) {
fwrite($fp, "GET / HTTP/1.1\r\nHOST: $url\r\n\r\n");
while (!feof($fp)) {
print fread($fp,256);
}
fclose ($fp);
} else {
print "Fatal error\n";
}
}
But this is giving me this error for each url:
[12-May-2011 13:34:09] PHP Warning: fsockopen() [function.fsockopen]: unable to connect to http://onlineservices.letterpart.com:80/content/en/FAMILY-201103311115/Family_FLJONLINE_FLJ_2009_07_4:-1 (Unable to find the socket transport "http" - did you forget to enable it when you configured PHP?) in /home/digital1/public_html/dev/sitemap.php on line 32
I have read that I need to: "just the hostname, not the URL in the fsockopen call. You'll need to provide the uri, minus the host/port in the actual HTTP headers"
so I tried this:
$fp = fsockopen ("http://onlineservices.letterpart.com",80);
if ($fp) {
fwrite($fp, "GET / HTTP/1.1\r\nHOST: content/en/FAMILY-201103311115/Family_FLJONLINE_FLJ_2009_07_4\r\n\r\n");
while (!feof($fp)) {
print fread($fp,256);
}
fclose ($fp);
} else {
print "Fatal error\n";
}
But I still get the same error.
EDIT:
If I change the fsockopen call to:
$fp = fsockopen ("onlineservices.letterpart.com",80);
then I get a slightly different and better but still wrong response. it seems to be ignoring the onlineservices.letterpart.com section and trying http:///content/ BUT... it has appended: /web/ui.xql?action=html&resource=login.html tot he end of the url which is our login page so it must be seeing our server...
HTTP/1.1 302 Moved Temporarily Date: Thu, 12 May 2011 14:40:02 GMT Server: Jetty/5.1.12 (Windows 2003/5.2 x86 java/1.6.0_07 Expires: Thu, 01 Jan 1970 00:00:00 GMT Set-Cookie: JSESSIONID=nh62zih3q8mf;Path=/ Location: http:///content/en/FAMILY-201103311115/Family_FLJONLINE_FLJ_2009_07_4/web/ui.xql?action=html&resource=login.html Content-Length: 0
Thanks.

fsockopen is not attented to be used for HTTP request,
Curl is a better choice (and much more powerful).
There is also file_get_contents which can make it quick:
foreach ($xml->url as $url_list) {
$url = $url_list->loc;
file_get_contents($url);
}
Usefull for application cache warmup!

Related

Remote file access from PHP server side gives 301 instead of file, what to do?

EDIT: the answer is in the comments to the marked answer.
I am currently working with updating a few key components on a mobile web site. The site uses data from a different server to display student schedules. Recently this other site (over which I have zero control) was subject to a major overhaul and naturally I now have to update the mobile web site.
What I am trying to do is to access an iCal file and parse it. Since the site I am working on runs in an environment that does not have the curl-library nor have fopen wrappers properly set up I have resorted to the method described here (number 4, using a socket directly).
My current issue is that instead of getting the iCal-file I get a 301 error. However, if I attempt to access the same file (via the same URL) in a web browser it works just fine.
EDIT:
I added a bit of logging and here is what came out of it:
-------------
Querying url:
https://someUrl/schema/ri654Q055ZQZ60QbQ0ygnQ70cWny067Z0109Zx4h0Z7o525Y407Q.ics
Response:
HTTP/1.1 301 Moved Permanently
Server: nginx/1.2.8
Date: Sun, 11 Aug 2013 14:08:36 GMT
Content-Type: text/html
Content-Length: 184
Connection: close
Location:
https://someUrl/schema/ri654Q055ZQZ60QbQ0ygnQ70cWny067Z0109Zx4h0Z7o525Y407Q.ics
<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<center><h1>301 Moved Permanently</h1></center>
<hr><center>nginx/1.2.8</center>
</body>
</html>
Redirect url found: https://someUrl/schema/ri654Q055ZQZ60QbQ0ygnQ70cWny067Z0109Zx4h0Z7o525Y407Q.ics
The new location I am getting is identical to the original one.
This is the code used:
function getRemoteFile($url)
{
error_log("------------- \r\nQuerying url: " . $url, 3, "error_log.log");
// get the host name and url path
$parsedUrl = parse_url($url);
$host = $parsedUrl['host'];
if (isset($parsedUrl['path'])) {
$path = $parsedUrl['path'];
} else {
// the url is pointing to the host like http://www.mysite.com
$path = '/';
}
if (isset($parsedUrl['query'])) {
$path .= '?' . $parsedUrl['query'];
}
if (isset($parsedUrl['port'])) {
$port = $parsedUrl['port'];
} else {
// most sites use port 80
// but we want port 443 because we are using https
error_log("Using port 443\r\n" . $url, 3, "error_log.log");
$port = 443;
}
$timeout = 10;
$response = '';
// connect to the remote server
$fp = fsockopen($host, $port, $errno, $errstr, $timeout );
if( !$fp ) {
echo "Cannot retrieve $url";
} else {
$payload = "GET $path HTTP/1.0\r\n" .
"Host: $host\r\n" .
"User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.3) Gecko/20060426 Firefox/1.5.0.3\r\n" .
"Accept: */*\r\n" .
"Accept-Language: sv-SE,sv;q=0.8,en-us,en;q=0.3\r\n" .
"Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7\r\n" .
"Referer: https://$host\r\n\r\n";
error_log("\nPAYLOAD: " . $payload, 3, "error_log.log");
// send the necessary headers to get the file
fputs($fp, $payload);
// retrieve the response from the remote server
while ( $line = stream_socket_recvfrom( $fp, 4096 ) ) {
$response .= $line;
}
fclose( $fp );
// naively find location redirect
$location_pos = strpos($response, "Location:");
if($location_pos){
$location_pos += 10;
$new_url = substr($response, $location_pos, strpos($response, "\r\n\r\n") - $location_pos);
error_log("\nRedirect url found: " . $new_url, 3, "error_log.log");
}else{
//log the response
error_log($response, 3, "error_log.log");
}
// strip the headers
$pos = strpos($response, "\r\n\r\n");
$response = substr($response, $pos + 4);
}
// return the file content
return $response;
}
HTTP Response Code 301 is a permanent redirect, not an error.
Your code will have to follow that redirect in order to access the resource.
For example, http://google.com/ returns a 301 in order to redirect users to http://www.google.com/ instead.
$ curl -I http://google.com/
HTTP/1.1 301 Moved Permanently
Location: http://www.google.com/
Content-Type: text/html; charset=UTF-8
Date: Sun, 11 Aug 2013 01:25:34 GMT
Expires: Tue, 10 Sep 2013 01:25:34 GMT
Cache-Control: public, max-age=2592000
Server: gws
Content-Length: 219
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Alternate-Protocol: 80:quic
You can see the 301 response on line 2, followed by the Location header which tells the web browser where to go instead.
What likely happened was that during this major overhaul, they moved the resource to another location. In order not to break any users bookmarks or calendar, they used a 301 redirect so that clients will automatically fetch the resource from the new location.

uploading remote url to server

I am using following codes to upload remote files to my server. It works great where direct download link is given but recently I have noticed that few websites are giving mysql links as download link and when we click on that link the files start downloading to my pc. But even in html source of that page it does not show the direct link.
Here is my code:
<form method="post">
<input name="url" size="50" />
<input name="submit" type="submit" />
</form>
<?php
if (!isset($_POST['submit'])) die();
$destination_folder = 'mydownloads/';
$url = $_POST['url'];
$newfname = $destination_folder . basename($url);
$file = fopen ($url, "rb");
if ($file) {
$newf = fopen ($newfname, "wb");
if ($newf)
while(!feof($file)) {
fwrite($newf, fread($file, 1024 * 8 ), 1024 * 8 );
}
}
if ($file) {
fclose($file);
}
if ($newf) {
fclose($newf);
}
?>
It works great for all links where the download link is direct for example if I will give
http://priceinindia.org/muzicpc/48.php?id=415508 link it will upload the music file but the file name will be 48.php?id=415508 but the actual mp3 file is stored at
http://lq.mzc.in/data48-2/37202/Appy_Budday_(Videshi)-Santokh_Singh(www.Mzc.in).mp3
So if I can get the actual destination url the name will be Appy_Budday_(Videshi)-Santokh_Singh(www.Mzc.in).mp3
So I want to get the actual download url.
You should use Curl library for this. http://php.net/manual/en/book.curl.php
An example of how to use curl is provided in tha manual (on that link) befo before you close the connections, call curl_getinfo (http://php.net/manual/en/function.curl-getinfo.php) and specifically get CURLINFO_EFFECTIVE_URL which is what you want.
<?php
// Create a curl handle
$ch = curl_init('http://www.yahoo.com/');
// Execute
$fileData = curl_exec($ch);
// Check if any error occured
if(!curl_errno($ch)) {
$effectiveURL = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
}
// Close handle
curl_close($ch);
?>
(You can also use curl to write directly to a file - use the CURLOPT_FILE options. Also in the manual)
The problem is the original URL is redirecting. You want to catch the URL it is being redirected to, try using the headers and then get the basename($redirect_url) as your file name.
+1 for Robbie using CURL.
If you run (from command line)
[username#localhost ~]$ curl http://priceinindia.org/muzicpc/48.php?id=415508 -I
HTTP/1.1 302 Moved Temporarily
Server: nginx/1.0.10
Date: Wed, 19 Sep 2012 07:31:18 GMT
Content-Type: text/html
Connection: keep-alive
X-Powered-By: PHP/5.3.10
Location: http://lq.mzc.in/data48-2/37202/Appy_Budday_(Videshi)-Santokh_Singh(www.Mzc.in).mp3
You can see the location header here is the new url.
in php try something like
$ch = curl_init('http://priceinindia.org/muzicpc/48.php?id=415508');
curl_setopt($ch, CURLOPT_HEADER, 1); // return header
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false); // dont redirect
$c = curl_exec($ch); //execute
echo curl_getinfo($ch, CURLINFO_HTTP_CODE); // will echo http code. 302 for temp move
echo curl_getinfo($ch, CURLINFO_EFFECTIVE_URL); // url being redirected to
You want to find the location part of the header. not sure the setting im sure though.
EDIT 3..or 4?
Yeah right, I see whats happening. You actually want to follow the location url then echo the effective url without downloading file. try.
$ch = curl_init('http://priceinindia.org/muzicpc/48.php?id=415508');
curl_setopt($ch, CURLOPT_NOBODY, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$c = curl_exec($ch); //execute
echo curl_getinfo($ch, CURLINFO_EFFECTIVE_URL); // url being redirected to
When I run this my output is
[username#localhost ~]$ php test.php
http://lq.mzc.in/data48-2/37202/Appy_Budday_(Videshi)-Santokh_Singh(www.Mzc.in).mp3

How do I make a POST request over HTTPS in PHP?

I trying to work with the YouTube API and its ClientLogin. And that means that I need to make a POST request to their servers.
The URL to which I need to make the request to https://www.google.com/accounts/ClientLogin. The variables I need to send are Email, Passwd, source and service. So far, so good.
I found this neat function to make POST calls (see below), but it does not use HTTPS, which I think I must use. It all works but I think my POST request is being forwarded to HTTPS and therefore it does not give me the proper callback. When I try to var_dump, the returned data web page reloads and I end up at https://www.google.com/accounts/ClientLogin where I get proper data. But of course I need this data as an array or string.
So how do I make a POST request using HTTPS?
Se my code (which I found at Jonas’ Snippet Library) below:
function post_request($url, $data, $referer='') {
$data = http_build_query($data);
$url = parse_url($url);
$host = $url['host'];
$path = $url['path'];
$fp = fsockopen($host, 80, $errno, $errstr, 30);
if ($fp){
fputs($fp, "POST $path HTTP/1.1\r\n");
fputs($fp, "Host: $host\r\n");
if ($referer != '')
fputs($fp, "Referer: $referer\r\n");
fputs($fp, "Content-type: application/x-www-form-urlencoded\r\n");
fputs($fp, "Content-length: ". strlen($data) ."\r\n");
fputs($fp, "Connection: close\r\n\r\n");
fputs($fp, $data);
$result = '';
while(!feof($fp)) {
$result .= fgets($fp, 128);
}
}
else {
return array(
'status' => 'err',
'error' => "$errstr ($errno)"
);
}
fclose($fp);
$result = explode("\r\n\r\n", $result, 2);
$header = isset($result[0]) ? $result[0] : '';
$content = isset($result[1]) ? $result[1] : '';
return array(
'status' => 'ok',
'header' => $header,
'content' => $content
);
}
This is the response headers:
HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Date: Tue, 03 May 2011 12:15:20 GMT
Expires: Tue, 03 May 2011 12:15:20 GMT
Cache-Control: private, max-age=0
X-Content-Type-Options: nosniff
X-XSS-Protection: 1; mode=block
Content-Length: 728
Server: GSE
Connection: close
The content I get back is some kind of form autosubmitted, which I think is because I use HTTP instead of HTTPS:
function autoSubmit() {
document.forms["hiddenpost"].submit();
}
Processing...
So, how do I do a HTTPS POST request?
As octopusgrabbus kindly pointed out, I need to use port 443 instead of 80. So I changed this, but now I get nothing back.
var_dump from function return:
array(3) {
["status"]=>
string(2) "ok"
["header"]=>
string(0) ""
["content"]=>
string(0) ""
}
I get no header and no content back. What is wrong?
I think you cannot talk directly HTTPS, as it is HTTP encrypted with the public certificate of the server you are connecting to. Maybe you can use some of the ssl functions in php. But, this will take you some time and frankly, there are easier things.
Just take a look at cURL (client URL), that has support for GET and POST requests, and also connecting to https servers.
You are opening your socket at port 80. The SSL port is 443.
If this is SSL, there is an official computer name tied to the secure cert that's present on that web server. You might need to connect using that official name.
When you open the socket, changing the port to 443 and the prepending the host with ssl:// should work. (I just had this issue with paypal and some third party code). This assumes you don't have a protocol in your host already.
So
$fp = fsockopen('ssl://' . $host, 443, $errno, $errstr, 30);
As Carlos pointed out cUrl is good for this sort of thing. But there's no need to completely change what you're using in this case, particularly when it's a single line change.

PHP - Downloading very large files with fsockopen(), fgets() and feof()

I have a simple download function in a class that might be dealing with files of many hundreds of megabytes at a time from an Amazon Web Services bucket. The whole file cannot be loaded into memory at once, so it must be streamed directly to a file pointer. This is my understanding as this is the first time I've dealt with this issue and I'm picking things up as I go along.
I've ended up with this, based on a 4 KB file buffer which simple testing showed was a good size:
$fs = fsockopen($host, 80, $errno, $errstr, 30);
if (!$fs) {
$this->writeDebugInfo("FAILED ", $errstr . '(' . $errno . ')');
} else {
$out = "GET $file HTTP/1.1\r\n";
$out .= "Host: $host\r\n";
$out .= "Connection: Close\r\n\r\n";
fwrite($fs, $out);
$fm = fopen ($temp_file_name, "w");
stream_set_timeout($fs, 30);
while(!feof($fs) && ($debug = fgets($fs)) != "\r\n" ); // ignore headers
while(!feof($fs)) {
$contents = fgets($fs, 4096);
fwrite($fm, $contents);
$info = stream_get_meta_data($fs);
if ($info['timed_out']) {
break;
}
}
fclose($fm);
fclose($fs);
if ($info['timed_out']) {
// Delete temp file if fails
unlink($temp_file_name);
$this->writeDebugInfo("FAILED - Connection timed out: ", $temp_file_name);
} else {
// Move temp file if succeeds
$media_file_name = str_replace('temp/', 'media/', $temp_file_name);
rename($temp_file_name, $media_file_name);
$this->writeDebugInfo("SUCCESS: ", $media_file_name);
}
}
In testing it's fine. However I have got into a conversation with someone who is saying that I am not understanding how fgets() and feof() work together, and he's mentioning chunked encoding as a more efficient method.
Is the code generally OK, or am I missing something vital here? What is the benefit that chunked encoding will give me?
Your solution seems fine to me, however I have a few comments.
1) Don't create a HTTP packet yourself, i.e. don't send the HTTP request. Instead use something like CURL. This is more fool proof and will support a wider range of responses the server might reply with. Additionally CURL can be setup to write directly to a file, saving you doing it yourself.
2) Using fgets may be a problem if you are reading binary data. Fgets reads to the end of a line, and with binary data this may corrupt your download. Instead I suggest fread($fs, 4096); which will handle both text and binary data.
2) Chunked encoding is a way for a webserver to send you the response in multiple chunks. I don't think this is very useful to you, however, a better encoding that the webserver might support is the gzip encoding. This would allow the webserver to compress the response on the fly. If you use a library like CURL, it will tell the server it supports gzip, and then automatically decompress it for you.
I hope this helps
Don't deal with sockets, optimize your code and use the cURL library, PHP cURL. Like this:
$url = 'http://'.$host.'/'.$file;
// create a new cURL resource
$fh = fopen ($temp_file_name, "w");
$ch = curl_init();
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_FILE, $fh);
//curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// grab URL and pass it to the browser
curl_exec($ch);
// close cURL resource, and free up system resources
curl_close($ch);
fclose($fh);
And the final result in case it helps anyone else. I also wrapped the whole thing in a retry loop to decrease the risk of a completely failed download, but it does increase the use of resources:
do {
$fs = fopen('http://' . $host . $file, "rb");
if (!$fs) {
$this->writeDebugInfo("FAILED ", $errstr . '(' . $errno . ')');
} else {
$fm = fopen ($temp_file_name, "w");
stream_set_timeout($fs, 30);
while(!feof($fs)) {
$contents = fread($fs, 4096); // Buffered download
fwrite($fm, $contents);
$info = stream_get_meta_data($fs);
if ($info['timed_out']) {
break;
}
}
fclose($fm);
fclose($fs);
if ($info['timed_out']) {
// Delete temp file if fails
unlink($temp_file_name);
$this->writeDebugInfo("FAILED on attempt " . $download_attempt . " - Connection timed out: ", $temp_file_name);
$download_attempt++;
if ($download_attempt < 5) {
$this->writeDebugInfo("RETRYING: ", $temp_file_name);
}
} else {
// Move temp file if succeeds
$media_file_name = str_replace('temp/', 'media/', $temp_file_name);
rename($temp_file_name, $media_file_name);
$this->newDownload = true;
$this->writeDebugInfo("SUCCESS: ", $media_file_name);
}
}
} while ($download_attempt < 5 && $info['timed_out']);

Seeking in remote flv files using PHP

I'm string to seek in a remotely hosted FLV file and have it stream locally. Streaming from start works, but when I try to 'seek', player stops.
I'm using this script to seek to remote file
$fp = fsockopen($host, 80, $errno, $errstr, 30);
$out = "GET $path_to_flv HTTP/1.1\r\n";
$out .= "Host: $host\r\n";
$out .= "Range: bytes=$pos-$end\r\n";
$out .= "Connection: Close\r\n\r\n";
fwrite($fp, $out);
$content = false;
while (!feof($fp))
{
$data = fgets($fp, 1024);
if($content) echo $data;
if($data == "\r\n")
{
$content = true;
header("Content-Type: video/x-flv");
header("Content-Length: " . (urlfilesize($file) - $pos));
if($pos > 0)
{
print("FLV");
print(pack('C', 1));
print(pack('C', 1));
print(pack('N', 9));
print(pack('N', 9));
}
}
}
fclose($fp);
Any ideas ?
UPDATE
so apparently, even though the server signals it accepts range requests (with the Accept-Ranges: bytes), it does not actually do so. to see if there is another way to make the flv seekable, let's have a look at the communication between flash player and server (i use wireshark for this):
the request when starting the player is:
GET /files/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ HTTP/1.1
Host: xxxxxx.megavideo.com
<some more headers>
<no range header>
this is answered with a response like that:
HTTP/1.0 200 OK
Server: Apache/1.3.37 (Debian GNU/Linux) PHP/4.4.7
Content-Type: video/flv
ETag: "<video-id>"
Content-Length: <length of complete video>
<some more headers>
<the flv content>
now when i seek in the flash player, another request is sent. it is almost the same as the initial one, with the following difference:
GET /files/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/8800968 HTTP/1.1
<same headers as first request>
which gets answered with a response almost the same as the initial one, with a difference only in the Content-Length header.
which lets me assume that the 8800968 at the end of the request url is the "seek range" (the byte offset in the file after seeking) we are looking for, and the second response Content-Length is the initial Content-Length (the length of the whole file) minus this range. which is the case indeed.
with this information, it should be possible to get what you want. good luck!
UPDATE END
this will only work if the server supports HTTP RANGE requests. if it does, it will return a 206 Partial Content response code with a Content-Range header and your requested range of bytes. check for these in the response to your request.

Categories