Using compression to get an external XML feed

Using compression to get an external XML feed - php

Using PHP, I am accessing an external URL, which is an XML feed file, and I'm parsing the results into my database. The XML file is large, around 27 MB.
How can I compress that file before the data transfer is initiated so I receive something much smaller than 27 MB? My guess is gzip should be used, but I don't know how.
This is my code I'm using for retreiving the data from the XML file:
$url = "http://www.website.com/feed.xml";
$xmlStr = file_get_contents("$url") or die("can't get file");
$xmlLinq = simplexml_load_string($xmlStr);
EDIT: The file is already using default gzip/deflate compression, but I seem to be accessing the non-compressed one.
EDIT: I got this piece of code from the owner of the feed, those are supposed to be instructions how to solve this problem, but this seems to be in C#. I'd need the equivalent in PHP:
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
request.Timeout = 60000;
request.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate");
request.KeepAlive = false;
request.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.0; ru; rv:1.9) Gecko/2008052906 Firefox/3.0 (.NET CLR 3.5.30729)";
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Stream responseStream = response.GetResponseStream();
if (response.ContentEncoding.ToLower().Contains("gzip"))
responseStream = new GZipStream(responseStream, CompressionMode.Decompress);
else if (response.ContentEncoding.ToLower().Contains("deflate"))
responseStream = new DeflateStream(responseStream, CompressionMode.Decompress);
StreamReader reader = new StreamReader(responseStream, Encoding.UTF8);

Expanding on my comment, web servers will only send content compressed using Gzip if the request's Accept-Encoding header contains gzip. To fire off a request containing this header, you can use the following:
$url = "http://www.website.com/feed.xml";
$curl = curl_init($url);
curl_setopt_array($curl, array(
CURLOPT_ENCODING => '', // specify that we accept all supported encoding types
CURLOPT_RETURNTRANSFER => true));
$xml = curl_exec($curl);
curl_close($curl);
if($xml === false) {
die('Can\'t get file');
}
$xmlLinq = simplexml_load_string($xml);
This uses the cURL extension, which is a very flexible library for making HTTP requests.

Related

"HTTP/1.1 406 Not Acceptable" using "file_get_contents()" - Same domain

I'm using file_get_contents() to get a PHP file which I use as a template to create a PDF.
I need to pass some POST values to it, in order to fill the template and get the produced HTML back into a PHP variable. Then use it with mPDF.
This works perfectly on MY server (a VPS using PHP 5.6.24)...
Now, at the point where I'm installing the fully tested script on the client's live site (PHP 5.6.29),
I get this error:
PHP Warning: file_get_contents(http://www.example.com/wp-content/calculator/pdf_page1.php): failed to open stream: HTTP request failed! HTTP/1.1 406 Not Acceptable
So I guess this can be fixed in php.ini or some config file.
I can ask (I WANT TO!!) my client to contact his host to fix it...
But since I know that hosters are generally not inclined to change server configs...
I would like to know exactly what to change in which file to allow the code below to work.
For my personnal knowledge... Obviously.
But also to make it look "easy" for the hoster (and my client!!) to change it efficiently. ;)
I'm pretty sure this is just one PHP config param with a strange name...
<?php
$baseAddr = "http://www.example.com/wp-content/calculator/";
// ====================================================
// CLEAR OLD PDFs
$now = date("U");
$delayToKeepPDFs = 60*60*2; // 2 hours in seconds.
if ($handle = opendir('.')) {
while (false !== ($entry = readdir($handle))) {
if(substr($entry,-4)==".pdf"){
$fileTime = filemtime($entry); // Returns unix timestamp;
if($fileTime+$delayToKeepPDFs<$now){
unlink($entry); // Delete file
}
}
}
closedir($handle);
}
// ====================================================
// Random file number
$random = rand(100, 999);
$page1 = $_POST['page1']; // Here are the values, sent via ajax, to fill the template.
$page2 = $_POST['page2'];
// Instantiate mpdf
require_once __DIR__ . '/vendor/autoload.php';
$mpdf = new mPDF( __DIR__ . '/vendor/mpdf/mpdf/tmp');
// GET PDF templates from external PHP
// ==============================================================
// REF: http://stackoverflow.com/a/2445332/2159528
// ==============================================================
$postdata = http_build_query(
array(
"page1" => $page1,
"page2" => $page2
)
);
$opts = array('http' =>
array(
'method' => 'POST',
'header' => 'Content-type: application/x-www-form-urlencoded',
'content' => $postdata
)
);
$context = stream_context_create($opts);
// ==============================================================
$STYLE .= file_get_contents("smolov.css", false, $context);
$PAGE_1 .= file_get_contents($baseAddr . "pdf_page1.php", false, $context);
$PAGE_2 .= file_get_contents($baseAddr . "pdf_page2.php", false, $context);
$mpdf->AddPage('P');
// Write style.
$mpdf->WriteHTML($STYLE,1);
// Write page 1.
$mpdf->WriteHTML($PAGE_1,2);
$mpdf->AddPage('P');
// Write page 1.
$mpdf->WriteHTML($PAGE_2,2);
// Create the pdf on server
$file = "training-" . $random . ".pdf";
$mpdf->Output(__DIR__ . "/" . $file,"F");
// Send filename to ajax success.
echo $file;
?>
Just to avoid the "What have you tried so far?" comments:
I searched those keywords in many combinaisons, but didn't found the setting that would need to be changed:
php
php.ini
request
header
content-type
application
HTTP
file_get_contents
HTTP/1.1 406 Not Acceptable

Maaaaany thanks to #Rasclatt for the priceless help! Here is a working cURL code, as an alternative to file_get_contents() (I do not quite understand it yet... But proven functional!):
function curl_get_contents($url, $fields, $fields_url_enc){
# Start curl
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
# Required to get data back
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
# Notes that request is sending a POST
curl_setopt($ch,CURLOPT_POST, count($fields));
# Send the post data
curl_setopt($ch,CURLOPT_POSTFIELDS, $fields_url_enc);
# Send a fake user agent to simulate a browser hit
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11) AppleWebKit/601.1.56 (KHTML, like Gecko) Version/9.0 Safari/601.1.56');
# Set the endpoint
curl_setopt($ch, CURLOPT_URL, $url);
# Execute the call and get the data back from the hit
$data = curl_exec($ch);
# Close the connection
curl_close($ch);
# Send back data
return $data;
}
# Store post data
$fields = array(
'page1' => $_POST['page1'],
'page2' => $_POST['page2']
);
# Create query string as noted in the curl manual
$fields_url_enc = http_build_query($fields);
# Request to page 1, sending post data
$PAGE_1 .= curl_get_contents($baseAddr . "pdf_page1.php", $fields, $fields_url_enc);
# Request to page 2, sending post data
$PAGE_2 .= curl_get_contents($baseAddr . "pdf_page2.php", $fields, $fields_url_enc);

file_get_contents returns unreadable text for a specific url

When I try to read the rss feeds of the kat.cr using php file_get_contents function, I get some unreadable text but when I open it up with my browser the feed is fine.
I have tried many other hosts but no chance in getting the correct data.
I even have tried setting the user-agent to diffrent browsers but still no change.
this is a simple code that I've tried:
$options = array('http' => array('user_agent' => 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1'));
$url = 'https://kat.cr/movies/?rss=1';
$data = file_get_contents($url, FILE_TEXT, stream_context_create($options));
echo $data;
I'm curious how their doing it and what I can do to overcome the problem.
A part of unreadable text:
‹ي]يrم6–‎?Oپي©™ت,à7{»‌âgw&يؤe;éN¹\S´HK\S¤–¤l+ے÷ِùِIِ”(إژzA5‌ةض؛غ%K4ـ{qtqy½ùوa^ »¬nٍھ|ûٹSِ eه¤Jَrِْصڈ1q^}sü§7uسlدزؤYً¾²yفVu‌•يغWGG·Iس&m>،“j~$ےzؤ(?zï‍ج’²جٹم?!ّ÷¦حغ";‏گ´Yس¢ï³{tر5ز ³َsgYٹْ.ں#
Actually everytime I open up the link there is some different unreadable text.

As I mentioned in the comment - the contents returned are gzip encoded so you need to un-gzip the data. Depending upon your version of php you may or may not have gzdecode installed, I don't but the function here does the trick.
if( !function_exists('gzdecode') ){
function gzdecode( $data ){
$g=tempnam('/tmp','ff');
#file_put_contents( $g, $data );
ob_start();
readgzfile($g);
$d=ob_get_clean();
unlink($g);
return $d;
}
}
$data=gzdecode( file_get_contents( $url ) );
echo $data;

C++ Editing a php file maintaining it's formatting

I have a C++ script that recognizes people, so it recognizes faces, but also the people belonging to that face. Im quite a C++ newbie so I was already glad I could get it to work (the original script isn't written by me but needed some changes to work).
There is a function in this script that needs to alter a php file when it recognizes, for example, me.
I have written this function, but it completely destroys the formatting of the php file and deletes pieces of code I don't want deleted.
The C++ code that looks for the php file and edits it:
if(nWho==P_NICK)
{
fstream calendar("/var/www/html/MagicMirror_Old/calendar.php");
string readout;
string search;
search = "$url = 'some_URL_to_some_site'";
string replace;
replace = "$url = 'some_URL_to_some_other_site'"
while(getline(calendar,readout))
{
if(readout == search)
{
calendar << replace;
}
else
{
calendar << readout;
}
}
}
Now the original php file that is being edited has the following content before it is edited:
// Set the url of the calendar feed.
$url = 'some_URL_to_some_site';
/*****************************************/
// Run the helper function with the desired URL and echo the contents.
echo get_url($url);
// Define the helper function that retrieved the data and decodes the content.
function get_url($url)
{
//user agent is very necessary, otherwise some websites like google.com wont give zipped content
$opts = array(
'http'=>array(
'method'=>"GET",
'header'=>"Accept-Language: en-US,en;q=0.8rn" .
"Accept-Encoding: gzip,deflate,sdchrn" .
"Accept-Charset:UTF-8,*;q=0.5rn" .
"User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:19.0) Gecko/20100101 Firefox/19.0 FirePHP/0.4rn",
"ignore_errors" => true //Fix problems getting data
),
//Fixes problems in ssl
"ssl" => array(
"verify_peer"=>false,
"verify_peer_name"=>false
)
);
$context = stream_context_create($opts);
$content = file_get_contents($url ,false,$context);
//If http response header mentions that content is gzipped, then uncompress it
foreach($http_response_header as $c => $h)
{
if(stristr($h, 'content-encoding') and stristr($h, 'gzip'))
{
//Now lets uncompress the compressed data
$content = gzinflate( substr($content,10,-8) );
}
}
return $content;
}
Which turns to the following after the file is edited by C++:
<?php
<?php Set the url of the calendar feed.
Set the url of the calendar feed.= 'https://p01-calendarws.icloud.com/ca/subscribe/1/n6x7Farxpt7m9S8bHg1TGArSj7J6kanm_2KEoJPL5YIAk3y70FpRo4GyWwO-6QfHSY5mXtHcRGVxYZUf7U3HPDOTG5x0qYnno1Zr_VuKH2M';
= 'https://p01-calendarws.icloud.com/ca/subscribe/1/n6x7Farxpt7m9S8bHg1TGArSj7J6kanm_2KEoJPL5YIAk3y70FpRo4GyWwO-6QfHSY5mXtHcRGVxYZUf7U3HPDOTG5x0qYnno1Zr_VuKH2M';***********/
***********/ helper function with the desired URL and echo the contents.
helper function with the desired URL and echo the contents.trieved the data and decodes the content.
trieved the data and decodes the content.ent is very necessary, otherwise some websites like google.com wont give zipped content
ent is very necessary, otherwise some websites like google.com wont give zipped content'header'=>"Accept-Language: en-US,en;q=0.8rn" .
'header'=>"Accept-Language: en-US,en;q=0.8rn" .,deflate,sdchrn" .
,deflate,sdchrn" . "Accept-Charset:UTF-8,*;q=0.5rn" .
"Accept-Charset:UTF-8,*;q=0.5rn" .illa/5.0 (X11; Linux x86_64; rv:19.0) Gecko/20100101 Firefox/19.0 FirePHP/0.4rn",
illa/5.0 (X11; Linux x86_64; rv:19.0) Gecko/20100101 Firefox/19.0 FirePHP/0.4rn",ixes problems in ssl
ixes problems in ssl "verify_peer"=>false,
"verify_peer"=>false,=>false
=>false );
); $context = stream_context_create($opts);
$context = stream_context_create($opts);e,$context);
e,$context); /If http response header mentions that content is gzipped, then uncompress it
/If http response header mentions that content is gzipped, then uncompress it, 'content-encoding') and stristr($h, 'gzip'))
, 'content-encoding') and stristr($h, 'gzip'))the compressed data
the compressed datant = gzinflate( substr($content,10,-8) );
nt = gzinflate( substr($content,10,-8) );tent;
tent;
As you probably notice, this isn't how the file should look like considering it's original state.
Basically only the $url on the second line needs to be replaced by a different url and the rest of the formatting of the php file should stay the same.
Is there a way to do this in C++?

Taking the code for the replace() and replaceAll() functions in this SO answer for replacing some text in a string:
#include <iostream>
#include <fstream>
#include <sstream>
using namespace std;
bool replace(std::string& str, const std::string& from, const std::string& to) {
size_t start_pos = str.find(from);
if(start_pos == std::string::npos)
return false;
str.replace(start_pos, from.length(), to);
return true;
}
void replaceAll(std::string& str, const std::string& from, const std::string& to) {
if(from.empty())
return;
size_t start_pos = 0;
while((start_pos = str.find(from, start_pos)) != std::string::npos) {
str.replace(start_pos, from.length(), to);
start_pos += to.length(); // In case 'to' contains 'from', like replacing 'x' with 'yx'
}
}
int main()
{
ifstream calendar("calendar.php");
std::stringstream buffer;
// read whole file in a buffer
buffer << calendar.rdbuf();
// use a new file for output
ofstream newcalendar;
newcalendar.open("newcalendar.php");
string search = "$url = 'some_URL_to_some_site'";
string to = "$url = 'some_URL_to_some_other_site'";
string content = buffer.str();
replaceAll(content, search, to);
newcalendar << content;
newcalendar.close();
calendar.close();
remove("calendar.php");
rename("newcalendar.php", "calendar.php");
return 0;
}
Be careful, the spelling of the searched text has to be exact !
EDIT: Added two lines for renaming the files

Why file_get_contents returning garbled data?

I am trying to grab the HTML from the below page using some simple php.
URL: https://kat.cr/usearch/architecture%20category%3Abooks/
My code is:
$html = file_get_contents('https://kat.cr/usearch/architecture%20category%3Abooks/');
echo $html;
where file_get_contents works, but returns scrambled data:
I have tried using cUrl as well as various functions like: htmlentities(), mb_convert_encoding, utf8_encode and so on, but just get different variations of the scrambled text.
The source of the page says it is charset=utf-8, but I am not sure what the problem is.
Calling file_get_contents() on the base url kat.cr returns the same mess.
What am I missing here?

It is GZ compressed and when fetched by the browser the browser decompresses this, so you need to decompress. To output it as well you can use readgzfile():
readgzfile('https://kat.cr/usearch/architecture%20category%3Abooks/');

Your site response is being compressed, therefore you've to uncompress in order to convert it to the original form.
The quickest way is to use gzinflate() as below:
$html = gzinflate(substr(file_get_contents("https://kat.cr/usearch/architecture%20category%3Abooks/"), 10, -8));
Or for more advanced solution, please consider the following function (found at this blog):
function get_url($url)
{
//user agent is very necessary, otherwise some websites like google.com wont give zipped content
$opts = array(
'http'=>array(
'method'=>"GET",
'header'=>"Accept-Language: en-US,en;q=0.8rn" .
"Accept-Encoding: gzip,deflate,sdchrn" .
"Accept-Charset:UTF-8,*;q=0.5rn" .
"User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:19.0) Gecko/20100101 Firefox/19.0 FirePHP/0.4rn"
)
);
$context = stream_context_create($opts);
$content = file_get_contents($url ,false,$context);
//If http response header mentions that content is gzipped, then uncompress it
foreach($http_response_header as $c => $h)
{
if(stristr($h, 'content-encoding') and stristr($h, 'gzip'))
{
//Now lets uncompress the compressed data
$content = gzinflate( substr($content,10,-8) );
}
}
return $content;
}
echo get_url('http://www.google.com/');

simple_html_dom ignores special characters

The code I am using is the one below, this works perfectly fine until I encounter url with Japanese character or any special characters. I have observed this issue and it seems that it is only returning the domain name whenever the url contains special characters such as japanese, as a result I kept getting random results which I don't intend to retrieve.
include_once 'simple_html_dom.php';
header('Content-Type: text/html; charset=utf-8');
$url_link = 'http://kissanime.com/Anime/Knights-of-Ramune-VS騎士ラムネ＆40FRESH';
$html = file_get_html($url_link);
echo $html->find('.bigChar', 0)->innertext;
I should be getting a result of "Knights of Ramune" since that's the element I was trying to retrieve. Instead, the $url_link was redirected to domain name which is the 'http://kissanime.com/' without 'Anime/Knights-of-Ramune-VS騎士ラムネ＆40FRESH'. And from there, it looks for the class with a value of '.bigChar' that results of giving random value.

The Real Problem domain is, how to retrieve the data using a URL with UTF-8 Characters, not simple_html_dom.
First of all, we need to encode the characters:
$url_link = 'http://kissanime.com/Anime/Knights-of-Ramune-VS騎士ラムネ＆40FRESH';
$strPosLastPart = strrpos($url_link, '/') + 1;
$lastPart = substr($url_link, $strPosLastPart);
$encodedLastPart = rawurlencode($lastPart);
$url_link = str_replace($lastPart, $encodedLastPart, $url_link);
Normaly this should work. Since i have test it, it worked not. So I am asking why this error happens, and made a Call using CURL.
Object reference not set to an instance of an object. Description: An
unhandled exception occurred during the execution of the current web
request. Please review the stack trace for more information about the
error and where it originated in the code.
Exception Details: System.NullReferenceException: Object reference not
set to an instance of an object.
Now we know, this page is written in ASP.NET. But i was asking me, why it not work. I added a User Agent, and voila:
$ch = curl_init($url_link);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0');
$data = curl_exec($ch);
echo $data;
All together (working):
$url_link = 'http://kissanime.com/Anime/Knights-of-Ramune-VS騎士ラムネ＆40FRESH';
//Encode Characters
$strPosLastPart = strrpos($url_link, '/') + 1;
$lastPart = substr($url_link, $strPosLastPart);
$encodedLastPart = rawurlencode($lastPart);
$url_link = str_replace($lastPart, $encodedLastPart, $url_link);
//Download Data
$ch = curl_init($url_link);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0');
$data = curl_exec($ch);
//Load Data into Html (untested, since i am not using this Lib)
$html = str_get_html($data);
Now the difference would be, to read $data into your simple_html_dom.php class, instead of file_get_html.
Cheers

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Using compression to get an external XML feed - php

Related

"HTTP/1.1 406 Not Acceptable" using "file_get_contents()" - Same domain

file_get_contents returns unreadable text for a specific url

C++ Editing a php file maintaining it's formatting

Why file_get_contents returning garbled data?

simple_html_dom ignores special characters

Categories

Resources