loading XML with PHP taking too long - php

I'm trying to retrieve information from an online XML file and it takes too long to get that information. It even get most of the times timeout error.
The strange part is that when i open the link directly on the browser is fast.
$xmlobj = simplexml_load_file("http://apple.accuweather.com/adcbin/apple/Apple_Weather_Data.asp?zipcode=EUR;PT;PO019;REGUA");
print header("Content-type: text/plain");
print_r($xmlobj);

That's because they're blocking depending what browser you're using.
Try this:
$curl = curl_init();
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.6) Gecko/2009012700 SUSE/3.0.6-1.4 Firefox/3.0.6');
curl_setopt($curl, CURLOPT_URL,'http://apple.accuweather.com/adcbin/apple/Apple_Weather_Data.asp?zipcode=EUR;PT;PO019;REGUA');
$xmlstr = curl_exec($curl);
$xmlobj = simplexml_load_string($xmlstr);
print header("Content-type: text/plain");
print_r($xmlobj);
BTW. in the file you can see "Redistribution Prohibited", so you might want to look for some royalty-free source of weather data.

The above code works perfectly fine for me. Try reading another xml file (small size) from a different location.
Looks like a firewall issue for me!

Once you've sent the faux user agent headers with cURL as vartec pointed out, it might be a good idea to cache the XML to your server. For weather, maybe an hour would be a good time (play with this, if the RSS is updating more frequently, you may want to try 15 minutes).
Once it is saved locally to your server, reading it and parsing the XML will be much quicker.
Keep in mind too that the RSS does state Redistribution Prohibited. IIRC there are a few free online weather RSS feeds, so maybe you should try another one.

Related

Running Out Of Memory With Fread

I'm using Backblaze B2 to store files and am using their documentation code to upload via their API. However their code uses fread to read the file, which is causing issues for files that are larger than 100MB as it tries to load the entire file into memory. Is there a better way to this that doesn't try to load the entire file into RAM?
$file_name = "file.txt";
$my_file = "<path-to-file>" . $file_name;
$handle = fopen($my_file, 'r');
$read_file = fread($handle,filesize($my_file));
$upload_url = ""; // Provided by b2_get_upload_url
$upload_auth_token = ""; // Provided by b2_get_upload_url
$bucket_id = ""; // The ID of the bucket
$content_type = "text/plain";
$sha1_of_file_data = sha1_file($my_file);
$session = curl_init($upload_url);
// Add read file as post field
curl_setopt($session, CURLOPT_POSTFIELDS, $read_file);
// Add headers
$headers = array();
$headers[] = "Authorization: " . $upload_auth_token;
$headers[] = "X-Bz-File-Name: " . $file_name;
$headers[] = "Content-Type: " . $content_type;
$headers[] = "X-Bz-Content-Sha1: " . $sha1_of_file_data;
curl_setopt($session, CURLOPT_HTTPHEADER, $headers);
curl_setopt($session, CURLOPT_POST, true); // HTTP POST
curl_setopt($session, CURLOPT_RETURNTRANSFER, true); // Receive server response
$server_output = curl_exec($session); // Let's do this!
curl_close ($session); // Clean up
echo ($server_output); // Tell me about the rabbits, George!
I have tried using:
curl_setopt($session, CURLOPT_POSTFIELDS, array('file' => '#'.realpath('file.txt')));
However I get an error response: Error reading uploaded data: SocketTimeoutException(Read timed out)
Edit: Streaming the filename withing the CURL also doesn't seem to work.
The issue you are having is related to this.
fread($handle,filesize($my_file));
With the filesize in there you might as well just do file_get_contents. it's much better memory wise to read 1 line at a time with fget
$handle = fopen($myfile, 'r');
while(!feof($handle)){
$line = fgets($handle);
}
This way you only read one line into memory, but if you need the full file contents you will still hit a bottleneck.
The only real way is to stream the upload.
I did a quick search and it seems the default for CURL is to stream the file if you give it the filename
$post_data['file'] = 'myfile.csv';
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_data);
You can see the previous answer for more details
Is it possible to use cURL to stream upload a file using POST?
So as long as you can get past the sha1_file It looks like you can just stream the file, which should avoid the memory issues. There may be issues with time limit though. Also I can't really think of a way around getting the hash if that fails.
Just FYI, personally I never tried this, typically i just us sFTP for large file transfers. So I don't know if it has to be specially post_data['file'] I just copied that from the other answer.
Good luck...
UPDATE
Seeing as streaming seems to have failed (see comments).
You may want to test the streaming to make sure it works. I don't know what all that would involve, maybe stream a file to your own server? Also I am not sure why it wouldn't work "as advertised" and you may have tested it already. But it never hurts to test something, never assume something works until you know for sure. It very easy to try something new as a solution, only to miss a setting or put a path in wrong and then fall back to thinking its all based on the original issue.
I've spent a lot of time tearing things apart only to realize I had a spelling error. I'm pretty adept a programing these days so I typically overthink the errors too. My point is, be sure it's not a simple mistake before moving on.
Assuming everything is setup right, I would try file_get_contents. I don't know if it will be any better but it's more meant to open whole files. It also would seem to be more Readable in the code, because then it's clear that the whole file is needed. It just seems more semantically correct if nothing else.
You can also increase the RAM PHP has access to by using
ini_set('memory_limit', '512M')
You can even go higher then that, depending on your server. The highest I went before was 3G, but the server I uses has 54GB of ram and that was a one time thing, (we migrated 130million rows from MySql to MongoDB, the innodb index was eating up 30+GB ). Typically I run with 512M and have some scripts that routinely need 1G. But I wouldn't just up the Memory willy-nilly. That is usually a last resort for me after optimizing and testing. We do a lot of heavy processing that is why we have such a big server, we also have 2 slave servers (among other things) that run with 16GB each.
As far as what size to put, typically I increment it by 128M tell it works, then add an extra 128M just to be sure, but you might want to go in smaller steps. Typically people always use multiples of 8, but I don't know if that make to much difference these days.
Again, Good Luck.

How to work around a site forbidding me to scrape their images with PHP

I'm scraping a site, searching for JPGs to download.
Scraping the site's HTML pages works fine.
But when I try getting the JPGs with CURL, copy(), fopen(), etc., I get a 403 forbiden status.
I know that's because the site owners don't want their images scraped, so I understand a good answer would be just don't do it, because they don't want you to.
Ok, but let's say it's ok and I try to work around this, how could this be achieved?
If I get the same URL with a browser, I can open the image perfectly, it's not that my IP is banned or anything, and I'm testing the scraper one file at a time, so it's not blocking me because I make too many requests too often.
From my understanding, it could be that either the site is checking for some cookies that confirm that I'm using a browser and browsing their site before I download a JPG.
Or that maybe PHP is using some user agent for the requests that the server can detect and filter out.
Anyway, have any idea?
Actually it was quite simple.
As #Leigh suggested, it only took spoofing an http referrer with the option CURLOPT_REFERER.
In fact for every request, I just provided the domain name as the referrer and it worked.
Are you able to view the page through a browser? Wouldn't a simple search of the page source find all images?
` $findme = '.jpg';
$pos = strpos($html, $findme);
if ($pos === false) {
echo "The string '$findme' was not found in the string '$html'";
} else {
echo "Images found..
///grab image location code
} `
Basic image retrieval:
Using the GD Library plugin commonly installed by default with many web hosts. This is something of an ugly hack but some may find the fact it can be done this way useful.
$remote_img = 'http://www.somwhere.com/images/image.jpg';
$img = imagecreatefromjpeg($remote_img);
$path = 'images/';
imagejpeg($img, $path);
Classic cURL image grabbing function for when you have extracted the location of the image from the donor pages HTML.
function save_image($img,$fullpath){
$ch = curl_init ($img);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER,1);
$rawdata=curl_exec($ch);
curl_close ($ch);
if(file_exists($fullpath)){
unlink($fullpath);
}
$fp = fopen($fullpath,'x');
fwrite($fp, $rawdata);
fclose($fp);
}
If the basic cURL image grabbing function fails then the donor site probably has some form of server side defences in place to prevent retrieval and so you are probably breaching the terms of service by proceeding further. Though rare some sites do create images 'on the fly' using the GD library module, so what may look like a link to an image is actually a PHP script and that could be checking for things like a cookie, referer or session value being passed to it before the image is created and outputted.

Downloading files using GZIP

I have many XML-s and I downloaded using file or file_get_content, but the server administrator told me that through GZIP is more efficient the downloading. My question is how can I include GZIP, because I never did this before, so this solution is really new for me.
You shouldn't need to do any decoding yourself if you use cURL. Just use the basic cURL example code, with the CURLOPT_ENCODING option set to "", and it will automatically request the file using gzip encoding, if the server supports it, and decode it.
Or, if you want the content in a string instead of in a file:
$ch = curl_init("http://www.example.com/");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, ""); // accept any supported encoding
$content = curl_exec($ch);
curl_close($ch);
I've tested this, and it indeed downloads the content in gzipped format and decodes it automatically.
(Also, you should probably include some error handling.)
I don't understand your question.
You say that you downloaded these files - you can't unilaterally enable compression client-side.
OTOH you can control it server-side - and since you've flagged the question as PHP, and it doesn't make any sense for your administrator to recommend compression where you don't have control over the server then I assume this is what you are talking about.
In which case you'd simply do something like:
<?php
ob_start("ob_gzhandler");
...your code for generating the XML goes here
...or maybe this is nothing to do with PHP, and the XML files are static - in which case you'd need to configure your webserver to compress on the fly.
Unless you mean that compression is available on the server and you are fetching data over HTTP using PHP as the client - in which case the server will only compress the data if the client provides an "Accept-Encoding" request header including "gzip". In which case, instead of file_get_contents() you might use:
function gzip_get_contents($url)
{
$ch=curl_init($url);
curl_setopt($ch, CURLOPT_ENCODING, 'gzip');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$content=curl_exec($ch);
curl_close($ch);
return $content;
}
probably curl can get a gzipped file
http://www.php.net/curl
try to use this instead of file_get_contents
edit: tested
curl_setopt($c,CURLOPT_ENCODING,'gzip');
then:
gzdecode($responseContent);
Send a Accept-Encoding: gzip header in your http request and then uncompress the result as shown here:
http://de2.php.net/gzdecode

downloading a page without downloading image files or css or javascript with curl

Whenever i use curl(php) to download a page it downloads everything on the page like images, css files or javascript files. but sometimes i dont want to download these. can i control the resources that i download through curl. i have gone through the manual but i havent found an option that can make this happen? Please dont suggest getting the whole page and then using some regex magic because that would still download the page and increase load time.
this is a demo code where i download a page from mozilla.com
<?php
$url="http://www.mozilla.com/en-US/firefox/new/";
$userAgent="Mozilla/5.0 (Windows NT 5.1; rv:2.0)Gecko/20100101 Firefox/4.0";
//$accept="text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
$encoding="gzip, deflate";
$header['lang']="en-us,en;q=0.5";
$header['charset']="ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header['conn']="keep-alive";
$header['keep-alive']=115;
$ch=curl_init();
curl_setopt($ch,CURLOPT_USERAGENT,$userAgent);
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_ENCODING,$encoding);
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch,CURLOPT_AUTOREFERER,1);
$content=curl_exec($ch);
curl_close($ch);
echo $content;
?>
when i echo the content it shows the images too. i saw in firebug's network tab that images and external js files are being downloaded
PHP's curl only fetches what you tell it to. It doesn't parse html to look for javascript/css <link> tags and <img> tags and doesn't fetch them automatically.
If you have curl downloading those resources, then it's your code telling it to do so, and it's up to you to decide what to fetch and what not to. Curl only does what you tell it to.
you can avoid the download by using
echo htmlentities($content);

PHP: get remote file size with strlen? (html)

I was looking at PHP docs for fsockopen and whatnot and they say you can't use filesize() on a remote file without doing some crazy things with ftell or something (not sure what they said exactly), but I had a good thought about how to do it:
$file = file_get_contents("http://www.google.com");
$filesize = mb_strlen($file) / 1000; //KBs, mb_* in case file contains unicode
Would this be a good method? It seemed so simple and good to use at the time, just want to get any thoughts if this could run into problems or not be the true file size.
I only wish to use this on text (websites) by the way not binary.
This answer requires PHP5 and cUrl. It first checks the headers. If Content-Length isn't specified, it uses cUrl to download it and check the size (the file is not saved anywhere though--just temporarily in memory).
<?php
echo get_remote_size("http://www.google.com/");
function get_remote_size($url) {
$headers = get_headers($url, 1);
if (isset($headers['Content-Length'])) return $headers['Content-Length'];
if (isset($headers['Content-length'])) return $headers['Content-length'];
$c = curl_init();
curl_setopt_array($c, array(
CURLOPT_URL => $url,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HTTPHEADER => array('User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3'),
));
curl_exec($c);
return curl_getinfo($c, CURLINFO_SIZE_DOWNLOAD);
}
?>
You should look at the get_headers() function. It will return a hash of HTTP headers from an HTTP request. The Content-length header may be a better judge of the size of the actual content, if it's present.
That being said, you really should use either curl or streams to do a HEAD request instead of a GET. Content-length should be present, which saves you the transfer. It will be both faster and more accurate.
it will fetch the whole file and then calculate the filesize (rather the string length) out of the retrieved data. usually filesize can tell the filesize directly from the filesystem without reading the whole file first.
so this will be rather slow, and will everytime fetch the whole file before being able to retrieve the filesize (string length

Categories