How to cache RSS feed data to display on website? - php

I am developing a website that will display the most recent item from a RSS feed. However, each time a user accesses the website, I'd like for the page to display cached data. This will make the page display much quicker since I plan on caching 50+ RSS feeds.
My question is, how do I cache an RSS feed, but make sure it updates in the background every 4 hours or so?
Thanks in advance.

Create a cache folder to store all the RSS feeds.
When the page is loaded, check to see if the file exists, if it doesn't download it and process it.
If the file exists and the result of filemtime($cached_file) + (60 * 60 * 4) is greater than time(), it means that it has been less than the 4 hours since the RSS feed was fetched. Display the page like normal. If that is not the case, redownload the file and display it.
There are many tutorials about for parsing RSS feeds in PHP. I prefer using PHP's DOM extension, but there are so many different ways you can do it.

I created a simple PHP class to tackle this issue. Since I'm dealing with a variety of sources, it can handle whatever you throw at it (xml, json, etc). You give it a local filename (for storage purposes), the external feed, and an expires time. It begins by checking for the local file. If it exists and hasn't expired, it returns the contents. If it has expired, it attempts to grab the remote file. If there's an issue with the remote file, it will fall-back to the cached file.
Blog post here: http://weedygarden.net/2012/04/simple-feed-caching-with-php/
Code here: https://github.com/erunyon/FeedCache

Related

Retrieve images from a link

Is there a script or service or snippet or method or anything that can get thumbnail from a url, by thumbnail i dont mean snapshot of the site, but an image that can automatically be fetched and used as post thumbnail, much like the one used in facebook. The image should be fetched thus img src="xxxxxxx?url=google.com" . this would fetch the google logo
Maybe there are existing solutions for this, but it's not really hard to implement:
you need to fetch the remote site, for e.g. with file_get_contents
optionally use Tidy to clean up the source HTML
parse the output with an XML parser if you used Tidy to clean the fetched data, or an HTML parser
fetch the first n images from the site (n should be a relatively small number)
store this fetched image set in a cache because this fetching, parsing thing could take time
Comments:
you may fetch the robots.txt from the site to check whether it's allowed to use/index the content
set a timeout for this remote website fetching, because if the website is down or slow it would timeout on your site as well
limit the concurrent fetching to a site and globally to protect against DoS-ing
you could use an HTTP client and limit the fetched HTML data size, or use HEAD HTTP method to fetch the Content-Length before downloading the actual content if it's allowed

PHP reading live rss feeds, the fastest way

I'm building a web-applIcation that reads from different sites the rss feeds. On every feed i can add a search key in the url, so the feeds are always different.
Now i'm using the simplexml_load_file, but this will take to long to read the feeds. Today i searched on stackoverflow and found the XMLReader class. This is a bit faster, but still not fast enough.
My question, is there a faster way to read multiple rss feeds that are always uniqe, so the user doesn't have to wait so long.
Check out simplePie - the library is very easy to use, and implements caching that works well.
Another thing you can do to speed up perceptual load time is to load the page without the feed content, then pipe the feeds in with AJAX. If you stick a loader animation image in the content area where the feed will go and start the AJAX request on page load, the user will perceive that your page is loading faster - it will be usable faster, even if the feeds take the same total amount of time to load. Plus, users who are not there for the feeds can start doing what they need to do without waiting for content they don't even care about.
Why not cache 5 or 6 feeds in files that can just be included randomly when there is a user request. That way the user does not end up waiting for a feed to be processed. The cached files can be refreshed every 10 or 15 minutes with a cron job so there is always fresh content.
Oke, i used different rss parsers like:
- SimplePie
- LastRSS
But the fastest way is to use the PHP XMLReader, because you don't have to read the whole xml file.
I think what you need to do is to read not the whole RSS Feed and parse it but to get only a part of it. If you use file_get_contents() you can set a limit to not download the whole page but only the first part.
Of course your RSS Feed is broken then. I don't now how your reader reacts to this. Maybe you can let him ignore it or fix the broken feed end.
Here you go:
$feed = file_get_contents('http://stackoverflow.com/...', false, null, -1, 1000);
$end = strpos($feed, '</entry>', -1);
echo substr($feed, 0, $end). '</entry></feed>';

Can you get a specific xml value without loading the full file?

I recently wrote a PHP plugin to interface with my phpBB installation which will take my users' Steam IDs, convert them into the community ids that Steam uses on their website, grab the xml file for that community id, get the value of avatarFull (which contains the link to the full avatar), download it via curl, resize it, and set it as the user's new avatar.
In effect it is syncing my forum's avatars with Steam's avatars (Steam is a gaming community/platform and I run a gaming clan). My issue is that whenever I am reading the value from the xml file it takes around a second for each user as it loads the entire xml file before searching for the variable and this causes the entire script to take a very long time to complete.
Ideally I want to have my script run several times a day to check each avatarFull value from Steam and check to see if it has changed (and download the file if it has), but it currently takes just too long for me to tie up everything to wait on it.
Is there any way to have the server serve up just the xml value that I am looking for without loading the entire thing?
Here is how I am calling the value currently:
$xml = #simplexml_load_file("http://steamcommunity.com/profiles/".$steamid."?xml=1");
$avatarlink = $xml->avatarFull;
And here is an example xml file: XML file
The file isn't big. Parsing it doesn't take much time. Your second is wasted mostly for network communication.
Since there is no way around this, you must implement a cache. Schedule a script that will run on your server every hour or so, looking for changes. This script will take a lot of time - at least a second for every user; several seconds if the picture has to be downloaded.
When it has the latest picture, it will store it in some predefined location on your server. The scripts that serve your webpage will use this location instead of communicating with Steam. That way they will work instantly, and the pictures will be at most 1 hour out-of-date.
Added: Here's an idea to complement this: Have your visitors perform AJAX requests to Steam and check if the picture has changed via JavaScript. Do this only for pictures that they're actually viewing. If it has, then you can immediately replace the outdated picture in their browser. Also you can notify your server who can then download the updated picture immediately. Perhaps you won't even need to schedule anything yourself.
You have to read the whole stream to get to the data you need, but it doesn't have to be kept in memory.
If I were doing this with Java, I'd use a SAX parser instead of a DOM parser. I could handle the few values I was interested in and not keep a large DOM in memory. See if there's something equivalent for you with PHP.
SimpleXml is a DOM parser. It will load and parse the entire document into memory before you can work with it. If you do not want that, use XMLReader which will allow you to process the XML while you are reading it from a stream, e.g. you could exit processing once the avatar was fetched.
But like other people already pointed out elsewhere on this page, with a file as small as shown, this is likely rather a network latency issue than an XML issue.
Also see Best XML Parser for PHP
that file looks small enough. It shouldn't take that long to parse. It probably takes that long because of some sort of network problem and the slowness of parsing.
If the network is your issue then no amount of trickery will help you :(.
If isn't the network then you could try a regex match on the input. That will probably be marginally faster.
Try this expression:
/<avatarFull><![CDATA[(.*?)]]><\/avatarFull>/
and read the link from the first group match.
You could try the SAX way of parsing (http://php.net/manual/en/book.xml.php) but as i said since the file is small i doubt it will really make a difference.
You can take advantage of caching the results of simplexml_load_file() somewhere like memcached or filesystem. Here is typical workflow:
check if XML file was processed during last N seconds
return processing results on success
on failure get results from simplexml
process them
resize images
store results in cache

Any way to display a large amount of external rss feeds on a site, without physically re-scraping them?

IMDb has an individual RSS feed for every single movie that they have listed. I have a site that has a lot of pages associated with movies, and I stored an IMDB id with each one.
I wanted to show top 5 results from each RSS feed, for each individual movie. The feed looks like this:
http://rss.imdb.com/title/tt1013743/news
As you can imagine, IMDB has over a million films indexed, with a large number of them actually active. Many update several times a day. Is there a way to have a live feed of the news, fetched from IMDB, without having my server physically fetch each RSS feed, for each movie, several times a day?
I think the short answer is no.
Unless imdb itself provides such a feed, then something somewhere has to do the work of fetching each feed individually, in order to find the movies with the most recently updated news.
There is a overall site news feed but I really don't think this does what you want.
I suppose that theoretically you could use Yahoo Pipes to deliver a combined feed, then your server only has to fetch that single feed. However, you'd still need to plumb in every movie feed, or find some way to cycle through them (is the 'tt1013743' part of your rss uri example incremented for each new film?). Realistically I've no idea if Pipes could even manage this potentially enormous task. Your best bet may be to contact imdb and ask for a "Recently Updated" rss feed to be added.
You can store the content-length header information in your Database for each release. It is very unlikely that two releases will have the exact same byte length, and the worst thing that could happen is just to lose an update, but it's not a big problem. In this way you only need to send HEAD http requests which is very cheap. On the server side, you can store the generated cache files compressed (gzcompress) so as to ensure the lowest filesize possible. This way you also save the time of XML parsing the RSS feed.
In addition you can try YQL to only get the 5 most recent news from the feed. Also, make sure to use cURL for fetching the RSS because it is very flexible and accepts compressed input, so you can reduce your bandwidth usage and transfer time.

PHP: I want to create a page that extracts images from a forum thread, doable? codeigniter?

You have a forum (vbulletin) that has a bunch of images - how easy would it be to have a page that visits a thread, steps through each page and forwards to the user (via ajax or whatever) the images. i'm not asking about filtering (that's easy of course).
doable in a day? :)
I have a site that uses codeigniter as well - would it be even simpler using it?
assuming this is to be carried out on server, curl + regexp are your friends .. and yes .. doable in a day...
there are also some open-source HTML parsers that might make this cleaner
It depends on where your scraping script runs.
If it runs on the same server as the forum software, you might want to access the database directly and check for image links there. I'm not familiar with vbulletin, but probably it offers a plugin api that allows for high level database access. That would simplify querying all posts in a thread.
If, however, your script runs on a different machine (or, in other words, is unrelated to the forum software), it would have to act as a http client. It could fetch all pages of a thread (either automatically by searching for a NEXT link in a page or manually by having all pages specified as parameters) and search the html source code for image tags (<img .../>).
Then a regular expression could be used to extract the image urls. Finally, the script could use these image urls to construct another page displaying all these images, or it could download them and create a package.
In the second case the script actually acts as a "spider", so it should respect things like robots.txt or meta tags.
When doing this, make sure to rate-limit your fetching. You don't want to overload the forum server by requesting many pages per second. Simplest way to do this is probably just to sleep for X seconds between each fetch.
Yes doable in a day
Since you already have a working CI setup I would use it.
I would use the following approach:
1) Make a model in CI capable of:
logging in to vbulletin (images are often added as attachments and you need to be logged in before you can download them). Use something like snoopy.
collecting the url for the "last button" using preg_match(), parsing the url with parse_url() / and parse_str() and generating links from page 1 to page last
collecting html from all generated links. Still using snoopy.
finding all images in html using preg_match_all()
downloading all images. Still using snoopy.
moving the downloaded image from a tmp directory into another directory renaming it imagename_01, imagename_02, etc. if the same imagename allready exists.
saving the image name and precise bytesize in a db table. Then you can avoid downloading the same image more than once.
2) Make a method in a controller that collects all images
3) Setup a cronjob that collect images at regular intervals. wget -o /tmp/useless.html http://localhost/imageminer/collect should do nicely
4) Write the code that outputs pretty html for the enduser using the db table to get the images.

Categories