PHP reading live rss feeds, the fastest way

PHP reading live rss feeds, the fastest way - php

I'm building a web-applIcation that reads from different sites the rss feeds. On every feed i can add a search key in the url, so the feeds are always different.
Now i'm using the simplexml_load_file, but this will take to long to read the feeds. Today i searched on stackoverflow and found the XMLReader class. This is a bit faster, but still not fast enough.
My question, is there a faster way to read multiple rss feeds that are always uniqe, so the user doesn't have to wait so long.

Check out simplePie - the library is very easy to use, and implements caching that works well.
Another thing you can do to speed up perceptual load time is to load the page without the feed content, then pipe the feeds in with AJAX. If you stick a loader animation image in the content area where the feed will go and start the AJAX request on page load, the user will perceive that your page is loading faster - it will be usable faster, even if the feeds take the same total amount of time to load. Plus, users who are not there for the feeds can start doing what they need to do without waiting for content they don't even care about.

Why not cache 5 or 6 feeds in files that can just be included randomly when there is a user request. That way the user does not end up waiting for a feed to be processed. The cached files can be refreshed every 10 or 15 minutes with a cron job so there is always fresh content.

Oke, i used different rss parsers like:
- SimplePie
- LastRSS
But the fastest way is to use the PHP XMLReader, because you don't have to read the whole xml file.

I think what you need to do is to read not the whole RSS Feed and parse it but to get only a part of it. If you use file_get_contents() you can set a limit to not download the whole page but only the first part.
Of course your RSS Feed is broken then. I don't now how your reader reacts to this. Maybe you can let him ignore it or fix the broken feed end.
Here you go:
$feed = file_get_contents('http://stackoverflow.com/...', false, null, -1, 1000);
$end = strpos($feed, '</entry>', -1);
echo substr($feed, 0, $end). '</entry></feed>';

Related

Searching RSS feed

I have a RSS feed with more than 5000 items in it. In the user interface, I'm trying to have search feature with ability to do custom search based on different categories. First when the page loads I'm just showing the first 10 feeds which loads really quick as its supposed to be, but when we enter a string to search with a category selected, the processing is pretty slow. I want to know if there is a way to do this more efficiently than going through each and every feed item every single time.
I'm not adding any code here because I'm looking for ideas for handling/searching such large rss feeds. So far I have been using PHP (simple XML) and JavaScript.

RSS (and XML in general) are great data transport formats. They are not good formats for accessing that data via random access.
Import the feeds into a database (properly, don't just dump the raw XML in there) such as Postgresql or MySQL and use a full text search provided by the database server.

Don't use SimpleXML for this. (In fact, it really shouldn't be used at all). Rather, use the DOMDocument class to parse through your XML.

You can use a session variable to store all the feeds. Also in the background have a polling script which checks for new feed. If you get one, add it to the session. Use the session variable to search the feed.

Image Scraping hogging server resources/connections?

As part of my web app, I built a system that periodically pulls an RSS feed and scrapes its content. I also look for any image tags present in the feed item, and attempt to pull it to query its size and such to determine which "picture" to use.
Here is a rough sketch of that part of the code:
Is there an <image> node? If so, that is the image. Exit.
Parse the content of the description node through simplehtmldom and look for any and all img tags
Iterate through all img tags:
getimagesize();
If the image size is greater than one I found earlier, use this picture.
Exit.
At step 3, the script can take awhile, especially for feeds that have lots of images for me to check. I assume that each call to getimagesize() takes a certain amount of time and it adds up quickly. I'm not too worried about it taking a long time (although if it could be reduced, that would be best), but the fact that while this script is running, it effectively leaves all other concurrent users hanging until the script has finished.
I'd like to avoid this, but am not too proficient at server admin - perhaps someone could give me some guiding pointers?
Thanks!

Run it on a separate server if you need the performance boost. getimagesize() can really slow things down. I'd recommend running the scraping script on it's own server and host everything else on your current server.

How to cache RSS feed data to display on website?

I am developing a website that will display the most recent item from a RSS feed. However, each time a user accesses the website, I'd like for the page to display cached data. This will make the page display much quicker since I plan on caching 50+ RSS feeds.
My question is, how do I cache an RSS feed, but make sure it updates in the background every 4 hours or so?
Thanks in advance.

Create a cache folder to store all the RSS feeds.
When the page is loaded, check to see if the file exists, if it doesn't download it and process it.
If the file exists and the result of filemtime($cached_file) + (60 * 60 * 4) is greater than time(), it means that it has been less than the 4 hours since the RSS feed was fetched. Display the page like normal. If that is not the case, redownload the file and display it.
There are many tutorials about for parsing RSS feeds in PHP. I prefer using PHP's DOM extension, but there are so many different ways you can do it.

I created a simple PHP class to tackle this issue. Since I'm dealing with a variety of sources, it can handle whatever you throw at it (xml, json, etc). You give it a local filename (for storage purposes), the external feed, and an expires time. It begins by checking for the local file. If it exists and hasn't expired, it returns the contents. If it has expired, it attempts to grab the remote file. If there's an issue with the remote file, it will fall-back to the cached file.
Blog post here: http://weedygarden.net/2012/04/simple-feed-caching-with-php/
Code here: https://github.com/erunyon/FeedCache

Can you get a specific xml value without loading the full file?

I recently wrote a PHP plugin to interface with my phpBB installation which will take my users' Steam IDs, convert them into the community ids that Steam uses on their website, grab the xml file for that community id, get the value of avatarFull (which contains the link to the full avatar), download it via curl, resize it, and set it as the user's new avatar.
In effect it is syncing my forum's avatars with Steam's avatars (Steam is a gaming community/platform and I run a gaming clan). My issue is that whenever I am reading the value from the xml file it takes around a second for each user as it loads the entire xml file before searching for the variable and this causes the entire script to take a very long time to complete.
Ideally I want to have my script run several times a day to check each avatarFull value from Steam and check to see if it has changed (and download the file if it has), but it currently takes just too long for me to tie up everything to wait on it.
Is there any way to have the server serve up just the xml value that I am looking for without loading the entire thing?
Here is how I am calling the value currently:
$xml = #simplexml_load_file("http://steamcommunity.com/profiles/".$steamid."?xml=1");
$avatarlink = $xml->avatarFull;
And here is an example xml file: XML file

The file isn't big. Parsing it doesn't take much time. Your second is wasted mostly for network communication.
Since there is no way around this, you must implement a cache. Schedule a script that will run on your server every hour or so, looking for changes. This script will take a lot of time - at least a second for every user; several seconds if the picture has to be downloaded.
When it has the latest picture, it will store it in some predefined location on your server. The scripts that serve your webpage will use this location instead of communicating with Steam. That way they will work instantly, and the pictures will be at most 1 hour out-of-date.
Added: Here's an idea to complement this: Have your visitors perform AJAX requests to Steam and check if the picture has changed via JavaScript. Do this only for pictures that they're actually viewing. If it has, then you can immediately replace the outdated picture in their browser. Also you can notify your server who can then download the updated picture immediately. Perhaps you won't even need to schedule anything yourself.

You have to read the whole stream to get to the data you need, but it doesn't have to be kept in memory.
If I were doing this with Java, I'd use a SAX parser instead of a DOM parser. I could handle the few values I was interested in and not keep a large DOM in memory. See if there's something equivalent for you with PHP.

SimpleXml is a DOM parser. It will load and parse the entire document into memory before you can work with it. If you do not want that, use XMLReader which will allow you to process the XML while you are reading it from a stream, e.g. you could exit processing once the avatar was fetched.
But like other people already pointed out elsewhere on this page, with a file as small as shown, this is likely rather a network latency issue than an XML issue.
Also see Best XML Parser for PHP

that file looks small enough. It shouldn't take that long to parse. It probably takes that long because of some sort of network problem and the slowness of parsing.
If the network is your issue then no amount of trickery will help you :(.
If isn't the network then you could try a regex match on the input. That will probably be marginally faster.
Try this expression:
/<avatarFull><![CDATA[(.*?)]]><\/avatarFull>/
and read the link from the first group match.
You could try the SAX way of parsing (http://php.net/manual/en/book.xml.php) but as i said since the file is small i doubt it will really make a difference.

You can take advantage of caching the results of simplexml_load_file() somewhere like memcached or filesystem. Here is typical workflow:
check if XML file was processed during last N seconds
return processing results on success
on failure get results from simplexml
process them
resize images
store results in cache

Any way to display a large amount of external rss feeds on a site, without physically re-scraping them?

IMDb has an individual RSS feed for every single movie that they have listed. I have a site that has a lot of pages associated with movies, and I stored an IMDB id with each one.
I wanted to show top 5 results from each RSS feed, for each individual movie. The feed looks like this:
http://rss.imdb.com/title/tt1013743/news
As you can imagine, IMDB has over a million films indexed, with a large number of them actually active. Many update several times a day. Is there a way to have a live feed of the news, fetched from IMDB, without having my server physically fetch each RSS feed, for each movie, several times a day?

I think the short answer is no.
Unless imdb itself provides such a feed, then something somewhere has to do the work of fetching each feed individually, in order to find the movies with the most recently updated news.
There is a overall site news feed but I really don't think this does what you want.
I suppose that theoretically you could use Yahoo Pipes to deliver a combined feed, then your server only has to fetch that single feed. However, you'd still need to plumb in every movie feed, or find some way to cycle through them (is the 'tt1013743' part of your rss uri example incremented for each new film?). Realistically I've no idea if Pipes could even manage this potentially enormous task. Your best bet may be to contact imdb and ask for a "Recently Updated" rss feed to be added.

You can store the content-length header information in your Database for each release. It is very unlikely that two releases will have the exact same byte length, and the worst thing that could happen is just to lose an update, but it's not a big problem. In this way you only need to send HEAD http requests which is very cheap. On the server side, you can store the generated cache files compressed (gzcompress) so as to ensure the lowest filesize possible. This way you also save the time of XML parsing the RSS feed.
In addition you can try YQL to only get the 5 most recent news from the feed. Also, make sure to use cURL for fetching the RSS because it is very flexible and accepts compressed input, so you can reduce your bandwidth usage and transfer time.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP reading live rss feeds, the fastest way - php

Why not cache 5 or 6 feeds in files that can just be included randomly when there is a user request. That way the user does not end up waiting for a feed to be processed. The cached files can be refreshed every 10 or 15 minutes with a cron job so there is always fresh content.

Oke, i used different rss parsers like: - SimplePie - LastRSS But the fastest way is to use the PHP XMLReader, because you don't have to read the whole xml file.

Related

Searching RSS feed

Image Scraping hogging server resources/connections?

How to cache RSS feed data to display on website?

Can you get a specific xml value without loading the full file?

Any way to display a large amount of external rss feeds on a site, without physically re-scraping them?

Categories

Resources