Searching RSS feed - php

I have a RSS feed with more than 5000 items in it. In the user interface, I'm trying to have search feature with ability to do custom search based on different categories. First when the page loads I'm just showing the first 10 feeds which loads really quick as its supposed to be, but when we enter a string to search with a category selected, the processing is pretty slow. I want to know if there is a way to do this more efficiently than going through each and every feed item every single time.
I'm not adding any code here because I'm looking for ideas for handling/searching such large rss feeds. So far I have been using PHP (simple XML) and JavaScript.

RSS (and XML in general) are great data transport formats. They are not good formats for accessing that data via random access.
Import the feeds into a database (properly, don't just dump the raw XML in there) such as Postgresql or MySQL and use a full text search provided by the database server.

Don't use SimpleXML for this. (In fact, it really shouldn't be used at all). Rather, use the DOMDocument class to parse through your XML.

You can use a session variable to store all the feeds. Also in the background have a polling script which checks for new feed. If you get one, add it to the session. Use the session variable to search the feed.

Related

Using keywords with RSS feed links

I learned from some sites (such as from this) that I can use keywords with RSS feed links, but I have not succes when I try this with a particular site. Is this way of access RSS feeds a general one, or it is limited somehow?
RSS is a data format, just like HTML.
If you want to get different data, then the server has to generate different data.
That needs server side code to be designed to generate different RSS (or different static files to be created).

How to make php pages load faster when fetching bulk of information?

Basically I am building a website that does web scraping and fetches particular web pages from around 8 different websites to extract price. I am using file_get_html() function of PHP Simple HTML DOM Parser extensively to fetch the page source into a string variable and extract the price information out of that.
Now the main problem is the page which shows the price information from all different sites is taking very long time to load.
So my question is
How to make the page load faster. &
How to load the page in steps so that those information which has been fetched loads and other information will load subsequently like google image search.
Don't fetch the data on page load, but do it in a background job (cronjob?) and save it in the database.
So you will only have to retrieve the data from the database. Additionally you could add a text with a timestamp when the data has been retrieved and / or give the user the ability to manually update (get) the data.
Well, first of you can use cURL instead of file_get_html(), It's easy and very configurable + it's faster than using the simple html dom function. Obviously you will have to convert the string into a dom object using the simple html dom function str_get_html() after that.

One huge XML file with long SQL statement or not?

I have a database with about 10 tables and they are all interconnected in some way(foreign keys, assosiative tables).
I want to use all that data to plot it on my instance of Google Map as markers with infoboxes and be able to filter the map.
Judging from the Google Maps Articles you have to use XML with the data from the database to plot the markers and everything else.
But what would be the right way to generate that XML file? Have a huge SQL statement to generate one XML file with entire data from all tables upon the load of the web-page and the map or is there a more correct approach for this?
You in no way have to use XML to place markers on an instance of Google Maps. You could, but you don't have to if it seems difficult. I work a lot with the Google Maps V3 API and I would recommend you export your data to JSON and embed it in your document using PHP or make it available for JavaScript to load using Ajax.
Creating interactive Markers from the data is REALLY easy. You just need to iterate over your data, create a Marker object for each point you want on the map, supply some HTML you want displayed in the info window and show that info window on the Marker's click event.
Instead of walking you through with teaspoon accuracy I'll refer you to the Google Maps API v3 beginner tutorial which among other things includes examples of how to create Markers and display them on the map.
Fun fact, you can control which icon is displayed for each marker (you can supply an URL to any image you want), as well as make them bounce. To summarize, you have way more control using JavaScript than if you went with XML.
Regarding performance, I would heed cillosis' advice and cache your MySQL data in which ever format you end up choosing. If you were to go with JSON you could cache the result of that as well. You can simply save the output in a file called something like "mysql-export-1335797013.json". That number is a Unix timestamp with which you can extrapolate when the data needs to be refreshed.
Use SQL the first time to generate the XML for a specific query, and then cache that XML output for later use. The very first time it may be slow, but after that it will already be generated and will be really fast.
If you want to use XML because PHP and AJAX make it relatively easy, then do. That's why the examples use it. But you are definitely not restricted to XML. JSON is commonly used because it's also easy with PHP, a smaller download than XML and delivered in a form which is directly usable by Javascript. Or you could use anything else which can be manipulated by your page's Javascript.
With regard to whether to use one humungous query and data download or not, you don't have to do that either. You could do: it might be slow — not only to do the query but also to transfer the data, where caching the query results won't help. Your users will have to wait for the data to arrive and then be manipulated by your Javascript to appear on the map.
You might consider doing a fairly simple query to get basic data to display so the users get something to see reasonably quickly, following that up with more queries, perhaps as data is required. There is no point in downloading loads of InfoWindow data if the user is not going to click and see it. In that instance, deliver marker data and basic InfoWindow data and only get detailed data if the user actually requests it (that is, use a two-stage InfoWindow).

Any way to display a large amount of external rss feeds on a site, without physically re-scraping them?

IMDb has an individual RSS feed for every single movie that they have listed. I have a site that has a lot of pages associated with movies, and I stored an IMDB id with each one.
I wanted to show top 5 results from each RSS feed, for each individual movie. The feed looks like this:
http://rss.imdb.com/title/tt1013743/news
As you can imagine, IMDB has over a million films indexed, with a large number of them actually active. Many update several times a day. Is there a way to have a live feed of the news, fetched from IMDB, without having my server physically fetch each RSS feed, for each movie, several times a day?
I think the short answer is no.
Unless imdb itself provides such a feed, then something somewhere has to do the work of fetching each feed individually, in order to find the movies with the most recently updated news.
There is a overall site news feed but I really don't think this does what you want.
I suppose that theoretically you could use Yahoo Pipes to deliver a combined feed, then your server only has to fetch that single feed. However, you'd still need to plumb in every movie feed, or find some way to cycle through them (is the 'tt1013743' part of your rss uri example incremented for each new film?). Realistically I've no idea if Pipes could even manage this potentially enormous task. Your best bet may be to contact imdb and ask for a "Recently Updated" rss feed to be added.
You can store the content-length header information in your Database for each release. It is very unlikely that two releases will have the exact same byte length, and the worst thing that could happen is just to lose an update, but it's not a big problem. In this way you only need to send HEAD http requests which is very cheap. On the server side, you can store the generated cache files compressed (gzcompress) so as to ensure the lowest filesize possible. This way you also save the time of XML parsing the RSS feed.
In addition you can try YQL to only get the 5 most recent news from the feed. Also, make sure to use cURL for fetching the RSS because it is very flexible and accepts compressed input, so you can reduce your bandwidth usage and transfer time.

How to create rss feeds (.xml) for our own dynamic site while using php and mysql?

I've tried every method that I knows but didn't got the solution on "How to create an rss feeds and sitemap for a dynamic site that get updated automatically".
The basic idea is to have your php file generate a valid RSS (xml) based on your database results. So, select the data you want to show from the MySQL db, and output them, conforming to the rss standard.
Check the top three google results
For the sitemap it will be a little harder, and it greatly depends on your structure, which is unknown to us. But the principle is the same - output a valid xml file conforming to the sitemap standard, based on the pages you want to show.

Categories