What im trying to do:
Fetch X numbers of RSS Feeds from my Blogs and echo only new entries. My Problem is, how to know wich items are already parsed?
Solution so far:
Fetch the Feed every 5 hours, store all titles inside an Database table or flat file. Next run check if the title is already in database if not print it and save it inside the database.
But iam not sure if this is best practise to do this?
If someone knows a fast way, it would be great. Sorry for my poor english.
If the blog entries your are parsing have some date indicator, just have a field called CREATED of type DATETIME in your database and save this date value there. Then when you parse select the latest DATETIME SELECT MAX(CREATED) FROM posts LIMIT 1 and don't insert anything that has a date earlier than that one.
This solution might have a slight drawback if you expect some of your blogs to update their rss with delay, but keep the past date as their timestamp.
I think you should store the date of the last post you fetched. When you fetch the next time, you can collect only that ones that are newer then the date you stored...
I believe that the usual practice is to work off of the guid element in the RSS feed. This is sometimes the URI of the source article, sometimes a number, sometimes a traditional GUID.
Using this element to see if you have already received an article will negate the need to parse for a date and this is how Google Reader usually determines if an item has already been collected.
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
<channel>
<atom:link href="http://www.stevefenton.co.uk/RSS/Blog/" rel="self" type="application/rss+xml" />
<title>Steve Fenton Blog</title>
<link>http://www.stevefenton.co.uk/RSS/Blog/</link>
<description>Blog</description>
<language>en</language>
<copyright>Copyright 2008 - 2010 Steve Fenton</copyright>
<category>Blog</category>
<generator>Swift Point Content Management System</generator>
<ttl>60</ttl>
<managingEditor>info#stevefenton.co.uk (Site Admin)</managingEditor>
<item>
<title><![CDATA[Jquery Plugin Infinite Scroller With AJAX]]></title>
<link>http://www.stevefenton.co.uk/Content/Blog/Date/201004/Blog/Jquery-Plugin-Infinite-Scroller-With-AJAX/</link>
<description><![CDATA[Friday, 9th April 2010 - Jquery Plugin Infinite Scroller With AJAX <p>I have just finished a new plugin for the jQuery framework.</p><p>The jQuery Infinite Scroller is a great way to deliver a really long list of things, in smaller chunks. For example, if you were displaying articles you could load a page with the first 10 results, then dynamically add more results to the bottom of the list when people start scrolling down. The further they scroll, the more articles you add - thus making it theoretically infinite.</p><p>When the plugin detects that no more results are available, it stops trying to get more items to add.]]> <a href="http://www.stevefenton.co.uk/Content/Blog/Date/201004/Blog/Jquery-Plugin-Infinite-Scroller-With-AJAX">View Details</a>.</description>
<guid>http://www.stevefenton.co.uk/Content/Blog/Date/201004/Blog/Jquery-Plugin-Infinite-Scroller-With-AJAX</guid>
</item>
<item>
<title><![CDATA[Auto Load Your PHP Classes]]></title>
<link>http://www.stevefenton.co.uk/Content/Blog/Date/201004/Blog/Auto-Load-Your-PHP-Classes/</link>
<description><![CDATA[Wednesday, 7th April 2010 - Auto Load Your PHP Classes <p>In PHP5 you can create classes to organise your code and represent objects that you want to pass around. This has long been a feature of other languages and was a fundamentally important step forward for PHP.</p><p>There was one thing, though, that I didn't like about PHP classes. If I wanted to instantiate a new "Customer" or "Product", I had to make sure that I included the PHP file that contained the "Customer" or "Product" class. This meant doing this:</p><p>[[#CODE:php:<br>include_once 'classes/Customer.php';</p>]]> <a href="http://www.stevefenton.co.uk/Content/Blog/Date/201004/Blog/Auto-Load-Your-PHP-Classes">View Details</a>.</description>
<guid>http://www.stevefenton.co.uk/Content/Blog/Date/201004/Blog/Auto-Load-Your-PHP-Classes</guid>
</item>
</channel>
</rss>
Every feed has a unique ID associated with it. You can check that id and store it in database instead of storing the Title.
Try reading the docs from Pubsubhb http://superfeedr.com/documentation#pubsubhubbub
Related
I'm fairly new to PHP, and I'm trying to write a script that solves the following
I have an RSS feed that gets saved to my server every 10 minutes (copied from elsewhere).
There is a problem with the timestamps (pubDate tag) on the RSS feed, they always have the correct date but 00:00:00 GMT as the timestamp (I have no control over this).
Therefor, when I use an autotweeting program to tweet updates from the feed (it checks it every hour or so), it won't - It only tweets the first update of each day as a result.
Therefor, what I'm trying to do to fix it to some degree is to check if the feed has changed, and if it has, change the saved pubDate to the current server time on only the new items.
I'm also kind of confused as to how I can have it check for changes - If I have a corrected version (with fairly accurate timestamps) saved to my server, it will ALWAYS think there are changes, because the timestamps will always be 00:00:00. I'm thinking, check both feeds for items including the full strings such as <guid isPermaLink="true">http://services.runescape.com/m=adventurers-log/a=161/display_player_profile.ws?searchName=A13d&id=-463827091</guid> - Since the id= at the end stays constant, it would only change the <pubDate> of items found to be new.
http://services.runescape.com/m=adventurers-log/a=161/rssfeed?searchName=A13d Here is a feed as an example. If anyone could get me started or point me to some kind of tutorial that might help, I'd really appreciate it. This is over my head, but something I need to learn how to do.
Maybe there is something wrong with your code parsing the timestamp, date format perhaps?
I believe the method of doing full string comparisons(<title> & <description>) between items with the same <guid> is your best bet. Here is some reading about RSS duplicate detection if you are interested.
I'm looking into the possibility of efficiently comparing two similar XML-files and updating outdated information.
The main XML-file I'm working with is about 200-250mb in size. The second is a tad smaller.
The two XML-files pretty much looks like this:
<product>
<Category>BOOK</Category>
<Bookgroup>BOOKF</Bookgroup>
<Productname>Name of the book</Productname>
<Productcode>123456789</Productcode>
<Price>79.00</Price>
<Availability>Stock On Order</Availability>
<ProductURL>www.url.com</ProductURL>
<Release>07.08.2013</Release>
<Author>Name of author</Author>
<Genre>Crime</Genre>
<BookType>Pocket</BookType>
<Language>English</Language>
</product>
As you can see I'm working with books, and the purpose of having a second XML-file with the same information is that I only want one copy of each book for further use.
Basically I'm trying to figure out how I effectively can parse through the first XML and check whether the book exists in the second XML. If it exists I'll check if productinformation (price, availabilty etc) have been updated. If this information has been updated this needs to be updated in the second XML as well.
If it doesnt exist it needs to be added to the second XML.
Using XMLReader I'm able to parse through each book from the first XML fairly fast (40ish seconds to loop through 4,5million lines of XML and echo out all the books) by using a similar approach as this.
My problem occurs when I want to check if this book exists in the second XML and make changes in the second XML if it needs to be updated or added.
Would it for example be possible to use XMLReader on the second XML and stop at nodes with the same booktitle as I've stopped at in the first XML and then make the check? If so how?
I would like to use the GoogleNews XML Feed and use some PHP to style it differently for creating morning news summaries.
QUESTIONS
Is it possible to search for a series of phrases in one xml address. Only one phrase needs to match for it to return results, but all are involved in the search?
e.g.. Fiscal+Cliff,US+Debt
The feed url should only fetch the last 24 hours. My query is not. The problem is with the last 2 variables. What needs to be done to fix it.
xml = http://news.google.com/news?output=rss&num=100&q=fiscal+cliff&as_drrb=q&as_qdr=d
I then want to fetch the <title>, <url> and if possible <author> of each article
Then I want each URL to be used for the PHP to fetch a caption and an image.
$item[title], $item[url] $item[author], Item[image_src], Item[caption]
I would then echo this information how I want it set up on the page. How do I do this?
http://www.queness.com/post/8743/learn-how-to-read-parse-and-display-xml-data-in-random-order-with-jquery
Use google as this question is not unique and you can use xml dom to fetch, parse and display data.
thanks in advance for those who will lend their time answering this question.
I’m trying to display two external XML data into my page. Let us say the fictional location of these XMLs are www.ExampleDomain1.com/xml-1.xml and www.ExampleDomain2.com/xml-2.xml respectively. The two XMLs have different element tags but have common contents, here's the example:
XML-1.xml
<property>
<type>SP</type>
<subtype>Apartment</subtype>
<refno>011248</refno>
<title>Fantastic Facilities!</title>
<description> Offering this fantastic 2 bedroom apartment set within this popular building in The Views. The property offers in excess of 1450sqft of internal living space comprising of two double bedrooms (en-suite to master), fitted kitchen with integrated appliances, main bathroom and spacious lounge/diner leading onto a good size balcony with partial views of the golf course and views of the Marina skyline.
</description>
<size>1458</size>
<sizeunits>SqFt</sizeunits>
<price>1525000</price>
<pricecurrency>AED</pricecurrency>
<totalclosingfee>1525000</totalclosingfee>
<bedrooms>2</bedrooms>
<bathrooms>2</bathrooms>
<locationtext>The Views</locationtext>
<locationlat>25.090200</locationlat>
<locationlon>55.170200</locationlon>
<developer>0</developer>
<lastupdated>2011-02-18 20:15:08</lastupdated>
<photos>
<photo>
http://www.ExampleDomain1.com/images/med_imgga94jqe351494b66b20eaec11fe501f5bdf797f4.jpg
</photo>
<photo>
http://www. ExampleDomain1.com/images/med_imgga94l1maf3ccdd27d2000e3f9255a7e3e2c48800.jpg
</photo>
</photos>
</property>
XML-2.xml
<listings>
<category>SP</category>
<subcategory>Apartment</ subcategory >
<reference>011250</reference>
<title>Fantastic Facilities!</title>
<description> A fantastic 1 bedroom apartment in the exclusive Downtown area. The property offers 850 sq.ft. of internal living space. Fantastic layout and personalized design. Externally the property has an easy accessible carport with parking for one.
</description>
<size>1200</size>
<unitsize>SqFt</ unitsize >
<price>905000</price>
<currency>AED</currency>
<closingfee>1525000</closingfee>
<bedrooms>2</bedrooms>
<bathrooms>2</bathrooms>
<location> Downtown </location>
<locationlon>55.170200</locationlon>
<developer>0</developer>
<updated>2011-02-18 20:15:08</updated>
<photos>
<photo> http://www.ExampleDomain2.com/images/med_imgga94jqe351494b66b20eaec11fe501f5bdf797f4.jpg
</photo>
<photo> http://www. ExampleDomain2.com/images/med_imgga94l1maf3ccdd27d2000e3f9255a7e3e2c48800.jpg
</photo>
</photos>
</listings>
From the above example you can clearly see that the XML tags varied differently though both have same contents. Someone will say why not clean the XML before fetching it? The problem is both XMLs were produced by proprietary software which do not have an option to manipulate the output. So matching the element tags first is not viable besides it will be a huge task for me if I will do it manually especially if the data are big.
To make things more complicated, I want to fetch both XMLs and display both results together in just one page. I want to load both XMLs simultaneously and display it in a search result.
I’ve been burning my eyes for a week now and the closest code example I’ve found was this http://net.tutsplus.com/tutorials/javascript-ajax/use-jquery-to-retrieve-data-from-an-xml-file/. The problem with this code example is that it loads only one XML from local directory. What if I want to load many XMLs at once from external source?
To elaborate more what exactly I want to achieve are these:
Make a search form that will get both XMLs and display it in just one page.
Make the image appear as the code example from the tutorial above is different from the XML structure that I have.
What application that I must know to accomplish this case (PHP or JQuery or AJAX or combination of three?).
Can it be achieved even if the XMLs would not be stored in a database?
As a newbie in coding PHP and just following some Jquery script examples, without your help the above problem will took me ages before I could find out the right solution. I’m confident in HTML and CSS but not in programming side.
Can you please help me show the path that I need to follow on (example codes)? Or is there someone Genius out there that could throw me the exact SCIENTFIC codes that I’m looking for?
Many thanks,
Mike
Ok. THAT seems to be quite complicated. But actually you can achieve this with a medium amount of work.
I will not provide you with any code or examples right now but show the possibilities how you can solve that problem.
First things first. You can do it with Javascript/jQuery only. This would mean that the client has lots of work and fetching to do, but it is possible.
You can also do it with a combination of PHP an JS. This would mean that the server would do the heavy work to fetch and match the data and send a combined result back to the client.
If it is always the same date the PHP solution has a major advantage. You could cache the combined data so you could save time matching the XML every time.
Just think about what will work better for you and we can help you coding for stuff. (Just to get clear: The XML always looks like the examples you provided? The fields/tags in the XML are always the same? If not the entire thing is much more complicated.)
RSS feed being generated on demand.
As far as I can see, for I have 2 options for lastBuildDate - current time or pubDate.
Which one would you choose and why?
According to the RSS 2.0 spec, lastBuildDate is the last time the content of the channel changed. (I'm not entirely satisfied with this definition because what if the feed's meta data changes? I think the common convention is to update lastBuildDate in that case, too.)
The channel-wide pubDate is supposed to be used for the original publication date of the items in the feed. It is never a good value to use for lastBuildDate because the pubDate is to stay unchanged even if the item gets updated.
Using the current time is the easy way out, but it's not perfect (because clients may start unnecessary operations due to the changed lastBuildDate)
The best way would be to actually know / find out when the feed's content last changed, and output that.
Related question
The item having the newest PubDate should become the lastBuildTime.
[EDIT]: If there is a separate PubDate you are using too for whole feed, then lastBuildTime should be current time because you are building it at current time on-demand :).
[EDIT]: 2:: As lastBuildTime is optional and you're anyways including PubDate for whole feed, why not remove it from your feed output?