Finding "similar" articles in an RSS feed with PHP - php

There is something I am trying to accomplish although I'm not really sure where to start.
I currently have a MySql database with a list of articles. The DB contains the article title, content, and some other info like dates, etc.
There is an RSS feed that we monitor for new articles, it's a Google Alert feed that just contains the latest news on certain subjects. I want to be able to automatically monitor this feed and record any feed items that are similar to stories currently in our DB.
I know how to set a script to run automatically, and I know how to parse the RSS feed with SimplePie.
What I need to figure out is how to take the description of the rss feed items, run a check on our DB to see if the feed item is similar to something we have in our DB, and return a numerical score of some sort, sort of like a "similarity rating" or something.
After that I can have the info I need recorded to the DB if the "similarity rating" is above a set limit, which I know how to do.
So my only issue is how to compare each feed item to our current articles, and return a score based on how similar it is.

The Levenshtein function (available for both PHP and MySQL) is a good way to handle this. It basically calculates a value based on the number of permutations (replacements, moves, etc) required to convert one string to another. That score would be your "similarity rating".
EDIT: the Levenshtein function is not available natively in MySQL but there are SQL implementations of it that you can use such as: http://kristiannissen.wordpress.com/2010/07/08/mysql-levenshtein/

Related

Notification System that loops through first ten database entries

I have looked for days now and cannot seem to find any direct examples of what I am trying to accomplish, that I can reference.
I am trying to create a simple, elegant notification system, that pulls a persons image, name (in text format), and a predefined message (selected from drop down menu), from a database, and then displays the info in an elegant little "profile like" layout, on a webpage or smartphone. The only feature that I want the app to have is an auto refresh setup (using AJAX maybe?) that cycles through the latest ten entries into the database, in a continual loop.
I already have the MySql database set up, as well as the form which supplies the information that I want show, into the database -- but I can't for the life of me figure out how to pull that info into a nice little alert, and get it to cycle through the latest ten database entries.
Thank you so much, in advance, for any assistance you can provide. I'm ok with databases, and Php, but I'm racking my brain trying to figure out how to get it to display and cycle through the first ten entries.
Thanks again!
If you have an id column or some sort of timestamp column, you can use ORDER BY and LIMIT in MySql to extract only the last X recrods.
For example:
SELECT * FROM profiles ORDER BY id DESC LIMIT 10
That will extract the top 10 id's, where in a standard id column, that will be the last 10 records.
As for formating the display - that is way to wide, and there are a lot of ways to do so.
I think you should have PHP file with SQL query SELECT ... ORDER BY id LIMIT 10, and use json_encode to JSON encode the returned array. JSON is easy to use with AJAX with JavaScript.
And about AJAX - I would use jQuery and use jQuery.getJson from PHP file do sleep and loop it
EDIT: On refresh you will do a new JSON parse, and remove last elements of container with cards, and use new elements

Simple RSS reader in php with continuous scrolling

I already made a simple RSS reader, but it only gets me like 25 articles. How do I make it to work like feedly.com or digg.com, so that it retrieves me many more feeds, and not only 25?
The php code I have:
$rss = simplexml_load_file('http://www.elespectador.com/rss.xml');
I already know how to retrieve the title, description, etc. of each item.
Pagination in feeds is arbitrary and you'll have trouble finding a consistent pattern. You should store any data so that now you have 25 elements, but when new ones are added, you keep adding more and more.
Another solution is to use the data from a service like Superfeedr (I created it!) which stores past content for milions of feeds.

Multiple xml feeds, sql match

I'm developing a store which gets its product info from lots of xml feed, I'll have maybe 3000 products in my database. I'll do it using a cronjob.
What I'd like to do is write posts, lets say a general post about picking the best TV set for yor family. Then I'd make a mysql match whitch should take the posts title and content and match it to the thousands of products in my database and retrieve the closest match to display on my post.
I'm thinking of this becouse having alot of xml with different nods, categories would be very hard for me to propely filter them using php.
Now, do you think thats a good ideea? content, performace wise?
Do you think mysql match could do it? Maybe use some other method?
Should I store all the product info like price, description, reviews in a single table field and use it for the mysql match?
Is there a better way I can do this?
Any ideea is very appreciated, I need to sort this out, make a plan before I start coding and waiting time.
What you are trying to do is awful with pure XML.
I strongly suggest you to leave this task to your Database in this case MySQL, basically your 3rd point.
With MyISAM table you can set up the full text search if you need a bit more complex query based on affinity.

A Publisher with RSS as Datasource

I am writing some code to fetch news from rss feed and publish n items at once every m hours to another site.
I compare the update xml file with the previous one saved on server using PHP.
I load the two xml into php array and the latest post is filter out using array_diff_assoc().
If the number of the latest post>n, the older one will be publish first, the rest will be done next time. Therefore I need some ways to store which item have publish or not.
What is the simplest way to do so? I don't want to apply mySQL/S for such a simple task.
Can't you just store those not published? Then each time, pull up the old, stored ones, and append to the list those new ones ID'd by array_diff_assoc(). Publish n, and if number > n, store the new list of unpublished ones.
As to how to store them, I'm not a PHP programmer, but what about using PHP's serialize and unserialize functions? In python, I'd use the pickle module if I had to store data objects of some type, and I understand those are the PHP equivalent.

Search Lucene - Usage Practice

I’ve got search lucene set up and running. Everything works perfectly.
My website is an application that populates results similar to that of ebay, each item has an image, title, content description and some other information come with it.
I have two solutions for populating my data, I want you to suggest which one should I go for.
store title, content, image name, and every other information in the index files. When users search, I will just query the index files, and get everything from there.
just store title and content and row ids. When users search, I will query the index files, get ids of match search then use those ids to query my actual database for every other information.
I would probably go with the first solution, storing everything into the search/index engine (Lucene, in your case).
This way, in order to display your list of products, you will not have to make any request to your database, which will lower the load on your DB server -- and your site will scale better.

Categories