RSS aggregator; how to insert only new items - php

A tutorial here shows how to build an agregator in PHP but I'm having some trouble finding the best way not to insert the same items in my database.
If I were to run the script on http://visualwebsiteoptimizer.com/split-testing-blog/feed/ and then run it again in 5 minutes it'll just insert the same items again.
That tutorial just has an interval time specified in wich it will reload the RSS feed and save all the items.
I was wondering if RSS implement some request header that will only send the items after a certain date. I see here that I could use lastBuildDate and mabe ignore channels that have a date older than last fetched but it doesn't say if that is mandatory.
My question here is: how can I check RSS feeds regularly and insert it in a database without inserting the same item more than once?
I'm thinking the only way to do it is to check if a record already exist using link and only insert if it doesn't exist already. I know link is optional but I won't save items that don't have one anyway. This seems a bit inefficient though; checking before every insert might be fine in the beginning but when the database starts filling up it might get very slow.

You might have to use a few different strategies depending on how well the site you are consuming has implemented the spec.
First I would try adding a unique index on the database for the GUID value, GUIDs by there nature should be unique, http://en.wikipedia.org/wiki/Globally_unique_identifier - then depending on which DB you are using you should be able to use syntax like INSERT IGNORE INTO... or INSERT ... ON DUPLICATE KEY UPDATE... and just have the update syntax not really do anything
If some sites don't have a guid field (I am assuming you will end up consuming more than just the example) you could add the unique on the siteId field and the either the time or the title, both are less than ideal of course contacting the site own to get them to implement a guid might work too ;)
You could also run an md5 hash on the post content and store that alongside the post, that should stop duplicates too.
How big are you expecting the DB to get? with proper indexing I would have thought that it would have to be huge before it runs slow; indexes on siteId, guid, time and/or hash and limited to just 1 row and just the rowId should be quick enough, epscialyl if you can get your script to run commandline / on a cron job rather than through a webserver

Related

Add id to parser html element?

I'm going to write a tool that extracts data from soccerway.com. In fact I'm going to create a sort of historical.
As you can see the data is grouped into football seasons, so there would be database 2015/2016, 2016/2017 and so on. What I do then is take the data, enter them in the database and then run a cron job that goes to update these values. The problem is that I should have a key recognition on them to upgrade or something. I currently have in mind only the operation of the parser, but I do not know how I can create a key for each item parserd. For example, take the league standings from the link that I have provided you, how can I (once entered data), in the future, check with cronjob that there are updates and replace the values?
I know that to see if there are updates could exploit the field lastUpdate header and save it somewhere in the database, then the cronjob going to check this field for each league. The most important point, however, is to recognize the values ​​to be updated because I have no id to reference.
Some idea?
While parsing the data, you can store date & time of the forthcoming matches and set the script to run then (the won't be updates in the meanwhile). If you directly parse HTML code it shouldn't take long.

php speed vs. mysql speed

I'm making a feeds aggregator using php and mysql. And writting a paper about it which must contain math.
I have a table feeds (id, title, description, link) where id is the primary key.
When I collect new feeds I need to add them to the database, but I must not let any duplicates in. I see two ways to do that:
1) for each feed run something like this:
SELECT id FROM feeds
WHERE title=$feed.title AND description=$feed.description;
And see if it returns any feeds.
2) Assume that feeds which came from different sources never match. In this case:
for each source of feeds run something like this:
SELECT title, description, source FROM feeds WHERE source=$source;
Then use PHP to match collected feeds against this array.
I admit, I don't have any performance problem. But I'm writing a paper about it and I must find some way to apply math to the problem. I've choosen the second approach because it allows me to go into math details about why it can be faster.
But I suspect that php might do the work much slower then mysql would and it might actually be faster to run a query for each feed.
Am I right? Is there any practical reason to choose the second approach? How can I justify my choise?
have you considered using a composite unique index instead?
alter table feeds add unique index(title, description);
this would prevent adding new rows when title and description taken together are already present in the table.
you would have to do large number of inserts in a large database to really get performance values though.
Edit:
This does have one downfall in MYSQL Null is always considered unique so you could have several rows input that are title=null and description=null. You should check for this before attempting insert of data.
For the math, consider what the scaling implications are for your database. How long does an add of a new feed take for the first feed? How about the 10,000th? What about the 10 millionth? In what way does the increase in number of existing feeds affect the speed by which a new feed can be added?
PHP and MySQL: Both running in the Serverside, not like javascript in clientside/Browser.
If you do not have more then millions of data, it wont be slow anyway.
why not just add a index that is unique on title and description? don't know if its the best to do performance wise but it will handle the logic for you in the most correct way..
I think the fastest way would be to put a UNIQUE index on the source column, and simply do an INSERT IGNORE, sending all your collected feeds in one query without even manually checking for duplicates. Not only will this save you the processing/network overhead of doing one query per feed, the index will ensure you don't have any duplicates (assuming source is actually unique per feed).

Display latest search results from MySQL with PHP

Google unfortunately didn't seem to have the answers I wanted. I currently own a small search engine website for specific content using PHP GET.
I want to add a latest searches page, meaning to have each search recorded, saved, and then displayed on another page, with the "most searched" at the top, or even the "latest search" at the top.
In short: Store my latest searches in a MySQL database (or anything that'll work), and display them on a page afterwards.
I'm guessing this would best be accomplished with MySQL, and then I'd like to output it in to PHP.
Any help is greatly appreciated.
Recent searches could be abused easily. All I have to do is to go onto your site and search for "your site sucks" or worse and they've essentially defaced your site. I'd really think about adding that feature.
In terms of building the most popular searches and scaling it nicely I'd recommend:
Log queries somewhere. Could be a MySQL db table but a logfile would be more sensible as it's a log.
Run a script/job periodically to extract/group data from the log
Have that periodic script job populate some table with the most popular searches
I like this approach because:
A backend script does all of the hard work - there's no GROUP BY, etc made by user requests
You can introduce filtering or any other logic to the backend script and it doesn't effect user requests
You don't ever need to put big volumes of data into the database
Create a database, create a table (for example recent_searches) and fields such as query (the query searched) and timestamp (unix timestamp that the query was made) said, then for your script your MySQL query will be something like:
SELECT * FROM `recent_searches` ORDER BY `timestamp` DESC LIMIT 0, 5
This should return the 5 most recent searches, with the most recent one appearing first.
Create table (something named like latest_searches) with fields query, searched_count, results_count.
Then after each search (if results_count>0), check, if this search query exists in that table. And update or insert new line into table.
And on some page you can just use data from this table.
It's pretty simple.
Ok, your question is not yet clear. But I'm guessing that you mean you want to READ the latest results first.
To achieve this, follow these steps:
When storing the results use an extra field to hold DATETIME. So your insert query will look like this:
Insert into Table (SearchItem, When) Values ($strSearchItem, Now() )
When retrieving, make sure you include an order by like this:
Select * from Table Order by When Desc
I hope this is what you meant to do :)
You simply store the link and name of the link/search in MySQL and then add a timestamp to record what time sb searched for them. Then you pull them out of the DB ordered by the timestamp and display them on the website with PHP.
Create a table with three rows: search link timestamp.
Then write a PHP script to insert rows when needed (this is done when the user actually searches)
Your main page where you want stuff to be displayed simply gets the data back out and puts them into a link container $nameOfWebsite
It's probably best to use a for/while loop to do step 3
You could additionally add sth like a counter to know what searches are the most popular / this would be another field in MySQL and you just keep updating it (increasing it by one, but limited to the IP)

Basic version control for MySQL table

I'm trying to setup a (I thought) fairly simple versioning system for static html pages on a site. The goal is to keep previous versions of the content, then restore to them if needed (I guess basically creating a new version that's a duplicate of an old one), and optionally to toss out data older than X versions ago.
The table's setup is fairly straightforward:
id
reference_id (string/used to determine what page the item pertains to)
content (document/html page sized amount of data)
e_user (user who changed it last)
e_timestamp (when it was changed)
I just want to have something setup to create a previous version for each edit to the content, then be able to restore to it if needed.
What's the best method for accomplishing this? Should everything be in the same table, or spread across a few different ones?
I read through a few pages on the subject, but a lot of them seemed like overkill for what i'm trying to accomplish (ex http://www.jasny.net/articles/versioning-mysql-data/ )
Are there any platforms/guides about that will help me in this endeavorer?
Ideally you would want everything in the same table with something in your query to get the correct version, however you should be careful how you do this as an inefficient query will put extra load on your server. If normally you would select a single item like this:
SELECT * FROM your_table WHERE id = 42
This would then become:
SELECT * FROM your_table
WHERE id = 42
AND date < '2010-10-12 15:23:24'
ORDER BY date DESC
LIMIT 1
Index (id, e_timestamp) to allow this to perform efficiently.
Selecting multiple rows in a single query is more tricky and requires a groupwise-maximum approach but it can be done.
You can use a technique called "auditing". You would set up audit tables. Then you would either write it into your code or setup triggers on the DB side so that every time a change is made, an entry is added into the appropriate audit table. Then you can go back through the audit table and see things like:
"Oh, yesterday Sue went in and fixed a typo"
"Uh oh, steve wiped out an entire paragraph by accident earlier today while trying to rewrite this section"
Your primary table that stores the data doesn't keep all that data, so it can stay slim. If you ever need to look at that data and say roll stuff back, you can go look in your audit table and do that. You can setup the audit table however you want, so each audit row can have the entire content BEFORE edit, and not just what was edited. That should make "rolling back" fairly easy.
Add a version column and a delete column (bool) and create some functions that compare the versions of rows with the same id. You'll definitely want to be able to easily find the current version and the previous version. To get rid of the data you'll want to write another function that sorts all of the versions of id, figures out which are old enough to be deleted, and marks them for deletion by another function. You'll probably want to have an option to make certain pages immune to deletion or postpone it.

Twitter competition ~ saving tweets (PHP & MySQL)

I am creating an application to help our team manage a twitter competition. So far I've managed to interact with the API fine, and return a set of tweets that I need.
I'm struggling to decide on the best way to handle the storage of the tweets in the database, how often to check for them and how to ensure there are no overlaps or gaps.
You can get a maximum number of 100 tweets per page. At the moment, my current idea is to run a cron script say, once every 5 minutes or so and grab a full 100 tweets at a time, and loop through them looking in the db to see if I can find them, before adding them.
This has the obvious drawback of running 100 queries against the db every 5 minutes, and however many INSERT there are also. Which I really don't like. Plus I would much rather have something a little more real time. As twitter is a live service, it stands to reason that we should update our list of entrants as soon as they enter.
This again throws up a drawback of having to repeatedly poll Twitter, which, although might be necessary, I'm not sure I want to hammer their API like that.
Does anyone have any ideas on an elegant solution? I need to ensure that I capture all the tweets, and not leave anyone out, and keeping the db user unique. Although I have considered just adding everything and then grouping the resultant table by username, but it's not tidy.
I'm happy to deal with the display side of things separately as that's just a pull from mysql and display. But the backend design is giving me a headache as I can't see an efficient way to keep it ticking over without hammering either the api or the db.
100 queries in 5 minutes is nothing. Especially since a tweet has essentially only 3 pieces of data associated with it: user ID, timestamp, tweet, tweet ID - say, about 170 characters worth of data per tweet. Unless you're running your database on a 4.77MHz 8088, your database won't even blink at that kind of "load"
The Twitter API offers a streaming API that is probably what you want to do to ensure you capture everything:
http://dev.twitter.com/pages/streaming_api_methods
If I understand what you're looking for, you'll probably want a statuses/filter, using the track parameter with whatever distinguishing characteristics (hashtags, words, phrases, locations, users) you're looking for.
Many Twitter API libraries have this built in, but basically you keep an HTTP connection open and Twitter continuously sends you tweets as they happen. See the streaming API overview for details on this. If your library doesn't do it for you, you'll have to check for dropped connections and reconnect, check the error codes, etc - it's all in the overview. But adding them as they come in will allow you to completely eliminate duplicates in the first place (unless you only allow one entry per user - but that's client-side restrictions you'll deal with later).
As far as not hammering your DB, once you have Twitter just sending you stuff, you're in control on your end - you could easily have your client cache up the tweets as they come in, and then write them to the db at given time or count intervals - write whatever it has gathered every 5 minutes, or write once it has 100 tweets, or both (obviously these numbers are just placeholders). This is when you could check for existing usernames if you need to - writing a cached-up list would allow you the best chance to make things efficient however you want to.
Update:
My solution above is probably the best way to do it if you want to get live results (which it seems like you do). But as is mentioned in another answer, it may well be possible to just use the Search API to gather entries after the contest is over, and not worry about storing them at all - you can specify pages when you ask for results (as outlined in the Search API link), but there are limits as to how many results you can fetch overall, which may cause you to miss some entries. What solution works best for your application is up to you.
I read over your question and it seems to me that you want to duplicate data already stored by Twitter. Without more specifics on the competition your running, how users enter for example, estimated amount of entries; its impossible to know whether or not storing this information locally on a database is the best way to approach this problem.
Might a better solution to be, skip storing duplicate data locally and drag the entrants directly from twitter, i.e. when your attempting to find a winner.
You could eliminate duplicate entries on-the-fly then whilst the code is running. You would just need to call "the next page" once its finished processing the 100 entries its already fetched. Although, i'm not sure if this is possible directly through the Twitter API.
I think running a cron every X minutes and basing it off of the tweets creation date may work. You can query your database to find the last date/time of the last recorded tweet, then only run selects if there are matching times to prevent duplicates. Then, when you do your inserts into the database, use one or two insert statements containing all the entries you want to record to keep performance up.
INSERT INTO `tweets` (id, date, ...) VALUES (..., ..., ...), (..., ..., ...), ...;
This doesn't seem too intensive...also depends on the number of tweets you expect to record though. Also make sure to index the table properly.

Categories