I have a system that connects to 2 popular APIs. I need to aggregate the data from each into a unified result that can then be paginated. The scope of the project means that the system could end up supporting 10's of APIs.
Each API imposes a max limit of 50 results per request.
What is the best way of aggregating this data so that it is reliable i.e ordered, no duplicates etc
I am using CakePHP framework on a LAMP environment, however, I think this question relates to all programming languages.
My approach so far is to query the search API of each provider and then populate a MySQL table. From this the results are ordered, paginated etc. However, my concern is performance: API communication, parsing, inserting and then reading all in one execution.
Am I missing something, does anyone have any other ideas? I'm sure this is a common problem with many alternative solutions.
Any help would be greatly appreciated.
Yes, this is a common problem.
Search SO for questions like https://stackoverflow.com/search?q=%5Bphp%5D+background+processing
Everyone who tries this realizes that calling other sites for data is slow. The first one or two seem quick, but other sites break (and your app breaks) and other sites are slow (and your app is slow)
You have to disconnect the front-end from the back-end.
Choice 1 - pre-query the data with a background process that simply gets and loads the database.
Choice 2 - start a long-running background process and check back from a JavaScript function to see if it's done yet.
Choice 3 - the user's initial request spawns the background process -- you then email them a link so they can return when the job is done.
i have a site doing just that with over 100 rss/atom feeds, this is what i do:
i have a list of feeds and a cron job that iterates over them, about 5 feeds a minute, meaning i cycle through all feeds every 20 minute or so.
i lift the feed, and try to insert each entry into the database, using the url as a unique field, if the url exists, i do not insert. the entry date is my current system clock, and is inserted by my application, as date fields in rss cannot be trusted, and in some cases, can't even be parsed.
for some feeds, and only experiece can tell you which, i also search for duplicate titles, some websites change the urls for their own reasons.
the items are now all placed in the same database table, ready to be queried.
one last thought: if your application is likely to have new feeds added while in production, you really should also check if a feed is "new" (ie: has no previous entries in the db), if it is, you should mark all currently available links as inactive, otherwise, when you add a feed, there will be a block of articles from that feed, all with the same date and time. (simply put: the method i described is for future additions to the feed only, past articles will not be available).
hope this helps.
Related
I'm the developer for a real estate syndication website and am currently having trouble figuring out a way to update massive numbers of listings/records efficiently (2,000,000+ listings).
We currently accept XML feeds, containing real estate listings, from about ~20 different websites. Most of the incoming feeds are small (~100 or so listings), but we have a couple of XML feeds that contain ~1,000,000 listings. The small feeds are parsed fast and easy, however, the large feeds are taking upwards of 2-3 hours each.
The current "live" database table that contains the listings for viewing on the site is MyISAM. I chose MyISAM because ~95% of the queries to the table are SELECTs. Really the only time there are writes (UPDATE/INSERT queries) are during the time the XML feeds are being processed.
The current process is as follows:
There is a CRON in place that starts the main parsing script.
It loops through a feeds table and grabs the external XML feed source files. It then runs through said file and for each record in the XML file it checks against the listings table to see if a listing needs to be updated or inserted (if it's a new listing).
This is all happening against the live table. What I'd like to find out is if anybody has any better logic to make these updates/inserts happen in the background so as to not slow down the production tables, and ultimately, the user experience.
Would a delta table be the best choice? Maybe do all the heavy work on a separate database and just copy the new table over to the production database? On a separate workhorse domain altogether? Should I have a separate listings table that does all the parsing which would be InnoDB instead of MyISAM?
What we're trying to accomplish is to have our system be able to update listings frequently throughout the day without slowing the site down. Our competitors boast that they are updating their listings every 5 minutes in some cases. I just don't see how that's even possible.
I'm working right now so this is more of a brain dump just to get the ball rolling. If anybody would like me to provide table schematics, I'd be more than happy.
In summary: I'm looking for a way to frequently update millions of records in our database (daily) via a couple dozen external XML feeds/files. I just need some logic on how to effectively, and efficiently, make this happen so as to not drag the production server down with it.
Firstly, for your existing 3 hour import, try wrapping every 100 inserts in a transaction. They will be written to the database in one go, and that might speed things up dramatically. Play around with the 100 value - the best value will depend on how resilient you want it, and how much memory your transaction cache has. (This will of course require you to switch to a different engine).
For providers that are known to offer larger files, try keeping a copy of the previous XML download, and then do a text diff between the old one and the new one. If you set your context settings (i.e. the number of unchanged lines around changed lines) sufficiently you might be able to capture the primary keys of changed items. You would then just do a small number of updates.
Of course, it would help if your providers maintain the order of their XML listing. If they don't, a text sort then a diff may still be faster than importing everything.
FWIW, I think a complete refresh every 5 minutes is probably not feasible. I expect your providers would not be happy with you downloading 1M records at this frequency!
I'm developing a website that is sensitive to page visits. For instance it has sections that will show the users which parts of the website (which items) have been visited the most. To implement this features, two strategies come to my mind:
Create a page hit counter, sort the pages by the number of visits and pick the highest ones.
Create a Google Analytics account and use its info.
If the first strategy has been chosen, I would need a very fast and accurate hit counter with the ability to distinguish the unique IPs (or users). I believe that using MySQL wouldn't be a good choice, since a lot of page visits, means a lot of DB locks and performance problems. I think a fast logging class would be a good one.
The second option seems very interesting when all the problems of the first one emerge but I don't know if there is a way (like an API) for Google Analytics to make me able to access the information I want. And if there is, is it fast enough?
Which approach (or even an alternative approach) you suggest I should take? Which one is faster? The performance is my top priority. Thanks.
UPDATE:
Thank you. It's interesting to see different answers. These answers reminded me an important factor. My website updates the "most visited" items, every 8 minutes so I don't need the data in real time but I need it to be accurate enoughe every 8 minutes or so. What I had in mind was this:
Log every page visit to a simple text log file
Send a cookie to the user to separate unique users
Every 8 minutes, load the log file, collect the info and update the MySQL tables.
That said, I wouldn't want to reinvent the wheel. If a 3rd party service can meet my requirements, I would be happy to use it.
Given you are planning to use the page hit data to determine what data to display on your site, I'd suggest logging the page hit info yourself. You don't want to be reliant upon some 3rd party service that you'd have to interrogate in order to create your page. This is especially true if you are loading that data real time as you'd have to interrogate that service for every incoming request to your site.
I'd be inclined to save the data yourself in a database. If you're really concerned about the performance of the inserts, then you could investigate intercepting requests (I'm not sure how you go about this in PHP, but I'm assuming it's possible.) and then passing the request data of to a separate thread to store the request info. By having a separate thread handle the logging, then you won't interrupt your response to the end user.
Also, given you are planning using the data collected to "... show the users which parts of the website (which items) have been visited the most", then you'll need to think about accessing this data to build your dynamic page. Maybe it'd be good to store a consolidated count for each resource. For example, rather than having 30000 rows showing that index.php was requested, maybe have one row showing index.php was requested 30000 times. This would certainly be quicker to reference than having to perform queries on what could become quite a large table.
Google Analytics has a latency to it and it samples some of the data returned to the API so that's out.
You could try the API from Clicky. Bear in mind that:
Free accounts are limited to the last 30 days of history, and 100 results per request.
There are many examples of hit counters out there, but it sounds like you didn't find one that met your needs.
I'm assuming you don't need real-time data. If that's the case, I'd probably just read the data out of the web server log files.
Your web server can distinguish IP addresses. There's no fully reliable way to distinguish users. I live in a university town; half the dormitory students have the same university IP address. I think Google Analytics relies on cookies to identify users, but shared computers makes that somewhat less than 100% reliable. (But that might not be a big deal.)
"Visited the most" is also a little fuzzy. The easy way out is to count every hit on a particular page as a visit. But a "visit" of 300 milliseconds is of questionable worth. (Probably realized they clicked the wrong link, and hit the "back" button before the page rendered.)
Unless there are requirements I don't know about, I'd probably start by using awk to extract timestamp, ip address, and page name into a CSV file, then load the CSV file into a database.
I'm really interested to find out how people approach collaborative filtering and recommendation engines etc. I mean this more in terms of performance of the script than anything. I have stated reading Programming Collective Intelligence, which is really interesting but tends to focus more on the algorithmic side of things.
I currently only have 2k users, but my current system is proving to be totally not future proof and very taxing on the server already. The entire system is based on making recommendations of posts to users. My application is PHP/MySQL but I use some MongoDB for my collaborative filtering stuff - I'm on a large Amazon EC2 instance. My setup is really a 2 step process. First I calculate similarities between items, then I use this information to make recommendations. Here's how it works:
First my system calculates similarities between users posts. The script runs an algorithm which returns a similarity score for each pair. The algorithm examines information such as - common tags, common commenters and common likers and is able to return a similarity score. The process goes like:
Each time a post is added, has a tag added, commented on or liked I add it to a queue.
I process this queue via cron (once a day), finding out the relevant information for each post, e.g. user_id's of the commenters and likers and tag_id's. I save this information to MongoDB in this kind of structure: {"post_id":1,"tag_ids":[12,44,67],"commenter_user_ids":[6,18,22],"liker_user_ids":[87,6]}. This allows me to eventually build up a MongoDB collection which gives me easy and quick access to all of the relevant information for when I try to calculate similarities
I then run another cron script (once a day also, but after the previous) which goes through the queue again. This time, for each post in the queue, I grab their entry from the MongoDB collection and compare it to all of the other entries. When 2 entries have some matching information, I give them +1 in terms of similarity. In the end I have an overall score for each pair of posts. I save the scores to a different MongoDB collection with the following structure: {"post_id":1,"similar":{"23":2,"2":5,"7":2}} ('similar' is a key=>value array with the post_id as key and the similarity score as the value. I don't save a score if it is 0.
I have 5k posts. So all of the above is quite hard on the server. There's a huge amount of reads and writes to be performed. Now, this is only half the issue. I then use this information to work out what posts would be interesting to a particular user. So, once an hour I run a cron script which runs a script that calculates 1 recommended post for each user on the site. The process goes like so:
The script first decides, which type of recommendation the user will get. It's a 50-50 change of - 1. A post similar to one of your posts or 2. A post similar to a post you have interacted with.
If 1, then the script grabs the users post_ids from MySQL, then uses them to grab their similar posts from MongoDB. The script takes the post that is most similar and has not yet been recommended to the user.
If 2, the script grabs all of the posts the user has commented on or liked from MySQL and uses their ids to do the same in 1 above.
Unfortunately the hourly recommendation script is getting very resource intensive and is slowly taking longer and longer to complete... currently 10-15 minutes. I'm worried that at some point I won't be able to provide hourly recommendations anymore.
I'm just wondering if anyone feels I could be approaching this any better?
With 5000 posts, that's 25,000,000 relationships, increasing O(n^2).
Your first problem is how you can avoid examining so many relationships every time the batch runs. Using tags or keywords will help with content matching - and you could use date ranges to limit common 'likes'. Beyond that....we'd to know a lot more about the methodology for establishing relationships.
Another consideration is when you establish relationships. Why are you waiting until the batch runs to compare a new post with existing data? Certainly it makes sense to handle this asynchronously to ensure that the request is processed quickly - but (other than the restrictions imposed by your platform) why wait until the batch kicks in before establishing the relationships? Use an asynchronous message queue.
Indeed depending on how long it takes to process a message, there may even be a case for re-generating cached relationship data when an item is retrieved rather than when it is created.
And if I were writing a platform to measure relationships with data then (the clue is in the name) I'd definitely be leaning towards a relational database where joins are easy and much of the logic can be implemented on the database tier.
It's certainly possible to reduce the length of time the system takes to cross-reference the data. This is exactly the kind of problem map-reduce is intended to address - but the benefits of this mainly come from being to run the algorithm in prallel across lots of machines - at the end of the day it takes just as many clock ticks.
I'm starting to plan how to do this.
First thing is to possibly get rid of your database technology or supplement it with either triplestore or graph technologies. That should provide some better performance for analyzing similar likes or topics.
Next yes get a subset. Take a few interests that the user has and get a small pool of users that have similar interests.
Then build indexes of likes in some sort of meaningful order and count the inversions (divide and conquer - this is pretty similar to merge sort and you'll want to sort on your way out to count split inversions anyways).
I hope that helps - you don't want to compare everything to everything else or it's definately n2. You should be able to replace that with something somwhere between constant and linear if you take sets of people who have similar likes and use that.
For example, from a graph perspective, take something that they recently liked, and look at the in edges and then go trace them out and just analyze those users. Maybe do this on a few recently liked articles and then find a common set of users from that and use that for the collaborative filtering to find articles the user would likely enjoy. then you're at a workable problem size - especially in graph where there is no index growth (although maybe more in edges to traverse on the article - that just gives you more change of finding usable data though)
Even better would be to key the articles themselves so that if an article was liked by someone you can see articles that they may like based on other users (ie Amazon's 'users that bought this also bought').
Hope that gives a few ideas. For graph analysis there are some frameworks that may help like faunus for stats and derivitions.
I am creating an application to help our team manage a twitter competition. So far I've managed to interact with the API fine, and return a set of tweets that I need.
I'm struggling to decide on the best way to handle the storage of the tweets in the database, how often to check for them and how to ensure there are no overlaps or gaps.
You can get a maximum number of 100 tweets per page. At the moment, my current idea is to run a cron script say, once every 5 minutes or so and grab a full 100 tweets at a time, and loop through them looking in the db to see if I can find them, before adding them.
This has the obvious drawback of running 100 queries against the db every 5 minutes, and however many INSERT there are also. Which I really don't like. Plus I would much rather have something a little more real time. As twitter is a live service, it stands to reason that we should update our list of entrants as soon as they enter.
This again throws up a drawback of having to repeatedly poll Twitter, which, although might be necessary, I'm not sure I want to hammer their API like that.
Does anyone have any ideas on an elegant solution? I need to ensure that I capture all the tweets, and not leave anyone out, and keeping the db user unique. Although I have considered just adding everything and then grouping the resultant table by username, but it's not tidy.
I'm happy to deal with the display side of things separately as that's just a pull from mysql and display. But the backend design is giving me a headache as I can't see an efficient way to keep it ticking over without hammering either the api or the db.
100 queries in 5 minutes is nothing. Especially since a tweet has essentially only 3 pieces of data associated with it: user ID, timestamp, tweet, tweet ID - say, about 170 characters worth of data per tweet. Unless you're running your database on a 4.77MHz 8088, your database won't even blink at that kind of "load"
The Twitter API offers a streaming API that is probably what you want to do to ensure you capture everything:
http://dev.twitter.com/pages/streaming_api_methods
If I understand what you're looking for, you'll probably want a statuses/filter, using the track parameter with whatever distinguishing characteristics (hashtags, words, phrases, locations, users) you're looking for.
Many Twitter API libraries have this built in, but basically you keep an HTTP connection open and Twitter continuously sends you tweets as they happen. See the streaming API overview for details on this. If your library doesn't do it for you, you'll have to check for dropped connections and reconnect, check the error codes, etc - it's all in the overview. But adding them as they come in will allow you to completely eliminate duplicates in the first place (unless you only allow one entry per user - but that's client-side restrictions you'll deal with later).
As far as not hammering your DB, once you have Twitter just sending you stuff, you're in control on your end - you could easily have your client cache up the tweets as they come in, and then write them to the db at given time or count intervals - write whatever it has gathered every 5 minutes, or write once it has 100 tweets, or both (obviously these numbers are just placeholders). This is when you could check for existing usernames if you need to - writing a cached-up list would allow you the best chance to make things efficient however you want to.
Update:
My solution above is probably the best way to do it if you want to get live results (which it seems like you do). But as is mentioned in another answer, it may well be possible to just use the Search API to gather entries after the contest is over, and not worry about storing them at all - you can specify pages when you ask for results (as outlined in the Search API link), but there are limits as to how many results you can fetch overall, which may cause you to miss some entries. What solution works best for your application is up to you.
I read over your question and it seems to me that you want to duplicate data already stored by Twitter. Without more specifics on the competition your running, how users enter for example, estimated amount of entries; its impossible to know whether or not storing this information locally on a database is the best way to approach this problem.
Might a better solution to be, skip storing duplicate data locally and drag the entrants directly from twitter, i.e. when your attempting to find a winner.
You could eliminate duplicate entries on-the-fly then whilst the code is running. You would just need to call "the next page" once its finished processing the 100 entries its already fetched. Although, i'm not sure if this is possible directly through the Twitter API.
I think running a cron every X minutes and basing it off of the tweets creation date may work. You can query your database to find the last date/time of the last recorded tweet, then only run selects if there are matching times to prevent duplicates. Then, when you do your inserts into the database, use one or two insert statements containing all the entries you want to record to keep performance up.
INSERT INTO `tweets` (id, date, ...) VALUES (..., ..., ...), (..., ..., ...), ...;
This doesn't seem too intensive...also depends on the number of tweets you expect to record though. Also make sure to index the table properly.
I have a site where the users can view quite a large number of posts. Every time this is done I run a query similar to UPDATE table SET views=views+1 WHERE id = ?. However, there are a number of disadvantages to this approach:
There is no way of tracking when the pageviews occur - they are simply incremented.
Updating the table that often will, as far as I understand it, clear the MySQL cache of the row, thus making the next SELECT of that row slower.
Therefore I consider employing an approach where I create a table, say:
object_views { object_id, year, month, day, views }, so that each object has one row pr. day in this table. I would then periodically update the views column in the objects table so that I wouldn't have to do expensive joins all the time.
This is the simplest solution I can think of, and it seems that it is also the one with the least performance impact. Do you agree?
(The site is build on PHP 5.2, Symfony 1.4 and Doctrine 1.2 in case you wonder)
Edit:
The purpose is not web analytics - I know how to do that, and that is already in place. There are two purposes:
Allow the user to see how many times a given object has been shown, for example today or yesterday.
Allow the moderators of the site to see simple view statistics without going into Google Analytics, Omniture or whatever solution. Furthermore, the results in the backend must be realtime, a feature which GA cannot offer at this time. I do not wish to use the Analytics API to retrieve the usage data (not realtime, GA requires JavaScript).
Quote : Updating the table that often will, as far as I understand it, clear the MySQL cache of the row, thus making the next SELECT of that row slower.
There is much more than this. This is database killer.
I suggest u make table like this :
object_views { object_id, timestamp}
This way you can aggregate on object_id (count() function).
So every time someone view the page you will INSERT record in the table.
Once in a while you must clean the old records in the table. UPDATE statement is EVIL :)
On most platforms it will basically mark the row as deleted and insert a new one thus making the table fragmented. Not to mention locking issues .
Hope that helps
Along the same lines as Rage, you simply are not going to get the same results doing it yourself when there are a million third party log tools out there. If you are tracking on a daily basis, then a basic program such as webtrends is perfectly capable of tracking the hits especially if your URL contains the ID's of the items you want to track... I can't stress this enough, it's all about the URL when it comes to these tools (Wordpress for example allows lots of different URL constructs)
Now, if you are looking into "impression" tracking then it's another ball game because you are probably tracking each object, the page, the user, and possibly a weighted value based upon location on the page. If this is the case you can keep your performance up by hosting the tracking on another server where you can fire and forget. In the past I worked this using SQL updating against the ID and a string version of the date... that way when the date changes from 20091125 to 20091126 it's a simple query without the overhead of let's say a datediff function.
First just a quick remark why not aggregate the year,month,day in DATETIME, it would make more sense in my mind.
Also I am not really sure what is the exact reason you are doing that, if it's for a marketing/web stats purpose you have better to use tool made for that purpose.
Now there is two big family of tool capable to give you an idea of your website access statistics, log based one (awstats is probably the most popular), ajax/1pixel image based one (google analytics would be the most popular).
If you prefer to build your own stats database you can probably manage to build a log parser easily using PHP. If you find parsing apache logs (or IIS logs) too much a burden, you would probably make your application ouput some custom logs formated in a simpler way.
Also one other possible solution is to use memcached, the daemon provide some kind of counter that you can increment. You can log view there and have a script collecting the result everyday.
If you're going to do that, why not just log each access? MySQL can cache inserts in continuous tables quite well, so there shouldn't be a notable slowdown due to the insert. You can always run Show Profiles to see what the performance penalty actually is.
On the datetime issue, you can always use GROUP BY MONTH( accessed_at ) , YEAR( accessed_at) or WHERE MONTH(accessed_at) = 11 AND YEAR(accessed_at) = 2009.