I am working on my bachelor's project and I'm trying to figure out a simple dilemma.
It's a website of a football club. There is some information that will be fetched from the website of national football association (basically league table and matches history). I'm trying to decide the best way to store this fetched data. I'm thinking about two possibilities:
1) I will set up a cron job that will run let's say every hour. It will call a script that will fetch the league table and all other data from the website and store them in a flat file.
2) I will use Zend_Cache object to do the same, except the data will be stored in cached files. The cache will get updated about every hour as well.
Which is the better approach?
I think the answer can be found in why you want to cache the file. Is it to place minimal load on the external server by only updating the cache every so often, or is it to keep pages loading fast because the file takes long to download or process?
If it's only to respect the other server, and fetching/processing the page takes little noticable time to the end user, I'd just implement Zend_Cache. It's simple, you don't have to worry about one script downloading the page, then another script loading the downloaded data (plus the cron job).
If the cache is also needed because fetching/processing the page is significant, I'd still use Zend_Cache; however, I'd set the cache to expire every 2 hours, and setup a cron job (or something similar) to manually update the cache every hour. Sure, this adds back the complexity of two scripts (or at least adding a request flag to manually refresh the cache), but should the cron job fail, you're still fine.
Well if you choose 1 it somewhat adds complexity because you have to use cron as well (not that cron is overly complex) and then you have to test that the data file is complete before using it or deall with moving files from a temp location after they have downloaded and been parsed to the proper format.
If you use two it eliminates much of 1, except now on the request where the cache is dead you have to wait for the download/parse.
I would say 1 is the better option, but 2 is going to be easier to implement and less prone to error. That said its fairly trivial to implement things in the cron script to prevent the negatives i describe. So i would probably go with 1.
Related
Here's the situation. I am scrapping a website to get the data from it's articles using a robots page supplied by that website (list of URLs pointing to every article that's posted on the site). So far, I do a database merge to 'upsert' the URLs into my table. I know that each scrapping run will take a good while cause there's over 1400 articles to parse. I need to write an algorithm that will only do a small chunk of the jobs on cron at a time so it doesn't overload my server, etc.
Edit: I think I should mention that I'm using drupal 7. Also, this has to be an ongoing script that happens over time, I'm not so worried about the time it takes for the initial fill of the database. The robots page is dynamic, URLs get added there periodically as articles are added. I'm using hook_cron() currently for this, but I'm open to better methods if there's something better than that for doing it.
You can use the Drupal queue operations API to enqueue each page to scrap as queue item. You can, but are not required, declare your queue as cron-executed. Drupal will then take cares of executing as much queue item at each cron run without reaching the queue declared maximum execution time.
See aggregator_cron for an example of item en-queuing. And aggregator_cron_queue_info for the declaration that will let Drupal process these queued items during its cron.
If queue processing during normal Drupal cron is an issue, you can process your queue independently with the help of modules like Waiting Queue or Beanstalkd integration.
Most likely the http overhead of fetching each article will vastly outweigh the overhead of doing the database operations. Just don't fetch too many articles in parallel and you should be fine. Most webmasters frown on scrapers, especially when they're doing 10, 20, 500+ parallel fetches.
So, you already have the urls in your database. Have a status column in that table - scraped or not. The cron can kick off every so often grabbing the next url that has not been scraped from the table and marking it as scraped.
Using a PHP script I need to update a number every 5 seconds while somebody is on my page. So let's say I have 300 visitors, each one spending about 1 minute on the page and every 5 seconds they stay on the page the number will be changed...which is a total of 3600 changes per minute. I would prefer to update the number in my MySQL database, except I'm not sure if it's not too inefficient to have so many MySQL connections (just for the one number change), when I could just change the number in a file.
P.S.: I have no idea weather 3600 connections/minute is a high number or not, but what about this case in general, considering an even higher number of visitors. What is the most efficient way to do this?
Doing 3,600 reads and writes per minute against the same file is just out of question. It's complicate (you need to be extremely careful with file locking), it's going to have an awful performance and sooner or later your data will get corrupted.
DBMSs like MySQL are designed for concurrent access. If they can't cope with your load, a file won't do it better.
It will fail eventually if the user count grows but the performance depends of your server setup and other tasks that are related to this update.
You can do a slight test and open up 300 persistent connections to your database end fire up as much query's you can in minute.
If you don't need it to be transactional (the order of executed query's is not important) then i suggest you to use memcached (or redis if you need to save stuff on disk) for this instead
If you save to file, you have to solve concurrency issues (and all but the currently reading/writing process will have to wait). The db solves this for you. For better performance you could use memcached.
Maybe you could do without this "do every 5s for each user" by another means (e.g. saving current time and subtracting next time the user does something). This depends on your real problem.
Don't even think about trying to handle this with files - its just not going to work unless you build a lock queue manager - and if you're going to all that trouble you might as well use the daemon to manage the value rather than just queue locks.
Using a DBMS is the simplest approach.
For a more efficient but massively more esoteric approach, write a single-threaded socket server daemon and have the clients connect to that. (there's a lib here for doing the socket handling, and there's a PEAR class for running PHP as a daemon)
files aren't transactional and you don't want to lose count so the database is the way to go
memcached's inc command is faster then the database and was the basis of i think one really fast view counting setup
if you use say a key per hour and switch so when a page view happens inc page:time occurs and you can have a process in the background collect the counts from the past hour and insert them in a database if the memcache fails you might lose the count for that hour but you will not have double counted or missed any and keeping counts per period gives interesting statistics
Using a dedicated temporary file will certainly be the most efficient disk access you can have. However, you will not be protected from concurrent access to the file in case your server uses multiple threads or processes. If what you want to do is update 1 number per user, then using a $_SESSION sub-variable will work, and I believe this is stored in memory, so it shouldbe very efficient. Then you can easily store this number into your database every 5 minutes per user
I need to show some basic stats on the front page of our site like the number of blogs, members, and some counts - all of which are basic queries.
Id prefer to find a method to run these queries say every 30 mins and store the output but im not sure of the best approach and I don't really want to use a cron. Basically, I don't want to make thousands of queries per day just to display these results.
Any ideas on the best method for this type of function?
Thanks in advance
Unfortunately, cron is better and reliable solution.
Cron is a time-based job scheduler in Unix-like computer operating systems. The name cron comes from the word "chronos", Greek for "time". Cron enables users to schedule jobs (commands or shell scripts) to run periodically at certain times or dates. It is commonly used to automate system maintenance or administration, though its general-purpose nature means that it can be used for other purposes, such as connecting to the Internet and downloading email.
If you are to store the output into disk file,
you can always check the filemtime is lesser than 30 minutes,
before proceed to re-run the expensive queries.
There is nothing at all wrong with using a cron to store this kind of stuff somewhere.
If you're looking for a bit more sophisticated caching methods, I suggest reading into memcached or APC, which could both provide a solution for your problem.
Cron Job is best approach nothing else i seen feasible.
You have many to do this, I think the good not the best, you can store your data on table and display it every 30 min. using the function sleep()
I recommend you to take a look at wordpress blog system, and specially at the plugin BuddyPress..
I did the same some time ago, and every time someone load the page, the query do the job and retrieve the information from database, I remenber It was something like
SELECT COUNT(*) FROM my_table
and I got the number of posts in my case.
Anyway, there are so many approach. Good Luck.
Dont forget The cron is always your best friend.
Using cron is the simplest way to solve the problem.
One good reason for not using cron - you'll be generating the stats even if nobody will request them.
Depending on the length of time it takes to generate the data (you might want to keep track of the previous counts and just add counts where the timestamp is greater than the previous run - with appropriate indexes!) then you could trigger this when a request comes in and the data looks as if it is stale.
Note that you should keep the stats in the database and think about how to implement a mutex to avoid multiple requests trying to update the cache at the same time.
However the right solution would be to update the stats every time a record is added. Unless you've got very large traffic volumes, the overhead would be minimal. While 'SELECT count(*) FROM some_table' will run very quickly you'll obviously run into problems if you don't simply want to count all the rows in a table (e.g. if blogs and replies are held in the same table). Indeed, if you were to implement the stats update as a trigger on the relevant tables, then you wouldn't need to make any changes to your PHP code.
Bit of an odd question but I'm hoping someone can point me in the right direction. Basically I have two scenarios and I'd like to know which one is the best for my situation (a user checking a scoreboard on a high traffic site).
Top 10 is regenerated every time a user hits the page - increase in load on the server, especially in high traffic, user will see his/her correct standing asap.
Top 10 is regenerated at a set interval e.g. every 10 minutes. - only generates one set of results causing one spike every 10 minutes rather than potentially once every x seconds, if a user hits in between the refresh they won't see their updated score.
Each one has it's pros and cons, in your experience which one would be best to use or are there any magical alternatives?
EDIT - An update, after taking on board what everyone has said I've decided to rebuild this part of the application. Rather than dealing with the individual scores I'm dealing with the totals, this is then saved out to a separate table which sort of acts like a cached data source.
Thank you all for the great input.
Adding to Marcel's answer, I would suggest only updating the scoreboards upon write events (like new score or deleted score). This way you can keep static answers for popular queries like Top 10, etc. Use something like MemCache to keep data cached up for requests, or if you don't/can't install something like MemCache on your server serialize common requests and write them to flat files, and then delete/update them upon write events. Have your code look for the cached result (or file) first, and then iff it's missing, do the query and create the data
Nothing is never needed real time when it comes to the web. I would go with option 2 users will not notice that there score is not changing. You can use some JS to refresh the top 10 every time the cache has cleared
To add to Jordan's suggestion: I'd put the scorecards in a separate (HTML formatted) file, that is produced every time when new data arrives and only then. You can include this file in the PHP page containing the scorecard or even let a visitor's browser fetch it periodically using XMLHttpRequests (to save bandwidth). Users with JavaScript disabled or using a browser that doesn't support XMLHttpRequests (rare these days, but possible) will just see a static page.
The Drupal voting module will handle this for you, giving you an option of when to recalculate. If you're implementing it yourself, then caching the top 10 somewhere is a good idea - you can either regenerate it at regular intervals or you can invalidate the cache at certain points. You'd need to look at how often people are voting, how often that will cause the top 10 to change, how often the top 10 page is being viewed and the performance hit that regenerating it involves.
If you're not set on Drupal/MySQL then CouchDB would be useful here. You can create a view which calculates the top 10 data and it'll be cached until something happens which causes a recalculation to be necessary. You can also put in an http caching proxy inline to cache results for a set number of minutes.
I have a cron that for the time being runs once every 20 minutes, but ultimately will run once a minute. This cron will process potentially hundreds of functions that grab an XML file remotely, and process it and perform its tasks. Problem is, due to speed of the remote sites, this script can sometimes take a while to run.
Is there a safe way to do this without [a] the script timing out, [b] overloading the server [c] overlapping and not completing its task for that minute before it runs again (would that error out?)
Unfortunately caching isnt an option as the data changes near real-time, and is from a variety of sources.
I think a slight design change would benefit this process quite a bit. Given that a remote server could time out, or a connection could be slow, you'll definitely run into concurrency issues if one slow job is still writing files when another one starts up.
I would break it into two separate scripts. Have one script that is only used for fetching the latest XML data, and another for processing it. The fetch script can take it's sweet time if it needs to, while the process script continually looks for the newest file available in order to process it.
This way they can operate independently, and the processing script can always work with the latest data, irregardless of how long either script takes to perform.
have a stack that you keep all the jobs on, have a handful of threads who's job it is to:
Pop a job off the stack
Check if you need to refresh the xml file (check for etags, expire headers, etc.)
grab the XML (this is the bit that could take the time hence spreading the load over threads) if need be, this should time out if it takes too long and raise the fact it did to someone as you might have a site down, dodgy rss generator or whatever.
then process it
This way you'll be able to grab lots of data each time.
It could be that you don't need to grab the file at all (would help if you could store the last etag for a file etc.)
One tip, don't expect any of them to be in a valid format. Suggest you have a look at Mark Pilgrims RSS RegExp reader which does a damn fine job of reading most RSS's
Addition: I would say hitting the same sites every minute is not really playing nice to the servers and creates alot of work for your server, do you really need to hit it that often?
You should make sure to read the <ttl> tag of the feeds you are grabbing to ensure you are not unnecessarily grabbing feeds before they change. <ttl> holds the update period. So if a feed has <ttl>60</ttl> then it should be only updated every 60 minutes.