I am a newbie with PHP and therefore this is more of a conceptual question or maybe even a question about 'best practices'.
Often, I see websites with stats drawn from their database. For example, let's say it is a sales lead website. It may have stats at the top of the page like:
NEW SALES LEADS YESTERDAY: 123
NEW SALES LEADS THIS MONTH: 556
NEW SALES LEADS THIS YEAR: 3870
Obviously, this should not be calculated everytime the page is displayed, right? That would potentially be a large burden on the server? How do people cache this type of data. Any best practices? I thought I writing a CRON jobs that would calculate it on a daily basis and insert to a database. What are your ideas? Thank you!
You can calculate it once and then store it in a xcache. Here, however there doesn't seem to be a need for a cron. The query can run one time and store the result in xcache. Important thing here would be to set the expiration time of the stored value according to your use case. For eg. if you need to store daily stats like above, set the expiration time to be a few hours. In case of data which gets updated every minute, you can set the expiration time to be a few minutes.
Something like this.
$newSalesLeadYest;
if(xcache_isset("newSalesLeadYest")){
$newSalesLeadYest = xcache_get("newSalesLeadYest");
} else{
$newSalesLeadYest = runQueryToFetchStat();
//Cache set for X secs
xcache_set("newSalesLeadYest", $newSalesLeadYest, X);
}
What you need is to come up with a caching strategy.
Some factors to help you decide:
How frequent does the data change?
How important is the current values - is it ok if it's 1min, 1hr, 1day old?
How expensive, time wise, is loading fresh data?
How much traffic are you getting? 10s, 100s, millions?
There are a few ways you can achieve the result.
You can use something like memcached to persist the data to avoid it being generated each request.
You can use http caching and load the data client side using javascript from an api.
You can have a background worker (eg. run by cron), which generates the latest figures and persists to a lookup database table.
You could improve the queries and indexes so that getting live data is fast enough to do every request
You could alter you database schema so that you have more static data
From the 3 examples you gave, 3 simple counts should not be expensive enough to warrant complex caching systems. If you can paste the sql queries, we can help optimise them.
The data sounds like it will only get updated once per day, so a simple nightly cron "flatten" query would be a nice fit.
Related
I just want a approach on how to build a database with live records, so don't just downvote. I don't expect any code.
At the moment I have a MySql database with about 2 thousand users, they are are getting more though. Each player/user has several points, which are increasing or decreasing by certain actions.
My goal is that this database gets refreshed about every second and the user with more points move up and others move down... and so on
My question is, what is the best approach for this "live database" where records have to be updated every second. In MySql I can run time based actions which are executing a SQL command but this isn't the greatest way I think. Can someone suggest a good way to handle this? E.g. other Database providers like MongoDB or anything else?
EDIT
This doesn't work client side, so I can't simply push/post it into the databse due some time based events. For explanation: A user is training his character in the application. This training (to get 1 level up) takes 12 hours. After the time is elapsed the record should be updated in the database AUTOMATICALLY also if the user doesn't send a post request by his self (if the user is not logged in) other users should see the updated data in his profile.
You need to accept the fact that rankings will be stale to some extent. Your predicament is no different than any other gaming platform (or SO rankings for that matter). Business decisions were put in place and constantly get reviewed for the level of staleness. Take the leaderboards on tags here, for instance. Or the recent change that has profile pages updated a lot more frequently, versus around 4AM GMT.
Consider the use of MySQL Events. It is built-in functionality that replaces the need for cron tasks. I have 3 event-related links off my profile page if interested. You could calculate ranks on a timed schedule (your tolerance for staleness) and the users' requests for them would be fast (faster than the below from Gordon). On the con-side, they are stale.
Consider not saving (writing) rank info but rather focus just on filling in the slots of your other data. And get your rankings on the fly. As an example, see this rankings answer here from Gordon. It is dynamic, runs upon request with at least at that moment non-staleness, and would not require Events.
Know that only you should decide what is tolerable for the UX.
I'm writing a realtime wep application, something similar to auction site. The problem is that I need a daemon script, preferrably php, that runs in background and constantly launches queries to mysql db and basing on some of criterias (time and conditions from resultsets) updates other tables. Performance of the daemon is crucial. Sample use case: we have a deal that is going to expire in 2:37 minutes. Even if nobody is watching/bidding it we need to expire it exactly in 2:37 since the time it started.
Can anybody advise a programming technology/software that performs this kind of task the best?
Thanks in advance
UPDATED: need to perform a query when a deal expires, no matter if it has ever been accessed by a user or not.
Why do you need to fire queries at time intervals? Can't you just change how your frontend works?
For example, in the "Deals" page, just show only deals that haven't expired - simplified example:
SELECT * FROM Deal WHERE NOW() <= DateTimeToExpire
Accordingly for the "Orders" page, a deal can become a placed order only if time hasn't expired yet.
Does your daemon need to trigger actions instantaneously? If you need a table containing the expired state as a column you could just compute the expire value on the fly or define a view? You could then use a daemon/cron job querying the view every 10 minutes or so if you have to send out emails or do some cleanup work etc.
I have a table of more than 15000 feeds and it's expected to grow. What I am trying to do is to fetch new articles using simplepie, synchronously and storing them in a DB.
Now i have run into a problem, since the number of feeds is high, my server stops responding and i am not able to fetch feeds any longer. I have also implemented some caching and fetching odd and even feeds at diff time intervals.
What I want to know is that, is there any way of improving this process. Maybe, fetching feeds in parallel. Or may be if someone can tell me a psuedo algo for it.
15,000 Feeds? You must be mad!
Anyway, a few ideas:
Increase the Script Execution time-limit - set_time_limit()
Don't go overboard, but ensuring you have a decent amount of time to work in is a start.
Track Last Check against Feed URLs
Maybe add a field for each feed, last_check and have that field set to the date/time of the last successful pull for that feed.
Process Smaller Batches
Better to run smaller batches more often. Think of it as being the PHP equivalent of "all of your eggs in more than one basket". With the last_check field above, it would be easy to identify those with the longest period since the last update, and also set a threshold for how often to process them.
Run More Often
Set a cronjob and process, say 100 records every 2 minutes or something like that.
Log and Review your Performance
Have logfiles and record stats. How many records were processed, how long was it since they were last processed, how long did the script take. These metrics will allow you to tweak the batch sizes, cronjob settings, time-limits, etc. to ensure that the maximum checks are performed in a stable fashion.
Setting all this may sound like alot of work compared to a single process, but it will allow you to handle increased user volumes, and would form a strong foundation for any further maintenance tasks you might be looking at down the track.
fetch new articles using simplepie, synchronously
What do you mean by "synchronously"? Do you mean consecutively in the same process? If so, this is a very dumb approach.
You need a way of sharding the data to run across multiple processes. Doing this declaratively based on, say the modulus of the feed id, or the hash of the URL is not a good solution - one slow URL would cause multiple feeds to be held up.
A better solution would be to start up multiple threads/processes which would each:
lock list of URL feeds
identify the feed with the oldest expiry date in the past which is not flagged as reserved
flag this record as reserved
unlock the list of URL feeds
fetch the feed and store it
remove the reserved flag on the list for this feed and update the expiry time
Note that if there are no expired records at step 2, then the table should be unlocked, the next step depends on whether you run the threads as daemons (in which case it should implement an exponential back of, e.g. sleeping for 10 seconds doubling up to 320 seconds for consecutive iterations) or if you're running as batches, exit.
Thank You for your responses. I apologize I am replying a little late. I got busy with this problem and later I forgot about this post.
I have been researching a lot on this. Faced a lot of problems. You see, 15,000 feed everyday is not easy.
May be I am MAD! :) But I did solve it.
How?
I wrote my own algorithm. And YES! It's written in PHP/MYSQL. I basically implemented a simple weighted machine learning algorithm. My algorithm basically learns the posting time about a feed and then estimates the next polling time for the feed. I save it in my DB.
And since it's a learning algorithm it improves with time. Ofcourse, there are 'misses'. but these misses are alteast better than crashing servers. :)
I have also written a paper on this. which got published in a local computer science journal.
Also, regarding the performance gain, I am getting a 500% to 700% improvement in speed as opposed to sequential polling.
How is it going so far?
I have a DB that has grown in size of TBs. I am using MySQL. Yes, I am facing perforance issues on MySQL. but it's not much. Most probably, I will be moving to some other DB or implement sharding to my existing DB.
Why I chose PHP?
Simple, because I wanted to show people that PHP and MySQL are capable of such things! :)
I'm currently building a user panel which will scrape daily information using curl. For each URL it will INSERT a new row to the database. Every user can add multiple URLs to scrape. For example: the database might contain 1,000 users, and every user might have 5 URLs to scrape on average.
How do I to run the curl scraping - by a cron job once a day at a specific time? Will a single dedicated server stand this without lags? Are there any techniques to reduce the server load? And about MySQL databases: with 5,000 new rows a day the database will be huge after a single month.
If you wonder I'm building a statistics service which will show the daily growth of their pages (not talking about traffic), so as i understand i need to insert a new value per user per day.
Any suggestions will be appreciated.
5000 x 365 is only 1.8 million... nothing to worry about for the database. If you want, you can stuff the data into mongodb (need 64bit OS). This will allow you to expand and shuffle loads around to multiple machines more easily when you need to.
If you want to run curl non-stop until it is finished from a cron, just "nice" the process so it doesn't use too many system resources. Otherwise, you can run a script which sleeps a few seconds between each curl pull. If each scrape takes 2 seconds that would allow you to scrape 43,200 pages per 24 period. If you slept 4 sec between a 2 second pull that would let you do 14,400 pages per day (5k is 40% of 14.4k, so you should be done in half a day with 4 sec sleep between 2 sec scrape).
This seems very doable on a minimal VPS machine for the first year, at least for the first 6 months. Then, you can think about utilizing more machines.
(edit: also, if you want you can store the binary GZIPPED scraped page source if you're worried about space)
I understand that each customer's pages need to be checked at the same time each day to make the growth stats accurate. But, do all customers need to be checked at the same time? I would divide my customers into chunks based on their ids. In this way, you could update each customer at the same time every day, but not have to do them all at once.
For the database size problem I would do two things. First, use partitions to break up the data into manageable pieces. Second, if the value did not change from one day to the next, I would not insert a new row for the page. In my processing of the data, I would then extrapolate for presentation the values of the data. UNLESS all you are storing is small bits of text. Then, I'm not sure the number of rows is going to be all that big a problem if you use proper indexing and pagination for queries.
Edit: adding a bit of an example
function do_curl($start_index,$stop_index){
// Do query here to get all pages with ids between start index and stop index
$query = "select * from db_table where id >= $start_index and id<=$stop_index";
for($i=$start_index; $i<= $stop_index; $i++;){
// do curl here
}
}
urls would look roughly like
http://xxx.example.com/do_curl?start_index=1&stop_index=10;
http://xxx.example.com/do_curl?start_index=11&stop_index=20;
The best way to deal with the growing database size is to perhaps write a single cron script that would generate the start_index and stop_index based on the number of pages you need to fetch and how often you intend to run the script.
Use multi curl and properly optimise not simply normalise your database design. If I were to run this cron job, I will try to spend time studying that is it possible to do this in chunks or not? Regarding hardware start with an average configuration, keep monitoring it and increment the hardware, CPU or Memory. Remember, there is no silver bullet.
I need to show some basic stats on the front page of our site like the number of blogs, members, and some counts - all of which are basic queries.
Id prefer to find a method to run these queries say every 30 mins and store the output but im not sure of the best approach and I don't really want to use a cron. Basically, I don't want to make thousands of queries per day just to display these results.
Any ideas on the best method for this type of function?
Thanks in advance
Unfortunately, cron is better and reliable solution.
Cron is a time-based job scheduler in Unix-like computer operating systems. The name cron comes from the word "chronos", Greek for "time". Cron enables users to schedule jobs (commands or shell scripts) to run periodically at certain times or dates. It is commonly used to automate system maintenance or administration, though its general-purpose nature means that it can be used for other purposes, such as connecting to the Internet and downloading email.
If you are to store the output into disk file,
you can always check the filemtime is lesser than 30 minutes,
before proceed to re-run the expensive queries.
There is nothing at all wrong with using a cron to store this kind of stuff somewhere.
If you're looking for a bit more sophisticated caching methods, I suggest reading into memcached or APC, which could both provide a solution for your problem.
Cron Job is best approach nothing else i seen feasible.
You have many to do this, I think the good not the best, you can store your data on table and display it every 30 min. using the function sleep()
I recommend you to take a look at wordpress blog system, and specially at the plugin BuddyPress..
I did the same some time ago, and every time someone load the page, the query do the job and retrieve the information from database, I remenber It was something like
SELECT COUNT(*) FROM my_table
and I got the number of posts in my case.
Anyway, there are so many approach. Good Luck.
Dont forget The cron is always your best friend.
Using cron is the simplest way to solve the problem.
One good reason for not using cron - you'll be generating the stats even if nobody will request them.
Depending on the length of time it takes to generate the data (you might want to keep track of the previous counts and just add counts where the timestamp is greater than the previous run - with appropriate indexes!) then you could trigger this when a request comes in and the data looks as if it is stale.
Note that you should keep the stats in the database and think about how to implement a mutex to avoid multiple requests trying to update the cache at the same time.
However the right solution would be to update the stats every time a record is added. Unless you've got very large traffic volumes, the overhead would be minimal. While 'SELECT count(*) FROM some_table' will run very quickly you'll obviously run into problems if you don't simply want to count all the rows in a table (e.g. if blogs and replies are held in the same table). Indeed, if you were to implement the stats update as a trigger on the relevant tables, then you wouldn't need to make any changes to your PHP code.