Strategies for rarely updated data

Strategies for rarely updated data - php

Background:
2 minutes before every hour, the server stops access to the site returning a busy screen while it processes data received in the previous hour. This can last less than two minutes, in which case it sleeps until the two minutes is up. If it lasts longer than two minutes it runs as long as it needs to then returns. The block is contained in a its own table with one field and one value in that field.
Currently the user is only informed of the block when (s)he tries to perform an action (click a link, send a form etc). I was planning to update the code to bring down a lightbox and the blocking message via BlockUI jquery plugin automatically.
There are basically 2 methods I can see to achieve my aim:
Polling every N seconds (via PeriodicalUpdater or similar)
Long polling (Comet)
You can reduce server load for 1 by checking the local time and when it gets close to the actual time start the polling loop. This can be more accurate by sending the local time to the server returning the difference mod 60. Still has 100+ people querying the server which causes an additional hit on the db.
Option 2 is the more attractive choice. This removes the repeated hit on the webserver, but doesn't allieve the repeated check on the db. However 2 is not the choice for apache 2.0 runners like us, and even though we own our server, none of us are web admins and don't want to break it - people pay real money to play so if it isn't broke don't fix it (hence why were are running PHP4/MySQL3 still).
Because of the problems with option 2 we are back with option 1 - sub-optimal.
So my question is really two-fold:
Are there any other possibilities I've missed?
Is long polling really such a problem at this size? I understand it doesn't scale, but I am more concerned at what level does it starve Apache of threads. Also are there any options you can adjust in Apache so it scales slightly further?

Can you just send to the page how many time is left before the server starts processing data received in the previous hour. Lets say that when sending the HTML you record that after 1 min the server will start processing. And create a JS that will trigger after that 1 min and will show the lightbox.

The alternative I see is to get it done faster, so there is less downtime from the users perspective. To do that I would use a distributed system to do the actual data processing behind the hourly update, such as Hadoop. Then use whichever method is most appropriate for that short downtime to update the page.

Related

How can I scale a database/CPU intensive script?

I currently have a PHP script that collects similar data from various sources, each data source is scraped and parsed every 120 seconds. At the moment I have 20 data sources, but I expect to integrate another 100 over the coming weeks.
Currently each data source is scraped in it's own thread, there is one main PHP script that will execute other scripts to perform the scraping work. This method allows all sources to be scraped at the same time, but it also puts a strain on the server, and a bottleneck on the database (MySQL).
I'm looking for a way to scale my current application, could I do something like this with AWS? Perhaps each of these scraping scripts could run in their own small server instance, each of these instances would be automatically created by a "main" instance and then die once the script has finished. I don't have any experience with AWS, so I'm not entirely sure if this is possible, or maybe it's just a bad idea.
The main question here is: How can I scale my current scraping script to allow for many new data sources? I'm interested in any solution even if I need to buy additional services.

You need a queueing system
You're describing a sort of worker / queue pattern, with your main server performing both the en-queueing and the worker execution, which of course is going to be a huge strain on your server.
First and foremost, your workers need to be asynchronous: you shouldn't be waiting for something that may or may not come back. You really should take a look at ZeroMQ which, I might add, contains some of the best documentation on the planet. If you're willing to learn, take a look at how this works and follow some tutorials, there are plenty out there. Have your queue taking on new jobs and dispatching others elsewhere (i.e. to other boxes) hosted on your main server.
Horizontal Scaling
You can create some sort of Instance Controller to handle AWS instances. You really just need to sit down and think about your logic (when do I want this many boxes, when do I want to shut them down). The API is pretty simple to use once you get your head around it. Here's some code I wrote a while back to wrap Amazon's SDK for PHP. I'm not sure if it's working 100% with the latest version (I used it around a year ago), but the concepts are there - you have simple methods like startBox() or stopBox() that you call from your queue, and have your box automatically start doing it's stuff once it starts up.
You could use the t1.micro instances from Amazon pricing here, which has a free tier info here up to a certain limit.
Get it working properly, with a loop on your main server deciding how many boxes you need working at any one time given certain circumstances (no. of jobs in your database table, for example), and you'll have theoretically infinite scaling. Here's how I did it for my code:
Tier 1: > 5 jobs, < 10 jobs = 1 box
Tier 2: > 10 jobs, < 20 jobs = 2 boxes
etc. etc.
Advice
Log everything. Log every box coming up, every box coming down. Calculate your costs in your code and store them, maybe in a database, or log them, so you know exactly how much you're spending - your don't want things to get out of hand.
Make sure you open up your DB ports so your instances can talk to your DB to say when a job is done or anything else you need to pass between your "master" box and your "slave" boxes.
Also, if you're paying for web servers, you'll be billed for the hour with aws, so you need to get the time you start the box, and when it's time to shut down, only actually shut it down when 55 minutes or so has passed - you might as well get those extra minutes for what you're paying.
I can't really think of anything else. Do your research, figure out the best way to build a queueing system, and build it with scalability in mind (it can react and change to numbers that you control).

Split your scraping up across multiple instances (say 5 per server) and have them talk to a central DB like Amazon RDS.
No need to kill the instances after you have finished scraping if your doing this every 120 seconds.

what would be better for conserving server system resources

I made a private chat system. So far the chat has 3 jquery ajax post scripts calling to the server in a loop for new data.
Message window between current user and target user (The ajax gets the timestamp of the last message on the db and compares it to the last message timestamp that was displayed. Get all messages > than last message timestamp and display it on message window. ajax loops every 5 seconds after last return.)
Whos online checker (Checks db for whos online. ajax loops every 30 seconds after last return)
Who messaged current user (Check and Get users who are not the current target user on the message window and has messaged the current user. ajax loops every 15 seconds after last return)
So far the above 3 are the only ajax loops I have and I am still double checking my code for areas where I can trim it down.
My question is. Would it be better in conserving server system resources if I group together the above 3 ajax post to create 1 ajax post and loop it every 5/8 seconds. Or should I leave it as?
I ask this because I got a warning from my hosting before that I was consuming too much of their server's system resources (due to a very stupid experiment). If I mess up again their gonna cut my hosting so I do hope you guys understand why I ask this kind of question.
Extra details: I use jquery ajax to talk to a php script that gets the data from a mysql db. The loop for the requests are done client side.

Websockets are tricky. So if you decide to go with ajax there are a couple of factors to consider:
The frequence. Efficient systems usually use a sort of tick system. In your case a tick would be 5 seconds as all your time lines can be tacted into a 5 second tact. And yes of course you group all transmission needs of a tick into 1 transmission.
The data quantity. Try to not send more than 1KB of Bytes per tick. Eg. use sparse formats like csv over eg. XML. Set hard entry limits. Compress. Things like that. Network traffic is packaged - so sending 1025 Bytes causes allocation of 2KB resources.
Act on user's inactivity somehow. Eg. do not use up each tick for the "Message window between current user and target user" if the user is inactive for more than a minute. Sort-of-session timeout of 20 minutes or so...
The computation. Make the server side tick response QUICK and small. Consider to use memory tables or mem chaches for the tick handling and then have a ten minutes or so agent that stores to persistence what is needed to go there. Try to avoid complex fat operations (like eg. >3 db round trips) in the tick response.
The hoster. That was also said in other comment. A quick additional hint: You could ask if you are allowed to implement that thing before you sign the contract, if you are able to change the contract. Sometimes there are things like video and instant messaging mentioned in the general terms of service.
There are probably more things.. But these come to my mind immediately...
In general maybe you should also check out https://developers.google.com/speed/docs/best-practices/rtt

How to implement a manager of scripts execution in php on a remote server

I'm trying to build a service that will collect some data form web at certain intervals, then parse those data, finally upon result of parse - execute dedicated procedures. Typical schematic of service run:
Request item list to be updated to
Download data of listed items
Check what's not updated yet
Update database
Filter data that contains updates (get only highest priority updates)
Perform some procedures to parse updates
Filter data that contains updates (get only medium priority updates)
Perform some procedures to parse ...
...
...
Everything would be simple if there ware not so many data to be updated.
There is so many data to be updated that at every step from 1 to 8 (maybe besides 1) scripts will fail due to restriction of 60 sec max execution time. Even if there was an option to increase it this would not be optimal as the primary goal of the project is to deliver highest priority data as first. Unlucky defining priority level of an information is based on getting majority of all data and doing lot of comparisons between already stored data and incoming (update) data.
I could resign from the service speed to get at least high priority updates in exchange and wait longer time for all the other.
I thought about writing some parent script (manager) to control every step (1-8) of service, maybe by executing other scripts?
Manager should be able to resume unfinished step (script) to get it completed. It is possible to write every step in that way that it will do some small portion of code and after finishing it mark this small portion of work as done in i.e. SQL DB. after manager's resuming, step (script) will continue form the point it was terminated by server due to exceeding max exec. time.
Known platform restrictions:
remote server, unchangeable max execution time, usually limit to parse one script at the same time, lack of the access to many apache features, and all the other restrictions typical to remote servers
Requirements:
Some kind of manager is mandatory as besides calling particular scripts this parent process must write some notes about scripts that ware activated.
Manager can be called by crul, one minute interval is enough. Unlucky, making for curl a list of calls to every step of service is not an option here.
I also considered getting new remote host for every step of service and control them by another remote host that could call them and ask for doing their job by using ie SOAP but this scenario is at the end of my list of wished solutions because it does not solve problem of max execution time and brings lot of data exchange over global net witch is the slowest way to work on data.
Any thoughts about how to implement solution?

I don't see how steps 2 and 3 by themself can execute over 60 seconds. If you use curl_multi_exec for step 2, it will run in seconds. If you are getting your script over 60 seconds at step 3, you would get "memory limit exceeded" instead and a lot earlier.
All that leads me to a conclusion, that the script is very unoptimized. And the solution would be to:
break the task into (a) what to update and save that in database (say flag 1 for what to update, 0 for what not to); (b) cycle through rows that needs update and update them, setting flag to 0. At ~50 seconds just shut down (assuming that script is run every few minutes, that will work).
get a second server and set it up with a proper execution time to run your script for hours. Since it will have access to your first database (and not via http calls), it won't be a major traffic increase.

Generating scoreboards on large traffic sites

Bit of an odd question but I'm hoping someone can point me in the right direction. Basically I have two scenarios and I'd like to know which one is the best for my situation (a user checking a scoreboard on a high traffic site).
Top 10 is regenerated every time a user hits the page - increase in load on the server, especially in high traffic, user will see his/her correct standing asap.
Top 10 is regenerated at a set interval e.g. every 10 minutes. - only generates one set of results causing one spike every 10 minutes rather than potentially once every x seconds, if a user hits in between the refresh they won't see their updated score.
Each one has it's pros and cons, in your experience which one would be best to use or are there any magical alternatives?
EDIT - An update, after taking on board what everyone has said I've decided to rebuild this part of the application. Rather than dealing with the individual scores I'm dealing with the totals, this is then saved out to a separate table which sort of acts like a cached data source.
Thank you all for the great input.

Adding to Marcel's answer, I would suggest only updating the scoreboards upon write events (like new score or deleted score). This way you can keep static answers for popular queries like Top 10, etc. Use something like MemCache to keep data cached up for requests, or if you don't/can't install something like MemCache on your server serialize common requests and write them to flat files, and then delete/update them upon write events. Have your code look for the cached result (or file) first, and then iff it's missing, do the query and create the data

Nothing is never needed real time when it comes to the web. I would go with option 2 users will not notice that there score is not changing. You can use some JS to refresh the top 10 every time the cache has cleared

To add to Jordan's suggestion: I'd put the scorecards in a separate (HTML formatted) file, that is produced every time when new data arrives and only then. You can include this file in the PHP page containing the scorecard or even let a visitor's browser fetch it periodically using XMLHttpRequests (to save bandwidth). Users with JavaScript disabled or using a browser that doesn't support XMLHttpRequests (rare these days, but possible) will just see a static page.

The Drupal voting module will handle this for you, giving you an option of when to recalculate. If you're implementing it yourself, then caching the top 10 somewhere is a good idea - you can either regenerate it at regular intervals or you can invalidate the cache at certain points. You'd need to look at how often people are voting, how often that will cause the top 10 to change, how often the top 10 page is being viewed and the performance hit that regenerating it involves.
If you're not set on Drupal/MySQL then CouchDB would be useful here. You can create a view which calculates the top 10 data and it'll be cached until something happens which causes a recalculation to be necessary. You can also put in an http caching proxy inline to cache results for a set number of minutes.

How to live update browser game attributes like the 4 resources in Travian game?

I would like to make a web-based game which is Travian-like (or Ikariam-like). The game will be in PHP & MySQL-based. I wonder how can I achieve the live updating of game attributes.
For frontend, I can achieve by using AJAX calls (fetch the latest values from database), or even fake update of values (not communicated with server).
For backend, is this done by a PHP cron job (which runs every few seconds)? If so, can anyone provide me some sample codes?
by the way, I know it would be a trouble if I use IIS + FastCGI.
=== Version Information ===
PHP : 5.2.3
IIS : 6.0 with FastCGI
OS : Windows Server 2003 Standard R2

The correct answer depends on your exact needs.
Does everyone always get resources at the same rate? If so, a simple solution is to track how long their user has existed, calculate the amount of resources based on the rate they're getting, and subtract the number of resources they've spent in total. That's going to be a bit of a problem if the rate can ever change, though, so if you use this solution, you're pretty much stuck with the rate you pick unless you rewrite the handling entirely (for example to the one below).
If it varies how quickly people can get resources, you'll need to update the data periodically. A cronjob/scheduled task would work well to make sure everyone is updated, but in some situations, it might be better to simply measure how long it's been since you've updated each user's resources, and then update them on every page load they make while logged in by multiplying the time they've been away by the rate at which they gain resources - that way, you avoid updating until you actually need the new value.

For a Travian like resource management you need to keep track when you updated the users resources for the last time. If you read the resource values (for a page refresh or something), you need to add the amount of resources gained since the 'last update time' (depending on the amount of resources fields and boni the user gets) and send that value to the browser. You could also the let browser script calculate these amounts.
You might to consider caching all resource amounts somehow, since these values are required a lot, improving the communication with your database.
If a user finishes building a resource field, uses the market, builds a structure, etc you need to update the amount of resources (and the 'last update time'), because you cannot keep track on these kind of events simply.
By calculating the resources the database load is reduced, since you do not need to write the new values every time when the user refreshes the browser page. It is also more accurate since you have less rounding errors.
To keep the resources increasing between page refreshes you need a method as Frank Farmer described. Just embed the resource amount and the 'gain frequency' in some javascript and increase the resource amount every 'gain frequency' by one.

You can also calculate the ressources each time a page or the javascript asks. You'd need to store the last updated time.

It may be an old post but it comes up right away in Google so here's another option which is how the game I've been developing does it.
I use a client side JavaScript that uses a flash socket to get live updates from a dedicated game server running on the host.
I use the xmlsocket kit from http://devpro.it/xmlsocket/

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.