I have a news site which receives around 58,000 hits a day for 36,000 articles. Of this 36000 unique stories, 30000 get only 1 hit (majority of which are search engine crawlers) and only 250 stories get over 20 impressions. It is a wastage of memory to cache anything, but these 250 articles.
Currently I am using MySQL Query Cache and xcache for data caching. The table is updated every 5-10 mins, hence Query Cache alone is not much useful. How can I detect frequently visited pages alone and cache the data?
I think you can have two options to start with:
You don't cache anything by default.
You can implement with an Observer/Observable pattern a way to trigger an event when the article's view reaches a threshold, and start caching the page.
You cache every article at creation
In both case, you can use a cron to purge articles which don't reaches your defined threshold.
In any case, you'll probably need to use any heuristic method to determine enough early that your article will need to be cached, and as in any heuristic method, you'll have false-positive and vice-versa.
It'll depend on how your content is read, if articles are realtime news, it'll probably be efficient as it'll quickly generate high traffic.
The main problem with those method is you'll need to store extra information like the last access datetime and its current page views which could result in extra queries.
You can cache only new articles (let's say the ones which have been added recently). I'd suggest having a look at memcached and Redis - they are both very useful, simple and at the same time powerful caching engines.
Related
I've looked around and haven't found a pre-existing answer to this question.
Info
My site relies on Ajax, Apache, Mysql, and PHP.
I have already built my site and it works well however as soon as too many users begin to connect (when receiving roughly 200+ requests per second) the server performs very poorly.
This site is very reliant on ajax. The main page of the site performs an ajax request every second so if 100 people are online, I'm receiving at least 100 requests per second.
These ajax queries invoke mysql queries on the server-side. These queries return small datasets. The returned datasets will change very often so I'd imagine caching would be ineffective.
Questions
1) What configuration practices would be best to help me increase the maximum number of requests per second? This applies to Ajax, Mysql, PHP, and Apache.
2) For Apache, do I want persistent connections (the KeepAlive directive) to be "On" or "Off"? As I understand, Off is useful if you are expecting many users, but On is useful for ajax and I need both of these things.
3) When I test the server's performance on serving a plain, short html page with no ajax (and involving only 1 minor mysql query) it still performs very poorly when this page gets 200+ requests per second. I'd imagine this must be due to apache configuration / server resources. What options do I have to improve this?
Thanks for any help!
Depending on the actual user need the caching can be implemented in different patterns. In many cases the users don't really need updates per every second and/or they can be cached for longer period of times and just make it look like it updates a lot. It depends...
Just to give some ideas:
Do every user need to get really unique, user specific responses from ajax requests or is it same or similar to all or sub groups of the users?
Does it make sense to have every second updates for every user?
Can the users notice the difference if the data is cached for, let's say, 10 seconds?
If the data is really unique for every user, but doesn't get updated for every user per every second, couldn't you use data refreshing (invalidate cached data when the data actually changes)?
I used requirejs to lazy load the js, html and css files. For server to serve loads of assets you need to keep the KeepAliveTimeout to 15
I am doing a pagination system for about 100 items.
My question is:
Should I just load all 100 of them and then use jQuery to switch pages without reloading? Or should I use a MySQL query with "LIMIT 5" and then, each time user presses on Next Page or Previous Page, another Mysql query with LIMIT 5 is initiated?
For every item, I would have to load a thumbnail picture but I could keep it in the cache to avoid using my server bandwidth.
Which one is the best option from a server resource perspective?
Thanks in advance. Regards
Try connecting directly to your MySql instance via the command line interface. Execute the query with 100 at at time, and then with LIMIT 5. Look at the msec results. This will tell you which is more efficient or less resource-demanding.
100 records at a time from MySql (depending on dataset) really is nothing. The performance hit wouldn't be noticeable for a properly written query/database schema.
That said, I vote for calling only the results you need at a time. Use the LIMIT clause and your jquery pagination method to make it efficient.
For the server, the most efficient way would be to grab all 100 items once, send them to the client once, and have the client page through them locally. That's one possibly expensive query, but that's cheaper overall than having the client go back and forth for each additional five items.
Having said that, whether that's feasible is a different topic. You do not want to be pushing a huge amount of data to your client at once, since it'll slow down page loads and client-side processing. In fact, it's usually desirable to keep the bandwidth consumed by the client to a minimum. From that POV, making small AJAX requests with five results at a time when and only when necessary is much preferable. Unless even 100 results are so small overall that it doesn't make much of a difference.
Which one works best for you, you need to figure out.
Depends significantly on your query. If it is a simple SELECT from a well-designed table (indexes set etc.) then unlee you're running on a very underpowered server, there will be no noticeable difference between requesting 100 rows and 5 rows. If it is complicated query, then you should probably limit the number of queries.
Other considerations to take into account are how long it takes to load a page, as in the actual round trip time to the server to receive the data by the client. I'm going to make the wild guess that you are in America or Europe, where internet speeds are nice a fast, not the entire world is that lucky. Limiting the number of times your site has to request data from the server is a much better metric than how much load your server has.
This is moving rapidly into UX here, but your users don't care about your server load, they don't care if this way means your load average is 0.01 instead of 0.02. They will care if you have almost instantaneous transitions between sections of your site.
Personally, I'd go with the "load all data, then page locally" method. Also remember that Ajax is your friend, if you have to, load the results page, then request the data. You can split the request into two: first page and rest of pages. There's alot of behind-the-scenes tweaks you can do to make your site seem incredibly fast, and that is something people notice.
I'd say, load 5 at a time and paginate. My considerations:
It is indeed much lighter to load 5 at a time
Not all of your users will navigate through all 100, so those loaded might not even be used
A slight load time between 5 records are something expected (i.e. most users won't complain just because they have to wait 500ms - 1s)
You can also give user options to display x number of items per page, and put all options as well to let users see all items in the page. Over time, you can also monitor what most of your users preference in terms of x number of items to display per page are then go with that for the default LIMIT
I have a table of more than 15000 feeds and it's expected to grow. What I am trying to do is to fetch new articles using simplepie, synchronously and storing them in a DB.
Now i have run into a problem, since the number of feeds is high, my server stops responding and i am not able to fetch feeds any longer. I have also implemented some caching and fetching odd and even feeds at diff time intervals.
What I want to know is that, is there any way of improving this process. Maybe, fetching feeds in parallel. Or may be if someone can tell me a psuedo algo for it.
15,000 Feeds? You must be mad!
Anyway, a few ideas:
Increase the Script Execution time-limit - set_time_limit()
Don't go overboard, but ensuring you have a decent amount of time to work in is a start.
Track Last Check against Feed URLs
Maybe add a field for each feed, last_check and have that field set to the date/time of the last successful pull for that feed.
Process Smaller Batches
Better to run smaller batches more often. Think of it as being the PHP equivalent of "all of your eggs in more than one basket". With the last_check field above, it would be easy to identify those with the longest period since the last update, and also set a threshold for how often to process them.
Run More Often
Set a cronjob and process, say 100 records every 2 minutes or something like that.
Log and Review your Performance
Have logfiles and record stats. How many records were processed, how long was it since they were last processed, how long did the script take. These metrics will allow you to tweak the batch sizes, cronjob settings, time-limits, etc. to ensure that the maximum checks are performed in a stable fashion.
Setting all this may sound like alot of work compared to a single process, but it will allow you to handle increased user volumes, and would form a strong foundation for any further maintenance tasks you might be looking at down the track.
fetch new articles using simplepie, synchronously
What do you mean by "synchronously"? Do you mean consecutively in the same process? If so, this is a very dumb approach.
You need a way of sharding the data to run across multiple processes. Doing this declaratively based on, say the modulus of the feed id, or the hash of the URL is not a good solution - one slow URL would cause multiple feeds to be held up.
A better solution would be to start up multiple threads/processes which would each:
lock list of URL feeds
identify the feed with the oldest expiry date in the past which is not flagged as reserved
flag this record as reserved
unlock the list of URL feeds
fetch the feed and store it
remove the reserved flag on the list for this feed and update the expiry time
Note that if there are no expired records at step 2, then the table should be unlocked, the next step depends on whether you run the threads as daemons (in which case it should implement an exponential back of, e.g. sleeping for 10 seconds doubling up to 320 seconds for consecutive iterations) or if you're running as batches, exit.
Thank You for your responses. I apologize I am replying a little late. I got busy with this problem and later I forgot about this post.
I have been researching a lot on this. Faced a lot of problems. You see, 15,000 feed everyday is not easy.
May be I am MAD! :) But I did solve it.
How?
I wrote my own algorithm. And YES! It's written in PHP/MYSQL. I basically implemented a simple weighted machine learning algorithm. My algorithm basically learns the posting time about a feed and then estimates the next polling time for the feed. I save it in my DB.
And since it's a learning algorithm it improves with time. Ofcourse, there are 'misses'. but these misses are alteast better than crashing servers. :)
I have also written a paper on this. which got published in a local computer science journal.
Also, regarding the performance gain, I am getting a 500% to 700% improvement in speed as opposed to sequential polling.
How is it going so far?
I have a DB that has grown in size of TBs. I am using MySQL. Yes, I am facing perforance issues on MySQL. but it's not much. Most probably, I will be moving to some other DB or implement sharding to my existing DB.
Why I chose PHP?
Simple, because I wanted to show people that PHP and MySQL are capable of such things! :)
I'll most probably be using MemCache for caching some database results.
As I haven't ever written and done caching I thought it would be a good idea to ask those of you who have already done it. The system I'm writing may have concurrency running scripts at some point of time. This is what I'm planning on doing:
I'm writing a banner exchange system.
The information about banners are stored in the database.
There are different sites, with different traffic, loading a php script that would generate code for those banners. (so that the banners are displayed on the client's site)
When a banner is being displayed for the first time - it get's cached with memcache.
The banner has a cache life time for example 1 hour.
Every hour the cache is renewed.
The potential problem I see in this task is at step 4 and 6.
If we have for example 100 sites with big traffic it may happen that the script has a several instances running simultaneously. How could I guarantee that when the cache expires it'll get regenerated once and the data will be intact?
How could I guarantee that when the cache expires it'll get regenerated once and the data will be intact?
The approach to caching I take is, for lack of a better word, a "lazy" implementation. That is, you don't cache something until you retrieve it once, with the hope that someone will need it again. Here's the pseudo code of what that algorithm would look like:
// returns false if there is no value or the value is expired
result = cache_check(key)
if (!result)
{
result = fetch_from_db()
// set it for next time, until it expires anyway
cache_set(key, result, expiry)
}
This works pretty well for what we want to use it for, as long as you use the cache intelligently and understand that not all information is the same. For example, in a hypothetical user comment system, you don't need an expiry time because you can simply invalidate the cache whenever a new user posts a comment on an article, so the next time comments are loaded, they're recached. Some information however (weather data comes to mind) should get a manual expiry time since you're not relying on user input to update your data.
For what its worth, memcache works well in a clustered environment and you should find that setting something like that up isn't hard to do, so this should scale pretty easily to whatever you need it to be.
Bit of an odd question but I'm hoping someone can point me in the right direction. Basically I have two scenarios and I'd like to know which one is the best for my situation (a user checking a scoreboard on a high traffic site).
Top 10 is regenerated every time a user hits the page - increase in load on the server, especially in high traffic, user will see his/her correct standing asap.
Top 10 is regenerated at a set interval e.g. every 10 minutes. - only generates one set of results causing one spike every 10 minutes rather than potentially once every x seconds, if a user hits in between the refresh they won't see their updated score.
Each one has it's pros and cons, in your experience which one would be best to use or are there any magical alternatives?
EDIT - An update, after taking on board what everyone has said I've decided to rebuild this part of the application. Rather than dealing with the individual scores I'm dealing with the totals, this is then saved out to a separate table which sort of acts like a cached data source.
Thank you all for the great input.
Adding to Marcel's answer, I would suggest only updating the scoreboards upon write events (like new score or deleted score). This way you can keep static answers for popular queries like Top 10, etc. Use something like MemCache to keep data cached up for requests, or if you don't/can't install something like MemCache on your server serialize common requests and write them to flat files, and then delete/update them upon write events. Have your code look for the cached result (or file) first, and then iff it's missing, do the query and create the data
Nothing is never needed real time when it comes to the web. I would go with option 2 users will not notice that there score is not changing. You can use some JS to refresh the top 10 every time the cache has cleared
To add to Jordan's suggestion: I'd put the scorecards in a separate (HTML formatted) file, that is produced every time when new data arrives and only then. You can include this file in the PHP page containing the scorecard or even let a visitor's browser fetch it periodically using XMLHttpRequests (to save bandwidth). Users with JavaScript disabled or using a browser that doesn't support XMLHttpRequests (rare these days, but possible) will just see a static page.
The Drupal voting module will handle this for you, giving you an option of when to recalculate. If you're implementing it yourself, then caching the top 10 somewhere is a good idea - you can either regenerate it at regular intervals or you can invalidate the cache at certain points. You'd need to look at how often people are voting, how often that will cause the top 10 to change, how often the top 10 page is being viewed and the performance hit that regenerating it involves.
If you're not set on Drupal/MySQL then CouchDB would be useful here. You can create a view which calculates the top 10 data and it'll be cached until something happens which causes a recalculation to be necessary. You can also put in an http caching proxy inline to cache results for a set number of minutes.