How does one retrieve and display posts on a forum? - php

I'm struggling with a conceptual question. When you have a forum with thousands of posts and/or threads, how do you retrieve all those posts to be displayed on your site? Do you connect to your database every time someone visits your page then capture every post in an array and display it? Surely this seems like it would be very taxing on your server and would cause a whole bunch of unnecessary database reads. Can anyone shine some light on this topic?
Thanks.

You never retrieve all those posts at once. In most case, forums show a page of X threads/posts, and you just get those X threads/posts from the database each time a page is served. RDBMS are pretty good at this. A forum is (should be) quite dynamic so indeed it generates a pretty good load on the database, but this is what database are made for, storing and retrieving data.

One new(ish) way of doing this is to use a Document Oriented Database like CouchDB where everything about an individual post is stored in the same document and that document gets loaded on request.
It seems in this case a Document Oriented Database would work very well for a forum or blog type site.
As far as Relational Databases go, I'm pretty sure the database gets hit every time the page loads unless there is some sort of caching implemented (then you'd have to worry about data getting stale though, which brings up a whole new mess of problems.)

Don't worry a lot about stale data. Facebook doesn't... their database is only "eventually consistent". The idea is like this: making sure that the comments are 100% always, always up-to-date is very expensive. That does put a large load on your DB. Although as Serty says, that's what the DB is made for, but whether or not your physical box is sufficient for the load is another matter.
Facebook and Digg to name a few took a different approach... Is it really all that important that every load of every page be 100% accurate? How many page loads actually result in every single comment being read by the end user anyways? It's a lot cheaper to get the comments right 'most' of the time and by 'most' I mean something you get to decide. Is a 10% chance of a page with missing comments ok? is a 1% chance? How many nodes need to have the right data NOW. When I write a new comment, how many nodes have to say they got the update for it to be successful.
I like the idea behind Cassandra which is in summary, "how much are we willing to spend to get Aunt Martha's comment about her nephew's baptism picture 100% correct?"
But that's a fine question for a free website, but this wouldn't work so good for a business application.

Related

Getting all data once for future use

Well this is kind of a question of how to design a website which uses less resources than normal websites. Mobile optimized as well.
Here it goes: I was about to display a specific overview of e.g. 5 posts (from e.g. a blog). Then if I'd click for example on the first post, I'd load this post in a new window. But instead of connecting to the Database again and getting this specific post with the specific id, I'd just look up that post (in PHP) in my array of 5 posts, that I've created earlier, when I fetched the website for the first time.
Would it save data to download? Because PHP works server-side as well, so that's why I'm not sure.
Ok, I'll explain again:
Method 1:
User connects to my website
5 Posts become displayed & saved to an array (with all its data)
User clicks on the first Post and expects more Information about this post.
My program looks up the post in my array and displays it.
Method 2:
User connects to my website
5 Posts become displayed
User clicks on the first Post and expects more Information about this post.
My program connects to MySQL again and fetches the post from the server.
First off, this sounds like a case of premature optimization. I would not start caching anything outside of the database until measurements prove that it's a wise thing to do. Caching takes your focus away from the core task at hand, and introduces complexity.
If you do want to keep DB results in memory, just using an array allocated in a PHP-processed HTTP request will not be sufficient. Once the page is processed, memory allocated at that scope is no longer available.
You could certainly put the results in SESSION scope. The advantage of saving some DB results in the SESSION is that you avoid DB round trips. Disadvantages include the increased complexity to program the solution, use of memory in the web server for data that may never be accessed, and increased initial load in the DB to retrieve the extra pages that may or may not every be requested by the user.
If DB performance, after measurement, really is causing you to miss your performance objectives you can use a well-proven caching system such as memcached to keep frequently accessed data in the web server's (or dedicated cache server's) memory.
Final note: You say
PHP works server-side as well
That's not accurate. PHP works server-side only.
Have you think in saving the posts in divs, and only make it visible when the user click somewhere? Here how to do that.
Put some sort of cache between your code and the database.
So your code will look like
if(isPostInCache()) {
loadPostFromCache();
} else {
loadPostFromDatabase();
}
Go for some caching system, the web is full of them. You can use memcached or a static caching you can made by yourself (i.e. save post in txt files on the server)
To me, this is a little more inefficient than making a 2nd call to the database and here is why.
The first query should only be pulling the fields you want like: title, author, date. The content of the post maybe a heavy query, so I'd exclude that (you can pull a teaser if you'd like).
Then if the user wants the details of the post, i would then query for the content with an indexed key column.
That way you're not pulling content for 5 posts that may never been seen.
If your PHP code is constantly re-connecting to the database you've configured it wrong and aren't using connection pooling properly. The execution time of a query should be a few milliseconds at most if you've got your stack properly tuned. Do not cache unless you absolutely have to.
What you're advocating here is side-stepping a serious problem. Database queries should be effortless provided your database is properly configured. Fix that issue and you won't need to go down the caching road.
Saving data from one request to the other is a broken design and if not done perfectly could lead to embarrassing data bleed situations where one user is seeing content intended for another. This is why caching is an option usually pursued after all other avenues have been exhausted.

Would it be better to use mysql to store comments or to write them to a file or XML file in PHP and show them on the main page that way?

i am making a comment system, the user will log in with their details on the main page which has been built, but on the second page where the comments will be i want to show each comment in order of time created, would this be better done in mysql database to store the comments or by putting the comments into a file and reading them from the file's on server?
XML [aka a 'flat file' database] may seem preferable for simplicity's sake, but you will find that your performance degrades ridiculously fast once you get a reasonable amount of traffic. Let's say that you have a separate XML file storing comments for each page, and you want to display the last 5 comments with the newest first. The server has to read the entire file from start to finish just figure out which 5 comments are the last.
What happens when you have 10,000 comments on something? What happens when you have 100 posts with 10,000 comments and a 1000 pageviews per second? Basically you're putting so much I/O load on your disk that everything else will grind to a halt for queued I/O.
The point of an RDBMS like mysql is that the information is indexed when it is put into the database, and that index is held in memory. In this way the application doesn't have to re-examine 100% of the data each time a request is made. A mysql query is written to consult the index and have the system retrieve only the desired information, ie: the last 5 comments.
Trust me, I worked for a hosting provider and code like this is constantly causing problems of the variety "it worked fine for months and now my site is slow as hell". You will not regret taking a little extra time to do it right the first time rather than scrambling to re-do the application when it grinds to a halt.
Definitely in MySQL. The actions are easy, and traceability is easy.
Unless you want to go the NoSQL route.
Storing the comments in database will be a better option. You get more power with database. By word more power, I mean you can easily do these:
When use gets deleted, you can decide if you want to delete his comments OR not
You can show top 10(or so) comments
You can show all comments from a user on some other page
Etc
I would recommend a database, a lot easier to access the data that way. XML is always a pain when it comes to laying out the code when pulling out the data. Might be worth getting a more detailed explanation from someone though as i've not had that much experience with XML.
Plus you have more control over moderating the comments.
Hope that helps!
MySQL every time!
It's perfect for doing this and very suitable/flexible/powerful.

How to deal with External API latency

I have an application that is fetching several e-commerce websites using Curl, looking for the best price.
This process returns a table comparing the prices of all searched websites.
But now we have a problem, the number of stores are starting to increase, and the loading time actually is unacceptable at the user experience side. (actually 10s pageload)
So, we decided to create a database, and start to inject all Curl filtered result inside this database, in order to reduce the DNS calls, and increase Pageload.
I want to know, despite of all our efforts, is still an advantage implement a Memcache module?
I mean, will it help even more or it is just a waste of time?
The Memcache idea was inspired by this topic, of a guy that had a similar problem: Memcache to deal with high latency web services APIs - good idea?
Memcache could be helpful, but (in my opinion) it's kind of a weird way to approach the issue. If it was me, I'd go about it this way:
Firstly, I would indeed cache everything I could in my database. When the user searches, or whatever interaction triggers this, I'd show them a "searching" page with whatever results the server currently has, and a progress bar that fills up as the asynchronous searches complete.
I'd use AJAX to add additional results as they become available. I'm imagining that the search takes about ten seconds - it might take longer, and that's fine. As long as you've got a progress bar, your users will appreciate and understand that Stuff Is Going On.
Obviously, the more searches go through your system, the more up-to-date data you'll have in your database. I'd use cached results that are under a half-hour old, and I'd also record search terms and make sure I kept the top 100 (or so) searches cached at all times.
Know your customers and have what they want available. This doesn't have much to do with any specific technology, but it is all about your ability to predict what they want (or write software that predicts for you!)
Oh, and there's absolutely no reason why PHP can't handle the job. Tying together a bunch of unrelated interfaces is one of the things PHP is best at.
Your result is found outside the bounds of only PHP. Do not bother hacking together a result in PHP when a cronjob could easily be used to populate your database and your PHP script can simply query your database.
If you plan to only stick with PHP then I suggest you change your script to index your database from the results you have populated it with. To populate the results, have a cronjob ping a PHP script that is not accessible to the users which will perform all of your curl functionality.

Caching debate/forum entries in PHP

Just looking for a piece of advice. On one of our webpages we have a debate/forum site. Everytime a user request the debate page, he/she will get a list of all topics (and their count of answers etc.).
Too when the user request a specific topic/thread, all answers to the thread will be shown to the user a long with username, user picture, age, number of totalt forum-posts from the poster of the answer.
All content is currently retrieved by using an MySQL-query everytime the page is accessed. But this is however starting to get painfully slow (especially with large threads, +3000 answers).
I would like to cache the debate entries somehow, to speed up this proces. However the problem is, that if I cache the entries it self, number of post etc. (which is dynamic, of course), will not always be up to date.
Is there any smart way of caching the pages/recaching them when stuff like this is updated? :)
Thanks in advance,
fischer
You should create a tag or a name for the cache based on it's data.
For example for the post named Jake's Post you could create an md5 of the name, this would give you the tag 49fec15add24931728652baacc08b8ee.
Now cache the contents and everything to do with this post against the tag 49fec15add24931728652baacc08b8ee. When the post is updated or a comment is added go to the cache and delete everything associated with 49fec15add24931728652baacc08b8ee.
Now there is no cache and it will be rebuilt when the next visitors arrives to new the post.
You could break this down further by having multiple tags per post. E.g you could have a tag for comments and answers, when a comment is added delete the comments tag, but not the answers tag. This reduces the work the server has to do when rebuilding the cache as only the comments are now missing.
There are number of libraries and frameworks that can aid you in doing this.
Jake
EDIT
I'd use files to store the data, more specifically the HTML output of the page. You can then do something like:
if(file_exists($tag))
{
// Load the contents of the cache file here and output it
}
else
{
// Do complex database look up and cache the file for later
}
Remember that frameworks like Zend have this sort of stuff built in. I would seriously considering using a framework.
Interesting topic!
The first thing I'd look at is optimizing your database - even if you have to spend money upgrading the hardware, it will be significantly easier and cheaper than introducing a cache - fewer moving parts, fewer things that can go wrong...
If you can't squeeze more performance out of your database, the next thing I'd consider is de-normalizing the data a little. For instance, maintain a "reply_count" column, rather than counting the replies against each topic. This is ugly, but introduces fewer opportunities for things to go wrong - with a bit of luck, you can localize all the logic in your data access layer.
The next option I'd consider is to cache pages. For instance, just caching the "debate page" for 30 seconds should dramatically reduce the load on your database if you've got reasonable levels of traffic, and even if it all goes wrong, because you're caching the entire page, it will sort itself out the next time the page goes stale. In most situations, caching an entire page is okay - it's not the end of the world if a new post has appeared in the last 30 seconds and you don't see it on your page.
If you really have to provide more "up to date" content on the page, you might introduce caching at the database access level. I have, in the past, built a database access layer which cached the results of SQL queries based on hard-wired logic about how long to cache the results. In our case, we built a function to call the database which allowed you to specify the query (e.g. get posts for user), an array of parameters (e.g. username, date-from), and the cache duration. The database access function would cache results for the cache duration based on the query and the parameters; if the cache duration had expired, it would refresh the cache.
This scheme was fairly bug-proof - as an end user, you'd rarely notice weirdness due to caching, and because we kept the cache period fairly short, it all sorted itself out very quickly.
Building up your page by caching snippets of content is possible, but very quickly becomes horribly complex. It's very easy to create a page that makes no sense to the end user due to the different caching policies - "unread posts" doesn't add up to the number of posts in the breakdown because of different caching policies between "summary" and "detail".

PHP: How to do caching?

So I'm looking to do caching for a forum I'm building and I want to understand the best method. I've been doing some reading and the way that the Zend Framework handles caching (here) explains the idea well, but there are a few things I'm not sure about.
Let's say that I want to cache posts, should I simply "dump" the contents of the query into a file and then retrieve from that, or should I be building the layout around the data and then simply returning the contents of the file? How would I handle user information, historically the standard forum display includes a users total postcount next to a post, this can change (assuming 30 posts per page) very often and would mean I'd have to constantly clear the cache, which would seem pretty redundant.
I can't find any articles about how I should approach this and I'd be interested to learn more, does anyone have any insight or relevant articles to help?
There's always a trade-off between how often you will hit the cache (and hence who useful the cache is) and how much you want to cache and how big the lifetime should be.
You should identify the bottlenecks in your application. If it's the query that's holding the performance back, by all means cache the query. If it's building some parts of the page, cache those instead.
As to retrieving the user posts, if you want that be as live as possible, then you can't cache those (or if you do, you'll have to invalidate all the cached threads where that user has ever posted...). Retrieving post counts from the database (if done right) shouldn't be too taxing. You can just cache a template where the post count is left blank to be filled later or you can do some tricks with Javascript.

Categories