I have a database containing many tables : [followers, favorites, posts ...etc)
These tables define the different activities a user can achieve, he can send posts, add other people to favorites and follow others.
What I want to do.. is to extract data from these tables, and build a real-time news feed .
I have two options:
1- Creating a separate table for the notifications (so that I won't have to get data from multiple tables, then using a Javascript timer to return results every x seconds.
2- using XMPP server...that sends (or pushes) notifications every x second, without actually sending any ajax queries.
And for these two options, I don't know whether I should connect to these tables to get news feed, or just create a separate table for notifications.
I searched in the subject, but I didn't find something really helpful yet, Any links will be appreciated.
If your data is normalized, you should be able to pull all the data with one query (using JOINs), or you could try creating a View if you want to query from one table. It's always best to keep your data in the appropriate tables to avoid having to duplicate data.
The push notifications are easier on the server, since they're not receiving requests from each client. Depending on your load, you could probably get away with simple AJAX requests.
The request for news feed will be very frequently. So you must keep your code run fast and take as less resource (CPU-time, database query) as possible.
I suggest you to take the first option. It meets your requirement and is simple enough.
Because you have many tables, all of them will grow bigger day by day. Every time you connect them to get news feed will take a long time and increase the load of your database. In the other hand, you query sql will be complex.
Like #Curtis Mattoon said: avoid having to duplicate data, but sometime, we need spend more space for less time.
So I suggest to create a new table to store the notification data. You even can delete the old data from this table periodically.
At the same time, your sql and php code for news feed will be simple and run fast.
Related
Say I have a webmail service and I have a table with fields - username, IP, location, login_time. Let us say my service is hugely popular with 100s of users logging in every minute. At end of the day, if I want to display this table for today's list of users, there are say half a million rows. Now even after indexing DB table, it's taking a huge amount of time to load this page. How do I make it faster (or give a feel of speedy load) ? May be I can do pagination and load say 50 rows at a time as users shift pages. What if I do not have that option ?
Best would be to use a Jquery "Load more" plugin and get only a restricted amount of data at once... Users can click "Load more" button and see the whole table if they want.
Use backend pagination, as you said.
Imagine that you have an excel file containing that many rows - how fast do you think it will open? And, unlike your browser, excel is a specialized software to work with rows of data.
Put it in another perspective - is it helpful in some way for the user to see .5 millions rows at once? I doubt they do. The user can get exactly the same functionality from your software if you offer him a paged results list, with a search form.
I think table Partitioning based on the login_time column is the solution.Partitioning lets you to store parts of your table in their own logical space. Your query will only have to look at a subset of the data to get a result, and not the whole table making the query multiple times faster depending on the number of rows. More about partitioning in the below link
http://dev.mysql.com/doc/refman/5.1/en/partitioning.html
Once you have partitioned your table, you can use a pagination mechanism since showing all 0.5 million rows to the user would not serve any purpose.
I want to write a top articles system.
I want to filter some content (articles/etc..) by number of views.
If I insert in database with views=views+1 every time when a user view that link, I think it's slowly and it's a bad practice.
An example of another site that does this is YouTube. It updates this table only at a certain interval, so the views aren't updated live. Is this a good practice to do this?
You could create a log (simple text file / xml / Json ………) to store the view count and do a job to parse the file and insert the result into the DB.
This job could run in time intervals or check the system for idle processor.
In my opinion, using the RAM to store sensitive data as this count (an important part of your system) seems kind of insecure.
Create a second table of views; you can also use this second table to filter "duplicate" views (backed too by cookies - removing duplicate views won't be a perfect operation).
As for being "slow" I've used that approach on many sites with many users. Storing and aggregating data is what databases do, so use their power.
Then you can use this second table to total the views, or periodically sum up the data and store the running total in the main table, clearing out the second table to keep space down. I usually keep all the data but demoralize for speed.
I have one large database table of request data, much like Apache request logs, of about 50 million rows:
request_url
user_agent
created
that contains data like this:
/profile/Billy
Mozilla.....
2012-06-17...
/profile/Jane
Mozilla.....
2012-06-17...
I then have my user database table, with all my user data including usernames.
At the moment, every night, I process the request data for the previous day, row by row and see if it contains an URL that matches one of the usernames in the users table. If it does, I increment a total in another table that stores stats that allows users to see how many pageviews they got for any particular day.
However as the datasets grow, this is becoming resource intensive and can also take a long time to complete, even when grouping the request data by URL and grabbing a count for that group.
Is there a better way of processing this information to get the end result I need? The request data is going to be logged anyway, so it would be preferable to to generate the stats after the fact rather than incrementing the total on every page view.
I'm running this on one server, so distributed processing of the data on multiple servers isn't required.
Start with a fresh log-table every day. When the day is done, use it to increment the totals, then append it to that huge main log-table and delete it.
Incrementing total on every page view is your best option. It saves trouble of "search" later on for each user separately. It's just one extra query of update on every pageview, and thus processing load is spread out throughout the day instead of single time (Plus your stats stay updated all the time, instead of being updated daily)
If you are insistent on doing in SQL, you might consider
SELECT COUNT(request_url) FROM your_table WHERE request_url LIKE %/profile/username%
(though I am not sure if that's what you're already doing?)
Start looking into analytic database like Infobright. Column Based storeage engines are huge in the big data initiatives and are built for doing in memory analytics on aggregates as well as ad hoc querying.
Disclaimer: the author is affiliated with Infobright.
I'm supposed to make queries from MySql database once a day and display data on the page... and this sounds like cron job - I never did this before and I'd like you opinion.
if I make query once a day, I have to save this data in a file, let's say, xml file and every time the page reloads, it has to parse data from that file.
From my point of view, it would be faster and more user friendly to make query every time the page loads, as data would be refreshed ...
Any help please ....
Thank for your answers, I'll update my answer ... I don't think the queries would be extensive: something like find the most popular categories from articles, the most popular cites from where the author is ... three of those queries. So data pulled out from database will rely only on two tables, max three and only one will have dynamic data, other will be small ones.
I didn't ask yet why ... because it is not available at the moment ...
It all depends on the load on the server. If users are requesting this data a few times a day, then pulling the data on each request should be ok (KISS first). However, if they are slamming the server many times and the request is slow on top of that, then you should store the data off. I would just suggest storing it to a table and just clearing the table each night on a successful reload.
If this is a normal query that doesn't take long to execute, there is no reason to cache the result in a file. MySQL also has caching built in, which may be closer to what you want.
That would depend on the complexity of the query. If the "query" is actually going through a lot of work to build a dataset, or querying a dozen different database servers, i can see only doing it once per day.
For example, if you own a chain of stores across 30 states and 5 countries, each with their own stock-levels, and you want to display local stock levels on your website, i can see only going through the trouble of doing that once per day...
If efficiency is the only concern, it should be pretty easy to estimate which is better:
Time to run Query + (Time to load xml x estimated visits)
versus
Time to run Query x Estimated Visits
How to build a proper structure for an analytics service? Currently i have 1 table that stores data about every user that visits the page with my client's ID so later my clients will be able to see the statistics for a specific date.
I've thought a bit today and I'm wondering: Let's say i have 1,000 users and everyone has around 1,000 impressions on their sites daily, means i get 1,000,000 (1M) new records every day to a single table. How will it work after 2 months or so (when the table reaches 60 Million records)?
I just think that after some time it will have so much records that the PHP queries to pull out the data will be really heavy, slow and take a lot of resources, is it true? and how to prevent that?
A friend of mine working on something similar and he is gonna make a new table for every client, is this the correct way to go with?
Thanks!
Problem you are facing is I/O bound system. 1 million records a day is roughly 12 write queries per second. That's achievable, but pulling the data out while writing at the same time will make your system to be bound at the HDD level.
What you need to do is configure your database to support the I/O volume you'll be doing, such as - use appropriate database engine (InnoDB and not MyISAM), make sure you have fast enough HDD subsystem (RAID, not regular drives since they can and will fail at some point), design your database optimally, inspect queries with EXPLAIN to see where you might have gone wrong with them, maybe even use a different storage engine - personally, I'd use TokuDB if I were you.
And also, I sincerely hope you'd be doing your querying, sorting, filtering on the database side and not on PHP side.
Consider this Link to the Google Analytics Platform Components Overview page and pay special attention to the way the data is written to the database, simply based on the architecture of the entire system.
Instead of writing everything to your database right away, you could write everything to a log file, then process the log later (perhaps at a time when the traffic isn't so high). At the end of the day, you'll still need to make all of those writes to your database, but if you batch them together and do them when that kind of load is more tolerable, your system will scale a lot better.
You could normalize impressions the data like this;
Client Table
{
ID
Name
}
Pages Table
{
ID
Page_Name
}
PagesClientsVisits Table
{
ID
Client_ID
Page_ID
Visits
}
and just increment visits on the final table on each new impression. Then the maximum number of records in there becomes (No. of clients * No. of pages)
Having a table with 60 million records can be ok. That is what a database is for. But you should be careful about how many fields you have in the table. Also what datatype (=>size) each field has.
You create some kind of reports on the data. Think about what data you really need for those reports. For example you might need only the numbers of visits per user on every page. A simple count would do the trick.
What you also can do is generate the report every night and delete the raw data afterwards.
So, read and think about it.