Processing and matching large amounts of data - php

I have one large database table of request data, much like Apache request logs, of about 50 million rows:
request_url
user_agent
created
that contains data like this:
/profile/Billy
Mozilla.....
2012-06-17...
/profile/Jane
Mozilla.....
2012-06-17...
I then have my user database table, with all my user data including usernames.
At the moment, every night, I process the request data for the previous day, row by row and see if it contains an URL that matches one of the usernames in the users table. If it does, I increment a total in another table that stores stats that allows users to see how many pageviews they got for any particular day.
However as the datasets grow, this is becoming resource intensive and can also take a long time to complete, even when grouping the request data by URL and grabbing a count for that group.
Is there a better way of processing this information to get the end result I need? The request data is going to be logged anyway, so it would be preferable to to generate the stats after the fact rather than incrementing the total on every page view.
I'm running this on one server, so distributed processing of the data on multiple servers isn't required.

Start with a fresh log-table every day. When the day is done, use it to increment the totals, then append it to that huge main log-table and delete it.

Incrementing total on every page view is your best option. It saves trouble of "search" later on for each user separately. It's just one extra query of update on every pageview, and thus processing load is spread out throughout the day instead of single time (Plus your stats stay updated all the time, instead of being updated daily)
If you are insistent on doing in SQL, you might consider
SELECT COUNT(request_url) FROM your_table WHERE request_url LIKE %/profile/username%
(though I am not sure if that's what you're already doing?)

Start looking into analytic database like Infobright. Column Based storeage engines are huge in the big data initiatives and are built for doing in memory analytics on aggregates as well as ad hoc querying.
Disclaimer: the author is affiliated with Infobright.

Related

MySQL big chunks of data - fast access without recalculation

I would like to store big chunks of data in RAM using sphinx /solr/ elastic search whatever else suits such needs (The problem is I don't know what tool suits the best I had only heard that people use them).
I build reports about sales, I get nearly 800-900k lines of sales per month and user wants to scroll the page and see them smoothly.
I can't give them all data at once becasue browser will just hang
in the same time I can't use LIMIT from mysql because queries demand merging cross tables.
Recalculating it on the flow is not an option.
Creating a temp table in mysql is a bad idea because there are a bunch of criteria and more than one user can view data.
Temporary_table
id product_id product_count order_id order_status.... .....user_id
Having such table I would store all result for current user in the table and would hold them there as long as user doesn't make a new query. But I don't like this solution. There must be something better.
I feel like it's over my head.
Any ideas?
"Drill down", don't "Scroll down" !
When I need to present a million lines of info, I start by thinking of way to slice and dice it -- subtotals by hour, by region, by product type, by whatever. Each slice might be a hundred lines -- quite manageable, especially with summary tables.
In that hundred lines would be clickable items that take the user to a more detailed page about one of the items. That would also have a hundred lines (or 10 or 1000 -- whatever makes sense; but >1000 is usually unreasonable). That page may have further links to drill further down. And/or links to move laterally.
With suitable slicing and dicing, you are very unlikely to need to send him a million lines; only a few hundred.
With suitable Summary Tables, the "tmp tables", etc, go away.

Creating a news feed realtime

I have a database containing many tables : [followers, favorites, posts ...etc)
These tables define the different activities a user can achieve, he can send posts, add other people to favorites and follow others.
What I want to do.. is to extract data from these tables, and build a real-time news feed .
I have two options:
1- Creating a separate table for the notifications (so that I won't have to get data from multiple tables, then using a Javascript timer to return results every x seconds.
2- using XMPP server...that sends (or pushes) notifications every x second, without actually sending any ajax queries.
And for these two options, I don't know whether I should connect to these tables to get news feed, or just create a separate table for notifications.
I searched in the subject, but I didn't find something really helpful yet, Any links will be appreciated.
If your data is normalized, you should be able to pull all the data with one query (using JOINs), or you could try creating a View if you want to query from one table. It's always best to keep your data in the appropriate tables to avoid having to duplicate data.
The push notifications are easier on the server, since they're not receiving requests from each client. Depending on your load, you could probably get away with simple AJAX requests.
The request for news feed will be very frequently. So you must keep your code run fast and take as less resource (CPU-time, database query) as possible.
I suggest you to take the first option. It meets your requirement and is simple enough.
Because you have many tables, all of them will grow bigger day by day. Every time you connect them to get news feed will take a long time and increase the load of your database. In the other hand, you query sql will be complex.
Like #Curtis Mattoon said: avoid having to duplicate data, but sometime, we need spend more space for less time.
So I suggest to create a new table to store the notification data. You even can delete the old data from this table periodically.
At the same time, your sql and php code for news feed will be simple and run fast.

A top content system written in PHP

I want to write a top articles system.
I want to filter some content (articles/etc..) by number of views.
If I insert in database with views=views+1 every time when a user view that link, I think it's slowly and it's a bad practice.
An example of another site that does this is YouTube. It updates this table only at a certain interval, so the views aren't updated live. Is this a good practice to do this?
You could create a log (simple text file / xml / Json ………) to store the view count and do a job to parse the file and insert the result into the DB.
This job could run in time intervals or check the system for idle processor.
In my opinion, using the RAM to store sensitive data as this count (an important part of your system) seems kind of insecure.
Create a second table of views; you can also use this second table to filter "duplicate" views (backed too by cookies - removing duplicate views won't be a perfect operation).
As for being "slow" I've used that approach on many sites with many users. Storing and aggregating data is what databases do, so use their power.
Then you can use this second table to total the views, or periodically sum up the data and store the running total in the main table, clearing out the second table to keep space down. I usually keep all the data but demoralize for speed.

MySql queries at certain time

I'm supposed to make queries from MySql database once a day and display data on the page... and this sounds like cron job - I never did this before and I'd like you opinion.
if I make query once a day, I have to save this data in a file, let's say, xml file and every time the page reloads, it has to parse data from that file.
From my point of view, it would be faster and more user friendly to make query every time the page loads, as data would be refreshed ...
Any help please ....
Thank for your answers, I'll update my answer ... I don't think the queries would be extensive: something like find the most popular categories from articles, the most popular cites from where the author is ... three of those queries. So data pulled out from database will rely only on two tables, max three and only one will have dynamic data, other will be small ones.
I didn't ask yet why ... because it is not available at the moment ...
It all depends on the load on the server. If users are requesting this data a few times a day, then pulling the data on each request should be ok (KISS first). However, if they are slamming the server many times and the request is slow on top of that, then you should store the data off. I would just suggest storing it to a table and just clearing the table each night on a successful reload.
If this is a normal query that doesn't take long to execute, there is no reason to cache the result in a file. MySQL also has caching built in, which may be closer to what you want.
That would depend on the complexity of the query. If the "query" is actually going through a lot of work to build a dataset, or querying a dozen different database servers, i can see only doing it once per day.
For example, if you own a chain of stores across 30 states and 5 countries, each with their own stock-levels, and you want to display local stock levels on your website, i can see only going through the trouble of doing that once per day...
If efficiency is the only concern, it should be pretty easy to estimate which is better:
Time to run Query + (Time to load xml x estimated visits)
versus
Time to run Query x Estimated Visits

How to build a proper Database for a traffic analytics system?

How to build a proper structure for an analytics service? Currently i have 1 table that stores data about every user that visits the page with my client's ID so later my clients will be able to see the statistics for a specific date.
I've thought a bit today and I'm wondering: Let's say i have 1,000 users and everyone has around 1,000 impressions on their sites daily, means i get 1,000,000 (1M) new records every day to a single table. How will it work after 2 months or so (when the table reaches 60 Million records)?
I just think that after some time it will have so much records that the PHP queries to pull out the data will be really heavy, slow and take a lot of resources, is it true? and how to prevent that?
A friend of mine working on something similar and he is gonna make a new table for every client, is this the correct way to go with?
Thanks!
Problem you are facing is I/O bound system. 1 million records a day is roughly 12 write queries per second. That's achievable, but pulling the data out while writing at the same time will make your system to be bound at the HDD level.
What you need to do is configure your database to support the I/O volume you'll be doing, such as - use appropriate database engine (InnoDB and not MyISAM), make sure you have fast enough HDD subsystem (RAID, not regular drives since they can and will fail at some point), design your database optimally, inspect queries with EXPLAIN to see where you might have gone wrong with them, maybe even use a different storage engine - personally, I'd use TokuDB if I were you.
And also, I sincerely hope you'd be doing your querying, sorting, filtering on the database side and not on PHP side.
Consider this Link to the Google Analytics Platform Components Overview page and pay special attention to the way the data is written to the database, simply based on the architecture of the entire system.
Instead of writing everything to your database right away, you could write everything to a log file, then process the log later (perhaps at a time when the traffic isn't so high). At the end of the day, you'll still need to make all of those writes to your database, but if you batch them together and do them when that kind of load is more tolerable, your system will scale a lot better.
You could normalize impressions the data like this;
Client Table
{
ID
Name
}
Pages Table
{
ID
Page_Name
}
PagesClientsVisits Table
{
ID
Client_ID
Page_ID
Visits
}
and just increment visits on the final table on each new impression. Then the maximum number of records in there becomes (No. of clients * No. of pages)
Having a table with 60 million records can be ok. That is what a database is for. But you should be careful about how many fields you have in the table. Also what datatype (=>size) each field has.
You create some kind of reports on the data. Think about what data you really need for those reports. For example you might need only the numbers of visits per user on every page. A simple count would do the trick.
What you also can do is generate the report every night and delete the raw data afterwards.
So, read and think about it.

Categories