Counting large amounts of data [closed]

Counting large amounts of data [closed] - php

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
My site has a social feature. On the user profile it shows how many posts the user has, how many followers and so on.
However our database has over 100,000 rows.
Its working fine however im getting very sceptical on the performance in the long run.
I was thinking of another method which i think would work best.
So basically right now it just counts the rows which the user owns in the mysql database.
Instead of scanning through the entire mysql table would it be better to do the following:
Create a section in the "Users" table called "post_counts". Every time he makes a post the counter will go up. Everytime the user removes his post it goes down and so forth.
Ive tried both methods however since the DB is still small its hard to tell if there is a performance increase
current method just querys SELECT * WHERE user = user_id FROM table_name; then just count with php count($fetchedRows);
Is there a better way to handle this?
[update]
Basically the feature is like the twitter followers. Im sure Twitter doesnt count billions of rows to determine the users followers count.

I have MySQL tables that have 70M+ rows in them. 100,000 is nothing.
But yes I would keep the counters in a field and simply update them whenever that users posts something or deletes a post. Make sure you have good indexes.
Also what you COUNT() make a differences. A COUNT(*) takes a less overhead than a COUNT(col) WHERE...
Use "explain" to see how long different COUNT() statements take and how many rows they are scanning.
As in: mysql> explain select count(*) from my_table where user_id = 72 \G;

Related

Database number of columns and separate table? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I'm using MySQL database and wondering about the database table designs.
I see sets of two kinds of tables designed by an experience PHP developer. Basically one table contains some dynamic stats and the other table more static ones but each record would have the same row_id. For example, a user table where info like name, pass are stored and a user_stats table where like a summary of actions maybe like the total money spent, etc.
Is there any real advantage in doing this besides adding more clarity to the functions of each table. How many columns would you say is optimal in a table? Is it bad to mix the more static info and dynamic stuff together into like 20 columns of the same table?

From a design standpoint, you might consider that the User table describes things, namely "users." There might be hundreds of thousands of them, and rows might need to remain in that table ... even for Users who've been pushing-up daisies in the local graveyard for many years now ... because this table associates the user_id values that are scattered throughout the database with individual properties of that "thing."
Meanwhile, we also collect User_Stats. But, we don't keep these stats nearly so long, and we don't keep them about every User that we know of. (Certainly not about the ones who live in the graveyard.) And, when we want to run reports about those statistics, we don't want to pore through all of those hundreds-of-thousands of User records, looking for the ones that actually have statistics.
User_Stats, then, is an entirely separate collection of records. Yes, it is related to Users, e.g. in the referential-integrity sense that "any user_id (foreign key ...) in User_Stats must correspond to a user_id in Users. But it is, nevertheless, "an entirely separate collection of records."
Another important reason for keeping these as a separate table is that there might naturally be a one-to-many relationship between a User and his User_Stats. (For instance, maybe you keep aggregate statistics by day, week, or month ...)
If you have nothing better to do with your afternoon than to read database textbooks ... ;-) ... the formal name of this general topic is: "normal forms." ("First," "Second," "Third," and so on.)

Paginate via PHP vs MySQL best practice? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
When dealing with records of at least 1 million rows, in terms of performance, is it better to:
Select the whole record e.g., SELECT * FROM tbl then paginate the result using array_chunk() or array_slice()
or
Select only part of the record e.g., SELECT * FROM tbl LIMIT x per page?

i think it depends, you can stock the whole response in the memory using memcache if your table is not too big and it will avoid HDD request which is more time consuming, but as you dont know if your user will look for lot of pages, it would be better to limit it with SQL.

It depends.
Does data change often in this table?
Yes -> you need query DB.
Is database big and changes often?
Then use some kind of search engine like Elasticsearch and don't query DB just populate search engine
Is database small but queries take long time?
Use some kind of cache like redis/memcache
It really depends on your needs.

The best method will depend on your context. If you choose to use the database directly, beware of this issue:
The naive LIMIT method will give you problems when you get into later pages. ORDER BY some_key LIMIT offset,page_size works like this - go through the key, through away the first offset records, then return page_size records. So offset + page_size records examined, if offset is high you have a problem.
Better - remember the last key value of the current page. When fetching next page use it like this:
SELECT * FROM tbl WHERE the_key > $last_key ORDER BY the_key ASC LIMIT $page_size
If your key is not unique, make it unique by adding an extra unique ID column at the end.

It REALLY depends on context.
In general you want to make heavy use of indexes to select the content that you want out of a large dataset with fast results. It's also faster to paginate through the programming language than to use the database. The database is often times the bottleneck. We had to do it this way for an application that had 100's of queries a minute. Hits to the database needed to be capped so we needed to return datasets that we knew may not need another query to the DB, around 100 results, and then paginate by 25 in the application.
In general, index and narrow your results with these indexes and if performance is key with lots of activity on the db, tune your db and your code to decrease I/O and DB hits by paginating in the application. You'll know why when your server is bleeding with a load of 12 and your I/O is showing 20 utilization. You'll need to hit the operating table stat!

It is better to use LIMIT. think about it.. The first one will get all even if you have 1000000 rows. vs limit which will only get your set number each time.
You will then want to make sure you have your offsets set correctly to get the next set of items from the table.

How to Handle a great number of rows with SQL Queries and take only small amount of data efficiently? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I'm coding a site in PHP, and site will contain really much messages(like 100.000 , 200.000 or more) which users will post on the site. The problem is, messages will be stored on a table called 'site_messages' by it's ID. This means, all messages aren't grouped by their poster, it's grouped by their ID. If I want to fetch the messages that are posted by user 'foo', I have to query a lot of rows, and it will get really slow I think. Or I want to fetch the messages by post subject(yes, it will contain post subject column too, and maybe more column to add), I must query all the table again, and unfortunately, it will be less efficient. Is there any speedy solutions about that? I'm using PHP and MySQL(and PHPMyAdmin).
Edit: For example, my table would look like this:
MessageID: 1
MessageContent(Varchar, this is the message that user posts): Hi I like this site. Bye!
MessagePoster(Varchar): crazyuser
MessagePostDate: 12/12/09
MessagePostedIn(Varchar, this is the post subject): How to make a pizza
MessageID: 2
MessageContent(Varchar): This site reallllly sucks.
MessagePoster(Varchar): top_lel
MessagePostDate: 12/12/09
MessagePostedIn(Varchar): Hello, I have a question!
MessageID: 3
MessageContent(Varchar): Who is the admin of this site?
MessagePoster(Varchar): creepy2000
MessagePostDate: 1/13/10
MessagePostedIn(Varchar): This site is boring.
etc...

This is what DBs (especially relationship DBs) were built for! MySql and other DBs use things like indexes to help you get access to the rows you need in the most efficient way. You will be able to write queries like select * from site_messages where subject like "News%" order by entryDateTime desc limit 10 to find the latest ten messages starting with "News", or select * from site_messages, user where user.userid='foo' and site_messages.fk_user=user.id to find all posts for a certain user, and you'll find it performs pretty well. For these, you'd probably have (amongst others) an index for the subject column, and an index on the fk_user column.
Work on having a good table structure (data model). Of course if you have issues you can research DB performance and the topic of explain plans to help.
Yes, for each set of columns you want, you will query the table again. Think of a query as a set of rows. Avoid sending large numbers of rows over connections. As the other commenters have suggested, we can't help much more without more details about your tables.

Two candidates for indexing that jump right out are (Poster, PostDate) and (PostDate, Poster) to help queries in the form:
select ...
from ...
where Poster = #PID and PostDate > #Yesterday;
and
select Poster, count(*) as Postings, ...
from ...
where PostDate > #Yesterday
group by Poster;
and
select Poster, ...
from ...
where PostDate between #DayBeforeYesterday and #Yesterday;
Just keep in mind that indexing improves queries at the expense of the DML operations (insert, update, delete). If the query/DML ratio is very low, you just may want to live with the slower queries.

Database query for timeline of a social networking site [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am building a social networking site and notice board system for my college.
I want to show 10 to 15 most recent posts from the friends of the LoggedInUser in Home Page.
If I select all the rows of timeline table in from database, then it will consume more memory and also fetch posts which are not of the friends of the LoggedInUser.
So what should be the best possible way to do so?
Help Me.
The table "friends" stores the UserID, FriendID and status.
The table "timeline" stores the posts of all the users with PostID and UserID
I'm using Apache, PHP and MySQL.
Thanks...

You'll want to look into SQL joins pretty carefully before you go build a social network. I'm not gonna lie. This is a very basic question that you should be able to answer on your own (without Stack Overflow) if you're planning on building a social network of any kind.
That said, here are two queries that would work. And you might try experimenting with both and reading up on the difference between the two, as well as the performance implications of using one vs the other. I should also mention, these will give you the PostIDs you need, but you'll need to join those with a posts table (presumably) to get the actual post content. I leave that step up to you.
Also, in terms of getting the 10-15 most recent posts, you need to ask yourself a question... can you even answer that question with the tables you've listed?
Hint: You can't. Question: Why not?
Query Option 1 (gets all the post ids belonging to friends of friends.UserId = ???):
SELECT b.postid, b.userid
FROM friends AS a
INNER JOIN timeline AS b
ON a.FriendID = b.userid
WHERE a.UserID = ???
Query Option 2 (gets all the post ids belonging to friends of friends.UserId = ???):
SELECT b.postid, b.userid
FROM timeline AS b
WHERE EXISTS (SELECT * FROM friends
WHERE friends.UserID = ??? AND b.UserID = friends.FriendID)
Now, a question for you: Do both queries above return the same result? Is one preferable to the other? If so, is it always preferable or does it depend on the circumstance? Can you give me an example?

add timestamp to specify the posts that you want to show.
if you are building a social network website NEVER select all rows (youll face security problems, injections and so forth if youre putting it online)
if you are mixing PHP and HTML together do an IF- ELSE statement: IF they are friends display posts of friends that are in your friends table with the value 1.
hope this helps, Ive also built a social network site for school project.

Looking for recommended methods for storing/cacheing counts [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I'm building a website using php/mysql where there will be Posts and Comments.
Posts need to show number of comments they have. I have count_comments column in Posts table and update it every time comment is created or deleted.
Someone recently advised me that denormalazing this way is a bad idea and I should be using caching instead.

My take is: You are doing the right thing. Here is why:
See the field count_comments as not being part of your data model - this is easily provable, you can delete all contents of this field and it is trivial to recreate it.
Instead see it as a cache, the storage of which is just co-located with the post - perfectly smart, as you get it for free whenever you have to query for the post(s)

I do not think this is a bad approach.
One thing i do recognize is that its very easy to introduce side effects as code base is expanded by having a more rigid approach. The nice part is at some point the amount of rows in the database will have to be calculated or kept track of, there is not really a way of getting out of this.
I would not advise against this. There are other solutions to getting comment counts. Check out Which is fastest? SELECT SQL_CALC_FOUND_ROWS FROM `table`, or SELECT COUNT(*)
The solution is slower upon selects, but requires less code to keep track of comment count.
I will say that your approach avoids LIMIT DE-optimization, which is a plus.

This is an optimization that is almost never needed for two reasons:
1) Proper indexing will make simple counts extremely fast. Ensure that your comments.post_id column has an index.
2) By the time you need to cache this value, you will need to cache much more. If your site has so many posts, comments, users and traffic that you need to cache the comments total, then you will almost definitely need to be employing caching strategies for much of your data/output (saving built pages to static, memcache, etc.). Those strategies will, no doubt, encompass your comments total, making the table field approach moot.

I have no idea what was meant by "Caching" and I'll be interested in some other answer that the one I have to offer:
Remove redundant information from your database is important and, in a "Believer way" (means that I didn't really test it, its merely speculative), I think that using SUM() function from your database is a better way to go for it.
Assuming that all your comments has a post_id, all you need is something like:
SELECT SUM(id) FROM comments WHERE id = {post_id_variation_here}
That way, you reduce 1 constant CRUD happening just to read how much comments there are and increase performance.

Unless you haven't hundreds or thousands of hits per seconds on your application there's nothing wrong about using a SQL statement like this:
select posts_field1, ..., (select count(*) from comments where comments_parent = posts_id) as commentNumber from posts
you can go with caching the html output of your page anyway. than no database query has to be done at all.

Maby you could connect the post and comment tables to each other and count the comments rows in mysql with the mysql function: mysql_num_rows. Like so:
Post table
postid*
postcontent
Comment table
commentid
postid*
comment
And then count the comments in mysql like:
$link = mysql_connect("localhost", "mysql_user", "mysql_password");
mysql_select_db("database", $link);
$result = mysql_query("SELECT * FROM commenttable WHERE postid = '1'", $link);
$num_rows = mysql_num_rows($result);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.