News Feed Database Design Efficiency

News Feed Database Design Efficiency - php

Greetings All, I've seen similar questions asked before with no conclusive or tested answers.
I'm designing a News Feed system using PHP/MySQL similar to facebooks. Seeing as this table could grow to be quite large -- any inefficiency could result in a significant bottleneck.
Example Notifications:
(Items in bold are linked objects)
User_A and USER_B commented on User_C's new album.
User_A added a new Vehicle to [his/her] garage.
Initially, I implemented this using excessive columns for Obj1:Type1 | Obj2:Type2 | etc..
It works but I fear it's not nearly scalable enough, now I'm looking to object serialization.
So, for example my new database is set up like so:
News_ID | User_ID | News_Desc | Timestamp
2643 904 {User904} and {User890} commented on SomeTimestamp
{User222}'s new {Album724}.
Anything inside {'s represents data that would be serialized using JSON.
Is this a smart (efficient / scalable) way to move forward?
Will it be difficult to separate the serialized data from the rest of the string using regular expressions?

What happens if User890 deletes his/her comment? I think you need to be more atomic - possibly storing the type of action (comment) with the actioner (User890), then generate the actual story on the fly, with heavy caching. This would also help the issue of translation, if you extend your site to several markets/audiences.

Related

PHP & MySQL performance - One big query vs. multiple small

For an MySQL table I am using the InnoDB engine and the structure of my tables looks like this:
Table user
id | username | etc...
----|------------|--------
1 | bruce | ...
2 | clark | ...
3 | tony | ...
Table user-emails
id | person_id | email
----|-------------|---------
1 | 1 | bruce#wayne-ent.com
2 | 1 | ceo#wayne-ent.com
3 | 2 | clark.k#daily-planet.com
To fetch data from the database I've written a tiny framework. E.g. on __construct($id) it checks if there is a person with the given id, if yes it creates the corresponding model and saves only the field id to an array. During runtime, if I need another field from the model it fetches only the value from the database, saves it to the array and returns it. E.g. same with the field emails for that my code accesses the table user-emails and get all the emails for the corresponding user.
For small models this works alright, but now I am working on another project where I have to fetch a lot of data at once for a list and that takes some time. Also I know that many connections to MySQL and many queries are quite stressful for the server, so..
My question now is: Should I fetch all data at once (with left joins etc.) while constructing the model and save the fields as an array or should I use some other method?

Why do people insist on referring to the entities and domain objects as "models".
Unless your entities are extremely large, I would populate the entire entity, when you need it. And, if "email list" is part of that entity, I would populate that too.
As I see it, the question is more related to "what to do with tables, that are related by foreign keys".
Lets say you have Users and Articles tables, where each article has a specific owner associate by user_id foreign key. In this case, when populating the Article entity, I would only retrieve the user_id value instead of pulling in all the information about the user.
But in your example with Users and UserEmails, the emails seem to be a part of the User entity, and something that you would often call via $user->getEmailList().
TL;DR
I would do this in two queries, when populating User entity:
select all you need from Users table and apply to User entity
select all user's emails from the UserEmails table and apply it to User entity.
P.S
You might want to look at data mapper pattern for "how" part.

In my opinion you should fetch all your fields at once, and divide queries in a way that makes your code easier to read/manage.
When we're talking about one query or two, the difference is usually negligible unless the combined query (with JOINs or whatever) is overly complex. Usually an index or two is the solution to a very slow query.
If we're talking about one vs hundreds or thousands of queries, that's when the connection/transmission overhead becomes more significant, and reducing the number of queries can make an impact.
It seems that your framework suffers from premature optimization. You are hyper-concerned about fetching too many fields from a row, but why? Do you have thousands of columns or something?
The time consuming part of your query is almost always the lookup, not the transmission of data. You are causing the database to do the "hard" part over and over again as you pull one field at a time.

How to approach multi-million data selection

I have a table that stores specific updates for all customers.
Some sample table:
record_id | customer_id | unit_id | time_stamp | data1 | data2 | data3 | data4 | more
When I created the application, I did not realize how much this table would grow -- currently I have over 10mil records within 1 month. I am facing issues, when php stops executing due to amount of time it takes. Some queries produce top-1 results, based on the time_stamp + customer_id + unit_id
How would you suggest handling this type of issues? For example, I can create new table for each customer, although I think it does not a good solution.
I am stuck with no good solution in mind.

If you're on the cloud (where you're charged for moving data between server and db), ignore.
Move all logic to the server
The fastest query is a SELECT WHEREing the PRIMARY. It won't matter how large your database is, it will come back just as fast with a table of 1 row (as long as your hardware isn't unbalanced).
I can't tell exactly what you're doing with your query, but first download all of the sorting and limiting data into PHP. Once you've got what you need, SELECT the data directly WHEREing on record_id (I assume that's your PRIMARY).
It looks like your on demand data is pretty computationally intensive and huge, so I recommend using a faster language. http://blog.famzah.net/2010/07/01/cpp-vs-python-vs-perl-vs-php-performance-benchmark/
Also, when you start sorting and limiting on the server rather than the db, you can start identifying shortcuts to speed it up even further.
This is what the server's for.

I suggest you use partitioning of your data following some criteria.
You can make horizontal or vertical partition of your data.
For example group your customer_id in 10 partitions, using his id module 10.
So, customer_id terminated in 0 goes to partition 0, with ended in 1 goes to partition 1
MySQL can make this for you easily.

What is the count of records within the tables? Often, with relational databases, it's not how much data you have (millions are nothing to relational databases), it's how you're retrieving it.
From the look of your select, in fact, you probably just need to optimize the statement itself and avoid the multiple subselects, which is probably the main cause of the slowdown. Try running an explain on that statement, or just get the ids and run the interior select individually on the ids of the records that you've actually found & retrieved in the first run.
Just the fact that you have those subselects within your overall statement means that you haven't optimized that far into the process anyway. For example, you could be running a nightly or hourly cron job that aggregates into a new table the sets like the one created by SELECT gps_unit.idgps_unit, and then you can run your selects against a previously generated table instead of creating blocks of data that are equivalent of a table on the fly.
If you find yourself unable to effectively optimize that select statement, you have "final" options like:
Categorize via some criteria and split into different tables.
Keep a deep archive, such that anything past the first year or so is migrated to a less used table and requires special retrieval.
Finally, if you have so much small data, you may be able to completely archive certain tables and keep them around in file form only and then truncate past a certain date. Often with web tracking data that isn't that important and is kinda spammy, I end up doing this after a few years, when the data is really not going to do anyone any good any more.

MySQL optimization: more entries vs complex queries

I want to improve the speed of a notification board. It retrieves data from the event table.
At this moment the events MySQL table looks like this
id | event_type | who_added_id | date
In the event table I store one row with information regarding a particular event. Each time a users A asks for new notifications, the query runs through the table and looks if the notifications added by the user B suit him (they have to be friends, members of the same groups, have previously chatted).
Table events became big, because of the bulky query the page loads slow.
I'm thinking of changing entirely this design and, instead of adding one event row and then compare if the user's event suits or not, to add as many rows as interested users. I would change the table events structure as follows:
id | event_type | who_added_id | forwho_id | date
Now, if user B creates an event which interests other 50 members, I create 50 rows with the same information and in the 'forwho_id' field I mention those 50 members which must get this notification.
I think the query will become much more simple and it will take less time to search through it.
How do you think:
1. Is this a good approach in storing such kind of data or we should avoid duplicate data at any cost?
2. How do you think the events table will behave if the number of interested users will be not 50 but hundreds?
Thank you for reading this and I hope I made myself understandable.

Duplicated data is not "bad", and it's not to be "avoided at all cost".
What is "bad" is uncontrolled redundancy, and the kind of problems that come up when the logical data model isn't third normal form. It is acceptable and expected that an implementation will deviate from a logical data model, and introduce redundancy for performance.
Your revised design looks appropriate for your needs.

Creating an efficient friendlist using PHP

I would like to build a website that has some elements of a social network.
So I have been trying to think of an efficient way to store a friend list (somewhat like Facebook).
And after searching a bit the only suggestion I have come across is making a "table" with two "ids" indicating a friendship.
That might work in small websites but it doesn't seem efficient one bit.
I have a background in Java but I am not proficient enough with PHP.
An idea has crossed my mind which I think could work pretty well, problem is I am not sure how to implement it.
the idea is to have all the "id"s of your friends saved in a tree data structure,each node in that tree resembles one digit from the friend's id.
first starting with 1 node, and then adding more nodes as the user adds friends.
(A bit like Lempel–Ziv).
every node will be able to point to 11 other nodes, 0 to 9 and X.
"X" marks the end of the Id.
for example see this tree:
An Example
In this tree the user has 4 friends with the following "id"s:
0
143
1436
15
Update: as it might have been unclear before, the idea is that every user will have a tree in a form of multidimensional array in which the existence of the pointers themselves indicate the friend's "id".
If every user had such a multidimensional array, searching if id "y" is a friend of mine, deleting id "y" from my friend list or adding id "y" to my friend list would all require constant time O(1) without being dependent on the number of users the website might have, only draw back is, taking such a huge array, serializing it and pushing it into each row of the table just doesn't seem right.
-Is this even possible to implement?
-Would using serializing to insert that tree into a table be practical?
-Is there any better way of doing this?
The benefits upon which I chose this is that even with a really large number of ids (millions or billions) the search,add,delete time is linear (depends of the number of digits).
I'd greatly appreciate any help with implementing this or any suggestions for alternative ways to improve or change this method.

I would strongly advise against this.
Storage savings are not significant, and may (probably?) be worse. In a real dataset, the actual space-savings afforded to you with this approach are minimal. Computing the average savings is a very difficult problem, but use some real numbers and try a few samples with random IDs. If you have a million users, consider a user with 15 friends. How much data do you save with this approch? You may actually use more space, since tree adjacency models can require significant data.
"Rendering" a list of users requires CPU investment.
Inserts are non-deterministic and non-trivial. When you add a new user to an existing tree, you will have a variety of methods of inserting them. Assuming you don't choose arbitrarily, it is difficult to compute which approach is the best (and would only be based on heuristics).
This are the big ones that came to my mind. But generally, I think you are over-thinking this.

You should check out OQGRAPH, the Open Query graph storage engine. It is designed to handle efficient tree and graph storage for MySQL.
You can also check out my presentation Models for Hierarchical Data with SQL and PHP, or my answer to What is the most efficient/elegant way to parse a flat table into a tree? here on Stack Overflow.
I describe a design I call Closure Table, which records all paths between ancestors and descendants in a hierarchy.

You say 'using PHP' in the title, but this seems to be just a database question at its heart. And believe it or not the linking table is by far the best way to go. Especially if you have millions or billions of users. It would be faster to process, easier to handle in the PHP code and smaller to store.
Update
Users table:
id | name | moreInfo
1 | Joe | stuff
2 | Bob | stuff
3 | Katie | stuff
4 | Harold | stuff
Friendship table:
left | right
1 | 4
1 | 2
3 | 1
3 | 4
In this example Joe knows everyone and Katie knows Harold.
This is of course a simplified example.
I'd love to hear if someone has a better logic to the left and right and an explanation as to why.
Update
I gave some php code in a comment below but it was marked up wrong so here it is again.
$sqlcmd = sprintf( 'SELECT IF( `left` = %1$d, `right`, `left`) AS "friend" FROM `friendship` WHERE `left` = %1$d OR `right` = %1$d', $userid);

Few ideas:
ordered lists - searching through ordered list is fast, though ordering itself might be heavier;
horizontal partitioning data;
getting rid of premature optimizations.

Database structure and querying hierarchal data and trees of data [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
What is the most efficient/elegant way to parse a flat table into a tree?
This I am finding rather tricky and would like some opinions on the matter.
I am trying to store hierarchal data (tree like) with an unknown number of levels and branches. I am wanting to be able to add new ones and delete any at any time.
I need to be able to query from any node in the hierarchy for all of the children id's in one go and efficiently due to large user base.
Lets take a hypothetical example of a website where families socialise and update their status like in facebook and at any time you can be viewing a family members "Wall" which will also include all of the recent status updates form the people below them in the hierarchy in chronological order.
Obviously the fetching posts once you have the array of family members id's who are children of this family members node is easy enough in a loop.
Lets take an example simple table structure of:
id | parentId | name
________________________
1 | NULL | John
2 | 1 | Peter
3 | 1 | Bob
4 | 3 | Emma
5 | 2 | Sam
6 | 4 | Gill
etc.... You get the idea.
I need to be able to do the above with something like this unless you think the structure needs to be adapted.
I have read up on mySql nested set model.
This seems very fiddly and could be unreliable if something was not to update correctly and would mess everything up.
I am used to using php and mysql but have been reading a bit on cassandra and thrift. Not sure if this would be easier?

There are already good approaches out there which are more simple than the solution you propose.
Here are a couple of links which explain how to do it (we use this ourselves for much the same problem you describe and it works well).
Managing Hierarchical Data in MySQL (from MySQL)
Storing Hierarchical Data in a Database (from Sitepoint, but a clearer explanation, I think)
This makes inserting/updating more complex, but selecting portions of the tree structure far faster (with only one query). It allows finding all children of any given node in one query, and finding all the ancestors of a given node with one query.

So I think I have come up with an idea.
The reason I am against the nested set model is because it seems like it is still not the best way and is not going to be the ideal performance solution.
I am going to cover a proposed solution I have been thinking about.
The concept means creating an hierarchal map table to keep track of all the relationships between each family member/node.
The way it would work is:
Using map table structure of this:
id | fMemberId | parentid
=====================================
1 | 3 | 2
2 | 4 | 3
3 | 4 | 2
1) As a new family member is created as a child of a parent we would take the parents id and create a new row in our family members table with the parent id set for future additional uses and functionality.
2) As this row is created we will create new rows with all of the parent id's for the new family member.
A quick way to do this would be to take the parent id from the new family member and do a query to the map table to find all the rows with the family member id the same as the new family members parent id and then store an array in php of the subsequent parent ids required for storing alongside the new family members id in the map table. This would then only require one sql query for grabbing all the parent id's for adding them rather than a number of queries based on the number of nodes
This would mean when we are viewing a family members feed of posts we would be able to query the db for simply the rows in the map table to get all the children id's of the current family member and subsequently query other tables for the post data.
The main trade off being the amount of potential storage required for this kind of system.
However I believe reading speed would be quicker as there is no conditional SQL statements and also maybe just as quick to write to db in this way.
We could overcome this by using InnoDB's cluster id's assigning an initial family id index and creating a new table with the "next family members id" based on the family id.
Also reliability, if a row wasn't written it would be easy enough to add it in. It prevents having to continually edit rows just to create a member.
What are your thoughts on this?
So far this seems to be a good way in my opinion. Took a lot of thinking to get to here. I also believe it could maybe be improved with time and being able to store arrays of id's per member rather than all of them. Still trying to work that one out!

Yes, your solution is called a transitive closure. I have written about it before:
What is the most efficient/elegant way to parse a flat table into a tree?
Models for Hierarchical Data
You also need the zero-length paths, e.g. 2-2, 3-3, 4-4.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.