MySQL stored procedure vs. multiple selects - php

Here's my scenario:
I've got a table of (let's call them) nodes. Primary key on each one is simply "node_id".
I've got a table maintaining a hierarchy of nodes, with only two columns: parent_node_id and child_node_id.
The hierarchy is maintained in a separate table because nodes can have an N:N relationship. That is to say, one node can have multiple children, and multiple parents.
If I start with a node and want to get all of its ancestors (i.e. everything higher up the hierarchy), I could either do several selects, or do it all in one stored procedure.
Anyone with any practical experience with this question know which one is likely to have the best performance? I've read things online that recommend both ways.

"which one is likely to have the best performance? " : No one can know ! The only thing you can do is try both and MEASURE. That's sadly enough the main answer to all performance related questions... except in cases where you clearly have a O(n) difference between algorithms.
And, by the way, "multiple parents" does not make a hierarchy (otherwise I would recommend to read some books by Joe Celko) but a DAG (Direct Acyclic Graph) a much harder beast to tame...

If performance is your concern, then that schema design is not going to work as well for you as others could.
See More Trees & Hierarchies in SQL for more info.

I think a general statements could lead into problem, because it depends on how you your queries respectively the stored procedure make of usage of the indices.
To make a helpful declaration it would be necessary to compare the SQL of your selects and the stored procedure.

Related

based on database design is it possible to predict the queries to be used in the application?

lets say i have this mysql db, and all the tables in the db are related to one another, primary keys, foreign keys, etc all are set. Now is it possible to predict, just from the database design, what the queries will be used for the application? Since the database does dictate the application capabilities, then therefore from the design, we can predict what queries that will be used in the application, right?
If it is possible, is there a strategy or automated way to generate the possible queries?
I have written a book on the subject of analyzing data using SQL and Excel, and have spent many years working with databases.
Yes, from a database structure, you can figure out how tables are going to be joined together. You are not going to figure out the harder -- and generally more business relevant -- things that users need. Here are some examples:
You can have a database where the primary table is telephone calls, with the associated information. From this database, you may need to know the maximum number of active calls at one time. Or you may need to know how many different people someone calls in a month.
You can have a database of subscriber records. You may need to figure out the probability that someone will stop after a given amount of time.
You can have a database of products and purchases. You may need to figure out the most common combinations of three products that occur together.
You can have a database of credit card purchases. You may need to figure out who spends more than $200 in a restaurant more than 50 miles from their billing address.
The point is. A database does not represent "application capabilities". A database represents entities and relationships between them, presumably in the real world. There is hubris to think that you can look at a database and know what the business questions are.
Instead, the purpose of a database is to support data, which in turn, supports applications. The needs of applications will change over time. The beauty of databases, as opposed to many other data storage technologies, is that the technology scales as the data increases, supports changes to the structure, and allows new entities and relationships to be added into the system, without completely rewriting it.
Over time, and with experience, you might develop intuition on what's important. Even if you do, you will be constantly surprised at the varied needs of your users.
I am sincerely not trying to be smart here but answer is - yes and no.
Yes, because 3NF design usually outlines business rules behind it pretty well, so you can to a degree tell what is the business logic behind it, you can create an object or graph model from it and get a good idea
what kinds of questions can be asked from based on connections/relations and accessible properties.
No, because combinatorially you might have a untractable number of combinations of questions from a graph. Hence, you can't really tell what question one might ask in reasonable, non-exponential amount of time.
In general, if design is good and tables are meaningfully named you can get a pretty good idea what is going on.
Theoretically it's possible but due to the combinatorial explosion of N rows by X columns by Z tables by W possible functions by Q possible values on each column/row this is an amazingly large number.
The issue here is that you need to take into account the data too. Some queries only make sense when there is particular data and other don't. So you are essentially considering massively large hypercube.
I work with Multidimensional databases (denormalised cubes) and this is essentially denormalised databases. Have a read aup on OLAP theory and you'll see why.
So in short no as it's practically impossible.
Now is it possible to predict, just from the database design, what the queries will be used for the application?
You can, at least in principle, predict which queries can be answered efficiently. Which queries will the applications actually try to execute is another matter.
In an ideal world, database model would take into account all the querying needs of all the applications, now and in the future. We don't live in that world yet ;)
If it is possible, is there a strategy or automated way to generate the possible queries?
No, that requires human understanding of what the model actually means. Unfortunately, there is no good way to teach a tool to have that level of understanding.
A good model will immediately make sense to a person experienced in database modeling and the domain being modeled. Such person will typically be able to predict a fair portion of queries actually being used, but rarely all of them, so the documentation beside the database model itself is desirable. And of course, not all models are good...

How to implement a nested comment system?

What would be the ideal way to implement this sort of thing? The idea I have in my head right now is to have a comments table and have each comment have a thread identifier and parent comment identifier. The thread identifier would indicate to which thread the comment belongs to and would allow for a simple MySQL statement using the WHERE clause. Each comment would have an auto_increment identifier as per usual database design and the parent identifier column would indicate which comment this comment is a child of.
This type of design would put most of the stress on the PHP aspect of things because it would only be one SQL call to get all comments from a thread. Another implementation I found was having an SQL query for each nesting level. This solution would place the stress on the SQL sides of things.
How would SO implement this? Currently I'm at a loss because I am not sure which solution is the "best" solution and I am still quite new to database design, PHP, and JQuery.
Thanks.
Look at Managing Hierarchical Data in MySQL, specifically the section called "Nested Set Model". You may have to read through it a few times before it makes sense (I did) but it's worth it. It's a very powerful way to work with nested data and retrieve the parts you want with only one query.
On the downside, for updates you have to do a lot more work.

Multi-tiered / Hierarchical SQL : How does Reddit do it? Which is the most efficient way? And what databases make it simpler?

I've been reading up a bit on how multi-tiered commenting systems are built:
http://articles.sitepoint.com/article/hierarchical-data-database/2
I understand the two methods talked about in that article. In fact I went down the recursive path myself, and I can see how the "Modified Preorder Tree Traversal" method is very useful as well, but I have a few questions:
How well do these two method perform in a large environment like Reddit's, where you can have thousands and thousands of mutli-tiered comments?
Which method does Reddit use? It simply seems very costly, to me, to have to update thousands of rows if they use the MPTT method. I'm not deluding myself into thinking I am building a system to handle Reddit's traffic, this is simply curiosity.
There's another way of retrieving comments like this ... JOINs via SQL that return the rows with IDs defining their parents. How much slower/faster/better/worse would it be to simply take these unformatted results, loop through them and add them into a formatted array using my language of choice (PHP)?
After reading that sitepoint article, I believe I understand that Oracle offers this functionality in a much simpler, easier to use way, and MySQL does not. Are there any free databases that offer something similar to Oracle?
On a side note, how is SQL pronounced? I'm getting the feeling I've been wrong for the past several years by saying 'sequel' instead of 's - q - l', although "My Sequel" rolls easier off the tongue than "My S Q L"!
MPTT is easier to fetch (a single SQL query), but more expensive to update. Simply delegate the update to a background process (that's what queue managers are for). Also note that most of that update is a single SQL UPDATE command. It might take long to process, but a smart RDBM could make the transaction visible (in cache) to new (read-only) queries before it's committed to disk.
I'd bet it uses MPTT, but not only doing the 'hard' update in background but also quite likely do a simple rendering to in-memory cache. This way, the posting user can see his post immediately, without having to wait until updating so many rows. Also, SSDs do help in getting high transaction rates.
that's called Adjacency Model (or sometimes adjacency list), it's a more obvious way to do it, and simpler to update (doesn't modify existing records) but FAR more inefficient to read. You have to do a recursive walk of the tree, with an SQL query at each node. That's what kills you: the number of small queries.
PostgreSQL has recursive SELECTs, which do in the server what you envision in PHP. It's better than PHP because it's closer to the data; but it still has the same (huge) number of random-access disk seeks.
You should have a closer look at the links in Further reading they give in the end. The Four ways to work with hierarchical data article on evolt linked there provides another way to approach this problem (the Flat table). Since that approach is extremely easy to implement for a threaded discussion board, I wouldn't be surprised if reddit uses it (or a variation on the theme).
I do like MPTT (aka nested set) though, and have used it for hierarchies that are (almost) static.

Understanding large mysql data relations

I am trying to teach myself how to use SQL, namely mysql.
What I am trying to understand is how to deal with many different types of data with in the same table. Say I am building a web application, and I have many different content types (blog item, comment item, files, pages, forms) that I need to store different data fields for each. Would I create a new table for each different content type since each content type has its own unique field requirements, or is there a better way to do this? It seems a little much to create a new table for content each type. If I had 30 types of content in my web app, that would be 30 tables just for the types, which seems a little much. And, if I had a new content type, I would have to create a new table that contained all the required fields I would need for that type.
Is there a better way to do something like this, when I have many different types of content that each requires different fields of data that needs to go into the database? Can I somehow check to see what type the content is, then select another table that holds all the different field types?
A little confused about what to do.
Just to give an example:
Stack Overflow itself uses the same database table (called Posts) for questions and answers. Even though these two types of data are not identical, the site creators considered them similar enough to put them into one table. There's a PostTypeId field that says whether this post is a question or an answer. On answers, the Title field would be NULL, on questions, other columns might be ignored.
Comments, on the other hand, are in a different table. Of course you could theoretically put them into the same Posts table and have a PostTypeId for comments. But the overhead this would create (because of the lightweightness of comments) justifies creating a new table.
I know this isn't really an answer, and other developers might even have decided to put questions and answers into different tables; but it gives some perspective. Long story short: It depends :)
Sketch interactions
First try not to think about database design, but how entities should interact between themselves. Think of it as each entity has its own Class, which represents required data.
It's always a good start to take pencil and paper and sketch your interactions between these entities, on what interactions (or relations) are you trying to accomplish. Learning the Database design process
Extendability and reuse
For example you want to have a User, which can post BlogPosts each BlogPost can have a set of Tags and relevant set of Comments. Attachments can be injected into BlogPost and also into Comment.
Reusability and extendability is the key. When sketching your interactions try to isolate dependencies. Think of it in OO manner. Let's explore the Attachment a little more. You can create an Attachment table and then extend Attachement by creating BlogPostAttachment and CommentAttachment where you can easily create relations between these dependable entities. This creates an easily extendable content type which you can further reuse in eg. UserDetailsAttachment
ORM's to rescue
By studying example code usage of Object relational mappers like Doctrine or Propel you can grasp some ideas for table extendabity. Practical examples are always the best one.
Related SO questions, which you may be interested in
Good Resources for Relational Database Design
Good PHP ORM Library?
How should a programmer learn great database design?
I know, it's a long way to go, but considering factors of creating large scale DB applications with many relations and entity types it best to use help of ORM in the long run
You needn't be afraid of using many many tables - the database will happily deal with lots of them without complaining. If you let each content type have its own table, you get certain advantages:
Simplicity: Each table can be fairly simple, and the constraints are straightforward. For example if ContentType1 has a field with a relation to another table, you can make that a foreign key in the database design and the RDBMS will take care of data integrity for you.
Indexing efficiency: if ContentType2 needs to be indexed by date but ContentType3 needs to be indexed by name (to take a simple example), having them in two separate tables means each index is there for exactly the data it needs and nothing else. Combining them in one table means you need both indexes covering the combined dataset, which is messier and uses up more disk space.
If you need to output a list combining two content types, a UNION of the two tables is both easy; and if you need to do that often with large amounts of data, an indexed view can make it cheap.
On the other hand, if you have two content types which are very similar (as in the StackOverflow case above for example), you can get some advantages from combining them into one table:
Simplicity: You only need to code the table once - if done right (i.e. the two content types are really very similar), this can make your codebase smaller and simpler.
Extensibility: if a third content type crops up which is again similar to the first two, and similar in the same way that the first two match each other, the table can straightforwardly be extended to store all three content types.
Indexing for performance. If the most common way of getting at the data is to combine the two content types and order them by date (say), a field which is common to both content types, then it can be inefficient to have two separate tables which must repeatedly be UNIONed and then sorted. Combining the two content types in one table lets you put a single index on the date field, allowing faster querying (though remember you can get a similar benefit from indexed views).
If you normalize rigorously, you will have a database where every entity type has its own table in the database. However, denormalization in various ways (such as combining two entity types in one table) can have benefits which might (depending on the size and shape of your data) outweight the costs. I'd advise a strategy of keeping all content types separate at least at first, and consider combining them as a tactical denormalization if it turns out to be necessary.
You need to read a book about building websites with PHP and MySQL. It's a good attitude to google first because some programmers think it is a lazy question. I suggest reading "Learning PHP MySQL and JavaScript".
Anyway, before you start coding your site, you need to plan what kinda information you will store, then you design your database. Say a register form will contain A First_Name, Second_Name, DateOfBirth, Country, Gender and Email. You create a table named as say "USER_INFO" and you assign a datatype matching the data you would like to store, a Number, text, Date, and So on, then via PHP you connect to MySQL and store or retrieve the data you want. You really need to read a book or a tutorial so you get a full answer, AND GOOGLE :P

Reasons why you wouldn't use a foreign key? [php + MySQL]

I'm working on an old web application my company uses to create surveys. I looked at the database schema through the mysql command prompt and thought the tables looked pretty solid. Though I'm not a DB guru I'm well versed in the theory behind it (having taken a few database design courses in my software engineering program).
That being said, I dumped the create statements into an SQL file and imported them in MySQL Workbench and saw that they make no use of any "actual" foreign keys. They'll store another table's primary key like you would with a FK but they don't declare it as one.
So seeing how their DB is designed the way I would through what I know (minus the FK issue) I'm left wondering that maybe there's a reason behind it. Is this a case of lazy programming or could you get some performance gains by doing all the error check programmatically?
In case you'd like an example they basically have Surveys and a survey has a series of Questions. A question is part of a survey so it holds it's PK in a column. That's pretty much it but they use it everywhere.
I'd appreciate any insight :) (I understand that this question might not have a right/wrong answer but I'm looking more for some information on why they would do this as this system has been pretty solid ever since we started using it so I'm led to believe that these guys knew what they were doing)
The original developers might have opted to use MyISAM or any other storage engine that does not support foreign key constraints.
MySQL only supports the defining of actual foreign key relationships on InnoDB tables, maybe yours are MyISAM, or something else?
More important is that the proper columns have indices defined on them (so the ones holding the PK of another table should be indexed). This is also possible in MyISAM.
As general points; keys speed up reads (if they are applicable to the read taking place they help the optimizer) and slow down writes (because they add overhead to the tables).
In the vast majority of cases the improvement of speed for reading and maintenance of referential integrity outweighs the minor overhead they add to writes.
This distinction has been blurred by cacheing, mirroring etc as so many reads on the very big sites don't actually hit the 'live' database - but this is not very relevant unless you are working for Amazon, Twitter or the like.
On uber large databases (the type that Teradata support) you find that they don't use Foreign keys. The reason is performance. Every time you write out to the database, which is often enough in a data warehouse you have the added overhead of having to check all the fk's on a table. If you already know it to be true, what's the point.
Good design on a small db would just mean you put them in, but there are performance gains to be had by leaving them out.
You don't really have to use foreign keys.
If you don't have them, data might became inconsistent and you won't be able to use cascade deletes and updates.
If you have them you might loose some of the users data due to the bug in your SQL statements that happens because of schema changes.
Some prefer to have them, some prefer life without them. There's no real advantages in either case.
Here is a real life instance where I'm not using a foreign key.
I needed a way to store a parent child relationship where the child may not exist, and the child is an abstract class. Since the child could be of a few types, I use one field to name the type of the child and one field to list the id of the child. The application handles most of the logic.
I'm not sure if this was the best design decision, but it was the best I could come up with under the deadline. It's been working well so far!

Categories