Say I have 2 tables,
which I can "merge" and represent in a single nested array.
I'm wandering what would be the best way to do that, considering:
efficiency
best-practice
DB/server-side usage trade-off
what you should do in real life
same case for 3, 4 or more tables that can be "merged" that way
The question is about ANY server-side/relational-db.
2 simple ways I was thinking about
(if you have others, please suggest!
notice I'm asking for a simple SERVER-SIDE and RELATIONAL-DB,
so please don't waste your time explaining why I shouldn't
use this kind of DB, use MVC design, etc., etc. ...):
2 loops, 5 simple "SELECT" queries
1 loop, 1 "JOIN" query
I've tried to give a simple and detailed example,
in order to explain myself & understand better your answers
(though how to write the code and/or
finding possible mistakes is not the issue here,
so try not to focus on that...)
SQL SCRIPTS FOR CREATING AND INSERTING DATA TO TABLES
CREATE TABLE persons
(
id int NOT NULL AUTO_INCREMENT,
fullName varchar(255),
PRIMARY KEY (id)
);
INSERT INTO persons (fullName) VALUES ('Alice'), ('Bob'), ('Carl'), ('Dan');
CREATE TABLE phoneNumbers
(
id int NOT NULL AUTO_INCREMENT,
personId int,
phoneNumber varchar(255),
PRIMARY KEY (id)
);
INSERT INTO phoneNumbers (personId, phoneNumber) VALUES ( 1, '123-456'), ( 1, '234-567'), (1, '345-678'), (2, '456-789'), (2, '567-890'), (3, '678-901'), (4, '789-012');
A JSON REPRESENTATION OF THE TABLES AFTER I "MERGED" THEM:
[
{
"id": 1,
"fullName": "Alice",
"phoneNumbers": [
"123-456",
"234-567",
"345-678"
]
},
{
"id": 2,
"fullName": "Bob",
"phoneNumbers": [
"456-789",
"567-890"
]
},
{
"id": 3,
"fullName": "Carl",
"phoneNumbers": [
"678-901"
]
},
{
"id": 4,
"fullName": "Dan",
"phoneNumbers": [
"789-012"
]
}
]
PSEUDO CODE FOR 2 WAYS:
1.
query: "SELECT id, fullName FROM persons"
personList = new List<Person>()
foreach row x in query result:
current = new Person(x.fullName)
"SELECT phoneNumber FROM phoneNumbers WHERE personId = x.id"
foreach row y in query result:
current.phoneNumbers.Push(y.phoneNumber)
personList.Push(current)
print personList
2.
query: "SELECT persons.id, fullName, phoneNumber FROM persons
LEFT JOIN phoneNumbers ON persons.id = phoneNumbers.personId"
personList = new List<Person>()
current = null
previouseId = null
foreach row x in query result:
if ( x.id != previouseId )
if ( current != null )
personList.Push(current)
current = null
current = new Person(x.fullName)
current.phoneNumbers.Push(x.phoneNumber)
print personList
CODE IMPLEMENTATION IN PHP/MYSQL:
1.
/* get all persons */
$result = mysql_query("SELECT id, fullName FROM persons");
$personsArray = array(); //Create an array
//loop all persons
while ($row = mysql_fetch_assoc($result))
{
//add new person
$current = array();
$current['id'] = $row['id'];
$current['fullName'] = $row['fullName'];
/* add all person phone-numbers */
$id = $current['id'];
$sub_result = mysql_query("SELECT phoneNumber FROM phoneNumbers WHERE personId = {$id}");
$phoneNumbers = array();
while ($sub_row = mysql_fetch_assoc($sub_result))
{
$phoneNumbers[] = $sub_row['phoneNumber']);
}
//add phoneNumbers array to person
$current['phoneNumbers'] = $phoneNumbers;
//add person to final result array
$personsArray[] = $current;
}
echo json_encode($personsArray);
2.
/* get all persons and their phone-numbers in a single query */
$sql = "SELECT persons.id, fullName, phoneNumber FROM persons
LEFT JOIN phoneNumbers ON persons.id = phoneNumbers.personId";
$result = mysql_query($sql);
$personsArray = array();
/* init temp vars to save current person's data */
$current = null;
$previouseId = null;
$phoneNumbers = array();
while ($row = mysql_fetch_assoc($result))
{
/*
if the current id is different from the previous id:
you've got to a new person.
save the previous person (if such exists),
and create a new one
*/
if ($row['id'] != $previouseId )
{
// in the first iteration,
// current (previous person) is null,
// don't add it
if ( !is_null($current) )
{
$current['phoneNumbers'] = $phoneNumbers;
$personsArray[] = $current;
$current = null;
$previouseId = null;
$phoneNumbers = array();
}
// create a new person
$current = array();
$current['id'] = $row['id'];
$current['fullName'] = $row['fullName'];
// set current as previous id
$previouseId = $current['id'];
}
// you always add the phone-number
// to the current phone-number list
$phoneNumbers[] = $row['phoneNumber'];
}
}
// don't forget to add the last person (saved in "current")
if (!is_null($current))
$personsArray[] = $current);
echo json_encode($personsArray);
P.S.
this link is an example of a different question here, where i tried to suggest the second way: tables to single json
Preliminary
First, thank you for putting that much effort into explaining the problem, and for the formatting. It is great to see someone who is clear about what they are doing, and what they are asking.
But it must be noted that that, in itself, forms a limitation: you are fixed on the notion that this is the correct solution, and that with some small correction or guidance, this will work. That is incorrect. So I must ask you to give that notion up, to take a big step back, and to view (a) the whole problem and (b) my answer without that notion.
The context of this answer is:
all the explicit considerations you have given, which are very important, which I will not repeat
the two most important of which is, what best practice and what I would do in real life
This answer is rooted in Standards, the higher order of, or frame of reference for, best practice. This is what the commercial Client/Server world does, or should be doing.
This issue, this whole problem space, is becoming a common problem. I will give a full consideration here, and thus answer another SO question as well. Therefore it might contain a tiny bit more detail that you require. If it does, please forgive this.
Consideration
The database is a server-based resource, shared by many users. In an online system, the database is constantly changing. It contains that One Version of the Truth (as distinct from One Fact in One Place, which is a separate, Normalisation issue) of each Fact.
the fact that some database systems do not have a server architecture, and that therefore the notion of server in such software is false and misleading, are separate but noted points.
As I understand it, JSON and JSON-like structures are required for "performance reasons", precisely because the "server" doesn't, cannot, perform as a server. The concept is to cache the data on each (every) client, such that you are not fetching it from the "server" all the time.
This opens up a can of worms. If you do not design and implement this properly, the worms will overrun the app.
Such an implementation is a gross violation of the Client/Server Architecture, which allows simple code on both sides, and appropriate deployment of software and data components, such that implementation times are small, and efficiency is high.
Further, such an implementation requires a substantial implementation effort, and it is complex, consisting of many parts. Each of those parts must be appropriately designed.
The web, and the many books written in this subject area, provide a confusing mix of methods, marketed on the basis of supposed simplicity; ease; anyone-can-do-anything; freeware-can-do-anything; etc. There is not scientific basis for any of those proposals.
Non-architecture & Sub-standard
As evidenced, you have learned that that some approaches to database design are incorrect. You have encountered one problem, one instance that that advice is false. As soon as you solve this one problem, the next problem, which is not apparent to you right now, will be exposed. The notions are a never-ending set of problems.
I will not enumerate all the false notions that are sometimes advocated. I trust that as you progress through my answer, you will notice that one after the other marketed notion is false.
The two bottom lines are:
The notions violate Architecture and Design Standards, namely Client/Server Architecture; Open Architecture; Engineering Principles; and to a lesser in this particular problem, Database Design Principles.
Which leads to people like you, who are trying to do an honest job, being tricked into implementing simple notions, which turn into massive implementations. Implementations that will never quite work, so they require substantial ongoing maintenance, and will eventually be replaced, wholesale.
Architecture
The central principle being violated is, never duplicate anything. The moment you have a location where data is duplicated (due to caching or replication or two separate monolithic apps, etc), you create a duplicate that will go out of synch in an online situation. So the principle is to avoid doing that.
Sure, for serious third-party software, such as a gruntly report tool, by design, they may well cache server-based data in the client. But note that they have put hundreds of man-years into implementing it correctly, with due consideration to the above. Yours is not such a piece of software.
Rather than providing a lecture on the principles that must be understood, or the evils and costs of each error, the rest of this answer provides the requested what would you do in real life, using the correct architectural method (a step above best practice).
Architecture 1
Do not confuse
the data which must be Normalised
with
the result set, which, by definition, is the flattened ("de-normalised" is not quite correct) view of the data.
The data, given that it is Normalised, will not contain duplicate values; repeating groups. The result set will contain duplicate values; repeating groups. That is pedestrian.
Note that the notion of Nested Sets (or Nested Relations), which is in my view not good advice, is based on precisely this confusion.
For forty-five years since the advent of the RM, they have been unable to differentiate base relations (for which Normalisation does apply) from derived relations (for which Normalisation does not apply).
Two of these proponents are currently questioning the definition of First Normal Form. 1NF is the foundation of the other NFs, if the new definition is accepted, all the NFs will be rendered value-less. The result would be that Normalisation itself (sparsely defined in mathematical terms, but clearly understood as a science by professionals) will be severely damaged, if not destroyed.
Architecture 2
There is a centuries-old scientific or engineering principle, that content (data) must be separated from control (program elements). This is because the analysis, design, and implementation of the two are completely different. This principle is no less important in the software sciences, where it has specific articulation.
In order to keep this brief (ha ha), instead of a discourse, I will assume that you understand:
That there is a scientifically demanded boundary between data and program elements. Mixing them up results in complex objects that are error-prone and hard to maintain.
The confusion of this principle has reached epidemic proportions in the OO/ORM world, the consequences reach far and wide.
Only professionals avoid this. For the rest, the great majority, they accept the new definition as "normal", and they spend their lives fixing problems that we simply do not have.
The architectural superiority, the great value, of data being both stored and presented in Tabular Form per Dr E F Codd's Relational Model. That there are specific rules for Normalisation of data.
And importantly, you can determine when the people, who write and market books, advise non-relational or anti-relational methods.
Architecture 3
If you cache data on the client:
Cache the absolute minimum.
That means cache only the data that does not change in the online environment. That means Reference and Lookup tables only, the tables that populate the higher level classifiers, the drop-downs, etc.
Currency
For every table that you do cache, you must have a method of (a) determining that the cached data has become stale, compared to the One Version of the Truth which exists on the server, and (b) refreshing it from the server, (c) on a table-by-table basis.
Typically, this involves a background process that executes every (e) five minutes, that queries the MAX updated DateTime for each cached table on the client vs the DateTime on the server, and if changed, refreshes the table, and all its child tables, those that dependent on the changed table.
That, of course, requires that you have an UpdatedDateTime column on every table. That is not a burden, because you need that for OLTP ACID Transactions anyway (if you have a real database, instead of a bunch of sub-standard files).
Which really means, never replicate, the coding burden is prohibitive.
Architecture 4
In the sub-commercial, non-server world, I understand that some people advise the reverse caching of "everything".
That is the only way the programs like PostgreSQL, can to the used in a multi-user system.
You always get what you pay for: you pay peanuts, you get monkeys; you pay zero, you get zero.
The corollary to Architecture 3 is, if you do cache data on the client, do not cache tables that change frequently. These are the transaction and history tables. The notion of caching such tables, or all tables, on the client is completely bankrupt.
In a genuine Client/Server deployment, due to use of applicable standards, for each data window, the app should query only the rows that are required, for that particular need, at that particular time, based on context or filter values, etc. The app should never load the entire table.
If the same user using the same window inspected its contents, 15 minutes after the first inspection, the data would be 15 mins out of date.
For freeware/shareware/vapourware platforms, which define themselves by the absence of a server architecture, and thus by the result, that performance is non-existent, sure, you have to cache more than the minimum tables on the client.
If you do that, you must take all the above into account, and implement it correctly, otherwise your app will be broken, and the ramifications will drive the users to seek your termination. If there is more than one user, they will have the same cause, and soon form an army.
Architecture 5
Now we get to how you cache those carefully chosen tables on the client.
Note that databases grow, they are extended.
If the system is broken, a failure, it will grow in small increments, and require a lot of effort.
If the system is even a small success, it will grow exponentially.
If the system (each of the database, and the app, separately) is designed and implemented well, the changes will be easy, the bugs will be few.
Therefore, all the components in the app must be designed properly, to comply with applicable standards, and the database must be fully Normalised. This in turn minimises the effect of changes in the database, on the app, and vice versa.
The app will consist of simple, not complex, objects, which are easy to maintain and change.
For the data that you do cache on the client, you will use arrays of some form: multiple instances of a class in an OO platform; DataWindows (TM, google for it) or similar in a 4GL; simple arrays in PHP.
(Aside. Note that what people in situations such as yours produce in one year, professional providers using a commercial SQL platform, a commercial 4GL, and complying with Architecture and Standards.)
Architecture 6
So let's assume that you understand all the above, and appreciate its value, particularly Architecture 1 & 2.
If you don't, please stop here and ask questions, do not proceed to the below.
Now that we have established the full context, we can address the crux of your problem.
In those arrays in the app, why on Earth would you store flattened views of data ?
and consequently mess with, and agonise over, the problems
instead of storing copies of the Normalised tables ?
Answer
Never duplicate anything that can be derived. That is an Architectural Principle, not limited to Normalisation in a database.
Never merge anything.
If you do, you will be creating:
data duplication, and masses of it, on the client. The client will not only be fat and slow, it will be anchored to the floor with the ballast of duplicated data.
additional code, which is completely unnecessary
complexity in that code
code that is fragile, that will constantly have to change.
That is the precise problem you are suffering, a consequence of the method, which you know intuitively is wrong, that there must be a better way. You know it is a generic and common problem.
Note also that method, that code, constitutes a mental anchor for you. Look at the way that you have formatted it and presented it so beautifully: it is of importance to you. I am reluctant to inform you of all this.
Which reluctance is easily overcome, due to your earnest and forthright attitude, and the knowledge that you did not invent this method
In each code segment, at presentation time, as and when required:
a. In the commercial Client/Server context
Execute a query that joins the simple, Normalised, unduplicated tables, and retrieves only the qualifying rows. Thereby obtaining current data values. The user never sees stale data. Here, Views (flattened views of Normalised data) are often used.
b. In the sub-commercial non-server context
Create a temporary result-set array, and join the simple, unduplicated, arrays (copies of tables that are cached), and populate it with only the qualifying rows, from the source arrays. The currency of which is maintained by the background process.
Use the Keys to form the joins between the arrays, in exactly the same way that Keys are used to form the joins in the Relational tables in the database.
Destroy those components when the user closes the window.
A clever version would eliminate the result-set array, and join the source arrays via the Keys, and limit the result to the qualifying rows.
Separate to being architecturally incorrect, Nested Arrays or Nested Sets or JSON or JSON-like structures are simply not required. This is the consequence of confusing the Architecture 1 Principle.
If you do choose to use such structures, then use them only for the temporary result-set arrays.
Last, I trust this discourse demonstrates that n tables is a non-issue. More important, that m levels deep in the data hierarchy, the "nesting", is a non-issue.
Answer 2
Now that I have given the full context (and not before), which removes the implications in your question, and makes it a generic, kernel one.
The question is about ANY server-side/relational-db. [Which is better]:
2 loops, 5 simple "SELECT" queries
1 loop, 1 "JOIN" query
The detailed examples you have given are not accurately described above. The accurate descriptions is:
Your Option 1
2 loops, each loop for loading each array
1 single-table SELECT query per loop
(executed n x m times ... the outermost loop, only, is a single execution)
Your Option 2
1 Joined SELECT query executed once
followed by 2 loops, each loop for loading each array
For the commercial SQL platforms, neither, because it does not apply.
The commercial SQL server is a set-processing engine. Use one query with whatever joins are required, that returns a result set. Never step through the rows using a loop, that reduces the set-processing engine to a pre-1970's ISAM system. Use a View, in the server, since it affords the highest performance and the code is in one place.
However, for the non-commercial, non-server platforms, where:
your "server" is not a set-processing engine ie. it returns single rows, therefore you have to fetch each row and fill the array, manually or
your "server" does not provide Client/Server binding, ie. it does not provide facilities on the client to bind the incoming result set to a receiving array, and therefore you have to step through the returned result set, row by row, and fill the array, manually,
as per your example then, the answer is, by a large margin, your option 2.
Please consider carefully, and comment or ask questions.
Response to Comment
Say I need to print this json (or other html page) to some STOUT (example: an http response to: GET /allUsersPhoneNumbers. It's just an example to clarify what I'm expecting to get), should return this json. I have a php function that got this 2 result sets (1). now it should print this json - how should I do that? this report could be an employee month salary for a whole year, and so one. one way or anther, I need to gather this information and represent it in a "JOIN"ed representation
Perhaps I was not clear enough.
Basically, do not use JSON unless you absolutely have to. Which means sending to some system that requires it, which means that receiving system, and that demand is stupid.
Make sure that your system doesn't make such demands on others.
Keep your data Normalised. Both in the database, and in whatever program elements that you write. That means (in this example) use one SELECT per table or array. That is for loading purposes, so that you can refer to and inspect them at any point in the program.
When you need a join, understand that it is:
a result-set; a derived relation; a view
therefore temporary, it exists for the duration of the execution of that element, only
a. For tables, join them in the usual manner, via Keys. One query, joining two (or more) tables.
b. For arrays, join arrays in the program, the same way you join tables in the database, via Keys.
For the example you have given, which is a response to some request, first understand that it is the category [4], and then fulfil it.
Why even consider JSON?
What has JSON got to do with this?
JSON is misunderstood and people are interested in the wow factor. It is a solution looking for a problem. Unless you have that problem it has no value.
Check these two links:
Copter - What is JSON
Stack Overflow - What is JSON
Now if you understand that, it is mostly for incoming feeds. Never for outgoing. Further, it requires parsing, deconstructing, etc, before the can be used.
Recall:
I need to gather this information and represent it in a "JOIN"ed representation
Yes. That is pedestrian. Joined does not mean JSONed.
In your example, the receiver is expecting a flattened view (eg. spreadsheet), with all the cells filled, and yes, for Users with more than one PhoneNumber, their User details will be repeated on the second nad subsequent result-set row. For any kind of print, eg. for debugging, I want a flattened view. It is just a:
SELECT ... FROM Person JOIN PhoneNumber
And return that. Or if you fulfil the request from arrays, join the Person and PhoneNumber Arrays, which may require a temporary result-set array, and return that.
please don't tell me you should get only 1 user at a time, etc. etc.
Correct. If someone tells you to regress to procedural processing (ie. row by row, in a WHILE loop), where the engine or your program has set processing (ie. processes an entire set in one command), that marks them as someone who should not be listened to.
I have already stated, your Option 2 is correct, Option 1 is incorrect. That is as far as the GET or SELECT is concerned.
On the other hand, for programming languages that do not have set-processing capability (ie. cannot print/set/inspect an array in a single command), or "servers" that do not provide client-side array binding, you do have to write loops, one loop per depth of the data hierarchy (in your example, two loops, one for Person, and one for PhoneNumber per User).
You have to do that to parse an incoming JSON object.
You have to do that to load each array from the result set that is returned in your Option 2.
You have to do that to print each array from the result set that is returned in your Option 2.
Response to Comment 2
I've ment I have to return a result represented in a nested version (let's say I'm printing the report to the page), json was just an example for such representation.
I don't think you understand the reasoning and the conclusions I have provided in this answer.
For printing and displaying, never nest. Print a flattened view, the rows returned from the SELECT per Option 2. That is what we have been doing, when printing or displaying data Relationally, for 31 years. It is easier to read, debug, search, find, fold, staple, mutilate. You cannot do anything with a nested array, except look at it, and say gee that is interesting.
Code
Caveat
I would prefer to take your code and modify it, but actually, looking at your code, it is not well written or structured, it cannot be reasonably modified. Second, if I use that, it would be a bad teaching tool. So I will have to give you fresh, clean code, otherwise you will not learn the correct methods.
This code examples follow my advice, so I am not going to repeat. And this is way beyond the original question.
Query & Print
Your request, using your Option 2. One SELECT executed once. Followed by one loop. Which you can "pretty up" if you like.
In general it is a best practice to grab the data you need in as few trips to the database as possible then map the data into the appropriate objects. (Option 2)
But, to answer your question I would ask yourself what the use case for your data is. If you know for sure that you will be needing your person and your phone number data then I would say the second method is your best option.
However, option one can also have its use case when the joined data is optional.One example of this could be that on the UI you have a table of all your people and if a user wants to see the phone number for a particular person they have to click on that person. Then it would be acceptable to "lazy-load" all of the phone numbers.
This is the common problem, especially if you are creating a WebAPIs, converting those table sets to nested arrays is a big deal..
I always go for you the second option(in slightly different method though), because the first is worst possible way to do it... One thing I learned with my experience is never query inside a loop, that is a waste of DB calls, well you know what I trying to say.
Although I don't accept all the things PerformanceDBA said, there are two major things I need the address,
1. Don't have duplicate data
2. Fetch only data you want
The Only problem I see in Joining the table is, we end up duplicating data lots of them, take you data for example, Joining Person ans phoneNumber tables we end up duplicating Every person for each of his phone number, for two table with few hundred rows its fine, imagine we need to merge 5 tables with thousands of rows its huge...
So here's my solution:
Query:
SELECT id, fullName From Person;
SELECT personId, phoneNumber FROM phoneNumbers
WHERE personId IN (SELECT id From Person);
So I get to tables in my result set, now I assign Table[0] to my Person list,
and use a 2 loops to place right phoneNumbers in to right person...
Code:
personList = ConvertToEntity<List<Person>>(dataset.Table[0]);
pnoList = ConvertToEntity<List<PhoneNumber>>(dataset.Table[1]);
foreach (person in personList) {
foreach (pno in pnoList) {
if(pno.PersonId = person.Id)
person.PhoneNumer.Add(pno)
}
}
I think above method reduce lots of duplication and only get me what I wanted, if there is any downside to the above method please let me know... and thanks for asking these kind of questions...
I have a webapp development problem that I've developed one solution for, but am trying to find other ideas that might get around some performance issues I'm seeing.
problem statement:
a user enters several keywords/tokens
the application searches for matches to the tokens
need one result for each token
ie, if an entry has 3 tokens, i need the entry id 3 times
rank the results
assign X points for token match
sort the entry ids based on points
if point values are the same, use date to sort results
What I want to be able to do, but have not figured out, is to send 1 query that returns something akin to the results of an in(), but returns a duplicate entry id for each token matches for each entry id checked.
Is there a better way to do this than what I'm doing, of using multiple, individual queries running one query per token? If so, what's the easiest way to implement those?
edit
I've already tokenized the entries, so, for example, "see spot run" has an entry id of 1, and three tokens, 'see', 'spot', 'run', and those are in a separate token table, with entry ids relevant to them so the table might look like this:
'see', 1
'spot', 1
'run', 1
'run', 2
'spot', 3
you could achive this in one query using 'UNION ALL' in MySQL.
Just loop through the tokens in PHP creating a UNION ALL for each token:
e.g if the tokens are 'x', 'y' and 'z' your query may look something like this
SELECT * FROM `entries`
WHERE token like "%x%" union all
SELECT * FROM `entries`
WHERE token like "%y%" union all
SELECT * FROM `entries`
WHERE token like "%z%" ORDER BY score ect...
The order clause should operate on the entire result set as one, which is what you need.
In terms of performance it won't be all that fast (I'm guessing), however with databases the main overhead in terms of speed is often sending the query to the database engine from PHP and receiving the results. With this technique this only happens once instead of once per token, so performance will increase, I just don't know if it'll be enough.
I know this isn't strictly an answer to the question you're asking but if your table is thousands rather than millions of rows, then a FULLTEXT solution might be the best way to go here.
In MySQL when you use MATCH on your indexed column, each keyword you supply will be given a relevance score (calculated roughly by the number of times each keyword was mentioned) that will be more accurate than your method and certainly more effecient for multiple keywords.
See here:
http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
If you're using the UNION ALL pattern you may also want to include the following parts to your query:
SELECT COUNT(*) AS C
...
GROUP BY ID
ORDER BY c DESC
While this is a really trivial example it does get you the frequency of the matches for each result and this could be a pseudo rank to start with.
You'll probably get much better performance if you used a data structure designed for search tasks rather than a database. For example, you might try looking at building an inverted index. Rather than writing it youself, however, you might also want to look into something like Lucene which does most of the work for you.