PHP/MySQL OOP: Loading complex objects from SQL - php
So I'm working on a project for a realtor. I have the following objects/MySQL tables in my design:
Complexes
Units
Amenities
Pictures
Links
Documents
Events
Agents
These are the relationships between the above objects.
Complexes have a single Agent.
Complexes have multiple Units, Amenities, Pictures, Links, Documents, and Events.
Units have multiple Pictures, Links, and Documents.
Amenities, Pictures, Links, Documents, and Events all have the necessary foreign keys in the database to specify which unit/complex they belong to.
I need to load the necessary objects from the database into PHP so I can use them in my project.
If I try to select all the data out of the table in 1 query, using LEFT JOINS, I'll get AT LEAST (# of links) * (# of pictures) * (# of documents) rows for each unique unit. Add amenities, and events to that and I'll get all that * # of amenities * # of events for each complex...Not sure I want to try to deal with loading that into an object in PHP.
The other possibility is for each complex/unit, execute 1 separate SQL statement each for links, pictures, documents, events and amenities
My questions are as follows:
If I properly index all my tables, is it REALLY a bad idea to execute 3-5 extra queries for each complex/unit?
If not, how else can I get the data I need to load into a PHP object. Ideally, I would have an object as follows for units:
Unit Object
(
[id]
[mls_number]
[type]
[retail_price]
[investor_price]
[quantity]
[beds]
[baths]
[square_feet]
[description]
[featured]
[year_built]
[has_garage]
[stories]
[other_features]
[investor_notes]
[tour_link]
[complex] => Complex Object
(
[id]
[name]
[description]
etc.
)
[agent] => Agent Object
(
[id]
[first_name]
[last_name]
[email]
[phone]
[phone2]
etc.
)
[pictures] => Array
(
[1] => Picture Object
(
)
)
[links] => Array
(
[1] => Link Object
(
)
)
[documents] => Array
(
[1] => Document Object
(
)
)
)
I don't ALWAYS need ALL of this information, sometimes I only need the primary key of the complex, sometimes I only need the primary key of the agent, etc. But I figured the correct way to do this would be to load the entire object every time I instantiate it.
I've been doing a lot of research on OO PHP, but most (read all) online examples use only 1 table. That obviously doesn't help as the project I'm working on has many complex relationships. Any ideas? Am I totally off the mark here?
Thanks
[UPDATE]
On the other hand, usually on the front-end, which everyone will see, I WILL need ALL the information. For instance, when someone wants information on a specific complex, I need to display all units belonging to that complex, all pictures, document, links, events for the complex as well as all pictures, documents and links for the unit.
What I was hoping to avoid was, during one page load, executing one query to get the complex I need. Then another query to get the 20 units associated with the complex. Then for each of the 20 units, executing a query for picture, another for documents, another for links, etc. I wanted to get them all at once, with one trip through the database.
[EDIT 2]
Also, note that the queries to select the pictures, documents, links, events, and agent from the database are pretty simple. Just basic SELECT [list of columns] FROM [table] WHERE [primary_key] = [value] with the occasional INNER JOIN. I'm not doing any complex computations or subqueries, just basic stuff.
[BENCHMARK]
So after reading all the answers to my question, I decided to run a benchmark on what I decided to do. What I do is load all the units that I need. Then as I need to display pictures, document, blah blah, I load them at that time. I created 30,000 test units, each with 100 pictures, 100 documents, and 100 links. Then I loaded a certain number of units (I started with 1000, then 100, then the more realistic 10), looped through them, then loaded all pictures, documents and links associated to the unit. With 1000 units, it took approximately 30 seconds. With 100 units, it took about 3 seconds. With 10 units, it took about .5 seconds. There was a lot of variance with the results. Sometimes, with 10 units, it would take .12 seconds. Then it would take .8. Then maybe .5. Then .78. It was really all over the place. However, it seemed to average around half a second. In reality, though, I might only need 6 units at a time, and they each might only have 10 pictures, 5 links and 5 documents associated with them...so I think the "grab the data when you need it" approach is the best bet in a situation like this. If you needed to get all this data at once though, it would be worthwhile to come up with a single SQL statement to load all the data you need so you are only looping through the data one time (6700 units at a time took 217 seconds while the full 30,000 made PHP run out of memory).
If I properly index all my tables, is it REALLY a bad idea to execute 3-5 extra queries for each complex/unit?
In short, no. For each of the related tables, you should probably run a separate query. That's what most ORM (Object-Relational Mapping/Modelling) systems would do.
If performance is really a problem (and, based on what you've said, it won't be) then you might consider caching the results using something like APC, memcache or Xcache.
the point of ORM is not to load entire objects every time. the point is to make it easy and transparent for your app to access object.
that being said, if you need the unit object, then load the unit object, and only the unit object. if you need the agent object, then load that when you need it, not when you load the unit object.
Maybe you should think of breaking this up.
When you initiate your object, get only what details you need for that object to function. If and when you need more details, then go and get them. You distribute your load and processing this way: the object only gets the load and processing it needs to function, and when more is needed, it gets it then.
So, in your example - create the complex first. When you need to access a unit, then create that unit, when you need the agent, then get that agent, etc.
$complexDetails = array('id' => $id, etc);
$complexUnits = array();
.........
$complexUnits[] = new unit();
.........
$complexDetails['agent'] = new Agent();
I had to address this issue a while back when I concocted my own MVC framework as an experiment. To limit the layers of data loaded from the DB, I passed an integer to the constructor. Each constructor would decrement this integer before passing it to the constructors of the objects it instantiated. When it got to 0, no more sub-objects would be instantiated. This meant, basically, the int passed was the number of layers loaded.
So if I only wanted an attribute of the unit object, I'd do this:
$myUnit = new Unit($unitId,1);
If you want to "store" the objects, meaning cache them, just load them into a PHP array and serialize it. Then you can store it back to the database, in memcache or anywhere else. Attaching a label to it would allow you to retrieve it, and include a time stamp so you know how old it is (i.e. needs to be refreshed).
If the data doesn't change, or changes infrequently, there really is no reason to run multiple complex queries every time. Simple ones, like getting a primary, you might as well just hit the database directly.
Related
Need aggregate data AND distinct data, should i use 2 queries or treat hundeds of thousands rows ? (best practice)
I have 2 tables 1. First table contains prospects, their treatment status and the mail code they received (see it as a foreign key) 2. Second table contains mails, indexed with email code I need to display some charts about hundreds of thousands prospects so I was thinking about an aggregate query (get prospect data group by month, count status positive, count status negative, between start and end date, etc) Result is pretty short and simple, and I can use it directly in charts : [ "2019-01" => [ "WON" => 55000, "LOST" => 85000, ...], ... ] Then I was asked to add a filter with mails (code and human label) so user would chose it from a multi select field. I can handle writting the query(ies), but I am wondering about which way I should use. I got a choice between: - keeping my first query and do a second one (distinct values of mail, same conditions) - query everything and treat all my rows with PHP I know coding but I have little knowledge about performance. In theory I should not use 2 queries about same data but treating all those lines with php when mysql can do it better, looks like ... "overkill". Is there a best practice ?
I have a lot of PHP pages that have dozens of queries supporting them, and they run plenty fast. When a page does not run fast, I focus on the slowest query; I do not on playing games in PHP. But I avoid running a query that hits hundreds of thousands of rows; it will be "too" slow. Some things... Maybe I will find a way to aggregate the data to avoid a big scan. Maybe I will move the big query to a second page -- this avoids penalizing the user who does not need. Maybe I will break up the big scan so that the user must ask for pieces, not build a page with 100K lines. Pagination is not good for that many rows. So... Maybe I will dynamically build an index into a second level of pages. To discuss this further, please provide SHOW CREATE TABLE, some SELECTs (not worrying about how bad they are; we'll tell you), and mockups of page(s).
Best way to create nested array from tables: multiple queries/loops VS single query/loop style
Say I have 2 tables, which I can "merge" and represent in a single nested array. I'm wandering what would be the best way to do that, considering: efficiency best-practice DB/server-side usage trade-off what you should do in real life same case for 3, 4 or more tables that can be "merged" that way The question is about ANY server-side/relational-db. 2 simple ways I was thinking about (if you have others, please suggest! notice I'm asking for a simple SERVER-SIDE and RELATIONAL-DB, so please don't waste your time explaining why I shouldn't use this kind of DB, use MVC design, etc., etc. ...): 2 loops, 5 simple "SELECT" queries 1 loop, 1 "JOIN" query I've tried to give a simple and detailed example, in order to explain myself & understand better your answers (though how to write the code and/or finding possible mistakes is not the issue here, so try not to focus on that...) SQL SCRIPTS FOR CREATING AND INSERTING DATA TO TABLES CREATE TABLE persons ( id int NOT NULL AUTO_INCREMENT, fullName varchar(255), PRIMARY KEY (id) ); INSERT INTO persons (fullName) VALUES ('Alice'), ('Bob'), ('Carl'), ('Dan'); CREATE TABLE phoneNumbers ( id int NOT NULL AUTO_INCREMENT, personId int, phoneNumber varchar(255), PRIMARY KEY (id) ); INSERT INTO phoneNumbers (personId, phoneNumber) VALUES ( 1, '123-456'), ( 1, '234-567'), (1, '345-678'), (2, '456-789'), (2, '567-890'), (3, '678-901'), (4, '789-012'); A JSON REPRESENTATION OF THE TABLES AFTER I "MERGED" THEM: [ { "id": 1, "fullName": "Alice", "phoneNumbers": [ "123-456", "234-567", "345-678" ] }, { "id": 2, "fullName": "Bob", "phoneNumbers": [ "456-789", "567-890" ] }, { "id": 3, "fullName": "Carl", "phoneNumbers": [ "678-901" ] }, { "id": 4, "fullName": "Dan", "phoneNumbers": [ "789-012" ] } ] PSEUDO CODE FOR 2 WAYS: 1. query: "SELECT id, fullName FROM persons" personList = new List<Person>() foreach row x in query result: current = new Person(x.fullName) "SELECT phoneNumber FROM phoneNumbers WHERE personId = x.id" foreach row y in query result: current.phoneNumbers.Push(y.phoneNumber) personList.Push(current) print personList 2. query: "SELECT persons.id, fullName, phoneNumber FROM persons LEFT JOIN phoneNumbers ON persons.id = phoneNumbers.personId" personList = new List<Person>() current = null previouseId = null foreach row x in query result: if ( x.id != previouseId ) if ( current != null ) personList.Push(current) current = null current = new Person(x.fullName) current.phoneNumbers.Push(x.phoneNumber) print personList CODE IMPLEMENTATION IN PHP/MYSQL: 1. /* get all persons */ $result = mysql_query("SELECT id, fullName FROM persons"); $personsArray = array(); //Create an array //loop all persons while ($row = mysql_fetch_assoc($result)) { //add new person $current = array(); $current['id'] = $row['id']; $current['fullName'] = $row['fullName']; /* add all person phone-numbers */ $id = $current['id']; $sub_result = mysql_query("SELECT phoneNumber FROM phoneNumbers WHERE personId = {$id}"); $phoneNumbers = array(); while ($sub_row = mysql_fetch_assoc($sub_result)) { $phoneNumbers[] = $sub_row['phoneNumber']); } //add phoneNumbers array to person $current['phoneNumbers'] = $phoneNumbers; //add person to final result array $personsArray[] = $current; } echo json_encode($personsArray); 2. /* get all persons and their phone-numbers in a single query */ $sql = "SELECT persons.id, fullName, phoneNumber FROM persons LEFT JOIN phoneNumbers ON persons.id = phoneNumbers.personId"; $result = mysql_query($sql); $personsArray = array(); /* init temp vars to save current person's data */ $current = null; $previouseId = null; $phoneNumbers = array(); while ($row = mysql_fetch_assoc($result)) { /* if the current id is different from the previous id: you've got to a new person. save the previous person (if such exists), and create a new one */ if ($row['id'] != $previouseId ) { // in the first iteration, // current (previous person) is null, // don't add it if ( !is_null($current) ) { $current['phoneNumbers'] = $phoneNumbers; $personsArray[] = $current; $current = null; $previouseId = null; $phoneNumbers = array(); } // create a new person $current = array(); $current['id'] = $row['id']; $current['fullName'] = $row['fullName']; // set current as previous id $previouseId = $current['id']; } // you always add the phone-number // to the current phone-number list $phoneNumbers[] = $row['phoneNumber']; } } // don't forget to add the last person (saved in "current") if (!is_null($current)) $personsArray[] = $current); echo json_encode($personsArray); P.S. this link is an example of a different question here, where i tried to suggest the second way: tables to single json
Preliminary First, thank you for putting that much effort into explaining the problem, and for the formatting. It is great to see someone who is clear about what they are doing, and what they are asking. But it must be noted that that, in itself, forms a limitation: you are fixed on the notion that this is the correct solution, and that with some small correction or guidance, this will work. That is incorrect. So I must ask you to give that notion up, to take a big step back, and to view (a) the whole problem and (b) my answer without that notion. The context of this answer is: all the explicit considerations you have given, which are very important, which I will not repeat the two most important of which is, what best practice and what I would do in real life This answer is rooted in Standards, the higher order of, or frame of reference for, best practice. This is what the commercial Client/Server world does, or should be doing. This issue, this whole problem space, is becoming a common problem. I will give a full consideration here, and thus answer another SO question as well. Therefore it might contain a tiny bit more detail that you require. If it does, please forgive this. Consideration The database is a server-based resource, shared by many users. In an online system, the database is constantly changing. It contains that One Version of the Truth (as distinct from One Fact in One Place, which is a separate, Normalisation issue) of each Fact. the fact that some database systems do not have a server architecture, and that therefore the notion of server in such software is false and misleading, are separate but noted points. As I understand it, JSON and JSON-like structures are required for "performance reasons", precisely because the "server" doesn't, cannot, perform as a server. The concept is to cache the data on each (every) client, such that you are not fetching it from the "server" all the time. This opens up a can of worms. If you do not design and implement this properly, the worms will overrun the app. Such an implementation is a gross violation of the Client/Server Architecture, which allows simple code on both sides, and appropriate deployment of software and data components, such that implementation times are small, and efficiency is high. Further, such an implementation requires a substantial implementation effort, and it is complex, consisting of many parts. Each of those parts must be appropriately designed. The web, and the many books written in this subject area, provide a confusing mix of methods, marketed on the basis of supposed simplicity; ease; anyone-can-do-anything; freeware-can-do-anything; etc. There is not scientific basis for any of those proposals. Non-architecture & Sub-standard As evidenced, you have learned that that some approaches to database design are incorrect. You have encountered one problem, one instance that that advice is false. As soon as you solve this one problem, the next problem, which is not apparent to you right now, will be exposed. The notions are a never-ending set of problems. I will not enumerate all the false notions that are sometimes advocated. I trust that as you progress through my answer, you will notice that one after the other marketed notion is false. The two bottom lines are: The notions violate Architecture and Design Standards, namely Client/Server Architecture; Open Architecture; Engineering Principles; and to a lesser in this particular problem, Database Design Principles. Which leads to people like you, who are trying to do an honest job, being tricked into implementing simple notions, which turn into massive implementations. Implementations that will never quite work, so they require substantial ongoing maintenance, and will eventually be replaced, wholesale. Architecture The central principle being violated is, never duplicate anything. The moment you have a location where data is duplicated (due to caching or replication or two separate monolithic apps, etc), you create a duplicate that will go out of synch in an online situation. So the principle is to avoid doing that. Sure, for serious third-party software, such as a gruntly report tool, by design, they may well cache server-based data in the client. But note that they have put hundreds of man-years into implementing it correctly, with due consideration to the above. Yours is not such a piece of software. Rather than providing a lecture on the principles that must be understood, or the evils and costs of each error, the rest of this answer provides the requested what would you do in real life, using the correct architectural method (a step above best practice). Architecture 1 Do not confuse the data which must be Normalised with the result set, which, by definition, is the flattened ("de-normalised" is not quite correct) view of the data. The data, given that it is Normalised, will not contain duplicate values; repeating groups. The result set will contain duplicate values; repeating groups. That is pedestrian. Note that the notion of Nested Sets (or Nested Relations), which is in my view not good advice, is based on precisely this confusion. For forty-five years since the advent of the RM, they have been unable to differentiate base relations (for which Normalisation does apply) from derived relations (for which Normalisation does not apply). Two of these proponents are currently questioning the definition of First Normal Form. 1NF is the foundation of the other NFs, if the new definition is accepted, all the NFs will be rendered value-less. The result would be that Normalisation itself (sparsely defined in mathematical terms, but clearly understood as a science by professionals) will be severely damaged, if not destroyed. Architecture 2 There is a centuries-old scientific or engineering principle, that content (data) must be separated from control (program elements). This is because the analysis, design, and implementation of the two are completely different. This principle is no less important in the software sciences, where it has specific articulation. In order to keep this brief (ha ha), instead of a discourse, I will assume that you understand: That there is a scientifically demanded boundary between data and program elements. Mixing them up results in complex objects that are error-prone and hard to maintain. The confusion of this principle has reached epidemic proportions in the OO/ORM world, the consequences reach far and wide. Only professionals avoid this. For the rest, the great majority, they accept the new definition as "normal", and they spend their lives fixing problems that we simply do not have. The architectural superiority, the great value, of data being both stored and presented in Tabular Form per Dr E F Codd's Relational Model. That there are specific rules for Normalisation of data. And importantly, you can determine when the people, who write and market books, advise non-relational or anti-relational methods. Architecture 3 If you cache data on the client: Cache the absolute minimum. That means cache only the data that does not change in the online environment. That means Reference and Lookup tables only, the tables that populate the higher level classifiers, the drop-downs, etc. Currency For every table that you do cache, you must have a method of (a) determining that the cached data has become stale, compared to the One Version of the Truth which exists on the server, and (b) refreshing it from the server, (c) on a table-by-table basis. Typically, this involves a background process that executes every (e) five minutes, that queries the MAX updated DateTime for each cached table on the client vs the DateTime on the server, and if changed, refreshes the table, and all its child tables, those that dependent on the changed table. That, of course, requires that you have an UpdatedDateTime column on every table. That is not a burden, because you need that for OLTP ACID Transactions anyway (if you have a real database, instead of a bunch of sub-standard files). Which really means, never replicate, the coding burden is prohibitive. Architecture 4 In the sub-commercial, non-server world, I understand that some people advise the reverse caching of "everything". That is the only way the programs like PostgreSQL, can to the used in a multi-user system. You always get what you pay for: you pay peanuts, you get monkeys; you pay zero, you get zero. The corollary to Architecture 3 is, if you do cache data on the client, do not cache tables that change frequently. These are the transaction and history tables. The notion of caching such tables, or all tables, on the client is completely bankrupt. In a genuine Client/Server deployment, due to use of applicable standards, for each data window, the app should query only the rows that are required, for that particular need, at that particular time, based on context or filter values, etc. The app should never load the entire table. If the same user using the same window inspected its contents, 15 minutes after the first inspection, the data would be 15 mins out of date. For freeware/shareware/vapourware platforms, which define themselves by the absence of a server architecture, and thus by the result, that performance is non-existent, sure, you have to cache more than the minimum tables on the client. If you do that, you must take all the above into account, and implement it correctly, otherwise your app will be broken, and the ramifications will drive the users to seek your termination. If there is more than one user, they will have the same cause, and soon form an army. Architecture 5 Now we get to how you cache those carefully chosen tables on the client. Note that databases grow, they are extended. If the system is broken, a failure, it will grow in small increments, and require a lot of effort. If the system is even a small success, it will grow exponentially. If the system (each of the database, and the app, separately) is designed and implemented well, the changes will be easy, the bugs will be few. Therefore, all the components in the app must be designed properly, to comply with applicable standards, and the database must be fully Normalised. This in turn minimises the effect of changes in the database, on the app, and vice versa. The app will consist of simple, not complex, objects, which are easy to maintain and change. For the data that you do cache on the client, you will use arrays of some form: multiple instances of a class in an OO platform; DataWindows (TM, google for it) or similar in a 4GL; simple arrays in PHP. (Aside. Note that what people in situations such as yours produce in one year, professional providers using a commercial SQL platform, a commercial 4GL, and complying with Architecture and Standards.) Architecture 6 So let's assume that you understand all the above, and appreciate its value, particularly Architecture 1 & 2. If you don't, please stop here and ask questions, do not proceed to the below. Now that we have established the full context, we can address the crux of your problem. In those arrays in the app, why on Earth would you store flattened views of data ? and consequently mess with, and agonise over, the problems instead of storing copies of the Normalised tables ? Answer Never duplicate anything that can be derived. That is an Architectural Principle, not limited to Normalisation in a database. Never merge anything. If you do, you will be creating: data duplication, and masses of it, on the client. The client will not only be fat and slow, it will be anchored to the floor with the ballast of duplicated data. additional code, which is completely unnecessary complexity in that code code that is fragile, that will constantly have to change. That is the precise problem you are suffering, a consequence of the method, which you know intuitively is wrong, that there must be a better way. You know it is a generic and common problem. Note also that method, that code, constitutes a mental anchor for you. Look at the way that you have formatted it and presented it so beautifully: it is of importance to you. I am reluctant to inform you of all this. Which reluctance is easily overcome, due to your earnest and forthright attitude, and the knowledge that you did not invent this method In each code segment, at presentation time, as and when required: a. In the commercial Client/Server context Execute a query that joins the simple, Normalised, unduplicated tables, and retrieves only the qualifying rows. Thereby obtaining current data values. The user never sees stale data. Here, Views (flattened views of Normalised data) are often used. b. In the sub-commercial non-server context Create a temporary result-set array, and join the simple, unduplicated, arrays (copies of tables that are cached), and populate it with only the qualifying rows, from the source arrays. The currency of which is maintained by the background process. Use the Keys to form the joins between the arrays, in exactly the same way that Keys are used to form the joins in the Relational tables in the database. Destroy those components when the user closes the window. A clever version would eliminate the result-set array, and join the source arrays via the Keys, and limit the result to the qualifying rows. Separate to being architecturally incorrect, Nested Arrays or Nested Sets or JSON or JSON-like structures are simply not required. This is the consequence of confusing the Architecture 1 Principle. If you do choose to use such structures, then use them only for the temporary result-set arrays. Last, I trust this discourse demonstrates that n tables is a non-issue. More important, that m levels deep in the data hierarchy, the "nesting", is a non-issue. Answer 2 Now that I have given the full context (and not before), which removes the implications in your question, and makes it a generic, kernel one. The question is about ANY server-side/relational-db. [Which is better]: 2 loops, 5 simple "SELECT" queries 1 loop, 1 "JOIN" query The detailed examples you have given are not accurately described above. The accurate descriptions is: Your Option 1 2 loops, each loop for loading each array 1 single-table SELECT query per loop (executed n x m times ... the outermost loop, only, is a single execution) Your Option 2 1 Joined SELECT query executed once followed by 2 loops, each loop for loading each array For the commercial SQL platforms, neither, because it does not apply. The commercial SQL server is a set-processing engine. Use one query with whatever joins are required, that returns a result set. Never step through the rows using a loop, that reduces the set-processing engine to a pre-1970's ISAM system. Use a View, in the server, since it affords the highest performance and the code is in one place. However, for the non-commercial, non-server platforms, where: your "server" is not a set-processing engine ie. it returns single rows, therefore you have to fetch each row and fill the array, manually or your "server" does not provide Client/Server binding, ie. it does not provide facilities on the client to bind the incoming result set to a receiving array, and therefore you have to step through the returned result set, row by row, and fill the array, manually, as per your example then, the answer is, by a large margin, your option 2. Please consider carefully, and comment or ask questions. Response to Comment Say I need to print this json (or other html page) to some STOUT (example: an http response to: GET /allUsersPhoneNumbers. It's just an example to clarify what I'm expecting to get), should return this json. I have a php function that got this 2 result sets (1). now it should print this json - how should I do that? this report could be an employee month salary for a whole year, and so one. one way or anther, I need to gather this information and represent it in a "JOIN"ed representation Perhaps I was not clear enough. Basically, do not use JSON unless you absolutely have to. Which means sending to some system that requires it, which means that receiving system, and that demand is stupid. Make sure that your system doesn't make such demands on others. Keep your data Normalised. Both in the database, and in whatever program elements that you write. That means (in this example) use one SELECT per table or array. That is for loading purposes, so that you can refer to and inspect them at any point in the program. When you need a join, understand that it is: a result-set; a derived relation; a view therefore temporary, it exists for the duration of the execution of that element, only a. For tables, join them in the usual manner, via Keys. One query, joining two (or more) tables. b. For arrays, join arrays in the program, the same way you join tables in the database, via Keys. For the example you have given, which is a response to some request, first understand that it is the category [4], and then fulfil it. Why even consider JSON? What has JSON got to do with this? JSON is misunderstood and people are interested in the wow factor. It is a solution looking for a problem. Unless you have that problem it has no value. Check these two links: Copter - What is JSON Stack Overflow - What is JSON Now if you understand that, it is mostly for incoming feeds. Never for outgoing. Further, it requires parsing, deconstructing, etc, before the can be used. Recall: I need to gather this information and represent it in a "JOIN"ed representation Yes. That is pedestrian. Joined does not mean JSONed. In your example, the receiver is expecting a flattened view (eg. spreadsheet), with all the cells filled, and yes, for Users with more than one PhoneNumber, their User details will be repeated on the second nad subsequent result-set row. For any kind of print, eg. for debugging, I want a flattened view. It is just a: SELECT ... FROM Person JOIN PhoneNumber And return that. Or if you fulfil the request from arrays, join the Person and PhoneNumber Arrays, which may require a temporary result-set array, and return that. please don't tell me you should get only 1 user at a time, etc. etc. Correct. If someone tells you to regress to procedural processing (ie. row by row, in a WHILE loop), where the engine or your program has set processing (ie. processes an entire set in one command), that marks them as someone who should not be listened to. I have already stated, your Option 2 is correct, Option 1 is incorrect. That is as far as the GET or SELECT is concerned. On the other hand, for programming languages that do not have set-processing capability (ie. cannot print/set/inspect an array in a single command), or "servers" that do not provide client-side array binding, you do have to write loops, one loop per depth of the data hierarchy (in your example, two loops, one for Person, and one for PhoneNumber per User). You have to do that to parse an incoming JSON object. You have to do that to load each array from the result set that is returned in your Option 2. You have to do that to print each array from the result set that is returned in your Option 2. Response to Comment 2 I've ment I have to return a result represented in a nested version (let's say I'm printing the report to the page), json was just an example for such representation. I don't think you understand the reasoning and the conclusions I have provided in this answer. For printing and displaying, never nest. Print a flattened view, the rows returned from the SELECT per Option 2. That is what we have been doing, when printing or displaying data Relationally, for 31 years. It is easier to read, debug, search, find, fold, staple, mutilate. You cannot do anything with a nested array, except look at it, and say gee that is interesting. Code Caveat I would prefer to take your code and modify it, but actually, looking at your code, it is not well written or structured, it cannot be reasonably modified. Second, if I use that, it would be a bad teaching tool. So I will have to give you fresh, clean code, otherwise you will not learn the correct methods. This code examples follow my advice, so I am not going to repeat. And this is way beyond the original question. Query & Print Your request, using your Option 2. One SELECT executed once. Followed by one loop. Which you can "pretty up" if you like.
In general it is a best practice to grab the data you need in as few trips to the database as possible then map the data into the appropriate objects. (Option 2) But, to answer your question I would ask yourself what the use case for your data is. If you know for sure that you will be needing your person and your phone number data then I would say the second method is your best option. However, option one can also have its use case when the joined data is optional.One example of this could be that on the UI you have a table of all your people and if a user wants to see the phone number for a particular person they have to click on that person. Then it would be acceptable to "lazy-load" all of the phone numbers.
This is the common problem, especially if you are creating a WebAPIs, converting those table sets to nested arrays is a big deal.. I always go for you the second option(in slightly different method though), because the first is worst possible way to do it... One thing I learned with my experience is never query inside a loop, that is a waste of DB calls, well you know what I trying to say. Although I don't accept all the things PerformanceDBA said, there are two major things I need the address, 1. Don't have duplicate data 2. Fetch only data you want The Only problem I see in Joining the table is, we end up duplicating data lots of them, take you data for example, Joining Person ans phoneNumber tables we end up duplicating Every person for each of his phone number, for two table with few hundred rows its fine, imagine we need to merge 5 tables with thousands of rows its huge... So here's my solution: Query: SELECT id, fullName From Person; SELECT personId, phoneNumber FROM phoneNumbers WHERE personId IN (SELECT id From Person); So I get to tables in my result set, now I assign Table[0] to my Person list, and use a 2 loops to place right phoneNumbers in to right person... Code: personList = ConvertToEntity<List<Person>>(dataset.Table[0]); pnoList = ConvertToEntity<List<PhoneNumber>>(dataset.Table[1]); foreach (person in personList) { foreach (pno in pnoList) { if(pno.PersonId = person.Id) person.PhoneNumer.Add(pno) } } I think above method reduce lots of duplication and only get me what I wanted, if there is any downside to the above method please let me know... and thanks for asking these kind of questions...
SQL: What if serializing PHP array into string and put it into table cell?
The exact question. What if making relation one-to-many like this: store id's of many in one's field, if we interact with it quite seldom and no deletes in many table are expected. Some other things A have dishes. Dishes consist of Products. Products has their own price. What if I'd do it this way: Products columns : { Id, Name , PricePerOne } Dish columns: { Id, Name, Content ( it is serialized [n x 2] PHP Array with ProductID and Amount of this product in each row) And than unserialize it when necessarily and calculate the exact sum, querying from Products like this WHERE Id in ( ".explode(..)." ) or even caching this figure. So, I must have said something wrong, but I don't need to compare this serialized string or even do something with it. It would simply be used in querying price. Actually that is quite close to relation one-to-many. I each one dish relates to few products. So, I simply store data about amount of products I need it this exact dish record.
I would advise against serializing anything you intend to search, or use in a "WHERE" clause. The only times I have serialized data to put in my database is for logging full sets of POST of GET variables for later debugging, OR for caching an array of data. In each example, I don't need to make any comparisons in SQL to the serialized string.
You may THINK it's easier to work with serialized data, but you're just going to end up tearing out your hair when your app runs at a glacial rate. You're negating the entire purpose of using a database. Sit back and rethink your app from the ground up BEFORE you start down this path.
I would garner that the overhead on executing requests on a serialized DB format vs a relational one such as SQL increases exponentially with the complexity of the query you are performing. Not only that, error checking and performing any number of more complex processes (Join? Union?) would doubtless cause premature aging...
Creating an efficient friendlist using PHP
I would like to build a website that has some elements of a social network. So I have been trying to think of an efficient way to store a friend list (somewhat like Facebook). And after searching a bit the only suggestion I have come across is making a "table" with two "ids" indicating a friendship. That might work in small websites but it doesn't seem efficient one bit. I have a background in Java but I am not proficient enough with PHP. An idea has crossed my mind which I think could work pretty well, problem is I am not sure how to implement it. the idea is to have all the "id"s of your friends saved in a tree data structure,each node in that tree resembles one digit from the friend's id. first starting with 1 node, and then adding more nodes as the user adds friends. (A bit like Lempel–Ziv). every node will be able to point to 11 other nodes, 0 to 9 and X. "X" marks the end of the Id. for example see this tree: An Example In this tree the user has 4 friends with the following "id"s: 0 143 1436 15 Update: as it might have been unclear before, the idea is that every user will have a tree in a form of multidimensional array in which the existence of the pointers themselves indicate the friend's "id". If every user had such a multidimensional array, searching if id "y" is a friend of mine, deleting id "y" from my friend list or adding id "y" to my friend list would all require constant time O(1) without being dependent on the number of users the website might have, only draw back is, taking such a huge array, serializing it and pushing it into each row of the table just doesn't seem right. -Is this even possible to implement? -Would using serializing to insert that tree into a table be practical? -Is there any better way of doing this? The benefits upon which I chose this is that even with a really large number of ids (millions or billions) the search,add,delete time is linear (depends of the number of digits). I'd greatly appreciate any help with implementing this or any suggestions for alternative ways to improve or change this method.
I would strongly advise against this. Storage savings are not significant, and may (probably?) be worse. In a real dataset, the actual space-savings afforded to you with this approach are minimal. Computing the average savings is a very difficult problem, but use some real numbers and try a few samples with random IDs. If you have a million users, consider a user with 15 friends. How much data do you save with this approch? You may actually use more space, since tree adjacency models can require significant data. "Rendering" a list of users requires CPU investment. Inserts are non-deterministic and non-trivial. When you add a new user to an existing tree, you will have a variety of methods of inserting them. Assuming you don't choose arbitrarily, it is difficult to compute which approach is the best (and would only be based on heuristics). This are the big ones that came to my mind. But generally, I think you are over-thinking this.
You should check out OQGRAPH, the Open Query graph storage engine. It is designed to handle efficient tree and graph storage for MySQL. You can also check out my presentation Models for Hierarchical Data with SQL and PHP, or my answer to What is the most efficient/elegant way to parse a flat table into a tree? here on Stack Overflow. I describe a design I call Closure Table, which records all paths between ancestors and descendants in a hierarchy.
You say 'using PHP' in the title, but this seems to be just a database question at its heart. And believe it or not the linking table is by far the best way to go. Especially if you have millions or billions of users. It would be faster to process, easier to handle in the PHP code and smaller to store. Update Users table: id | name | moreInfo 1 | Joe | stuff 2 | Bob | stuff 3 | Katie | stuff 4 | Harold | stuff Friendship table: left | right 1 | 4 1 | 2 3 | 1 3 | 4 In this example Joe knows everyone and Katie knows Harold. This is of course a simplified example. I'd love to hear if someone has a better logic to the left and right and an explanation as to why. Update I gave some php code in a comment below but it was marked up wrong so here it is again. $sqlcmd = sprintf( 'SELECT IF( `left` = %1$d, `right`, `left`) AS "friend" FROM `friendship` WHERE `left` = %1$d OR `right` = %1$d', $userid);
Few ideas: ordered lists - searching through ordered list is fast, though ordering itself might be heavier; horizontal partitioning data; getting rid of premature optimizations.
To serialize or to keep a separate table?
This question has risen on many different occasions for me but it's hard to explain without giving a specific example. So here goes: Let's imagine for a while that we are creating a issue tracker database in PHP/MySQL. There is a "tasks" table. Now you need to keep track of people who are associated with a particular task (have commented or what not). These persons will get an email when a task changes. There are two ways to solve this type of situation. One is to create a separate table tasks_participants: CREATE TABLE IF NOT EXISTS `task_participants` ( `task_id` int(10) unsigned NOT NULL, `person_id` int(10) unsigned NOT NULL, UNIQUE KEY `task_id_person_id` (`task_id`,`person_id`) ); And to query this table with SELECT person_id WHERE task_id='XXX'. If there are 5000 tasks and each task has 4 participants in average (the reporter, the subject for whom the task brought benefit, the solver and one commenter) then the task_participants table would be 5000*4 = 20 000 rows. There is also another way: create a field in tasks table and store serialized array (JSON or PHP serialize()) of person_id's. Then there would not be need for this 20 000 rows table. What are your comments, which way would you go?
Go with the multiple records. It promotes database normalization. Normalization is very important. Updating a serialized value is no fun to maintain. With multiple records I can let the database do the work with INSERT, UPDATE and DELETE. Also, you are limiting your future joins by having a multivalued column.
Definitely do the cross reference table (the first option you listed). Why? First of all, do not worry about the size of the cross reference table. Relational databases would have been out on their ear decades ago if they could not handle the scale of a simple cross reference table. Stop worrying about 20K or 200K records, etc. In fact, if you're going to worry about something like this, it's better to start worrying about why you've chosen a relational DB instead of a key-value DB. After that, and only when it actually starts to be a problem, then you can start worrying about adding an index or other tuning techniques. Second, if you serialize the association info, you're probably opaque-ifying a whole dimension of your data that only your specialized JSON-enabled app can query. Serialization of data into a single cell in a table typically only makes sense if the embedded structure is (a) not something that contains data you would never need to query outside your app, (b) is not something you need to query the internals of efficiently (e.g., avg count(*) of people with tasks), and (c) is just something you either do not have time to model out properly or is in a prototypical state. So I say probably above, because it's not usually the case that data worth persisting fits these criteria. Finally, by serializing your data, you are now forced to solve any computation on that serialized data in your code, which is just a big waste of time that you could have spent doing something more productive. Your database already can slice and dice that data any way you need, yet because your data is not in a format it understands, you need to now do that in your code. And now imagine what happens when you change the serialized data structure in V2. I won't say there aren't use cases for serializing data (I've done it myself), but based on your case above, this probably isn't one of them.
There are a couple of great answers already, but they explain things in rather theoretical terms. Here's my (essentially identical) answer, in plain English: 1) 20k records is nothing to MySQL. If it gets up into the 20 million record range, then you might want to start getting concerned - but it still probably won't be an issue. 2) OK, let's assume you've gone with concatenating all the people involved with a ticket into a single field. Now... Quick! Tell me how many tickets Alice has touched! I have a feeling that Bob is screwing things up and Charlie is covering for him - can you get me a list of tickets that they both worked on, divided up by who touched them last? With a separate table, MySQL itself can find answers to all kinds of questions about who worked on what tickets and it can find them fast. With everything crammed into a single field, you pretty much have to resort to using LIKE queries to find the (potentially) relevant records, then post-process the query results to extract the important data and summarize it yourself.