data grabbing/restructuring speed

data grabbing/restructuring speed - php

This question might be flagged as too broad or opinion-based, but I take the risk...
I have a REST-API in php that gets all data from a mysql table, that also includes 'hasMany' fields. Let's call them a 'post' hasMany 'comments'.
Right now I'm doing ONE select with a LEFT JOIN on comments, then walk through the results to restructure the output to
{ "posts": [
{"id": 1,
"comments": [1,2,3]
},
....
]}
All is fine until I have more than one hasMany-field, because then the refacturing gets complicated (now produces double entries) and I would need to loop through the result (not manually, but still with built in functions) several times.
So I thought about refacturing my code to:
1. select the actual item ('post')
2. select all hasMany fields ('comments', 'anythingelse',...) and add the results.
which of course produces loads of action on my db.
So my question is if anybody has a simple answer like 'better grab all the data in one go from a database and do the work in php' or the opposite.
Yes, I could do benchmarks myself. But fist - to be honest I would like to avoid all the reprogramming just to find out it's slower - second i don't know if my benchmark would remain the same on an optimized (and linux) production machine (right now I'm developing on easyPhp on windows).
Some info:
The 'post' table could result in some hundred records, same as the hasMany each. But combined with some hasMany fields it could result in a recordset (with the first aproach) of several thousands.

Use the IN (…) operator.
First, get the relevant posts on their own:
SELECT […stuff…] FROM posts WHERE […conditions…]
Then take the list of post IDs from the results you get there and substitute the whole list into a set of queries of the form:
SELECT […stuff…] FROM comments WHERE post_id IN (1, 2, 3 […etc…])
SELECT […stuff…] FROM anythingelse WHERE post_id IN (1, 2, 3 […etc…])
Running one query per dependent table is fine. It's not significantly more expensive than running a single JOINed query; in fact, it may be less expensive, as there's no duplication of the fields from the parent table.
Make sure the post_id column is indexed on the subtables, of course.

The best alternative that I can think of would be along the lines of:
$posts = $dbh->prepare('SELECT [fields] FROM posts WHERE [conditions]')->
execute([...])->
fetchAll();
$stmt = $dbh->prepare('SELECT id FROM comments WHERE post_id = ?');
for($i=0; $i<count($posts); $i++) {
$stmt->execute($posts[$i]['id']);
$posts[$i]['comments'] = $stmt->fetchAll();
}
You need to decide if the work/overhead tradeoff between dealing "duplicate" data as a result of the join is more or less than separately retrieving the comments for each post.
Chance are if you were using an ORM something along the lines of the above would be happening automagically.

Related

Good practice for handling naturally JOINed results across an application

I'm working on an existing application that uses some JOIN statements to create "immutable" objects (i.e. the results are always JOINed to create a processable object - results from only one table will be meaningless).
For example:
SELECT r.*,u.user_username,u.user_pic FROM articles r INNER JOIN users u ON u.user_id=r.article_author WHERE ...
will yield a result of type, let's say, ArticleWithUser that is necessary to display an article with the author details (like a blog post).
Now, I need to make a table featured_items which contains the columnsitem_type (article, file, comment, etc.) and item_id (the article's, file's or comment's id), and query it to get a list of the featured items of some type.
Assuming tables other than articles contain whole objects that do not need JOINing with other tables, I can simply pull them with a dynamicially generated query like
SELECT some_table.* FROM featured_items RIGHT JOIN some_table ON some_table.id = featured_items.item_id WHERE featured_items.type = X
But what if I need to get a featured item from the aforementioned type ArticleWithUser? I cannot use the dynamically generated query because the syntax will not suit two JOINs.
So, my question is: is there a better practice to retrieve results that are always combined together? Maybe do the second JOIN on the application end?
Or do I have to write special code for each of those combined results types?
Thank you!

a view can be thot of as like a table for the faint of heart.
https://dev.mysql.com/doc/refman/5.0/en/create-view.html
views can incorporate joins. and other views. keep in mind that upon creation, they take a snapshot of the columns in existence at that time on underlying tables, so Alter Table stmts adding columns to those tables are not picked up in select *.

An old article which I consider required reading on the subject of MySQL Views:
By Peter Zaitsev
To answer your question as to whether they are widely used, they are a major part of the database developer's toolkit, and in some situations offer significant benefits, which have more to do with indexing than with the nature of views, per se.

What's faster, db calls or resorting an array?

In a site I maintain I have a need to query the same table (articles) twice, once for each category of article. AFAIT there are basically two ways of doing this (maybe someone can suggest a better, third way?):
Perform the db query twice, meaning the db server has to sort through the entire table twice. After each query, I iterate over the cursor to generate html for a list entry on the page.
Perform the query just once and pull out all the records, then sort them into two separate arrays. After this, I have to iterate over each array separately in order to generate the HTML.
So it's this:
$newsQuery = $mysqli->query("SELECT * FROM articles WHERE type='news' ");
while($newRow = $newsQuery->fetch_assoc()){
// generate article summary in html
}
// repeat for informational articles
vs this:
$query = $mysqli->query("SELECT * FROM articles ");
$news = Array();
$info = Array();
while($row = $query->fetch_assoc()){
if($row['type'] == "news"){
$news[] = $row;
}else{
$info[] = $row;
}
}
// iterate over each array separate to generate article summaries
The recordset is not very large, current <200 and will probably grow to 1000-2000. Is there a significant different in the times between the two approaches, and if so, which one is faster?
(I know this whole thing seems awfully inefficient, but it's a poorly coded site I inherited and have to take care of without a budget for refactoring the whole thing...)
I'm writing in PHP, no framework :( , on a MySql db.
Edit
I just realized I left out one major detail. On a given page in the site, we will display (and thus retrieve from the db) no more than 30 records at once - but here's the catch: 15 info articles, and 15 news articles. On each page we pull the next 15 of each kind.

You know you can sort in the DB right?
SELECT * FROM articles ORDER BY type

EDIT
Due to the change made to the question, I'm updating my answer to address the newly revealed requirement: 15 rows for 'news' and 15 rows for not-'news'.
The gist of the question is the same "which is faster... one query to two separate queries". The gist of the answer remains the same: each database roundtrip incurs overhead (extra time, especially over a network connection to a separate database server), so with all else being equal, reducing the number database roundtrips can improve performance.
The new requirement really doesn't impact that. What the newly revealed requirement really impacts is the actual query to return the specified resultset.
For example:
( SELECT n.*
FROM articles n
WHERE n.type='news'
LIMIT 15
)
UNION ALL
( SELECT o.*
FROM articles o
WHERE NOT (o.type<=>'news')
LIMIT 15
)
Running that statement as a single query is going to require fewer database resources, and be faster than running two separate statements, and retrieving two disparate resultsets.
We weren't provided any indication of what the other values for type can be, so the statement offered here simply addresses two general categories of rows: rows that have type='news', and all other rows that have some other value for type.
That query assumes that type allows for NULL values, and we want to return rows that have a NULL for type. If that's not the case, we can adjust the predicate to be just
WHERE o.type <> 'news'
Or, if there are specific values for type we're interested in, we can specify that in the predicate instead
WHERE o.type IN ('alert','info','weather')
If "paging" is a requirement... "next 15", the typical pattern we see applied, LIMIT 30,15 can be inefficient. But this question isn't asking about improving efficiency of "paging" queries, it's asking whether running a single statement or running two separate statements is faster.
And the answer to that question is still the same.
ORIGINAL ANSWER below
There's overhead for every database roundtrip. In terms of database performance, for small sets (like you describe) you're better off with a single database query.
The downside is that you're fetching all of those rows and materializing an array. (But, that looks like that's the approach you're using in either case.)
Given the choice between the two options you've shown, go with the single query. That's going to be faster.
As far as a different approach, it really depends on what you are doing with those arrays.
You could actually have the database return the rows in a specified sequence, using an ORDER BY clause.
To get all of the 'news' rows first, followed by everything that isn't 'news', you could
ORDER BY type<=>'news' DESC
That's MySQL short hand for the more ANSI standards compliant:
ORDER BY CASE WHEN t.type = 'news' THEN 1 ELSE 0 END DESC
Rather than fetch every single row and store it in an array, you could just fetch from the cursor as you output each row, e.g.
while($row = $query->fetch_assoc()) {
echo "<br>Title: " . htmlspecialchars($row['title']);
echo "<br>byline: " . htmlspecialchars($row['byline']);
echo "<hr>";
}

Best way of dealing with a situation like this is to test this for yourself. Doesn't matter how many records do you have at the moment. You can simulate whatever amount you'd like, that's never a problem. Also, 1000-2000 is really a small set of data.
I somewhat don't understand why you'd have to iterate over all the records twice. You should never retrieve all the records in a query either way, but only a small subset you need to be working with. In a typical site where you manage articles it's usually about 10 records per page MAX. No user will ever go through 2000 articles in a way you'd have to pull all the records at once. Utilize paging and smart querying.
// iterate over each array separate to generate article summaries
Not really what you mean by this, but something tells me this data should be stored in the database as well. I really hope you're not generating article excerpts on the fly for every page hit.
It all sounds to me more like a bad architecture design than anything else...
PS: I believe sorting/ordering/filtering of a database data should be done on the database server, not in the application itself. You may save some traffic by doing a single query, but it won't help much if you transfer too much data at once, that you won't be using anyway.

One SELECT (to fetch/ rule them all!) then handle the entire collection using Ruby array functions?

I'm wondering, when looking through a set of rows and for each row fetching another tables set of rows. For example, looping through a series of categories and for each category, fetching all their news articles. Perhaps to display on a single page or such. It seems like a lot of SELECT queries - 1 to get all categories, and one for each category (to get it's articles). So, my question is - is it quicker to simple do two fetches at the start:
categories = Category.all
articles = Articles.all
...and then just use select() or where() on articles by category id to only take those from the articles array? Replacing multiple SELECT queries with multiple array functions, which is quicker? I imagine also that each app, depending on number of rows, may vary. I would be interested to hear what people think, or any links that clarify this as I ddin't find much on the matter myself.
My code example above is Ruby on Rails but this question might actually apply to any given language. I also use PHP from time to time.

It depends on what you want to do with your data. You could try eager loading.
categories = Category.includes(:articles)
Here's the documentation. http://guides.rubyonrails.org/active_record_querying.html#eager-loading-associations

I think you're describing what's called the N+1 problem (I'm new to this too). Here's another stack overflow question that addresses this issue generally: What is SELECT N+1?
n+1 is the worst, especially when you think about 10k or 10M articles like timpone pointed out. For 10M articles you'll be hitting the DB 10,000,001 times for a single request (hence the name n + 1 problem). Avoid this. Always. Anything is better than this.
If Category has a has_many relation to Article (and Article has a belongs_to relation to Category) you could use #includes to "pre-fetch" the association like so:
categories = Category.includes(:articles)
This will two two queries, one for the Category and one for the Article. You can write it out in two explicit select/where statements but I think doing it this way is semantically clearer. If you want to retrieve all the categories and then for each category get all the articles. You can write code like this (in Ruby):
categories.articles.each do |article|
# do stuff...
end
and it's immediately clear that you mean "all the articles for this category instance".

PHP/MySQL: Massive SQL query or several smaller queries?

I have a database design here that looks this in simplified version:
Table building:
id
attribute1
attribute2
Data in there is like:
(1, 1, 1)
(2, 1, 2)
(3, 5, 4)
And the tables, attribute1_values and attribute2_values, structured as:
id
value
Which contains information like:
(1, "Textual description of option 1")
(2, "Textual description of option 2")
...
(6, "Textual description of option 6")
I am unsure whether this is the best setup or not, but it is done as such per requirements of my project manager. It definitely has some truth in it as you can modify the text easily now without messing op the id's.
However now I have come to a page where I need to list the attributes, so how do I go about there? I see two major options:
1) Make one big query which gathers all values from building and at the same time picks the correct textual representation from the attribute{x}_values table.
2) Make a small query that gathers all values from the building table. Then after that get the textual representation of each attribute one at a time.
What is the best option to pick? Is option 1 even faster as option 2 at all? If so, is it worth the extra trouble concerning maintenance?

Another suggestion would be to create a view on the server with only the data you need and query from that. That would keep the work on the server end, and you can pull just what you need each time.

If you have a small number of rows in attributes table, then I suggest to fetch them first, fetch all of them! store them into some array using id as index key in array.
Then you can proceed with building data, now you just have to use respective array to look for attribute value

I would recommend something in-between. Parse the result from the first table in php, and figure out how many attributes you need to select from each attribute[x]_values table.
You can then select attributes in bulk using one query per table, rather than one query per attribute, or one query per building.

Here is a PHP solution:
$query = "SELECT * FROM building";
$result = mysqli_query(connection,$query);
$query = "SELECT * FROM attribute1_values";
$result2 = mysqli_query(connection,$query);
$query = "SELECT * FROM attribute2_values";
$result3 = mysqli_query(connection,$query);
$n = mysqli_num_rows($result);
for($i = 1; $n <= $i; $i++) {
$row = mysqli_fetch_array($result);
mysqli_data_seek($result2,$row['attribute1']-1);
$row2 = mysqli_fetch_array($result2);
$row2['value'] //Use this as the value for attribute one of this object.
mysqli_data_seek($result3,$row['attribute2']-1);
$row3 = mysqli_fetch_array($result3);
$row3['value'] //Use this as the value for attribute one of this object.
}
Keep in mind that this solution requires that the tables attribute1_values and attribute2_values start at 1 and increase by 1 every single row.

Oracle / Postgres / MySql DBA here:
Running a query many times has quite a bit of overhead. There are multiple round trips to the db, and if it's on a remote server, this can add up. The DB will likely have to parse the same query multiple times in MySql which will be terribly inefficient if there are tons of rows. Now, one thing that your PHP method (multiple queries) has as an advantage is that it'll use less memory as it'll release the results as they're no longer needed (if you run the query as a nested loop that is, but if you query all the results up front, you'll have a lot of memory overhead, depending on the table sizes).
The optimal result would be to run it as 1 query, and fetch the results 1 at a time, displaying each one as needed and discarding it, which can reek havoc with MVC frameworks unless you're either comfortable running model code in your view, or run small view fragments.

Your question is very generic and i think that to get an answer you should give more hints to how this page will look like and how big the dataset is.
You will get all the buildings with theyr attributes or just one at time?
Cause your data structure look like very simple and anything more than a raspberrypi can handle it very good.
If you need one record at time you don't need any special technique, just JOIN the tables.
If you need to list all buildings and you want to save db time you have to measure your data.
If you have more attribute than buildings you have to choose one way, if you have 8 attributes and 2000 buildings you can think of caching attributes in an array with a select for each table and then just print them using the array. I don't think you will see any speed drop or improvement with so simple tables on a modern computer.
$att1[1]='description1'
$att1[2]='description2'
....

Never do one at a time queries, try to combine them into a single one.
MySQL will cache your query and it will run much faster. PhP loops are faster than doing many requests to the database.
The query cache stores the text of a SELECT statement together with the corresponding result that was sent to the client. If an identical statement is received later, the server retrieves the results from the query cache rather than parsing and executing the statement again.
http://dev.mysql.com/doc/refman/5.1/en/query-cache.html

Optimal method for retrieving two levels of hierarchical data from MySQL

There seems to be no shortage of hierarchical data questions in MySQL on SO, however it seems they are mostly talking about managing such data in the database or actually retrieving recursively hierarchical data. My situation is neither. I have a grid of items I need to display. Each item can also have 0 or more comments associated with it. Right now, both the item, along with its data, are displayed in the grid as well as any comments belonging to that item. Usually there is some sort of drill down, dialog, or other user action required to see child data for a grid item but in this case we display both parent and child data in the same grid. Might not fit the de facto standards but it is what it is.
Right now the comments are retrieved by a separate MySQL query for every single parent item in the grid. I immediately cringe at this being aware of all the completely separate database queries that have to be run for a single page load. I haven't profiled but I wouldn't be too surprised if this is part of the slow page loads we sometimes see. I'd like to ideally bring this down to a single query or perhaps 2. However, I'm having difficulty coming up with a solution that sounds any better than what is currently being done.
My first thought was to flatten the comment children for each row with some sort of separator like '|' and then explode them back apart in PHP when rendering the page. The issue with this is it gets increasingly complicated with having to separate each field in a comment, and then each comment, and then account for the possibility of separator characters in the data. Just feels like a mess to maintain and debug.
My next thought was to left outer join the comments to the items and just account for the item duplicates in PHP. I'm working with Codeigniter's database library that returns a PHP array for database data. This sounds like potentially a lot of duplicated data in the resulting array which could possibly be system taxing for larger result sets. I'm thinking in most cases it wouldn't be too bad though so this option is currently at the top of my possibilities list. Ideally, if I understand MVC correctly, I should keep my database, business logic, and view/display as separate as possible. So again, ideally, there should not be any database "quirks" (for lack of a better word) apparent in the data returned by the model. That is, whatever calls for data from this model method, shouldn't be concerned with duplicate data like this. So I'd have to add on an additional loop to somehow eliminate the duplicate item array entries but only after I have retrieved all the child comments and placed them into their own array.
Two queries is another idea but then I have to pass numerous item IDs in the SQL statement for the comments and then go through and zip all the data together manually in PHP.
My goal isn't to get out of doing work here but I am hoping there is some more optimal (less resource intensive and less confusing to the coder) method I haven't thought of yet.

As you state in your question, using a join will bring back a lot of duplicate information. It should be simple enough to remove in PHP, but why bring it back in the first place?
Compiling a SQL statement with a list of IDs retrieved from the query for your list of items shouldn't be a problem (see cwallenpoole's answer). Alternatively, you could create a sub-query so that MySQL recreates the list of IDs for you - it depends on how intensive the sub-query is.
Select your items:
SELECT * FROM item WHERE description = 'Item 1';
Then select the comments for those items:
SELECT * FROM comment WHERE item_id IN (
SELECT id FROM item WHERE description = 'Item 1'
);

For the most part, I solve this type of problem using some sort of ORM Lazy-Loading system but it does not look like you've that as an option.
Have you considered:
Select all top-level items.
Select all second-level items by the ID's in the top-level set.
Associate the objects retrieved in 2 with the items found in 1 in PHP.
Basically (in pseudo-code)
$stmt = $pdo->query("SELECT ID /*columns*/ FROM ENTRIES");
$entries = array();
foreach( $row as $stmt->fetchAll(PDO::FETCH_ASSOC) )
{
$row['child-entities'] = array();
$entries[$row['id']] = $row;
}
$ids = implode(',',array_keys($entries));
$stmt = $pdo->query("SELECT PARENT_ID /*columns*/ FROM children WHERE PARENT_ID IN ($ids)");
foreach( $row as $stmt->fetchAll(PDO::FETCH_ASSOC) )
{
$entries[$row['parent_pid']]['child-entities'][] = $row;
}
$entries will now be an associative array with parent items directly associated with child items. Unless recursion is needed, that should be everything in two queries.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.