nested mysql queries with huge tables

nested mysql queries with huge tables - php

I'm working on a management system for a small library. I proposed them to replace the Excel spreadsheet they are using now with something more robust and professional like PhpMyBibli - https://en.wikipedia.org/wiki/PhpMyBibli - but they are scared by the amount of fields to fill, and also the interfaces are not fully translated in Italian.
So I made a very trivial DB, with basically a table for the authors and a table for the books. The authors table is because I'm tired to have to explain that "Gabriele D'Annunzio" != "Gabriele d'Annunzio" != "Dannunzio G." and so on.
My test tables are now populated with ~ 100k books and ~ 3k authors, both with plausible random text, to check the scripts under pressure.
For the public consultation I want to make an interface like that of Gallica, the website of the Bibliothèque nationale de France, which I find pretty useful. A sample can be seen here: http://gallica.bnf.fr/Search?ArianeWireIndex=index&p=1&lang=EN&f_typedoc=livre&q=Computer&x=0&y=0
The concept is pretty easy: for each menu, e.g. the author one, I generate a fancy <select> field with all the names retrieved from the DB, and this works smoothly.
The issue arises when I try to add beside every author name the number of books, as made by Gallica, in this way (warning - conceptual code, not actual PHP):
SELECT id, surname, name FROM authors
foreach row {
SELECT COUNT(*) as num FROM BOOKS WHERE id_auth=id
echo "<option>$surname, $name ($num)</option>";
}
With the code above a core of the CPU jumps at 100%, and no results are shown in the browser. Not surprising, since they are 3k queries on a 100k table in a very short time.
Just to try, I added a LIMIT 100 to the first query (on the authors table). The page then required 3 seconds to be generated, and 15 seconds when I raised the LIMIT to 500 (seems a linear increment). But of course I can't show to library users a reduced list of authors.
I don't know which hardware/software is used by Gallica to achieve their results, but I bet their budget is far above that of a small village library using 2nd hand computers.
Do you think that to add a "number_of_books" field in the authors table, which will be updated every time a new book is inserted, could be a practical solution, rather than to browse the whole list at every request?
BTW, a similar procedure must be done for the publication date, the language, the theme, and some other fields, so the query time will be hit again, even if the other tables are a lot smaller than the authors one.

Your query style is very inefficient - try using a join and group structure:
SELECT
authors.id,
authors.surname,
authors.name,
COUNT(books.id) AS numbooks
FROM authors
INNER JOIN books ON books.id_auth=authors.id
GROUP BY authors.id
ORDER BY numbooks DESC
;
EDIT
Just to clear up some issues I not explicitely said:
Ofcourse you don't need a query in the PHP loop any longer, just the displaying portion
Indices on books.id_auth and authors.id (the latter primary or unique) are assumed
EDIT 2
As #GordonLinoff pointed out, the IFNULL() is redundant in an inner join, so I removed it.
To get all themes, even if there aren't any books in them, just use a left join (this time including the IFNULL(), if your provider's MySQL may be old):
SELECT
theme.id,
theme.main,
theme.sub,
IFNULL(COUNT(books.theme),0) AS num
FROM themes
LEFT JOIN books ON books.theme=theme.id
GROUP BY themes.id
;
EDIT 3
Ofcourse a stored value will give you the best performance - but this denormalization comes at a cost: Your Database now has the potential to become inconsistent in a user-visible way.
If you do go with this method. I strongly recommend you use triggers to auto-fill this field (and ofcourse those triggers must sit on the books table).
Be prepared to see slowed down inserts - this might ofcourse be okay, as I guess you will see a much higher rate of SELECTS than INSERTS

After reading a lot about how the JOIN statement works, with the help of
useful answer 1 and useful answer 2, I discovered I used it some 15 or 20 years ago, then I forgot about this since I never needed it again.
I made a test using the options I had:
reply with the JOIN query with IFNULL(): 0,5 seconds
reply with the JOIN query without IFNULL(): 0,5 seconds
reply using a stored value: 0,4 seconds
That DB will run on some single core old iron, so I think a 20% difference could be significant, and I decide to use stored values, updating the count every time a new book is inserted (i.e. not often).
Anyway thanks a lot for having refreshed my memory: JOIN queries will be useful somewhere else in my DB.
update
I used the JOIN method above to query the book themes, which are stored into a far smaller table, in this way:
SELECT theme.id, theme.main, theme.sub, COUNT(books.theme) as num FROMthemesJOIN books ON books.theme = theme.id GROUP BY themes.id ORDER by themes.main ASC, themes.sub ASC
It works fine, but for themes which are not in the books table I obviously don't get a 0 response, so I don't have lines like Contemporary Poetry - Etruscan (0) to show as disabled options for the sake of list completeness.
Is there a way to have back my theme.main and theme.sub?

Related

Multiple SELECTs vs Single Query with JOIN

Our current setup looks a bit like this.
public_entry (5.000.000 rows) → telephone_number (5.000.000 rows) → user (400.000 rows)
3 tables, the arrow to the right indicating a foreign key constraint containing a foreign key (integer) from the right table.
Now we have two "views" of the data we want to present in our web app.
displaying telephone numbers with public entries based on user attributes (e.g. only numbers from male users), a bit like a score.
displaying telephone numers with public entries based on their entry date
Each result should get a score assigned whether the number fits your needs (e.g. you look for a plumber, if the number is in you area an the related user is a plumber the telephone number should score high).
We tried several approaches on solving this problem with two scenarios.
The first approach does a SELECT with INNER JOINs over the table, like the following
SELECT ..., (...) as score
FROM public_entry pe
INNER JOIN telephone_numer tn ON tn.id = pe.numberid
INNER JOIN user u ON u.id = tn.userid WHERE ... ORDER BY score
using this query on smaller system, 1/4 of the production system performs very very well, even under load.
However when we put this query in the production system it wrecked havoc with execution times over 30 seconds.
The second approach was getting all public_entries filtered with a single SELECT on public_entry without any JOINs and iterating over them an calling a SELECT for each public_entry fetching the telephone_number and user, computing the score and discarding the results if telephone_number and user do not match our filter/interest.
Usually the second approach is never considered, because it creates over 300 queries for a single page load. Foreach'ing over results and calling SELECTs within a foreach is usually considered bad style.
However approach number two performs on the production system. Not well but does not tak more tahn 1-3 seconds, but also performs bad on the test systems.
Do you have any suggestions on where the problem might be?
EDIT:
Query
SELECT COUNT(p.id)
FROM public_entry p, fon f, user u
WHERE p.isweb = 1
AND f.hidden = 0
AND f.deleted = 0
AND f.id = p.fonid
AND u.id = f.userid
AND u.gender = "female"
This query has 3 seconds execution time.
This is just an example query. I can take out the where and it performs just a bit worse. In general if we do a SELECT COUNT() with a single INNER JOIN over the data the query blows up (30 seconds)

I don't have the magic answer you want, but here are some 'reasons' for poor performance, and some possible workarounds (with caveats).
Which of isweb, hidden, deleted, and gender are the most 'selective'? This optimizer sees them as useless and annoying. That is, if each has two values and an INDEX on that field is probably useless. Hence, it picks one table, does a full scan, then reaches into the next table, etc. Notice, in the EXPLAIN that it picked the smallest table (user) first. This is typically what the optimizer does when none of the WHERE clause looks useful.
Whether MySQL does all that work, or you do all that work is about the same amount of effort. Perhaps you can do it faster since you can have a simple associative arrays in memory, while MySQL is coded to allow for the tables to live on disk an be "cached" in RAM, block by block. But, if you don't have enough RAM to load everything in, you are stuck with MySQL.
If you actually removed "hidden" and "deleted" rows, the task would be a little faster.
Your two SELECTs do not look much alike. Are you suggesting there is a wide range of SELECTs? And you effectively need to look through most of all 3 tables to get the "score" or "count"?
Let's look at this from a Data Warehouse approach... Is some of the data "static"; that is, unchanging and could be summarized? If so, precomputing subtotals (COUNT(*)) into a summary table would let the ultimate queries be a lot faster. DW often involves subtotals by day. But it requires that these subtotals don't change.
COUNT(x) has the overhead of checking x for being NULL. Usually that is not necessary and COUNT(*) gives you what you want.
How often are you running the same SELECT? Or, at least, similar SELECTs? Do you need up-to-the-second scores? I'm fishing for running all the likely queries in the middle of the night, then using the results for 24 hours. Note that some queries can run faster by doing multiple things at once. For example, instead of two SELECTs for 'female' versus 'male', do one SELECT and GROUP BY gender.

MySQL/PHP: Using multiple sub-queries in a query selecing multiple results, is it a bad idea?

Sorry if the title is a little... Crappy. Basically I'm writing a small forum and using multiple sub-queries to select the number of threads, number of posts, and the date of the last post in a forum while grabbing the forum's information at the same time to display on the main page!
This is my query, since I suck at explaining things:
SELECT `f`.*,
(SELECT COUNT(`id`)
FROM `forum_threads`
WHERE `forumId1` = `f`.`id1`
AND `forumId2` = `f`.`id2`) AS `threadCount`,
(SELECT COUNT(`id`)
FROM `forum_posts`
WHERE `forumId1` = `f`.`id1`
AND `forumId2` = `f`.`id2`) AS `postCount`,
(SELECT `date`
FROM `forum_posts`
WHERE `forumId1` = `f`.`id1`
AND `forumId2` = `f`.`id2`
ORDER BY `date` DESC LIMIT 1) AS `lastPostDate`
FROM `forum_forums` AS `f`
ORDER BY `f`.`position` ASC, `f`.`id1` ASC;
And am using the general foreach loop to display the results:
foreach($forums AS $forum) {
echo $forum->name .'<br />';
echo $forum->threadCount .'<br />';
echo $forum->postCount .'<br />';
echo $forum->lastPostDate .'<br />';
}
(Not exactly like that of course, but for the sake of explaining...)
Now I was wondering if that would be "bad" for performance, or if there was any better way of doing it? Assuming there are quite a few posts and threads in each forum.
I was originally storing "posts", "threads", and "lastPost" columns in the forum table itself, and was going to increment (posts = posts + 1) the values every time someone created a new thread or post. Though I had this idea as well and was wondering if it was any good. :P

I would do things a bit differently:
It seems to me that all these three fields: threadCount, postCount and lastPostDate are fields that you can maintain on a separate table, say forum_stats which will hold only 4 columns:
* forum_id
* thread_count
* post_count
* last_post_date
These columns can be updated via. trigger upon insert/update.
If you'll pay this small overhead during the update operations - you'll get a very fast query for the select (and it will remain very fast regardless the number of forums/posts/threads you have).
Another approach (not us good TMO):
Create the stats table and run a daily (or every few hours) a batch-job which will update the stats. The price is that the data you display will never be up-to-date, and the job might require resources, you might want to run the job only at night, for example, since it's heavy and you don't want it to effect the majority of your website visitors.

Usually this kind of thing is terrible from a performance perspective and you'd be better off with counter columns that you can fetch from a single row. Keeping these in sync can be annoying, but there's no retrieval cost once they're in there.
You've identified the data you're retrieving, so what you need to do next is figure out how to put that data in there in the first place. #alfasin's answer describes an example schema, and while putting it in a separate table is one idea, there's usually not too much in the way of trouble just putting them in the main one. If you're worried about locking, update in smaller batches.
One approach is to write a TRIGGER that updates the counters as records are added and removed from the various tables. This tends to hide a lot of the complexity which can be a bad thing if the logic changes often and people need to be aware of how the system works.
A simple method is to just fiddle with the columns using an additional query after you've created or removed something that would have updated them. For instance, adjusting the last-posted-date is trivial if you do it at the time a post is created.
If these counters get a bit screwy, and they will eventually, you need a method to bring them back into sync. An easy way is to write a VIEW that produces the same results your query does now, perhaps re-written to use LEFT JOIN instead, and then UPDATE against that if that's possible. This may involve using a temporary table if MySQL can't cope with updating a table with a view of itself, but that's usually not a big deal.

Are database queries for everyone in a user list too much?

I am currently using MySQL and MyISAM.
I have a function of which returns an array of user IDs of either friends or users in general in my application, and when displaying them a foreach seemed best.
Now my issue is that I only have the IDs, so I would need to nest a database call to get each user's other info (i.e. name, avatar, other fields) based on the user ID in the loop.
I do not expect hundreds of thousands of users (mainly for hobby learning), although how should I do this one, such as the flexibility of placing code in a foreach for display, but not relying on ID arrays so I am out of luck to using a single query?
Any general structures or tips on what I can display the list appropriately with?
Is my amount of queries (1:1 per users in list) inappropriate? (although pages 0..n of users, 10 at a time make it seem not as bad I just realize.)

You could use the IN() MySQL method, i.e.
SELECT username,email,etc FROM user_table WHERE userid IN (1,15,36,105)
That will return all rows where the userid matches those ID's. It gets less efficient the more ID's you add but the 10 or so you mention should be just fine.

Why couldn't you just use a left join to get all the data in 1 shot? It sounds like you are getting a list, but then you only need to get all of a single user's info. Is that right?
Remember databases are about result SETS and while generally you can return just a single row if you need it, you almost never have to get a single row then go back for more info.
For instance a list of friends might be held in a text column on a user's entry.

Whether you expect to have a small database or large database, I would consider using the InnoDB engine rather than MyISAM. It does have a little higher overhead for processing than MyISAM, however you get all the added benefits (as your hobby grows) including JOIN, which will allow you to pull in specific data from multiple tables:
SELECT u.`id`, p.`name`, p.`avatar`
FROM `Users` AS u
LEFT JOIN `Profiles` AS p USING `id`
Would return id from Users and name and avatar from Profiles (where id of both tables match)
There are numerous resources online talking about database normalization, you might enjoy: http://www.devshed.com/c/a/MySQL/An-Introduction-to-Database-Normalization/

SQL query to collect entries from different tables - need an alternate to UNION

I'm running a sql query to get basic details from a number of tables. Sorted by the last update date field. Its terribly tricky and I'm thinking if there is an alternate to using the UNION clause instead...I'm working in PHP MYSQL.
Actually I have a few tables containing news, articles, photos, events etc and need to collect all of them in one query to show a simple - whats newly added on the website kind of thing.

Maybe do it in PHP rather than MySQL - if you want the latest n items, then fetch the latest n of each of your news items, articles, photos and events, and sort in PHP (you'll need the last n of each obviously, and you'll then trim the dataset in PHP). This is probably easier than combining those with UNION given they're likely to have lots of data items which are different.
I'm not aware of an alternative to UNION that does what you want, and hopefully those fetches won't be too expensive. It would definitely be wise to profile this though.

If you use Join in your query you can select datas from differents tables who are related with foreign keys.

You can look of this from another angle: do you need absolutely updated information? (the moment someone enters new information it should appear)
If not, you can have a table holding the results of the query in the format you need (serving as cache), and update this table every 5 minutes or so. Then your query problem becomes trivial, as you can have the updates run as several updates in the background.

Tricky query solution

Does anyone have any idea on how can you create a product filtering query (or queries) that will emulate the results on this page?
http://www.emag.ro/notebook_laptop
Explanation
If you press HP as a brand, the page will show you all the HP products, and the rest of the available filters are gathered from this query result. Fine and dandy until now, I got this licked w/o any problems.
Press 4GB Ram, and ofcourse you will see all HP products that have this property/feature. Again fine and dandy, got no problems until here.
BUT if you look closely you will see that the Brand features now show also, let's say Acer, having a few products with the 4GB feature, and maybe more after Acer, and the checkbox isn't yet pressed.
The only ideea that comes to mind is to make that much more queries to the database to get these other possibilities results.
After you start checking the 3rd possible option (let's say Display size) the things start to complicate even more.
I guess my question is:
Does anyone has any idea on how to make this w/o taxing the server with tons of queries ?
Thank you very much for reading this far, I hope I made myself clear in all this little story.

Take a look at sql
UNION
syntax.
"UNION is used to combine the result from multiple SELECT statements into a single result set."

It's not really "tons" of queries, it's one query per attribute type (brand, RAM, HDD). Let's say you have selected HP, 4GB RAM and 250GB disk. Now for each attribute type select products according to the filter, except for the current type, and group by results by the current type. In a simplistic model, the queries could look like this:
SELECT brand, count(*) FROM products WHERE ram='4BG' AND disk='250GB' GROUP BY brand
SELECT ram, count(*) FROM products WHERE brand='HP' AND disk='250GB' GROUP BY ram
SELECT disk, count(*) FROM products WHERE brand='HP' AND ram='4BG' GROUP BY disk
SELECT cpu, count(*) FROM products WHERE brand='HP' AND ram='4BG' AND disk='250BG' GROUP BY cpu
...
You should have indexes on the columns, that every query doesn't do a sequential scan over the table. Of course there are some "popular" combinations and you will likely have to display the same numbers on multiple pages when the user is sorting/navigating the list, so you might want to cache the numbers and invalidate the cache on update/insert/delete.

It could be that there is some sophisticated means of determining some computed distance of a result from your criteria, but maybe it is as simple as using an OR in the query rather than an AND.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.