Our current setup looks a bit like this.
public_entry (5.000.000 rows) → telephone_number (5.000.000 rows) → user (400.000 rows)
3 tables, the arrow to the right indicating a foreign key constraint containing a foreign key (integer) from the right table.
Now we have two "views" of the data we want to present in our web app.
displaying telephone numbers with public entries based on user attributes (e.g. only numbers from male users), a bit like a score.
displaying telephone numers with public entries based on their entry date
Each result should get a score assigned whether the number fits your needs (e.g. you look for a plumber, if the number is in you area an the related user is a plumber the telephone number should score high).
We tried several approaches on solving this problem with two scenarios.
The first approach does a SELECT with INNER JOINs over the table, like the following
SELECT ..., (...) as score
FROM public_entry pe
INNER JOIN telephone_numer tn ON tn.id = pe.numberid
INNER JOIN user u ON u.id = tn.userid WHERE ... ORDER BY score
using this query on smaller system, 1/4 of the production system performs very very well, even under load.
However when we put this query in the production system it wrecked havoc with execution times over 30 seconds.
The second approach was getting all public_entries filtered with a single SELECT on public_entry without any JOINs and iterating over them an calling a SELECT for each public_entry fetching the telephone_number and user, computing the score and discarding the results if telephone_number and user do not match our filter/interest.
Usually the second approach is never considered, because it creates over 300 queries for a single page load. Foreach'ing over results and calling SELECTs within a foreach is usually considered bad style.
However approach number two performs on the production system. Not well but does not tak more tahn 1-3 seconds, but also performs bad on the test systems.
Do you have any suggestions on where the problem might be?
EDIT:
Query
SELECT COUNT(p.id)
FROM public_entry p, fon f, user u
WHERE p.isweb = 1
AND f.hidden = 0
AND f.deleted = 0
AND f.id = p.fonid
AND u.id = f.userid
AND u.gender = "female"
This query has 3 seconds execution time.
This is just an example query. I can take out the where and it performs just a bit worse. In general if we do a SELECT COUNT() with a single INNER JOIN over the data the query blows up (30 seconds)
I don't have the magic answer you want, but here are some 'reasons' for poor performance, and some possible workarounds (with caveats).
Which of isweb, hidden, deleted, and gender are the most 'selective'? This optimizer sees them as useless and annoying. That is, if each has two values and an INDEX on that field is probably useless. Hence, it picks one table, does a full scan, then reaches into the next table, etc. Notice, in the EXPLAIN that it picked the smallest table (user) first. This is typically what the optimizer does when none of the WHERE clause looks useful.
Whether MySQL does all that work, or you do all that work is about the same amount of effort. Perhaps you can do it faster since you can have a simple associative arrays in memory, while MySQL is coded to allow for the tables to live on disk an be "cached" in RAM, block by block. But, if you don't have enough RAM to load everything in, you are stuck with MySQL.
If you actually removed "hidden" and "deleted" rows, the task would be a little faster.
Your two SELECTs do not look much alike. Are you suggesting there is a wide range of SELECTs? And you effectively need to look through most of all 3 tables to get the "score" or "count"?
Let's look at this from a Data Warehouse approach... Is some of the data "static"; that is, unchanging and could be summarized? If so, precomputing subtotals (COUNT(*)) into a summary table would let the ultimate queries be a lot faster. DW often involves subtotals by day. But it requires that these subtotals don't change.
COUNT(x) has the overhead of checking x for being NULL. Usually that is not necessary and COUNT(*) gives you what you want.
How often are you running the same SELECT? Or, at least, similar SELECTs? Do you need up-to-the-second scores? I'm fishing for running all the likely queries in the middle of the night, then using the results for 24 hours. Note that some queries can run faster by doing multiple things at once. For example, instead of two SELECTs for 'female' versus 'male', do one SELECT and GROUP BY gender.
Related
I'm working on a management system for a small library. I proposed them to replace the Excel spreadsheet they are using now with something more robust and professional like PhpMyBibli - https://en.wikipedia.org/wiki/PhpMyBibli - but they are scared by the amount of fields to fill, and also the interfaces are not fully translated in Italian.
So I made a very trivial DB, with basically a table for the authors and a table for the books. The authors table is because I'm tired to have to explain that "Gabriele D'Annunzio" != "Gabriele d'Annunzio" != "Dannunzio G." and so on.
My test tables are now populated with ~ 100k books and ~ 3k authors, both with plausible random text, to check the scripts under pressure.
For the public consultation I want to make an interface like that of Gallica, the website of the Bibliothèque nationale de France, which I find pretty useful. A sample can be seen here: http://gallica.bnf.fr/Search?ArianeWireIndex=index&p=1&lang=EN&f_typedoc=livre&q=Computer&x=0&y=0
The concept is pretty easy: for each menu, e.g. the author one, I generate a fancy <select> field with all the names retrieved from the DB, and this works smoothly.
The issue arises when I try to add beside every author name the number of books, as made by Gallica, in this way (warning - conceptual code, not actual PHP):
SELECT id, surname, name FROM authors
foreach row {
SELECT COUNT(*) as num FROM BOOKS WHERE id_auth=id
echo "<option>$surname, $name ($num)</option>";
}
With the code above a core of the CPU jumps at 100%, and no results are shown in the browser. Not surprising, since they are 3k queries on a 100k table in a very short time.
Just to try, I added a LIMIT 100 to the first query (on the authors table). The page then required 3 seconds to be generated, and 15 seconds when I raised the LIMIT to 500 (seems a linear increment). But of course I can't show to library users a reduced list of authors.
I don't know which hardware/software is used by Gallica to achieve their results, but I bet their budget is far above that of a small village library using 2nd hand computers.
Do you think that to add a "number_of_books" field in the authors table, which will be updated every time a new book is inserted, could be a practical solution, rather than to browse the whole list at every request?
BTW, a similar procedure must be done for the publication date, the language, the theme, and some other fields, so the query time will be hit again, even if the other tables are a lot smaller than the authors one.
Your query style is very inefficient - try using a join and group structure:
SELECT
authors.id,
authors.surname,
authors.name,
COUNT(books.id) AS numbooks
FROM authors
INNER JOIN books ON books.id_auth=authors.id
GROUP BY authors.id
ORDER BY numbooks DESC
;
EDIT
Just to clear up some issues I not explicitely said:
Ofcourse you don't need a query in the PHP loop any longer, just the displaying portion
Indices on books.id_auth and authors.id (the latter primary or unique) are assumed
EDIT 2
As #GordonLinoff pointed out, the IFNULL() is redundant in an inner join, so I removed it.
To get all themes, even if there aren't any books in them, just use a left join (this time including the IFNULL(), if your provider's MySQL may be old):
SELECT
theme.id,
theme.main,
theme.sub,
IFNULL(COUNT(books.theme),0) AS num
FROM themes
LEFT JOIN books ON books.theme=theme.id
GROUP BY themes.id
;
EDIT 3
Ofcourse a stored value will give you the best performance - but this denormalization comes at a cost: Your Database now has the potential to become inconsistent in a user-visible way.
If you do go with this method. I strongly recommend you use triggers to auto-fill this field (and ofcourse those triggers must sit on the books table).
Be prepared to see slowed down inserts - this might ofcourse be okay, as I guess you will see a much higher rate of SELECTS than INSERTS
After reading a lot about how the JOIN statement works, with the help of
useful answer 1 and useful answer 2, I discovered I used it some 15 or 20 years ago, then I forgot about this since I never needed it again.
I made a test using the options I had:
reply with the JOIN query with IFNULL(): 0,5 seconds
reply with the JOIN query without IFNULL(): 0,5 seconds
reply using a stored value: 0,4 seconds
That DB will run on some single core old iron, so I think a 20% difference could be significant, and I decide to use stored values, updating the count every time a new book is inserted (i.e. not often).
Anyway thanks a lot for having refreshed my memory: JOIN queries will be useful somewhere else in my DB.
update
I used the JOIN method above to query the book themes, which are stored into a far smaller table, in this way:
SELECT theme.id, theme.main, theme.sub, COUNT(books.theme) as num FROMthemesJOIN books ON books.theme = theme.id GROUP BY themes.id ORDER by themes.main ASC, themes.sub ASC
It works fine, but for themes which are not in the books table I obviously don't get a 0 response, so I don't have lines like Contemporary Poetry - Etruscan (0) to show as disabled options for the sake of list completeness.
Is there a way to have back my theme.main and theme.sub?
Hi I need some help optimizing this code, currently it takes 38 seconds to run the SQL query, and 23 to load it as a view.
Here's the background -
Redirects table records when a member uses a link and records where they go, and when they return and with what status.
Projects table manages the per project information that I need.
Currently I do have a third table that keeps a per project count which is updated each time a record is added to the redirects table, however the counts can be a little unreliable. Every hour the server runs the query to fix/verify the counts.
Is there any good way to count the columns without having to use a sum(if(xxx,1,0)) ?
Select projects.ID as ID,cid,name as name,state as status,
sum(if(status="complete",1,0)) as complete,cpc,
cpc*ss as mmkingaku,
cpc*sum(if(status="complete",1,0)) as total,
sum(if(status="screenout",1,0)) as screenout,
sum(if(status="quotafull",1,0)) as quotafull,
sum(if(status="short",1,0)) as short,
sum(if(status="gate",1,0)) as gate,
sum(if(status is null,1,0)) as empty,
sum(if(status="complete",1,0))/(sum(if(status="complete",1,0))+sum(if(status="screenout",1,0)))*100 as IR
from redirects,projects
where redirects.rid=projects.rid and state<>"test" group by name order by cid desc
SQL performance is not usually due to calculations in the select clause. You need to look at the from and group by clauses.
Do your tables have appropriate indexes? You should have an index on redirects.rid, projects.rid, or both. In fact, these should probably be composite indexes, including state and test (wherever is appropriate).
The group by can be a performance hog in MySQL. How much data is in each table?
I have recently written a survey application that has done it's job and all the data is gathered. Now i have to analyze the data and i'm having some time issues.
I have to find out how many people selected what option and display it all.
I'm using this query, which does do it's job:
SELECT COUNT(*)
FROM survey
WHERE users = ? AND table = ? AND col = ? AND row = ? AND selected = ?
GROUP BY users,table,col,row,selected
As evident by the "?" i'm using MySQLi (in php) to fetch the data when needed, but i fear this is causing it to be so slow.
The table consists of all the elements above (+ an unique ID) and all of them are integers.
To explain some of the fields:
Each survey was divided into 3 or 4 tables (sized from 2x3 to 5x5) with a 1 to 10 happiness grade to select form. (questions are on the right and top of the table, then you answer where the questions intersect)
users - age groups
table, row, col - explained above
selected - dooooh explained above
Now with the surveys complete and around 1 million entries in the table the query is getting very slow. Sometimes it takes like 3 minutes, sometimes (i guess) the time limit expires and you get no data at all. I also don't have access to the full database, just my empty "testing" one since the costumer is kinda paranoid :S (and his server seems to be a bit slow)
Now (after the initial essay) my questions are: I left indexing out intentionally because with a lot of data being written during the survey, it would be a bad idea. But since no new data is coming in at this point, would it make sense to index all the fields of a table? How much sense does it make to index integers that never go above 10? (as you can guess i haven't got a clue about indexes). Do i need the primary unique ID in this table? I
I read somewhere that indexing may help groups but only if you group by the first columns in a table (and since my ID is first and from my point of view useless can i remove it and gain anything by it?)
Is there another way to write my query that would basically do the same thing but in a shorter period of time?
Thanks for all your suggestions in advance!
Add an index on entries that you "GROUP BY" or do "WHERE". So that's ONE index incorporating users,table,col,row and selected in your case.
Some quick rules:
combine fields to have the WHERE first, and the GROUP BY elements last.
If you have other queries that only use part of it (e.g. users,table,col and selected) then leave the missing value (row, in this example) last.
Don't use too many indexes/indeces, as each will slow the table to updates marginally - so on really large system you need to balance queries with indexes.
Edit: do you need the GROUP BY user,col,row as these are used in the WHERE. If the WHERE has already filtered them out, you only need group by "selected".
I again run into problem of selecting random subset of rows. And as many probably know ORDER BY RAND() is quite inefficient, or at least thats the consensus. I have read that mysql generates random value for every row in table, then filters then orders by these random values and then returns set. The biggest performance impact seems to be from the fact that there as many random numbers generated as there are rows in a table. So i was looking for possibly better way to return random subset of results for such query:
SELECT id FROM <table> WHERE <some conditions> LIMIT 10;
Of course simplest and easiest way to do what i want would be the one witch I try to avoid:
SELECT id FROM <table> WHERE <some conditions> ORDER BY RAND() LIMIT 10; (a)
Now after some thinking i came up with other option for this task:
SELECT id FROM <table> WHERE (<some conditions>) AND RAND() > x LIMIT 10; (b)
(Of course we can use < instead of >) Here we take x from range 0.0 - 1.0. Now I'm not exactly sure how MySQL handles this but my guess is that it first selects rows matching <some conditions> (using index[es]?) and then generates random value and sees if it should return or discard row. But what do i know:) thats why i ask here. So some observations about this method:
first it does not guarantee that asked number of rows will be returned even if there is much more matching rows than needed, especially true for x values close to 1.0 (or close to 0.0 if we use <)
returned object don't really have random ordering, they are just objects selected randomly, order by index used or by the way they are stored(?) (of course this might not matter in some cases at all)
we probably need to choose x according to size of result set, since if we have large result set and x is lets say 0.1, it will be very likely that only some random first results will be returned most of the time; on the other hand if have small result set and choose large x it is likely that we might get less object than we want, although there are enough of them
Now some words about performance. I did a little testing using jmeter. <table> has about 20k rows, and there are about 2-3k rows matching <some conditions>. I wrote simple PHP script that executes query and print_r's the result. Then I setup test using jmeter that starts 200 threads, so that is 200 requests per second, and requests said PHP script. I ran it until about 3k requests were done (average response time stabilizes well before this). Also I executed all queries with SQL_NO_CACHE to prevent query cache having some effect. Average response times were:
~30ms for query (a)
13-15ms for query (b) with x = 0.1
17-20ms for query (b) with x = 0.9, as expected larger x is slower since it has to discard more rows
So my questions are: what do you think about this method of selecting random rows? Maybe you have used it or tried it and see that it did not work out? Maybe you can better explain how MySQL handles such query? What could be some caveats that I'm not aware of?
EDIT: I probably was not clear enough, the point is that i need random rows of query not simply table, thus I included <some conditions> and there are quite some. Moreover table is guaranteed to have gaps in id, not that it matters much since this is not random rows from table but from query, and there will be quite a lot such queries so those suggestions involving querying table multiple times do not sound appealing. <some conditions> will vary at least a bit between requests, meaning that there will be requests with different conditions.
From my own experience, I've always used ORDER BY RAND(), but this has it's own performance implications on larger datasets. For example, if you had a table that was too big to fit in memory then MySQL will create a temporary table on disk, and then perform a file sort to randomise the dataset (storage engine permitting). Your LIMIT 10 clause will have no effect on the execution time of the query AFAIK, but it will reduce the size of the data to send back to the client obviously.
Basically, the limit and order by happen after the query has been executed (full table scan to find matching records, then it is ordered and then it is limited). Any rows outside of your LIMIT 10 clause are discarded.
As a side note, adding in the SQL_NO_CACHE will disable MySQL's internal query cache, but will does not prevent your operating system from caching the data (due to the random nature of this query I don't think it would have much of an effect on your execution time anyway).
Hopefully someone can correct me here if I have made any mistakes but I believe that is the general idea.
An alternative way which probably would not be faster, but might who knows :)
Either use a table status query to determine the next auto_increment, or the row count, or use (select count(*)). Then you can decide ahead of time the auto_increment value of a random item and then select that unique item.
This will fail if you have gaps in the auto_increment field, but if it is faster than your other methods, you could loop a few times or fall back to a failsafe method in the case of zero rows returned. Best case might be a big savings, worst case would be comparable to your current method.
You're using the wrong data structure.
The usual method is something like this:
Find out the number of elements n — something like SELECT count(id) FROM tablename.
Choose r distinct randomish numbers in the interval [0,n). I usually recommend a LCG with suitably-chosen parameters, but simply picking r randomish numbers and discarding repeats also works.
Return those elements. The hard bit.
MySQL appears to support indexed lookups with something like SELECT id from tablename ORDER BY id LIMIT :i,1 where :i is a bound-parameter (I forget what syntax mysqli uses); alternative syntax LIMIT 1 OFFSET :i. This means you have to make r queries, but this might be fast enough (it depends on per-statement overheads and how efficiently it can do OFFSET).
An alternative method, which should work for most databases, is a bit like interval-bisection:
SELECT count(id),max(id),min(id) FROM tablename. Then you know rows [0,n-1] have ids [min,max].
So rows [a,b] have ids [min,max]. You want row i. If i == a, return min. If i == b, return max. Otherwise, bisect:
Choose split = min+(max-min)/2 (avoiding integer overflow).
SELECT count(id),max(id) WHERE :min < id AND id < split and SELECT count(id),min(id) WHERE :split <= id and id < :max. The two counts should equal b-a+1 if the table hasn't been modified...
Figure out which range i is in, and update a, b, min, and max appropriately. Repeat.
There are plenty of edge cases (I've probably included some off-by-one errors) and a few potential optimizations (you can do this for all the indexes at once, and you don't really need to do two queries per iteration if you don't assume that i == b implies id = max). It's not really worth doing if SELECT ... OFFSET is even vaguely efficient.
I'm working on the next version of a local online dating site, PHP & MySQL based and I want to do things right. The user table is quite massive and is expected to grow even more with the new version as there will be a lot of money spent on promotion.
The current version which I guess is 7-8 years old was done probably by someone not very knowledgeable in PHP and MySQL so I have to start over from scratch.
There community has currently 200k+ users and is expected to grow to 500k-1mil in the next one or two years. There are more than 100 attributes for each user's profile and I have to be able to search by at least 30-40 of them.
As you can imagine I'm a little wary to make a table with 200k rows and 100 columns. My predecessor split the user table in two ... one with the most used and searched columns and one with the rest (and bulk) of the columns. But this lead to big synchronization problems between the two tables.
So, what do you think it's the best way to go about it?
This is not an answer per se, but since few answers here suggested the attribute-value model, I just wanted to jump in and say my life experience.
I've tried once using this model with a table with 120+ attributes (growing 5-10 every year), and adding about 100k+ rows (every 6 months), the indexes is growing so big that it takes for ever to add or update a single user_id.
The problem I find with this type of design (not that it's completely unfit to any situation) is that you need to put a primary key on user_id,attrib on that second table. Unknowing the potential length of attrib, you would usually use a greater length value, thus increasing the indexes. In my case, attribs could have from 3 to 130 chars. Also, the value most certainly suffer from the same assumption.
And as the OP said, this leads to synchronization problems. Imagine if every attributes (or say at least 50% of them) NEED to exist.
Also, as the OP suggest, the search needs to be done on 30-40 attributes, and I can't just imagine how a 30-40 joins would be efficient, or even a group_concat() due to length limitation.
My only viable solution was to go back to a table with as much columns as there are attributes. My indexes are now greatly smaller, and searches are easier.
EDIT: Also, there are no normalization problems. Either having lookup tables for attribute values or have them ENUM().
EDIT 2: Of course, one could say I should have a look-up table for attribute possible values (reducing index sizes), but I should then make a join on that table.
What you could do is split the user data accross two tables.
1) Table: user
This will contain the "core" fixed information about a user such as firstname, lastname, email, username, role_id, registration_date and things of that nature.
Profile related information can go in its own table. This will be an infinitely expandable table with a key => val nature.
2) Table: user_profile
Fields: user_id, option, value
user_id: 1
option: profile_image
value: /uploads/12/myimage.png
and
user_id: 1
option: questions_answered
value: 24
Hope this helps,
Paul.
The entity-attribute-value model might be a good fit for you:
http://en.wikipedia.org/wiki/Entity-attribute-value_model
Rather than have 100 and growing columns, add one table with three columns:
user_id, property, value.
In general, you shouldn't sacrifice database integrity for performance.
The first thing that I would do about this is to create a table with 1 mln rows of dummy data and test some typical queries on it, using a stress tool like ab. It will most probably turn out that it performs just fine - 1 mln rows is a piece of cake for mysql. So, before trying to solve a problem make sure you actually have it.
If you find the performance poor and the database really turns out to be a bottleneck, consider general optimizations, like caching (on all levels, from mysql query cache to html caching), getting better hardware etc. This should work out in most cases.
In general you should always get the schema formally correct before you worry about performance!
That way you can make informed decisions about adapting the schema to resolve specific performance problems, rather than guessing.
You definitely should go down the 2 table route. This will significantly reduce the amount of storage, code complexity, and the effort to changing the system to add new attributes.
Assuming that each attribute can be represented by an Ordinal number, and that you're only looking for symmetrical matches (i.e. you're trying to match people based on similar attributes, rather than an expression of intention)....
At a simple level, the query to find suitable matches may be very expensive. Effectively you are looking for nodes within the same proximity in a N-dimensional space, unfortunately most relational databases aren't really setup for this kind of operation (I believe PostgreSQL has support for this). So most people would probably start with something like:
SELECT candidate.id,
COUNT(*)
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value=current_user.attr_value
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
However this forces the system to compare every available candidate to find the best match. Applying a little heurisitics and you could get a very effective query:
SELECT candidate.id,
COUNT(*)
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
(the value of $tolerance will affect the number of rows returned and query performance - if you've got an index on attr_type, attr_value).
This can be further refined into a points scoring system:
SELECT candidate.id,
SUM(1/1+
((candidate_attrs.attr_value - current_user.attr_value)
*(candidate_attrs.attr_value - current_user.attr_value))
) as match_score
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
This approach lets you do lots of different things - including searching by a subset of attributes, e.g.
SELECT candidate.id,
SUM(1/1+
((candidate_attrs.attr_value - current_user.attr_value)
*(candidate_attrs.attr_value - current_user.attr_value))
) as match_score
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs,
attribute_subsets s
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
AND s.subset_name=$required_subset
AND s.attr_type=current_user.attr_type
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
Obviously this does not accomodate non-ordinal data (e.g. birth sign, favourite pop-band). Without knowing a lot more about te structure of the existing data, its rather hard to say exactly how effective this will be.
If you want to add more attributes, then you don't need to make any changes to your PHP code nor the database schema - it can be completely data-driven.
Another approach would be to identify sterotypes - i.e. reference points within the N-dimensional space, then work out which of these a particular user is closest to. You collapse all the attributes down to a single composite identifier - then you just need to apply the same approach to find the best match within the subset of candidates whom also have been matched to the stereotype.
Can't really suggest anything without seeing the schema. Generally - Mysql database have to be normalized to at least 3NF or BNCF. It rather sounds like it is not normalized right now with 100 columns in 1 table.
Also - you can easily enforce referential integrity with foreign keys using transactions and INNODB engine.