MySQL performance issue with large tables

MySQL performance issue with large tables - php

I've been asked to develop a web software able to store some reading data from heat metering device and to divide the heat expenses among all the flat owner. I chose to work in php with MySQL engine MyISAM.
I was not used to work with large data, so i simply created a logical database where we have:
a table for building, with an id as primary key indexed (now we have ~1200
buildings in the db)
a table with all the flats in all the buildings, with an id as primary key indexed and the building_id to link to the building (around 32k+ flats in total)
a table with all the heaters in all the flats, with an id as primary key indexed and the flat_id to link to the flat (around 280k+ heaters)
a table with all the reading value, with the timestamp of the reading, an id as primary key and the heater_id to link to the heater (around 2.7M+ reading now)
There is also a separate table, linked to the building, where are stored the starting date and the end date between which the division of expenses have to be done.
When it is necessary to get all the data from a building, the approach i used is to get raw data from DB with single query, elaborate in php, than make the next query.
So here is roughly the operation sequence i used:
get the starting and end date from the specific table with a single query
store the dates in a php variable
get all the flats of the building: SELECT * FROM flats where building_id=my_building_id
parse all the data in php with a php while cycle
on each step of the while cycle i make a query getting all the heaters of that specific flat: SELECT * FROM heaters where flat_id=my_flat_id
parse all the data of the heaters with a php while cycle
on each step of this inner while cycle i'll get the last reading value of that specific heater: SELECT * FROM reading_values where heater_id=my_heater_id AND data<my_data
Now the problem is that i have serious performance issue.
Before someone point it out, i cannot get only reading value jumping all the first 6 steps of the list above, since i need to print bills and on each bill i have to write all flat information and all heaters information, so i have to get all the flats and heaters data anyway.
So I'd like some suggestions on how to improve script performance:
all the tables are indexed, but i have to add some index somewhere else?
would using a single query with subquery instead of several one among php code improve performance?
any other suggestions?
I haven't inserted specific code as i think it would have made the question too heavy, but if asked i could insert some.

Some:
Don't use 'SELECT *' if you can avoid it -> Just get the fields you really need
I didn't test it in your particular case, but usually a single query which joins all three tables should achieve much better performance rather than looping through results with php.
If you need to loop for some reason, then at least use mysql prepared statements, which again should increase performance given the amount of queries :)
Hope it helps!
Regards
EDIT:
just to exemplify an alternative query, not sure if this suits your specific needs and without testing it (which probably means I forgot something):
SELECT
a.field1,
b.field2,
c.field3,
d.field4
FROM heaters a
JOIN reading_values b ON (b.heater_id = a.heater_id)
JOIN flats c ON (c.flat_id = a.flat_id)
JOIN buildings d ON (d.building_id = c.building_id)
WHERE
a.heater_id = my_heater_id
AND b.date < my_date
GROUP BY a.heater_id
EDIT 2
Following your comments, I modified the query so that it retrieves the information as you want it: Given a building id, it will list all the heaters and their newest reading value according to a given date:
SELECT
a.name,
b.name,
c.name,
d.reading_value,
d.created
FROM buildings a
JOIN flats b ON (b.building_id = a.building_id)
JOIN heaters c ON (c.flat_id = b.flat_id)
JOIN reading_values d ON (d.reading_value_id = (SELECT reading_value_id FROM reading_values WHERE created <= my_date AND heater_id = c.heater_id ORDER BY created DESC LIMIT 1))
WHERE
a.building_id = my_building_id
GROUP BY c.heater_id
It should be interesting to know how it performs in your environment.
Regards

Related

SQL Join duplicates, converting an access db

I am converting an access database to a new format. Currently all data resides in MySQL.
For the purposes of this question, there are 3 tables. tbl_Bills, tbl_Documents, and tbl_Receipts.
I wrote an outer join query , as some bills have documents and receipts, other's don't. And I need a full listing of each set, given those situations, to be processed by a php script later on.
The problem is that the primary identifier, we'll call fld_CommonID, happens to exist in duplicate. For example, 3 bills have the same identifier, with different information. 3 documents and 3 receipts match those 3 bills.
So as you might have guessed, my join query results in 9 indistinct rows (6 duplicates), when there should be 3 (one join from each table). An inner join excludes data that isn't defined in the other table, and so doesn't work for my needs.
SO ... I'm thinking what I want to do, is update those 3 records in each table (across all rows that have duplicates) such that they have a unique counter id. #1, #2, and #3 respectively, so that I can perform join queries on them uniquely per row.
Is that possible without running php code to select the duplicates ordered by natural table order, followed-by updating them with a counter?
Would you advise that I go that route(scripted) instead of some magical SQL query to do such a thing, if such a query can be made?
Or is it possible to outer join based on natural table order (pretty sure that's impossible)?

writing this answer to simply close the question.
Inner joins would be perfect if there were a way to link duplicate fields in separate tables based on natural order (no primary key). The problem isn't that I lack a query, it's that the database is poorly structured. Which is a problem better solved with code not complex queries.

nested mysql queries with huge tables

I'm working on a management system for a small library. I proposed them to replace the Excel spreadsheet they are using now with something more robust and professional like PhpMyBibli - https://en.wikipedia.org/wiki/PhpMyBibli - but they are scared by the amount of fields to fill, and also the interfaces are not fully translated in Italian.
So I made a very trivial DB, with basically a table for the authors and a table for the books. The authors table is because I'm tired to have to explain that "Gabriele D'Annunzio" != "Gabriele d'Annunzio" != "Dannunzio G." and so on.
My test tables are now populated with ~ 100k books and ~ 3k authors, both with plausible random text, to check the scripts under pressure.
For the public consultation I want to make an interface like that of Gallica, the website of the Bibliothèque nationale de France, which I find pretty useful. A sample can be seen here: http://gallica.bnf.fr/Search?ArianeWireIndex=index&p=1&lang=EN&f_typedoc=livre&q=Computer&x=0&y=0
The concept is pretty easy: for each menu, e.g. the author one, I generate a fancy <select> field with all the names retrieved from the DB, and this works smoothly.
The issue arises when I try to add beside every author name the number of books, as made by Gallica, in this way (warning - conceptual code, not actual PHP):
SELECT id, surname, name FROM authors
foreach row {
SELECT COUNT(*) as num FROM BOOKS WHERE id_auth=id
echo "<option>$surname, $name ($num)</option>";
}
With the code above a core of the CPU jumps at 100%, and no results are shown in the browser. Not surprising, since they are 3k queries on a 100k table in a very short time.
Just to try, I added a LIMIT 100 to the first query (on the authors table). The page then required 3 seconds to be generated, and 15 seconds when I raised the LIMIT to 500 (seems a linear increment). But of course I can't show to library users a reduced list of authors.
I don't know which hardware/software is used by Gallica to achieve their results, but I bet their budget is far above that of a small village library using 2nd hand computers.
Do you think that to add a "number_of_books" field in the authors table, which will be updated every time a new book is inserted, could be a practical solution, rather than to browse the whole list at every request?
BTW, a similar procedure must be done for the publication date, the language, the theme, and some other fields, so the query time will be hit again, even if the other tables are a lot smaller than the authors one.

Your query style is very inefficient - try using a join and group structure:
SELECT
authors.id,
authors.surname,
authors.name,
COUNT(books.id) AS numbooks
FROM authors
INNER JOIN books ON books.id_auth=authors.id
GROUP BY authors.id
ORDER BY numbooks DESC
;
EDIT
Just to clear up some issues I not explicitely said:
Ofcourse you don't need a query in the PHP loop any longer, just the displaying portion
Indices on books.id_auth and authors.id (the latter primary or unique) are assumed
EDIT 2
As #GordonLinoff pointed out, the IFNULL() is redundant in an inner join, so I removed it.
To get all themes, even if there aren't any books in them, just use a left join (this time including the IFNULL(), if your provider's MySQL may be old):
SELECT
theme.id,
theme.main,
theme.sub,
IFNULL(COUNT(books.theme),0) AS num
FROM themes
LEFT JOIN books ON books.theme=theme.id
GROUP BY themes.id
;
EDIT 3
Ofcourse a stored value will give you the best performance - but this denormalization comes at a cost: Your Database now has the potential to become inconsistent in a user-visible way.
If you do go with this method. I strongly recommend you use triggers to auto-fill this field (and ofcourse those triggers must sit on the books table).
Be prepared to see slowed down inserts - this might ofcourse be okay, as I guess you will see a much higher rate of SELECTS than INSERTS

After reading a lot about how the JOIN statement works, with the help of
useful answer 1 and useful answer 2, I discovered I used it some 15 or 20 years ago, then I forgot about this since I never needed it again.
I made a test using the options I had:
reply with the JOIN query with IFNULL(): 0,5 seconds
reply with the JOIN query without IFNULL(): 0,5 seconds
reply using a stored value: 0,4 seconds
That DB will run on some single core old iron, so I think a 20% difference could be significant, and I decide to use stored values, updating the count every time a new book is inserted (i.e. not often).
Anyway thanks a lot for having refreshed my memory: JOIN queries will be useful somewhere else in my DB.
update
I used the JOIN method above to query the book themes, which are stored into a far smaller table, in this way:
SELECT theme.id, theme.main, theme.sub, COUNT(books.theme) as num FROMthemesJOIN books ON books.theme = theme.id GROUP BY themes.id ORDER by themes.main ASC, themes.sub ASC
It works fine, but for themes which are not in the books table I obviously don't get a 0 response, so I don't have lines like Contemporary Poetry - Etruscan (0) to show as disabled options for the sake of list completeness.
Is there a way to have back my theme.main and theme.sub?

Multiple SELECTs vs Single Query with JOIN

Our current setup looks a bit like this.
public_entry (5.000.000 rows) → telephone_number (5.000.000 rows) → user (400.000 rows)
3 tables, the arrow to the right indicating a foreign key constraint containing a foreign key (integer) from the right table.
Now we have two "views" of the data we want to present in our web app.
displaying telephone numbers with public entries based on user attributes (e.g. only numbers from male users), a bit like a score.
displaying telephone numers with public entries based on their entry date
Each result should get a score assigned whether the number fits your needs (e.g. you look for a plumber, if the number is in you area an the related user is a plumber the telephone number should score high).
We tried several approaches on solving this problem with two scenarios.
The first approach does a SELECT with INNER JOINs over the table, like the following
SELECT ..., (...) as score
FROM public_entry pe
INNER JOIN telephone_numer tn ON tn.id = pe.numberid
INNER JOIN user u ON u.id = tn.userid WHERE ... ORDER BY score
using this query on smaller system, 1/4 of the production system performs very very well, even under load.
However when we put this query in the production system it wrecked havoc with execution times over 30 seconds.
The second approach was getting all public_entries filtered with a single SELECT on public_entry without any JOINs and iterating over them an calling a SELECT for each public_entry fetching the telephone_number and user, computing the score and discarding the results if telephone_number and user do not match our filter/interest.
Usually the second approach is never considered, because it creates over 300 queries for a single page load. Foreach'ing over results and calling SELECTs within a foreach is usually considered bad style.
However approach number two performs on the production system. Not well but does not tak more tahn 1-3 seconds, but also performs bad on the test systems.
Do you have any suggestions on where the problem might be?
EDIT:
Query
SELECT COUNT(p.id)
FROM public_entry p, fon f, user u
WHERE p.isweb = 1
AND f.hidden = 0
AND f.deleted = 0
AND f.id = p.fonid
AND u.id = f.userid
AND u.gender = "female"
This query has 3 seconds execution time.
This is just an example query. I can take out the where and it performs just a bit worse. In general if we do a SELECT COUNT() with a single INNER JOIN over the data the query blows up (30 seconds)

I don't have the magic answer you want, but here are some 'reasons' for poor performance, and some possible workarounds (with caveats).
Which of isweb, hidden, deleted, and gender are the most 'selective'? This optimizer sees them as useless and annoying. That is, if each has two values and an INDEX on that field is probably useless. Hence, it picks one table, does a full scan, then reaches into the next table, etc. Notice, in the EXPLAIN that it picked the smallest table (user) first. This is typically what the optimizer does when none of the WHERE clause looks useful.
Whether MySQL does all that work, or you do all that work is about the same amount of effort. Perhaps you can do it faster since you can have a simple associative arrays in memory, while MySQL is coded to allow for the tables to live on disk an be "cached" in RAM, block by block. But, if you don't have enough RAM to load everything in, you are stuck with MySQL.
If you actually removed "hidden" and "deleted" rows, the task would be a little faster.
Your two SELECTs do not look much alike. Are you suggesting there is a wide range of SELECTs? And you effectively need to look through most of all 3 tables to get the "score" or "count"?
Let's look at this from a Data Warehouse approach... Is some of the data "static"; that is, unchanging and could be summarized? If so, precomputing subtotals (COUNT(*)) into a summary table would let the ultimate queries be a lot faster. DW often involves subtotals by day. But it requires that these subtotals don't change.
COUNT(x) has the overhead of checking x for being NULL. Usually that is not necessary and COUNT(*) gives you what you want.
How often are you running the same SELECT? Or, at least, similar SELECTs? Do you need up-to-the-second scores? I'm fishing for running all the likely queries in the middle of the night, then using the results for 24 hours. Note that some queries can run faster by doing multiple things at once. For example, instead of two SELECTs for 'female' versus 'male', do one SELECT and GROUP BY gender.

Optimal mySQL table index structure for faster SELECT of a large range of daily data

I am wondering the best format to lay out my data in a mySQL table so that it can be queried in the fastest manner to gather an array of daily values to be further utilized by php.
So far, I have laid out the table as such:
item_id price_date price_amount
1 2000-03-01 22.4
2 2000-03-01 19.23
3 2000-03-01 13.4
4 2000-03-01 14.95
1 2000-03-02 22.5
2 2000-03-02 19.42
3 2000-03-02 13.4
4 2000-03-02 13.95
with item_id defined as an index.
Also, I am using:
"SELECT DISTINCT price_date FROM table_name"
to get an array containing a unique list of dates.
Furthermore, the part of the code that is within a loop (and the focus of my optimization question) is currently written as:
"SELECT price_amount FROM table_name WHERE item_id = 1 ORDER BY price_date"
This second "SELECT" statement is actually within a loop where I am selecting/storing-in-array the daily prices of each item_id requested.
All is currently functioning and pulling the data from mySQL properly, however, both the above listed "SELECT" statements are taking approx 4-5 seconds to complete per each run, and when looping through 100+ products to create a summary, adds up to a very inefficient/slow information system.
Is there any more-efficient way that I could structure the mySQL table and/or SELECT statements to retrieve the results faster? Perhaps defining a different index on a different column? I have used the EXPLAIN command to return information per the queries but am unsure how to use the EXPLAIN information to increase the efficiency of my queries.
Thanks in advance for any mySQL wizards that may be able to assist.

Single column index
I am using:
"SELECT DISTINCT price_date FROM table_name"
to get an array containing a unique list of dates.
This query can be executed more efficiently if you create an index for the price_date column:
ALTER TABLE table_name ADD INDEX price_idx (price_date);
Mutiple column index
Furthermore, the part of the code that is within a loop (and the focus of my optimization question) is currently written as:
"SELECT price_amount FROM table_name WHERE item_id = 1 ORDER BY price_date"
For the second query, you should create an index covering both the item_id and price_date column:
ALTER TABLE table_name ADD INDEX item_price_idx (item_id, price_date);

I know this is a bit late, but i stumbled across this and thought I would throw my thoughts into the mix.
Indexes used well are very helpful in speeding up queries (Explain shows some really godd results around which indexes are being chosen - if any - for a specific query). However efficient PHP will help even more.
In your case you do not show the PHP, but it looks like you offer a list of dates and then loop through finding all the items in that date to get the prices. It would be more efficient to do something like the following:
Select item_id, price_amount from table_name where price_date= order by item_id, price_amount
with an index (preferably a Unique Index) on price_date,item_id,price_amount
You then have a single loop through the resultant SQL not a loop with multiple SQL connections (this is especially true if your SQL server is separate from the PHP box as an external network connection can have an overhead).
4-5 seconds for a single query though is very slow )by a factor of at least 100x) so it would indicate a problem (very large table with no key to use) or disk issues (potentially).

Too relation or not to relation ? A MySQL, PHP database workflow

im kinda new with mysql and i'm trying to create a kind complex database and need some help.
My db structure
Tables(columns)
1.patients (Id,name,dob,etc....)
2.visits (Id,doctor,clinic,Patient_id,etc....)
3.prescription (Id,visit_id,drug_name,dose,tdi,etc....)
4.payments (id,doctor_id,clinic_id,patient_id,amount,etc...) etc..
I have about 9 tables, all of them the primary key is 'id' and its set to autoinc.
i dont use relations in my db (cuz i dont know if it would be better or not ! and i never got really deep into mysql , so i just use php to run query's to Fitch info from one table and use that to run another query to get more info/store etc..)
for example:
if i want to view all drugs i gave to one of my patients, for example his id is :100
1-click patient name (name link generated from (tbl:patients,column:id))
2-search tbl visits WHERE patient_id=='100' ; ---> that return all his visits ($x array)
3-loop prescription tbl searching for drugs with matching visit_id with $x (loop array).
4- return all rows found.
as my database expanding more and more (1k+ record in visit table) so 1 patient can have more than 40 visit that's 40 loop into prescription table to get all his previous prescription.
so i came up with small teak where i edited my db so that patient_id and visit_id is a column in nearly all tables so i can skip step 2 and 3 into one step (
search prescription tbl WHERE patient_id=100), but that left me with so many duplicates in my db,and i feel its kinda stupid way to do it !!
should i start considering using relational database ?
if so can some one explain a bit how this will ease my life ?
can i do this redesign but altering current tables or i must recreate all tables ?
thank you very much

Yes, you should exploit MySQL's relational database capabilities. They will make your life much easier as this project scales up.
Actually you're already using them well. You've discovered that patients can have zero or more visits, for example. What you need to do now is learn to use JOIN queries to MySQL.
Once you know how to use JOIN, you may want to declare some foreign keys and other database constraints. But your system will work OK without them.
You have already decided to denormalize your database by including both patient_id and visit_id in nearly all tables. Denormalization is the adding of data that's formally redundant to various tables. It's usually done for performance reasons. This may or may not be a wise decision as your system scales up. But I think you can trust your instinct about the need for the denormalization you have chosen. Read up on "database normalization" to get some background.
One little bit of advice: Don't use columns named simply "id". Name columns the same in every table. For example, use patients.patient_id, visits.patient_id, and so forth. This is because there are a bunch of automated software engineering tools that help you understand the relationships in your database. If your ID columns are named consistently these tools work better.
So, here's an example about how to do the steps numbered 2 and 3 in your question with a single JOIN query.
SELECT p.patient_id p.name, v.visit_id, rx.drug_name, rx.drug_dose
FROM patients AS p
LEFT JOIN visits AS v ON p.patient_id = v.patient_id
LEFT JOIN prescription AS rx ON v.visit_id = rx.visit_id
WHERE p.patient_id = '100'
ORDER BY p.patient_id, v.visit_id, rx.prescription_id
Like all SQL queries, this returns a virtual table of rows and columns. In this case each row of your virtual table has patient, visit, and drug data. I used LEFT JOIN in this example. That means that a patient with no visits will have a row with NULL data in it. If you specify JOIN MySQL will omit those patients from the virtual table.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.