MySQL table with 4,000,000 record?

MySQL table with 4,000,000 record? - php

The website I have to manage is a search engine for worker (yellow page style)
I have created a database like this:
People: <---- 4,000,000 records
id
name
address
id_activity <--- linked to the activites table
tel
fax
id_region <--- linked to the regions table
activites: <---- 1500 activites
id
name_activity
regions: <--- 95 regions
id
region_name
locations: <---- 4,000,000 records
id_people
lat
lon
So basically the request that I am having slow problem with is to select all the "workers" around a selecty city (select by the user)
The request I have created is fully working but takes 5-6 seconds to return results...
Basically I do a select on the table locations to select all the city in a certain radius and then join to the people table
SELECT people.*,id, lat, lng, poi,
(6371 * acos(cos(radians(plat)) * cos(radians(lat)) * cos(radians(lng) - radians(plon)) + sin(radians(plat)) * sin(radians(lat)))) AS distance
FROM locations,
people
WHERE locations.id = people.id
HAVING distance < dist
ORDER BY distance LIMIT 0 , 20;
My questions are:
Is my Database nicely designed? I don't know if it's a good idea to have 2 table with 4,000,000 records each. Is it OK to do a select on it?
Is my request badly designed?
How can I speed up the search?

The design looks normalized. This is what I would expect to see in most well designed databases. The amount of data in the tables is important, but secondary. However if there is a 1-to-1 correlation between People and Locations, as appears from your query, I would say the tables should be one table. This will certainly help.
Your SQL looks OK, though adding constraints to reduce the number of rows involved would help.
You need to index your tables. This is what will normally help most with slowness (as most developers don't consider database indexes at all).

There are a couple of basic things that could be making your query run slowly.
What are your indexes like on your tables? Have you declared primary keys on the tables? Joining two tables each with 4M rows without having indexes causes a lot of work on the DB. Make sure you get this right first.
If you've already built the right indexes for your DB you can look at caching data. You're doing a calculation in your query Are the locations (lat/lon) generally fixed? How often do they change? Are the items in your locations table actual places (cities, buildings, etc), or are they records of where the people have been (like Foursquare checkins)?
If your locations are places you can make a lot of nice optimizations if you isolate the parts of your data that change infrequently and pre-calculate the distances between them.
If all else fails, make sure your database server has enough RAM. If the server can keep your data in memory it will speed things up a lot.

Related

MySQL performance issue with large tables

I've been asked to develop a web software able to store some reading data from heat metering device and to divide the heat expenses among all the flat owner. I chose to work in php with MySQL engine MyISAM.
I was not used to work with large data, so i simply created a logical database where we have:
a table for building, with an id as primary key indexed (now we have ~1200
buildings in the db)
a table with all the flats in all the buildings, with an id as primary key indexed and the building_id to link to the building (around 32k+ flats in total)
a table with all the heaters in all the flats, with an id as primary key indexed and the flat_id to link to the flat (around 280k+ heaters)
a table with all the reading value, with the timestamp of the reading, an id as primary key and the heater_id to link to the heater (around 2.7M+ reading now)
There is also a separate table, linked to the building, where are stored the starting date and the end date between which the division of expenses have to be done.
When it is necessary to get all the data from a building, the approach i used is to get raw data from DB with single query, elaborate in php, than make the next query.
So here is roughly the operation sequence i used:
get the starting and end date from the specific table with a single query
store the dates in a php variable
get all the flats of the building: SELECT * FROM flats where building_id=my_building_id
parse all the data in php with a php while cycle
on each step of the while cycle i make a query getting all the heaters of that specific flat: SELECT * FROM heaters where flat_id=my_flat_id
parse all the data of the heaters with a php while cycle
on each step of this inner while cycle i'll get the last reading value of that specific heater: SELECT * FROM reading_values where heater_id=my_heater_id AND data<my_data
Now the problem is that i have serious performance issue.
Before someone point it out, i cannot get only reading value jumping all the first 6 steps of the list above, since i need to print bills and on each bill i have to write all flat information and all heaters information, so i have to get all the flats and heaters data anyway.
So I'd like some suggestions on how to improve script performance:
all the tables are indexed, but i have to add some index somewhere else?
would using a single query with subquery instead of several one among php code improve performance?
any other suggestions?
I haven't inserted specific code as i think it would have made the question too heavy, but if asked i could insert some.

Some:
Don't use 'SELECT *' if you can avoid it -> Just get the fields you really need
I didn't test it in your particular case, but usually a single query which joins all three tables should achieve much better performance rather than looping through results with php.
If you need to loop for some reason, then at least use mysql prepared statements, which again should increase performance given the amount of queries :)
Hope it helps!
Regards
EDIT:
just to exemplify an alternative query, not sure if this suits your specific needs and without testing it (which probably means I forgot something):
SELECT
a.field1,
b.field2,
c.field3,
d.field4
FROM heaters a
JOIN reading_values b ON (b.heater_id = a.heater_id)
JOIN flats c ON (c.flat_id = a.flat_id)
JOIN buildings d ON (d.building_id = c.building_id)
WHERE
a.heater_id = my_heater_id
AND b.date < my_date
GROUP BY a.heater_id
EDIT 2
Following your comments, I modified the query so that it retrieves the information as you want it: Given a building id, it will list all the heaters and their newest reading value according to a given date:
SELECT
a.name,
b.name,
c.name,
d.reading_value,
d.created
FROM buildings a
JOIN flats b ON (b.building_id = a.building_id)
JOIN heaters c ON (c.flat_id = b.flat_id)
JOIN reading_values d ON (d.reading_value_id = (SELECT reading_value_id FROM reading_values WHERE created <= my_date AND heater_id = c.heater_id ORDER BY created DESC LIMIT 1))
WHERE
a.building_id = my_building_id
GROUP BY c.heater_id
It should be interesting to know how it performs in your environment.
Regards

Too relation or not to relation ? A MySQL, PHP database workflow

im kinda new with mysql and i'm trying to create a kind complex database and need some help.
My db structure
Tables(columns)
1.patients (Id,name,dob,etc....)
2.visits (Id,doctor,clinic,Patient_id,etc....)
3.prescription (Id,visit_id,drug_name,dose,tdi,etc....)
4.payments (id,doctor_id,clinic_id,patient_id,amount,etc...) etc..
I have about 9 tables, all of them the primary key is 'id' and its set to autoinc.
i dont use relations in my db (cuz i dont know if it would be better or not ! and i never got really deep into mysql , so i just use php to run query's to Fitch info from one table and use that to run another query to get more info/store etc..)
for example:
if i want to view all drugs i gave to one of my patients, for example his id is :100
1-click patient name (name link generated from (tbl:patients,column:id))
2-search tbl visits WHERE patient_id=='100' ; ---> that return all his visits ($x array)
3-loop prescription tbl searching for drugs with matching visit_id with $x (loop array).
4- return all rows found.
as my database expanding more and more (1k+ record in visit table) so 1 patient can have more than 40 visit that's 40 loop into prescription table to get all his previous prescription.
so i came up with small teak where i edited my db so that patient_id and visit_id is a column in nearly all tables so i can skip step 2 and 3 into one step (
search prescription tbl WHERE patient_id=100), but that left me with so many duplicates in my db,and i feel its kinda stupid way to do it !!
should i start considering using relational database ?
if so can some one explain a bit how this will ease my life ?
can i do this redesign but altering current tables or i must recreate all tables ?
thank you very much

Yes, you should exploit MySQL's relational database capabilities. They will make your life much easier as this project scales up.
Actually you're already using them well. You've discovered that patients can have zero or more visits, for example. What you need to do now is learn to use JOIN queries to MySQL.
Once you know how to use JOIN, you may want to declare some foreign keys and other database constraints. But your system will work OK without them.
You have already decided to denormalize your database by including both patient_id and visit_id in nearly all tables. Denormalization is the adding of data that's formally redundant to various tables. It's usually done for performance reasons. This may or may not be a wise decision as your system scales up. But I think you can trust your instinct about the need for the denormalization you have chosen. Read up on "database normalization" to get some background.
One little bit of advice: Don't use columns named simply "id". Name columns the same in every table. For example, use patients.patient_id, visits.patient_id, and so forth. This is because there are a bunch of automated software engineering tools that help you understand the relationships in your database. If your ID columns are named consistently these tools work better.
So, here's an example about how to do the steps numbered 2 and 3 in your question with a single JOIN query.
SELECT p.patient_id p.name, v.visit_id, rx.drug_name, rx.drug_dose
FROM patients AS p
LEFT JOIN visits AS v ON p.patient_id = v.patient_id
LEFT JOIN prescription AS rx ON v.visit_id = rx.visit_id
WHERE p.patient_id = '100'
ORDER BY p.patient_id, v.visit_id, rx.prescription_id
Like all SQL queries, this returns a virtual table of rows and columns. In this case each row of your virtual table has patient, visit, and drug data. I used LEFT JOIN in this example. That means that a patient with no visits will have a row with NULL data in it. If you specify JOIN MySQL will omit those patients from the virtual table.

How can I optimize this simple database and query using php and mysql?

I pull a range (e.g. limit 72, 24) of games from a database according to which have been voted most popular. I have a separate table for tracking game data, and one for tracking individual votes for a game (rating from 1 to 5, one vote per user per game). A game is considered "most popular" or "more popular" when that game has the highest average rating of all the rating votes for said game. Games with less than 5 votes are not considered. Here is what the tables look like (two tables, "games" and "votes"):
games:
gameid(key)
gamename
thumburl
votes:
userid(key)
gameid(key)
rating
Now, I understand that there is something called an "index" which can speed up my queries by essentially pre-querying my tables and constructing a separate table of indices (I don't really know.. that's just my impression).
I've also read that mysql operates fastest when multiple queries can be condensed into one longer query (containing joins and nested select statements, I presume).
However, I am currently NOT using an index, and I am making multiple queries to get my final result.
What changes should be made to my database (if any -- including constructing index tables, etc.)? And what should my query look like?
Thank you.

Your query that calculates the average for every game could look like:
SELECT gamename, AVG(rating)
FROM games INNER JOIN votes ON games.gameid = votes.gameid
GROUP BY games.gameid
HAVING COUNT(*)>=5
ORDER BY avg(rating) DESC
LIMIT 0,25
You must have an index on gameid on both games and votes. (if you have defined gameid as a primary key on table games that is ok)

According to the MySQL documentation, an index is created when you designate a primary key at table creation. This is worth mentioning, because not all RDBMS's function this way.
I think you have the right idea here, with your "votes" table acting as a bridge between "games" and "user" to handle the many-to-many relationship. Just make sure that "userid" and "gameid" are indexed on the "votes" table.

If you have access to use InnoDB storage for your tables, you can create foreign keys on gameid in the votes table which will use the index created for your primary key in the games table. When you then perform a query which joins these two tables (e.g. ... INNER JOIN votes ON games.gameid = votes.gameid) it will use that index to speed things up.
Your understanding of an index is essentially correct — it basically creates a separate lookup table which it can use behind the scenes when the query is executed.
When using an index it is useful to use the EXPLAIN syntax (simply prepend your SELECT with EXPLAIN to try this out). The output it gives show you the list of possible keys available for the query as well as which key the query is using. This can be very helpful when optimising your query.

An index is a PHYSICAL DATA STRUCTURE which is used to help speed up retrieval type queries; it's not simply a table upon a table -> good for a concept though. Another concept is the way indexes work at the back of your text book (the only difference is with your book a search key could point to multiple pages / matches whereas with indexes a search key points to only one page/match). An index is defined by data structures so you could use a B+ tree index and there are even hash indexes. It's Database/Query optimization from the physical/internal level of the Database - I'm assuming that you know that you're working at the higher levels of the DBMS which is easier. An index is rooted within the internal levels and that make DB query optimization much more effective and interesting.
I've noticed from your question that you have not even developed the query as yet. Focus on the query first. Indexing comes after, as a matter of a fact, in any graduate or post graduate Database course, indexing falls under the maintenance of a Database and not necessarily the development.
Also N.B. I have seen quite many people say as a rule to make all primary keys indexes. This is not true. There are many instances where a primary key index would slow up the Database. Infact, if we were to go with only primary indexes then should use hash indexes since they work better than B+ trees!
In summary, it doesn't make sense to ask a question for a query and an index. Ask for help with the query first. Then given your tables (relational schema) and SQL query, then and only then could I advice you on the best index - remember its maintenance. We can't do maintanance if there is 0 development.
Kind Regards,
N.B. most questions concerning indexes at the post graduate level of many computing courses are as follows: we give the students a relational schema (i.e. your tables) and a query and then ask: critically suggest a suitable index for the following query on the tables ----> we can't ask a question like this if they dont have a query

PHP/MySQL - Performance considerations querying (select only) 48k rows

I am currently attempting to build a web application that relies quite heavily on postcode data (supplied from OS CodePoint Open). The postcode database has 120 tables which breaks down the initial postcode prefix (i.e. SE, WS, B). Inside these tables there are between 11k - 48k rows with 3 fields (Postcode, Lat, Lng).
What I need to be able to do is for a user to come online, enter their postcode i.e. SE1 1LD which then selects the SE table, and converts the postcode into a lat / lng.
I am fine with doing this on a PHP level. My concern is.. well the huge number of rows that will be queried and whether it is going to grind my website to a halt?
If there are any techniques that I should know about, please do let me know.. I've never worked with tables with big numbers in!
Thanks :)

48K are not big numbers. 48 million is. :) If your tables are properly indexed (put indexes on the fields you use in the WHERE clause) it won't be a problem at all.
Avoid LIKE, and use INNER JOINS instead of LEFT JOINs if possible.

selecting from 48k rows in mysql is not big, in fact its rather small. index it properly and you are fine.

If I understand correct, there is a SE table, a WS one, a B one, etc. In all, 120 tables with same structure (Postcode, Lat, Lng).
I strongly propose you normalize the tables.
You can have either one table:
postcode( prefix, postcode, lat, lng)
or two:
postcode( prefixid , postcode, lat, lng )
prefix( prefixid, prefix )
The postcode table will be slighly bigger than 11K-48K rows, about 30K x 120 = 3.6M rows but it will save you time for writing different queries for every prefix and quite complex ones if, for example, you want to search for latitude and longitude (imagine a query that searches in 120 tables).
If you are not convinced try to add a person table so you can add data for your users. How this table will be related to the postcode table(s) ?
EDIT
Since the prefix is just the first characters of the postcode which is also the primary key, there is no need for extra field or second table. I would simply combine the 120 tables into one:
postcode( postcode, lat, lng )
Then queries like:
SELECT *
FROM postode
WHERE postcode = 'SE11LD'
or
SELECT *
FROM postode
WHERE postcode LIKE 'SE%'
will be fast, as they will be using the primary key index.

As long as you have indexes on the appropriate columns, there should be no problem. One of my customers has the postcode database stored in a table like :
CREATE TABLE `postcode_geodata` (
`postcode` varchar(8) NOT NULL DEFAULT '',
`x_coord` float NOT NULL DEFAULT '0',
`y_coord` float NOT NULL DEFAULT '0',
UNIQUE KEY `postcode_idx` (`postcode`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 |
And we have no problems (from a performance point of view) in querying that.
If your table did become really large, then you could always look at using MySQL's partitioning support - see http://dev.mysql.com/doc/refman/5.1/en/partitioning.html - but I wouldn't look at that until you've done the easier things first (see below).
If you think performance is an issue, turn on MySQL's slow_query_log (see /etc/mysql/my.cnf) and see what it says (you may also find the command 'mysqldumpslow' useful at this point for analysing the slow query log).
Also try using the 'explain' syntax on the MySQL cli - e.g.
EXPLAIN SELECT a,b,c FROM table WHERE d = 'foo' and e = 'bar'
These steps will help you optimise the database - by identifying which indexes are (or aren't) being used for a query.
Finally, there's the mysqltuner.pl script (see http://mysqltuner.pl) which helps you optmise the MySQL server's settings (e.g. query cache, memory usage etc which will affect I/O and therefore performance/speed).

Optimize a rankings page using PHP and MySQL

I could really use some help optimizing a table on my website that is used to display rankings. I have been reading a lot on how to optimize queries and how to properly use indexes but even after implementing changes I thought would work, little improvement can be seen. My quick fix has been simply to use only the top 100,000 rankings (updated daily and stored in a different table) to improve the speed for now, but I really don't like that option.
So I have a table that stores the information for users that looks something like:
table 'cache':
id (Primary key)
name
region
country
score
There are other variables being stored about the user, but I don't think they are relevant here as they are not used in the rankings.
There are 3 basic ranking pages that a user can view:
A world view:
SELECT cache name,region,country,score FROM cache ORDER BY score DESC LIMIT 0,26
A region view:
SELECT name,region,country,score FROM cache WHERE region='Europe' ORDER BY score DESC LIMIT 0,26
and a country view:
SELECT name,region,country,score FROM cache WHERE region='Europe' AND country='Germany' ORDER BY score DESC LIMIT 0,26
I have tried almost every combination of indexes I can think of to help alleviate work for the database, and while some seem to help a little bit I can't find one that will only return 26 rows for both the region and country queries(with simply an index on 'score' the world rankings are blazing fast).
I feel like I might be missing something basic, any help would be much appreciated!
Little extra info: the cache table is currently around 920 megabytes with a little more than 800,000 rows total. If you could use any more info just let me know.

Your world rankings benefit from the score index because score is the only criteria in the query. The logical sequence it sorts on is built into the query. So that's good.
The other queries will benefit from an index on region. However, similar to what #Matt indicates, a composite index on region, country and score may be the best bet. Note, the three columns for the key should be in region, country, score sequence.

put ONE index on country, score, region. Too many indices will slow you down.

How many records are we talking about?
I am no SQL guru, but in this case, if just changes to indexes did not do the trick. I would consider playing around with the table structure to see what perf gains I could get.
Cache
id (pk)
LocationId (index)
name
score
Location
LocationId (pk)
CountryId (index) maybe
RegionId (index) maybe
Country
CountryId
name
Region
RegionId
name
Location
LocationId (Primary key)
CountryId
RegionId
Country
CountryId
name
Region
RegionId
name
Temp tables in Procs would allow you to select on Location Id in every case. It would reduce the over all complexity of the issue you are having: you would be troubleshooting 1 query plan, not 3.
The effort would be high, and the payoff would be until you were done, so I would suggest looking at the index approach first.
Good luck

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.