I am currently attempting to build a web application that relies quite heavily on postcode data (supplied from OS CodePoint Open). The postcode database has 120 tables which breaks down the initial postcode prefix (i.e. SE, WS, B). Inside these tables there are between 11k - 48k rows with 3 fields (Postcode, Lat, Lng).
What I need to be able to do is for a user to come online, enter their postcode i.e. SE1 1LD which then selects the SE table, and converts the postcode into a lat / lng.
I am fine with doing this on a PHP level. My concern is.. well the huge number of rows that will be queried and whether it is going to grind my website to a halt?
If there are any techniques that I should know about, please do let me know.. I've never worked with tables with big numbers in!
Thanks :)
48K are not big numbers. 48 million is. :) If your tables are properly indexed (put indexes on the fields you use in the WHERE clause) it won't be a problem at all.
Avoid LIKE, and use INNER JOINS instead of LEFT JOINs if possible.
selecting from 48k rows in mysql is not big, in fact its rather small. index it properly and you are fine.
If I understand correct, there is a SE table, a WS one, a B one, etc. In all, 120 tables with same structure (Postcode, Lat, Lng).
I strongly propose you normalize the tables.
You can have either one table:
postcode( prefix, postcode, lat, lng)
or two:
postcode( prefixid , postcode, lat, lng )
prefix( prefixid, prefix )
The postcode table will be slighly bigger than 11K-48K rows, about 30K x 120 = 3.6M rows but it will save you time for writing different queries for every prefix and quite complex ones if, for example, you want to search for latitude and longitude (imagine a query that searches in 120 tables).
If you are not convinced try to add a person table so you can add data for your users. How this table will be related to the postcode table(s) ?
EDIT
Since the prefix is just the first characters of the postcode which is also the primary key, there is no need for extra field or second table. I would simply combine the 120 tables into one:
postcode( postcode, lat, lng )
Then queries like:
SELECT *
FROM postode
WHERE postcode = 'SE11LD'
or
SELECT *
FROM postode
WHERE postcode LIKE 'SE%'
will be fast, as they will be using the primary key index.
As long as you have indexes on the appropriate columns, there should be no problem. One of my customers has the postcode database stored in a table like :
CREATE TABLE `postcode_geodata` (
`postcode` varchar(8) NOT NULL DEFAULT '',
`x_coord` float NOT NULL DEFAULT '0',
`y_coord` float NOT NULL DEFAULT '0',
UNIQUE KEY `postcode_idx` (`postcode`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 |
And we have no problems (from a performance point of view) in querying that.
If your table did become really large, then you could always look at using MySQL's partitioning support - see http://dev.mysql.com/doc/refman/5.1/en/partitioning.html - but I wouldn't look at that until you've done the easier things first (see below).
If you think performance is an issue, turn on MySQL's slow_query_log (see /etc/mysql/my.cnf) and see what it says (you may also find the command 'mysqldumpslow' useful at this point for analysing the slow query log).
Also try using the 'explain' syntax on the MySQL cli - e.g.
EXPLAIN SELECT a,b,c FROM table WHERE d = 'foo' and e = 'bar'
These steps will help you optimise the database - by identifying which indexes are (or aren't) being used for a query.
Finally, there's the mysqltuner.pl script (see http://mysqltuner.pl) which helps you optmise the MySQL server's settings (e.g. query cache, memory usage etc which will affect I/O and therefore performance/speed).
Related
This is my first time making my own mysql database, and I was hoping for some pointers. I've looked through previous questions and found that it IS possible to search multiple tables at once... so that expanded my posibilities.
What I am trying to do, is have a searchable / filterable listing of Snowmobile clubs on a PHP page.
These clubs should be listable by state, county or searchable for by name / other contained info.
I'd also like them to be alphabetized in the results, despite the order of entry.
Currently my mind was in the place of, have a table for NY, PA etc
With Columns for County(varchar), Clubname(varchar), Street address (long text) , phone (varchar) email (varchar) website address (varchar)
Should I really be making multiple tables for each county, such as NY.ALBANY , NY.MADISON
Are the field formats I have chosen the sensible ones?
Should Address be broken into subcomponents... such as street1, street2, city, state, zip.
Eventually, I think I'd like a column "trailsopen" with a yes or no, and change the tr background to green or red based on input.
Hope this makes sense...
Here is how I would setup your db:
state
id (tinyint) //primary key auto incremented unsigned
short (varchar(2)) // stores NY, PA
long (varchar(20)) // Stores New York, Pennsylvania
county
id (int) //primary key auto incremented unsigned
state_id (tinyint) //points to state.id
name (varchar(50))
club_county
id (int) //primary key auto incremented unsigned
county_id (int) //points to county.id
club_id (int) //points to club.id
club
id (int) //primary key auto incremented unsigned
name (varchar(100))
address (varchar(100)
city (varchar(25))
zip (int)
etc...
From my perspective, it seems like 1 table will be enough for your needs. MySQL is so robust that there are many ways to do just about anything. I recommend downloading and using MySQL Workbench, which makes creating tables, changing tables, and writing queries easier and quicker than embedding them in a webpage.
Download MySQL Workbench -> http://dev.mysql.com/downloads/workbench/
You will also need to learn a lot about the MySQL queries. I think you can put all the info that you need in one table, and the trick is which query you use to display the information.
For example, assume you only have 1 table, with all states together. You can display just the snow mobile clubs from NY state with a query like this:
select * from my_table where state = "NY";
If you want to display the result alphabetic by Club Name, then you would use something like this:
select * from my_table where state = "NY" order by clubname;
There is A LOT of documentation online. So I would suggest doing quite a few hours of research and playing with MySQL Workbench.
The purpose of Stack Overflow is to answer more specific questions that have to do with specific code or queries. So once you have built a program, and get stumped on something, you can ask the specific question here. Good luck!
U can a create a single table with composite key constraint. Like..
I have 3 department in a company and each have multiple num of sub department.so I can create a database like this..
Dept_id || sub_dept_id || Name || Sal || Address || Phone
..where Dept_id and sub_dept_id will jointly represent the primary key and beholds its uniqueness.
But remember if ur database is going to be too large,then think before u doing this step,u might need need clustering or index for that scenario.
While writing SQL query,its good practise to divide a main module in num of sub module. So u can break the Adress.
As per your yes/no.... use integer feild and plan it in a way that if its YES,it'll store 1 else 0(zero)...
You shouldn't make individual tables for the individual counties. What you should do instead is create a table for states, a table for counties, and a table for addresses.
The result could look something like this:
state (id, code, name),
county (id, stateID, name),
club (id, countyID, name, streetAddress, etc...)
The process used to determine what to break up and when is called "database normalisation" - there's actually algorithms that do this for you. The wiki page on that is a good place to start: http://en.wikipedia.org/wiki/Database_normalization
One long text for the street address is fine, btw, as are varchars for the other fields.
Should I really be making multiple tables for each county, such as NY.ALBANY , NY.MADISON
It depends, but in your described case an alternative might be to have one database table with all the snowmobile clubs, and one table for all the states/counties. In the clubs table you could have an id field as foreign key which links the entry to a specific state/county entry.
To get all the info together you'd just have to do a JOIN-operation on the tables (please refer to mysql documentation).
Are the field formats I have chosen the sensible ones?
They would work..
Should Address be broken into subcomponents... such as street1, street2, city, state, zip?
Essentially the question here is if you need it broken down into subcomponents, either now or in the future? If it is broken down, you have the data separated which makes further processing (eg generation of serial letters, automated lookups..) potentially simpler, but that depends on your processing; if you don't need it separated why make life more complicated?
so many answers already.. agreed.
i need to sort through a column in my database, this column is my category structure the data thats in the column is city names but not all the names are the same for each city, what i need to do is go through the values in the column i may have 20-40 value that are the same city but written differently i need a script that can interpret them and change them to a single value
so i may have two values in the city column say:( england > london ) and ( westlondon ) but i need to change to just london, is there a script out there that is capable of interpreting the values that are already there and change them to the value would want i know the dificult way of doing this one by one but wondered if there was a script in any language that could complete this
I've done this sort of data clean-up plenty of times and I'm afraid I don't know of anything easier than just writing your own fixes.
One thing I can recommend is making the process repeatable. Have a replacement table with something like (rulenum, pattern, new_value). Then, work on a copy of the relevant bits of your table so you can just re-run the whole script.
Then, you can start with the obvious matches (just see what looks plausible) and move to more obscure ones. Eventually you'll have 50 without matches and you can just manually patch entries for this.
Making it repeatable is important because you'll be bound to find mis-matches in your first few attempts.
So, something like (syntax untested):
CREATE TABLE matches (rule_num int PRIMARY KEY, pattern text, new_value text)
CREATE TABLE cityfix AS
SELECT id, city AS old_city, '' AS new_city, 0 AS match_num FROM locations;
UPDATE c SET c.new_city = m.new_value, c.match_num = m.rule_num
FROM cityfix AS c JOIN matches m ON c.old_city LIKE m.pattern
WHERE c.match_num = 0;
-- Review results, add new patterns to rule_num, repeat UPDATE
-- If you need to you can drop table cityfix and repeat it.
Just an idea: 16K is not so much. first use Perl's DBI (im assuming you are going to use Perl) to fetch that city column, store it in a hash (city name as the hash), then find your an algorithm that suites your needs (performance wise) to iterate over the hash keys and use String::Diff to find matching intersection (read about it, it definitely can help you out) and store it as a value.. then you can use that to update the database using the key (old value) and the value as the new value to update.
The website I have to manage is a search engine for worker (yellow page style)
I have created a database like this:
People: <---- 4,000,000 records
id
name
address
id_activity <--- linked to the activites table
tel
fax
id_region <--- linked to the regions table
activites: <---- 1500 activites
id
name_activity
regions: <--- 95 regions
id
region_name
locations: <---- 4,000,000 records
id_people
lat
lon
So basically the request that I am having slow problem with is to select all the "workers" around a selecty city (select by the user)
The request I have created is fully working but takes 5-6 seconds to return results...
Basically I do a select on the table locations to select all the city in a certain radius and then join to the people table
SELECT people.*,id, lat, lng, poi,
(6371 * acos(cos(radians(plat)) * cos(radians(lat)) * cos(radians(lng) - radians(plon)) + sin(radians(plat)) * sin(radians(lat)))) AS distance
FROM locations,
people
WHERE locations.id = people.id
HAVING distance < dist
ORDER BY distance LIMIT 0 , 20;
My questions are:
Is my Database nicely designed? I don't know if it's a good idea to have 2 table with 4,000,000 records each. Is it OK to do a select on it?
Is my request badly designed?
How can I speed up the search?
The design looks normalized. This is what I would expect to see in most well designed databases. The amount of data in the tables is important, but secondary. However if there is a 1-to-1 correlation between People and Locations, as appears from your query, I would say the tables should be one table. This will certainly help.
Your SQL looks OK, though adding constraints to reduce the number of rows involved would help.
You need to index your tables. This is what will normally help most with slowness (as most developers don't consider database indexes at all).
There are a couple of basic things that could be making your query run slowly.
What are your indexes like on your tables? Have you declared primary keys on the tables? Joining two tables each with 4M rows without having indexes causes a lot of work on the DB. Make sure you get this right first.
If you've already built the right indexes for your DB you can look at caching data. You're doing a calculation in your query Are the locations (lat/lon) generally fixed? How often do they change? Are the items in your locations table actual places (cities, buildings, etc), or are they records of where the people have been (like Foursquare checkins)?
If your locations are places you can make a lot of nice optimizations if you isolate the parts of your data that change infrequently and pre-calculate the distances between them.
If all else fails, make sure your database server has enough RAM. If the server can keep your data in memory it will speed things up a lot.
I'm working on the next version of a local online dating site, PHP & MySQL based and I want to do things right. The user table is quite massive and is expected to grow even more with the new version as there will be a lot of money spent on promotion.
The current version which I guess is 7-8 years old was done probably by someone not very knowledgeable in PHP and MySQL so I have to start over from scratch.
There community has currently 200k+ users and is expected to grow to 500k-1mil in the next one or two years. There are more than 100 attributes for each user's profile and I have to be able to search by at least 30-40 of them.
As you can imagine I'm a little wary to make a table with 200k rows and 100 columns. My predecessor split the user table in two ... one with the most used and searched columns and one with the rest (and bulk) of the columns. But this lead to big synchronization problems between the two tables.
So, what do you think it's the best way to go about it?
This is not an answer per se, but since few answers here suggested the attribute-value model, I just wanted to jump in and say my life experience.
I've tried once using this model with a table with 120+ attributes (growing 5-10 every year), and adding about 100k+ rows (every 6 months), the indexes is growing so big that it takes for ever to add or update a single user_id.
The problem I find with this type of design (not that it's completely unfit to any situation) is that you need to put a primary key on user_id,attrib on that second table. Unknowing the potential length of attrib, you would usually use a greater length value, thus increasing the indexes. In my case, attribs could have from 3 to 130 chars. Also, the value most certainly suffer from the same assumption.
And as the OP said, this leads to synchronization problems. Imagine if every attributes (or say at least 50% of them) NEED to exist.
Also, as the OP suggest, the search needs to be done on 30-40 attributes, and I can't just imagine how a 30-40 joins would be efficient, or even a group_concat() due to length limitation.
My only viable solution was to go back to a table with as much columns as there are attributes. My indexes are now greatly smaller, and searches are easier.
EDIT: Also, there are no normalization problems. Either having lookup tables for attribute values or have them ENUM().
EDIT 2: Of course, one could say I should have a look-up table for attribute possible values (reducing index sizes), but I should then make a join on that table.
What you could do is split the user data accross two tables.
1) Table: user
This will contain the "core" fixed information about a user such as firstname, lastname, email, username, role_id, registration_date and things of that nature.
Profile related information can go in its own table. This will be an infinitely expandable table with a key => val nature.
2) Table: user_profile
Fields: user_id, option, value
user_id: 1
option: profile_image
value: /uploads/12/myimage.png
and
user_id: 1
option: questions_answered
value: 24
Hope this helps,
Paul.
The entity-attribute-value model might be a good fit for you:
http://en.wikipedia.org/wiki/Entity-attribute-value_model
Rather than have 100 and growing columns, add one table with three columns:
user_id, property, value.
In general, you shouldn't sacrifice database integrity for performance.
The first thing that I would do about this is to create a table with 1 mln rows of dummy data and test some typical queries on it, using a stress tool like ab. It will most probably turn out that it performs just fine - 1 mln rows is a piece of cake for mysql. So, before trying to solve a problem make sure you actually have it.
If you find the performance poor and the database really turns out to be a bottleneck, consider general optimizations, like caching (on all levels, from mysql query cache to html caching), getting better hardware etc. This should work out in most cases.
In general you should always get the schema formally correct before you worry about performance!
That way you can make informed decisions about adapting the schema to resolve specific performance problems, rather than guessing.
You definitely should go down the 2 table route. This will significantly reduce the amount of storage, code complexity, and the effort to changing the system to add new attributes.
Assuming that each attribute can be represented by an Ordinal number, and that you're only looking for symmetrical matches (i.e. you're trying to match people based on similar attributes, rather than an expression of intention)....
At a simple level, the query to find suitable matches may be very expensive. Effectively you are looking for nodes within the same proximity in a N-dimensional space, unfortunately most relational databases aren't really setup for this kind of operation (I believe PostgreSQL has support for this). So most people would probably start with something like:
SELECT candidate.id,
COUNT(*)
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value=current_user.attr_value
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
However this forces the system to compare every available candidate to find the best match. Applying a little heurisitics and you could get a very effective query:
SELECT candidate.id,
COUNT(*)
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
(the value of $tolerance will affect the number of rows returned and query performance - if you've got an index on attr_type, attr_value).
This can be further refined into a points scoring system:
SELECT candidate.id,
SUM(1/1+
((candidate_attrs.attr_value - current_user.attr_value)
*(candidate_attrs.attr_value - current_user.attr_value))
) as match_score
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
This approach lets you do lots of different things - including searching by a subset of attributes, e.g.
SELECT candidate.id,
SUM(1/1+
((candidate_attrs.attr_value - current_user.attr_value)
*(candidate_attrs.attr_value - current_user.attr_value))
) as match_score
FROM users candidate,
attributes candidate_attrs,
attributes current_user_attrs,
attribute_subsets s
WHERE current_user_attrs.user_id=$current_user
AND candidate.user_id<>$current_user
AND candidate.id=candidate_attrs.user_id
AND candidate_attrs.attr_type=current_user.attr_type
AND candidate_attrs.attr_value
AND s.subset_name=$required_subset
AND s.attr_type=current_user.attr_type
BETWEEN current_user.attr_value+$tolerance
AND current_user.attr_value-$tolerance
GROUP BY candidate.id
ORDER BY COUNT(*) DESC;
Obviously this does not accomodate non-ordinal data (e.g. birth sign, favourite pop-band). Without knowing a lot more about te structure of the existing data, its rather hard to say exactly how effective this will be.
If you want to add more attributes, then you don't need to make any changes to your PHP code nor the database schema - it can be completely data-driven.
Another approach would be to identify sterotypes - i.e. reference points within the N-dimensional space, then work out which of these a particular user is closest to. You collapse all the attributes down to a single composite identifier - then you just need to apply the same approach to find the best match within the subset of candidates whom also have been matched to the stereotype.
Can't really suggest anything without seeing the schema. Generally - Mysql database have to be normalized to at least 3NF or BNCF. It rather sounds like it is not normalized right now with 100 columns in 1 table.
Also - you can easily enforce referential integrity with foreign keys using transactions and INNODB engine.
I know i am writing query's wrong and when we get a lot of traffic, our database gets hit HARD and the page slows to a grind...
I think I need to write queries based on CREATE VIEW from the last 30 days from the CURDATE ?? But not sure where to begin or if this will be MORE efficient query for the database?
Anyways, here is a sample query I have written..
$query_Recordset6 = "SELECT `date`, title, category, url, comments
FROM cute_news
WHERE category LIKE '%45%'
ORDER BY `date` DESC";
Any help or suggestions would be great! I have about 11 queries like this, but I am confident if I could get help on one of these, then I can implement them to the rest!!
Putting a wildcard on the left side of a value comparison:
LIKE '%xyz'
...means that an index can not be used, even if one exists. Might want to consider using Full Text Searching (FTS), which means adding full text indexing.
Normalizing the data would be another step to consider - categories should likely be in a separate table.
SELECT `date`, title, category, url, comments
FROM cute_news
WHERE category LIKE '%45%'
ORDER BY `date` DESC
The LIKE '%45%' means a full table scan will need to be performed. Are you perhaps storing a list of categories in the column? If so creating a new table storing category and news_article_id will allow an index to be used to retrieve the matching records much more efficiently.
OK, time for psychic debugging.
In my mind's eye, I see that query performance would be improved considerably through database normalization, specifically by splitting the category multi-valued column into a a separate table that has two columns: the primary key for cute_news and the category ID.
This would also allow you to directly link said table to the categories table without having to parse it first.
Or, as Chris Date said: "Every row-and-column intersection contains exactly one value from the applicable domain (and nothing else)."
Anything with LIKE '%XXX%' is going to be slow. Its a slow operation.
For something like categories, you might want to separate categories out into another table and use a foreign key in the cute_news table. That way you can have category_id, and use that in the query which will be MUCH faster.
Also, I'm not quite sure why you're talking about using CREATE VIEW. Views will not really help you for speed. Not unless its a materialized view, which MySQL doesn't suppose natively.
If your database is getting hit hard, the solution isn't to make a view (the view is still basically the same amount of work for the database to do), the solution is to cache the results.
This is especially applicable since, from what it sounds like, your data only needs to be refreshed once every 30 days.
I'd guess that your category column is a list of category values like "12,34,45,78" ?
This is not good relational database design. One reason it's not good is as you've discovered: it's incredibly slow to search for a substring that might appear in the middle of that list.
Some people have suggested using fulltext search instead of the LIKE predicate with wildcards, but in this case it's simpler to create another table so you can list one category value per row, with a reference back to your cute_news table:
CREATE TABLE cute_news_category (
news_id INT NOT NULL,
category INT NOT NULL,
PRIMARY KEY (news_id, category),
FOREIGN KEY (news_id) REFERENCES cute_news(news_id)
) ENGINE=InnoDB;
Then you can query and it'll go a lot faster:
SELECT n.`date`, n.title, c.category, n.url, n.comments
FROM cute_news n
JOIN cute_news_category c ON (n.news_id = c.news_id)
WHERE c.category = 45
ORDER BY n.`date` DESC
Any answer is a guess, show:
- the relevant SHOW CREATE TABLE outputs
- the EXPLAIN output from your common queries.
And Bill Karwin's comment certainly applies.
After all this & optimizing, sampling the data into a table with only the last 30 days could still be desired, in which case you're better of running a daily cronjob to do just that.