I have a database containing places, and I need to show distances from any place to other ones in my webpage. Having the distances stored somewhere will save lots of work (loading them should be easier than computing them anew). But how to save the square matrix of the distances? Creating a new column every time I insert a new row doesn't seem to be a good solution, but I didn't find any better solution (though I can think of workarounds such as computing some 10 or 20 nearest distances and assuming I will rarely need more).
What is optimal way to save square tables of variable (and growing) size in PHP/MySQL? Or is there no good solution and my (or some other) workaround is better?
Edit Note: As was mentioned in a comment, once you get enough places it may make more sense to just store long/lat values and calculate the distance on the fly based on those instead. However the solution explained here may still be relevant for other applications.
The best way to handle this would be with a pivot table, where each row has two place id's, and a distance value.
Now, since Distance A-B is the same as B-A, we only need to store each pairing once. We can do this by only ever storing a distance if the ID of A is less than B.
SETUP
First a places table to store your places
id | name
---+---------
1 | Place_A
2 | Place_B
3 | Place_C
4 | Place_D
Then a places_distances Pivot table:
place_id_1 | place_id_2 | distance
-----------+------------+----------
1 | 2 | 10.0
1 | 3 | 20.0
1 | 4 | 15.0
2 | 3 | 12.0
2 | 4 | 8.0
3 | 4 | 14.0
Note that pivot tables do not need their own ID field (though some may argue it's still good to have one sometimes). You will set up a unique key as follows (you'll want to look into the documentation for correct usage):
UNIQUE KEY `UNIQUE_placesDistances_primary`(`place_id_1`,`place_id_2`)
This ensures that you cannot have the same place/place paring in the table twice.
You will also want to make sure to set up foreign keys:
CONSTRAINT FOREIGN KEY `FK_placesDistances_place1` (`place_id_1`)
REFERENCES `places`(`id`),
CONSTRAINT FOREIGN KEY `FK_placesDistances_place2` (`place_id_2`)
REFERENCES `places`(`id`)
Which will ensure that you can only add entries for place you actually have defined in places. it also means (if you use the default foreign key behavior) that you can't delete a place if you have distance row referencing that place.
Usage Examples
Looking up the distance between two places
(Given two variables #id_1 as the id of the first place and #id_2 as the id of the second place)
SELECT `distance`
FROM `places_distances`
WHERE (`place_id_1` = #id_1 AND `place_id_2` = #id_2)
OR (`place_id_2` = #id_1 AND `place_id_11` = #id_2)
LIMIT 1;
We use the OR to account for the case where we try to look up distance 2 to 1 and not 1 to 2 - remember, we only store values where the first place's id is less than the second to avoid storing duplicates.
Inserting a new distance
(Given three variables #id_1 as the id of the first place and #id_2 as the id of the second place, and #distance being the distance)
INSERT `places_distances`(`place_id_1`,`place_id_2`,`distance`)
VALUES(LEAST(#id_1, #id_2),GREATEST(#id_1, #id_2), #distance)
We're using the built in comparison functions LEAST and GREATEST to help maintain our rule that we only store places where the first ID is less than the second, to avoid duplicates.
Showing a list of place names, sorted by their distance from furthest to closest
To get the original names from the places table to show up in our places_distances query we have to join them together. In this case LEFT JOIN is the best choice since we only care about what is in the places_distances table. For more info on MySQL joins check here.
SELECT
`p_1`.`name` AS `place_1`,
`p_2`.`name` AS `place_2`,
`distance`
FROM `places_distances`
LEFT JOIN `places` AS `p_1`
ON `distances`.`place_id_1` = `p_1`.`id`
LEFT JOIN `places` AS `p_2`
ON `distances`.`place_id_2` = `p_2`.`id`
ORDER BY `distance` DESC
Which should return a table like this:
place_id_1 | place_id_2 | distance
-----------+------------+----------
Place_A | Place_C | 20.0
Place_A | Place_D | 15.0
Place_C | Place_D | 14.0
Place_B | Place_C | 12.0
Place_A | Place_B | 10.0
Place_B | Place_D | 8.0
showing a table of places and their distances to a specific given place
This is a bit more tricky since we need to show the name in a row that is not our input place, but we can use another useful function IF(CONDITION,'TRUE_OUTPUT','FALSE_OUTPUT') to do that.
(#place_name being the variable containing the place name, in this case 'Place_B')
SELECT
IF(`p_1`.`name`=#place_name, `p_2`.`name`, `p_1`.`name`) AS `name`,
`distance`
FROM `places_distances`
LEFT JOIN `places` AS `p_1`
ON `distances`.`place_id_1` = `p_1`.`id`
LEFT JOIN `places` AS `p_2`
ON `distances`.`place_id_2` = `p_2`.`id`
WHERE `p_1`.`name` = #place_name OR `p_2`.`name` = #place_name
ORDER BY `distance` DESC
Which should return a table like this:
name | distance
--------+-----------
Place_C | 12.0
Place_A | 10.0
Place_D | 8.0
I would store the lat/long for all places and write a function to calculate distance between them with the lat/long information.
In this way, no need to calculate the distances for new places you want to add in you DB.
Moreover if you have lots of places, using a pivot table to store only the distances, you have to be aware that this table can grow very fast. As you need to cover all combinaisons of places.
For instance: for 1000 places you will have 1000 * 1000 - 1000 = 999000 rows in your table. Do the math for larger number but this table might contain a lot a rows depends on how many places you've got.
Break it into another table called "distance" that relates back to the original "place" table:
create table distance (place_id_1 int, place_id_2 int, distance int);
That is, for each place, calculate the distance for another place and save it in this new table.
You could create a new table with two columns as foreign keys for the locations and one column for the distance between them.
|place1 | place2 | distance
-+-------|--------|---------
|.... |..... | .....
Depends on how many locations you have, this table could grow very fast.
The simplest way is make another table which will contain two places id and the distance like
place1 place2 distance
a b 20
c d 30
in the time of fetching data just join it with place table.
I think something like this could do the job.
ORIGIN | CITY 1 | CITY 2 | CITY 3 | CITY 4 | CITY 5
+++++++++++++++++++++++++++++++++++++++++++++++++++++++
CITY 1 0 20 40 20
CITY 5 10 50 20 0
CITY 3 10 0 10 40
You can easily get the distances to other places and the you don't need to store the names of the cities for each distance you know.
SELECT 'CITY 2' FROM DISTANCES WHERE ORIGIN='CITY 5'
Related
I have a MySQL table with area and lat/lon location columns. Every area has many locations, say 20.000. Is there a way to pick just a few, say 100, that look somewhat evenly distributed on the map?
The distribution doesn't have to be perfect, query speed is more important. If that is not possible directly with MySQL a very fast algorithm that somehow picks evenly distributed locations might also work.
Thanks in advance.
Edit: answering some requests in comments. The data doesn't have something that can be used, it's just area and coordinates of locations, example:
+-------+--------------+----------+-----------+------------+--------+--------+
| id | area | postcode | lat | lon | colour | size |
+-------+--------------+----------+-----------+------------+--------+--------+
| 16895 | Athens | 10431 | 37.983917 | 23.7293599 | red | big |
| 16995 | Athens | 11523 | 37.883917 | 23.8293599 | green | medium |
| 16996 | Athens | 10432 | 37.783917 | 23.7293599 | yellow | small |
| 17000 | Thessaloniki | 54453 | 40.783917 | 22.7293599 | green | small |
+-------+--------------+----------+-----------+------------+--------+--------+
There are some more columns with characteristics but those are just used for filtering.
I did try getting the nth row in the meantime, it seems to work although a bit slow
SET #a = 0;
select * from `locations` where (#a := #a + 1) % 200 = 0
using random() also works but a bit slow too.
Edit2: Turns out it was easy to add postal codes on the table. Having that, getting grouped by postal code seems to give a nice to the eye result. Only issue is that there are very large areas with around 3000 distinct postcodes and getting just 100 may end up many of them show in one place, so will probably need to further process in PHP.
Edit3, answering #RickJames questions in comments so they are in one place:
Please define "evenly distributed" -- evenly spaced in latitude? no two are "close" to each other? etc.
"Evenly distributed" was a bad choice of words. We just want to show some locations on the area that are not all in one place
Are the "areas" rectangles? Hexagons? Or gerrymandered congressional districts?
They can be thought roughly as rectangles but it shouldn't matter. Important thing I missed, we also need to show locations from multiple areas. Areas may be far apart from each other or neighboring (but not overlap). In that case we'd want the sample of 100 to be split between the areas.
Is the "100 per area" fixed? Or can it be "about 100"
It's not fixed, it's around 100 but we can change this if it doesn't look nice
Is there an AUTO_INCREMENT id on the table? Are there gaps in the numbers?
Yes there is an AUTO_INCREMENT id and can have gaps
Has the problem changed from "100 per area" to "1 per postal code"?
Nope the problem is still the same, "show 100 per area in a way that not all of them are in the same place", how this is done it doesn't matter
What are the total row count and desired number of rows in output?
Total row count depends on area and criteria, it can be up to 40k in an area. If total is more than 1000 we want to fall back showing just a random of 100. If 1000 or less we can just show all of them
Do you need a different sample each time you run the query?
Same sample or different sample even with the same criteria is fine
Are you willing to add a column to the table?
It's not up to me but if I have good argument then most probably we can add a new column
Here's an approach that may satisfy the goals.
Preprocess the table, making a new table, to get rid of "duplicate" items.
If the new table is small enough, a full scan of it may be fast enough.
As for "duplicates", consider this as a crude way to discover that two items land in the same spot:
SELECT ROUND(latitude * 5),
ROUND(longitude * 3),
MIN(id) AS id_to_keep
FROM tbl
GROUP BY 1,2
The "5" and "3" can be tweaked upward (or downard) to cause more (or fewer) ids to be kept. "5" and "3" are different because of the way the lat/lng are laid out; that ratio might work most temperate latitudes. (Use equal numbers near the equator, use a bigger ration for higher latitudes.)
There is a minor flaw... Two items very close to each other might be across the boundaries created by those ROUNDs.
How many rows are in the original table? How many rows does the above query generate? ( SELECT COUNT(*) FROM ( ... ) x; )
I have a table company in which I am saving company information and I want to save N number of company locations for that particular company (country_id, city_id). 1 company has Multiple locations. I have to save country and city in database in such a way that if user wants to view company filter by Country or filter by city, it will search very fast (Indexing applied).
Which option will give me better performance in terms of fast search, Normalization?
Option 1:
Should I maintain country id and city id in JSON and save it in Company table?
No need of new table. every time I will add or update JSON based on users selections.
for e.g.
[{"country1" : {city1, city2, city3}},
{"country3" : {city5, city1, city3}}]
Then I can (LIKE) query on this field -> decode json -> return result
option 2:
Should I create new table and save country's and city's PK along with
company_id FK.
company_id (FK) | country id | city id
1 | 25 | 12
1 | 25 | 16
1 | 25 | 19
1 | 30 | 1
1 | 30 | 69
1 | 30 | 14
then just query and return result
Normalize if you're using traditional SQL.
MongoDB and other similar systems for storing hierarchical data (MarkLogic, etc) have ways of making the search of JSON docs fast.
But searching and updating denormalized data is an unreliable pain in the neck in SQL. With the volume you have, it will be very slow.
Option #2, meaning creating a separate table for company location is the best option. Use the combination of all 3 columns to create the primary key as a clustered index.
Never, under any circumstances, will a delimited value column be more efficient then lookup tables in a relational database. the cost of using Like or parsing the data (not to mention if you are using the like operator to get more results then needed and then parse the data in code) is always higher then the cost of querying a well indexed normalized tables with a simple inner join.
I have a table which contains a standard auto-incrementing ID, a type identifier, a number, and some other irrelevant fields. When I insert a new object into this table, the number should auto-increment based on the type identifier.
Here is an example of how the output should look:
id type_id number
1 1 1
2 1 2
3 2 1
4 1 3
5 3 1
6 3 2
7 1 4
8 2 2
As you can see, every time I insert a new object, the number increments according to the type_id (i.e. if I insert an object with type_id of 1 and there are 5 objects matching this type_id already, the number on the new object should be 6).
I'm trying to find a performant way of doing this with huge concurrency. For example, there might be 300 inserts within the same second for the same type_id and they need to be handled sequentially.
Methods I've tried already:
PHP
This was a bad idea but I've added it for completeness. A request was made to get the MAX() number for the item type and then add the number + 1 as part of an insert. This is quick but doesn't work concurrently as there could be 200 inserts between the request for MAX() and that particular insert leading to multiple objects with the same number and type_id.
Locking
Manually locking and unlocking the table before and after each insert in order to maintain the increment. This caused performance issues due to the number of concurrent inserts and because the table is constantly read from throughout the app.
Transaction with Subquery
This is how I'm currently doing it but it still causes massive performance issues:
START TRANSACTION;
INSERT INTO objects (type_id,number) VALUES ($type_id, (SELECT COALESCE(MAX(number),0)+1 FROM objects WHERE type_id = $type_id FOR UPDATE));
COMMIT;
Another negative thing about this approach is that I need to do a follow up query in order to get the number that was added (i.e. searching for an object with the $type_id ordered by number desc so I can see the number that was created - this is done based on a $user_id so it works but adds an extra query which I'd like to avoid)
Triggers
I looked into using a trigger in order to dynamically add the number upon insert but this wasn't performant as I need to perform a query on the table I'm inserting into (which isn't allowed so has to be within a subquery causing performance issues).
Grouped Auto-Increment
I've had a look at grouped auto-increment (so that the number would auto-increment based on type_id) but then I lose my auto-increment ID.
Does anybody have any ideas on how I can make this performant at the level of concurrent inserts that I need? My table is currently InnoDB on MySQL 5.5
Appreciate any help!
Update: Just in case it is relevant, the objects table has several million objects in it. Some of the type_id can have around 500,000 objects assigned to them.
Use transaction and select ... for update. This will solve concurrency conflicts.
In Transaction with Subquery
Try to make index on column type_id
I think by making index on column type_id it will speed up your subquery.
DROP TABLE IF EXISTS my_table;
CREATE TABLE my_table
(id INT NOT NULL AUTO_INCREMENT PRIMARY KEY
,type_id INT NOT NULL
);
INSERT INTO my_table VALUES
(1,1),(2,1),(3,2),(4,1),(5,3),(6,3),(7,1),(8,2);
SELECT x.*
, COUNT(*) rank
FROM my_table x
JOIN my_table y
ON y.type_id = x.type_id
AND y.id <= x.id
GROUP
BY id
ORDER
BY type_id
, rank;
+----+---------+------+
| id | type_id | rank |
+----+---------+------+
| 1 | 1 | 1 |
| 2 | 1 | 2 |
| 4 | 1 | 3 |
| 7 | 1 | 4 |
| 3 | 2 | 1 |
| 8 | 2 | 2 |
| 5 | 3 | 1 |
| 6 | 3 | 2 |
+----+---------+------+
or, if performance is an issue, just do the same thing with a couple of #variables.
Perhaps an idea to create a (temporary) table for all rows with a common "type_id".
In that table you can use auto-incrementing for your num colomn.
Then your num shoud be fully trustable.
Then you can select your data and update your first table.
My website has a followers/following system (like Twitter's). My dilemma is creating the database structure to handle who's following who.
What I came up with was creating a table like this:
id | user_id | followers | following
1 | 20 | 23,58,84 | 11,156,27
2 | 21 | 72,35,14 | 6,98,44,12
... | ... | ... | ...
Basically, I was thinking that each user would have a row with columns for their followers and the users they're following. The followers and people they're following would have their user id's separated by commas.
Is this an effective way of handling it? If not, what's the best alternative?
That's the worst way to do it. It's against normalization. Have 2 seperate tables. Users and User_Followers. Users will store user information. User_Followers will be like this:
id | user_id | follower_id
1 | 20 | 45
2 | 20 | 53
3 | 32 | 20
User_Id and Follower_Id's will be foreign keys referring the Id column in the Users table.
There is a better physical structure than proposed by other answers so far:
CREATE TABLE follower (
user_id INT, -- References user.
follower_id INT, -- References user.
PRIMARY KEY (user_id, follower_id),
UNIQUE INDEX (follower_id, user_id)
);
InnoDB tables are clustered, so the secondary indexes behave differently than in heap-based tables and can have unexpected overheads if you are not cognizant of that. Having a surrogate primary key id just adds another index for no good reason1 and makes indexes on {user_id, follower_id} and {follower_id, user_id} fatter than they need to be (because secondary indexes in a clustered table implicitly include a copy of the PK).
The table above has no surrogate key id and (assuming InnoDB) is physically represented by two B-Trees (one for the primary/clustering key and one for the secondary index), which is about as efficient as it gets for searching in both directions2. If you only need one direction, you can abandon the secondary index and go down to just one B-Tree.
BTW what you did was a violation of the principle of atomicity, and therefore of 1NF.
1 And every additional index takes space, lowers the cache effectiveness and impacts the INSERT/UPDATE/DELETE performance.
2 From followee to follower and vice versa.
One weakness of that representation is that each relationship is encoded twice: once in the row for the follower and once in the row for the following user, making it harder to maintain data integrity and updates tedious.
I would make one table for users and one table for relationships. The relationship table would look like:
id | follower | following
1 | 23 | 20
2 | 58 | 20
3 | 84 | 20
4 | 20 | 11
...
This way adding new relationships is simply an insert, and removing relationships is a delete. It's also much easier to roll up the counts to determine how many followers a given user has.
No, the approach you describe has a few problems.
First, storing multiple data points as comma-separated strings has a number of issues. It's difficult to join on (and while you can join using like it will slow down performance) and difficult and slow to search on, and can't be indexed the way you would want.
Second, if you store both a list of followers and a list of people following, you have redundant data (the fact that A is following B will show up in two places), which is both a waste of space, and also creates the potential of data getting out-of-sync (if the database shows A on B's list of followers, but doesn't show B on A's list of following, then the data is inconsistent in a way that's very hard to recover from).
Instead, use a join table. That's a separate table where each row has a user id and a follower id. This allows things to be stored in one place, allows indexing and joining, and also allows you to add additional columns to that row, for example to show when the following relationship started.
Earlier I asked this question, which basically asked how to list 10 winners in a table with many winners, according to their points.
This was answered.
Now I'm looking to search for a given winner X in the table, and find out what position he is in, when the table is ordered by points.
For example, if this is the table:
Winners:
NAME:____|__POINTS:
Winner1 | 1241
Winner2 | 1199
Sally | 1000
Winner4 | 900
Winner5 | 889
Winner6 | 700
Winner7 | 667
Jacob | 623
Winner9 | 622
Winner10 | 605
Winner11 | 600
Winner12 | 586
Thomas | 455
Pamela | 434
Winner15 | 411
Winner16 | 410
These are possible inputs and outputs for what I want to do:
Query: "Sally", "Winner12", "Pamela", "Jacob"
Output: 3 12 14 623
How can I do this? Is it possible, using only a MySQL statement? Or do I need PHP as well?
This is the kind of thing I want:
WHEREIS FROM Winners WHERE Name='Sally' LIMIT 1
Ideas?
Edit - NOTE: You do not have to deal with the situation where two Winners have the same Points (assume for simplicity's sake that this does not happen).
I think this will get you the desired result. Note that i properly handles cases where the targeted winner is tied for points with another winner. (Both get the same postion).
SELECT COUNT(*) + 1 AS Position
FROM myTable
WHERE Points > (SELECT Points FROM myTable WHERE Winner = 'Sally')
Edit:
I'd like to "plug" Ignacio Vazquez-Abrams' answer which, in several ways, is better than the above.
For example, it allows listing all (or several) winners and their current position.
Another advantage is that it allows expressing a more complicated condition to indicate that a given player is ahead of another (see below). Reading incrediman's comment to the effect that there will not be "ties" prompted me to look into this; the query can be slightly modified as follow to handle the situation when players have same number of points (such players would formerly have been given the same Position value, now the position value is further tied to their relative Start values).
SELECT w1.name, (
SELECT COUNT(*)
FROM winners AS w2
WHERE (w2.points > w1.points)
OR (W2.points = W1.points AND W2.Start < W1.Start) -- Extra cond. to avoid ties.
)+1 AS rank
FROM winners AS w1
-- WHERE W1.name = 'Sally' -- optional where clause
SELECT w1.name, (
SELECT COUNT(*)
FROM winners AS w2
WHERE w2.points > w1.points
)+1 AS rank
FROM winners AS w1