We are creating a website where users can create a certain profile. At the moment we already have about 662000 profiles (records in our database). The user can link certain keywords (divided into 5 categories) to their profile. They can link up to about 1250 keywords per category (no, this isn't nonsense, for certain profiles this would actually make sense). At the moment we save these keywords into an array and insert the serialized array in the profile's record in the database.
When a different user uses the search function and searches for one of the keywords, an SQL query is executed with 'WHERE keyword LIKE %keyword%'. This means that is has to go to a pretty big number of records and go through the entire serialized array for each record. Adding an index to the keyword columns is pretty tricky, since they don't have a defined max lenght (this could be 22000+ chars!).
Is there any other more sensible and practical way to go about this?
Thanks!
Never, never, never store multiple values in one column!
Use a mapping table
user_keywords TABLE
--------------------
user_id INT
keyword_id INT
users TABLE
---------------------
id INT
name VARCHAR
...
keywords TABLE
---------------------
id INT
name VARCHAR
...
You could then return all users having a specific keyword in their profile like this
select u.*
from users u
inner join user_keywords uk on uk.user_id = u.id
inner join keywords k on uk.keyword_id = k.id
where k.name = 'keyword_name'
Since you are dealing with a large data you should use NoSQL databases such as Hadoop/Hbase, Cassandra etc. You should also take a look at Lucene/Solr...
http://nosql-database.org/
Related
I need help with this mysql query that executes too long or does not execute at all.
(What I am trying to do is a part of more complex problem, where I want to create PHP cron script that will execute few heavy queries and calculate data from the results returned and then use those data to store it in database for further more convenient use. Most likely I will make question here about that process.)
First lets try to solve one of the problems with these heavy queries.
Here is the thing:
I have table: users_bonitet. This table has fields: id, user_id, bonitet, tstamp.
First important note: when I say user, please understand that users are actually companies, not people. So user.id is id of some company, but for some other reasons table that I am using here is called "users".
Three key fields in users_bonitet table are: user_id ( referencing user.id), bonitet ( represents the strength of user, it can have 3 values, 1 - 2 - 3, where 3 is the best ), and tstamp ( stores the time of bonitet insert. Every time when bonitet value changes for some user, new row is inserted with tstamp of that insert and of course new bonitet value.). So basically some user can have bonitet of 1 indicating that he is in bad situation, but after some time it can change to 3 indicating that he is doing great, and time of that change is stored in tstamp.
Now, I will just list other tables that we need to use in query, and then I will explain why. Tables are: user, club, club_offer and club_territories.
Some users ( companies ) are members of a club. Member of the club can have some club offers ( he is representing his products to the people and other club members ) and he is operating on some territory.
What I need to do is to get bonitet value for every club offer ( made by some user who is member of a club ) but only for specific territory with id of 1100000; Since bonitet values are changing over time for each user, that means that I need to get the latest one only. So if some user have bonitet of 1 at 21.01.2012, but later at 26.05.2012 it has changed to 2, I need to get only 2, since that is the current value.
I made an SQL Fiddle with example db schema and query that I am using right now. On this small database, query is working what I want and it is fast, but on real database it is very slow, and sometimes do not execute at all.
See it here: http://sqlfiddle.com/#!9/b0d98/2
My question is: am I using wrong query to get all this data ? I am getting right result but maybe my query is bad and that is why it executes so slow ? How can I speed it up ? I have tried by putting indexes using phpmyadmin, but it didn't help very much.
Here is my query:
SELECT users_bonitet.user_id, users_bonitet.bonitet, users_bonitet.tstamp,
club_offer.id AS offerId, club_offer.rank
FROM users_bonitet
INNER JOIN (
SELECT max( tstamp ) AS lastDate, user_id
FROM users_bonitet
GROUP BY user_id
)lastDate ON users_bonitet.tstamp = lastDate.lastDate
AND users_bonitet.user_id = lastDate.user_id
JOIN users ON users_bonitet.user_id = users.id
JOIN club ON users.id = club.user_id
JOIN club_offer ON club.id = club_offer.club_id
JOIN club_territories ON club.id = club_territories.club_id
WHERE club_territories.territory_id = 1100000
So I am selecting bonitet values for all club offers made by users that are members of a club and operate on territory with an id of 1100000. Important thing is that I am selecting club_offer.id AS offerId, because I need to use that offerId in my application code so I can do some calculations based on bonitet values returned for each offer, and insert data that was calculated to the field "club_offer.rank" for each row with the id of offerId.
Your query looks fine. I suspect your query performance may be improved if you add a compound index to help the subquery that finds the latest entry from users_botinet for each user.
The subquery is:
SELECT max( tstamp ) AS lastDate, user_id
FROM users_bonitet
GROUP BY user_id
If you add (user_id, tstamp) as an index to this table, that subquery can be satisfied with a very efficient loose index scan.
ALTER TABLE users_bonitet ADD KEY maxfinder (user_id, tstamp);
Notice that if this users_botinet table had an autoincrementing id number in it, your subquery could be refactored to use that instead of tstamp. That would eliminate the possibility of duplicates and be even more efficient, because there's a unique id for joining. Like so.
FROM users_botinet
INNER JOIN (
SELECT MAX(id) AS id
FROM users_botinet
GROUP BY user_id
) ubmax ON users_botinet.id = ubmax.id
In this case your compound index would be (user_id, id.
Pro tip: Don't add lots of indexes unless you know you need them. It's a good idea to read up on how indexes can help you. For example. http://use-the-index-luke.com/
I have a query regarding regular expression.I have design a table which contain three column one column contain member ids which are separated by commas.I am showing you my table structure.Please follow
send_id member_id
1 1211,23,34
2 1,23
I want to select only send_id 2 data which contain member_id as 1.
this is the query that i am using
SELECT * FROM table WHERE column REGEXP '^[1]+$';
but this query giving me both row.Please help me.
With Regards
Rahul
Never store separate values in one column
Normalize your structure like
send_id member_id
1 1211
1 23
1 34
2 1
2 23
If you still want your regex, then it will be
SELECT * FROM t WHERE column REGEXP '(^|[^0-9])1([^0-9]|$)'
First, you should be normalizing your data so you're not in this horrible mess in the first place. Here's a good resource explaining normalization.
Second, I believe your problem lies with your regular expression. Try this instead:
SELECT * FROM table WHERE column REGEXP '^[1]$';
The regular expression you're using uses the [1]+ group. The + means it has to match [1] 1 or more times, hence why you're getting two rows instead of one. Removing the + means it will match [1] once.
However, that still won't fix your problem, as more than one row contains 1. This is why normalization is so important.
Having multiple values inside a column isn't a good practice for designing a DB.
You should normalize your data, i.e., put just one piece of atomic information inside each element of your table.
You can find more information regarding to this in Wikipedia:
http://en.wikipedia.org/wiki/Database_normalization
Like they have told you, perfect solution would be normalize your data, I think Alma Do Mundo answer explains it quite well.
If you want to use REGEXP anyway you have to take in account four approaches; id is the only one, id is the first, id is in the middle and id is at the end. I have use id=74 for the example:
SELECT * FROM table WHERE member_id REGEXP '(^74$|^74,|,74,|,74$)';
depending on your requirements, you should either normalize your data i.e. make 3 tables, one with the send ID, one with the member id, and one that combines the two, then you can link them up with INNER JOINS.
However, if you are going to do it that way, you can use a "WHERE member_id LIKE %1%" to pull in all the relevant fields. You'll have to use the application to filter the relevant records.
In any case, if you're not going to normalize the data you will have to use the front end to filter out the results.
An example of the inner join syntax would look like this
SELECT * FROM SendTable
JOIN Send_Member ON SendTable.send_id = Send_Member.send_id
JOIN Member ON Member.member_id = Send_Member.member_id
WHERE Member.member_id = 1;
where the schema looks like:
Sendtable:
send_Id (primary key)
...other fields
Send_Member:
send_id (primary key and foreign key to SendTable)
member_id (primary key and foreign key to member)
...any fields you might want that are relevant to the particular send table and member table link
Member:
member_id (primarykey)
...other fields
in MySQL, I have a row for each user, with a column that contains friends names separated by \n.
eg.
Friend1
Friend2
Friend3
I'd like to be able to quickly search all the users where the Friends field contains Friend2.
I've found FIND_IN_SET but that only works for commas and the data can contains commas and foreign characters.
Obviously searching with regular expressions and the such will be slow. I'm new to the whole cross referencing so I'd love some help on the best way to structure the data so that it can be found quickly.
Thanks in advance.
Edit: Ok, I forgot to mention a point that the data is coming from a
game where friends names are stored locally and there are no links to
another users ID. Thus the strings. Every time they connect I am given
a dump of their friends names which I use in the background to help match games.
The most commonly used structure for this kind of data is usually adding an extra table. I.e.
user
id,
name
email,
e.t.c.
user_friend
user_id
friend_id
Querying this is a matter of querying the tables. I.e.
List all of a users friends names:
SELECT friend_id
FROM user_friend
WHERE user_id = :theUser
Edit: Regarding OPs edit. Just storing the names is possible too. In this case the table structure would become:
user_friend
user_id
friend_name
and the query:
SELECT friend_name
FROM user_friend
WHERE user_id = :theUser
Why are you keeping friend names as text? This will be inefficient to edit uf say a user removes a friend or changes their name. That's another thing, you should store friend names by some auto_increment id key in your database. It's much faster to search for an integer than a string, especially in a very large database. You should set up a friends table which is like
Column 1: connectionid auto_increment key
Column 2: user1id int
Column 3: user2id int
Column 4: date added date
ect...
Then you can search the connection table above for all rows where user is user1id or user2id and get a list of the other users from that.
My database hasn't been filled yet so I can easily change the format and structure in which the data will be stored.
Yes, you need to normalize your database a bit. With current structure, your searches will be quite slow and consume more space.
Check out this wiki for detailed help on normalization.
You can have the friends table and users table separate and link them both by either foreign key constraint or inner joins.
The structure would be:
Users table
id: AUTO_INCRMENT PK
name
other columns
Friends table
id: AUTO_INCREMENT(not required, but good for partitioning)
UserID:
FriendsID
DateAdded
OtherInfo if required.
I have a voting script which pulls out the number of votes per user.
Everything is working, except I need to now display the number of votes per user in order of number of votes. Please see my database structure:
Entries:
UserID, FirstName, LastName, EmailAddress, TelephoneNumber, Image, Status
Voting:
item, vote, nvotes
The item field contains vt_img and then the UserID, so for example: vt_img4 and both vote & nvotes display the number of votes.
Any ideas how I can relate those together and display the users in order of the most voted at the top?
Thanks
You really need to change the structure of the voting table so that you can do a normal join. I would strongly suggest adding either a pure userID column, or at the very least not making it a concat of two other columns. Based on an ID you could then easily do something like this:
select
a.userID,
a.firstName,
b.votes
from
entries a
join voting b
on a.userID=b.userID
order by
b.votes desc
The other option is to consider (if it is a one to one relationship) simply merging the data into one table which would make it even easier again.
At the moment, this really is an XY problem, you are looking for a way to join two tables that aren't meant to be joined. While there are (horrible, ghastly, terrible) ways of doing it, I think the best solution is to do a little extra work and alter your database (we can certainly help with that so you don't lose any data) and then you will be able to both do what you want right now (easily) and all those other things you will want to do in the future (that you don't know about right now) will be oh so much easier.
Edit: It seems like this is a great opportunity to use a Trigger to insert the new row for you. A MySQL trigger is an action that the database will make when a certain predefined action takes place. In this case, you want to insert a new row into a table when you insert a row into your main table. The beauty is that you can use a reference to the data in the original table to do it:
CREATE TRIGGER Entries_Trigger AFTER insert ON Entries
FOR EACH ROW BEGIN
insert into Voting values(new.UserID,0,0);
END;
This will work in the following manner - When a row is inserted into your Entries table, the database will insert the row (creating the auto_increment ID and the like) then instantly call this trigger, which will then use that newly created UserID to insert into the second table (along with some zeroes for votes and nvotes).
Your database is badly designed. It should be:
Voting:
item, user_id, vote, nvotes
Placing the item id and the user id into the same column as a concatenated string with a delimiter is just asking for trouble. This isn't scalable at all. Look up the basics on Normalization.
You could try this:
SELECT *
FROM Entries e
JOIN Voting v ON (CONCAT('vt_img', e.UserID) = v.item)
ORDER BY nvotes DESC
but please notice that this query might be quite slow due to the fact that the join field for Entries table is built at query time.
You should consider changing your database structure so that Voting contains a UserID field in order to do a direct join.
I'm figuring the Entries table is where votes are cast (you're database schema doesn't make much sense to me, seems like you could work it a little better). If the votes are actually on the Votes table and that's connected to a user, then you should have UserID field in that table too. Either way the example will help.
Lets say you add UserID to the Votes table and this is where a user's votes are stored than this would be your query
SELECT Users.id, Votes.*,
SUM(Votes.nvotes) AS user_votes
FROM Users, Votes
WHERE Users.id = Votes.UserID
GROUP BY Votes.UserID
ORDER BY user_votes
USE ORDER BY in your query --
SELECT column_name(s)
FROM table_name
ORDER BY column_name(s) ASC|DESC
In cases when some one needs to store more than one value in a in a cell, what approach is more desirable and advisable, storing it with delimiters or glue and exploding it into an array later for processing in the server side language of choice, for example.
$returnedFromDB = "159|160|161|162|163|164|165";
$myIdArray = explode("|",$returnedFromDB);
or as a JSON or PHP serialized array, like this.
:6:{i:0;i:1;i:1;i:2;i:2;i:3;i:3;i:4;i:4;i:5;i:5;i:6;}
then later unserialize it into an array and work with it,
OR
have a new row for every new entry like this
postid 12 | showto 2
postid 12 | showto 3
postid 12 | showto 5
postid 12 | showto 6
postid 12 | showto 8
instead of postid 12 | showto "2|3|4|6|8|5|".
OR postid 12 | showto ":6:{i:0;i:2;i:1;i:3;i:2;i:3;i:3;i:4;i:4;i:5;i:5;i:6;}".
Thanks, looking forward to your opinions :D
In cases when some one needs to store more than one value in a in a cell, what approach is more desirable and advisable, storing it with delimiters or glue and exploding it into an array later for processing in the server side language of choice, for example.
Neither. Oh goodness, neither! Edgar F. Codd is rolling in his grave right now.
Storing delimited data in a text field is no better than storing it in a flat file. The data becomes unqueryable. Storing PHP serialized data in a text field is even worse because then only PHP can parse the data.
You want a nice, happy, normalized database.
The thing you're trying to describe is a many-to-many relationship. Each user can maintain one or more posts. Likewise, each post can be maintained by one or more user. Right? Then something like this will work.
CREATE TABLE users (
user_id INTEGER PRIMARY KEY,
...
);
CREATE TABLE posts (
post_id INTEGER PRIMARY KEY,
...
);
CREATE TABLE user_posts (
user_id INTEGER REFERENCES users(user_id),
post_id INTEGER REFERENCES posts(post_id),
UNIQUE KEY(user_id, post_id)
);
-- All posts made by user 22.
SELECT posts.*
FROM posts, user_posts
WHERE user_posts.user_id = 22
AND posts.post_id = user_posts.post_id
-- All users that worked on post 47
SELECT users.*
FROM users, user_posts
WHERE user_posts.post_id = 47
AND users.user_id = user_posts.user_id
Most of the time the recommendation is that many-to-many relationships (such as posts to users) should have a mapping table with 1 row for each post-user combination (in other words, your "new row for every new entry" version).
It's more optimal for things like join queries, and lets you retrieve only the data you need.
You should only serialize data in the DB if the data is never needed to be processed by the DB. For example, you could serialize user ID in the user_id field if you never need to do a query with the user_id field; e.g. never selecting anything based on user.
If these are posts (blog/news/etc. posts?) then I'm pretty confident you'll need to be able to query them by user. Normalizing the user into another table would serve you:
CREATE TABLE posts (post_id, ....);
CREATE TABLE post_users (post_id, user_id, ...);
You can then get the users in a different query, or use group_concat: SELECT post_id, GROUP_CONCAT(user_id) FROM posts JOIN post_users USING (post_id) GROUP BY post_id. When you need to show user name, just join to the users table to get their name in the group concat.
From RDBMS point of view i would 'have a new row for every new entry'
Thats called m:n relationship table.
You can then query the data however you like.
If you need postid 12 | showto ":6:{i:0;i:2;i:1;i:3;i:2;i:3;i:3;i:4;i:4;i:5;i:5;i:6;}". you can do
SELECT postid, CONCAT(':',count(showto),':{i:',GROUP_CONCAT(showto SEPARATOR ';i:'),';}') AS showto
FROM tablename
GROUP BY postid
However if you only need the data in 1 form and not do any other kind of queries on that data then you may aswell store the string.