In the interest of good relational database design:
There are currently two columns in the DB: "GroupName" and "WebGroupName". The second column is used for simple url access to a profile. Eg: www.example.com/myWebGroupName the reason for this is that it avoids spaces being passed in the url for example: www.example.com/my Web Group Name would not work
To re-iterate the DB structure; column 1 would store "My Group Name" and column two would store "MyGroupName".
Possible solutions may err on the side of storing the group name without spaces then using some regular expression to add the spaces back. The focus of my question is how to eliminate the need for two columns storing near-duplicate date.
Thank you for your time
Assuming that you really have a problem with spaces in URLs (as Larry Lustig pointed out it isn't necessarily a problem) - Then it isn't bad relational database design to have two columns that often have very similiar information.
The kind of repetition that is to be avoided (normalized) deals with repetition across multiple rows. If you have two columns which are meant to contain different, but related information, then these two columns are perfectly OK and you aren't breaking any rules. The fact that sometimes these two columns are equal (coincidentally) is not a problem.
You said:
Possible solutions may err on the side of storing the group name
without spaces then using some regular expression to add the spaces
back. The focus of my question is how to eliminate the need for two
columns storing near-duplicate date.
From this I assume that what is most important to your system is the web group name. If the group name were the driver then writing an expression that removes spaces would be trivial. If the web group name is something that can be set arbitrarily based on the group name, then you should store the name with spaces and replace them with empty strings when you need a web group name. If the web group name is not completely arbitrary then you really do have two independent data points and they need to be stored in two separate columns.
Related
Which is good practice? To store data as a comma separated list in the database or have multiple rows?
I have a table for accounts, classes, and enrolments.
If the enrolment table has 3 fields: ID, AccountID and ClassID, is it better for ClassID to be a varchar containing a comma separated list such as this: "24,21,182,12" or for it to be just an int and have one entry per enrolment?
tldr: Don't do this. That is, don't use a "packed array" here.
Use a correctly normalized design with "multiple rows". This is likely a good candidate for a Many-to-Many relationship. Consider this structure:
Classes 1:M Enrollments(Class,Student) M:1 Students
Following a properly normalized design will reduce pain. In addition, here are some other advantages:
Referential integrity (use InnoDB)
Consistent model described with relationships
Type enforcement (can't have "foo,,")
JOIN and query without needing custom code
"What are the names of the students in class A?"
"Who is taking more than one class?"
Columns can be useful indexed (query performance)
Generally faster than handling locally in code
More flexible and consistent
Can attach attributes to enrollments such as status
No need to have code to handle serialization at access sites
More accommodating of placeholders and ORMs
Never ever ever cram multiple values into a single database field by combining them with some sort of delimiter, like a comma, or fixed length substrings. In the rare cases where this clearly gives a benefit in storage requirements or performance ... see rule #1: never ever ever. Ever.
When you cram multiple values into a single field, you sabatague all the clever features built into the database engine to help you retrieve and manipulate values.
Like let's say you have this -- I guess it's some sort of student database.
Plan A
student (student_id, account_id, class_id_mash)
Plan B
student (student_id, account_id)
student_class (student_id, class_id)
Okay, lets' say you want a list of all the students taking class #27. With Plan B you write
select student_id
from student join student_class on student.student_id=student_class.student_id
where class_id=27
Easy.
How would you do it with Plan A? You might think
select student_id
from student
where class_id_mash like '%27%'
But that will not only find all students in class 27, but also all those in class 127 or 272.
Okay, how about:
select student_id
from student
where class_id_mash like '%,27,%'
There, now we won't find 127 or 272! But, oops, we also won't find it if the 27 happens to be the first or last one in the list, because then there aren't commas on both sides.
So okay, maybe we could get around that with more rules about delimiters or with a more complex matching expression. But it would be unnecessariliy complex and painful.
And even if we did it, every search for class id has to be a full-fill sequential search. With one value per field and multiple records, you can create an index on the class_id field for fast, efficient retrieval. (Some database engines have ways to index into the middle of text fields, but again, why get into complicated solutions when there's an easy solution?)
How do we validate the class_id's? With separate fields, we can say "class_id references class" and the database engine will insure that we don't enter an illegal value. With the mash, no such free validation.
I have done both, but instead of storing the information in the database as comma seperated, I use another delimiter, such as | (so that I don't worry about formatting on insert into db). Its more about how often you will query the data
If you are only going to need the complete list, it is fine to store it as a comma separated value. But if you need to query the list, they should be stored separately.
Currently storing 3 bits of information about an persons name.
name,nicename,searchname = ("Mr.Joe bloggs", "Mr-Joe-bloggs", "mrjoebloggs")
Name used for a user's display name, nicename for the url and searchname for realtime searching the database (so speed is a must, milliseconds matter!)
Currently one table holds all 3 fields, but how much more effient would it be to store each field in a seperate table?and relate everything by id?
or would that just waste extra selects relating them to one another? DB will have over 100m records.
If you insist on keeping those three fields, you'd be creating a one-to-one relationship with every piece of data. It would make sense to keep them all in the same row.
However, you might find it better to only store the name. When you need the "nice name", you can use a regex to replace periods and space (and other characters) with hyphens (or remove them). When a user searches for "mr joe bloggs", you can make a simple searching algorithm by dividing up the three words and using the LIKE clause.
Background
I am creating a MySQL database to store items such as courses where there may be many attributes to a single course. For example:
A single course may have any or all of the following attributes:
Title (varchar)
Secondary Title (varchar)
Description (text)
Date
Time
Specific Location (varchar; eg. White Hall Room 7)
General Location (varchar; eg. Las Vegas, NV)
Location Coords (floats; eg. lat, long)
etc.
The database is set up as follows:
A table storing specific course info:
courses table:
Course_ID (a Primary Key unique ID for each course)
Creator_ID (a unique ID for the creator)
Creation_Date (datetime of course creation)
Modified_Date (where this is the most recent timestamp the course was modified)
The table storing each courses multiple attributes is set up as follows:
course_attributes table:
Attribute_ID (a unique ID for each attribute)
Course_ID (reference to the specific course attribute is for)
Attribute (varchar definining the attribute; eg. 'title')
Value (text containing value of specified attribute; eg. 'Title Of My Course')
Desire
I would like to search this database using sphinx search. With this search, I have different fields weighing different amounts, for example: 'title' would be more important than 'description'.
Specific search fields that I wish to have are:
Title
Date
Location (string)
Location (geo - lat/long)
The Question
Should I define a View in Mysql to organize the attributes according to 'title', 'description', etc., or is there a way to define my sphinx.conf file to understand specific attributes?
I am open to all suggestions to solving this problem, whether it be rearrangement of the database/tables or the way in which I search.
Let me know if you need any additional details to help me find a solution.
Thanks in advance for the help
!--Update--!
OK, so after reading some of the answers, I feel that I should provide some additional information.
Latitude / Longitude
The latitude/longitude attributes are created by me internally after receiving the general location string. I can generate the values in any way I wish, meaning that I can store them together in a single lat/long attribute as 'float lat, float long' values or any other desired format. This is done only after they have been generated from the initial location string and verified. This is to guard against malformed data as #X-Zero and #Cody have suggested.
Keep in mind that the latitude and longitude was merely illustrating the need to have that field be searchable as opposed to anything more than that. It is simply another attribute; one of many.
Weighting Search Results
I know how to add weights to results in a Sphinx search query:
$cl->setFieldWeights( array('title'=>1000, 'description'=>500) );
This causes the title column to have a higher weight than the description column if the structure was as #X-Zero suggested. My question was more directed to how one would apply the above logic with the current table definition.
Database Structure, Views, and Efficiency
Using my introductory knowledge of Views, I was thinking that I could possibly create something that displays a row for each course where each attribute is its own column. I don't know how to accomplish this or if it's even possible.
I am not the most confident with database structures, but the reason I set my tables up as described was because there are many cases where not all of the fields will be completed for every course and I was attempting to be efficient [yes, it seems as though I've failed].
I was thinking that using my current structure, each attribute would contain a value and would therefore cause no wasted space in the table. Alternatively, if I had a table with tons of potential attributes, I would think there would be wasted space. If I am incorrect, I am happy to learn why my understanding is wrong.
Let me preface this by saying that I've never even heard of Sphinx, nor (obviously) used it. However, from a database perspective...
Doing multi-domain columns like this is a terrible (I will hunt you down and kill you) idea. For one thing, it's impossible to index or sort meaningfully, period. You also have to pray that you don't get a latitude attribute with textual data (and because this can only be reinforced programatically, I'm going to garuantee this will happen) - doing so will cause all distance based formulas to crash. And speaking of location, what happens if somebody stores a latitude without a longitude (note that this is possible regardless of whether you are storing a single GeoLocation attribute, or the pair)?
Your best bet is to do the following:
Figure out which attributes will always be required. These belong in the course table (...mostly).
For each related set of optional attributes, create a table. For example, location (although this should probably be required...), which would contain Latitude/Longitude, City, State, Address, Room, etc. Allow the columns to be nullable (in sets - add constraints so users can't add just longitude and not latitude).
For every set of common queries add a view. Even (perhaps especially) if you persist in using your current design, use a view. This promotes seperation between the logical and physical implementations of the database. (This assumes searching by SQL) You will then be able to search by specifying view_column is null or view_column = input_parameter or whichever.
For weighted searching (assuming dynamic weighting) your query will need to use left joins (inside the view as well - please document this), and use prepared-statement host-parameters (just save yourself the trouble of trying to escape things yourself). Check each set of parameters (both lat and long, for example), and assign the input weighting to a new column (per attribute), which can be summed up into a 'total' column (which must be over some threshold).
EDIT:
Using views:
For your structure, what you would normally do is left join to the attributes table multiple times (one for each attribute needed), keying off of the attribute (which should really be an int FK to a table; you don't want both 'title' and 'Title' in there) and joining on course_id - the value would be included as part of the select. Using this technique, it would be simple to then get the list of columns, which you can then apparently weight in Sphinx.
The problem with this is if you need to do any data conversion - you are betting that you'll be able to find all conversions if the type ever changes. When using strongly typed columns, this is between trivial (the likelyhood is that you end up with a uniquely named column) to unnecessary (views usually take their datatype definitions from the fields in the query); with your architecture, you'll likely end up looking through too many false positives.
Database efficiency:
You're right, unfilled columns are wasted space. Usually, when something is optional(ish), that means you may need an additional table. Which is Why I suggested splitting off location into it's own table: this prevents events which don't need a location (... what?) from 'wasting' the space, but then forces any event that defines a location to specify all required information. There's an additional benefit about splitting it off this way: if multiple events all use the same location (... not at the same time, we hope), a cross-reference table will save you a lot of space. Way more than your attributes table ever could (you're still having to store the complete location per event, after all). If you still have a lot of 'optional' attributes, I hear that NoSQL is made for these kinds of things (but I haven't really looked into it). However, other than that, the cost of an additional table is trivial; the cost of the data inside may not be, but the space required is weighed against the perceived value of the data stored. Remember that disk space is relatively cheap - it's developer/maintainer time that is expensive.
Side note for addresses:
You are probably going to want to create an address table. This would be completely divorced from the event information, and would include (among other things) the precomputed latitude/longitude (in the recommended datatype - I don't know what it is, but it's for sure not a comma-separated string). You would then have an event_address table that would be the cross-reference between the events and where they take place - if there is additional information (such as room), that should be kept in a location table that is referenced (instead of referencing address directly). Once a lat/long value is computed, you should never need to change it.
Thoughts on later updates for lat/long:
While specifying the lat/long values yourself is better, you're going to want to make them a required part of the address table (or part of/in addition to a purely lat/long only table). Frankly, multi-value columns (delimited lists) of any sort are just begging for trouble - you keep having to parse them every time you search on them (among other related issues). And the moment you make them separate rows, one of the pair will eventually get dropped - Murphy himself will personally intervene, if necessary. Additionally, updating them at different times from the addresses will result in an address having a lat/long pair that does not match; your best bet is to compute this at insertion time (there are a number of webservices to find this information for you).
Multi-domain tables:
With a multi-domain table, you're basically betting that the domain key (attribute) will never become out-of-sync with the value (err, value). I don't care how good you are, somewhere, somehow, it's going to happen: at my company, we had one of these in our legacy application (it stored FK links and which files the FKs refer to, along with an attribute). At one point an application was installed in production which promptly began storing the correct file links, but the FK links to a different file, for a given class of attribute. Thankfully, there were audit records in another file which allowed this to be reversed (... as near as they were able tell).
In summary:
Revisit your required/optional data. Don't be afraid to create additional tables, each for a single entity, with every column for a single domain; you will also need relationship tables. You may also wish to place your audit data (last_updated_time) in a set of separate tables (single-domain tables will help immensely in this regard).
In the sphinx config you define your index and the SQL queries that populate it. You can define basic attributes, see Sphinx Attributes
Sphinx also supports geo searches on lat/long but they need to be expressed in radians, definitely not text columns like you have. I agree with X-Zero that storing lat/lng values are strings is a bad idea.
I have a two separate tables that contain parts of user name (don't ask why)...
t1
---------------
firstName
lastName
t2
---------------
middleName
stage_firstName
stage_middleName
stage_lastName
So before I output the name I run it through a function that capitalizes First letter of Name and uses Stage name if provided.
It works OK, but I now have a case where I need to display multiple names. The question I have, is: shall I use mySQL to store properly formatted name when the user formats it initially, or keep the values in multiple tables and keep on using the function to format them. For some reason I think I can improve some performance by utilizing a single value from a table, even if I add additional table column rather than keeping the fields separately in two separate tables and then parsing each name through this huge function.
Am I wrong with these assumptions?
And if, at some point, you need to extract the name and display/use it in a different format, you will need to then perform some kind of translation on the already formatted string.
You could write the formatting into the query though.
I'm in need of some quick help on matching a field in my database that stores all of the "parent" categories for my online store. Here's an example of how my "parents" are stored in the table via one field named Parent:
MENS MENS-BRANDS MENS-SHIRTS MENS-T-SHIRTS
Here is my query in PHP to perform the call:
$query = "SELECT id FROM $usertable where parent like '".strtoupper($parent)."'";
The problem is, if I am on MENS-BRANDS, this will also return those products who are listed in every other category because it contains the word "MENS." Since all of the parents are stored in one field, how can I make my SQL query only recognize each physical word that is separated by spaces in the field itself, instead of it trying to find every instance of different fragments of a word throughout the field?
I hope this makes sense, and any help is surely appreciated.
Ideally you can change your schema so that you have a separate table linking these categories to your existing entries. This way you can have one row per product and you can easily write a SQL query that looks for the specific word you want without the need for a LIKE match. Added bonus: this will improve performance.
However, if you absolutely cannot change this schema, your best bet is probably to use a regular expression like WHERE parent REGEXP '[[:<:]]MENS[[:>:]]'
I'm here using MySQL regular expressions. If you're using a different database management system the same concept will work, but the exact syntax may be different.