Hi, so I have this database project I'm working on that involves transcribing archival sources to make them more accessible.
I'm revamping the database structure, so I can make the depiction of the archival data more accurate to the manuscript sources. As part of that, I have this new table, which has both the labels/titles for columns of data in the documents, plus a "used"field which acts both as a flag for if the field is used, and also for what position it should be in left to right (As the order changes sometimes).
I'm wondering if there's a way to pair the columns together so I can do a query that - when asking for a single row to be returned= sorts the "used" functions numerically (returning all the ones that aren't -1), and also returns all the "label" fields also sorted into the same order (eg if guns_used is 2, and men_used is 1 and ship_name_position is 0, the query will put them in the correct order and also return guns_label, men_label and shipname_label in the correct order).
I'm also working with/around wordpress, so I have the contents of the whole wpdb thing available to me too.
I'm hoping to be able to "pair" the fields in some way so that if I order one set, the other gets ordered as well.
Edit:
I really would prefer to find a way to do this in a query but until I find a way to do that I'm going to
a)Select the entire row that I need
b)Have a long series of if statements- one for each pair of _label/_used fields- and assigning the values I want to the position in the array indicated by the value of the _used field.
Related
This one might be a bit complicated. I searched for similar questions and found nothing that seemed relevant.
Let me start by establishing my database structure.
I have several tables, but the relevant ones are as follows:
Right now I have the decklist stored as a string of cardid delimited by a comma. I realize this is inefficient and when I get around to improving my code I will make a new table tcg_card_in_deck that has relationid, cardid, deckid. For now my code assumes a decklist string.
I'm building a function to allow purchases of a deck. In order to give them the actual cards, I have the following query (generated with the PHP it will actually be about 50 entries):
$db->query_write("INSERT INTO
`tcg_card`
(masterid, userid, foil)
VALUES
('159', '15', '0'),
('209', '15', '0'),
('209', '15', '0'),
('318', '15', '0')");
This part is easy. My issue now is making sure the cards that have just been added can have their ids grabbed and put together in an array (to enter in as a string currently, and as entries into the separate table once the code is updated). If it was one entry I could use LAST_INSERT_ID(). If I did 50 separate insert queries I could grab the id on each iteration and add them into the array. But because it's all done with one query, I don't know how to effectively find the correct cards to put in the decklist. I suppose I could add a dateline field to the cards table to specify date acquired, but that seems sloppy and it may produce flawed results if a user gets cards from a trade or a booster pack in a similar timeframe.
Any advice would be appreciated!
Change tcg_card by removing cardid, and make masterid and userid a compound key. Then, add a row called quantity. Since you cannot distinguish between two copies of a card in any meaningful way (except for being foils, which you could handle with this schema), there is no need for every card to get its own ID.
Presumably you aren't entering new tcg_master rows dynamically, so you don't have to worry about pulling their IDs back out.
Reading comments on this question I thought of a very simple and easy solution:
Get all inserted IDs when inserting multiple rows using a single query
I already track booster pack purchases with a table tcg_history. This table can also track other types of purchases, such as a starter deck.
I simply need to add a field on the tcg_card table that references a tcg_history.recordid, then I will be able to select all cards that are from that purchase.
I've got a problem that I just can't seem to find the answer to. I've developed a very small CRM-like application in PHP that's driven by MySQL. Users of this application can import new data to the database via an uploaded CSV file. One of the issues we're working to solve right now is duplicate, or more importantly, near duplicate records. For example, if I have the following:
Record A: [1, Bob, Jones, Atlanta, GA, 30327, (404) 555-1234]
and
Record B: [2, Bobby, Jones, Atlanta, GA, 30327, Bob's Shoe Store, (404) 555-1234]
I need a way to see that these are both similar, take the record with more information (in this case record B) and remove record A.
But here's where it gets even more complicated. This must be done upon importing new data, and a function I can execute to remove duplicates from the database at any time. I have been able to put something together in PHP that gets all duplicate rows from the MySQL table and matches them up by phone number, or by using implode() on all columns in the row and then using strlen() to decide the longest record.
There has got to be a better way of doing this, and one that is more accurate.
Do any of you have any brilliant suggestions that I may be able to implement or build on? It's obvious that when importing new data I'll need to open their CSV file into an array or temporary MySQL table, do the duplicate/similar search, then recompile the CSV file or add everything from the temporary table to the main table. I think. :)
I'm hoping that some of you can point out something that I may be missing that can scale somewhat decently and that's somewhat accurate. I'd rather present a list of duplicates we're 'unsure' about to a user that's 5 records long, not 5,000.
Thanks in advance!
Alex
If I were you I'd give a UNIQUE key to name, surname and phone number since in theory if all these three are equal then it means that it is a duplicate. I am thinking so because a phone number can have only one owner. Anyways, you should find a combination of 2-3 or maybe 4 columns and assign them a unique key. Once you have such a structure, run something like this:
// assuming that you have defined something like the following in your CREATE TABLE:
UNIQUE(phone, name, surname)
// then you should perform something like:
INSERT INTO your_table (phone, name, surname) VALUES ($val1, $val2, $val3)
ON DUPLICATE KEY UPDATE phone = IFNULL($val1, phone),
name = IFNULL($val2, name),
surname = IFNULL($val3, surname);
So basically, if the inserted value is a duplicate, this code will update the row, rather than inserting a new one. The IFNULL function performs a check to see whether the first expression is null or not. If it is null, then it picks the second expression, which in this case is the column value that already exists in your table. Hence, it will update your row with as much as information possible.
I don't think there're brilliant solutions. You need to determine priority of your data fields you can rely on for detecting similarity, for example phone, some kind of IDs, of some uniform address or official name.
You can save some cleaned up values (reduced to the same format like only digits in phones, concatenated full address) along with row which you would be able to use for similarity search when adding records.
Then you need to decide on data completeness in any case to update existing rows with more complete fields, or delete old and add new row.
Don't know any ready solutions for such a variable task and doubt they exist.
So I've got this form with an array of checkboxes to search for an event. When you create an event, you choose one or more of the checkboxes and then the event gets created with these "attributes". What is the best way to store it in a MySQL database if I want to filter results when searching for these events? Would creating several columns with boolean values be the best way? Or possibly a new table with the checkbox values only?
I'm pretty sure selializing is out of the question because I wouldn't be able to query the selialized string for whether the checkbox was ticked or not, right?
Thanks
You can use the set datatype or a separate table that you join. Either will work.
I would not do a bunch of columns though.
You can search the set easily using FIND_IN_SET(), but it's not indexed, so it depends on how many rows you expect (up to a few thousand is probably OK - it's a very fast search).
The normal solution is a separate table with one column being the ID of the event, and the second column being the attribute using the enum datatype (don't use text, it's slower).
create separate columns or you can store them all in one column using bit mask
One way would be to create a new table with a column for each checkbox, as already described by others. I'll not add to that.
However, another way is to use a bitmask. You have just one column myCheckboxes and store the values as an int. Then in the code you have constants or another appropriate way to store the correlation between each checkbox and it's bit. I.e.:
CHECKBOX_ONE 1
CHECKBOX_TWO 2
CHECKBOX_THREE 4
CHECKBOX_FOUR 8
...
CHECKBOX_NINE 256
Remember to always use the next power of two for new values, otherwise you'll get values that overlap.
So, if the first two checkboxes have been checked you should have 3 as the value of myCheckboxes for that row. If you have ONE and FOUR checked you'd have 9 as the values of myCheckboxes, etc. When you want to see which rows have say checkboxes ONE, THREE and NINE checked your query would be like:
SELECT * FROM myTable where myCheckboxes & 1 AND myCheckboxes & 4 AND myCheckboxes & 256;
This query will return only rows having all this checkboxes marked as checked.
You should also use bitwise operations when storing and reading the data.
This is a very efficient way when it comes to speed. You have just a single column, probably just a smallint, and your searches are pretty fast. This can make a big difference if you have several different collections of checkboxes that you want to store and search trough. However, this makes the values harder to understand. If you see the value 261 in the DB it'll not be easy for a human to immeditely see that this means checkboxes ONE, THREE and NINE have been checked whereas it is much easier for a human seeing separate columns for each checkbox. This normally is not an issue, cause humans don't need to manually poke the database, but it's something worth mentioning.
From the coding perspective it's not much of a difference, but you'll have to be careful not to corrupt the values, cause it's not that hard to mess up a single int, it's magnitudes easier than screwing the data than when it's stored in different columns. So test carefully when adding new stuff. All that said, the speed and low memory benefits can be very big if you have a ton of different collections.
Background
I am creating a MySQL database to store items such as courses where there may be many attributes to a single course. For example:
A single course may have any or all of the following attributes:
Title (varchar)
Secondary Title (varchar)
Description (text)
Date
Time
Specific Location (varchar; eg. White Hall Room 7)
General Location (varchar; eg. Las Vegas, NV)
Location Coords (floats; eg. lat, long)
etc.
The database is set up as follows:
A table storing specific course info:
courses table:
Course_ID (a Primary Key unique ID for each course)
Creator_ID (a unique ID for the creator)
Creation_Date (datetime of course creation)
Modified_Date (where this is the most recent timestamp the course was modified)
The table storing each courses multiple attributes is set up as follows:
course_attributes table:
Attribute_ID (a unique ID for each attribute)
Course_ID (reference to the specific course attribute is for)
Attribute (varchar definining the attribute; eg. 'title')
Value (text containing value of specified attribute; eg. 'Title Of My Course')
Desire
I would like to search this database using sphinx search. With this search, I have different fields weighing different amounts, for example: 'title' would be more important than 'description'.
Specific search fields that I wish to have are:
Title
Date
Location (string)
Location (geo - lat/long)
The Question
Should I define a View in Mysql to organize the attributes according to 'title', 'description', etc., or is there a way to define my sphinx.conf file to understand specific attributes?
I am open to all suggestions to solving this problem, whether it be rearrangement of the database/tables or the way in which I search.
Let me know if you need any additional details to help me find a solution.
Thanks in advance for the help
!--Update--!
OK, so after reading some of the answers, I feel that I should provide some additional information.
Latitude / Longitude
The latitude/longitude attributes are created by me internally after receiving the general location string. I can generate the values in any way I wish, meaning that I can store them together in a single lat/long attribute as 'float lat, float long' values or any other desired format. This is done only after they have been generated from the initial location string and verified. This is to guard against malformed data as #X-Zero and #Cody have suggested.
Keep in mind that the latitude and longitude was merely illustrating the need to have that field be searchable as opposed to anything more than that. It is simply another attribute; one of many.
Weighting Search Results
I know how to add weights to results in a Sphinx search query:
$cl->setFieldWeights( array('title'=>1000, 'description'=>500) );
This causes the title column to have a higher weight than the description column if the structure was as #X-Zero suggested. My question was more directed to how one would apply the above logic with the current table definition.
Database Structure, Views, and Efficiency
Using my introductory knowledge of Views, I was thinking that I could possibly create something that displays a row for each course where each attribute is its own column. I don't know how to accomplish this or if it's even possible.
I am not the most confident with database structures, but the reason I set my tables up as described was because there are many cases where not all of the fields will be completed for every course and I was attempting to be efficient [yes, it seems as though I've failed].
I was thinking that using my current structure, each attribute would contain a value and would therefore cause no wasted space in the table. Alternatively, if I had a table with tons of potential attributes, I would think there would be wasted space. If I am incorrect, I am happy to learn why my understanding is wrong.
Let me preface this by saying that I've never even heard of Sphinx, nor (obviously) used it. However, from a database perspective...
Doing multi-domain columns like this is a terrible (I will hunt you down and kill you) idea. For one thing, it's impossible to index or sort meaningfully, period. You also have to pray that you don't get a latitude attribute with textual data (and because this can only be reinforced programatically, I'm going to garuantee this will happen) - doing so will cause all distance based formulas to crash. And speaking of location, what happens if somebody stores a latitude without a longitude (note that this is possible regardless of whether you are storing a single GeoLocation attribute, or the pair)?
Your best bet is to do the following:
Figure out which attributes will always be required. These belong in the course table (...mostly).
For each related set of optional attributes, create a table. For example, location (although this should probably be required...), which would contain Latitude/Longitude, City, State, Address, Room, etc. Allow the columns to be nullable (in sets - add constraints so users can't add just longitude and not latitude).
For every set of common queries add a view. Even (perhaps especially) if you persist in using your current design, use a view. This promotes seperation between the logical and physical implementations of the database. (This assumes searching by SQL) You will then be able to search by specifying view_column is null or view_column = input_parameter or whichever.
For weighted searching (assuming dynamic weighting) your query will need to use left joins (inside the view as well - please document this), and use prepared-statement host-parameters (just save yourself the trouble of trying to escape things yourself). Check each set of parameters (both lat and long, for example), and assign the input weighting to a new column (per attribute), which can be summed up into a 'total' column (which must be over some threshold).
EDIT:
Using views:
For your structure, what you would normally do is left join to the attributes table multiple times (one for each attribute needed), keying off of the attribute (which should really be an int FK to a table; you don't want both 'title' and 'Title' in there) and joining on course_id - the value would be included as part of the select. Using this technique, it would be simple to then get the list of columns, which you can then apparently weight in Sphinx.
The problem with this is if you need to do any data conversion - you are betting that you'll be able to find all conversions if the type ever changes. When using strongly typed columns, this is between trivial (the likelyhood is that you end up with a uniquely named column) to unnecessary (views usually take their datatype definitions from the fields in the query); with your architecture, you'll likely end up looking through too many false positives.
Database efficiency:
You're right, unfilled columns are wasted space. Usually, when something is optional(ish), that means you may need an additional table. Which is Why I suggested splitting off location into it's own table: this prevents events which don't need a location (... what?) from 'wasting' the space, but then forces any event that defines a location to specify all required information. There's an additional benefit about splitting it off this way: if multiple events all use the same location (... not at the same time, we hope), a cross-reference table will save you a lot of space. Way more than your attributes table ever could (you're still having to store the complete location per event, after all). If you still have a lot of 'optional' attributes, I hear that NoSQL is made for these kinds of things (but I haven't really looked into it). However, other than that, the cost of an additional table is trivial; the cost of the data inside may not be, but the space required is weighed against the perceived value of the data stored. Remember that disk space is relatively cheap - it's developer/maintainer time that is expensive.
Side note for addresses:
You are probably going to want to create an address table. This would be completely divorced from the event information, and would include (among other things) the precomputed latitude/longitude (in the recommended datatype - I don't know what it is, but it's for sure not a comma-separated string). You would then have an event_address table that would be the cross-reference between the events and where they take place - if there is additional information (such as room), that should be kept in a location table that is referenced (instead of referencing address directly). Once a lat/long value is computed, you should never need to change it.
Thoughts on later updates for lat/long:
While specifying the lat/long values yourself is better, you're going to want to make them a required part of the address table (or part of/in addition to a purely lat/long only table). Frankly, multi-value columns (delimited lists) of any sort are just begging for trouble - you keep having to parse them every time you search on them (among other related issues). And the moment you make them separate rows, one of the pair will eventually get dropped - Murphy himself will personally intervene, if necessary. Additionally, updating them at different times from the addresses will result in an address having a lat/long pair that does not match; your best bet is to compute this at insertion time (there are a number of webservices to find this information for you).
Multi-domain tables:
With a multi-domain table, you're basically betting that the domain key (attribute) will never become out-of-sync with the value (err, value). I don't care how good you are, somewhere, somehow, it's going to happen: at my company, we had one of these in our legacy application (it stored FK links and which files the FKs refer to, along with an attribute). At one point an application was installed in production which promptly began storing the correct file links, but the FK links to a different file, for a given class of attribute. Thankfully, there were audit records in another file which allowed this to be reversed (... as near as they were able tell).
In summary:
Revisit your required/optional data. Don't be afraid to create additional tables, each for a single entity, with every column for a single domain; you will also need relationship tables. You may also wish to place your audit data (last_updated_time) in a set of separate tables (single-domain tables will help immensely in this regard).
In the sphinx config you define your index and the SQL queries that populate it. You can define basic attributes, see Sphinx Attributes
Sphinx also supports geo searches on lat/long but they need to be expressed in radians, definitely not text columns like you have. I agree with X-Zero that storing lat/lng values are strings is a bad idea.
OK, I know the technical answer is NEVER.
BUT, there are times when it seems to make things SO much easier with less code and seemingly few downsides, so please here me out.
I need to build a Table called Restrictions to keep track of what type of users people want to be contacted by and that will contain the following 3 columns (for the sake of simplicity):
minAge
lookingFor
drugs
lookingFor and drugs can contain multiple values.
Database theory tells me I should use a join table to keep track of the multiple values a user might have selected for either of those columns.
But it seems that using comma-separated values makes things so much easier to implement and execute. Here's an example:
Let's say User 1 has the following Restrictions:
minAge => 18
lookingFor => 'Hang Out','Friendship'
drugs => 'Marijuana','Acid'
Now let's say User 2 wants to contact User 1. Well, first we need to see if he fits User 1's Restrictions, but that's easy enough EVEN WITH the comma-separated columns, as such:
First I'd get the Target's (User 1) Restrictions:
SELECT * FROM Restrictions WHERE UserID = 1
Now I just put those into respective variables as-is into PHP:
$targetMinAge = $row['minAge'];
$targetLookingFor = $row['lookingFor'];
$targetDrugs = $row['drugs'];
Now we just check if the SENDER (User 2) fits that simple Criteria:
COUNT (*)
FROM Users
WHERE
Users.UserID = 2 AND
Users.minAge >= $targetMinAge AND
Users.lookingFor IN ($targetLookingFor) AND
Users.drugs IN ($targetDrugs)
Finally, if COUNT == 1, User 2 can contact User 1, else they cannot.
How simple was THAT? It just seems really easy and straightforward, so what is the REAL problem with doing it this way as long as I sanitize all inputs to the DB every time a user updates their contact restrictions? Being able to use MySQL's IN function and already storing the multiple values in a format it will understand (e.g. comma-separated values) seems to make things so much easier than having to create join tables for every multiple-choice column. And I gave a simplified example, but what if there are 10 multiple choice columns? Then things start getting messy with so many join tables, whereas the CSV method stays simple.
So, in this case, is it really THAT bad if I use comma-separated values?
****ducks****
You already know the answer.
First off, your PHP code isn't even close to working because it only works if user 2 has only a single value in LookingFor or Drugs. If either of these columns contains multiple comma-separated values then IN won't work even if those values are in the exact same order as User 1's values. What do expect IN to do if the right-hand side has one or more commas?
Therefore, it's not "easy" to do what you want in PHP. It's actually quite a pain and would involve splitting user 2's fields into single values, writing dynamic SQL with many ORs to do the comparison, and then doing an extremely inefficient query to get the results.
Furthermore, the fact that you even need to write PHP code to answer such a relatively simple question about the intersection of two sets means that your design is badly flawed. This is exactly the kind of problem (relational algebra) that SQL exists to solve. A correct design allows you to solve the problem in the database and then simply implement a presentation layer on top in PHP or some other technology.
Do it correctly and you'll have a much easier time.
Suppose User 1 is looking for 'Hang Out','Friendship' and User 2 is looking for 'Friendship','Hang Out'
Your code would not match them up, because 'Friendship','Hang Out' is not in ('Hang Out','Friendship')
That's the real problem here.