It's highly likely this has been answered elsewhere, I can't find anything so if you can link to another post that would be ace.
I have a MySQL table which lists tickets and payments, amongst other columns. I didn't design table and I can't change too much; adding a column is out of the question. Here's a simplified version of the table:
CREATE TABLE cases (
id INT PRIMARY,
created TIMESTAMP, -- when created in db- each day the table is truncated and a new csv imported
opened DATE, -- will differ from the timestamp, this is the column I want to group by
total DECIMAL, -- not relevant in this question but i'm totaling this by month too
tickets VARCHAR(256) -- the one I want to count, two tickets would look like '"foo1234","baa5678"', there is no pattern (i.e. Same prefix) in the tickets either
);
The tickets column references possibly multiple tickets. Each is wrapped in double quotes in the DB (I have a feeling this is a bit insecure, not my decision and I can't really change it), with each reference separated by a comma (if there is more than one).
There are other columns like client name and date closed, but they are irrelevant in this case
I've written a query which will group the rows by month, year and total the payments for each month.
I want the results to also include the count of tickets, itemized monthly. It can be assumed that there will always be at least one ticket. On the front end I either count the commas and add one, or explode by comma and count the array which works fine for a hand full of rows (there are rarely more than 4 tickets in one row), but as this will form a report of the entire database I want it to be a bit more efficient.
I don't have any code to post (I'm not looking for someone to write it for me, just point me in the right direction, maybe there is a method i'm forgetting?), as I'm not really sure where to start with it, all other posts similar to this are about grouping and totaling numbers (which I also need to do but i'm ok with that part), but nothing on how to do it with a string. SUM(), but with a string.
This is easily solved by using CONCAT_WS, read more about it here
CONCAT_WS will concatenate the strings with a seperator chosen by you, in your case you would do that with a comma: CONCAT_WS(',',string_column)
This wil give you a all the strings separated by commas. if you really need the count than you can continue with this: How to count items in comma separated list MySQL
Using this would give:
LENGTH(CONCAT_WS(',',string_column)) - LENGTH(REPLACE(CONCAT_WS(',',string_column), ',', ''))+1
+1 because you need have one more result than you have comma's
Related
Hi, so I have this database project I'm working on that involves transcribing archival sources to make them more accessible.
I'm revamping the database structure, so I can make the depiction of the archival data more accurate to the manuscript sources. As part of that, I have this new table, which has both the labels/titles for columns of data in the documents, plus a "used"field which acts both as a flag for if the field is used, and also for what position it should be in left to right (As the order changes sometimes).
I'm wondering if there's a way to pair the columns together so I can do a query that - when asking for a single row to be returned= sorts the "used" functions numerically (returning all the ones that aren't -1), and also returns all the "label" fields also sorted into the same order (eg if guns_used is 2, and men_used is 1 and ship_name_position is 0, the query will put them in the correct order and also return guns_label, men_label and shipname_label in the correct order).
I'm also working with/around wordpress, so I have the contents of the whole wpdb thing available to me too.
I'm hoping to be able to "pair" the fields in some way so that if I order one set, the other gets ordered as well.
Edit:
I really would prefer to find a way to do this in a query but until I find a way to do that I'm going to
a)Select the entire row that I need
b)Have a long series of if statements- one for each pair of _label/_used fields- and assigning the values I want to the position in the array indicated by the value of the _used field.
I am probably thinking about this wrong but here goes.
A computer starts spitting out a gazillion random numbers between 11111111111111111111 and 99999999999999999999, in a linear row:
Sometimes the computer adds a number to one end of the line.
Sometimes the computer adds a number to the other end of the line.
Each number has a number that comes, or will come, before.
Each number has a number that comes, or will come, after.
Not all numbers are unique, many, but not most, are repeated.
The computer never stops spitting out numbers.
As I record all of these numbers, I need to be able to make an educated guess, at any given time:
If this is the second time I have seen a number I must know what number preceded it in line last time.
If it has appeared more than two times, I must know the probability/frequency of numbers preceding it.
If this is the second time I have seen a number, I must also know what number came after it in line last time.
If it has appeared more than two times, I must know the probability/frequency of numbers coming after it.
How the heck do I structure the tables in a MySQL database to store all these numbers? Which engine do I use and why? How do I formulate my queries? I need to know fast, but capacity is also important because when will the thing stop spitting them out?
My ill-conceived plan:
2 Tables:
1. Unique ID/#
2. #/ID/#
My thoughts:
Unique ID's are almost always going to be shorter than the number = faster match.
Numbers repeat = fewer ID rows = faster match initially.
Select * in table2 where id=(select id in table1 where #=?)
OR:
3 Tables:
1. Unique ID/#
2. #/ID
3. ID/#
My thoughts:
If I only need left/before, or only need after/right, im shrinking the size of the second query.
SELECT # IN table2(or 3) WHERE id=(SELECT id IN table1 WHERE #=?)
OR
1 Table:
1. #/#/#
Thoughts:
Less queries = less time.
SELECT * IN table WHERE col2=#.
I'm lost.... :( Each number has four attributes, that which comes before+frequency and that which comes after+frequency.
Would I be better off thinking of it in that way? If I store and increment frequency in the table, I do away with repetition and thus speed up my queries? I was initially thinking that if I store every occurrence, it would be faster to figure the frequency programmatically.......
Such simple data, but I just don't have the knowledge of how databases function to know which is more efficient.
In light of a recent comment, I would like to add a bit of information about the actual problem: I have a string of indefinite length. I am trying to store a Markov chain frequency table of the various characters, or chunks of characters, in this string.
Given any point in the string I need to know the probability of the next state, and the probability of the previous state.
I am anticipating user input, based on a corpus of text and past user input. A major difference compared to other applications I have seen is that I am going farther down the chain, more states, at a given time and I need the frequency data to provide multiple possibilities.
I hope that clarifies the picture a lot more. I didn't want to get into the nitty gritty of the problem, because in the past I have created questions that are not specific enough to get a specific answer.
This seems maybe a bit better. My primary question with this solution is: Would providing the "key" (first few characters of the state) increase the speed of the system? i.e query for state_key, then query only the results of that query for the full state?
Table 1:
name: state
col1:state_id - unique, auto incrementing
col2:state_key - the first X characters of the state
col3:state - fixed length string or state
Table 2:
name: occurence
col1:state_id_left - non unique key from table 1
col2:state_id_right - non unique key from table 1
col3:frequency - int, incremented every time the two states occur next to each other.
QUERY TO FIND PREVIOUS STATES:
SELECT * IN occurence WHERE state_id_right=(SELECT state_id IN state WHERE state_key=? AND state=?)
QUERY TO FIND NEXT STATES:
SELECT * IN occurence WHERE state_id_left=(SELECT state_id IN state WHERE state_key=? AND state=?)
I'm not familiar with Markov Chains but here is an attempt to answer the question. Note: To simplify things, let's call each string of numbers a 'state'.
First of all I imagine a table like this
Table states:
order : integer autonumeric (add an index here)
state_id : integer (add an index here)
state : varchar (?)
order: just use a sequential number (1,2,3,...,n) this will make it easy to search for the previous or next state.
state_id: a unique number associated to the state. As an example, you can use the number 1 to represent the state '1111111111...1' (whatever the length of the sequence is). What's important is that a reoccurrence of a state needs to use the same state_id that was used before. You may be able to formulate the state_id based on the string (maybe substracting a number). Of course a state_id only makes sense if the number of possible states fits in a MySQL int field.
state: that is the string of numbers '11111111...1' to '99999999...9' ... I'm guessing this can only be stored as a string but if it fits in an integer/number column you should try it as it may well be that you don't need the state_id
The point of state_id is that searching number is quicker than searching texts, but there will always be trade-offs when it comes to performance ... profile and identify your bottlenecks to make better design decisions.
So, how do you look for a previous occurrence of the state S_i ?
"SELECT order, state_id, state FROM states WHERE state_id = " and then attach get_state_id(S_i) where get_state_id ideally uses a formula to generate a unique id for the state.
Now, with order - 1 or order + 1 you can access the neighboring states issuing an additional query.
Next we need to track the frequency of different occurrences. You can do that in a different table that could look like this:
Table state_frequencies:
state_id integer (indexed)
occurrences integer
And only add records as you get the numbers.
Finally, you can have tables to track frequency for the neighboring states:
Table prev_state_frequencies (next_state_frequencies is the same):
state_id: integer (indexed)
prev_state_id: integer (indexed)
occurrences: integer
You will be able to infer probabilities (i guess this is what you are trying to do) by looking at the number of occurrences of a state (in state_frequencies) vs the number of occurrences of it's predecessor state (in prev_state_frequencies).
I'm not sure if I got your problem right but if this makes sense I'm guessing I have.
Hope it helps,
AH
It seems to me that the Markov Chain is finite, so first I would start by defining the limit of the chain (i.e. 26 characters with x number of spaces to fill) then you can calculate the total number of possible combinations. to determine the probability of a certain arrangement of characters the math if I remember correctly is:
x = ((C)(C))(P)
where
C = the number of possible characters and
P = the total potential outcomes.
this is a ton of data to store and creating procedures to filter through the data could turn out to be a seemingly endless task.
->
if you are using an auto incremented id in your table you could query the table and use preg_match to test the new result against the previous results then insert the number of total matches with the new result into the table, this would also allow you to query the preceding results to see what came before it this should give you a general idea of the pattern within the results as well as a general base for statistical relevance and new algorithm generation
i need to sort through a column in my database, this column is my category structure the data thats in the column is city names but not all the names are the same for each city, what i need to do is go through the values in the column i may have 20-40 value that are the same city but written differently i need a script that can interpret them and change them to a single value
so i may have two values in the city column say:( england > london ) and ( westlondon ) but i need to change to just london, is there a script out there that is capable of interpreting the values that are already there and change them to the value would want i know the dificult way of doing this one by one but wondered if there was a script in any language that could complete this
I've done this sort of data clean-up plenty of times and I'm afraid I don't know of anything easier than just writing your own fixes.
One thing I can recommend is making the process repeatable. Have a replacement table with something like (rulenum, pattern, new_value). Then, work on a copy of the relevant bits of your table so you can just re-run the whole script.
Then, you can start with the obvious matches (just see what looks plausible) and move to more obscure ones. Eventually you'll have 50 without matches and you can just manually patch entries for this.
Making it repeatable is important because you'll be bound to find mis-matches in your first few attempts.
So, something like (syntax untested):
CREATE TABLE matches (rule_num int PRIMARY KEY, pattern text, new_value text)
CREATE TABLE cityfix AS
SELECT id, city AS old_city, '' AS new_city, 0 AS match_num FROM locations;
UPDATE c SET c.new_city = m.new_value, c.match_num = m.rule_num
FROM cityfix AS c JOIN matches m ON c.old_city LIKE m.pattern
WHERE c.match_num = 0;
-- Review results, add new patterns to rule_num, repeat UPDATE
-- If you need to you can drop table cityfix and repeat it.
Just an idea: 16K is not so much. first use Perl's DBI (im assuming you are going to use Perl) to fetch that city column, store it in a hash (city name as the hash), then find your an algorithm that suites your needs (performance wise) to iterate over the hash keys and use String::Diff to find matching intersection (read about it, it definitely can help you out) and store it as a value.. then you can use that to update the database using the key (old value) and the value as the new value to update.
I have recently written a survey application that has done it's job and all the data is gathered. Now i have to analyze the data and i'm having some time issues.
I have to find out how many people selected what option and display it all.
I'm using this query, which does do it's job:
SELECT COUNT(*)
FROM survey
WHERE users = ? AND table = ? AND col = ? AND row = ? AND selected = ?
GROUP BY users,table,col,row,selected
As evident by the "?" i'm using MySQLi (in php) to fetch the data when needed, but i fear this is causing it to be so slow.
The table consists of all the elements above (+ an unique ID) and all of them are integers.
To explain some of the fields:
Each survey was divided into 3 or 4 tables (sized from 2x3 to 5x5) with a 1 to 10 happiness grade to select form. (questions are on the right and top of the table, then you answer where the questions intersect)
users - age groups
table, row, col - explained above
selected - dooooh explained above
Now with the surveys complete and around 1 million entries in the table the query is getting very slow. Sometimes it takes like 3 minutes, sometimes (i guess) the time limit expires and you get no data at all. I also don't have access to the full database, just my empty "testing" one since the costumer is kinda paranoid :S (and his server seems to be a bit slow)
Now (after the initial essay) my questions are: I left indexing out intentionally because with a lot of data being written during the survey, it would be a bad idea. But since no new data is coming in at this point, would it make sense to index all the fields of a table? How much sense does it make to index integers that never go above 10? (as you can guess i haven't got a clue about indexes). Do i need the primary unique ID in this table? I
I read somewhere that indexing may help groups but only if you group by the first columns in a table (and since my ID is first and from my point of view useless can i remove it and gain anything by it?)
Is there another way to write my query that would basically do the same thing but in a shorter period of time?
Thanks for all your suggestions in advance!
Add an index on entries that you "GROUP BY" or do "WHERE". So that's ONE index incorporating users,table,col,row and selected in your case.
Some quick rules:
combine fields to have the WHERE first, and the GROUP BY elements last.
If you have other queries that only use part of it (e.g. users,table,col and selected) then leave the missing value (row, in this example) last.
Don't use too many indexes/indeces, as each will slow the table to updates marginally - so on really large system you need to balance queries with indexes.
Edit: do you need the GROUP BY user,col,row as these are used in the WHERE. If the WHERE has already filtered them out, you only need group by "selected".
This is kind of a weird question so my title is just as weird.
This is a voting app so I have a table called ballots that has a two important fields: username and ballot. The field ballot is a VARCHAR but it basically stores a list of 25 ids (numbers from 1-200) as CSVs. For example it might be:
22,12,1,3,4,5,6,7,...
And another one might have
3,4,12,1,4,5,...
And so on.
So given an id (let's say 12) I want to find which row (or username) has that id in the leading spot. So in our example above it would be the first user because he has 12 in the second position whereas the second user has it in the third position. It's possible that multiple people may have 12 in the leading position (say if user #3 and #4 have it in spot #1) and it's possible that no one may have ranked 12.
I also need to do the reverse (who has it in the worst spot) but I figure if I can figure out one problem the rest is easy.
I would like to do this using a minimal number of queries and statements but for the life of me I cannot see how.
The simplest solution I thought of is to traverse all of the users in the table and keep track of who has an id in the leading spot. This will work fine for me now but the number of users can potentially increase exponentially.
The other idea I had was to do a query like this:
select `username` from `ballots` where `ballot` like '12,%'
and if that returns results I'm done because position 1 is the leading spot. But if that returned 0 results I'd do:
select `username` from `ballots` where `ballot` like '*,12,%'
where * is a wildcard character that will match one number and one number only (unlike the %). But I don't know if this can actually be done.
Anyway does anyone have any suggestions on how best to do this?
Thanks
I'm not sure I understood correctly what you want to do - to get a list of users who have a given number in the 'ballot' field ordered by its position in that field?
If so, you should be able to use MySQL FIND_IN_SET() function:
SELECT username, FIND_IN_SET(12, ballot) as position
FROM ballots
WHERE FIND_IN_SET(12, ballot) > 0
ORDER BY position
This will return all rows that have your number (e.g. 12) somewhere in ballot sorted by position you can apply LIMIT to reduce the number of rows returned.