I am probably thinking about this wrong but here goes.
A computer starts spitting out a gazillion random numbers between 11111111111111111111 and 99999999999999999999, in a linear row:
Sometimes the computer adds a number to one end of the line.
Sometimes the computer adds a number to the other end of the line.
Each number has a number that comes, or will come, before.
Each number has a number that comes, or will come, after.
Not all numbers are unique, many, but not most, are repeated.
The computer never stops spitting out numbers.
As I record all of these numbers, I need to be able to make an educated guess, at any given time:
If this is the second time I have seen a number I must know what number preceded it in line last time.
If it has appeared more than two times, I must know the probability/frequency of numbers preceding it.
If this is the second time I have seen a number, I must also know what number came after it in line last time.
If it has appeared more than two times, I must know the probability/frequency of numbers coming after it.
How the heck do I structure the tables in a MySQL database to store all these numbers? Which engine do I use and why? How do I formulate my queries? I need to know fast, but capacity is also important because when will the thing stop spitting them out?
My ill-conceived plan:
2 Tables:
1. Unique ID/#
2. #/ID/#
My thoughts:
Unique ID's are almost always going to be shorter than the number = faster match.
Numbers repeat = fewer ID rows = faster match initially.
Select * in table2 where id=(select id in table1 where #=?)
OR:
3 Tables:
1. Unique ID/#
2. #/ID
3. ID/#
My thoughts:
If I only need left/before, or only need after/right, im shrinking the size of the second query.
SELECT # IN table2(or 3) WHERE id=(SELECT id IN table1 WHERE #=?)
OR
1 Table:
1. #/#/#
Thoughts:
Less queries = less time.
SELECT * IN table WHERE col2=#.
I'm lost.... :( Each number has four attributes, that which comes before+frequency and that which comes after+frequency.
Would I be better off thinking of it in that way? If I store and increment frequency in the table, I do away with repetition and thus speed up my queries? I was initially thinking that if I store every occurrence, it would be faster to figure the frequency programmatically.......
Such simple data, but I just don't have the knowledge of how databases function to know which is more efficient.
In light of a recent comment, I would like to add a bit of information about the actual problem: I have a string of indefinite length. I am trying to store a Markov chain frequency table of the various characters, or chunks of characters, in this string.
Given any point in the string I need to know the probability of the next state, and the probability of the previous state.
I am anticipating user input, based on a corpus of text and past user input. A major difference compared to other applications I have seen is that I am going farther down the chain, more states, at a given time and I need the frequency data to provide multiple possibilities.
I hope that clarifies the picture a lot more. I didn't want to get into the nitty gritty of the problem, because in the past I have created questions that are not specific enough to get a specific answer.
This seems maybe a bit better. My primary question with this solution is: Would providing the "key" (first few characters of the state) increase the speed of the system? i.e query for state_key, then query only the results of that query for the full state?
Table 1:
name: state
col1:state_id - unique, auto incrementing
col2:state_key - the first X characters of the state
col3:state - fixed length string or state
Table 2:
name: occurence
col1:state_id_left - non unique key from table 1
col2:state_id_right - non unique key from table 1
col3:frequency - int, incremented every time the two states occur next to each other.
QUERY TO FIND PREVIOUS STATES:
SELECT * IN occurence WHERE state_id_right=(SELECT state_id IN state WHERE state_key=? AND state=?)
QUERY TO FIND NEXT STATES:
SELECT * IN occurence WHERE state_id_left=(SELECT state_id IN state WHERE state_key=? AND state=?)
I'm not familiar with Markov Chains but here is an attempt to answer the question. Note: To simplify things, let's call each string of numbers a 'state'.
First of all I imagine a table like this
Table states:
order : integer autonumeric (add an index here)
state_id : integer (add an index here)
state : varchar (?)
order: just use a sequential number (1,2,3,...,n) this will make it easy to search for the previous or next state.
state_id: a unique number associated to the state. As an example, you can use the number 1 to represent the state '1111111111...1' (whatever the length of the sequence is). What's important is that a reoccurrence of a state needs to use the same state_id that was used before. You may be able to formulate the state_id based on the string (maybe substracting a number). Of course a state_id only makes sense if the number of possible states fits in a MySQL int field.
state: that is the string of numbers '11111111...1' to '99999999...9' ... I'm guessing this can only be stored as a string but if it fits in an integer/number column you should try it as it may well be that you don't need the state_id
The point of state_id is that searching number is quicker than searching texts, but there will always be trade-offs when it comes to performance ... profile and identify your bottlenecks to make better design decisions.
So, how do you look for a previous occurrence of the state S_i ?
"SELECT order, state_id, state FROM states WHERE state_id = " and then attach get_state_id(S_i) where get_state_id ideally uses a formula to generate a unique id for the state.
Now, with order - 1 or order + 1 you can access the neighboring states issuing an additional query.
Next we need to track the frequency of different occurrences. You can do that in a different table that could look like this:
Table state_frequencies:
state_id integer (indexed)
occurrences integer
And only add records as you get the numbers.
Finally, you can have tables to track frequency for the neighboring states:
Table prev_state_frequencies (next_state_frequencies is the same):
state_id: integer (indexed)
prev_state_id: integer (indexed)
occurrences: integer
You will be able to infer probabilities (i guess this is what you are trying to do) by looking at the number of occurrences of a state (in state_frequencies) vs the number of occurrences of it's predecessor state (in prev_state_frequencies).
I'm not sure if I got your problem right but if this makes sense I'm guessing I have.
Hope it helps,
AH
It seems to me that the Markov Chain is finite, so first I would start by defining the limit of the chain (i.e. 26 characters with x number of spaces to fill) then you can calculate the total number of possible combinations. to determine the probability of a certain arrangement of characters the math if I remember correctly is:
x = ((C)(C))(P)
where
C = the number of possible characters and
P = the total potential outcomes.
this is a ton of data to store and creating procedures to filter through the data could turn out to be a seemingly endless task.
->
if you are using an auto incremented id in your table you could query the table and use preg_match to test the new result against the previous results then insert the number of total matches with the new result into the table, this would also allow you to query the preceding results to see what came before it this should give you a general idea of the pattern within the results as well as a general base for statistical relevance and new algorithm generation
Related
I have three tables: Users with an unique nickname, more than four hundred Names, 300000 plus Adjectives and a ton of possible combinations.
When subscribing, the user can generate an unique, random and hopefully funny nickname by combining a random name with a random adjective. The user clicks a button and VoilĂ ! an exhilarating identity is born.
I select the random names and adjectives by running two queries for each:
SELECT FLOOR(RAND() * COUNT(*)) AS `offset` FROM names/adjectives
and
SELECT * FROM names/adjectives LIMIT offset, 1
Then I check if the User was unlucky enough to generate an already existing identity.
SELECT COUNT(nickname) FROM users WHERE nickname=:generatedNickname
If he was, the poor chap, I loop through this again until it settles on something untaken.
But, as you guys probably already figured out, the growth of the user base also means lengthier loops and more sweat from my feeble EC2 Tier 1 Matchbox. So I came up with a brilliant solution: What if I pre-generate all the possible combinations and stuff them in a huge table? This will allow a simple pluck and play operation while I'll be sipping worry free martinis on some anonymous beach or would I? Will my humble LAMP instance tremble and flee at the glorious sight of the humongous tables (both male and female)? Is there any better solution?
Generating those combinations beforehand will result in a huge amount of data. I do not recommend it. My suggestion would be to use a better source of randomness than RAND(). The likeliness of a collision (based on your estimates) is only around n/120000000, where n is the amount of users, so your loop will not run for a very long time if you do get one.
Give the Nouns and Adjectives an AUTO_INCREMENT id that is the PRIMARY KEY. The other column (nouns/adjectives) should be UNIQUE.
Keep COUNT(*) for each of those two tables somewhere handy. Recompute these counts if you ever modify the tables. Do not do SELECT COUNT(*) in the code below, it will do a table scan -- not cheap.
Use SELECT noun FROM Nouns WHERE id = CEIL(noun_count * RAND()) to get a random "noun". Ditto for "adjective".
Now we need to check for dups. You have stored the adjective-noun combo in the user table, correct? and it is INDEXed, correct? So simply check this combo for already having been used.
If it is a dup, then start over.
None of the steps takes long, so even when you have to (rarely) repeat the process, it will not take long.
PS: I think you will find that RAND() is good enough for this task.
I have a MySQL database that contains all the words in the standard English alphabet, which I am using to create a simple Scrabble word generator. The database is separated into 26 tables: one for each letter in the alphabet. Each table contains two columns:
"Word" column: this column is the primary key, is of type char(12), and does not accept null values.
"Length" column: this column contains an unsigned tinyint value and does not accept null values.
In my application, the user enters in any number of letters into a textbox (indicating their tiles) and I query the database using this code:
// this is looped over 26 times, and $char is a letter between 'A' and 'Z'
// check if the user entered in character $char or a blank tile (signified by ? in app)
// this check prevents me from having to query useless tables
if (in_array($char, $lettersArray) || $blanks)
{
// if so, select all words that have a length that's possible to make
$query = 'SELECT Word FROM '.$char.'Words WHERE Length <= '.strlen($letters);
$result = $db->query($query);
$num_results = $result->num_rows;
for ($j = 0; $j < $num_results; $j++)
{
// determine if it's possible to create word based on letters input
// if so, perform appropriate code
}
}
Everything is working, but my application takes a long time compared to the competition (theoretical competition, that is; this is more of a learning project I created for myself and I doubt I'll release it on the internet), despite the fact the application is on my local computer. I tried used the automatic optimization feature of phpMyAdmin, but that provided no noticeable speed increase.
I don't think the performance problem is really the database. The structure of your data store is going to have the most significant impact on the performance of your algorithm.
One fairly easy-to-understand approach to the problem would be to handle the problem as anagrams. You could alphabetize all of the letters in each of your words, and store that as a column with an index on it.
word dorw
-------- -------
DALE ADEL
LEAD ADEL
LED DEL
HELLO EHLLO
HELP EHLP
Then, given a set of letters, you could query the database for all matching anagrams. Just alphabetize the set of letters passed in, and run a query.
SELECT word FROM dictionary WHERE dorw = 'AERT'
RATE
TARE
TEAR
Then, you could query for subsets of the letters:
SELECT word FROM dictionary WHERE dorw IN ('AER','AET','ART','ERT')
This approach would get you the longest words returned first.
This isn't the most efficient approach, but it's workable.
Handling a "blank" tile is going to be more work, you'd need to substitute a possible letter for it, and checking all 26 possibilities could be done in one query,
If they have letters ABCD and the blank tile, for example...
SELECT word FROM dictionary WHERE dorw IN ('AABCD','ABBCD', 'ABCCD'
, 'ABCDD', 'ABCDE', 'ABCDE', 'ABCDF', ..., 'ABCDZ')
That gets more painful when you start dealing with the subsets...
(In Crossword and Jumble puzzles, there aren't any blank tiles)
So this may not be the most appropriate algorithm for Scrabble.
There are other algorithms that may be more efficient, especially at returning the shorter words first.
One approach is to build a tree.
The root node is a "zero" letter word. As a child of the root node, would be nodes of all one-letter words. Each node would be marked whether it represented a valid word or not. As children of those nodes, you would have all possible three-letter words, again marked as whether it was valid or not.
That will be a lot of nodes. For words up to 12 letters in length, that's a total possible space of 1 + 26 + 26**2 + 26**3 + 26**4 + ...
But you wouldn't need to store every possible node, you'd only store those branches that result in a valid word. You wouldn't have branches below ->Z->Z or ->X->Q
However, you would have a branch under ->X->Y->L, even though XYL is not a word, it would be the beginning of a branch leading to 'XYLOPHONE'
But that's a tree traversal algorithm, which is fundamentally different.
It sounds like you need to learn about indexes. If you created indexes in the database, even if all the data was in one table, it would not be querying" useless letters".
You should provide some more information though, how long a query takes to return a result if you run it from the mysql console, how long it takes to move that result from the database to the PHP engine. You might for example be bringing back a 100 meg result set with each query you are running, if that is the case, limit the results to the first or a number of possible results.
To look at how much data is being returned, manually run one of your queries in the console and see how many records are being returned. If the number is high, the data will take longer to be passed to PHP, but it will also mean your code must iterate through a lot more results. You might want to consider dropping our of the for loop after you find the first word that can be accepted. If at least one word is possible, don't check it again until another letter is placed.
I know this question is about optimizing your database but if I were doing this I would only read the words from the database once, initialize some data structure and search that structure instead of continually querying the database.
Sorry if this was completely irrelevant.
Suppose Table1 contains column orderid (not a key, although it's NOT NULL and unique). It contains 5-digit numbers.
What's the best way to generate a php var $unique_var that is not in that column.
What could be important, from 10% to 30% of 5-digit numbers are in the table (so generating a number and checking while (mysql_num_rows() ==0) {} is not a best way to find one, are there any better solution for performance?).
Thank you.
If there is just 10-30% of numbers already taken - then it means that only 10-30% of queries will be performed at least twice. Which is not a big performance issue at all.
Otherwise - just create all-5-digit-numbers-list table (just 100k rows) and remove all that exist. When you need another random number - just pick one and delete.
I would suggest finding the biggest number (with a MAX() clause) and start from there.
Here are a couple of suggestions. Each has its drawbacks.
Pre-populate your table and add a column to indicate that a number is unused. Select an unused number using LIMIT = 1 and mark it used. This uses a lot of space.
Keep a separate table containing previously used numbers. If that table is empty, generate numbers sequentially from the last used number (or from 00001 if Table1 is empty). This requires some extra bookkeeping.
From someone with more experience than myself, would it be a better idea to simply count the number of items in a table (such as counting the number of topics in a category) or to keep a variable that holds that value and just increment and call it (an extra field in the category table)?
Is there a significant difference between the two or is it just very slight, and even if it is slight, would one method still be better than the other? It's not for any one particular project, so please answer generally (if that makes sense) rather than based on something like the number of users.
Thank you.
To get the number of items (rows in a table), you'd use standard SQL and do it on demand
SELECT COUNT(*) FROM MyTable
Note, in case I've missed something, each item (row) in the table has some unique identifier, whether it's a part number, some code, or an auto-increment. So adding a new row could trigger the "auto-increment" of a column.
This is unrelated to "counting rows". Because of DELETEs or ROLLBACK, numbers may not be contiguous.
Trying to maintain row counts separately will end in tears and/or disaster. Trying to use COUNT(*)+1 or MAX(id)+1 to generate a new row identifier is even worse
I think there is some confusion about your question. My interpretation is whether you want to do a select count(*) or a column where you track your actual count.
I would not add such a column, if you don't have reasons to do so. This is premature optimization and you complicate your software design.
Also, you want to avoid having the same information stored in different places. Counting is a trivial task, so you actually duplicating information, which is a bad idea.
I'd go with just counting. If you notice a performance issue, you can consider other options, but as soon as you keep a value that's separate, you have to do some work to make sure it's always correct. Using COUNT() you always get the actual number "straight from the horse's mouth" so to speak.
Basically, don't start optimizing until you have to. If everything works fine and fast using COUNT(), then do that. Otherwise, store the count somewhere, but rather than adding/subtracting to update the stored value, run COUNT() when needed to get the new number of items
In my forum I count the sub-threads in a forum like this:
SELECT COUNT(forumid) AS count FROM forumtable
As long as you're using an identifier that is the same to specify what forum and/or sub-section, and the column has an index key, it's very fast. So there's no reason to add more columns than you need to.
This is kind of a weird question so my title is just as weird.
This is a voting app so I have a table called ballots that has a two important fields: username and ballot. The field ballot is a VARCHAR but it basically stores a list of 25 ids (numbers from 1-200) as CSVs. For example it might be:
22,12,1,3,4,5,6,7,...
And another one might have
3,4,12,1,4,5,...
And so on.
So given an id (let's say 12) I want to find which row (or username) has that id in the leading spot. So in our example above it would be the first user because he has 12 in the second position whereas the second user has it in the third position. It's possible that multiple people may have 12 in the leading position (say if user #3 and #4 have it in spot #1) and it's possible that no one may have ranked 12.
I also need to do the reverse (who has it in the worst spot) but I figure if I can figure out one problem the rest is easy.
I would like to do this using a minimal number of queries and statements but for the life of me I cannot see how.
The simplest solution I thought of is to traverse all of the users in the table and keep track of who has an id in the leading spot. This will work fine for me now but the number of users can potentially increase exponentially.
The other idea I had was to do a query like this:
select `username` from `ballots` where `ballot` like '12,%'
and if that returns results I'm done because position 1 is the leading spot. But if that returned 0 results I'd do:
select `username` from `ballots` where `ballot` like '*,12,%'
where * is a wildcard character that will match one number and one number only (unlike the %). But I don't know if this can actually be done.
Anyway does anyone have any suggestions on how best to do this?
Thanks
I'm not sure I understood correctly what you want to do - to get a list of users who have a given number in the 'ballot' field ordered by its position in that field?
If so, you should be able to use MySQL FIND_IN_SET() function:
SELECT username, FIND_IN_SET(12, ballot) as position
FROM ballots
WHERE FIND_IN_SET(12, ballot) > 0
ORDER BY position
This will return all rows that have your number (e.g. 12) somewhere in ballot sorted by position you can apply LIMIT to reduce the number of rows returned.