I have a MySQL database that contains all the words in the standard English alphabet, which I am using to create a simple Scrabble word generator. The database is separated into 26 tables: one for each letter in the alphabet. Each table contains two columns:
"Word" column: this column is the primary key, is of type char(12), and does not accept null values.
"Length" column: this column contains an unsigned tinyint value and does not accept null values.
In my application, the user enters in any number of letters into a textbox (indicating their tiles) and I query the database using this code:
// this is looped over 26 times, and $char is a letter between 'A' and 'Z'
// check if the user entered in character $char or a blank tile (signified by ? in app)
// this check prevents me from having to query useless tables
if (in_array($char, $lettersArray) || $blanks)
{
// if so, select all words that have a length that's possible to make
$query = 'SELECT Word FROM '.$char.'Words WHERE Length <= '.strlen($letters);
$result = $db->query($query);
$num_results = $result->num_rows;
for ($j = 0; $j < $num_results; $j++)
{
// determine if it's possible to create word based on letters input
// if so, perform appropriate code
}
}
Everything is working, but my application takes a long time compared to the competition (theoretical competition, that is; this is more of a learning project I created for myself and I doubt I'll release it on the internet), despite the fact the application is on my local computer. I tried used the automatic optimization feature of phpMyAdmin, but that provided no noticeable speed increase.
I don't think the performance problem is really the database. The structure of your data store is going to have the most significant impact on the performance of your algorithm.
One fairly easy-to-understand approach to the problem would be to handle the problem as anagrams. You could alphabetize all of the letters in each of your words, and store that as a column with an index on it.
word dorw
-------- -------
DALE ADEL
LEAD ADEL
LED DEL
HELLO EHLLO
HELP EHLP
Then, given a set of letters, you could query the database for all matching anagrams. Just alphabetize the set of letters passed in, and run a query.
SELECT word FROM dictionary WHERE dorw = 'AERT'
RATE
TARE
TEAR
Then, you could query for subsets of the letters:
SELECT word FROM dictionary WHERE dorw IN ('AER','AET','ART','ERT')
This approach would get you the longest words returned first.
This isn't the most efficient approach, but it's workable.
Handling a "blank" tile is going to be more work, you'd need to substitute a possible letter for it, and checking all 26 possibilities could be done in one query,
If they have letters ABCD and the blank tile, for example...
SELECT word FROM dictionary WHERE dorw IN ('AABCD','ABBCD', 'ABCCD'
, 'ABCDD', 'ABCDE', 'ABCDE', 'ABCDF', ..., 'ABCDZ')
That gets more painful when you start dealing with the subsets...
(In Crossword and Jumble puzzles, there aren't any blank tiles)
So this may not be the most appropriate algorithm for Scrabble.
There are other algorithms that may be more efficient, especially at returning the shorter words first.
One approach is to build a tree.
The root node is a "zero" letter word. As a child of the root node, would be nodes of all one-letter words. Each node would be marked whether it represented a valid word or not. As children of those nodes, you would have all possible three-letter words, again marked as whether it was valid or not.
That will be a lot of nodes. For words up to 12 letters in length, that's a total possible space of 1 + 26 + 26**2 + 26**3 + 26**4 + ...
But you wouldn't need to store every possible node, you'd only store those branches that result in a valid word. You wouldn't have branches below ->Z->Z or ->X->Q
However, you would have a branch under ->X->Y->L, even though XYL is not a word, it would be the beginning of a branch leading to 'XYLOPHONE'
But that's a tree traversal algorithm, which is fundamentally different.
It sounds like you need to learn about indexes. If you created indexes in the database, even if all the data was in one table, it would not be querying" useless letters".
You should provide some more information though, how long a query takes to return a result if you run it from the mysql console, how long it takes to move that result from the database to the PHP engine. You might for example be bringing back a 100 meg result set with each query you are running, if that is the case, limit the results to the first or a number of possible results.
To look at how much data is being returned, manually run one of your queries in the console and see how many records are being returned. If the number is high, the data will take longer to be passed to PHP, but it will also mean your code must iterate through a lot more results. You might want to consider dropping our of the for loop after you find the first word that can be accepted. If at least one word is possible, don't check it again until another letter is placed.
I know this question is about optimizing your database but if I were doing this I would only read the words from the database once, initialize some data structure and search that structure instead of continually querying the database.
Sorry if this was completely irrelevant.
Related
I want to store large amount (~thousands) of strings and be able to perform matches using wildcards.
For example, here is a sample content:
Folder1
Folder1/Folder2
Folder1/*
Folder1/Folder2/Folder3
Folder2/Folder*
*/Folder4
*/Fo*4
(each line has additionnal data too, like tags, but the matching is only against that key)
Here is an example of what I would like to match against the data:
Folder1
Folder1/Folder2/Folder3
Folder3
(* being a wildcard here, it can be a different character)
I naively considered storing it in a MySQL table and using % wildcards with the LIKE operator, but MySQL indexes will only work for characters on the left of the wildcard, and in my case it can be anywhere (i.e. %/Folder3).
So I'm looking for a fast solution, that could be used from PHP. And I am open: it can be a separate server, a PHP library using files with regex, ...
Have you considered using MySQL's regular expression engine? Try something like this:
SELECT *
FROM your_table
WHERE your_query_string REGEXP pattern_column
This will return rows with regex keys that your query string matches. I expect it will perform better than running a query to pull all of the data and doing the matching in PHP.
More info here: http://dev.mysql.com/doc/refman/5.1/en/regexp.html
You might want to use the multicore approach to solve that search in a fraction of the time, i would recommend for search and matching, using FPGA's but thats probably the hardest way to do it, consider THIS ARTICLE using CUDA, you can do that searches in 16x usual time, in multicore CPU Systems, you can use posix, or a cluster of computers to do the job (MPI for example), you can call Gearman service to run the searches using advanced algorithms.
Were it me, I'd store out the key field two times ... once forward and once reversed (see mysql's reverse function). you can then search the index with left(main_field) and left(reversed_field). it won't help you when you have a wildcard in the middle of the string AND the beginning (e.g. "*Folder1*Folder2), but it will when you have a wildcard at the beginning or the end.
e.g. if you want to search */Folder1 then search where left(reverse_field, 8) = '1redloF/';
for Folder1/*/FolderX search where left(reverse_field, 8) = 'XredloF/' and left(main_field, 8) = 'Folder1/'
If your strings represent some kind of hierarchical structure (as it looks like in your sample content), actually not "real" files, but you say you are open to alternative solutions - why not consider something like a file-based index?
Choose a new directory like myindex
Create an empty file for each entry using the string key as location & file name in myindex
Now you can find matches using glob - thanks to the hierarchical file structure a glob search should be much faster than searching up all your database entries.
If needed you can match the results to your MySQL data - thanks to your MySQL index on the key this action will be very fast.
But don't forget to update the myindex structure on INSERT, UPDATE or DELETE in your MySQL database.
This solution will only compete on a huge data-set (but not too huge as #Kyle mentioned) with a rather deep than wide hierarchical structure.
EDIT
Sorry this would only work if the wildcards are in your search terms not in the stored strings itself.
As the wildcards (*) are in your data and not in your queries I think you should start with breaking up your data into pieces. You should create an index-table having columns like:
dataGroup INT(11),
exactString varchar(100),
wildcardEnd varchar(100),
wildcardStart varchar(100),
If you have a value like "Folder1/Folder2" store it in "exactString" and assign the ID of the value in the main data table to "dataGroup" in the above index table.
If you have a value like "Folder1/*" store a value of "Folder1/" to "wildcardEnd" and again assign the id of the value in the main table to the "dataGroup" field in above Table.
You can then do a match within your query using:
indexTable.wildcardEnd = LEFT('Folder1/WhatAmILookingFor/Data', LENGTH(indexTable.wildcardEnd))
This will truncate the search string ('Folder1/WhatAmILookingFor/Data') to "Folder1/" and then match it against the wildcardEnd field. I assume mysql is clever enough not to do the truncate for every row but to start with the first character and match it against every row (using B-Tree indexes).
A value like "*/Folder4" will go into the field "wildcardStart" but reversed. To cite Missy Elliot: "Is it worth it, let me work it
I put my thing down, flip it and reverse it" (http://www.youtube.com/watch?v=Ke1MoSkanS4). So store a value of "4redloF/" in "wildcardStart". Then a WHERE like the following will match rows:
indexTable.wildcardStart = LEFT(REVERSE('Folder1/WhatAmILookingFor/Folder4'), LENGTH(indexTable.wildcardStart))
of course you could do the "REVERSE" already in your application logic.
Now about the tricky part. Something like "*/Fo*4" should get split up into two records:
# Record 1
dataGroup ==> id of "*/Fo*4" in data table
wildcardStart ==> oF/
wildcardEnd ==> /Fo
# Record 2
dataGroup ==> id of "*/Fo*4" in data table
wildcardStart ==> 4
Now if you match something you have to take care that every index-record of a dataGroup gets returned for a complete match and that no overlapping occurs. This could also get solved in SQL but is beyond this question.
Database isn't the right tool to do these kinds of searches. You can still use a database (any database and any structure) to store the strings, but you have to write the code to do all the searches in memory. Load all the strings from the database (a few thousand strings is really no biggy), cache them and run your search\match algorithm on them.
You probably have to code your algorithm yourself because the standard tools will be an overkill for what you are trying to achieve and there is no garantee that they will be able to achieve exactly what you need.
I would build a regex representation of your wildcard based strings and run those regexs on your input. Your probabaly will have to do some work until you get the regex right, but it will be the fastest way to go.
I suggest reading the keys and their associated payload into a binary tree representation ordered alphanumerically by key. If your keys are not terribly "clumped" then you can avoid the (slight additional) overhead building of a balanced tree. You also can avoid any tree maintenance code as, if I understand your problem correctly, the data will be changing frequently and it would be simplest to rebuild the tree rather than add/remove/update nodes in place. The overhead of reading into the tree is similar to performing an initial sort, and tree traversal to search for your value is straight-forward and much more efficient than just running a regex against a bunch of strings. You may even find while working it through that your wild cards in the tree will lead to some shortcuts to prune the search space. A quick search show lots of resources and PHP snippets to get you started.
If you run SELECT folder_col, count(*) FROM your_sample_table group by folder_col do you get duplicate folder_col values (ie count(*) greater than 1)?
If not, that means you can produce an SQL that would generate a valid sphinx index (see http://sphinxsearch.com/).
I wouldn't recommend to do text search on large collection of data in MySQL. You need a database to store the data but that would be it. For searching use a search engine like:
Solr (http://lucene.apache.org/solr/)
Elastic Search (http://www.elasticsearch.org/)
Sphinx (http://sphinxsearch.com/)
Those services will allow you doing all sort of funky text search (including Wildcards) in a blink of an eye ;-)
I am probably thinking about this wrong but here goes.
A computer starts spitting out a gazillion random numbers between 11111111111111111111 and 99999999999999999999, in a linear row:
Sometimes the computer adds a number to one end of the line.
Sometimes the computer adds a number to the other end of the line.
Each number has a number that comes, or will come, before.
Each number has a number that comes, or will come, after.
Not all numbers are unique, many, but not most, are repeated.
The computer never stops spitting out numbers.
As I record all of these numbers, I need to be able to make an educated guess, at any given time:
If this is the second time I have seen a number I must know what number preceded it in line last time.
If it has appeared more than two times, I must know the probability/frequency of numbers preceding it.
If this is the second time I have seen a number, I must also know what number came after it in line last time.
If it has appeared more than two times, I must know the probability/frequency of numbers coming after it.
How the heck do I structure the tables in a MySQL database to store all these numbers? Which engine do I use and why? How do I formulate my queries? I need to know fast, but capacity is also important because when will the thing stop spitting them out?
My ill-conceived plan:
2 Tables:
1. Unique ID/#
2. #/ID/#
My thoughts:
Unique ID's are almost always going to be shorter than the number = faster match.
Numbers repeat = fewer ID rows = faster match initially.
Select * in table2 where id=(select id in table1 where #=?)
OR:
3 Tables:
1. Unique ID/#
2. #/ID
3. ID/#
My thoughts:
If I only need left/before, or only need after/right, im shrinking the size of the second query.
SELECT # IN table2(or 3) WHERE id=(SELECT id IN table1 WHERE #=?)
OR
1 Table:
1. #/#/#
Thoughts:
Less queries = less time.
SELECT * IN table WHERE col2=#.
I'm lost.... :( Each number has four attributes, that which comes before+frequency and that which comes after+frequency.
Would I be better off thinking of it in that way? If I store and increment frequency in the table, I do away with repetition and thus speed up my queries? I was initially thinking that if I store every occurrence, it would be faster to figure the frequency programmatically.......
Such simple data, but I just don't have the knowledge of how databases function to know which is more efficient.
In light of a recent comment, I would like to add a bit of information about the actual problem: I have a string of indefinite length. I am trying to store a Markov chain frequency table of the various characters, or chunks of characters, in this string.
Given any point in the string I need to know the probability of the next state, and the probability of the previous state.
I am anticipating user input, based on a corpus of text and past user input. A major difference compared to other applications I have seen is that I am going farther down the chain, more states, at a given time and I need the frequency data to provide multiple possibilities.
I hope that clarifies the picture a lot more. I didn't want to get into the nitty gritty of the problem, because in the past I have created questions that are not specific enough to get a specific answer.
This seems maybe a bit better. My primary question with this solution is: Would providing the "key" (first few characters of the state) increase the speed of the system? i.e query for state_key, then query only the results of that query for the full state?
Table 1:
name: state
col1:state_id - unique, auto incrementing
col2:state_key - the first X characters of the state
col3:state - fixed length string or state
Table 2:
name: occurence
col1:state_id_left - non unique key from table 1
col2:state_id_right - non unique key from table 1
col3:frequency - int, incremented every time the two states occur next to each other.
QUERY TO FIND PREVIOUS STATES:
SELECT * IN occurence WHERE state_id_right=(SELECT state_id IN state WHERE state_key=? AND state=?)
QUERY TO FIND NEXT STATES:
SELECT * IN occurence WHERE state_id_left=(SELECT state_id IN state WHERE state_key=? AND state=?)
I'm not familiar with Markov Chains but here is an attempt to answer the question. Note: To simplify things, let's call each string of numbers a 'state'.
First of all I imagine a table like this
Table states:
order : integer autonumeric (add an index here)
state_id : integer (add an index here)
state : varchar (?)
order: just use a sequential number (1,2,3,...,n) this will make it easy to search for the previous or next state.
state_id: a unique number associated to the state. As an example, you can use the number 1 to represent the state '1111111111...1' (whatever the length of the sequence is). What's important is that a reoccurrence of a state needs to use the same state_id that was used before. You may be able to formulate the state_id based on the string (maybe substracting a number). Of course a state_id only makes sense if the number of possible states fits in a MySQL int field.
state: that is the string of numbers '11111111...1' to '99999999...9' ... I'm guessing this can only be stored as a string but if it fits in an integer/number column you should try it as it may well be that you don't need the state_id
The point of state_id is that searching number is quicker than searching texts, but there will always be trade-offs when it comes to performance ... profile and identify your bottlenecks to make better design decisions.
So, how do you look for a previous occurrence of the state S_i ?
"SELECT order, state_id, state FROM states WHERE state_id = " and then attach get_state_id(S_i) where get_state_id ideally uses a formula to generate a unique id for the state.
Now, with order - 1 or order + 1 you can access the neighboring states issuing an additional query.
Next we need to track the frequency of different occurrences. You can do that in a different table that could look like this:
Table state_frequencies:
state_id integer (indexed)
occurrences integer
And only add records as you get the numbers.
Finally, you can have tables to track frequency for the neighboring states:
Table prev_state_frequencies (next_state_frequencies is the same):
state_id: integer (indexed)
prev_state_id: integer (indexed)
occurrences: integer
You will be able to infer probabilities (i guess this is what you are trying to do) by looking at the number of occurrences of a state (in state_frequencies) vs the number of occurrences of it's predecessor state (in prev_state_frequencies).
I'm not sure if I got your problem right but if this makes sense I'm guessing I have.
Hope it helps,
AH
It seems to me that the Markov Chain is finite, so first I would start by defining the limit of the chain (i.e. 26 characters with x number of spaces to fill) then you can calculate the total number of possible combinations. to determine the probability of a certain arrangement of characters the math if I remember correctly is:
x = ((C)(C))(P)
where
C = the number of possible characters and
P = the total potential outcomes.
this is a ton of data to store and creating procedures to filter through the data could turn out to be a seemingly endless task.
->
if you are using an auto incremented id in your table you could query the table and use preg_match to test the new result against the previous results then insert the number of total matches with the new result into the table, this would also allow you to query the preceding results to see what came before it this should give you a general idea of the pattern within the results as well as a general base for statistical relevance and new algorithm generation
Suppose Table1 contains column orderid (not a key, although it's NOT NULL and unique). It contains 5-digit numbers.
What's the best way to generate a php var $unique_var that is not in that column.
What could be important, from 10% to 30% of 5-digit numbers are in the table (so generating a number and checking while (mysql_num_rows() ==0) {} is not a best way to find one, are there any better solution for performance?).
Thank you.
If there is just 10-30% of numbers already taken - then it means that only 10-30% of queries will be performed at least twice. Which is not a big performance issue at all.
Otherwise - just create all-5-digit-numbers-list table (just 100k rows) and remove all that exist. When you need another random number - just pick one and delete.
I would suggest finding the biggest number (with a MAX() clause) and start from there.
Here are a couple of suggestions. Each has its drawbacks.
Pre-populate your table and add a column to indicate that a number is unused. Select an unused number using LIMIT = 1 and mark it used. This uses a lot of space.
Keep a separate table containing previously used numbers. If that table is empty, generate numbers sequentially from the last used number (or from 00001 if Table1 is empty). This requires some extra bookkeeping.
This is kind of a weird question so my title is just as weird.
This is a voting app so I have a table called ballots that has a two important fields: username and ballot. The field ballot is a VARCHAR but it basically stores a list of 25 ids (numbers from 1-200) as CSVs. For example it might be:
22,12,1,3,4,5,6,7,...
And another one might have
3,4,12,1,4,5,...
And so on.
So given an id (let's say 12) I want to find which row (or username) has that id in the leading spot. So in our example above it would be the first user because he has 12 in the second position whereas the second user has it in the third position. It's possible that multiple people may have 12 in the leading position (say if user #3 and #4 have it in spot #1) and it's possible that no one may have ranked 12.
I also need to do the reverse (who has it in the worst spot) but I figure if I can figure out one problem the rest is easy.
I would like to do this using a minimal number of queries and statements but for the life of me I cannot see how.
The simplest solution I thought of is to traverse all of the users in the table and keep track of who has an id in the leading spot. This will work fine for me now but the number of users can potentially increase exponentially.
The other idea I had was to do a query like this:
select `username` from `ballots` where `ballot` like '12,%'
and if that returns results I'm done because position 1 is the leading spot. But if that returned 0 results I'd do:
select `username` from `ballots` where `ballot` like '*,12,%'
where * is a wildcard character that will match one number and one number only (unlike the %). But I don't know if this can actually be done.
Anyway does anyone have any suggestions on how best to do this?
Thanks
I'm not sure I understood correctly what you want to do - to get a list of users who have a given number in the 'ballot' field ordered by its position in that field?
If so, you should be able to use MySQL FIND_IN_SET() function:
SELECT username, FIND_IN_SET(12, ballot) as position
FROM ballots
WHERE FIND_IN_SET(12, ballot) > 0
ORDER BY position
This will return all rows that have your number (e.g. 12) somewhere in ballot sorted by position you can apply LIMIT to reduce the number of rows returned.
Suppose it is a long article (say 100,000 words), and I need to write a PHP file to display page 1, 2, or page 38 of the article, by
display.php?page=38
but the number of words for each page can change over time (for example, right now if it is 500 words per page, but next month, we can change it to 300 words per page easily). What is a good way to divide the long article and store into the database?
P.S. The design may be further complicated if we want to display 500 words but include whole paragraphs. That is, if we are showing word 480 already but the paragraph has 100 more words remaining, then show those 100 words anyway even though it exceeds the 500 words limit. (and then, the next page shouldn't show those 100 words again).
I would do it by splitting articles on chuks when saving them. The save script would split the article using whatever rules you design into it and save each chunk into a table like this:
CREATE TABLE article_chunks (
article_id int not null,
chunk_no int not null,
body text
}
Then, when you load a page of an article:
$sql = "select body from article_chunks where article_id = "
.$article_id." and chunk_no=".$page;
Whenever you want to change the logic of splitting articles into pages, you run a script thats pulls all the chunks together and re-splits them:
UPDPATE: Giving the advice I suppose your application is read-intensive more than write-intensive, meaning that articles are read more often than they are written
You could of course output exactly 500 words per page, but the better way would be to put some kind of breaks into your article (end of sentence, end of paragraph). Put these at places where a break would be good. This way your pages won't have exactly X words in it each, but about or up to X and it won't tear sentences or paragraphs apart.
Of course, when displaying the pages, don't display these break markers.
You might want to start by breaking the article up into an array of paragraphs by using split command:
http://www.php.net/split
$array = split("\n",$articleText);
It's better way to manually cut the text, because it's not a good idea to leave a program that determines where to cut. Sometimes it will be cut just after h2 tag and continue with text on next page.
This is simple database structure for that:
article(id, title, time, ...)
article_body(id, article_id, page, body, ...)
The SQL query:
SELECT a.*, ab.body, ab.page
FROM article a
INNER JOIN article_body ab
ON ab.article_id = a.id
WHERE a.id = $aricle_id AND ab.page= $page
LIMIT 1;
In application you can use jQuery to simple add new textarea for another page...
Your table could be something like
CREATE TABLE ArticleText (
INTEGER artId,
INTEGER wordNum,
INTEGER wordId,
PRIMARY KEY (artId, wordNum),
FOREIGN KEY (artId) REFERENCES Articles,
FOREIGN KEY (wordId) REFERENCES Words
)
this of course may be very space-expensive, or slow, etc, but you'll need some measurements to determine that (as so much depends on your DB engine). BTW, I hope it's clear that the Articles table is simply a table with metadata on articles keyed by artId, and the Words table a table of all words in every article keyed by wordId (trying to save some space there by identifying already-known words when an article is entered, if that's feasible...). One special word must be the "end of paragraph" marker, easily identifiable as such and distinct from every real word.
If you do structure your data like this you gain lots of flexibility in retrieving by page, and page length can be changed in a snap, even query by query if you wish. To get a page:
SELECT wordText
FROM Articles
JOIN ArticleText USING (artID)
JOIN Words USING (wordID)
WHERE wordNum BETWEEN (#pagenum-1)*#pagelength AND #pagenum * #pagelength + #extras
AND Articles.artID = #articleid
parameters #pagenum, #pagelength, #extras, #articleid are to be inserted in the prepared query at query time (use whatever syntax your DB and language like, such as :extras or numbered parameters or whatever).
So we get #extras words beyond expected end-of-page and then on the client side we check those extra words to make sure one of them is the end-paragraph marker - otherwise we'll do another query (with different BETWEEN values) to get yet more.
Far from ideal, but, given all the issues you've highlighted, worth considering. If you can count on the page length always being e.g. a multiple of 100, you can adopt a slight variation of this based on 100-word chunks (and no Words table, just text stored directly per row).
Let the author divide the article into parts themselves.
Writers know how to make an article interesting and readable by dividing it into logical parts, like "Part 1—Installation", "Part 2—Configuration" etc. Having an algorithm do it is a bad decision, imho.
Chopping an article in the wrong place just makes the reader annoyed. Don't do it.
my 2¢
/0