Convert string of words to unique number - php

I'm building my own custom speller that should correct word or number of words to custom correction.
For that I created a SQL table that have the next structure:
|---------------------|-------------------------|----------------------------|
| id (int11) | keyword (varchar 255) | correction (varchar 255) |
|---------------------|-------------------------|----------------------------|
| 1 | Facebooc | Facebook |
|---------------------|-------------------------|----------------------------|
| 2 | I lovi you | I love you |
|---------------------|-------------------------|----------------------------|
| 3 | This is a tsst | This is a test |
|---------------------|-------------------------|----------------------------|
Keyword column is mark as unique and have index on it (asc)
keyword can be more than one word (batch of words)
When I get request with new keyword, my code is making a select query to check if this specific keyword have a correction (if keyword not exists its inserting the new keyword to the table without correction.
Now I expect this table to be very large (about 10 million rows and even more), so I thought maybe placing a unique flag and index on the keyword column is not so good idea.
Does the correct structure is good for my needs?
I thought maybe to add another int column to the table and check if there a way to convert each keyword to a unique number so maybe it will be easy to search and select data? think its good idea?

You can add a column with a short checksum as provided by the crc32() function.
However, crc32() does not generate a unique index. There is a probability greater than 0 that 2 strings generate the same checksum.
If the same checksum is not found for a new keyword, the keyword is certainly not yet in the database.
If the same checksums are found, then the keywords themselves have to be checked.
Whether this method brings advantages in speed also depends heavily on the performance of the database system.

Related

Search database for cell that contains all of the specified characters but no others

I'm attempting to write a small API for my internal use that basically searches my table column for all results that contain specified letters but ignores cells that contain more letters than specified.
For example, let's say I have the following rows:
| id | word |
| 1 | cloud |
| 2 | could |
| 3 | cloudy |
And my user inputs the string "dulco". In this example, we search the database for a cell that contains all of the letters supplied, but no extras. So every word would be a positive match, except "cloudy" because it has an extra letter. Now just imagine this on a much bigger scale, where there are over 100 thousand rows.
I've done a bit of searching trying to figure out how to do this, but I can't figure out the proper search terms, and thus I can't figure out how to do the search. I thought of something like a LIKE clause:
$query = "SELECT word FROM words WHERE word LIKE '%d%u%l%c%o%'";
but this wouldn't work properly because there may be other letters there too, meaning the selection would be invalid.
Is this type of query possible, and if so, how can I do such a query?
You could try something like this using a combination of INSTR and LENGTH :
SELECT word FROM words
WHERE INSTR(word,"d")
AND INSTR(word,"u")
AND INSTR(word,"l")
AND INSTR(word,"c")
AND INSTR(word,"o")
AND LENGTH(word)=5;
Where 5 is the length of input string.
Outputs :
+-------+
| word |
+-------+
| cloud |
| could |
+-------+
Edit for multiple occurences of the same letter :
SELECT word FROM words
WHERE LOCATE("y",word)
AND LOCATE("e",word)
AND LOCATE("l",word)
AND LOCATE("l",word, LOCATE("l",word)+1)
AND LENGTH(word)=4;
So, if the table contains yell and yella, it will returns only yell.
This must be preprocessed by PHP to get the required query, ONLY for multiple occurences of the same letter.
MySQL has a function called RegExP. So, in theory, you can query, and filter elements using a regular expression.
$query = "SELECT word FROM words WHERE word REGEXP '^[dulco]{5}$'";
The only thing is you should consider how expensive this query could be.

Foriegn Keys Yah or Nah

Non F-K SCHEMA
human
human_id | name
alien
alien_id | name | planet
comment
comment_id | text
1 hello
vote
to_id | to_type | who | who_type
1 human 1 alien
1 comment 1 human
FK- SCHEMA
human
human_id | name
alien
alien_id | name | planet
comment
comment_id | text
1 hello
entity_id
entity_id | id | type
1 1 human
2 1 comment
3 1 alien
vote
to_id | who_id
1 3
2 1
I want to ask which one is better ?
First one is without foreign key
Second one is with foreign key
Isnt the second one (with fk key) will be slow as i have to do twice inserting and unnecesary joins in order to get human/ alien name etc.
And what will happen if entity_id reaches a maximum of 18446744073709551615 ?
I suggest you add a supertype to unify Human and Alien and use this supertype in relationships. I also suggest separating votes on comments from votes on users into separate relationships. Consider the following tables:
This is the basic idea, though somewhat oversimplified. It allows a User to have both Human and Alien details. If required, disjoint subtypes can be enforced with a few additional columns and triggers.
You ask whether a foreign key and joins will be slower. An argument can be made that normalized databases are likely to be more efficient, since redundant associations are eliminated. In practice, performance has much more to do with effective use of indexes than avoiding joins.
If an auto_increment column overflows, the database engine will return an error and refuse to insert more rows. In this case you can adjust the column to use a larger type. When you exceed the space of the largest types in MySQL, it's probably time for a different (or even custom) solution.

Followers/following database structure

My website has a followers/following system (like Twitter's). My dilemma is creating the database structure to handle who's following who.
What I came up with was creating a table like this:
id | user_id | followers | following
1 | 20 | 23,58,84 | 11,156,27
2 | 21 | 72,35,14 | 6,98,44,12
... | ... | ... | ...
Basically, I was thinking that each user would have a row with columns for their followers and the users they're following. The followers and people they're following would have their user id's separated by commas.
Is this an effective way of handling it? If not, what's the best alternative?
That's the worst way to do it. It's against normalization. Have 2 seperate tables. Users and User_Followers. Users will store user information. User_Followers will be like this:
id | user_id | follower_id
1 | 20 | 45
2 | 20 | 53
3 | 32 | 20
User_Id and Follower_Id's will be foreign keys referring the Id column in the Users table.
There is a better physical structure than proposed by other answers so far:
CREATE TABLE follower (
user_id INT, -- References user.
follower_id INT, -- References user.
PRIMARY KEY (user_id, follower_id),
UNIQUE INDEX (follower_id, user_id)
);
InnoDB tables are clustered, so the secondary indexes behave differently than in heap-based tables and can have unexpected overheads if you are not cognizant of that. Having a surrogate primary key id just adds another index for no good reason1 and makes indexes on {user_id, follower_id} and {follower_id, user_id} fatter than they need to be (because secondary indexes in a clustered table implicitly include a copy of the PK).
The table above has no surrogate key id and (assuming InnoDB) is physically represented by two B-Trees (one for the primary/clustering key and one for the secondary index), which is about as efficient as it gets for searching in both directions2. If you only need one direction, you can abandon the secondary index and go down to just one B-Tree.
BTW what you did was a violation of the principle of atomicity, and therefore of 1NF.
1 And every additional index takes space, lowers the cache effectiveness and impacts the INSERT/UPDATE/DELETE performance.
2 From followee to follower and vice versa.
One weakness of that representation is that each relationship is encoded twice: once in the row for the follower and once in the row for the following user, making it harder to maintain data integrity and updates tedious.
I would make one table for users and one table for relationships. The relationship table would look like:
id | follower | following
1 | 23 | 20
2 | 58 | 20
3 | 84 | 20
4 | 20 | 11
...
This way adding new relationships is simply an insert, and removing relationships is a delete. It's also much easier to roll up the counts to determine how many followers a given user has.
No, the approach you describe has a few problems.
First, storing multiple data points as comma-separated strings has a number of issues. It's difficult to join on (and while you can join using like it will slow down performance) and difficult and slow to search on, and can't be indexed the way you would want.
Second, if you store both a list of followers and a list of people following, you have redundant data (the fact that A is following B will show up in two places), which is both a waste of space, and also creates the potential of data getting out-of-sync (if the database shows A on B's list of followers, but doesn't show B on A's list of following, then the data is inconsistent in a way that's very hard to recover from).
Instead, use a join table. That's a separate table where each row has a user id and a follower id. This allows things to be stored in one place, allows indexing and joining, and also allows you to add additional columns to that row, for example to show when the following relationship started.

Which approach is best for storing a list of words in mysql that will later be used for statistics?

DETAILS
I have a quiz (let’s call it quiz1). Quiz1 uses the same wordlist each time it is generated.
If the user needs to, they can skip words to complete the quiz. I’d like to store those skipped words in mysql and then later perform statistics on them.
At first I was going to store the missed words in one column as a string. Each word would be separated by a comma.
|testid | missedwords | score | userid |
*************************************************************************
| quiz1 | wordlist,missed,skipped,words | 59 | 1 |
| quiz2 | different,quiz,list | 65 | 1 |
The problem with this approach is that I want to show statistics at the end of each quiz about which words were most frequently missed by users who took quiz1.
I’m assuming that storing missed words in one column as above is inefficient for this purpose as I'd need to extract the information and then tally it -(probably tally using php- unless I stored that tallied data in a separate table).
I then thought perhaps I need to create a separate table for the missed words
The advantage of the below table is that it should be easy to tally the words from the table below.
|Instance| missed word |
*****************************
| 1 | wordlist |
| 1 | missed |
| 1 | skipped |
Another approach
I could create a table with tallys and update it each time quiz1 was taken.
Testid | wordlist| missed| skipped| otherword|
**************************************************
Quiz1 | 1 | 1| 1| 0 |
The problem with this approach is that I would need a different table for each quiz, because each quiz will use different words. Also information is lost because only the tally is kept not the related data such which user missed which words.
Question
Which approach would you use? Why? Alternative approaches to this task are welcome. If you see any flaws in my logic please feel free to point them out.
EDIT
Users will be able to retake the quiz as many times as they like. Their information will not be updated, instead a new instance would be created for each quiz they retook.
The best way to do this is to have the word collection completely normalized. This way, analyses will be easy and fast.
quiz_words with wordID, word
quiz_skipped_words with quizID, userID, wordID
To get all the skipped words of a user:
SELECT wordID, word
FROM quiz_words
JOIN quiz_skipped_words USING (wordID)
WHERE userID = ?;
You could add a group by clause to have group counts of the same word.
To get the count of a specific word:
SELECT COUNT(*)
FROM quiz_words
WHERE word LIKE '?';
According to database normalization theory, second approach is better, because ideally one relational table cell should store only one value, which is atomic and unsplitable. Each word is an entity instance.
Also, I might suggest to not create Quiz-Word tables, but reserve another column in Missed-Word table for quiz, for which this word was specified, then use this column as a foreign key for Quiz table. Then you probably may avoid real time table generation (which is a "bad practice" in database design).
why not have a quiz table and quiz_words table, the quiz_words table would store id,quizID,word as columns. Then for each quiz instance create records in the quiz_words table for each word the user did use.
You could then run mysql counts on the quiz_words table based on quizID and or quiz type
The best solution (from my pov) for what are you trying to achieve is the normalized aproach:
test table which has test_id column and other columns
missed_words table which has id (AI PK) and word (UQ) , here you can also have a hits column that should be incremented each time that a association to this word is made in test_missed_words table this way you have the stats that you want already compiled and you don't need them to be calculated from a select query
test_missed_words which is a link table that has test_id and missed_word_id (composite PK)
This way you do not have redundant data (missed words) and you can extract easily that stats that you want
Keeping as much information as possible (and being able to compile user-specific stats later as well as overall stats now) I would create a table structure similar to:
Stats
quizId | userId | type| wordId|
******************************************
1 | 1 | missed| 4|
1 | 1 | skipped| 7|
Where type can either be an int defining the different types of actions, or a string representation - depending on if you believe it can ever be more. ^^
Then:
Quizzes
quizId | quizName|
********************
1| Quiz 1|
With the word list made for each quiz like:
WordList (pk: wordId)
quizId | wordId| word|
***************************
1 | 1 | Cat|
1 | 2 | Dog|
You would have your user table however you want, we are just linking the id from it in to this system.
With this, all id fields will be non-unique keys in the stats table. When a user skips or misses a word, you would add the id of that word to the stats table along with relevant quizId and type. Getting stats this way would make it easy as a per-user basis, a per-word basis, or a per-type basis - or a combination of the three. It will also make the word list for each quiz easily available as well for making the quizzes. ^^
Hope this helps!

How to add a series of string in incrementing id in any table?

I have MySQL Table with an Order table in a DB this table has an auto-incremental id. Up-till now we have general numeric Order-ID likewise 1,2,3,4,5... From now onwards I have to append the series A20 to my id like A20103, A20104, A20105 and so on and when the last three digits reaches 999 the series appended should get changed to A21001, A21002, A21003 and so on, the same series has to be added in the previously added orders..
How can i achieve this task? please guide
Altering an existing auto_increment column does not sound like a good idea - do you really have to do this? Instead, why not just modify your select query to return a suitably formatted id? By doing so, you maintain referential integrity, and you are also free to change the order id format at any time in the future, without having to update your database.
SELECT id, CONCAT('A2', LPAD(id, 4, '0')) AS order_id FROM <table>;
Example output:
+------+----------+
| id | order_id |
+------+----------+
| 1 | A20001 |
| 2 | A20002
...
| 999 | A20999 |
| 1000 | A21000 |
| 1001 | A21001 |
+------+----------+
something along the lines of:
"AR2" . str_pad((int) $ordernumber, 4, "0", STR_PAD_LEFT);
jim
[edit] - i'm assuming this is for display purposes as stated elsewhere, the ID field on the DB is integer!!
You can't have an auto-increment which is not a numeric field. You will better keep the current auto-incrementing column, and add a new one which you will compute manually according to your rules.
You'll probably want to use the MAX() function to get the latest order and generate the next value: remember to do it within a transaction.
You could create a function or a trigger, to do the insert cleanly for you.
You can't add prefixes directly to the database. However, when selecting it you can prepend it.
SELECT concat('A', id) as id FROM table
To get the effect of starting from 20000 you can set the auto increment starting value.

Categories