"Did you mean" feature on a dictionary database

"Did you mean" feature on a dictionary database - php

I have a ~300.000 row table; which includes technical terms; queried using PHP and MySQL + FULLTEXT indexes. But when I searching a wrong typed term; for example "hyperpext"; naturally giving no results.
I need to "compansate" little writing errors and getting nearest record from database. How I can accomplish such feaure? I know about Levenshtein distance, Soundex and Metaphone algorithms but currently not having a solid idea to implement this to querying against database.
Thanks

See this article for how you might implement Levenshtein distance in a MySQL stored function.
For posterity, the author's suggestion is to do this:
CREATE FUNCTION LEVENSHTEIN (s1 VARCHAR(255), s2 VARCHAR(255))
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF s1_char = SUBSTRING(s2, j, 1) THEN SET cost = 0; ELSE SET cost = 1; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN SET c = c_temp; END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
RETURN c;
END
He also supplies a LEVENSHTEIN_RATIO helper method which will evaluate the ratio of different/total characters, rather than a straight edit distance. For instance, if it's 60%, then three-fifths of the characters in the source word are different from the destination word.
CREATE FUNCTION LEVENSHTEIN_RATIO (s1 VARCHAR(255), s2 VARCHAR(255))
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, max_len INT;
SET s1_len = LENGTH(s1), s2_len = LENGTH(s2);
IF s1_len > s2_len THEN SET max_len = s1_len; ELSE SET max_len = s2_len; END IF;
RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100);
END

From the comments of http://dev.mysql.com/doc/refman/5.0/en/udf-compiling.html
now i download the package from the mysql udf repository
http://empyrean.lib.ndsu.nodak.edu/~nem/mysql/
wget http://empyrean.lib.ndsu.nodak.edu/~nem/mysql/udf/dludf.cgi?ckey=28
ll
tar -xzvf dludf.cgi\?ckey\=28
gcc -shared -o libmysqllevenshtein.so mysqllevenshtein.cc -I/usr/include/mysql/
mv libmysqllevenshtein.so /usr/lib
mysql -uroot -pPASS
mysql> use DATABASE
mysql> CREATE FUNCTION levenshtein RETURNS INT SONAME 'libmysqllevenshtein.so';
mysql> select levenshtein(w1.word,w2.word) as dist from word w1, word w2 where ETC........... order by dist asc limit 0,10;

I suggest that you generate typo variations on the query input.
i.e. hyperpext > { hyperpeext, hipertext, ... } etc
One of these is bound to be the correct spelling (especially for common misspellings)
The way you identify the most likely match is to do a lookup for each on an index which tells you the document frequency of the term. (make sense?)

Why not add a table column for storing the word in its alternate (e.g., Soundex) form? that way, if your first SELECT does not find the exact match, you can do a second search to look for matching alternate forms.
The trick is to encode each word so that misspelled variations end up converted into the same alternate form.

Related

Timeout when I execute query from php code

I'm using PHP with oracle 10 as database when I execute this query I get always a timeout problem, I've raise it to 1000s but it still the same problem.
I've checked oracle logs but I didn't find any information
DECLARE RET NUMBER;
CHR VARCHAR2(80);
BEGIN
DBMS_PIPE.PURGE('SPAq3qefqefhd1f19b21c3a7gvt30');
DBMS_PIPE.PACK_MESSAGE('q3qefqefhd1f19b21c3a7gvt30;100;14;3345 0047 10/02/2023 S X2009292 ');
RET := DBMS_PIPE.SEND_MESSAGE('SPA');
IF RET = 0 THEN
RET := DBMS_PIPE.RECEIVE_MESSAGE('SPAq3qefqefhd1f19b21c3a7gvt30', '100');
IF RET = 0 THEN
DBMS_PIPE.UNPACK_MESSAGE(CHR);
:ret_string := CHR;
ELSIF RET = 1 THEN
:ret_string := 'KOTIMEOUT';
ELSE :ret_string := 'KOCOMMERROR';
END IF;
ELSIF RET = 1 THEN
:ret_string := 'KOTIMEOUTSEND';
ELSIF RET = 3 THEN
:ret_string := 'KOINTERRUPTSEND';
ELSE
:ret_string := 'KOERRORSEND';
END IF;
END;

Query where sum of specified columns (all ints) in row are less than a specified int

I'm writing a word search program.
My database is set up to MyISAM
with one table (Words) structured
WordID | String | A | B | ... | Z |
------------------------------------
int varchar int int ... int
Where the values for columns A - Z are the # of occurrences of that letter in the string.
To write a query to find all possible words made out of a specified (but dynamic - user chosen) set of characters (including wild characters) ie: "Bu!!er" should return but, butt, bull, etc
Where
S is the set of characters specified that we can use
W is the set of characters in a word
I'll need to query the database for all strings where
# of occurences in the word for each specified character (not including "!") is less than number of occurrences of that character in the specified string
W_k < S_k where k is each character specified
AND
# of occurrences of letters not specified in the specified string are in SUM less than the total occurrences of the wildcard character ("!") in the specified string
W_q < S_! where q is each character not specified and S_! total amount of occurrences of "!".
For the first part of the WHERE statement (W_k < S_k)
For bu!!er the statement would be
`B` <= 1 AND `U` <= 1 AND `E` <= 1 AND `R` <= 1
And for the second part
`A` + `C` + `D` + ... + `Z` <= 2
The complete Where part of the query becomes
( ( `A` + (IF(`B`-1 < 0, 0, `B`-1)) + `C` + `D` + (IF(`E`-1 < 0, 0, `E`-1)) + `F` + `G` + `H` + `I` + `J` + `K` + `L` + `M` + `N` + `O` + `P` + `Q` + (IF(`R`-1 < 0, 0, `R`-1)) + `S` + `T` + (IF(`U`-1 < 0, 0, `U`-1)) + `V` + `W` + `X` + `Y` + `Z` ) <= 2 )
Is there a better way to do it than this?

`A` + `C` + `D` + ... + `Z`
Use denormalization? Store the full length in a separate column.
`TOTAL` <= 5
As a sidenote:
Your schema restricts the possible queries too much - though it's enough for this job. It might be better to keep all words in the memory (one per server instance) and do "full table scans" or "indexed scans" on the words.

Store tree data structure in database

In the above tree each node has a name and value. Each node can have 6 children at max. How to store it in MySQL database to perform the below operations efficiently?
Operations
1) grandValue(node) - should give (sum of all of the descendants' values, including self)
Eg.,
grandValue(C) = 300
grandValue(I) = 950
grandValue(A) = 3100
2) children(node) - should give the list of all children (immediate descendants only)
Eg.,
children(C) = null
children(I) = L,M,N
children(A) = B,C,D,E
3) family(node) - should give the list of descendants
family(C) = null
family(I) = L,M,N
family(A) = B,C,D,E,F,G,H,I,J,K,L,M,N
4) parent(node) - should give the parent of the node
parent(C) = A
parent(I) = D
parent(A) = null
5) insert(parent, node, value) - should insert node as a child of parent
insert(C, X, 500) Insert a node name X with value 500 as C's child
I am thinking of using recursive methods to do these manipulations as we do with binary trees. But I am not sure if that's the optimal way to do it. The tree may hold 10 to 30 million nodes and maybe skewed. So dumping the data into memory stack is my area of concern.
Please help.
NOTE: I am using PHP, MySQL, Laravel, on VPS Machine.
UPDATE: Tree will grow in size. New nodes will be added as a child of leaf nodes or nodes which has less than 6 nodes and not in-between 2 nodes.

You could store the data in a table using nested sets.
http://en.wikipedia.org/wiki/Nested_set_model#Example
I worry that your millions of nodes may make life difficult if you intend to constantly add new items. Perhaps that concern could be mitigated by using rational numbers instead of integers as the left and right values. Add a column for depth to speed up your desire to ask for descendants. I wrote some SQL to create the table and the stored procedures you asked for. I did it in SQL Server do the syntax might be slightly different but it's all standard SQL statements being executed. Also I just manually decided the upper and lower bounds for each Node. Obviously you'd have to deal with writing the code to get these nodes inserted (and maintained) in your database.
CREATE TABLE Tree(
Node nvarchar(10) NOT NULL,
Value int NOT NULL,
L int NOT NULL,
R int NOT NULL,
Depth int NOT NULL,
);
INSERT INTO Tree (Node, Value, L, R, Depth) VALUES ('A', 100, 1, 28, 0);
INSERT INTO Tree (Node, Value, L, R, Depth) VALUES ('B', 100, 2, 3, 1);
INSERT INTO Tree (Node, Value, L, R, Depth) VALUES ('C', 300, 4, 5, 1);
INSERT INTO Tree (Node, Value, L, R, Depth) VALUES ('D', 150, 6, 25, 1);
INSERT INTO Tree (Node, Value, L, R, Depth) VALUES ('E', 200, 26, 27, 1);
INSERT INTO Tree (Node, Value, L, R, Depth) VALUES ('F', 400, 7, 8, 2);
INSERT INTO Tree (Node, Value, L, R, Depth) VALUES ('G', 250, 9, 10, 2);
INSERT INTO Tree (Node, Value, L, R, Depth) VALUES ('H', 500, 11, 12, 2);
INSERT INTO Tree (Node, Value, L, R, Depth) VALUES ('I', 350, 13, 21, 2);
INSERT INTO Tree (Node, Value, L, R, Depth) VALUES ('J', 100, 21, 22, 2);
INSERT INTO Tree (Node, Value, L, R, Depth) VALUES ('K', 50, 23, 24, 2);
INSERT INTO Tree (Node, Value, L, R, Depth) VALUES ('L', 100, 14, 15, 3);
INSERT INTO Tree (Node, Value, L, R, Depth) VALUES ('M', 300, 16, 17, 3);
INSERT INTO Tree (Node, Value, L, R, Depth) VALUES ('N', 200, 18, 19, 3);
CREATE PROCEDURE grandValue
#Node NVARCHAR(10)
AS
BEGIN
SET NOCOUNT ON;
DECLARE #lbound INT;
DECLARE #ubound INT;
SELECT #lbound = L, #ubound = R FROM Tree WHERE Node = #Node
SELECT SUM(Value) AS Total FROM TREE WHERE L >= #lbound AND R <= #ubound
RETURN
END;
EXECUTE grandValue 'C';
EXECUTE grandValue 'I';
EXECUTE grandValue 'A';
CREATE PROCEDURE children
#Node NVARCHAR(10)
AS
BEGIN
SET NOCOUNT ON;
DECLARE #lbound INT;
DECLARE #ubound INT;
DECLARE #depth INT;
SELECT #lbound = L, #ubound = R, #depth=Depth FROM Tree WHERE Node = #Node
SELECT Node FROM TREE WHERE L > #lbound AND R < #ubound AND Depth = (#depth + 1)
RETURN
END;
EXECUTE children 'C';
EXECUTE children 'I';
EXECUTE children 'A';
CREATE PROCEDURE family
#Node NVARCHAR(10)
AS
BEGIN
SET NOCOUNT ON;
DECLARE #lbound INT;
DECLARE #ubound INT;
SELECT #lbound = L, #ubound = R FROM Tree WHERE Node = #Node
SELECT Node FROM TREE WHERE L > #lbound AND R < #ubound
RETURN
END;
EXECUTE family 'C';
EXECUTE family 'I';
EXECUTE family 'A';
CREATE PROCEDURE parent
#Node NVARCHAR(10)
AS
BEGIN
SET NOCOUNT ON;
DECLARE #lbound INT;
DECLARE #ubound INT;
DECLARE #depth INT;
SELECT #lbound = L, #ubound = R, #depth = Depth FROM Tree WHERE Node = #Node
SELECT Node FROM TREE WHERE L < #lbound AND R > #ubound AND Depth = (#depth - 1)
RETURN
END;
EXECUTE parent 'C';
EXECUTE parent 'I';
EXECUTE parent 'A';
CREATE PROCEDURE ancestor
#Node NVARCHAR(10)
AS
BEGIN
SET NOCOUNT ON;
DECLARE #lbound INT;
DECLARE #ubound INT;
SELECT #lbound = L, #ubound = R FROM Tree WHERE Node = #Node
SELECT Node FROM TREE WHERE L < #lbound AND R > #ubound
RETURN
END;
EXECUTE ancestor 'C';
EXECUTE ancestor 'I';
EXECUTE ancestor 'A';
For creating the nested sets in the table in the first place you can run some code to generate the inserts or start with the first node and then successively add each additional node - although since each add potentially modifies many of the nodes in the set there can be a lot of thrashing of the database as you build this.
Here's a stored procedure for adding a node as a child of another node:
CREATE PROCEDURE insertNode
#ParentNode NVARCHAR(10), #NewNodeName NVARCHAR(10), #NewNodeValue INT
AS
BEGIN
SET NOCOUNT ON;
DECLARE #ubound INT;
DECLARE #depth INT;
SELECT #ubound = R, #depth = Depth FROM Tree WHERE Node = #ParentNode
UPDATE Tree SET L = L + 2 WHERE L >= #ubound
UPDATE Tree SET R = R + 2 WHERE R >= #ubound
INSERT INTO Tree (Node, Value, L, R, Depth) VALUES (#NewNodeName, #NewNodeValue, #ubound, #ubound + 1, #depth + 1);
RETURN
END;
I got this from http://www.evanpetersen.com/item/nested-sets.html who also shows a nice graph walking algorithm for creating the initial L and R values. You'd have to enhance this to keep track of the depth as well but that's be easy.

How do I rename duplicates in MySQL using PHP or just a MySQL

trying to rename duplicates in MySQL database so far using that code but this only adding 1 at the end of name. So if I have
UPDATE phpfox_photo n
JOIN (SELECT title_url, MIN(photo_id) min_id FROM phpfox_photo GROUP BY title_url HAVING COUNT(*) > 1) d
ON n.title_url = d.title_url AND n.photo_id <> d.min_id
SET n.title_url = CONCAT(n.title_url, '1');
Anna
Anna
Anna
Result is
Anna
Anna1
Anna11
When I got 200 Annas result is Anna1111111111111111111111111111111111111111111....etc
how do I do it to rename in the following inc
Anna
Anna1
Anna2

if i didn't miss something you can make a stored procedure that iterates throw your rows using cursors to do that as following:
DECLARE counter INT DEFAULT 0;
DECLARE num_rows INT DEFAULT 0;
DECLARE offset INT;
DECLARE title_urlvalue VARCHAR(50);
DECLARE no_more_rows BOOLEAN;
DECLARE ucur CURSOR FOR
SELECT
UPDATE phpfox_photo n
JOIN (SELECT title_url, MIN(photo_id) min_id
FROM phpfox_photo GROUP BY title_url HAVING COUNT(*) > 1) d
ON n.title_url = d.title_url AND n.photo_id <> d.min_id;
SET offset = 1;
SET no_more_rows = TRUE;
select FOUND_ROWS() into num_rows;
OPEN ucur;
uloop: LOOP
FETCH ucur
if counter >= num_rows then
no_more_rows = False;
endif
INTO title_urlvalue;
IF no_more_rows THEN
CLOSE ucur;
LEAVE uloop;
END IF;
update title_urlvalue = Concat(title_urlvalue,offset);
SET offset = offset + 1;
SET counter = counter + 1;
END LOOP uloop;
close ucur;

With User-Defined Variables
SET #counter:=0;
SET #title_url:='';
UPDATE phpfox_photo n
JOIN (SELECT title_url, MIN(photo_id) min_id
FROM phpfox_photo
GROUP BY title_url
HAVING COUNT(*) > 1) d
ON n.title_url = d.title_url AND n.photo_id <> d.min_id
SET n.title_url = IF(n.title_url <> #title_url, CONCAT(#title_url:=n.title_url, #counter:=1), CONCAT(n.title_url, #counter:=#counter+1));

Maybe you can use modulo to produce numbering, like this (SQLite example, but should be similar in mysql):
SELECT *, (rowid % (SELECT COUNT(*) FROM table as t WHERE t.name = table.name ) ) FROM table ORDER BY name
All you need is to translate rowid and modulo function, both availible in mysql.
Then you can CONCAT results as you desire.

UPDATE phpfox_photo n
JOIN
(SELECT title_url,
MIN(photo_id) min_id
FROM phpfox_photo
GROUP BY title_url
HAVING COUNT(*) > 1
)
d
ON n.title_url = d.title_url
AND n.photo_id <> d.min_id
SET n.title_url =
CASE
WHEN <last char is int>
THEN <replace last char with incremented last char>
ELSE <string + 1>
END

How can I apply mathematical function to MySQL query?

I've got the following query to determine how many votes a story has received:
SELECT s_id, s_title, s_time, (s_time-now()) AS s_timediff,
(
(SELECT COUNT(*) FROM s_ups WHERE stories.q_id=s_ups.s_id) -
(SELECT COUNT(*) FROM s_downs WHERE stories.s_id=s_downs.s_id)
) AS votes
FROM stories
I'd like to apply the following mathematical function to it for upcoming stories (I think it's what reddit uses) -
http://redflavor.com/reddit.cf.algorithm.png
I can perform the function on the application side (which I'm doing now), but I can't sort it by the ranking which the function provides.
Any advise?

Try this:
SELECT s_id, s_title, log10(Z) + (Y * s_timediff)/45000 AS redditfunction
FROM (
SELECT stories.s_id, stories.s_title, stories.s_time,
stories.s_time - now() AS s_timediff,
count(s_ups.s_id) - count(s_downs.s_id) as X,
if(X>0,1,if(x<0,-1,0)) as Y,
if(abs(x)>=1,abs(x),1) as Z
FROM stories
LEFT JOIN s_ups ON stories.q_id=s_ups.s_id
LEFT JOIN s_downs ON stories.s_id=s_downs.s_id
GROUP BY stories.s_id
) as derived_table1
You might need to check this statement if it works with your datasets.

y and z are the tricky ones. You want a specific return based on x's value. That sounds like a good reason to make a function.
http://dev.mysql.com/doc/refman/5.0/en/if-statement.html
You should make 1 function for y and one for z. pass in x, and expect a number back out.
DELIMINATOR //
CREATE FUNCTION y_element(x INT)
RETURNS INT
BEGIN
DECLARE y INT;
IF x > 0 SET y = 1;
ELSEIF x = 0 SET y = 0;
ELSEIF x < 0 SET y = -1;
END IF;
RETURN y;
END //;
DELIMINATOR;
There is y. I did it by hand without checking so you may have to fix a few typo's.
Do z the same way, and then you have all of the values for your final function.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

"Did you mean" feature on a dictionary database - php

Related

Timeout when I execute query from php code

Query where sum of specified columns (all ints) in row are less than a specified int

Store tree data structure in database

How do I rename duplicates in MySQL using PHP or just a MySQL

How can I apply mathematical function to MySQL query?

Categories

Resources