Search in MySQL with permutations

Search in MySQL with permutations - php

I need help.
I have a table where only two columns are: ID and NAME and these data:
ID | NAME
1 HOME
2 GAME
3 LINK
And I want show e.g. row with name: HOME if user search: HOME or OMEH or EMOH or HMEO, etc... - all permutations from word HOME.
I can't save to mysql all these permutations and search in this columns, because some words will be a too big (9-10 chars) and more than 40 MB for each 9 chars words.

One way to solve this problem is to store the sorted set of characters in each name in your database as an additional column and then sort the string the user inputs before searching e.g. database has
ID NAME CHARS
1 HOME EHMO
2 GAME AEGM
3 LINK IKLN
Then when searching in PHP you would do this:
$search = 'MEHO'; // user input = MEHO
$chars = str_split($search);
sort($chars);
$search = implode('', $chars); // now contains EHMO
$sql = "SELECT ID, NAME FROM table1 WHERE CHARS = '$search'";
// perform query etc.
Output
ID NAME
1 HOME

This sounds like a "please do my homework for me" question. It is hard to conceive what real world problem this is applicable to and there is no standard solution. It is OK to ask for help with your homework here, but you should state that this is the case.
more than 40 MB for each 9 chars words
Your maths is a bit wonky, but indeed the storage does not scale well. OTOH leaving aside the amount of storage, in terms of the processing workload it does scale well as a solution.
You could simply brute-force a dynamic query:
function mkqry($word)
{
$qry="SELECT * FROM yourtable WHERE 1 ";
$last=strlen($word);
for ($x=0; $x<$last; $x==) {
$qry.=" AND word LIKE '%" . substr($word, $x, 1) . "%'";
}
return $qry;
}
However this will always result in a full table scan (slow) and won't correctly handle cases where a letter occurs twice in a word.
The solution is to use an indexing function which is independent of the order in which the characters appear - a non-cryptographic hash. An obvious candidate would be to XOR the characters together, although this only results in a one character identifier which is not very selective. So I would suggest simply adding the character codes:
function pos_ind_hash($word)
{
$sum=0;
for ($x=0; $x<$last; $x==) {
$sum+=ord(substr($word, $x));
}
return $sum;
}
function mkqry($word)
{
$qry="SELECT * FROM yourtable WHERE 1 ";
$last=strlen($word);
for ($x=0; $x<$last; $x==) {
$qry.=" AND word LIKE '%" . substr($word, $x, 1) . "%'";
}
$qry.=" AND yourtable.hash=" . pos_ind_hash($word);
return $qry;
}
Note that the hash mechanism here does not uniquely identify a single word, but is specific enough to reduce the volume to the point where an index (on the hash) would be effective.
Multiplying rather than adding would create fewer collisions but at a greater risk of overflowing (which would create ambiguity between implementations).
But both the hash and the single character LIKE only reduce the number of potential matches. To get the query to behave definitively, you need to go further. You could add an attribute to the table (and to the index with the hash)containing the string length - this would be more selective (i.e. improve effectiveness of the index) but still not definitive.
For a definitive method you would need to specify in your query that the data does NOT contain characters which are NOT in the word you are looking for.
The wrong way to do that would be to add a loop specifying "AND NOT LIKE....".
A valid way of doing that would be to add a test in the query which replaces all the letters in the table attribute which appear in the word you are searching for which results in a zero length string.

Related

Is it worth to save keyword <-> link relation into "hastable" like structure in mysql?

im working on PHP + MySQL application, which will crawl HDD/shared drive and index all files and directories into database, to provide "fulltext" search on it. So far im doing well, but im stuck on question, if i chosed good way how to store data into database.
On picture below, you can see part schema of my database. Thought is, that i'm saving domain (which represents part of disk which i wana to index) then there are some link(s) (which represents files and folder (with content, filepath, etc) then i have table to store sole (uniq) keywords, which i find in file/folder name or content.
And finaly, i have 16 tables linkkeyword to store relations between links and keywords. I have 16 of them because i thought it might be good to make something like hashtable, because im expecting high number of relations between link <-> keyword. (so far for 15k links and 400k keywords i have about 2.5milion of linkkeyword records). So to avoid storing so much data into one table (and later search above them) i thought that this hastable can be faster. It works like i wana to search for word, i compute it md5 and look at first character of md5 and then i know to which linkkeyword table i should use. So there is only about 150~200k records in each linkkeyword table (against 2.5milions)
So there im curious, if this approach can be of any use, or if will be better to store all linkkeyword information to single table and mysql will take care of it (and to how much link<->keyword it can work?)
So far this was great solution to me, but i crushed hard when i tried to implement regular-expression search. So user can use e.g. "tem*" which can result in temp, temporary, temple etc... In normal way when searching for word, i will conpute in md5 hash and then i know to which linkkeyword table i need to look. But for regular expression i need to get all keywords from keywords table (which matches regular expression) and then process them one by one.
Im also attaching part of code for normal keyword search
private function searchKeywords($selectedDomains) {
$searchValues = $this->searchValue;
$this->resultData = array();
foreach (explode(" ", $searchValues) as $keywordName) {
$keywordName = strtolower($keywordName);
$keywordMd5 = md5($keywordName);
$selection = $this->database->table('link');
$results = $selection->where('domain.id', $selectedDomains)->where('domain.searchable = ?', '1')->where(':linkkeyword' . $keywordMd5[0] . '.keyword.keyword LIKE ?', $keywordName)
->select('link.*,:linkkeyword' . $keywordMd5[0] . '.weight,:linkkeyword' . $keywordMd5[0] . '.keyword.keyword');
foreach ($results as $result) {
$keyExists = array_key_exists($result->linkId, $this->resultData);
if ($keyExists) {
$this->resultData[$result->linkId]->updateWeight($result->weight);
$this->resultData[$result->linkId]->addKeyword($result->keyword);
} else {
$domain = $result->ref('domain');
$linkClass = new search\linkClass($result, $domain);
$linkClass->updateWeight($result->weight);
$linkClass->addKeyword($result->keyword);
$this->resultData[$result->linkId] = $linkClass;
}
}
}
}
and regular expression search function
private function searchRegexp($selectedDomains) {
//get stored search value
$searchValues = $this->searchValue;
//replace astering and exclamation mark (counted as characters for regular expression) and replace them by their mysql equivalent
$searchValues = str_replace("*", "%", $searchValues);
$searchValues = str_replace("!", "_", $searchValues);
// empty result array to prevent previous results to interfere
$this->resultData = array();
//searched phrase can be multiple keywords, so split it by space and get results for each keyword
foreach (explode(" ", $searchValues) as $keywordName) {
//set default link result weight to -1 (default value)
$weight = -1;
//select all keywords, which match searched keyword (or its regular expression)
$keywords = $this->database->table('keyword')->where('keyword LIKE ?', $keywordName);
foreach ($keywords as $keyword) {
//count keyword md5 sum to determine which table should be use to match it links
$md5 = md5($keyword->keyword);
//get all link ids from linkkeyword relation table
$keywordJoinLink = $keyword->related('linkkeyword' . $md5[0])->where('link.domain.searchable','1');
//loop found links
foreach ($keywordJoinLink as $link) {
//store link weight, for later result sort
$weight = $link->weight;
//get link ID
$linkId = $link->linkId;
//check if link already exists in results, to prevent duplicity
$keyExists = array_key_exists($linkId, $this->resultData);
//if link already exists in result set, just update its weight and insert matching keyword for later keyword tag specification
if ($keyExists) {
$this->resultData[$linkId]->updateWeight($weight);
$this->resultData[$linkId]->addKeyword($keyword->keyword);
//if link isnt in result yet, insert it
} else {
//get link reference
$linkData = $link->ref('link', 'linkId');
//get information about domain, to which link belongs (location, flagPath,...)
$domainData = $linkData->ref('domain', 'domainId');
//if is domain searchable and was selected before search, add link to result set. Otherwise ignore it
if ($domainData->searchable == 1 && in_array($domainData->id, $selectedDomains)) {
//create new link instance
$linkClass = new search\linkClass($linkData, $domainData);
//insert matching keyword to links keyword set
$linkClass->addKeyword($keyword->keyword);
//set links weight
$linkClass->updateWeight($weight);
//insert link into result set
$this->resultData[$linkId] = $linkClass;
}
}
}
}
}
}

Your question is mostly one of opinion, so you may want to include the criteria that allow us to answer "worth it' more objectively.
It appears you've re-invented the concept of database sharding (though without distributing your data across multiple servers).
I assume you are trying to optimize search time; if that's the case, I'd suggest that 2.5 million records on a modern hardware is not a particularly big performance challenge, as long as your queries can use an index. If you can't use an index (e.g. because you're doing a regular expression search), sharding will probably not help at all.
My general recommendation with database performance tuning is to start with the simplest possible relational solution, keep tuning that until it breaks your performance goals, then add more hardware, and only once you've done that should you go for "exotic" solutions like sharding.
This doesn't mean using prayer as a strategy. For performance-critical application, I typically build a test database, where I can experiment with solutions. In your case, I'd build a database with your schema without the "sharding" tables, and then populate it with test data (either write your own population routines, or use a tool like DBMonster). Typically, I'd go for at least double the size I expect in production. You can then run and tune queries to prove, one way or another, whether your schema is good enough. It sounds like a lot of work, but it's much less work than your sharding solution is likely to bring along.
There are (as #danFromGermany comments) solutions that are optimized for text serach, and you could use MySQL fulltext search features rather than regular expressions.

elastica scoring based on regular expression using mvel

I am new to elastic search and here is my scenario I am trying to solve.
I have a search input box that supports autosuggestion logic.
The results are fetched from an elastic index which uses ngram filter.
What I want to improve is to introduce a scoring capability so as to order the results from the most important to the less important one (depending on the score).
The score must be based on the following cases:
If there is a match that starts with the given string, set score 100
If there is a match that contains the given string and does not start with it, set score to 10
For this purpose an elastica script was implemented with mvel statements in order to support regular expression match. In other words, it checks to see if the value on the left matches the regular expression on the right (only then a variable is incremented accordingly). But unfortunately it goes wrong when search string is language specific despite the fact that the value on the left is of the specified language too. Another problem to deal with is the second case I mention above (cannot make it to work).
The script when a value ('one example' (belongs to the name field)) starting with the given word ('one') works just fine.
$testParam = mb_strtolower('one', 'utf-8');
$regexStart = '^' . $testParam . '.*$';
$ElasticaScript = new Elastica_Script(" total = 1; if(doc['name'].value ~= '{$regexStart}'){ total += 100; } return total; ");
The script when a value ('one example' (belongs to the name field)) contain the given word ('example') does not work and as a result total score remains 1 and does not increment to 11 as it should be.
$testParam = mb_strtolower('example', 'utf-8');
$regexStart = '^.*' . $testParam . '.*$';
$ElasticaScript = new Elastica_Script(" total = 1; if(doc['name'].value ~= '{$regexStart}'){ total += 10; } return total; ");
And at last, with the same logic, when I try to match a greek word against a value (containing greek letters) of the name field, the increment of the total score is ignored as well.
All the work has been done using the elastica, let alone php.
Could you please help to solve my problem ?
If there is another approach/solution, feel free to share it with me.
Thank you in advance

doc['name'].value loads the analyzed version of the field. Unless your field is set to not analyzed, this will likely be very different than the original content of the field, and not useful for doing regex matches. The Elasticsearch docs on script fields say this only makes sense for non-analyzed or single term fields. For example, if your content is indexed as ngrams, this value will consist of ngrams.
You can access the original text of the field using _source.field_name, and then compute your score based on that. You can still do your search as usual against the ngrams, and use the _source just for scoring.
Here's a sample function_score query that defaults the score to _score, adds 100 if the name field starts with one, else adds 10 if the name field contains one anywhere else. It uses _source.name to access the contents of the name field, so it's doing the regex against the original text of the name field, not the ngrams calculated from the name field.
{
"query": {
"function_score": {
"boost_mode": "replace",
"script_score": {
"script": "total = _score; if (_source.name ~= '^one.*') { total += 100 } else if (_source.name ~= '.*?one.*?') { total += 10 } return total"
}
}
}
}

How to search partial/masked strings?

I am storing social security numbers in the database, but instead of storing whole numbers, I only store only 5 digits sequence. So, if SSN# is 123-12-1234, my database would store it #23121### or ####21234 or anything else, as long as it has a 5 digits in the row.
Therefore, when user enters whole SSN, I want the database to locate all matches.
So, I can do this :
SELECT * FROM user WHERE ssn like 123121234
But the query above would not work, since I have some masked characters in the SSN field (#23121###). Is there a good way of doing this?
Maybe a good way would be to use
SELECT * FROM user WHERE REPLACE (ssn, '#', '') like 123121234
Although there could be an issue - the query might return non-relevant matches since 5 numbers that I store in the DB could be anywhere in a sequence.
Any idea how to do a better search?

If the numbers are always in a sequential block, you can generate a very efficient query by just generating the 5 variations of the ssn that could be stored in the DB and search for all of them with an exact match. This query can also use indexes to speed things up.
SELECT *
FROM user
WHERE ssn IN ('12312####',
'#23121###',
'##31212##',
'###12123#',
'####21234');

I think you can do something like this:
Extract all possible 5-char combinations out of the queried SSN.
Make an IN() query on those numbers. I'm not sure though how many results you would get from this.
$n = 123121234;
$sequences = array();
for($i = 0; $i + 5 <= strlen($n); $i++) {
$sequences[] = substr($n, $i, 5);
}
var_dump($sequences);
Tell me if you need those hash sign surrounding the strings.

PHP mysql search queries

I'm trying to create a search engine for an inventory based site. The issue is that I have information inside bbtags (like in [b]test[/b] sentence, the test should be valued at 3, whereas sentence should be valued at 1).
Here is an example of an index:
My test sentence, my my (has a SKU of TST-DFS)
The Database:
|Product| word |relevancy|
| 1 | my | 3 |
| 1 | test | 1 |
| 1 |sentence| 1 |
| 1 | TST-DFS| 10 |
But how would I match TST-DFS if the user typed in TST DFS? I would like that SKU to have a relevancy of say 8, instead of the full 10..
I have heard that the FULL TEXT search feature in MySQL would help, but I can't seem to find a good way to do it. I would like to avoid things like UNIONS, and to keep the query as optimized as possible.
Any help with coming up with a good system for this would be great.
Thanks,
Max

But how would I match TST-DFS if the user typed in TST DFS?
I would like that SKU to have a relevancy of say 8, instead of the full 10..
If I got the question right, the answer is actually easy.
Well, if you forge your query a little before sending it to mysql.
Ok, let's say we have $query and it contains TST-DFS.
Are we gonna focus on word spans?
I suppose we should, as most search engines do, so:
$ok=preg_match_all('#\w+#',$query,$m);
Now if that pattern matched... $m[0] contains the list of words in $query.
This can be fine-tuned to your SKU, but matching against full words in a AND fashion is pretty much what the user presumes is happening. (as it happens over google and yahoo)
Then we need to cook a $expr expression that will be injected into our final query.
if(!$ok) { // the search string is non-alphanumeric
$expr="false";
} else { // the search contains words that are no in $m[0]
$expr='';
foreach($m[0] as $word) {
if($expr)
$expr.=" AND "; // put an AND inbetween "LIKE" subexpressions
$s_word=addslashes($word); // I put a s_ to remind me the variable
// is safe to include in a SQL statement, that's me
$expr.="word LIKE '%$s_word%'";
}
}
Now $expr should look like "words LIKE '%TST%' AND words LIKE '%DFS%'"
With that value, we can build the final query:
$s_expr="($expr)";
$s_query=addslashes($query);
$s_fullquery=
"SELECT (Product,word,if((word LIKE '$s_query'),relevancy,relevancy-2) as relevancy) ".
"FROM some_index ".
"WHERE word LIKE '$s_query' OR $s_expr";
Which shall read, for "TST-DFS":
SELECT (Product,word,if((word LIKE 'TST-DFS'),relevancy,relevancy-2) as relevancy)
FROM some_index
WHERE word LIKE 'TST-DFS' OR (word LIKE '%TST%' AND word LIKE '%DFS%')
As you can see, in the first SELECT line, if the match is partial, mysql will return relevancy-2
In the third one, the WHERE clause, if the full match fails, $s_expr, the partial match query we cooked in advance, is tried instead.

I like to lower case everything and strip out special characters (like in a phone number or credit card I take everything out on both sides that isn't a number)

Rather than try to create your own FTS solution, you could try to fit the MySQL FTS engine to your requirements. What I've seen done is create a new table to store your FTS data. Create a column for each different piece of data that you want to have a different relevance. For your sku field you could store the raw sku, with spaces, underscores, hyphens and any other special character intact. Then store a stripped down version with all these things removed. You may also want to store a version with leading zeros removed, as people often leave things like that out. You can store all these variations in the same column. Store your product name in another column, and the product description in another column. Create a separate index on each column. Then when you do your search, you can search each column individually, and multiply the rank of the results based on how important you think that column is. So you could multiply sku results by 10, title by 5 and leave description results as is. You may have to do a little experimentation to get the results you want, but it may ultimately be simpler than creating your own index.

Create a keywords table. Something along the lines of:
integer keywordId (autoincrement) | varchar keyword | int pointValue
Assign all possible keywords, skus, etc, into this table. Create another table, a post-keywords bridge, (assuming postId is the id you've assigned in your original table) along the lines of:
integer keywordId | integer postId
Once you have this, you can easily add keywords to each post as it is interested. To calculate total point value for a given post, a query such as the following should do the trick:
SELECT sum(pointValue) FROM keywordPostsBridge kpb
JOIN keywords k ON k.keywordId = kpb.keywordId
WHERE kpb.postId = YOUR_INTENDED_POST

I think the solution is quite straightforward unless I missed something.
Basically run two search, one is exact match, the other is like match or regex match.
Join two resultsets together, like match left join exact match. Then for example:
final_relevancy = (IFNULL(like_relevancy, 0) + IFNULL(exact_relevancy, 0) * 3) / 4
I didn't try this myself though. Just an idea.

I would add a column that is stripped of all special character's, misspellings, and then upcased (or create a function that compares on text that has been stripped and upcased). That way your relevancy will be consistent.

/*
q and q1 - you table
this query takes too much resources,
make from it update-query ( scheduled task or call it on_save if you develop new system )
*/
SELECT
CASE
WHEN word NOT REGEXP "^[a-zA-Z]+$"
/*many replace with junk characters
or create custom function
or if you have full db access install his https://launchpad.net/mysql-udf-regexp
*/
THEN REPLACE(REPLACE( word, '-', ' ' ), '#', ' ')
ELSE word
END word ,
CASE
WHEN word NOT REGEXP "^[a-zA-Z]+$"
THEN 8
ELSE relevancy
END relevancy
FROM ( SELECT 'my' word,
3 relevancy
UNION
SELECT 'test' word,
1 relevancy
UNION
SELECT 'sentence' word,
1 relevancy
UNION
SELECT 'TST-DFS' word,
10 relevancy
)
q
UNION
SELECT *
FROM ( SELECT 'my' word,
3 relevancy
UNION
SELECT 'test' word,
1 relevancy
UNION
SELECT 'sentence' word,
1 relevancy
UNION
SELECT 'TST-DFS' word,
10 relevancy
)
q1

it is a page coading where query result shows
**i can not use functions by use them work are more easier**
<html>
<head>
</head>
<body>
<?php
//author S_A_KHAN
//date 10/02/2013
$dbcoonect=mysql_connect("127.0.0.1","root");
if (!$dbcoonect)
{
die ('unable to connect'.mysqli_error());
}
else
{
echo "connection successfully <br>";
}
$data_base=mysql_select_db("connect",$dbcoonect);
if ($data_base==FALSE){
die ('unable to connect'.mysqli_error($dbcoonect));
}
else
{
echo "connection successfully done<br>";
***$SQLString = "select * from user where id= " . $_GET["search"] . "";
$QueryResult=mysql_query($SQLString,$dbcoonect);***
echo "<table width='100%' border='1'>\n";
echo "<tr><th bgcolor=gray>Id</th><th bgcolor=gray>Name</th></tr>\n";
while (($Row = mysql_fetch_row($QueryResult)) !== FALSE) {
echo "<tr><td bgcolor=tan>{$Row[0]}</td>";
echo "<td bgcolor=tan>{$Row[1]}</td></tr>";
}
}
?>
</body>
</html>

While loop for mysql database with php?

I am developing a mysql database.
I "need" a unique id for each user but it must not auto increment! It is vital it is not auto increment.
So I was thinking of inserting a random number something like mt_rand(5000, 1000000) into my mysql table when a user signs up for my web site to be. This is where I am stuck?!
The id is a unique key on my mysql table specific to each user, as I can not 100% guarantee that inserting mt_rand(5000, 1000000) for the user id will not incoherently clash with another user's id.
Is there a way in which I can use mt_rand(5000, 1000000) and scan the mysql database, and if it returns true that it is unique, then insert it as the user's new ID, upon returning false (somebody already has that id) generate a new id until it becomes unique and then insert it into the mysql database.
I know this is possible I have seen it many times, I have tried with while loops and all sorts, so this place is my last resort.
Thanks

You're better off using this: http://dev.mysql.com/doc/refman/5.0/en/miscellaneous-functions.html#function_uuid
Or using this: http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html
But if you actually want to do what you are saying, you can just do something like:
$x;
do {
$x = random_number();
"SELECT count(*) FROM table WHERE id = $x"
} while (count != 0);
// $x is now a value that's not in the db

You could use a guid. That's what I've seen done when you can't use an auto number.
http://php.net/manual/en/function.com-create-guid.php

Doesn't this function do what you want (without verification): http://www.php.net/manual/en/function.uniqid.php?

I think you need to approach the problem from a different direction, specifically why a sequence of incrementing numbers is not desired.
If it needs to be an 'opaque' identifier, you can do something like start with a simple incrementing number and then add something around it to make it look like it's not, such as three random numbers on the end. You could go further than that and put some generated letters in front (either random or based on some other algorithm, such as the day of the month they first registered, or which server they hit), then do a simple checksuming algorithm to make another letter for the end. Now someone can't easily guess an ID and you have a way of rejecting one sort of ID before it hits the database. You will need to store the additional data around the ID somewhere, too.
If it needs to be a number that is random and unique, then you need to check the database with the generated ID before you tell the new user. This is where you will run into problems of scale as too small a number space and you will get too many collisions before the check lucks upon an unallocated one. If that is likely, then you will need to divide your ID generation into two parts: the first part is going to be used to find all IDs with that prefix, then you can generate a new one that doesn't exist in the set you got from the DB.

Random string generation... letters, numbers, there are 218 340 105 584 896 combinations for 8 chars.
function randr($j = 8){
$string = "";
for($i=0;$i < $j;$i++){
srand((double)microtime()*1234567);
$x = mt_rand(0,2);
switch($x){
case 0:$string.= chr(mt_rand(97,122));break;
case 1:$string.= chr(mt_rand(65,90));break;
case 2:$string.= chr(mt_rand(48,57));break;
}
}
return $string;
}
Loop...
do{
$id = randr();
$sql = mysql_query("SELECT COUNT(0) FROM table WHERE id = '$id'");
$sql = mysql_fetch_array($sql);
$count = $sql[0];
}while($count != 0);

For starters I always prefer to do all the randomization in php.
function gencode(){
$tempid=mt_rand(5000, 1000000);
$check=mysql_fetch_assoc(mysql_query("SELECT FROM users WHERE id =$tempid",$link));
if($check)gencode();
$reg=mysql_query("INSERT INTO users id VALUES ('$tempid')",$link);
//of course u can check for if $reg then insert successfull

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.