I have 3 distinct lists of strings. First one contains names of people(from 10 chars to 80 chars long). Second one - room numbers(903, 231 and so on). Last one - group numbers(ABCD-1312, CXVZ-123).
I have a query which is given by a user. Firstly, I tried to search using Levenshtein distance, it didn't work, because whenever user types 3 chars, it gives some room number, even though there is no any digit in query. Then, I tried similar_text(), it worked better, but because people names all have different length, it mostly gives results with shorter names.
Now, the best I come up with is using similar_text() and str_pad() to make each string equal length. Still doesn't work properly.
I want to somehow give extra weight to strings, if they have several matches in a row, or if query and my string starts with the same letter and so on.
$search_min_heap = new SearchMinHeap();
$query = strtolower($query); // similar_text is case sensitive, so make everything lowercase
foreach ($res["result"] as &$item) {
similar_text($query, str_pad(strtolower($item["name_en"]), 100, " "), $cur_distance_en);
similar_text($query, str_pad(strtolower($item["name_ru"]), 100, " "), $cur_distance_ru);
similar_text($query, str_pad(strtolower($item["name_kk"]), 100, " "), $cur_distance_kk);
$cur_max_distance = max($cur_distance_en, $cur_distance_ru, $cur_distance_kk);
$item["matching"] = $cur_max_distance;
$search_min_heap->insert($item);
}
$first_elements = $search_min_heap->getFirstElements($count);
Related
I need help.
I have a table where only two columns are: ID and NAME and these data:
ID | NAME
1 HOME
2 GAME
3 LINK
And I want show e.g. row with name: HOME if user search: HOME or OMEH or EMOH or HMEO, etc... - all permutations from word HOME.
I can't save to mysql all these permutations and search in this columns, because some words will be a too big (9-10 chars) and more than 40 MB for each 9 chars words.
One way to solve this problem is to store the sorted set of characters in each name in your database as an additional column and then sort the string the user inputs before searching e.g. database has
ID NAME CHARS
1 HOME EHMO
2 GAME AEGM
3 LINK IKLN
Then when searching in PHP you would do this:
$search = 'MEHO'; // user input = MEHO
$chars = str_split($search);
sort($chars);
$search = implode('', $chars); // now contains EHMO
$sql = "SELECT ID, NAME FROM table1 WHERE CHARS = '$search'";
// perform query etc.
Output
ID NAME
1 HOME
This sounds like a "please do my homework for me" question. It is hard to conceive what real world problem this is applicable to and there is no standard solution. It is OK to ask for help with your homework here, but you should state that this is the case.
more than 40 MB for each 9 chars words
Your maths is a bit wonky, but indeed the storage does not scale well. OTOH leaving aside the amount of storage, in terms of the processing workload it does scale well as a solution.
You could simply brute-force a dynamic query:
function mkqry($word)
{
$qry="SELECT * FROM yourtable WHERE 1 ";
$last=strlen($word);
for ($x=0; $x<$last; $x==) {
$qry.=" AND word LIKE '%" . substr($word, $x, 1) . "%'";
}
return $qry;
}
However this will always result in a full table scan (slow) and won't correctly handle cases where a letter occurs twice in a word.
The solution is to use an indexing function which is independent of the order in which the characters appear - a non-cryptographic hash. An obvious candidate would be to XOR the characters together, although this only results in a one character identifier which is not very selective. So I would suggest simply adding the character codes:
function pos_ind_hash($word)
{
$sum=0;
for ($x=0; $x<$last; $x==) {
$sum+=ord(substr($word, $x));
}
return $sum;
}
function mkqry($word)
{
$qry="SELECT * FROM yourtable WHERE 1 ";
$last=strlen($word);
for ($x=0; $x<$last; $x==) {
$qry.=" AND word LIKE '%" . substr($word, $x, 1) . "%'";
}
$qry.=" AND yourtable.hash=" . pos_ind_hash($word);
return $qry;
}
Note that the hash mechanism here does not uniquely identify a single word, but is specific enough to reduce the volume to the point where an index (on the hash) would be effective.
Multiplying rather than adding would create fewer collisions but at a greater risk of overflowing (which would create ambiguity between implementations).
But both the hash and the single character LIKE only reduce the number of potential matches. To get the query to behave definitively, you need to go further. You could add an attribute to the table (and to the index with the hash)containing the string length - this would be more selective (i.e. improve effectiveness of the index) but still not definitive.
For a definitive method you would need to specify in your query that the data does NOT contain characters which are NOT in the word you are looking for.
The wrong way to do that would be to add a loop specifying "AND NOT LIKE....".
A valid way of doing that would be to add a test in the query which replaces all the letters in the table attribute which appear in the word you are searching for which results in a zero length string.
We have a system that creates a 5 digit alpha-numeric string of numbers and letters. Originally, I had the full alphabet and 0-9 so something like the following was possible:
0O1I0
Because different fonts may be used on different systems, there was confusion between the o's and i's so I updated the function to only include the numbers. Because there are historical items with the "o" and "i" items I have been asked to modify our search to automatically look for a zero if an o is entered and a 1 if an i is entered (or vice versa).
These are 5 digit ids with 2 possible values for the specific character. I'm thinking I could loop over the value with PHP prior to writing the query to build a list of options and then check if "IN (list of items)" in my query. I don't know if there's something built in that I'm missing though in MySQL like..
WHERE ID = o/0, i/1, etc.
So how about parsing the id in php, replacing every occurence of 0 or O with regex string [o0], and similarly replacing i and 1 with [i1].
Then you could use this string in your query like this
WHERE id REGEXP '...[i1]...[o0]...'
The php code could look like this
$id = '0O1I0';
$id = preg_replace('/[i1]/i', '[i1]', $id);
$id = preg_replace('/[o0]/i', '[o0]', $id);
echo $id; // [i1][o0][i1][o0][o0]
...
mysqli_query($conn, "SELECT ... WHERE id REGEXP '$id'");
What about something like
select <stuff> from <table>
where replace(replace(upper(id), 'I', '1'), 'O', '0') like '%<number-search-term>%'
EDIT (more detail)
replace() in mysql takes three arguments: the original term, what to look for, and what to swap it with. In the where clause I did a nested replace. The inner one replaced any instances of I with 1 and the outer one took the inner replace as its argument (so with all Is as 1s) and replaced any Os with 0s. This is then compared against the number search term (I used a like statement).
I am running the following SQL statement from a PHP script:
SELECT PHONE, COALESCE(PREFERREDNAME, POPULARNAME) FROM distilled_contacts WHERE PHONE LIKE :phone LIMIT 6
As obvious, the statement returns the first 6 matches against the table in question. The value I'm binding to the :phone variable is goes something like this:
$search = '%'.$search.'%';
Where, $search could be any string of numerals. The wildcard characters ensure that a search on, say 918, would return every record where the PHONE field contains 918:
9180078961
9879189872
0098976918
918
...
My problem is what happens if there does exist an entry with the value that matches the search string exactly, in this case 918 (the 4th item in the list above). Since there's a LIMIT 6, only the first 6 entries would be retrieved which may or may not contain the one with the exact match. Is there a way to ensure the results always contain the record with the exact match, on top of the resulting list, should one be available?
You could use an order by to ensure the exact match is always on top:
ORDER BY CASE WHEN PHONE = :phone THEN 1 ELSE 2 END
Using $search = ''.$search.'%' will show result, that matches the starting value.
I am storing social security numbers in the database, but instead of storing whole numbers, I only store only 5 digits sequence. So, if SSN# is 123-12-1234, my database would store it #23121### or ####21234 or anything else, as long as it has a 5 digits in the row.
Therefore, when user enters whole SSN, I want the database to locate all matches.
So, I can do this :
SELECT * FROM user WHERE ssn like 123121234
But the query above would not work, since I have some masked characters in the SSN field (#23121###). Is there a good way of doing this?
Maybe a good way would be to use
SELECT * FROM user WHERE REPLACE (ssn, '#', '') like 123121234
Although there could be an issue - the query might return non-relevant matches since 5 numbers that I store in the DB could be anywhere in a sequence.
Any idea how to do a better search?
If the numbers are always in a sequential block, you can generate a very efficient query by just generating the 5 variations of the ssn that could be stored in the DB and search for all of them with an exact match. This query can also use indexes to speed things up.
SELECT *
FROM user
WHERE ssn IN ('12312####',
'#23121###',
'##31212##',
'###12123#',
'####21234');
I think you can do something like this:
Extract all possible 5-char combinations out of the queried SSN.
Make an IN() query on those numbers. I'm not sure though how many results you would get from this.
$n = 123121234;
$sequences = array();
for($i = 0; $i + 5 <= strlen($n); $i++) {
$sequences[] = substr($n, $i, 5);
}
var_dump($sequences);
Tell me if you need those hash sign surrounding the strings.
i have an array, it have same data:
data range
115X0101-115X0200
115X0101-115X0200
115X0101-115X0200
the 115x mean production code..this unimportant.
we just concern at four digits behind it that we can counting.
1. i want script read or search "0101" and "0200" from 115X0101-115X0200
2. i have tried using regex to count them become 200-101=100
3. the "115X0101-115X0200" repeated until there are 20 data like this
4. after it reached 20, show result at page:
data range
100
If this is the raw data, the easiest way to extract it is probably using a regular expression, as you've mentioned.
You'll probably want something like this (in PHP):
# Get this from the database
$sql_results = array(
'115X0101-115X0200',
'115X0101-115X0200',
'115X0101-115X0200',
);
foreach($sql_results as $row)
{
preg_match_all('/\d{4}/', $row, $matches);
#200 #101
echo intval($matches[0][1]) - intval($matches[0][0]) + 1;
}
For each row, preg_match will find groups of 4 digits (\d{4}) and place them in $matches (use var_dump($matches) to see what it looks like).
More on Regex
Regular Expressions Cheatsheet
Regular Expressions Help
SQL Limit
Side note: If you only want 20 results at a time, you'll want to SELECT * FROM table LIMIT 20 when you query the database. To get rows 31-50 you'd use LIMIT 30, 20, which means offset by 30, then get 20 rows.