Optimizing near-duplicate value search

Optimizing near-duplicate value search - php

I'm trying to find near duplicate values in a set of fields in order to allow an administrator to clean them up.
There are two criteria that I am matching on
One string is wholly contained within the other, and is at least 1/4 of its length
The strings have an edit distance less than 5% of the total length of the two strings
The Pseudo-PHP code:
foreach($values as $value){
$matches = array();
foreach($values as $match){
if(
(
$value['length'] < $match['length']
&&
$value['length'] * 4 > $match['length']
&&
stripos($match['value'], $value['value']) !== false
)
||
(
$match['length'] < $value['length']
&&
$match['length'] * 4 > $value['length']
&&
stripos($value['value'], $match['value']) !== false
)
||
(
abs($value['length'] - $match['length']) * 20 < ($value['length'] + $match['length'])
&&
0 < ($match['changes'] = levenshtein($value['value'], $match['value']))
&&
$match['changes'] * 20 <= ($value['length'] + $match['length'])
)
){
$matches[] = &$match;
}
}
// output matches for current outer loop value
}
I've tried to reduce calls to the comparatively expensive stripos and levenshtein functions where possible, which has reduced the execution time quite a bit.
However, as an O(n^2) operation this just doesn't scale to the larger sets of values and it seems that a significant amount of the processing time is spent simply iterating through the arrays.
Some properties of a few sets of values being operated on
Total | Strings | # of matches per string | |
Strings | With Matches | Average | Median | Max | Time (s) |
--------+--------------+---------+--------+------+----------+
844 | 413 | 1.8 | 1 | 58 | 140 |
593 | 156 | 1.2 | 1 | 5 | 62 |
272 | 168 | 3.2 | 2 | 26 | 10 |
157 | 47 | 1.5 | 1 | 4 | 3.2 |
106 | 48 | 1.8 | 1 | 8 | 1.3 |
62 | 47 | 2.9 | 2 | 16 | 0.4 |
Are there any other things I can do to reduce the time to check criteria, and more importantly are there any ways for me to reduce the number of criteria checks required (for example, by pre-processing the input values), since there is such low selectivity?
Edit: Implemented solution
// $values is ordered from shortest to longest string length
$values_count = count($values); // saves a ton of time, especially on linux
for($vid = 0; $vid < $values_count; $vid++){
for($mid = $vid+1; $mid < $values_count; $mid++){ // only check against longer strings
if(
(
$value['length'] * 4 > $match['length']
&&
stripos($match['value'], $value['value']) !== false
)
||
(
($match['length'] - $value['length']) * 20 < ($value['length'] + $match['length'])
&&
0 < ($changes = levenshtein($value['value'], $match['value']))
&&
$changes * 20 <= ($value['length'] + $match['length'])
)
){
// store match in both directions
$matches[$vid][$mid] = true;
$matches[$mid][$vid] = true;
}
}
}
// Sort outer array of matches alphabetically with uksort()
foreach($matches as $vid => $mids){
// sort inner array of matches by usage count with uksort()
// output matches
}

You could first order the strings by length ( O(N) ) and then only check smaller strings to be substrings or larger strings, plus only check with levenshtein in string pairs for which the difference is not too large.
You already perform these checks, but now you do it for all N x N pairs, while preselecting first by length will help you reduce the pairs to check first. Avoid the N x N loop, even if it contains only tests that will fail.
For substring matching you could further improve by creating an index for all smaller items, and update this accordingly as you parse larger items. The index should can form a tree structure branching on letters, where each word (string) forms a path from root to leaf. This way you can find if any of the words in the index compare to some string to match. For each character in your match string try to proceed any pointers in the tree index, and create a new pointer at the index. If a pointer can not be proceeded to a following character in the index, you remove it. If any pointer reaches a leaf note, you've found a substring match.
Implementing this is, I think, not difficult, but not trivial either.

You can get an instant 100% improvement by tightening your inner loop. Aren't you getting duplicate matches in your results?
For a preprocess step I'd go through and calculate character frequencies (assuming your set of characters is small like a-z0-9, which, given that you're using stripos, I think is likely). Then rather than comparing sequences (expensive) compare frequencies (cheap). This will give you false positives which you can either live with, or plug into the test you've currently got to weed out.

Related

Create identification number in order and range delimited to registers

A have a list with many full names (>20000) and it increases with each new registration. I need create a seven digits identification number to every register in alphabetical order, so that the conversion start in 0100000 and finish in 9999999. This number must be based on the full name and your order.
When adding new names and that they are merged in the existing base, also generate new numbers merged too.
I have not yet been able to develop a good algorithm that solves this. Then I'll need to create a PHP script for this.
It is a conversion of names to numbers, but with a defined range.
For example
Anthony Felder : 0.459.789
Bianca Mall : 0.989.432
Danton Bishop : 2.986.999
Mario Cortez: 7.883.120
Paul Rudd: 8.788.454
Zeta Jones: 9.987.001
A new name inserted:
Augustus Novell : 0.589.223
Frederic Francis Ford Copolla : 3.765.453

You are going to run into problems, because eventually you are adding to many records that August zzzzzperson will get number 0.989.432 and that already exists.
If you don't expect TOO many new people being added, what you could do:
If Augustus Novell is added to your database - find out between which two names he should be placed (alphabetically).
Anthony Felder : 0.459.789
Bianca Mall : 0.989.432
Grab their numbers and get a number right in the middle of the two:
roundUp((0.459.789 + 0.989.432) / 2) = 0.724.611
As long as you leave a significant gap between each record at the start. In this example with this gap you can only do this 20 times when you keep adding a new name between Anthony Felder and the latest added name. Increasing the gap, increases the amount of times you can do this. But you have to DOUBLE the gap, just to get one additional name in there.
The limit of 20 is only if keep using the same name 20 times as the upper or lower limit. Would love to hear if there is a smarter algorithm, but I doubt it, without rebuilding indices. Taking the middle of two numbers makes sure you always have the biggest gap between two numbers. (not taking predictive models into account).
I don't like my solution of taking the average, but I think it may be the best solution. In other words, unless someone comes up with a better algo, I would try to sort your situation differently. For example, letting go of the need to make the numbers sequential to the alphabetical order of the names (I wonder why this is needed anyway)
EDIT: One other option. Map their name to a number
a = 01, b = 02, c = 03... z = 26, space = 27
Optional, space is a dot, but you can also put a dot every 3 letters (6 numbers)
That means 2 people with the same name would get the same number. You can solve this by having the first two numbers telling you which person it is.
So the first Anthony Felder would start with 01, second Anthony Felder with 02, third Anthony Felder with 03 etc and then start mapping the A (=01).
You have to define how to deal with other characters such as é .
This leads to numbers with variable lengths
This can lead to LONG numbers.
The limit is 99 people with the same name (or 100 if you start with 00)

Actually we can create your idea to code
but it need more time (You say that there are 20000 records)
$value = 100000;
$n= 100000 + $total_db_row; //find total name count from your db table
for( $i=100000; $i < $n; $i++ ) {
$a = str_pad($i, 7, '0', STR_PAD_LEFT);
$a = substr_replace( $a, '.', 1, 0 );
$id[] = substr_replace( $a, '.', 5, 0 );
}
//var_dump($id);
then ,after each insertion you should do...
mysql> SELECT * FROM ordered_names;
+----+-----------+----------+------------+
| id | firstname | lastname | sort_order |
+----+-----------+----------+------------+
| 1 | pamela | edwards | NULL |
| 2 | rolando | edwards | NULL |
| 3 | diamond | edwards | NULL |
+----+-----------+----------+------------+
3 rows in set (0.00 sec)
mysql>
Next, let's populate the sort order column:
SET #x = 0;
UPDATE ordered_names SET sort_order = (#x:=#x+1) ORDER BY lastname,firstname;
SELECT * FROM ordered_names;
Let's run that:
mysql> SET #x = 0;
Query OK, 0 rows affected (0.00 sec)
mysql> UPDATE ordered_names SET sort_order = (#x:=#x+1) ORDER BY lastname,firstname;
Query OK, 3 rows affected (0.00 sec)
Rows matched: 3 Changed: 3 Warnings: 0
mysql> SELECT * FROM ordered_names;
+----+-----------+----------+------------+
| id | firstname | lastname | sort_order |
+----+-----------+----------+------------+
| 1 | pamela | edwards | 2 |
| 2 | rolando | edwards | 3 |
| 3 | diamond | edwards | 1 |
+----+-----------+----------+------------+
3 rows in set (0.00 sec)
Then update php created array with where condition 'sort_order'.
foreach($i=0; $i < count($id); i++ )
{
$sql = "UPDATE ordered_names SET sort_order=.$id[$i]. WHERE sort_order=.$i.";
$db->query($sql);
}
it will update sort_order col with 0.100.100 to .....
But i repeat it need more execution time for larger records

PHP- Query MySQLi results nearest a given number

I am trying to search for an invoice by the amount. So, I would like to search all invoices +/- 10% of the amount searched, and order by the result closest to the given number:
$search = 100.00
$lower = $search * 0.9; // 90
$higher = $search * 1.1 // 110
$results = $db->select("SELECT ID from `invoices` WHERE Amount >= `$lower` && Amount >= `$higher`");
So, I am not sure how to order these. Let's say this query gives me the following results:
108, 99, 100, 103, 92
I want to order the results, starting with the actual number searched (since it's an exact match), and working out from there, so:
100, 99, 103, 92, 108

You could do this as follows:
$search = 100.00
$deviation = 0.10;
$results = $db->select("
SELECT ID, Amount, ABS(1 - Amount/$search) deviation
FROM invoices
WHERE ABS(1 - Amount/$search) <= $deviation
ORDER BY ABS(1 - Amount/$search)
");
Output is:
+----+--------+-----------+
| id | Amount | deviation |
+----+--------+-----------+
| 3 | 100 | 0 |
| 2 | 99 | 0.01 |
| 4 | 103 | 0.03 |
| 1 | 108 | 0.08 |
| 5 | 92 | 0.08 |
+----+--------+-----------+
Here is an SQL fiddle
This way you let SQL calculate the deviation, by dividing the actual amount by the "perfect" amount ($search). This will be 1 for a perfect match. By subtracting this from 1, the perfect match is represented by the value 0. Any deviation is non-zero. By taking the absolute value of that, you get the exact deviation as a fractional number (representing a percentage), like for example 0.02 (which is 2%).
By comparing this deviation to a given maximum deviation ($deviation), you get what you need. Of course, ordering is then easily done on this calculated deviation.

Try this:
$search = 100.00
$lower = $search * 0.9; // 90
$higher = $search * 1.1 // 110
$results = $db->select("SELECT ID from `invoices`
WHERE Amount >= `$lower` && Amount <= `$higher`
ORDER BY ABS(Amount - $search)
");
The ABS function returns the absolute value of its argument (=> it basically removes the minus from negative numbers). Therefore ABS(Amount - $search) returns the distance from the $search value.
Besides that you should consider using prepared statements. Otherwise your application could be vulnerable to sql injection.

Advertisement System Tips

I am creating an advertisement system which shows the highest bidder's ads more frequently.
Here is an example of the table structure I am using, but simplified...
+----+----------+------------------------+----------------------+-----+
| id | name | image | destination | bid |
+----+----------+------------------------+----------------------+-----+
| 1 | abc, co | htt.../blah | htt...djkd.com/ | 3 |
+----+----------+------------------------+----------------------+-----+
| 2 | facebook | htt.../blah | htt...djkd.com/ | 200 |
+----+----------+------------------------+----------------------+-----+
| 3 | google | htt.../blah | htt...djkd.com/ | 78 |
+----+----------+------------------------+----------------------+-----+
Now, right now I am selecting the values from the database and then inserting them into an array and picking one out by random similar to the following:
$ads_array = [];
$ads = Ad::where("active", "=", 1)->orderBy("price", "DESC");
if ($ads->count() > 0) {
$current = 0;
foreach ($ads->get() as $ad) {
for ($i = 0; $i <= ($ad->price == 0 ? 1 : $ad->price); $i++) {
$ads_array[$current] = $ad->id;
$current++;
}
}
$random = rand(0,$current-1);
$ad = Ad::where("id", "=", $ads_array[$random])->first();
...
}
So, essentially what this is doing is, it is inserting the advert's ID into an array 1*$bid times. This is very inefficient, sadly (for obvious reasons), but it was the best way I could think of doing this.
Is there a better way of picking out a random ad from my database; while still giving the higher bidders a higher probability of being shown?

Looks like this might do the trick (but all the credit go to this guy in the comments)
SELECT ads.*
FROM ads
ORDER BY -log(1.0 - rand()) / ads.bid
LIMIT 1
A script to test this :
<?php
$pdo = new PDO('mysql:host=localhost;dbname=test;', 'test', 'test');
$times = array();
// repeat a lot to have real values
for ($i = 0; $i < 10000; $i++) {
$stmt = $pdo->query('SELECT ads.* FROM ads ORDER BY -log(1.0 - rand()) / bid LIMIT 1');
$bid = $stmt->fetch()['bid'];
if (isset($times[$bid])) {
$times[$bid] += 1;
} else {
$times[$bid] = 1;
}
}
// echoes the number of times one bid is represented
var_dump($times);
The figures that comes to me out of that test are pretty good :
// key is the bid, value is the number of times this bid is represented
array (size=3)
200 => int 7106
78 => int 2772
3 => int 122
Further reference on mathematical explanation
Many important univariate distributions can be sampled by inversion using simple closed form expressions. Some of the most useful ones are listed here.
Example 4.1 (Exponential distribution). The standard exponential distribution has density f(x) = e−x on x > 0. If X has this distribution, then E(X) = 1, and we write X ∼ Exp(1). The cumulative distribution function is F(x) = P(X 􏰀 x) = 1 − e−x, with F−1(u) = −log(1 − u). Therefore taking X = − log(1 − U ) for U ∼ U(0, 1), generates standard exponential random variables. Complementary inversion uses X = − log(U ).
The exponential distribution with rate λ > 0 (and mean θ = 1/λ) has PDF λexp(−λx) for 0 􏰀 x < ∞. If X has this distribution, then we write X ∼ Exp(1)/λ or equivalently X ∼ θExp(1), depending on whether the problem is more naturally formulated in terms of the rate λ or mean θ. We may generate X by taking X = − log(1 − U )/λ.
coming from http://statweb.stanford.edu/~owen/mc/Ch-nonunifrng.pdf

php - map a value using fromRange and toRange?

I'm trying to figure out how to map a number between 1 and 1000 to a number between 1 and 5.
For example:
I have a database of 1000 records and I want to assign an id number between 1 and 5 to each record. I don't want it to be random, thats easy enough with rand(1,5).
In the Arduino language it has a function that I'm hoping PHP has:
result = map(value, fromLow, fromHigh, toLow, toHigh)
The purpose of this is I don't want to store that mapped value in the database, I need a php function that I can call and if say the db record is 100 no matter how often the function is called it will always return the same mapped value.
Thanks!

The function you're looking for maps ranges by using different scales. So that's easy to do:
function map($value, $fromLow, $fromHigh, $toLow, $toHigh) {
$fromRange = $fromHigh - $fromLow;
$toRange = $toHigh - $toLow;
$scaleFactor = $toRange / $fromRange;
// Re-zero the value within the from range
$tmpValue = $value - $fromLow;
// Rescale the value to the to range
$tmpValue *= $scaleFactor;
// Re-zero back to the to range
return $tmpValue + $toLow;
}
So basically, it'll re-base the number within the range. Now, note that there is no error checking if value is within either range. The reason is that it maps scales, not ranges. So you can use it for base conversion:
$feet = map($inches, 0, 12, 0, 1);
And you can map "ranges" as well since it re-bases the number (moves it along the number line):
5 == map(15, 10, 20, 0, 10);
So for from range (0, 1000) and to range (0, 5), the following table will hold true:
-200 | -1
0 | 0
1 | 0.005
100 | 0.5
200 | 1
400 | 2
600 | 3
800 | 4
1000 | 5
2000 | 10
3000 | 15
And to show the re-basing, if we map (0, 1000) to (5, 10):
-200 | 4
0 | 5
1 | 5.005
100 | 5.5
200 | 6
400 | 7
600 | 8
800 | 9
1000 | 10
2000 | 15
3000 | 20

Have you considered: $mappedValue = ($value % 5) + 1;? Will return the remainder after dividing the value by 5 (i.e. 0-4) then adds one.

How could I detect if a character close to another character on a QWERTY keyboard?

I'm developing a spam detection system and have been alerted to find that it can't detect strings like this - "asdfsdf".
My solution to this involves detecting if the previous keys were near the other keys on the keyboard. I am not getting the input (to detect spam from) from the keyboard, I'm getting it in the form of a string.
All I want to know is whether a character is one key, two keys or more than two keys away from another character.
For example, on a modern QWERTY keyboard, the characters 'q' and 'w' would be 1 key away. Same would the chars 'q' and 's'. Humans can figure this out logically, how could I do this in code?

You could simply create a two-dimensional map for the standard qwerty keyboard.
Basically it could look something like this:
map[0][0] = 'q';
map[0][1] = 'a';
map[1][0] = 'w';
map[1][1] = 's';
and so on.
When you get two characters, you simply need to find their x, and y in the array 'map' above, and can simply calculate the distance using pythagoras. It would not fill the requirement you had as 'q' and 's' being 1 distance away. But rather it would be sqrt(1^2 + 1^2) approx 1.4
The formula would be:
Characters are c1 and c2
Find coordinates for c1 and c2: (x1,y1) and (x2,y2)
Calculate the distance using pythagoras: dist = sqrt((x2-x1)^2 + (y2-y1)^2).
If necessary, ceil or floor the result.
For example:
Say you get the characters c1='q', and c2='w'. Examine the map and find that 'q' has coordinates (x1,y1) = (0, 0) and 'w' has coordinates (x2,y2) = (1, 0). The distance is
sqrt((1-0)^2 + (0-0)^2) = sqrt(1) = 1

Well, let's see. That's a tough one. I always take the brute-force method and I stay away from advanced concepts like that guy Pythagoras tried to foist on us, so how about a two-dimensional table? Something like this. maybe:
+---+---+---+---+---+---+---
| | a | b | c | d | f | s ...
+---+---+---+---+---+---+---
| a | 0 | 5 | 4 | 2 | 4 | 1 ...
| b | 5 | 0 | 3 | 3 | 2 | 4 ...
| c | 4 | 3 | 0 | 1 | 2 | 2 ...
| d | 2 | 3 | 1 | 0 | 1 | 1 ...
| f | 3 | 2 | 2 | 1 | 0 | 2 ...
| s | 1 | 4 | 2 | 1 | 2 | 0 ...
+---+---+---+---+---+---+---
Could that work for ya'? You could even have negative numbers to show that one key is to the left of the other. PLUS you could put a 2-integer struct in each cell where the second int is positive or negative to show that the second letter is up or down from the first. Get my patent attorney on the phone, quick!

Build a map from keys to positions on an idealized keyboard. Something like:
'q' => {0,0},
'w' => {0,1},
'a' => {1,0},
's' => {1,1}, ...
Then you can take the "distance" as the mathematical distance between the two points.

The basic idea is to create a map of characters and their positions on the keyboard. You can then use a simple distance formula to determine how close they are together.
For example, consider the left side of the keyboard:
1 2 3 4 5 6
q w e r t
a s d f g
z x c v b
Character a has the position [2, 0] and character b has the position [3, 4]. The formula for their distance apart is:
sqrt((x2-x1)^2 + (y2-y1)^2);
So the distance between a and b is sqrt((4 - 0)^2 + (3 - 2)^2)
It'll take you a little bit of effort to map the keys into a rectangular grid (my example isn't perfect, but it gives you the idea). But after that you can build a map (or dictionary), and lookup is simple and fast.

I developed a function for the same purpose in PHP because I wanted to see whether I can use it to analyse strings to figure out whether they're likely to be spam.
This is for the QWERTZ keyboard, but it can easily be changed. The first number in the array $keys is the approximate distance from the left and the second is the row number from top.
function string_distance($string){
if(mb_strlen($string)<2){
return NULL;
}
$keys=array(
'q'=>array(1,1),
'w'=>array(2,1),
'e'=>array(3,1),
'r'=>array(4,1),
't'=>array(5,1),
'z'=>array(6,1),
'u'=>array(7,1),
'i'=>array(8,1),
'o'=>array(9,1),
'p'=>array(10,1),
'a'=>array(1.25,2),
's'=>array(2.25,2),
'd'=>array(3.25,2),
'f'=>array(4.25,2),
'g'=>array(5.25,2),
'h'=>array(6.25,2),
'j'=>array(7.25,2),
'k'=>array(8.25,2),
'l'=>array(9.25,2),
'y'=>array(1.85,3),
'x'=>array(2.85,3),
'c'=>array(3.85,3),
'v'=>array(4.85,3),
'b'=>array(5.85,3),
'n'=>array(6.85,3),
'm'=>array(7.85,3)
);
$string=preg_replace("/[^a-z]+/",'',mb_strtolower($string));
for($i=0;$i+1<mb_strlen($string);$i++){
$char_a=mb_substr($string,$i,1);
$char_b=mb_substr($string,$i+1,1);
$a=abs($keys[$char_a][0]-$keys[$char_b][0]);
$b=abs($keys[$char_a][1]-$keys[$char_b][1]);
$distance=sqrt($a^2+$b^2);
$distances[]=$distance;
}
return array_sum($distances)/count($distances);
}
You can use it the following way.
string_distance('Boat'); # output 2.0332570942187
string_distance('HDxtaBQrGkjny'); # output 1.4580596252044
I used multibyte functions because I was thinking about extending it for other characters. One could extend it by checking the case of characters as well.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.