MySql Php Find Similar Values - php

I run a website where users have a username. They can change their usernames whenever they want. When them change their name, we check that that name isn't currently being used and then allow or not allow the change. On our site people often like to change their username to copy other peoples (make their name very similar to confuse other people of their identity). This isn't uncommon for the type of site we run.
Is way to easily check for usernames that are somewhat similar using a simple query?
Here are some examples of usernames that we would like to have a query match up.
testingman1 = testingman11
lionhead = Iionhead (one has an l and the other has a capital i)
sleepybears = sleeepybears
Any way to do a character by character count of the same letters in the same position and then determine based on the percentage if it is a copy of another user?
I know I'll most likely have to write a custom function, but just looking for some advice on how to make it as painless and not very system taxing process.

You can use
levenshtein(str1, str2) that will return an integer witch is the distance between the two strings.
In PHP if one string is longer that 255 characters the function will return -1.
More info: http://php.net/manual/en/function.levenshtein.php
or if you want in percent you can use similar_text ( string $first , string $second, [, float &$percent ] )
witch pass in the 3rd parameter the percent of similarity
More info: http://www.php.net/manual/en/function.similar-text.php

Related

How to convert string to integer in PHP?

(This question is not about PHP type-casting.)
I have read in couple of questions what it is best not to show record id to users, but use another value, which doesn't give out any information about howmany record there is in the database, etc. I wanted to implement this, and after searching on google, surprisingly no solution was found.
So, my question is, is it possible to convert a (long, for example) sequence of strings to unique (or unique enough for at least million converts) sequence of numbers, is there any options available which I have no idea of?
Just to show you an example:
$uName = $this->newUsername;
$publicId = $this->strToInt(somecomplexstring); // Outputs something like 13272992
// Or feed with username
$publicId = $this->strToInt($uName);
You might considering using something like a slug. So your user will have an unique id in the database but also an unique slug (random string, ex. TGqJItemU5TGqJItemU5f6S5VaCr2n). You can then use this slug instead of the id when presenting data to the browser.
As stated in a comment, you can use uniquid to get a unique string. this function will return an hexadecimal string.
If you need only numbers, you just have to convert the hex to a decimal number with this hexdec.
The final code will look like this :
hexdec(uniqid());
An other way to get an integer from a string is to use md5. This function also returns a hex string so you will have to use hexdec to get a decimal number
Please note that a md5 is not a unique string, there is a (very small) probability of accidental collision (1 in 340 undecillion, see How many random elements before MD5 produces collisions ? for more info)
You can use:
md5(uniqid(rand(), true))

Convert numbers to another one in the same range, and back

I'm looking for a solution to convert all numbers in a given range to another number in the same range, and later convert that number back.
More concrete, let's say I have the numbers 1..100.
The easiest way to convert all numbers to another one in the same range is to use: b = 99 -a; later get the original with a = 99 - b;.
My problem is that I want to simulate some randomness.
I want to implement this in PHP, but the coding language doesn't matter.
WHY?
You maybe say why? Good question :)
I am generating some easy to read short code string based on id-s, and because the id's are incremented one by one, my consecutive short codes are too similar.
Later I need to "decode" the short codes, to get the id.
What my algorithm is doing now is:
0000001 -> ababac, 0000002 -> ababad, 0000003 -> ababaf, etc.
later
ababac -> 0000001, ababad -> 0000002, ababaf -> 0000003, etc.
So before I actually generate the short code I want to "randomize" the number as much as possible.
Option 1:
Why dont you just have a database of conversion? i.e each record has a "real" id, and a "random md5" string or something
Option 2:
Use a rainbow table - maybe even a MD5 lookup table for the range 0 - 10,000 or whatever. Then just do a hashtable lookup
Finally I found a solution based on module operator, on the math forum.
The solution can be found here:
https://math.stackexchange.com/questions/259891/function-to-convert-each-number-in-a-m-n-to-another-number-in-the-same-range

Generating an int within a certain range based on two variables

I'm making an anonymous commenting system for my blog. I need the users to have a randomly picked username from an array I have made, it has 600 usernames. I can't just make it random because then people wouldn't know if it was the same person posting a reply, so I have given each post a randomly generated key between 1-9999, using the key and the users ID I want to do some sort of calculation so that number will stay consistent through that particular post. The result has to be within 1-600.
something like:
user_id x foo(1-9999) = bar(1-600)
Thanks.
What you're probably looking for is a hash function. To quote Wikipedia:
A hash function is any algorithm or subroutine that maps large data sets of variable length, called keys, to smaller data sets of a fixed length.
So you can use a standard hash function, plus modular arithmetic to further map the output of that hash function to your username range, like so:
function anonymise($username, $post_key) {
$hash = hash("adler32", "$username/$post_key");
$hash_decimal = base_convert($hash, 16, 10);
$anonymised_id = $hash_decimal % 600;
return $usernames[$anonymised_id];
}
So, what you really want is a unique identifier for every poster?
Why not use http://php.net/ip2long modded 600?
of course, you'll have to do some collision detection with that too.
You can try using md5 on the concatinated id and post key. it gives you a consistent 32 byte hash of that. And it is actually a hexadecimal string, so you can actually covet it to a number easily by doing a hex to int conversion.
Edit: Based on your feedback. you can take the generated int and modulas it by 600.

Name comparison algorithm

To check if a name is inside an anti-terrorism list.
In addition of the given name, also search for similar names (possible aliases).
Example:
given name => Bin Laden alert!
given name => Ben Larden mhm.. suspicious name, matchs at xx% with Bin Laden
How can I do this?
using PHP
names are 100% correct, since they are from official sources
i'm Italian, but i think this won't be a problem, since names are international
names can be composed of several words: Najmiddin Kamolitdinovich JALOLOV
looking for companies and people
I looked at differents algorithms: do you think that Levenshtein can do the job?
thank you in advance!
ps i got some problems to format this text, sorry :-)
I'd say your best bet to get this working with PHP's native functions are
soundex() — Calculate the soundex key of a string
levenshtein() - Calculate Levenshtein distance between two strings
metaphone() - Calculate the metaphone key of a string
similar_text() - Calculate the similarity between two strings
Since you are likely matching the names against a database (?), you might also want to check whether your database provides any Name Matching Functions.
Google also provided a PDF with a nice overview on Name Matching Algorithms:
http://homepages.cs.ncl.ac.uk/brian.randell/Genealogy/NameMatching.pdf
The Levenshtein function (http://php.net/manual/en/function.levenshtein.php) can do this:
$string1 = 'Bin Laden';
$string2 = 'Ben Larden';
levenshtein($string1, $string2); // result: 2
Set a threshold on this result and determine if the name looks similar.

Creating your own TinyURL

I have just found this great tutorial as it is something that I need.
However, after having a look, it seems that this might be inefficient. The way it works is, first generate a unique key then check if it exists in the database to make sure it really is unique. However, the larger the database gets the slower the function gets, right?
Instead, I was thinking, is there a way to add ordering to this function? So all that has to be done is check the previous entry in the DB and increment the key. So it will always be unique?
function generate_chars()
{
$num_chars = 4; //max length of random chars
$i = 0;
$my_keys = "123456789abcdefghijklmnopqrstuvwxyz"; //keys to be chosen from
$keys_length = strlen($my_keys);
$url = "";
while($i<$num_chars)
{
$rand_num = mt_rand(1, $keys_length-1);
$url .= $my_keys[$rand_num];
$i++;
}
return $url;
}
function isUnique($chars)
{
//check the uniqueness of the chars
global $link;
$q = "SELECT * FROM `urls` WHERE `unique_chars`='".$chars."'";
$r = mysql_query($q, $link);
//echo mysql_num_rows($r); die();
if( mysql_num_rows($r)>0 ):
return false;
else:
return true;
endif;
}
The tiny url people like to use random tokens because then you can't just troll the tiny url links. "Where does #2 go?" "Oh, cool!" "Where does #3 go?" "Even cooler!" You can type in random characters but it's unlikely you'll hit a valid value.
Since the key is rather sparse (4 values each having 36* possibilities gives you 1,679,616 unique values, 5 gives you 60,466,176) the chance of collisions is small (indeed, it's a desired part of the design) and a good SQL index will make the lookup be trivial (indeed, it's the primary lookup for the url so they optimize around it).
If you really want to avoid the lookup and just unse auto-increment you can create a function that turns an integer into a string of seemingly-random characters with the ability to convert back. So "1" becomes "54jcdn" and "2" becomes "pqmw21". Similar to Base64-encoding, but not using consecutive characters.
(*) I actually like using less than 36 characters -- single-cased, no vowels, and no similar characters (1, l, I). This prevents accidental swear words and also makes it easier for someone to speak the value to someone else. I even map similar charactes to each other, accepting "0" for "O". If you're entirely machine-based you could use upper and lower case and all digits for even greater possibilities.
In the database table, there is an index on the unique_chars field, so I don't see why that would be slow or inefficient.
UNIQUE KEY `unique_chars` (`unique_chars`)
Don't rush to do premature optimization on something that you think might be slow.
Also, there may be some benefit in a url shortening service that generates random urls instead of sequential urls.
I don't know why you'd bother. The premise of the tutorial is to create a "random" URL. If the random space is large enough, then you can simply rely on pure, dumb luck. If you random character space is 62 characters (A-Za-z0-9), the the 4 characters they use, given a reasonable random number generator, is 1 in 62^4, which is 1 in 14,776,336. Five characters is 1 in 916,132,832. So, a conflict is, literally, "1 in a billion".
Obviously, as the documents fill, your odds increase for the chance of a collision.
With 10,000 documents, it's 1 in 91,613, almost 1 in 100,000 (for round numbers).
That means, for every new document, you have a 1 in 91,613 chance of hitting the DB again for another pull on the slot machine.
It is not deterministic. It's random. It's luck. In theory, you can hit a string of really, really, bad luck and just get collision after collision after collision. Also, it WILL, eventually, fill up. How many URLs do you plan on hashing?
But if 1 in 91,613 odds isn't good enough, boosting it to 6 chars makes it more than 1 in 5M for 10,000 documents. We're talking almost LOTTO odds here.
Simply put, make the key big enough (7 characters? 8?) and the problem pretty much "wishes" itself out of existence.
Couldn't you encode the URL as Base36 when it's generated, and then decode it when visited - that would allow you to remove the database completely?
A snippet from Channel9:
The formula is simple, just turn the
Entry ID of our post, which is a long
into a short string by Base-36
encoding it and then stick
'http://ch9.ms/' onto the front of it.
This produces reasonably short URLs,
and can be computed at either end
without any need for a database look
up. The result, a URL like
http://ch9.ms/A49H is then used in
creating the twitter link.
I solved a similar problem by implementing an alogirthm that used to generate serial numbers one-by-one in base36. I had my own oredring of base36 characters all of which are unique. Since it was generating numbers serially I did not have to worry about duplication. Complexity and randomness of the number depends on the ordering of base36 numbers[characters]... that too for public only becuase to my application they are serial numbers :)
Check out this guys functions - http://www.pgregg.com/projects/php/base_conversion/base_conversion.php source - http://www.pgregg.com/projects/php/base_conversion/base_conversion.inc.phps
You can use any base you like, for example to convert 554512 to base 62, call
$tiny = base_base2base(554512, 10, 62); and that evaluates to $tiny = '2KFk'.
So, just pass in the unique id of the database record.
In a project I used this in a removed a few characters from the $sChars string, and am using base 58. You can also rearrange the characters in the string if you want the values to be less easy to guess.
You could of course add ordering by simply numbering the urls:
http://mytinyfier.com/1
http://mytinyfier.com/2
and so on. But if the hash key is indexed in the database (which it obviously should be), the performance boost would be minimal at best.
I wouldn't bother doing ordered enumeration for two reasons:
1) SQL servers are very effective at checking such hash collisions (given correct indexes)
2) That might hurt privacy, as users would be able to easily figure out what other users are tinyurl-ing.
Use autoincrement on the database, and get the latest id as described by http://www.acuras.co.uk/articles/24-php-use-mysqlinsertid-to-get-the-last-entered-auto-increment-value
Perhaps this is a bit off-answer, but, my general rule for creating always unique keys is simple md5( time() * 100 + rand( 0, 100 ) ); There is a one in 100,000 chance that if two people are using the same service at the same second they will get the same result (nie impossible).
That said, md5( rand( 0, n ) ) works too.
That might work, but the easiest way to accomplish the problem would probably be with hashing. Theoretically speaking, hashing runs in O(1) time, as in, it only has to perform the hash, and then does only one actual hit to the database to retrieve the value. Then, you would introduce complications for checking for hash collisions, but it seems like this is probably what most of the tinyurl providers do. And, a good hash function isn't terribly hard to write.
I have also created small tinyurl service.
I wrote a script in Python that was generating keys and store in MySQL table named tokens with status U(Unused).
But, I am doing it in offline mode. I have a corn job on my VPS. It runs a script every 10 minutes. The script check if there are less than 1000 keys in the table, it keep generating keys and inserting them if they are unique and not already exists in the table until the key's count up to 1000.
For my service, 1000 keys for 10 minutes are more than enough, you can set the timing or number of keys generated according to your need.
Now when any tiny url needs to be created on my website, my PHP script just fetch any key which is unused from the table and marked its status as T(taken). PHP script does not have to bother about its uniqueness as my python script already populated only unique keys.
Couldn't you just trim the hash to the length you wish?
$tinyURL = substr(md5($longURL . time()),0,4);
Granted, this may not provide as much pseudo randomness as using the entire string length. But, if you hash the long URL concatenated with the time(), wouldn't this be sufficient? Thoughts on using this method? Thanks!

Categories