php : speed up levensthein comparing, 10k + records

php : speed up levensthein comparing, 10k + records - php

In my MySQL table I have the field name, which is unique. However the contents of the field are gathered on different places. So it is possible I have 2 records with a very similar name instead of second one being discarded, due to spelling errors.
Now I want to find those entries that are very similar to another one. For that I loop through all my records, and compare the name to other entries by looping through all the records again. Problem is that there are over 15k records which takes way too much time. Is there a way to do this faster?
this is my code:
for($x=0;$x<count($serie1);$x++)
{
for($y=0;$y<count($serie2);$y++)
{
$sim=levenshtein($serie1[$x]['naam'],$serie2[$y]['naam']);
if($sim==1)
print("{$A[$x]['naam']} --> {$B[$y]['naam']} = {$sim}<br>");
}
}
}

A preamble: such a task will always be time consuming, and there will always be some pairs that slip through.
Nevertheless, a few ideas :
1. actually, the algorithm can be (a bit) improved
assuming that $series1 and $series2 have the same values in the same order, you don't need to loop over the whole second array in the inner loop every time. In this use case you only need to evaluate each value pair once - levenshtein('a', 'b') is sufficient, you don't need levenshtein('b', 'a') as well (and neither do you need levenstein('a', 'a'))
under these assumptions, you can write your function like this:
for($x=0;$x<count($serie1);$x++)
{
for($y=$x+1;$y<count($serie2);$y++) // <-- $y doesn't need to start at 0
{
$sim=levenshtein($serie1[$x]['naam'],$serie2[$y]['naam']);
if($sim==1)
print("{$A[$x]['naam']} --> {$B[$y]['naam']} = {$sim}<br>");
}
}
2. maybe MySQL is faster
there examples in the net for levenshtein() implementations as a MySQL function. An example on SO is here: How to add levenshtein function in mysql?
If you are comfortable with complex(ish) SQL, you could delegate the heavy lifting to MySQL and at least gain a bit of performance because you aren't fetching the whole 16k rows into the PHP runtime.
3. don't do everything at once / save your results
of course you have to run the function once for every record, but after the initial run, you only have to check new entries since the last run. Schedule a chronjob that once every day/week/month.. checks all new records. You would need an inserted_at column in your table and would still need to compare the new names with every other name entry.
3.5 do some of the work onInsert
a) if the wait is acceptable, do a check once a new record should be inserted, so that you either write it to a log oder give a direct feedback to the user. (A tangent: this could be a good use case for an asynchrony task queue like http://gearman.org/ -> start a new process for the check in the background, return with the success message for the insert immediately)
b) PHP has two other function to help with searching for almost similar strings: metaphone() and soundex() . These functions generate abstract hashes that represent how a string will sound when spoken. You could generate (one or both of) these hashes on each insert, store them as a separate field in your table and use simple SQL functions to find records with similar hashes

The trouble with levenshtein is it only compares string a to string b. I built a spelling corrector once that puts all the strings a into a big trie, and that functioned as a dictionary. Then it would look up any string b in that dictionary, finding all nearest-matching words. I did it first in Fortran (!), then in Pascal. It would be easiest in a more modern language, but I suspect php would not make it easy. Look here.

Related

someone knows the algorithm to avoid that two or more datetimes collide between them in php

I want to find the correct algorithm to avoid collision between datetime variables.
I mean for example we don't want the one wedding collide with other one in the same hour.

You have to make two checks because, from what you've stated, you don't know which of the two events starts first. So, this is what you do in mysql, assuming that you've joined two rows and one is called a and one is called b and they each have a datetime field for start and end:
select ...your query...
where (
a.end<=b.start or b.end<=a.start
) ... continue your query here if you want...
I changed this. You don't want ANY overlap. So, all you need to do is ensure one ends before the other one starts. Simplifies the query.
The same logic us used if you have some sort of PHP variables. I'll use $a_start, $a_end, $b_start, and $b_end as an example:
if($a_end<=$b_start || $b_end<=$a_start) // You are good!
I used <=, so one can end at the exact time another starts. Use just < if you don't want that microsecond of overlap.

MySQLi query vs PHP Array, which is faster?

I'm developing an algorithm for intense calculations on multiple huge arrays. Right now I have used PHP arrays to do the job but, it seems slower than what I needed it to be. I was thinking on using MySQLi tables and convert the php arrays into database rows and then start the calculations to solve the speed issue.
At the very first step, when I was converting a 20*10 PHP array into 200 rows of database containing zeros, it took a long time. Here is the code: (Basically the following code is generating a zero matrix, if you're interested to know)
$stmt = $mysqli->prepare("INSERT INTO `table` (`Row`, `Col`, `Value`) VALUES (?, ?, '0')");
for($i=0;$i<$rowsNo;$i++){
for($j=0;$j<$colsNo;$j++){
//$myArray[$j]=array_fill(0,$colsNo,0);
$stmt->bind_param("ii", $i, $j);
$stmt->execute();
}
}
$stmt->close();
The commented-out line "$myArray[$j]=array_fill(0,$colsNo,0);" would generate the array very fast while filling out the table in next two lines, took a very longer time.
Array time: 0.00068 seconds
MySQLi time: 25.76 seconds
There is a lot more calculating remaining and I got worried even after modifying numerous parts it may get worse. I searched a lot but I couldn't find any answer on whether the array is a better choice or mysql tables? Has anybody done or know about any benchmarking test on this?
I really appreciate any help.
Thanks in advance
UPDATE:
I did the following test for a 273*273 matrix. I created two versions for the same data. First one, a two-dimension PHP array and the second one, a table with 273*273=74529 rows, both containing the same data. The followings are the speed test results for retrieving similar data from both [in here, finding out which column(s) of a certain row has a value equal to 1 - the other columns are zero]:
It took 0.00021 seconds for the array.
It took 0.0026 seconds for mysqli table. (more than 10 times slower)
My conclusion is sticking to the arrays instead of converting them into database tables.
Last thing to say, in case the mentioned data is stored in the database table in the first place, generating an array and then using it would be much much slower as shown below (slower due to data retrieval from database):
It took 0.9 seconds for the array. (more than 400 times slower)
It took 0.0021 seconds for mysqli table.

The main reason is not that the database itself is slower. The main reason is that the database access the hard-drive to store data and PHP functions use only the RAM memory to execute this procedure, wich is faster than the Hard-Drive.

Although there is a way to speed up your insert queries (most likely you are using innodb table without transaction), the very statement of question is wrong.
A database intended - in the first place - to store data. To store it permanently. It does it well. It can do calculations too, but again - before doing any calculations there is one necessary step - to store data.
If you want to do your calculations on a stored data - it's ok to use a database.
If you want to push your data in database only to calculate it - it makes not too much sense.

In my case, as shown on the update part of the question, I think arrays have better performance than mysql databases.
Array usage showed 10 times faster response even when I search through the cells to find desired values in a row. Even good indexing of the table couldn't beat the array functionality and speed.

Is my random generation method flawed? (PHP)

I have this small internal project that inserts occasional entries into a MySQL database. I have a column named "idChar" were I set it's value to a randomly generated string using 62 possible characters with a length of 31.
Today I discovered that a new entry just so happened to have the same exact idChar as an entry from several months ago. I am now checking for duplicate entries before saving them, but this made me think about the odds of this happening and I am curious to know if my implementation of generating these random keys is flawed. Getting a duplicate should be roughly 1 in 62^31 right?
function getCode($len)
{
//$len = 10;
$base='ABCDEFGHIJKLMNOPQRSTWXYZabcdefghijklmnopqrstwxyz123456789';
$max=strlen($base)-1;
$linkCode='';
mt_srand((double)microtime()*1000000);
while (strlen($linkCode)<$len+1)
$linkCode.=$base{mt_rand(0,$max)};
return $linkCode;
}
$idChar=getCode(30);
//code to insert into MySQL here

The odds of getting a duplicate would be calculated as per the birthday problem, because that's how you calculate the chance of collisions for the output of a function that produces a randomly chosen output from a discrete codomain. In effect you want to calculate the chance that among a pool of selections made randomly any two selections are the same.
You should also completely drop the mt_srand call as it is not necessary, and it's likely to provide worse seeds than what PHP will do automatically. Consider that the output of microtime (in my system at least) is like
0.29574400 1348356024
which means that you only have 1 million different seeds available as the last two digits of the float are always zeroes and the (double)microtime() cast completely ignores the seconds part (it would be a lousy seed anyway).
Assuming that the random number generator produces the same sequence of random numbers whenever it is seeded with the same seed then in effect you only have 1 million possible random codes instead of 62^31 -- quite a decrease! Fortunately it is documented that this does not happen on PHP 5.2.1 onwards.

MySql speed of executing max(), min(), sum() on relatively large database

I have a relatively large database (130.000+ rows) of weather data, which is accumulating very fast (every 5minutes a new row is added). Now on my website I publish min/max data for day, and for the entire existence of my weatherstation (which is around 1 year).
Now I would like to know, if I would benefit from creating additional tables, where these min/max data would be stored, rather than let the php do a mysql query searching for day min/max data and min/max data for the entire existence of my weather station. Would a query for max(), min() or sum() (need sum() to sum rain accumulation for months) take that much longer time then a simple query to a table, that already holds those min, max and sum values?

That depends on weather your columns are indexed or not. In case of MIN() and MAX() you can read in the MySQL manual the following:
MySQL uses indexes for these
operations:
To find the MIN() or MAX() value for a
specific indexed column key_col. This
is optimized by a preprocessor that
checks whether you are using WHERE
key_part_N = constant on all key parts
that occur before key_col in the
index. In this case, MySQL does a
single key lookup for each MIN() or
MAX() expression and replaces it with
a constant.
In other words in case that your columns are indexed you are unlikely to gain much performance benefits by denormalization. In case they are NOT you will definitely gain performance.
As for SUM() it is likely to be faster on an indexed column but I'm not really confident about the performance gains here.
Please note that you should not be tempted to index your columns after reading this post. If you put indices your update queries will slow down!

Yes, denormalization should help performance a lot in this case.
There is nothing wrong with storing calculations for historical data that will not change in order to gain performance benefits.

While I agree with RedFilter that there is nothing wrong with storing historical data, I don't agree with the performance boost you will get. Your database is not what I would consider a heavy use database.
One of the major advantages of databases is indexes. They used advanced data structures to make data access lightening fast. Just think, every primary key you have is an index. You shouldn't be afraid of them. Of course, it would probably be counter productive to make all your fields indexes, but that should never really be necessary. I would suggest researching indexes more to find the right balance.
As for the work done when a change happens, it is not that bad. An index is a tree like representation of your field data. This is done to reduce a search down to a small number of near binary decisions.
For example, think of finding a number between 1 and 100. Normally you would randomly stab at numbers, or you would just start at 1 and count up. This is slow. Instead, it would be much faster if you set it up so that you could ask if you were over or under when you choose a number. Then you would start at 50 and ask if you are over or under. Under, then choose 75, and so on till you found the number. Instead of possibly going through 100 numbers, you would only have to go through around 6 numbers to find the correct one.
The problem here is when you add 50 numbers and make it out of 1 to 150. If you start at 50 again, your search is less optimized as there are 100 numbers above you. Your binary search is out of balance. So, what you do is rebalance your search by starting at the mid-point again, namely 75.
So the work a database is just an adjustment to rebalance the mid-point of its index. It isn't actually a lot of work. If you are working on a database that is large and requires many changes a second, you would definitely need to have a strong strategy for your indexes. In a small database that gets very few changes like yours, its not a problem.

Creating your own TinyURL

I have just found this great tutorial as it is something that I need.
However, after having a look, it seems that this might be inefficient. The way it works is, first generate a unique key then check if it exists in the database to make sure it really is unique. However, the larger the database gets the slower the function gets, right?
Instead, I was thinking, is there a way to add ordering to this function? So all that has to be done is check the previous entry in the DB and increment the key. So it will always be unique?
function generate_chars()
{
$num_chars = 4; //max length of random chars
$i = 0;
$my_keys = "123456789abcdefghijklmnopqrstuvwxyz"; //keys to be chosen from
$keys_length = strlen($my_keys);
$url = "";
while($i<$num_chars)
{
$rand_num = mt_rand(1, $keys_length-1);
$url .= $my_keys[$rand_num];
$i++;
}
return $url;
}
function isUnique($chars)
{
//check the uniqueness of the chars
global $link;
$q = "SELECT * FROM `urls` WHERE `unique_chars`='".$chars."'";
$r = mysql_query($q, $link);
//echo mysql_num_rows($r); die();
if( mysql_num_rows($r)>0 ):
return false;
else:
return true;
endif;
}

The tiny url people like to use random tokens because then you can't just troll the tiny url links. "Where does #2 go?" "Oh, cool!" "Where does #3 go?" "Even cooler!" You can type in random characters but it's unlikely you'll hit a valid value.
Since the key is rather sparse (4 values each having 36* possibilities gives you 1,679,616 unique values, 5 gives you 60,466,176) the chance of collisions is small (indeed, it's a desired part of the design) and a good SQL index will make the lookup be trivial (indeed, it's the primary lookup for the url so they optimize around it).
If you really want to avoid the lookup and just unse auto-increment you can create a function that turns an integer into a string of seemingly-random characters with the ability to convert back. So "1" becomes "54jcdn" and "2" becomes "pqmw21". Similar to Base64-encoding, but not using consecutive characters.
(*) I actually like using less than 36 characters -- single-cased, no vowels, and no similar characters (1, l, I). This prevents accidental swear words and also makes it easier for someone to speak the value to someone else. I even map similar charactes to each other, accepting "0" for "O". If you're entirely machine-based you could use upper and lower case and all digits for even greater possibilities.

In the database table, there is an index on the unique_chars field, so I don't see why that would be slow or inefficient.
UNIQUE KEY `unique_chars` (`unique_chars`)
Don't rush to do premature optimization on something that you think might be slow.
Also, there may be some benefit in a url shortening service that generates random urls instead of sequential urls.

I don't know why you'd bother. The premise of the tutorial is to create a "random" URL. If the random space is large enough, then you can simply rely on pure, dumb luck. If you random character space is 62 characters (A-Za-z0-9), the the 4 characters they use, given a reasonable random number generator, is 1 in 62^4, which is 1 in 14,776,336. Five characters is 1 in 916,132,832. So, a conflict is, literally, "1 in a billion".
Obviously, as the documents fill, your odds increase for the chance of a collision.
With 10,000 documents, it's 1 in 91,613, almost 1 in 100,000 (for round numbers).
That means, for every new document, you have a 1 in 91,613 chance of hitting the DB again for another pull on the slot machine.
It is not deterministic. It's random. It's luck. In theory, you can hit a string of really, really, bad luck and just get collision after collision after collision. Also, it WILL, eventually, fill up. How many URLs do you plan on hashing?
But if 1 in 91,613 odds isn't good enough, boosting it to 6 chars makes it more than 1 in 5M for 10,000 documents. We're talking almost LOTTO odds here.
Simply put, make the key big enough (7 characters? 8?) and the problem pretty much "wishes" itself out of existence.

Couldn't you encode the URL as Base36 when it's generated, and then decode it when visited - that would allow you to remove the database completely?
A snippet from Channel9:
The formula is simple, just turn the
Entry ID of our post, which is a long
into a short string by Base-36
encoding it and then stick
'http://ch9.ms/' onto the front of it.
This produces reasonably short URLs,
and can be computed at either end
without any need for a database look
up. The result, a URL like
http://ch9.ms/A49H is then used in
creating the twitter link.

I solved a similar problem by implementing an alogirthm that used to generate serial numbers one-by-one in base36. I had my own oredring of base36 characters all of which are unique. Since it was generating numbers serially I did not have to worry about duplication. Complexity and randomness of the number depends on the ordering of base36 numbers[characters]... that too for public only becuase to my application they are serial numbers :)

Check out this guys functions - http://www.pgregg.com/projects/php/base_conversion/base_conversion.php source - http://www.pgregg.com/projects/php/base_conversion/base_conversion.inc.phps
You can use any base you like, for example to convert 554512 to base 62, call
$tiny = base_base2base(554512, 10, 62); and that evaluates to $tiny = '2KFk'.
So, just pass in the unique id of the database record.
In a project I used this in a removed a few characters from the $sChars string, and am using base 58. You can also rearrange the characters in the string if you want the values to be less easy to guess.

You could of course add ordering by simply numbering the urls:
http://mytinyfier.com/1
http://mytinyfier.com/2
and so on. But if the hash key is indexed in the database (which it obviously should be), the performance boost would be minimal at best.

I wouldn't bother doing ordered enumeration for two reasons:
1) SQL servers are very effective at checking such hash collisions (given correct indexes)
2) That might hurt privacy, as users would be able to easily figure out what other users are tinyurl-ing.

Use autoincrement on the database, and get the latest id as described by http://www.acuras.co.uk/articles/24-php-use-mysqlinsertid-to-get-the-last-entered-auto-increment-value

Perhaps this is a bit off-answer, but, my general rule for creating always unique keys is simple md5( time() * 100 + rand( 0, 100 ) ); There is a one in 100,000 chance that if two people are using the same service at the same second they will get the same result (nie impossible).
That said, md5( rand( 0, n ) ) works too.

That might work, but the easiest way to accomplish the problem would probably be with hashing. Theoretically speaking, hashing runs in O(1) time, as in, it only has to perform the hash, and then does only one actual hit to the database to retrieve the value. Then, you would introduce complications for checking for hash collisions, but it seems like this is probably what most of the tinyurl providers do. And, a good hash function isn't terribly hard to write.

I have also created small tinyurl service.
I wrote a script in Python that was generating keys and store in MySQL table named tokens with status U(Unused).
But, I am doing it in offline mode. I have a corn job on my VPS. It runs a script every 10 minutes. The script check if there are less than 1000 keys in the table, it keep generating keys and inserting them if they are unique and not already exists in the table until the key's count up to 1000.
For my service, 1000 keys for 10 minutes are more than enough, you can set the timing or number of keys generated according to your need.
Now when any tiny url needs to be created on my website, my PHP script just fetch any key which is unused from the table and marked its status as T(taken). PHP script does not have to bother about its uniqueness as my python script already populated only unique keys.

Couldn't you just trim the hash to the length you wish?
$tinyURL = substr(md5($longURL . time()),0,4);
Granted, this may not provide as much pseudo randomness as using the entire string length. But, if you hash the long URL concatenated with the time(), wouldn't this be sufficient? Thoughts on using this method? Thanks!

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.