High performance PHP simaliraty checking on large database - php

I have 30,000 rows in a database that need to be similarity checked (using similar_text or another such function).
In order to do this it will require doing 30,000^2 checks for each columns.
I estimate I will be checking on average 4 columns.
This means I will have to do 3,600,000,000 checks.
What is the best (fastest, and most reliable) way to do this with PHP, bearing in mind request memory limits and time limits etc?
The server need to still actively serve webpages at the same time as doing this.
PS. The server we are using is an 8 core Xeon 32 GB ram.
Edit:
The size of each column is normally less that 50 characters.

I guess you just need FULL TEXT search.
If that not fits you, you have only one chance to solve this: cache the results.
So you will not have to parse 3bil of records for each requests
Anyway here how you can do it:
$result = array();
$sql = "SELECT * FROM TABLE";
while( $row = ... ) {
$result[] = $row; //> Append the current record
}
Now results contains all the rows from your table.
At this point you said you want to similar_text() all columns with each other.
To do that and cache the results you need at least a table (as I said in the comment).
//> Starting calculating the similarity
foreach($result as $k=>$v) {
foreach($result as $k2=>$v2) {
//> At this point you have 2 rows, $v and $v2 containing your column
$similarity = 0;
$similartiy += levensthein($v['column1'],$v2['column1']);
$similartiy += levensthein($v['column2'],$v2['column2']);
//> What ever comparison you need here between columns
//> Now you can finally store the result by inserting in a table the $similarity
"INSERT DELAYED INTO similarity (value) VALUES ('$similarity')";
}
}
2 Things you have to notice:
I used levensthein because it's much faster than similar_text (notice it's value it's the contrary of similar_text, because the greater the value levensthein returns the less the affinity between string)
I Used INSERT DELAYED to greatly lower the database cost

oy... similar_text() is O(n^3) !
do you really need a percentage similarity for each comparison or can you just do a quick compare of the first/middle/last X bytes of the strings to narrow the field?
if you're just looking for dups say... you can probably narrow down the number of comparisons you need to do, and that will be the most effective tack imho.

Related

php : speed up levensthein comparing, 10k + records

In my MySQL table I have the field name, which is unique. However the contents of the field are gathered on different places. So it is possible I have 2 records with a very similar name instead of second one being discarded, due to spelling errors.
Now I want to find those entries that are very similar to another one. For that I loop through all my records, and compare the name to other entries by looping through all the records again. Problem is that there are over 15k records which takes way too much time. Is there a way to do this faster?
this is my code:
for($x=0;$x<count($serie1);$x++)
{
for($y=0;$y<count($serie2);$y++)
{
$sim=levenshtein($serie1[$x]['naam'],$serie2[$y]['naam']);
if($sim==1)
print("{$A[$x]['naam']} --> {$B[$y]['naam']} = {$sim}<br>");
}
}
}
A preamble: such a task will always be time consuming, and there will always be some pairs that slip through.
Nevertheless, a few ideas :
1. actually, the algorithm can be (a bit) improved
assuming that $series1 and $series2 have the same values in the same order, you don't need to loop over the whole second array in the inner loop every time. In this use case you only need to evaluate each value pair once - levenshtein('a', 'b') is sufficient, you don't need levenshtein('b', 'a') as well (and neither do you need levenstein('a', 'a'))
under these assumptions, you can write your function like this:
for($x=0;$x<count($serie1);$x++)
{
for($y=$x+1;$y<count($serie2);$y++) // <-- $y doesn't need to start at 0
{
$sim=levenshtein($serie1[$x]['naam'],$serie2[$y]['naam']);
if($sim==1)
print("{$A[$x]['naam']} --> {$B[$y]['naam']} = {$sim}<br>");
}
}
2. maybe MySQL is faster
there examples in the net for levenshtein() implementations as a MySQL function. An example on SO is here: How to add levenshtein function in mysql?
If you are comfortable with complex(ish) SQL, you could delegate the heavy lifting to MySQL and at least gain a bit of performance because you aren't fetching the whole 16k rows into the PHP runtime.
3. don't do everything at once / save your results
of course you have to run the function once for every record, but after the initial run, you only have to check new entries since the last run. Schedule a chronjob that once every day/week/month.. checks all new records. You would need an inserted_at column in your table and would still need to compare the new names with every other name entry.
3.5 do some of the work onInsert
a) if the wait is acceptable, do a check once a new record should be inserted, so that you either write it to a log oder give a direct feedback to the user. (A tangent: this could be a good use case for an asynchrony task queue like http://gearman.org/ -> start a new process for the check in the background, return with the success message for the insert immediately)
b) PHP has two other function to help with searching for almost similar strings: metaphone() and soundex() . These functions generate abstract hashes that represent how a string will sound when spoken. You could generate (one or both of) these hashes on each insert, store them as a separate field in your table and use simple SQL functions to find records with similar hashes
The trouble with levenshtein is it only compares string a to string b. I built a spelling corrector once that puts all the strings a into a big trie, and that functioned as a dictionary. Then it would look up any string b in that dictionary, finding all nearest-matching words. I did it first in Fortran (!), then in Pascal. It would be easiest in a more modern language, but I suspect php would not make it easy. Look here.

How to iterate through a "window" of data in a dataset?

I have a data set in mysql with 150 rows. I have a set of 2 for loops that run math calculations based on some user inputs and the dataset. The code does calculations for 30 row windows, and accumulates the results for each 30 row window in an array. What I mean is, I do a "cycle" of calculations on rows 0-29, then 1-30, then 2-31, etc... That would result in 120 "cycles".
Right now the for loop is set up like so (there are more fields, I just trimmed the code for simplicity of this question.
$period=30;
$query = "SELECT * FROM table";
$result = mysql_query($query);
while ($row = mysql_fetch_assoc($result)){
$data[] = array("Date" => $row['Date'], "ID" => $row['ID']);
}
for($i=0;$i<(count($data)-$window);$i++){
for($j=0;$j<$window;$j++){
//do calculations here with $data[]
$results[$i][$j]= calculations;
}
}
This works fine for the number of rows I have. However, I opened up the script to a larger dataset (1700 rows) with a different window (360 rows). This means there are exponentially more iterations. It gave me an out of memory error. Some quick use of memory_get_peak_usage() showed that memory would just continually increase.
I'm starting to think that having the loops search through that data array is extremely laborious, especially when the "window" overlaps on a lot of the "cycles". Example: Cycle 0 goes through rows 0-29. Cycle 1 goes through rows 1-30. So, both of those cycles share a row of data that they need, but I'm telling PHP to look for the new data each time.
Is there a way to structure this better? I'm getting kind of lost thinking about running these concurrent cycles.
I think the array that is blowing memory will be the $result array. In your small sample it will be a 2 dimensional array with 150x149 cells. array( 150, 149 ). At 144 bytes per element thats 3,218,400 bytes slightly over 3 Meg + remaining bucket space.
In you second larger sample it will be array(1700,1699). At 144 bytes per element thats 415,915,200 bytes, thats slightly over 406Meg + remaining bucket space, just to hold the results of your calculations.
I think you need to ask if you really need to hold all this data. If you really do, you may have to come up with another way of storing it.
I dont see any point attempting the 1000's odd database calls as this will only add to the overhead as you still have to maintain the hugh list of results in an array.
The SQL Way
You can accomplish this by using LIMIT
$period = 30;
$cycle = 0; //
$query = "SELECT * FROM table LIMIT $cycle,$period";
This will return only the results you need for each cycle. You will need to loop and increment $cycle. The way you are doing it now is probably better, however.
This won't loop back however and grab the first of the data, you will have to add additional logic to handle that case.

Create 10,000 non-repeating random numbers in PHP

I need to work out a way to create 10,000 non-repeating random numbers in PHP, and then put it into database table. Number will be 12 digits long.
What is the best way to do this?
At 12 digits long, I don't think the possibility of getting repeats is very large. I would probably just generate the numbers, try to insert them into the table, and if it already exists (assuming you have a unique constraint on that column) just generate another one.
Read e.g. 40000 (=PHP_INT_SIZE * 10000) bytes from /dev/random at once, then split it, modularize it (the % operator), and there you have it.
Then filter it, and repeat the procedure.
That avoids too many syscalls/context switches (between the php runtime, the zend engine, and the operating system itself - I'm not going to dive into details here).
That should be the most performant way of doing it.
Generate 10000 random numbers and place them in an array. Run the array through array_unique. Check the length. If less than 10000, add on a bunch more. Run the array through array_unique. If greater than 10000, then run through array_slice to give 10000. Otherwise, lather, rinse, repeat.
This assumes that you can generate a 12 digit random number without problems (use getrandmax() to see how big you can get....according to php.net on some systems 32k is as large a number as you can get.
$array = array();
while(sizeof($array)<=10000){
$number = mt_rand(0,999999999999);
if(!array_key_exists($number,$array)){
$array[$number] = null;
}
}
foreach($array as $key=>$val){
//write array records to db.
}
You could use either rand() or mt_rand(). mt_rand() is supposed to be faster however.

Creating your own TinyURL

I have just found this great tutorial as it is something that I need.
However, after having a look, it seems that this might be inefficient. The way it works is, first generate a unique key then check if it exists in the database to make sure it really is unique. However, the larger the database gets the slower the function gets, right?
Instead, I was thinking, is there a way to add ordering to this function? So all that has to be done is check the previous entry in the DB and increment the key. So it will always be unique?
function generate_chars()
{
$num_chars = 4; //max length of random chars
$i = 0;
$my_keys = "123456789abcdefghijklmnopqrstuvwxyz"; //keys to be chosen from
$keys_length = strlen($my_keys);
$url = "";
while($i<$num_chars)
{
$rand_num = mt_rand(1, $keys_length-1);
$url .= $my_keys[$rand_num];
$i++;
}
return $url;
}
function isUnique($chars)
{
//check the uniqueness of the chars
global $link;
$q = "SELECT * FROM `urls` WHERE `unique_chars`='".$chars."'";
$r = mysql_query($q, $link);
//echo mysql_num_rows($r); die();
if( mysql_num_rows($r)>0 ):
return false;
else:
return true;
endif;
}
The tiny url people like to use random tokens because then you can't just troll the tiny url links. "Where does #2 go?" "Oh, cool!" "Where does #3 go?" "Even cooler!" You can type in random characters but it's unlikely you'll hit a valid value.
Since the key is rather sparse (4 values each having 36* possibilities gives you 1,679,616 unique values, 5 gives you 60,466,176) the chance of collisions is small (indeed, it's a desired part of the design) and a good SQL index will make the lookup be trivial (indeed, it's the primary lookup for the url so they optimize around it).
If you really want to avoid the lookup and just unse auto-increment you can create a function that turns an integer into a string of seemingly-random characters with the ability to convert back. So "1" becomes "54jcdn" and "2" becomes "pqmw21". Similar to Base64-encoding, but not using consecutive characters.
(*) I actually like using less than 36 characters -- single-cased, no vowels, and no similar characters (1, l, I). This prevents accidental swear words and also makes it easier for someone to speak the value to someone else. I even map similar charactes to each other, accepting "0" for "O". If you're entirely machine-based you could use upper and lower case and all digits for even greater possibilities.
In the database table, there is an index on the unique_chars field, so I don't see why that would be slow or inefficient.
UNIQUE KEY `unique_chars` (`unique_chars`)
Don't rush to do premature optimization on something that you think might be slow.
Also, there may be some benefit in a url shortening service that generates random urls instead of sequential urls.
I don't know why you'd bother. The premise of the tutorial is to create a "random" URL. If the random space is large enough, then you can simply rely on pure, dumb luck. If you random character space is 62 characters (A-Za-z0-9), the the 4 characters they use, given a reasonable random number generator, is 1 in 62^4, which is 1 in 14,776,336. Five characters is 1 in 916,132,832. So, a conflict is, literally, "1 in a billion".
Obviously, as the documents fill, your odds increase for the chance of a collision.
With 10,000 documents, it's 1 in 91,613, almost 1 in 100,000 (for round numbers).
That means, for every new document, you have a 1 in 91,613 chance of hitting the DB again for another pull on the slot machine.
It is not deterministic. It's random. It's luck. In theory, you can hit a string of really, really, bad luck and just get collision after collision after collision. Also, it WILL, eventually, fill up. How many URLs do you plan on hashing?
But if 1 in 91,613 odds isn't good enough, boosting it to 6 chars makes it more than 1 in 5M for 10,000 documents. We're talking almost LOTTO odds here.
Simply put, make the key big enough (7 characters? 8?) and the problem pretty much "wishes" itself out of existence.
Couldn't you encode the URL as Base36 when it's generated, and then decode it when visited - that would allow you to remove the database completely?
A snippet from Channel9:
The formula is simple, just turn the
Entry ID of our post, which is a long
into a short string by Base-36
encoding it and then stick
'http://ch9.ms/' onto the front of it.
This produces reasonably short URLs,
and can be computed at either end
without any need for a database look
up. The result, a URL like
http://ch9.ms/A49H is then used in
creating the twitter link.
I solved a similar problem by implementing an alogirthm that used to generate serial numbers one-by-one in base36. I had my own oredring of base36 characters all of which are unique. Since it was generating numbers serially I did not have to worry about duplication. Complexity and randomness of the number depends on the ordering of base36 numbers[characters]... that too for public only becuase to my application they are serial numbers :)
Check out this guys functions - http://www.pgregg.com/projects/php/base_conversion/base_conversion.php source - http://www.pgregg.com/projects/php/base_conversion/base_conversion.inc.phps
You can use any base you like, for example to convert 554512 to base 62, call
$tiny = base_base2base(554512, 10, 62); and that evaluates to $tiny = '2KFk'.
So, just pass in the unique id of the database record.
In a project I used this in a removed a few characters from the $sChars string, and am using base 58. You can also rearrange the characters in the string if you want the values to be less easy to guess.
You could of course add ordering by simply numbering the urls:
http://mytinyfier.com/1
http://mytinyfier.com/2
and so on. But if the hash key is indexed in the database (which it obviously should be), the performance boost would be minimal at best.
I wouldn't bother doing ordered enumeration for two reasons:
1) SQL servers are very effective at checking such hash collisions (given correct indexes)
2) That might hurt privacy, as users would be able to easily figure out what other users are tinyurl-ing.
Use autoincrement on the database, and get the latest id as described by http://www.acuras.co.uk/articles/24-php-use-mysqlinsertid-to-get-the-last-entered-auto-increment-value
Perhaps this is a bit off-answer, but, my general rule for creating always unique keys is simple md5( time() * 100 + rand( 0, 100 ) ); There is a one in 100,000 chance that if two people are using the same service at the same second they will get the same result (nie impossible).
That said, md5( rand( 0, n ) ) works too.
That might work, but the easiest way to accomplish the problem would probably be with hashing. Theoretically speaking, hashing runs in O(1) time, as in, it only has to perform the hash, and then does only one actual hit to the database to retrieve the value. Then, you would introduce complications for checking for hash collisions, but it seems like this is probably what most of the tinyurl providers do. And, a good hash function isn't terribly hard to write.
I have also created small tinyurl service.
I wrote a script in Python that was generating keys and store in MySQL table named tokens with status U(Unused).
But, I am doing it in offline mode. I have a corn job on my VPS. It runs a script every 10 minutes. The script check if there are less than 1000 keys in the table, it keep generating keys and inserting them if they are unique and not already exists in the table until the key's count up to 1000.
For my service, 1000 keys for 10 minutes are more than enough, you can set the timing or number of keys generated according to your need.
Now when any tiny url needs to be created on my website, my PHP script just fetch any key which is unused from the table and marked its status as T(taken). PHP script does not have to bother about its uniqueness as my python script already populated only unique keys.
Couldn't you just trim the hash to the length you wish?
$tinyURL = substr(md5($longURL . time()),0,4);
Granted, this may not provide as much pseudo randomness as using the entire string length. But, if you hash the long URL concatenated with the time(), wouldn't this be sufficient? Thoughts on using this method? Thanks!

PHP/SQL: ORDER BY or sort($array)?

Which do you think is faster in a PHP script:
$query = "SELECT... FROM ... ORDER BY first_val";
or
while($row = odbc_fetch_array($result))
$arrayname[] = array(
"first_key" => $row['first_val'],
"second_key" => $row['second_val'],
etc...
);
sort($arrayname);
It depends on so many factors that I don't even know what to begin with.
But as a rule, you perform sorting on database side.
Indexes, collations and all this, they help.
Which do you think is faster in a php script:
The ORDER BY doesn't execute in the PHP script -- it executes in the database, before data is retrieved by the PHP script. Apologies if this seems pedantic, I just want to make sure you understand this.
Anyway, the reason I would use ORDER BY is that the database has access to indexes and cached pages from the database. Sorting in PHP sorts the data set in memory of course, but has no access to any index.
If the ordered field is indexed, I'd say probably the SQL query. If not, I'm not sure, but I can't imagine it will be overly noticeable either way unless you're dealing with an absurdly large number of rows.
ORDER BY will almost always be faster.
In my opinion, nothing beats actually timing the thing so you really, really know for sure:
$time_start = microtime(true);
// Try the ORDER BY and sort($array) variants here
$time_end = microtime(true);
$time = $time_end - $time_start;
echo "It took $time seconds";
If there's a LIMIT on the first query, and the set of rows the query would match without the LIMIT is much larger than the LIMIT, then ORDER BY on the query is DEFINITELY faster.
That is to say, if you need the top 50 rows from a 10,000 row table, it's much faster to have the database sort for you, and return only those top 50 rows, than it is to retrieve all 10,000 rows and sort them yourself in PHP. This is probably representative of the vast majority of what will happen in real-world applications
If there are any cases at all in which sorting in PHP is even comparable, they're few and far between.
Additionally, SQL sorting is much more powerful -- it's trivial to sort on multiple columns, subqueries, the return values of aggregate functions etc.

Categories