How can I write a query to select similar titles? - php

I would like to select those movies which have similar titles.
I found this, but this way it dosn't work, it gives nothing. I would like to give toy story 2, toy story 3 and others with similar title like toy soldielrs, etc.
$title = "Toy Story";
$query = mysql_query("SELECT title, year, poster, LEVENSHTEIN_RATIO( ".$title.", title ) as textDiff FROM movies HAVING textDiff > 60");
I can compare strings in PHP with this function:
static public function string_compare($str_a, $str_b)
{
$length = strlen($str_a);
$length_b = strlen($str_b);
$i = 0;
$segmentcount = 0;
$segmentsinfo = array();
$segment = '';
while ($i < $length)
{
$char = substr($str_a, $i, 1);
if (strpos($str_b, $char) !== FALSE)
{
$segment = $segment.$char;
if (strpos($str_b, $segment) !== FALSE)
{
$segmentpos_a = $i - strlen($segment) + 1;
$segmentpos_b = strpos($str_b, $segment);
$positiondiff = abs($segmentpos_a - $segmentpos_b);
$posfactor = ($length - $positiondiff) / $length_b;
$lengthfactor = strlen($segment)/$length;
$segmentsinfo[$segmentcount] = array( 'segment' => $segment, 'score' => ($posfactor * $lengthfactor));
}
else
{
$segment = '';
$i--;
$segmentcount++;
}
}
else
{
$segment = '';
$segmentcount++;
}
$i++;
}
// PHP 5.3 lambda in array_map
$totalscore = array_sum(array_map(function($v) { return $v['score']; }, $segmentsinfo));
return $totalscore;
}
But how can I compare in a SELECT query or any other way?

You can use like queries for that:
Following example will return all the records from table customer for which customer name ends with kh
select * from customer where name like '%kh'
Following example will return all the records from table customer for which customer name start with kh
select * from customer where name like 'kh%'
Following example will return all the records from table customer for which the middle world of customer name is kh
select * from customer where name like 'kh%'
if you want more specific record then add some and/or condition in your query
I recommend you to read this
http://dev.mysql.com/doc/refman/5.7/en/string-comparison-functions.html#operator_like

I think you might need to define how similar things need to be to be considered a match.
But if you just wanna search for containing words, you could split your search string by whitespaces and use it in a REGEXP in your query
$search_array = explode(" ", "Toy story");
$query = "SELECT title, year, poster FROM movies WHERE title REGEXP '".implode("|", $search_array)."'";
This would probably match a lot rows, but you could make a more restrictive regular expression.

Related

mysql JSON_EXTRACT - don't return any results

I create a simple search form. Firstly I explode words and make some query (for search multiple words at once). Also, I want to search only in Movie titles. Movie titles can be found in movie_data.
My database called movies have rows:
id and
movie_data - contains movie data in JSON. Example storaged data in json:
{"movie_title":"Forrest Gump","movie_cover":"http://1.fwcdn.pl/po/09/98/998/7314731.6.jpg","movie_name_en":"","movie_desc":"Historia życia Forresta, chłopca o niskim ilorazie inteligencji z niedowładem kończyn, który staje się miliarderem i bohaterem wojny w Wietnamie.","movie_year":"1994","movie_genres":"DramatKomedia"}
Here is my code. This code don't return any results.
$query = 'Forrest Gump';
$words = explode(' ', $query);
$i = 0;
$len = count($words);
$query_build = '';
foreach ($words as $item) {
if ($i == $len - 1) {
$query_build .= ' titlemov LIKE "%$item%"';
} else {
$query_build .= ' titlemov LIKE "%$item%" OR ';
}
// …
$i++;
}
$sql = "SELECT movie_data, JSON_EXTRACT(movie_data, '$.movie_title') AS titlemov
FROM movies WHERE $query_build";
This code don't result any items
UPDATE
I updated my code.
$query = 'Forrest Gump';
$words = explode(' ', $query);
$i = 0;
$len = count($words);
$query_build = '';
foreach ($words as $item) {
if ($i == $len - 1) {
$query_build .= " JSON_EXTRACT(movie_data, \'$.movie_title\') LIKE '%$item%'";
} else {
$query_build .= " JSON_EXTRACT(movie_data, \'$.movie_title\') LIKE '%$item%' OR ";
}
// …
$i++;
}
$sql = "SELECT * FROM movies WHERE $query_build";
Now i getting error:
Warning: mysqli::query(): (22032/3141): Invalid JSON text in argument 1 to function json_extract: "Missing a comma or '}' after an object member." at position 228. in /search.php on line 43
Line 43:
$result = $conn->query($sql);
var_dump of $sql var:
SELECT * FROM movies WHERE JSON_EXTRACT(movie_data, '$.movie_title') LIKE '%Forrest%' OR JSON_EXTRACT(movie_data, '$.movie_title') LIKE '%Gump%'
Hmm?
titlemov doesn't exist for your where clause to use. What you want to use inside your where clause (specified in the $query_build) is the JSON_SEARCH function. Here's what it looks like in general with wildcards for the movie name:
SELECT movie_data, movie_data->'$.movie_title' AS titlemov FROM movies
WHERE JSON_SEARCH(movie_data->'$.movie_title', 'one', "%orrest Gum%") IS NOT NULL;
You'll just need to modify your $query_build string. I also drop the needless JSON_EXTRACT function calls.
Try this:
$query_build = " JSON_SEARCH(movie_data->'$.movie_title', 'one', '%$item%') IS NOT NULL";
Final note, I'm assuming $item is not a user supplied string, otherwise this opens you up to SQL injection. If it is supplied by user, you'll want to use a prepared statement and bind this value.

Grouping N person randomly but evenly? (Name & Gender)

So, I tried to create an algorithm(?) to assign a person to a classroom. The requirement for each class is :
Have at least 30 people and maximum of 45
The person name will not be "Homogen" (e.g: class 1 - 3 has all person name started with the letter "A", while class 4-5 the letter "B" etc.)
The gender is also evenly distributed
If the class is full, the remaining person will be moved to waiting list
My data has the column Unique ID, Name, and Gender. I'm still new to this kind of stuff (Algorithm?) so I don't even know where to start. Is it even possible? Where do I start? I am using PHP and my data is in MySQL Database
Step 1
You need to get data from DateBase (all people)
$host = '***';
$user = '***'';
$password = '***'';
$database = '***'';
$link = mysqli_connect($host, $user, $password, $database) or die("Error" . mysqli_error($link));
$query = "SELECT * FROM people";
$people = mysqli_query($link, $query) or die("Error" . mysqli_error($link));
mysqli_close($link);
Step 2
Conver mysql_result to Array and shuffle it.
$people = [];
foreach ($result as $person) {
$people[] = $person;
}
shuffle($people);
Step 3
There is algorithm:
$count = count($people);
// Classes
$classes = [];
const MIN_SIZE = 30;
const MAX_SIZE = 45;
$maxSizeClass= $count / MIN_SIZE;
$minSizeClass= $count / MAX_SIZE;
$countClasses = max(ceil($minSizeClass), floor($maxSizeClass));
$currentCountClass = $count / $countClasses;
$tmpClass = [];
foreach ($people as $person) {
if (count($tmpClass) < $currentCountClass) {
$tmpClass[] = $person;
} else {
$classes[] = $tmpClass;
$tmpClass = [];
}
}
if (count($tmpClass) >= MIN_SIZE) {
$classes[] = $tmpClass;
$tmpClass = [];
}
foreach ($tmpClass as $index => $person) {
foreach ($classes as &$class) {
if (count($class) < MAX_SIZE) {
$class[] = $person;
// be careful, PHP7 is OK
unset($tmpClass[$index]);
continue 2;
}
}
}
// persons awaiting distribution
$waitingQueue = $tmpClass;
Step 4
Result is:
$waitingQueue - persons awaiting distribution
$classes - classes with persons
$letters = array('a','b',....,'y','z');
foreach($letters as $letter){
$sql['male'] = "SELECT * FROM people_table WHERE person_name LIKE '".$letter."%' AND person_gender = 'male' ORDER BY person_name";
$sql['female'] = "SELECT * FROM people_table WHERE person_name LIKE '".$letter."%' AND person_gender = 'female' ORDER BY person_name";
foreach($sql as $key => $query){
$results[$key] = $connection->query($query);
for($i = 0; $i < $results[$key]->num_rows; $i++){
$people[$letter][$key][] = results->fetch_array(MYSQLI_ASSOC);
}
}
}
Here we have the lists of people listed by gender by letter... Now we can loop it and insert a man and a woman by pairs. If the count(); of the list of pairs is lesser than 30, people wait more. If bigger than 44 (because in pair isn't possible to have 45 people, if I don't missunderstand the question, of course) then save this 44 in a class array $class[$letter] which you can see all classes by each letter. To know about how many classes you have in total, you can use count($class); or if you would like to know how many classes of a specific letter you can do count($class[$letter]);.
You can redo other foreach in the $letters array or just put the loop inside the foreachabove to create the array of classes.
Inside the foreach($letters as $letter){} at the final:
if( !(count($people[$letter][$key]) < 15 OR count($people[$letter][$key]) < 15) ){
$she = count($people[$letter]['female'];
$he = count($people[$letter]['male'];
if($she < $he){
for($i = 0; $i < 2*count($she); $i++){
$class[$letter][$i] = $people[$letter]['female'][$i];
$class[$letter][$i+1] = $people[$letter]['male'][$i];
$i++;//Important to avoid replace values!
}
} else {
for($i = 0; $i < 2*count($he); $i++){
$class[$letter][$i] = $people[$letter]['female'][$i];
$class[$letter][$i+1] = $people[$letter]['male'][$i];
$i++;//Important to avoid replace values!
}
}
A false boolean in the bigger if means cannot create a class with this letter gender evenly distributed. You can loop again to make each class in an entry of an array.

Can i select data from mysqli database using SUBSTR to refine the query

I am trying the use refine tools for a search on my website. The bit i'm stuck with is search by start letter. For example i could use a wildcard '%X%' but his would return anything that contained the letter 'x'.
I read on few sites that SUBSTRING can be used in mysql queries
http://dev.mysql.com/
http://www.kirupa.com/
https://stackoverflow.com/questions/6302027
This is what I have so far but returns nothing. There is data in the database that should return with the query.
public function refineUsersFollowers($user_id,$q){
if($this->databaseConnection()){
// get the users followers
$state = array(1,2);
$stmt = $this->db_connection->prepare("SELECT * FROM friends WHERE id_2 = :1 AND Friend_Request_State = :2 OR id_2 = :3 AND Friend_Request_State = :4");
$stmt->bindParam(':1', $user_id);
$stmt->bindParam(':2', $state[0]);
$stmt->bindParam(':3', $user_id);
$stmt->bindParam(':4', $state[1]);
$stmt->execute();
// format the SQL OR statements
$sql = '';
$ids = [];
while($rows = $stmt->fetch(\PDO::FETCH_ASSOC)){
array_push($ids,$rows['id_1']);
}
for($x = 0; $x < count($ids); $x++){
if(count($ids) == 1){
//if there is one result
$sql.= ' user_id = :'.$x." AND SUBSTRING('first_name',0,1) = :".$x.$x;
}else if($x == (count($ids) - 1)){
// last entry
$sql.= ' user_id = :'.$x." AND SUBSTRING('first_name',0,1) = :".$x.$x;
}else{
//continue loop
$sql.= ' user_id = :'.$x." AND SUBSTRING('first_name',0,1) = :".$x.$x." OR";
}
}
$stmt = $this->db_connection->prepare("SELECT * FROM account WHERE ".$sql);
for($x = 0; $x < count($ids); $x++){
$stmt->bindParam(':'.$x,$ids[$x]);
$insert = $x.$x.'';
$stmt->bindParam(':'.$insert,$q);
}
$stmt->execute();
$results = $stmt->fetch(\PDO::FETCH_ASSOC);
print_r($results);
// check for followers that start with letter
}
}
The first part of the function is fine, this gets an array of id's which is then placed together as an SQL string. Is the SQL not returning results because SUBSTRING is not supported in this way?
If so is there a way of producing a query like this or would it be easier to pull every result from the database then check them in a different function?
You have two issues with this expression:
SUBSTRING('first_name', 0, 1) = :".$x.$x;
First, substr() in SQL (in general) starts counting with 1 and not 0. So, the first argument should be 1.
Second, you have the first argument in single quotes. So, at best, this would return the letter 'f'. Here is a simple rule: Only use single quotes for string and date constants. Never use single quotes to refer to column names.
There are several way to write what you want. Here are three:
SUBSTRING(first_name, 1, 1) = $x
LEFT(first_name, 1) = $x
first_name like '$x%'
You query can be greatly simplified with the LIKE operator. This:
"AND SUBSTRING('first_name',0,1) = :".$x.$x;
can become this:
"AND first_name LIKE '".$x.$x."%'";
I'm not sure what the $x.$x is for, so I just left it in for illustrative purposes.

PHP and mysql Ignoring the special chracters present in the database

a little help on this one, here are its details
[Products]
id int
name text
category
color
Problem is the values of the color field, sample values are:
GOLDRED
GOLD-RED
GOLD/RED
BLUE/GREEN-RED
WHITE GOLD-YELLOW/ORANGE
I could very much clean the search query such as this sample using a basic function
"select * from products where color=".cleanstring($stringval)." limit 1";
function cleanstring($var) {
$newtext = $var;
$newtext = preg_replace("/[^a-zA-Z0-9\s]/", "", $newtext);
$newtext = str_replace(" ", "", $newtext);
$newtext = strtoupper($newtext);
return $newtext;
}
The problem is with the content. It's thousands of records without any form of standard in using a naming convention.
I want to select those records with its values clean similar to my cleanstring().
Example:
Query = GOLDRED
Can select
GOLD-RED
GOLD RED
GOLDRED
GOLD/RED
GOLDRED
Any solution that you could recommend? Code is in PHP/MySQL.
"select * from products where 1".cleanstring($stringval);
function cleanstring($var) {
$color_list = array('GOLD','RED','GREEN','WHITE');
$sql_where='';
foreach( $color_list AS $v){
if(strpos($var, $v)!==false){
$sql_where .=" AND color LIKE '%{$v}%'";
}
}
return $sql_where;
}
//select * from products where 1 OR color LIKE '%GOLD%' OR color LIKE '%RED%'
REMARK:
input: GOLDRED ,
match: GOLD RED,GOLD-RED,GOLD/RED..... GOLD/RED/ABC,RED_GOLDGREEN,
may be after get all data , then make func ranking by match % ,like search engine
Probably You could make just a MySQL regexp with 'GOLD.?RED' or 'GOLD(-|[[:space:]])?RED' ?
That's an online example I made : http://regexr.com?34mmg
Not the best way, and I am sure has tons of downfalls, but if I did not make any mistakes in php code (don't have machine to try it out), it would do the work:
"select * from products where color REGEXP '".cleanstring($stringval)."' limit 1";
function cleanstring($var) {
$var = preg_replace('![-\/ ]+!', '', $var);
$strLength = strlen($var);
$parts = array();
for ($i = 1; $i <= $strLength; i++) {
$parts[] = ($i > 0) ? substr($var, 0, $i).'[-/ ]?'.substr($var, $i);
}
return "[[:<:]](".implode('|', $parts).")[[:>:]]";
}
It would output something like this:
"select * from products where color REGEXP '[[:<:]](G[-/ ]?OLDRED|GO[-/ ]?LDRED|GOL[-/ ]?DRED|GOLD[-/ ]?RED|GOLDR[-/ ]?ED|GOLDRE[-/ ]?D)[[:>:]]' limit 1"
which basically breaks your keyword in pieces letter by letter, i.e.
G OLDRED
GO LDRED
GOL DRED
GOLD RED
GOLDR ED
GOLDRE D
and do the "LIKE" statement on them but with smarter word boundaries and instead of just space, it considers "-" and "/" as well.

Compare lots of texts (clustering) with a matrix

I have the following PHP function to calculate the relation between to texts:
function check($terms_in_article1, $terms_in_article2) {
$length1 = count($terms_in_article1); // number of words
$length2 = count($terms_in_article2); // number of words
$all_terms = array_merge($terms_in_article1, $terms_in_article2);
$all_terms = array_unique($all_terms);
foreach ($all_terms as $all_termsa) {
$term_vector1[$all_termsa] = 0;
$term_vector2[$all_termsa] = 0;
}
foreach ($terms_in_article1 as $terms_in_article1a) {
$term_vector1[$terms_in_article1a]++;
}
foreach ($terms_in_article2 as $terms_in_article2a) {
$term_vector2[$terms_in_article2a]++;
}
$score = 0;
foreach ($all_terms as $all_termsa) {
$score += $term_vector1[$all_termsa]*$term_vector2[$all_termsa];
}
$score = $score/($length1*$length2);
$score *= 500; // for better readability
return $score;
}
The variable $terms_in_articleX must be an array containing all single words which appear in the text.
Assuming I have a database of 20,000 texts, this function would take a very long time to run through all the connections.
How can I accelerate this process? Should I add all texts into a huge matrix instead of always comparing only two texts? It would be great if you had some approaches with code, preferably in PHP.
I hope you can help me. Thanks in advance!
You can split the text on adding it. Simple example: preg_match_all(/\w+/, $text, $matches); Sure real splitting is not so simple... but possible, just correct the pattern :)
Create table id(int primary autoincrement), value(varchar unique) and link-table like this: word_id(int), text_id(int), word_count(int). Then fill the tables with new values after splitting text.
Finally you can do with this data anything you want, quickly operating with indexed integers(IDs) in DB.
UPDATE:
Here are the tables and queries:
CREATE TABLE terms (
id int(11) NOT NULL auto_increment, value char(255) NOT NULL,
PRIMARY KEY (`id`), UNIQUE KEY `value` (`value`)
);
CREATE TABLE `terms_in_articles` (
term int(11) NOT NULL,
article int(11) NOT NULL,
cnt int(11) NOT NULL default '1',
UNIQUE KEY `term` (`term`,`article`)
);
/* Returns all unique terms in both articles (your $all_terms) */
SELECT t.id, t.value
FROM terms t, terms_in_articles a
WHERE a.term = t.id AND a.article IN (1, 2);
/* Returns your $term_vector1, $term_vector2 */
SELECT article, term, cnt
FROM terms_in_articles
WHERE article IN (1, 2) ORDER BY article;
/* Returns article and total count of term entries in it ($length1, $length2) */
SELECT article, SUM(cnt) AS total
FROM terms_in_articles
WHERE article IN (1, 2) GROUP BY article;
/* Returns your $score wich you may divide by ($length1 / $length2) from previous query */
SELECT SUM(tmp.term_score) * 500 AS total_score FROM
(
SELECT (a1.cnt * a2.cnt) AS term_score
FROM terms_in_articles a1, terms_in_articles a2
WHERE a1.article = 1 AND a2.article = 2 AND a1.term = a2.term
GROUP BY a2.term, a1.term
) AS tmp;
Well, now, I hope, this will help? The 2 last queries are enough to perform your task. Other queries are just in case. Sure, you can count more stats like "the most popular terms" etc...
Here's a slightly optimized version of your original function. It produces the exact same results. (I run it on two articles from Wikipedia with 10000+ terms and like 20 runs each:
check():
test A score: 4.55712524522
test B score: 5.08138042619
--Time: 1.0707
check2():
test A score: 4.55712524522
test B score: 5.08138042619
--Time: 0.2624
Here's the code:
function check2($terms_in_article1, $terms_in_article2) {
$length1 = count($terms_in_article1); // number of words
$length2 = count($terms_in_article2); // number of words
$score_table = array();
foreach($terms_in_article1 as $term){
if(!isset($score_table[$term])) $score_table[$term] = 0;
$score_table[$term] += 1;
}
$score_table2 = array();
foreach($terms_in_article2 as $term){
if(isset($score_table[$term])){
if(!isset($score_table2[$term])) $score_table2[$term] = 0;
$score_table2[$term] += 1;
}
}
$score =0;
foreach($score_table2 as $key => $entry){
$score += $score_table[$key] * $entry;
}
$score = $score / ($length1*$length2);
$score *= 500;
return $score;
}
(Btw. The time needed to split all the words into arrays was not included.)
EDIT: Trying to be more explicit:
First, encode every term into an
integer. You can use a dictionary
associative array, like this:
$count = 0;
foreach ($doc as $term) {
$val = $dict[$term];
if (!defined($val)) {
$dict[$term] = $count++;
}
$doc_as_int[$val] ++;
}
This way, you replace string
calculations with integer
calculations. For example, you can
represent the word "cloud" as the
number 5, and then use the index 5
of arrays to store counts of the
word "cloud". Notice that we only
use associative array search here,
no need for CRC etc.
Do store all texts as a matrix, preferably a sparse one.
Use feature selection (PDF).
Maybe use a native implementation in a faster language.
I suggest you first use K-means with about 20 clusters, this way get a rough draft of which document is near another, and then compare only pairs inside each cluster. Assuming uniformly-sized cluster, this improves the number of comparisons to 20*200 + 20*10*9 - around 6000 comparisons instead of 19900.
If you can use simple text instead of arrays for comparing, and if i understood right where your goal is, you can use the levenshtein php function (that is usually used for give the google-like 'Did you meaning ...?' function in php search engines).
It works in the opposite way youre using: return the difference between two strings.
Example:
<?php
function check($a, $b) {
return levenshtein($a, $b);
}
$a = 'this is just a test';
$b = 'this is not test';
$c = 'this is just a test';
echo check($a, $b) . '<br />';
//return 5
echo check($a, $c) . '<br />';
//return 0, the strings are identical
?>
But i dont know exactly if this will improve the speed of execution.. but maybe yes, you take-out many foreach loops and the array_merge function.
EDIT:
A simply test for the speed (is a 30-second-wroted-script, its not 100% accurated eh):
function check($terms_in_article1, $terms_in_article2) {
$length1 = count($terms_in_article1); // number of words
$length2 = count($terms_in_article2); // number of words
$all_terms = array_merge($terms_in_article1, $terms_in_article2);
$all_terms = array_unique($all_terms);
foreach ($all_terms as $all_termsa) {
$term_vector1[$all_termsa] = 0;
$term_vector2[$all_termsa] = 0;
}
foreach ($terms_in_article1 as $terms_in_article1a) {
$term_vector1[$terms_in_article1a]++;
}
foreach ($terms_in_article2 as $terms_in_article2a) {
$term_vector2[$terms_in_article2a]++;
}
$score = 0;
foreach ($all_terms as $all_termsa) {
$score += $term_vector1[$all_termsa]*$term_vector2[$all_termsa];
}
$score = $score/($length1*$length2);
$score *= 500; // for better readability
return $score;
}
$a = array('this', 'is', 'just', 'a', 'test');
$b = array('this', 'is', 'not', 'test');
$timenow = microtime();
list($m_i, $t_i) = explode(' ', $timenow);
for($i = 0; $i != 10000; $i++){
check($a, $b);
}
$last = microtime();
list($m_f, $t_f) = explode(' ', $last);
$fine = $m_f+$t_f;
$inizio = $m_i+$t_i;
$quindi = $fine - $inizio;
$quindi = substr($quindi, 0, 7);
echo 'end in ' . $quindi . ' seconds';
print: end in 0.36765 seconds
Second test:
<?php
function check($a, $b) {
return levenshtein($a, $b);
}
$a = 'this is just a test';
$b = 'this is not test';
$timenow = microtime();
list($m_i, $t_i) = explode(' ', $timenow);
for($i = 0; $i != 10000; $i++){
check($a, $b);
}
$last = microtime();
list($m_f, $t_f) = explode(' ', $last);
$fine = $m_f+$t_f;
$inizio = $m_i+$t_i;
$quindi = $fine - $inizio;
$quindi = substr($quindi, 0, 7);
echo 'end in ' . $quindi . ' seconds';
?>
print: end in 0.05023 seconds
So, yes, seem faster.
Would be nice to try with many array items (and many words for levenshtein)
2°EDIT:
With similar text the speed seem to be equal to the levenshtein method:
<?php
function check($a, $b) {
return similar_text($a, $b);
}
$a = 'this is just a test ';
$b = 'this is not test';
$timenow = microtime();
list($m_i, $t_i) = explode(' ', $timenow);
for($i = 0; $i != 10000; $i++){
check($a, $b);
}
$last = microtime();
list($m_f, $t_f) = explode(' ', $last);
$fine = $m_f+$t_f;
$inizio = $m_i+$t_i;
$quindi = $fine - $inizio;
$quindi = substr($quindi, 0, 7);
echo 'end in ' . $quindi . ' seconds';
?>
print: end in 0.05988 seconds
But it can take more than 255 char:
Note also that the complexity of this
algorithm is O(N**3) where N is the
length of the longest string.
and, it can even return the similary value in percentage:
function check($a, $b) {
similar_text($a, $b, $p);
return $p;
}
Yet another edit
What about create a database function, to make the compare directly in the sql query, instead of retrieving all the data and loop them?
If youre running Mysql, give a look at this one (hand-made levenshtein function, still 255 char limit)
Else, if youre on Postgresql, this other one (many functions that should be evalutate)
Another approach to take would be Latent Semantic Analysis, which leverages a large corpus of data to find similarities between documents.
The way it works is by taking the co-occurance matrix of the text and comparing it to the Corpus, essentially providing you with an abstract location of your document in a 'semantic space'. This will speed up your text comparison, as you can compare documents using Euclidian distance in the LSA Semantic space. It's pretty fun semantic indexing. Thus, adding new articles will not take much longer.
I can't give a specific use case of this approach, having only learned it in school but it appears that KnowledgeSearch is an open source implementation of the algorithm.
(Sorry, its my first post, so can't post links, just look it up)

Categories