Translate phrase with more than 1 singular/plural value - php

I have the following phrase and want to translate it.
$lifeTime = 'Expires in '.$days.($days == 1 ? ' day ':' days ' ).$h.(($h == 1 ? ' hour ':' hours ' ));
My question is: will I need to split the phrase, translate separated and concatenate them ? Is there a way to __n() function accept multiple "instances" of singular/plural with their respective counts on a single phrase ?
A bit confusing. A example to make it clear:
__('I have %s eggs and %s milk boxes', $eggs, $milk)
This will not have singular and plural form. Can I make it translate the entire phrase without having to split it in two __n() function calls ?

If you know beforehand what words might come in your strings you could do as follows:
// the preperation/setup
$translations = Array();
$translations['hours']['search'] = "%hour";
$translations['hours']['inflections'] = Array("hour", "hours");
$translations['milk']['search'] = "%milk";
$translations['milk']['inflections'] = Array("milk bottle", "bottles of milk");
// add what else you might need
function replaceWithPlurals($heystack, $translations) {
foreach ($translations as $key => $vals) {
$heystack = str_replace($vals['search'], $vals['cnt']." ".$vals['inflections'][usePlural($vals['cnt'])], $heystack);
}
return $heystack;
}
function usePlural($cnt) {
if($cnt==1) return 0;
else return 1;
}
// dynamic vars
$heystack = "I drank %milk in %hour";
$hours = 1;
$milk = 2;
$translations['hours']['cnt'] = $hours;
$translations['milk']['cnt'] = $milk;
// the actual usage
echo replaceWithPlurals($heystack, $translations);
overall that's quite a lot of preperation if you would only need that for 4 specific occations. But if you regular use it I hope that will help.

Related

Get range letter and number with php

how can I get range for bottom string in php?
M0000001:M0000100
I want result
M0000001
M0000002
M0000003
..
..
..
M0000100
this is what i do
<?php
$string = "M0000001:M0000100";
$explode = explode(":",$string );
$text_one = $explode[0];
$text_two = $explode[1];
$range = range($text_one,$text_two);
print_r($range);
?>
So can anyone help me with this?
This is one of many ways you could do this and this is a little verbose but hopefully it shows you some "steps" to take.
It doesn't check for the 1st number being bigger than the 2nd.
It doesn't check your Range strings start with a "M".
It doesn't have all of the required comments.
Those are things for you to consider and work out...
<?php
$string = "M00000045:M000099";
echo generate_range_from_string($string);
function generate_range_from_string($string) {
// First explode the two strings
$explode = explode(":", $string);
$text_one = $explode[0];
$text_two = $explode[1];
// Remove the Leading Alpha character
$range_one = str_replace('M', '', $text_one);
$range_two = str_replace('M', '', $text_two);
$padding_length = strlen($range_one);
// Build the output string
$output = '';
for ( $index = (int) $range_one; $index <= (int) $range_two; $index ++ ) {
$output .= 'M' . str_pad($index, $padding_length, '0', STR_PAD_LEFT) . '<br>';
}
return $output;
}
The output lists a String in the format you have specified in the question. So this is based solely upon that.
This could undergo a few more revisions to make it more function like, as I'm sure some folks will pick out!

Knapsack Equation with item groups

Can't call it a problem on Stack Overflow apparently, however I am currently trying to understand how to integrate constraints in the form of item groups within the Knapsack problem. My math skills are proving to be fairly limiting in this situation, however I am very motivated to both make this work as intended as well as figure out what each aspect does (in that order since things make more sense when they work).
With that said, I have found an absolutely beautiful implementation at Rosetta Code and cleaned up the variable names some to help myself better understand this from a very basic perspective.
Unfortunately I am having an incredibly difficult time figuring out how I can apply this logic to include item groups. My purpose is for building fantasy teams, supplying my own value & weight (points/salary) per player but without groups (positions in my case) I am unable to do so.
Would anyone be able to point me in the right direction for this? I'm reviewing code examples from other languages and additional descriptions of the problem as a whole, however I would like to get the groups implemented by whatever means possible.
<?php
function knapSolveFast2($itemWeight, $itemValue, $i, $availWeight, &$memoItems, &$pickedItems)
{
global $numcalls;
$numcalls++;
// Return memo if we have one
if (isset($memoItems[$i][$availWeight]))
{
return array( $memoItems[$i][$availWeight], $memoItems['picked'][$i][$availWeight] );
}
else
{
// At end of decision branch
if ($i == 0)
{
if ($itemWeight[$i] <= $availWeight)
{ // Will this item fit?
$memoItems[$i][$availWeight] = $itemValue[$i]; // Memo this item
$memoItems['picked'][$i][$availWeight] = array($i); // and the picked item
return array($itemValue[$i],array($i)); // Return the value of this item and add it to the picked list
}
else
{
// Won't fit
$memoItems[$i][$availWeight] = 0; // Memo zero
$memoItems['picked'][$i][$availWeight] = array(); // and a blank array entry...
return array(0,array()); // Return nothing
}
}
// Not at end of decision branch..
// Get the result of the next branch (without this one)
list ($without_i,$without_PI) = knapSolveFast2($itemWeight, $itemValue, $i-1, $availWeight,$memoItems,$pickedItems);
if ($itemWeight[$i] > $availWeight)
{ // Does it return too many?
$memoItems[$i][$availWeight] = $without_i; // Memo without including this one
$memoItems['picked'][$i][$availWeight] = array(); // and a blank array entry...
return array($without_i,array()); // and return it
}
else
{
// Get the result of the next branch (WITH this one picked, so available weight is reduced)
list ($with_i,$with_PI) = knapSolveFast2($itemWeight, $itemValue, ($i-1), ($availWeight - $itemWeight[$i]),$memoItems,$pickedItems);
$with_i += $itemValue[$i]; // ..and add the value of this one..
// Get the greater of WITH or WITHOUT
if ($with_i > $without_i)
{
$res = $with_i;
$picked = $with_PI;
array_push($picked,$i);
}
else
{
$res = $without_i;
$picked = $without_PI;
}
$memoItems[$i][$availWeight] = $res; // Store it in the memo
$memoItems['picked'][$i][$availWeight] = $picked; // and store the picked item
return array ($res,$picked); // and then return it
}
}
}
$items = array("map","compass","water","sandwich","glucose","tin","banana","apple","cheese","beer","suntan cream","camera","t-shirt","trousers","umbrella","waterproof trousers","waterproof overclothes","note-case","sunglasses","towel","socks","book");
$weight = array(9,13,153,50,15,68,27,39,23,52,11,32,24,48,73,42,43,22,7,18,4,30);
$value = array(150,35,200,160,60,45,60,40,30,10,70,30,15,10,40,70,75,80,20,12,50,10);
## Initialize
$numcalls = 0;
$memoItems = array();
$selectedItems = array();
## Solve
list ($m4, $selectedItems) = knapSolveFast2($weight, $value, sizeof($value)-1, 400, $memoItems, $selectedItems);
# Display Result
echo "<b>Items:</b><br>" . join(", ", $items) . "<br>";
echo "<b>Max Value Found:</b><br>$m4 (in $numcalls calls)<br>";
echo "<b>Array Indices:</b><br>". join(",", $selectedItems) . "<br>";
echo "<b>Chosen Items:</b><br>";
echo "<table border cellspacing=0>";
echo "<tr><td>Item</td><td>Value</td><td>Weight</td></tr>";
$totalValue = 0;
$totalWeight = 0;
foreach($selectedItems as $key)
{
$totalValue += $value[$key];
$totalWeight += $weight[$key];
echo "<tr><td>" . $items[$key] . "</td><td>" . $value[$key] . "</td><td>".$weight[$key] . "</td></tr>";
}
echo "<tr><td align=right><b>Totals</b></td><td>$totalValue</td><td>$totalWeight</td></tr>";
echo "</table><hr>";
?>
That knapsack program is traditional, but I think that it obscures what's going on. Let me show you how the DP can be derived more straightforwardly from a brute force solution.
In Python (sorry; this is my scripting language of choice), a brute force solution could look like this. First, there's a function for generating all subsets with breadth-first search (this is important).
def all_subsets(S): # brute force
subsets_so_far = [()]
for x in S:
new_subsets = [subset + (x,) for subset in subsets_so_far]
subsets_so_far.extend(new_subsets)
return subsets_so_far
Then there's a function that returns True if the solution is valid (within budget and with a proper position breakdown) – call it is_valid_solution – and a function that, given a solution, returns the total player value (total_player_value). Assuming that players is the list of available players, the optimal solution is this.
max(filter(is_valid_solution, all_subsets(players)), key=total_player_value)
Now, for a DP, we add a function cull to all_subsets.
def best_subsets(S): # DP
subsets_so_far = [()]
for x in S:
new_subsets = [subset + (x,) for subset in subsets_so_far]
subsets_so_far.extend(new_subsets)
subsets_so_far = cull(subsets_so_far) ### This is new.
return subsets_so_far
What cull does is to throw away the partial solutions that are clearly not going to be missed in our search for an optimal solution. If the partial solution is already over budget, or if it already has too many players at one position, then it can safely be discarded. Let is_valid_partial_solution be a function that tests these conditions (it probably looks a lot like is_valid_solution). So far we have this.
def cull(subsets): # INCOMPLETE!
return filter(is_valid_partial_solution, subsets)
The other important test is that some partial solutions are just better than others. If two partial solutions have the same position breakdown (e.g., two forwards and a center) and cost the same, then we only need to keep the more valuable one. Let cost_and_position_breakdown take a solution and produce a string that encodes the specified attributes.
def cull(subsets):
best_subset = {} # empty dictionary/map
for subset in filter(is_valid_partial_solution, subsets):
key = cost_and_position_breakdown(subset)
if (key not in best_subset or
total_value(subset) > total_value(best_subset[key])):
best_subset[key] = subset
return best_subset.values()
That's it. There's a lot of optimization to be done here (e.g., throw away partial solutions for which there's a cheaper and more valuable partial solution; modify the data structures so that we aren't always computing the value and position breakdown from scratch and to reduce the storage costs), but it can be tackled incrementally.
One potential small advantage with regard to composing recursive functions in PHP is that variables are passed by value (meaning a copy is made) rather than reference, which can save a step or two.
Perhaps you could better clarify what you are looking for by including a sample input and output. Here's an example that makes combinations from given groups - I'm not sure if that's your intention... I made the section accessing the partial result allow combinations with less value to be considered if their weight is lower - all of this can be changed to prune in the specific ways you would like.
function make_teams($players, $position_limits, $weights, $values, $max_weight){
$player_counts = array_map(function($x){
return count($x);
}, $players);
$positions = array_map(function($x){
$positions[] = [];
},$position_limits);
$num_positions = count($positions);
$combinations = [];
$hash = [];
$stack = [[$positions,0,0,0,0,0]];
while (!empty($stack)){
$params = array_pop($stack);
$positions = $params[0];
$i = $params[1];
$j = $params[2];
$len = $params[3];
$weight = $params[4];
$value = $params[5];
// too heavy
if ($weight > $max_weight){
continue;
// the variable, $positions, is accumulating so you can access the partial result
} else if ($j == 0 && $i > 0){
// remember weight and value after each position is chosen
if (!isset($hash[$i])){
$hash[$i] = [$weight,$value];
// end thread if current value is lower for similar weight
} else if ($weight >= $hash[$i][0] && $value < $hash[$i][1]){
continue;
// remember better weight and value
} else if ($weight <= $hash[$i][0] && $value > $hash[$i][1]){
$hash[$i] = [$weight,$value];
}
}
// all positions have been filled
if ($i == $num_positions){
$positions[] = $weight;
$positions[] = $value;
if (!empty($combinations)){
$last = &$combinations[count($combinations) - 1];
if ($weight < $last[$num_positions] && $value > $last[$num_positions + 1]){
$last = $positions;
} else {
$combinations[] = $positions;
}
} else {
$combinations[] = $positions;
}
// current position is filled
} else if (count($positions[$i]) == $position_limits[$i]){
$stack[] = [$positions,$i + 1,0,$len,$weight,$value];
// otherwise create two new threads: one with player $j added to
// position $i, the other thread skipping player $j
} else {
if ($j < $player_counts[$i] - 1){
$stack[] = [$positions,$i,$j + 1,$len,$weight,$value];
}
if ($j < $player_counts[$i]){
$positions[$i][] = $players[$i][$j];
$stack[] = [$positions,$i,$j + 1,$len + 1
,$weight + $weights[$i][$j],$value + $values[$i][$j]];
}
}
}
return $combinations;
}
Output:
$players = [[1,2],[3,4,5],[6,7]];
$position_limits = [1,2,1];
$weights = [[2000000,1000000],[10000000,1000500,12000000],[5000000,1234567]];
$values = [[33,5],[78,23,10],[11,101]];
$max_weight = 20000000;
echo json_encode(make_teams($players, $position_limits, $weights, $values, $max_weight));
/*
[[[1],[3,4],[7],14235067,235],[[2],[3,4],[7],13235067,207]]
*/

MySQL / PHP, but more of a MATH Question (Shortening Script)

For my latest project I need to shorten the URLs which I then put in a mysql database.
I now ran against a problem, because I don't know how to solve this. Basically, the shortened strings should look like this (I want to include lowercase letters, uppercase letters and numbers)
a
b
...
z
0
...
9
A
...
Z
aa
ab
ac
...
ba
So, 1. URl --> a. Stored in mysql.
Next time, a new url gets stored to --> b because a is already in the mysql database.
And that is it. But I don't have any idea. Could someone of you please help me out?
Edit: Formattted & Further explanation.
It is kinda like the imgur.com URL shortening service. It should continue like this until infinity (which is not needed, I think...)
You can use the following function (code adapted from my personal framework):
function Base($input, $output, $number = 1, $charset = 'abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ')
{
if (strlen($charset) >= 2)
{
$input = max(2, min(intval($input), strlen($charset)));
$output = max(2, min(intval($output), strlen($charset)));
$number = ltrim(preg_replace('~[^' . preg_quote(substr($charset, 0, max($input, $output)), '~') . ']+~', '', $number), $charset[0]);
if (strlen($number) > 0)
{
if ($input != 10)
{
$result = 0;
foreach (str_split(strrev($number)) as $key => $value)
{
$result += pow($input, $key) * intval(strpos($charset, $value));
}
$number = $result;
}
if ($output != 10)
{
$result = $charset[$number % $output];
while (($number = intval($number / $output)) > 0)
{
$result = $charset[$number % $output] . $result;
}
$number = $result;
}
return $number;
}
return $charset[0];
}
return false;
}
Basically you just need to grab the newly generated auto-incremented ID (this also makes sure you don't generate any collisions) from your table and pass it to this function like this:
$short_id = Base(10, 62, $auto_increment_id);
Note that the first and second arguments define the input and output bases, respectively.
Also, I've reordered the charset from the "default" 0-9a-zA-Z to comply with your examples.
You can also just use base_convert() if you can live without the mixed alphabet case (base 36).

Wrongly asked or am I stupid?

There's a blog post comment on codinghorror.com by Paul Jungwirth which includes a little programming task:
You have the numbers 123456789, in that order. Between each number, you must insert either nothing, a plus sign, or a multiplication sign, so that the resulting expression equals 2001. Write a program that prints all solutions. (There are two.)
Bored, I thought, I'd have a go, but I'll be damned if I can get a result for 2001. I think the code below is sound and I reckon that there are zero solutions that result in 2001. According to my code, there are two solutions for 2002. Am I right or am I wrong?
/**
* Take the numbers 123456789 and form expressions by inserting one of ''
* (empty string), '+' or '*' between each number.
* Find (2) solutions such that the expression evaluates to the number 2001
*/
$input = array(1,2,3,4,5,6,7,8,9);
// an array of strings representing 8 digit, base 3 numbers
$ops = array();
$numOps = sizeof($input)-1; // always 8
$mask = str_repeat('0', $numOps); // mask of 8 zeros for padding
// generate the ops array
$limit = pow(3, $numOps) -1;
for ($i = 0; $i <= $limit; $i++) {
$s = (string) $i;
$s = base_convert($s, 10, 3);
$ops[] = substr($mask, 0, $numOps - strlen($s)) . $s;
}
// for each element in the ops array, generate an expression by inserting
// '', '*' or '+' between the numbers in $input. e.g. element 11111111 will
// result in 1+2+3+4+5+6+7+8+9
$limit = sizeof($ops);
$stringResult = null;
$numericResult = null;
for ($i = 0; $i < $limit; $i++) {
$l = $numOps;
$stringResult = '';
$numericResult = 0;
for ($j = 0; $j <= $l; $j++) {
$stringResult .= (string) $input[$j];
switch (substr($ops[$i], $j, 1)) {
case '0':
break;
case '1':
$stringResult .= '+';
break;
case '2':
$stringResult .= '*';
break;
default :
}
}
// evaluate the expression
// split the expression into smaller ones to be added together
$temp = explode('+', $stringResult);
$additionElems = array();
foreach ($temp as $subExpressions)
{
// split each of those into ones to be multiplied together
$multplicationElems = explode('*', $subExpressions);
$working = 1;
foreach ($multplicationElems as $operand) {
$working *= $operand;
}
$additionElems[] = $working;
}
$numericResult = 0;
foreach($additionElems as $operand)
{
$numericResult += $operand;
}
if ($numericResult == 2001) {
echo "{$stringResult}\n";
}
}
Further down the same page you linked to.... =)
"Paul Jungwirth wrote:
You have the numbers 123456789, in
that order. Between each number, you
must insert either nothing, a plus
sign, or a multiplication sign, so
that the resulting expression equals
2001. Write a program that prints all solutions. (There are two.)
I think you meant 2002, not 2001. :)
(Just correcting for anyone else like
me who obsessively tries to solve
little "practice" problems like this
one, and then hit Google when their
result doesn't match the stated
answer. ;) Damn, some of those Perl
examples are ugly.)"
The number is 2002.
Recursive solution takes eleven lines of JavaScript (excluding string expression evaluation, which is a standard JavaScript function, however it would probably take another ten or so lines of code to roll your own for this specific scenario):
function combine (digit,exp) {
if (digit > 9) {
if (eval(exp) == 2002) alert(exp+'=2002');
return;
}
combine(digit+1,exp+'+'+digit);
combine(digit+1,exp+'*'+digit);
combine(digit+1,exp+digit);
return;
}
combine(2,'1');

Compare lots of texts (clustering) with a matrix

I have the following PHP function to calculate the relation between to texts:
function check($terms_in_article1, $terms_in_article2) {
$length1 = count($terms_in_article1); // number of words
$length2 = count($terms_in_article2); // number of words
$all_terms = array_merge($terms_in_article1, $terms_in_article2);
$all_terms = array_unique($all_terms);
foreach ($all_terms as $all_termsa) {
$term_vector1[$all_termsa] = 0;
$term_vector2[$all_termsa] = 0;
}
foreach ($terms_in_article1 as $terms_in_article1a) {
$term_vector1[$terms_in_article1a]++;
}
foreach ($terms_in_article2 as $terms_in_article2a) {
$term_vector2[$terms_in_article2a]++;
}
$score = 0;
foreach ($all_terms as $all_termsa) {
$score += $term_vector1[$all_termsa]*$term_vector2[$all_termsa];
}
$score = $score/($length1*$length2);
$score *= 500; // for better readability
return $score;
}
The variable $terms_in_articleX must be an array containing all single words which appear in the text.
Assuming I have a database of 20,000 texts, this function would take a very long time to run through all the connections.
How can I accelerate this process? Should I add all texts into a huge matrix instead of always comparing only two texts? It would be great if you had some approaches with code, preferably in PHP.
I hope you can help me. Thanks in advance!
You can split the text on adding it. Simple example: preg_match_all(/\w+/, $text, $matches); Sure real splitting is not so simple... but possible, just correct the pattern :)
Create table id(int primary autoincrement), value(varchar unique) and link-table like this: word_id(int), text_id(int), word_count(int). Then fill the tables with new values after splitting text.
Finally you can do with this data anything you want, quickly operating with indexed integers(IDs) in DB.
UPDATE:
Here are the tables and queries:
CREATE TABLE terms (
id int(11) NOT NULL auto_increment, value char(255) NOT NULL,
PRIMARY KEY (`id`), UNIQUE KEY `value` (`value`)
);
CREATE TABLE `terms_in_articles` (
term int(11) NOT NULL,
article int(11) NOT NULL,
cnt int(11) NOT NULL default '1',
UNIQUE KEY `term` (`term`,`article`)
);
/* Returns all unique terms in both articles (your $all_terms) */
SELECT t.id, t.value
FROM terms t, terms_in_articles a
WHERE a.term = t.id AND a.article IN (1, 2);
/* Returns your $term_vector1, $term_vector2 */
SELECT article, term, cnt
FROM terms_in_articles
WHERE article IN (1, 2) ORDER BY article;
/* Returns article and total count of term entries in it ($length1, $length2) */
SELECT article, SUM(cnt) AS total
FROM terms_in_articles
WHERE article IN (1, 2) GROUP BY article;
/* Returns your $score wich you may divide by ($length1 / $length2) from previous query */
SELECT SUM(tmp.term_score) * 500 AS total_score FROM
(
SELECT (a1.cnt * a2.cnt) AS term_score
FROM terms_in_articles a1, terms_in_articles a2
WHERE a1.article = 1 AND a2.article = 2 AND a1.term = a2.term
GROUP BY a2.term, a1.term
) AS tmp;
Well, now, I hope, this will help? The 2 last queries are enough to perform your task. Other queries are just in case. Sure, you can count more stats like "the most popular terms" etc...
Here's a slightly optimized version of your original function. It produces the exact same results. (I run it on two articles from Wikipedia with 10000+ terms and like 20 runs each:
check():
test A score: 4.55712524522
test B score: 5.08138042619
--Time: 1.0707
check2():
test A score: 4.55712524522
test B score: 5.08138042619
--Time: 0.2624
Here's the code:
function check2($terms_in_article1, $terms_in_article2) {
$length1 = count($terms_in_article1); // number of words
$length2 = count($terms_in_article2); // number of words
$score_table = array();
foreach($terms_in_article1 as $term){
if(!isset($score_table[$term])) $score_table[$term] = 0;
$score_table[$term] += 1;
}
$score_table2 = array();
foreach($terms_in_article2 as $term){
if(isset($score_table[$term])){
if(!isset($score_table2[$term])) $score_table2[$term] = 0;
$score_table2[$term] += 1;
}
}
$score =0;
foreach($score_table2 as $key => $entry){
$score += $score_table[$key] * $entry;
}
$score = $score / ($length1*$length2);
$score *= 500;
return $score;
}
(Btw. The time needed to split all the words into arrays was not included.)
EDIT: Trying to be more explicit:
First, encode every term into an
integer. You can use a dictionary
associative array, like this:
$count = 0;
foreach ($doc as $term) {
$val = $dict[$term];
if (!defined($val)) {
$dict[$term] = $count++;
}
$doc_as_int[$val] ++;
}
This way, you replace string
calculations with integer
calculations. For example, you can
represent the word "cloud" as the
number 5, and then use the index 5
of arrays to store counts of the
word "cloud". Notice that we only
use associative array search here,
no need for CRC etc.
Do store all texts as a matrix, preferably a sparse one.
Use feature selection (PDF).
Maybe use a native implementation in a faster language.
I suggest you first use K-means with about 20 clusters, this way get a rough draft of which document is near another, and then compare only pairs inside each cluster. Assuming uniformly-sized cluster, this improves the number of comparisons to 20*200 + 20*10*9 - around 6000 comparisons instead of 19900.
If you can use simple text instead of arrays for comparing, and if i understood right where your goal is, you can use the levenshtein php function (that is usually used for give the google-like 'Did you meaning ...?' function in php search engines).
It works in the opposite way youre using: return the difference between two strings.
Example:
<?php
function check($a, $b) {
return levenshtein($a, $b);
}
$a = 'this is just a test';
$b = 'this is not test';
$c = 'this is just a test';
echo check($a, $b) . '<br />';
//return 5
echo check($a, $c) . '<br />';
//return 0, the strings are identical
?>
But i dont know exactly if this will improve the speed of execution.. but maybe yes, you take-out many foreach loops and the array_merge function.
EDIT:
A simply test for the speed (is a 30-second-wroted-script, its not 100% accurated eh):
function check($terms_in_article1, $terms_in_article2) {
$length1 = count($terms_in_article1); // number of words
$length2 = count($terms_in_article2); // number of words
$all_terms = array_merge($terms_in_article1, $terms_in_article2);
$all_terms = array_unique($all_terms);
foreach ($all_terms as $all_termsa) {
$term_vector1[$all_termsa] = 0;
$term_vector2[$all_termsa] = 0;
}
foreach ($terms_in_article1 as $terms_in_article1a) {
$term_vector1[$terms_in_article1a]++;
}
foreach ($terms_in_article2 as $terms_in_article2a) {
$term_vector2[$terms_in_article2a]++;
}
$score = 0;
foreach ($all_terms as $all_termsa) {
$score += $term_vector1[$all_termsa]*$term_vector2[$all_termsa];
}
$score = $score/($length1*$length2);
$score *= 500; // for better readability
return $score;
}
$a = array('this', 'is', 'just', 'a', 'test');
$b = array('this', 'is', 'not', 'test');
$timenow = microtime();
list($m_i, $t_i) = explode(' ', $timenow);
for($i = 0; $i != 10000; $i++){
check($a, $b);
}
$last = microtime();
list($m_f, $t_f) = explode(' ', $last);
$fine = $m_f+$t_f;
$inizio = $m_i+$t_i;
$quindi = $fine - $inizio;
$quindi = substr($quindi, 0, 7);
echo 'end in ' . $quindi . ' seconds';
print: end in 0.36765 seconds
Second test:
<?php
function check($a, $b) {
return levenshtein($a, $b);
}
$a = 'this is just a test';
$b = 'this is not test';
$timenow = microtime();
list($m_i, $t_i) = explode(' ', $timenow);
for($i = 0; $i != 10000; $i++){
check($a, $b);
}
$last = microtime();
list($m_f, $t_f) = explode(' ', $last);
$fine = $m_f+$t_f;
$inizio = $m_i+$t_i;
$quindi = $fine - $inizio;
$quindi = substr($quindi, 0, 7);
echo 'end in ' . $quindi . ' seconds';
?>
print: end in 0.05023 seconds
So, yes, seem faster.
Would be nice to try with many array items (and many words for levenshtein)
2°EDIT:
With similar text the speed seem to be equal to the levenshtein method:
<?php
function check($a, $b) {
return similar_text($a, $b);
}
$a = 'this is just a test ';
$b = 'this is not test';
$timenow = microtime();
list($m_i, $t_i) = explode(' ', $timenow);
for($i = 0; $i != 10000; $i++){
check($a, $b);
}
$last = microtime();
list($m_f, $t_f) = explode(' ', $last);
$fine = $m_f+$t_f;
$inizio = $m_i+$t_i;
$quindi = $fine - $inizio;
$quindi = substr($quindi, 0, 7);
echo 'end in ' . $quindi . ' seconds';
?>
print: end in 0.05988 seconds
But it can take more than 255 char:
Note also that the complexity of this
algorithm is O(N**3) where N is the
length of the longest string.
and, it can even return the similary value in percentage:
function check($a, $b) {
similar_text($a, $b, $p);
return $p;
}
Yet another edit
What about create a database function, to make the compare directly in the sql query, instead of retrieving all the data and loop them?
If youre running Mysql, give a look at this one (hand-made levenshtein function, still 255 char limit)
Else, if youre on Postgresql, this other one (many functions that should be evalutate)
Another approach to take would be Latent Semantic Analysis, which leverages a large corpus of data to find similarities between documents.
The way it works is by taking the co-occurance matrix of the text and comparing it to the Corpus, essentially providing you with an abstract location of your document in a 'semantic space'. This will speed up your text comparison, as you can compare documents using Euclidian distance in the LSA Semantic space. It's pretty fun semantic indexing. Thus, adding new articles will not take much longer.
I can't give a specific use case of this approach, having only learned it in school but it appears that KnowledgeSearch is an open source implementation of the algorithm.
(Sorry, its my first post, so can't post links, just look it up)

Categories