Compare lots of texts (clustering) with a matrix - php

I have the following PHP function to calculate the relation between to texts:
function check($terms_in_article1, $terms_in_article2) {
$length1 = count($terms_in_article1); // number of words
$length2 = count($terms_in_article2); // number of words
$all_terms = array_merge($terms_in_article1, $terms_in_article2);
$all_terms = array_unique($all_terms);
foreach ($all_terms as $all_termsa) {
$term_vector1[$all_termsa] = 0;
$term_vector2[$all_termsa] = 0;
}
foreach ($terms_in_article1 as $terms_in_article1a) {
$term_vector1[$terms_in_article1a]++;
}
foreach ($terms_in_article2 as $terms_in_article2a) {
$term_vector2[$terms_in_article2a]++;
}
$score = 0;
foreach ($all_terms as $all_termsa) {
$score += $term_vector1[$all_termsa]*$term_vector2[$all_termsa];
}
$score = $score/($length1*$length2);
$score *= 500; // for better readability
return $score;
}
The variable $terms_in_articleX must be an array containing all single words which appear in the text.
Assuming I have a database of 20,000 texts, this function would take a very long time to run through all the connections.
How can I accelerate this process? Should I add all texts into a huge matrix instead of always comparing only two texts? It would be great if you had some approaches with code, preferably in PHP.
I hope you can help me. Thanks in advance!

You can split the text on adding it. Simple example: preg_match_all(/\w+/, $text, $matches); Sure real splitting is not so simple... but possible, just correct the pattern :)
Create table id(int primary autoincrement), value(varchar unique) and link-table like this: word_id(int), text_id(int), word_count(int). Then fill the tables with new values after splitting text.
Finally you can do with this data anything you want, quickly operating with indexed integers(IDs) in DB.
UPDATE:
Here are the tables and queries:
CREATE TABLE terms (
id int(11) NOT NULL auto_increment, value char(255) NOT NULL,
PRIMARY KEY (`id`), UNIQUE KEY `value` (`value`)
);
CREATE TABLE `terms_in_articles` (
term int(11) NOT NULL,
article int(11) NOT NULL,
cnt int(11) NOT NULL default '1',
UNIQUE KEY `term` (`term`,`article`)
);
/* Returns all unique terms in both articles (your $all_terms) */
SELECT t.id, t.value
FROM terms t, terms_in_articles a
WHERE a.term = t.id AND a.article IN (1, 2);
/* Returns your $term_vector1, $term_vector2 */
SELECT article, term, cnt
FROM terms_in_articles
WHERE article IN (1, 2) ORDER BY article;
/* Returns article and total count of term entries in it ($length1, $length2) */
SELECT article, SUM(cnt) AS total
FROM terms_in_articles
WHERE article IN (1, 2) GROUP BY article;
/* Returns your $score wich you may divide by ($length1 / $length2) from previous query */
SELECT SUM(tmp.term_score) * 500 AS total_score FROM
(
SELECT (a1.cnt * a2.cnt) AS term_score
FROM terms_in_articles a1, terms_in_articles a2
WHERE a1.article = 1 AND a2.article = 2 AND a1.term = a2.term
GROUP BY a2.term, a1.term
) AS tmp;
Well, now, I hope, this will help? The 2 last queries are enough to perform your task. Other queries are just in case. Sure, you can count more stats like "the most popular terms" etc...

Here's a slightly optimized version of your original function. It produces the exact same results. (I run it on two articles from Wikipedia with 10000+ terms and like 20 runs each:
check():
test A score: 4.55712524522
test B score: 5.08138042619
--Time: 1.0707
check2():
test A score: 4.55712524522
test B score: 5.08138042619
--Time: 0.2624
Here's the code:
function check2($terms_in_article1, $terms_in_article2) {
$length1 = count($terms_in_article1); // number of words
$length2 = count($terms_in_article2); // number of words
$score_table = array();
foreach($terms_in_article1 as $term){
if(!isset($score_table[$term])) $score_table[$term] = 0;
$score_table[$term] += 1;
}
$score_table2 = array();
foreach($terms_in_article2 as $term){
if(isset($score_table[$term])){
if(!isset($score_table2[$term])) $score_table2[$term] = 0;
$score_table2[$term] += 1;
}
}
$score =0;
foreach($score_table2 as $key => $entry){
$score += $score_table[$key] * $entry;
}
$score = $score / ($length1*$length2);
$score *= 500;
return $score;
}
(Btw. The time needed to split all the words into arrays was not included.)

EDIT: Trying to be more explicit:
First, encode every term into an
integer. You can use a dictionary
associative array, like this:
$count = 0;
foreach ($doc as $term) {
$val = $dict[$term];
if (!defined($val)) {
$dict[$term] = $count++;
}
$doc_as_int[$val] ++;
}
This way, you replace string
calculations with integer
calculations. For example, you can
represent the word "cloud" as the
number 5, and then use the index 5
of arrays to store counts of the
word "cloud". Notice that we only
use associative array search here,
no need for CRC etc.
Do store all texts as a matrix, preferably a sparse one.
Use feature selection (PDF).
Maybe use a native implementation in a faster language.
I suggest you first use K-means with about 20 clusters, this way get a rough draft of which document is near another, and then compare only pairs inside each cluster. Assuming uniformly-sized cluster, this improves the number of comparisons to 20*200 + 20*10*9 - around 6000 comparisons instead of 19900.

If you can use simple text instead of arrays for comparing, and if i understood right where your goal is, you can use the levenshtein php function (that is usually used for give the google-like 'Did you meaning ...?' function in php search engines).
It works in the opposite way youre using: return the difference between two strings.
Example:
<?php
function check($a, $b) {
return levenshtein($a, $b);
}
$a = 'this is just a test';
$b = 'this is not test';
$c = 'this is just a test';
echo check($a, $b) . '<br />';
//return 5
echo check($a, $c) . '<br />';
//return 0, the strings are identical
?>
But i dont know exactly if this will improve the speed of execution.. but maybe yes, you take-out many foreach loops and the array_merge function.
EDIT:
A simply test for the speed (is a 30-second-wroted-script, its not 100% accurated eh):
function check($terms_in_article1, $terms_in_article2) {
$length1 = count($terms_in_article1); // number of words
$length2 = count($terms_in_article2); // number of words
$all_terms = array_merge($terms_in_article1, $terms_in_article2);
$all_terms = array_unique($all_terms);
foreach ($all_terms as $all_termsa) {
$term_vector1[$all_termsa] = 0;
$term_vector2[$all_termsa] = 0;
}
foreach ($terms_in_article1 as $terms_in_article1a) {
$term_vector1[$terms_in_article1a]++;
}
foreach ($terms_in_article2 as $terms_in_article2a) {
$term_vector2[$terms_in_article2a]++;
}
$score = 0;
foreach ($all_terms as $all_termsa) {
$score += $term_vector1[$all_termsa]*$term_vector2[$all_termsa];
}
$score = $score/($length1*$length2);
$score *= 500; // for better readability
return $score;
}
$a = array('this', 'is', 'just', 'a', 'test');
$b = array('this', 'is', 'not', 'test');
$timenow = microtime();
list($m_i, $t_i) = explode(' ', $timenow);
for($i = 0; $i != 10000; $i++){
check($a, $b);
}
$last = microtime();
list($m_f, $t_f) = explode(' ', $last);
$fine = $m_f+$t_f;
$inizio = $m_i+$t_i;
$quindi = $fine - $inizio;
$quindi = substr($quindi, 0, 7);
echo 'end in ' . $quindi . ' seconds';
print: end in 0.36765 seconds
Second test:
<?php
function check($a, $b) {
return levenshtein($a, $b);
}
$a = 'this is just a test';
$b = 'this is not test';
$timenow = microtime();
list($m_i, $t_i) = explode(' ', $timenow);
for($i = 0; $i != 10000; $i++){
check($a, $b);
}
$last = microtime();
list($m_f, $t_f) = explode(' ', $last);
$fine = $m_f+$t_f;
$inizio = $m_i+$t_i;
$quindi = $fine - $inizio;
$quindi = substr($quindi, 0, 7);
echo 'end in ' . $quindi . ' seconds';
?>
print: end in 0.05023 seconds
So, yes, seem faster.
Would be nice to try with many array items (and many words for levenshtein)
2°EDIT:
With similar text the speed seem to be equal to the levenshtein method:
<?php
function check($a, $b) {
return similar_text($a, $b);
}
$a = 'this is just a test ';
$b = 'this is not test';
$timenow = microtime();
list($m_i, $t_i) = explode(' ', $timenow);
for($i = 0; $i != 10000; $i++){
check($a, $b);
}
$last = microtime();
list($m_f, $t_f) = explode(' ', $last);
$fine = $m_f+$t_f;
$inizio = $m_i+$t_i;
$quindi = $fine - $inizio;
$quindi = substr($quindi, 0, 7);
echo 'end in ' . $quindi . ' seconds';
?>
print: end in 0.05988 seconds
But it can take more than 255 char:
Note also that the complexity of this
algorithm is O(N**3) where N is the
length of the longest string.
and, it can even return the similary value in percentage:
function check($a, $b) {
similar_text($a, $b, $p);
return $p;
}
Yet another edit
What about create a database function, to make the compare directly in the sql query, instead of retrieving all the data and loop them?
If youre running Mysql, give a look at this one (hand-made levenshtein function, still 255 char limit)
Else, if youre on Postgresql, this other one (many functions that should be evalutate)

Another approach to take would be Latent Semantic Analysis, which leverages a large corpus of data to find similarities between documents.
The way it works is by taking the co-occurance matrix of the text and comparing it to the Corpus, essentially providing you with an abstract location of your document in a 'semantic space'. This will speed up your text comparison, as you can compare documents using Euclidian distance in the LSA Semantic space. It's pretty fun semantic indexing. Thus, adding new articles will not take much longer.
I can't give a specific use case of this approach, having only learned it in school but it appears that KnowledgeSearch is an open source implementation of the algorithm.
(Sorry, its my first post, so can't post links, just look it up)

Related

Solve Multiple Choice Knapsack (MCKP) With Dynamic Programming?

Example Data
For this question, let's assume the following items:
Items: Apple, Banana, Carrot, Steak, Onion
Values: 2, 2, 4, 5, 3
Weights: 3, 1, 3, 4, 2
Max Weight: 7
Objective:
The MCKP is a type of Knapsack Problem with the additional constraint that "[T]he items are subdivided into k classes... and exactly one item must be taken from each class"
I have written the code to solve the 0/1 KS problem with dynamic programming using recursive calls and memoization. My question is whether it is possible to add this constraint to my current solution? Say my classes are Fruit, Vegetables, Meat (from the example), I would need to include 1 of each type. The classes could just as well be type 1, 2, 3.
Also, I think this can be solved with linear programming and a solver, but if possible, I'd like to understand the answer here.
Current Code:
<?php
$value = array(2, 2, 4, 5, 3);
$weight = array(3, 1, 3, 4, 2);
$maxWeight = 7;
$maxItems = 5;
$seen = array(array()); //2D array for memoization
$picked = array();
//Put a dummy zero at the front to make things easier later.
array_unshift($value, 0);
array_unshift($weight, 0);
//Call our Knapsack Solver and return the sum value of optimal set
$KSResult = KSTest($maxItems, $maxWeight, $value, $weight);
$maxValue = $KSResult; //copy the result so we can recreate the table
//Recreate the decision table from our memo array to determine what items were picked
//Here I am building the table backwards because I know the optimal value will be at the end
for($i=$maxItems; $i > 0; $i--) {
for($j=$maxWeight; $j > 0; $j--) {
if($seen[$i][$j] != $seen[$i-1][$j]
&& $maxValue == $seen[$i][$j]) {
array_push($picked, $i);
$maxValue -= $value[$i];
break;
}
}
}
//Print out picked items and max value
print("<pre>".print_r($picked,true)."</pre>");
echo $KSResult;
// Recursive formula to solve the KS Problem
// $n = number of items to check
// $c = total capacity of bag
function KSTest($n, $c, &$value, &$weight) {
global $seen;
if(isset($seen[$n][$c])) {
//We've seen this subproblem before
return $seen[$n][$c];
}
if($n === 0 || $c === 0){
//No more items to check or no more capacity
$result = 0;
}
elseif($weight[$n] > $c) {
//This item is too heavy, check next item without this one
$result = KSTest($n-1, $c, $value, $weight);
}
else {
//Take the higher result of keeping or not keeping the item
$tempVal1 = KSTest($n-1, $c, $value, $weight);
$tempVal2 = $value[$n] + KSTest($n-1, $c-$weight[$n], $value, $weight);
if($tempVal2 >= $tempVal1) {
$result = $tempVal2;
//some conditions could go here? otherwise use max()
}
else {
$result = $tempVal1;
}
}
//memo the results and return
$seen[$n][$c] = $result;
return $result;
}
?>
What I've Tried:
My first thought was to add a class (k) array, sort the items via class (k), and when we choose to select an item that is the same as the next item, check if it's better to keep the current item or the item without the next item. Seemed promising, but fell apart after a couple of items being checked. Something like this:
$tempVal3 = $value[$n] + KSTest($n-2, $c-$weight[$n]);
max( $tempVal2, $tempVal3);
Another thought is that at the function call, I could call a loop for each class type and solve the KS with only 1 item at a time of that type + the rest of the values. This will definitely be making some assumptions thought because the results of set 1 might still be assuming multiples of set 2, for example.
This looks to be the equation (If you are good at reading all those symbols?) :) and a C++ implementation? but I can't really see where the class constraint is happening?
The c++ implementation looks ok.
Your values and weights which are 1 dimensional array in your current PHP implementation will become 2 dimensional.
So for example,
values[i][j] will be value of j th item in class i. Similarly in case of weights[i][j]. You will be taking only one item for each class i and move forward while maximizing the condition.
The c++ implementation also does an optimization in memo. It only keeps 2 arrays of size respecting the max_weight condition, which are current and previous states. This is because you only need these 2 states at a time to compute present state.
Answers to your doubts:
1)
My first thought was to add a class (k) array, sort the items via
class (k), and when we choose to select an item that is the same as
the next item, check if it's better to keep the current item or the
item without the next item. Seemed promising, but fell apart after a
couple of items being checked. Something like this: $tempVal3 =
$value[$n] + KSTest($n-2, $c-$weight[$n]); max( $tempVal2, $tempVal3);
This won't work because there could be some item in class k+1 where you take a optimal value and to respect constraint you need to take a suboptimal value for class k. So sorting and picking the best won't work when the constraint is hit. If the constraint is not hit you can always pick the best value with best weight.
2)
Another thought is that at the function call, I could call a loop for
each class type and solve the KS with only 1 item at a time of that
type + the rest of the values.
Yes you are on the right track here. You will assume that you had already solved for first k classes. Now you will try extending using the values of k+1 class respecting the weight constraint.
3)
... but I can't really see where the class constraint is happening?
for (int i = 1; i < weight.size(); ++i) {
fill(current.begin(), current.end(), -1);
for (int j = 0; j < weight[i].size(); ++j) {
for (int k = weight[i][j]; k <= max_weight; ++k) {
if (last[k - weight[i][j]] > 0)
current[k] = max(current[k],
last[k - weight[i][j]] + value[i][j]);
}
}
swap(current, last);
}
In the above c++ snippet, the first loop iterates on class, the second loop iterates on values of class and the third loop extends the current state current using the previous state last and only 1 item j with class i at a time. Since you are only using previous state last and 1 item of the current class to extend and maximize, you are following the constraint.
Time complexity:
O( total_items x max_weight) which is equivalent to O( class x max_number_of_items_in_a_class x max_weight)
So I am not a php programmer but I will try to write a pseudocode with good explanation.
In the original problem each cell i, j meaning was: "Value of filling the knapsack with items 1 to i until it reach capacity j", the solution in the link you have provided defines each cell as "Value of filling the knapsack with items from buckets 1 to i until it reach capacity j". Notice that in this variation there is not such this as not taking an element from a class.
So on each step (each call for KSTest with $n, $c), we need to find which element to pick from the n'th class such that the weight of this element is less than c and it's value + KSTest(n - 1, c - w) is the greatest.
So I think you should only change the else if and else statements to something like:
else {
$result = 0
for($i=0; $i < $number_of_items_in_nth_class; $i++) {
if ($weight[$n][$i] > $c) {
//This item is too heavy, check next item
continue;
}
$result = max($result, KSTest($n-1, $c - $weight[$n][$i], $value, $weight));
}
}
Now two disclaimers:
I do not code in php so this code will not run :)
This is not the implementation given in the link you provided, TBH I didn't understood why the time complexity of their algorithm is so small (and what is C) but this implementation should work since it is following the definition of the recursive formula given.
The time complexity of this should be O(max_weight * number_of_classes * size_of_largerst_class).
This is my PHP solution. I've tried to comment the code in a way that it's easy to follow.
Update:
I updated the code because the old script was giving unreliable results. This is cleaner and has been thoroughly tested. Key takeaways are that I use two memo arrays, one at the group level to speed up execution and one at the item level to reconstruct the results. I found any attempts to track which items are being chosen as you go are unreliable and much less efficient. Also, isset() instead of if($var) is essential for checking the memo array because the previous results might have been 0 ;)
<?php
/**
* Multiple Choice Knapsack Solver
*
* #author Michael Cruz
* #version 1.0 - 03/27/2020
**/
class KS_Solve {
public $KS_Items;
public $maxValue;
public $maxWeight;
public $maxItems;
public $finalValue;
public $finalWeight;
public $finalItems;
public $finalGroups;
public $memo1 = array(); //Group memo
public $memo2 = array(); //Item memo for results rebuild
public function __construct() {
//some default variables as an example.
//KS_Items = array(Value, Weight, Group, Item #)
$this->KS_Items = array(
array(2, 3, 1, 1),
array(2, 1, 1, 2),
array(4, 3, 2, 3),
array(5, 4, 2, 4),
array(3, 2, 3, 5)
);
$this->maxWeight = 7;
$this->maxItems = 5;
$this->KS_Wrapper();
}
public function KS_Wrapper() {
$start_time = microtime(true);
//Put a dummy zero at the front to make things easier later.
array_unshift($this->KS_Items, array(0, 0, 0, 0));
//Call our Knapsack Solver
$this->maxValue = $this->KS_Solver($this->maxItems, $this->maxWeight);
//Recreate the decision table from our memo array to determine what items were picked
//ksort($this->memo2); //for debug
for($i=$this->maxItems; $i > 0; $i--) {
//ksort($this->memo2[$i]); //for debug
for($j=$this->maxWeight; $j > 0; $j--) {
if($this->maxValue == 0) {
break 2;
}
if($this->memo2[$i][$j] == $this->maxValue
&& $j == $this->maxWeight) {
$this->maxValue -= $this->KS_Items[$i][0];
$this->maxWeight -= $this->KS_Items[$i][1];
$this->finalValue += $this->KS_Items[$i][0];
$this->finalWeight += $this->KS_Items[$i][1];
$this->finalItems .= " " . $this->KS_Items[$i][3];
$this->finalGroups .= " " . $this->KS_Items[$i][2];
break;
}
}
}
//Print out the picked items and value. (IMPLEMENT Proper View or Return!)
echo "<pre>";
echo "RESULTS: <br>";
echo "Value: " . $this->finalValue . "<br>";
echo "Weight: " . $this->finalWeight . "<br>";
echo "Item's in KS:" . $this->finalItems . "<br>";
echo "Selected Groups:" . $this->finalGroups . "<br><br>";
$end_time = microtime(true);
$execution_time = ($end_time - $start_time);
echo "Results took " . sprintf('%f', $execution_time) . " seconds to execute<br>";
}
/**
* Recursive function to solve the MCKS Problem
* $n = number of items to check
* $c = total capacity of KS
**/
public function KS_Solver($n, $c) {
$group = $this->KS_Items[$n][2];
$groupItems = array();
$count = 0;
$result = 0;
$bestVal = 0;
if(isset($this->memo1[$group][$c])) {
$result = $this->memo1[$group][$c];
}
else {
//Sort out the items for this group
foreach($this->KS_Items as $item) {
if($item[2] == $group) {
$groupItems[] = $item;
$count++;
}
}
//$k adjusts the index for item memoization
$k = $count - 1;
//Find the results of each item + items of other groups
foreach($groupItems as $item) {
if($item[1] > $c) {
//too heavy
$result = 0;
}
elseif($item[1] >= $c && $group != 1) {
//too heavy for next group
$result = 0;
}
elseif($group == 1) {
//Just take the highest value
$result = $item[0];
}
else {
//check this item with following groups
$result = $item[0] + $this->KS_Solver($n - $count, $c - $item[1]);
}
if($result == $item[0] && $group != 1) {
//No solution with the following sets, so don't use this item.
$result = 0;
}
if($result > $bestVal) {
//Best item so far
$bestVal = $result;
}
//memo the results
$this->memo2[$n-$k][$c] = $result;
$k--;
}
$result = $bestVal;
}
//memo and return
$this->memo1[$group][$c] = $result;
return $result;
}
}
new KS_Solve();
?>

Random generator returning endless duplicates

I am trying to create a random string which will be used as a short reference number. I have spent the last couple of days trying to get this to work but it seems to get to around 32766 records and then it continues with endless duplicates. I need at minimum 200,000 variations.
The code below is a very simple mockup to explain what happens. The code should be syntaxed according to 1a-x1y2z (example) which should give a lot more results than 32k
I have a feeling it may be related to memory but not sure. Any ideas?
<?php
function createReference() {
$num = rand(1, 9);
$alpha = substr(str_shuffle("abcdefghijklmnopqrstuvwxyz"), 0, 1);
$char = '0123456789abcdefghijklmnopqrstuvwxyz';
$charLength = strlen($char);
$rand = '';
for ($i = 0; $i < 6; $i++) {
$rand .= $char[rand(0, $charLength - 1)];
}
return $num . $alpha . "-" . $rand;
}
$codes = [];
for ($i = 1; $i <= 200000; $i++) {
$code = createReference();
while (in_array($code, $codes) == true) {
echo 'Duplicate: ' . $code . '<br />';
$code = createReference();
}
$codes[] = $code;
echo $i . ": " . $code . "<br />";
}
exit;
?>
UPDATE
So I am beginning to wonder if this is not something with our WAMP setup (Bitnami) as our local machine gets to exactly 1024 records before it starts duplicating. By removing 1 character from the string above (instead of 6 in the for loop I make it 5) it gets to exactly 32768 records.
I uploaded the script to our centos server and had no duplicates.
What in our enviroment could cause such a behaviour?
The code looks overly complex to me. Let's assume for the moment you really want to create n unique strings each based on a single random value (rand/mt_rand/something between INT_MIN,INT_MAX).
You can start by decoupling the generation of the random values from the encoding (there seems to be nothing in the code that makes a string dependant on any previous state - excpt for the uniqueness). Comparing integers is quite a bit faster than comparing arbitrary strings.
mt_rand() returns anything between INT_MIN and INT_MAX, using 32bit integers (could be 64bit as well, depends on how php has been compiled) that gives ~232 elements. You want to pick 200k, let's make it 400k, that's ~ a 1/10000 of the value range. It's therefore reasonable to assume everything goes well with the uniqueness...and then check at a later time. and add more values if a collision occured. Again much faster than checking in_array in each iteration of the loop.
Once you have enough values, you can encode/convert them to a format you wish. I don't know whether the <digit><character>-<something> format is mandatory but assume it is not -> base_convert()
<?php
function unqiueRandomValues($n) {
$values = array();
while( count($values) < $n ) {
for($i=count($values);$i<$n; $i++) {
$values[] = mt_rand();
}
$values = array_unique($values);
}
return $values;
}
function createReferences($n) {
return array_map(
function($e) {
return base_convert($e, 10, 36);
},
unqiueRandomValues($n)
);
}
$start = microtime(true);
$references = createReferences(400000);
$end = microtime(true);
echo count($references), ' ', count(array_unique($references)), ' ', $end-$start, ' ', $references[0];
prints e.g. 400000 400000 3.3981630802155 f3plox on my i7-4770. (The $end-$start part is constantly between 3.2 and 3.4)
Using base_convert() there can be strings like li10, which can be quite annoying to decipher if you have to manually type the string.

Why does this random string generator perform so poorly?

I found this bit of PHP code for generating random strings (alphabetical, alphanumeric, numeric, and hexadecimal).
<?php
function random($length = 8, $seeds = 'alpha') {
// Possible seeds
$seedings['alpha'] = 'abcdefghijklmnopqrstuvwqyz';
$seedings['numeric'] = '0123456789';
$seedings['alphanum'] = 'abcdefghijklmnopqrstuvwqyz0123456789';
$seedings['hexidec'] = '0123456789abcdef';
// Choose seed
if (isset($seedings[$seeds])) {
$seeds = $seedings[$seeds];
}
// Seed generator
list($usec, $sec) = explode(' ', microtime());
$seed = (float) $sec + ((float) $usec * 100000);
mt_srand($seed);
// Generate
$str = '';
$seeds_count = strlen($seeds);
for ($i = 0; $length > $i; $i++) {
$str .= $seeds{mt_rand(0, $seeds_count - 1)};
}
return $str;
}
?>
If I run this function with the default arguments (so it is generating 8 character strings, alphabetical only) and generate 1,000,000 strings, I'd think my collision rate would be low:
26^8 = 208,827,064,576
1,000,000 / 208,827,064,576 ~= 0.0004%
In actuality, when I run that on my machine, I get a 90% collision rate! Only 10% of my generated strings are unique.
Actually, it is suspiciously close to 10%. Generating multiple sets of 1,000,000 random strings, I find that each set generates...
100,032 unique strings
100,035 unique strings
100,032 unique strings
100,028 unique strings
100,030 unique strings
you get the idea...
So what gives? Obviously it has to do with how I'm seeding mt_srand, or how php implements mt_rand, or something else.
So...
Why doesn't this code generate useful random strings?
And what would be a better approach?
Don't set the seed unless you know what you're doing, from the manual:
Note: There is no need to seed the random number generator with
srand() or mt_srand() as this is done automatically.
The following code gets me almost a set of 100% unique strings
<?php
function random($length = 8, $charset = 'alpha'){
$list = [
'alpha' => 'abcdefghijklmnopqrstuvwqyz',
'numeric' => '0123456789',
'alphanum' => 'abcdefghijklmnopqrstuvwqyz0123456789',
'hexidec' => '0123456789abcdef'
];
if(!isset($list[$charset])){
trigger_error("Invalid charset '$charset', allowed sets: '".implode(', ', array_keys($list))."'", E_USER_NOTICE);
$charset = 'alpha';
}
$str = '';
$max = strlen($list[$charset]) - 1;
for ($i = 0; $length > $i; $i++) {
$str .= $list[$charset][mt_rand(0, $max)];
}
return $str;
}
$loop = 1000000;
for($i=0;$i<$loop;$i++){
$arr[random()] = true;
}
echo $loop - count($arr), " dupes found in list.";
?>

Counting possibilities in a char. combination using a consecutive repetition criterion

In PHP, given
the final string length
the range of characters it can use
min consecutive repetition count possible
how can you calculate the number of matches that fits these criteria?To draw a better picture…
$range = array('a','b','c');
$length = 2; // looking for 2 digit results
$minRep = 2; // with >=2 consecutive characters
// aa,bb,cc = 3 possibilities
another one:
$range = array('a','b','c');
$length = 3; // looking for 3 digit results
$minRep = 2; // with >=2 consecutive characters
// aaa,aab,aac,baa,caa
// bbb,bba,bbc,abb,cbb
// ccc,cca,ccb,acc,bcc
// 5 + 5 + 5 = 15 possibilities
// note that combos like aa,bb,cc are not included
// because their length is smaller than $length
last one:
$range = array('a','b','c');
$length = 3; // looking for 3 digit results
$minRep = 3; // with >=3 consecutive characters
// aaa,bbb,ccc = 3 possibilities
So basically, in the 2nd example the 3rd criterion made it catch e.g. [aa]b in aab because a was repeating consecutively more than once, whereas [a]b[a] wouldn't be a match because those a's are separate.
Needless to say, none of the variables is static.
Got it. All credit to leonbloy #mathexchange.com.
/* The main function computes the number of words that do NOT contain
* a character repetition of length $minRep (or more). */
function countStrings($rangeLength, $length, $minRep, &$results = array())
{
if (!isset($results[$length]))
{
$b = 0;
if ($length < $minRep)
$b = pow($rangeLength, $length);
else
{
for ($i = 1; $i < $minRep; $i++)
$b += countStrings($rangeLength, $length - $i, $minRep, $results);
$b *= $rangeLength - 1;
}
$results[$length] = $b;
}
return $results[$length];
}
/* This one answers directly the question. */
function printNumStringsRep($rangeLength, $length, $minRep)
{
$n = (pow($rangeLength, $length)
- countStrings($rangeLength, $length, $minRep));
echo "Size of alphabet : $rangeLength<br/>"
. "Size of string : $length<br/>"
. "Minimal repetition : $minRep<br/>"
. "<strong>Number of words : $n</strong>";
}
/* Prints :
*
Size of alphabet : 3
Size of string : 3
Minimal repetition : 2
Number of words : 15
*
*/
printNumStringsRep(3, 3, 2);
I think it is best to handle this with math.
$range = array('a','b','c');
$length = 3; // looking for 3 digit results
$minRep = 2; // with >=2 consecutive characters
$rangeLength = count($range);
$count = (pow($rangeLength,$length-$minRep+1) * ($length-$minRep+1)) - ($rangeLength * ($length-$minRep)); // is the result
Now, $count is getting true result for three situation. But it may not be general formula and need to improve.
Try to explain it:
pow($rangeLength,$length-$minRep+1)
in this, we count repetitive characters like as one. For instance, in second example that you gave, we think in aab, aa is a one character. Because, two characters need to change together. We think now there is two character like xy. So there is same possibilities for both character a, b, and c namely 3 ($rangeLength) possible value for two characters($length-$minRep+1). So 3^2=9 is possible situations for second example.
We calculate 9 is for just xy not yx. For this, we multiply length of xy ($length-$minRep+1). And then we have 18.
It can be seemed that we calculated the result, but there is a repeat in our calculation. We didn't reckon with this situation: xy => aaa and yx => aaa. For this, we calculate and substract repeated results
- ($rangeLength * ($length-$minRep))
So after this, we get result.
As i said begining of the description, this formula may need to improve.
With Math, work becomes really complex. But, there is always a way, even not beautiful as much as Math. We can create all possible strings with php and control them with regexp like below:
$range = array('a','b','c');
$length = 3;
$minRep = 2;
$rangeLength = count($range);
$createdStrings = array();
$matchedStrings = array();
function calcIndex(){
global $range;
global $length;
global $rangeLength;
static $ret;
$addTrigger = false;
// initial values
if(is_null($ret)){
$ret = array_fill(0, $length, 0);
return $ret;
}
for($i=$length-1;$i>=0;$i--){
if($ret[$i] == ($rangeLength-1)) {
if($i==0) return false;
$ret[$i] = 0;
}
else {
$ret[$i]++;
break;
}
}
return $ret;
}
function createPattern()
{
global $minRep;
$patt = '/(.)\\1{'.($minRep-1).'}/';
return $patt;
}
$pattern = createPattern();
while(1)
{
$index = calcIndex();
if($index === false) break;
$string = '';
for($i=0;$i<$length;$i++)
{
$string .= $range[$index[$i]];
}
if(!in_array($string, $createdStrings)){
$createdStrings[] = $string;
if(preg_match($pattern, $string)){
$matchedStrings[] = $string;
}
}
}
echo count($createdStrings).' is created:';
var_dump($createdStrings);
echo count($matchedStrings).'strings is matched:';
var_dump($matchedStrings);

How to reduce lists of ranges?

Given a list of ranges ie: 1-3,5,6-4,31,9,19,10,25-20
how can i reduce it to 1-6,9-10,19-25,31 ?
Here is what i've done so far, it seems a little bit complicated, so
is there any simpler/clever method to do this.
$in = '1-3,5,6-4,31,9,19,10,25-20';
// Explode the list in ranges
$rs = explode(',', $in);
$tmp = array();
// for each range of the list
foreach($rs as $r) {
// find the start and end date of the range
if (preg_match('/(\d+)-(\d+)/', $r, $m)) {
$start = $m[1];
$end = $m[2];
} else {
// If only one date
$start = $end = $r;
}
// flag each date in an array
foreach(range($start,$end) as $i) {
$tmp[$i] = 1;
}
}
$str = '';
$prev = 999;
// for each date of a month (1-31)
for($i=1; $i<32; $i++) {
// is this date flaged ?
if (isset($tmp[$i])) {
// is output string empty ?
if ($str == '') {
$str = $i;
} else {
// if the previous date is less than the current minus 1
if ($i-1 > $prev) {
// build the new range
$str .= '-'.$prev.','.$i;
}
}
$prev = $i;
}
}
// build the last range
if ($i-1 > $prev) {
$str .= '-'.$prev;
}
echo "str=$str\n";
NB: it must run under php 5.1.6 (i can't upgrade).
FYI : the numbers represent days of month so they are limited to 1-31.
Edit:
From a given range of dates (1-3,6,7-8), i'd like obtain another list (1-3,6-8) where all the ranges are recalculated and ordered.
Perhaps not the most efficient, but shouldn't be too bad with the limited range of values you're working with:
$in = '1-3,5,6-4,31,9,19,10,25-20';
$inSets = explode(',',$in);
$outSets = array();
foreach($inSets as $inSet) {
list($start,$end) = explode('-',$inSet.'-'.$inSet);
$outSets = array_merge($outSets,range($start,$end));
}
$outSets = array_unique($outSets);
sort($outSets);
$newSets = array();
$start = $outSets[0];
$end = -1;
foreach($outSets as $outSet) {
if ($outSet == $end+1) {
$end = $outSet;
} else {
if ($start == $end) {
$newSets[] = $start;
} elseif($end > 0) {
$newSets[] = $start.'-'.$end;
}
$start = $end = $outSet;
}
}
if ($start == $end) {
$newSets[] = $start;
} else {
$newSets[] = $start.'-'.$end;
}
var_dump($newSets);
echo '<br />';
You just have to search your data to get what you want. Split the input on the delimiter, in your case ','. Then sort it somehow, this safes you searching left from the current position. Take you first element, check whether it's a range and use the highest number in this range (3 out of 1-3 range or 3 if 3 is a single element) for further comparisions. Then take the 2nd element in your list and check if it's a direct successor of the last element. If yes combine the 1st and 2nd elements/range to a new range. Repeat.
Edit: I'm not sure about PHP but a regular expression is a bit overkill for this problem. Just look for a '-' in your exploded array, then you know it's a range. Sorting the exp. array safes you the backtracking, the stuff you are doing with $prev. You could also explode every element in the exploded array on '-' and check if the resulting array has a size > 1 to learn whether an element is a range or not.
Looking at the problem from an algorithmic stand-point, let's consider the limitations that you've put on the problem. All numbers will be from 1-31. The list is a collection of "ranges", each of which is defined by two numbers (start and end). There is no rule for whether start will be more, less than, or equal to end.
Since we have an arbitrarily large list of ranges but a definite means of sorting/organizing these, a divide and conquer strategy may yield the best complexity.
At first I typed out a very long and careful explanation of how I created each step in this algorithm (the dividing portion, the conquering potion, optimizations, etc.) however the explanation got extremely long. To shorten it, here's the final answer:
<?php
$ranges = "1-3,5,6-4,31,9,19,10,25-20";
$range_array = explode(',', $ranges);
$include = array();
foreach($range_array as $range){
list($start, $end) = explode('-', $range.'-'.$range); //"1-3-1-3" or "5-5"
$include = array_merge($include, range($start, $end));
}
$include = array_unique($include);
sort($include);
$new_ranges = array();
$start = $include[0];
$count = $start;
// And begin the simple conquer algorithm
for( $i = 1; $i < count($include); $i++ ){
if( $include[$i] != ($count++) ){
if($start == $count-1){
$new_ranges[] = $start;
} else {
$new_ranges[] = $start."-".$count-1;
}
$start = $include[$i];
$count = $start;
}
}
$new_ranges = implode(',', $new_ranges);
?>
This should (theoretically) work on arrays of arbitrary length for any positive integers. Negative integers would get tripped up since - is our delimiter for the range.

Categories