I have scraped 5000 files, stored them in individual files (0-4999.txt), now i need to find duplicate content in them. so i am comparing each file with one another in nested loop (ETA 82 hours). This approach will definitely take hours to complete. My main concern here is the no. of iterations. Can anyone suggest a better approach to cut down iterations and reduce time taken?
current code: NCD algorithm
function ncd_new($sx, $sy, $prec=0, $MAXLEN=9000) {
# NCD with gzip artifact correctoin and percentual return.
# sx,sy = strings to compare.
# Use $prec=-1 for result range [0-1], $pres=0 for percentual,
# For NCD definition see http://arxiv.org/abs/0809.2553
$x = $min = strlen(gzcompress($sx));
$y = $max = strlen(gzcompress($sy));
$xy= strlen(gzcompress($sx.$sy));
$a = $sx;
if ($x>$y) { # swap min/max
$min = $y;
$max = $x;
$a = $sy;
}
$res = ($xy-$min)/$max; # NCD definition.
if ($MAXLEN<0 || $xy<$MAXLEN) {
$aa= strlen(gzcompress($a.$a));
$ref = ($aa-$min)/$min;
$res = $res - $ref; # correction
}
return ($prec<0)? $res: 100*round($res,2+$prec);
}
looping over each file:
$totalScraped = 5000;
for($fileC=0;$fileC<$totalScraped;$fileC++)
{
$f1 = file_get_contents($fileC.".txt");
$stripstr = array('/\bis\b/i', '/\bwas\b/i', '/\bthe\b/i', '/\ba\b/i');
$file1 = preg_replace($stripstr, '', $f1);
// 0+fileC => exclude already compared files
// eg. if fileC=10 , start loop 11 to 4999
for($fileD=(0+$fileC);$fileD<$totalScraped;$fileD++)
{
$f2 = file_get_contents($fileD.".txt", FILE_USE_INCLUDE_PATH);
$stripstr = array('/\bis\b/i', '/\bwas\b/i', '/\bthe\b/i', '/\ba\b/i');
$file2 = preg_replace($stripstr, '', $f2);
$total=ncd_new($file1,$file2);
echo "$fileName1 vs $fileName2 is: $total%\n";
}
}
You may want to find a way to distinguish likely candidates from unlikely ones.
So, maybe there is a way that you can compute a value for each file (say: a word count, a count of sentences / paragraphs... maybe even a count of individual letters), to identify the unlikely candidates beforehand.
If you could achieve this, you could reduce the amount of comparisons by ordering your arrays by this computed number.
another process that i tried was:
strip html tags from page
replace \s{2,} with \s, \n{2,} with \n, so that text b/w each tag is presented in a single line(almost)
compare two such generated files by taking a line, preg_matching, if found -> duplicate, else break line into array of words, calculate array_intersect, if count is 70% or more of line length -> duplicate.
which was very efficient and i could compare 5000 files in ~10 minutes
but still slow for my requirements.
So i implemented the first logic "ncd algo" method in C language, and it completes the task with 5-10 seconds (depending on the average page size)
Related
I'm generating a 6 digit code from the following characters. These will be used to stamp on stickers.
They will be generated in batches of 10k or less (before printing) and I don't envisage there will ever be more than 1-2 million total (probably much less).
After I generate the batches of codes, I'll check the MySQL database of existing codes to ensure there are no duplicates.
// exclude problem chars: B8G6I1l0OQDS5Z2
$characters = 'ACEFHJKMNPRTUVWXY4937';
$string = '';
for ($i = 0; $i < 6; $i++) {
$string .= $characters[rand(0, strlen($characters) - 1)];
}
return $string;
Is this a solid approach to generating the code?
How many possible permutations would there be? (6 Digit code from pool of 21 characters). Sorry math isn't my strong point
21^6 = 85766121 possibilities.
Using a DB and storing used values is bad. If you want to fake randomness you can use the following:
Reduce to 19 possible numbers and make use of the fact that groups of order p^k where p is an odd prime are always cyclic.
Take the group of order 7^19, using a generator co-prime to 7^19 (I'll pick 13^11, you can choose anything not divisible by 7).
Then the following works:
$previous = 0;
function generator($previous)
{
$generator = pow(13,11);
$modulus = pow(7,19); //int might be too small
$possibleChars = "ACEFHJKMNPRTUVWXY49";
$previous = ($previous + $generator) % $modulus;
$output='';
$temp = $previous;
for($i = 0; $i < 6; $i++) {
$output += $possibleChars[$temp % 19];
$temp = $temp / 19;
}
return $output;
}
It will cycle through all possible values and look a little random unless they go digging. An even safer alternative would be multiplicative groups but I forget my math already :(
There is a lot of possible combination with or without repetition so your logic would be sufficient
Collision would be frequent because you are using rand see str_shuffle and randomness.
Change rand to mt_rand
Use fast storage like memcached or redis not MySQL when checking
Total Possibility
21 ^ 6 = 85,766,121
85,766,121 should be ok , To add database to this generation try:
Example
$prifix = "stamp.";
$cache = new Memcache();
$cache->addserver("127.0.0.1");
$stamp = myRand(6);
while($cache->get($prifix . $stamp)) {
$stamp = myRand(6);
}
echo $stamp;
Function Used
function myRand($no, $str = "", $chr = 'ACEFHJKMNPRTUVWXY4937') {
$length = strlen($chr);
while($no --) {
$str .= $chr{mt_rand(0, $length- 1)};
}
return $str;
}
as Baba said generating a string on the fly will result in tons of collisions. the closer you will go to 80 millions already generated ones the harder it will became to get an available string
another solution could be to generate all possible combinations once, and store each of them in the database already, with some boolean column field that marks if a row/token is already used or not
then to get one of them
SELECT * FROM tokens WHERE tokenIsUsed = 0 ORDER BY RAND() LIMIT 0,1
and then mark it as already used
UPDATE tokens SET tokenIsUsed = 1 WHERE token = ...
You would have 21 ^ 6 codes = 85 766 121 ~ 85.8 million codes!
To generate them all (which would take some time), look at the selected answer to this question: algorithm that will take numbers or words and find all possible combinations.
I had the same problem, and I found very impressive open source solution:
http://www.hashids.org/php/
You can take and use it, also it's worth it to look in it's source code to understand what's happening under the hood.
Or... you can encode username+datetime in md5 and save to database, this for sure will generate an unique code ;)
I have a fully-populated array of values, and I would like to arbitrarily remove elements from this array with more removed towards the far end.
For example, given input ( where a . signifies a populated index )
............................................
I would like something like
....... . ... .. . . .. . .
My first thought was to count the elements, then iterate over the array generating a random number somewhere between the current index and the total size of the array, eg:
if ( mt_rand( 0, $total ) > $total - $current_index )
//remove this element
however, as this entails making a random number each time the loop goes round it becomes very arduous.
Is there a better way of doing this?
One easy way is to flip a weighted coin for each entry with coin flips more weighted towards the end. For example, if the array is size n, for each entry you could choose a random number from 0 to n-1 and only keep the value if the index is less than or equal to the random number. (That is, keep each entry with probability 1 - index/total.) This has the nice advantage that if you're going to be compacting your array anyways, and you're using a good enough but efficient random number generator (could be a simple integer hash over a nonce), it's going to be rather fast for memory access.
On the other hand if you're only blanking out a few items and aren't rearranging the array, you can go with some sort of weighted random number generator that more often chooses numbers that are toward the end of the index. For example, if you have a random number generator that generates floats in the value of [0,1] (closed or open bounds not mattering that much likely), consider obtaining such a random float r and squaring it. This will tend to prefer lower values. You can fix this by flipping it around: 1-r^2. Of course, you need this to be in your index range of 0 to n - 1, so take floor(n * (1 - r^2)) and also round n down to n-1.
There's practically an infinite number of variations on both of these techniques.
This is quite probably not the best/most efficient way to do this, but it is the best I can come up with and it does work.
N.B. the codepad example takes a long time to execute, but this is because of the pretty-print loop I added to the end so you can see it visibly working. If you remove the inner loop, execution time drops to acceptable levels.
<?php
$array = range(0, 99);
for ($i = 0, $count = count($array); $i < $count; $i++) {
// Get array keys
$keys = array_keys($array);
// Get a random number between 0 and count($keys) - 1
$rand = mt_rand(0, count($keys) - 1);
// Cut $rand elements off the beginning of the keys
$keys = array_slice($keys, $rand);
// Unset a random key from the remaining keys
unset($array[$keys[array_rand($keys)]]);
}
This method isn't random- it works by you defining a function, and its inverse. Different functions, with different constant coefficients will have different distribution characteristics.
The results are very pattern like, as expected when mapping a continuous function to a discrete structure like an array.
Here's an example using a quadratic function. You could try varying the constant.
demo: http://codepad.org/ojU3s9xM
#as in y = x^2 / 7;
function y($x) {
return $x * $x / 7;
}
function x($y) {
return 7 * sqrt($y);
}
$theArray = range(0,100);
$size = count($theArray);
//use func inverse to find the max value we can input to $y() without going out of array bounds
$maximumX = x($size);
for ($i=0; $i<$maximumX; $i++) {
$index = (int) y($i);
//unset the index if it still exists, else, the next greatest index
while (!isset($theArray[$index]) && $index < $size) {
$index++;
}
unset($theArray[$index]);
}
for ($i=0; $i<$size; $i++) {
printf("[%-3s]", isset($theArray[$i]) ? $theArray[$i] : '');
}
I am implementing a system at the moment where it needs to allocate a number in a certain range to a person, but not use any number that has been used before.
Keep in mind, both the number range and exclusion list are both going to be quite large.
Initially, I thought doing something like this would be best:
<?php
$start = 1;
$end = 199999;
$excluded = array(4,6,7,8,9,34);
$found = FALSE;
while (!$found) {
$rand = mt_rand($start,$end);
if (!in_array($rand,$excluded)) {
$found = TRUE;
}
}
?>
But I don't think this is ideal, there is the possibility of an infinite loop (or it taking a very long time / timing out the script).
I also thought about generating an array of all the numbers I needed, but surely a massive array would be worse? Also doing an array diff on 2 massive arrays would surely take a long time too?
Something like this:
<?php
$start = 1;
$end = 199999;
$allnums = range($start,$end);
$excluded = array(4,6,7,8,9,34);
$searcharray = array_diff($allnums,$excluded);
$rand = array_rand($searcharray);
?>
So, my question would be which would be a better option? And is there another (better) way of doing this that someone has used before?
Array's holding large amounts of data will use up a lot of memory, can you not use a database to hold these numbers in? That's generally what they are designed for.
Is there is any way to avoid duplication in random number generation .
I want to create a random number for a special purpose. But it's should be a unique value. I don't know how to avoid duplicate random number
ie, First i got the random number like 1892990070. i have created a folder named that random number(1892990070). My purpose is I will never get that number in future. I it so i have duplicate random number in my folder.
A random series of number can always have repeated numbers. You have to keep a record of which numbers are already used so you can regenerate the number if it's already used. Like this:
$used = array(); //Initialize the record array. This should only be done once.
//Do like this to draw a number:
do {
$random = rand(0, 2000);
}while(in_array($random, $used));
$used[] = $random; //Save $random into to $used array
My example above will of course only work across a single page load. If it should be static across page loads you'll have to use either sessions (for a single user) or some sort of database (if it should be unique to all users), but the logic is the same.
You can write a wrapper for mt_rand which remembers all the random number generated before.
function my_rand() {
static $seen = array();
do{
$rand = mt_rand();
}while(isset($seen[$rand]));
$seen[$rand] = 1;
return $rand;
}
The ideas to remember previously generated numbers and create new ones is a useful general solution when duplicates are a problem.
But are you sure an eventual duplicate is really a problem? Consider rolling dice. Sometimes they repeat the same value, even in two sequential throws. No one considers that a problem.
If you have a controlled need for a choosing random number—say like shuffling a deck of cards—there are several approaches. (I see there are several recently posted answer to that.)
Another approach is to use the numbers 0, 1, 2, ..., n and modify them in some way, like a Gray Code encoding or exclusive ORing by a constant bit pattern.
For what purpose are you generating the random number? If you are doing something that generates random "picks" of a finite set, like shuffling a deck of cards using a random-number function, then it's easiest to put the set into an array:
$set = array('one', 'two', 'three');
$random_set = array();
while (count($set)) {
# generate a random index into $set
$picked_idx = random(0, count($set) - 1);
# copy the value out
$random_set []= $set[$picked_idx];
# remove the value from the original set
array_splice($set, $picked_idx, 1);
}
If you are generating unique keys for things, you may need to hash them:
# hold onto the random values we've picked
$already_picked = array();
do {
$new_pick = rand();
# loop until we know we don't have a value that's been picked before
} while (array_key_exists($new_pick, $already_picked));
$already_picked[$new_pick] = 1;
This will generate a string with one occurence of each digit:
$randomcharacters = '0123456789';
$length = 5;
$newcharacters = str_shuffle($randomcharacters);
$randomstring = substr($newcharacters, 0, $length);
rand(1,N) but excluding array(a,b,c,..),
is there already a built-in function that I don't know or do I have to implement it myself(how?) ?
UPDATE
The qualified solution should have gold performance whether the size of the excluded array is big or not.
No built-in function, but you could do this:
function randWithout($from, $to, array $exceptions) {
sort($exceptions); // lets us use break; in the foreach reliably
$number = rand($from, $to - count($exceptions)); // or mt_rand()
foreach ($exceptions as $exception) {
if ($number >= $exception) {
$number++; // make up for the gap
} else /*if ($number < $exception)*/ {
break;
}
}
return $number;
}
That's off the top of my head, so it could use polishing - but at least you can't end up in an infinite-loop scenario, even hypothetically.
Note: The function breaks if $exceptions exhausts your range - e.g. calling randWithout(1, 2, array(1,2)) or randWithout(1, 2, array(0,1,2,3)) will not yield anything sensible (obviously), but in that case, the returned number will be outside the $from-$to range, so it's easy to catch.
If $exceptions is guaranteed to be sorted already, sort($exceptions); can be removed.
Eye-candy: Somewhat minimalistic visualisation of the algorithm.
I don't think there's such a function built-in ; you'll probably have to code it yourself.
To code this, you have two solutions :
Use a loop, to call rand() or mt_rand() until it returns a correct value
which means calling rand() several times, in the worst case
but this should work OK if N is big, and you don't have many forbidden values.
Build an array that contains only legal values
And use array_rand to pick one value from it
which will work fine if N is small
Depending on exactly what you need, and why, this approach might be an interesting alternative.
$numbers = array_diff(range(1, N), array(a, b, c));
// Either (not a real answer, but could be useful, depending on your circumstances)
shuffle($numbers); // $numbers is now a randomly-sorted array containing all the numbers that interest you
// Or:
$x = $numbers[array_rand($numbers)]; // $x is now a random number selected from the set of numbers you're interested in
So, if you don't need to generate the set of potential numbers each time, but are generating the set once and then picking a bunch of random number from the same set, this could be a good way to go.
The simplest way...
<?php
function rand_except($min, $max, $excepting = array()) {
$num = mt_rand($min, $max);
return in_array($num, $excepting) ? rand_except($min, $max, $excepting) : $num;
}
?>
What you need to do is calculate an array of skipped locations so you can pick a random position in a continuous array of length M = N - #of exceptions and easily map it back to the original array with holes. This will require time and space equal to the skipped array. I don't know php from a hole in the ground so forgive the textual semi-psudo code example.
Make a new array Offset[] the same length as the Exceptions array.
in Offset[i] store the first index in the imagined non-holey array that would have skipped i elements in the original array.
Now to pick a random element. Select a random number, r, in 0..M the number of remaining elements.
Find i such that Offset[i] <= r < Offest[i+i] this is easy with a binary search
Return r + i
Now, that is just a sketch you will need to deal with the ends of the arrays and if things are indexed form 0 or 1 and all that jazz. If you are clever you can actually compute the Offset array on the fly from the original, it is a bit less clear that way though.
Maybe its too late for answer, but I found this piece of code somewhere in my mind when trying to get random data from Database based on random ID excluding some number.
$excludedData = array(); // This is your excluded number
$maxVal = $this->db->count_all_results("game_pertanyaan"); // Get the maximum number based on my database
$randomNum = rand(1, $maxVal); // Make first initiation, I think you can put this directly in the while > in_array paramater, seems working as well, it's up to you
while (in_array($randomNum, $excludedData)) {
$randomNum = rand(1, $maxVal);
}
$randomNum; //Your random number excluding some number you choose
This is the fastest & best performance way to do it :
$all = range($Min,$Max);
$diff = array_diff($all,$Exclude);
shuffle($diff );
$data = array_slice($diff,0,$quantity);