PHP Detect Duplicate Text

PHP Detect Duplicate Text - php

I have a site where users can put in a description about themselves.
Most users write something appropriate but some just copy/paste the same text a number of times (to create the appearance of a fair amount of text).
eg: "Love a and peace love a and peace love a and peace love a and peace love a and peace love a and peace"
Is there a good method to detect repetitive text with PHP?
The only concept I currently have would be to break the text into separate words (delimited by space) and then look to see if the word is repeated more then a set limited. Note: I'm not 100% sure how I would code this solution.
Thoughts on the best way to detect duplicate text? Or how to code the above idea?

This is a basic text classification problem. There are lots of articles out there on how to determine if some text is spam/not spam which I'd recommend digging into if you really want to get into the details. A lot of it is probably overkill for what you need to do here.
Granted one approach would be to evaluate why you're requiring people to enter longer bios, but I'll assume you've already decided that forcing people to enter more text is the way to go.
Here's an outline of what I would do:
Build a histogram of word occurrences for the input string
Study the histograms of some valid and invalid text
Come up with a formula for classifying a histogram as valid or not
This approach would require you to figure out what's different between the two sets. Intuitively, I'd expect spam to show fewer unique words and if you plot the histogram values, a higher area under the curve concentrated toward the top words.
Here's some sample code to get you going:
$str = 'Love a and peace love a and peace love a and peace love a and peace love a and peace love a and peace';
// Build a histogram mapping words to occurrence counts
$hist = array();
// Split on any number of consecutive whitespace characters
foreach (preg_split('/\s+/', $str) as $word)
{
// Force all words lowercase to ignore capitalization differences
$word = strtolower($word);
// Count occurrences of the word
if (isset($hist[$word]))
{
$hist[$word]++;
}
else
{
$hist[$word] = 1;
}
}
// Once you're done, extract only the counts
$vals = array_values($hist);
rsort($vals); // Sort max to min
// Now that you have the counts, analyze and decide valid/invalid
var_dump($vals);
When you run this code on some repetitive strings, you'll see the difference. Here's a plot of the $vals array from the example string you gave:
Compare that with the first two paragraphs of Martin Luther King Jr.'s bio from Wikipedia:
A long tail indicates lots of unique words. There's still some repetition, but the general shape shows some variation.
FYI, PHP has a stats package you can install if you're going to be doing lots of math like standard deviation, distribution modeling, etc.

You could use a regex, like this:
if (preg_match('/(.{10,})\\1{2,}/', $theText)) {
echo "The string is repeated.";
}
Explanation:
(.{10,}) looks for and captures a string that is at least 10 characters long
\\1{2,} looks for the first string at least 2 more times
Possible tweaks to suit your needs:
Change 10 to a higher or lower number to match longer or shorter repeated strings. I just used 10 as an example.
If you want to catch even one repetition (love and peace love and peace), delete the {2,}. If you want to catch a higher number of repetitions, increase the 2.
If you don't care how many times the repetition occurs, only that it occurs, delete the , in {2,}.

I think you are on the right track breaking down the string and looking at repeated words.
Here is some code though which does not use a PCRE and leverages PHP native string functions (str_word_count and array_count_values):
<?php
$words = str_word_count("Love a and peace love a and peace love a and peace love a and peace love a and peace love a and peace", 1);
$words = array_count_values($words);
var_dump($words);
/*
array(5) {
["Love"]=>
int(1)
["a"]=>
int(6)
["and"]=>
int(6)
["peace"]=>
int(6)
["love"]=>
int(5)
}
*/
Some tweaks might be to:
setup a list of common words to be ignored
look at order of words (previous and next), not just number of occurrences

Another idea would be to use substr_count iteration:
$str = "Love a and peace love a and peace love a and peace love a and peace love a and peace love a and peace";
$rep = "";
$str = strtolower($str);
for($i=0,$len=strlen($str),$pattern=""; $i<$len; ++$i) {
$pattern.= $str[$i];
if(substr_count($str,$pattern)>1)
$rep = strlen($rep)<strlen($pattern) ? $pattern : $rep;
else
$pattern = "";
}
// warn if 20%+ of the string is repetitive
if(strlen($rep)>strlen($str)/5) echo "Repetitive string alert!";
else echo "String seems to be non-repetitive.";
echo " Longest pattern found: '$rep'";
Which would output
Repetitive string alert! Longest pattern found: 'love a and peace love a and peace love a and peace'

// 3 examples of how you might detect repeating user input
// use preg_match
// pattern to match agains
$pattern = '/^text goes here$/';
// the user input
$input = 'text goes here';
// check if its match
$repeats = preg_match($pattern, $input);
if ($repeats) {
var_dump($repeats);
} else {
// do something else
}
// use strpos
$string = 'text goes here';
$input = 'text goes here';
$repeats = strpos($string, $input);
if ($repeats !== false) {
# code...
var_dump($repeats);
} else {
// do something else
}
// or you could do something like:
function repeatingWords($str)
{
$words = explode(' ', trim($str)); //Trim to prevent any extra blank
if (count(array_unique($words)) == count($words)) {
return true; //Same amount of words
}
return false;
}
$string = 'text goes here. text goes here. ';
if (repeatingWords($string)) {
var_dump($string);
} else {
// do something else
}

Here's a the code of the function you're looking for in the description:
<?php
function duplicate(){
$txt = strtolower("Love a and peace love a and peace love a and peace love a and peace love a and peace love a and peace");
$strings = explode(" ",$txt);
$set = 2 ;
for($i=0;$i < sizeof($strings);$i++){
$count = 0;
$current = $strings[$i];
for($j=$i+1;$j < sizeof($strings);$j++){
if($strings[$j]!==$current){
continue;
}else if($count<$set){
$count++;
}else{
echo ("String ".$current." repeated more than ".$set." times\n");
}
}
}
}
echo("Hello World!\n");
duplicate();
?>

I think the approach of finding duplicate words, will be messy. Most likely you'll get duplicate words in real descriptions "I really, really, really, like ice creme, especially vanilla ice creme".
A better approach, is to split the string to get the words, find all the unique words, add all the character counts of the unique words, and set that too some limit. Say, you require 100 character descriptions, require around 60 unique characters from words.
Copying #ficuscr's approach
$words = str_word_count("Love a and peace love a and peace love a and peace love a and peace love a and peace love a and peace", 1);
$total = 0;
foreach ($words as $key => $count) { $total += strlen($key) }

I am not sure whether it is a good idea to combat such problem. If a person wants to put junk in aboutme field, they will always come up with the idea how to do it. But I will ignore this fact and combat the problem as an algorithmic challenge:
Having a string S, which consists of the substrings (which can appear
many times and non-overlapping) find the substring it consist of.
The definition is louse and I assume that the string is already converted to lowercase.
First an easier way:
Use modification of a longest common subsequence which has an easy DP programming solution. But instead of finding a subsequence in two different sequences, you can find longest common subsequence of the string with respect to the same string LCS(s, s).
It sounds stupid at the beginning (surely LCS(s, s) == s), but we actually do not care about the answer, we care about the DP matrix that it get.
Let's look at the example: s = "abcabcabc" and the matrix is:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 1, 0, 0, 1, 0, 0, 1, 0, 0]
[0, 0, 2, 0, 0, 2, 0, 0, 2, 0]
[0, 0, 0, 3, 0, 0, 3, 0, 0, 3]
[0, 1, 0, 0, 4, 0, 0, 4, 0, 0]
[0, 0, 2, 0, 0, 5, 0, 0, 5, 0]
[0, 0, 0, 3, 0, 0, 6, 0, 0, 6]
[0, 1, 0, 0, 4, 0, 0, 7, 0, 0]
[0, 0, 2, 0, 0, 5, 0, 0, 8, 0]
[0, 0, 0, 3, 0, 0, 6, 0, 0, 9]
Note the nice diagonals there. As you see the first diagonal ends with 3, second with 6 and third with 9 (our original DP solution which we do not care).
This is not a coincidence. Hope that after looking in more details about how DP matrix is constructed you can see that these diagonals correspond to duplicate strings.
Here is an example for s = "aaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtas"
and the very last row in the matrix is:
[0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 17, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 34, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 51, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 68].
As you see big numbers (17, 34, 51, 68) there correspond to the end of the diagonals (there is also some noise there just because I specifically added small duplicate letters like aaa).
Which suggest that we can just find the gcd of biggest two numbers gcd(68, 51) = 17 which will be the length of our repeated substring.
Here just because we know that the the whole string consists of repeated substrings, we know that it starts at the 0-th position (if we do not know it we would need to find the offset).
And here we go: the string is "aaabasdfwasfsdtas".
P.S. this method allows you to find repeats even if they are slightly modified.
For people who would like to play around here is a python script (which was created in a hustle so feel free to improve):
def longest_common_substring(s1, s2):
m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
longest, x_longest = 0, 0
for x in xrange(1, 1 + len(s1)):
for y in xrange(1, 1 + len(s2)):
if s1[x - 1] == s2[y - 1]:
m[x][y] = m[x - 1][y - 1] + 1
if m[x][y] > longest:
longest = m[x][y]
else:
m[x][y] = 0
return m
s = "aaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtas"
m = longest_common_substring(s, s)
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
M = np.array(m)
print m[-1]
arr = np.asarray(M)
plt.imshow(arr, cmap = cm.Greys_r, interpolation='none')
plt.show()
I told about the easy way, and forgot to write about the hard way.
It is getting late, so I will just explain the idea. The implementation is harder and I am not sure whether it will give you better results. But here it is:
Use the algorithm for longest repeated substring (you will need to implement trie or suffix tree which is not easy in php).
After this:
s = "aaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtas"
s1 = largest_substring_algo1(s)
Took the implementation of largest_substring_algo1 from here. Actually it is not the best (just for showing the idea) as it does not use the above-mention data-structures. The results for s and s1 are:
aaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtas
aaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtasaa
As you see the difference between them is actually the substring which was duplicated.

You have a tricky problem on your hands, primarily because your requirements are somewhat unclear.
You indicate you want to disallow repeated text, because it's "bad".
Consider someone with who puts the last stanza of Robert Frosts Stopping by Woods on a Snowy Evening in their profile:
These woods are lovely, dark and deep
but I have promises to keep
and miles to go before I sleep
and miles to go before I sleep
You might consider this good, but it does have a repetition. So what's good, and what's bad? (note that this is not an implementation problem just yet, you're just looking for a way to define "bad repetitions")
Directly detecting duplicates thus proves tricky. So let's devolve to tricks.
Compression works by taking redundant data, and compressing it into something smaller. A very repetitive text would be very easily compressed. A trick you could perform, is to take the text, zip it, and take a look at the compression ratio. Then tweak the allowed ratio to something you find acceptable.
implementation:
$THRESHOLD = ???;
$bio = ???;
$zippedbio = gzencode($bio);
$compression_ratio = strlen($zippedbio) / strlen($bio);
if ($compression_ratio >= $THRESHOLD) {
//ok;
} else {
//not ok;
}
A couple of experimental results from examples found in this question/answers:
"Love a and peace love a and peace love a and peace love a and peace love a and peace love a and peace": 0.3960396039604
"These woods are lovely, dark and deep
but I have promises to keep
and miles to go before I sleep
and miles to go before I sleep": 0.78461538461538
"aaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtas": 0.58823529411765
suggest a threshold value of around 0.6 before rejecting it as too repetitive.

Related

Find the coin change (Greedy Algorithm) when coins are in decimals and returned amount in coins is larger then original return value

I need to find the number of coins that make a given value where coins are in decimals and there is a possibility that the algorithm will return more coins (money) because there is no way to return the exact value where the returned amount is close to the given value.
For example:
Coins: [23, 29.90, 34.50]
Value: 100
Possible solutions:
Solution 1: 34.60, 34.50, 29.90, 23 (122)
Solution 2: 29.90, 29.90, 29.90 ,29.90 (119.90)
Solution 3: 23, 23, 23, 23, 23 (115)
Solution 4: 23, 23, 23, 34.50 (103.5)
Based on the possible solutions, the clear winner is "Solution 4" and I am looking for an algorithm that will help me to solve this issue. I don't care how many coins are used I just need to be sure that returned values in coins are as close as passed/desired value.
Does someone know the solution or algorithm for this case?
Best Regards.

Greedy algorithm assumes that you get the largest possible coin, then the next possible and so on until you reach the sum. But it does not provide the best solution in general case.
So consider using of table containing possible sums:
Multiply sum and all nominals by 100 to work in integers.
Make array A[] of length 1 + sum + largest_coin, filled with zeros, set A[0] into -1.
For every coin nominal C walk through array. If A[i-C] is not zero, put value C into A[i]
After all scan array range A[sum]..A[max] to find the first non-zero item. It's index K represents the best sum. This cell contains the last coin added - so you can unwind the whole combination, walking down until index 0: A[k] => A[k - A[k]] an so on
Python code
def makesum(lst, summ):
mx = max(lst)
A = [-1] + [0]*(mx+summ)
for c in lst:
for i in range(c, summ + c + 1):
if A[i - c]:
A[i] = c
print(A)
#look for the smallest possible combination >= summ
for k in range(summ, summ + mx + 1):
if A[k]:
break
if (k == summ + mx + 1):
return
# unwind combination of used coins
while (k > 0):
print(A[k])
k = k - A[k]
makesum([7, 13, 21], 30)
Array for reference. Non-zero entries - for possible sums.
[-1, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 13, 7, 0, 0, 0, 0, 0, 13, 21, 0, 0,
0, 0, 13, 13, 21, 0, 0, 0, 0, 13, 21, 21, 0, 0, 0, 13, 13, 21, 21, 0, 0, 0,
0, 21, 21, 21, 0, 0]
Combination:
13
13
7

Getting started
I do not have enough reputation to ask in comments I wanted to ask you further questions but here it goes.I believe this can get you started
-Assuming we are not sure how many coins a user is going to pick
-Any of the coins can be the same amount as the others but have to be treated as different inputs
-Any number of coins can be added together so that the sum closest to the desired max can be accepted
What exactly is the script intended to achieve
A user picks random number of coins which are recorded then put into array.Any random number of coins can be picked and added and if the sum gets closer to a specific threshold those coins are accepted
Concept
<?php
$arrX = array("1.1","20.1","3.5","4","5.7","6.8","7.3","8.6","9","10"); //random coins from user
$minthresh = "30";
$desired = "33";
$maxthresh = "35"; //A threshold is necessary if the desired amount is not enforced
$randIndex = array_rand($arrX, 2); //Pick any random two coins avoid the same coin twice
$sumofpair = $arrX[$randIndex[0]] + $arrX[$randIndex[1]]." Possible acceptable sum<br>"; //Debug to see which two
coins are picked and how much they give
print_r($randIndex[0]);
echo " pair with ";
print_r($randIndex[1]);
echo " = ".$sumofpair; //Debug to see which two coins are picked and how much they give
if (($sumofpair >= "30") && ($sumofpair <= "35")){ //Check if the sum is within the threshold
echo "<br>found something<br>";
echo "<br>This ".$arrX[$randIndex[0]]."+".$arrX[$randIndex[1]]." Gets you this ".$sumofpair." Which is very close to
".$desired;
//if a pair is found show which pair exactly and how much did the pair make
} else { echo "<br>No match so far.Refresh again</br>"; //If the pair do not match refresh until a pair is found...See
below for more info on this }
?>
If you were to solve the need for refresh.You will run this in a loop until the pair that gives you the desired is found
You can expand this to check for 3 random and 4 random and so on.
randIndex = array_rand($arrX, 3)...randIndex = array_rand($arrX, 4)....
Php.net does not say array_rand function cannot pick the same keys.Personally never seen two picked at the same time.If that does happen.
The code should be expanded to record coins that are already picked,which should also prevent adding a coin against itself.
Here is a clean code...run this any sum that return amount between 29 and 35 should return found.
<?php
$arrX = array("1.1","20.1","3.5","4","5.7","6.8","7.3","8.6","9","10");
$desired = "32";
$randIndex = array_rand($arrX, 2);
$sumofpair = $arrX[$randIndex[0]] + $arrX[$randIndex[1]];
echo $arrX[$randIndex[0]];
echo " pair with ";
echo $arrX[$randIndex[1]];
echo " = ".$sumofpair;
if (($sumofpair >= "30") && ($sumofpair <= "35")){
echo "<br>This coin ".$arrX[$randIndex[0]]."+ This coin ".$arrX[$randIndex[1]]." = ".$sumofpair." This amount ~ equal
to ".$desired;
} else { echo "<br>Not a match so far.Refresh again</br>"; }
?>

Iterate through 2d array of booleans and leave only the largest contiguous "2D blob of ones"

Ok, so the question is kind of awkwardly phrased, but I hope this will clear things up.
I have this sample 2d array.
$array = array(
array(1, 0, 0, 0, 1, 0, 0, 1),
array(0, 0, 1, 1, 1, 1, 0, 1),
array(0, 1, 1, 0, 1, 0, 0, 0),
array(0, 1, 1, 0, 0, 0, 1, 0),
array(1, 0, 0, 0, 1, 1, 1, 1),
array(0, 1, 1, 0, 1, 0, 1, 0),
array(0, 0, 0, 0, 0, 0, 0, 1)
);
When iterated by rows (and terminating each row with \n), and for every row then iterated by column, it will echo something like this: (░░ = 0, ▓▓ = 1)
▓▓░░░░░░▓▓░░░░▓▓
░░░░▓▓▓▓▓▓▓▓░░▓▓
░░▓▓▓▓░░▓▓░░░░░░
░░▓▓▓▓░░░░░░▓▓░░
▓▓░░░░░░▓▓▓▓▓▓▓▓
░░▓▓▓▓░░▓▓░░▓▓░░
░░░░░░░░░░░░░░▓▓
But what I'd like to do is to "analyse" the array and only leave 1 contiguous shape (the one with the most "cells"), in this example, the result would be:
░░░░░░░░▓▓░░░░░░
░░░░▓▓▓▓▓▓▓▓░░░░
░░▓▓▓▓░░▓▓░░░░░░
░░▓▓▓▓░░░░░░░░░░
▓▓░░░░░░░░░░░░░░
░░▓▓▓▓░░░░░░░░░░
░░░░░░░░░░░░░░░░
My initial approach was to:
Assign each ▓▓ cell a unique number (be it completely random, or the current iteration number):
01 02 03
04050607 08
0910 11
1213 14
15 16171819
2021 22 23
24
Iterate through the array many, MANY times: every iteration, each ▓▓ cell assumes the largest unique number among his neighbours. The loop would go on indefinitely until there's no change detected between the current state and the previous state. After the last iteration, the result would be this:
01 21 08
21212121 08
2121 21
2121 24
21 24242424
2121 24 24
24
Now it all comes down to counting the value that occurs the most. Then, iterating once again, to turn all the cells whose value is not the most popular one, to 0, giving me the desired result.
However, I feel it's quite a roundabout and computationally heavy approach for such a simple task and there has to be a better way. Any ideas would be greatly appreciated, cheers!
BONUS POINTS: Divide all the blobs into an array of 2D arrays, ordered by number of cells, so we can do something with the smallest blob, too

Always fun, these problems. And done before, so I'll dump my code here, maybe you can use some of it. This basically follows every shape by looking at a cell and its surrounding 8 cells, and if they connect go to the connecting cell, look again and so on...
<?php
$shape_nr=1;
$ln_max=count($array);
$cl_max=count($array[0]);
$done=[];
//LOOP ALL CELLS, GIVE 1's unique number
for($ln=0;$ln<$ln_max;++$ln){
for($cl=0;$cl<$cl_max;++$cl){
if($array[$ln][$cl]===0)continue;
$array[$ln][$cl] = ++$shape_nr;
}}
//DETECT SHAPES
for($ln=0;$ln<$ln_max;++$ln){
for($cl=0;$cl<$cl_max;++$cl){
if($array[$ln][$cl]===0)continue;
$shape_nr=$array[$ln][$cl];
if(in_array($shape_nr,$done))continue;
look_around($ln,$cl,$ln_max,$cl_max,$shape_nr,$array);
//SET SHAPE_NR to DONE, no need to look at that number again
$done[]=$shape_nr;
}}
//LOOP THE ARRAY and COUNT SHAPENUMBERS
$res=array();
for($ln=0;$ln<$ln_max;++$ln){
for($cl=0;$cl<$cl_max;++$cl){
if($array[$ln][$cl]===0)continue;
if(!isset($res[$array[$ln][$cl]]))$res[$array[$ln][$cl]]=1;
else $res[$array[$ln][$cl]]++;
}}
//get largest shape
$max = max($res);
$shape_value_max = array_search ($max, $res);
//get smallest shape
$min = min($res);
$shape_value_min = array_search ($min, $res);
// recursive function: detect connecting cells
function look_around($ln,$cl,$ln_max,$cl_max,$nr,&$array){
//create mini array
$mini=mini($ln,$cl,$ln_max,$cl_max);
if($mini===false)return false;
//loop surrounding cells
foreach($mini as $v){
if($array[$v[0]][$v[1]]===0){continue;}
if($array[$v[0]][$v[1]]!==$nr){
// set shape_nr of connecting cell
$array[$v[0]][$v[1]]=$nr;
// follow the shape
look_around($v[0],$v[1],$ln_max,$cl_max,$nr,$array);
}
}
return $nr;
}
// CREATE ARRAY WITH THE 9 SURROUNDING CELLS
function mini($ln,$cl,$ln_max,$cl_max){
$look=[];
$mini=[[-1,-1],[-1,0],[-1,1],[0,-1],[0,1],[1,-1],[1,0],[1,1]];
foreach($mini as $v){
if( $ln + $v[0] >= 0 &&
$ln + $v[0] < $ln_max &&
$cl + $v[1] >= 0 &&
$cl + $v[1] < $cl_max
){
$look[]=[$ln + $v[0], $cl + $v[1]];
}
}
if(count($look)===0){return false;}
return $look;
}
Here's a fiddle

I can only think of a few minor improvements:
Keep a linked list of the not empty fields. In step 2 you do not need to touch n² matrix-elements, you only need to touch the ones in your linked list. Which might be much less depending how sparse your matrix is.
You only need to compare to the right, right-down, left-down and down directions. Otherwise The other directions are already checked from the former row/column. What I mean: When I am greater that my right neighbour, I can already change the number of the right neighbour. (same for down and right-down). This halfs the number of compairs.

If your array size isn't huge and memory won't be a problem maybe a recursive solution would be faster. I found a c++ algorithm that does this here:
https://www.geeksforgeeks.org/find-length-largest-region-boolean-matrix/

Get number by knowing chance

I don't think the article title is correct so I will try to explain what I need.
ATM I have array:
array(
'start' => 1,
'end' => 10,
'lucky_numbers' => 6
);
and knowing this array I can define that chance to win is 60% out of 100%. The next part is the second array:
array(0, 1.25, 0.5, 1.25, 0, 3, 0, 1, 0.5, 1.25, 0.5, 1.25, 0, 2, 0.5, 2)
and this is the hard part. I don't have any clue how to pick one number knowing that chance to pick not a zero is 60%. Any ideas?
EDIT
this is the wheel numbers. when user spins the wheel i need to give him 60% chance to win. So 60% chance to spin not the 0

Let me see if i understand:
-This is a "slots like" winnings multiplier minigame for a game where you roll a "wheel" showing the possibles multipliers the player can win, this wheel is the second array, this array is variable in lenght.
-You want to give the player a variable chance to win (sometimes 60%, sometimes 80%, sometimes 20%).
If you only want to be sure the player doesn't get a "0%" multiplier, do the opposite, take the possibility of a "0%" to appear and put them the equivalent in the array and then fill the array with random multipliers and shuffle it.
$multipliers = [0.5, 1.25, 2, 3];
$wheel = [];
for ($i = 0; $i < $arraylenght; $i++) {
if ($i < floor($arraylenght * (1 - ((float)$luckyNumbers/10)))){
$wheel[] = 0;
} else {
$wheel[] = array_rand($multipliers);
}
}
shuffle($wheel);
Now if you also want to control the probabilities of each multiplier... That's another beast.

How to split a string in two parts without cutting words PHP?

I have this code that cuts the string $txt0 in two parts when lenght is greater than 64 characters. The problem is that sometimes cuts a word giving me a result like:
$txt0 = "This house is ... and really pretty"
----result----
This house is...and rel
ly pretty
$array_posiciones = array (-250, -200, -150, -100, -50);
$posicion_lineas = 0;
if ( !empty($txt0) )
{
if ( strlen($txt0) > 64 )
{
$lines = str_split($txt0, 40);
$listing_title->annotateImage($draw, 0, $array_posiciones[$posicion_lineas], 0, $lines[0]."-");
$posicion_lineas++;
$listing_title->annotateImage($draw, 0, $array_posiciones[$posicion_lineas], 0, $lines[1]);
$posicion_lineas++;
} else {
$listing_title->annotateImage($draw, 0, $array_posiciones[$posicion_lineas], 0, $txt0);
$posicion_lineas++;
}
}
I am trying to draw two clean lines on a image using imagick but for the moment I can avoid to cut a word when I separated lines. Thank you and sry for my poor english. I hope you can understand the question.

$txt0_short0=substr($txt0,0,strrpos($txt0," ",64));
$txt0_short1=substr($txt0,strrpos($txt0," ",64));
it starts from the first position of " " after the frst 64 char and uses that as the point where to split.

PHP substr() REALLY slow when not taking first part of string

I have a text file that has about 5,000 lines with each line being about 200 characters long. Each line actually contains 6 different pieces of data that I've been using substr() to break apart. For example, on each line, characters 0 - 10 contain the Client#, characters 10-20 contain the Matter#, etc. This is all well and good and was running faster than I even needed it to.
My problems arose when I was told by my boss that the client number has 4 leading zeros and they need to be stripped off. So I thought, no problem - I just changed my first substr() function from substr(0, 10) (start at 0 and take 10 characters) and changed it to substr(4, 6) (starting at the 4th character and just taking 6) which will skip the 4 leading zeros and I'll be good to go.
However, when I changed the substr(0, 10) to substr(4,6) the process grinds to a halt and takes forever to complete. Why is this?
Here is a snippet from my code:
// open the file
$file_matters = fopen($varStoredIn_matters,"r") or exit("Unable to open file!");
// run until the end of the file
while(!feof($file_matters))
{
// place current line in temp variable
$tempLine_matters = fgets($file_matters);
// increment the matters line count
$linecount_matters++;
// break up each column
$clientID = trim(substr($tempLine_matters, 0, 10)); // THIS ONE WORKS FINE
//$clientID = trim(substr($tempLine_matters, 4, 6)); // THIS ONE MAKES THE PROCESS GRIND TO A HALT!!
$matterID = trim(substr($tempLine_matters, 10, 10));
//$matterID = trim(substr($tempLine_matters, 15, 5));
$matterName = trim(substr($tempLine_matters, 20, 80));
$subMatterName = trim(substr($tempLine_matters, 100, 80));
$dateOpen = trim(substr($tempLine_matters, 180, 10));
$orgAttorney = trim(substr($tempLine_matters, 190, 3));
$bilAttorney = trim(substr($tempLine_matters, 193, 3));
$resAttorney = trim(substr($tempLine_matters, 196, 3));
//$tolCode = trim(substr($tempLine_matters, 200, 3));
$tolCode = trim(substr($tempLine_matters, 200, 3));
$dateClosed = trim(substr($tempLine_matters, 203, 10));
// just does an insert into the DB using the variables above
}

I can't see why that would be so much slower, but you could take a look at unpack which could extract your fixed width record in one hit:
$fields = unpack('A10client/A10matter/A60name ...etc... ',$tempLine_matters);
I did a quick benchmark using a similar record pattern to your example and found unpack was over twice as fast as using 10 substr calls in each iteration.
I'd suggest profiling your code with xdebug to see where the different really lies.

It's not a very optimized process. You should maybe think about it a little bit more.
But if it's working right now it's the most important...
Maybe if you get your value with two process it will be faster. For example :
$clientID_bis = trim(substr($tempLine_matters, 0, 10));
$clientID = trim(substr($clientID_bis, 4, 6));

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP Detect Duplicate Text - php

Related

Find the coin change (Greedy Algorithm) when coins are in decimals and returned amount in coins is larger then original return value

Iterate through 2d array of booleans and leave only the largest contiguous "2D blob of ones"

Get number by knowing chance

How to split a string in two parts without cutting words PHP?

PHP substr() REALLY slow when not taking first part of string

Categories

Resources