Ok, so the question is kind of awkwardly phrased, but I hope this will clear things up.
I have this sample 2d array.
$array = array(
array(1, 0, 0, 0, 1, 0, 0, 1),
array(0, 0, 1, 1, 1, 1, 0, 1),
array(0, 1, 1, 0, 1, 0, 0, 0),
array(0, 1, 1, 0, 0, 0, 1, 0),
array(1, 0, 0, 0, 1, 1, 1, 1),
array(0, 1, 1, 0, 1, 0, 1, 0),
array(0, 0, 0, 0, 0, 0, 0, 1)
);
When iterated by rows (and terminating each row with \n), and for every row then iterated by column, it will echo something like this: (░░ = 0, ▓▓ = 1)
▓▓░░░░░░▓▓░░░░▓▓
░░░░▓▓▓▓▓▓▓▓░░▓▓
░░▓▓▓▓░░▓▓░░░░░░
░░▓▓▓▓░░░░░░▓▓░░
▓▓░░░░░░▓▓▓▓▓▓▓▓
░░▓▓▓▓░░▓▓░░▓▓░░
░░░░░░░░░░░░░░▓▓
But what I'd like to do is to "analyse" the array and only leave 1 contiguous shape (the one with the most "cells"), in this example, the result would be:
░░░░░░░░▓▓░░░░░░
░░░░▓▓▓▓▓▓▓▓░░░░
░░▓▓▓▓░░▓▓░░░░░░
░░▓▓▓▓░░░░░░░░░░
▓▓░░░░░░░░░░░░░░
░░▓▓▓▓░░░░░░░░░░
░░░░░░░░░░░░░░░░
My initial approach was to:
Assign each ▓▓ cell a unique number (be it completely random, or the current iteration number):
01 02 03
04050607 08
0910 11
1213 14
15 16171819
2021 22 23
24
Iterate through the array many, MANY times: every iteration, each ▓▓ cell assumes the largest unique number among his neighbours. The loop would go on indefinitely until there's no change detected between the current state and the previous state. After the last iteration, the result would be this:
01 21 08
21212121 08
2121 21
2121 24
21 24242424
2121 24 24
24
Now it all comes down to counting the value that occurs the most. Then, iterating once again, to turn all the cells whose value is not the most popular one, to 0, giving me the desired result.
However, I feel it's quite a roundabout and computationally heavy approach for such a simple task and there has to be a better way. Any ideas would be greatly appreciated, cheers!
BONUS POINTS: Divide all the blobs into an array of 2D arrays, ordered by number of cells, so we can do something with the smallest blob, too
Always fun, these problems. And done before, so I'll dump my code here, maybe you can use some of it. This basically follows every shape by looking at a cell and its surrounding 8 cells, and if they connect go to the connecting cell, look again and so on...
<?php
$shape_nr=1;
$ln_max=count($array);
$cl_max=count($array[0]);
$done=[];
//LOOP ALL CELLS, GIVE 1's unique number
for($ln=0;$ln<$ln_max;++$ln){
for($cl=0;$cl<$cl_max;++$cl){
if($array[$ln][$cl]===0)continue;
$array[$ln][$cl] = ++$shape_nr;
}}
//DETECT SHAPES
for($ln=0;$ln<$ln_max;++$ln){
for($cl=0;$cl<$cl_max;++$cl){
if($array[$ln][$cl]===0)continue;
$shape_nr=$array[$ln][$cl];
if(in_array($shape_nr,$done))continue;
look_around($ln,$cl,$ln_max,$cl_max,$shape_nr,$array);
//SET SHAPE_NR to DONE, no need to look at that number again
$done[]=$shape_nr;
}}
//LOOP THE ARRAY and COUNT SHAPENUMBERS
$res=array();
for($ln=0;$ln<$ln_max;++$ln){
for($cl=0;$cl<$cl_max;++$cl){
if($array[$ln][$cl]===0)continue;
if(!isset($res[$array[$ln][$cl]]))$res[$array[$ln][$cl]]=1;
else $res[$array[$ln][$cl]]++;
}}
//get largest shape
$max = max($res);
$shape_value_max = array_search ($max, $res);
//get smallest shape
$min = min($res);
$shape_value_min = array_search ($min, $res);
// recursive function: detect connecting cells
function look_around($ln,$cl,$ln_max,$cl_max,$nr,&$array){
//create mini array
$mini=mini($ln,$cl,$ln_max,$cl_max);
if($mini===false)return false;
//loop surrounding cells
foreach($mini as $v){
if($array[$v[0]][$v[1]]===0){continue;}
if($array[$v[0]][$v[1]]!==$nr){
// set shape_nr of connecting cell
$array[$v[0]][$v[1]]=$nr;
// follow the shape
look_around($v[0],$v[1],$ln_max,$cl_max,$nr,$array);
}
}
return $nr;
}
// CREATE ARRAY WITH THE 9 SURROUNDING CELLS
function mini($ln,$cl,$ln_max,$cl_max){
$look=[];
$mini=[[-1,-1],[-1,0],[-1,1],[0,-1],[0,1],[1,-1],[1,0],[1,1]];
foreach($mini as $v){
if( $ln + $v[0] >= 0 &&
$ln + $v[0] < $ln_max &&
$cl + $v[1] >= 0 &&
$cl + $v[1] < $cl_max
){
$look[]=[$ln + $v[0], $cl + $v[1]];
}
}
if(count($look)===0){return false;}
return $look;
}
Here's a fiddle
I can only think of a few minor improvements:
Keep a linked list of the not empty fields. In step 2 you do not need to touch n² matrix-elements, you only need to touch the ones in your linked list. Which might be much less depending how sparse your matrix is.
You only need to compare to the right, right-down, left-down and down directions. Otherwise The other directions are already checked from the former row/column. What I mean: When I am greater that my right neighbour, I can already change the number of the right neighbour. (same for down and right-down). This halfs the number of compairs.
If your array size isn't huge and memory won't be a problem maybe a recursive solution would be faster. I found a c++ algorithm that does this here:
https://www.geeksforgeeks.org/find-length-largest-region-boolean-matrix/
I have a site where users can put in a description about themselves.
Most users write something appropriate but some just copy/paste the same text a number of times (to create the appearance of a fair amount of text).
eg: "Love a and peace love a and peace love a and peace love a and peace love a and peace love a and peace"
Is there a good method to detect repetitive text with PHP?
The only concept I currently have would be to break the text into separate words (delimited by space) and then look to see if the word is repeated more then a set limited. Note: I'm not 100% sure how I would code this solution.
Thoughts on the best way to detect duplicate text? Or how to code the above idea?
This is a basic text classification problem. There are lots of articles out there on how to determine if some text is spam/not spam which I'd recommend digging into if you really want to get into the details. A lot of it is probably overkill for what you need to do here.
Granted one approach would be to evaluate why you're requiring people to enter longer bios, but I'll assume you've already decided that forcing people to enter more text is the way to go.
Here's an outline of what I would do:
Build a histogram of word occurrences for the input string
Study the histograms of some valid and invalid text
Come up with a formula for classifying a histogram as valid or not
This approach would require you to figure out what's different between the two sets. Intuitively, I'd expect spam to show fewer unique words and if you plot the histogram values, a higher area under the curve concentrated toward the top words.
Here's some sample code to get you going:
$str = 'Love a and peace love a and peace love a and peace love a and peace love a and peace love a and peace';
// Build a histogram mapping words to occurrence counts
$hist = array();
// Split on any number of consecutive whitespace characters
foreach (preg_split('/\s+/', $str) as $word)
{
// Force all words lowercase to ignore capitalization differences
$word = strtolower($word);
// Count occurrences of the word
if (isset($hist[$word]))
{
$hist[$word]++;
}
else
{
$hist[$word] = 1;
}
}
// Once you're done, extract only the counts
$vals = array_values($hist);
rsort($vals); // Sort max to min
// Now that you have the counts, analyze and decide valid/invalid
var_dump($vals);
When you run this code on some repetitive strings, you'll see the difference. Here's a plot of the $vals array from the example string you gave:
Compare that with the first two paragraphs of Martin Luther King Jr.'s bio from Wikipedia:
A long tail indicates lots of unique words. There's still some repetition, but the general shape shows some variation.
FYI, PHP has a stats package you can install if you're going to be doing lots of math like standard deviation, distribution modeling, etc.
You could use a regex, like this:
if (preg_match('/(.{10,})\\1{2,}/', $theText)) {
echo "The string is repeated.";
}
Explanation:
(.{10,}) looks for and captures a string that is at least 10 characters long
\\1{2,} looks for the first string at least 2 more times
Possible tweaks to suit your needs:
Change 10 to a higher or lower number to match longer or shorter repeated strings. I just used 10 as an example.
If you want to catch even one repetition (love and peace love and peace), delete the {2,}. If you want to catch a higher number of repetitions, increase the 2.
If you don't care how many times the repetition occurs, only that it occurs, delete the , in {2,}.
I think you are on the right track breaking down the string and looking at repeated words.
Here is some code though which does not use a PCRE and leverages PHP native string functions (str_word_count and array_count_values):
<?php
$words = str_word_count("Love a and peace love a and peace love a and peace love a and peace love a and peace love a and peace", 1);
$words = array_count_values($words);
var_dump($words);
/*
array(5) {
["Love"]=>
int(1)
["a"]=>
int(6)
["and"]=>
int(6)
["peace"]=>
int(6)
["love"]=>
int(5)
}
*/
Some tweaks might be to:
setup a list of common words to be ignored
look at order of words (previous and next), not just number of occurrences
Another idea would be to use substr_count iteration:
$str = "Love a and peace love a and peace love a and peace love a and peace love a and peace love a and peace";
$rep = "";
$str = strtolower($str);
for($i=0,$len=strlen($str),$pattern=""; $i<$len; ++$i) {
$pattern.= $str[$i];
if(substr_count($str,$pattern)>1)
$rep = strlen($rep)<strlen($pattern) ? $pattern : $rep;
else
$pattern = "";
}
// warn if 20%+ of the string is repetitive
if(strlen($rep)>strlen($str)/5) echo "Repetitive string alert!";
else echo "String seems to be non-repetitive.";
echo " Longest pattern found: '$rep'";
Which would output
Repetitive string alert! Longest pattern found: 'love a and peace love a and peace love a and peace'
// 3 examples of how you might detect repeating user input
// use preg_match
// pattern to match agains
$pattern = '/^text goes here$/';
// the user input
$input = 'text goes here';
// check if its match
$repeats = preg_match($pattern, $input);
if ($repeats) {
var_dump($repeats);
} else {
// do something else
}
// use strpos
$string = 'text goes here';
$input = 'text goes here';
$repeats = strpos($string, $input);
if ($repeats !== false) {
# code...
var_dump($repeats);
} else {
// do something else
}
// or you could do something like:
function repeatingWords($str)
{
$words = explode(' ', trim($str)); //Trim to prevent any extra blank
if (count(array_unique($words)) == count($words)) {
return true; //Same amount of words
}
return false;
}
$string = 'text goes here. text goes here. ';
if (repeatingWords($string)) {
var_dump($string);
} else {
// do something else
}
Here's a the code of the function you're looking for in the description:
<?php
function duplicate(){
$txt = strtolower("Love a and peace love a and peace love a and peace love a and peace love a and peace love a and peace");
$strings = explode(" ",$txt);
$set = 2 ;
for($i=0;$i < sizeof($strings);$i++){
$count = 0;
$current = $strings[$i];
for($j=$i+1;$j < sizeof($strings);$j++){
if($strings[$j]!==$current){
continue;
}else if($count<$set){
$count++;
}else{
echo ("String ".$current." repeated more than ".$set." times\n");
}
}
}
}
echo("Hello World!\n");
duplicate();
?>
I think the approach of finding duplicate words, will be messy. Most likely you'll get duplicate words in real descriptions "I really, really, really, like ice creme, especially vanilla ice creme".
A better approach, is to split the string to get the words, find all the unique words, add all the character counts of the unique words, and set that too some limit. Say, you require 100 character descriptions, require around 60 unique characters from words.
Copying #ficuscr's approach
$words = str_word_count("Love a and peace love a and peace love a and peace love a and peace love a and peace love a and peace", 1);
$total = 0;
foreach ($words as $key => $count) { $total += strlen($key) }
I am not sure whether it is a good idea to combat such problem. If a person wants to put junk in aboutme field, they will always come up with the idea how to do it. But I will ignore this fact and combat the problem as an algorithmic challenge:
Having a string S, which consists of the substrings (which can appear
many times and non-overlapping) find the substring it consist of.
The definition is louse and I assume that the string is already converted to lowercase.
First an easier way:
Use modification of a longest common subsequence which has an easy DP programming solution. But instead of finding a subsequence in two different sequences, you can find longest common subsequence of the string with respect to the same string LCS(s, s).
It sounds stupid at the beginning (surely LCS(s, s) == s), but we actually do not care about the answer, we care about the DP matrix that it get.
Let's look at the example: s = "abcabcabc" and the matrix is:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[0, 1, 0, 0, 1, 0, 0, 1, 0, 0]
[0, 0, 2, 0, 0, 2, 0, 0, 2, 0]
[0, 0, 0, 3, 0, 0, 3, 0, 0, 3]
[0, 1, 0, 0, 4, 0, 0, 4, 0, 0]
[0, 0, 2, 0, 0, 5, 0, 0, 5, 0]
[0, 0, 0, 3, 0, 0, 6, 0, 0, 6]
[0, 1, 0, 0, 4, 0, 0, 7, 0, 0]
[0, 0, 2, 0, 0, 5, 0, 0, 8, 0]
[0, 0, 0, 3, 0, 0, 6, 0, 0, 9]
Note the nice diagonals there. As you see the first diagonal ends with 3, second with 6 and third with 9 (our original DP solution which we do not care).
This is not a coincidence. Hope that after looking in more details about how DP matrix is constructed you can see that these diagonals correspond to duplicate strings.
Here is an example for s = "aaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtas"
and the very last row in the matrix is:
[0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 17, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 34, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 51, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 68].
As you see big numbers (17, 34, 51, 68) there correspond to the end of the diagonals (there is also some noise there just because I specifically added small duplicate letters like aaa).
Which suggest that we can just find the gcd of biggest two numbers gcd(68, 51) = 17 which will be the length of our repeated substring.
Here just because we know that the the whole string consists of repeated substrings, we know that it starts at the 0-th position (if we do not know it we would need to find the offset).
And here we go: the string is "aaabasdfwasfsdtas".
P.S. this method allows you to find repeats even if they are slightly modified.
For people who would like to play around here is a python script (which was created in a hustle so feel free to improve):
def longest_common_substring(s1, s2):
m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
longest, x_longest = 0, 0
for x in xrange(1, 1 + len(s1)):
for y in xrange(1, 1 + len(s2)):
if s1[x - 1] == s2[y - 1]:
m[x][y] = m[x - 1][y - 1] + 1
if m[x][y] > longest:
longest = m[x][y]
else:
m[x][y] = 0
return m
s = "aaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtas"
m = longest_common_substring(s, s)
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
M = np.array(m)
print m[-1]
arr = np.asarray(M)
plt.imshow(arr, cmap = cm.Greys_r, interpolation='none')
plt.show()
I told about the easy way, and forgot to write about the hard way.
It is getting late, so I will just explain the idea. The implementation is harder and I am not sure whether it will give you better results. But here it is:
Use the algorithm for longest repeated substring (you will need to implement trie or suffix tree which is not easy in php).
After this:
s = "aaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtas"
s1 = largest_substring_algo1(s)
Took the implementation of largest_substring_algo1 from here. Actually it is not the best (just for showing the idea) as it does not use the above-mention data-structures. The results for s and s1 are:
aaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtas
aaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtasaa
As you see the difference between them is actually the substring which was duplicated.
You have a tricky problem on your hands, primarily because your requirements are somewhat unclear.
You indicate you want to disallow repeated text, because it's "bad".
Consider someone with who puts the last stanza of Robert Frosts Stopping by Woods on a Snowy Evening in their profile:
These woods are lovely, dark and deep
but I have promises to keep
and miles to go before I sleep
and miles to go before I sleep
You might consider this good, but it does have a repetition. So what's good, and what's bad? (note that this is not an implementation problem just yet, you're just looking for a way to define "bad repetitions")
Directly detecting duplicates thus proves tricky. So let's devolve to tricks.
Compression works by taking redundant data, and compressing it into something smaller. A very repetitive text would be very easily compressed. A trick you could perform, is to take the text, zip it, and take a look at the compression ratio. Then tweak the allowed ratio to something you find acceptable.
implementation:
$THRESHOLD = ???;
$bio = ???;
$zippedbio = gzencode($bio);
$compression_ratio = strlen($zippedbio) / strlen($bio);
if ($compression_ratio >= $THRESHOLD) {
//ok;
} else {
//not ok;
}
A couple of experimental results from examples found in this question/answers:
"Love a and peace love a and peace love a and peace love a and peace love a and peace love a and peace": 0.3960396039604
"These woods are lovely, dark and deep
but I have promises to keep
and miles to go before I sleep
and miles to go before I sleep": 0.78461538461538
"aaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtasaaabasdfwasfsdtas": 0.58823529411765
suggest a threshold value of around 0.6 before rejecting it as too repetitive.
I have a text file that has about 5,000 lines with each line being about 200 characters long. Each line actually contains 6 different pieces of data that I've been using substr() to break apart. For example, on each line, characters 0 - 10 contain the Client#, characters 10-20 contain the Matter#, etc. This is all well and good and was running faster than I even needed it to.
My problems arose when I was told by my boss that the client number has 4 leading zeros and they need to be stripped off. So I thought, no problem - I just changed my first substr() function from substr(0, 10) (start at 0 and take 10 characters) and changed it to substr(4, 6) (starting at the 4th character and just taking 6) which will skip the 4 leading zeros and I'll be good to go.
However, when I changed the substr(0, 10) to substr(4,6) the process grinds to a halt and takes forever to complete. Why is this?
Here is a snippet from my code:
// open the file
$file_matters = fopen($varStoredIn_matters,"r") or exit("Unable to open file!");
// run until the end of the file
while(!feof($file_matters))
{
// place current line in temp variable
$tempLine_matters = fgets($file_matters);
// increment the matters line count
$linecount_matters++;
// break up each column
$clientID = trim(substr($tempLine_matters, 0, 10)); // THIS ONE WORKS FINE
//$clientID = trim(substr($tempLine_matters, 4, 6)); // THIS ONE MAKES THE PROCESS GRIND TO A HALT!!
$matterID = trim(substr($tempLine_matters, 10, 10));
//$matterID = trim(substr($tempLine_matters, 15, 5));
$matterName = trim(substr($tempLine_matters, 20, 80));
$subMatterName = trim(substr($tempLine_matters, 100, 80));
$dateOpen = trim(substr($tempLine_matters, 180, 10));
$orgAttorney = trim(substr($tempLine_matters, 190, 3));
$bilAttorney = trim(substr($tempLine_matters, 193, 3));
$resAttorney = trim(substr($tempLine_matters, 196, 3));
//$tolCode = trim(substr($tempLine_matters, 200, 3));
$tolCode = trim(substr($tempLine_matters, 200, 3));
$dateClosed = trim(substr($tempLine_matters, 203, 10));
// just does an insert into the DB using the variables above
}
I can't see why that would be so much slower, but you could take a look at unpack which could extract your fixed width record in one hit:
$fields = unpack('A10client/A10matter/A60name ...etc... ',$tempLine_matters);
I did a quick benchmark using a similar record pattern to your example and found unpack was over twice as fast as using 10 substr calls in each iteration.
I'd suggest profiling your code with xdebug to see where the different really lies.
It's not a very optimized process. You should maybe think about it a little bit more.
But if it's working right now it's the most important...
Maybe if you get your value with two process it will be faster. For example :
$clientID_bis = trim(substr($tempLine_matters, 0, 10));
$clientID = trim(substr($clientID_bis, 4, 6));