I am looking for the most efficient algorithm in PHP to check if a string was made from dictionary words only or not.
Example:
thissentencewasmadefromenglishwords
thisonecontainsyxxxyxsomegarbagexaatoo
pure
thisisalsobadxyyyaazzz
Output:
thissentencewasmadefromenglishwords
pure
a.txt
contains the dictionary words
b.txt
contains the strings: one in every line, without spaces made from a..z chars only
Another way to do this is to employ the Aho-Corasick string matching algorithm. The basic idea is to read in your dictionary of words and from that create the Aho-Corasick tree structure. Then, you run each string you want to split into words through the search function.
The beauty of this approach is that creating the tree is a one time cost. You can then use it for all of the strings you're testing. The search function runs in O(n) (n being the length of the string), plus the number of matches found. It's really quite efficient.
Output from the search function will be a list of string matches, telling you which words match at what positions.
The Wikipedia article does not give a great explanation of the Aho-Corasick algorithm. I prefer the original paper, which is quite approachable. See Efficient String Matching: An Aid to Bibliographic Search.
So, for example, given your first string:
thissentencewasmadefromenglishwords
You would get (in part):
this, 0
his, 1
sent, 4
ten, 7
etc.
Now, sort the list of matches by position. It will be almost sorted when you get it from the string matcher, but not quite.
Once the list is sorted by position, the first thing you do is make sure that there is a match at position 0. If there is not, then the string fails the test. If there is (and there might be multiple matches at position 0), you take the length of the matched string and see if there's a string match at that position. Add that match's length and see if there's a match at the next position, etc.
If the strings you're testing aren't very long, then you can use a brute force algorithm like that. It would be more efficient, though, to construct a hash map of the matches, indexed by position. Of course, there could be multiple matches for a particular position, so you have to take that into account. But looking to see if there is a match at a position would be very fast.
It's some work, of course, to implement the Aho-Corasick algorithm. A quick Google search shows that there are php implementations available. How well they work, I don't know.
In the average case, this should be very quick. Again, it depends on how long your strings are. But you're helped by there being relatively few matches at any one position. You could probably construct strings that would exhibit pathologically poor runtimes, but you'd probably have to try real hard. And again, even a pathological case isn't going to take too terribly long if the string is short.
This is a problem that can be solved using Dynamic Programming, based on the next formulas:
f(0) = true
f(i) = OR { f(i-j) AND Dictionary.contais(s.substring(i-j,i) } for each j=1,...,i
First, load your file into a dictionary, then use the DP solution for the above formula.
Pseudo code is something like: (Hope I have no "off by one" for indices..)
check(word):
f = new boolean[word.length() + 1)
f[0] = true
for i from 1 to word.length() + 1:
f[i] = false
for j from 1 to i-1:
if dictionary.contains(word.substring(j-1,i-1)) AND f[j]:
f[i] = true
return f[word.length()
I recommend a recursive approach. Something like this:
<?php
$wordsToCheck = array(
'otherword',
'word1andother',
'word1',
'word1word2',
'word1word3',
'word1word2word3'
);
$wordList = array(
'word1',
'word2',
'word3'
);
$results = array();
function onlyListedWords($word, $wordList) {
if (in_array($word, $wordList)) {
return true;
} else {
$length = strlen($word);
$wordTemp = $word;
$part = '';
for ($i=0; $i < $length; $i++) {
$part .= $wordTemp[$i];
if (in_array($part, $wordList)) {
if ($i == $length - 1) {
return true;
} else {
$wordTemp = substr($wordTemp, $i + 1);
return onlyListedWords($wordTemp, $wordList);
}
}
}
}
}
foreach ($wordsToCheck as $word) {
if (onlyListedWords($word, $wordList))
$results[] = $word;
}
var_dump($results);
?>
I have a range of whole numbers that might or might not have some numbers missing. Is it possible to find the smallest missing number without using a loop structure? If there are no missing numbers, the function should return the maximum value of the range plus one.
This is how I solved it using a for loop:
$range = [0,1,2,3,4,6,7];
// sort just in case the range is not in order
asort($range);
$range = array_values($range);
$first = true;
for ($x = 0; $x < count($range); $x++)
{
// don't check the first element
if ( ! $first )
{
if ( $range[$x - 1] + 1 !== $range[$x])
{
echo $range[$x - 1] + 1;
break;
}
}
// if we're on the last element, there are no missing numbers
if ($x + 1 === count($range))
{
echo $range[$x] + 1;
}
$first = false;
}
Ideally, I'd like to avoid looping completely, as the range can be massive. Any suggestions?
Algo solution
There is a way to check if there is a missing number using an algorithm. It's explained here. Basically if we need to add numbers from 1 to 100. We don't need to calculate by summing them we just need to do the following: (100 * (100 + 1)) / 2. So how is this going to solve our issue ?
We're going to get the first element of the array and the last one. We calculate the sum with this algo. We then use array_sum() to calculate the actual sum. If the results are the same, then there is no missing number. We could then "backtrack" the missing number by substracting the actual sum from the calculated one. This of course only works if there is only one number missing and will fail if there are several missing. So let's put this in code:
$range = range(0,7); // Creating an array
echo check($range) . "\r\n"; // check
unset($range[3]); // unset offset 3
echo check($range); // check
function check($array){
if($array[0] == 0){
unset($array[0]); // get ride of the zero
}
sort($array); // sorting
$first = reset($array); // get the first value
$last = end($array); // get the last value
$sum = ($last * ($first + $last)) / 2; // the algo
$actual_sum = array_sum($array); // the actual sum
if($sum == $actual_sum){
return $last + 1; // no missing number
}else{
return $sum - $actual_sum; // missing number
}
}
Output
8
3
Online demo
If there are several numbers missing, then just use array_map() or something similar to do an internal loop.
Regex solution
Let's take this to a new level and use regex ! I know it's nonsense, and it shouldn't be used in real world application. The goal is to show the true power of regex :)
So first let's make a string out of our range in the following format: I,II,III,IIII for range 1,3.
$range = range(0,7);
if($range[0] === 0){ // get ride of 0
unset($range[0]);
}
$str = implode(',', array_map(function($val){return str_repeat('I', $val);}, $range));
echo $str;
The output should be something like: I,II,III,IIII,IIIII,IIIIII,IIIIIII.
I've come up with the following regex: ^(?=(I+))(^\1|,\2I|\2I)+$. So what does this mean ?
^ # match begin of string
(?= # positive lookahead, we use this to not "eat" the match
(I+) # match I one or more times and put it in group 1
) # end of lookahead
( # start matching group 2
^\1 # match begin of string followed by what's matched in group 1
| # or
,\2I # match a comma, with what's matched in group 2 (recursive !) and an I
| # or
\2I # match what's matched in group 2 and an I
)+ # repeat one or more times
$ # match end of line
Let's see what's actually happening ....
I,II,III,IIII,IIIII,IIIIII,IIIIIII
^
(I+) do not eat but match I and put it in group 1
I,II,III,IIII,IIIII,IIIIII,IIIIIII
^
^\1 match what was matched in group 1, which means I gets matched
I,II,III,IIII,IIIII,IIIIII,IIIIIII
^^^ ,\2I match what was matched in group 1 (one I in thise case) and add an I to it
I,II,III,IIII,IIIII,IIIIII,IIIIIII
^^^^ \2I match what was matched previously in group 2 (,II in this case) and add an I to it
I,II,III,IIII,IIIII,IIIIII,IIIIIII
^^^^^ \2I match what was matched previously in group 2 (,III in this case) and add an I to it
We're moving forward since there is a + sign which means match one or more times,
this is actually a recursive regex.
We put the $ to make sure it's the end of string
If the number of I's don't correspond, then the regex will fail.
See it working and failing. And Let's put it in PHP code:
$range = range(0,7);
if($range[0] === 0){
unset($range[0]);
}
$str = implode(',', array_map(function($val){return str_repeat('I', $val);}, $range));
if(preg_match('#^(?=(I*))(^\1|,\2I|\2I)+$#', $str)){
echo 'works !';
}else{
echo 'fails !';
}
Now let's take in account to return the number that's missing, we will remove the $ end character to make our regex not fail, and we use group 2 to return the missed number:
$range = range(0,7);
if($range[0] === 0){
unset($range[0]);
}
unset($range[2]); // remove 2
$str = implode(',', array_map(function($val){return str_repeat('I', $val);}, $range));
preg_match('#^(?=(I*))(^\1|,\2I|\2I)+#', $str, $m); // REGEEEEEX !!!
$n = strlen($m[2]); //get the length ie the number
$sum = array_sum($range); // array sum
if($n == $sum){
echo $n + 1; // no missing number
}else{
echo $n - 1; // missing number
}
Online demo
EDIT: NOTE
This question is about performance. Functions like array_diff and array_filter are not magically fast. They can add a huge time penalty. Replacing a loop in your code with a call to array_diff will not magically make things fast, and will probably make things slower. You need to understand how these functions work if you intend to use them to speed up your code.
This answer uses the assumption that no items are duplicated and no invalid elements exist to allow us to use the position of the element to infer its expected value.
This answer is theoretically the fastest possible solution if you start with a sorted list. The solution posted by Jack is theoretically the fastest if sorting is required.
In the series [0,1,2,3,4,...], the n'th element has the value n if no elements before it are missing. So we can spot-check at any point to see if our missing element is before or after the element in question.
So you start by cutting the list in half and checking to see if the item at position x = x
[ 0 | 1 | 2 | 3 | 4 | 5 | 7 | 8 | 9 ]
^
Yup, list[4] == 4. So move halfway from your current point the end of the list.
[ 0 | 1 | 2 | 3 | 4 | 5 | 7 | 8 | 9 ]
^
Uh-oh, list[6] == 7. So somewhere between our last checkpoint and the current one, one element was missing. Divide the difference in half and check that element:
[ 0 | 1 | 2 | 3 | 4 | 5 | 7 | 8 | 9 ]
^
In this case, list[5] == 5
So we're good there. So we take half the distance between our current check and the last one that was abnormal. And oh.. it looks like cell n+1 is one we already checked. We know that list[6]==7 and list[5]==5, so the element number 6 is the one that's missing.
Since each step divides the number of elements to consider in half, you know that your worst-case performance is going to check no more than log2 of the total list size. That is, this is an O(log(n)) solution.
If this whole arrangement looks familiar, It's because you learned it back in your second year of college in a Computer Science class. It's a minor variation on the binary search algorithm--one of the most widely used index schemes in the industry. Indeed this question appears to be a perfectly-contrived application for this searching technique.
You can of course repeat the operation to find additional missing elements, but since you've already tested the values at key elements in the list, you can avoid re-checking most of the list and go straight to the interesting ones left to test.
Also note that this solution assumes a sorted list. If the list isn't sorted then obviously you sort it first. Except, binary searching has some notable properties in common with quicksort. It's quite possible that you can combine the process of sorting with the process of finding the missing element and do both in a single operation, saving yourself some time.
Finally, to sum up the list, that's just a stupid math trick thrown in for good measure. The sum of a list of numbers from 1 to N is just N*(N+1)/2. And if you've already determined that any elements are missing, then obvously just subtract the missing ones.
Technically, you can't really do without the loop (unless you only want to know if there's a missing number). However, you can accomplish this without first sorting the array.
The following algorithm uses O(n) time with O(n) space:
$range = [0, 1, 2, 3, 4, 6, 7];
$N = count($range);
$temp = str_repeat('0', $N); // assume all values are out of place
foreach ($range as $value) {
if ($value < $N) {
$temp[$value] = 1; // value is in the right place
}
}
// count number of leading ones
echo strspn($temp, '1'), PHP_EOL;
It builds an ordered identity map of N entries, marking each value against its position as "1"; in the end all entries must be "1", and the first "0" entry is the smallest value that's missing.
Btw, I'm using a temporary string instead of an array to reduce physical memory requirements.
I honestly don't get why you wouldn't want to use a loop. There's nothing wrong with loops. They're fast, and you simply can't do without them. However, in your case, there is a way to avoid having to write your own loops, using PHP core functions. They do loop over the array, though, but you simply can't avoid that.
Anyway, I gather what you're after, can easily be written in 3 lines:
function highestPlus(array $in)
{
$compare = range(min($in), max($in));
$diff = array_diff($compare, $in);
return empty($diff) ? max($in) +1 : $diff[0];
}
Tested with:
echo highestPlus(range(0,11));//echoes 12
$arr = array(9,3,4,1,2,5);
echo highestPlus($arr);//echoes 6
And now, to shamelessly steal Pé de Leão's answer (but "augment" it to do exactly what you want):
function highestPlus(array $range)
{//an unreadable one-liner... horrid, so don't, but know that you can...
return min(array_diff(range(0, max($range)+1), $range)) ?: max($range) +1;
}
How it works:
$compare = range(min($in), max($in));//range(lowest value in array, highest value in array)
$diff = array_diff($compare, $in);//get all values present in $compare, that aren't in $in
return empty($diff) ? max($in) +1 : $diff[0];
//-------------------------------------------------
// read as:
if (empty($diff))
{//every number in min-max range was found in $in, return highest value +1
return max($in) + 1;
}
//there were numbers in min-max range, not present in $in, return first missing number:
return $diff[0];
That's it, really.
Of course, if the supplied array might contain null or falsy values, or even strings, and duplicate values, it might be useful to "clean" the input a bit:
function highestPlus(array $in)
{
$clean = array_filter(
$in,
'is_numeric'//or even is_int
);
$compare = range(min($clean), max($clean));
$diff = array_diff($compare, $clean);//duplicates aren't an issue here
return empty($diff) ? max($clean) + 1; $diff[0];
}
Useful links:
The array_diff man page
The max and min functions
Good Ol' range, of course...
The array_filter function
The array_map function might be worth a look
Just as array_sum might be
$range = array(0,1,2,3,4,6,7);
// sort just in case the range is not in order
asort($range);
$range = array_values($range);
$indexes = array_keys($range);
$diff = array_diff($indexes,$range);
echo $diff[0]; // >> will print: 5
// if $diff is an empty array - you can print
// the "maximum value of the range plus one": $range[count($range)-1]+1
echo min(array_diff(range(0, max($range)+1), $range));
Simple
$array1 = array(0,1,2,3,4,5,6,7);// array with actual number series
$array2 = array(0,1,2,4,6,7); // array with your custom number series
$missing = array_diff($array1,$array2);
sort($missing);
echo $missing[0];
$range = array(0,1,2,3,4,6,7);
$max=max($range);
$expected_total=($max*($max+1))/2; // sum if no number was missing.
$actual_total=array_sum($range); // sum of the input array.
if($expected_total==$actual_total){
echo $max+1; // no difference so no missing number, then echo 1+ missing number.
}else{
echo $expected_total-$actual_total; // the difference will be the missing number.
}
you can use array_diff() like this
<?php
$range = array("0","1","2","3","4","6","7","9");
asort($range);
$len=count($range);
if($range[$len-1]==$len-1){
$r=$range[$len-1];
}
else{
$ref= range(0,$len-1);
$result = array_diff($ref,$range);
$r=implode($result);
}
echo $r;
?>
function missing( $v ) {
static $p = -1;
$d = $v - $p - 1;
$p = $v;
return $d?1:0;
}
$result = array_search( 1, array_map( "missing", $ARRAY_TO_TEST ) );
I'm attempting to write a regex in PHP that validates the following:
At least 10 chars
Has at least 2 Upper-case characters
Has at least 2 Numbers OR Symbols
I've looked at just about every reference I can find but, to no avail.
I guess I can test individually, but that makes me very sad :(
Can someone please help? (And send me to a spot where I can learn in plain English Reg Ex?)
This picture is worth more than 1000 words
(and that's a lot of entropy)
(image via XKCD)
With this in mind you might want to consider dropping rules 2 & 3 if password length is higher than X (say.. 20) or increase the minimum to at least 16 characters (as the only rule).
As for your requirement:
As opposed to having one big, ugly, hard-to-maintain, advanced RegExp you might want to break the problem in smaller parts and tackle each bit separately using dedicated functions.
For this you could look at ctype_* functions, count_chars() and MultiByte String Functions.
Now the ugly:
This advanced RegEx will return true or false according to your rules:
preg_match('/^(?=.{10,}$)(?=.*?[A-Z].*?[A-Z])(?=.*?([\x20-\x40\x5b-\x60\x7b-\x7e\x80-\xbf]).*?(?1).*?$).*$/',$string);
Test demo here: http://regex101.com/r/qE9eB2
1st part (LookAhead) : (?=.{10,}$) will check string length and continue if it has at least 10 characters. You could drop this and do a check with strlen() or even better mb_strlen().
2nd part (also a LookAhead): (?=.*?[A-Z].*?[A-Z]) will check for the presence of 2 UPPERCASE characters. You could also do a $upper=preg_replace('/[^A-Z]/','',$string) instead and count the chars in $upper to be more than two.
3rd LookAhead uses a character class: [\x20-\x40\x5b-\x60\x7b-\x7e\x80-\xbf] with hex escaped character ranges for common symbols (pretty much all the symbols one could find on an average keyboard). You could also do a $sym=preg_replace('/[^a-zA-Z]/','',$string) instead and count the chars in $sym to be more than two. Note: to make it shorter I used a recursive group (?1) to not repeat the same character class again
For learning, the most comprehensive RegExp reference I know of is: regular-expressions.info
You can use lookaheads to make sure that what you are looking for is contained appropriately.
/(?=.*[A-Z].*[A-Z])(?=.*[^a-zA-Z].*[^a-zA-Z]).{10,}/
I have always preferred good old procedural code for handling stuff like this. Regular expressions can be useful but they can also be a little cumbersome, especially for code maintenance and quick scanning (regular expressions are not exactly examples of readability).
function strContains($string, $contains, $n = 1, $exact = false) {
$length = strlen($string);
$tally = 0;
for ($i = 0; $i < $length; $i++) {
if (strpos($contains, $string[$i]) !== false) {
$tally++;
}
}
return ($exact ? $tally == $n : $tally >= $n);
}
function validPassword($password) {
if (strlen($password) < 10) {
return false;
}
$upperChars = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ';
$upperCount = 2;
if (strContains($password, $upperChars, $upperCount) === false) {
return false;
}
$numSymChars = '0123456789!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~';
$numSymCount = 2;
if (strContains($password, $numSymChars, $numSymCount) === false) {
return false;
}
return true;
}
Let's say we have a string: "abcbcdcde"
I want to identify all substrings that are repeated in this string using regex (i.e. no brute-force iterative loops).
For the above string, the result set would be: {"b", "bc", "c", "cd", "d"}
I must confess that my regex is far more rusty than it should be for someone with my experience. I tried using a backreference, but that'll only match consecutive duplicates. I need to match all duplicates, consecutive or otherwise.
In other words, I want to match any character(s) that appears for the >= 2nd time. If a substring occurs 5 times, then I want to capture each of occurrences 2-5. Make sense?
This is my pathetic attempt thus far:
preg_match_all( '/(.+)(.*)\1+/', $string, $matches ); // Way off!
I tried playing with look-aheads but I'm just butchering it. I'm doing this in PHP (PCRE) but the problem is more or less language-agnostic. It's a bit embarrassing that I'm finding myself stumped on this.
Your problem is recursi ... you know what, forget about recursion! =p it wouldn't really work well in PHP and the algorithm is pretty clear without it as well.
function find_repeating_sequences($s)
{
$res = array();
while ($s) {
$i = 1; $pat = $s[0];
while (false !== strpos($s, $pat, $i)) {
$res[$pat] = 1;
// expand pattern and try again
$pat .= $s[$i++];
}
// move the string forward
$s = substr($s, 1);
}
return array_keys($res);
}
Out of interest, I wrote Tim's answer in PHP as well:
function find_repeating_sequences_re($s)
{
$res = array();
preg_match_all('/(?=(.+).*\1)/', $s, $matches);
foreach ($matches[1] as $match) {
$length = strlen($match);
if ($length > 1) {
for ($i = 0; $i < $length; ++$i) {
for ($j = $i; $j < $length; ++$j) {
$res[substr($match, $i, $j - $i + 1)] = 1;
}
}
} else {
$res[$match] = 1;
}
}
return array_keys($res);
}
I've let them fight it out in a small benchmark of 800 bytes of random data:
$data = base64_encode(openssl_random_pseudo_bytes(600));
Each code is run for 10 rounds and the execution time is measured. The results?
Pure PHP - 0.014s (10 runs)
PCRE - 40.86s <-- ouch!
It gets weirder when you look at 24k bytes (or anything above 1k really):
Pure PHP - 4.565s (10 runs)
PCRE - 0.232s <-- WAT?!
It turns out that the regular expression broke down after 1k characters and so the $matches array was empty. These are my .ini settings:
pcre.backtrack_limit => 1000000 => 1000000
pcre.recursion_limit => 100000 => 100000
It's not clear to me how a backtrack or recursion limit would have been hit after only 1k of characters. But even if those settings are "fixed" somehow, the results are still obvious, PCRE doesn't seem to be the answer.
I suppose writing this in C would speed it up somewhat, but I'm not sure to what degree.
Update
With some help from hakre's answer I put together an improved version that increases performance by ~18% after optimizing the following:
Remove the substr() calls in the outer loop to advance the string pointer; this was a left over from my previous recursive incarnations.
Use the partial results as a positive cache to skip strpos() calls inside the inner loop.
And here it is, in all its glory (:
function find_repeating_sequences3($s)
{
$res = array();
$p = 0;
$len = strlen($s);
while ($p != $len) {
$pat = $s[$p]; $i = ++$p;
while ($i != $len) {
if (!isset($res[$pat])) {
if (false === strpos($s, $pat, $i)) {
break;
}
$res[$pat] = 1;
}
// expand pattern and try again
$pat .= $s[$i++];
}
}
return array_keys($res);
}
You can't get the required result in a single regex because a regex will match either greedily (finding bc...bc) or lazily (finding b...b and c...c), but never both. (In your case, it does find c...c, but only because c is repeated twice.)
But once you've found a repeated substring of length > 1, it logically follows that all the smaller "substrings of that substring" must also be repeated. If you want to get them spelled out for you, you need to do this separately.
Taking your example (using Python because I don't know PHP):
>>> results = set(m.group(1) for m in re.finditer(r"(?=(.+).*\1)", "abcbcdcde"))
>>> results
{'d', 'cd', 'bc', 'c'}
You could then go and apply the following function to each of your results:
def substrings(s):
return [s[start:stop] for start in range(len(s)-1)
for stop in range(start+1, len(s)+1)]
For example:
>>> substrings("123456")
['1', '12', '123', '1234', '12345', '123456', '2', '23', '234', '2345', '23456',
'3', '34', '345', '3456', '4', '45', '456', '5', '56']
The closest I can get is /(?=(.+).*\1)/
The purpose of the lookahead is to allow the same characters to be matched more than once (for instance, c and cd). However, for some reason it doesn't seem to be getting the b...
Interesting question. I basically took the function in Jacks answer and was trying if the number of tests can be reduced.
I first tried to only search half the string, however it turned out that creating the pattern to search for via substr each time was way too expensive. The way how it is done in Jacks answer by appending one character per each iteration is way better it looks like. And then I did run out of time so I could not look further into it.
However while looking for such an alternative implementation I at least found out that some of the differences in the algorithm I had in mind could be applied to Jacks function as well:
There is no need to cut the beginning of the string in each outer iteration as the search is already done with offsets.
If the rest of the subject to look for repetition is smaller than the repetition needle, you do not need to search for the needle.
If it was already searched for the needle, you don't need to search again.
Note: This is a memory trade. If you have many repetitions, you will use similar memory. However if you do have a low amount of repetitions, than this variant uses more memory than before.
The function:
function find_repeating_sequences($string) {
$result = array();
$start = 0;
$max = strlen($string);
while ($start < $max) {
$pat = $string[$start];
$i = ++$start;
while ($max - $i > 0) {
$found = isset($result[$pat]) ? $result[$pat] : false !== strpos($string, $pat, $i);
if (!$result[$pat] = $found) break;
// expand pattern and try again
$pat .= $string[$i++];
}
}
return array_keys(array_filter($result));
}
So just see this as an addition to Jacks answer.