Performance of tokenizing CSS in PHP

Performance of tokenizing CSS in PHP - php

This is a noob question from someone who hasn't written a parser/lexer ever before.
I'm writing a tokenizer/parser for CSS in PHP (please don't repeat with 'OMG, why in PHP?'). The syntax is written down by the W3C neatly here (CSS2.1) and here (CSS3, draft).
It's a list of 21 possible tokens, that all (but two) cannot be represented as static strings.
My current approach is to loop through an array containing the 21 patterns over and over again, do an if (preg_match()) and reduce the source string match by match. In principle this works really good. However, for a 1000 lines CSS string this takes something between 2 and 8 seconds, which is too much for my project.
Now I'm banging my head how other parsers tokenize and parse CSS in fractions of seconds. OK, C is always faster than PHP, but nonetheless, are there any obvious D'Oh! s that I fell into?
I made some optimizations, like checking for '#', '#' or '"' as the first char of the remaining string and applying only the relevant regexp then, but this hadn't brought any great performance boosts.
My code (snippet) so far:
$TOKENS = array(
'IDENT' => '...regexp...',
'ATKEYWORD' => '#...regexp...',
'String' => '"...regexp..."|\'...regexp...\'',
//...
);
$string = '...CSS source string...';
$stream = array();
// we reduce $string token by token
while ($string != '') {
$string = ltrim($string, " \t\r\n\f"); // unconsumed whitespace at the
// start is insignificant but doing a trim reduces exec time by 25%
$matches = array();
// loop through all possible tokens
foreach ($TOKENS as $t => $p) {
// The '&' is used as delimiter, because it isn't used anywhere in
// the token regexps
if (preg_match('&^'.$p.'&Su', $string, $matches)) {
$stream[] = array($t, $matches[0]);
$string = substr($string, strlen($matches[0]));
// Yay! We found one that matches!
continue 2;
}
}
// if we come here, we have a syntax error and handle it somehow
}
// result: an array $stream consisting of arrays with
// 0 => type of token
// 1 => token content

Use a lexer generator.

The first thing I would do would be to get rid of the preg_match(). Basic string functions such as strpos() are much faster, but I don't think you even need that. It looks like you are looking for a specific token at the front of a string with preg_match(), and then simply taking the front length of that string as a substring. You could easily accomplish this with a simple substr() instead, like this:
foreach ($TOKENS as $t => $p)
{
$front = substr($string,0,strlen($p));
$len = strlen($p); //this could be pre-stored in $TOKENS
if ($front == $p) {
$stream[] = array($t, $string);
$string = substr($string, $len);
// Yay! We found one that matches!
continue 2;
}
}
You could further optimize that by pre-calculating the length of all your tokens and storing them in the $TOKENS array, so that you don't have to call strlen() all the time. If you sorted $TOKENS into groups by length, you could reduce the number of substr() calls further as well, as you could take a substr($string) of the current string being analyzed just once for each token length, and run through all the tokens of that length before moving on to the next group of tokens.

the (probably) faster (but less memory friendly) approach would be to tokenize the whole stream at once, using one big regexp with alternatives for each token, like
preg_match_all('/
(...string...)
|
(#ident)
|
(#ident)
...etc
/x', $stream, $tokens);
foreach($tokens as $token)...parse

Don't use regexp, scan character by character.
$tokens = array();
$string = "...code...";
$length = strlen($string);
$i = 0;
while ($i < $length) {
$buf = '';
$char = $string[$i];
if ($char <= ord('Z') && $char >= ord('A') || $char >= ord('a') && $char <= ord('z') || $char == ord('_') || $char == ord('-')) {
while ($char <= ord('Z') && $char >= ord('A') || $char >= ord('a') && $char <= ord('z') || $char == ord('_') || $char == ord('-')) {
// identifier
$buf .= $char;
$char = $string[$i]; $i ++;
}
$tokens[] = array('IDENT', $buf);
} else if (......) {
// ......
}
}
However, that makes the code unmaintainable, therefore, a parser generator is better.

It's an old post but still contributing my 2 cents on this.
one thing that seriously slows down the original code in the question is the following line :
$string = substr($string, strlen($matches[0]));
instead of working on the entire string, take just a part of it (say 50 chars) which are enough for all the possible regexes. then, apply the same line of code on it. when this string shrinks below a preset length, load some more data to it.

Related

PHP: trim word OR part of it from begining/end of string

I need to trim words from begining and end of string. Problem is, sometimes the words can be abbreviated ie. only first three letters (followed by dot).
I tried hard to find suitable regular expression. Basicaly I need to chatch three or more initial characters up to length of replacement, but I cannot find regular expression, that will match variable length and will keep order of characters.
For example, if I need to trim 'insurance' from sentence 'insur. companies are rich', then pattern \^[insurance]{3,9}\ comes to my mind, but this pattern will also catch words like 'sensace', because order of characters (and their occurance) inside [] is not important for regexp.
Also, at end of string, I need remove serial-numbers, that are abbreviated from beginig - say 'XK-25F14' is sometimes presented as '25F14'. So I decided to go purely with character by character comparison.
Therefore I end with following php function
function trimWords($s, $dirt, $case_insensitive = false, $reverse = true)
{
$pos = 0;
$func = $case_insensitive ? 'strncasecmp' : 'strncmp';
// Get number of initial characters, that match in both strings
while ($func($s, $dirt, $pos + 1) === 0)
$pos++;
// If more than 2 initial characters match, then remove the match
if ($pos > 2)
$s = substr($s, $pos);
// Reverse $s and $dirt so it will trim from the end of string
$s = strrev($s);
if ($reverse)
return trimWords($s, strrev($dirt), $case_insensitive, false);
// After second run return back-reversed string
return trim($s, ' .-');
}
I'm happy with this function, but it has one drawback. It trims only one occurence of word. How to make it trim more occurances, i.e. remove both 'insurance ' from 'Insurance insur. companies'.
And I'm also curious, it realy does not exists such regular expression, that will match variable length and will respect order of characters in pattern?
Final solution
Thanks to mrhobo I have ended with function based on regular expression. This function can be easily improved and shall also be the most efficient for this task.
I have modified my previous function and it is two times quicker than regexp, but it can remove only one word per single run, so to be able to remove word from begin and end, it has to runs itself twice and performance is same as regexp and to remove more than one occurance of word, it has to runs itself multiple times, which will then be more and more slower.
The final function goes like this.
function trimWords($string, $word, $case_insensitive = false, $min_abbrv = 3)
{
$exc = substr($word, $min_abbrv);
$pat = null;
$i = strlen($exc);
while ($i--)
$pat = '(?>'.preg_quote($exc[$i], '#').$pat.')?';
$pat = substr($word, 0, $min_abbrv).$pat;
$pat = '#(?<begin>^)?(?:\W*\b'.$pat.'\b\W*)+(?(begin)|$)#';
if ($case_insensitive)
$pat .= 'i';
return preg_replace($pat, '', $string);
}
NOTE: with this function, it does not matter, if abbreviation ends with dot or not, it wipes out any shorter form of word and also removes all nonword characters around the word.
EDIT: I just tried create replace pattern like insu(r|ra|ran|ranc|rance) and function with atomic groups is faster by ~30% and with longer words it could be possibly even more efficient.

Matching a word and all possible abbreviations from the nth letter isn't quite an easy task in regex.
Here is how I would do it for the word insurance from the 4th letter:
insu(?>r(?>a(?>n(?>c(?>(?<last>e))?)?)?)?)?(?(last)|\.)
http://regex101.com/r/aL2gV4
It works by using atomic groups to force the regex engine as far as possible forward past the last 'rance' letters using the nested pattern (?>a(?>b)?)?. If the last letter letter is matched we're not dealing with an abbreviation thus no dot is required, otherwise the dot is required. This is coded by (?(last)|\.).
To trim, I would create a function to build the above regex for an abbreviation. Then you can write a while loop that replaces each of the abbreviation regexes with empty space until there are no more matches.
Non regex version
Here is my non regex version that removes multiple words and abbreviated words from a string:
function trimWords($str, $word, $min_abbrv, $case_insensitive = false) {
$len = 0;
$word_len = strlen($word);
$strlen = strlen($str);
$cmp = $case_insensitive ? strncasecmp : strncmp;
for ($i = 0; $i < $strlen; $i++) {
if ($cmp($str[$i], $word[$len], $i) == 0) {
$len++;
} else if ($len > 0) {
if ($len == $word_len || ($len >= $min_abbrv && ($dot = $str[$i] == '.'))) {
$i -= $len;
$len += $dot;
$str = substr($str, 0, $i) . substr($str, $i+$len);
$strlen = strlen($str);
$dot = 0;
}
$len = 0;
}
}
return $str;
}
Example:
$string = 'ins. <- "ins." / insu. insuranc. insurance / insurance. <- "."';
echo trimWords($string, 'insurance', 4);
Output is:
ins. <- "ins." / / . <- "."

I wrote function that constructs regular expression pattern according to mrhobo and also simple test and benchmarked it against my function with pure PHP string comparison.
Here is the code:
$string = 'Insur. companies are nasty rich';
$dirt = 'insurance';
$cycles = 500000;
$start = microtime(true);
$i = $cycles;
while ($i) {
$i--;
regexpStyle($string, $dirt, true);
}
$stop = microtime(true);
$i = $cycles;
while ($i) {
$i--;
trimWords($string, $dirt, true);
}
$end = microtime(true);
$res1 = $stop - $start;
$res2 = $end - $stop;
$winner = $res1 < $res2 ? '<<<' : '>>>';
echo 'regexp: '.$res1.' '.$winner.' string operations: '.$res2;
function trimWords($s, $dirt, $case_insensitive = false, $reverse = true)
{
$pos = 0;
$func = $case_insensitive ? 'strncasecmp' : 'strncmp';
// Get number of initial characters, that match in both strings
while ($func($s, $dirt, $pos + 1) === 0)
$pos++;
// If more than 2 initial characters match, then remove the match
if ($pos > 2)
$s = substr($s, $pos);
// After second run return back-reversed string
return trim($s, ' .-');
}
function regexpStyle($s, $dirt, $case_insensitive, $min_abbrev = 3)
{
$ss = substr($dirt, $min_abbrev);
$arr = str_split($ss);
$patt = '(?>(?<last>'.array_pop($arr).'))?';
$i = count($arr);
while ($i)
$patt = '(?>'.$arr[--$i].$patt.')?';
$patt = '#^'.substr($dirt, 0, $min_abbrev).$patt.'(?(last)|\.)#';
$patt .= $case_insensitive ? 'i' : null;
return trim(preg_replace($patt, '', $s));
}
and the winner is... moment of silence... it is...
a draw
regexp: 8.5169589519501 >>> string operations: 8.0951890945435
but I have strong feeling that regexp approach could be better utilized.

One-line PHP random string generator?

I am looking for the shortest way to generate random/unique strings and for that I was using the following two:
$cClass = sha1(time());
or
$cClass = md5(time());
However, I need the string to begin with a letter, I was looking at base64 encoding but that adds == at the end and then I would need to get rid of that.
What would be the best way to achieve this with one line of code?
Update:
PRNDL came up with a good suggestions which I ended up using it but a bit modified
echo substr(str_shuffle(abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ),0, 1) . substr(str_shuffle(aBcEeFgHiJkLmNoPqRstUvWxYz0123456789),0, 31)
Would yield 32 characters mimicking the md5 hash but it would always product the first char an alphabet letter, like so;
However, Uours really improved upon and his answer;
substr(str_shuffle("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"), 0, 1).substr(md5(time()),1);
is shorter and sweeter
The other suggestion by Anonymous2011 was very awesome but the first character for some reason would always either M, N, Y, Z so didn't fit my purposes but would have been the chosen answer, by the way does anyone know why it would always yield those particular letters?
Here is the preview of my modified version
echo rtrim(base64_encode(md5(microtime())),"=");

Rather than shuffling the alphabet string , it is quicker to get a single random char .
Get a single random char from the string and then append the md5( time( ) ) to it . Before appending md5( time( ) ) remove one char from it so as to keep the resulting string length to 32 chars :
substr("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ", mt_rand(0, 51), 1).substr(md5(time()), 1);
Lowercase version :
substr("abcdefghijklmnopqrstuvwxyz", mt_rand(0, 25), 1).substr(md5(time()), 1);
Or even shorter and a tiny bit faster lowercase version :
chr(mt_rand(97, 122)).substr(md5(time()), 1);
/* or */
chr(mt_rand(ord('a'), ord('z'))).substr(md5(time()), 1);
A note to anyone trying to generate many random strings within a second: Since time( ) returns time in seconds , md5( time( ) ) will be same throughout a given second-of-time due to which if many random strings were generated within a second-of-time, those probably could end up having some duplicates .
I have tested using below code . This tests lower case version :
$num_of_tests = 100000;
$correct = $incorrect = 0;
for( $i = 0; $i < $num_of_tests; $i++ )
{
$rand_str = substr( "abcdefghijklmnopqrstuvwxyz" ,mt_rand( 0 ,25 ) ,1 ) .substr( md5( time( ) ) ,1 );
$first_char_of_rand_str = substr( $rand_str ,0 ,1 );
if( ord( $first_char_of_rand_str ) < ord( 'a' ) or ord( $first_char_of_rand_str ) > ord( 'z' ) )
{
$incorrect++;
echo $rand_str ,'<br>';
}
else
{
$correct++;
}
}
echo 'Correct: ' ,$correct ,' . Incorrect: ' ,$incorrect ,' . Total: ' ,( $correct + $incorrect );

I had found something like this:
$length = 10;
$randomString = substr(str_shuffle(md5(time())),0,$length);
echo $randomString;

If you need it to start with a letter, you could do this. It's messy... but it's one line.
$randomString = substr(str_shuffle("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"), 0, 1) . substr(str_shuffle("0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"), 0, 10);
echo $randomString;

I decided this question needs a better answer. Like code golf! This also uses a better random byte generator.
preg_replace("/[\/=+]/", "", base64_encode(openssl_random_pseudo_bytes(8)));
Increase the number of bytes for a longer password, obviously.

Creates a 200 char long hexdec string:
$string = bin2hex(openssl_random_pseudo_bytes(100));
maaarghk's answer is better though.

base_convert(microtime(true), 10, 36);

You can try this:
function KeyGenerator($uid) {
$tmp = '';
for($z=0;$z<5;$z++) {
$tmp .= chr(rand(97,122)) . rand(0,100);
}
$tmp .= $uid;
return $tmp;
}

I have generated this code for you. Simple, short and (resonably) elegant.
This uses the base64 as you mentioned, if length is not important to you - However it removes the "==" using str_replace.
<?php
echo str_ireplace("==", "", base64_encode(time()));
?>

I use this function
usage:
echo randomString(20, TRUE, TRUE, FALSE);
/**
* Generate Random String
* #param Int Length of string(50)
* #param Bool Upper Case(True,False)
* #param Bool Numbers(True,False)
* #param Bool Special Chars(True,False)
* #return String Random String
*/
function randomString($length, $uc, $n, $sc) {
$rstr='';
$source = 'abcdefghijklmnopqrstuvwxyz';
if ($uc)
$source .= 'ABCDEFGHIJKLMNOPQRSTUVWXYZ';
if ($n)
$source .= '1234567890';
if ($sc)
$source .= '|##~$%()=^*+[]{}-_';
if ($length > 0) {
$rstr = "";
$length1= $length-1;
$input=array('a','b','c','d','e','f','g','h','i','j,''k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z')
$rand = array_rand($input, 1)
$source = str_split($source, 1);
for ($i = 1; $i <= $length1; $i++) {
$num = mt_rand(1, count($source));
$rstr1 .= $source[$num - 1];
$rstr = "{$rand}{$rstr1}";
}
}
return $rstr;
}

I'm using this one to generate dozens of unique strings in a single go, without repeating them, based on other good examples above:
$string = chr(mt_rand(97, 122))
. substr(md5(str_shuffle(time() . rand(0, 999999))), 1);
This way, I was able to generate 1.000.000 unique strings in ~5 seconds. It's not THAT fast, I know, but as I just need a handful of them, I'm ok with it. By the way, generating 10 strings took less than 0.0001 ms.

JavaScript Solution:
function randomString(pIntLenght) {
var strChars = “0123456789ABCDEFGHIJKLMNOPQRSTUVWXTZabcdefghiklmnopqrstuvwxyz”;
var strRandomstring = ”;
for (var intCounterForLoop=0; intCounterForLoop < pIntLenght; intCounterForLoop++) {
var rnum = Math.floor(Math.random() * strChars.length);
strRandomstring += strChars.substring(rnum,rnum+1);
}
return strRandomstring;
}
alert(randomString(20));
Reference URL : Generate random string using JavaScript
PHP Solution:
function getRandomString($pIntLength = 30) {
$strAlphaNumericString = ’0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ’;
$strReturnString = ”;
for ($intCounter = 0; $intCounter < $pIntLength; $intCounter++) {
$strReturnString .= $strAlphaNumericString[rand(0, strlen($strAlphaNumericString) - 1)];
}
return $strReturnString;
}
echo getRandomString(20);
Reference URL : Generate random String using PHP

This function returns random lowercase string:
function randomstring($len=10){
$randstr='';
for($iii=1; $iii<=$len; $iii++){$randstr.=chr(rand(97,122));};
return($randstr);
};

I find that base64 encoding is useful for creating random strings, and use this line:
base64_encode(openssl_random_pseudo_bytes(9));
It gives me a random string of 12 positions, with the additional benefit that the randomness is "cryptographically strong".

to generate strings consists of random characters, you can use this function
public function generate_random_name_for_file($length=50){
$key = '';
$keys = array_merge(range(0, 9), range('a', 'z'));
for ($i = 0; $i < $length; $i++) {
$key .= $keys[array_rand($keys)];
}
return $key;
}

It really depends on your requirements.
I needed strings to be unique between test runs, but not many other restrictions.
I also needed my string to start with a character, and this was good enough for my purpose.
$mystring = "/a" . microtime(true);
Example output:
a1511953584.0997

How to match the OPs original request in an awful way (expanded for readability):
// [0-9] ASCII DEC 48-57
// [A-Z] ASCII DEC 65-90
// [a-z] ASCII DEC 97-122
// Generate: [A-Za-z][0-9A-Za-z]
$r = implode("", array_merge(array_map(function($a)
{
$a = [rand(65, 90), rand(97, 122)];
return chr($a[array_rand($a)]);
}, array_fill(0, 1, '.')),
array_map(function($a)
{
$a = [rand(48, 57), rand(65, 90), rand(97, 122)];
return chr($a[array_rand($a)]);
}, array_fill(0, 7, '.'))));
One the last array_fill() would would change the '7' to your length - 1.
For one that does all alpha-nurmeric (And still slow):
// [0-9A-Za-z]
$x = implode("", array_map(function($a)
{
$a = [rand(48, 57), rand(65, 90), rand(97, 122)];
return chr($a[array_rand($a)]);
}, array_fill(0, 8, '.')));

The following one-liner meets the requirements in your question: notably, it begins with a letter.
substr("abcdefghijklmnop",random_int(0, 16),1) . bin2hex(random_bytes(15))
If you didn't care whether the string begins with a letter, you could just use:
bin2hex(random_bytes(16))
Note that here we use random_bytes and random_int, which were introduced in PHP 7 and use cryptographic random generators, something that is important if you want unique strings to be hard to guess. Many other solutions, including those involving time(), microtime(), uniqid(), rand(), mt_rand(), str_shuffle(), array_rand(), and shuffle(), are much more predictable and are unsuitable if the random string will serve as a password, a bearer credential, a nonce, a session identifier, a "verification code" or "confirmation code", or another secret value.
I also list other things to keep in mind when generating unique identifiers, especially random ones.

True one liner random string options:
implode('', array_rand(array_flip(str_split(str_shuffle('abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ'))), 21));
md5(microtime() . implode('', array_rand(array_flip(str_split(str_shuffle('abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ'))), 21)));
sha1(microtime() . implode('', array_rand(array_flip(str_split(str_shuffle('abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ'))), 21)));

PHP Runtime Issue when Breaking a 10,000 char string into segments

$chapter is a string that stores a chapter of a book with 10,000 - 15,000 characters. I want to break up the string into segments with a minimum of 1000 characters but officially break after the next whitespace, so that I don't break up a word. The provided code will run successfully about 9 times and then it will run into a run time issue.
"Fatal error: Maximum execution time of 30 seconds exceeded in D:\htdocs\test.php on line 16"
<?php
$chapter = ("10000 characters")
$len = strlen($chapter);
$i=0;
do{$key="a";
for($k=1000;($key != " ") && ($i <= $len); $k = $k+1) {
$j=$i+$k; echo $j;
$key = substr($chapter,$j,1);
}
$segment = substr ($chapter,$i,$k);
$i=$j;
echo ($segment);
} while($i <= $len);
?>

I think your method of writing it has too much overhead, while increasing max_execution_time will help, not everyone is able to modify their server settings. This simple thing split 15000 bytes of lorum ipsum text (2k Words) into 1000 character segments. I assume it would do well with more, as the execution time was fairly quick.
//Define variables, Set $x as int(1 = true) to start
$chapter = ("15000 bytes of Lorum Ipsum Here");
$sections = array();
$x = 1;
//Start Splitting
while( $x ) {
//Get current length of $chapter
$len = strlen($chapter);
//If $chapter is longer than 1000 characters
if( $len > 1000 ) {
//Get Position of last space character before 1000
$x = strrpos( substr( $chapter, 0, 1000), " ");
//If $x is not FALSE - Found last space
if( $x ) {
//Add to $sections array, assign remainder to $chapter again
$sections[] = substr( $chapter, 0, $x );
$chapter = substr( $chapter, $x );
//If $x is FALSE - No space in string
} else {
//Add last segment to $sections for debugging
//Last segment will not have a space. Break loop.
$sections[] = $chapter;
break;
}
//If remaining $chapter is not longer than 1000, simply add to array and break.
} else {
$sections[] = $chapter;
break;
}
}
print_r($sections);
Edit:
Tested with 5k Words (33K bytes) In a fraction of a second. Divided the text up into 33 segments. (Whoops, I had it set to divide into 10K character segments, before.)
Added verbose comments to code, as to explain what everything does.

Here is a simple function to do that
$chapter = "Your full chapter";
breakChapter($chapter,1000);
function breakChapter($chapter,$size){
do{
if(strlen($chapter)<$size){
$segment=$chapter;
$chapter='';
}else{
$pos=strpos($chapter,' ', $size);
if ($pos==false){
$segment=$chapter;
$chapter='';
}else{
$segment=substr($chapter,0,$pos);
$chapter=substr($chapter,$pos+1);
}
}
echo $segment. "\n";
}while ($chapter!='');
}
checking each character is not a good option and is resource/time intensive
PS: I have not tested this (just typed in here), and this may not be the best way to do this. but the logic works!

You are always reading the $chapter from the start. You should delete the already read characters from $chapter so you will never read much more than 10000 characters. If you do this, you must also tweak the cycles.

try
set_time_limit(240);
at the begining of the code. (this is the ThrowSomeHardwareAtIt aproach )

It can be done in just one single line, wich speeds up your code a lot.
echo $segment = substr($chapter, 0, strpos($chapter, " ", 1000));
It wil take the substring of the chapter until 1000 + some characters until the first space.

Search for pattern in a string

Pattern search within a string.
for eg.
$string = "111111110000";
FindOut($string);
Function should return 0
function FindOut($str){
$items = str_split($str, 3);
print_r($items);
}

If I understand you correctly, your problem comes down to finding out whether a substring of 3 characters occurs in a string twice without overlapping. This will get you the first occurence's position if it does:
function findPattern($string, $minlen=3) {
$max = strlen($string)-$minlen;
for($i=0;$i<=$max;$i++) {
$pattern = substr($string,$i,$minlen);
if(substr_count($string,$pattern)>1)
return $i;
}
return false;
}
Or am I missing something here?

What you have here can conceptually be solved with a sliding window. For your example, you have a sliding window of size 3.
For each character in the string, you take the substring of the current character and the next two characters as the current pattern. You then slide the window up one position, and check if the remainder of the string has what the current pattern contains. If it does, you return the current index. If not, you repeat.
Example:
1010101101
|-|
So, pattern = 101. Now, we advance the sliding window by one character:
1010101101
|-|
And see if the rest of the string has 101, checking every combination of 3 characters.
Conceptually, this should be all you need to solve this problem.
Edit: I really don't like when people just ask for code, but since this seemed to be an interesting problem, here is my implementation of the above algorithm, which allows for the window size to vary (instead of being fixed at 3, the function is only briefly tested and omits obvious error checking):
function findPattern( $str, $window_size = 3) {
// Start the index at 0 (beginning of the string)
$i = 0;
// while( (the current pattern in the window) is not empty / false)
while( ($current_pattern = substr( $str, $i, $window_size)) != false) {
$possible_matches = array();
// Get the combination of all possible matches from the remainder of the string
for( $j = 0; $j < $window_size; $j++) {
$possible_matches = array_merge( $possible_matches, str_split( substr( $str, $i + 1 + $j), $window_size));
}
// If the current pattern is in the possible matches, we found a duplicate, return the index of the first occurrence
if( in_array( $current_pattern, $possible_matches)) {
return $i;
}
// Otherwise, increment $i and grab a new window
$i++;
}
// No duplicates were found, return -1
return -1;
}
It should be noted that this certainly isn't the most efficient algorithm or implementation, but it should help clarify the problem and give a straightforward example on how to solve it.

Looks like you more want to use a sub-string function to walk along and check every three characters and not just break it into 3
function fp($s, $len = 3){
$max = strlen($s) - $len; //borrowed from lafor as it was a terrible oversight by me
$parts = array();
for($i=0; $i < $max; $i++){
$three = substr($s, $i, $len);
if(array_key_exists("$three",$parts)){
return $parts["$three"];
//if we've already seen it before then this is the first duplicate, we can return it
}
else{
$parts["$three"] = i; //save the index of the starting position.
}
}
return false; //if we get this far then we didn't find any duplicate strings
}

Based on the str_split documentation, calling str_split on "1010101101" will result in:
Array(
[0] => 101
[1] => 010
[2] => 110
[3] => 1
}
None of these will match each other.
You need to look at each 3-long slice of the string (starting at index 0, then index 1, and so on).
I suggest looking at substr, which you can use like this:
substr($input_string, $index, $length)
And it will get you the section of $input_string starting at $index of length $length.

quick and dirty implementation of such pattern search:
function findPattern($string){
$matches = 0;
$substrStart = 0;
while($matches < 2 && $substrStart+ 3 < strlen($string) && $pattern = substr($string, $substrStart++, 3)){
$matches = substr_count($string,$pattern);
}
if($matches < 2){
return null;
}
return $substrStart-1;

Select random strings over 6 and under 10 characters from text file

I have a text file, and I need to pick a random string that is over 6 characters and under 10 characters. Normally, I would use a script like this, which would work, but since it needs to be a certain length, that won't work. Does anybody have a solution to this?
A sample input would be something like this:
Apple
Banana
Orange
Strawberry
Blueberry
Pineapple
Somelongfruithere
Those values would be in a .txt file, each with a line break. An example of a string that would be allowed is pineapple, but apple or Somelongfruithere wouldn't be allowed.

You'll need to do something like this:
$lines = array();
$tmpLines = file('random.txt');
for($i = 0; $i < count($tmpLines); ++$i)
{
if(strlen($tmpLines[ $i ]) > 6 && strlen($tmpLines[ $i ]) < 10)
{
$lines[] = $tmpLines[ $i ];
}
}
$randomWord = $lines[ array_rand($lines) ];
A shorter way, in number of lines, goes like this (but is much less safe):
$randomWord = '';
$lines = file('random.txt');
while(strlen($randomWord) <= 6 || strlen($randomWord) >= 10)
$randomWord = $lines[ array_rand($lines) ];
The first option gets all the lines in the file, and copies only the ones between 6 and 10 chars in length to another array. When choosing a random element from this array, you are "guaranteed" a reasonable access time for any random string.
The second option simply continues to pick a random string until one of the proper length is chosen, but could potentially take a while depending on the random number generator's mood. Unlikely, but I wouldn't want to risk it. Always take reliability as the best approach, in my book.

I would say explode the text file into a variable, run your random generator to get a placement value (position of the random string) then in a loop (a do/while loop), pull the string from the exploded variable, and check it's length to ensure it's what you want
if (strlen($rand_word) > 6 && strlen($rand_word) < 10) {
//execute function and end loop
} else {
// keep checking using a new random placement number
}

The answer depends on the requirement.
If you need to select from a group of strings and only accept one that fits your criteria, then you'll need to use strlen and try again if it is not the correct length.
Otherwise, you're still going to need strlen, to make sure it is at least 6 Chars, but then you can use substr to cut it to 10. If whitespace does not count, use ltrim & rtrim before strlen and substr.

First find all the words that are in the right character range, then pull a random one from the resulting array:
$fileLines = file('somefile.txt');
$myWords = array();
foreach ($fileLines as $line)
{
$thisLine = split(" ",$line);
foreach ($thisLine as $word)
{
$length = strlen($word);
if ($length > 6 && $length < 10)
{
$myWords[] = $word;
}
}
}
$randomWord = $myWords[array_rand($myWords)];

Shortened way!
for ($i=7; $i<=9; $i++)
if (strlen($str) == $i )
echo "bingo! " . strlen($str);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Performance of tokenizing CSS in PHP - php

Use a lexer generator.

the (probably) faster (but less memory friendly) approach would be to tokenize the whole stream at once, using one big regexp with alternatives for each token, like preg_match_all('/ (...string...) | (#ident) | (#ident) ...etc /x', $stream, $tokens); foreach($tokens as $token)...parse

Related

PHP: trim word OR part of it from begining/end of string

One-line PHP random string generator?

PHP Runtime Issue when Breaking a 10,000 char string into segments

Search for pattern in a string

Select random strings over 6 and under 10 characters from text file

Categories

Resources