PHP regex generator - php

I have now got a working regex string for the below needed criteria:
a one line php-ready regex that encompasses a number of keywords, and keyterms and will match at least one of them.
For example:
Keyterms:
apple
banana
strawberry
pear cake
Now if any of these key terms are found then it returns true. However, to add a little more difficulty here, the pear cake term should be split as two keywords which must both be in the string, but need not be together.
Example strings which should return true:
A great cake is made from pear
i like apples
i like apples and bananas
i like cakes made from pear and apples
I like cakes made from pears
The working regex is:
/\bapple|\bbanana|\bstrawberry|\bpear.*?\bcake|\bcake.*?\bpear/
Now I need a php function that will create this regex on the fly from an array of keyterms. The stickler is that a keyterm may have any number of keywords within that key. Only on of the keyterms need be found, but multiple can be present. As above all of the the words within a keyterm must appear in the string in any order.

I have written a function for you here:
<?php
function permutations($array)
{
$list = array();
for ($i=0; $i<=10000; $i++) {
shuffle($array);
$tmp = implode(',',$array);
if (isset($list[$tmp])) {
$list[$tmp]++;
} else {
$list[$tmp] = 1;
}
}
ksort($list);
$list = array_keys($list);
return $list;
}
function CreateRegex($array)
{
$toReturn = '/';
foreach($array AS $value)
{
//Contains spaces
if(strpos($value, " ") != false)
{
$pieces = explode(" ", $value);
$combos = permutations($pieces);
foreach($combos AS $currentCombo)
{
$currentPieces = explode(',', $currentCombo);
foreach($currentPieces AS $finallyGotIt)
{
$toReturn .= '\b' . $finallyGotIt . '.*?';
}
$toReturn = substr($toReturn, 0, -3) . '|';
}
}
else
{
$toReturn .= '\b' . $value . '|';
}
}
$toReturn = substr($toReturn, 0, -1) . '/';
return $toReturn;
}
var_dump(CreateRegex(array('apple', 'banana', 'strawberry', 'pear cake')));
?>
I got the permutations function from:
http://www.hashbangcode.com/blog/getting-all-permutations-array-php-74.html
But I would recommend to find a better function and use another one since just at first glance this one is pretty ugly since it increments $i to 10,000 no matter what.
Also, here is a codepad for the code:
http://codepad.org/nUhFwKz1
Let me know if there is something wrong with it!

Related

How correct incorrect words from dictionary array?

I've dictionary array with correct words:
<?php
$dict = array("apple","windows","microsoft","happy");
?>
I tried some options for correction, but it did not help me to solve my problem. I will give an example of one of the methods used
<?php
$config_dic= pspell_config_create ('en');
function orthograph($string)
{
// Suggests possible words in case of misspelling
$config_dic = pspell_config_create('en');
// Ignore words under 3 characters
pspell_config_ignore($config_dic, 3);
// Configure the dictionary
pspell_config_mode($config_dic, PSPELL_FAST);
$dictionary = pspell_new_config($config_dic);
// To find out if a replacement has been suggested
$replacement_suggest = false;
$string = explode('', trim(str_replace(',', ' ', $string)));
foreach ($string as $key => $value) {
if(!pspell_check($dictionary, $value)) {
$suggestion = pspell_suggest($dictionary, $value);
// Suggestions are case sensitive. Grab the first one.
if(strtolower($suggestion [0]) != strtolower($value)) {
$string [$key] = $suggestion [0];
$replacement_suggest = true;
}
}
}
if ($replacement_suggest) {
// We have a suggestion, so we return to the data.
return implode('', $string);
} else {
return null;
}
}
$search = $_POST['input'];
$suggestion_spell = orthograph($search);
if ($suggestion_spell) {
echo "Try with this spelling : $suggestion_spell";
}
$dict = pspell_new ("en");
if (!pspell_check ($dict, "lappin")) {
$suggestions = pspell_suggest ($dict, "lappin");
foreach ($suggestions as $suggestion) {
echo "Did you mean: $suggestion?<br />";
}
}
// Suggests possible words in case of misspelling
$config_dic = pspell_config_create('en');
// Ignore words under 3 characters
pspell_config_ignore($config_dic, 3);
// Configure the dictionary
pspell_config_mode($config_dic, PSPELL_FAST);
$dictionary = pspell_new_config($config_dic);
..................................................
The idea is there but could not realize. If you can understand and help as something with something then for you earlier, thank you. The idea is that it checks each word for similarity from an array with the correct words in percentages well or in anything and if the words 90-95% coincide with the word in the mass then replace it with the correct version. For example, in the word maximum, 2-3 letters can be omitted. Most people will miss 1-2 letters. Such functionality is in google translate. I want to implement such a functional in my project.
The user, for example, can enter a word in such different incorrect variants:
$textarea = "Windows - Widows, Apple - Aple";
There are such cases that the user adds what is not needed letters in
the word
$text = "microsoft - microosoft, happy - haapy";
I would like to create my own function to solve this problem. Than to use the side extended ... Therefore for an example can show what that variants.

In_array not working - compare two files

The code below is a simple version of what I am trying to do. The code will read in two files, see if there is a matching entry and, if there is, display the difference in the numbers for that item. But it isn't working. The first echo displays the word but the second echo is never reached. Would someone please explain what I am missing?
$mainArry = array('Albert,8');
$arry = array('Albert,12');
foreach ($arry as $line) {
$kword = explode(',', $line);
echo 'kword '.$kword[0];
if (in_array($kword[0], $mainArry)) {
echo 'line '.$line. ' has count of '.$kword[1] . '<br>';
}
}
Your $mainArry contains a single element: the string 'Albert,8'. It looks like you want to use it as an array (elements 'Albert' and '8') instead of a string.
You mention the code will read from two files, so you can 'explode' it to a real array, as you do with $arry. A simpler approach would be using str_getcsv() to parse the CSV string into $mainArry.
$inputString = 'Albert,8';
$mainArry = str_getcsv($inputString); // now $mainArry is ['Albert','8']
$arry = array('Albert,12');
foreach ($arry as $line) {
$kword = explode(',', $line);
echo 'kword '.$kword[0];
if (in_array($kword[0], $mainArry)) {
echo 'line '.$line. ' has count of '.$kword[1] . '<br>';
}
}
Test it here.
You are attempting to compare the string Albert with Albert,8, so they won't match. If you want to check if the string contains the keyword, assuming your array has more than one element, you could use:
$mainArry = array('Albert,8');
$arry = array('Albert,12');
foreach ($arry as $line) {
$kword = explode(',', $line);
echo 'kword '.$kword[0];
foreach ($mainArry as $comp) {
if (strstr($comp, $kword[0])) {
echo 'line '.$line. ' has count of '.$kword[1] . '<br>';
}
}
}
example: https://eval.in/728566
I don't recommend your way of working, but this is a solution, basically the process you apply to the $arry should also apply to the $mainArry you're trying to compare it to.
$mainArry = array('Albert,8');
$arry = array('Albert,12');
/***
NEW function below takes the valus out of the main array.
and sets them in their own array so they can be properly compared.
***/
foreach ($mainArry as $arrr){
$ma = explode(",",$arrr);
$names[] = $ma[0];
$values[] = $ma[1];
}
unset($arrr,$ma);
foreach ($arry as $line) {
$kword = explode(',', $line);
echo 'kword '.$kword[0];
/// note var reference here is updated.
if (in_array($kword[0], $names)) {
echo '<br>line '.$kword[0]. ' has count of '.$kword[1] . '<br>';
}
}
Yeah, MarcM's answer above does the same thing in a neat single line, but I wanted to illustrate a little more under the hood operations of the value setting. :-/

Merging two words together letter by letter in php. How to make it work?

How to merge two words together letter by letter in php on the following way:
Input #1: Apricot
Input #2: Kiwi
Expected output: AKpirwiicot.
So that if one word's characters are more than the other, it simply writes it down until the end.
I tried it by this logic:
Input smthing
str_split()
array_merge()
But I failed. Any solutions appreciated.
$string1 and $string2 can be in any order.
$string1=str_split("Apricot");
$string2=str_split("Kiwi");
if(count($string2)>count($string1)){
$templ = $string1;
$string1 = $string2;
$string2 = $temp;
}
$result = "";
foreach($string1 as $key => $var){
{
$result.=$var;
if(isset($string2[$key])){
$result.$string2[$key];
}
}
echo $result;
Array_merge() also sticks one array on the end of the other so it wouldn't do what you are looking for I believe.
edit : ive adjusted to take into account no order, like #nikkis answer.
How about this:
def str_merge(a, b):
s = ''
k = min(len(a), len(b))
for i in range(k):
s += a[i] + b[i]
s += a[k:] + b[k:]
return s
In PHP:
function merge($a, $b)
{
$s = '';
$k = min(strlen($a), strlen($b));
for($i=0; $i<$k; $i++)
{
$s = $s . $a[$i] . $b[$i];
}
$s = $s . substr($a, $k) . substr($b, $k);
}
Please forgive my PHP, not my strongest language...

PHP: Check string for certain words

How can I check if data submitted from a form or querystring has certain words in it?
I'm trying to look for words containing admin, drop, create etc in form [Post] data and querystring data so I can accept or reject it.
I'm converting from ASP to PHP. I used to do this using an array in ASP (keep all illegal words in a string and use ubound to check the whole string for those words), but is there a better (efficient) way to do this in PHP?
Eg: A string like this would be rejected: "The administrator dropped a blah blah" because it has admin and drop in it.
I intend using this to check usernames when creating accounts and for other things too.
Thanks
You could use stripos()
int stripos ( string $haystack , string $needle [, int $offset = 0 ] )
You could have a function like:
function checkBadWords($str, $badwords) {
foreach ($badwords as $word) {
if (stripos(" $str ", " $word ") !== false) {
return false;
}
}
return true;
}
And to use it:
if (!checkBadWords('something admin', array('admin')) {
// ...
}
strpos() will let you search for a substring within a larger string. It's quick and works well. It returns false if the string's not found, and a number (which could be zero, so you need to use === to check) if it finds the string.
stripos() is a case-insensitive version of the same.
I'm trying to look for words containing admin, drop, create etc in form [Post] data and querystring data so I can accept or reject it.
I suspect that you are trying to filter the string so it's suitable for including in something like a database query, or something like that. If this is the case, this is probably not a good way to go about it, and you'd need to actually need to escape the string using mysql_real_escape_string() or equivalent.
$badwords = array("admin", "drop",);
foreach (str_word_count($string, 1) as $word) {
foreach ($badwords as $bw) {
if (strpos($word, $bw) === 0) {
//contains word $word that starts with bad word $bw
}
}
}
For JGB146, here is a performance comparison with regular expressions:
<?php
function has_bad_words($badwords, $string) {
foreach (str_word_count($string, 1) as $word) {
foreach ($badwords as $bw) {
if (stripos($word, $bw) === 0) {
return true;
}
}
return false;
}
}
function has_bad_words2($badwords, $string) {
$regex = array_map(function ($w) {
return "(?:\\b". preg_quote($w, "/") . ")"; }, $badwords);
$regex = "/" . implode("|", $regex) . "/";
return preg_match($regex, $string) != 0;
}
$badwords = array("abc", "def", "ghi", "jkl", "mnop");
$string = "The quick brown fox jumps over the lazy dog";
$start = microtime(true);
for ($i = 0; $i < 10000; $i++) {
has_bad_words($badwords, $string);
}
echo "elapsed: ". (microtime(true) - $start);
$start = microtime(true);
for ($i = 0; $i < 10000; $i++) {
has_bad_words2($badwords, $string);
}
echo "elapsed: ". (microtime(true) - $start);
Example output:
elapsed: 0.076514959335327
elapsed: 0.29999899864197
So regular expressions are much slower.
You could use regular expression like this:
preg_match("~(admin)|(drop)|(another token)|(yet another)~",$subject);
building the pattern string from array
$pattern = implode(")|(", $banned_words);
$pattern = "~(".$pattern.")~";
function check($string, $array) {
foreach($array as $item) {
if( preg_match("/($item)/", $string) )
return true;
}
return false;
}
You can certainly do a loop, as others have suggested. But I think you can get closer to the behavior you're looking for with an operation that directly uses arrays, plus it allows execution via a single if statement.
Originally, I was thinking you could do this with a simple preg_match() call (hence the downvote), however preg_match does not support arrays. Instead, you can do a replacement via preg_replace to have all rejected strings replaced with nothing, and then check to see if the string is changed. This is simple and avoids requiring a loop iteration for each rejected string.
$rejectedStrs = array("/admin/", "/drop/", "/create/");
if($input == preg_replace($rejectedStrs, "", $input)) {
//do stuff
} else {
//reject
}
Note also that you can provide case-insensitive searches by using the i flag on the regex patterns, changing the array of patterns to $rejectedStrs = array("/admin/i", "/drop/i", "/create/i");
On Efficiency
There has been some debate about the efficiency of doing it this way vs the accepted nested loop method. I ran some tests and found the preg_replace method executed around twice as fast as the nested loop. Here is the code and output of those tests:
$input = "You can certainly do a loop, as others have suggested. But I think you can get closer to the behavior you're looking for with an operation that directly uses arrays, plus it allows execution via a single if statement. You can certainly do a loop, as others have suggested. But I think you can get closer to the behavior you're looking for with an operation that directly uses arrays, plus it allows execution via a single if statement.";
$input = "Short string with no matches";
$input2 = "Longer string with a lot more words but still no matches. Longer string with a lot more words but still no matches. Longer string with a lot more words but still no matches. Longer string with a lot more words but still no matches. Longer string with a lot more words but still no matches. Longer string with a lot more words but still no matches. Longer string with a lot more words but still no matches. ";
$input3 = "Short string which loop will match quickly";
$input4 = "Longer string that will eventually be matches but first has a lot of words, followed by more words and then more words, followed by more words and then more words, followed by more words and then more words, followed by more words and then more words, followed by more words and then more words, followed by more words and then more words, followed by more words and then more words, followed by more words and then more words and then finally the word create near the end";
$start1 = microtime(true);
$rejectedStrs = array("/loop/", "/operation/", "/create/");
$p_matches = 0;
for ($i = 0; $i < 10000; $i++) {
if (preg_check($rejectedStrs, $input)) $p_matches++;
if (preg_check($rejectedStrs, $input2)) $p_matches++;
if (preg_check($rejectedStrs, $input3)) $p_matches++;
if (preg_check($rejectedStrs, $input4)) $p_matches++;
}
$start2 = microtime(true);
$rejectedStrs = array("loop", "operation", "create");
$l_matches = 0;
for ($i = 0; $i < 10000; $i++) {
if (loop_check($rejectedStrs, $input)) $l_matches++;
if (loop_check($rejectedStrs, $input2)) $l_matches++;
if (loop_check($rejectedStrs, $input3)) $l_matches++;
if (loop_check($rejectedStrs, $input4)) $l_matches++;
}
$end = microtime(true);
echo "preg_match: ".$start1." ".$start2."= ".($start2-$start1)."\nloop_match: ".$start2." ".$end."=".($end-$start2);
function preg_check($rejectedStrs, $input) {
if($input == preg_replace($rejectedStrs, "", $input))
return true;
return false;
}
function loop_check($badwords, $string) {
foreach (str_word_count($string, 1) as $word) {
foreach ($badwords as $bw) {
if (stripos($word, $bw) === 0) {
return true;
}
}
return false;
}
}
Output:
preg_match: 1281908071.4032 1281908071.9947= 0.5915060043335
loop_match: 1281908071.9947 1281908073.006=1.0112948417664
This is actually pretty simple, use substr_count.
And example for you would be:
if (substr_count($variable_to_search, "drop"))
{
echo "error";
}
And to make things even simpler, put your keywords (ie. "drop", "create", "alter") in an array and use foreach to check them. That way you cover all your words. An example
foreach ($keywordArray as $keyword)
{
if (substr_count($variable_to_search, $keyword))
{
echo "error"; //or do whatever you want to do went you find something you don't like
}
}

Regular Expression to match unlimited number of options

I want to be able to parse file paths like this one:
/var/www/index.(htm|html|php|shtml)
into an ordered array:
array("htm", "html", "php", "shtml")
and then produce a list of alternatives:
/var/www/index.htm
/var/www/index.html
/var/www/index.php
/var/www/index.shtml
Right now, I have a preg_match statement that can split two alternatives:
preg_match_all ("/\(([^)]*)\|([^)]*)\)/", $path_resource, $matches);
Could somebody give me a pointer how to extend this to accept an unlimited number of alternatives (at least two)? Just regarding the regular expression, the rest I can deal with.
The rule is:
The list needs to start with a ( and close with a )
There must be one | in the list (i.e. at least two alternatives)
Any other occurrence(s) of ( or ) are to remain untouched.
Update: I need to be able to also deal with multiple bracket pairs such as:
/var/(www|www2)/index.(htm|html|php|shtml)
sorry I didn't say that straight away.
Update 2: If you're looking to do what I'm trying to do in the filesystem, then note that glob() already brings this functionality out of the box. There is no need to implement a custom solutiom. See #Gordon's answer below for details.
I think you're looking for:
/(([^|]+)(|([^|]+))+)/
Basically, put the splitter '|' into a repeating pattern.
Also, your words should be made up 'not pipes' instead of 'not parens', per your third requirement.
Also, prefer + to * for this problem. + means 'at least one'. * means 'zero or more'.
Not exactly what you are asking, but what's wrong with just taking what you have to get the list (ignoring the |s), putting it into a variable and then explodeing on the |s? That would give you an array of however many items there were (including 1 if there wasn't a | present).
Non-regex solution :)
<?php
$test = '/var/www/index.(htm|html|php|shtml)';
/**
*
* #param string $str "/var/www/index.(htm|html|php|shtml)"
* #return array "/var/www/index.htm", "/var/www/index.php", etc
*/
function expand_bracket_pair($str)
{
// Only get the very last "(" and ignore all others.
$bracketStartPos = strrpos($str, '(');
$bracketEndPos = strrpos($str, ')');
// Split on ",".
$exts = substr($str, $bracketStartPos, $bracketEndPos - $bracketStartPos);
$exts = trim($exts, '()|');
$exts = explode('|', $exts);
// List all possible file names.
$names = array();
$prefix = substr($str, 0, $bracketStartPos);
$affix = substr($str, $bracketEndPos + 1);
foreach ($exts as $ext)
{
$names[] = "{$prefix}{$ext}{$affix}";
}
return $names;
}
function expand_filenames($input)
{
$nbBrackets = substr_count($input, '(');
// Start with the last pair.
$sets = expand_bracket_pair($input);
// Now work backwards and recurse for each generated filename set.
for ($i = 0; $i < $nbBrackets; $i++)
{
foreach ($sets as $k => $set)
{
$sets = array_merge(
$sets,
expand_bracket_pair($set)
);
}
}
// Clean up.
foreach ($sets as $k => $set)
{
if (false !== strpos($set, '('))
{
unset($sets[$k]);
}
}
$sets = array_unique($sets);
sort($sets);
return $sets;
}
var_dump(expand_filenames('/(a|b)/var/(www|www2)/index.(htm|html|php|shtml)'));
Maybe I'm still not getting the question, but my assumption is you are running through the filesystem until you hit one of the files, in which case you could do
$files = glob("$path/index.{htm,html,php,shtml}", GLOB_BRACE);
The resulting array will contain any file matching your extensions in $path or none. If you need to include files by a specific extension order, you can foreach over the array with an ordered list of extensions, e.g.
foreach(array('htm','html','php','shtml') as $ext) {
foreach($files as $file) {
if(pathinfo($file, PATHINFO_EXTENSION) === $ext) {
// do something
}
}
}
Edit: and yes, you can have multiple curly braces in glob.
The answer is given, but it's a funny puzzle and i just couldn't resist
function expand_filenames2($str) {
$r = array($str);
$n = 0;
while(preg_match('~(.*?) \( ( \w+ \| [\w|]+ ) \) (.*) ~x', $r[$n++], $m)) {
foreach(explode('|', $m[2]) as $e)
$r[] = $m[1] . $e . $m[3];
}
return array_slice($r, $n - 1);
}
print_r(expand_filenames2('/(a|b)/var/(ignore)/(www|www2)/index.(htm|html|php|shtml)!'));
maybe this explains a bit why we like regexps that much ;)

Categories