How to effectively match a string with lots of regular expressions

How to effectively match a string with lots of regular expressions - php

I want to be able to effectively match a string with a number of regular expressions to determine what this string represents.
^[0-9]{1}$ if string matches it is of type 1
^[a-x]{300}$ if string matches it is of type 2
... ...
Iterating over a collection containing all of the regular expressions every time I want to match a string is way too heavy for me.
Is there any more effective way? Maybe I can compile these regexps into one big one? Maybe something that works like Google Suggestions, analysing letter after letter?
In my project, I am using PHP/MySQL, however I will be thankful for a clue in any language.
Edit:
Operation of matching a string will be very frequent and string values will vary.

What you could do, if possible, is grouping your regexes together and determine in which group a string belongs.
For instance, if a string doesn't match \d, you know there is no digit in it and you can skip all regexes that require one. So (for instance) instead of matching against +300 regexes, you can narrow that down to just 25.

You can sum up your regexes like this:
^([0-9])|([a-x]{300})$
Later, if you get more regex, you can do this:
^([0-9])|([a-x]{300})|([x-z]{1,5})|([ab]{2,})$...
Then use this code:
$input=...
preg_match_all('#^([0-9])|([a-x]{300})$#', $input, $matches);
foreach ($matches as $val) {
if (isset($val[1])) {
// type 1
} else if (isset($val[2])) {
// type 2
}
// and so on...
}

Since the regexes are going to be changing, I don't think you can get a generic answer - both your regex(es), and the way you handle them will need to evolve. For now, if you're looking to optimize the processing of your script, test for known strings before evaluating using something like indedOf to lighten the regex load.
For instance, if you have 4 strings:
asdfsdfkjslkdujflkj2lkjsdlkf2lkja
100010010100111010100101001001011
101032021309420940389579873987113
asdfkajhslkdjhflkjshdlfkjhalksjdf
Each belongs to a different "type" as you've described it, so you could do:
//type 1 only contains 0 or 1
//type 2 must have a "2"
//type 3 contains only letters
var arr = [
"asdfsdfkjslkdujflkj2lkjsdlkf2lkja",
"100010010100111010100101001001011",
"101032021309420940389579873987113",
"asdfkajhslkdjhflkjshdlfkjhalksjdf"
];
for (s in arr)
{
if (arr[s].indexOf('2') > 0)
{
//type 2
}
else if (arr[s].indexOf('0') > 0)
{
if ((/^[01]+$/g).test(arr[s]))
//type 1
else
//ignore
}
else if ((/^[a-z]+$/gi).test(arr[s]))
//type 3
}
See it in action here: http://jsfiddle.net/remus/44MdX/

Related

Detect common Password/PIN

I Made a PIN authentication on my website, and I don't want my user using common PINs like 12345, 11111, 121212, etc.
I tried this
if($PIN=="111111" || $PIN="222222"){
echo "This PIN is common";
}
But I think that Will be too long for a simple function?
How to simplify it?

Your problem is actually quite simple, you want, for example, to avoid pins that have multiples iterations of a same character in a row OR/AND avoid pins that have a same character repeated more than X times in a string.
Using Regex we can easily achieve something like this: For example, the following will return 1 if 3 characters or more are in a row.
<?php
$pin = '111025';
if ( preg_match( '/(.)\1{2}/', $pin ) ) {
return true;
} else {
return false;
}; ?>
Learn more
RegEx.
A regular expression is a sequence of characters that forms a search
pattern. When you search for data in a text, you can use this search
pattern to describe what you are searching for.
Function
Description
preg_match()
Returns 1 if the pattern was found in the string and 0 if not
( )
You can use parentheses ( ) to apply quantifiers to entire patterns. They also can be used to select parts of the pattern to be used as a match
.
Find just one instance of any character
n{x}
Matches any string that contains a sequence of X n's
PHP RegEx # https://www.w3schools.com/php/php_regex.asp

Delete multiple file for/while

I have a php pull down that I select an item and delete
all files associated with it.
It works well if there was only 5 or 6. After I put the
first 4 to test and get it working I realized it could
take a very long time to enter in a couple hundred and
would blot the script.
Not knowing enough about for and while loops is there
anyone that might have a way to help?
There will never be more than one set deleted at a time.
Thanks in advance.
<?php
$workitem = $_POST["workitem"];
$workdirPAth = "/var/work.files/";
if($workitem == 'item1.php')
{
unlink("$workdirPath/page1.php");
unlink("$workdirPath/temp1.php");
unlink("$workdirPath/all1.php");
}
if($workitem == 'item2.php')
{
unlink("$workdirPath/page2.php");
unlink("$workdirPath/temp2.php");
unlink("$workdirPath/all2.php");
}
if($workitem == 'item3.php')
{
unlink("$workdirPath/page3.php");
unlink("$workdirPath/temp3.php");
unlink("$workdirPath/all3.php");
}
if($workitem == 'item4.php')
{
unlink("$workdirPath/page4.php");
unlink("$workdirPath/temp4.php");
unlink("$workdirPath/all3.php");
?>

Some simple pattern matching and substitution is all you need here.
First, the code:
1. if (preg_match('/^item(\d+)\.php$/', $workitem, $matches)) {
2. $number = $matches[1];
3. foreach(array('page','temp','all') as $base) {
4. unlink("$workdirPath/$base$number.php");
5. }
6. } else {
7. # unrecognized work item value; complain to user or whatever
8. }
The preg_match function takes a pattern, a string, and an array. If the string matches the pattern, the parts that match are stored in the array. The particular type of pattern is a *p*erl5-compatible *reg*ular expression, which is where the preg_ part of the name comes from.
Regular expressions are scary-looking to the uninitiated, but they're a handy way to scan a string and get some values out of it. Most characters just represent themselves; the string "foo" matches the regular expression /foo/. But some characters have special meanings that let you make more general patterns to match a whole set of strings where you don't have to know ahead of time exactly what's in them.
The /s just mark the beginning and end of the actual regular expression; they're there because you can stick additional modifier flags inside the string along with the expression itself.
The ^and $ arepresent the beginning and end of the string. "/foo/" matches "foo", but also "foobar", "bunnyfoofoo", and so on - any string that contains "foo" will match. But /^foo$/ matches only "foo" exactly.
\d means "any digit". + means "one or more of that last thing". So \d+ means "one or more digits".
The period (.) is special; it matches any character at all. Since we want a literal period, we have to escape it with a backslash; \. just matches a period.
So our regular expression is '/^item\d+\.php$/', which will match any itemnumber.php filename. But that's not quite enough. The preg_match function is basically a binary test: does the string match the pattern or not, yes or no? In this case, it's not enough to just say "yup, the string is valid"; we need to know which items specifically the user specified. That's what capture groups are for. We use parentheses to say "remember what matched this part", and provide an array name that gets filled with those remembrances.
The part of the string that matches the whole regular expression (which may not be the whole string, if the regular expression isn't anchored with ^...$ like this one is) is always put in element 0 of the array. If you use parentheses in the regular expression, then the part of the string that matches the part of the regular expression inside the first pair of parentheses is stored in element 1 of the array; if there's a second set of parentheses, the matching part of the string goes in element 2 of the array, and so on.
So we put parentheses around our number ((\d+)) and then the actual number will be remembered in element 1 of our $matches array.
Great, we have a number. Now we just need to use it to build up the filenames we want to delete.
In each case, we want to delete three files: page$n.php, temp$n.php, and all$n.php, where $n is the number we extracted above. We could just put three unlink calls, but since they're all so similar, we can use a loop instead.
Take the different prefixes that are the same no matter the number, and make an array out of them. Then loop over that array. In the body of the loop, the variable $base will contain whichever element of the array it's currently on. Stick that between the $workdirPath prefix and the $number we got from the match, append .php, and that's your file. unlink it and go back to the top of the loop to grab the next one.

How to find if two characters are in an array php

I am looking to develop a search function that allows users to just search for the item, or modify their search with a price range in brackets. So that is to say if they are looking for a car, then they can enter either car and receive all cars in the database or they can enter car (100, 299) or car(100, 299) and receive only cars in the database with the price range of 100 to 299.
Before what I did was three different explode function calls, but that was cumbersome and looked ridiculously ugly. I also tried to put the the brackets in an array and then compare that against the word searched (a word is basically an array of characters) but that didn't work. Finally I have been reading up on strpos and substr but they don't seem to fit the requirements as strpos returns the first occurrence of the the character and substr returns the characters within a specified length after a specific occurrence.
So for example the problem with strpos is the user can just enter ( and no ) bracket and I'll make a call to my search function with who knows what. And for example the problem with substr is that the price range can vary wildly.

You can use preg_match to parse the search string - I'm assuming that's the part you're having issues with.
if (preg_match('/car ?\(([^,]+), ?([^\)]+)\)/', $search_text, $matches)) {
$low_price = $matches[1];
$high_price = $matches[2];
//do your price filtering here
}
The regular expression may need a little tweaking, I don't remember offhand if parentheses need to be escaped in character classes.

Yes, Sam is right. You should do this with regular expressions.
Look for preg_match() on the documentation
To complete his answer, the regular expression for your case is:
$regex = "^([a-zA-Z]+)\s\(([0-9]+),([0-9]+)\)$"
if (preg_match($regex, $search_text, $matches)) {
$type = $matches[0];
$low_price = $matches[1];
$high_price = $matches[2];
//do your price filtering here
}
Be careful, as the array containing matches starts at index 0, not one.

PHP Regex Check if two strings share two common characters

I'm just getting to know regular expressions, but after doing quite a bit of reading (and learning quite a lot), I still have not been able to figure out a good solution to this problem.
Let me be clear, I understand that this particular problem might be better solved not using regular expressions, but for the sake of brevity let me just say that I need to use regular expressions (trust me, I know there are better ways to solve this).
Here's the problem. I'm given a big file, each line of which is exactly 4 characters long.
This is a regex that defines "valid" lines:
"/^[AB][CD][EF][GH]$/m"
In english, each line has either A or B at position 0, either C or D at position 1, either E or F at position 2, and either G or H at position 3. I can assume that each line will be exactly 4 characters long.
What I'm trying to do is given one of those lines, match all other lines that contain 2 or more common characters.
The below example assumes the following:
$line is always a valid format
BigFileOfLines.txt contains only valid lines
Example:
// Matches all other lines in string that share 2 or more characters in common
// with "$line"
function findMatchingLines($line, $subject) {
$regex = "magic regex I'm looking for here";
$matchingLines = array();
preg_match_all($regex, $subject, $matchingLines);
return $matchingLines;
}
// Example Usage
$fileContents = file_get_contents("BigFileOfLines.txt");
$matchingLines = findMatchingLines("ACFG", $fileContents);
/*
* Desired return value (Note: this is an example set, there
* could be more or less than this)
*
* BCEG
* ADFG
* BCFG
* BDFG
*/
One way I know that will work is to have a regex like the following (the following regex would only work for "ACFG":
"/^(?:AC.{2}|.CF.|.{2}FG|A.F.|A.{2}G|.C.G)$/m"
This works alright, performance is acceptable. What bothers me about it though is that I have to generate this based off of $line, where I'd rather have it be ignorant of what the specific parameter is. Also, this solution doesn't scale terrible well if later the code is modified to match say, 3 or more characters, or if the size of each line grows from 4 to 16.
It just feels like there's something remarkably simple that I'm overlooking. Also seems like this could be a duplicate question, but none of the other questions I've looked at really seem to address this particular problem.
Thanks in advance!
Update:
It seems that the norm with Regex answers is for SO users to simply post a regular expression and say "This should work for you."
I think that's kind of a halfway answer. I really want to understand the regular expression, so if you can include in your answer a thorough (within reason) explanation of why that regular expression:
A. Works
B. Is the most efficient (I feel there are a sufficient number of assumptions that can be made about the subject string that a fair amount of optimization can be done).
Of course, if you give an answer that works, and nobody else posts the answer *with* a solution, I'll mark it as the answer :)
Update 2:
Thank you all for the great responses, a lot of helpful information, and a lot of you had valid solutions. I chose the answer I did because after running performance tests, it was the best solution, averaging equal runtimes with the other solutions.
The reasons I favor this answer:
The regular expression given provides excellent scalability for longer lines
The regular expression looks a lot cleaner, and is easier for mere mortals such as myself to interpret.
However, a lot of credit goes to the below answers as well for being very thorough in explaining why their solution is the best. If you've come across this question because it's something you're trying to figure out, please give them all a read, helped me tremendously.

Why don't you just use this regex $regex = "/.*[$line].*[$line].*/m";?
For your example, that translates to $regex = "/.*[ACFG].*[ACFG].*/m";

This is a regex that defines "valid" lines:
/^[A|B]{1}|[C|D]{1}|[E|F]{1}|[G|H]{1}$/m
In english, each line has either A or B at position 0, either C or D
at position 1, either E or F at position 2, and either G or H at
position 3. I can assume that each line will be exactly 4 characters
long.
That's not what that regex means. That regex means that each line has either A or B or a pipe at position 0, C or D or a pipe at position 1, etc; [A|B] means "either 'A' or '|' or 'B'". The '|' only means 'or' outside of character classes.
Also, {1} is a no-op; lacking any quantifier, everything has to appear exactly once. So a correct regex for the above English is this:
/^[AB][CD][EF][GH]$/
or, alternatively:
/^(A|B)(C|D)(E|F)(G|H)$/
That second one has the side effect of capturing the letter in each position, so that the first captured group will tell you whether the first character was A or B, and so on. If you don't want the capturing, you can use non-capture grouping:
/^(?:A|B)(?:C|D)(?:E|F)(?:G|H)$/
But the character-class version is by far the usual way of writing this.
As to your problem, it is ill-suited to regular expressions; by the time you deconstruct the string, stick it back together in the appropriate regex syntax, compile the regex, and do the test, you would probably have been much better off just doing a character-by-character comparison.
I would rewrite your "ACFG" regex thus: /^(?:AC|A.F|A..G|.CF|.C.G|..FG)$/, but that's just appearance; I can't think of a better solution using regex. (Although as Mike Ryan indicated, it would be better still as /^(?:A(?:C|.E|..G))|(?:.C(?:E|.G))|(?:..EG)$/ - but that's still the same solution, just in a more efficiently-processed form.)

You've already answered how to do it with a regex, and noted its shortcomings and inability to scale, so I don't think there's any need to flog the dead horse. Instead, here's a way that'll work without the need for a regex:
function findMatchingLines($line) {
static $file = null;
if( !$file) $file = file("BigFileOfLines.txt");
$search = str_split($line);
foreach($file as $l) {
$test = str_split($l);
$matches = count(array_intersect($search,$test));
if( $matches > 2) // define number of matches required here - optionally make it an argument
return true;
}
// no matches
return false;
}

There are 6 possibilities that at least two characters match out of 4: MM.., M.M., M..M, .MM., .M.M, and ..MM ("M" meaning a match and "." meaning a non-match).
So, you need only to convert your input into a regex that matches any of those possibilities. For an input of ACFG, you would use this:
"/^(AC..|A.F.|A..G|.CF.|.C.G|..FG)$/m"
This, of course, is the conclusion you're already at--so good so far.
The key issue is that Regex isn't a language for comparing two strings, it's a language for comparing a string to a pattern. Thus, either your comparison string must be part of the pattern (which you've already found), or it must be part of the input. The latter method would allow you to use a general-purpose match, but does require you to mangle your input.
function findMatchingLines($line, $subject) {
$regex = "/(?<=^([AB])([CD])([EF])([GH])[.\n]+)"
+ "(\1\2..|\1.\3.|\1..\4|.\2\3.|.\2.\4|..\3\4)/m";
$matchingLines = array();
preg_match_all($regex, $line + "\n" + $subject, $matchingLines);
return $matchingLines;
}
What this function does is pre-pend your input string with the line you want to match against, then uses a pattern that compares each line after the first line (that's the + after [.\n] working) back to the first line's 4 characters.
If you also want to validate those matching lines against the "rules", just replace the . in each pattern to the appropriate character class (\1\2[EF][GH], etc.).

People may be confused by your first regex. You give:
"/^[A|B]{1}|[C|D]{1}|[E|F]{1}|[G|H]{1}$/m"
And then say:
In english, each line has either A or B at position 0, either C or D at position 1, either E or F at position 2, and either G or H at position 3. I can assume that each line will be exactly 4 characters long.
But that's not what that regex means at all.
This is because the | operator has the highest precedence here. So, what that regex really says, in English, is: Either A or | or B in the first position, OR C or | or D in the first position, OR E or | or F in the first position, OR G or '|orH` in the first position.
This is because [A|B] means a character class with one of the three given characters (including the |. And because {1} means one character (it is also completely superfluous and could be dropped), and because the outer | alternate between everything around it. In my English expression above each capitalized OR stands for one of your alternating |'s. (And I started counting positions at 1, not 0 -- I didn't feel like typing the 0th position.)
To get your English description as a regex, you would want:
/^[AB][CD][EF][GH]$/
The regex will go through and check the first position for A or B (in the character class), then check C or D in the next position, etc.
--
EDIT:
You want to test for only two of these four characters matching.
Very Strictly speaking, and picking up from #Mark Reed's answer, the fastest regex (after it's been parsed) is likely to be:
/^(A(C|.E|..G))|(.C(E)|(.G))|(..EG)$/
as compared to:
/^(AC|A.E|A..G|.CE|.C.G|..EG)$/
This is because of how the regex implementation steps through text. You first test if A is in the first position. If that succeeds, then you test the sub-cases. If that fails, then you're done with all those possible cases (or which there are 3). If you don't yet have a match, you then test if C is in the 2nd position. If that succeeds, then you test for the two subcases. And if none of those succeed, you test, `EG in the 3rd and 4th positions.
This regex is specifically created to fail as fast as possible. Listing each case out separately, means to fail, you would have test 6 different cases (each of the six alternatives), instead of 3 cases (at a minimum). And in cases of A not being the first position, you would immediately go to test the 2nd position, without hitting it two more times. Etc.
(Note that I don't know exactly how PHP compiles regex's -- it's possible that they compile to the same internal representation, though I suspect not.)
--
EDIT: On additional point. Fastest regex is a somewhat ambiguous term. Fastest to fail? Fastest to succeed? And given what possible range of sample data of succeeding and failing rows? All of these would have to be clarified to really determine what criteria you mean by fastest.

Here's something that uses Levenshtein distance instead of regex and should be extensible enough for your requirements:
$lines = array_map('rtrim', file('file.txt')); // load file into array removing \n
$common = 2; // number of common characters required
$match = 'ACFG'; // string to match
$matchingLines = array_filter($lines, function ($line) use ($common, $match) {
// error checking here if necessary - $line and $match must be same length
return (levenshtein($line, $match) <= (strlen($line) - $common));
});
var_dump($matchingLines);

I bookmarked the question yesterday in the evening to post an answer today, but seems that I'm a little late ^^ Here is my solution anyways:
/^[^ACFG]*+(?:[ACFG][^ACFG]*+){2}$/m
It looks for two occurrences of one of the ACFG characters surrounded by any other characters. The loop is unrolled and uses possessive quantifiers, to improve performance a bit.
Can be generated using:
function getRegexMatchingNCharactersOfLine($line, $num) {
return "/^[^$line]*+(?:[$line][^$line]*+){$num}$/m";
}

I need a PHP regular expression to validate string format of 5 digits, one comma

I have a huge PHP input box on a webpage. This input should only take 5 digit string separated by commas:
00100,00247,90277,97030,00657
notice the last one has no comma at the end.
Is there a regular expression that can do this? Since the input box is very large and can take 100+ of these items, I want to validate it on the PHP server side before the database is queried and those avoid any SQL Injection tries.
Query is only run if only 5 numbers and a comma in the sequence, except for the last one.
These are a state's public water system ID's by the way.

I believe this will get the result you're looking for, though explode may be the better option.
/^(?:\d{5},)*\d{5}$/
This will only match 1 or more 5-digit numbers that are comma delimited with no spaces.

Since this is user submitted data, your validation should be more flexible. What if the user accidentally puts a space after one of the commas? Or a line break gets inserted?
I realize you are looking for a regex solution but may I suggest using explode to create an array and apply a rule to each element. Having them separated into elements allows more flexibility when validating and storing:
$nums = explode(',', '00100,00247,90277,97030,00657');
foreach ($nums as $num) {
if (!preg_match('/^\d{5}$)/', trim($num))) {
// error!
}
}

I'd explode it and validate each string individually:
$input = '00100,00247,90277,97030,00657';
$input_array = explode(',', $input);
$is_valid = true;
foreach ($input_array as $number) {
if (preg_match("/\\d/", trim($number)) != strlen(trim($number))) {
$is_valid = false;
}
}
print($is_valid);

I think you rather need str_getcsv:
while ($row = str_getcsv($fp)) {
// $row is an array containing your digits
}

Simple. This regex matches a value having one or more comma separated 5-digit numbers:
if (preg_match('/^\d{5}(\s*,\s*\d{5})*$/', $value)) {
// Good value
}
It allows whitespace between the numbers as well.

This might work:
/^\d{5}(?:,\d{5})*$/
edit 1 noticed ridgerunner has the same answer, so disregard this.
edit 2 some notes on performance.
Failure analysis
Backtracking give back on failure:
^\d{5}(?:,\d{5})*$ gives back ,\d{5}
^(?:\d{5},)*\d{5}$ gives back \d{5},
Post Backtracking regressive topography checks:
(After backtracking give back, checks are to the right of the one that gave back)
^\d{5}(?:,\d{5})*$ checks for $
^(?:\d{5},)*\d{5}$ checks for \d{5}$
Winner: ^\d{5}(?:,\d{5})*$
NON-Backtracking regex's (using possesive quantifier +):
^\d{5}(?:,\d{5})*+$ gives nothing back, fails immediately
^(?:\d{5},)*+\d{5}$ gives nothing back fails immediately
Benchmarks
Using a string of 50 blocks of \d{5},.
The sample string is matched against each regex in a loop of 100,000 times.
Failure was induced at the end of the string, removed for a sucess test.
Sucess:
All took 1 second to complete a sucessfull run.
Failure, Backtracking:
^\d{5}(?:,\d{5})\*$ took 1.2 seconds best
^(?:\d{5},)\*\d{5}$ took 1.6 seconds
Failure, Non-Backtracking:
^\d{5}(?:,\d{5})*+$ took .9 seconds
^(?:\d{5},)*+\d{5}$ took .9 seconds
Conclusions
Backtracking - Put the smallest post-backtracking check
after the backtracking sub-expression. In this case, the
smallest is $.
In general, put the required expressions ahead of the optional ones.
Best ^\d{5}(?:,\d{5})*$
NON-Backtracking - It doesn't matter.
^\d{5}(?:,\d{5})*+$ or ^(?:\d{5},)*+\d{5}$

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to effectively match a string with lots of regular expressions - php

Related

Detect common Password/PIN

Delete multiple file for/while

How to find if two characters are in an array php

PHP Regex Check if two strings share two common characters

I need a PHP regular expression to validate string format of 5 digits, one comma

Categories

Resources