I would like to validate if user input string is in correct form for further processing / database update.
Form:
elephant1:elephant2:elephant3;cat1:cat2:cat3;unicorn1:unicorn2:unicorn3
: as separator between siblings and ; as separator between groups of siblings
Rules:
There are ALWAYS 3 siblings, since it is meant just for personal bulk import, i just want to avoid mistakes with very long strings. As for the groups, there could be one or more, so group separator not obligatory. Siblings names are letters only with exception of underscore (_) for spaces when there are two or more words in a name.
i was thinking regex, but i am not very familiar with it. If there are any other, simpler ways to achieve this, please suggest.
Valid examples
N-number of groups, separated by semicolon, each of which containing exactly three (3) members separated by punctuation. As mentioned before, names are letters only, with exception of underscore as space for names with multiple words.
VALIDS:
john:mike:dave;jenny:helen:jessica
dog:cat:frog;car:boat:ship;house:flat:shack
meat:vegetable:fruit
UPDATE:
This is what i came up with while trying to understand your answers, it works fine so far
"/^(([a-z]+:[a-z]+:[a-z]+;?)+)$/"
Upgraded to Roman's answer
"/([a-z_]+:[a-z_]+:[a-z_]+;?)+/i"
allowing function to ignore spaces, tabs and allow underscores where items have multiple words.
The solution using preg_match function with specific regex pattern:
$str = 'og:cat:frog;car:boat:ship;house:flat:shack';
if (preg_match("/([a-z_]+:[a-z_]+:[a-z_]+;?)+/i", $str)) {
echo 'valid';
} else {
echo 'invalid';
}
^(?:[a-zA-Z_]+:[a-zA-Z_]+:[a-zA-Z_]+(?:;(?!$)|$))+$ (demo, with multiline flag on)
^ # Anchors to beginning of string
(?: # Opens non-capturing group
[a-zA-Z_]+ # Any number of letters/underscore, one or more times
: # Literal :
[a-zA-Z_]+ # Any number of letters/underscore, one or more times
: # Literal :
[a-zA-Z_]+ # Any number of letters/underscore, one or more times
(?: # Opens non-capturing group
; # Literal ;
(?!$) # Negative Lookahead, ensuring that semi-colons are not at the end of line
| # Or
$ # End of string
) # Closes non-capturing group
)+ # Repeats overall non-capturing-group one or more times
$ # Anchors to end of string
You didn't specify if siblings could be 0 characters, if that's the case, change each [a-zA-Z_]+ to [a-zA-Z_]*
// PHP Code generated by Regex101.
$re = '/^(?:[a-zA-Z_]+:[a-zA-Z_]+:[a-zA-Z_]+(?:;(?!$)|$))+$/m';
$str = 'a_b:bread:stack_overflow;test:this_thing:jane;Get_me:h:down
ab:bread:stack_overflow;test:this_thing:jane;Get_me:h:down
a_b:any other characters break it:stack_overflow;test:this_thing:jane;Get_me:h:down
a_b:bread:format_messed_up-test:this_thing:jane;Get_me:h:down
a_b:bread:stack_overflow;test:this_thing:jane;semi_colon_at_end;';
preg_match_all($re, $str, $matches);
// Print the entire match result
print_r($matches);
Related
I'm trying to only replace string between two symbols and start replacing just if the string contains specific word for example:
$string = '%Test% %font-style:italic; font-weight:bold;%'; //It Can be with different orders such as
$string = '%Test% %font-weight:bold; font-style:italic;%';
So The string which I want to use preg_replace for is the string between this two symbols %% and I want to use preg_replace just if the string contains one of css tags such as font-style:italic; color:red; font-weight:bold; etc.. I've tried
$string = preg_replace('`\%(.*?)((.*?):(.*?);)(.*?)\%`si', '(span style="$2$5")', $string); // ( used as start tag html symbol
But It caused a problem when I used it for
http://localhost/NaiTreNo/Games/Games/BatMan%20Arkham%20Knight/Image/Cover.jpg :D %color:blue; font-weight:bold;%
it should return:
http://localhost/NaiTreNo/Games/Games/BatMan%20Arkham%20Knight/Image/Cover.jpg :D
<span style="color:blue; font-weight:bold;">
But it returned:
http://localhost/NaiTreNo/Games/Games/BatMan<span style="20Arkham%20Knight/Image/Cover.jpg :D %color:blue; font-weight:bold;%">
Please help.
Your regex is too loose. It only checks for the presence of : and ; somewhere inside the string. I would use the knowledge that CSS property names have a specific format to make a regex rule that won't match any string that contains : and ;.
For example, something like this:
#%(([a-z]+(-[a-z]+){0,2}: *[^;]+;)+)(.*?)%#si
A CSS property name starts with a word containing one or more lowercase letters [a-z]+, followed by zero, one or two more words, each of them preceded by a dash (-[a-z]+){0,2}.
A rule to restrict the too-accepting .*? used for values can also be created but the outcome doesn't pay the effort (and the regex becomes difficult to understand.
How the regex works:
% # your custom boundary start symbol
( # start of group #1 used to capture the CSS rules
( # start of group #2 that captures a single CSS rul
[a-z]+ # first word of CSS property name
(-[a-z]+){0,2} # 0-2 more words, separated with dash (-)
: * # the colon followed by optional white spaces
[^;]+; # anything until the first semicolon (at least one character)
)+ # end of group #2; it can repeat; at least one occurence is required
) # end of group #1
(.*?) # captures everything after the last semicolon
% # your custom boundary end symbol
The regex above doesn't match when there is only one CSS property and its value is not followed by a semicolon, f.e. %color: red%. In order to fix this, the + symbol after group #2 must be replaced with * (to match zero or more CSS rules ended with ;) but this way the ending (.*?) will match anything, including Test or the URL in your examples.
This can be fixed by replacing .*? in the last group with the content of group #2 without the ending ;. The expression becomes longer and more difficult to understand and I won't post it here. You better make sure your CSS rules always end with a semicolon (;), including the last one.
A playground for this regex can be found at: https://regex101.com/r/vC1oS2/3
I can propose to add a space at the beginning.
$string = preg_replace('`(^| )\%(.*?)((.*?):(.*?);)(.*?)\%`si', ' <span style="$2$5">', $string); // ( used as start tag html symbol
Try at https://3v4l.org/uubF7
I need help to write a regular expression to match numbers which may be broken up into sections by spaces or dashes e.g:
606-606-606
606606606
123 456-789
However, matches should be rejected if all the digits of the number are identical (or if there are any other characters besides [0-9 -]):
111 111 111
111111111
123456789a
If spaces/dashes weren't allowed, the Regex would be simple:
/^(\d)(?!\1*$)\d*$/
But how would I allow dashes and spaces in the number?
EDIT
How would I allow also letters in the same regex (dashes and spaces shoud be still allowed) e.g:
aaaaa - it's not ok
aa-aaa-aaa-aaaaa - it's not OK
ababab - it's OK
ab-ab-ab - it's OK
This rule checks only numbers.
^(?!(?:(\d)\1+[ -]*)+$)\d[\d- ]+$
Desired results can be achieved by this Regular Expression:
^(?!(?:(\d)\1+[ -]*)+$)\d[\d- ]+$
Live demo
Explanations:
^ # Start of string
(?! # Negative Lookahead to check duplicate numbers
(?: # Non-capturing group
(\d) # Capture first digit
\1+ # More digits same as lately captured one
[ -]* # Any spaces and dashes between
)+ # One or more of what's captured up to now
$ # End of string
) # End of negative lookahead
\d # Start of match with a digit
[\d- ]+ # More than one digit/dash/space
$ # End of string
The theory behind this regex is to use a lookaround to check if string contains any duplicate numbers base on the first captured number. If we have no match in this lookaround, then match it.
Even if you can, i wonder if a regex is the right tool to solve this problem. Just imagine your fellow developers scratching their heads trying to understand your code, how much time do you grant them? Even worse, what if you need to alter the rules?
A small function with some comments could make them happy...
function checkNumberWithSpecialRequirements($number)
{
// ignore given set of characters
$cleanNumber = str_replace([' ', '-'], '', $number);
// handle empty string
if ($cleanNumber == '')
return false;
// check whether non-digit characters are inside
if (!ctype_digit($cleanNumber))
return false;
// check if a character differs from the first (not all equal)
for ($index = 1; $index < strlen($cleanNumber); $index++)
{
if ($cleanNumber[$index] != $cleanNumber[0])
return true;
}
return false;
}
I have to take out some data from strings. Unfortunately data has realy unfriendly format. I had to create about 15 regural expressions placed in separate preg_replace. It's worth to say that they have many OR (|) within itself. My question is what should I do finally: combine all expressions into one and separate them using | or leave them as is - in individual preg_replace?
Is it very bad practice to create other expressions to keep clarity? I think maybe I could combine some expressions into the one but they become very complicated and not understanding.
For example I have:
$itemFullName = preg_replace("#^\b(([a-zA-Z]{1,3})?[0-9]{1,2}(\.|\-|X)[0-9]{1,2}(\s|\.|\-)?(X|x)?\s?[0-9]{1,3}\.?(([0-9]{1,3})?(X[0-9]{1,3})|(\s[0-9]\/[0-9]|\/[0-9]{1,3}))?(\s\#[0-9]{1,3}\/[0-9]{1,3})?)\s#", ' ', $itemFullName, -1, $sum);
Untidy:
For starters your original PHP statement:
$itemFullName = preg_replace("#^\b(([a-zA-Z]{1,3})?[0-9]{1,2}(\.|\-|X)[0-9]{1,2}(\s|\.|\-)?(X|x)?\s?[0-9]{1,3}\.?(([0-9]{1,3})?(X[0-9]{1,3})|(\s[0-9]\/[0-9]|\/[0-9]{1,3}))?(\s\#[0-9]{1,3}\/[0-9]{1,3})?)\s#", ' ', $itemFullName, -1, $sum);
would be much more readable (and maintainable) if you write it in free-spacing mode with comments like so:
Tidy:
$itemFullName = preg_replace("/(?#!php re_item_tidy Rev:20180207_0700)
^ # Anchor to start of string.
\b # String must begin with a word char.
( # $1: Unnecessary group.
([a-zA-Z]{1,3})? # $2: Optional 1-3 alphas.
[0-9]{1,2} # 1-2 decimal digits.
(\.|\-|X) # $3: Either a dot, hyphen or X.
[0-9]{1,2} # One or two decimal digits.
(\s|\.|\-)? # $4: Optional whitespace, dot or hyphen.
(X|x)? # $5: Optional X or x.
\s?[0-9]{1,3}\.? # Optional whitespace, 1-3 digits, optional dot.
( # $6: Optional ??? from 2 alternatives.
([0-9]{1,3})? # Either a1of2 $7: Optional 1-3 digits.
(X[0-9]{1,3}) # $8: X and 1-3 digits.
| ( # Or a2of2 $9: one ??? from 2 alternatives.
\s[0-9]\/[0-9] # Either a1of2.
| \/[0-9]{1,3} # Or a2of2.
) # End $9: one ??? from 2 alternatives.
)? # End $6: optional ??? from 2 alternatives.
( # $10: Optional sequence.
\s\#[0-9]{1,3} # whitespace, hash, 1-3 digits.
\/[0-9]{1,3} # Forward slash, 1-3 digits.
)? # End $10: Optional sequence
) # End $1: Unnecessary group.
\s # End with a single whitespace char.
/x", ' ', $itemFullName, -1, $sum);
Critique:
This regex is really not bad performance-wise. It has a start of string anchor at the start which helps it fail quickly for non-matching strings. It also does not have any backtracking problems. However, there are a few minor improvements which can be made:
There are three groups of alternatives where each of the alternatives consists of only one character - each of these can be replaced with a simple character class.
There are 10 capture groups but preg_replace uses none of the captured data. These capture groups can be changed to be non-capturing.
There are several unnecessary groups which can be simply removed.
Group 2: ([a-zA-Z]{1,3})? can be written more simply as: [a-zA-Z]{0,3}. Group 7 has a similar construct.
The \b word boundary at the start is unnecessary.
With PHP, its best to enclose regex patterns inside single quoted strings. Double quoted strings have many metacharacters that must be escaped. Single quoted strings only have two: the single quote and the backslash.
There are a few unnecessarily escaped forward slashes.
Note also that you are using the $sum variable to count the number of replacements being made by preg_replace(). Since you have a ^ start of string anchor at the beginning of the pattern, you will only ever have one replacement because you have not specifid the 'm' multi-line modifier. I am assuming that you actually do want to perform multiple replacements (and count them in $sum), so I've added the 'm' modifier.
Here is an improved version incorporating these changes:
Tidier:
$itemFullName = preg_replace('%(?#!php/m re_item_tidier Rev:20180207_0700)
^ # Anchor to start of string.
[a-zA-Z]{0,3} # Optional 1-3 alphas.
[0-9]{1,2} # 1-2 decimal digits.
[.X-] # Either a dot, hyphen or X.
[0-9]{1,2} # One or two decimal digits.
[\s.-]? # Optional whitespace, dot or hyphen.
[Xx]? # Optional X or x.
\s?[0-9]{1,3}\.? # Optional whitespace, 1-3 digits, optional dot.
(?: # Optional ??? from 2 alternatives.
[0-9]{0,3} # Either a1of2: Optional 1-3 digits
X[0-9]{1,3} # followed by X and 1-3 digits.
| (?: # Or a2of2: One ??? from 2 alternatives.
\s[0-9]/[0-9] # Either a1of2.
| /[0-9]{1,3} # Or a2of2.
) # End one ??? from 2 alternatives.
)? # End optional ??? from 2 alternatives.
(?: # Optional sequence.
\s\#[0-9]{1,3} # whitespace, hash, 1-3 digits.
/[0-9]{1,3} # Forward slash, 1-3 digits.
)? # End optional sequence
\s # End with a single whitespace char.
%xm', ' ', $itemFullName, -1, $sum);
Note however, that I don't think you'll see much if any performance improvements - your original regex is pretty good. Your performance issues are probably coming from some other aspect of your program.
Hope this helps.
Edit 2018-02-07: Removed extraneous double quote, added regex shebangs.
My question is what should I do finally: combine all expressions into one and separate them using | or leave them as is - in individual preg_replace?
Keep the regular expressions in separate preg_replace() calls because this gives you more maintainability, readability and efficiency.
Using a lot of OR operators | in your regular expression is not performance friendly especially for large amounts of text because the regular expression engine has to apply at every character in the input, it has to apply every alternative in the OR operator's | list.
Please don't worry about "fastest" without having first done some sort of measurement that it matters. Unless your program is operating too slowly, and you've run a profiler like XDebug to determine that the regex matching is the bottleneck, then you're doing premature optimization.
Rather than worrying about fastest, think about which way is clearest.
I have a string that looks like this:
[if-abc] 12345 [if-def] 67890 [/if][/if]
I have the following regex:
/\[if-([a-z0-9-]*)\]([^\[if]*?)\[\/if\]/s
This matches the inner brackets just like I want it to. However, when I replace the 67890 with text (ie. abcdef), it doesn't match it.
[if-abc] 12345 [if-def] abcdef [/if][/if]
I want to be able to match ANY characters, including line breaks, except for another opening bracket [if-.
This part doesn't work like you think it does:
[^\[if]
This will match a single character that is neither of [, i or f. Regardless of the combination. You can mimic the desired behavior using a negative lookahead though:
~\[if-([a-z0-9-]*)\]((?:(?!\[/?if).)*)\[/if\]~s
I've also included closing tags in the lookahead, as this avoid the ungreedy repetition (which is usually worse performance-wise). Plus, I've changed the delimiters, so that you don't have to escape the slash in the pattern.
So this is the interesting part ((?:(?!\[/?if).)*) explained:
( # capture the contents of the tag-pair
(?: # start a non-capturing group (the ?: are just a performance
# optimization). this group represents a single "allowed" character
(?! # negative lookahead - makes sure that the next character does not mark
# the start of either [if or [/if (the negative lookahead will cause
# the entire pattern to fail if its contents match)
\[/?if
# match [if or [/if
) # end of lookahead
. # consume/match any single character
)* # end of group - repeat 0 or more times
) # end of capturing group
Modifying a little results in:
/\[if-([a-z0-9-]+)\](.+?)(?=\[if)/s
Running it on [if-abc] 12345 [if-def] abcdef [/if][/if]
Results in a first match as: [if-abc] 12345
Your groups are: abc and 12345
And modifying even further:
/\[if-([a-z0-9-]+)\](.+?)(?=(?:\[\/?if))/s
matches both groups. Although the delimiter [/if] is not captured by either of these.
NOTE: Instead of matching the delimeters I used a lookahead ((?=)) in the regex to stop when the text ahead matches the lookahead.
Use a period to match any character.
So, let's say I want to accept strings as follows
SomeColumn IN||<||>||= [123, 'hello', "wassup"]||123||'hello'||"yay!"
For example:MyValue IN ['value', 123] or MyInt > 123 -> I think you get the idea. Now, what's bothering me is how to phrase this in a regex? I'm using PHP, and this is what I'm doing right now: $temp = explode(';', $constraints);
$matches = array();
foreach ($temp as $condition) {
preg_match('/(.+)[\t| ]+(IN|<|=|>|!)[\t| ]+([0-9]+|[.+]|.+)/', $condition, $matches[]);
}
foreach ($matches as $match) {
if ($match[2] == 'IN') {
preg_match('/(?:([0-9]+|".+"|\'.+\'))/', substr($match[3], 1, -1), $tempm);
print_r($tempm);
}
}
Really appreciate any help right there, my regex'ing is horrible.
I assume your input looks similar to this:
$string = 'SomeColumn IN [123, \'hello\', "wassup"];SomeColumn < 123;SomeColumn = \'hello\';SomeColumn > 123;SomeColumn = "yay!";SomeColumn = [123, \'hello\', "wassup"]';
If you use preg_match_all there is no need for explode or to build the matches yourself. Note that the resulting two-dimensional array will have its dimensions switched, but that is often desirable. Here is the code:
preg_match_all('/(\w+)[\t ]+(IN|<|>|=|!)[\t ]+((\'[^\']*\'|"[^"]*"|\d+)|\[[\t ]*(?4)(?:[\t ]*,[\t ]*(?4))*[\t ]*\])/', $string, $matches);
$statements = $matches[0];
$columns = $matches[1];
$operators = $matches[2];
$values = $matches[3];
There will also be a $matches[4] but it does not really have a meaning and is only used inside the regular expression. First, a few things you did wrong in your attempt:
(.+) will consume as much as possible, and any character. So if you have something inside a string value that looks like IN 13 then your first repetition might consume everything until there and return it as the column. It also allows whitespace and = inside column names. There are two ways around this. Either making the repetition "ungreedy" by appending ? or, even better, restrict the allowed characters, so you cannot go past the desired delimiter. In my regex I only allow letters, digits and underscores (\w) for column identifiers.
[\t| ] this mixes up two concepts: alternation and character classes. What this does is "match a tab, a pipe or a space". In character classes you simply write all characters without delimiting them. Alternatively you could have written (\t| ) which would be equivalent in this case.
[.+] I don't know what you were trying to accomplish with this, but it matches either a literal . or a literal +. And again it might be useful to restrict the allowed characters, and to check for correct matching of quotes (to avoid 'some string")
Now for an explanation of my own regex (you can copy this into your code, as well, it will work just fine; plus you have the explanation as comments in your code):
preg_match_all('/
(\w+) # match an identifier and capture in $1
[\t ]+ # one or more tabs or spaces
(IN|<|>|=|!) # the operator (capture in $2)
[\t ]+ # one or more tabs or spaces
( # start of capturing group $3 (the value)
( # start of subpattern for single-valued literals (capturing group $4)
\' # literal quote
[^\']* # arbitrarily many non-quote characters, to avoid going past the end of the string
\' # literal quote
| # OR
"[^"]*" # equivalent for double-quotes
| # OR
\d+ # a number
) # end of subpattern for single-valued literals
| # OR (arrays follow)
\[ # literal [
[\t ]* # zero or more tabs or spaces
(?4) # reuse subpattern no. 4 (any single-valued literal)
(?: # start non-capturing subpattern for further array elements
[\t ]* # zero or more tabs or spaces
, # a literal comma
[\t ]* # zero or more tabs or spaces
(?4) # reuse subpattern no. 4 (any single-valued literal)
)* # end of additional array element; repeat zero or more times
[\t ]* # zero or more tabs or spaces
\] # literal ]
) # end of capturing group $3
/',
$string,
$matches);
This makes use of PCRE's recursion feature where you can reuse a subpattern (or the whole regular expression) with (?n) (where n is just the number you would also use for a backreference).
I can think of three major things that could be improved with this regex:
It does not allow for floating-point numbers
It does not allow for escaped quotes (if your value is 'don\'t do this', I would only captur 'don\'). This can be solved using a negative lookbehind.
It does not allow for empty arrays as values (this could be easily solved by wrapping all parameters in a subpattern and making it optional with ?)
I included none of these, because I was not sure whether they apply to your problem, and I thought the regex was already complex enough to present here.
Usually regular expressions are not powerful enough to do proper language parsing anyway. It is generally better to write your parser.
And since you said your regex'ing is horrible... while regular expressions seem like a lot of black magic due to their uncommon syntax, they are not that hard to understand, if you take the time once to get your head around their basic concepts. I can recommend this tutorial. It really takes you all the way through!