Preg_match is "ignoring" a capture group delimiter

Preg_match is "ignoring" a capture group delimiter - php

We have thousands of structured filenames stored in our database, and unfortunately many hundreds have been manually altered to names that do not follow our naming convention. Using regex, I'm trying to match the correct file names in order to identify all the misnamed ones.
The files are all relative to a meeting agenda, and use the date, meeting type, Agenda Item#, and description in the name.
Our naming convention is yyyymmdd_aa[_bbb]_ccccc.pdf where:
yyyymmdd is a date (and may optionally use underscores such as yyyy_mm_dd)
aa is a 2-3 character Meeting Type code
bbb is an optional Agenda Item
ccccc is a freeform variable length description of the file (alphanumeric only)
Example filenames:
20200225_RM_agenda.pdf
20200225_RM_2_memo.pdf
20200225_SS1_3c_presenTATION.pdf
20200225_CA_4d_SiGnEd.pdf
20200225_RM_5_Order1234.pdf
2021_02_25_EV_Notice.pdf
The regex I'm using to match these files is below (regex demo):
/^(\d{4}[_]?\d{2}[_]?\d{2})_(\w{2,3})_([a-z0-9]{1,3})_?(.+)?.pdf/i
The Problem:
In general, it's working fine, BUT if the Agenda Number ("bbb") is NOT in the filename, the regex captures and returns the first 3 characters of the description. It seems to me that the 3rd capture group _([a-z0-9]{1,3})_ is saying 1-3 alphanumeric characters between underscores, but I don't know how to "force the underscore delimiters", or otherwise tell it that the group may not be there, and that it's now looking at the descriptive text. This can be seen in the demo code where the first and last filenames do not use an Agenda Number.
Any assistance is appreciated.

The optional identifier ? is for the last thing, either a characters or group. So the expression ([a-z0-9]{1,3})_? makes the underscore optional, but not the preceding group. The solution is to move the underscore into the parenthesis.
^(\d{4}[_]?\d{2}[_]?\d{2})_(\w{2,3})_([a-z0-9]{1,3}_)?(.+)?.pdf
Additionally, the [_]? can be simplified to just _?, file name periods should be escaped (otherwise they are a wildcard), and I personally like to name my groups using (?<name>) syntax. Putting that all together you get:
^(?<date>\d{4}_?\d{2}_?\d{2})_(?<meeting_type>\w{2,3})_(?<agenda>[a-z0-9]{1,3}_)?(?<description>.+)?\.pdf$
Demo here: https://regex101.com/r/BUKCih/1
Updated:
I've made some updates based on the comments. I added $ to the end to force "end of filename" as #Chris Maurer said. This stops file.pdf.txt from getting through. I also made a sub-group and moved the name into that group, which allows the trailing underscore to not be included in the named-group. I'm going to leave Chris's other comment about tightening the last matching group alone, although I do agree with it, and the OP might find a couple of non-conforming files if they use [a-z0-9]+ or similar. I don't remember off-hand if PHP supports POSIX but if so [:alnum:] could be used too.
^(?<date>\d{4}_?\d{2}_?\d{2})_(?<meeting_type>\w{2,3})_((?<agenda>[a-z0-9]{1,3})_)?(?<description>.+)?\.pdf$
Updated demo here: https://regex101.com/r/ebmxkF/1

Related

Regular expression for 12;1;19-39;43

I am new to regular expression and trying to match the following pattern using regular expression:
Groups of numbers, each looks like either a single number like 12, or a number range like 19-39
Groups are separated by semicolon(;)
All numbers are within range 1-48 (but we don't need to verify this in regular expression)
So an example match would be 12;13;19-39;43
For a single group, I can think of using
\b[1-9]{1}|[1-9]{1}[0-9]{1}\b
for single number, and
\b[1-9]{1}|[1-9]{1}[0-9]{1}-[1-9]{1}|[1-9]{1}[0-9]{1}\b
for number range.
The question is how to take the semicolon(;) into consideration also: any number of the above groups of number(s) connected by ; can be matched.

This should exactly match your requirement:
\d*[0-9](|-\d*[0-9]|;\d*[0-9])*$
Explanation:
Match any digit multiple times.
Next, check for a - or ; followed by another series of digits.
Repeat this till matches are found.
Try it out here:
http://gskinner.com/RegExr/
You can paste sample text in the big text area and see the exp in action. Cheers!

Try this:
/^\d*[0-9](|.\d*[0-9]|;\d*[0-9])*$/;
Its matches your requirement.

One trick to learning these is to try and break it into parts and write brutal ones to start:
1-48 alone ending in ; you can be as complicated as: ((\d)|([1-3]\d)|(4[0-8]));
for dashed groups just the same components repeated with a dash: ((\d)|([1-3]\d)|(4[0-8]))-((\d)|([1-3]\d)|(4[0-8]));
Now Combine to get either / or and repeat the whole group: ((((\d)|([1-3]\d)|(4[0-8]));)|(((\d)|([1-3]\d)|(4[0-8]))-((\d)|([1-3]\d)|(4[0-8]));))*
Now we have this gross, brute force, regex with a ridiculous number of groupings above, but it works. Next we can think about simplifying and you have an even better place (sort of) to start asking for help from.
Was going to start simplifying, but you have a other answers here already.
Simplifying a little and just noting your final number does not end with a semicolon you can start with merging with something like #Sunny has:
^((\d)|([1-3]\d)|(4[0-8]))(|-((\d)|([1-3]\d)|(4[0-8]))|;((\d)|([1-3]\d)|(4[0-8])))*$

Is there a regex symbol to match one, the other, or both (if possible)?

I want to highlight a group of words, they can appear single or in a row. I'd like them to be highlighted together if they appear one after the other, and if they don't, they should also be highlighted, like the normal behavior. For instance, if I want to highlight the words:
results as
And the subject is:
real time results: shows results as you type
I'd like the result to be:
real time results: shows <span class="highlighted"> results as </span> you type
The whitespaces are also a headache, because I tried using an or expression:
( results )|( as )
with whitespaces to prevent highlighting words like bass, crash, and so on. But since the whitespace after results is the same as the whitespace before as, the regexp ignores it and only highlights results.
It can be used to highlighted many words so combinations of
( (one) (two) )|( (two) (one) )|( one )|( two )
are not an option :(
Then I thought that there may be an operator that worked like | that could be use to match both if possible, else one, or the other.

Using spaces to ensure you match full words is the wrong approach. That's what word boundaries are for: \b matches a position between a word and a non-word character (where word characters usually are letters, digits and underscores). To match combinations of your desired words, you can simply put them all in an alternation (like you already do), and repeat as often as possible. Like so:
(?:\bresults\b\s*|\bas\b\s*)+
This assumes that you want to highlight the first and separate results in your example as well (which would satisfy your description of the problem).

Perhaps you do not need to match a string of words next to each other. Why not just apply your highlighting like so:
real time results: shows <span class="highlighted">results</span> <span class="highlighted">as</span> you type
The only realy difference is that the space between the words is not highlighted, but it's a clean and easy compromise which will save you hours of work and doesn't seem to hurt the UX in the least (in my opinion).
In that case, you could just use alternation:
\b(results|as)\b
(\b being the word boundary anchor)
If you really don't like the space between words not being highlight, you could write a jQuery function to find "highlighted" spans separated by only white space and then combine them (a "second stage" to achieve your UX design goals).
Update
(OK... so merging spans is actually kind of difficult via jQuery. See Find text between two tags/nodes)

Regex to detect word abbreviations

I'm currently working on a CSV that has information about Portugal's administrative areas and postal codes, but the file doesn't follow any strict format, which means sometimes there are entire strings in uppercase, along with other issues.
The issue I want to solve is as follows : some areas have a abbreviation at the end of the name, related to it's parent's administrative level, that I want to remove. As far as I can see, this are the rules :
Abbreviations don't take more than 3 characters in lenght (always 3 characters so far);
The first character may be any letter, case insensitive;
The last 2 characters are always consonants (e.g. Z, B, M, P, ..);
(edit) the abbreviations always occur as the last word in a string;
(edit 2) - The strings are always UTF-8
The purpose is to remove this abbreviations from the area names.

Sounds simple enough..
/\b[a-z][ZBMP]{2}\b/i
Would match any such described abbrevations, Add letters to the second character class ([ZBMP]) to complete the match.
It would only match if it's not part of another word (That's the \b's job).

PHP regex non-capture non-match group

I'm making a date matching regex, and it's all going pretty well, I've got this so far:
"/(?:[0-3])?[0-9]-(?:[0-1])?[0-9]-(?:20)[0-1][0-9]/"
It will (hopefully) match single or double digit days and months, and double or quadruple digit years in the 21st century. A few trials and errors have gotten me this far.
But, I've got two simple questions regarding these results:
(?: ) what is a simple explanation for this? Apparently it's a non-matching group. But then...
What is the trailing ? for? e.g. (? )?

[Edited (again) to improve formatting and fix the intro.]
This is a comment and an answer.
The answer part... I do agree with alex' earlier answer.
(?: ), in contrast to ( ), is used to avoid capturing text, generally so as to have fewer back references thrown in with those you do want or to improve speed performance.
The ? following the (?: ) -- or when following anything except * + ? or {} -- means that the preceding item may or may not be found within a legitimate match. Eg, /z34?/ will match z3 as well as z34 but it won't match z35 or z etc.
The comment part... I made what might considered to be improvements to the regex you were working on:
(?:^|\s)(0?[1-9]|[1-2][0-9]|30|31)-(0?[1-9]|10|11|12)-((?:20)?[0-9][0-9])(?:\s|$)
-- First, it avoids things like 0-0-2011
-- Second, it avoids things like 233443-4-201154564
-- Third, it includes things like 1-1-2022
-- Forth, it includes things like 1-1-11
-- Fifth, it avoids things like 34-4-11
-- Sixth, it allows you to capture the day, month, and year so you can refer to these more easily in code.. code that would, for example, do a further check (is the second captured group 2 and is either the first captured group 29 and this a leap year or else the first captured group is <29) in order to see if a feb 29 date qualified or not.
Finally, note that you'll still get dates that won't exist, eg, 31-6-11. If you want to avoid these, then try:
(?:^|\s)(?:(?:(0?[1-9]|[1-2][0-9]|30|31)-(0?[13578]|10|12))|(?:(0?[1-9]|[1-2][0-9]|30)-(0?[469]|11))|(?:(0?[1-9]|[1-2][0-9])-(0?2)))-((?:20)?[0-9][0-9])(?:\s|$)
Also, I assumed the dates would be preceded and followed by a space (or beg/end of line), but you may want ot adjust that (eg, to allow punctuations).
A commenter elsewhere referenced this resource which you might find useful:
http://rubular.com/

It is a non capturing group. You can not back reference it. Usually used to declutter backreferences and/or increase performance.
It means the previous capturing group is optional.

Subpatterns
Subpatterns are delimited by parentheses (round brackets), which can be nested. Marking part of a pattern as a subpattern does two things:
It localizes a set of alternatives. For example, the pattern
cat(aract|erpillar|) matches one of the words "cat", "cataract", or
"caterpillar". Without the parentheses, it would match "cataract",
"erpillar" or the empty string.
It sets up the subpattern as a capturing subpattern (as defined
above). When the whole pattern matches, that portion of the subject
string that matched the subpattern is passed back to the caller via
the ovector argument of pcre_exec(). Opening parentheses are counted
from left to right (starting from 1) to obtain the numbers of the
capturing subpatterns.
For example, if the string "the red king" is matched against the pattern the ((red|white) (king|queen)) the captured substrings are "red king", "red", and "king", and are numbered 1, 2, and 3.
The fact that plain parentheses fulfill two functions is not always helpful. There are often times when a grouping subpattern is required without a capturing requirement. If an opening parenthesis is followed by "?:", the subpattern does not do any capturing, and is not counted when computing the number of any subsequent capturing subpatterns. For example, if the string "the white queen" is matched against the pattern the ((?:red|white) (king|queen)) the captured substrings are "white queen" and "queen", and are numbered 1 and 2. The maximum number of captured substrings is 65535. It may not be possible to compile such large patterns, however, depending on the configuration options of libpcre.
As a convenient shorthand, if any option settings are required at the start of a non-capturing subpattern, the option letters may appear between the "?" and the ":". Thus the two patterns
(?i:saturday|sunday)
(?:(?i)saturday|sunday)
match exactly the same set of strings. Because alternative branches are tried from left to right, and options are not reset until the end of the subpattern is reached, an option setting in one branch does affect subsequent branches, so the above patterns match "SUNDAY" as well as "Saturday".
It is possible to name a subpattern using the syntax (?Ppattern). This subpattern will then be indexed in the matches array by its normal numeric position and also by name. PHP 5.2.2 introduced two alternative syntaxes (?pattern) and (?'name'pattern).
Sometimes it is necessary to have multiple matching, but alternating subgroups in a regular expression. Normally, each of these would be given their own backreference number even though only one of them would ever possibly match. To overcome this, the (?| syntax allows having duplicate numbers. Consider the following regex matched against the string Sunday:
(?:(Sat)ur|(Sun))day
Here Sun is stored in backreference 2, while backreference 1 is empty. Matching yields Sat in backreference 1 while backreference 2 does not exist. Changing the pattern to use the (?| fixes this problem:
(?|(Sat)ur|(Sun))day
Using this pattern, both Sun and Sat would be stored in backreference 1.
Reference : http://php.net/manual/en/regexp.reference.subpatterns.php

Ignore common words (the, and) in MySQL REGEXP query

I'm trying to query a database of Book titles based on the first letter of the title. However, I want to ignore common words such as "The" and "A".
So when searching for books that start with the letter "T"
"The Adventures of Huck Finn" - would NOT be matched
"Transformation of a Runner" - would be matched
I'm not very experienced with REGEX, but this is what I have so far (where $first_letter could equal 't')
... WHERE title = '^[(a )(the )]*[$first_letter]' ...
This successfully matches book titles that start with a particular letter even after the words "A" or "The", but doesn't ignore those words. So if $first_letter='t', it would match BOTH books mentioned above.
I've tried googling it, but haven't found any solutions. Any help would be greatly appreciated.
Thanks in advance.
Kevin

Read about MySQL full text search

The regular expression you've written isn't valid. []s are used to denote what is called a character class. Everything you enter between the brackets (with some characters potentially needing to be escaped, such as the literal characters [ and ]) is treated as standing-in for a single character.
edit After re-reading my answer, I realized lookaround wasn't a good way to approach this.
The functionality you're groping for is called negative lookahead, negative lookbehind, or some similar variant. I'm unsure whether MySQL's regex flavor supports it, but I don't think it would be a good fit for this problem.
Alternatively, you could do a regex that looks like this:
^((a|the|of|and) )?[letter of interest]
The breakdown:
There are two groups
The inner-most group looks for instances of words you want to ignore
The outer-most group just adds a space to the end of that
The ? asserts that there could be 0 or 1 instances of this group
You'll have to do the legwork of translating this into MySQL regex syntax yourself. My apologies.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Preg_match is "ignoring" a capture group delimiter - php

Related

Regular expression for 12;1;19-39;43

Is there a regex symbol to match one, the other, or both (if possible)?

Regex to detect word abbreviations

PHP regex non-capture non-match group

Ignore common words (the, and) in MySQL REGEXP query

Categories

Resources