PHP Regex extract date or date range - php

I have a database full of movie titles and i want to extract the date which i've managed to do with the following:
(19|20)[0-9][0-9]
However i've noticed some of my dates are in ranges for example 1998-2003 or sometimes there is a space like 1998 - 2003. Is there any way to adapt the regex to match the ranges with or without a space?

Use \s* to match zero or more spaces.
(?:19|20)[0-9]{2}\s*-\s*(?:19|20)[0-9]{2}
DEMO
If you want to match also the single year, then make the second part as optional.
(?:19|20)[0-9]{2}(?:\s*-\s*(?:19|20)[0-9]{2})?
DEMO

Related

Preg_match is "ignoring" a capture group delimiter

We have thousands of structured filenames stored in our database, and unfortunately many hundreds have been manually altered to names that do not follow our naming convention. Using regex, I'm trying to match the correct file names in order to identify all the misnamed ones.
The files are all relative to a meeting agenda, and use the date, meeting type, Agenda Item#, and description in the name.
Our naming convention is yyyymmdd_aa[_bbb]_ccccc.pdf where:
yyyymmdd is a date (and may optionally use underscores such as yyyy_mm_dd)
aa is a 2-3 character Meeting Type code
bbb is an optional Agenda Item
ccccc is a freeform variable length description of the file (alphanumeric only)
Example filenames:
20200225_RM_agenda.pdf
20200225_RM_2_memo.pdf
20200225_SS1_3c_presenTATION.pdf
20200225_CA_4d_SiGnEd.pdf
20200225_RM_5_Order1234.pdf
2021_02_25_EV_Notice.pdf
The regex I'm using to match these files is below (regex demo):
/^(\d{4}[_]?\d{2}[_]?\d{2})_(\w{2,3})_([a-z0-9]{1,3})_?(.+)?.pdf/i
The Problem:
In general, it's working fine, BUT if the Agenda Number ("bbb") is NOT in the filename, the regex captures and returns the first 3 characters of the description. It seems to me that the 3rd capture group _([a-z0-9]{1,3})_ is saying 1-3 alphanumeric characters between underscores, but I don't know how to "force the underscore delimiters", or otherwise tell it that the group may not be there, and that it's now looking at the descriptive text. This can be seen in the demo code where the first and last filenames do not use an Agenda Number.
Any assistance is appreciated.
The optional identifier ? is for the last thing, either a characters or group. So the expression ([a-z0-9]{1,3})_? makes the underscore optional, but not the preceding group. The solution is to move the underscore into the parenthesis.
^(\d{4}[_]?\d{2}[_]?\d{2})_(\w{2,3})_([a-z0-9]{1,3}_)?(.+)?.pdf
Additionally, the [_]? can be simplified to just _?, file name periods should be escaped (otherwise they are a wildcard), and I personally like to name my groups using (?<name>) syntax. Putting that all together you get:
^(?<date>\d{4}_?\d{2}_?\d{2})_(?<meeting_type>\w{2,3})_(?<agenda>[a-z0-9]{1,3}_)?(?<description>.+)?\.pdf$
Demo here: https://regex101.com/r/BUKCih/1
Updated:
I've made some updates based on the comments. I added $ to the end to force "end of filename" as #Chris Maurer said. This stops file.pdf.txt from getting through. I also made a sub-group and moved the name into that group, which allows the trailing underscore to not be included in the named-group. I'm going to leave Chris's other comment about tightening the last matching group alone, although I do agree with it, and the OP might find a couple of non-conforming files if they use [a-z0-9]+ or similar. I don't remember off-hand if PHP supports POSIX but if so [:alnum:] could be used too.
^(?<date>\d{4}_?\d{2}_?\d{2})_(?<meeting_type>\w{2,3})_((?<agenda>[a-z0-9]{1,3})_)?(?<description>.+)?\.pdf$
Updated demo here: https://regex101.com/r/ebmxkF/1

Extract an 8 character integer string from an ical file

This code to get all sequences of 8 integers works fine:
preg_match_all('/[0-9]{8}/', $string, $match);
However I am only interested if the match starts with 20.
I know I have to add ^20 somewhere but I have tried many times with no success. I have looked at many regex tutorials but none of them seems to explain how to do 2 separate searches.
I am actually trying to parse ICAL files to extract the dates. If the 8 digit integer starts with 20 it almost certainly is a date.
For example: DTSTART:20150112T120000Z
How about this solution:
/(20)\d{6}/
This will probably find what you are looking for:
(?=20)(\d{8})
It does a positive lookahead to capture a group if it starts with 20 along with a 8 digit number.
The answer highly depends on what you want to achieve. Do you want to extract all and any dates from an icalendar file. If so, you might be missing birthday dates as their year are most likely to be starting with 19xx.
Also matching any dates will yield most likely many undesired dates like UNTIL, TRIGGER, DTEND, ...
Assuming from your example you want to extract events start dates, you could try:
DTSTART[a-zA-Z._%+-/=;]*:(\d){8}[T]?[\d]{6}
To be kept in mind: following DTSTART can be a timezone definition like TZID=America/New_York and/or the type definition DATE or DATE-TIME (see RFC5545 DATE-TIME

PCRE(php) Is it possible to check if sequence of numbers contains only unique number for that sequence?

Assuming I have a set of numbers (from 1 to 22) divided by some trivial delimiters (comma, point, space, etc). I need to make sure that this set of numbers does not contain any repetition of the same number. Examples:
1,14,22,3 // good
1,12,12,3 // not good
Is it possible to do via regular expression?
I know it's easy to do using just php, but I really wander how to make it work with regex.
Yes, you could achieve this through regex via negative looahead.
^(?!.*\b(\d+)\b.*\b\1\b)\d+(?:,\d+)+$
(?!.*\b(\d+)\b.*\b\1\b) Negative lookahead at the start asserts that the there wouldn't be a repeated number present in the match. \b(\d+)\b.*\b\1\b matches the repeated number.
\d+ matches one or more digits.
(?:,\d+)+ One or more occurances of , , one or more digits.
$ Asserts that we are at the end .
DEMO
OR
Regex for the numbers separated by space, dot, comma as delimiters.
^(?!.*\b(\d+)\b.*\b\1\b)\d+(?:([.\s,])\d+)(?:\2\d+)*$
(?:([.\s,])\d+) capturing group inside this non-capturing group helps us to check for following delimiters are of the same type. ie, the above regex won't match the strings like 2,3 5.6
DEMO
You can use this regex:
^(?!.*?(\b\d+)\W+\1\b)\d+(\W+\d+)*$
Negative lookahead (?!.*?(\b\d+)\W+\1\b) avoids the match when 2 similar numbers appear one after another separated by 1 or more non-word characters.
RegEx Demo
Here is the solution that fit my current need:
^(?>(?!\2\b|\3\b)(1\d{1}|2[0-2]{1}|\d{1}+)[,.; ]+)(?>(?!\1\b|\3\b)(1\d{1}|2[0-2]{1}|\d{1}+)[,.; ]+)(?>(?!\1\b|\2\b)(1\d{1}|2[0-2]{1}|\d{1}+))$
It returns all the sequences with unique numbers divided by one or more separator and also limit the number itself from 1 to 22, allowing only 3 numbers in the sequence.
See working example
Yet, it's not perfect, but work fine! Thanks a lot to everyone who gave me a hand on this!

Regex for two latitudes and longitudes not working

I am using some data which gives paths for google maps either as a path or a set of two latitudes and longitudes. I have stored both values as a BLOB in a mySql database, but I need to detect the values which are not paths when they come out in the result. In an attempt to do this, I have saved them in the BLOB in the following format:
array(lat,lng+lat,lng)
I am using preg_match to find these results, but i havent managed to get any to work. Here are the regex codes I have tried:
^[a]{1}[r]{2}[a]{1}[y]{1}[\(]{1}[1-9\.\,\+]{1*}[\)]{1}^
^[a]{1}[r]{2}[a]{1}[y]{1}[\(]{1}(\-?\d+(\.\d+)?),(\-?\d+(\.\d+)?)\+(\-?\d+(\.\d+)?),(\-?\d+(\.\d+)?)[\)]{1}^
Regex confuses me sometimes (as it is doing now). Can anyone help me out?
Edit:
The lat can be 2 digits followed by a decimal point and 8 more digits and the lng can be 3 digits can be 3 digits follwed by a decimal point and 8 more digits. Both can be positive or negative.
Here are some example lat lngs:
51.51160000,-0.12766000
-53.36442000,132.27519000
51.50628000,0.12699000
-51.50628000,-0.12699000
So a full match would look like:
array(51.51160000,-0.12766000+-53.36442000,132.27519000)
Further Edit
I am using the preg_match() php function to match the regex.
Here are some pointers for writing regex:
If you have a single possibility for a character, for example, the a in array, you can indeed write it as [a]; however, you can also write it as just a.
If you are looking to match exactly one of something, you can indeed write it as a{1}, however, you can also write it as just a.
Applying this lots, your example of ^[a]{1}[r]{2}[a]{1}[y]{1}[\(]{1}[1-9\.\,\+]{1*}[\)]{1}^ reduces to ^array\([1-9\.\,\+]{1*}\)^ - that's certainly an improvement!
Next, numbers may also include 0's, as well as 1-9. In fact, \d - any digit - is usually used instead of 1-9.
You are using ^ as the delimiter - usually that is /; I didn't recognize it at first. I'm not sure what you can use for the delimiter, so, just in case, I'll change it to the usual /.This makes the above regex /array\([\d\.\,\+]{1*}\)/.
To match one or more of a character or character set, use +, rather than {1*}. This makes your query /array\([\d\.\,\+]+\)/
Then, to collect the resulting numbers (assuming you want only the part between the brackets, put it in (non-escaped) brackets, thus: /array\(([\d\.\,\+]+)\)/ - you would then need to split them, first by +, then by ,. Alternatively, if there are exactly two lat,lng pairs, you might want: /array\(([\d\.]+),([\d\.]+)\+([\d\.]+),([\d\.]+)\)/ - this will return 4 values, one for each number; the additional stuff (+, ,) will already be removed, because it is not in (unescaped) brackets ().
Edit: If you want negative lats and longs (and why wouldn't you?) you will need \-? (a "literal -", rather than part of a range) in the appropriate places; the ? makes it optional (i.e. 0 or 1 dashes). For example, /array\((\-?[\d\.]+),(\-?[\d\.]+)\+(\-?[\d\.]+),(\-?[\d\.]+)\)/
You might also want to check out http://regexpal.com - you can put in a regex and a set of strings, and it will highlight what matches/doesn't match. You will need to exclude the delimiter / or ^.
Note that this is a little fast and loose; it would also match array(5,0+0,1...........). You can nail it down a little more, for example, by using (\-?\d*\.\d+)\) instead of (\-?[\d\.]+)\) for the numbers; that will match (0 or 1 literal -) followed by (0 or more digits) followed by (exactly one literal dot) followed by (1 or more digits).
This is the regex I made:
array\((-*\d+\.\d+),(-*\d+\.\d+)\+(-*\d+\.\d+),(-*\d+\.\d+)\)
This also breaks the four numbers into groups so you can get the individual numbers.
You will note the repeated pattern of
(-*\d+\.\d+)
Explanation:
-* means 0 or more matches of the - sign ( so - sign is optional)
\d+ means 1 or more matches of a number
\. means a literal period (decimal)
\d+ means 1 or more matches of a number
The whole thing is wrapped in brackets to make it a captured group.

How can I use regex to solve this?

I have two strings that I need to pull data out of but can't seem to get it working. I wish I knew regular expression but unfortunately I don't. I have read some beginner tutorials but I can't seem to find an expression that will do what I need.
Out of this first string delimited by the equal character, I need to skip the first 6 characters and grab the following 9 characters. After the equal character, I need to grab the first 4 characters which is a day and year. Lastly for this string, I need the remaining numbers which is a date in YYYYmmdd.
636014034657089=130719889904
The second string seems a little more difficult because the spaces between the characters differ but always seem to be delimited by at minimum, a single space. Sometimes, there are as many as 15 or 20 spaces separating the blocks of data.
Here are two different samples that show the space difference.
!!92519 C 01 M600200BLNBRN D55420090205M1O
!!95815 A M511195BRNBRN D62520070906 ":%/]Q2#0*&
The data that I need out of these last two strings are:
The zip code following the 2 exclamation marks.
The single letter 'M' following that. It always appears to be in a 13 character block
The 3 numbers after the single letter
The next 3 numbers which are the person's height
The following next 3 are the person's weight
The next 3 are eye color
The next block of 3 which are the person's hair color
The last block that I need data from:
I need to get the single letter which in the example appears to be a 'D'.
Skip the next 3 numbers
The last and remaining 8 numbers which is a date in YYYYmmdd
If someone could help me resolve this, I'd be very grateful.
For the first string you can use this regular expression:
^[0-9]{6}([0-9]{9})=([0-9]{4})([0-9]{4})([0-9]{2})([0-9]{2})$
Explanation:
^ Start of string/line
[0-9]{6} Match the first 6 digits
([0-9]{9}) Capture the next 9 digits
= Match an equals sign
([0-9]{4}) Capture the "day and year" (what format is this in?)
([0-9]{4}) Capture the year
([0-9]{2}) Capture the month
([0-9]{2}) Capture the date
$ End of string/line
For the second:
^!!([0-9]{5}) +.*? +M([0-9]{3})([0-9]{3})([A-Z]{3})([A-Z]{3}) +([A-Z])[0-9]{3}([0-9]{4})([0-9]{2})([0-9]{2})
Rubular
It works in a similar way to the first. You may need to adjust it slightly if your data is not exactly in the format that the regular expression expects. You might want to replace the .*? with something more precise but I'm not sure what because you haven't described the format of the parts you are not interested in.

Categories