Regex optional groups

Regex optional groups - php

I'd like to capture up to four groups of text between <p> and </p>. I can do that using the following regex:
<h5>Trivia<\/h5><p>(.*)<\/p><p>(.*)<\/p><p>(.*)<\/p><p>(.*)<\/p>
The text to match on:
<h5>Trivia</h5><p>Was discovered by a freelance photographer while sunbathing on Bournemouth Beach in August 2003.</p><p>Supports Southampton FC.</p><p>She has 11 GCSEs and 2 'A' Levels.</p><p>Listens to soul, R&B, Stevie Wonder, Aretha Franklin, Usher Raymond, Michael Jackson and George Michael.</p>
It outputs the four lines of text. It also works as intended if there are more trivia items or <p> occurrences.
But if there are less than 4 trivia items or <p> groups, it outputs nothing since it cannot find the fourth group. How do I make that group optional?
I've tried: <h5>Trivia<\/h5><p>(.*?)<\/p>(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)? and that works according to http://gskinner.com/RegExr/ but it doesn't work if I put it inside PHP code. It only detects one group and puts everything in it.

The magic word is either 'escaping' or 'delimiters', read on.
The first regex:
<h5>Trivia<\/h5><p>(.*)<\/p><p>(.*)<\/p><p>(.*)<\/p><p>(.*)<\/p>
worked because you escaped the / characters in tags like </h5> to <\/h5>.
But in your second regex (correctly enclosing each paragraph in a optional non-capturing group, fetching 1 to 5 paragraphs):
<h5>Trivia</h5><p>(.*?)</p>(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?
you forgot to escape those / characters.
It should then have been:
$pattern = '/<h5>Trivia<\/h5><p>(.*?)<\/p>(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?/';
The above is assuming you were putting your regex between two / "delimiters" characters (out of conventional habit).
To dive a little deeper into the rabbit-hole, one should note that in php the first and last character of a regular expression is usually a "delimiter", so one can add modifiers at the end (like case-insensitive etc).
So instead of escaping your regex, you could also use a ~ character (or #, etc) as a delimiter.
Thus you could also use the same identical (second) regex that you posted and enclose for example like this:
$pattern = '~<h5>Trivia</h5><p>(.*?)</p>(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?~';
Here is a working (web-based) example of that, using # as delimiter (just because we can).

You can use the question mark to make each <p>...</p> optional:
$pattern = '~<h5>Trivia</h5>(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?~';
Use the Dom is a good option too.

Related

How to extract text from multiple lines including the first and last word?

I am trying to extract part of a long text, such as information about caring for a plant. The text contains paragraphs and blank lines. I am not able to capture the specific text I want, the second problem is that the last word isn't showing in the extracted text, and the last problem is when my search starts at the beginning of the line.
I tried searching for the text I want to extract by using a word that isn't at the beginning of the line, it worked except that the end of the desired text is missing a word, and if that word is on new line, it won't show any results at all.
I was using https://scriptun.com/tools/php/preg_match for testing
//The first word to start the search is 'How to'. And I want to capture it as well
// The second word where the text I want ends is '(optional):'
'/(?=How to).*?\s(?=\(optional\):)/'
The sample text I am using to test is:
//Text comes before this..
How to care for Split Leaf Plant
The Split leaf philodendron, also called monstera deliciosa or swiss
cheese plant, is a large, popular, easy- care houseplant that is not
really in the philodendron family. There is a great deal of confusion
about what to call this plant; the various names have become
inter-changeable over the years.
Here is more info (optional):
//And more text goes here
I want to extract all the text from the word 'How to' ending with '(optional):'. Regardless of how many lines or paragraphs are in between
The expected extracted text:
How to care for Split Leaf Plant
The Split leaf philodendron, also called monstera deliciosa or swiss
cheese plant, is a large, popular, easy- care houseplant that is not
really in the philodendron family. There is a great deal of confusion
about what to call this plant; the various names have become
inter-changeable over the years.
Here is more info (optional):
Thank you

That's pretty easy. You can use the following pattern:
https://regex101.com/r/TjE2x8/2
Pattern: ^How to[\w\W]+?\(optional\):$

Pattern: ^How to(?:.|\R)*optional\):$
demo on regex101
Explanation:
^ match the first instance where How to appears at the beginning of the line
(?: ) non capturing group. We need it because of the following OR instruction which is the pipe |. But we don't need to capture the contents. That's why we use ?: after the first parenthesis.
. every character
| or
\R every kind of new line
* make sure to capture zero to every instance of the group
optional\):$ match the word optional with parenthesis (escaped, because it is not an instruction) \) and a colon : at the very end of the text $
Pattern 2: /^How to.*optional\):$/ms
demo on regex101
This pattern is even simpler, but requires the m and s flag to be set in order to match multiline and the . character class to match new lines.

Detect phone number with preg_replace with some specifics

It's a basic preg_replace that detects phone numbers (and just long numbers). My problem is I want to avoid detecting numbers between double "", single '' and forward slashes //
$text = preg_replace("/(\+?[\d-\(\)\s]{8,25}[0-9]?\d)/", "<strong>$1</strong>", $text);
I poked around but nothing is working for me. Your help will be appreciated.

I predict that your pattern is going to let you down more than it is going to satisfy you (or you are very comfortable with "over-matching" within the scope of your project).
While my suggestion really blows out the pattern length, a (*SKIP)(*FAIL) technique will serve you well enough by consuming and discarding the substrings that require disqualification. There may be a way of dictating the pattern logic with lookaround instead, but with an initial pattern with so many potential holes in it and no sample data, there are just too many variables to make a confident suggestion.
Regex101 Demo
Code: (Demo)
$text = <<<TEXT
A number 555555555 then some more text and a quoted number "(123)4567890" and
then 1 2 3 4 6 (54) 3 -2 and forward slashed /+--------0/ versus
+--------0 then something more realistic '234 588 9191' no more text.
This is not closed by the same character on both
ends: "+012345678901/ which of course is a _necessary_ check?
TEXT;
echo preg_replace(
'~([\'"/])\+?[\d()\s-]{8,25}\d{1,2}\1(*SKIP)(*FAIL)|((?!\s)\+?[\d()\s-]{8,25}\d{1,2})~',
"<strong>$2</strong>",
$text);
Output:
A number <strong>555555555</strong> then some more text and a quoted number "(123)4567890" and
then <strong>1 2 3 4 6 (54) 3 -2</strong> and forward slashed /+--------0/ versus
<strong>+--------0</strong> then something more realistic '234 588 9191' no more text.
This is not closed by the same character on both
ends: "<strong>+012345678901</strong>/ which of course is a _necessary_ check?
For the technical breakdown, see the Regex101 link.
Otherwise, this is effectively checking for "phone numbers" (by your initial pattern) and if they are wrapped by ', ", or / then the match is ignored and the regex engine continues looking for matches AFTER that substring. I have added (?!\s) at the start of the second usage of your phone pattern so that leading spaces are omitted from the replacement.

It seems that you're not validating, then you might be trying to write some expression with less boundaries, such as:
^\+?[0-9()\s-]{8,25}[0-9]$
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

replace all punctuations except for abbreviations

I have a regex in PHP that replaces everything I don't want with spaces
/[^a-z0-9\p{L}]/siu
But there is this one exception, I want to keep punctuations for abbreviations.
Example:
F.B.I.Federal.Bureau.of.Investigation => 'F B I Federal Bureau of
Investigation'
S.W.A.T.Team => 'S W A T Team'
Should be:
F.B.I.Federal.Bureau.of.Investigation => 'F.B.I. Federal Bureau of
Investigation'
S.W.A.T.Team => 'S.W.A.T. Team'
PHP code:
$s = "F.B.I.Federal.Bureau.of.Investigation";
return preg_replace('/[^a-z0-9\p{L}]/siu', " ", $s);
so the logic is, that it should check the second char of first match, and if it's an '.' char, then don't replace.
Not sure if this is possible with regex, then I would appreciate an alternative with PHP.

Actually, there are many types of abbreviations, and as Jon Stirling says, there is no really 100% working solution here since you need a whole list of possible abbreviations to filter out. You may have a peek at some fancy regex solution by #ndn and grab the pattern part related to abbreviations there.
If you need to only handle patterns like in the question, you may consider using
'~(\b(?:\p{Lu}\.){2,})|[^0-9\p{L}]~u'
or - if D.Word should also be treated as an abbreviation:
'~(\b(?:\p{Lu}\.)+)|[^0-9\p{L}]~u'
and replace with '$1 '. See the regex demo.
Pattern details:
(\b(?:\p{Lu}\.)+) - Group 1 (later referenced with $1 backreference): 1 or more consequent occurrences of any Unicode uppercase letter and a dot after it
| - or
[^0-9\p{L}] - any char that is not an ASCII digit and a Unicode letter.
And here is a variant of a regex with #ndn's abbreviations:
'~\b((?:[Ee]tc|St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd|pp|[Vv]iz|i\.?\s*e|[Vvol]|[Rr]col|maj|Lt|[Ff]ig|[Ff]igs|[Vv]iz|[Vv]ols|[Aa]pprox|[Ii]ncl|Pres|[Dd]ept|min|max|[Gg]ovt|lb|ft|c\.?\s*f|vs|\p{Lu}(?:\.\p{Lu})+)\.)|[^0-9\p{L}]~'
See the regex demo.
If you do not want to remove -, ( and ), just make sure to add them to the negated character class, replace [^0-9\p{L}] with [^0-9\p{L}()-].
Feel free to update by adding more abbreviations or enhance by shrinking the alternatives.

How do I find blocks of text ending with "!!", while still allowing "!" characters in Regex?

I have a peculiar use case where I need to detect paragraphs that end in !!. Normal occurrences of ! (a single one) is fine in the paragraph, but the block ends when !! is found.
For example:
test foo bar !!
longer paragraph this time!
goes on and on
and then stops !!
Should be detected as two separate matches, one covering the first line, and another (separate) covering lines 2, 3 and 4. This brings it to a total of 2 matches.
(Preferably it should work with multiline-mode, as it's part of a larger regex that employs this mode.)
How would I accomplish this? I tried [^!!]* which to me says, find as many non-!! characters as possible, but I'm not sure how to leverage that, and worse yet it still finds single occurrences of !.

There is a common idiom in regular expressions that is used for escape sequences. (Like "\n" in a string.) You can use the same concept here.
The trick is to match either NOT the first character, or the first character followed by a valid second character.
In your case, that would be:
(?: # this is a package, either A or B, choose one
[^!] # Not a bang
| # or
![^!] # Bang, followed by not-a-bang
)
This pair of alternatives describes all the characters in your paragraph. So you can repeat it either 0 times (*) or one-or-more times (+) depending on what you are doing in the rest of your pattern.
# All together:
(?:[^!]|![^!])* # zero or more
(?:[^!]|![^!])+ # one or more
(Obviously, you can match '!!' at the end if you like...)

^([!]?[^!]+[!]?[^!]+)*[!]{2}$/gm
This regex worked for me. It ensures any single ! characters are separated by non-! characters, but there don't have to be any single ! characters. It worked on multiline mode. This also has the added benefit of extracting the text that comes before an occurrence of "!!" since I assume you want to work with it.
/^([!]?[^!]+[!]?[^!]+)*.?[!]{2}$|^([!]?[^!]+[!]?[^!]+)*[^!]?[!]?$/gm
This slightly longer regex captures text that occurs after the final !! (ie, if the file has text between !! and EOF). I wouldn't recommend using the capturing groups though as on my regex checker, they didn't seem to work properly (that may have just been an implementation glitch, however, as the capturing groups look like they should work properly).

Try this:
([\w\s!]+?\!{2})
DEMO
Output:
MATCH 1
1. [0-15] `test foo bar !!`
MATCH 2
1. [15-76] `
longer paragraph this time!
goes on and on
and then stops !!`
or
(?:\n?([\w\s!]+?)\s?\!{2})
DEMO
Output:
MATCH 1
1. [0-12] `test foo bar`
MATCH 2
1. [16-73] `longer paragraph this time!
goes on and on
and then stops`

Try following regex using lookahead
VERSION #1
/(?<=!!|^).*?(?=!!)/gms
Please see https://regex101.com/r/cQ0wC0/2
Result should be
OUTPUT:
test foo bar
longer paragraph this time!
goes on and on
and then stops
VERSION #2
Since OP want to capture last paragraph of text after !! even it's not ending with bang signs.
/(?<=!!|^).*?(?=!!)|(?<=!!).*$/gms
Please see demo https://regex101.com/r/cQ0wC0/4
INPUT:
test foo bar !!
longer paragraph this time!
goes on and on
and then stops !!
longer paragraph this time!
goes on and on
OUTPUT:
test foo bar
longer paragraph this time!
goes on and on
and then stops
longer paragraph this time!
goes on and on

Retrieve 0 or more matches from comma separated list inside parenthesis using regex

I am trying to retrieve matches from a comma separated list that is located inside parenthesis using regular expression. (I also retrieve the version number in the first capture group, though that's not important to this question)
What's worth noting is that the expression should ideally handle all possible cases, where the list could be empty or could have more than 3 entries = 0 or more matches in the second capture group.
The expression I have right now looks like this:
SomeText\/(.*)\s\(((,\s)?([\w\s\.]+))*\)
The string I am testing this on looks like this:
SomeText/1.0.4 (debug, OS X 10.11.2, Macbook Pro Retina)
Result of this is:
1. [6-11] `1.0.4`
2. [32-52] `, Macbook Pro Retina`
3. [32-34] `, `
4. [34-52] `Macbook Pro Retina`
The desired result would look like this:
1. [6-11] `1.0.4`
2. [32-52] `debug`
3. [32-34] `OS X 10.11.2`
4. [34-52] `Macbook Pro Retina`
According to the image above (as far as I can see), the expression should work on the test string. What is the cause of the weird results and how could I improve the expression?
I know there are other ways of solving this problem, but I would like to use a single regular expression if possible. Please don't suggest other options.

When dealing with a varying number of groups, regex ain't the best. Solve it in two steps.
First, break down the statement using a simple regex:
SomeText\/([\d.]*) \(([^)]*)\)
1. [9-14] `1.0.4`
2. [16-55] `debug, OS X 10.11.2, Macbook Pro Retina`
Then just explode the second result by ',' to get your groups.

Probably the \G anchor works best here for binding the match to an entry point. This regex is designed for input that is always similar to the sample that is provided in your question.
(?<=SomeText\/|\G(?!^))[(,]? *\K[^,)(]+
(?<=SomeText\/|\G) the lookbehind is the part where matches should be glued to
\G matches where the previous match ended (?!^) but don't match start
[(,]? *\ matches optional opening parenthesis or comma followed by any amount of space
\K resets beginning of the reported match
[^,)(]+ matches the wanted characters, that are none of ( ) ,
Demo at regex101 (grab matches of $0)
Another idea with use of capture groups.
SomeText\/([^(]*)\(|\G(?!^),? *([^,)]+)
This one without lookbehind is a bit more accurate (it also requires the opening parenthesis), of better performance (needs fewer steps) and probably easier to understand and maintain.
SomeText\/([^(]*)\( the entry anchor and version is captured here to $1
|\G(?!^),? *([^,)]+) or glued to previous match: capture to $2 one or more characters, that are not , ) preceded by optional space or comma.
Another demo at regex101

Actually, stribizhev was close:
(?:SomeText\/([^() ]*)\s*\(|(?!^)\G),?\s*([^(),]+)(?=[^()]*\))
Just had to make that one class expect at least one match
(?:SomeText\/([0-9.]+)\s*\(|(?!^)\G),?\s*([^(),]+)(?=[^()]*\)) is a little more clear as long as the version number is always numbers and periods.

I wanted to come up with something more elegant than this (though this does actually work):
SomeText\/(.*)\s\(([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?\)
Obviously, the
([^\,]+)?\,?\s?
is repeated 6 times.
(It can be repeated any number of times and it will work for any number of comma-separated items equal to or below that number of times).
I tried to shorten the long, repetitive list of ([^\,]+)?\,?\s? above to
(?:([^\,]+)\,?\s?)*
but it doesn't work and my knowledge of regex is not currently good enough to say why not.

This should solve your problem. Use the code you already have and add something like this. It will determine where commas are in your string and delete them.
Use trim() to delete white spaces at the start or the end.
$a = strpos($line, ",");
$line = trim(substr($line, 55-$a));
I hope, this helps you!

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.