Regex starts with x or x prefixed or suffixed - php

I'm trying to get pattern match for string like the following to convert every line into a list item <li>:
-Give on result
&Second new text
-The third text
Another paragraph without list.
-New list here
In natural language: Match every string that starts with - and ended with the new line sign \n
I tried the following pattern that works fine:
/^([-|-]\w+\s*.*)?\n*$/gum
Of course we can write it simply without the square brackets ^(-\w+\s*.*)?\n*$ but for debugging I used it as described.
In the example above, when I replaces the second - with & to be ^([-|&]\w+\s*.*)?\n*$ It works fine too and it mtaches the the second line of the smaple string. However, I could not able to make it matches - prefixed with white space or suffixed with white space.
I changed the sample string to:
- Give on result
&Second new text
-The third text
Another paragraph without list.
-New list here
and I tried the following pattern:
/^([-|\- |&| -]\w+\s*.*)?\n*$/gum
However, it failed to match any suffixed or prefixed - with white space.
Here are a live demo for the original working pattern:

To my understanding, what you want is having a line that starts with an element e (e being & or -), with element being either prefixed/suffixed by space(s).
^\s*[&-]\s*(.*)$
If you do not want multilines, simply do not use the m modifier.

^(\h*(?:-|&)\h*\w+\s*.*)\n*$
You can try this.| inside [] has no special meaning.See demo.
https://regex101.com/r/nS2lT4/3
A string may start with whitespace, then it should have either - or & which may have spaces ahead. Then it should have at least one alphanumeric characters which may have space ahead. Then it can have anything or nothing. In the end, it will eat up all the newlines it consume or none if it can't.

Related

How to extract text from multiple lines including the first and last word?

I am trying to extract part of a long text, such as information about caring for a plant. The text contains paragraphs and blank lines. I am not able to capture the specific text I want, the second problem is that the last word isn't showing in the extracted text, and the last problem is when my search starts at the beginning of the line.
I tried searching for the text I want to extract by using a word that isn't at the beginning of the line, it worked except that the end of the desired text is missing a word, and if that word is on new line, it won't show any results at all.
I was using https://scriptun.com/tools/php/preg_match for testing
//The first word to start the search is 'How to'. And I want to capture it as well
// The second word where the text I want ends is '(optional):'
'/(?=How to).*?\s(?=\(optional\):)/'
The sample text I am using to test is:
//Text comes before this..
How to care for Split Leaf Plant
The Split leaf philodendron, also called monstera deliciosa or swiss
cheese plant, is a large, popular, easy- care houseplant that is not
really in the philodendron family. There is a great deal of confusion
about what to call this plant; the various names have become
inter-changeable over the years.
Here is more info (optional):
//And more text goes here
I want to extract all the text from the word 'How to' ending with '(optional):'. Regardless of how many lines or paragraphs are in between
The expected extracted text:
How to care for Split Leaf Plant
The Split leaf philodendron, also called monstera deliciosa or swiss
cheese plant, is a large, popular, easy- care houseplant that is not
really in the philodendron family. There is a great deal of confusion
about what to call this plant; the various names have become
inter-changeable over the years.
Here is more info (optional):
Thank you
That's pretty easy. You can use the following pattern:
https://regex101.com/r/TjE2x8/2
Pattern: ^How to[\w\W]+?\(optional\):$
Pattern: ^How to(?:.|\R)*optional\):$
demo on regex101
Explanation:
^ match the first instance where How to appears at the beginning of the line
(?: ) non capturing group. We need it because of the following OR instruction which is the pipe |. But we don't need to capture the contents. That's why we use ?: after the first parenthesis.
. every character
| or
\R every kind of new line
* make sure to capture zero to every instance of the group
optional\):$ match the word optional with parenthesis (escaped, because it is not an instruction) \) and a colon : at the very end of the text $
Pattern 2: /^How to.*optional\):$/ms
demo on regex101
This pattern is even simpler, but requires the m and s flag to be set in order to match multiline and the . character class to match new lines.

PHP preg_replace: find string part not starting with an exclamation point

I am working on some very messy Excel sheets, and trying to use PHP to find clues..
I have a MySQL database with all formulas from an excel document, and as usual, the cellnames from the current sheet do not have a "sheetname!" in front of it. To make it searchable (and find dead-routes in the formulas) I like to replace all formulas in the database with their sheetname as prefix.
Example:
=+(sheet_factory_costs!A17/sheet_employees!D23)+T12+W12
The database contains the name of the current sheet, and I like to change the formula above with that sheetname (let's call it "sheet_turnover").
=+(sheet_factory_costs!A17 / sheet_employees!D23)+sheet_turnover!T12+sheet_turnover!W12
I try this in PHP with preg_replace, and I think I need the following rules:
Find one or two letters, directly followed by a number. This is always a cell-adress within formulas.
When there is a ! on the position before, there is already a sheetname. So I am only looking for the letters and numbers NOT starting with an exclamation point.
The problem seems to be that the ! is also a special sign within patterns. Even if I try to escape it, it does not work:
$newformula =
preg_replace('/(?<\!)[A-Z]{1,2}[0-9]/',
'lala',
$oldformula);
(lala is my temporary marker to see if it is selecting the right cell-adresses)
(and yes, the lala is only places over the first number, but that's no issue right now)
(and yes, all Excel $..$.. (permanent) markers have already been replaced. No need to build that in the formula)
Your negative lookbehind is corrupt, you need to define it as (?<!!). However, you also need to use either a word boundary before it, or a (?<![A-Z]) lookbehind to make sure you have no other letters before the [A-Z]{1,2}.
So, you may use
'~\b(?<!!)[A-Z]{1,2}[0-9]~'
See the regex demo. Replace with sheet_turnover!$0 where $0 is the whole match value.
Details
\b - a word boundary (it is necessary, or name!AA11 would still get matched)
(?<!!) - no ! immediately to the left of the current location
[A-Z]{1,2} - 1 or 2 letters
[0-9] - a digit.
Another approach is match and skip "wrong" contexts and then match and keep the "right" ones:
'~\w+![A-Z]{1,2}[0-9](*SKIP)(*F)|\b[A-Z]{1,2}[0-9]~'
See this regex demo.
Here, \w+![A-Z]{1,2}[0-9](*SKIP)(*F)| part matches 1 or more word chars, then 1 or 2 uppercase ASCII letters and then a digit, and (*SKIP)(*F) will omit the match and will make the engine proceed looking for matches after the end of the previous match.

How do I find blocks of text ending with "!!", while still allowing "!" characters in Regex?

I have a peculiar use case where I need to detect paragraphs that end in !!. Normal occurrences of ! (a single one) is fine in the paragraph, but the block ends when !! is found.
For example:
test foo bar !!
longer paragraph this time!
goes on and on
and then stops !!
Should be detected as two separate matches, one covering the first line, and another (separate) covering lines 2, 3 and 4. This brings it to a total of 2 matches.
(Preferably it should work with multiline-mode, as it's part of a larger regex that employs this mode.)
How would I accomplish this? I tried [^!!]* which to me says, find as many non-!! characters as possible, but I'm not sure how to leverage that, and worse yet it still finds single occurrences of !.
There is a common idiom in regular expressions that is used for escape sequences. (Like "\n" in a string.) You can use the same concept here.
The trick is to match either NOT the first character, or the first character followed by a valid second character.
In your case, that would be:
(?: # this is a package, either A or B, choose one
[^!] # Not a bang
| # or
![^!] # Bang, followed by not-a-bang
)
This pair of alternatives describes all the characters in your paragraph. So you can repeat it either 0 times (*) or one-or-more times (+) depending on what you are doing in the rest of your pattern.
# All together:
(?:[^!]|![^!])* # zero or more
(?:[^!]|![^!])+ # one or more
(Obviously, you can match '!!' at the end if you like...)
^([!]?[^!]+[!]?[^!]+)*[!]{2}$/gm
This regex worked for me. It ensures any single ! characters are separated by non-! characters, but there don't have to be any single ! characters. It worked on multiline mode. This also has the added benefit of extracting the text that comes before an occurrence of "!!" since I assume you want to work with it.
/^([!]?[^!]+[!]?[^!]+)*.?[!]{2}$|^([!]?[^!]+[!]?[^!]+)*[^!]?[!]?$/gm
This slightly longer regex captures text that occurs after the final !! (ie, if the file has text between !! and EOF). I wouldn't recommend using the capturing groups though as on my regex checker, they didn't seem to work properly (that may have just been an implementation glitch, however, as the capturing groups look like they should work properly).
Try this:
([\w\s!]+?\!{2})
DEMO
Output:
MATCH 1
1. [0-15] `test foo bar !!`
MATCH 2
1. [15-76] `
longer paragraph this time!
goes on and on
and then stops !!`
or
(?:\n?([\w\s!]+?)\s?\!{2})
DEMO
Output:
MATCH 1
1. [0-12] `test foo bar`
MATCH 2
1. [16-73] `longer paragraph this time!
goes on and on
and then stops`
Try following regex using lookahead
VERSION #1
/(?<=!!|^).*?(?=!!)/gms
Please see https://regex101.com/r/cQ0wC0/2
Result should be
OUTPUT:
test foo bar
longer paragraph this time!
goes on and on
and then stops
VERSION #2
Since OP want to capture last paragraph of text after !! even it's not ending with bang signs.
/(?<=!!|^).*?(?=!!)|(?<=!!).*$/gms
Please see demo https://regex101.com/r/cQ0wC0/4
INPUT:
test foo bar !!
longer paragraph this time!
goes on and on
and then stops !!
longer paragraph this time!
goes on and on
OUTPUT:
test foo bar
longer paragraph this time!
goes on and on
and then stops
longer paragraph this time!
goes on and on

PHP - preg_match_all - a little advenced

I need to find specific part of text in string.
That text need to have:
12 characters (letters and numbers only)
whole string must contains at least 3 digits
3*4 characters with spaces (ex. K9X6 6GM6 LM11)
every block from example above must contains at least 1 number
words like this, line, spod shouldn't be recognized
So I ended with this code:
preg_match_all("/(?<!\S)(?i:[a-z\d]{4}|[a-z\d]{12})(?!\S)/", $input_lines, $output_array);
But it won't works for all of requirements. Of course I can use preg_repace or str_replace and remove all (!,?,#) and in a loop count numbers if there are 4 or more but I wonder if it is possible to do with preg_match_all...
Here is a string to search in:
?K9X6 6GM6 LM11 // not recognized - but it should be
!K9X6 6GM6 LM11 // not recognized - but it should be
K0X6 0GM7 LM12! // not recognized - but it should be
K1X6 1GM8 LM13# // not recognized - but it should be
K2X6 2GM9 LM14? // not recognized - but it should be
K3X6 3GM0 LM15# // not recognized - but it should be
K4X6 4GM1 LM16* // not recognized - but it should be
K5X65GM2LM17
bla bla bla
this shouldn't be visible
spod also shouldn't be visible
but line below should be!!
K9X66GM6LM11! (see that "!" at the end? Help me with this)
Correct preg_match_all should returns this:
K9X6
6GM6
LM11
K9X6
6GM6
LM11
K0X6
0GM7
LM12
K1X6
1GM8
LM13
K2X6
2GM9
LM14
K3X6
3GM0
LM15
K4X6
4GM1
LM16
K5X65GM2LM17
K9X66GM6LM11
working example: http://www.phpliveregex.com/p/bHX
The following should do the trick:
\b(?:(?=.{0,3}?\d)[A-Za-z\d]{4}\s??){3}\b
Demo
[A-Za-z\d]{4} matches 4 letters/digits
(?=.{0,3}?\d) checks there's a digit in these 4 characters
\s?? matches a whitespace character, but tries not to match it if possible
\b makes sure everything isn't contained in a larger word
Note that this will allow strings like K2X6 2GM9LM14, I'm not sure whether you want these to match or not.

Regex optional groups

I'd like to capture up to four groups of text between <p> and </p>. I can do that using the following regex:
<h5>Trivia<\/h5><p>(.*)<\/p><p>(.*)<\/p><p>(.*)<\/p><p>(.*)<\/p>
The text to match on:
<h5>Trivia</h5><p>Was discovered by a freelance photographer while sunbathing on Bournemouth Beach in August 2003.</p><p>Supports Southampton FC.</p><p>She has 11 GCSEs and 2 'A' Levels.</p><p>Listens to soul, R&B, Stevie Wonder, Aretha Franklin, Usher Raymond, Michael Jackson and George Michael.</p>
It outputs the four lines of text. It also works as intended if there are more trivia items or <p> occurrences.
But if there are less than 4 trivia items or <p> groups, it outputs nothing since it cannot find the fourth group. How do I make that group optional?
I've tried: <h5>Trivia<\/h5><p>(.*?)<\/p>(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)? and that works according to http://gskinner.com/RegExr/ but it doesn't work if I put it inside PHP code. It only detects one group and puts everything in it.
The magic word is either 'escaping' or 'delimiters', read on.
The first regex:
<h5>Trivia<\/h5><p>(.*)<\/p><p>(.*)<\/p><p>(.*)<\/p><p>(.*)<\/p>
worked because you escaped the / characters in tags like </h5> to <\/h5>.
But in your second regex (correctly enclosing each paragraph in a optional non-capturing group, fetching 1 to 5 paragraphs):
<h5>Trivia</h5><p>(.*?)</p>(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?
you forgot to escape those / characters.
It should then have been:
$pattern = '/<h5>Trivia<\/h5><p>(.*?)<\/p>(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?/';
The above is assuming you were putting your regex between two / "delimiters" characters (out of conventional habit).
To dive a little deeper into the rabbit-hole, one should note that in php the first and last character of a regular expression is usually a "delimiter", so one can add modifiers at the end (like case-insensitive etc).
So instead of escaping your regex, you could also use a ~ character (or #, etc) as a delimiter.
Thus you could also use the same identical (second) regex that you posted and enclose for example like this:
$pattern = '~<h5>Trivia</h5><p>(.*?)</p>(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?~';
Here is a working (web-based) example of that, using # as delimiter (just because we can).
You can use the question mark to make each <p>...</p> optional:
$pattern = '~<h5>Trivia</h5>(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?~';
Use the Dom is a good option too.

Categories