PHP Regex: Trouble identifying if next sub-pattern starts another pattern

PHP Regex: Trouble identifying if next sub-pattern starts another pattern - php

I've been trying to extract this data from a file but the thing is, at the point where I'm stuck, there could be a whole new pattern (that starts with a date), or there could be a complemente in the route (which does not start with a digit).
I'm having trouble identifying whether or not the next digit is a new pattern or a complement. I also haven't been able to optimize this pattern, as you can see after the EQPT mark.
Examples of strings to match:
291011 311011 1234560 AZU4059 E190/M SBKP1513 N0458 350 DCT BGC DCT TRIVI DCT CNF UW58 SBRF0249 EQPT/WRG PBN/D1O1 EET/SBRE0107 SAGAZ/N0454F370 UW58 GEBIT UW10
271011 UFN 1230060 AZU4062 E190/M SBPA2140 N0460 350 UM540 OSAMU DCT NEGUS UW47 SBKP0120 EQPT/WRG PBN/D1O1 EET/SBBS0106
My regex so far:
preg_match_all('/([0-3][0-9][0|1][0-9][0-9]{2})\s*(UFN|[0-3][0-9][0|1][0-9][0-9]{2})\s*([0-7]{7})\s*(AZU[0-9]{4})\s*([A-Z0-9]{4})\/([L|M|H])\s*([A-Z0-9]{8})\s*(N[0-9]{4})\s*([0-9]{3})\s*([\S\s]{1,40})\s*([A-Z0-9]{8})\s*(EQPT\/WR?G?\s?P?B?N?\/?D?1?O?1?\s?E?E?T?\/?([A-Z0-9]{8})?)\s*)/', $result, $match);

I got it!
I had to do many things to make this work:
I removed all the blank double spaces and replaced all of the first sub-pattern dates by "######". I also replaced and second parameters by "UFN" and mapped the ones I replace with a couple of arrays.
Then I added a # at the end and used it at the end of the regex pattern, so that it would be certain that it would start a new pattern when it came to a #. And it all worked out, I then just had to reposition the rest of the route so that it would complement the other one.
Thank you for trying to help!

Related

How to extract text from multiple lines including the first and last word?

I am trying to extract part of a long text, such as information about caring for a plant. The text contains paragraphs and blank lines. I am not able to capture the specific text I want, the second problem is that the last word isn't showing in the extracted text, and the last problem is when my search starts at the beginning of the line.
I tried searching for the text I want to extract by using a word that isn't at the beginning of the line, it worked except that the end of the desired text is missing a word, and if that word is on new line, it won't show any results at all.
I was using https://scriptun.com/tools/php/preg_match for testing
//The first word to start the search is 'How to'. And I want to capture it as well
// The second word where the text I want ends is '(optional):'
'/(?=How to).*?\s(?=\(optional\):)/'
The sample text I am using to test is:
//Text comes before this..
How to care for Split Leaf Plant
The Split leaf philodendron, also called monstera deliciosa or swiss
cheese plant, is a large, popular, easy- care houseplant that is not
really in the philodendron family. There is a great deal of confusion
about what to call this plant; the various names have become
inter-changeable over the years.
Here is more info (optional):
//And more text goes here
I want to extract all the text from the word 'How to' ending with '(optional):'. Regardless of how many lines or paragraphs are in between
The expected extracted text:
How to care for Split Leaf Plant
The Split leaf philodendron, also called monstera deliciosa or swiss
cheese plant, is a large, popular, easy- care houseplant that is not
really in the philodendron family. There is a great deal of confusion
about what to call this plant; the various names have become
inter-changeable over the years.
Here is more info (optional):
Thank you

That's pretty easy. You can use the following pattern:
https://regex101.com/r/TjE2x8/2
Pattern: ^How to[\w\W]+?\(optional\):$

Pattern: ^How to(?:.|\R)*optional\):$
demo on regex101
Explanation:
^ match the first instance where How to appears at the beginning of the line
(?: ) non capturing group. We need it because of the following OR instruction which is the pipe |. But we don't need to capture the contents. That's why we use ?: after the first parenthesis.
. every character
| or
\R every kind of new line
* make sure to capture zero to every instance of the group
optional\):$ match the word optional with parenthesis (escaped, because it is not an instruction) \) and a colon : at the very end of the text $
Pattern 2: /^How to.*optional\):$/ms
demo on regex101
This pattern is even simpler, but requires the m and s flag to be set in order to match multiline and the . character class to match new lines.

How would I replace a word in a string that I know the start and ending to, but not the entire word? Ie: Converting an ID# to a name

I am creating a web interface for a Discord bot I have created. I currently store all user accounts, messages, etc in a SQL database so that the web interface can have extensive logs for the mods to use. I am currently trying to come up with a solution for when viewing messages to convert "Discord Mentions" to readable names.
For example, when someone tags/mentions another user in a message, instead of the SQL storing '#name' it stores '<#!12345678>'. Based on how that text starts with <#! I know that it's linking a user name, in which I can access the SQL table containing all the users to retrieve their plain text name, but I'm not sure how to:
A) Specifically grab any words that both start with <#! and end with > to be able to grab the ID for a query and
B) Replace the the above <#!12345etc>, which is easy enough to do once I know how to do A.
Just for clarification I'm not looking for help doing SQL query, just looking for help in getting the entire word that stats with <#! and ends with > from a string/paragraph.
I'm terrible with regex so hopefully there is a solution that can work without needing it haha. Any tips you could provide would be greatly appreciated.
TLDR:
Sample string:
"Hey <#!123456789> thanks for that, I'll get back to you sooon."
How to get the grab the entire word that starts with <#! and ends with > to be able to do SQL query with it and then a replace() later.
I thought about exploding the string with a space and then going through each word one at a time checking each word with startswith and endswith but if the message author didn't leave a space between mentions and the rest of the text that wouldn't work.

If I'm understanding this correctly you want all the values between "<#!" and ">". That being said I believe all you need is this /<#!(.+)>/g
demo

You can do it this way:
<?php
$str = "Hey <#!123456789> thanks for that, I'll get back to you sooon.";
$re = '/(?<=<#!).+?(?=>)/m';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
// flatten the array result (otherwise it's an array of arrays)
$matches = array_merge(...$matches);
// Print the entire match result
print_r($matches); //Array ( [0] => 123456789 )
Demo https://3v4l.org/D64Ma
Regex explanation:
(?<=<#!) looksbehind to find <#!. This starts the match
.+? matches any character any number of times until next lookahead.
(?=>) match ends when > is found (but not included in match)
The difference between using lookaheads and lookbehinds and regular /<#!(.+?)> is the matches array that they produce.
Lookarounds are not included in the matching group and results in an array of arrays containing all the matching groups ("12345678") only.
Not wrapping the start <#! and end > in a lookaround results in an array of arrays containing both the regex pattern match ("<#!12345678>", plus matching group ("12345678"). So you would have to extract the matching groups from the resulting arrays.

PHP preg_replace: find string part not starting with an exclamation point

I am working on some very messy Excel sheets, and trying to use PHP to find clues..
I have a MySQL database with all formulas from an excel document, and as usual, the cellnames from the current sheet do not have a "sheetname!" in front of it. To make it searchable (and find dead-routes in the formulas) I like to replace all formulas in the database with their sheetname as prefix.
Example:
=+(sheet_factory_costs!A17/sheet_employees!D23)+T12+W12
The database contains the name of the current sheet, and I like to change the formula above with that sheetname (let's call it "sheet_turnover").
=+(sheet_factory_costs!A17 / sheet_employees!D23)+sheet_turnover!T12+sheet_turnover!W12
I try this in PHP with preg_replace, and I think I need the following rules:
Find one or two letters, directly followed by a number. This is always a cell-adress within formulas.
When there is a ! on the position before, there is already a sheetname. So I am only looking for the letters and numbers NOT starting with an exclamation point.
The problem seems to be that the ! is also a special sign within patterns. Even if I try to escape it, it does not work:
$newformula =
preg_replace('/(?<\!)[A-Z]{1,2}[0-9]/',
'lala',
$oldformula);
(lala is my temporary marker to see if it is selecting the right cell-adresses)
(and yes, the lala is only places over the first number, but that's no issue right now)
(and yes, all Excel $..$.. (permanent) markers have already been replaced. No need to build that in the formula)

Your negative lookbehind is corrupt, you need to define it as (?<!!). However, you also need to use either a word boundary before it, or a (?<![A-Z]) lookbehind to make sure you have no other letters before the [A-Z]{1,2}.
So, you may use
'~\b(?<!!)[A-Z]{1,2}[0-9]~'
See the regex demo. Replace with sheet_turnover!$0 where $0 is the whole match value.
Details
\b - a word boundary (it is necessary, or name!AA11 would still get matched)
(?<!!) - no ! immediately to the left of the current location
[A-Z]{1,2} - 1 or 2 letters
[0-9] - a digit.
Another approach is match and skip "wrong" contexts and then match and keep the "right" ones:
'~\w+![A-Z]{1,2}[0-9](*SKIP)(*F)|\b[A-Z]{1,2}[0-9]~'
See this regex demo.
Here, \w+![A-Z]{1,2}[0-9](*SKIP)(*F)| part matches 1 or more word chars, then 1 or 2 uppercase ASCII letters and then a digit, and (*SKIP)(*F) will omit the match and will make the engine proceed looking for matches after the end of the previous match.

Retrieve 0 or more matches from comma separated list inside parenthesis using regex

I am trying to retrieve matches from a comma separated list that is located inside parenthesis using regular expression. (I also retrieve the version number in the first capture group, though that's not important to this question)
What's worth noting is that the expression should ideally handle all possible cases, where the list could be empty or could have more than 3 entries = 0 or more matches in the second capture group.
The expression I have right now looks like this:
SomeText\/(.*)\s\(((,\s)?([\w\s\.]+))*\)
The string I am testing this on looks like this:
SomeText/1.0.4 (debug, OS X 10.11.2, Macbook Pro Retina)
Result of this is:
1. [6-11] `1.0.4`
2. [32-52] `, Macbook Pro Retina`
3. [32-34] `, `
4. [34-52] `Macbook Pro Retina`
The desired result would look like this:
1. [6-11] `1.0.4`
2. [32-52] `debug`
3. [32-34] `OS X 10.11.2`
4. [34-52] `Macbook Pro Retina`
According to the image above (as far as I can see), the expression should work on the test string. What is the cause of the weird results and how could I improve the expression?
I know there are other ways of solving this problem, but I would like to use a single regular expression if possible. Please don't suggest other options.

When dealing with a varying number of groups, regex ain't the best. Solve it in two steps.
First, break down the statement using a simple regex:
SomeText\/([\d.]*) \(([^)]*)\)
1. [9-14] `1.0.4`
2. [16-55] `debug, OS X 10.11.2, Macbook Pro Retina`
Then just explode the second result by ',' to get your groups.

Probably the \G anchor works best here for binding the match to an entry point. This regex is designed for input that is always similar to the sample that is provided in your question.
(?<=SomeText\/|\G(?!^))[(,]? *\K[^,)(]+
(?<=SomeText\/|\G) the lookbehind is the part where matches should be glued to
\G matches where the previous match ended (?!^) but don't match start
[(,]? *\ matches optional opening parenthesis or comma followed by any amount of space
\K resets beginning of the reported match
[^,)(]+ matches the wanted characters, that are none of ( ) ,
Demo at regex101 (grab matches of $0)
Another idea with use of capture groups.
SomeText\/([^(]*)\(|\G(?!^),? *([^,)]+)
This one without lookbehind is a bit more accurate (it also requires the opening parenthesis), of better performance (needs fewer steps) and probably easier to understand and maintain.
SomeText\/([^(]*)\( the entry anchor and version is captured here to $1
|\G(?!^),? *([^,)]+) or glued to previous match: capture to $2 one or more characters, that are not , ) preceded by optional space or comma.
Another demo at regex101

Actually, stribizhev was close:
(?:SomeText\/([^() ]*)\s*\(|(?!^)\G),?\s*([^(),]+)(?=[^()]*\))
Just had to make that one class expect at least one match
(?:SomeText\/([0-9.]+)\s*\(|(?!^)\G),?\s*([^(),]+)(?=[^()]*\)) is a little more clear as long as the version number is always numbers and periods.

I wanted to come up with something more elegant than this (though this does actually work):
SomeText\/(.*)\s\(([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?\)
Obviously, the
([^\,]+)?\,?\s?
is repeated 6 times.
(It can be repeated any number of times and it will work for any number of comma-separated items equal to or below that number of times).
I tried to shorten the long, repetitive list of ([^\,]+)?\,?\s? above to
(?:([^\,]+)\,?\s?)*
but it doesn't work and my knowledge of regex is not currently good enough to say why not.

This should solve your problem. Use the code you already have and add something like this. It will determine where commas are in your string and delete them.
Use trim() to delete white spaces at the start or the end.
$a = strpos($line, ",");
$line = trim(substr($line, 55-$a));
I hope, this helps you!

RegEx with character set inside positive lookbehind, Is it possible?

I need to match "name" only after "listing", but of course those words could be any url directory or page.
mydomain.com/listing/name
so the only thing I can "REGuest" (request) is to be some parent directory there.
In other words, I want to match the "position" i.e. whatever comes 2nd after the domain.
I'm trying something like
(?<=mydomain\.com/[^/\?&]+/)[^/\?&]+(?:/)?
But the character set won't work inside the positive lookbehind, at least it's setup to match only ONE character. As soon as I try to match other than one (e.g. modify it with +, ? or *) it just stops working.
I'm obviously missing the positive lookbehind syntax and it seems not intended for what I'm trying.
How can I match that 2nd level filename?
Thanks.

Regular-expressions.info states that
The bad news is that most regex flavors do not allow you to use just
any regex inside a lookbehind, because they cannot apply a regular
expression backwards. Therefore, the regular expression engine needs
to be able to figure out how many steps to step back before checking
the lookbehind...
(Read further, they even mention Perl, Python and Java.)
I think the quantifier might be the problem. I found this on stackoverflow and briefly flew over it.
Wouldn't it be possible to just match the whole path, and use a group for the second level filename:
mydomain\.com\/[^\/\?&]+\/([^\/\?&]+)(?:\/)?
(note: I had to escape the / for my tests...)
The result of this would be something like:
Array
(
[0] => mydomain.com/listing/name
[1] => name
)
Now, because I don't know the context of your problem, I just assumed you would be able to postprocess the results and get the group 1 (index 1) from the result. If not, I unfortunately don't know...

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.