RegEx parsing eshop params with PHP

RegEx parsing eshop params with PHP - php

sorry for my bad english. I have some params of ware in eshop like:
Mraznička
* Počet zásuvek mrazničky 3
* XXL zásuvka
* Mrazící výkon 4,5 kg/24 h
Rozměry balení:
Hmotnost (kg): 61.000
Výška (cm): 182.00
Šířka (cm): 64.00
Hloubka (cm): 71.00
Typ: volně stojící
Konstrukce chladničky: kombinovaná
Umístění mrazícího prostoru: mraznička dole
Změna otevírání dveří: ANO
Ovládání: mechanické-knoflíkové
Displej: bez displeje
Energetická třída: A++
There are three kind of block and I need to choose, which one is.
Conditions for types:
1) Text block begin with any letter, but NOT with * and NOT ending with :, this line must be followed by new line(s) beg. with *
2) Text block begin with any letter, but not with * and ending with :, this line must be followed by new line(s) NOT beg. with *
3) Line(or lines) begin with word(od word), then following char ":" and then following any othes word(or words)
Can you help me, how can I identify type of textblock? I need to check each textblock separately - parsing long text to block is allready done and works fine.
Thanks.

Added a possible solution for the 3 cases with a link to a online regex tester tool.
Each of these regex will only match one case of the block types.
As a precondition I assumed that the blocks are always separated by empty lines.
Edit
Minor updated regex inspired by comment (that was posted as separate answer) case 2 and 3 can overlap thus the regex now force empty line before each block.
1) http://www.myregextester.com/?r=df2be635
^[\r\n]{1,2}(?:[^*].+[^:][\r\n]{1,2})(?:\*.+[\r\n]{1,2})+$
2) http://www.myregextester.com/?r=f903ae6d
^[\r\n]{1,2}(?:[^*].+:[\r\n]{1,2})(?:[^*\s].+[\r\n]{1,2})+$
3) http://www.myregextester.com/?r=17ed0af8
^[\r\n]{1,2}(?:[^*].+:.+[^:][\r\n]{1,2})(?:[^*\s].+[\r\n]{1,2})+$
For all three cases the result will be captured in matcher group [0]. The regex is composed of two non capturing groups for the first line and the following repeated list.

Related

Regex to Capture Each Line to Unique Capture Group, Where Number of Lines Varies, and Some Data may be Missing

I’m looking for a regex expression that will capture each line (NOT including the line title colon and the space) to a separate Group. I'm using this regex within the Mac Application Keyboard Maestro.
Here's what I have: https://regex101.com/r/pxVzPM/1
My current regex captures the entire line but I recently decided to add the 'name' of the data like "Prefix: " and so I only want to capture the data itself. I tried changing the capture so that it ignores everything before the data I want like this:
\R?\h*:\ ((?:.+)?)
But when I repeat this, the regex no longer works.
Also, it would be great to have this as a repeating capture group if at all possible, instead of having to copy the code 11 times.
Caveats:
Sometimes, the field data may be blank like ‘Start: ‘ - see below. The ‘Start: ‘ would be there, but the actual ‘Start’ data may not. But any of these data 'may' be blank.
I need a regex that will work for data with a minimum of say 4 or 5 lines, up to 'as many lines as are present'. Most likely this will be less than 20 total lines.
The capture data could be 'anything' from text to numbers to a colon etc.
Here is the data that I'm searching:
Prefix: 123
Name: Testing
File: 12345
Description: This field
Duration: 01:32
Start:
Volume: 200
Tempo: 120BPM
Referencing: Another Track
Original: This One
Notes: This is a test project
So I’m trying to capture this:
123
Testing
12345
This Field
…etc.
Into Capture Groups:
Group 1 would be:
123
Group 2 would be:
Testing
and so on...
Any help is much appreciated!
Thanks!

What about :\s*(.*)?
This will start looking for a colon, followed by an optional whitespace and captures everything after the whitespace till the end of the line in a group.
You can look at the results of your test-data here:
regex101.com
EDIT:
For included blank data you can use this one: :(.*) but then you have to trim all results to remove the leading whitespace

How to iterate through regular expression for '1 or many' items?

I am trying to develop a regular expression to read a pager message into categories. At the end are the responding brigades' codes (CBORT, CYAND)
These brigade codes represent each responding brigade. The issue is that there can either be one eg. (CBORT) or many eg. (CBAGR, CBORT, CYAND). I am unsure how to make the regex match each brigade as an individual match.
Each brigade code will have the letter C as a prefix.
Can this be done using a regular expression or will I require a PHP script to iterate through the last part of the message to match each of these brigade codes into an array?
Pager Message:
##ALERT BORT1 G&SC1 GRASS FIRE - SPREADING QUICKLY 79 BOORT-YANDO RD BOORT SVNW 214 J15 (475017) AFPR CBORT CYAND F190400036
Current Regex:
(##)(ALERT)\s(\w+)\s(\S+)(C1|C3)\s(.+)\s(\d+|\d+KM|\d+ M|CNR|NEAR|NEXT TO|ADJACENT|BEHIND|ACROSS FROM|ACROSS|REAR OF|REAR|OUTSIDE)\s(.+)\s(SVNW)\s(\d+)\s(\w\d+)\s((\d+))\s(F|AF|FP|AFP|AFPR|AFPRS)\s(C\w+)\s(F\d+)
The bold section is the section I wish to iterate 1 or many times.
Thank you

You can use ((C\w+)\s)+ to have at least one group that match, but capture the following ones as well.
You also need to escape the parenthesis of your message with \( and \) (reserved characters)
Full regex :
(##)(ALERT)\s(\w+)\s(\S+)(C1|C3)\s(.+)\s(\d+|\d+KM|\d+ M|CNR|NEAR|NEXT TO|ADJACENT|BEHIND|ACROSS FROM|ACROSS|REAR OF|REAR|OUTSIDE)\s(.+)\s(SVNW)\s(\d+)\s(\w\d+)\s(\(\d+\))\s(F|AF|FP|AFP|AFPR|AFPRS)\s((C\w+)\s)+(F\d+)

Matching string that contains asterisk [duplicate]

This question already has answers here:
Reference - What does this regex mean?
(1 answer)
Regular expressions: Ensuring b doesn't come between a and c
(4 answers)
Closed 3 years ago.
I know this sounds easy but I am stuck.
I want to match strings that has asterisk *.
Essentially I want to allow strings having asterisk at front/back/both but not middle:
(At max there will be 2 asterisks, front and both but no middle, and the presence string is a must)
ALLOW:
*string* *string string* string
DENY:
*str*ing*
*str*ing str*ing* str*ing
*string*****
I tried
^\\*?((?!\\*).)*\\\*?$
and somehow it works.
Can someone explains how this works?
And verify if this is correct because regex..hard to debug and check..

You can use the following regex:
^\*?\w+\*?$
demo: https://regex101.com/r/vwuXv2/1/
Explanations:
^ anchor imposing the start of a line
\*? a * appearing at most one time
\w+ at least 1 word char appearing in the text ([a-zA-Z0-9_] feel free to change it depending on your need)
\*? a * appearing at most one time
$ end of line anchor
Now if you are interested in partial line matches, you can use the following regex:
(?<=^| )\*?\w+\*?(?=$| )
demo: https://regex101.com/r/vwuXv2/2/
Explanations: you add lookbehind, lookahead assertions.
Adding Japanese characters as requested in the comment (add in [^*\s] all the characters you need to exclude from the words):
^\*?[^*\s]+\*?$
demo: https://regex101.com/r/RaCmwt/1/
or
^\*?[[:alpha:]]+\*?$
(with unicode flag enabled) or just
^\*?\p{L}+\*?$
demo: https://regex101.com/r/RaCmwt/2/

You can simply say: Optionally start with asterisk, 0 or more arbitrary characters except asterisk, optionally end with asterisk.
^\*?[^*]*\*?$
https://regex101.com/r/bibCEc/2
An alternative is to inverse the match and test if there is not ( i.e. if(!...)) any asterisk not at the begin or end using negative look behind and look ahead:
(?<!^)\*(?!$)
https://regex101.com/r/8St0M4/2
According to your recent edit you would use the quatifier + to match 1 or more characters:
^\*?[^*]+\*?$
https://regex101.com/r/bibCEc/3

How do I find blocks of text ending with "!!", while still allowing "!" characters in Regex?

I have a peculiar use case where I need to detect paragraphs that end in !!. Normal occurrences of ! (a single one) is fine in the paragraph, but the block ends when !! is found.
For example:
test foo bar !!
longer paragraph this time!
goes on and on
and then stops !!
Should be detected as two separate matches, one covering the first line, and another (separate) covering lines 2, 3 and 4. This brings it to a total of 2 matches.
(Preferably it should work with multiline-mode, as it's part of a larger regex that employs this mode.)
How would I accomplish this? I tried [^!!]* which to me says, find as many non-!! characters as possible, but I'm not sure how to leverage that, and worse yet it still finds single occurrences of !.

There is a common idiom in regular expressions that is used for escape sequences. (Like "\n" in a string.) You can use the same concept here.
The trick is to match either NOT the first character, or the first character followed by a valid second character.
In your case, that would be:
(?: # this is a package, either A or B, choose one
[^!] # Not a bang
| # or
![^!] # Bang, followed by not-a-bang
)
This pair of alternatives describes all the characters in your paragraph. So you can repeat it either 0 times (*) or one-or-more times (+) depending on what you are doing in the rest of your pattern.
# All together:
(?:[^!]|![^!])* # zero or more
(?:[^!]|![^!])+ # one or more
(Obviously, you can match '!!' at the end if you like...)

^([!]?[^!]+[!]?[^!]+)*[!]{2}$/gm
This regex worked for me. It ensures any single ! characters are separated by non-! characters, but there don't have to be any single ! characters. It worked on multiline mode. This also has the added benefit of extracting the text that comes before an occurrence of "!!" since I assume you want to work with it.
/^([!]?[^!]+[!]?[^!]+)*.?[!]{2}$|^([!]?[^!]+[!]?[^!]+)*[^!]?[!]?$/gm
This slightly longer regex captures text that occurs after the final !! (ie, if the file has text between !! and EOF). I wouldn't recommend using the capturing groups though as on my regex checker, they didn't seem to work properly (that may have just been an implementation glitch, however, as the capturing groups look like they should work properly).

Try this:
([\w\s!]+?\!{2})
DEMO
Output:
MATCH 1
1. [0-15] `test foo bar !!`
MATCH 2
1. [15-76] `
longer paragraph this time!
goes on and on
and then stops !!`
or
(?:\n?([\w\s!]+?)\s?\!{2})
DEMO
Output:
MATCH 1
1. [0-12] `test foo bar`
MATCH 2
1. [16-73] `longer paragraph this time!
goes on and on
and then stops`

Try following regex using lookahead
VERSION #1
/(?<=!!|^).*?(?=!!)/gms
Please see https://regex101.com/r/cQ0wC0/2
Result should be
OUTPUT:
test foo bar
longer paragraph this time!
goes on and on
and then stops
VERSION #2
Since OP want to capture last paragraph of text after !! even it's not ending with bang signs.
/(?<=!!|^).*?(?=!!)|(?<=!!).*$/gms
Please see demo https://regex101.com/r/cQ0wC0/4
INPUT:
test foo bar !!
longer paragraph this time!
goes on and on
and then stops !!
longer paragraph this time!
goes on and on
OUTPUT:
test foo bar
longer paragraph this time!
goes on and on
and then stops
longer paragraph this time!
goes on and on

Substring with dots

I am using SUBSTRING function to retreive an "excerpt" of a message body:
SELECT m.id, m.thread_id, m.user_id, SUBSTRING(m.body, 1, 100) AS body, m.sent_at
FROM message m;
What I would like to do is add 3 dots to the end of the substring, but only if the source string was more than my upper limit (100 characters), i.e. if substring had to cut off the string. If the source string is less than 100 characters then no need to add any dots to the end.
I am using PHP as my scripting language.

That can be done in the query, rather than PHP, using:
SELECT m.id, m.thread_id, m.user_id,
CASE
WHEN CHAR_LENGTH(m.body) > 100 THEN CONCAT(SUBSTRING(m.body, 1, 100), '...')
ELSE m.body
END AS body,
m.sent_at
FROM MESSAGE m
The term for the three trailing dots is "ellipsis".

Ask for 101 characters. If you receive 101 characters your resource string is definitely more than 100 characters. In that case, remove the last character in your scripting language of choice and add "...". This will relieve your DB somewhat.
Personally I would advise you to create a bit of a difference though. E.g. cut off at 90 characters if and only if you exceed 110 characters (by requesting 110 + 1 characters of course). Otherwise you will get the effect I notice with Slashdot sometimes: you have a Read the rest of this comment link, only to receive the final word of the story.
More or less, the user will be annoyed if the method of retrieving the rest of the story takes more space than the story itself.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.