Regular expressions: matching words containing sequences - php

I am trying to match words containing the following: eph gro iss
I have eph|gro|iss which will match eph gro iss in this example: new grow miss eph.
However I need to match the whole word. For example it should match all of the miss not just iss and grow not just gro
Thanks

You can do it like this:
\b(\w*(eph|gro|iss)\w*)\b
How it works:
The expression is bracketed with word-boundary anchors \b, so it only matches whole words. These words must contain one of the literals eph, gro or iss somewhere, but the \w* parts allow the literals to appear anywhere within the whole word.
The important thing here is that you need to adopt some specific definition for "words". If you are OK with the regex definition that words are sequences that match [a-zA-Z0-9_]+ then you can use the above verbatim.
If your definition of word is something else, you will need to replace the \b anchors and \w classes appropriately.

Try this:
\b([a-zA-Z]*(?:eph|gro|iss)[a-zA-Z]*)\b
Breakdown:
\b - word boundary
( - start capture
[a-zA-Z]* - zero or more letters
(?:eph|gro|iss) - your original regex, non-capturing
[a-zA-Z]* - zero or more letters
) - end capture
\b - word boundary
Example output:
php > $string = "new grow miss eph";
php > preg_match_all("/\b([a-zA-Z]*(?:eph|gro|iss)[a-zA-Z]*)\b/", $string, $matches);
php > print_r($matches);
Array
(
[0] => Array
(
[0] => grow
[1] => miss
[2] => eph
)
[1] => Array
(
[0] => grow
[1] => miss
[2] => eph
)
)

Related

How can I extract values that have opening and closing brackets with regular expression?

I am trying to extract [[String]] with regular expression. Notice how a bracket opens [ and it needs to close ]. So you would receive the following matches:
[[String]]
[String]
String
If I use \[[^\]]+\] it will just find the first closing bracket it comes across without taking into consideration that a new one has opened in between and it needs the second close. Is this at all possible with regular expression?
Note: This type can either be String, [String] or [[String]] so you don't know upfront how many brackets there will be.
You can use the following PCRE compliant regex:
(?=((\[(?:\w++|(?2))*])|\b\w+))
See the regex demo. Details:
(?= - start of a positive lookahead (necessary to match overlapping strings):
(- start of Capturing group 1 (it will hold the "matches"):
(\[(?:\w++|(?2))*]) - Group 2 (technical, used for recursing): [, then zero or more occurrences of one or more word chars or the whole Group 2 pattern recursed, and then a ] char
| - or
\b\w+ - a word boundary (necessary since all overlapping matches are being searched for) and one or more word chars
) - end of Group 1
) - end of the lookahead.
See the PHP demo:
$s = "[[String]]";
if (preg_match_all('~(?=((\[(?:\w++|(?2))*])|\b\w+))~', $s, $m)){
print_r($m[1]);
}
Output:
Array
(
[0] => [[String]]
[1] => [String]
[2] => String
)

Double regex matches

I'm preg_match_all looping through a string using different patterns. Sometimes these patterns look a lot like each other, but differ slightly.
Right now I'm looking for a way to stop pattern A from matching strings that only pattern B - which has a 'T' in front of the 4 digits - should match.
The problem I'm running into is that pattern A also matches pattern B:
A:
(\d{4})(A|B)?(C|D)?
... matches 1234, 1234A, 1234AD, etc.
B:
I also have another pattern:
T(\d{4})\/(\d{4})
... which matches strings like: T7878/6767
The result
When running a preg_match_all on "T7878/6767 1234AD", A will give the following matches:
7878, 6767, 1234AD
Does anyone have a suggestion how to prevent A from matching B in a string like "Some text T7878/6767 1234AD and some more text"?
Your help is greatly appreciated!
Scenario with boundaries
If you only want to match those specific strings within some boundaries, use those boundary patterns on each side of the pattern.
If you expect a whitespace boundary before each match, then add the (?<!\S) negative lookbehind at the start of the pattern. If you expect a whitespace boundary at the end of the match, add the (?!\S) negative lookahead. If there can be any chars (as is in your original question), then SKIP-FAIL is the only way (see below).
So, in this first case, you may use
(?<!\S)(\d{4})([AB]?)([CD]?)(?!\S)
and
(?<!\S)T(\d{4})\/(\d{4})(?!\S)
See Pattern 1 demo and Pattern 2 demo.
Scenario with no specific boundaries
You need to make sure the second pattern is skipped when you parse the string with the first one. Use SKIP-FAIL technique for this:
'~T\d{4}/\d{4}(*SKIP)(*F)|(\d{4})(A|B)?(C|D)?~'
See the regex demo.
If you do not need the capturing groups, you may simplify it to
'~T\d{4}/\d{4}(*SKIP)(*F)|\d{4}[AB]?[CD]?~'
See another demo
Details
T\d{4}/\d{4} - T followed with 4 digits, / and another 4 digits
(*SKIP)(*F) - the matched text is discarded and the next match is searched from the matched text end
| - or
\d{4}[AB]?[CD]? - 4 digits, then optionally A or B and then optionally C or D.
From what you're asking, your current regexes don't really work. (A|B)?(C|D)? will never match AB. So I think you meant [ABCD]
Here's your new regex:
T(\d{4})\/(\d{4}) (\d{4}[ABCD]*)
For the string input:
T7878/6767 1234AB
We get the groups:
Match 1
Full match 0-17 `T7878/6767 1234AB`
Group 1. 1-5 `7878`
Group 2. 6-10 `6767`
Group 3. 11-17 `1234AB`
Regex101
Your syntax is pretty specific, so you regex just needs to be. Get rid of all your capture groups because they are screwing things up. You only need two groups which match your string syntax exactly.
First groups looks for word bounday followed by T then 4 digits then / then 4 more digits and a word break.
Second groups matches 4 digits and then letters A-D between 0 and 2 times. It has a negative lookbehind so will only match if there is a whitespace character before the 4 digits
(\bT\d{4}\/\d{4}\b)|(?<!\S)(\d{4}[A-D]{0,2})
Preg match all output:
Array
(
[0] => Array
(
[0] => T7878/6767
[1] => 1234AB
)
[1] => Array
(
[0] => T7878/6767
[1] =>
)
[2] => Array
(
[0] =>
[1] => 1234AB
)
)

Quick regex pattern in PHP

I have a large string chunk of text, and I need to extract all occurrences of text matching the following pattern:
QXXXXX-X (where X can be any digit, 0-9).
How do I do this in PHP?
<?php
preg_match_all("","Q05546-8 XXX Q13323-0",$output,PREG_PATTERN_ORDER);
print_r($output);
?>
Here you go:
preg_match_all('/\bQ[0-9]{5}-[0-9]\b/',"Q05546-8 XXX Q13323-0",$output,PREG_PATTERN_ORDER);
print_r($output);
Or, you can use shorthand class \d for a digit: \bQ\d{5}-\d\b.
Regex explanation:
\b - Word boundary (we are either at the beginning of between a word character ([a-zA-Z0-9_]) and a non-word one (all others)
Q - Literal case-sensitive Q
[0-9]{5} - Exactly 5 (due to {5}) digits from 0 to 5 range
- - Literal hyphen
[0-9] - Exactly 1 digit from 0 to 5
\b - Again a word boundary.
If you have these values inside longer sequences, you may consider using \bQ[0-9]{5}-[0-9](?![0-9]) or using shorthand classes, \bQ\d{5}-\d(?!\d).
Output of the demo:
Array
(
[0] => Array
(
[0] => Q05546-8
[1] => Q13323-0
)
)

find a string mapped between two string using php

I know this question was asked many times before and was read most of them, but I have still issue with this.
I will have a string that mapped with [[[ and ]]], and I don't know the position of this string and either I don't know how many times this would be happen.
for example :
$string = '[[[this is a string]]] and this is some other part. [[[this is another]]]and etc.';
Now, would some body help me to learn how can I find this is a string and this is another
Thanks in Advance
You need to use preg_match_all(), and you also need to be sure to escape the square brackets since they are special characters.
$string = '[[[this is a string]]] and this is some other part. [[[this is another]]]and etc.';
preg_match_all('/\[\[\[([^\]]*)\]\]\]/', $string, $matches);
print_r($matches);
Regex logic:
\[\[\[([^\]]*)\]\]\]
Debuggex Demo
Output:
Array
(
[0] => Array
(
[0] => [[[this is a string]]]
[1] => [[[this is another]]]
)
[1] => Array
(
[0] => this is a string
[1] => this is another
)
)
Here is a method using lookbehinds and lookaheads:
$string = '[[[this is a string]]] and this is some other part. [[[this is another]]]and etc.';
preg_match_all('/(?<=\[{3}).*?(?=\]{3})/', $string, $m);
print_r($m);
This outputs the following:
Array
(
[0] => Array
(
[0] => this is a string
[1] => this is another
)
)
Here is the explanation of the REGEX:
(?<= \[{3} ) .*? (?= \]{3} )
1 2 3 4 5 6 7
(?<= Positive lookbehind - This combination of (?<= ... ) tells REGEX to make sure that whatever is in the parenthesis has to appear directly before whatever it is we are trying to match. It will check to see if it's there, but won't include it in the matches.
\[{3} This says to look for an opening square brace '[', three times in a row {3}. The only thing is that the square brace is a special character in REGEX, so we have to escape it with a backslash \. [ becomes \[.
) Closing parenthesis ) for the lookbehind (Item #1)
.*? This tells REGEX to match any character ., any number of times * until it hits the next part of our regular expression ?. In this case, the next part that it will hit will be a lookahead for three closing square braces.
(?= Positive lookahead - The combination of (?= ... ) tells REGEX to make sure that whatever is in the parenthesis has to be directly in front (ahead) of what we are currently matching. It will check to see if it's there, but won't include it as part of our match.
\]{3} This looks for a closing square brace ], three times in a row {3} and as with item #2, must be escaped with a backslash \.
) Closing parenthesis ) for the lookahead (Item #5)

Regular expression: match a word of certain length which starts with certain letters

I need a regex which matches a 7 letter word, which starts with 'st'.
For example, it should only match 'startin' out of the following: start startin starting
General tips:
The starting symbols are included into the regex directly, e.g. st.
If the starting characters are special in the sense of regex-syntax (like dots, parentheses, etc.), you need to escape them with a backslash, but it is not needed in your case.
After the starting symbols, include character class for the remaining characters of your "word". If you want to allow all characters, use a dot: .. If you want to allow all non-whitespace characters, use \S. If you want to allow only (unicode) letters, use \p{L}. To only allow non-accented latin letters, use [A-Za-z]. There are many possibilities here.
Finally, include repetition quantifier for the character class from the previous step. In you case, you need exactly 5 characters after st, so the repetition quantifier is {5}.
If you want only the whole string to match, use \A at the beginning and \z at the end of your regex. Or include \b at the beginning/end of your regex to match at the so-called word boundaries (including start/end of the string, whitespace, punctuation).
The most powerful alternative (with full control) is the so-called lookahead - I'll leave it out here for the sake of simplicity.
See this tutorial for details. You can just look for specific keywords I've mentioned, e.g. repetition, character class, unicode, lookahead, etc.
To match words with non-accent characters that are case insensitive you'll need the i modifier or you'll need to declare both letters at the beginning in both cases.
<?php
$regex = '!\bst[a-z]{5}\b!i';
$words = "start startin starting station Stalker SHOWER Staples Stiffle Steerin StÄbles'";
preg_match_all($regex,$words,$matches);
print_r($matches[0]);
?>
Output
Array
(
[0] => startin
[1] => station
[2] => Stalker
[3] => Staples
[4] => Stiffle
[5] => Steerin
)
With the same output as above, if you didn't use the i modifier you would have to declare more characters:
$regex = '!\b[Ss][Tt][A-Za-z]{5}\b!';
If you want to match Unicode Characters you can do this:
print "<meta charset=\"utf-8\"><body>";
$regex = '!\bst([a-z]|[^u0000-u0080]){5}\b!iu';
$words = "start startin starting station Stalker SHOWER Staples Stiffle Steerin StÄbles'";
preg_match_all($regex,$words,$matches);
print_r($matches[0]);
print "</body>";
Output
Array
(
[0] => startin
[1] => station
[2] => Stalker
[3] => Staples
[4] => Stiffle
[5] => Steerin
[6] => StÄbles //without UTF-8 output it looks like this-> StÃ"bles
)
preg_match_all('/\bst\w{5}\b/', 'start startin starting', $arr, PREG_PATTERN_ORDER);
UPDATE: used word boundaries before and after, based on comment

Categories