How can I use groups with preg_match? - php

I have some data that will be one of the following
Word Number Word Number
Word Number Word Word Number
Word Word Number Word Number
Word Word Number Word Word Number
I would like to extract the Word(s) up until the numbers, and the numbers. Here is what I have at the moment (which looks OK to me, but I don't fully understand regex).
preg_match('/([A-Za-z ])([0-9])([A-Za-z ])([0-9])/', $game, $info);
print_r($info);
However, the array is empty. I know I've seen ^ and + and $ used before but I'm not quite sure how to work it into the regex.

In order to match the strings with the format you described, you need
preg_match_all('/^([a-z]+(?:\s+[a-z]+)?)\s+([0-9]+)\s+([a-z]+(?:\s+[a-z]+)?)\s+([0-9]+)$/im', $game, $info);
See the regex demo
IDEONE demo:
$re = '~^([a-z]+(?:\s+[a-z]+)?)\s+([0-9]+)\s+([a-z]+(?:\s+[a-z]+)?)\s+([0-9]+)$~im';
$game = "Word 123 Word 456\nWord 1234 Word Word 3456\nWord Word 3455 Word 4566\nWord Word 4434 Word Word 44332";
preg_match_all($re, $game, $info);
print_r($info);
The regex explanation:
^ - start of string
([a-z]+(?:\s+[a-z]+)?) - Group 1 for Word Word or Word pattern
\s+ - one or more whitespaces
([0-9]+) - Group 2 for Number
\s+ - one or more whitespaces
([a-z]+(?:\s+[a-z]+)?) - Group 3 for Word Word or Word pattern
\s+ - one or more whitespaces
([0-9]+) - Group 4 for Number pattern
$ - end of string
The /i modifier makes the pattern case-insensitive. /m modifier is used for testing only (it makes ^ and $ match start and end of a line, not the whole string).
The [a-z]+(?:\s+[a-z]+)? subpattern means *match one or more letters with [a-z]+ and then match one or zero occurrence of a sequence of one or more whitespaces (\s+) followed with one or more letters ([a-z]+). Thus, this pattern effectively matches 1 or 2 words separated with a whitespace.

Related

Why regex with lookaheads doesn't match?

I need (in PHP) to split a sententse by the word that cannot be the first or the last one in the sentence. Say the word is "pression" and here is my regex
/^.+?(?=[\s\.\,\:\;])pression(?=[\s\.\,\:\;]).+$/i
Live here: https://regex101.com/r/CHAhKj/1/
First, it doesn't match.
Next, I think - it is at all possible to split that way? I tryed simplified example
print_r(preg_split('/^.+pizza.+$/', 'my pizza is cool'));
live here http://sandbox.onlinephpfunctions.com/code/10b674900fc1ef44ec79bfaf80e83fe1f4248d02
and it prints an array of 2 empty strings, when I expect
['my ', ' is cool']
I need (in PHP) to split a sentence by the word that cannot be the first or the last one in the sentence
You may use this regex:
(?<=[^\s.?]\h)pression(?=\h[^\s.?])
RegEx Demo
RegEx Details:
(?<=[^\s.?]\h): Lookbehind to assert that ahead of current position we have a space and a character that not a whitespace, not a dot and not a ?.
pression: Match word pression
(?=\h[^\s.?]): Lookahead to assert that before current position we have a space and a character that not a whitespace, not a dot and not a ?
First, ^.+?(?=[\s\.\,\:\;])pression(?=[\s\.\,\:\;]).+$ can't match any string at all because the (?=[\s\.\,\:\;])p part requires p to be also either a whitespace char, or a ., ,, : or ;, which invalidates the whole match at once.
Second, ^.+pizza.+$ pattern does not ensure the pizza matched is not the first or last word in a sentence as . matches whitespace, too. It does not return anything meaningful, because preg_split uses the match to break string into chunks, and the two empty values are 1) start of string and 2) empty string positions.
That said, all you need is:
preg_match('~^(.*?\w\W+)pression(\W+\w.*)$~is', $text, $m)
See the regex demo. Details:
^ - start of string
(.*?\w\W+) - Capturing group 1: any zero or more chars, as few as possible, then a word char and then one or more non-word chars
pression - a word
(\W+\w.*) - Capturing group 2: one or more non-word chars, a word char, and then any zero or more chars as many as possible
$ - end of string.
s makes the . match across lines and i flag makes the pattern match in a case insensitive way.
See the PHP demo:
$text = "You can use any regular expression pression inside the lookahead ";
if (preg_match('~^(.*?\w\W+)pression(\W+\w.*)$~is', $text, $m)) {
echo $m[1] . " << | >> " . $m[2];
}
// => You can use any regular expression << | >> inside the lookahead

Regex match with next being either a space or end of string

My regfu has declined... and I'm having trouble getting expected matches.
Here's example of what needs to match and what not:
NLNL LL
LNLN LL LL
NNLL LL LL LL
LNLN LLL LL
LLNN LL LLL <-- skip because:
Only need:
1 to 3 Pairs of letters separated by one space
Which are consecutive to end of string
\s{1}([A-Z]{2}) is close, but also grabbing part of the skip above.
Why? I need to grab what are name initials from strings. There are either 1,2,or 3 persons initials appended to the strings. I will be grabbing those with PHP to store them.
You may use
if (preg_match('~(?: [A-Z]{2})+$~', $s, $match)) {
print_r(explode(" ", trim($match[0])));
}
Here, (?: [A-Z]{2})+$ matches one or more sequences of a space and then two uppercase ASCII letters till the end of string, and then explode(" ", trim($match[0])) splits the trimmed match with a space into chunks.
Or, if you want to match all occurrences with one regex call:
if (preg_match_all('~(?:\G(?!\A)|(?=(?:\s[A-Z]{2})+$))\s\K[A-Z]{2}~', $s, $matches)) {
print_r($matches[0]);
}
Here, the regex matches:
(?:\G(?!\A)|(?=(?:\s[A-Z]{2})+$)) - end of previous match (\G(?!\A)) or (|) a location immediately followed with one or more sequences of a space and then two uppercase ASCII letters till the end of string
\s - a whitespace
\K - match reset operator
[A-Z]{2} - two uppercase ASCII letters.
See the PHP demo.

How to use preg_replace to remove excessive single spaces

We are extracting text from PDF files, and there is a high frequency of results that contain malformed text. Specifically adding spaces between the characters of a word. e.g. SEATTLE is being returned as S E A T T L E.
Is there a RegEx expression for preg_replace that can remove any spaces in the case of n number of single character "words"? Specifically, remove spaces from any occurrence of a string that is more than 3 single alpha characters and is separated by spaces?
If googled this for awhile, but can't even imagine how to construct the expression. As expressed in a comment, I don't want ALL spaces removed, but only when there is an occurrence of >3 single alpha characters, e.g. Welcome to the Greater S E A T T L E area should become Welcome to the Greater SEATTLE area. The result is to be used in full text searching, so case sensitivity is not a concern.
You may use a simple approach with a preg_replace_callback. Match '~\b[A-Za-z](?: [A-Za-z]){2,}\b~' and str_replace spaces in the anonymous function:
$regex = '~\b[A-Za-z](?: [A-Za-z]){2,}\b~';
$result = preg_replace_callback($regex, function($m) {
return str_replace(" ", "", $m[0]);
}, $s);
See the regex demo.
To only match sequences of uppercase letters, remove a-z from the pattern:
$regex = '~\b[A-Z](?: [A-Z]){2,}\b~';
And another thing: there may be soft/hard spaces, tabs, other kind of whitespace. Then, use
$regex = '~\b[A-Za-z](?:\h[A-Za-z]){2,}\b~u';
^^ ^
Finally, to match any Unicode letter, use \p{L} (to only match uppercase ones, \p{Lu}) instead of [a-zA-Z]:
$regex = '~\b\p{L}(?:\h\p{L}){2,}\b~u';
NOTE: It will most probably fail to work in some cases, e.g. when there are one-letter words. You will have to handle those cases separately/manually. Anyway, there is no safe regex-only way to fix OCR issues.
Pattern details
\b - a word boundary
[A-Za-z] - a single letter
(?: [A-Za-z]){2,} - 2 or more occurrences of
- a space (\h matches any kind of horizontal whitespace)
[A-Za-z] - a single letter
\b - a word boundary
When usign u modifier, \h becomes Unicode-aware.
You could do this in one go:
(?i:(?<!\S)([a-z]) +((?1))|\G(?!\A) +((?1))\b)
See live demo here
Explanation:
(?i: # Start of non-capturing group with case-insensitive modifier on
(?<!\S) # Negative lookbehind to ensure there is no leading non-whitespace character
([a-z]) + # Capture one letter and at least one space
((?1)) # Capture one letter in 2nd capturing group
| # Or
\G(?!\A) + # Start match from where previous match ends
# with matching spaces
((?1))\b # Match a letter at word boundary
) # End of non-capturing group
PHP code:
$str = preg_replace('~(?i:(?<!\S)([a-z]) +((?1))|\G(?!\A) +((?1))\b)~', '$1$2$3', $str);
You may use this pure regex approach with lookarounds and \G:
$re = '~\b(?:(?=(?:\pL\h+){3}\pL\b)|(?<!^)\G)(\pL)\h+(?=\pL\b)~';
$repl = preg_replace($re, '$1', $str);
RegEx Demo
RegEx Details:
\b: Match word boundary
(?:: Start non-capture group
(?=(?:\pL\h+){3}\pL\b): Lookahead to assert we have 3+ single letters separated by 1+ spaces
|: OR
(?<!^)\G: \G asserts position at the end of the previous match. (?<!^) ensures we don't match start of the string for the first match
): End non-capture group
(\pL): Match a single letter and capture it
\h+: Followed by 1+ horizontal whitespace
(?=\pL\b): Assert that we only have a single letter ahead
In the replacement we use $1 which is the letter left of whitespace we capture

PHP Regular expression to get words with repeated chars in a string

I'm trying to get the words in a string with repeated chars.
For example: "II loooovve this video. It's awesooooommeee."
How can I get the result:
loooovve
awesooooommeee
?
You can use this regex with a back-reference:
\b\w*(\w)\1\w*
RegEx Demo
RegEx Breakup:
\b # word boundary
\w* # match 0 or more word characters
(\w) # match a single word char and capture it as group #1
\1 # back-reference to captured group #1 to make sure we have a *repeat*
\w* # match 0 or more word characters
btw it will also match II since it has a repeating character I.
Pattern for matching all words with 3+ repeated letters:
\b\w*(\w)\1{2}\w*
II loooovve this video. It's awesooooommeee.
https://regex101.com/r/cP7kT7/1

Php Regex to insert character after first all-capital letter word in a string

I'm trying to use a preg_replace or similar php function to:
- identify the first all capital letter word in a string,
- and insert a character directly after it (a dash or semi-colon will do)
- the all capital letter word should be 3 characters long or more.
So far I have the regular expression:
/(?<!\ )([^A-Z{3,}])/
But, this isn't working in terms of only words that are 3+ characters. I'm also not sure I have it 'strictly' only looking at the very first word.
I believe that once I have the regex sorted out - this
$string = "LONDON On November 12th twelve people...";
$replaced_string = preg_replace('/myregex/',': ', $string);
will output as the following
LONDON: On November 12th twelve people..."
It's a fairly simple regex, really:
$replacedString = preg_replace('/\b([A-Z]{3,})\b/', '$1: ', $string);
It works like this:
\b: word boundary. This detects the start and end of a "word"
([A-Z]{3,}): Match 3 or more upper-case characters. The brackets capture this part of the match, so we can use it in the replacement string
\b: Another word boundary
Replace this match with:
'$1: ': the $1 refers back to the first captured group (the 3 or more upper case characters). To this, we're adding a colon and a space. That will be our replacement string
This will add the colon and space after all upper-case words of 3 or more characters. To replace only 1 word, just pass a limit to preg_replace:
$replaced = preg_replace('/\b([A-Z]{3,})\b/', '$1: ', $string, 1);
Where that last argument is the number of matches you wish to replace. -1 for all, 1 for 1, 2 for 2, etc...
Demo
Judging by your sample string, the upper-case words are city names. It's possible for city names to contain a dash, or even a space. To address this, you might want to match all strings containing upper-case chars, dashes and spaces:
$replaceAll = preg_replace('/\b([A-Z -]{2,}[A-Z])\b/', '$1: ', $string);
Demo 2
What changed:
([A-Z -]{2,}: The capturing match start with upper-case chars (2 or more, not 3), but also matches spaces and dashes.
[A-Z]): The last character of the captured group must be an upper-case character, this avoids capturing the trailing spaces or dashes. The result is that we capture stuff like "NEW YORK" or "FOO-TOWN", but not "ON - Something".
The rest is the same as before. If you want to allow for other characters that might occur (like a dot) just add them to the first part of the capturing group. The most complete pattern will probably be something like this:
$replaced = preg_replace('/\b([A-Z][A-Z .-]+[A-Z])\b/', '$1: ', $string);
This ensures the captured group starts, and ends with an upper case character, and contains any number of upper-case chars, spaces, dots and dashes in between. So this will match something like "ST. LEWIS", too

Categories