Finding sentences between characters - php

I am trying to find sentences between pipe | and dot ., e.g.
| This is one. This is two.
The regex pattern I use :
preg_match_all('/(:\s|\|+)(.*?)(\.|!|\?)/s', $file0, $matches);
So far I could not manage to capture both sentences. The regex I use captures only the first sentence.
How can I solve this problem?
EDIT: as it may seen from the regex, I am trying to find the sentences BETWEEN (: or |) AND (. or ! or ?)
Column or pipe indicates starting point for sentences.
The sentences might be:
: Sentence one. Sentence two. Sentence three.
| Sentence one. Sentence two?
| Sentence one. Sentence two! Sentence three?

I would keep it simple and just match on:
\s*[^.|]+\s*
This says to match any content not consisting of pipes or full stops, and it also trims optional whitespace before/after each sentence.
$input = "| This is one. This is two.";
preg_match_all('/\s*[^.|]+\s*/s', $input, $matches);
print_r($matches[0]);
This prints:
Array
(
[0] => This is one
[1] => This is two
)

This does the job:
$str = '| This is one. This is two.';
preg_match_all('/(?:\s|\|)+(.*?)(?=[.!?])/', $str, $m);
print_r($m)
Output:
Array
(
[0] => Array
(
[0] => | This is one
[1] => This is two
)
[1] => Array
(
[0] => This is one
[1] => This is two
)
)
Demo & explanation

Another option is to make use of \G to get iterative matches asserting the position at the end of the previous match and capture the values in a capturing group matching a dot and 0+ horizontal whitespace chars after.
(?:\|\h*|\G(?!^))([^.\r\n]+)\.\h*
In parts
(?: Non capturing group
\|\h* Match | and 0+ horizontal whitespace chars
| Or
\G(?!^) Assert position at the end of previous match
) Close group
( Capture group 1
- [^.\r\n]+ Match 1+ times any char other than . or a newline
) Close group
\.\h* Match 1 . and 0+ horizontal whitespace chars
Regex demo | Php demo
For example
$re = '/(?:\|\h*|\G(?!^))([^.\r\n]+)\.\h*/';
$str = '| This is one. This is two.
John loves Mary.| This is one. This is two.';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
print_r($matches);
Output
Array
(
[0] => Array
(
[0] => | This is one.
[1] => This is one
)
[1] => Array
(
[0] => This is two
[1] => This is tw
)
)

To keep it simple, find everything between | and . and then split:
$input = "John loves Mary. | This is one. This is two. | Sentence 1. Sentence 2.";
preg_match_all('/\|\s*([^|]+)\./', $input, $matches);
if ($matches) {
foreach($matches[1] as $match) {
print_r(preg_split('/\.\s*/', $match));
}
}
Prints:
Array
(
[0] => This is one
[1] => This is two
)
Array
(
[0] => Sentence 1
[1] => Sentence 2
)

Related

Regex splitting string by space or " NOT " sequence (php)?

I'm looking to split a string by spaces, unless there is the string " NOT ", in which case I would only want to split by the space before the "NOT", and not after the "NOT".
Example:
"cancer disease NOT brain NOT sickle"
should become:
["cancer", "disease", "NOT brain", "NOT sickle"]
Here is what I have so far, but it is incorrect:
$splitKeywordArr = preg_split('/[^(NOT)]( )/', "cancer disease NOT brain NOT sickle")
It results in:
["cance", "diseas", "NOT brai", "NOT sickle"]
I know why it is incorrect, but I don't know how to fix it.
You may use
<?php
$text = "cancer disease NOT brain NOT sickle";
$pattern = "~NOT\s+(*SKIP)(*FAIL)|\s+~";
print_r(preg_split($pattern, $text));
?>
Which yields
Array
(
[0] => cancer
[1] => disease
[2] => NOT brain
[3] => NOT sickle
)
See a demo on ideone.com.
You might also match optional repetitions of the word NOT followed by 1+ word characters in case the word occurs multiple times after each other.
(?:\bNOT\h+)*\w+
The pattern matches:
(?: Non capture group
\bNOT\h+ A word boundary, match NOT and 1 or more horizontal whitespace chars
)* Close non capture group and optionally repeat
\w+ Match 1+ word characters
Regex demo | Php demo
$str = "cancer disease NOT brain NOT sickle";
preg_match_all('/(?:\bNOT\h+)*\w+/', $str, $matches);
print_r($matches[0]);
Output
Array
(
[0] => cancer
[1] => disease
[2] => NOT brain
[3] => NOT sickle
)

Regex: Capturing multiple instances in one word group

I'm not good at Regex and I've been trying for hours now so I hope you can help me. I have this text:
✝his is *✝he* *in✝erne✝*
I need to capture (using PREG_OFFSET_CAPTURE) only the ✝ in a word surrounded with *, so I only need to capture the last three ✝ in this example. The output array should look something like this:
[0] => Array
(
[0] => ✝
[1] => 17
)
[1] => Array
(
[0] => ✝
[1] => 32
)
[2] => Array
(
[0] => ✝
[1] => 44
)
I've tried using (✝) but ofcourse this will select all instances including the words without asterisks. Then I've tried \*[^ ]*(✝)[^ ]*\* but this only gives me the last instance in one word. I've tried many other variations but all were wrong.
To clarify: The asterisk can be at all places in the string, but always at the beginning and end of a word. The opening asterisk always precedes a space except at the beginning of the string and the closing asterisk always ends with a space except at the end of the string. I must add that punctuation marks can be inside these asterisks. ✝ is exactly (and only) what I need to capture and can be at any position in a word.
You could make use of the \G anchor to get iterative matches between the *. The anchor matches either at the start of the string, or at the end of the previous match.
(?:\*|\G(?!^))[^&*]*(?>&(?!#)[^&*]*)*\K✝(?=[^*]*\*)
Explanation
(?: Non capture group
\* Match *
| Or
\G(?!^) Assert the end of the previous match, not at the start
) Close non capture group
[^&*]* Match 0+ times any char except & and *
(?> Atomic group
&(?!#) Match & only when not directly followed by #
[^&*]* Match 0+ times any char except & and *
)* Close atomic group and repeat 0+ times
\K Clear the match buffer (forget what is matched until now)
✝ Match literally
(?=[^*]*\*) Positive lookahead, assert a * at the right
Regex demo | Php demo
For example
$re = '/(?:\*|\G(?!^))[^&*]*(?>&(?!#)[^&*]*)*\K✝(?=[^*]*\*)/m';
$str = '✝his is *✝he* *in✝erne✝*';
preg_match_all($re, $str, $matches, PREG_OFFSET_CAPTURE);
print_r($matches[0]);
Output
Array
(
[0] => Array
(
[0] => ✝
[1] => 16
)
[1] => Array
(
[0] => ✝
[1] => 31
)
[2] => Array
(
[0] => ✝
[1] => 43
)
)
Note The the offset is 1 less than the expected as the string starts counting at 0. See PREG_OFFSET_CAPTURE
If you want to match more variations, you could use a non capturing group and list the ones that you would accept to match. If you don't want to cross newline boundaries you can exclude matching those in the negated character class.
(?:\*|\G(?!^))[^&*\r\n]*(?>&(?!#)[^&*\\rn]*)*\K&#(?:x271D|169);(?=[^*\r\n]*\*)
Regex demo

Can preg_match() capture unknown number of occurrences?

Let's say I'm having the following string:
$string = 'cats[Garfield,Tom,Azrael]';
I need to capture the following strings:
cats
Garfield
Tom
Azrael
That string can be any word-like text, followed by brackets with the list of comma-separated word-like entries. I tried the following:
preg_match('#^(\w+)\[(\w+)(?:,(\w+))*\]$#', $string, $matches);
The problem is that $matches ignores Tom, matching only the first and the last cat.
Now, I know how to do that with more calls, perhaps combining preg_match() and explode(), so the question is not how to do it in general.
The question is: can that be done in single preg_match(), so I could validate and match on one go?
The underlying question seems to be: is it possible to extract each occurrence of a repeated capture group?
The answer is no.
However, several workarounds exists:
The most understandable uses two steps: you capture the full list and then you split it. Something like:
$str = 'cats[Garfield,Tom,Azrael,Supermatou]';
if ( preg_match('~(?<item>\w+)\[(?<list>\w+(?:,\w+)*)]~', $str, $m) )
$result = [ $m['item'], explode(',', $m['list']) ];
(or any structure you want)
An other workaround uses preg_match_all in conjunction with the \G anchor that matches either the start of the string or the position after a successful match:
$pattern = '~(?:\G(?!\A),|(?<item>\w+)\[(?=[\w,]+]))(?<elt>\w+)~';
if ( preg_match_all($pattern, $str, $matches) )
print_r($matches);
This design ensures that all elements are between the brackets.
To obtain a more flat result, you can also write it like this:
$pattern = '~\G(?!\A)[[,]\K\w+|\w+(?=\[[\w,]+])~';
details of this last pattern:
~
# first alternative (can't be the first match)
\G (?!\A) # position after the last successful match
# (the negative lookahead discards the start of the string)
[[,] # an opening bracket or a comma
\K # return the whole match from this position
\w+ # an element
| # OR
# second alternative (the first match)
\w+ # the item
(?= # lookahead to check forward if the format is correct
\[ # opening bracket
[\w,]+ # word characters and comma (feel free to be more descriptive
# like \w+(?:,\w+)* or anything you want)
] # closing bracket
)
~
Why not a simple preg_match_all:
$string = 'cats[Garfield,Tom,Azrael], entity1[child11,child12,child13], entity2:child21&child22&child23';
preg_match_all('#\w+#', $string, $matches);
print_r($matches);
Output:
Array
(
[0] => Array
(
[0] => cats
[1] => Garfield
[2] => Tom
[3] => Azrael
[4] => entity1
[5] => child11
[6] => child12
[7] => child13
[8] => entity2
[9] => child21
[10] => child22
[11] => child23
)
)

need some help on regex in preg_match_all()

so I need to extract the ticket number "Ticket#999999" from a string.. how do i do this using regex.
my current regex is working if I have more than one number in the Ticket#9999.. but if I only have Ticket#9 it's not working please help.
current regex.
preg_match_all('/(Ticket#[0-9])\w\d+/i',$data,$matches);
thank you.
In your pattern [0-9] matches 1 digit, \w matches another digit and \d+ matches 1+ digits, thus requiring 3 digits after #.
Use
preg_match_all('/Ticket#([0-9]+)/i',$data,$matches);
This will match:
Ticket# - a literal string Ticket#
([0-9]+) - Group 1 capturing 1 or more digits.
PHP demo:
$data = "Ticket#999999 ticket#9";
preg_match_all('/Ticket#([0-9]+)/i',$data,$matches, PREG_SET_ORDER);
print_r($matches);
Output:
Array
(
[0] => Array
(
[0] => Ticket#999999
[1] => 999999
)
[1] => Array
(
[0] => ticket#9
[1] => 9
)
)

Regular expression doesn't work as expected: '/=(\w+\s*)+=/'

This is what I have:
<?php
preg_match_all('/=(\w+\s*)+=/', 'aaa =bbb ccc ddd eee= zzz', $match);
print_r($match);
It matches only eee:
Array
(
[0] => Array
(
[0] => =bbb ccc ddd eee=
)
[1] => Array
(
[0] => eee
)
)
I need it to match bbb, ccc, ddd, eee, e.g.:
...
[1] => Array
(
[0] => bbb
[1] => ccc
[2] => ddd
[3] => eee
)
...
Where is the problem?
Try this regex:
(\w+)(?=[^=]*=[^=]*$)
Explaining:
(\w+) # group all words
(?= # only if right after can be found:
[^=]* # regardless of non '=' character
= # one '=' character
[^=]*$ # non '=' character till the end makes sure the first words are eliminated... You can try remove it from regex101 to see what happens.
)
Regex live here.
Hope it helps.
Thats is expected behaviour. Group captures are overwritten on repetition.
1 group, 1 capture
Instead of trying to get them in 1 match attempt, you should match one token on each attempt. Use \G to match the end of last match.
Something like this should work:
/(?(?=^)[^=]*+=(?=.*=)|\G\s+)([^\s=]+)/
regex101 Demo
Regex break-down
(?(?=^) ... | ... ) IF on start of string
[^=]*+= consume everything up to the first =
(?=.*=) and check there's a closing = as well
ELSE
\G\s+ only match if the last match ended here, consuming preceding spaces
([^\s=]+) Match 1 token, captured in group 1.
If you're also interested in matching more than 1 set of tokens, you need to match the text in between sets as well:
/(?(?=^)[^=]*+=(?=.*=)|\G\s*+(?:=[^=]*+=(?=.*=))?)([^\s=]+)/
regex101 Demo
Your regex starts and ends with an =, so the only possible match is:
=bbb ccc ddd eee=
You can use preg_replace with preg_split, i.e.:
$string = "aaa =bbb ccc ddd eee= zzz";
$matches = preg_split('/ /', preg_replace('/^.*?=|=.*?$/', '', $string));
print_r($matches);
OUTPUT:
Array
(
[0] => bbb
[1] => ccc
[2] => ddd
[3] => eee
)
DEMO:
http://ideone.com/pAmjbk

Categories