php sentence boundaries detection [duplicate] - php

This question already has answers here:
Split string into sentences using regex
(7 answers)
Closed 2 years ago.
I would like to divide a text into sentences in PHP. I'm currently using a regex, which brings ~95% accuracy and would like to improve by using a better approach. I've seen NLP tools that do that in Perl, Java, and C but didn't see anything that fits PHP. Do you know of such a tool?

An enhanced regex solution
Assuming you do care about handling: Mr. and Mrs. etc. abbreviations, then the following single regex solution works pretty well:
<?php // test.php Rev:20160820_1800
$split_sentences = '%(?#!php/i split_sentences Rev:20160820_1800)
# Split sentences on whitespace between them.
# See: http://stackoverflow.com/a/5844564/433790
(?<= # Sentence split location preceded by
[.!?] # either an end of sentence punct,
| [.!?][\'"] # or end of sentence punct and quote.
) # End positive lookbehind.
(?<! # But don\'t split after these:
Mr\. # Either "Mr."
| Mrs\. # Or "Mrs."
| Ms\. # Or "Ms."
| Jr\. # Or "Jr."
| Dr\. # Or "Dr."
| Prof\. # Or "Prof."
| Sr\. # Or "Sr."
| T\.V\.A\. # Or "T.V.A."
# Or... (you get the idea).
) # End negative lookbehind.
\s+ # Split on whitespace between sentences,
(?=\S) # (but not at end of string).
%xi'; // End $split_sentences.
$text = 'This is sentence one. Sentence two! Sentence thr'.
'ee? Sentence "four". Sentence "five"! Sentence "'.
'six"? Sentence "seven." Sentence \'eight!\' Dr. '.
'Jones said: "Mrs. Smith you have a lovely daught'.
'er!" The T.V.A. is a big project! '; // Note ws at end.
$sentences = preg_split($split_sentences, $text, -1, PREG_SPLIT_NO_EMPTY);
for ($i = 0; $i < count($sentences); ++$i) {
printf("Sentence[%d] = [%s]\n", $i + 1, $sentences[$i]);
}
?>
Note that you can easily add or take away abbreviations from the expression. Given the following test paragraph:
This is sentence one. Sentence two! Sentence three? Sentence "four". Sentence "five"! Sentence "six"? Sentence "seven." Sentence 'eight!' Dr. Jones said: "Mrs. Smith you have a lovely daughter!" The T.V.A. is a big project!
Here is the output from the script:
Sentence[1] = [This is sentence one.]
Sentence[2] = [Sentence two!]
Sentence[3] = [Sentence three?]
Sentence[4] = [Sentence "four".]
Sentence[5] = [Sentence "five"!]
Sentence[6] = [Sentence "six"?]
Sentence[7] = [Sentence "seven."]
Sentence[8] = [Sentence 'eight!']
Sentence[9] = [Dr. Jones said: "Mrs. Smith you have a lovely daughter!"]
Sentence[10] = [The T.V.A. is a big project!]
The essential regex solution
The author of the question commented that the above solution "overlooks many options" and is not generic enough. I'm not sure what that means, but the essence of the above expression is about as clean and simple as you can get. Here it is:
$re = '/(?<=[.!?]|[.!?][\'"])\s+(?=\S)/';
$sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);
Note that both solutions correctly identify sentences ending with a quotation mark after the ending punctuation. If you don't care about matching sentences ending in a quotation mark the regex can be simplified to just: /(?<=[.!?])\s+(?=\S)/.
Edit: 20130820_1000 Added T.V.A. (another punctuated word to be ignored) to regex and test string. (to answer PapyRef's comment question)
Edit: 20130820_1800 Tidied and renamed regex and added shebang. Also fixed regexes to prevent splitting text on trailing whitespace.

Slight improvement on someone else's work:
$re = '/# Split sentences on whitespace between them.
(?<= # Begin positive lookbehind.
[.!?] # Either an end of sentence punct,
| [.!?][\'"] # or end of sentence punct and quote.
) # End positive lookbehind.
(?<! # Begin negative lookbehind.
Mr\. # Skip either "Mr."
| Mrs\. # or "Mrs.",
| Ms\. # or "Ms.",
| Jr\. # or "Jr.",
| Dr\. # or "Dr.",
| Prof\. # or "Prof.",
| Sr\. # or "Sr.",
| \s[A-Z]\. # or initials ex: "George W. Bush",
# or... (you get the idea).
) # End negative lookbehind.
\s+ # Split on whitespace between sentences.
/ix';
$sentences = preg_split($re, $story, -1, PREG_SPLIT_NO_EMPTY);

As a low-tech approach, you might want to consider using a series of explode calls in a loop, using ., !, and ? as your needle. This would be very memory and processor intensive (as most text processing is). You would have a bunch of temporary arrays and one master array with all found sentences numerically indexed in the right order.
Also, you'd have to check for common exceptions (such as a . in titles like Mr. and Dr.), but with everything being in an array, these types of checks shouldn't be that bad.
I'm not sure if this is any better than regex in terms of speed and scaling, but it would be worth a shot. How big are these blocks of text you want to break into sentences?

I was using this regex:
preg_split('/(?<=[.?!])\s(?=[A-Z"\'])/', $text);
Won't work on a sentence starting with a number, but should have very few false positives as well. Of course what you are doing matters as well. My program now uses
explode('.',$text);
because I decided speed was more important than accuracy.

Build a list of abbreviations like this
$skip_array = array (
'Jr', 'Mr', 'Mrs', 'Ms', 'Dr', 'Prof', 'Sr' , etc.
Compile them into a an expression
$skip = '';
foreach($skip_array as $abbr) {
$skip = $skip . (empty($skip) ? '' : '|') . '\s{1}' . $abbr . '[.!?]';
}
Last run this preg_split to break into sentences.
$lines = preg_split ("/(?<!$skip)(?<=[.?!])\s+(?=[^a-z])/",
$txt, -1, PREG_SPLIT_NO_EMPTY);
And if you're processing HTML, watch for tags getting deleted which eliminate the space between sentences.<p></p> If you have situations.Like this where.They stick together it becomes immensely more difficult to parse.

#ridgerunner I wrote your PHP code in C #
I get like 2 sentences as result :
Mr. J. Dujardin régle sa T.V.
A. en esp. uniquement
The correct result should be the sentence : Mr. J. Dujardin régle sa T.V.A. en esp. uniquement
and with our test paragraph
string sText = "This is sentence one. Sentence two! Sentence three? Sentence \"four\". Sentence \"five\"! Sentence \"six\"? Sentence \"seven.\" Sentence 'eight!' Dr. Jones said: \"Mrs. Smith you have a lovely daughter!\" The T.V.A. is a big project!";
The result is
index: 0 sentence: This is sentence one.
index: 22 sentence: Sentence two!
index: 36 sentence: Sentence three?
index: 52 sentence: Sentence "four".
index: 69 sentence: Sentence "five"!
index: 86 sentence: Sentence "six"?
index: 102 sentence: Sentence "seven.
index: 118 sentence: " Sentence 'eight!'
index: 136 sentence: ' Dr. Jones said: "Mrs. Smith you have a lovely daughter!
index: 193 sentence: " The T.V.
index: 203 sentence: A. is a big project!
C# code :
string sText = "Mr. J. Dujardin régle sa T.V.A. en esp. uniquement";
Regex rx = new Regex(#"(\S.+?
[.!?] # Either an end of sentence punct,
| [.!?]['""] # or end of sentence punct and quote.
)
(?<! # Begin negative lookbehind.
Mr. # Skip either Mr.
| Mrs. # or Mrs.,
| Ms. # or Ms.,
| Jr. # or Jr.,
| Dr. # or Dr.,
| Prof. # or Prof.,
| Sr. # or Sr.,
| \s[A-Z]. # or initials ex: George W. Bush,
| T\.V\.A\. # or "T.V.A."
) # End negative lookbehind.
(?=|\s+|$)",
RegexOptions.CultureInvariant | RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled);
foreach (Match match in rx.Matches(sText))
{
Console.WriteLine("index: {0} sentence: {1}", match.Index, match.Value);
}

Related

Regular expression for highlighting numbers between words

Site users enter numbers in different ways, example:
from 8 000 packs
432534534
from 344454 packs
45054 packs
04 555
434654
54 564 packs
I am looking for a regular expression with which I could highlight words before digits (if there are any), digits in any format and words after (if there are any). It is advisable to exclude spaces.
Now I have such a design, but it does not work correctly.
(^[0-9|a-zA-Z].*?)\s([0-9].*?)\s([a-zA-Z]*$)
The main purpose of this is to put the strings in order, bring them to the same form, format them in PHP digit format, etc.
As a result, I need to get the text before the digits, the digits themselves and the text after them into the variables separately.
$before = 'from';
$num = '8000';
$after = 'packs';
Thank you for any help in this matter)
I think you may try this:
^(\D+)?([\d \t]+)(\D+)?$
group 1: optional(?) group that will contain anything but digit
group 2: mandatory group that will contain only digits and
white space character like space and tab
group 3: optional(?) group that will contain anything but digit
Demo
Source (run)
$re = '/^(\D+)?([\d \t]+)(\D+)?$/m';
$str = 'from 8 000 packs
432534534
from 344454 packs
45054 packs
04 555
434654
54 564 packs
';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
foreach ($matches as $matchgroup)
{
echo "before: ".$matchgroup[1]."\n";
echo "number:".preg_replace('/\D/m','',$matchgroup[2])."\n";
echo "after:".$matchgroup[3]."";
echo "\n\n\n";
}
I corrected your regex and added groups, the regex looks like this:
^(?<before>[a-zA-Z]+)?\s?(?<number>[0-9].*?)\s?(?<after>[a-zA-Z]+)?$`
Test regex here: https://regex101.com/r/QLEC9g/2
By using groups you can easily separate the words and numbers, and handle them any way you want.
Your pattern does not match because there are 4 required parts that all expect 1 character to be present:
(^[0-9|a-zA-Z].*?)\s([0-9].*?)\s([a-zA-Z]*$)
^^^^^^^^^^^^ ^^ ^^^^^ ^^
The other thing to note is that the first character class [0-9|a-zA-Z] can also match digits (you can omit the | as it would match a literal pipe char)
If you would allow all other chars than digits on the left and right, and there should be at least a single digit present, you can use a negated character class [^\d\r\n]* optionally matching any character except a digit or a newline:
^([^\d\r\n]*)\h*(\d+(?:\h+\d+)*)\h*([^\d\r\n]*)$
^ Start of string
([^\d\r\n]*) Capture group 1, match any char except a digit or a newline
\h* Match optional horizontal whitespace chars
(\d+(?:\h+\d+)*) Capture group 2, match 1+ digits and optionally repeat matching spaces and 1+ digits
\h* Match optional horizontal whitespace chars
([^\d\r\n]*) Capture group 3, match any char except a digit or a newline
$ End of string
See a regex demo and a PHP demo.
For example
$re = '/^([^\d\r\n]*)\h*(\d+(?:\h+\d+)*)\h*([^\d\r\n]*)$/m';
$str = 'from 8 000 packs
test from 8 000 packs test
432534534
from 344454 packs
45054 packs
04 555
434654
54 564 packs';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
foreach($matches as $match) {
list(,$before, $num, $after) = $match;
echo sprintf(
"before: %s\nnum:%s\nafter:%s\n--------------------\n",
$before, preg_replace("/\h+/", "", $num), $after
);
}
Output
before: from
num:8000
after:packs
--------------------
before: test from
num:8000
after:packs test
--------------------
before:
num:432534534
after:
--------------------
before: from
num:344454
after:packs
--------------------
before:
num:45054
after:packs
--------------------
before:
num:04555
after:
--------------------
before:
num:434654
after:
--------------------
before:
num:54564
after:packs
--------------------
If there should be at least a single digit present, and the only allowed characters are a-z for the word(s), you can use a case insensitive pattern:
(?i)^((?:[a-z]+(?:\h+[a-z]+)*)?)\h*(\d+(?:\h+\d+)*)\h*((?:[a-z]+(?:\h+[a-z]+)*)?)?$
See another regex demo and a php demo.

php preg_match_all not working

Regex is highlighting the wrong words like Hell«o» and ignoring the correct words «Hello» or Hello,
So, my problem is working fine for my javascript code, but when i try it for php it also highlighting the string, which shouldn't:
'«This is the point of sale» ';
here is my regex: https://regex101.com/r/SqCR1y/14
PHP Code:
$re = '/^(?:.*[[{(«][^\]})»\n]*|[^[{(«\n]*[\]})»].*|.*\w[[{(«].*|.*[\]})»]\w.*)$/m';
$str = '«This is the point of sale»';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
// Print the entire match result
var_dump($matches);
//Output
array(1) {
[0]=>
array(1) {
[0]=>
string(29) "«This is the point of sale»"
}
}
expected: empty array
jsfiddle here, which is working fine
Thanks in advance
you're not using the right pattern. try this:
$re = '/^
(?:
\([^)\n] | [^(\n]*\). |
\[[^]\n] | [^[\n]*\]. |
{[^}\n] | [^{\n]}.* |
«[^»\n] | [^«\n]*». |
.?\w[[{(«]. | .?[\]})»]\w.
)
$/mxu';
What about a string like "(not) balanced)" ? Should that be legal?
This type of pattern isn't explicit in your test input, but since none of your "good" strings are imbalanced, you could consider covering these cases by using regex recursion to match balanced bracket expressions and targeting valid strings instead of invalid ones:
$re = '/
^
(?!.*\w[{}«»\(\)\[\]]\w) //disallow brackets inside words
(?:
[^\n{}«»\(\)\[\]]| //non bracket character, OR:
( //(capture group #1, the recursive subpattern) "one of the following balanced groups":
(\((?:(?>[^\n«»\(\){}\[\]]|(?1))*)\))| //balanced paren groups
(\[(?:(?>[^\n«»\(\){}\[\]]|(?1))*)\])| //balanced bracket groups
(«(?:(?>[^\n«»\(\){}\[\]]|(?1))*)»)| //balanced chevron groups
({(?:(?>[^\n«»\(\){}\[\]]|(?1))*)}) //balanced curly bracket groups
)
)+ //repeat "non bracket character or balanced group" until end of string
$
/mxu';
The recursion takes this form:
[openbracket]([nonbracket] | [open/close pattern again via recursion])*[closebracket]
To use part of the pattern recursively you identify it via the capture group that encloses it (?N), where N is the number of the group.
*The initial negative lookahead will fail any "word boundary" violations before going into the recursive stuff
*This regex looks to be about 35% faster than the original approach, as seen here: https://regex101.com/r/MBITHe/4

preg_split based on a Sentence

I have the follwoing script to split up sentences. There are a few phrases that I would like to treat as the end of a sentence in addition to punctuation. This works fine if it is a single character, but not when it there is a space.
This is the code I have that works:
$re = '/# Split sentences on whitespace between them.
(?<= # Begin positive lookbehind.
[.!?:\#*] # Either an end of sentence punct,
| [.!?:][\'"]
| [\r\t\n] # or end of sentence punct and quote.
| HYPERLINK
| .org
| .gov
| .aspx
| .com
| Date
| Dear
) # End positive lookbehind.
(?<! # Begin negative lookbehind.
Mr\. # Skip either "Mr."
| Mrs\. # or "Mrs.",
| Ms\. # or "Ms.",
| Jr\. # or "Jr.",
| Dr\. # or "Dr.",
| Prof\. # or "Prof.",
| U\.S\.A\.
| U\.S\.
| Sr\. # or "Sr.",
| T\.V\.A\. # or "T.V.A.",
| a\.m\. # or "a.m.",
| p\.m\. # or "p.m.",
| a€¢\.
| :\.
# or... (you get the idea).
) # End negative lookbehind.
\s+ # Split on whitespace between sentences.
/ix';
This is an example phrase I have tried to add:
"Total Gross Income"
I have tried formating it in these ways, but none of them work:
$re = '/# Split sentences on whitespace between them.
(?<= # Begin positive lookbehind.
[.!?:\#*] # Either an end of sentence punct,
| [.!?:][\'"]
| [\r\t\n] # or end of sentence punct and quote.
| HYPERLINK
| .org
| .gov
| .aspx
| .com
| Date
| Dear
| "Total Gross Income"
| Total[ X]Gross[ X]Income
| Total" "Gross" "Income
)
This for example if I have the following code:
$block_o_text = "You could receive the wrong amount. If you receive more benefits than you should, you must pay them back. When will we review your case? An eligibility review form will be sent before your benefits stop. Total Gross Income Total ResourcesMedical ProgramsHousehold.";
$sentences = preg_split($re, $block_o_text, -1, PREG_SPLIT_NO_EMPTY);
for ($i = 0; $i < count($sentences); ++$i) {
echo $i . " - " . $sentance . "<BR>";
}
The results I get are:
77 - You could receive the wrong amount.
78 - If you receive more benefits than you should, you must pay them back.
79 - When will we review your case?
80 - An eligibility review form will be sent before your benefits stop.
81 - 01/201502/2015
82 - Total Gross Income Total ResourcesMedical ProgramsHousehold
What I want to get is :
77 - You could receive the wrong amount.
78 - If you receive more benefits than you should, you must pay them back.
79 - When will we review your case?
80 - An eligibility review form will be sent before your benefits stop.
81 - 01/201502/2015
82 - Total Gross Income
83 - Total ResourcesMedical ProgramsHousehold
What am I doing wrong?
Your problem is with the white space declaration that follows your lookbehind - it requires at least one white space in order to split, but if you remove it, then you end up capturing the preceeding letter and breaking the whole thing.
Thus As far as I can tell, you can't do this entirely with lookarounds. You'll still need to have some of the expression work with lookarounds (space preceded by punctuation, etc.), but for specific phrases, you can't.
You can also use the PREG_SPLIT_DELIM_CAPTURE flag to capture out what you're splitting. Something like this should get you started:
$re = '/((?<=[\.\?\!])\s+|Total\sGross\sIncome)/ix';
$block_o_text = "You could receive the wrong amount. If you receive more benefits than you should, you must pay them back. When will we review your case? An eligibility review form will be sent before your benefits stop. Total Gross IncomeTotal ResourcesMedical ProgramsHousehold.";
$sentences = preg_split($re, $block_o_text, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
for ($i = 0; $i < count($sentences); ++$i) {
if (!ctype_space($sentences[$i])) {
echo $i . " - " . $sentences[$i] . "<br>";
}
}
Output:
0 - You could receive the wrong amount.
2 - If you receive more benefits than you should, you must pay them back.
4 - When will we review your case?
6 - An eligibility review form will be sent before your benefits stop.
8 - Total Gross Income
9 - Total ResourcesMedical ProgramsHousehold.

Regular expression to find a string included between two bars and containing certain words

I always forget regex right after learning it. I want to extract the isbn number from a string.
String: English | ISBN: 1285463234 | 2014 | 499 pages | PDF | 28 MB
Target to extract: 1285463234
You can use findall with this regex:
/(?<=ISBN: )\d+/
Regex Demo
Explanation:
(?<= Opens a positive lookahead group, asserts that this matches after:
ISBN: Matches the string "ISBN: "
) Closes the lookahead group.
\d+ Matches one or more digits.
You could try the below regex to extract the ISBN number,
ISBN:\s*\K\d+
DEMO
Your PHP code would be,
<?php
$mystring = 'English | ISBN: 1285463234 | 2014 | 499 pages | PDF | 28 MB';
$regex = '~ISBN:\s*\K\d+~';
if (preg_match($regex, $mystring, $m)) {
$yourmatch = $m[0];
echo $yourmatch;
}
?> //=> 1285463234
Explanation:
ISBN: Matches the string ISBN:
\s* Matches zero or more spaces.
\K Discards previously matched characters.(ie, ISBN:)
\d+ Matches one or more digits.
If you have problems with regex, there are also other libraries which can do it with your straight forward example, for example the sscanf function in the PHP string library:
$subject = 'English | ISBN: 1285463234 | 2014 | 499 pages | PDF | 28 MB';
$result = sscanf($subject, 'English | ISBN: %d | ', $isbn);
If there is a match ($result is 1), the $isbn variable will contain the ISBN number as integer:
int(1285463234)
ISBN numbers never start with 0, so this should not pose any problem. If you need is as string, use %s instead of %d:
$result = sscanf($subject, 'English | ISBN: %s | ', $isbn);
The result is then a string:
string(10) "1285463234"
Scanf patterns are way easier to deal with less complex string parsing than PCRE regular expressions (which have more power but are also more complex). Also the assignment to a specific variable is easier to do.
An exact representation (and also a patchwork of the other answers ^^) would be this:
(?<=\| ISBN: )(\S+)(?= \|)
Debuggex Demo

Split text into words & numbers with unicode support (preg_split)

I'm trying to split (with preg_split) a text with a lot of foreign chars and digits into words and numbers with length >= 2 and without ponctuation.
Now I have this code but it only split into words without taking account digits and length >= 2 for all.
How can I do please?
$text = 'abc 文 字化け, efg Yukarda mavi gök, asağıda yağız yer yaratıldıkta; (1998 m. siejės 7 d.). Ton pate dėina bandomkojė бойынша бірінші орында тұр (79.65 %), айына 41';
$splitted = preg_split('#\P{L}+#u', $text, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
Expected result should be : array('abc', '字化け', 'efg', 'Yukarda', 'mavi', 'gök', 'asağıda', 'yağız', 'yer', 'yaratıldıkta', '1998', 'siejės', 'Ton', 'pate', 'dėina', 'bandomkojė', 'бойынша', 'бірінші', 'орында', 'тұр', '79.65', 'айына', '41');
NB : already tried with these docs link1 & link2 but i can't get it works :-/
Use preg_match_all instead, then you can check the length condition (that is hard to do with preg_split, but not impossible):
$text = 'abc 文 字化け, efg Yukarda mavi gök, asağıda yağız yer yaratıldıkta; (1998 m. siejės 7 d.). Ton pate dėina bandomkojė бойынша бірінші орында тұр (79.65 %), айына 41';
preg_match_all('~\p{L}{2,}+|\d{2,}+(?>\.\d++)?|\d\.\d++~u',$text,$matches);
print_r($matches);
explanation:
p{L}{2,}+ # letter 2 or more times
| # OR
\d{2,}+ # digit 2 or more times
(?>\.\d++)? # can be a decimal number
| # OR
\d\.\d++ # single digit MUST be followed by at least a decimal
# (length constraint)
With a little hack to match digits separated by dot before matching only digits as part of the word:
preg_match_all("#(?:\d+\.\d+|\w{2,})#u", $text, $matches);
$splitted = $matches[0];
http://codepad.viper-7.com/X7Ln1V
Splitting CJK into "words" is kind of meaningless. Each character is a word. If you use whitespace the you split into phrases.
So it depends on what you're actually trying to accomplish. If you're indexing text, then you need to consider bigrams and/or CJK idioms.

Categories