I have the follwoing script to split up sentences. There are a few phrases that I would like to treat as the end of a sentence in addition to punctuation. This works fine if it is a single character, but not when it there is a space.
This is the code I have that works:
$re = '/# Split sentences on whitespace between them.
(?<= # Begin positive lookbehind.
[.!?:\#*] # Either an end of sentence punct,
| [.!?:][\'"]
| [\r\t\n] # or end of sentence punct and quote.
| HYPERLINK
| .org
| .gov
| .aspx
| .com
| Date
| Dear
) # End positive lookbehind.
(?<! # Begin negative lookbehind.
Mr\. # Skip either "Mr."
| Mrs\. # or "Mrs.",
| Ms\. # or "Ms.",
| Jr\. # or "Jr.",
| Dr\. # or "Dr.",
| Prof\. # or "Prof.",
| U\.S\.A\.
| U\.S\.
| Sr\. # or "Sr.",
| T\.V\.A\. # or "T.V.A.",
| a\.m\. # or "a.m.",
| p\.m\. # or "p.m.",
| a€¢\.
| :\.
# or... (you get the idea).
) # End negative lookbehind.
\s+ # Split on whitespace between sentences.
/ix';
This is an example phrase I have tried to add:
"Total Gross Income"
I have tried formating it in these ways, but none of them work:
$re = '/# Split sentences on whitespace between them.
(?<= # Begin positive lookbehind.
[.!?:\#*] # Either an end of sentence punct,
| [.!?:][\'"]
| [\r\t\n] # or end of sentence punct and quote.
| HYPERLINK
| .org
| .gov
| .aspx
| .com
| Date
| Dear
| "Total Gross Income"
| Total[ X]Gross[ X]Income
| Total" "Gross" "Income
)
This for example if I have the following code:
$block_o_text = "You could receive the wrong amount. If you receive more benefits than you should, you must pay them back. When will we review your case? An eligibility review form will be sent before your benefits stop. Total Gross Income Total ResourcesMedical ProgramsHousehold.";
$sentences = preg_split($re, $block_o_text, -1, PREG_SPLIT_NO_EMPTY);
for ($i = 0; $i < count($sentences); ++$i) {
echo $i . " - " . $sentance . "<BR>";
}
The results I get are:
77 - You could receive the wrong amount.
78 - If you receive more benefits than you should, you must pay them back.
79 - When will we review your case?
80 - An eligibility review form will be sent before your benefits stop.
81 - 01/201502/2015
82 - Total Gross Income Total ResourcesMedical ProgramsHousehold
What I want to get is :
77 - You could receive the wrong amount.
78 - If you receive more benefits than you should, you must pay them back.
79 - When will we review your case?
80 - An eligibility review form will be sent before your benefits stop.
81 - 01/201502/2015
82 - Total Gross Income
83 - Total ResourcesMedical ProgramsHousehold
What am I doing wrong?
Your problem is with the white space declaration that follows your lookbehind - it requires at least one white space in order to split, but if you remove it, then you end up capturing the preceeding letter and breaking the whole thing.
Thus As far as I can tell, you can't do this entirely with lookarounds. You'll still need to have some of the expression work with lookarounds (space preceded by punctuation, etc.), but for specific phrases, you can't.
You can also use the PREG_SPLIT_DELIM_CAPTURE flag to capture out what you're splitting. Something like this should get you started:
$re = '/((?<=[\.\?\!])\s+|Total\sGross\sIncome)/ix';
$block_o_text = "You could receive the wrong amount. If you receive more benefits than you should, you must pay them back. When will we review your case? An eligibility review form will be sent before your benefits stop. Total Gross IncomeTotal ResourcesMedical ProgramsHousehold.";
$sentences = preg_split($re, $block_o_text, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
for ($i = 0; $i < count($sentences); ++$i) {
if (!ctype_space($sentences[$i])) {
echo $i . " - " . $sentences[$i] . "<br>";
}
}
Output:
0 - You could receive the wrong amount.
2 - If you receive more benefits than you should, you must pay them back.
4 - When will we review your case?
6 - An eligibility review form will be sent before your benefits stop.
8 - Total Gross Income
9 - Total ResourcesMedical ProgramsHousehold.
Related
I have input string coming from user like this
" |appl | pinea | orang frui | vege lates"
and I want to convert it to this format
"+(appl* | pinea* | orang*) +(frui* | vege*) +(lates*)"
I have tried with preg_match and arrays but unable to find solution
$term = " |appl | pinea | orang frui | vege lates";
$term = trim($term, "| ");
$term = preg_replace("#[\| ]{2,}#", " | ", $term);
// $term value is "appl | pinea | orang frui | vege lates" after applying safe guards
// $matches = [];
// $matched = preg_match_all("#(\P{Xan}+)\|(\P{Xan}+)#ui", $term, $matches);
// var_dump($matches);
$termArray = explode(" ", $term);
foreach($termArray as $index => $singleWord) {
$termArray[$index] = trim($singleWord);
}
$termArray = array_filter($termArray);
foreach($termArray as $index => $singleWord) {
if($singleWord === "|") {
$pipeIndexes[] = $index;
$orIndexes[] = $index - 1;
$orIndexes[] = $index + 1;
unset($termArray[$index]);
}
}
// desired string I am hoping for
// "+(appl* | pinea* | orang*) +(frui* | vege*) +(lates*)"
desired string I am hoping for
"+(appl* | pinea* | orang*) +(frui* | vege*) +(lates*)"
This assumes you have cleaned up your input string to the following:
appl | pinea | orang frui | vege lates
Quick working example
https://3v4l.org/NjIai
<?php
$subject = 'appl | pinea | orang frui | vege lates';
// Add the "*" in place at end
$subject = preg_replace('/[a-z]+/i', '$0*', $subject);
// Group the terms adding brackets and + sign
$result = preg_replace('/(?:[^\s|]+\s*\|\s*)*(?:[^\s|]+)/i', '+($0)', $subject);
var_dump($result);
// Outputs:
// string(53) "+(appl* | pinea* | orang*) +(frui* | vege*) +(lates*)"
Add in the * after each word:
$result = preg_replace('/[a-z]+/i', '$0*', $subject);
Group based on space
$result = preg_replace('/(?:[^\s|]+\s*\|\s*)*(?:[^\s|]+)/i', '+($0)', $subject);
Visualisation
Human Readable
(?:[^\s|]+\s*\|\s*)*(?:[^\s|]+)
Options: Case insensitive; Exact spacing; Dot doesn’t match line breaks; ^$ don’t match at line breaks; Greedy quantifiers; Regex syntax only
Match the regular expression below (?:[^\s|]+\s*\|\s*)*
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) *
Match any single character NOT present in the list below [^\s|]+
Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
A “whitespace character” (any Unicode separator, tab, line feed, carriage return, vertical tab, form feed, next line) \s
The literal character “|” |
Match a single character that is a “whitespace character” (any Unicode separator, tab, line feed, carriage return, vertical tab, form feed, next line) \s*
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) *
Match the character “|” literally \|
Match a single character that is a “whitespace character” (any Unicode separator, tab, line feed, carriage return, vertical tab, form feed, next line) \s*
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) *
Match the regular expression below (?:[^\s|]+)
Match any single character NOT present in the list below [^\s|]+
Between one and unlimited times, as many times as possible, giving back as needed (greedy) +
A “whitespace character” (any Unicode separator, tab, line feed, carriage return, vertical tab, form feed, next line) \s
The literal character “|” |
+($0)
Insert the character “+” literally +
Insert an opening parenthesis (
Insert the whole regular expression match $0
Insert a closing parenthesis )
I have the following expression:
"MSRP | <span style='text-decoration: line-through;'>$74,660</span><br /> Buy | $67,092
I need
MSRP $74,600 $67,092
I can't seem to find the regex to include the '$' symbol in the match group. This is what I am currently doing:
MSRP | <[^>]+>\$([^>]+)<[^>]+> *<[^>]+> *Buy | $([^>]+)\/i
What is wrong with this expression and why is it not including the '$' symbol?
You need to escape the $ and | and also put the \$inside the parenthesis to match in a group:
(MSRP) \| <[^>]+>(\$[^>]+)<[^>]+> *<[^>]+> *Buy \| (\$[^>]+)
If you want to have the $ sign together with the numbers in your results, you need to put it into the parenthesis:
Here is a working example with PHP:
$str = "MSRP | <span style='text-decoration: line-through;'>$74,660</span><br /> Buy | $67,092";
preg_match('/MSRP \| <[^>]+>(\$[^>]+)<[^>]+> *<[^>]+> *Buy \| (\$[^>]+)/i', $str, $matches);
I got the results like this:
1 => '$74,660', 2 => '$67,092'
I am using the following script I have altered to split a large string into sentances. However I am having issues getting it to also break on dates.
Original working code:
$re = '/# Split sentences on whitespace between them.
(?<= # Begin positive lookbehind.
[.!?:] # Either an end of sentence punct,
| [.!?:][\'"]
| [\r\t\n] # or end of sentence punct and quote.
) # End positive lookbehind.
(?<! # Begin negative lookbehind.
Mr\. # Skip either "Mr."
| Mrs\. # or "Mrs.",
| Ms\. # or "Ms.",
| Jr\. # or "Jr.",
| Dr\. # or "Dr.",
| Prof\. # or "Prof.",
| U\.S\.A\.
| Sr\. # or "Sr.",
| T\.V\.A\. # or "T.V.A.",
| a\.m\. # or "a.m.",
| p\.m\. # or "p.m.",
| •\.
| :\.
| •\.
# or... (you get the idea).
) # End negative lookbehind.
\s+ # Split on whitespace between sentences.
/ix';
$sentences = preg_split($re, $block_o_text, -1, PREG_SPLIT_NO_EMPTY);
for ($i = 0; $i < count($sentences); ++$i) {
I have added [0-9]/[0-9]/[0-9], but it doesn't seem to be having the desired effect. What am I missing? Here Is my updated code below:
$re = '/# Split sentences on whitespace between them.
(?<= # Begin positive lookbehind.
[.!?:] # Either an end of sentence punct,
| [.!?:][\'"]
| [\r\t\n] # or end of sentence punct and quote.
| [0-9]/[0-9]/[0-9] # or on a date
) # End positive lookbehind.
(?<! # Begin negative lookbehind.
Mr\. # Skip either "Mr."
| Mrs\. # or "Mrs.",
| Ms\. # or "Ms.",
| Jr\. # or "Jr.",
| Dr\. # or "Dr.",
| Prof\. # or "Prof.",
| U\.S\.A\.
| Sr\. # or "Sr.",
| T\.V\.A\. # or "T.V.A.",
| a\.m\. # or "a.m.",
| p\.m\. # or "p.m.",
| •\.
| :\.
| •\.
# or... (you get the idea).
) # End negative lookbehind.
\s+ # Split on whitespace between sentences.
/ix';
Dates do not have only single digits especially in the year. You need to account for that. You also need to escape the / since that is your regex delimiter.
[0-9]{1,2}\/[0-9]{1,2}\/[0-9]{2,4}
I always forget regex right after learning it. I want to extract the isbn number from a string.
String: English | ISBN: 1285463234 | 2014 | 499 pages | PDF | 28 MB
Target to extract: 1285463234
You can use findall with this regex:
/(?<=ISBN: )\d+/
Regex Demo
Explanation:
(?<= Opens a positive lookahead group, asserts that this matches after:
ISBN: Matches the string "ISBN: "
) Closes the lookahead group.
\d+ Matches one or more digits.
You could try the below regex to extract the ISBN number,
ISBN:\s*\K\d+
DEMO
Your PHP code would be,
<?php
$mystring = 'English | ISBN: 1285463234 | 2014 | 499 pages | PDF | 28 MB';
$regex = '~ISBN:\s*\K\d+~';
if (preg_match($regex, $mystring, $m)) {
$yourmatch = $m[0];
echo $yourmatch;
}
?> //=> 1285463234
Explanation:
ISBN: Matches the string ISBN:
\s* Matches zero or more spaces.
\K Discards previously matched characters.(ie, ISBN:)
\d+ Matches one or more digits.
If you have problems with regex, there are also other libraries which can do it with your straight forward example, for example the sscanf function in the PHP string library:
$subject = 'English | ISBN: 1285463234 | 2014 | 499 pages | PDF | 28 MB';
$result = sscanf($subject, 'English | ISBN: %d | ', $isbn);
If there is a match ($result is 1), the $isbn variable will contain the ISBN number as integer:
int(1285463234)
ISBN numbers never start with 0, so this should not pose any problem. If you need is as string, use %s instead of %d:
$result = sscanf($subject, 'English | ISBN: %s | ', $isbn);
The result is then a string:
string(10) "1285463234"
Scanf patterns are way easier to deal with less complex string parsing than PCRE regular expressions (which have more power but are also more complex). Also the assignment to a specific variable is easier to do.
An exact representation (and also a patchwork of the other answers ^^) would be this:
(?<=\| ISBN: )(\S+)(?= \|)
Debuggex Demo
This question already has answers here:
Split string into sentences using regex
(7 answers)
Closed 2 years ago.
I would like to divide a text into sentences in PHP. I'm currently using a regex, which brings ~95% accuracy and would like to improve by using a better approach. I've seen NLP tools that do that in Perl, Java, and C but didn't see anything that fits PHP. Do you know of such a tool?
An enhanced regex solution
Assuming you do care about handling: Mr. and Mrs. etc. abbreviations, then the following single regex solution works pretty well:
<?php // test.php Rev:20160820_1800
$split_sentences = '%(?#!php/i split_sentences Rev:20160820_1800)
# Split sentences on whitespace between them.
# See: http://stackoverflow.com/a/5844564/433790
(?<= # Sentence split location preceded by
[.!?] # either an end of sentence punct,
| [.!?][\'"] # or end of sentence punct and quote.
) # End positive lookbehind.
(?<! # But don\'t split after these:
Mr\. # Either "Mr."
| Mrs\. # Or "Mrs."
| Ms\. # Or "Ms."
| Jr\. # Or "Jr."
| Dr\. # Or "Dr."
| Prof\. # Or "Prof."
| Sr\. # Or "Sr."
| T\.V\.A\. # Or "T.V.A."
# Or... (you get the idea).
) # End negative lookbehind.
\s+ # Split on whitespace between sentences,
(?=\S) # (but not at end of string).
%xi'; // End $split_sentences.
$text = 'This is sentence one. Sentence two! Sentence thr'.
'ee? Sentence "four". Sentence "five"! Sentence "'.
'six"? Sentence "seven." Sentence \'eight!\' Dr. '.
'Jones said: "Mrs. Smith you have a lovely daught'.
'er!" The T.V.A. is a big project! '; // Note ws at end.
$sentences = preg_split($split_sentences, $text, -1, PREG_SPLIT_NO_EMPTY);
for ($i = 0; $i < count($sentences); ++$i) {
printf("Sentence[%d] = [%s]\n", $i + 1, $sentences[$i]);
}
?>
Note that you can easily add or take away abbreviations from the expression. Given the following test paragraph:
This is sentence one. Sentence two! Sentence three? Sentence "four". Sentence "five"! Sentence "six"? Sentence "seven." Sentence 'eight!' Dr. Jones said: "Mrs. Smith you have a lovely daughter!" The T.V.A. is a big project!
Here is the output from the script:
Sentence[1] = [This is sentence one.]
Sentence[2] = [Sentence two!]
Sentence[3] = [Sentence three?]
Sentence[4] = [Sentence "four".]
Sentence[5] = [Sentence "five"!]
Sentence[6] = [Sentence "six"?]
Sentence[7] = [Sentence "seven."]
Sentence[8] = [Sentence 'eight!']
Sentence[9] = [Dr. Jones said: "Mrs. Smith you have a lovely daughter!"]
Sentence[10] = [The T.V.A. is a big project!]
The essential regex solution
The author of the question commented that the above solution "overlooks many options" and is not generic enough. I'm not sure what that means, but the essence of the above expression is about as clean and simple as you can get. Here it is:
$re = '/(?<=[.!?]|[.!?][\'"])\s+(?=\S)/';
$sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);
Note that both solutions correctly identify sentences ending with a quotation mark after the ending punctuation. If you don't care about matching sentences ending in a quotation mark the regex can be simplified to just: /(?<=[.!?])\s+(?=\S)/.
Edit: 20130820_1000 Added T.V.A. (another punctuated word to be ignored) to regex and test string. (to answer PapyRef's comment question)
Edit: 20130820_1800 Tidied and renamed regex and added shebang. Also fixed regexes to prevent splitting text on trailing whitespace.
Slight improvement on someone else's work:
$re = '/# Split sentences on whitespace between them.
(?<= # Begin positive lookbehind.
[.!?] # Either an end of sentence punct,
| [.!?][\'"] # or end of sentence punct and quote.
) # End positive lookbehind.
(?<! # Begin negative lookbehind.
Mr\. # Skip either "Mr."
| Mrs\. # or "Mrs.",
| Ms\. # or "Ms.",
| Jr\. # or "Jr.",
| Dr\. # or "Dr.",
| Prof\. # or "Prof.",
| Sr\. # or "Sr.",
| \s[A-Z]\. # or initials ex: "George W. Bush",
# or... (you get the idea).
) # End negative lookbehind.
\s+ # Split on whitespace between sentences.
/ix';
$sentences = preg_split($re, $story, -1, PREG_SPLIT_NO_EMPTY);
As a low-tech approach, you might want to consider using a series of explode calls in a loop, using ., !, and ? as your needle. This would be very memory and processor intensive (as most text processing is). You would have a bunch of temporary arrays and one master array with all found sentences numerically indexed in the right order.
Also, you'd have to check for common exceptions (such as a . in titles like Mr. and Dr.), but with everything being in an array, these types of checks shouldn't be that bad.
I'm not sure if this is any better than regex in terms of speed and scaling, but it would be worth a shot. How big are these blocks of text you want to break into sentences?
I was using this regex:
preg_split('/(?<=[.?!])\s(?=[A-Z"\'])/', $text);
Won't work on a sentence starting with a number, but should have very few false positives as well. Of course what you are doing matters as well. My program now uses
explode('.',$text);
because I decided speed was more important than accuracy.
Build a list of abbreviations like this
$skip_array = array (
'Jr', 'Mr', 'Mrs', 'Ms', 'Dr', 'Prof', 'Sr' , etc.
Compile them into a an expression
$skip = '';
foreach($skip_array as $abbr) {
$skip = $skip . (empty($skip) ? '' : '|') . '\s{1}' . $abbr . '[.!?]';
}
Last run this preg_split to break into sentences.
$lines = preg_split ("/(?<!$skip)(?<=[.?!])\s+(?=[^a-z])/",
$txt, -1, PREG_SPLIT_NO_EMPTY);
And if you're processing HTML, watch for tags getting deleted which eliminate the space between sentences.<p></p> If you have situations.Like this where.They stick together it becomes immensely more difficult to parse.
#ridgerunner I wrote your PHP code in C #
I get like 2 sentences as result :
Mr. J. Dujardin régle sa T.V.
A. en esp. uniquement
The correct result should be the sentence : Mr. J. Dujardin régle sa T.V.A. en esp. uniquement
and with our test paragraph
string sText = "This is sentence one. Sentence two! Sentence three? Sentence \"four\". Sentence \"five\"! Sentence \"six\"? Sentence \"seven.\" Sentence 'eight!' Dr. Jones said: \"Mrs. Smith you have a lovely daughter!\" The T.V.A. is a big project!";
The result is
index: 0 sentence: This is sentence one.
index: 22 sentence: Sentence two!
index: 36 sentence: Sentence three?
index: 52 sentence: Sentence "four".
index: 69 sentence: Sentence "five"!
index: 86 sentence: Sentence "six"?
index: 102 sentence: Sentence "seven.
index: 118 sentence: " Sentence 'eight!'
index: 136 sentence: ' Dr. Jones said: "Mrs. Smith you have a lovely daughter!
index: 193 sentence: " The T.V.
index: 203 sentence: A. is a big project!
C# code :
string sText = "Mr. J. Dujardin régle sa T.V.A. en esp. uniquement";
Regex rx = new Regex(#"(\S.+?
[.!?] # Either an end of sentence punct,
| [.!?]['""] # or end of sentence punct and quote.
)
(?<! # Begin negative lookbehind.
Mr. # Skip either Mr.
| Mrs. # or Mrs.,
| Ms. # or Ms.,
| Jr. # or Jr.,
| Dr. # or Dr.,
| Prof. # or Prof.,
| Sr. # or Sr.,
| \s[A-Z]. # or initials ex: George W. Bush,
| T\.V\.A\. # or "T.V.A."
) # End negative lookbehind.
(?=|\s+|$)",
RegexOptions.CultureInvariant | RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled);
foreach (Match match in rx.Matches(sText))
{
Console.WriteLine("index: {0} sentence: {1}", match.Index, match.Value);
}