Breaking a string on dates - php

I am using the following script I have altered to split a large string into sentances. However I am having issues getting it to also break on dates.
Original working code:
$re = '/# Split sentences on whitespace between them.
(?<= # Begin positive lookbehind.
[.!?:] # Either an end of sentence punct,
| [.!?:][\'"]
| [\r\t\n] # or end of sentence punct and quote.
) # End positive lookbehind.
(?<! # Begin negative lookbehind.
Mr\. # Skip either "Mr."
| Mrs\. # or "Mrs.",
| Ms\. # or "Ms.",
| Jr\. # or "Jr.",
| Dr\. # or "Dr.",
| Prof\. # or "Prof.",
| U\.S\.A\.
| Sr\. # or "Sr.",
| T\.V\.A\. # or "T.V.A.",
| a\.m\. # or "a.m.",
| p\.m\. # or "p.m.",
| •\.
| :\.
| •\.
# or... (you get the idea).
) # End negative lookbehind.
\s+ # Split on whitespace between sentences.
/ix';
$sentences = preg_split($re, $block_o_text, -1, PREG_SPLIT_NO_EMPTY);
for ($i = 0; $i < count($sentences); ++$i) {
I have added [0-9]/[0-9]/[0-9], but it doesn't seem to be having the desired effect. What am I missing? Here Is my updated code below:
$re = '/# Split sentences on whitespace between them.
(?<= # Begin positive lookbehind.
[.!?:] # Either an end of sentence punct,
| [.!?:][\'"]
| [\r\t\n] # or end of sentence punct and quote.
| [0-9]/[0-9]/[0-9] # or on a date
) # End positive lookbehind.
(?<! # Begin negative lookbehind.
Mr\. # Skip either "Mr."
| Mrs\. # or "Mrs.",
| Ms\. # or "Ms.",
| Jr\. # or "Jr.",
| Dr\. # or "Dr.",
| Prof\. # or "Prof.",
| U\.S\.A\.
| Sr\. # or "Sr.",
| T\.V\.A\. # or "T.V.A.",
| a\.m\. # or "a.m.",
| p\.m\. # or "p.m.",
| •\.
| :\.
| •\.
# or... (you get the idea).
) # End negative lookbehind.
\s+ # Split on whitespace between sentences.
/ix';

Dates do not have only single digits especially in the year. You need to account for that. You also need to escape the / since that is your regex delimiter.
[0-9]{1,2}\/[0-9]{1,2}\/[0-9]{2,4}

Related

Regex split string on a char with exception for inner-string

I have a string like aa | bb | "cc | dd" | 'ee | ff' and I'm looking for a way to split this to get all the values separated by the | character with exeption for | contained in strings.
The idea is to get something like this [a, b, "cc | dd", 'ee | ff']
I've already found an answer to a similar question here : https://stackoverflow.com/a/11457952/11260467
However I can't find a way to adapt it for a case with multiple separator characters, is there someone out here which is less dumb than me when it come to regular expressions ?
This is easily done with the (*SKIP)(*FAIL) functionality pcre offers:
(['"]).*?\1(*SKIP)(*FAIL)|\s*\|\s*
In PHP this could be:
<?php
$string = "aa | bb | \"cc | dd\" | 'ee | ff'";
$pattern = '~([\'"]).*?\1(*SKIP)(*FAIL)|\s*\|\s*~';
$splitted = preg_split($pattern, $string);
print_r($splitted);
?>
And would yield
Array
(
[0] => aa
[1] => bb
[2] => "cc | dd"
[3] => 'ee | ff'
)
See a demo on regex101.com and on ideone.com.
This is easier if you match the parts (not split). Patterns are greedy by default, they will consume as many characters as possible. This allows to define more complex patterns for the quoted string before providing a pattern for an unquoted token:
$subject = '[ aa | bb | "cc | dd" | \'ee | ff\' ]';
$pattern = <<<'PATTERN'
(
(?:[|[]|^) # after | or [ or string start
\s*
(?<token> # name the match
"[^"]*" # string in double quotes
|
'[^']*' # string in single quotes
|
[^\s|]+ # non-whitespace
)
\s*
)x
PATTERN;
preg_match_all($pattern, $subject, $matches);
var_dump($matches['token']);
Output:
array(4) {
[0]=>
string(2) "aa"
[1]=>
string(2) "bb"
[2]=>
string(9) ""cc | dd""
[3]=>
string(9) "'ee | ff'"
}
Hints:
The <<<'PATTERN' is called HEREDOC syntax and cuts down on escaping
I use () as pattern delimiters - they are group 0
Naming matches makes code a lot more readable
Modifier x allows to indent and comment the pattern
Use
$string = "aa | bb | \"cc | dd\" | 'ee | ff'";
preg_match_all("~(?|\"([^\"]*)\"|'([^']*)'|([^|'\"]+))(?:\s*\|\s*|\z)~", $string, $matches);
print_r(array_map(function($x) {return trim($x);}, $matches[1]));
See PHP proof.
Results:
Array
(
[0] => aa
[1] => bb
[2] => cc | dd
[3] => ee | ff
)
EXPLANATION
--------------------------------------------------------------------------------
(?| Branch reset group, does not capture:
--------------------------------------------------------------------------------
\" '"'
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[^\"]* any character except: '\"' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\" '"'
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
' '\''
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
' '\''
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[^|'\"]+ any character except: '|', ''', '\"'
(1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\| '|'
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\z the end of the string
--------------------------------------------------------------------------------
) end of grouping
It's interesting that there are so many ways to construct a regular expression for this problem. Here is another that is similar to #Jan's answer.
(['"]).*?\1\K| *\| *
PCRE Demo
(['"]) # match a single or double quote and save to capture group 1
.*? # match zero or more characters lazily
\1 # match the content of capture group 1
\K # reset the starting point of the reported match and discard
# any previously-consumed characters from the reported match
| # or
\ * # match zero or more spaces
\| # match a pipe character
\ * # match zero or more spaces
Notice that the part before the pipe character ("or") serves merely to move the engine's internal string pointer to just past the closing quote or a quoted substring.

preg_split based on a Sentence

I have the follwoing script to split up sentences. There are a few phrases that I would like to treat as the end of a sentence in addition to punctuation. This works fine if it is a single character, but not when it there is a space.
This is the code I have that works:
$re = '/# Split sentences on whitespace between them.
(?<= # Begin positive lookbehind.
[.!?:\#*] # Either an end of sentence punct,
| [.!?:][\'"]
| [\r\t\n] # or end of sentence punct and quote.
| HYPERLINK
| .org
| .gov
| .aspx
| .com
| Date
| Dear
) # End positive lookbehind.
(?<! # Begin negative lookbehind.
Mr\. # Skip either "Mr."
| Mrs\. # or "Mrs.",
| Ms\. # or "Ms.",
| Jr\. # or "Jr.",
| Dr\. # or "Dr.",
| Prof\. # or "Prof.",
| U\.S\.A\.
| U\.S\.
| Sr\. # or "Sr.",
| T\.V\.A\. # or "T.V.A.",
| a\.m\. # or "a.m.",
| p\.m\. # or "p.m.",
| a€¢\.
| :\.
# or... (you get the idea).
) # End negative lookbehind.
\s+ # Split on whitespace between sentences.
/ix';
This is an example phrase I have tried to add:
"Total Gross Income"
I have tried formating it in these ways, but none of them work:
$re = '/# Split sentences on whitespace between them.
(?<= # Begin positive lookbehind.
[.!?:\#*] # Either an end of sentence punct,
| [.!?:][\'"]
| [\r\t\n] # or end of sentence punct and quote.
| HYPERLINK
| .org
| .gov
| .aspx
| .com
| Date
| Dear
| "Total Gross Income"
| Total[ X]Gross[ X]Income
| Total" "Gross" "Income
)
This for example if I have the following code:
$block_o_text = "You could receive the wrong amount. If you receive more benefits than you should, you must pay them back. When will we review your case? An eligibility review form will be sent before your benefits stop. Total Gross Income Total ResourcesMedical ProgramsHousehold.";
$sentences = preg_split($re, $block_o_text, -1, PREG_SPLIT_NO_EMPTY);
for ($i = 0; $i < count($sentences); ++$i) {
echo $i . " - " . $sentance . "<BR>";
}
The results I get are:
77 - You could receive the wrong amount.
78 - If you receive more benefits than you should, you must pay them back.
79 - When will we review your case?
80 - An eligibility review form will be sent before your benefits stop.
81 - 01/201502/2015
82 - Total Gross Income Total ResourcesMedical ProgramsHousehold
What I want to get is :
77 - You could receive the wrong amount.
78 - If you receive more benefits than you should, you must pay them back.
79 - When will we review your case?
80 - An eligibility review form will be sent before your benefits stop.
81 - 01/201502/2015
82 - Total Gross Income
83 - Total ResourcesMedical ProgramsHousehold
What am I doing wrong?
Your problem is with the white space declaration that follows your lookbehind - it requires at least one white space in order to split, but if you remove it, then you end up capturing the preceeding letter and breaking the whole thing.
Thus As far as I can tell, you can't do this entirely with lookarounds. You'll still need to have some of the expression work with lookarounds (space preceded by punctuation, etc.), but for specific phrases, you can't.
You can also use the PREG_SPLIT_DELIM_CAPTURE flag to capture out what you're splitting. Something like this should get you started:
$re = '/((?<=[\.\?\!])\s+|Total\sGross\sIncome)/ix';
$block_o_text = "You could receive the wrong amount. If you receive more benefits than you should, you must pay them back. When will we review your case? An eligibility review form will be sent before your benefits stop. Total Gross IncomeTotal ResourcesMedical ProgramsHousehold.";
$sentences = preg_split($re, $block_o_text, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
for ($i = 0; $i < count($sentences); ++$i) {
if (!ctype_space($sentences[$i])) {
echo $i . " - " . $sentences[$i] . "<br>";
}
}
Output:
0 - You could receive the wrong amount.
2 - If you receive more benefits than you should, you must pay them back.
4 - When will we review your case?
6 - An eligibility review form will be sent before your benefits stop.
8 - Total Gross Income
9 - Total ResourcesMedical ProgramsHousehold.

regex - Compilation failed: assertion expected at offset 6

I'm trying to use Casimir et Hippolyte's pattern (Here) to wrap HTML tags in string.
$html = <<<EOD
$str
EOD;
$pattern = <<<'EOD'
~
(?(DEFINE)
(?<self> < [^\W_]++ [^>]* > )
(?<comment> <!-- (?>[^-]++|-(?!->))* -->)
(?<cdata> \Q<![CDATA[\E (?>[^]]++|](?!]>))* ]]> )
(?<text> [^<]++ )
(?<tag>
< ([^\W_]++) [^>]* >
(?> \g<text> | \g<tag> | \g<self> | \g<comment> | \g<cdata> )*
</ \g{-1} >
)
)
# main pattern
(?: \g<tag> | \g<self> | \g<comment> | \g<cdata> )+
~x
EOD;
After implementing this method, I got an error Compilation failed: assertion expected after (?( at offset 6. What's wrong with this pattern?
After some researches, it seems that PCRE versions < 7.2 have this kind of bug with the DEFINE syntax.
You can write the same pattern like that:
$pattern = <<<'EOD'
~
(?:
(?<tag>
< ([^\W_]++) [^>]* >
(?> (?<text> [^<]++ )
| \g<tag>
| (?<self> < [^\W_]++ [^>]* > )
| (?<comment> <!-- (?>[^-]++|-(?!->))* -->)
| (?<cdata> \Q<![CDATA[\E (?>[^]]++|](?!]>))* ]]>)
)*
</ \g{2} > # second group from pattern start (<tag> is 1st)
)
| \g<self> | \g<comment> | \g<cdata>
)+
~x
EOD;

Regular expression testing for UTF-8

Today I decided to test a small function that checks if a string is UTF-8.
I used recommendations of the Multilingual form encoding and created a small helper:
function is_utf8($string) {
if (strlen($string) == 0)
{
return true;
}
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string);
}
As a test, I used a string with 196 characters. And just checked my helper. But browser doesn't display page with result, instead - 404 Page not found.
$string = "1234567890123456789012345678..."; // 196 characters here
echo strlen($string); // result - 196
var_dump(is_utf8($string)); // Error - Page not found!
But if I use 195 characters, everything works fine.
I've tried any of the characters, even spaces. This function only works with a string of no more than 195 characters.
Why?
This works as well, with a simple regular expression and serialize
function check_utf8($str) {
return (bool)preg_match('//u', serialize($str));
}
Did a simple test.
I performed the function of 1000000 times. Looked who faster.
I would also like to thank #mario for the help of an atomic grouping.
$string = "ывлдоkfdsuLIU(*knj4k58u7MJHKkiyhsf9hfhlknhlkjldfivjo8iulkjlgs".
"2345678901234567890123456789012345678901234567890123456789012".
"ыдваолт ДЛЯОЧДльы0щ39478509г0*()*?Щчялртодылматцю4к 2ылвсголо".
"4567890123456789012345678901234567890123456789012345678901234".
"4567890123456789012345678901234567890123456789012345678901234".
"asdfsd ds.kjasldasjlKUJLjLKZjulizL kzjxLkUJOLIULKM.LKl;.mcvss";
$s = microtime(true);
for ($i=0; $i<1000000; $i++)
{
// algorithm
}
$e = microtime(true);
echo $e-$s;
And here result:
preg_match('//u', $string )
Result: 11.634791135788 sec
(preg_match('%^(?>
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string)
Result: Fatal error: Maximum execution time of 30 seconds exceeded
preg_match('/^./su', $string)
Result: 12.27244400978 sec
mb_detect_encoding($string, array('UTF-8'), true)
Result: 15.370143890381 sec
And I also tried method proposed here by #helloworld
preg_match('//u', serialize($string))
Result: 23.193331956863 sec
Thank you all for your advice!
You helped me to understand
If the String is too long -> PCRE crash
look http://www.java-samples.com/showtutorial.php?tutorialid=1526 for solving

php sentence boundaries detection [duplicate]

This question already has answers here:
Split string into sentences using regex
(7 answers)
Closed 2 years ago.
I would like to divide a text into sentences in PHP. I'm currently using a regex, which brings ~95% accuracy and would like to improve by using a better approach. I've seen NLP tools that do that in Perl, Java, and C but didn't see anything that fits PHP. Do you know of such a tool?
An enhanced regex solution
Assuming you do care about handling: Mr. and Mrs. etc. abbreviations, then the following single regex solution works pretty well:
<?php // test.php Rev:20160820_1800
$split_sentences = '%(?#!php/i split_sentences Rev:20160820_1800)
# Split sentences on whitespace between them.
# See: http://stackoverflow.com/a/5844564/433790
(?<= # Sentence split location preceded by
[.!?] # either an end of sentence punct,
| [.!?][\'"] # or end of sentence punct and quote.
) # End positive lookbehind.
(?<! # But don\'t split after these:
Mr\. # Either "Mr."
| Mrs\. # Or "Mrs."
| Ms\. # Or "Ms."
| Jr\. # Or "Jr."
| Dr\. # Or "Dr."
| Prof\. # Or "Prof."
| Sr\. # Or "Sr."
| T\.V\.A\. # Or "T.V.A."
# Or... (you get the idea).
) # End negative lookbehind.
\s+ # Split on whitespace between sentences,
(?=\S) # (but not at end of string).
%xi'; // End $split_sentences.
$text = 'This is sentence one. Sentence two! Sentence thr'.
'ee? Sentence "four". Sentence "five"! Sentence "'.
'six"? Sentence "seven." Sentence \'eight!\' Dr. '.
'Jones said: "Mrs. Smith you have a lovely daught'.
'er!" The T.V.A. is a big project! '; // Note ws at end.
$sentences = preg_split($split_sentences, $text, -1, PREG_SPLIT_NO_EMPTY);
for ($i = 0; $i < count($sentences); ++$i) {
printf("Sentence[%d] = [%s]\n", $i + 1, $sentences[$i]);
}
?>
Note that you can easily add or take away abbreviations from the expression. Given the following test paragraph:
This is sentence one. Sentence two! Sentence three? Sentence "four". Sentence "five"! Sentence "six"? Sentence "seven." Sentence 'eight!' Dr. Jones said: "Mrs. Smith you have a lovely daughter!" The T.V.A. is a big project!
Here is the output from the script:
Sentence[1] = [This is sentence one.]
Sentence[2] = [Sentence two!]
Sentence[3] = [Sentence three?]
Sentence[4] = [Sentence "four".]
Sentence[5] = [Sentence "five"!]
Sentence[6] = [Sentence "six"?]
Sentence[7] = [Sentence "seven."]
Sentence[8] = [Sentence 'eight!']
Sentence[9] = [Dr. Jones said: "Mrs. Smith you have a lovely daughter!"]
Sentence[10] = [The T.V.A. is a big project!]
The essential regex solution
The author of the question commented that the above solution "overlooks many options" and is not generic enough. I'm not sure what that means, but the essence of the above expression is about as clean and simple as you can get. Here it is:
$re = '/(?<=[.!?]|[.!?][\'"])\s+(?=\S)/';
$sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);
Note that both solutions correctly identify sentences ending with a quotation mark after the ending punctuation. If you don't care about matching sentences ending in a quotation mark the regex can be simplified to just: /(?<=[.!?])\s+(?=\S)/.
Edit: 20130820_1000 Added T.V.A. (another punctuated word to be ignored) to regex and test string. (to answer PapyRef's comment question)
Edit: 20130820_1800 Tidied and renamed regex and added shebang. Also fixed regexes to prevent splitting text on trailing whitespace.
Slight improvement on someone else's work:
$re = '/# Split sentences on whitespace between them.
(?<= # Begin positive lookbehind.
[.!?] # Either an end of sentence punct,
| [.!?][\'"] # or end of sentence punct and quote.
) # End positive lookbehind.
(?<! # Begin negative lookbehind.
Mr\. # Skip either "Mr."
| Mrs\. # or "Mrs.",
| Ms\. # or "Ms.",
| Jr\. # or "Jr.",
| Dr\. # or "Dr.",
| Prof\. # or "Prof.",
| Sr\. # or "Sr.",
| \s[A-Z]\. # or initials ex: "George W. Bush",
# or... (you get the idea).
) # End negative lookbehind.
\s+ # Split on whitespace between sentences.
/ix';
$sentences = preg_split($re, $story, -1, PREG_SPLIT_NO_EMPTY);
As a low-tech approach, you might want to consider using a series of explode calls in a loop, using ., !, and ? as your needle. This would be very memory and processor intensive (as most text processing is). You would have a bunch of temporary arrays and one master array with all found sentences numerically indexed in the right order.
Also, you'd have to check for common exceptions (such as a . in titles like Mr. and Dr.), but with everything being in an array, these types of checks shouldn't be that bad.
I'm not sure if this is any better than regex in terms of speed and scaling, but it would be worth a shot. How big are these blocks of text you want to break into sentences?
I was using this regex:
preg_split('/(?<=[.?!])\s(?=[A-Z"\'])/', $text);
Won't work on a sentence starting with a number, but should have very few false positives as well. Of course what you are doing matters as well. My program now uses
explode('.',$text);
because I decided speed was more important than accuracy.
Build a list of abbreviations like this
$skip_array = array (
'Jr', 'Mr', 'Mrs', 'Ms', 'Dr', 'Prof', 'Sr' , etc.
Compile them into a an expression
$skip = '';
foreach($skip_array as $abbr) {
$skip = $skip . (empty($skip) ? '' : '|') . '\s{1}' . $abbr . '[.!?]';
}
Last run this preg_split to break into sentences.
$lines = preg_split ("/(?<!$skip)(?<=[.?!])\s+(?=[^a-z])/",
$txt, -1, PREG_SPLIT_NO_EMPTY);
And if you're processing HTML, watch for tags getting deleted which eliminate the space between sentences.<p></p> If you have situations.Like this where.They stick together it becomes immensely more difficult to parse.
#ridgerunner I wrote your PHP code in C #
I get like 2 sentences as result :
Mr. J. Dujardin régle sa T.V.
A. en esp. uniquement
The correct result should be the sentence : Mr. J. Dujardin régle sa T.V.A. en esp. uniquement
and with our test paragraph
string sText = "This is sentence one. Sentence two! Sentence three? Sentence \"four\". Sentence \"five\"! Sentence \"six\"? Sentence \"seven.\" Sentence 'eight!' Dr. Jones said: \"Mrs. Smith you have a lovely daughter!\" The T.V.A. is a big project!";
The result is
index: 0 sentence: This is sentence one.
index: 22 sentence: Sentence two!
index: 36 sentence: Sentence three?
index: 52 sentence: Sentence "four".
index: 69 sentence: Sentence "five"!
index: 86 sentence: Sentence "six"?
index: 102 sentence: Sentence "seven.
index: 118 sentence: " Sentence 'eight!'
index: 136 sentence: ' Dr. Jones said: "Mrs. Smith you have a lovely daughter!
index: 193 sentence: " The T.V.
index: 203 sentence: A. is a big project!
C# code :
string sText = "Mr. J. Dujardin régle sa T.V.A. en esp. uniquement";
Regex rx = new Regex(#"(\S.+?
[.!?] # Either an end of sentence punct,
| [.!?]['""] # or end of sentence punct and quote.
)
(?<! # Begin negative lookbehind.
Mr. # Skip either Mr.
| Mrs. # or Mrs.,
| Ms. # or Ms.,
| Jr. # or Jr.,
| Dr. # or Dr.,
| Prof. # or Prof.,
| Sr. # or Sr.,
| \s[A-Z]. # or initials ex: George W. Bush,
| T\.V\.A\. # or "T.V.A."
) # End negative lookbehind.
(?=|\s+|$)",
RegexOptions.CultureInvariant | RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled);
foreach (Match match in rx.Matches(sText))
{
Console.WriteLine("index: {0} sentence: {1}", match.Index, match.Value);
}

Categories