Explode a paragraph into sentences in PHP

Explode a paragraph into sentences in PHP - php

I have been using
explode(".",$mystring)
to split a paragraph into sentences. However this doen't cover sentences that have been concluded with different punctuation such as ! ? : ;
Is there a way of using an array as a delimiter instead of a single character? Alternativly is there another neat way of splitting using various punctuation?
I tried
explode(("." || "?" || "!"),$mystring)
hopefully but it didn't work...

You can use preg_split() combined with a PCRE lookahead condition to split the string after each occurance of ., ;, :, ?, !, .. while keeping the actual punctuation intact:
Code:
$subject = 'abc sdfs. def ghi; this is an.email#addre.ss! asdasdasd? abc xyz';
// split on whitespace between sentences preceded by a punctuation mark
$result = preg_split('/(?<=[.?!;:])\s+/', $subject, -1, PREG_SPLIT_NO_EMPTY);
print_r($result);
Result:
Array
(
[0] => abc sdfs.
[1] => def ghi;
[2] => this is an.email#addre.ss!
[3] => asdasdasd?
[4] => abc xyz
)
You can also add a blacklist for abbreviations (Mr., Mrs., Dr., ..) that should not be split into own sentences by inserting a negative lookbehind assertion:
$subject = 'abc sdfs. Dr. Foo said he is not a sentence; asdasdasd? abc xyz';
// split on whitespace between sentences preceded by a punctuation mark
$result = preg_split('/(?<!Mr.|Mrs.|Dr.)(?<=[.?!;:])\s+/', $subject, -1, PREG_SPLIT_NO_EMPTY);
print_r($result);
Result:
Array
(
[0] => abc sdfs.
[1] => Dr. Foo said he is not a sentence;
[2] => asdasdasd?
[3] => abc xyz
)

You can do:
preg_split('/\.|\?|!/',$mystring);
or (simpler):
preg_split('/[.?!]/',$mystring);

Assuming that you actually want the punctuations marks with the end result, have you tried:
$mystring = str_replace("?","?---",str_replace(".",".---",str_replace("!","!---",$mystring)));
$tmp = explode("---",$mystring);
Which would leave your punctuation marks in tact.

preg_split('/\s+|[.?!]/',$string);
A possible problem might be if there is an email address as it could split it onto a new line half way through.

Use preg_split and give it a regex like [\.|\?!] to split on

You can't have multiple delimiters for explode. That's what preg_split(); is for. But even then, it explodes at the delimiter, so you will get sentences returned without the punctuation marks.
You can take preg_split a step farther and flag it to return them in their own elements with PREG_SPLIT_DELIM_CAPTURE and then run some loop to implode sentence and following punctation mark in the returned array, or just use preg_match_all();:
preg_match_all('~.*?[?.!]~s', $string, $sentences);

$mylist = preg_split("/[.?!:;]/", $mystring);

You can try preg_split
$sentences = preg_split("/[.?!:;]+/", $mystring);
Please note this will remove the punctuations. If you would like to strip out leading or trailing whitespace as well
$sentences = preg_split("/[.?!:;]+\s+?/", $mystring);

Related

Matching whole words between commas, or a comma at the beginning, or a comma at the end with Regex

I have a string like this:
page-9000,page-template,page-type,page-category-128,image-195,listing-latest,rss-latest,even-more-info,even-more-tags
I made this regex that I expect to get the whole tags with:
(?<=\,)(rss-latest|listing-latest-no-category|category-128|page-9000)(?=\,)
I want it to match all the ocurrences.
In this case:
page-9000 and rss-latest.
This regex checks whole words between commas just fine but it ignores the first and the last because it's not between commas (obviously).
I've also tried that it checks if it's between commas OR one comma at the beginning OR one comma to the end, however it would give me false positives, as it would match:
category-128
while the string contains:
page-category-128
Any help?

Try using the following pattern:
(?<=,|^)(rss-latest|listing-latest-no-category|category-128|page-9000)(?=,|$)
The only change I have made is to add boundary markers ^ and $ to the lookarounds to also match on the start and end of the input.
Script:
$input = "page-9000,page-template,page-type,page-category-128,image-195,listing-latest,rss-latest,even-more-info,even-more-tags";
preg_match_all("/(?<=,|^)(rss-latest|listing-latest-no-category|category-128|page-9000)(?=,|$)/", $input, $matches);
print_r($matches[1]);
This prints:
Array
(
[0] => page-9000
[1] => rss-latest
)

Here is a non-regex way using explode and array_intersect:
$arr1 = explode(',', 'page-9000,page-template,page-type,page-category-128,image-195,listing-latest,rss-latest,even-more-info,even-more-tags');
$arr2 = explode('|', 'rss-latest|listing-latest-no-category|category-128|page-9000');
print_r(array_intersect($arr1, $arr2));
Output:
Array
(
[0] => page-9000
[6] => rss-latest
)

The (?<=\,) and (?=,) require the presence of , on both sides of the matching pattern. You want to match also at the start/end of string, and this is where you need to either explicitly tell to match either , or start/end of string or use double-negating logic with negated character classes inside negative lookarounds.
You may use
(?<![^,])(?:rss-latest|listing-latest-no-category|category-128|page-9000)(?![^,])
See the regex demo
Here, (?<![^,]) matches the start of string position or a , and (?![^,]) matches the end of string position or ,.
Now, you do not even need a capturing group, you may get rid of its overhead using a non-capturing group, (?:...). preg_match_all won't have to allocate memory for the submatches and the resulting array will be much cleaner.
PHP demo:
$re = '/(?<![^,])(?:rss-latest|listing-latest-no-category|category-128|page-9000)(?![^,])/m';
$str = 'page-9000,page-template,page-type,page-category-128,image-195,listing-latest,rss-latest,even-more-info,even-more-tags';
if (preg_match_all($re, $str, $matches)) {
print_r($matches[0]);
}
// => Array ( [0] => page-9000 [1] => rss-latest )

Regular Expression on PHP

I have a problem in splitting a string using regex.
I have searched about regex to split string on uppercase word, but what I need is to split string like in the following example.
Having this example data:
This is First SentenceThis is Second Sentence
... the string should be split like this:
This is First Sentence
This is Second Sentence
Anyone know the solution for this?

You can use the \K token combined with a lookahead assertion.
$str = 'This is First SentenceThis is Second Sentence';
$results = preg_split('~[a-z]\K(?=[A-Z])~', $str);
print_r($results);
Or utilize both look-behind and lookahead assertions:
$results = preg_split('~(?<=[a-z])(?=[A-Z])~', $str);
Output
Array
(
[0] => This is First Sentence
[1] => This is Second Sentence
)

php RegEx extract values from string

I am new to regular expressions and I am trying to extract some specific values from this string:
"Iban: EU4320000713864374\r\nSwift: DTEADCCC\r\nreg.no 2361 \r\naccount no. 1234531735"
Values that I am trying to extract:
EU4320000713864374
2361
This is what I am trying to do now:
preg_match('/[^Iban: ](?<iban>.*)[^\\r\\nreg.no ](?<regnr>.*)[^\\r\\n]/',$str,$matches);
All I am getting back is null or empty array. Any suggestions would be highly appreciated

The square brackets make no sense, you perhaps meant to anchor at the beginning of a line:
$result = preg_match(
'/^Iban: (?<iban>.*)\R.*\R^reg.no (?<regnr>.*)/m'
, $str, $matches
);
This requires to set the multi-line modifier (see m at the very end). I also replaced \r\n with \R so that this handles all kind of line-separator sequences easily.
Example: https://eval.in/47062
A slightly better variant then only captures non-whitespace values:
$result = preg_match(
'/^Iban: (?<iban>\S*)\R.*\R^reg.no (?<regnr>\S*)/m'
, $str, $matches
);
Example: https://eval.in/47069
Result then is (beautified):
Array
(
[0] => "Iban: EU4320000713864374
Swift: DTEADCCC
reg.no 2361"
[iban] => "EU4320000713864374"
[1] => "EU4320000713864374"
[regnr] => "2361"
[2] => "2361"
)

preg_match("/Iban: (\\S+).*reg.no (\\S+)/s", $str, $matches);
There is a specific feature about newlines: dot (.) does not match newline character unless s flag is specified.

Splitting a string on multiple separators in PHP

I can split a string with a comma using preg_split, like
$words = preg_split('/[,]/', $string);
How can I use a dot, a space and a semicolon to split string with any of these?
PS. I couldn't find any relevant example on the PHP preg_split page, that's why I am asking.

Try this:
<?php
$string = "foo bar baz; boo, bat";
$words = preg_split('/[,.\s;]+/', $string);
var_dump($words);
// -> ["foo", "bar", "baz", "boo", "bat"]
The Pattern explained
[] is a character class, a character class consists of multiple characters and matches to one of the characters which are inside the class
. matches the . Character, this does not need to be escaped inside character classes. Though this needs to be escaped when not in a character class, because . means "match any character".
\s matches whitespace
; to split on the semicolon, this needs not to be escaped, because it has not special meaning.
The + at the end ensures that spaces after the split characters do not show up as matches

The examples are there, not literally perhaps, but a split with multiple options for delimiter
$words = preg_split('/[ ;.,]/', $string);

something like this?
<?php
$string = "blsdk.bldf,las;kbdl aksm,alskbdklasd";
$words = preg_split('/[,\ \.;]/', $string);
print_r( $words );
result:
Array
(
[0] => blsdk
[1] => bldf
[2] => las
[3] => kbdl
[4] => aksm
[5] => alskbdklasd
)

$words = preg_split('/[\,\.\ ]/', $string);

just add these chars to your expression
$words = preg_split('/[;,. ]/', $string);
EDIT: thanks to Igoris Azanovas, escaping dot in character class is not needed ;)

$words = preg_split('/[,\.\s;]/', $string);

how to split by letter-followed-by-period?

I want to split text by the letter-followed-by-period rule. So I do this:
$text = 'One two. Three test. And yet another one';
$splitted_text = preg_split("/\w\./", $text);
print_r($splitted_text);
Then I get this:
Array ( [0] => One tw [1] => Three tes [2] => And yet another one )
But I do need it to be like this:
Array ( [0] => One two [1] => Three test [2] => And yet another one )
How to settle the matter?

Its splitting on the letter and the period. If you want to test to make sure that there is a letter preceding the period, you need to use a positive look behind assertion.
$text = 'One two. Three test. And yet another one';
$splitted_text = preg_split("/(?<=\w)\./", $text);
print_r($splitted_text);

use explode statement
$text = 'One two. Three test. And yet another one';
$splitted_text = explode(".", $text);
print_r($splitted_text);
Update
$splitted_text = explode(". ", $text);
using ". " the explode statement check also the space.
You can use any kind of delimiters also a phrase non only a single char

Using regex is an overkill here, you can use explode easily. Since a explode based answer is already give, I'll give a regex based answer:
$splitted_text = preg_split("/\.\s*/", $text);
Regex used: \.\s*
\. - A dot is a meta char. To match a literal match we escape it.
\s* - zero or more white space.
If you use the regex: \.
You'll have some leading spaces in some of the pieces created.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Explode a paragraph into sentences in PHP - php

You can do: preg_split('/\.|\?|!/',$mystring); or (simpler): preg_split('/[.?!]/',$mystring);

Assuming that you actually want the punctuations marks with the end result, have you tried: $mystring = str_replace("?","?---",str_replace(".",".---",str_replace("!","!---",$mystring))); $tmp = explode("---",$mystring); Which would leave your punctuation marks in tact.

preg_split('/\s+|[.?!]/',$string); A possible problem might be if there is an email address as it could split it onto a new line half way through.

Use preg_split and give it a regex like [\.|\?!] to split on

$mylist = preg_split("/[.?!:;]/", $mystring);

You can try preg_split $sentences = preg_split("/[.?!:;]+/", $mystring); Please note this will remove the punctuations. If you would like to strip out leading or trailing whitespace as well $sentences = preg_split("/[.?!:;]+\s+?/", $mystring);

Related

Matching whole words between commas, or a comma at the beginning, or a comma at the end with Regex

Regular Expression on PHP

php RegEx extract values from string

Splitting a string on multiple separators in PHP

how to split by letter-followed-by-period?

Categories

Resources