Php splitting a sentence - php

I'm trying to split a string of sentences by "." to get each sentence in an array. Like below:
$Text = "Hello, Mr. James. How are you today."
$split= explode(".", $Text);
As you can see $Text contains 2 sentences therefore i should only have 2 elements in the array. The issue i'm having is that sometimes my $Text can contain words like "Mr." or any other word which contains a "." in the middle of a sentence. This will result in the sentences being split from the middle and placed separately in the array like below:
Array ( [0] => Hello, Mr [1] => James [2] => How are you today [3] => )

You can avoid a lot of exception handling and general misery, if you can ensure that all English sentences are properly spaced at the end of each sentence -- 2 consecutive spaces. This can be difficult when dealing with some digitized strings because sometimes multi-spacing gets condensed to a single space.
This is what I mean:
$Text = "Hello, Mr. James. How are you today.";
$split = explode(" ", $Text);
var_export($split);
// array ( 0 => 'Hello, Mr. James.', 1 => 'How are you today.', )
Exploding on each space-space will give you a reliable result.
If you want good output, you'll need to use good input.
If you want to blacklist a few predictable substrings that should not be use to split the string, then you can use (*SKIP)(*FAIL) for that.
Code: (Demo)
$text = "Hello, Mr. James. How are you today.";
var_export(
preg_split('~(?:Mrs?|Miss|Ms|Prof|Rev|Col|Dr)[.?!:](*SKIP)(*F)|[.?!:]+\K\s+~', $text, 0, PREG_SPLIT_NO_EMPTY)
);
Output:
array (
0 => 'Hello, Mr. James.',
1 => 'How are you today.',
)

Related

Regex for matching phrase between '' php

I want to match the content inside the ' and ' (single quotes). For example: 'for example' should return for and example. It's only a part of the sentence I have to analyze, I used preg_split(\s) for the whole sentence, so the 'for example' will become 'for and example'.
Right now I've tried /^'(.*)|(.*)'$/ and it only returns for but not the example, if I put it like /^(.*)'|'(.*)$/, it only returns example but not for. How should I fix this?
You can avoid double handling of the string by leveraging the \G metacharacter to continue matching an unlimited number of space-delimited strings inside of single quotes.
Code: (PHP Demo) (Regex Demo)
$string = "text 'for an example of the \"continue\" metacharacter' text";
var_export(preg_match_all("~(?|'|\G(?!^) )\K[^ ']+~", $string, $out) ? $out[0] : []);
Output:
array (
0 => 'for',
1 => 'an',
2 => 'example',
3 => 'of',
4 => 'the',
5 => '"continue"',
6 => 'metacharacter',
)
To get the single sentences (which you then want to split) you can use preg_match_all() to capture anything between two single quotes.
preg_match_all("~'([^']+)'~", $text, $matches)
$string = $matches[1];
$string now contains something like "example string with words".
Now if you want to split a string according to a specific sequence / character, you can make use of explode():
$string = "example string with words";
$result = explode(" ", $string);
print_r($result);
gives you:
Array
(
[0] => example
[1] => string
[2] => with
[3] => words
)

How to extract substrings with delimiters from a string in php

I would like to remove substrings from a string that have delimiters.
Example:
$string = "Hi, I want to buy an [apple] and a [banana].";
How do I get "apple" and "banana" out of this string and in an array? And the other parts of the string "Hi, I want to buy an" and "and a" in another array.
I apologize if this question has already been answered. I searched this site and couldn't find anything that would help me. Every situation was just a little different.
You could use preg_split() thus:
<?php
$pattern = '/[\[\]]/'; // Split on either [ or ]
$string = "Hi, I want to buy an [apple] and a [banana].";
echo print_r(preg_split($pattern, $string), true);
which outputs:
Array
(
[0] => Hi, I want to buy an
[1] => apple
[2] => and a
[3] => banana
[4] => .
)
You can trim the whitespace if you like and/or ignore the final fullstop.
preg_match_all('(?<=\[)([a-z])*(?=\])', $string, $matches);
Should do what you want. $matches will be an array with each match.
I assume you want words as values in the array:
$words = explode(' ', $string);
$result = preg_grep('/\[[^\]]+\]/', $words);
$others = array_diff($words, $result);
Create an array of words using explode() on a space
Use a regex to find [somethings] using preg_grep()
Find the difference of all words and [somethings] using array_diff(), which will be the "other" parts of the string

split string by any amount of whitespace in PHP

I know how to split a string so the words between the delimitor into elements in an array using .explode() by " ".
But that only splits the string by a single whitespace character. How can I split by any amount of whitespace?
So an element in the array end when whitespace is found and the next element in the array starts when the first next non-whitespace character is found.
So something like "The quick brown fox" turns into an array with The, quick, brown, and fox are elements in the returned array.
And "jumped over the lazy dog" also splits so each word is an individual element in the returned array.
Like this:
preg_split('#\s+#', $string, null, PREG_SPLIT_NO_EMPTY);
$yourSplitArray=preg_split('/[\ \n\,]+/', $your_string);
try this
preg_split(" +", "hypertext language programming"); //for one or more whitespaces
you can see here: PHP explode() Function
<?php
$str = "Hello world. It's a beautiful day.";
print_r (explode(" ",$str));
?>
will return:
Array ( [0] => Hello [1] => world. [2] => It's [3] => a [4] => beautiful [5] => day. )

Intelligent split of string into an array

This code will split the string into an array that contains test and string:
$str = 'test string';
$arr = preg_split('/\s+/', $str);
But I also want to detect quotes and ignore the text between them when splitting, for example:
$str = 'test "Two words"';
This should also return an array with two elements, test and Two words.
And another form, if possible:
$str = 'test=Two Words';
So if the equal sign is present before any spaces, the string should be split by =, otherwise the other rules from above should apply.
So how can I do this with preg_split?
Try str_getcsv:
print_r(str_getcsv('test string'," "));
print_r(str_getcsv('test "Two words"'," "));
print_r(str_getcsv('test=Two Words',"="));
Outputs
Array
(
[0] => test
[1] => string
)
Array
(
[0] => test
[1] => Two words
)
Array
(
[0] => test
[1] => Two Words
)
You can use something like preg_match to check if there's an equal sign exist before space and then determine what delimiter to use.
Works only in PHP>=5.3 though.
I'm sure this could be done with regex, but how about just splitting the string by quotation marks, then by spaces, using explode?
Given the string 'I am a string "with an embedded" string', you could first split by quotation marks, giving you ['I am a string', 'with an embedded', 'string'], then you go over every other element in the array and split by spaces, resulting in ['I', 'am', 'a', 'string', 'with an embedded', 'string'].
The exact code to do this you can probably write yourself. If not, let me know and I'll help you.
In your last example, just split by the equals symbol:
$str = 'test=Two Words';
print explode('=', $str);

string to array, split by single and double quotes, ignoring escaped quotes

i have another php preg_split question which is very similar to my last question, although i fear the solution will be quite a bit more complicated. as before, i'm trying to use php to split a string into array components using either " or ' as the delimiter. however in addition to this i would like to ignore escaped single quotations within the string (escaped double quotations within a string will not happen so there is no need to worry about that). all of the examples from my last question remain valid, but in addition the following two desired results should also be obtained:
$pattern = "?????";
$str = "the 'cat\'s dad sat on' the mat then \"fell 'sideways' off\" the mat";
$res = preg_split($pattern, $str, null, PREG_SPLIT_DELIM_CAPTURE);
print_r($res);
/*output:
Array
(
[0] => the
[1] => 'cat\'s dad sat on'
[2] => the mat then
[3] => "fell 'sideways' off"
[4] => the mat
)*/
$str = "the \"cat\'s dad\" sat on 'the \"cat\'s\" own' mat";
$res = preg_split($pattern, $str, null, PREG_SPLIT_DELIM_CAPTURE);
print_r($res);
/*output:
Array
(
[0] => the
[1] => "cat\'s dad"
[2] => sat on
[3] => 'the "cat\'s" own'
[4] => mat
)*/
#mcrumley's answer to my previous question worked well if there were no escaped quotations:
$pattern = "/('[^']*'|\"[^\"]*\")/U";
however as soon as an escaped single quotation is given the regex uses it as the end of the match, which is not what i want.
i have tried something like this:
$pattern = "/('(?<=(?!\\').*)'|\"(?<=(?!\\').*)\")/";
but its not working. unfortunately my knowledge of lookarounds is not good enough for this.
after some reading and fiddling...
this seems closer:
$pattern = "/('(?:(?!\\').*)')|(\"(?:(?!\\'|').*)\")/";
but the level of greedyness is wrong and does not produce the above outputs.
Try this:
$pattern = "/(?<!\\\\)('(?:\\\\'|[^'])*'|\"(?:\\\\\"|[^\"])*\")/";
^^^^^^^^^ ^^^^^^^^^ ^ ^^^^^^^^^^ ^
Demo at http://rubular.com/r/Eps2mx8KCw.
You can also collapse that into a unified expression using back-references:
$pattern = "/(?<!\\\\)((['\"])(?:\\\\\\2|(?!\\2).)*\\2)/";
Demo at http://rubular.com/r/NLZKyr9xLk.
These don't work though if you also want escaped backslashes to be recognized in your text, but I doubt that's a scenario you need to account for.

Categories