how to split by letter-followed-by-period? - php

I want to split text by the letter-followed-by-period rule. So I do this:
$text = 'One two. Three test. And yet another one';
$splitted_text = preg_split("/\w\./", $text);
print_r($splitted_text);
Then I get this:
Array ( [0] => One tw [1] => Three tes [2] => And yet another one )
But I do need it to be like this:
Array ( [0] => One two [1] => Three test [2] => And yet another one )
How to settle the matter?

Its splitting on the letter and the period. If you want to test to make sure that there is a letter preceding the period, you need to use a positive look behind assertion.
$text = 'One two. Three test. And yet another one';
$splitted_text = preg_split("/(?<=\w)\./", $text);
print_r($splitted_text);

use explode statement
$text = 'One two. Three test. And yet another one';
$splitted_text = explode(".", $text);
print_r($splitted_text);
Update
$splitted_text = explode(". ", $text);
using ". " the explode statement check also the space.
You can use any kind of delimiters also a phrase non only a single char

Using regex is an overkill here, you can use explode easily. Since a explode based answer is already give, I'll give a regex based answer:
$splitted_text = preg_split("/\.\s*/", $text);
Regex used: \.\s*
\. - A dot is a meta char. To match a literal match we escape it.
\s* - zero or more white space.
If you use the regex: \.
You'll have some leading spaces in some of the pieces created.

Related

Matching whole words between commas, or a comma at the beginning, or a comma at the end with Regex

I have a string like this:
page-9000,page-template,page-type,page-category-128,image-195,listing-latest,rss-latest,even-more-info,even-more-tags
I made this regex that I expect to get the whole tags with:
(?<=\,)(rss-latest|listing-latest-no-category|category-128|page-9000)(?=\,)
I want it to match all the ocurrences.
In this case:
page-9000 and rss-latest.
This regex checks whole words between commas just fine but it ignores the first and the last because it's not between commas (obviously).
I've also tried that it checks if it's between commas OR one comma at the beginning OR one comma to the end, however it would give me false positives, as it would match:
category-128
while the string contains:
page-category-128
Any help?
Try using the following pattern:
(?<=,|^)(rss-latest|listing-latest-no-category|category-128|page-9000)(?=,|$)
The only change I have made is to add boundary markers ^ and $ to the lookarounds to also match on the start and end of the input.
Script:
$input = "page-9000,page-template,page-type,page-category-128,image-195,listing-latest,rss-latest,even-more-info,even-more-tags";
preg_match_all("/(?<=,|^)(rss-latest|listing-latest-no-category|category-128|page-9000)(?=,|$)/", $input, $matches);
print_r($matches[1]);
This prints:
Array
(
[0] => page-9000
[1] => rss-latest
)
Here is a non-regex way using explode and array_intersect:
$arr1 = explode(',', 'page-9000,page-template,page-type,page-category-128,image-195,listing-latest,rss-latest,even-more-info,even-more-tags');
$arr2 = explode('|', 'rss-latest|listing-latest-no-category|category-128|page-9000');
print_r(array_intersect($arr1, $arr2));
Output:
Array
(
[0] => page-9000
[6] => rss-latest
)
The (?<=\,) and (?=,) require the presence of , on both sides of the matching pattern. You want to match also at the start/end of string, and this is where you need to either explicitly tell to match either , or start/end of string or use double-negating logic with negated character classes inside negative lookarounds.
You may use
(?<![^,])(?:rss-latest|listing-latest-no-category|category-128|page-9000)(?![^,])
See the regex demo
Here, (?<![^,]) matches the start of string position or a , and (?![^,]) matches the end of string position or ,.
Now, you do not even need a capturing group, you may get rid of its overhead using a non-capturing group, (?:...). preg_match_all won't have to allocate memory for the submatches and the resulting array will be much cleaner.
PHP demo:
$re = '/(?<![^,])(?:rss-latest|listing-latest-no-category|category-128|page-9000)(?![^,])/m';
$str = 'page-9000,page-template,page-type,page-category-128,image-195,listing-latest,rss-latest,even-more-info,even-more-tags';
if (preg_match_all($re, $str, $matches)) {
print_r($matches[0]);
}
// => Array ( [0] => page-9000 [1] => rss-latest )

How to split 123abcd#abcd.com like 123 and abcd#abcd.com in PHP

I have like a bunch of texts in a txt file like that.
I just want to split the starting numbers and email separately like above
Can someone make a function or something for that alone please? will be very thankful.
Any other suggestion is also gladly welcome. !
You could try the below code.
<?php
$yourstring = "123abcd#abcd.com";
$regex = '~^\d+\K~';
$splits = preg_split($regex, $yourstring);
print_r($splits);
?>
Output:
Array
(
[0] => 123
[1] => abcd#abcd.com
)
Explanation:
^ Asserts that we are at the start.
\d+ Matches one or more digits.
\K discards the previously matched characters. So after ^\d+\K, the matching marker would be on the boundary exists between the starting number and the email id. Splitting according to that boundary will give you the desired result.

Why does this regular expression only capture one word?

I'm trying to learn Regular Expressions. I know the basics, and I'm not terrible at regex, I'm just no pro - hence I've got a question for you guys. If you know regex, I bet it'll be simple.
What I've got currently is this:
/(\w+)\s-{1}\s(\w+)\.{1}(\w{3,4})/
What I'm trying to do is create a little script for myself that tidies up my music collection by formatting all of the filenames. I know there's other stuff out there already but this is a learning experience for me. I already screwed up all the titles once by replacing things like "Hell Aint A Bad Place To Be" with "Hell Aint a Bad Place To Be". In my wisdom I somehow ended up with "Hell Aint a ad Place to be" (I was looking a A followed by a space and an uppercase character). Obviously that was a nightmare to fix and it had to be done manually. Needless to say I'm testing samples first now.
Anyway, the above regex is sort of a stage 1 of many. Eventually I want to build it up, but for now I just need to get the simple bits working.
In the end I'd like to turn:
"arctic Monkeys- a fake tales of a san francisco"
into
"Arctic Monkeys - A Fake Tales of a San Francisco"
I know I'll need lookbehind assertions to grab when you're after a '-', because if the first word is 'a', 'of' etc. which I'd normally lowercase, I need to uppercase them (the above is a bad example for this use case I know).
Any way of fixing the existing regular expression would be great, and and tips on where to look on my cheatsheet to finish the rest off would be great (I'm not looking for a fully-fledged answer, since I need to learn to do it myself, I just can't figure why w+ is only getting one word).
I believe there is a much simpler way of approaching this problem: split the string into words, based on a much simpler regex, and then apply whatever processing you want to those words. This will allow you to perform more complicated transformations on the text in a much cleaner way. Here's an example:
<?php
$song = "arctic Monkeys- a fake tales of a san francisco";
// Split on spaces or - (the - is still present
// because it's only a lookahead match)
$words = preg_split("/([\s]+|(?=-))/", $song);
/*
Output for print_r:
Array
(
[0] => arctic
[1] => Monkeys
[2] => -
[3] => a
[4] => fake
[5] => tales
[6] => of
[7] => a
[8] => san
[9] => francisco
)
*/
print_r($words);
$new_words = array();
foreach ($words as $k => $word) {
$new_words[] = processWord($word, $k, $words);
}
// This will output:
// Arctic Monkeys - A Fake Tales of a San Francisco
echo implode(' ', $new_words);
// You can add as many processing rules you want in here - in a very clean way
function processWord($word, $idx, $words) {
if ($words[$idx - 1] == '-') return ucfirst($word);
return strlen($word) > 2 ? ucfirst($word) : $word;
}
Here's an example of this code running: http://codepad.org/t6pc8WpR
I'm a little confused about what you're doing, but maybe this will help. Remember that + is 1 or more characters, * is 0 or more. So you probably want to do something like ([\s]*) to match spaces. You don't need to specify the {1} next to a single character.
So maybe something like this:
([\w\s]+)([\s]*)-([\s]*)([\w\s]+)\.([\w]{3,4})
I haven't tested this code, but I think you get the idea.
\w does not contain the blank. A working regex might be:
/^(.+?)\s*-\s*(.+)$/
Explanation:
^ - must start at the beginning of the string
(.+?) - match any character, be ungreedy
\s* - match any number whitespace that might exists (including none)
- - match character
\s* - any whitespace again
(.+) - remaining characters
$ - end of string
The transcoding would then happen in another replacing regex.
For the first part, \w doesn't match words, it matches word characters. It's equivalent to [A-Za-z0-9_].
Instead, try ([A-Za-z0-9_ ]+) as your first bit (has an extra space inside the match square brackets and removed the \s.
Here's what I have:
<?php
/**
* Formats a string into a title:
* * Pads all dashes with spaces.
* * Uppercase all words with 3 letters or more.
* * Uppercase first word and first words after dashes.
*
* #param $str
*
* #return string
*/
function format_title($str) {
//Remove all spaces before and after dashes.
//(These will return in the final product)
$str = preg_replace("/\s?-\s?/", "-", $str);
//Explode by dash.
$string_split_by_dash = explode("-", $str);
//For each sentence (separated by dashes)
foreach ($string_split_by_dash as &$sentence) {
//Uppercase all words.
$sentence = ucwords($sentence);
//Explode into words (by space)
$words = explode(" ", $sentence);
//For each word
foreach ($words as &$word) {
//If its length is smaller than 3
if (strlen($word) < 3) {
//Lowercase it.
$word = strtolower($word);
}
}
//Implode back into a sentence.
$sentence = implode(" ", $words);
//Uppercase the first word, regardless of length.
$sentence = ucfirst($sentence);
}
//Implode all sentances back by space-padded dash.
$str = implode(" - ", $string_split_by_dash);
return $str;
}
$str = "arctic Monkeys- a fake tales of a san francisco";
var_dump(format_title($str));
I'd argue it's more readable (and more documentable) than a regex. Probably more efficient too, (didn't check).

Explode a paragraph into sentences in PHP

I have been using
explode(".",$mystring)
to split a paragraph into sentences. However this doen't cover sentences that have been concluded with different punctuation such as ! ? : ;
Is there a way of using an array as a delimiter instead of a single character? Alternativly is there another neat way of splitting using various punctuation?
I tried
explode(("." || "?" || "!"),$mystring)
hopefully but it didn't work...
You can use preg_split() combined with a PCRE lookahead condition to split the string after each occurance of ., ;, :, ?, !, .. while keeping the actual punctuation intact:
Code:
$subject = 'abc sdfs. def ghi; this is an.email#addre.ss! asdasdasd? abc xyz';
// split on whitespace between sentences preceded by a punctuation mark
$result = preg_split('/(?<=[.?!;:])\s+/', $subject, -1, PREG_SPLIT_NO_EMPTY);
print_r($result);
Result:
Array
(
[0] => abc sdfs.
[1] => def ghi;
[2] => this is an.email#addre.ss!
[3] => asdasdasd?
[4] => abc xyz
)
You can also add a blacklist for abbreviations (Mr., Mrs., Dr., ..) that should not be split into own sentences by inserting a negative lookbehind assertion:
$subject = 'abc sdfs. Dr. Foo said he is not a sentence; asdasdasd? abc xyz';
// split on whitespace between sentences preceded by a punctuation mark
$result = preg_split('/(?<!Mr.|Mrs.|Dr.)(?<=[.?!;:])\s+/', $subject, -1, PREG_SPLIT_NO_EMPTY);
print_r($result);
Result:
Array
(
[0] => abc sdfs.
[1] => Dr. Foo said he is not a sentence;
[2] => asdasdasd?
[3] => abc xyz
)
You can do:
preg_split('/\.|\?|!/',$mystring);
or (simpler):
preg_split('/[.?!]/',$mystring);
Assuming that you actually want the punctuations marks with the end result, have you tried:
$mystring = str_replace("?","?---",str_replace(".",".---",str_replace("!","!---",$mystring)));
$tmp = explode("---",$mystring);
Which would leave your punctuation marks in tact.
preg_split('/\s+|[.?!]/',$string);
A possible problem might be if there is an email address as it could split it onto a new line half way through.
Use preg_split and give it a regex like [\.|\?!] to split on
You can't have multiple delimiters for explode. That's what preg_split(); is for. But even then, it explodes at the delimiter, so you will get sentences returned without the punctuation marks.
You can take preg_split a step farther and flag it to return them in their own elements with PREG_SPLIT_DELIM_CAPTURE and then run some loop to implode sentence and following punctation mark in the returned array, or just use preg_match_all();:
preg_match_all('~.*?[?.!]~s', $string, $sentences);
$mylist = preg_split("/[.?!:;]/", $mystring);
You can try preg_split
$sentences = preg_split("/[.?!:;]+/", $mystring);
Please note this will remove the punctuations. If you would like to strip out leading or trailing whitespace as well
$sentences = preg_split("/[.?!:;]+\s+?/", $mystring);

Splitting a string on multiple separators in PHP

I can split a string with a comma using preg_split, like
$words = preg_split('/[,]/', $string);
How can I use a dot, a space and a semicolon to split string with any of these?
PS. I couldn't find any relevant example on the PHP preg_split page, that's why I am asking.
Try this:
<?php
$string = "foo bar baz; boo, bat";
$words = preg_split('/[,.\s;]+/', $string);
var_dump($words);
// -> ["foo", "bar", "baz", "boo", "bat"]
The Pattern explained
[] is a character class, a character class consists of multiple characters and matches to one of the characters which are inside the class
. matches the . Character, this does not need to be escaped inside character classes. Though this needs to be escaped when not in a character class, because . means "match any character".
\s matches whitespace
; to split on the semicolon, this needs not to be escaped, because it has not special meaning.
The + at the end ensures that spaces after the split characters do not show up as matches
The examples are there, not literally perhaps, but a split with multiple options for delimiter
$words = preg_split('/[ ;.,]/', $string);
something like this?
<?php
$string = "blsdk.bldf,las;kbdl aksm,alskbdklasd";
$words = preg_split('/[,\ \.;]/', $string);
print_r( $words );
result:
Array
(
[0] => blsdk
[1] => bldf
[2] => las
[3] => kbdl
[4] => aksm
[5] => alskbdklasd
)
$words = preg_split('/[\,\.\ ]/', $string);
just add these chars to your expression
$words = preg_split('/[;,. ]/', $string);
EDIT: thanks to Igoris Azanovas, escaping dot in character class is not needed ;)
$words = preg_split('/[,\.\s;]/', $string);

Categories