I am trying to break the text by sentences. There are no dots in this text. But it contains capital letters. I use:
<?php preg_match_all('/[A-Z][^A-Z]*?/Usu',$text,$sentences);
But it split the text only by capital letters. So I have such sentences as "S", "M", "S". It is wrong. I do not need to break such words as SMS. Help please.
Some clarification:
I try to break the string before each string of one or more capital letters.
But my real task is more complex. I am trying to format text for readability.
Example: a piece of vacancy without html tags and line breaks: "Desirable: AWS
experience Experience with Docker/Kubernetes". I try to get: "Desirable:", "AWS experience" and "Experience with Docker/Kubernetes" (I think I will be able to stick together very short strings after splitting by space and capital letter. Maybe it is a very bad way, of course).
I assume you you wish to break a string into pieces, where the break points are zero-width positions that immediately precede a capital letter and do not follow a capital letter. If so you could used the following regular expression.
(?=(?<![A-Z]|^)[A-Z])
Regex demo
The can be executed as follows:
<?php
$result = preg_split("/(?=(?<![A-Z]|^)[A-Z])/", "now is THE time to BE brave");
print_r($result);
PHP demo
As shown at the link, this returns
Array
(
[0] => now is
[1] => THE time to
[2] => BE brave
)
If the first word of the string were capitalized ("Now"), the first element of the string would be "Now is" (i.e., not an empty string").
PHP's regex engine performs the following operations.
(?= # begin a positive lookahead
(?<! # begin a negative lookbehind
[A-Z] # match a capital letter
| # or
^ # match the beginning of the line
) # end the negative lookbehind
[A-Z] # match a capital letter
) # end positive lookahead
This attempts to match a capital letter in a positive lookahead ([A-Z]), but that match fails if the negative lookbehind matches a capital letter preceding it or the capital letter is at the beginning of the string.
You really shouldn't be using regex to parse something as complex as natural language. I'd recommend something like IntlBreakIterator instead.
$text = "Sentence 1. Sentence 2! Sentence 3? Sentence; number 4...Sentence, 5.";
$it = IntlBreakIterator::createSentenceInstance("en_US");
$it->setText($text);
$parts = $it->getPartsIterator();
foreach ($parts as $point => $sentence) {
echo "$point => $sentence\n\n\n";
}
Output
0 => Sentence 1.
1 => Sentence 2!
2 => Sentence 3?
3 => Sentence; number 4...
4 => Sentence, 5.
The rules for parsing words/sentences can be complex and daunting to implement in a regular expression. This solution is more sane for syntactically correct corpus. However, if the text has no punctuation like you say then there is no sane way to distinguish one sentence from another. Simply attempting to do it by capital letters can yield a lot of false positives because words can be capitalized mid-sentence such as proper nouns and some abbreviations.
Related
I'm trying to make a regular expression in PHP. I can get it working in other languages but not working with PHP.
I want to validate item names in an array
They can contain upper and lower case letters, numbers, underscores, and hyphens.
They can contain => as an exact string, not separate characters.
They cannot start with =>.
They cannot finish with =>.
My current code:
$regex = '/^[a-zA-Z0-9-_]+$/'; // contains A-Z a-z 0-9 - _
//$regex = '([^=>]$)'; // doesn't end with =>
//$regex = '~.=>~'; // doesn't start with =>
if (preg_match($regex, 'Field_name_true2')) {
echo 'true';
} else {
echo 'false';
};
// Field=>Value-True
// =>False_name
//Bad_name_2=>
Use negative lookarounds. Negative lookahead (?!=>) at the beginning to prohibit beginning with =>, and negative lookbehind (?<!=>) at the end to prohibit ending with =>.
^(?!=>)(?:[a-zA-Z0-9-_]+(=>)?)+(?<!=>)$
DEMO
There is absolutely no requirement for lookarounds here.
Anchors and an optional group will suffice.
Demo
/^[\w-]+(?:=>[\w-]+)?$/
^^^^^^^^^^^^^-- this whole non-capturing group is optional
This allows full strings consisting exclusively of [0-9a-zA-Z-] or split ONCE by =>.
The non-capturing group may occur zero or one time.
In other words, => may occur after one or more [\w-] characters, but if it does occur, it MUST be immediately followed by one or more [\w-] characters until the end of the string.
To cover some of the ambiguity in the question requirements:
If foo=>bar=>bam is valid, then use /^[\w-]+(?:=>[\w-]+)*$/ which replaces ? (zero or one) with * (zero or more).
If foo=>=>bar is valid then use /^[\w-]+(?:(?:=>)+[\w-]+)*$/ which replaces => (must occur once) with (?:=>)+ (substring must occur one or more times).
Well, your character ranges equal to \w, so you could use
^(?!=>)(?:(?!=>$)(?:[-\w]|=>))+$
This construct uses a "tempered greedy token", see a demo on regex101.com.
More shiny, complicated and surely over the top, you could use subroutines as in:
(?(DEFINE)
(?<chars>[-\w]) # equals to A-Z, a-z, 0-9, _, -
(?<af>=>) # "arrow function"
(?<item>
(?!(?&af)) # no af at the beginning
(?:(?&af)?(?&chars)++)+
(?!(?&af)) # no af at the end
)
)
^(?&item)$
See another demo on regex101.com
For the example data, you can use
^[a-zA-Z0-9_-]+=>[a-zA-Z0-9_-]+$
The pattern matches:
^ Start of string
[a-zA-Z0-9_-]+ Match 1+ times any of the listed ranges or characters (can not start with =>)
=> Match literally
[a-zA-Z0-9_-]+ Match again 1+ times any of the listed ranges or characters
$ End of string
Regex demo
If you want to allow for optional spaces:
^\h*[a-zA-Z0-9_-]+\h*=>\h*[a-zA-Z0-9_-]+\h*$
Regex demo
Note that [a-zA-Z0-9_-] can be written as [\w-]
I'm still trying to get to grips with regex patterns and just after a little double-checking if someone wouldn't mind obliging!
I have a string which should either contain:
A 10 digit (numbers and letters) licence key, for example: 1234567890 OR
A 25 digit (numbers and letters) licence key, for example: ABCD1EFGH2IJKL3MNOP4QRST5 OR
A 29 digit licence number (25 numbers and letters, separated into 5 group by hyphens), for example: ABCD1-EFGH2-IJKL3-MNOP4-QRST51
I can match the first two fine, using ctype_alnum and strlen functions. However, for the last one I think I'll need to use regex and preg_match.
I had a go over at regex101.com and came up with the following:
preg_match('^([A-Za-z0-9]{5})+-+([A-Za-z0-9]{5})+-+([A-Za-z0-9]{5})+-([A-Za-z0-9]{5})+-+([A-Za-z0-9]{5})', $str);
Which seems to match what I'm looking for.
I want the string to only contain an exact match for a string beginning with the licence number, and contain nothing other than mixed upper/lower case letters and numbers in any order and hyphens between each group of 5 characters (so a total of 29 characters - I don't want any further matches). No white space, no other characters and nothing else before or after the 29 digit key.
Will the above work, without allowing any other combinations? Will it stop checking at 29 characters? I'm not sure if there is a simpler way to express this in regex?
Thanks for your time!
The main point is that you need to use both ^ (start of string) and $ (end of string) anchors. Also, when you use + after (...), you allow 1 or more repetitions of the whole subpattern inside the (...). So, you need to remove the +s and add the $ anchor. Also, you need regex delimiters for your regex to work in PHP preg_match. I prefer ~ so as not to escape /. Maybe it is not the case here, but this is a habit.
So, the regex can look like
'~^[A-Za-z0-9]{5}(?:-[A-Za-z0-9]{5}){4}$~'
See the regex demo
The (?:-[A-Za-z0-9]{5}){4} matches 4 occurrences of -[A-Za-z0-9]{5} subpattern. The (?:...) is a non-capturing group whose matched text does not get stored in any buffer (unlike the capturing group).
See the IDEONE demo:
$re = '~^[A-Za-z0-9]{5}(?:-[A-Za-z0-9]{5}){4}$~';
$str = "ABCD1-EFGH2-IJKL3-MNOP4-QRST5";
if (preg_match($re, $str, $matches)) {
echo "Matched!";
}
How about:
preg_match('/^([a-z0-9]{5})(?:-(?1)){4}$/i', $str);
Explanation:
/ : regex delimiter
^ : begining of string
( : begin group 1
[a-z0-9]{5} : exactly 5 alphanum.
) : end of group 1
(?: : begin NON capture group
- : a dash
(?1) : same as definition in group 1 (ie. [a-z0-9]{5})
){4} : this group must be repeated 4 times
$ : end of string
/i : regex delimiter with case insensitive modifier
I've researched a little, but I found nothing that relates exactly to what I need and whenever tried to create the expression it is always a little off from what I require.
I attempted something along the lines of [AZaz09]{3,8}\-[AZaz09]{3,8}.
I want the valid result to only allow text-text, where either or the text can be alphabetical or numeric however the only symbol allowed is - and that is in between the two texts.
Each text must be at least three characters long ({3,8}?), then separated by the -.
Therefore for it to be valid some examples could be:
Text-Text
Abc-123
123-Abc
A2C-def4gk
Invalid tests could be:
Ab-3
Abc!-ajr4
a-bc3-25aj
a?c-b%
You need to use anchors and use the - so the characters in the character class are read as a range, not the individual characters.
Try:
^[A-Za-z0-9]{3,8}-[A-Za-z0-9]{3,8}$
Demo: https://regex101.com/r/xH3oM8/1
You also could simplify it a but with the i modifier and the \d meta character.
(?i)^[a-z\d]{3,8}-[a-z\d]{3,8}$
If accented letters should be allowed, or any other letter that exists in the Unicode range (like Greek or Cyrillic letters), then use the u modifier (for UTF-8 support) and \pL to match Unicode letters (and \d for digits):
$string ="
Mañana-déjà
Text-Text
Abc-123
123-Abc
A2C-def4gk
Ab-3
Abc!-ajr4
a-bc3-25aj
a?c-b%";
$regex='/^[\pL\d]{3,}-[\pL\d]{3,}$/mu';
preg_match_all($regex, $string, $matches);
var_export($matches);
Output:
array (
0 =>
array (
0 => 'Mañana-déjà',
1 => 'Text-Text',
2 => 'Abc-123',
3 => '123-Abc',
4 => 'A2C-def4gk',
),
)
NB: the difference with \w is that [\pL\d] will not match an underscore.
You could come up with the following:
<?php
$string ="
Text-Text
Abc-123
123-Abc
A2C-def4gk
Ab-3
Abc!-ajr4
a-bc3-25aj
a?c-b%";
$regex='~
^\w{3,} # at last three word characters at the beginning of the line
- # a dash
\w{3,}$ # three word characters at the end of the line
~xm'; # multiline and freespacing mode (for this explanation)
# ~xmu for accented characters
preg_match_all($regex, $string, $matches);
print_r($matches);
?>
As #chris85 pointed out, \w will match an underscore as well. Trincot had a good comment (matching accented characters, that is). To achieve this, simply use the u modifier.
See a demo on regex101.com and a complete code on ideone.com.
You can use this regex
^\w{3,}-\w{3,}$
^ // start of the string
\w{3,} // match "a" to "z", "A" to "Z" and 0 to 9 and requires at least 3 characters
- // requires "-"
\w{3,} // same as above
$ // end of the string
Regex Demo
And a short one.
^([^\W_]{3,8})-(?1)$
[^\W_] can be used as short for alnum. It subtracts the underscore from \w
(?1) is a subroutine call to the pattern in first group
Demo at regex101
My vote for #chris85 which is most obvious and performant.
This one
^([\w]{3,8}-[\w]{3,8})$
https://regex101.com/r/uS8nB5/1
I'm looking at a way of removing a property number from an address.
For example the address could be - 56 Hello Road
I've managed to use the following code to remove the number and that works
$meta_url = trim(str_replace(range(0,9),'', $row[property_address_1]));
However if the address is 56b Hello Road it leaves the b and returns - b Hello Road
Any idea how I can edit my current code to remove the next corresponding letter?
One way would be to use a regular expression:
preg_replace('/[0-9]+[a-z]/', '', $row['property_address_1'])
The expression means:
[0-9]+ one or more characters in the range 0-9, followed by
[a-z] one lowercase character in the range a-z
First you may want to split your string at spaces, then look at the first item if it fits a certain condition:
<?php
$addressParts = explode(" ",$property_address);
if(preg_match('/\b[0-9]+[a-z]{0,2}\b/', $addressParts[0], $matches)){
unset($addressParts[0]);
}
$property_address = implode(" ",$addressParts);
?>
/\b[0-9]+[a-z]{0,2}\b/ means:
\b: word boundary at start
[0-9]+ : any length of numeric chars
[a-z]{0,2} : min 0 max 2 chars of alphanumeric chars
\b: word boundary at end
I've written the next regular expression
$pattern = "~\d+[.][\s]*[A-Z]{1}[A-Za-z0-9\s-']+~";
in order to match substrings as 2.bon jovi - it's my life
the problem is the only part that is recognized is - bon jovi
none " - " or " ' " are recognized by this regular expression.
I'd prefer to know what is wrong with the regular expression that I've wrote rather than getting a new one.
Your regular expressions states that after the period character (can be changed to \.), you will have zero or more white space characters which should then be followed by 1 upper case letter. In your string, you do not have any upper case letters.
Secondly, the - should be placed last when you want to match it. So, changing your regex to this: ~\d+[.][\s]*[A-Z]{1}[A-Za-z0-9\s'-]+~ will match something like so: 2.Bon jovi - it's my life.
On the other hand, you can change it to this: ~\d+[.][\s]*[A-Za-z0-9\s'-]+~ to match something like so: 2.bon jovi - it's my life.
EDIT: Ammended as per the comments of Marko D and aleation.
A better regular expression to handle that would be...
$pattern = "~\d+\.\s*[\pL\pP\s]+~";
CodePad.
This will match a number, followed by a ., followed by optional whitespace, followed by one or more Unicode letters, whitespace or punctuation marks.
$pattern = "~\d+\..*~";
$string = "2.bon jovi - it's my life";
preg_match($pattern, $string, $match);
print_r($match);
output: Array ( [0] => 2.bon jovi - it's my life )
So the way I understand this regular expression is:
\d+ // Match any digit, 1 or more times
[.] // Match a dot
[\s]* // Match 0 or more whitespace characters
[A-Z]{1} // Match characters between an UPPERCASE A-Z Range 1 time
[A-Za-z0-9\s-']+ // Match characters between A-Z, a-z, 0-9, whitespace, dashe and apostrophe
So straight away, your 'bon jovi' might not get matched as it's lower case and you're only looking for uppercase characters. 'bon jovi' also contains a space so perhaps changing that part of the regular expression to allow for lowercase characters and whitespace might help so you'd end up with:
$pattern = "~\d+[.][\s]*[A-Za-z\s]{1}[A-Za-z0-9\s-']+~";
Note: I quickly tested this on RegExr ( http://gskinner.com/RegExr/ ) and it appeared to match the string fine.
Your regrex is as follows.
~ // delimiter
\d+ // 1 or more numbers
[.] // a period
[\s]* // 0 or more whitespace characters
[A-Z]{1} // 1 upper case letter
[A-Za-z0-9\s-\']+ // 1 or more characters, from the character class
~ //delimiter
Comparing that to the string "2.bon jovi" You have:
~ //
\d+ // "2"
[.] // "."
[\s]* // ""
[A-Z]{1} // <- NO MATCH
[A-Za-z0-9\s-\']+ //
~ //
"bon" does not start with a captial letter, it therefore does not match [A-Z]{1}
Cleaner regex
There are a few simple things you can do to clean up your regex
don't use character-classes for one character
don't specify {1} it's the same as not being present
Applying the above to your existing regex you get:
$pattern = "~\d+\.\s*[A-Z][A-Za-z0-9\s-']+~";
Which is slightly easier to read.
Your [A-Z]{1} sub-pattern requires one capital letter, so "2.bon jovi - it's my life" will not match.
And you need to escape the - in the [A-Za-z0-9\s-'] character class, or put it at the start or end, otherwise it is specifying a range.
"~\d+\.[A-Za-z0-9\s'-]+~"
As pointed out in the comments, it is actually not necessary to escape the - in the character class in your regex. That is only because you happened to precede it with a metacharacter \s that cannot be part of a range. Normally, if you want to match a literal - and you have it in a character class, you must escape it or position it as described above.