I need matches the lasts non-words with last word on string (both optionals). The matches will be removed from original string by replacing it by null. My current results are:
Regular expression:
\W* # Matches optional lasts non-words.
\w* # Matches optional last word.
$
Working cases: unit tests here
String Removed Result
----------------------------------------------------------------------------
"Hello World!" "!" "Hello World"
"Hello World" " World" "Hello"
"Hello " " " "Hello"
"Hello" "Hello" ""
"Hello; World!" "!" "Hello; World"
"Hello; World" "; World" "Hello"
"Hello;" ";" "Hello"
Of course, I'm having a problem. I want accept HTML entities as part of the word, but as it have non-words characters (the ending semicolon), it final semicolon is matched and replaced incorrectly.
Currently I expects only that is matches simple HTML entities, basically it &\w+; (like &aaccute;), I'll improve it later (for now, let uses it to simplify answer).
What I expect: unit tests here, failing for now
String Removed Result
----------------------------------------------------------------------------
"Hell&aaccute; World!" "!" "Hell&aaccute; World"
"Hell&aaccute; World" " World" "Hell&aaccute;"
"Hell&aaccute;" "Hell&aaccute;" ""
"&aaccute; &aaccute;" " &aaccute;" "&aaccute;"
"&aaccute; " " " "&aaccute;"
"&aaccute;" "&aaccute;" ""
I just need add the HTML entities expression in some place to avoid matches it on first expression (\W*), I guess. But I tried somethings and it does not worked.
I don't know of a way to accomplish your regex match goal as specifically stated. I believe that you'd need a variable width negative lookback to avoid matching unwanted HTML entities, and that doesn't exist in any implementation that I've seen.
But, if your true goal is just to split the strings in the manner you've specified, there are two ways to accomplish that goal.
#1
You can match and consume the preliminary characters as a group, replacing the original string with just the first group match (${result}). ${removed} will have the text matching the removed characters as you described in your question.
^(?<result>.*?(?:(?:&[a-z]+;)|\w)*?)(?<removed>(\W*)((?:&[a-z]+;|\w)*))(?<=.)$ # regex101
Since all matches are optional, the trailing (?<=.) is present to avoid a matching completely empty lines. I'm also using a simplified definition of HTML entities as you suggested (eg, assuming lowercase and ignoring numeric entities such as "<").
All updated unit tests pass.
#2
Alternatively, you can reverse the strings and use something like this regex to match the desired characters to remove:
^(?<removed>((?:;[a-z]+&|\w)*)((?:[^\w;]|;(?![a-z]+&))*))(?<=.)
Then, after removing the characters, re-reverse the string. ${removed} will have the characters that were removed (as a reversed string). Note that, as of yet, I've only done some preliminary testing on the "reversed" regex.
You cannot use \w because it does not include HTML entities as you point out.
Instead, accept any combination of letters and HTML entities, something like this:
([a-zA-Z]*(&[a-zA-Z]+;)*[a-zA-Z]*)+\s([a-zA-Z]*(&[a-zA-Z]+;)*[a-zA-Z]*)+$
https://regex101.com/r/pH7tK2/2
Related
I want to split a string as per the parameters laid out in the title. I've tried a few different things including using preg_match with not much success so far and I feel like there may be a simpler solution that I haven't clocked on to.
I have a regex that matches the "price" mentioned in the title (see below).
/(?=.)\£(([1-9][0-9]{0,2}(,[0-9]{3})*)|[0-9]+)?(\.[0-9]{1,2})?/
And here are a few example scenarios and what my desired outcome would be:
Example 1:
input: "This string should not split as the only periods that appear are here £19.99 and also at the end."
output: n/a
Example 2:
input: "This string should split right here. As the period is not part of a price or at the end of the string."
output: "This string should split right here"
Example 3:
input: "There is a price in this string £19.99, but it should only split at this point. As I want it to ignore periods in a price"
output: "There is a price in this string £19.99, but it should only split at this point"
I suggest using
preg_split('~\£(?:[1-9]\d{0,2}(?:,\d{3})*|[0-9]+)?(?:\.\d{1,2})?(*SKIP)(*F)|\.(?!\s*$)~u', $string)
See the regex demo.
The pattern matches your pattern, \£(?:[1-9]\d{0,2}(?:,\d{3})*|[0-9]+)?(?:\.\d{1,2})? and skips it with (*SKIP)(*F), else, it matches a non-final . with \.(?!\s*$) (even if there is trailing whitespace chars).
If you really only need to split on the first occurrence of the qualifying dot you can use a matching approach:
preg_match('~^((?:\£(?:[1-9]\d{0,2}(?:,\d{3})*|[0-9]+)?(?:\.\d{1,2})?|[^.])+)\.(.*)~su', $string, $match)
See the regex demo. Here,
^ - matches a string start position
((?:\£(?:[1-9]\d{0,2}(?:,\d{3})*|[0-9]+)?(?:\.\d{1,2})?|[^.])+) - one or more occurrences of your currency pattern or any one char other than a . char
\. - a . char
(.*) - Group 2: the rest of the string.
To split a text into sentences avoiding the different pitfalls like dots or thousand separators in numbers and some abbreviations (like etc.), the best tool is intlBreakIterator designed to deal with natural language:
$str = 'There is a price in this string £19.99, but it should only split at this point. As I want it to ignore periods in a price';
$si = IntlBreakIterator::createSentenceInstance('en-US');
$si->setText($str);
$si->next();
echo substr($str, 0, $si->current());
IntlBreakIterator::createSentenceInstance returns an iterator that gives the indexes of the different sentences in the string.
It takes in account ?, ! and ... too. In addition to numbers or prices pitfalls, it works also well with this kind of string:
$str = 'John Smith, Jr. was running naked through the garden crying "catch me! catch me!", but no one was chasing him. His psychatre looked at him from the window with a circumspect eye.';
More about rules used by IntlBreakIterator here.
You could simply use this regex:
\.
Since you only have a space after the first sentence (and not a price), this should work just as well, right?
I have a peculiar use case where I need to detect paragraphs that end in !!. Normal occurrences of ! (a single one) is fine in the paragraph, but the block ends when !! is found.
For example:
test foo bar !!
longer paragraph this time!
goes on and on
and then stops !!
Should be detected as two separate matches, one covering the first line, and another (separate) covering lines 2, 3 and 4. This brings it to a total of 2 matches.
(Preferably it should work with multiline-mode, as it's part of a larger regex that employs this mode.)
How would I accomplish this? I tried [^!!]* which to me says, find as many non-!! characters as possible, but I'm not sure how to leverage that, and worse yet it still finds single occurrences of !.
There is a common idiom in regular expressions that is used for escape sequences. (Like "\n" in a string.) You can use the same concept here.
The trick is to match either NOT the first character, or the first character followed by a valid second character.
In your case, that would be:
(?: # this is a package, either A or B, choose one
[^!] # Not a bang
| # or
![^!] # Bang, followed by not-a-bang
)
This pair of alternatives describes all the characters in your paragraph. So you can repeat it either 0 times (*) or one-or-more times (+) depending on what you are doing in the rest of your pattern.
# All together:
(?:[^!]|![^!])* # zero or more
(?:[^!]|![^!])+ # one or more
(Obviously, you can match '!!' at the end if you like...)
^([!]?[^!]+[!]?[^!]+)*[!]{2}$/gm
This regex worked for me. It ensures any single ! characters are separated by non-! characters, but there don't have to be any single ! characters. It worked on multiline mode. This also has the added benefit of extracting the text that comes before an occurrence of "!!" since I assume you want to work with it.
/^([!]?[^!]+[!]?[^!]+)*.?[!]{2}$|^([!]?[^!]+[!]?[^!]+)*[^!]?[!]?$/gm
This slightly longer regex captures text that occurs after the final !! (ie, if the file has text between !! and EOF). I wouldn't recommend using the capturing groups though as on my regex checker, they didn't seem to work properly (that may have just been an implementation glitch, however, as the capturing groups look like they should work properly).
Try this:
([\w\s!]+?\!{2})
DEMO
Output:
MATCH 1
1. [0-15] `test foo bar !!`
MATCH 2
1. [15-76] `
longer paragraph this time!
goes on and on
and then stops !!`
or
(?:\n?([\w\s!]+?)\s?\!{2})
DEMO
Output:
MATCH 1
1. [0-12] `test foo bar`
MATCH 2
1. [16-73] `longer paragraph this time!
goes on and on
and then stops`
Try following regex using lookahead
VERSION #1
/(?<=!!|^).*?(?=!!)/gms
Please see https://regex101.com/r/cQ0wC0/2
Result should be
OUTPUT:
test foo bar
longer paragraph this time!
goes on and on
and then stops
VERSION #2
Since OP want to capture last paragraph of text after !! even it's not ending with bang signs.
/(?<=!!|^).*?(?=!!)|(?<=!!).*$/gms
Please see demo https://regex101.com/r/cQ0wC0/4
INPUT:
test foo bar !!
longer paragraph this time!
goes on and on
and then stops !!
longer paragraph this time!
goes on and on
OUTPUT:
test foo bar
longer paragraph this time!
goes on and on
and then stops
longer paragraph this time!
goes on and on
I need to find specific part of text in string.
That text need to have:
12 characters (letters and numbers only)
whole string must contains at least 3 digits
3*4 characters with spaces (ex. K9X6 6GM6 LM11)
every block from example above must contains at least 1 number
words like this, line, spod shouldn't be recognized
So I ended with this code:
preg_match_all("/(?<!\S)(?i:[a-z\d]{4}|[a-z\d]{12})(?!\S)/", $input_lines, $output_array);
But it won't works for all of requirements. Of course I can use preg_repace or str_replace and remove all (!,?,#) and in a loop count numbers if there are 4 or more but I wonder if it is possible to do with preg_match_all...
Here is a string to search in:
?K9X6 6GM6 LM11 // not recognized - but it should be
!K9X6 6GM6 LM11 // not recognized - but it should be
K0X6 0GM7 LM12! // not recognized - but it should be
K1X6 1GM8 LM13# // not recognized - but it should be
K2X6 2GM9 LM14? // not recognized - but it should be
K3X6 3GM0 LM15# // not recognized - but it should be
K4X6 4GM1 LM16* // not recognized - but it should be
K5X65GM2LM17
bla bla bla
this shouldn't be visible
spod also shouldn't be visible
but line below should be!!
K9X66GM6LM11! (see that "!" at the end? Help me with this)
Correct preg_match_all should returns this:
K9X6
6GM6
LM11
K9X6
6GM6
LM11
K0X6
0GM7
LM12
K1X6
1GM8
LM13
K2X6
2GM9
LM14
K3X6
3GM0
LM15
K4X6
4GM1
LM16
K5X65GM2LM17
K9X66GM6LM11
working example: http://www.phpliveregex.com/p/bHX
The following should do the trick:
\b(?:(?=.{0,3}?\d)[A-Za-z\d]{4}\s??){3}\b
Demo
[A-Za-z\d]{4} matches 4 letters/digits
(?=.{0,3}?\d) checks there's a digit in these 4 characters
\s?? matches a whitespace character, but tries not to match it if possible
\b makes sure everything isn't contained in a larger word
Note that this will allow strings like K2X6 2GM9LM14, I'm not sure whether you want these to match or not.
I'd like to capture up to four groups of text between <p> and </p>. I can do that using the following regex:
<h5>Trivia<\/h5><p>(.*)<\/p><p>(.*)<\/p><p>(.*)<\/p><p>(.*)<\/p>
The text to match on:
<h5>Trivia</h5><p>Was discovered by a freelance photographer while sunbathing on Bournemouth Beach in August 2003.</p><p>Supports Southampton FC.</p><p>She has 11 GCSEs and 2 'A' Levels.</p><p>Listens to soul, R&B, Stevie Wonder, Aretha Franklin, Usher Raymond, Michael Jackson and George Michael.</p>
It outputs the four lines of text. It also works as intended if there are more trivia items or <p> occurrences.
But if there are less than 4 trivia items or <p> groups, it outputs nothing since it cannot find the fourth group. How do I make that group optional?
I've tried: <h5>Trivia<\/h5><p>(.*?)<\/p>(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)? and that works according to http://gskinner.com/RegExr/ but it doesn't work if I put it inside PHP code. It only detects one group and puts everything in it.
The magic word is either 'escaping' or 'delimiters', read on.
The first regex:
<h5>Trivia<\/h5><p>(.*)<\/p><p>(.*)<\/p><p>(.*)<\/p><p>(.*)<\/p>
worked because you escaped the / characters in tags like </h5> to <\/h5>.
But in your second regex (correctly enclosing each paragraph in a optional non-capturing group, fetching 1 to 5 paragraphs):
<h5>Trivia</h5><p>(.*?)</p>(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?
you forgot to escape those / characters.
It should then have been:
$pattern = '/<h5>Trivia<\/h5><p>(.*?)<\/p>(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?(?:<p>(.*?)<\/p>)?/';
The above is assuming you were putting your regex between two / "delimiters" characters (out of conventional habit).
To dive a little deeper into the rabbit-hole, one should note that in php the first and last character of a regular expression is usually a "delimiter", so one can add modifiers at the end (like case-insensitive etc).
So instead of escaping your regex, you could also use a ~ character (or #, etc) as a delimiter.
Thus you could also use the same identical (second) regex that you posted and enclose for example like this:
$pattern = '~<h5>Trivia</h5><p>(.*?)</p>(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?~';
Here is a working (web-based) example of that, using # as delimiter (just because we can).
You can use the question mark to make each <p>...</p> optional:
$pattern = '~<h5>Trivia</h5>(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?(?:<p>(.*?)</p>)?~';
Use the Dom is a good option too.
I was trying to write an regex that allows single hyphens and single spaces only within words but not at the beginning or at the end of the words.
I thought I have this sorted from the answer I got yesterday, but I just realised there is small error which I don't quite understand,
Why it won't accept the inputs like,
'forum-category-b forum-category-a'
'forum-category-b Counter-terrorism'
'forum-category-a Preventing'
'forum-category-a Preventing Violent'
'forum-category-a International-Research-and-Publications'
'International-Research-and-Publications forum-category-b forum-category-a'
but it takes,
'forum-category-b'
'Counter-terrorism forum-category-a'
'Preventing forum-category-a'
'Preventing Violent forum-category-a'
'International-Research-and-Publications forum-category-b'
Why is that? How can I fix it? It Below is the regex with the initial test, but ideally it should accept all the combination inputs above,
$aWords = array(
'a',
'---stack---over---flow---',
' stack over flow',
'stack-over-flow',
'stack over flow',
'stacoverflow'
);
foreach($aWords as $sWord) {
if (preg_match('/^(\w+([\s-]\w+)?)+$/', $sWord)) {
echo 'pass: ' . $sWord . "\n";
} else {
echo 'fail: ' . $sWord . "\n";
}
}
accept/ to reject the input like these below,
---stack---over---flow---
stack-over-flow- stack-over-flow2
stack over flow
Thanks.
Your pattern does not do what you want. Let's break it apart:
^(\w+([\s-]\w+)?)+$
It matches strings that consist solely of one or more sequences of the pattern:
\w+([\s-]\w+)?
...which is a sequence of word characters, followed optionally by one other sequence of word characters, separated by one space or dash character.
In other words, your pattern searches for strings like:
xxx-xxxyyy-yyyzzz zzz
...but you intent to write a pattern that would find:
xxx-xxxxxx-xxxxxx yyy
In your examples, this one is matched:
Counter-terrorism forum-category-a
...but it is interpreted as the following sequence:
(Counter(-terroris)) (m( foru)) (m(-categor) (y(-a))
As you can see, the pattern did not really find the words you are looking for.
This example is not matched:
forum-category-a Preventing Violent
...since the pattern cannot form groups of "word characters, space-or-dash, word-characters" when it encounters a single word character followed by space or dash:
(forum(-categor)) (y(-a)) <Mismatch: Found " " but expected "\w">
If you would add another character to "forum-category-a", say "forum-category-ax", it would match again, since it could split at the "ax":
(forum(-categor)) (y(-a)) (x( Preventin)) (g( Violent))
What you are actually interested in is a pattern like
^(\w+(-\w+)*)(\s\w+(-\w+)*)*$
...which would find a sequence of words that may contain dashes, separated by spaces:
(forum(-category)(-a)) ( Preventing) ( Violent)
By the way, I tested this using a Python script, and while trying to match your pattern against the example string "International-Research-and-Publications forum-category-b forum-category-a", the regular expression engine seemed to run into an infinite loop...
import re
expr = re.compile(r'^(\w+([\s-]\w+)?)+$')
expr.match('International-Research-and-Publications forum-category-b forum-category-a')
the part of your pattern ([\s-]\w+)? is the issue. It's only allowing for one repetition (the trailing ?). Try changing the last ? to * and see if that helps.
Nope, I still believe that's the problem. The original pattern is looking for "word" or "word[space_hyphen]word" repeated 1+ times. Which is weird because the pattern should fall within another match. But switching the question mark worked for me.
There should be only one answer to this problem:
/^((?<=\w)[ -]\w|[^ -])+$/
There is only 1 rule as stated \w[ -]\w and thats it. And its on a per character basis granularity, and cannot be anthing else. Add the [^ -] for the rest.