Regex/PHP check if group of characters appears only once

Regex/PHP check if group of characters appears only once - php

I am trying to validate an input in PHP with REGEX. I want to check whether the input has the %s character group inside it and that it appears only once. Otherwise, the rule should fail.
Here's what I've tried:
preg_match('|^[0-9a-zA-Z_-\s:;,\.\?!\(\)\p{L}(%s){1}]*$|u', $value); (there are also some other rules besides this; I've tried the (%s){1} part and it doesn't work).
I believe it is a very easy solution to this, but I'm not really into REGEX's...Thank you for your help!

If I understand your question, you need a positive lookahead. The lookahead causes the expression to only match if it finds a single %s.
preg_match('|^(?=[^%s].*?[%s][^%s]*$)[0-9a-zA-Z_-\s:;,\.\?!\(\)\p{L}(%s){1}]*$|u', $value);
I'll explain how each part works
^(?=[^%s].*?[%s][^%s]*$) is a zero-width assertion -- (?=regex) a positive lookahead -- (meaning it must match, but does not "eat" any characters). It means that the whole line can have only 1 %s.
[0-9a-zA-Z_-\s:;,\.\?!\(\)\p{L}(%s){1}]*$ The remaining part of the regex also looks at the entire string and ensures that the whole string is composed only of the characters in the character class (like your original regex).

I managed to do this with PHP's substr_count() function, following Johnsyweb suggestion to use an alternate way to perform the validation and because the REGEX's suggested seem pretty complicated.
Thank you again!

Alternatively, you can use preg_match_all with your pattern and check the number of matches. If it's 1, then you're ok - something like this:
$result = (preg_match_all('|^[0-9a-zA-Z_-\s:;,\.\?!\(\)\p{L}(%s){1}]*$|u', $value) == 1)

Try this:
'|^(?=(?:(?!%s).)*%s(?:(?!%s).)*$)[0-9_\s:;,.?!()\p{L}-]+$|u'
The (%s){1} sequence inside the square brackets probably doesn't do what you think it does, but never mind, the solution is more complex. In fact, {1} should never appear anywhere in a regex. It doesn't ensure that there's only one of something, as many people assume. As a matter of fact, it doesn't do anything; it's pure clutter.
EDIT (in answer to the comment): To ensure that only one of a particular sequence is present in a string, you have to actively examine every single character, classifying it as either part-of-%s or not part-of-%s. To that end, (?:(?!%s).)* consumes one character at a time, after the negative lookahead has confirmed that the character is not the start of %s.
When that part of the lookahead expression quits matching, the next thing in the string has to be %s. Then the second (?:(?!%s).)*$ kicks in to confirm that there are no more %s sequences until the end of the string.
And don't forget that the lookahead expression must be anchored at both ends. Because the lookahead is the first thing after the main regex's start anchor you don't need to add another ^. But the lookahead must end with its own $ anchor.

If you're not "into" regular expressions, why not solve this with PHP?
One call to the builtin strpos() will tell you if the string has a match. A second call will tell you if it appears more than once.
This will be easier for you to read and for others to maintain.

Related

Optional Group Expression

Today I was working with regular expressions at work and during some experimentation I noticed that a regex such as (\w|) compiled. This seems to be an optional group but looking online didn't yield any results.
Is there any practical use of having a group that matches something, but otherwise can match anything? What's the difference between that and (\w|.*)? Thanks.

(\w|) is a verbose way of writing \w?, which checks for \w first, then empty string.
I remove the capturing group, since it seems that () is used for grouping property only. If you actually need the capturing group, then (\w?).
On the same vein, (|\w) is a verbose way of writing \w??, which tries for empty string first, before trying for \w.
(\w|.*) is a different regex altogether. It tries to match (in that order) one word character \w, or 0 or more of any character (except line terminators) .*.
I can't imagine how this regex fragment would be useful, though.

Match 'exclamation mark' character 'not immediately preceded by a word'

I want to delete every ! character from a string that is not immediately preceded by a word. To accomplish this task, I was thinking about preg_replace() to perform a Regex match.
That is, I'd like the following blasphemy of a text:
search! query ! !key!words that! acc!ept exclamation! marks!
... to become:
search! query keywords that! accept exclamation! marks!
There is no need to take double+ occurrences into account, since I filter those out using (![!]+) - although if someone knows of a solution that takes double+ occurrences into consideration, I'd be more than glad to welcome it, since it removes the need for an extra lookup.
So far I have (!\b)|(\s+!\s+)|(!\s+!) which - besides being a bit whacky in my opinion - works almost perfectly, but sometimes removes spacing between words, producing the result of
search! querykeywords that! accept exclamation! marks!
EDIT
I need to take accented and/or uppercase characters into consideration when parsing the string.

You want to remove an ! when
there's no word break before it (as in foo !)
or there is a word break after it (as in !foo)
That gives:
\B!|!\b
https://regex101.com/r/xF7bG6/1

([^a-z])\!+|\!+([a-z]), with a replacement of $1$2 should match multiple !'s that are not preceded by a letter (\W) or have a letter immediately after (\w).
If your regular expression language takes positive lookaheads/lookbehinds, then you can use (?<=[^a-z])\!+|\!+(?=[a-z]) with no replacement string.

(PHP) How to find words beginning with a pattern and replace all of them?

I have a string. An example might be "Contact /u/someone on reddit, or visit /r/subreddit or /r/subreddit2"
I want to replace any instance of "/r/x" and "/u/x" with "[/r/x](http://reddit.com/r/x)" and "[/u/x](http://reddit.com/u/x)" basically.
So I'm not sure how to 1) find "/r/" and then expand that to the rest of the word (until there's a space), then 2) take that full "/r/x" and replace with my pattern, and most importantly 3) do this for all "/r/" and "/u/" matches in a single go...
The only way I know to do this would be to write a function to walk the string, character by character, until I found "/", then look for "r" and "/" to follow; then keep going until I found a space. That would give me the beginning and ending characters, so I could do a string replacement; then calculate the new end point, and continue walking the string.
This feels... dumb. I have a feeling there's a relatively simple way to do this, and I just don't know how to google to get all the relevant parts.

A simple preg_replace will do what you want.
Try:
$string = preg_replace('#(/(?:u|r)/[a-zA-Z0-9_-]+)#', '[\1](http://reddit.com\1)', $string);
Here is an example: http://ideone.com/dvz2zB
You should see if you can discover what characters are valid in a Reddit name or in a Reddit username and modify the [a-zA-Z0-9_-] charset accordingly.

You are looking for a regular expression.
A basic pattern starts out as a fixed string. /u/ or /r/ which would match those exactly. This can be simplified to match one or another with /(?:u|r)/ which would match the same as those two patterns. Next you would want to match everything from that point up to a space. You would use a negative character group [^ ] which will match any character that is not a space, and apply a modifier, *, to match as many characters as possible that match that group. /(?:u|r)/[^ ]*
You can take that pattern further and add a lookbehind, (?<= ) to ensure your match is preceded by a space so you're not matching a partial which results in (?<= )/(?:u|r)/[^ ]*. You wrap all of that to make a capturing group ((?<= )/(?:u|r)/[^ ]*). This will capture the contents within the parenthesis to allow for a replacement pattern. You can express your chosen replacement using the \1 reference to the first captured group as [\1](http://reddit.com\1).
In php you would pass the matching pattern, replacement pattern, and subject string to the preg_replace function.

In my opinion regex would be an overkill for such a simple operation. If you just want to replace instance of "/r/x" with "[r/x](http://reddit.com/r/x)" and "/u/x" with "[/u/x](http://reddit.com/u/x)" you should use str_replace although with preg_replace it'll lessen the code.
str_replace("/r/x","[/r/x](http://reddit.com/r/x)","whatever_string");
use regex for intricate search string and replace. you can also use http://www.jslab.dk/tools.regex.php regular expression generator if you have something complex to capture in the string.

Regex question mark

To match a string with pattern like:
-TEXT-someMore-String
To get -TEXT-, I came to know that this works:
/-(.+?)-/ // -TEXT-
As of what I know, ? makes preceding token as optional as in:
colou?r matches both colour and color
I initially put in regex to get -TEXT- part like this:
/-(.+)-/
But it gave -TEXT-someMore-.
How does adding ? stops regex to get the -TEXT- part correctly? Since it used to make preceding token optional not stopping at certain point like in above example ?

As you say, ? sometimes means "zero or one", but in your regex +? is a single unit meaning "one or more — and preferably as few as possible". (This is in contrast to bare +, which means "one or more — and preferably as many as possible".)
As the documentation puts it:
However, if a quantifier is followed by a question mark,
then it becomes lazy, and instead matches the minimum
number of times possible, so the pattern /\*.*?\*/
does the right thing with the C comments. The meaning of the
various quantifiers is not otherwise changed, just the preferred
number of matches. Do not confuse this use of
question mark with its use as a quantifier in its own right.
Because it has two uses, it can sometimes appear doubled, as
in \d??\d which matches one digit by preference, but can match two if
that is the only way the rest of the pattern matches.

Alternatively, you can use Ungreedy modifier to set the whole regular expression to search for preferably as short as possible match:
/-(.+)-/U

? before a token is shorthand for {0,1}, which means: Anything up from 0 to 1 appearances as the foremost.
But + is not a token, but a quantifier. shorthand for {1,}: 1 up to endless appearances.
A ? after a quantifier sets it into nongreedy mode. If in greedy mode, it matches as much of the string as possible. If non greedy it matches as little as possible

Another, perhaps the underlying error in your regex is that you try to match a number of arbitrary characters via .+?. However, what you really want is probably: "any character except -". You can get that via [^-]+ In this case, it doesn't matter if you do a greedy match or not -- the repeated match will terminate as soon as you encounter the second "-" in your string.

regex validation

I am trying to validate a string of 3 numbers followed by / then 5 more numbers
I thought this would work
(/^([0-9]+[0-9]+[0-9]+/[0-9]+[0-9]+[0-9]+[0-9]+[0-9])/i)
but it doesn't, any ideas what i'm doing wrong

Try this
preg_match('#^\d{3}/\d{5}#', $string)
The reason yours is not working is due to the + symbols which match "one or more" of the nominated character or character class.
Also, when using forward-slash delimiters (the characters at the start and end of your expression), you need to escape any forward-slashes in the pattern by prefixing them with a backslash, eg
/foo\/bar/
PHP allows you to use alternate delimiters (as in my answer) which is handy if your expression contains many forward-slashes.

First of all, you're using / as the regexp delimiter, so you can't use it in the pattern without escaping it with a backslash. Otherwise, PHP will think that you're pattern ends at the / in the middle (you can see that even StackOverflow's syntax highlighting thinks so).
Second, the + is "greedy", and will match as many characters as it can, so the first [0-9]+ would match the first 3 numbers in one go, leaving nothing for the next two to match.
Third, there's no need to use i, since you're dealing with numbers which aren't upper- or lowercase, so case-sensitivity is a moot point.
Try this instead
/^\d{3}\/\d{5}$/
The \d is shorthand for writing [0-9], and the {3} and {5} means repeat 3 or 5 times, respectively.
(This pattern is anchored to the start and the end of the string. Your pattern was only anchored to the beginning, and if that was on purpose, the remove the $ from my pattern)

I recently found this site useful for debugging regexes:
http://www.regextester.com/index2.html
It assumes use of /.../ (meaning you should not include those slashes in the regex you paste in).
So, after I put your regex ^([0-9]+[0-9]+[0-9]+/[0-9]+[0-9]+[0-9]+[0-9]+[0-9]) in the Regex box and 123/45678 in the Test box I see no match. When I put a backslash in front of the forward slash in the middle, then it recognizes the match. You can then try matching 1234/567890 and discover it still matches. Then you go through and remove all the plus signs and then it correctly stops matching.
What I particularly like about this particular site is the way it shows the partial matches in red, allowing you to see where your regex is working up to.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regex/PHP check if group of characters appears only once - php

I managed to do this with PHP's substr_count() function, following Johnsyweb suggestion to use an alternate way to perform the validation and because the REGEX's suggested seem pretty complicated. Thank you again!

Alternatively, you can use preg_match_all with your pattern and check the number of matches. If it's 1, then you're ok - something like this: $result = (preg_match_all('|^[0-9a-zA-Z_-\s:;,\.\?!\(\)\p{L}(%s){1}]*$|u', $value) == 1)

If you're not "into" regular expressions, why not solve this with PHP? One call to the builtin strpos() will tell you if the string has a match. A second call will tell you if it appears more than once. This will be easier for you to read and for others to maintain.

Related

Optional Group Expression

Match 'exclamation mark' character 'not immediately preceded by a word'

(PHP) How to find words beginning with a pattern and replace all of them?

Regex question mark

regex validation

Categories

Resources