PHP regular expression, how to validate URL segment? - php

I have a problem with regular expressions. I would like to match strings that will represent pages in URL.
I want to match strings like these:
article
article-some
article-some-more
article-some-more-text
a
a-r-t-i-c-l-e
And avoid strings like these:
-article
article-
article--some
article-some--more
So basically all I need is a string that starts with [a-z], ends with [a-z], and can have minus sign in the middle. But I need multiple minus signs.
I tried this:
^([a-z0-9]+)(\-[a-z0-9]+)*([a-z0-9]+)?$
This works now, I opened a tab with Rubular, to paste what I was trying and came up with idea and solve this problem
But anyway, is there any other, more elegant way of doing this?

You can replace 0-9 in your character classes with \d - it stands for 'digit' but means the same thing. You can also remove the last ([a-z0-9]+)?, it is completely unnecessary because the block immediately before it ends with the same character class. If you're OK with capital letters too (which you may not be, you didn't specify) you can replace the character classes with [\w\d] which means any letter ('word character') or digit.

Related

Allow Parenthesis and forward slash to this regex

I have this regex
!preg_match("/^[a-z0-9](?:[a-z0-9'. -]*[a-z0-9])?$/i", stripslashes($post['job_title']))
and I want to allow numbers parenthesis and also slashes in this regex. because some job title can be "Front-end developer/designer" or "Recruitment Staff (HR)"How can I achieve this?
Okay I managed to make a proper regex for this which allows Slashes within but not at the START/END, and also allows parenthesis within and at START/END.
!preg_match("/^[a-z0-9\(\)](?:[a-z0-9\/\(\)'. -]*[a-z0-9\(\)])?$/i", stripslashes($post['job_title']))
Thanks to #anubhava his reply gave me an idea how to add stuff in the regex
I don't think your intention is being translated to the pattern.
/^[a-z0-9](?:[a-z0-9'. -]*[a-z0-9])?$/i
/^[a-z0-9\(\)](?:[a-z0-9\/\(\)'. -]*[a-z0-9\(\)])?$/i
In the pattern in your question and the oattern in your answer, the third segment (final optional character match) it provides no effective validation. You see the multi-character (zero or more) matching in the middle of the pattern contains all characters in the last character class. In other words, your pattern will behave exactly the same without the last optional check. These are suitable replacements:
/^[a-z0-9](?:[a-z0-9'. -]*$/i
~^[a-z0-9()](?:[a-z0-9/()'. -]*$~i
If you mean to demand that the string ends in alphanumeric or parenthetical character, then remove your ? before the $.
That said, if you want to ensure that:
hyphens, spaces, and dots only occur the the middle of the string and
all parentheses are properly opened and closed, contain characters between them, and do not occur at the start of the string
etc.
then the best strategy will be "test driven development". Create a large, diverse sample of strings as well as unrealistic strings that you know should fail. Then run your current pattern against all strings. Then analyze which cases do not evaluate as expected and adjust your pattern.

How to check if string contains specific special characters or starting with a space? [duplicate]

I have the following requirements for validating an input field:
It should only contain alphabets and spaces between the alphabets.
It cannot contain spaces at the beginning or end of the string.
It cannot contain any other special character.
I am using following regex for this:
^(?!\s*$)[-a-zA-Z ]*$
But this is allowing spaces at the beginning. Any help is appreciated.
For me the only logical way to do this is:
^\p{L}+(?: \p{L}+)*$
At the start of the string there must be at least one letter. (I replaced your [a-zA-Z] by the Unicode code property for letters \p{L}). Then there can be a space followed by at least one letter, this part can be repeated.
\p{L}: any kind of letter from any language. See regular-expressions.info
The problem in your expression ^(?!\s*$) is, that lookahead will fail, if there is only whitespace till the end of the string. If you want to disallow leading whitespace, just remove the end of string anchor inside the lookahead ==> ^(?!\s)[-a-zA-Z ]*$. But this still allows the string to end with whitespace. To avoid this look back at the end of the string ^(?!\s)[-a-zA-Z ]*(?<!\s)$. But I think for this task a look around is not needed.
This should work if you use it with String.matches method. I assume you want English alphabet.
"[a-zA-Z]+(\\s+[a-zA-Z]+)*"
Note that \s will allow all kinds of whitespace characters. In Java, it would be equivalent to
[ \t\n\x0B\f\r]
Which includes horizontal tab (09), line feed (10), carriage return (13), form feed (12), backspace (08), space (32).
If you want to specifically allow only space (32):
"[a-zA-Z]+( +[a-zA-Z]+)*"
You can further optimize the regex above by making the capturing group ( +[a-zA-Z]+) non-capturing (with String.matches you are not going to be able to get the words individually anyway). It is also possible to change the quantifiers to make them possessive, since there is no point in backtracking here.
"[a-zA-Z]++(?: ++[a-zA-Z]++)*+"
Try this:
^(((?<!^)\s(?!$)|[-a-zA-Z])*)$
This expression uses negative lookahead and negative lookbehind to disallow spaces at the beginning or at the end of the string, and requiring the match of the entire string.
I think the problem is there's a ? before the negation of white spaces, which means it is optional
This should work:
[a-zA-Z]{1}([a-zA-Z\s]*[a-zA-Z]{1})?
at least one sequence of letters, then optional string with spaces but always ends with letters
I don't know if words in your accepted string can be seperated by more then one space. If they can:
^[a-zA-Z]+(( )+[a-zA-z]+)*$
If can't:
^[a-zA-Z]+( [a-zA-z]+)*$
String must start with letter (or few letters), not space.
String can contain few words, but every word beside first must have space before it.
Hope I helped.

Match 'exclamation mark' character 'not immediately preceded by a word'

I want to delete every ! character from a string that is not immediately preceded by a word. To accomplish this task, I was thinking about preg_replace() to perform a Regex match.
That is, I'd like the following blasphemy of a text:
search! query ! !key!words that! acc!ept exclamation! marks!
... to become:
search! query keywords that! accept exclamation! marks!
There is no need to take double+ occurrences into account, since I filter those out using (![!]+) - although if someone knows of a solution that takes double+ occurrences into consideration, I'd be more than glad to welcome it, since it removes the need for an extra lookup.
So far I have (!\b)|(\s+!\s+)|(!\s+!) which - besides being a bit whacky in my opinion - works almost perfectly, but sometimes removes spacing between words, producing the result of
search! querykeywords that! accept exclamation! marks!
EDIT
I need to take accented and/or uppercase characters into consideration when parsing the string.
You want to remove an ! when
there's no word break before it (as in foo !)
or there is a word break after it (as in !foo)
That gives:
\B!|!\b
https://regex101.com/r/xF7bG6/1
([^a-z])\!+|\!+([a-z]), with a replacement of $1$2 should match multiple !'s that are not preceded by a letter (\W) or have a letter immediately after (\w).
If your regular expression language takes positive lookaheads/lookbehinds, then you can use (?<=[^a-z])\!+|\!+(?=[a-z]) with no replacement string.

regex validation

I am trying to validate a string of 3 numbers followed by / then 5 more numbers
I thought this would work
(/^([0-9]+[0-9]+[0-9]+/[0-9]+[0-9]+[0-9]+[0-9]+[0-9])/i)
but it doesn't, any ideas what i'm doing wrong
Try this
preg_match('#^\d{3}/\d{5}#', $string)
The reason yours is not working is due to the + symbols which match "one or more" of the nominated character or character class.
Also, when using forward-slash delimiters (the characters at the start and end of your expression), you need to escape any forward-slashes in the pattern by prefixing them with a backslash, eg
/foo\/bar/
PHP allows you to use alternate delimiters (as in my answer) which is handy if your expression contains many forward-slashes.
First of all, you're using / as the regexp delimiter, so you can't use it in the pattern without escaping it with a backslash. Otherwise, PHP will think that you're pattern ends at the / in the middle (you can see that even StackOverflow's syntax highlighting thinks so).
Second, the + is "greedy", and will match as many characters as it can, so the first [0-9]+ would match the first 3 numbers in one go, leaving nothing for the next two to match.
Third, there's no need to use i, since you're dealing with numbers which aren't upper- or lowercase, so case-sensitivity is a moot point.
Try this instead
/^\d{3}\/\d{5}$/
The \d is shorthand for writing [0-9], and the {3} and {5} means repeat 3 or 5 times, respectively.
(This pattern is anchored to the start and the end of the string. Your pattern was only anchored to the beginning, and if that was on purpose, the remove the $ from my pattern)
I recently found this site useful for debugging regexes:
http://www.regextester.com/index2.html
It assumes use of /.../ (meaning you should not include those slashes in the regex you paste in).
So, after I put your regex ^([0-9]+[0-9]+[0-9]+/[0-9]+[0-9]+[0-9]+[0-9]+[0-9]) in the Regex box and 123/45678 in the Test box I see no match. When I put a backslash in front of the forward slash in the middle, then it recognizes the match. You can then try matching 1234/567890 and discover it still matches. Then you go through and remove all the plus signs and then it correctly stops matching.
What I particularly like about this particular site is the way it shows the partial matches in red, allowing you to see where your regex is working up to.

Regex for netbios names

I got this issue figuring out how to build a regexp for verifying a netbios name. According to the ms standard these characters are illegal
\/:*?"<>|
So, thats what I'm trying to detect. My regex is looking like this
^[\\\/:\*\?"\<\>\|]$
But, that wont work.
Can anyone point me in the right direction? (not regexlib.com please...)
And if it matters, I'm using php with preg_match.
Thanks
Your regular expression has two problems:
you insist that the match should span the entire string. As Andrzej says, you are only matching strings of length 1.
you are quoting too many characters. In a character class (i.e. []), you only need to quote characters that are special within character classes, i.e. hyphen, square bracket, backslash.
The following call works for me:
preg_match('/[\\/:*?"<>|]/', "foo"); /* gives 0: does not include invalid characters */
preg_match('/[\\/:*?"<>|]/', "f<oo"); /* gives 1: does include invalid characters */
As it stands at the moment, your regex will match the start of the string (^), then exactly one of the characters in the square brackets (i.e. the illegal characters), then then end of the string ($).
So this likely isn't working because a string of length > 1 will trivially fail to match the regex, and thus be considered OK.
You likely don't need the start and end anchors (the ^ and $). If you remove these, then the regex should match one of the bracketed characters occurring anywhere on the input text, which is what you want.
(Depending on the exact regex dialect, you may canonically need less backslashes within the square brackets, but they are unlikely to do any harm in any case).

Categories