PHP RegEx: a Pattern to Validate the Second Level Domain - php

Note: this is a theoretical question about PHP flavor of regex, not a practical question about validation in PHP. I am merely using Domain Names for lack of a better example.
"Second Level Domain" refers to the combination of letters, numbers, period signs, and/or dashes that are placed between http:// or http://www. and .com (.co, .info, .etc) .
I am only interested in second level domains that use English version of Latin alphabet.
This pattern:
[A-Za-z0-9.-]+
matches valid domain names, such as stackoverflow, StackOverflow, stackoverflow.co (as in stackoverflow.co.uk), stack-overflow, or stackoverflow123.
However, the same pattern would also match something like stack...overflow, stack---over--flow, ........ , -------- , or even . and -.
How can that pattern be rewritten, to indicate that period signs and dashes, even though they can be used multiple times in a node,
cannot be used without other symbols,
cannot be placed twice or more side by side with each other,
and cannot be placed in the beginning or end of the node?
Thank you in advance!

I think something like this should do the trick:
^([a-zA-Z0-9]+[.-])*[a-zA-Z0-9]+$
What this tries to do is
start at the beginning of string, end at the end
one or more letter or digit
followed by either dot or hypen
the group above repeated 0 or more times
followed by one or more letter or digit

Assuming that you are looking for a regex that does not allow two consecutive . or - you can use:
^[a-zA-Z0-9]+([-.][a-zA-Z0-9]+)*$
regexr demo

Related

PHP Regex detect repeated character in a word

(preg_match('/(.)\1{3}/', $repeater))
I am trying to create a regular expression which will detect a word that repeats a character 3 or more times throughout the word. I have tried this numerous ways and I can't seem to get the correct output.
If you don't need letters to be contiguous, you can do it with this pattern:
\b\w*?(\w)\w*?\1\w*?\1\w*
otherwise this one should suffice:
\b\w*?(\w)\1{2}\w*
Try this regex instead
(preg_match('/(.)\1{2,}/', $repeater))
This should match 3 or more times, see example here http://regexr.com/3fk80
Strictly speaking, regular expressions that include \1, \2, ... things are not mathematical regular expressions and the scanner that parses them is not efficient in the sense that it has to modify itself to include the accepted group, in order to be used to match the discovered string, and in case of failure it has to backtrack for the length of the matched group.
The canonical way to express a true regular expression that accepts word characters repeated three or more times is
(A{3,}|B{3,}|C{3,}|...|Z{3,}|a{3,}|b{3,}|...|z{3,})
and there's no associativity of the operator {3,} to be able to group it as you shown in your question.
For the pedantic, the pure regular expression should be:
(AAAA*|BBBB*|CCCC*|...|ZZZZ*|aaaa*|bbbb*|cccc*|...|zzzz*)
again, this time, you can use the fact that AAAA* is matched as soon as three As are found, so it would be valid also the regex:
AAA|BBB|CCC|...|ZZZ|aaa|bbb|ccc|...|zzz
but the first version allow you to capture the \1 group that delimits the actual matching sequence.
This approach will be longer to write but is by far much more efficient when parsing the data string, as it has no backtrack at all and visits each character only once.

Perfect a regex for finding #username tags

Im using a system to get #twitter like names and the following regex is near perfect:
(?<![^\s<>])#([^\s<>]+)
The problem I have found is if there are punctuation marks after the name
So for example:
Hey #mark ===> matches #mark (This is what we want)
Hey #mark. ===> matches #mark.
Hey #mark, you're nice ===> matches #mark,
Hey #mark!!!! I didn't think of that ===> matches #mark!!!!
Obviously we only want to match the username and not the punctuation marks. The caveat is that some usernames have these period inside the username, for example
For example, these are all legitimate usernames
mark.markus
mark#gmail.com
mark_markus#gmail.com
EDIT We are using a lookbehind, if the above usernames are used with an # infront of them, they should match, but without the # in front then an email address should actually not match. #mark_markus#gmail.com should match mark_markus#gmail.com, but if someone typed plain old mark_markus#gmail.com we dont want gmail.com to match.
Any ideas on how to modify the regex to account for the various punctuation marks that could be used?
how about this:
(?<![\w#])#([\w#]+(?:[.!][\w#]+)*)
I have replaced [^\s<>] with [\w#], which is a bit more restrictive. \w matches letters, numbers, and underscores. If there are any other characters you specifically need to allow, add them to each character class.
This group: (?:\.\w+)* Allows one or more periods to be part of the username, but only if they are followed immediately by word characters. Note that (?:...) is a non-capturing group. It is useful when you want to group things for logical purposes, but don't need to capture the result.
Update: see a working example.

Regex with negative lookahead to ignore the word "class"

I'm getting insane over this, it's so simple, yet I can't figure out the right regex. I need a regex that will match blacklisted words, ie "ass".
For example, in this string:
<span class="bob">Blacklisted word was here</span>bass
I tried that regex:
((?!class)ass)
That matches the "ass" in the word "bass" bot NOT "class".
This regex flags "ass" in both occurences. I checked multiple negative lookaheads on google and none works.
NOTE: This is for a CMS, for moderators to easily find potentially bad words, I know you cannot rely on a computer to do the filtering.
If you have lookbehind available (which, IIRC, JavaScript does not and that seems likely what you're using this for) (just noticed the PHP tag; you probably have lookbehind available), this is very trivial:
(?<!cl)(ass)
Without lookbehind, you probably need to do something like this:
(?:(?!cl)..|^.?)(ass)
That's ass, with any two characters before as long as they are not cl, or ass that's zero or one characters after the beginning of the line.
Note that this is probably not the best way to implement a blacklist, though. You probably want this:
\bass\b
Which will match the word ass but not any word that includes ass in it (like association or bass or whatever else).
It seems to me that you're actually trying to use two lists here: one for words that should be excluded (even if one is a part of some other word), and another for words that should not be changed at all - even though they have the words from the first list as substrings.
The trick here is to know where to use the lookbehind:
/ass(?<!class)/
In other words, the good word negative lookbehind should follow the bad word pattern, not precede it. Then it would work correctly.
You can even get some of them in a row:
/ass(?<!class)(?<!pass)(?<!bass)/
This, though, will match both passhole and pass. ) To make it even more bullet-proof, we can add checking the word boundaries:
/ass(?<!\bclass\b)(?<!\bpass\b)(?<!\bbass\b)/
UPDATE: of course, it's more efficient to check for parts of the string, with (?<!cl)(?<!b) etc. But my point was that you can still use the whole words from whitelist in the regex.
Then again, perhaps it'd be wise to prepare the whitelists accordingly (so shorter patterns will have to be checked).
Is this one is what you want ? (?<!class)(\w+ass)

How to validate a hyperlink from different links using php

can you please tell me how to validate a hyperlink from different hyperlinks. eg
i want to fetch these links separately starting with the bolded address(between two stars) from a website using simple html dom
1 http://**www.website1.com**/1/2/
2 http://**news.website2.com**/s/d
3 http://**website3.com/news**/gds
i know we can do it using preg_match ;but i am getting a hardtime understanding preg_match.
can anyone give me a preg_match script for these websites validation..
and can you also explain me what this means
preg_match('|^http(s)?://[a-z0-9-]+(.[a-z0-9-]+)*(:[0-9]+)?(/.*)?$|i', $url)
what are those random looking characters in preg_match? what is the meaning of these characters?
If you want to learn about regular expression, I think you could get a good start on the regular-expressions.info website.
And if you want to use them more, the book Mastering Regular Expressions is a must read.
Edit: here is a simple walkthrough tho:
the first parameter of preg_match is the regexp string. The second is the string you're testing against. A third optionnal one can be used and would be an array inside which everything captured is stored.
the | are used to delimit your regexp and its options. What is between the first one is the regexp, the i at the end is an option (meaning your regexp is case insensitive)
the first ^ is marking where your string you want to match starts
then (s)? mean that you want one or no s character, and you want to "capture it"
[a-z0-9]+ is any number (even 0) of alphanumeric characters
(.[a-z0-9-]+)* is wrong. It should be (\.[a-z0-9-]+)* to capture any number of sequences formed by a dot then at least one alphanumeric character
(:[0-9]+)? will capture one or no sequence formed by : followed by any number. It's used to get the url port
(/.*)? captures the end of the url, a slash followed by any number of any character
$ is the end of your string
Have a look at In search of the perfect URL validation regex.

Is there any way to improve this regular expression for website URLs?

^(https?:\/\/([a-zA-Z0-9\-]{1,64}\.){0,127}([a-zA-Z0-9\-]{3,64})\.\w{2,4}(\/.*)?)?$
I only need to match website urls (without IPs, ports, username/password, etc). Are there any critical flaws in this regex?
Edit: Here's a slightly improved one:
^(https?:\/\/([a-zA-Z0-9\-]{1,64}\.){0,127}([a-zA-Z0-9\-]{1,64})\.\w{2,7}(\/.*)?)?$
I've realized that domain names can't begin or end with a dash. Is there a simple way to not match domains that begin or end with dash?
In the first part you are very restrictive and allow only the characters [a-zA-Z0-9\-], where in the last part you allow anything, but newline.
==> In the first part you are missing many valid characters and in the last part you match anything till the end of the string.
Why not simplify this and match anything that starts with http and has no whitespace till the end?
^https?:\/\/\S+$
To avoid the starting/ending dash in the domain name use lookarounds in your second expression. I also replaced the .* with \S*
^(https?:\/\/([a-zA-Z0-9\-]{1,64}\.){0,127}((?!-)[a-zA-Z0-9\-]{1,64})(?<!-)\.\w{2,7}(\/\S*)?)?$
See it here online on Regexr
Why have you made your complete expression Optional with the surrounding ()?. So it will also match an empty string.

Categories