REGEX URL regular expression [duplicate] - php

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Regular expression for browser Url
Is this regex perfect for any url ?
preg_match_all(
'/([www]+(\.|dot))?[a-zA-Z0-9_\.-]+(\.|dot){1,}[com|net|org|info\.]+((\.|dot){0,}[a-zA-Z]){0,}+/i',
$url, $regp);

Don't use regex for that. If you cant resist, a valid one can be found here:
What is the best regular expression to check if a string is a valid URL?
but that regex is ridiculous. Try to use your framework for that, if you can (Uri class in .net for example).

No. In fact it doesn't match URLs at all. It's trying to detect hostnames written in text, like www.example.com.
Its approach is to try to detect some common known TLDs, but:
[com|net|org|info\.]+
is actually a character group, allowing any sequence of characters from the list |.comnetrgif. Probably this was meant:
((com|net|org|info)\.)+
and also [www] is similarly wrong, plus the business with dot doesn't really make any sense.
But this is in general a really bad idea. There are way more TLDs in common use than just those and the 2-letter CCTLDs. Also many/most of the CCTLDs don't have a second-level domain of com/net/org/info. This expression will fail to match those, and will match a bunch of other stuff that's not supposed to be a hostname.
In fact the task of detecting hostnames is basically impossible to do, since a single word can be a hostname, as can any dot-separated sequence of words. (And since internationalised domain names were introduced, almost anything can be a hostname, eg. 例え.テスト.)

'any' url is a tough call. In OZ you have .com.au, in the UK it is .co.uk Each country has its own set of rules, and they can change. .xxx has just been approved. And non-ascii characters have been approved now, but I suspect you don't need that.
I would wonder why you want validation which is that tight? Many urls that are right will be excluded, and it does not exlude all incorrect urls. www.thisisnotavalidurl.com would still be accepted.
I would suggest
A) using a looser check , just for ([a-zA-Z0-9_.-].)*[a-zA-Z0-9_.-] (or somthing), just as a sanity check
B) using a reverse lookup to check if the URL is actually valid if you want to only allow actual real urls.
Oh, and I find this: http://www.fileformat.info/tool/regex.htm to be a really useful tool if I am developing regex, which I am not great at.

[www]+ should be changed for (www)?
(\.|dot){1,} - one and more? mayby you wanted to do ([a-zA-Z0-9_\.-]+(\.|dot)){1,}

A URL also has a protocol like http, which you're missing. You're also missing a lot of TLDs, as already mentioned.
Something like an escaped space (%20) would also not be recognized.
Port numbers can also appear in an URL (e.g. :80)

No, and you can't create a REGEX that will parse any URI (or URL or URN) - the only way to parse them properly is to read them as per the spec of RFC-3986

Related

Looking for tips to better understand Perl Compatible Regular Expression operators and syntax

My question is about Perl Compatible Regular Expression operators and syntax. I've learned the basic syntax of '/hello/' and that /i means case insensitive. I looked into this at jotform.com and will study this until I have a greater understanding. But I was hoping someone could give me a head start on understanding the Perl Syntax and Operators in the (2) PCRE I've posted below. They both work to keep users from posting links in the form textarea, but are very different in syntax and operators. Just wanting to know if one regex is preferred over the other. Which is best and why?
Update: After several months of live testing, it appears that PCRE 1 does not work to prevent URLs in PHP contact form. PCRE 2 does seem to work to prevent URLs in PHP contact form for the same live testing time period.
The 2 regex below were originally found here at How to prevent spam URLs in a PHP contact form
Is there is a better regex than PCRE 2? Any help or advice would be greatly appreciated.
Thanks.
<?php
//PCRE 1 - Does not work to prevent URLs
if (preg_match( '/www\.|http:|https:\/\/[a-z0-9_]+([\-\.]{1}[a-z_0-9]+)*\.[_a-z]{2,5}'.'((:[0-9]{1,5})?\/.*)?$/i', $_POST['message']))
{
echo 'error please remove URLs';
}else
{....
//PCRE 2 - Does work to prevent URLs
if (preg_match("/\b(?:(?:https?|ftp|http):\/\/|www\.)[-a-z0-9+&##\/%?=~_|!:,.;]*[-a-z0-9+&##\/%=~_|]/i",$_POST['message']))
{
echo 'error please remove URLs';
}else
{....
?>
For the sake of offering an answer so that this page can be marked as resolved (instead of abandoned), I'll offer a refinement of the second pattern.
/\b(?:(?:https?|ftp|http):\/\/|www\.)[-a-z0-9+&##\/%?=~_|!:,.;]*[-a-z0-9+&##\/%=~_|]/i
can be rewritten as:
\b(?:(?:f|ht)tps?:\/\/)[-\w+&##\/%?=~|!:,.;]*[-\w+&##\/%=~|]
The first segment matches https, http, ftps, or ftp as a "whole word" (\b) using alternation (|) and the zero or one quantifier (?). Your original pattern requires the "protocol" portion of the url to exist, so I will not change the pattern logic.
The subdomain in your pattern is requiring www. although the subdomain is not required in a valid url and there are valid values other than www. that can be used. I am going to change the pattern logic on this segment to make the subdomain optional and more flexible.
The character class (whitelisted characters) incorporates the characters in www., so the literal match can be omitted from the pattern.
I have reduced the length of both of your character classes by employing \w -- it includes all alphanumeric characters (uppercase and lowercase) as well as the underscore.
Here is a demonstration of what is matched: https://regex101.com/r/TP16iB/1 -- you will find that a valid url like www.example.com is not matched by your preferred pattern nor my pattern. To overcome this, you could hardcode the www. as the required subdomain and make the protocol optional, but then you would not be matching variable subdomains. So you see, this is a bit of a rabbit hole where you will need to weigh up how much time you wish to invest versus what your application really needs. Be warned, the more accurate your pattern becomes, so grows its total length/convolution.\b(?:(?:(?:f|ht)tps?:\/\/)|(?:www\.))\[-\w+&##\/%?=~|!:,.;\]*\[-\w+&##\/%=~|\]

changing www*.com to a clickable URL with REGEX

I'm working on a web page and regex keeps coming up as the best way to handle string manipulation for an issue I'm trying to resolve. Unfortunately, regex is not exactly trivial and I've been having trouble. Any help is appreciated;
I would like to make strings entered from a php form into clickable links. I've received help with my first challenge; how to make strings starting with http, https or ftp into clickable links;
function make_links_clickable($message){
return preg_replace('!(((f|ht)tp(s)?://)[-a-zA-Zа-яА-Я()0-9#:%_+.~#?&;//=]+)!i', '$1', $message);
}
$message = make_links_clickable($message);
And this works well. When I look at it (and do some research), the best that I can glean from the syntax is that the first piece is matching ftp, http, and https, :, and // along with a wide range of combined patterns. I would like to know how I can;
1) Make links starting with www, or ending with .com/.net/.org/etc clickable (like google.com, or www.google.com - leaving out the http://)
2) Change youtube links like
"https://www.youtube.com/watch?v=examplevideo"
into
"<iframe width="560" height="315" src="//www.youtube.com/embed/examplevideo" frameborder="0" allowfullscreen></iframe>"
I think these two cases are basically doing the same kind of thing, but figuring out is not intuitive. Any help would be deeply appreciated.
The first regular expression there is made to match almost everything that follows ftp://, http://, https:// that occurs, so it might be best to implement the others as separate expressions since they'll only be matching hostnames.
For number 1, you'll need to decide how strictly you wish to match different TLDs (.com/.net/etc). For example, you can explicitly match them like this:
(www\.)?[a-z0-9\-]+\.(com|net|org)
However, that will only match URLs that end in .com, .net, or .org. If you want all top-level domains and only the valid ones, you'll need to manually write them all in to the end of that. Alternatively, you can do something like this,
(www\.)?[a-z0-9\-]+\.[a-z]{2,6}
which will accept anything that looks like a url and ends with "dot", and any combination of 2 to 6 letters (.museum and .travel). However, this will match strings like "fgs.fds". Depending on your application, you may need to add more characters to [a-z], to add support for extended character alphabets.
Edit (2 Aug 14): As pointed out in the comments below, this won't match TLDs like .co.uk. Here's one that will:
(www\.)?[a-z0-9\-]+\.([a-z]{2,3}(\.?[a-z]{2,3})?)
Instead of any string between two and six characters (following a period), this will match any two to three, then another one to three (if present), with or without a dividing period.
It'd be redundant, but you could instead remove the question mark after www on the second option, then do both tests; that way, you can match any string ending in a common TLD, or a string that begins with "www." and is followed by any characters with one period separating them, "gpspps.cobg". It would still match sites that might not actually exist, but at least it looks like a url, at it would look like one.
For the YouTube one, I went a little question mark crazy.
(?i:(?:(?:http(?:s)?://)?(?:www\.)?)?youtu(?:\.be/|be\.com/watch\?(?:[a-z0-9_\-\%\&\=]){0,}?v\=))([a-zA-Z0-9_\-]{11}){0,}?v\=))(?i)([a-zA-Z0-9_\-]{11})
EDIT: I just tried to use the above regex in one of my own projects, but I encountered some errors with it. I changed it a little and I think this version may be better:
(?i:(?:(?:http(?:s)?://)?(?:www\.)?)?youtu(?:\.be/|be\.com/watch\?(?:[a-z0-9_\-\%\&\=]){0,})?)(?:v=)?([a-zA-Z0-9_\-]{11})
For those not familiar with regular expressions, parentheses , ( ...regex... ), are stored as groups, which can be selectively picked out of matched strings. Parenthesis groups that begin with ?: as in most of the ones up there, (?:www\.) are however not captured within the groups. Because the end of that regex was left as a normal—"captured"—group, ([a-zA-Z0-9_\-]{11}), you use the $matches argument of functions like preg_match, then you can use $matches[1] to get the YouTube ID of the video, 'examplevide', then work with it however you'd like. Also note, the regex is only matching 11 characters for the ID.
This regex will match pretty much any of the current youtube url formats including incorrect cases, and out of (normal) order parameters:
http://youtu.be/dQw4w9WgXcQ
https://www.youtube.com/watch?v=dQw4w9WgXcQ
http://www.youtube.com/watch?v=dQw4w9WgXcQ&feature=featured
http://www.youtube.com/watch?feature=featured&v=dQw4w9WgXcQ
http://WWW.YouTube.Com/watch?v=dQw4w9WgXcQ
http://YouTube.Com/watch?v=dQw4w9WgXcQ
www.youtube.com/watch?v=dQw4w9WgXcQ

Validate the name of a person in php [duplicate]

I would like to create a regex which validates a name of a person. These should be allowed:
Letters (uppercase and lowercase)
-
spaces
This is pretty easy to create a regex for. The problem is that some people also use special characters in their names. For example, assume a user named gûnther or François. There are a lot of characters like û and ç available and it's hard to list all of these.
Is there an easy way to check for correct human names?
Is there an easy way to check for correct human names?
This has been discussed several times. I'm fairly certain that the only thing that people can agree on is that in order to exist a name cannot be a empty string, thus:
^.+$
(Yes, I am aware that this is probably not what OP is looking for. I'm just summarizing earlier Q&As.)
/^\pL[\pL '-]*\z/ should do the trick
The short answer is no, there is no easy way. You have touched on the biggest issue. There are so many special cases of accents and extra things hanging of letters that it will become a mess to deal with. Additionally, the expression with break down to something like this
^[CAPITAL_LETERS][ALL_LETERS_AND_SYMBOLS]*$
That is not that helpful because "Abcd" fits that and you have no way to know if someone is incorrectly entering info into the field or if it was a crazy Hollywood parent that actually named their kid that or something like Sandwich or Umbrella.
^.+$
Checked #jensgram answer, but that regex only accepts all strings, so it doesn't solve problem, because string needs to be name, in this case it can be anything.
^[A-Z][a-z]+$
My regex only accepts string where first char is uppercase and following chars are letters in lowercase. Also looking through other answers, this seems to be shortest regex and also simpliest.
I don't know exactly what you are trying to do (validate user name input?) but basically I would keep it simple - fail the validation if the text contains numbers. And even that's probably pretty shaky.
I had the same problem. First I came up with something like
preg_match("/^[a-zA-Z]{1,}([\s-]*[a-zA-Z\s\'-]*)$/", $name))
but then realized that UTF-8 chars of countries like Sweden, China etc. for example Õ å would not be allowed which was important to my site since it's an international site and don't want to force users not being able to enter their real name.
I though it might be an easier solution instead of trying to figure out how to allow names like O'Malley and Brooks-Schneider and Õsmar (made that one up :) to rather catch chars that you don't want them to enter. For me it was basically to avoid xss JS code being entered. So I use the following regex to filter out all chars that might be harmful.
preg_match("/[~!##\$%\^&\*\(\)=\+\|\[\]\{\};\\:\",\.\<\>\?\/]+/", $name)
That way they can enter any name they want except chars that really aren't part of any name. Hope this might be useful.

Regex to validate URL - Not checking for HTTP?

I know there are tonns of questions on here to validate a web address with something like this
/^[a-zA-Z]+[:\/\/]+[A-Za-z0-9\-_]+\\.+[A-Za-z0-9\.\/%&=\?\-_]+$/i
The only problem is, not everybody uses the http:// or whatever comes before so i wanted to find a way to use the preg_match() but not checking for http as a must have but more of a doesn't really matter, i modified it to this but then it rejects the url it it does have http:// in it:
/^[A-Za-z0-9\-_]+\\.+[A-Za-z0-9\.\/%&=\?\-_]+$/i
I was hoping more to validate it on these conditions
If it has http:// or www then just ignore this
If the .extension is longer than 9 then reject
If it contains no full stops
Anybody got an idea, thanks :)
Can't you just use the built in filter_var function?
filter_var('example.com', FILTER_VALIDATE_URL);
Not sure about the nine chars extension limit, but I guess you could easily check this in an additional step.
Why not have a stage before the regexp to simply remove the http:// if present ? The same would apply to the www. That may make your life a bit easier.
/^(http\://|www\.)/
/^.+?\.\S{0,9}\./
/\./
Those should work for your bullet points?
not everybody uses the http://
They should. Without a scheme it simply isn't a URL, and omitting it can cause weird problems. For example:
www.example.com:8080/file.txt
This is a valid URL with the non-existant scheme www.example.com:.
If you are sure that the normal scheme should be http:, you could try automatically appending http:// to ‘fix up’ any URL that doesn't begin with https?:, before validation. But you shouldn't allow/keep/return schemeless URLs over the longer term.
Incidentally the current regex you are using is a long way from accurate according to the official URI syntax (see RFC 3986). It will disallow many valid URI characters, not to mention Unicode characters in IRI. If you want a proper validation you should use a real URL-parser; if you just want a quick check for obvious problems you should use something much more permissive. For example just checking for the absence of categorically-invalid characters like space and ".

How do I create a regular expression that disallows symbols?

I got a question regarding regexp in general. I'm currently building a register form where you can enter the full name (given name and family name) however I cant use [a-zA-Z] as a validation check because that would exclude everyone with a "foreign" character.
What is the best way to make sure that they don't enter a symbol, in both php and javascript?
Thanks in advance!
The correct solution to this problem (in general) is POSIX character classes. In particular, you should be able to use [:alpha:] (or [:alphanum:]) to do this.
Though why do you want to prevent users from entering their name exactly as they type it? Are you sure you're in a position to tell them exactly what characters are allowed to be in their names?
You first need to conceptually distinguish between a "foreign" character and a "symbol." You may need to clarify here.
Accounting for other languages means accounting for other code pages and that is really beyond the scope of a simple regexp. It can be done, but on a higher level, the codepages have to work.
If you strictly wanted your regexp to fail on punctuation and symbols, you could use [^[:punct:]], but I'm not sure how the [:punct:] POSIX class reacts to some of the weird unicode symbols. This would of course stop some one from putting in "John Smythe-Jones" as their name though (as '-' is a punctuation character), so I would probably advise against using it.
I don’t think that’s a good idea. See How to check real names and surnames - PHP
I don't know how you would account for what is valid or not, and depending on your global reach, you will probably not be able to remove anything without locking out somebody. But a Google search turned this up which may be helpful.
http://nadeausoftware.com/articles/2007/09/php_tip_how_strip_symbol_characters_web_page
You could loop through the input string and use the String.charCodeAt() function to get the integer character code for each character. Set yourself up with a range of acceptable characters and do your comparison.
As noted POSIX character classes are likely the best bet. But the details of their support (and alternatives) vary very much with the details of the specific regex variant.
PHP apparently does support them, but JavaScript does not.
This means for JavaScript you will need to use character ranges: /[\u0400-\u04FF]/ matches any one Cyrillic character. Clearly this will take some writing, but not the XML 1.0 Recommendation (from W3C) includes a listing of a lot of ranges, albeit a few years old now.
One approach might be to have a limited check on the client in JavaScript, and the full check only server side.

Categories