php regexp for national domains - php

Thare are new nations domains and TLDs like "http://президент.рф/" - for Russian Federation domains, or http://example.新加坡 for Singapore...
Is there a regex to validate these domains?
I have found this one: What is the best regular expression to check if a string is a valid URL?
But when I try to use one of the expressions listed there - PHP is getting overhitted :)
preg_match(): Compilation failed: character value in \x{...} sequence is too large at offset 81
P.S.
1) Last part was solved by #OmnipotentEntity
2) But the main problem - to validate international domain - still exists, because example regexp doesn't validate well.

Use the "u" modifier to match unicode characters. The example you gave only uses the "i" modifier.

No, there's no regexp to validate those domains. Each TLD has different rules about which Unicode code points are permissible within their IDNs (if any). You would need a very big lookup table which would have to be kept up-to-date to know which specific characters are legal.
Furthermore there are rules about whether left-to-right written characters and right-to-left characters can be combined within a single DNS label.
BTW, the RFCs mentioned in the other comments are obsolete. The recently approved set are RFCs 5890 - 5895.

Related

Looking for tips to better understand Perl Compatible Regular Expression operators and syntax

My question is about Perl Compatible Regular Expression operators and syntax. I've learned the basic syntax of '/hello/' and that /i means case insensitive. I looked into this at jotform.com and will study this until I have a greater understanding. But I was hoping someone could give me a head start on understanding the Perl Syntax and Operators in the (2) PCRE I've posted below. They both work to keep users from posting links in the form textarea, but are very different in syntax and operators. Just wanting to know if one regex is preferred over the other. Which is best and why?
Update: After several months of live testing, it appears that PCRE 1 does not work to prevent URLs in PHP contact form. PCRE 2 does seem to work to prevent URLs in PHP contact form for the same live testing time period.
The 2 regex below were originally found here at How to prevent spam URLs in a PHP contact form
Is there is a better regex than PCRE 2? Any help or advice would be greatly appreciated.
Thanks.
<?php
//PCRE 1 - Does not work to prevent URLs
if (preg_match( '/www\.|http:|https:\/\/[a-z0-9_]+([\-\.]{1}[a-z_0-9]+)*\.[_a-z]{2,5}'.'((:[0-9]{1,5})?\/.*)?$/i', $_POST['message']))
{
echo 'error please remove URLs';
}else
{....
//PCRE 2 - Does work to prevent URLs
if (preg_match("/\b(?:(?:https?|ftp|http):\/\/|www\.)[-a-z0-9+&##\/%?=~_|!:,.;]*[-a-z0-9+&##\/%=~_|]/i",$_POST['message']))
{
echo 'error please remove URLs';
}else
{....
?>
For the sake of offering an answer so that this page can be marked as resolved (instead of abandoned), I'll offer a refinement of the second pattern.
/\b(?:(?:https?|ftp|http):\/\/|www\.)[-a-z0-9+&##\/%?=~_|!:,.;]*[-a-z0-9+&##\/%=~_|]/i
can be rewritten as:
\b(?:(?:f|ht)tps?:\/\/)[-\w+&##\/%?=~|!:,.;]*[-\w+&##\/%=~|]
The first segment matches https, http, ftps, or ftp as a "whole word" (\b) using alternation (|) and the zero or one quantifier (?). Your original pattern requires the "protocol" portion of the url to exist, so I will not change the pattern logic.
The subdomain in your pattern is requiring www. although the subdomain is not required in a valid url and there are valid values other than www. that can be used. I am going to change the pattern logic on this segment to make the subdomain optional and more flexible.
The character class (whitelisted characters) incorporates the characters in www., so the literal match can be omitted from the pattern.
I have reduced the length of both of your character classes by employing \w -- it includes all alphanumeric characters (uppercase and lowercase) as well as the underscore.
Here is a demonstration of what is matched: https://regex101.com/r/TP16iB/1 -- you will find that a valid url like www.example.com is not matched by your preferred pattern nor my pattern. To overcome this, you could hardcode the www. as the required subdomain and make the protocol optional, but then you would not be matching variable subdomains. So you see, this is a bit of a rabbit hole where you will need to weigh up how much time you wish to invest versus what your application really needs. Be warned, the more accurate your pattern becomes, so grows its total length/convolution.\b(?:(?:(?:f|ht)tps?:\/\/)|(?:www\.))\[-\w+&##\/%?=~|!:,.;\]*\[-\w+&##\/%=~|\]

PHP - Matching city/street names in PHP using unicode regex

I have this expression:
'/^([\p{L}\p{Mn}\p{Pd}\'\x{2019}]+|\d+)(\s+([\p{L}\p{Mn}\p{Pd}\'\x{2019}]+|\d+))*$/u'
It's goal is to match names and numbers like "6 de diciembre" or "Mariana de Jesús" (using numbers and unicode characters.
The issue is that it also matches typos like: "6de diciembre" [1]. Mixing numbers and letters in the same word should not be allowed (no, we have not expression like "6th" in this cases).
Question: What character classes should I use? I need digits and these unicode letters, but not mixed, not concatenated.
Notes: I posted a similar question regarding this topic before, but the issue was slightly different and cannot expect the same kind of answer.
[1] I can't believe I MUST clarify this point: typos should not be matched - unless explicitly said, a regex is to find an expected regular format in a string
The expression works well. I had a totally different issue in which my validation handler was not called.
After a bit of experimentation I noticed that if I reduce the length of the validation handling function then the DRUPAL 7 form can use it as a handler, instead of silently discarding it. Yes, ladies and gentlemen, my handler was named toyotaec_form_webform_client_form_trabaja_con_nosotros_validate and assigned as:
`$form['#validate'][] = 'toyotaec_form_webform_client_form_trabaja_con_nosotros_validate';`.
Slicing the 'con_notros_' part in both sides made it work, and lead me to this conclusion.
.: Drupal has a(n absolutely senseless) limit for those identifiers, while PHP has not.
.: Drupal truncates the input when you assign it.
.: Drupal raises no error upon unexistent function (the truncated name does not exist as a function).
Rantful (but logic) conclusion: For this and many previous issues I conclude that drupal sucks.

PHP RegEx: a Pattern to Validate the Second Level Domain

Note: this is a theoretical question about PHP flavor of regex, not a practical question about validation in PHP. I am merely using Domain Names for lack of a better example.
"Second Level Domain" refers to the combination of letters, numbers, period signs, and/or dashes that are placed between http:// or http://www. and .com (.co, .info, .etc) .
I am only interested in second level domains that use English version of Latin alphabet.
This pattern:
[A-Za-z0-9.-]+
matches valid domain names, such as stackoverflow, StackOverflow, stackoverflow.co (as in stackoverflow.co.uk), stack-overflow, or stackoverflow123.
However, the same pattern would also match something like stack...overflow, stack---over--flow, ........ , -------- , or even . and -.
How can that pattern be rewritten, to indicate that period signs and dashes, even though they can be used multiple times in a node,
cannot be used without other symbols,
cannot be placed twice or more side by side with each other,
and cannot be placed in the beginning or end of the node?
Thank you in advance!
I think something like this should do the trick:
^([a-zA-Z0-9]+[.-])*[a-zA-Z0-9]+$
What this tries to do is
start at the beginning of string, end at the end
one or more letter or digit
followed by either dot or hypen
the group above repeated 0 or more times
followed by one or more letter or digit
Assuming that you are looking for a regex that does not allow two consecutive . or - you can use:
^[a-zA-Z0-9]+([-.][a-zA-Z0-9]+)*$
regexr demo

REGEX URL regular expression [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Regular expression for browser Url
Is this regex perfect for any url ?
preg_match_all(
'/([www]+(\.|dot))?[a-zA-Z0-9_\.-]+(\.|dot){1,}[com|net|org|info\.]+((\.|dot){0,}[a-zA-Z]){0,}+/i',
$url, $regp);
Don't use regex for that. If you cant resist, a valid one can be found here:
What is the best regular expression to check if a string is a valid URL?
but that regex is ridiculous. Try to use your framework for that, if you can (Uri class in .net for example).
No. In fact it doesn't match URLs at all. It's trying to detect hostnames written in text, like www.example.com.
Its approach is to try to detect some common known TLDs, but:
[com|net|org|info\.]+
is actually a character group, allowing any sequence of characters from the list |.comnetrgif. Probably this was meant:
((com|net|org|info)\.)+
and also [www] is similarly wrong, plus the business with dot doesn't really make any sense.
But this is in general a really bad idea. There are way more TLDs in common use than just those and the 2-letter CCTLDs. Also many/most of the CCTLDs don't have a second-level domain of com/net/org/info. This expression will fail to match those, and will match a bunch of other stuff that's not supposed to be a hostname.
In fact the task of detecting hostnames is basically impossible to do, since a single word can be a hostname, as can any dot-separated sequence of words. (And since internationalised domain names were introduced, almost anything can be a hostname, eg. 例え.テスト.)
'any' url is a tough call. In OZ you have .com.au, in the UK it is .co.uk Each country has its own set of rules, and they can change. .xxx has just been approved. And non-ascii characters have been approved now, but I suspect you don't need that.
I would wonder why you want validation which is that tight? Many urls that are right will be excluded, and it does not exlude all incorrect urls. www.thisisnotavalidurl.com would still be accepted.
I would suggest
A) using a looser check , just for ([a-zA-Z0-9_.-].)*[a-zA-Z0-9_.-] (or somthing), just as a sanity check
B) using a reverse lookup to check if the URL is actually valid if you want to only allow actual real urls.
Oh, and I find this: http://www.fileformat.info/tool/regex.htm to be a really useful tool if I am developing regex, which I am not great at.
[www]+ should be changed for (www)?
(\.|dot){1,} - one and more? mayby you wanted to do ([a-zA-Z0-9_\.-]+(\.|dot)){1,}
A URL also has a protocol like http, which you're missing. You're also missing a lot of TLDs, as already mentioned.
Something like an escaped space (%20) would also not be recognized.
Port numbers can also appear in an URL (e.g. :80)
No, and you can't create a REGEX that will parse any URI (or URL or URN) - the only way to parse them properly is to read them as per the spec of RFC-3986

Properly Matching a IDN URL

I need help building a regular expression that can properly match an URL inside free text.
scheme
One of the following: ftp, http, https (is ftps a protocol?)
optional user (and optional pass)
host (with support for IDNs)
support for www and sub-domain(s) (with support for IDNs)
basic filtering of TLDs ([a-zA-Z]{2,6} is enough I think)
optional port number
path (optional, with support for Unicode chars)
query (optional, with support for Unicode chars)
fragment (optional, with support for Unicode chars)
Here is what I could find out about sub-domains:
A "subdomain" expresses relative
dependence, not absolute dependence:
for example, wikipedia.org comprises a
subdomain of the org domain, and
en.wikipedia.org comprises a subdomain
of the domain wikipedia.org. In
theory, this subdivision can go down
to 127 levels deep, and each DNS label
can contain up to 63 characters, as
long as the whole domain name does not
exceed a total length of 255
characters.
Regarding the domain name itself I couldn't find any reliable source but I think the regular expression for non-IDNs (I'm not sure how to write a IDN compatible version) is something like:
[0-9a-zA-Z][0-9a-zA-Z\-]{2,62}
Can someone help me out with this regular expression or point me to a good direction?
John Gruber, of Daring Fireball fame, had a post recently that detailed his quest for a good URL-recognizing regex string. What he came up with was this:
\b(([\w-]+://?|www[.])[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/)))
Which apparently does OK with Unicode-containing URLs, as well. You'd need to do the slight modification to it to get the rest of what you're looking for -- the scheme, username, password, etc. Alan Storm wrote a piece explaining Gruber's regex pattern, which I definitely needed (regex is so write-once-have-no-clue-how-to-read-ever-again!).
If you require the protocol and aren't worried too much about false positives, by far the easiest thing to do is match all non-whitespace characters around ://
This will get you most of the way there. If you need it more refined please provide test data.
(ftp|https?)://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?

Categories