RegEx extract website url from email address w/ sub-sub-domain - php

We are trying to extract from an email list a valid url for that organization.
abc#charleston.k12.il.us is easy, but sometimes we have
someone#u40gw.effingham.k12.il.us where the 040gw is a subdomain for internal mail.
Another example is someone#mail.meridian223.org or someone#athletics.msstate.edu
What would be the most efficient way to capture the .edu + the preceding name only, without additional subdomains, or in the case of high schools the whole part k12.il.us plus the preceding name only?
Tried so far:
/#(([a-zA-Z0-9]*)([.])([a-zA-Z0-9]*)|#([a-zA-Z0-9]*)([.])([a-zA-Z0-9]*)([.])([a-zA-Z0-9]*)([.])([a-zA-Z0-9]*)([.])([a-zA-Z0-9]*))/

You can try the following regex pattern:
#.*?([^.]+[.]\w{3}|[^.]+[.]k12[.]il[.]us)$
Where, you can replace \w{3} with your list of possible extensions, like org, edu, net etc. An example would be like:
#.*?([^.]+[.](edu|org|net|info|com)|[^.]+[.]k12[.]il[.]us)$
You can see it working on regexr.com

Related

PHP regex to simplify webmail referrer addresses?

I'm trying to aggregate stats on referrers to my site, to give me a simple display of top referrers. Unfortunately referrer data is untrustworthy, and often dirty, so I'm just trying to make a good faith attempt to get something like usable data.
I've already filtered bad urls, and used url_parts to get the host portion of each url. I've then stripped common aliased subdomains, and social media url-shorteners, like t.co or fb.me
The big issue that remains is webmail. many webmail providers shunt their users to a sub-sub domain, as soon as they log in, for load-balancing. This is easy to filter, for mail services like yahoo, as they are all something.something.mail.yahoo.com, so I can just check if the third from last segment is "mail" or a similar subdomain, and strip all previous segments.
But now I am left with the hard cases, subdomains like:
webmaila (like webmaila.juno.com)
email16 (like email16.secureserver.net)
webmailb (like webmailb.netzero.net)
I need to find entries that start with 'mail', 'webmail', 'email', or 'mailbox', followed by any string, and strip the string, leaving me with just the appropriate prefix.
How can I do that?
echo preg_replace('#^(webmaila|email16|webmailb)(.+)?#', '$1', $string);

regexp not catching all emails

I am trying to get emails in a php program and I am using the following regexp
([a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+(?:[A-Z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum)\b)siU
This appears to be working fine for getting your standard emails. Such as me # gmail.com or you # hotmail.com
Where this fails is on emails with ending such as co.uk. Now I have tried adding co.uk in my regexp as such
([a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+(?:[A-Z]{2}|com|co.uk|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum)\b)siU
But that just gives me the same output as the original regexp. Where the output of the email is, you#co . I also tried just adding in uk. What am I missing on this one? Is it the second period throwing it off?
Ideally I am trying to make it catch all emails with .com .net .org co.uk .au .ca. Basically I am searching for all US, UK, AU and CA emails. Can anyone spot what my mistake is to be able to output non US emails properly like you # yoursite.co.uk instead of you # yoursite.co
Thank you. The spaces in the emails shown for example are only there to get this to post.
Edit: I am not trying to validate the emails, its a series of emails that can be anything that are in an array and I am trying to only catch specific ones for a database. Sorry for not making that clear initially.
Edit2: Here is my working string for my issue. Thanks everyone
^([a-zA-Z0-9_\-\.]+)#((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$^
Do NOT use regex to validate email addresses. Use libraries that do it correctly.
See I Knew How To Validate An Email Address Until I Read The RFC for more information.
|co.uk|
try to (not in [] , dot mean anything but break link)
|co\.uk|
You have to escape the dot like this.
co\.uk
If you are trying to match only valid emails this regexp should do the trick
(\s)?(([^\s]+)\#([^\s]+))
What am I missing on this one?
You are adding terms to a non-capturing group. So how can any output based on your regex contain anything in a non-capturing group? Not to mention the mistake in the term you added that nacholibre mentioned.
Dont use regex for email, you wont like it, and you will fail.
filter_var()
http://php.net/manual/en/function.filter-var.php
FILTER_VALIDATE_EMAIL
http://www.php.net/manual/en/filter.filters.validate.php

Looking for a PHP regex or function to filter variations using . of an email for security

I am getting spam due to gmail allowing the use of . in their emails, so someone like this spammer.
q.i.n.ghu.im.i.n.g.o.u.r#gmail.com
can get through by removing and/or adding another period in his naming structure.
This happens to be on a Joomla install, so I am specifically looking to create a component so I can add to multiple sites, or if there is a simple regex to add inline existing code. Also, is there anything being done about this, as this seems to be along the lines of and be newly termed a loosely typed email address.. that is crazy to me.
If your goal is to match this address against the others that are equivalent to it (because you've already got them blacklisted) then I'd simply normalize the address to it's most basic state before storing it. Lower case it, split it at the #, and if the right side is "gmail.com" then remove all dots from the left side and put the halves back together.
start with JOE.SCHMOE#GMAIL.COM
lowercase to joe.schmoe#gmail.com
split to joe.schmoe and gmail.com
since right side is gmail.com, remove dots from left
reassemble to joeschmoe#gmail.com
Now you've got the base address that you can block/ban/whatever.
You could do something simple like: /^(?:[^#]+\.){5,}[^#]+#(?:[^#]+\.)+[^#]+/
This is just quick toss up not meant for validation, but rather, a pointer to tell you if their email is scetchy. The key here is the {5,} quantifier that says if the email has 5 or more dots (like a.b.c.d.e.f) it will match. In other words be flagged as scetchy.
I hope this helps!
Explanation: http://regex101.com/r/lB5vG3

regex help... php check entry format

Im using php to develop an application, but I am running into some issues with regex...
I found a few sites that explain it, but it is for some reason over my head? can someone please help explain regex arguements?
I uploaded a sample of what I am working on here...
First, click on the "+" button at top right to get to the add content view.
Basically, I need it so when you submit from this form, php will check that the values are formatted correctly.
Domain: this can be .com, .co, .biz, .info, etc... User can enter the prefix, like a url, and php gets rid of it... so the ending strings in the array are just domain.com
domain1.com
somedomain.biz
mydomain.co
Redirect: with this one, php uses the ',' so we are left with the ip, and the domainkey as seperate strings, the ip can be 2-3 numbers per section!, so ###.##.##.###, or even ##.##.##.##, and the domain key is a varchar(not so important)
##.##.##.##, domainkey
###.###.###.###, domainkey
Solution for redirect:
(\d{1,3}\.){3}\d{1,3}
/24's: this is similar to the redirect IP, but the end will always end in '0/24'
##.##.###.0/24
##.##.###.0/24
Names:* This one should be the easiest, it can only be letters, no numbers... any length... *
randomname
thisisaname
May I suggest using some software or even website that allows you to test your regex. Such as:
The Regex Coach
Regexpal
RegExr
Expresso
RegexDesigner
etc
It really depends on how strict you want to get with it and how fancy you want to make your regex.
/((\d{1,3}).){3}(\d{1,3})(\/\d{2})?/

PHP regex for url validation, filter_var is too permisive

First lets define a "URL" according to my requirements.
The only protocols optionally allowed are http:// and https://
then a mandatory domain name like stackoverflow.com
then optionally the rest of url components (path, query, hash, ...)
For reference a list of valid and invalid url's according to my requirements
VALID
stackoverflow.com
stackoverflow.com/questions/ask
https://stackoverflow.com/questions/ask
http://www.amazon.com/Computers-Internet-Books/b/ref=bhp_bb0309A_comint2?ie=UTF8&node=5&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=browse&pf_rd_r=0AH7GM29WF81Q72VPFDH&pf_rd_t=101&pf_rd_p=1273387142&pf_rd_i=283155
amazon.com/Computers-Internet-Books/b/ref=bhp_bb0309A_comint2?ie=UTF8&node=5&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=browse&pf_rd_r=0AH7GM29WF81Q72VPFDH&pf_rd_t=101&pf_rd_p=1273387142&pf_rd_i=283155
http://test-site.com (filter_var reject this!!! I have domain names with dashes )
INVALID
http://www (php filter_var allow this, yes i know is a valid url)
google
http://www..des (php filter_var allow this)
Any url with not allowed characters in the domain name
For completeness here is my php version: 5.3.2-1ubuntu4.2
As a starting point you can use this one, it's for JS, but it's easy to convert it to work for PHP preg_match.
/^(https?\://)?(www\.)?([a-z0-9]([a-z0-9]|(\-[a-z0-9]))*\.)+[a-z]+$/i
For PHP should work this one:
$reg = '#^(https?\://)?(www\.)?([a-z0-9]([a-z0-9]|(\-[a-z0-9]))*\.)+[a-z]+$#i';
This regexp anyway validates only the domain part, but you can work on this or split the url at the 1st slash '/' (after "://") and validate separately the domain part and the rest.
BTW: It would validate also "http://www.domain.com.com" but this is not an error because a subdomain url could be like: "http://www.subdomain.domain.com" and it's valid! And there is almost no way (or at least no operatively easy way) to validate for proper domain tld with a regex because you would have to write inline into your regex all possible domain tlds ONE BY ONE like this:
/^(https?\://)?(www\.)?([a-z0-9]([a-z0-9]|(\-[a-z0-9]))*\.)+(com|it|net|uk|de)$/i
(this last one for instance would validate only domain ending with .com/.net/.de/.it/.co.uk). New tlds always come out, so you would have to adjust you regex everytimne a new tld comes out, that's a pain in the neck!
You could use parse_url to break up the address into its components. While it's explicitly not built to validate a URL, analyzing the resulting components and matching them against your requirements would at least be a start.
It may vary but in most of the cases you don't really need to check the validity of any URL.
If it's a vital information and you trust your user enough to let him give it through a URL, you can trust him enough to give a valid URL.
If it isn't a vital information, then you just have to check for XSS attempts and display the URL that the user wanted.
You can add manually a "http://" if you don't detect one to avoid navigation problems.
I know, I don't give you an alternative as a solution, but maybe the best way to solve performance & validity problems is just to avoid unnecessary checks.

Categories