I'm trying to aggregate stats on referrers to my site, to give me a simple display of top referrers. Unfortunately referrer data is untrustworthy, and often dirty, so I'm just trying to make a good faith attempt to get something like usable data.
I've already filtered bad urls, and used url_parts to get the host portion of each url. I've then stripped common aliased subdomains, and social media url-shorteners, like t.co or fb.me
The big issue that remains is webmail. many webmail providers shunt their users to a sub-sub domain, as soon as they log in, for load-balancing. This is easy to filter, for mail services like yahoo, as they are all something.something.mail.yahoo.com, so I can just check if the third from last segment is "mail" or a similar subdomain, and strip all previous segments.
But now I am left with the hard cases, subdomains like:
webmaila (like webmaila.juno.com)
email16 (like email16.secureserver.net)
webmailb (like webmailb.netzero.net)
I need to find entries that start with 'mail', 'webmail', 'email', or 'mailbox', followed by any string, and strip the string, leaving me with just the appropriate prefix.
How can I do that?
echo preg_replace('#^(webmaila|email16|webmailb)(.+)?#', '$1', $string);
Related
We are trying to extract from an email list a valid url for that organization.
abc#charleston.k12.il.us is easy, but sometimes we have
someone#u40gw.effingham.k12.il.us where the 040gw is a subdomain for internal mail.
Another example is someone#mail.meridian223.org or someone#athletics.msstate.edu
What would be the most efficient way to capture the .edu + the preceding name only, without additional subdomains, or in the case of high schools the whole part k12.il.us plus the preceding name only?
Tried so far:
/#(([a-zA-Z0-9]*)([.])([a-zA-Z0-9]*)|#([a-zA-Z0-9]*)([.])([a-zA-Z0-9]*)([.])([a-zA-Z0-9]*)([.])([a-zA-Z0-9]*)([.])([a-zA-Z0-9]*))/
You can try the following regex pattern:
#.*?([^.]+[.]\w{3}|[^.]+[.]k12[.]il[.]us)$
Where, you can replace \w{3} with your list of possible extensions, like org, edu, net etc. An example would be like:
#.*?([^.]+[.](edu|org|net|info|com)|[^.]+[.]k12[.]il[.]us)$
You can see it working on regexr.com
I am getting spam due to gmail allowing the use of . in their emails, so someone like this spammer.
q.i.n.ghu.im.i.n.g.o.u.r#gmail.com
can get through by removing and/or adding another period in his naming structure.
This happens to be on a Joomla install, so I am specifically looking to create a component so I can add to multiple sites, or if there is a simple regex to add inline existing code. Also, is there anything being done about this, as this seems to be along the lines of and be newly termed a loosely typed email address.. that is crazy to me.
If your goal is to match this address against the others that are equivalent to it (because you've already got them blacklisted) then I'd simply normalize the address to it's most basic state before storing it. Lower case it, split it at the #, and if the right side is "gmail.com" then remove all dots from the left side and put the halves back together.
start with JOE.SCHMOE#GMAIL.COM
lowercase to joe.schmoe#gmail.com
split to joe.schmoe and gmail.com
since right side is gmail.com, remove dots from left
reassemble to joeschmoe#gmail.com
Now you've got the base address that you can block/ban/whatever.
You could do something simple like: /^(?:[^#]+\.){5,}[^#]+#(?:[^#]+\.)+[^#]+/
This is just quick toss up not meant for validation, but rather, a pointer to tell you if their email is scetchy. The key here is the {5,} quantifier that says if the email has 5 or more dots (like a.b.c.d.e.f) it will match. In other words be flagged as scetchy.
I hope this helps!
Explanation: http://regex101.com/r/lB5vG3
Im using php to develop an application, but I am running into some issues with regex...
I found a few sites that explain it, but it is for some reason over my head? can someone please help explain regex arguements?
I uploaded a sample of what I am working on here...
First, click on the "+" button at top right to get to the add content view.
Basically, I need it so when you submit from this form, php will check that the values are formatted correctly.
Domain: this can be .com, .co, .biz, .info, etc... User can enter the prefix, like a url, and php gets rid of it... so the ending strings in the array are just domain.com
domain1.com
somedomain.biz
mydomain.co
Redirect: with this one, php uses the ',' so we are left with the ip, and the domainkey as seperate strings, the ip can be 2-3 numbers per section!, so ###.##.##.###, or even ##.##.##.##, and the domain key is a varchar(not so important)
##.##.##.##, domainkey
###.###.###.###, domainkey
Solution for redirect:
(\d{1,3}\.){3}\d{1,3}
/24's: this is similar to the redirect IP, but the end will always end in '0/24'
##.##.###.0/24
##.##.###.0/24
Names:* This one should be the easiest, it can only be letters, no numbers... any length... *
randomname
thisisaname
May I suggest using some software or even website that allows you to test your regex. Such as:
The Regex Coach
Regexpal
RegExr
Expresso
RegexDesigner
etc
It really depends on how strict you want to get with it and how fancy you want to make your regex.
/((\d{1,3}).){3}(\d{1,3})(\/\d{2})?/
First lets define a "URL" according to my requirements.
The only protocols optionally allowed are http:// and https://
then a mandatory domain name like stackoverflow.com
then optionally the rest of url components (path, query, hash, ...)
For reference a list of valid and invalid url's according to my requirements
VALID
stackoverflow.com
stackoverflow.com/questions/ask
https://stackoverflow.com/questions/ask
http://www.amazon.com/Computers-Internet-Books/b/ref=bhp_bb0309A_comint2?ie=UTF8&node=5&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=browse&pf_rd_r=0AH7GM29WF81Q72VPFDH&pf_rd_t=101&pf_rd_p=1273387142&pf_rd_i=283155
amazon.com/Computers-Internet-Books/b/ref=bhp_bb0309A_comint2?ie=UTF8&node=5&pf_rd_m=ATVPDKIKX0DER&pf_rd_s=browse&pf_rd_r=0AH7GM29WF81Q72VPFDH&pf_rd_t=101&pf_rd_p=1273387142&pf_rd_i=283155
http://test-site.com (filter_var reject this!!! I have domain names with dashes )
INVALID
http://www (php filter_var allow this, yes i know is a valid url)
google
http://www..des (php filter_var allow this)
Any url with not allowed characters in the domain name
For completeness here is my php version: 5.3.2-1ubuntu4.2
As a starting point you can use this one, it's for JS, but it's easy to convert it to work for PHP preg_match.
/^(https?\://)?(www\.)?([a-z0-9]([a-z0-9]|(\-[a-z0-9]))*\.)+[a-z]+$/i
For PHP should work this one:
$reg = '#^(https?\://)?(www\.)?([a-z0-9]([a-z0-9]|(\-[a-z0-9]))*\.)+[a-z]+$#i';
This regexp anyway validates only the domain part, but you can work on this or split the url at the 1st slash '/' (after "://") and validate separately the domain part and the rest.
BTW: It would validate also "http://www.domain.com.com" but this is not an error because a subdomain url could be like: "http://www.subdomain.domain.com" and it's valid! And there is almost no way (or at least no operatively easy way) to validate for proper domain tld with a regex because you would have to write inline into your regex all possible domain tlds ONE BY ONE like this:
/^(https?\://)?(www\.)?([a-z0-9]([a-z0-9]|(\-[a-z0-9]))*\.)+(com|it|net|uk|de)$/i
(this last one for instance would validate only domain ending with .com/.net/.de/.it/.co.uk). New tlds always come out, so you would have to adjust you regex everytimne a new tld comes out, that's a pain in the neck!
You could use parse_url to break up the address into its components. While it's explicitly not built to validate a URL, analyzing the resulting components and matching them against your requirements would at least be a start.
It may vary but in most of the cases you don't really need to check the validity of any URL.
If it's a vital information and you trust your user enough to let him give it through a URL, you can trust him enough to give a valid URL.
If it isn't a vital information, then you just have to check for XSS attempts and display the URL that the user wanted.
You can add manually a "http://" if you don't detect one to avoid navigation problems.
I know, I don't give you an alternative as a solution, but maybe the best way to solve performance & validity problems is just to avoid unnecessary checks.
I'm developing a PHP website, and currently my links are in a facebook-ish style, like so
me.com/profile.php?id=123
I'm thinking of moving to something more friendly to crawling search engines
(like here at stackoverflow), something like:
me.com/john-adams
But how can I differentiate from two users with the same name - or more correctly, how does stackoverflow tell the difference from two questions with the same title?
I was thinking of doing something like
me.com/john-adams-123
and parsing the url.
Any other recommendations?
Stackoverflow does something similar to your me.com/john-adams-123 option, except more like me.com/123/john-adams where the john-adams part actually has no programmatic meaning. The way you're proposing is slightly better because the semantic-content-free numeric ID is farther to the right in the URL.
What I would do is store a unique slug (these SEO-friendly URL components are generally called slugs) in the user table and do the number append thing when necessary to get a unique one.
In stack overflow's case, it's
http://stackoverflow.com/questions/975240/using-seo-friendly-links
http://stackoverflow.com/questions <- Constant prefix
/975240 <- Unique question id
using-seo-friendly-links <- Any text at all, defaults to title of question.
Facebook, on the other hand, has decided to just make everyone pick a unique ID. Then they are going to use that as a profile page. Something like http://facebook.com/p/username/. They are solving the problem of uniqueness between users, by just requiring it to be some string that the user picks that is unique among all existing users.
SO 'cheats' :-).
The link for your question is "Using SEO-friendly links" but "Using SEO-friendly links" also works.
The part after the number is the SEO friendly bit, but SO doesn't really care what's there. I think it defaults to the question title.
So in your case you could construct a link like:
me.com/123/john-adams
a second john adams would have a different Id and a unique URL like :
me.com/111/john-adams
I would say that your proposed solution is a better solution to that of stackoverflows as it preserves content hierarchy:
me.com/john-adams-123
Usage of the unique ID before the username is simply nonsensical.
I would, however, recommend enforcement of content type:
me.com/john-adams-123.html
This will allow for consistent urls while serving a variety of content types.
Additionally, you could make use of sexatrigesimal for the unique id, to further reduce the amount of unnecessary cruft in your URL, especially for high end numbers, but this is often overkill :D
me.com/john-adams-123.html -> me.com/john-adams-3F.html
me.com/john-adams-1234567890.html -> me.com/john-adams-KF12OI.html
Finally, be sure to utilize 301 redirects on non-conforming accessible URIs to redirect to the "correct" seo-friendly schema to prevent duplicate content penalties.
I'd go with your style of me.com/john-adams-123, because I think the leftmost part of the URI has more importance in SEO ranking.
Actually, if you are willing to use this on several controllers (not just user profile), you may want to do it more like me.com/john-adams-profile-123 with a rewriting rule redirecting /.+-profile-(\d+) to profile.php?uid=$1 and still be able to use, say, me.com/john-adams-articles-123 for this user's articles...
To avoid dealing with the links contain special characters, you can use this plugin for Zend Framework.
https://github.com/btlagutoli/CharConvert
$filter2 = new Zag_Filter_CharConvert(array(
'onlyAlnum' => true,
'replaceWhiteSpace' => '-'
));
echo $filter2->filter('éééé ááááá ? 90 :');//eeee-aaaaa-90
this can help you deal with strings in other languages