php code to check for repeating characters / bogus text - php

i'm running a dating site and there is a place where people enter their profile - I already have a bad-words filter but now I have a problem where people enter a profile that is just garbage characters or just "aaaaaaaaaaaaaaaaaaaa" or "--------------" etc. I'm looking for an effective way of filtering out the long words of repeated characters. thanks in advance.

this should do it (but it will replace double-characters too, mabe you need to edit a bit):
preg_replace('{(.)\1+}','$1',$text);
OT: can't belive there are still people who use bad-word filters...

Maybe you need some bayesian spam filter-alike filter for that kind of stuff.
Particular words have particular probabilities of occurring in spam email and in legitimate email. For instance, most email users will frequently encounter the word "Viagra" in spam email, but will seldom see it in other email. The filter doesn't know these probabilities in advance, and must first be trained so it can build them up. To train the filter, the user must manually indicate whether a new email is spam or not.
...

You could use a word-list, and flag each message that has long words (e.g. 5+ chars) not on the list - if the field contains 5 8-letter words, of which none are in a dictionary, it's likely it's not meaningful data.

Related

preg_match verification of non English email addresses (international domain names)

We all know email address verification is a touchy subject, there are so many opinions on the best way to deal with it without encoding for the entire RFC. But since 2009 its become even more difficult and I haven't really seen anyone address the issue of IDN's yet.
Here is what I've been using:
preg_match(/^[a-z0-9._%+-]+#[a-z0-9.-]+\.[a-z]{2,6}\z/i)
Which will work for most email addresses but what if I need to match a non Latin email address? e.g.: bob#china.中國, or bob#russia.рф
Look here for the complete list. (Notice all the non Latin domain extensions at the bottom of the list.)
Information on this subject can be found here and I think what they are saying is these new characters will simply be read as '.xn--fiqz9s' and '.xn--p1ai' on the machine level but I'm not 100% sure.
If it is, does that mean the only change I need to consider making in my code the following? (For domain extensions like .travelersinsurance and .sandvikcoromant)
preg_match(/^[a-z0-9._%+-]+#[a-z0-9.-]+\.[a-z]{2,20}\z/i)
NOTICE: This is not related to the discussion found on this page Using a regular expression to validate an email address
Consider: Every time you make up your own new regex without validating addresses according to the complete RFC spec, you're just making the situation for using "exotic" email addresses on the web worse. You're inventing some new ad-hoc sub or superset of the official RFC spec; that means you will either have false positives or false negatives or both, you will deny people to use their actual addresses because your regex doesn't account for them correctly, or you will accept addresses which are actually invalid.
Add to that that even if the address is syntactically valid, that still doesn't mean a) the address actually (still) exists, b) belongs to that user or c) can actually receive email. In the grant scheme of things, validating the syntax is an extremely minor concern.
If you're going to validate the syntax at all, either do a very rough general check which is sure to not reject any valid addresses (e.g. /.+#.+/), or validate according to all RFC rules; don't do some in-between half-assed sort-of-strict-but-not-really validation you just came up with.
I'm gonna stick with the tried and true suggestion that you should send them a verification email. No need for a fancy regex that will need to be updated time and time again. Just assume they know their email address and let them enter it.
That's what I've always done when this situation comes up. If anything I would make them enter their email twice. It'll free you up to spend more time on the important parts of your site/project.
Here is what I eventually came up with.
preg_match(/^[\pL\pM*+\pN._%+-]+#[\pL\pM*+\pN.-]+\.[\pL\pM*+]{2,20}\z/u)
This uses Unicode regular expressions like \pL, \pM*+ and \pN to help me deal with characters and numbers from any language.
\pL Any kind of letter from any language, upper or lower case.
\pM*+ Matches zero or more code points that are combining marks. A character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.).
\pN Any number.
The expression above will work perfectly for normal email addresses like me#mydomain.com and cacophonous email addresses like a.s中3_yÄhমহাজোটেরoo文%网+d-fελληνικά#πyÄhooαράδειγμα.δοκιμή.
It's not that I don't trust people to be able to type in their own email addresses but people do make mistakes and I may use this code in other situations. For example: I need to double check the integrity of an existing list of 10,000 email addresses. Besides, I was always taught to NOT trust user input and to ALWAYS filter.
UPDATE
I just discovered that though this works perfectly when tested on sites like phpliveregex.com and locally when parsing a normal string for utf-8 content it doesn't work properly with email fields because browsers converting fields of that content type to normal latin. So an email address like bob#china.中國, or bob#russia.рф does get converted before being received by the server to bob#china.xn--fiqz9s, or bob#russia.xn--p1ai. The only thing I was really missing from my original filter was the inclusion of hyphens from the domain extention.
Here is the final version:
preg_match('/^[a-z0-9%+-._]+#[a-z0-9-.]+\.[a-z0-9-]{2,20}\z/i');

Prevent people to submit a form with meaningless data

I work on a website which allows people to tell about how they were treated when they request for support from companies. The issue is that some people are playing with the platform using meaningless data like
blabla bal bla bka asdfdsff sdfs sdf
Is there a way to prevent this?
Can't do the validation of data manually because the website is very dynamic with a lot of data.
Thanks
Improve your form validation checks.
For the phone number, make sure it's exactly the appropriate size, and it doesn't (for example) have the same number (ie the number 0777777777 will probably be fake).
Calculate the letter usage in a sentence. The most used letters in the english language are e and a (I think). If the ratio is completely different (for example if there is no letter e in a 200 letter text - there is a bit problem ).
Also match the words with a dictionary. For a ratio of unknown words larger than 60% you can consider it to be not valid.
Check for dates, if you're expecting a date that's in the next few days, you shouldn't accept dates for 30 years ago.
Think of the data that you're expecting to receive, and find limits to it, that's the only way. Good luck !
Short answer no.
Long answer: you may want to try to match words against a dictionary. But this is not fool proof and when doing the matching too tight you may get a lot of false positives.
Another way may be to build a blacklist of bogus words and match against that.
Also you may want reconsider making that particular field required. When a lot of people fill in bogus data the form is probably setup wrong.
You can do it to an extent:
Validation on certain fields (phone number, email, numeric/text only fields etc...)
Restrict the user to use pre-defined items, such as drop-downs, check-boxes, rather than just plain text inputs where they have total freedom
Run some checks through the dictionary and determine a desirable percentage of quality that a user submits.
Regardless of what you do, it'll never be 100%. The only (almost!) guaranteed method of correct validation with user input outside of pre-determined values would be to sit someone down and manually check every submitted piece of data. Even then, they're prone to human error and it still wouldn't be 100%.
My advice would be to keep all important fields to values you've already specified yourself with drop-downs, check-boxes, number spinners etc...
Add fields for 'additional comments' on certain items, but keep those fields unnecessary to the main process handling of a submitted form.

PHP - preg_match_all - Loose email match pattern that allows spaces and double #

I am going through our old site files and data that has our members emails and correspondence for 10 years.
I am extracting all of the email addresses (and botched email entries) and adding them to our new sites db.
It was a beginner attempt cms and had no error checking and validation.
So, I am having trouble matching emails with spaces and double #.
jam # spa ces1.com
jam#spac es2.com
jam##doubleats.org
I have constructed this loose regex that intentionally allows for a whole bunch of incorrect email formats but, the above three are examples of ones I can't figure out.
Here is my current "working" code:
$pattern1= '([\s]*)([_a-zA-Z0-9-]+(\.[_a-zA-Z0-9-]+)*([ ]+|)#([ ]+|)([a-zA-Z0-9-]+\.)+([a-zA-Z]{2,}))([\s]*)';
$pattern2='\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b';
$pattern="/$pattern1|$pattern2/i";
$isago = preg_match_all($pattern,$text,$matches);
if ($isago) {.......
I need another pattern that would allow the three email examples above to be recognized as email addresses. (actual validation comes later)
Also, is there is any other patterns I could use that would allow me to recognize possible emails in the files?
Thanks for any help.
For the third case you can change your # to #{1,2}.
For the first and second you can add a space in your regex pattern1:
$pattern1= '([\s]*)([_a-zA-Z0-9-]+(\.[_a-zA-Z0-9-]+)*([ ]+|)#{1,2}([ ]+|)([ a-zA-Z0-9-]+\.)+([a-zA-Z]{2,}))([\s]*)';
$pattern2='\b[A-Z0-9._%+-]+#{1,2}[A-Z0-9.-]+\.[A-Z]{2,4}\b';
This answer is like a joke I know... but, how about this RegEx:
/[\S ]+#[\S ]+\.[\S ]+/i
That's works for you? I'm tested it in a document and match the three mails.
For general purpose you should use something like this:
/[A-Za-z0-9\._]+#[A-Za-z0-9\._]+\.[A-Za-z0-9\._]+/i
With that you would match all the emails, even separated by newline or commas.

Get the actual email message that the person just wrote, excluding any quoted text

There are two pre-existing questions on the site.
One for Python, one for Java.
Java How to remove the quoted text from an email and only show the new text
Python Reliable way to only get the email text, excluding previous emails
I want to be able to do pretty much exactly the same (in PHP). I've created a mail proxy, where two people can have a correspondance together by emailing a unique email address.
The problem I am finding however, is that when a person receives the email and hits reply, I am struggling to accurately capture the text that he has written and discard the quoted text from previous correspondance.
I'm trying to find a solution that will work for both HTML emails and Plaintext email, because I am sending both.
I also have the ability if it helps to insert some <*****RESPOND ABOVE HERE*******> tag if neccessary in the emails meaning that I can discard everything below.
What would you recommend I do? Always add that tag to the HTML copy and the plaintext copy then grab everything above it?
I would still then be left with the scenario of knowing how each mail client creates the response. Because for example Gmail would do this:
On Wed, Nov 2, 2011 at 10:34 AM, Message Platform <35227817-7cfa-46af-a190-390fa8d64a23#dev.example.com> wrote:
## In replies all text above this line is added to your message conversation ##
Any suggestions or recommendations of best practices?
Or should I just grab the 50 most popular mail clients, and start creating custom Regex for each. Then for each of these clients, also a bizallion different locale settings since I'm guessing the locale of the user will also influence what is added.
Or should I just remove the preceding line always if it contains a date?.. etc
Unfortunately, you're in for a world of hurt if you want to try to clean up emails meticulously (removing everything that's not part of the actual reply email itself). The ideal way would be to, as you suggest, write up regex for each popular email client/service, but that's a pretty ridiculous amount of work, and I recommend being lazy and dumb about it.
Interestingly enough, even Facebook engineers have trouble with this problem, and Google has a patent on a method for "Detecting quoted text".
There are three solutions you might find acceptable:
Leave It Alone
The first solution is to just leave everything in the message. Most email clients do this, and nobody seems to complain. Of course, online message systems (like Facebook's 'Messages') look pretty odd if they have inception-style replies. One sneaky way to make this work okay is to render the message with any quoted lines collapsed, and include a little link to 'expand quoted text'.
Separate the Reply from the Older Message
The second solution, as you mention, is to put a delineating message at the top of your messages, like --------- please reply above this line ----------, and then strip that line and anything below when processing the replies. Many systems do this, and it's not the worst thing in the world... but it does make your email look more 'automated' and less personal (in my opinion).
Strip Out Quoted Text
The last solution is to simply strip out any new line beginning with a >, which is, presumably, a quoted line from the reply email. Most email clients use this method of indicating quoted text. Here's some regex (in PHP) that would do just that:
$clean_text = preg_replace('/(^\w.+:\n)?(^>.*(\n|$))+/mi', '', $message_body);
There are some problems using this simpler method:
Many email clients also allow people to quote earlier emails, and preface those quote lines with > as well, so you'll be stripping out quotes.
Usually, there's a line above the quoted email with something like On [date], [person] said. This line is hard to remove, because it's not formatted the same among different email clients, and it may be one or two lines above the quoted text you removed. I've implemented this detection method, with moderate success, in my PHP Imap library.
Of course, testing is key, and the tradeoffs might be worth it for your particular system. YMMV.
There are many libraries out there that can help you extract the reply/signature from a message:
Ruby: https://github.com/github/email_reply_parser
Python: https://github.com/zapier/email-reply-parser or https://github.com/mailgun/talon
JavaScript: https://github.com/turt2live/node-email-reply-parser
Java: https://github.com/Driftt/EmailReplyParser
PHP: https://github.com/willdurand/EmailReplyParser
I've also read that Mailgun has a service to parse inbound email and POST its content to a URL of your choice. It will automatically strip quoted text from your emails: https://www.mailgun.com/blog/handle-incoming-emails-like-a-pro-mailgun-api-2-0/
Hope this helps!
Possibly helpful: quotequail is a Python library that helps identify quoted text in emails
Afaik, (standard) emails should quote the whole text by adding a ">" in front of every line. Which you could strip by using strstr(). Otherwise, did you trie to port that Java example to php? It's nothing else than Regex.
Even pages like Github and Facebook do have this problem.
Just an idea: You have the text which was originally sent, so you can look for it and remove it and additional surrounding noise from the reply. It is not trivial, because additional line breaks, HTML elements, ">" characters are added by the mail client application.
The regex is definitely better if it works, because it is simple and it perfectly cuts the original text, but if you find that it frequently does not work then this can be a fallback method.
I agree that quoted text or reply is just a TEXT. So there's no accurate way to fetch it. Anyway you can use regexp replace like this.
$filteringMessage = preg_replace('/.*\n\n((^>+\s{1}.*$)+\n?)+/mi', '', $message);
Test
https://regex101.com/r/xO8nI1/2

PHP and Regular Expressions question?

I was wondering if the codes below are the correct way to check for a street address, email address, password, city and url using preg_match using regular expressions?
And if not how should I fix the preg_match code?
preg_match ('/^[A-Z0-9 \'.-]{1,255}$/i', $trimmed['address']) //street address
preg_match ('/^[\w.-]+#[\w.-]+\.[A-Za-z]{2,6}$/', $trimmed['email'] //email address
preg_match ('/^\w{4,20}$/', $trimmed['password']) //password
preg_match ('/^[A-Z \'.-]{1,255}$/i', $trimmed['city']) //city
preg_match("/^[a-zA-Z]+[:\/\/]+[A-Za-z0-9\-_]+\\.+[A-Za-z0-9\.\/%&=\?\-_]+$/i", $trimmed['url']) //url
Your street address: ^[A-Z0-9 \'.-]{1,255}$
you need not escape the single quote.
since you have a dot in the char
class, it will allow all char (except
newline). So effective your regex becomes ^.{1,255}$
you are allowing it to be of min
length of 1 and max of length 255. I
would suggest you to increase the min
length to something more than 1.
Your email regex: ^[\w.-]+#[\w.-]+\.[A-Za-z]{2,6}$
again you are having . in the char
class. fix that.
Your password regex: ^\w{4,20}$
allows for a passwd of length 4 to 20
and can contain only alphabets(upper
and lower), digits and underscore. I would suggest you to allow
special char too..to make your
password stronger.
Your city regex: ^[A-Z \'.-]{1,255}$
has . in char class
allows min length of 1 (if you want
to allow cities of 1 char length this
is fine).
EDIT:
Since you are very new to regex, spend some time on Regular-Expressions.info
This seems overly complicated to me. In particular I can see a few things that won't work:
Your regex will fail for cities with non-ASCII letters in their names, such as "Malmö" or 서울, etc.
Your password validator doesn't allow for spaces in the password (which is useful for entering pass-phrases) it doesn't even allow digits or punctuation, which many people will like to put in their passwords for added security.
You address validator won't allow for people who live in apartments (12/345 Foo St)
(this is assuming you meant "\." instead of "." since "." matches anything)
And so on. In general, I think over-reliance on regular expressions for validation is not a good thing. You're probably better off allowing anything for those fields and just validating them some other way.
For example, with email addresses: just because an address is valid according to the RFC standard doesn't mean you'll actually be able to send email to it (or that it's the correct email address for the person). The only reliable way to validate an email address is to actually send an email to it and get the person to click on a link or something.
Same thing with URLs: just because it's valid according to the standard doesn't actually mean there's a web page there. You can validate the URL by trying to do an actual request to fetch the page.
But my personal preference would be to just do the absolute minimum verification possible, and leave it at that. Let people edit their profile (or whatever it is you're verifying) in case they make a mistake.
There's not really a 'correct' way to check for any of those things. It depends on what exactly your requirements are.
For e-mail addresses and URLs, I'd recommend using filter_var instead of regexps - just pass it FILTER_VALIDATE_EMAIL or FILTER_VALIDATE_URL.
With the other regexps, you need to make sure you escape . inside character classes (otherwise it'll allow everything), and you might want to consider that the City/Street ones would allow rubbish such as ''''', or just whitespace.
Please don't assume that you know how an address is made up. There are thousands of cities, towns and villages with characters like & and those from other alphabets.
Just DON'T try to validate an address unless you do it thru an API specific to a country (USPS for the US, for example).
And why would you want to limit the characters in a users password? Don't have ANY requirements on the password except for it existing.
Your site will be unusable if you use those regex.

Categories