PHP preg_match match consecutive newline chars - php

I am trying to prevent certain kinds of posts on my site, which are mostly meant to make it look like they contain some content but are just spam. Specifically, the posts are a few random words, some newline characters, and a random character.
So, I know some legit users might have use for using two newline chars (to create a blank line between paragraphs), but I figure 3+ can be marked as spam.
I tested this regex on regex101 and it works fine, but is never triggered when I test on my site, any ideas as to why? When I uncomment the echo line, it will show me the number 4 for my test data, so I know it sees the newlines.. is my regex formed incorrectly?!
Test data:
This is a potential
spam post
Code:
//echo substr_count($lowercaseBody, "\n");
if (preg_match('/\n{3,}./', $lowercaseBody)){
error("Stop Spamming my chan you .");
}

The data likely contains CRLF's, not just LF's.
The substr_count test does not care about the interleaving CR's, but your regex patterns does.
Use (\r?\n) instead of the \n to allow both CRLF's and LF's (different browsers/OS's, may use different new-lines):
if (preg_match('/(\r?\n){3,}./', $lowercaseBody)){
error("Stop Spamming my chan you .");
}

Related

regex with preg_match anything and all line breaks

i am not too good with regex and can't seem to find the answer
I am writing a class file to check data type and "partially/best possible sanitise" any submitted data as well as performing some other functions too. This is working on all data types (i.e emails, url's phone numbers, int/signed/un-signed, words, passwords, various date formats, basic HTML, etc)
i am having problems with trying to match "anything"* (this is the one data type i dont really need to check, but for consistency, i need it to run through the preg_match, but always want it to return true).
when i say "anything" i want it to match any text, number, symbols AND Line Breaks. It is the line break i am having problems with
i am using :
define('REG_TEXT', '/^(.*)$/');
preg_match(REG_TEXT, $data)
this works fine on the first paragraph, but wont match past any line beaks so returns false
an example of what i want this to match (return true) would be:
this is a test match on anything 345 +_)(*&^%$£"!<br><html> <?php echo this i PHP; ?>
and match this too on a new line
and match all this line too
and anything else at all
i am not worried about any code in-putted into the data at this point as other areas of my class are dealing with this (before this stage!).
basically i am after a regex that will match/return true on absolutely anything.
(i dont want to change to preg_match_all as this will break other aspects of the class or require me to add additional code that will be a partial repeat of code that i dont think is needed)
any advice would be greatly welcomed!
thanks
Jon
Use:
'/^(.*)$/ms'
You need the m and s modifiers here. http://php.net/manual/en/reference.pcre.pattern.modifiers.php

PHP - How to ask if field contains number or + or spaces or empty?

My website form is getting hammered with spam. I have noticed in the "Phone" field the spam bots always insert text rather that a number so I would like to add an if statement to the php mailer blocking the email if the phone field doesn't contain any of the following:
1) I want users to be able to leave the field blank, so empty field must be accepted.
2) Must contain "numbers" or "plus sign" or "spaces"
How would I write this in PHP?
Any help is appreciated
EDIT: Just though lol it would be much easier to just check if the field contains alphabetical characters. How would I do this?
EDIT2: Sorted. I used "if (ctype_alpha ($phone) !== false)"
Regular expressions are probably the best way, although not necessarily the easiest to understand at first. But regular expressions are definitely a good thing to learn if you are not familiar with them. My favorite introduction is this site: http://www.zytrax.com/tech/web/regex.htm And this is a good site for interactively building a regex and seeing how it works in realtime: http://www.regexr.com/ I'm sure there are plenty of other similar sites but those are the two I always go back to.
If you search around for a regular expression solution you will find countless possibilities and variations. My personal advice is to keep it simple. I would start with considering how you store the phone number data. I usually just keep the numbers, so I would simplify it by first removing those "allowed" characters and then checking if what's left over is just numbers.
$phone = str_replace(Array('+', ' ', '(', ')'), '', $phone);
That will replace all pluses, spaces, and parentheses with an empty string (i.e. remove them). Then you can check if the string is numeric, and if it is store it, otherwise print/return an error.
if (!is_numeric($phone))
// stop processing and output an error
First of all You must use some spamblock for example: token, honey pot, captcha etc.
In my country mobile or local phone number contains only 9digits without country code which is +XX. So i create INT(10) field in db. After submit form remove everything without digits.
For example:
$phoneNumber = (int) substr( preg_replace( '#[^\d]+#', '', $_POST['phone_numer'] ), 0, 9 );
In many project allways works.

How to remove offensive words from post by php?

Assume "xyza" is a bad word. I'm using following method to replace offensive words-
$text = str_replace("x***","(Offensive words detected & removed!)",$text);
This code will replace xyza into "(Offensive words detected & removed!)".
But problem is "Case" if someone type XYZA my code can't detect it. How to solve it?
No matter what you do, users will find ways to get around your filters. They will use unicode characters (аss, for example, uses a Cyrillic а and will not get captured by any of the regex solutions). They will use spaces, dollar signs, asterisks, whatever you haven't managed to catch yet.
If family-friendliness is essential to your application, have a person review the content before it goes live. Otherwise, add a flag feature so other people can flag offensive content. Better yet, use some sort of machine learning or Bayesian filter to automatically flag potentially offensive posts and have humans check them out manually. People read human languages better than computers.
The problem with whitelists/blacklists is—as other users have pointed out—your users will make it their priority to find ways around your filter for satisfaction rather than using your website for what it was intended for, whatever that may be.
One approach would be to use Google’s undocumented profanity API it created for its “What Do You Love?” website. If you get a response of true then just give the user a message saying their post couldn’t be submitted due to detected profanity.
You could approach this as follows:
<?php
if (isset($_POST['submit'])) {
$result = json_decode(file_get_contents(sprintf('http://www.wdyl.com/profanity?q=%s', urlencode($_POST['comments']))));
if ($result->response == true) {
// profanity detected
}
else {
// save comments to database as normal
}
}
Other answers and comments say that programming is not the best solution to this problem. I agree with them. Those answers should be moved to Moderators - Stack Exchange or Webmasters - Stack Exchange.
Since this is stackoverflow, my answer is going to be based on computer programming.
If you want to use str_replace, do something like this.
For the sake of this post, since some people are offended by actual cusswords, let's pretend that these are bad words:
'fug', 'schnitt', 'dam'.
$text = str_ireplace(" fug ","(Offensive words detected & removed!)",$text);
Notice, it's str_ireplace not str_replace. The i is for "case insensitive".
But that will erroneously match "fuggedaboudit," for example.
If you want to do a more reliable job, you need to use regex.
$bad_text = "Fug dis schnitt, because a schnitter never dam wins a fuggin schnitting darn";
$hit_words = array("fug","schnitt","dam"); // these words are 'hits' that we need to replace. hit words...
array_walk($hit_words, function(&$value, $key) { // this prepares the regex, requires PHP 5.3+ I think.
$value = '~\b' . preg_quote( $value ,'~') . '\b~i'; // \b means word boundary, like space, line-break, period, dash, and many others. Prevends "refudgee" from being matched when searching for "fudge"
});
/*print_r($bad_words);*/
$good_words = array("fudge","shoot","dang");
$good_text = preg_replace($hit_words,$good_words,$bad_text); // does all search/replace actions at once
echo '<br />' . $good_text . '<br />';
That will do all your search/replacements at once. The two arrays should contain the same number of elements, matching up searches and replace terms. It will not match parts of words, only whole words. And of course, determined cussers will find ways of getting their swearing onto your website. But it will stop lazy cussers.
I've decided to add some links to sites that obviously use programming to do a first run through removing profanity. I'll add more as I come across them. Other than yahoo:
1.) Dell.com - replace matching words with <profanity deleted>.
http://en.community.dell.com/support-forums/peripherals/f/3529/t/19502072.aspx
2.) Watson, the supercomputer, apparently developed a cursing problem. How do you tell the difference between cursing and slang? Apparently, it's so hard that the researchers just decided to purge it all. But they could have just used a list of curse words ( exact matching is a subset of regex, I would say) and forbidden their use. That's kind of how it works in real life, anyway.
Watson develops a profanity problem
3.) Content Compliance section of Gmail custom settings in Apps for Business:
Add expressions that describe the content you want to search for in each message
The "Expresssions" used can be of several types, including "Advanced content match", which, among other things, allows you to choose "Match type" options very similar to what you'd have in an excel filter: Starts with, Ends with, Contains, Not contains, Equals, Is Empty, all of which presumably use Regex. But wait, there's more: Matches regex, Not matches regex, Matches any word, Matches all words. So, the mighty Google implements regex filtering options for its business users. Why would it do that, when regex is supposedly so ineffective? Because it actually is effective enough. It is a simple, fast, programming solution that will only fail when people are hell-bent on circumventing it.
Besides that list, I wonder if anyone else has noticed the similarity between weeding out profanity and filtering out spam. Clearly, regex has uses in both arenas but nitpickers who learned by rote that "all regex is bad" will always downvote any answer to any question if regex is even mentioned.
Try googling "how spam filters work". You'll get results like this one that covers spam assassin:
http://www.seas.upenn.edu/cets/answers/spamblock-filter.html
Another example where I'm sure regex is used is when communicating via Amazon.com's Amazon Marketplace. You receive emails at your usual email address. So, naturally, when responding to a seller, your email program will include all kinds of sender information, like your email address, cc email addresses, and any you enter into the body. But Amazon.com strips these out "for your protection." Can I find a way around this regex? Probably, but it would take more trouble than it's worth and is therefore effective to a degree. They also keep the emails for 2 years, presumably so that a human can go over them in case of any fraud claims.
SpamAssassin also looks at the subject and body of the message for the same sort of things that a person notices when a message "looks like spam". It searches for strings like "viagra", "buy now", "lowest prices", "click here", etc. It also looks for flashy HTML such as large fonts, blinking text, bright colors, etc.
Regex is not mentioned, but I'm sure it's in use.
Use str_ireplace function that Case-insensitive version of str_replace()
$text = str_ireplace("flip","(Offensive words detected & removed!)", $text);
Use 'str_ireplace' to replace any case sensitive strings
Probable, this will help you
$text = 'contains offensive_word .... so on';
$array = array(
'offensive_word' => '****',
'offensive_word2' => '****',
'offensive_word3' => '****',
//.....
);
$text = str_ireplace(array_keys($array),array_values($array), $text);
echo $text;
You should use regex replacement and need to add the i flag to the end of your regex so it searches your text regardless of case. so..
$text = preg_replace("/xyza/i","(Offensive words detected & removed!)", $text);
str_ireplace can also be used if you don't need complex regex rules.
$text = str_ireplace("xyza","(Offensive words detected & removed!)", $text);
In fact, the latter is the preferred way as it's faster than regex manipulation. From PHP docs:
If you don't need fancy replacing rules, you should generally use this function instead of preg_replace() with the i modifier.
BUT, as the commenter pointed out, simple string/regex replacements can break your strings if the substring you're replacing appears as part of another non-offensive word. For this, you could either use word boundaries in your regexes or replace only those words that can't be part of other strings (e.g. the word xyza).

PHP / Regex: Check if string was copy pasted

I am trying to write a validation function for strings where I want to check if the string is a copy+paste work.
Background:
We have a CMS where the user can enter description texts with a minimum of - for example - 200 Chars. A lot of user write too short texts and get the "you have to use more then 200 letters" error message.
To avoid this, they copy paste the text or some dummy strings like "AAAAA" to reach the limit.
I am looking now for an function / methode / regex to detect such copy+paste strings and prevent them by showing a message.
I know that there is no 100% solution to prevent dummy texts, but we want to reduce it a little bit. Any ideas?
There's not going to be a fast, reliable, undefeatable solution. But I can think of a compromise:
preg_match('/(.{1,4})\1{3,}/', $subject)
would return True for strings that contain repeated sequences of one to four characters (when they're repeated at least three times).
So it would match on strings like
AAAAAAA
asdasdasdasd
foo bar baz glglglglglglglgl
It would not detect longer repetitions like
asdfgasdfgasdfgasdfg
but the complexitly of the regex will grow exponentially if you try to match longer repeats, so I think four characters are a workable compromise.
Alternatively, you might want to anchor the repeats to the end of the string (which is where most people would put the filler):
preg_match('/(.+)\1{3,}$/', $subject)
but of course, then a string like
LOL OMG!!!!!!!!!!!!!!!!!!!!!!!!!!!.
would not be detected. Your choice :)

Fetch All URLs from a Page using Regex

Original format:
<a href="http://www.example.com/t434234.html" ...>
1. I need to fetch all URLs of this format:
http://www.example.com/t[ANY CHARACTER].html
ANY CHARACTER is where value changes from URL to another. The rest are fixed.
Here is my attempt:
preg_match("#http:\/\/www\.aqarcity\.com\/t[a-zA-Z0-9_]\.html#", $page, $urls);
I get empty results. I don't know where i went wrong...
The problem appears to be that [a-zA-Z0-9_] will only match exactly one character. If you want to match zero or more characters, use [a-zA-Z0-9_]*. For one or more, use [a-zA-Z0-9_]+. For exactly six characters, use [a-zA-Z0-9_]{6}. For e.g. one to six characters, use [a-zA-Z0-9_]{1,6}.
Also note that, since you're using # as the delimiter, you don't need to escape the / characters. As far as I know this will not make your code misbehave, but it'll be easier to read if you remove the backslashes before the slashes.
Finally, please realize that regular expressions are a rather dangerous way to work with HTML. In this case, you may pick up matching URLs from comments, Javascript code, and other things that aren't links. It is literally impossible to correctly parse HTML with unaugmented regular expressions—they don't have the expressive power necessary to do so. I don't know what sorts of HTML parsers are available for PHP, but you may want to look into them.

Categories