PHP/Regex for a smart censor [duplicate] - php

This question already has answers here:
How do you implement a good profanity filter?
(20 answers)
Closed 8 years ago.
I am looking to build a smart censor in PHP using Regex for a message board. Basically, I have an array the bad words (in Regex) along with the substitution to be used for each. I detect spaces in between the letters to prevent bypassing the censor, but I'm hung up on someone having any of the bad word's letters wrapped by HTML tags. So, if "shit" is blocked, I can catch "s h i t" with any number of spaces, but if someone does sh<b>i</b>t (with the i wrapped with bold tags), it gets through. That obviously can't happen, so I'm stumped here.
Here is what I have so far:
$bad_words = array('/s\s*h\s*i\s*t/i'=>'s***');
$new_string = preg_replace(array_keys($bad_words), array_values($bad_words), $string);
return $new_string;
I've thought of wrapping $string with strip_tags() but because the rest of the post contents (not just the bad words being sought after) can contain HTML, that will destroy the whole message board post on return. Any help or insight provided would be greatly appreciated!

The fact is - no matter what you add to catch swear words, if somebody wants to find a way around it, they will. And the more your try and stop it, they more false positives you will get.
Even your method now, if someone enters "Push it to github", you're going to turn it into "Pus*** to github".
Honestly, your best bet is to catch the obvious ones, and have a way to flag a post as obscene.
Some good resources to look at on this site are:
How do you implement a good profanity filter?
and
"bad words" filter

Related

Nasty regex and strange string behavior

I've been struggling with this problem for quite some time now and I just can't seem to find a solution. I have the following regular expression for matching URLs which appears to work flawlessly until I post a bunch of links on new lines without spaces between them.
(http|ftp)+(s)?:(\/\/)((\w|\.|\-)+)(\/)?(\S)+
I tried this in a couple of regex testers and it seems to pick URLs correctly, unlike the code at my application. Which made me think there must be something wrong with the code and I started debugging. What I found out when I echo'ed the string I'm applying the regular expression to is this:
http://www.google.com/\r\nhttp://www.google.com/\r\nhttp://www.google.com/
I have never seen new lines \r\n appear as text in the browser. This makes me think that there's something else getting its hands on this string. I followed my logic and it turned out that this string comes right from a textarea element into $_POST and is not being manipulated anywhere.
What may be causing those \r\ns to appear as text and how would I go about matching those URLs that users may input separated by new lines?
I'm kind of really desperate over here, I would really appreciate your help guys.
If you are seeing
http://www.google.com/\r\nhttp://www.google.com/\r\nhttp://www.google.com/
when you echo the string, that means that the actual string you are echoing is:
http://www.google.com/\\r\\nhttp://www.google.com/\\r\\nhttp://www.google.com/
i.e. the backslashes have been escaped, causing them to not be treated as newline characters. This means that you are only getting a single match in your regex.
Check out this question: Why are $_POST variables getting escaped in PHP? for reasons why your requests may be getting escaped.

Regular Expression (regex) match of base64_decode concatenated using PHP

So i've been trying to build a regex for the past couple hours and i'm starting to go crazy in thinking if this is even possible or worth wild.
I have a script that scans PHP files checking MD5 sum for known malicious files, and certain strings. Most recently i've come across files where instead of using base64_decode in the PHP file, they are using variables and concatenating it so the scanner doesn't pick it up.
As an example here's the latest one I found:
$a='bas'.'e6'.'4_d'.'ecode';eval($a
So because the scanner searches for base64_decode this file wasn't picked up as they are using PHP to concatenate base64_decode in a variable, and then call the variable.
Forgive me because i've just started with regex, but is it even possible to search for something like this using regex? I mean, I understand and was able to get a regex that would match that exact one, but what about if they used this instead:
$a='b'.'ase'.'64_d'.'ecode';eval($a
It wouldn't be picked up because the regex was looking for ' then b then a, etc etc.
I've already added
(eval)\(\$[a-z]
To send me an email as a notice to check the file, i'll have to let it run for a couple days and see how many false positives show up, but my main concern is with the base64_decode
If someone could please shed some light on this for me and maybe point me in the right direction, I would greatly appreciate it.
Thanks!!
You can use this regexp:
b\W*a\W*s\W*e\W*6\W*4\W*_\W*d\W*e\W*c\W*o\W*d\W*e
It searches for base64_decode with any non-alphanumeric characters interspersed.

Is there any logic to validate whether a group of letters could be considered a phonetic word? [duplicate]

This question already has answers here:
Measure the pronounceability of a word?
(3 answers)
Closed 9 years ago.
Basically, if I'm given a random jumble of letters, I need to check to see if this could phonetically be considered a word.
I'm not looking to validate against a dictionary list, since I don't really care if the letters form an actual word or not. I just need to determine whether or not those letters are in the correct format to be considered a word.
For example:
aaaaaa // Not valid, because there are no consonants
bbbbbb // Not valid, because no vowels
dogcat // Valid, even though it is not a word, because it phonetically makes what could be considered a word
dapmar // Valid, even though nothing about this is a word, it phonetically works
I realize there are going to be exceptions to almost any logic given, so this doesn't have to be an exact science, I would just like to catch the majority, so the most general logic would work for me.
I think it all boils down to whether or not a jumble of letters can be pronounced easily.
Any help is appreciated, thanks!
Prevent letters to be repeated more than 3 times first, for example ccc will be invalid (or maybe you could do every letters except vowels so aaaaa, eeeee, uuuuu will be ok), then check all words from a list of existing words of your language only if you want to check something, but if you're generating a pronouncable word I don't think you'll need existing words.
Pleas also check this: pronounceability algorithm , http://10000ideas.blogspot.fr/2011/07/what-makes-word-pronounceable.html and this one : Measure the pronounceability of a word?
For the amount of time and effort it would take to write code to logically check this, you'd be better off getting a file with as many English words as possible and putting them into an array. That would be your BEST logical check.

regex (preg_match) woes

I'm sorry if this has been asked before, but I just can't get a straight answer from the interwebs today!
I need to validate a form field and check if there are 3 and only 3 (no more, no less), uppercase letters.
My sorry attempts at regex have so far all failed - I thought that
/^[A-Z]{3}$/
would do the job, but nix. Any takers?!
/^[A-Z]{3}$/ Will check the string for ...
begining_of_the_string->three_and_only_tree_uppercase_letters->end_of_line
no other letters are valid in the string with this regexp.
But, I tried it with js regexp. And I think the same for php. Could you provide full code (or at least part of it) of your php script ?

Regular expression for counting sentences in a block of text [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
PHP - How to split a paragraph into sentences.
I have a block of text that I would like to separate into sentences, what would be the best way of doing this? I thought of looking for '.','!','?' characters, but I realized there were some problems with this, such as when people use acronyms, or end a sentence with something like !?. What would be the best way to handle this? I figured there would be some regex that could handle this, but I'm open to a non-regex solution if that fits the problem better.
Regex isn't the best solution for this problem. You'd be served better by creating a parsing library. Something where you an easily create logic blocks to distinguish one thing from another. You'll need to come up with a set of rules breaking up the text into the chunks you'd like to see.
"Are you sure?" he asked.
Doesn't that mess things up when using regex? However, with a parser you could actually see
<start quote><capitalization>are you sure<question><end quote>he asked<period>
that with simple rules could say "that's one sentence."
Unfortunately there is no perfect solution for this, for the very reasons you stated. If it is content that you can somehow control or force a specified delimiter after every sentence, that would be ideal. Beyond that, all you can really do is look for (\.|!|?)+ and maybe even throw in a \s after that since most people pad new sentences with 1 or 2 spaces between the previous and next sentence.
I think the biggest problem is the possible existence of acronyms! Therefore you must use something like Prof. Knuth in a JavaDoc summary sentence so that the javadoc generator don't thinks that the first sentence ends after Prof..
This is a problem I don't know how anyone can reliably handle. The only approximate solution I could imagine is the use of an abbreviation dictionary.

Categories