I know it can be done for bad words (checking an array of preset words) but how to detect telephone numbers in a long text?
I'm building a website in PHP for a client who needs to avoid people using the description field to put their mobile phone numbers..(see craigslist etc..)
beside he's going to need some moderation but i was wondering if there is a way to block at least the obvious like nnn-nnn-nnnn, not asking to block other weird way of writing like HeiGHT*/four*/nine etc...
Welcome to the world of regular expressions. You're basically going to want to use preg_replace to look for (some pattern) and replace with a string.
Here's something to start you off:
$text = preg_replace('/\+?[0-9][0-9()\-\s+]{4,20}[0-9]/', '[blocked]', $text);
this looks for:
a plus symbol (optional), followed by a number, followed by between 4-20 numbers, brackets, dashes or spaces, followed by a number
and replaces with the string [blocked].
This catches all the obvious combinations I can think of:
012345 123123
+44 1234 123123
+44(0)123 123123
0123456789
Placename 123456 (although this one will leave 'Placename')
however it will also strip out any succession of 6+ numbers, which might not be desirable!
To do so you must use regular expressions as you may know.
I found this pattern that could be useful for your project:
<?php
preg_match("/(^(([\+]\d{1,3})?[ \.-]?[\(]?\d{3}[\)]?)?[ \.-]?\d{3}[ \.-]?\d{4}$)/", $yourText, $matches);
//matches variable will contain the array of matched strings
?>
More information about this pattern can be found here http://gskinner.com/RegExr/?2rirv where you can even test it online. It's a great tool to test regular expressions.
preg_match($pattern, $subject) will return 1 (true) if pattern is found in subject, and 0 (false) otherwise.
A pattern to match the example you give might be '/\d{3}-\d{3}\d{4}/'
However whatever you choose for your pattern will suffer from both false positives and false negatives.
You might also consider looking for words like mob, cell or tel next to the number.
The fill details of the php pattern matching can be found at http://www.php.net/manual/en/reference.pcre.pattern.syntax.php
Ian
p.s. It can't be done for bad words, as the people in Scunthorpe will tell you.
I think that use a too tight regular espression would lead to loose a great number of detections.
You should check for portions of 10 consecutive chatacters containing more than 5 digits.
So it is similar you will have an analisys routine queued to be called after any message insertion due to the computational weight.
After the 6 or more digits have been isolated replace them as you prefer, including other syblings digits.
Better in any case to preserve original data, so you can try and train your detection algorithm until it works the best way.
Then you can also study your user data to create more complex euristics, such like case insensitive numbers written as letters, mixed, dot separated, etc...
It's not about write the most perfect regex, is about approaching the problem statistically and dinamically.
And remember, after you take action, user will change their insertion habits as consequence, so stats will change and you will need to learn and update your euristics.
Related
I must detect the presence of some words (even polyrematic, like in "bag of words") in a user-submitted string.
I need to find the exact word, not part of it, so the strstr/strpos/stripos family is not an option for me.
My current approach (PHP/PCRE regex) is the following:
\b(first word|second word|many other words)\b
Is there any other better approach? Am I missing something important?
Words are about 1500.
Any help is appreciated
A regular expression the way you're demonstrating will work. It may be challenging to maintain if the list of words grows long or changes.
The method you're using will work in the event that you need to look for phrases with spaces and the list doesn't grow much.
If there are no spaces in the words you're looking for, you could split the input string on space characters (\s+, see https://www.php.net/manual/en/function.preg-split.php ), then check to see if any of those words are in a Set (https://www.php.net/manual/en/class.ds-set.php) made up of the words you're looking for. This will be a bit more code, but less regex maintenance, so ymmv based on your application.
If the set has spaces, consider instead using Trie. Wiktor Stribiżew suggests: https://github.com/sters/php-regexp-trie
I'm getting insane over this, it's so simple, yet I can't figure out the right regex. I need a regex that will match blacklisted words, ie "ass".
For example, in this string:
<span class="bob">Blacklisted word was here</span>bass
I tried that regex:
((?!class)ass)
That matches the "ass" in the word "bass" bot NOT "class".
This regex flags "ass" in both occurences. I checked multiple negative lookaheads on google and none works.
NOTE: This is for a CMS, for moderators to easily find potentially bad words, I know you cannot rely on a computer to do the filtering.
If you have lookbehind available (which, IIRC, JavaScript does not and that seems likely what you're using this for) (just noticed the PHP tag; you probably have lookbehind available), this is very trivial:
(?<!cl)(ass)
Without lookbehind, you probably need to do something like this:
(?:(?!cl)..|^.?)(ass)
That's ass, with any two characters before as long as they are not cl, or ass that's zero or one characters after the beginning of the line.
Note that this is probably not the best way to implement a blacklist, though. You probably want this:
\bass\b
Which will match the word ass but not any word that includes ass in it (like association or bass or whatever else).
It seems to me that you're actually trying to use two lists here: one for words that should be excluded (even if one is a part of some other word), and another for words that should not be changed at all - even though they have the words from the first list as substrings.
The trick here is to know where to use the lookbehind:
/ass(?<!class)/
In other words, the good word negative lookbehind should follow the bad word pattern, not precede it. Then it would work correctly.
You can even get some of them in a row:
/ass(?<!class)(?<!pass)(?<!bass)/
This, though, will match both passhole and pass. ) To make it even more bullet-proof, we can add checking the word boundaries:
/ass(?<!\bclass\b)(?<!\bpass\b)(?<!\bbass\b)/
UPDATE: of course, it's more efficient to check for parts of the string, with (?<!cl)(?<!b) etc. But my point was that you can still use the whole words from whitelist in the regex.
Then again, perhaps it'd be wise to prepare the whitelists accordingly (so shorter patterns will have to be checked).
Is this one is what you want ? (?<!class)(\w+ass)
I am looking to implement a system to strip out url's from text posted by a user.
I know there is no perfect solution and users will still attempt things like:
www dot google dot com
so I know that ultimately any solution will be flawed in some way... all I am looking to do really is reduce the number of people doing it.
Any suggestions, source or approaches appriciated,
Thanks
There are number of regular expression pattern matchers here. Some of them are quite complex.
I would suggest that running multiple ones may be a good idea.
You need to define exactly what you want to strip out. The stricter the definition, the more false positives you will get. The following example will remove any string with 3 characters, followed by a period, more letters, another period and 2-4 more letters:
$text = preg_replace('/[a-z]{3}\.[a-z]+\.[a-z]{2,4}/i', '', $text);
The other end of strictness might be anything that ends on a period and 2-4 letters (like .com):
$text = preg_replace('/[a-z]+\.[a-z]{2,4}/i', '', $text);
Note that the latter will strip out the last word of a sentence, the full stop and the first word of the next sentence if someone forgets to add a space inbetween the sentences.
Using PHP, how can I verify if a phone # is well formed?
It seems easiest to simply strip all non-numeric data, leaving only the numbers. Then to check if 10 digits exist.
Is this the best and easiest way?
The best? No. Issues I see with this approach:
Some area codes - like 000-###-#### - are not valid. See http://en.wikipedia.org/wiki/List_of_NANP_area_codes
Some exchanges - like ###-555-#### - are not valid. See http://en.wikipedia.org/wiki/555_%28telephone_number%29
Some people will enter a 1 before their number, i.e. 1-###-###-####.
Some people are only reachable at an extension, like ###-###-#### x####.
Some companies tack on extra digits, like 1-800-GO-FLOWERS. The additional digits are simply ignored by the phone system, but a user might expect to be able to enter the whole thing.
International phone numbers are not necessarily 10 digits, even if you discount the country codes.
Good enough? Quite possibly, but that's up to you and your app.
You can use a regex for it:
$pattern_phone = "|^[0-9\+][0-9\s+\-]*$|i";
if(!preg_match($pattern_phone,$phone)){
//Somethings wrong
}
Haven't tested the regex, so it may not be 100% correct.
Checking for 10 digits after stripping will check the syntax but won't check the validity. For that you'd need to determine what valid numbers are available in the region/country and probably write a regex to match the patterns.
The problem with validating/filtering data like this usually comes down the the answer to this question: "How strict do I want to be?" which then devolves into a series of "feature" questions
Are you going to accept international numbers?
Are you going to accept extensions?
Are you going to allow various formats i.e., (111) 222-3333 vs 111.222.3333
Depending on your business rules, the answers to these questions can vary. But to be the most flexible, I recommend 3 fields to take a phone number
Country Code (optional)
Phone Number
Extension (optional)
All 3 fields can be programmatically limited/filters for numeric values only. You can then combine them before storing into some parse-able format, or store each value individually.
Answering if something is "the best" thing to do, is nearly impossible (unless you're the one answering your own question).
The way you propose it, stripping all non-digits and then check if there are 10 digits, might result in unwanted behaviour for a string like:
George Washington (February 22, 1732 –
December 14, '99) was the commander
of the Continental Army in the
American Revolutionary War and served
as the first President of the United
States of America.
since stripping all non-digits will result in the string 2217321499 which is 10 fdigits long, but I highly doubt that the entire string should be considered as a valid phone number.
What format you need? You can use regular expressions to this.
I want to display the results of a searchquery in a website with a title and a short description. The short description should be a small part of the page which holds the searchterm. What i want to do is:
1 strip tags in page
2 find first position of seachterm
3 from that position, going back find the beginning (if there is one) of that sentence.
4 Start at the found position in step 3 and display ie 200 characters from there
I need some help with step 3. I think i need an regex that finds the first capital or dot...
Even that will ultimately fail. Given the sentence "We went to Dr. Smith's office", if your search term is "office", virtually any criterion you use will give you "Smith's office" as your sentence.
The way I would do it is, I would parse the page...
Skip over all the things starting with '<'
When you encounter a "." or [A-Z], start putting it into a buffer till you find another "."
If the buffered string has the search keyword, thats your string! Else. start buffering at the "." you encountered and repeat.
EDIT: As James Curran pointed out, this strategy would fail in some cases... So heres the solution:
What you can do, is to start X number of characters from start of page (after tags)
and then search for your keyword, buffering 2 previous words. When you find it,
do something like this: {X} ... {prev-2} {next-2}
Example: This planet has - or rather had - a problem, which was this: most of the people living on it were unhappy for pretty much of the time. Many solutions were suggested for this problem, but most of these were largely concerned with the movement of small green pieces of paper, which was odd because on the whole it wasn't the small green pieces of paper that were unhappy.
Search Keyword: "suggested"
Result: This planet has - or rather had - a problem ... Many solutions were suggested for this problem...
For step 3: If you reverse the substring that ends where you want to search backward from, get the position of the first '.' and subtrack that value from the position of your search string.
$offset = stripos( strrev(substr($string, $searchlocation)), '.');
$startloc = $searchlocation - $offset;
$finalstring = substr($string, $startloc, 200);
That may be off by 1, but I think it'll get the job done. Seems like there should be a shorter way to do it.
I think instead of trying to find sentences, I'd think about the amount of context around the search term I would need in words. Then go backwards some fraction of this number of words (or to the beginning) and forward the remaining number of words to select the rest of the context. In this way, you just split the entire corpus on whitespace, find the first occurence of the term (perhaps using a fuzzy match to find subterms and account for punctuation), and apply the above algorithm. You could even be creative about introducing ellipses if the first non-selected term doesn't end in punctuation, etc.
To save others from thinking they can beat this problem - it can't be done without accepting either false positives or false negatives. To add to what James Curran said, you either declare Smith the start of the sentence in We went to Dr. Smith's office., or you read This sentence is English. So is this one. as a single sentence.
Next to those problems, different forms of abbreviations and Overeager Capitalization Of Every Word Can Kill Your Algorithm Or Regex.
That said, I might as well share the regexes I came up with.
The first regex is simple enough:
(?m)(?:^|[.!?][\t ]+)([A-Z]\S*)
It matches the start of a line or a .!?
This is followed by at least one tabs/whitespace, after which a capital letter is matched and the rest of the word (including dots to match abbreviations).
The first word of the sentence will be caught in group 1.
The second regex
(?m)[A-Z]\S*\.[^\S\r\n]+[A-Z]|(?:^|[.!?][\t ]+)([A-Z]\S*)
This is the previous regex, prepended with [A-Z]\S*\.[^\S\r\n]+[A-Z]|. This part matches a word starting with a capital, followed by a dot, some whitespace and another capitalized character. Because the first part gets matched, the second part no longer tries to match it (explained in-depth here). The first word of the sentence will again be caught in group 1.
The first regex has false positives: it will wrongly match Smith in the second half of the sentence We went to Dr. Smith's office.
The second regex has false negatives: it will fail to match So in This is sentence is English. So is this one.
Test the regexes here.