removing phone number from a document

removing phone number from a document - php

I've got a challenge that I am hoping that the SO community is able to help me with.
I trying to parse a lot of html documents in my PHP application to remove personal details, such as names, addresses and phone numbers. I can remove most of these details without too much trouble, however the phone number is a real problem for me.
My idea is to take the text from these documents and the use a regex to identify the phone numbers and replace them with another value such as 'xxxx'.
I've got 2 regex that I am using one for UK landline numbers and one for UK cell/mobile numbers.
However when I try and run them against the text it just returns an empty string.
I am using the following preg_replace code:
$pattens = array(
'/^(((\+44\s?\d{4}|\(?0\d{4}\)?)\s?\d{3}\s?\d{3})|((\+44\s?\d{3}|\(?0\d{3}\)?)\s?\d{3}\s?\d{4})|((\+44\s?\d{2}|\(?0\d{2}\)?)\s?\d{4}\s?\d{4}))(\s?\#(\d{4}|\d{3}))?$/',
'/^(\+44\s?7\d{3}|\(?07\d{3}\)?)\s?\d{3}\s?\d{3}$/'
);
$replace = array('xxxxx', 'xxxxx');
//do the search for the numbers.
$updatedContents = preg_replace($pattens, $replace, $htmlContents);
At the moment this is causing me a lot of head scratching as I thought that I had this nailed, but at the moment I can't see what's wrong??
I am sure that it is something really simple.
Thanks,
Grant

You probably don't want to anchor your regular expressions. Remove the ^ from the beginning and the $ from the end.

Related

PHP - How to ask if field contains number or + or spaces or empty?

My website form is getting hammered with spam. I have noticed in the "Phone" field the spam bots always insert text rather that a number so I would like to add an if statement to the php mailer blocking the email if the phone field doesn't contain any of the following:
1) I want users to be able to leave the field blank, so empty field must be accepted.
2) Must contain "numbers" or "plus sign" or "spaces"
How would I write this in PHP?
Any help is appreciated
EDIT: Just though lol it would be much easier to just check if the field contains alphabetical characters. How would I do this?
EDIT2: Sorted. I used "if (ctype_alpha ($phone) !== false)"

Regular expressions are probably the best way, although not necessarily the easiest to understand at first. But regular expressions are definitely a good thing to learn if you are not familiar with them. My favorite introduction is this site: http://www.zytrax.com/tech/web/regex.htm And this is a good site for interactively building a regex and seeing how it works in realtime: http://www.regexr.com/ I'm sure there are plenty of other similar sites but those are the two I always go back to.
If you search around for a regular expression solution you will find countless possibilities and variations. My personal advice is to keep it simple. I would start with considering how you store the phone number data. I usually just keep the numbers, so I would simplify it by first removing those "allowed" characters and then checking if what's left over is just numbers.
$phone = str_replace(Array('+', ' ', '(', ')'), '', $phone);
That will replace all pluses, spaces, and parentheses with an empty string (i.e. remove them). Then you can check if the string is numeric, and if it is store it, otherwise print/return an error.
if (!is_numeric($phone))
// stop processing and output an error

First of all You must use some spamblock for example: token, honey pot, captcha etc.
In my country mobile or local phone number contains only 9digits without country code which is +XX. So i create INT(10) field in db. After submit form remove everything without digits.
For example:
$phoneNumber = (int) substr( preg_replace( '#[^\d]+#', '', $_POST['phone_numer'] ), 0, 9 );
In many project allways works.

How to remove offensive words from post by php?

Assume "xyza" is a bad word. I'm using following method to replace offensive words-
$text = str_replace("x***","(Offensive words detected & removed!)",$text);
This code will replace xyza into "(Offensive words detected & removed!)".
But problem is "Case" if someone type XYZA my code can't detect it. How to solve it?

No matter what you do, users will find ways to get around your filters. They will use unicode characters (аss, for example, uses a Cyrillic а and will not get captured by any of the regex solutions). They will use spaces, dollar signs, asterisks, whatever you haven't managed to catch yet.
If family-friendliness is essential to your application, have a person review the content before it goes live. Otherwise, add a flag feature so other people can flag offensive content. Better yet, use some sort of machine learning or Bayesian filter to automatically flag potentially offensive posts and have humans check them out manually. People read human languages better than computers.

The problem with whitelists/blacklists is—as other users have pointed out—your users will make it their priority to find ways around your filter for satisfaction rather than using your website for what it was intended for, whatever that may be.
One approach would be to use Google’s undocumented profanity API it created for its “What Do You Love?” website. If you get a response of true then just give the user a message saying their post couldn’t be submitted due to detected profanity.
You could approach this as follows:
<?php
if (isset($_POST['submit'])) {
$result = json_decode(file_get_contents(sprintf('http://www.wdyl.com/profanity?q=%s', urlencode($_POST['comments']))));
if ($result->response == true) {
// profanity detected
}
else {
// save comments to database as normal
}
}

Other answers and comments say that programming is not the best solution to this problem. I agree with them. Those answers should be moved to Moderators - Stack Exchange or Webmasters - Stack Exchange.
Since this is stackoverflow, my answer is going to be based on computer programming.
If you want to use str_replace, do something like this.
For the sake of this post, since some people are offended by actual cusswords, let's pretend that these are bad words:
'fug', 'schnitt', 'dam'.
$text = str_ireplace(" fug ","(Offensive words detected & removed!)",$text);
Notice, it's str_ireplace not str_replace. The i is for "case insensitive".
But that will erroneously match "fuggedaboudit," for example.
If you want to do a more reliable job, you need to use regex.
$bad_text = "Fug dis schnitt, because a schnitter never dam wins a fuggin schnitting darn";
$hit_words = array("fug","schnitt","dam"); // these words are 'hits' that we need to replace. hit words...
array_walk($hit_words, function(&$value, $key) { // this prepares the regex, requires PHP 5.3+ I think.
$value = '~\b' . preg_quote( $value ,'~') . '\b~i'; // \b means word boundary, like space, line-break, period, dash, and many others. Prevends "refudgee" from being matched when searching for "fudge"
});
/*print_r($bad_words);*/
$good_words = array("fudge","shoot","dang");
$good_text = preg_replace($hit_words,$good_words,$bad_text); // does all search/replace actions at once
echo '<br />' . $good_text . '<br />';
That will do all your search/replacements at once. The two arrays should contain the same number of elements, matching up searches and replace terms. It will not match parts of words, only whole words. And of course, determined cussers will find ways of getting their swearing onto your website. But it will stop lazy cussers.
I've decided to add some links to sites that obviously use programming to do a first run through removing profanity. I'll add more as I come across them. Other than yahoo:
1.) Dell.com - replace matching words with <profanity deleted>.
http://en.community.dell.com/support-forums/peripherals/f/3529/t/19502072.aspx
2.) Watson, the supercomputer, apparently developed a cursing problem. How do you tell the difference between cursing and slang? Apparently, it's so hard that the researchers just decided to purge it all. But they could have just used a list of curse words ( exact matching is a subset of regex, I would say) and forbidden their use. That's kind of how it works in real life, anyway.
Watson develops a profanity problem
3.) Content Compliance section of Gmail custom settings in Apps for Business:
Add expressions that describe the content you want to search for in each message
The "Expresssions" used can be of several types, including "Advanced content match", which, among other things, allows you to choose "Match type" options very similar to what you'd have in an excel filter: Starts with, Ends with, Contains, Not contains, Equals, Is Empty, all of which presumably use Regex. But wait, there's more: Matches regex, Not matches regex, Matches any word, Matches all words. So, the mighty Google implements regex filtering options for its business users. Why would it do that, when regex is supposedly so ineffective? Because it actually is effective enough. It is a simple, fast, programming solution that will only fail when people are hell-bent on circumventing it.
Besides that list, I wonder if anyone else has noticed the similarity between weeding out profanity and filtering out spam. Clearly, regex has uses in both arenas but nitpickers who learned by rote that "all regex is bad" will always downvote any answer to any question if regex is even mentioned.
Try googling "how spam filters work". You'll get results like this one that covers spam assassin:
http://www.seas.upenn.edu/cets/answers/spamblock-filter.html
Another example where I'm sure regex is used is when communicating via Amazon.com's Amazon Marketplace. You receive emails at your usual email address. So, naturally, when responding to a seller, your email program will include all kinds of sender information, like your email address, cc email addresses, and any you enter into the body. But Amazon.com strips these out "for your protection." Can I find a way around this regex? Probably, but it would take more trouble than it's worth and is therefore effective to a degree. They also keep the emails for 2 years, presumably so that a human can go over them in case of any fraud claims.
SpamAssassin also looks at the subject and body of the message for the same sort of things that a person notices when a message "looks like spam". It searches for strings like "viagra", "buy now", "lowest prices", "click here", etc. It also looks for flashy HTML such as large fonts, blinking text, bright colors, etc.
Regex is not mentioned, but I'm sure it's in use.

Use str_ireplace function that Case-insensitive version of str_replace()
$text = str_ireplace("flip","(Offensive words detected & removed!)", $text);

Use 'str_ireplace' to replace any case sensitive strings
Probable, this will help you
$text = 'contains offensive_word .... so on';
$array = array(
'offensive_word' => '****',
'offensive_word2' => '****',
'offensive_word3' => '****',
//.....
);
$text = str_ireplace(array_keys($array),array_values($array), $text);
echo $text;

You should use regex replacement and need to add the i flag to the end of your regex so it searches your text regardless of case. so..
$text = preg_replace("/xyza/i","(Offensive words detected & removed!)", $text);
str_ireplace can also be used if you don't need complex regex rules.
$text = str_ireplace("xyza","(Offensive words detected & removed!)", $text);
In fact, the latter is the preferred way as it's faster than regex manipulation. From PHP docs:
If you don't need fancy replacing rules, you should generally use this function instead of preg_replace() with the i modifier.
BUT, as the commenter pointed out, simple string/regex replacements can break your strings if the substring you're replacing appears as part of another non-offensive word. For this, you could either use word boundaries in your regexes or replace only those words that can't be part of other strings (e.g. the word xyza).

Make certain text in outlook incoming e-mails into links?

I'd like to do some operations on incoming e-mails. Namely transform all 6 digit numbers into links which lead to a url based on the number.
I don't want to open a huge can of worms, in terms of APIs or languages besides PHP, this isn't that much of a timesaver, but it would be nice. Anyone done anything like this? Just looking to get pointed in the right direction !

You can use a regex to find your numbers and replace them with your links. Since I do not know your link structure, I made one up.
Here is a simple example:
$str = "Testing 385758 String";
preg_replace( '/(\d{6})/', '$1', $str);
This will turn $str into:
Testing 385758 String
Demo

PHP preg_replace_callback correct escaping to handle hash symbol ('#')

I'm doing some work with a Twitter feed and want to turn any hashtags into a clicable URL.
A hashtag is a hash symbol ('#') immediately followed by a word acting as a search tag - and contains no spaces.
An example would be ...
#Eutechnyx looking to form a tech group in #Shoreditch next year. Game and Web programmers get in touch. #AutoClubRev
There are two tags here, #Shoreditch and #AutoClubRev.
These should respectively become the following links ...
https://twitter.com/#!/search?q=%23Shoreditch
and
https://twitter.com/#!/search?q=%23AutoClubRev
I'm assuming I should be using preg_replace_callback here and not just vanilla preg_replace, as I am trying to take a backreference ($1) and change it not just display it. But of course I could be wrong. I'm not fuessed on which function to use - as long as it does the job and is relatively efficient.
Thanks,
Pete

preg_replace should be able to do it.
$test = "#Eutechnyx looking to form a tech group in #Shoreditch next year. Game and Web programmers get in touch. #AutoClubRev";
echo preg_replace('|#([\w_\d]+)|', '#\1', $test);

Whitelist in php

I have an input for users where they are supposed to enter their phone number. The problem is that some people write their phone number with hyphens and spaces in them. I want to put the input trough a filter to remove such things and store only digits in my database.
I figured that I could do some str_replace() for the whitespaces and special chars.
However I think that a better approach would be to pick out just the digits instead of removing everything else. I think that I have heard the term "whitelisting" about this.
Could you please point me in the direction of solving this in PHP?
Example: I want the input "0333 452-123-4" to result in "03334521234"
Thanks!

This is a non-trivial problem because there are lots of colloquialisms and regional differences. Please refer to What is the best way for converting phone numbers into international format (E.164) using Java? It's Java but the same rules apply.
I would say that unless you need something more fully-featured, keep it simple. Create a list of valid regular expressions and check the input against each until you find a match.
If you want it really simple, simply remove non-digits:
$phone = preg_replace('![^\d]+!', '', $phone);
By the way, just picking out the digits is, by definition, the same as removing everything else. If you mean something different you may want to rephrase that.

$number = filter_var(str_replace(array("+","-"), '', $number), FILTER_SANITIZE_NUMBER_INT);
Filter_Var removes everything but pluses and minuses, and str_replace gets rid of those.
or you could use preg_replace
$number = preg_replace('/[^0-9]/', '', $number);

You could do it two ways. Iterate through each index in the string, and run is_numeric() on it, or you could use a regular expression on the string.

On the client side I do recommand using some formating that you design when creating a form. This is good for zip or telephone fields. Take a look at this jquery plugin for a reference. It will much easy later on the server side.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.