Using regex to fix phone numbers in a CSV with PHP

Using regex to fix phone numbers in a CSV with PHP - php

My new phone does not recognize a phone number unless its area code matches the incoming call. Since I live in Idaho where an area code is not needed for in-state calls, many of my contacts were saved without an area code. Since I have thousands of contacts stored in my phone, it would not be practical to manually update them. I decided to write the following PHP script to handle the problem. It seems to work well, except that I'm finding duplicate area codes at the beginning of random contacts.
<?php
//the script can take a while to complete
set_time_limit(200);
function validate_area_code($number) {
//digits are taken one by one out of $number, and insert in to $numString
$numString = "";
for ($i = 0; $i < strlen($number); $i++) {
$curr = substr($number,$i,1);
//only copy from $number to $numString when the character is numeric
if (is_numeric($curr)) {
$numString = $numString . $curr;
}
}
//add area code "208" to the beginning of any phone number of length 7
if (strlen($numString) == 7) {
return "208" . $numString;
//remove country code (none of the contacts are outside the U.S.)
} else if (strlen($numString) == 11) {
return preg_replace("/^1/","",$numString);
} else {
return $numString;
}
}
//matches any phone number in the csv
$pattern = "/((1? ?\(?[2-9]\d\d\)? *)? ?\d\d\d-?\d\d\d\d)/";
$csv = file_get_contents("contacts2.CSV");
preg_match_all($pattern,$csv,$matches);
foreach ($matches[0] as $key1 => $value) {
/*create a pattern that matches the specific phone number by adding slashes before possible special characters*/
$pattern = preg_replace("/\(|\)|\-/","\\\\$0",$value);
//create the replacement phone number
$replacement = validate_area_code($value);
//add delimeters
$pattern = "/" . $pattern . "/";
$csv = preg_replace($pattern,$replacement,$csv);
}
echo $csv;
?>
Is there a better approach to modifying the CSV? Also, is there a way to minimize the number of passes over the CSV? In the script above, preg_replace is called thousands of times on a very large String.

If I understand you correctly, you just need to prepend the area code to any 7-digit phone number anywhere in this file, right? I have no idea what kind of system you're on, but if you have some decent tools, here are a couple options. And of course, the approaches they take can presumably be implemented in PHP; that's just not one of my languages.
So, how about a sed one-liner? Just look for 7-digit phone numbers, bounded by either beginning of line or comma on the left, and comma or end of line on the right.
sed -r 's/(^|,)([0-9]{3}-[0-9]{4})(,|$)/\1208-\2\3/g' contacts.csv
Or if you want to only apply it to certain fields, perl (or awk) would be easier. Suppose it's the second field:
perl -F, -ane '$"=","; $F[1]=~s/^[0-9]{3}-[0-9]{4}$/208-$&/; print "#F";' contacts.csv
The -F, indicates the field separator, the $" is the output field separator (yes, it gets assigned once per loop, oh well), the arrays are zero-indexed so second field is $F[1], there's a run-of-the-mill substitution, and you print the results.

Ah programs... sometimes a 10-min hack is better.
If it were me... I'd import the CSV into Excel, sort it by something - maybe the length of the phone number or something. Make a new col for the fixed phone number. When you have a group of similarly-fouled numbers, make a formula to fix. Same for the next group. Should be pretty quick, no? Then export to .csv again, omitting the bad col.

A little more digging on my own revealed the issues with the regex in my question. The problem is with duplicate contacts in the csv.
Example:
(208) 555-5555, 555-5555
After the first pass becomes:
2085555555, 208555555
and After the second pass becomes
2082085555555, 2082085555555
I worked around this by changing the replacement regex to:
//add escapes for special characters
$pattern = preg_replace("/\(|\)|\-|\./","\\\\$0",$value);
//add delimiters, and optional area code
$pattern = "/(\(?[0-9]{3}\)?)? ?" . $pattern . "/";

Related

How do I check the amount of times each array object appears in a string, and then save that into a seperate array?

Basically what I am trying to do here is get a text input (a paragraph), and then save each word into an array. Then I want to check each word in the array against the original paragraph to see how many times it occurred. By doing this I am hopefully going to be able to check what the topic is. Originally I started this is as an open ended school project, but I am more interested in finding out how to do this for my own sanity.
Here is my code (this is after I requested the text input in html code above):
$paragraph = $_POST['text'];
$paragraph = str_replace(' ',' ',$paragraph);
$paragraph = str_replace(' ',' ',$paragraph);
$paragraph = strtolower($paragraph);
$words = explode(" ",$paragraph);
$count = count($words);
for($x = 0; $x < $count; $x++) {
echo $words[$x];
echo "<br/>";
}
So far I have been able to get the words all lowercase and to replace all the extra spaces in my text, and then subsequently save that to an array. For now I am just displaying the words.
This is where I have run into some problems. I was thinking I could have a multidimensional array where it would be something along the lines of
$words[1]["word"][0]["amount"];
The word would be the actual word in the paragraph, and amount would count how many times it showed up in the paragraph. If anyone has basic concepts for doing this, or there is something I am missing here I would appreciate your help. The main thing I need help with is checking the amount of times each word shows up in the paragraph. I couldn't get this to work (it was within the prior for loop):
substr_count($words[$x],$paragraph)
To recap, I am trying to take a paragraph, save each different word into an array (I have managed to do this successfully) and then save the amount of times the word shows up in the paragraph into a different array (or a multidimensional array). Once I get this data I am going to see which words I used the most, while filtering out filler words like "the" and "a".

You would be better off using preg_replace('/\W+/', ' ', $paragraph); and simplifying the rest of your code to this:
$paragraph = preg_replace('/\W+/', ' ', $paragraph);
$filter = array('the', 'a');
$words = explode(' ',$paragraph);
$countWords = array();
foreach($words as $w)
{
if(trim($w) != "" && array_search($w, $filter) === false)
{
if(!isset($countWords[$w]))
$countWords[$w] = 0;
$countWords[$w] += 1;
}
}
This will give you how many times each word is used. And if you don't care about case, then you can use $countWords[strtolower($w)] instead. Also, with the $filter array I added, you can add whatever words that you don't want to count in there.

Temporarily remove labels/tags and re-insert them later on

Consider the following string
$input = "string with {LABELS} between brackets {HERE} and {HERE}";
I want to temporarily remove all labels (= whatever is between curly braces) so that an operation can be performed on the rest of the string:
$string = "string with between brackets and";
For arguments sake, the operation is concatenate every word that starts with 'b' with the word 'yes'.
function operate($string) {
$words = explode(' ', $string);
foreach ($words as $word) {
$output[] = (strpos($word, 0, 1) == 'b') ? "yes$word" : $word;
}
return implode(' ', $output);
}
The output of this function would be
"string with yesbetween yesbrackets and"
Now I want to insert the temporarily deleted labels back into place:
"string with {LABELS} yesbetween yesbrackets {HERE} and {HERE}"
My question is: how can I accomplish this? Important: I am not able to alter operate(), so the solution should contain a wrapper function around operate() or something. I have been thinking about this for quite a while now, but am confused as to how to do this. Could you help me out?
Edit: it would be too much to put the actual operate() in this post. It will not really add value (except make the post longer). There is not much difference between the output of operate() here and the real one. I will be able to translate any ideas from here, to the real-world situation :-)

The answer to this depends on wether or not you are able to understand operate(), even if you can't change it.
If you have absolutely no insight into operate(), your problem is simply unsolvable: To reinsert your labels you need one of
Their offset or relative position (You can't know them, if you don't know operate())
A marker for their place (You can't have them, if you don't know how operate() will work on them)
If you have at least some insight into operate(), this becomes something between solvable and easy:
If operate($a . $b)==operate($a) . operate($b), then you just split your original input by the labels, run the non-label parts through operate(), but obviously not the labels, then reassemble
If operate() is guaranteed to let a placeholder string, that itself is guaranteed to be not part of the normal input ("\0" and friends come to mind) alone, then you extract your labels in order, replace them by the placeholder, run the result through operate() and later replace the placeholder by your saved labels (in order)
Edit
After reading your comments, here are some lines of code
$input = "string with {LABELS} between brackets {HERE} and {HERE}";
//Extract labels and replace with \0
$tmp=preg_split('/(\{.*?\})/',$input,-1,PREG_SPLIT_DELIM_CAPTURE);
$labels=array();
$txt=array();
$islabel=false;
foreach ($tmp as $t) {
if ($islabel) $labels[]=$t;
else $txt[]=$t;
$islabel=!$islabel;
}
$txt=implode("\0",$txt);
//Run through operate()
$txt=operate($txt);
//Reasssemble
$txt=explode("\0",$txt);
$result='';
foreach ($txt as $t)
$result.=$t.array_shift($labels);
echo $result;

Here's what I would do as a first attempt. Split your string into single words, then feed them into operate() one by one, depending on whether the word is 'braced' or not.
$input = "string with {LABELS} between brackets {HERE} and {HERE}";
$inputArray = explode(' ',$input);
foreach($inputArray as $key => $value) {
if(!preg_match('/^{.*}$/',$value)) {
$inputArray[$key] = operate($value);
}
}
$output = implode(' ',$inputArray);

php simplest case regex replacement, but backtraces not working

Hacking up what I thought was the second simplest type of regex (extract a matching string from some strings, and use it) in php, but regex grouping seems to be tripping me up.
Objective
take a ls of files, output the commands to format/copy the files to have the correct naming format.
Resize copies of the files to create thumbnails. (not even dealing with that step yet)
Failure
My code fails at the regex step, because although I just want to filter out everything except a single regex group, when I get the results, it's always returning the group that I want -and- the group before it, even though I in no way requested the first backtrace group.
Here is a fully functioning, runnable version of the code on the online ide:
http://ideone.com/2RiqN
And here is the code (with a cut down initial dataset, although I don't expect that to matter at all):
<?php
// Long list of image names.
$file_data = <<<HEREDOC
07184_A.jpg
Adrian-Chelsea-C08752_A.jpg
Air-Adams-Cap-Toe-Oxford-C09167_A.jpg
Air-Adams-Split-Toe-Oxford-C09161_A.jpg
Air-Adams-Venetian-C09165_A.jpg
Air-Aiden-Casual-Camp-Moc-C09347_A.jpg
C05820_A.jpg
C06588_A.jpg
Air-Aiden-Classic-Bit-C09007_A.jpg
Work-Moc-Toe-Boot-C09095_A.jpg
HEREDOC;
if($file_data){
$files = preg_split("/[\s,]+/", $file_data);
// Split up the files based on the newlines.
}
$rename_candidates = array();
$i = 0;
foreach($files as $file){
$string = $file;
$pattern = '#(\w)(\d+)_A\.jpg$#i';
// Use the second regex group for the results.
$replacement = '$2';
// This should return only group 2 (any number of digits), but instead group 1 is somehow always in there.
$new_file_part = preg_replace($pattern, $replacement, $string);
// Example good end result: <img src="images/ch/ch-07184fs.jpg" width="350" border="0">
// Save the rename results for further processing later.
$rename_candidates[$i]=array('file'=>$file, 'new_file'=>$new_file_part);
// Rename the images into a standard format.
echo "cp ".$file." ./ch/ch-".$new_file_part."fs.jpg;";
// Echo out some commands for later.
echo "<br>";
$i++;
if($i>10){break;} // Just deal with the first 10 for now.
}
?>
Intended result for the regex: 788750
Intended result for the code output (multiple lines of): cp air-something-something-C485850_A.jpg ./ch/ch-485850.jpg;
What's wrong with my regex? Suggestions for simpler matching code would be appreciated as well.

Just a guess:
$pattern = '#^.*?(\w)(\d+)_A\.jpg$#i';
This includes the whole filename in the match. Otherwise preg_replace() will really only substitute the end of each string - it only applies the $replacement expression on the part that was actually matched.

Scan Dir and Expode
You know what? A simpler way to do it in php is to use scandir and explode combo
$dir = scandir('/path/to/directory');
foreach($dir as $file)
{
$ext = pathinfo($file,PATHINFO_EXTENSION);
if($ext!='jpg') continue;
$a = explode('-',$file); //grab the end of the string after the -
$newfilename = end($a); //if there is no dash just take the whole string
$newlocation = './ch/ch-'.str_replace(array('C','_A'),'', basename($newfilename,'.jpg')).'fs.jpg';
echo "#copy($file, $newlocation)\n";
}
#and you are done :)
explode: basically a filename like blah-2.jpg is turned into a an array('blah','2.jpg); and then taking the end() of that gets the last element. It's the same almost as array_pop();
Working Example
Here's my ideaone code http://ideone.com/gLSxA

Pulling a number and prefix from filename using PHP

I am currently using the following function to grab the product code number for a filename such as "62017 THOR.jpg"
$number = (int) $value;
Leaving me with 62017
The trouble is some of these files have prefixes which need to be left in place ie "WST 62017.jpg"
So im after
WST 62017
not
62017
Could someone help me, either redo what im using or alter ?

replace all characters except the numbers from the image name and get only numbers.
$number = preg_replace("~[^0-9]~", "", $value);

If you want to capture everything before the number and the number as well, you can use:
$value = "WST 62017.jpg";
$number = preg_replace('/^(.*?\d*)\..*/',"$1",trim($value));
// $number is "WST 62017"
See it

You could do it like this:
$value = preg_replace('/^(.*\d+).*$/', '\1', $filename);
It should replace everything after the first numeric value with nothing, leaving everything in front of it in place. Note that you wont't be able to cast the number to int, then.

PHP - smart, error tolerating string comparison

I'm looking either for routine or way to look for error tolerating string comparison.
Let's say, we have test string Čakánka - yes, it contains CE characters.
Now, I want to accept any of following strings as OK:
cakanka
cákanká
ČaKaNKA
CAKANKA
CAAKNKA
CKAANKA
cakakNa
The problem is, that I often switch letters in word, and I want to minimize user's frustration with not being able (i.e. you're in rush) to write one word right.
So, I know how to make ci comparison (just make it lowercase :]), I can delete CE characters, I just can't wrap my head around tolerating few switched characters.
Also, you often put one character not only in wrong place (character=>cahracter), but sometimes shift it by multiple places (character=>carahcter), just because one finger was lazy during writing.
Thank you :]

Not sure (especially about the accents / special characters stuff, which you might have to deal with first), but for characters that are in the wrong place or missing, the levenshtein function, that calculates Levenshtein distance between two strings, might help you (quoting) :
int levenshtein ( string $str1 , string $str2 )
int levenshtein ( string $str1 , string $str2 , int $cost_ins , int $cost_rep , int $cost_del )
The Levenshtein distance is defined as
the minimal number of characters you
have to replace, insert or delete to
transform str1 into str2
Other possibly useful functions could be soundex, similar_text, or metaphone.
And some of the user notes on the manual pages of those functions, especially the manual page of levenshtein might bring you some useful stuff too ;-)

You could transliterate the words to latin characters and use a phonetic algorithm like Soundex to get the essence from your word and compare it to the ones you have. In your case that would be C252 for all of your words except the last one that is C250.
Edit    The problem with comparative functions like levenshtein or similar_text is that you need to call them for each pair of input value and possible matching value. That means if you have a database with 1 million entries you will need to call these functions 1 million times.
But functions like soundex or metaphone, that calculate some kind of digest, can help to reduce the number of actual comparisons. If you store the soundex or metaphone value for each known word in your database, you can reduce the number of possible matches very quickly. Later, when the set of possible matching value is reduced, then you can use the comparative functions to get the best match.
Here’s an example:
// building the index that represents your database
$knownWords = array('Čakánka', 'Cakaka');
$index = array();
foreach ($knownWords as $key => $word) {
$code = soundex(iconv('utf-8', 'us-ascii//TRANSLIT', $word));
if (!isset($index[$code])) {
$index[$code] = array();
}
$index[$code][] = $key;
}
// test words
$testWords = array('cakanka', 'cákanká', 'ČaKaNKA', 'CAKANKA', 'CAAKNKA', 'CKAANKA', 'cakakNa');
echo '<ul>';
foreach ($testWords as $word) {
$code = soundex(iconv('utf-8', 'us-ascii//TRANSLIT', $word));
if (isset($index[$code])) {
echo '<li> '.$word.' is similar to: ';
$matches = array();
foreach ($index[$code] as $key) {
similar_text(strtolower($word), strtolower($knownWords[$key]), $percentage);
$matches[$knownWords[$key]] = $percentage;
}
arsort($matches);
echo '<ul>';
foreach ($matches as $match => $percentage) {
echo '<li>'.$match.' ('.$percentage.'%)</li>';
}
echo '</ul></li>';
} else {
echo '<li>no match found for '.$word.'</li>';
}
}
echo '</ul>';

Spelling checkers do something like fuzzy string comparison. Perhaps you can adapt an algorithm based on that reference. Or grab the spell checker guessing code from an open source project like Firefox.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Using regex to fix phone numbers in a CSV with PHP - php

Related

How do I check the amount of times each array object appears in a string, and then save that into a seperate array?

Temporarily remove labels/tags and re-insert them later on

php simplest case regex replacement, but backtraces not working

Pulling a number and prefix from filename using PHP

PHP - smart, error tolerating string comparison

Categories

Resources