I've been tasked with standardizing some address information. Toward that goal, I'm breaking the address string into granular values (our address schema is very similar to Google's format).
Progress so far:
I'm using PHP, and am currently breaking out Bldg, Suite, Room#, etc... info.
It was all going great until I encountered Floors.
For the most part, the floor info is represented as "Floor 10" or "Floor 86". Nice & easy.
For everything to that point, I can simply break the string on a string ("room", "floor", etc..)
The problem:
But then I noticed something in my test dataset. There are some cases where the floor is represented more like "2nd Floor".
This made me realize that I need to prepare for a whole slew of variations for the FLOOR info.
There are options like "3rd Floor", "22nd floor", and "1ST FLOOR". Then what about spelled out variants such as "Twelfth Floor"?
Man!! This can become a mess pretty quickly.
My Goal:
I'm hoping someone knows of a library or something that already solves this problem.
In reality, though, I'd be more than happy with some good suggestions/guidance on how one might elegantly handle splitting the strings on such diverse criteria (taking care to avoid false positives such as "3rd St").
first of all, you need to have exhaustive list of all possible formats of the input and decide, how deep you'd like to go.
If you consider spelled out variants as invalid case, you may apply simple regular expressions to capture number and detect the token (room, floor ...)
I would start by reading up on regex in PHP. For example:
$floorarray = preg_split("/\sfloor\s/i", $floorstring)
Other useful functions are preg_grep, preg_match, etc
Edit: added a more complete solution.
This solution takes as an input a string describing the floor. It can be of various formats such as:
Floor 102
Floor One-hundred two
Floor One hundred and two
One-hundred second floor
102nd floor
102ND FLOOR
etc
Until I can look at an example input file, I am just guessing from your post that this will be adequate.
<?php
$errorLog = 'error-log.txt'; // a file to catalog bad entries with bad floors
// These are a few example inputs
$addressArray = array('Fifty-second Floor', 'somefloor', '54th floor', '52qd floor',
'forty forty second floor', 'five nineteen hundredth floor', 'floor fifty-sixth second ninth');
foreach ($addressArray as $id => $address) {
$floor = parseFloor($id, $address);
if ( empty($floor) ) {
error_log('Entry '.$id.' is invalid: '.$address."\n", 3, $errorLog);
} else {
echo 'Entry '.$id.' is on floor '.$floor."\n";
}
}
function parseFloor($id, $address)
{
$floorString = implode(preg_split('/(^|\s)floor($|\s)/i', $address));
if ( preg_match('/(^|^\s)(\d+)(st|nd|rd|th)*($|\s$)/i', $floorString, $matchArray) ) {
// floorString contained a valid numerical floor
$floor = $matchArray[2];
} elseif ( ($floor = word2num($floorString)) != FALSE ) { // note assignment op not comparison
// floorString contained a valid english ordinal for a floor
; // No need to do anything
} else {
// floorString did not contain a properly formed floor
$floor = FALSE;
}
return $floor;
}
function word2num( $inputString )
{
$cards = array('zero',
'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten',
'eleven', 'twelve', 'thirteen', 'fourteen', 'fifteen', 'sixteen', 'seventeen', 'eighteen', 'nineteen', 'twenty');
$cards[30] = 'thirty'; $cards[40] = 'forty'; $cards[50] = 'fifty'; $cards[60] = 'sixty';
$cards[70] = 'seventy'; $cards[80] = 'eighty'; $cards[90] = 'ninety'; $cards[100] = 'hundred';
$ords = array('zeroth',
'first', 'second', 'third', 'fourth', 'fifth', 'sixth', 'seventh', 'eighth', 'ninth', 'tenth',
'eleventh', 'twelfth', 'thirteenth', 'fourteenth', 'fifteenth', 'sixteenth', 'seventeenth', 'eighteenth', 'nineteenth', 'twentieth');
$ords[30] = 'thirtieth'; $ords[40] = 'fortieth'; $ords[50] = 'fiftieth'; $ords[60] = 'sixtieth';
$ords[70] = 'seventieth'; $ords[80] = 'eightieth'; $ords[90] = 'ninetieth'; $ords[100] = 'hundredth';
// break the string at any whitespace, dash, comma, or the word 'and'
$words = preg_split( '/([\s-,](?!and\s)|\sand\s)/i', $inputString );
$sum = 0;
foreach ($words as $word) {
$word = strtolower($word);
$value = array_search($word, $ords); // try the ordinal words
if (!$value) { $value = array_search($word, $cards); } // try the cardinal words
if (!$value) {
// if temp is still false, it's not a known number word, fail and exit
return FALSE;
}
if ($value == 100) { $sum *= 100; }
else { $sum += $value; }
}
return $sum;
}
?>
In the general case, parsing words into numbers is not easy. The best thread that I could find that discusses this is here. It is not nearly as easy as the inverse problem of converting numbers into words. My solution only works for numbers <2000, and it liberally interprets poorly formed constructs rather than tossing an error. Also, it is not resilient against spelling mistakes at all. For example:
forty forty second = 82
five nineteen hundredth = 2400
fifty-sixth second ninth = 67
If you have a lot of inputs and most of them are well formed, throwing errors for spelling mistakes is not really a big deal because you can manually correct the short list of problem entries. Silently accepting bad input, however, could be a real problem depending on your application. Just something to think about when deciding if it is worth it to make the conversion code more robust.
Related
Working on a tool to make runway recommendations for flight simulation enthusiasts based off of the real world winds at a given airport. The ultimate goal is to compare, and return a list of available runways in a list, with the smallest wind variance displaying at the top of the list.
I would say that I probably have 95% of what I need, but where it gets slippery is for wind headings that approach 0 degrees (360 on a compass rose).
If runway heading is 029 and wind heading is 360, it is only a difference of 29 degrees, but the formula that I have written displays a difference of 331 degrees.
I have tried experimenting with abs() as part of the comparison but have gotten nowhere. I will link my current results here: https://extendsclass.com/php-bin/7eba5c8
Attempted switching comparisons for wind heading and runway heading (subtracting one from the other, and then the other way around) with the same result.
I am sure that the key lies in some little three line nonsense that I just cannot get the knack of (disadvantage of being a self-taught cowboy coder, I guess).
I saw a post about how to do it in C# from about 11 years ago but I never messed around with that particular deep, dark corner of the programming world.
The code is included below:
<?php
echo "<pre>\n";
//Dummy Runways. Will be replaced by data from AVWX
$rwy_hdgs = array(
"04R" => "029",
"04L" => "029",
"22R" => "209",
"22L" => "209",
"03R" => "029",
"03L" => "029",
"21L" => "216",
"21R" => "216",
"09L" => "089",
"09R" => "089",
"27R" => "269",
"27L" => "269"
);
//Dummy Wind Heading. Will be replaced by data from AVWX
$wind_dir = "360";
$runways = array();
$i = 1;
foreach($rwy_hdgs as $key => $value)
{
$diff = $value - $wind_dir;
$runways[$i]["rwy"] = $key;
$runways[$i]["hdg"] = $value;
$runways[$i]["diff"] = abs($diff);
$i++;
}
//Select "diff" value
$diff = array_column($runways, "diff");
//Sort $runways by difference betweeen wind and runways, with smallest value first
array_multisort($diff, SORT_ASC, $runways);
foreach ($runways as $runway){
echo "Wind Heading: " . $wind_dir . "\n";
echo "Runway: " . $runway["rwy"] . "\n";
echo "Heading: " . $runway["hdg"] . "\n";
echo "Variance: " . $runway["diff"] . "°\n\n";
}
echo "</pre>\n";
?>
When you subtract two angles in a circle, you can either go the "short way" or the "long way" - it's a circle... So you have to calculate both ways and then find out, which one is shorter - and the direction too, because you have a fixed start angle and a fixed target angle:
function angleDiff($angleStart, $angleTarget) {
$delta = $angleTarget - $angleStart;
$direction = ($delta > 0) ? -1 : 1;
$absDelta1 = abs($delta);
$absDelta2 = 360 - $absDelta1;
return $direction * ($absDelta1 < $absDelta2 ? $absDelta1 : $absDelta2);
}
This should give you positive numbers for clockwise turning and negative numbers for counter-clockwise turning from start to target angle.
Disclaimer: didn't actually test the code, sorry, might have flaws ;)
When I got it right, you want to calculate the difference between two angles always going the short way. So you could do it like this:
$diff = min([abs($a - $b), 360 - abs($a - $b)]);
whith $a and $b being the two angles. The result will always be between 0 and 180 degrees.
I have a string in the form of "AsKcQsJd" that represents 4 cards from a deck of playing cards. The uppercase value represnts the card value (in this case, Ace, King, Queen, and Jack) and the lowercase value represents the suit (in this case, spade, club, spade, diamond).
Say I have another value that tells me what suit I'm looking for. So in this case, I have:
$hand = 'AsKcQsJd';
$suit = 's';
How can I write a regular expression that checks if the hand has an Ace in it, followed by the suit, so in this case 'As' and also any other card that has the suit? Or in 'poker terms', I'm trying to determine if the hand has the 'ace high flush draw' for the suit defined as $suit.
To further explain, I need to check if any combination of the following two cards exist:
AsKs, AsQs, AsJs, AsTs,As9s,As8s,As7s,As6s,As5s,As4s,As3s,As2s
With the added complexity that these cards could occur anywhere in the hand. For example, the string could have As at the front and Ks at the end. That's why I think a regular expression is the best method for determining if the two coexist in the string.
You might use two lookaheads, one for As, and one for [^A]s, like this:
(?=.*As)(?=.*[^A]s)
https://regex101.com/r/8hkWTv/1
$suit = 's';
$re = '/(?=.*A' . $suit . ')(?=.*[^A]' . $suit . ')/';
print($re); // /(?=.*As)(?=.*[^A]s)/
print(preg_match($re, 'AsKcQsJd')); // 1
print(preg_match($re, 'AdKcQsJd')); // 0
print(preg_match($re, 'KsKcQsJd')); // 0
I'm not sure regex is the best solution but if that's your cup of tea you can do it pretty easily with alternation like this:
As.*s|s.*As
Or better yet - to capture the actual cards giving you a match:
(As).*(.s)|(.s).*(As)
These basically say - the hand has a spade followed by an ace of spades OR has ace of spades followed by any other spade. https://regex101.com/r/pdwHPQ/1
That said, I'd probably consider building a simple class to parse the hand and give you more flexibility when it comes to answering questions about what cards are present. Whether or not this is worth it really depends a lot on your app. Here's an idea:
$hand = 'AsKh4c5c9h2s';
$cards = new Cards($hand);
$spades = $cards->getCardsBySuit('s');
if (in_array('As',array_keys($spades)) && count($spades) > 1) {
// hand has ace high flush draw
echo 'yep';
}
class Cards {
private $cards = '';
public function __construct($hand) {
foreach (str_split($hand,2) as $card) {
$this->cards[$card] = [
'rank' => substr($card,0,1),
'suit' => substr($card,1,1)
];
}
}
public function getCardsBySuit($suit) {
$response = [];
foreach ($this->cards as $k => $card) {
if ($card['suit'] == $suit) {
$response[$k] = $card;
}
}
return $response;
}
}
I've got a source file that contains some data in a few formats that I need to parse. I'm writing an ETL process that will have to match other data.
Most of the data is in the format city, state (US standard, more or less). Some cities are grouped across heavier population areas with multiple cities combined.
Most of the data looks like this (call this 1):
Elkhart, IN
Some places have multiple cities, delimited by a dash (call this 2):
Hickory-Lenoir-Morganton, NC
It's still not too complicated when the cities are in different states (call this 3):
Steubenville, OH-Weirton, WV
This one threw me for a loop; it makes sense but it flushes the previous formats (call this 4):
Kingsport, TN-Johnson City, TN-Bristol, VA-TN
In that example, Bristol is in both VA and TN. Then there's this (call this 5):
Mayagüez/Aguadilla-Ponce, PR
I'm okay with replacing the slash with a dash and processing the same as a previous example. That contains a diacritic as well and the rest of my data are diacritic-free. I'm okay with stripping the diacritic off, that seems to be somewhat straightforward in PHP.
Then there's my final example (call this 6):
Scranton--Wilkes-Barre--Hazleton, PA
The city name contains a dash so the delimiter between city names is a double dash.
What I'd like to produce is, given any of the above examples and a few hundred other lines that follow the same format, an array of [[city, state],...] for each so I can turn them into SQL. For example, parsing 4 would yield:
[
['Kingsport', 'TN'],
['Johnson City', 'TN'],
['Bristol', 'VA'],
['Bristol', 'TN']
]
I'm using a standard PHP install, I've got preg_match and so on but no PECL libraries. Order is unimportant.
Any thoughts on a good way to do this without a big pile of if-then statements?
I would split the input with '-'s and ','s, then delete empty elements in the array. str_replace followed by explode and array_diff (, array ()) should do the trick.
Then identify States - either searching a list or working on the principal that cities don't tend to have 2 upper-case letter names.
Now work through the array. If it's a city, save the name, if it's a state, apply it to the saved cities. Clear the list of cities when you get a city immediately following a state.
Note any exceptions and reformat by hand into a different input.
Hope this helps.
For anyone who's interested, I took the answer from #mike and came up with this:
function SplitLine($line) {
// This is over-simplified, just to cover the given case.
$line = str_replace('ü', 'u', $line);
// Cover case 6.
$delimiter = '-';
if (false !== strpos($line, '--'))
$delimiter = '--';
$line = str_replace('/', $delimiter, $line);
// Case 5 looks like case 2 now.
$parts = explode($delimiter, $line);
$table = array_map(function($part) { return array_map('trim', explode(',', $part)); }, $parts);
// At this point, table contains a grid with missing values.
for ($i = 0; $i < count($table); $i++) {
$row = $table[$i];
// Trivial case (case 1 and 3), go on.
if (2 == count($row))
continue;
if (preg_match('/^[A-Z]{2}$/', $row[0])) {
// Missing city; seek backwards.
$find = $i;
while (2 != count($table[$find]))
$find--;
$table[$i] = [$table[$find][0], $row[0]];
} else {
// Missing state; seek forwards.
$find = $i;
while (2 != count($table[$find]))
$find++;
$table[$i][] = $table[$find][1];
}
}
return $table;
}
It's not pretty and it's slow. It does cover all my cases and since I'm doing an ETL process the speed isn't paramount. There's also no error detection, which works in my particular case.
I'm trying to print the possible words that can be formed from a phone number in php. My general strategy is to map each digit to an array of possible characters. I then iterate through each number, recursively calling the function to iterate over each possible character.
Here's what my code looks like so far, but it's not working out just yet. Any syntax corrections I can make to get it to work?
$pad = array(
array('0'), array('1'), array('abc'), array('def'), array('ghi'),
array('jkl'), array('mno'), array('pqr'), array('stuv'), array('wxyz')
);
function convertNumberToAlpha($number, $next, $alpha){
global $pad;
for($i =0; $i<count($pad[$number[$next]][0]); $i++){
$alpha[$next] = $pad[$next][0][$i];
if($i<strlen($number) -1){
convertNumberToAlpha($number, $next++, $alpha);
}else{
print_r($alpha);
}
}
}
$alpha = array();
convertNumberToAlpha('22', 0, $alpha);
How is this going to be used? This is not a job for a simple recursive algorithm such as what you have suggested, nor even an iterative approach. An average 10-digit number will yield 59,049 (3^10) possibilities, each of which will have to be evaluated against a dictionary if you want to determine actual words.
Many times, the best approach to this is to pre-compile a dictionary which maps 10-digit numbers to various words. Then, your look-up is a constant O(1) algorithm, just selecting by a 10 digit number which is mapped to an array of possible words.
In fact, pre-compiled dictionaries were the way that T9 worked, mapping dictionaries to trees with logarithmic look-up functions.
The following code should do it. Fairly straight forward: it uses recursion, each level processes one character of input, a copy of current combination is built/passed at each recursive call, recursion stops at the level where last character of input is processed.
function alphaGenerator($input, &$output, $current = "") {
static $lookup = array(
1 => "1", 2 => "abc", 3 => "def",
4 => "ghi", 5 => "jkl", 6 => "mno",
7 => "pqrs", 8 => "tuv", 9 => "wxyz",
0 => "0"
);
$digit = substr($input, 0, 1); // e.g. "4"
$other = substr($input, 1); // e.g. "3556"
$chars = str_split($lookup[$digit], 1); // e.g. "ghi"
foreach ($chars as $char) { // e.g. g, h, i
if ($other === false) { // base case
$output[] = $current . $char;
} else { // recursive case
alphaGenerator($other, $output, $current . $char);
}
}
}
$output = array();
alphaGenerator("43556", $output);
var_dump($output);
Output:
array(243) {
[0]=>string(5) "gdjjm"
[1]=>string(5) "gdjjn"
...
[133]=>string(5) "helln"
[134]=>string(5) "hello"
[135]=>string(5) "hfjjm"
...
[241]=>string(5) "iflln"
[242]=>string(5) "ifllo"
}
You should read Norvigs article on writing a spellchecker in Python http://norvig.com/spell-correct.html . Although its a spellchecker and in python not php, it is the same concept around finding words with possible variations, might give u some good ideas.
can anyone suggest me a better method(or most preferred method) to find the match percentage between two strings(i.e. how closely those two strings(eg. name) are related in terms of percentage) using fuzzy logic.? can anyone help me to write the code? really i am wondering where to start..
$str1 = 'Hello';
$str2 = 'Hello, World!';
$percent;
similar_text($str1, $str2, $percentage);
http://php.net/manual/en/function.similar-text.php
Word Comparator
Here's a comparison based on words - it's a lot faster than character-based ones, plus it often makes more sense to compare human text by words. However, word lengths do matter; this algorithm takes this into consideration, for better results. Check test results at the end; I think they're pretty much what a human would say.
function wordSimilarity($s1,$s2) {
$words1 = preg_split('/\s+/',$s1);
$words2 = preg_split('/\s+/',$s2);
$diffs1 = array_diff($words2,$words1);
$diffs2 = array_diff($words1,$words2);
$diffsLength = strlen(join("",$diffs1).join("",$diffs2));
$wordsLength = strlen(join("",$words1).join("",$words2));
if(!$wordsLength) return 0;
$differenceRate = ( $diffsLength / $wordsLength );
$similarityRate = 1 - $differenceRate;
return $similarityRate;
}
This function gives you a floating point value between 0 and 1 where 1 is total similarity.
Let's see some tests
$test = "this is something you've never done before";
wordSimilarity($test,"this is something you've never done before"); // 1.000
wordSimilarity($test,"this is something"); // 0.588
wordSimilarity($test,"this is nothing you have ever done"); // 0.312
wordSimilarity($test,"leave me alone with lorem ipsum"); // 0.000
wordSimilarity($test,"before you do something you've never done"); // 0.845
wordSimilarity($test,"never have i ever done this"); // 0.448