I am creating a Bible search. The trouble with bible searches is that people often enter different kinds of searches, and I need to split them up accordingly. So i figured the best way to start out would be to remove all spaces, and work through the string there. Different types of searches could be:
Genesis 1:1 - Genesis Chapter 1, Verse 1
1 Kings 2:5 - 1 Kings Chapter 2, Verse 5
Job 3 - Job Chapter 3
Romans 8:1-7 - Romans Chapter 8 Verses 1 to 7
1 John 5:6-11 - 1 John Chapter 5 Verses 6 - 11.
I am not too phased by the different types of searches, But If anyone can find a simpler way to do this or know's of a great way to do this then please tell me how!
Thanks
The easiest thing to do here is to write a regular expression to capture the text, then parse out the captures to see what you got. To start, lets assume you have your test bench:
$tests = array(
'Genesis 1:1' => 'Genesis Chapter 1, Verse 1',
'1 Kings 2:5' => '1 Kings Chapter 2, Verse 5',
'Job 3' => 'Job Chapter 3',
'Romans 8:1-7' => 'Romans Chapter 8, Verses 1 to 7',
'1 John 5:6-11' => '1 John Chapter 5, Verses 6 to 11'
);
So, you have, from left to right:
A book name, optionally prefixed with a number
A chapter number
A verse number, optional, optionally followed by a range.
So, we can write a regex to match all of those cases:
((?:\d+\s)?\w+)\s+(\d+)(?::(\d+(?:-\d+)?))?
And now see what we get back from the regex:
foreach( $tests as $test => $answer) {
// Match the regex against the test case
preg_match( $regex, $test, $match);
// Ignore the first entry, the 2nd and 3rd entries hold the book and chapter
list( , $book, $chapter) = array_map( 'trim', $match);
$output = "$book Chapter $chapter";
// If the fourth match exists, we have a verse entry
if( isset( $match[3])) {
// If there is no dash, it's a single verse
if( strpos( $match[3], '-') === false) {
$output .= ", Verse " . $match[3];
} else {
// Otherwise it's a range of verses
list( $start, $end) = explode( '-', $match[3]);
$output .= ", Verses $start to $end";
}
}
// Here $output matches the value in $answer from our test cases
echo $answer . "\n" . $output . "\n\n";
}
You can see it working in this demo.
I think I understand what you are asking here. You want to devise an algorithm that extracts information (ex. book name, chapter, verse/verses).
This looks to me like a job for pattern matching (ex. regular expressions) because you could then define patterns, extract data for all scenario's that make sense and work from there.
There are actually quite a few variants that could exist - perhaps you should also take a look at natural language processing. Fuzzy string matching on names could provide better results (ex. people misspelling book names).
Best of luck
Try out something based on preg_match_all, like:
$ php -a
Interactive shell
php > $s = '1 kings 2:4 and 1 sam 4-5';
php > preg_match_all("/(\\d*|[^\\d ]*| *)/", $s, $parts);
php > print serialize($s);
Okay Well I am not too sure about regular expressions and I havent yet studied them out, So I am stuck with the more procedural approach. I have made the following (which is still a huge improvement on the code I wrote 5 years ago, which was what I was aiming to achieve) That seems to work flawlessly:
You need this function first of all:
function varType($str) {
if(is_numeric($str)) {return false;}
if(is_string($str)) {return true;}
}
$bible = array("BookNumber" => "", "Book" => "", "Chapter" => "", "StartVerse" => "", "EndVerse" => "");
$pos = 1; // 1 - Book Number
// 2 - Book
// 3 - Chapter
// 4 - ':' or 'v'
// 5 - StartVerse
// 6 - is a dash for spanning verses '-'
// 7 - EndVerse
$scan = ""; $compile = array();
//Divide into character type groups.
for($x=0;$x<=(strlen($collapse)-1);$x++)
{ if($x>=1) {if(varType($collapse[$x]) != varType($collapse[$x-1])) {array_push($compile,$scan);$scan = "";}}
$scan .= $collapse[$x];
if($x==strlen($collapse)-1) {array_push($compile,$scan);}
}
//If the first element is not a number, then it is not a numbered book (AKA 1 John, 2 Kings), So move the position forward.
if(varType($compile[0])) {$pos=2;}
foreach($compile as $val)
{ if(!varType($val))
{ switch($pos)
{ case 1: $bible['BookNumber'] = $val; break;
case 3: $bible['Chapter'] = $val; break;
case 5: $bible['StartVerse'] = $val; break;
case 7: $bible['EndVerse'] = $val; break;
}
} else {switch($pos)
{ case 2: $bible['Book'] = $val; break;
case 4: //Colon or 'v'
case 6: break; //Dash for verse spanning.
}}
$pos++;
}
This will give you an array called 'Bible' at the end that will have all the necessary data within to run on an SQL database or whatever else you might want it for. Hope this helps others.
I know this is crazy talk, but why not just have a form with 4 fields so they can specify:
Book
Chapter
Starting Verse
Ending Verse [optional]
Related
I have a string in the form of "AsKcQsJd" that represents 4 cards from a deck of playing cards. The uppercase value represnts the card value (in this case, Ace, King, Queen, and Jack) and the lowercase value represents the suit (in this case, spade, club, spade, diamond).
Say I have another value that tells me what suit I'm looking for. So in this case, I have:
$hand = 'AsKcQsJd';
$suit = 's';
How can I write a regular expression that checks if the hand has an Ace in it, followed by the suit, so in this case 'As' and also any other card that has the suit? Or in 'poker terms', I'm trying to determine if the hand has the 'ace high flush draw' for the suit defined as $suit.
To further explain, I need to check if any combination of the following two cards exist:
AsKs, AsQs, AsJs, AsTs,As9s,As8s,As7s,As6s,As5s,As4s,As3s,As2s
With the added complexity that these cards could occur anywhere in the hand. For example, the string could have As at the front and Ks at the end. That's why I think a regular expression is the best method for determining if the two coexist in the string.
You might use two lookaheads, one for As, and one for [^A]s, like this:
(?=.*As)(?=.*[^A]s)
https://regex101.com/r/8hkWTv/1
$suit = 's';
$re = '/(?=.*A' . $suit . ')(?=.*[^A]' . $suit . ')/';
print($re); // /(?=.*As)(?=.*[^A]s)/
print(preg_match($re, 'AsKcQsJd')); // 1
print(preg_match($re, 'AdKcQsJd')); // 0
print(preg_match($re, 'KsKcQsJd')); // 0
I'm not sure regex is the best solution but if that's your cup of tea you can do it pretty easily with alternation like this:
As.*s|s.*As
Or better yet - to capture the actual cards giving you a match:
(As).*(.s)|(.s).*(As)
These basically say - the hand has a spade followed by an ace of spades OR has ace of spades followed by any other spade. https://regex101.com/r/pdwHPQ/1
That said, I'd probably consider building a simple class to parse the hand and give you more flexibility when it comes to answering questions about what cards are present. Whether or not this is worth it really depends a lot on your app. Here's an idea:
$hand = 'AsKh4c5c9h2s';
$cards = new Cards($hand);
$spades = $cards->getCardsBySuit('s');
if (in_array('As',array_keys($spades)) && count($spades) > 1) {
// hand has ace high flush draw
echo 'yep';
}
class Cards {
private $cards = '';
public function __construct($hand) {
foreach (str_split($hand,2) as $card) {
$this->cards[$card] = [
'rank' => substr($card,0,1),
'suit' => substr($card,1,1)
];
}
}
public function getCardsBySuit($suit) {
$response = [];
foreach ($this->cards as $k => $card) {
if ($card['suit'] == $suit) {
$response[$k] = $card;
}
}
return $response;
}
}
I've got a source file that contains some data in a few formats that I need to parse. I'm writing an ETL process that will have to match other data.
Most of the data is in the format city, state (US standard, more or less). Some cities are grouped across heavier population areas with multiple cities combined.
Most of the data looks like this (call this 1):
Elkhart, IN
Some places have multiple cities, delimited by a dash (call this 2):
Hickory-Lenoir-Morganton, NC
It's still not too complicated when the cities are in different states (call this 3):
Steubenville, OH-Weirton, WV
This one threw me for a loop; it makes sense but it flushes the previous formats (call this 4):
Kingsport, TN-Johnson City, TN-Bristol, VA-TN
In that example, Bristol is in both VA and TN. Then there's this (call this 5):
Mayagüez/Aguadilla-Ponce, PR
I'm okay with replacing the slash with a dash and processing the same as a previous example. That contains a diacritic as well and the rest of my data are diacritic-free. I'm okay with stripping the diacritic off, that seems to be somewhat straightforward in PHP.
Then there's my final example (call this 6):
Scranton--Wilkes-Barre--Hazleton, PA
The city name contains a dash so the delimiter between city names is a double dash.
What I'd like to produce is, given any of the above examples and a few hundred other lines that follow the same format, an array of [[city, state],...] for each so I can turn them into SQL. For example, parsing 4 would yield:
[
['Kingsport', 'TN'],
['Johnson City', 'TN'],
['Bristol', 'VA'],
['Bristol', 'TN']
]
I'm using a standard PHP install, I've got preg_match and so on but no PECL libraries. Order is unimportant.
Any thoughts on a good way to do this without a big pile of if-then statements?
I would split the input with '-'s and ','s, then delete empty elements in the array. str_replace followed by explode and array_diff (, array ()) should do the trick.
Then identify States - either searching a list or working on the principal that cities don't tend to have 2 upper-case letter names.
Now work through the array. If it's a city, save the name, if it's a state, apply it to the saved cities. Clear the list of cities when you get a city immediately following a state.
Note any exceptions and reformat by hand into a different input.
Hope this helps.
For anyone who's interested, I took the answer from #mike and came up with this:
function SplitLine($line) {
// This is over-simplified, just to cover the given case.
$line = str_replace('ü', 'u', $line);
// Cover case 6.
$delimiter = '-';
if (false !== strpos($line, '--'))
$delimiter = '--';
$line = str_replace('/', $delimiter, $line);
// Case 5 looks like case 2 now.
$parts = explode($delimiter, $line);
$table = array_map(function($part) { return array_map('trim', explode(',', $part)); }, $parts);
// At this point, table contains a grid with missing values.
for ($i = 0; $i < count($table); $i++) {
$row = $table[$i];
// Trivial case (case 1 and 3), go on.
if (2 == count($row))
continue;
if (preg_match('/^[A-Z]{2}$/', $row[0])) {
// Missing city; seek backwards.
$find = $i;
while (2 != count($table[$find]))
$find--;
$table[$i] = [$table[$find][0], $row[0]];
} else {
// Missing state; seek forwards.
$find = $i;
while (2 != count($table[$find]))
$find++;
$table[$i][] = $table[$find][1];
}
}
return $table;
}
It's not pretty and it's slow. It does cover all my cases and since I'm doing an ETL process the speed isn't paramount. There's also no error detection, which works in my particular case.
I've been tasked with standardizing some address information. Toward that goal, I'm breaking the address string into granular values (our address schema is very similar to Google's format).
Progress so far:
I'm using PHP, and am currently breaking out Bldg, Suite, Room#, etc... info.
It was all going great until I encountered Floors.
For the most part, the floor info is represented as "Floor 10" or "Floor 86". Nice & easy.
For everything to that point, I can simply break the string on a string ("room", "floor", etc..)
The problem:
But then I noticed something in my test dataset. There are some cases where the floor is represented more like "2nd Floor".
This made me realize that I need to prepare for a whole slew of variations for the FLOOR info.
There are options like "3rd Floor", "22nd floor", and "1ST FLOOR". Then what about spelled out variants such as "Twelfth Floor"?
Man!! This can become a mess pretty quickly.
My Goal:
I'm hoping someone knows of a library or something that already solves this problem.
In reality, though, I'd be more than happy with some good suggestions/guidance on how one might elegantly handle splitting the strings on such diverse criteria (taking care to avoid false positives such as "3rd St").
first of all, you need to have exhaustive list of all possible formats of the input and decide, how deep you'd like to go.
If you consider spelled out variants as invalid case, you may apply simple regular expressions to capture number and detect the token (room, floor ...)
I would start by reading up on regex in PHP. For example:
$floorarray = preg_split("/\sfloor\s/i", $floorstring)
Other useful functions are preg_grep, preg_match, etc
Edit: added a more complete solution.
This solution takes as an input a string describing the floor. It can be of various formats such as:
Floor 102
Floor One-hundred two
Floor One hundred and two
One-hundred second floor
102nd floor
102ND FLOOR
etc
Until I can look at an example input file, I am just guessing from your post that this will be adequate.
<?php
$errorLog = 'error-log.txt'; // a file to catalog bad entries with bad floors
// These are a few example inputs
$addressArray = array('Fifty-second Floor', 'somefloor', '54th floor', '52qd floor',
'forty forty second floor', 'five nineteen hundredth floor', 'floor fifty-sixth second ninth');
foreach ($addressArray as $id => $address) {
$floor = parseFloor($id, $address);
if ( empty($floor) ) {
error_log('Entry '.$id.' is invalid: '.$address."\n", 3, $errorLog);
} else {
echo 'Entry '.$id.' is on floor '.$floor."\n";
}
}
function parseFloor($id, $address)
{
$floorString = implode(preg_split('/(^|\s)floor($|\s)/i', $address));
if ( preg_match('/(^|^\s)(\d+)(st|nd|rd|th)*($|\s$)/i', $floorString, $matchArray) ) {
// floorString contained a valid numerical floor
$floor = $matchArray[2];
} elseif ( ($floor = word2num($floorString)) != FALSE ) { // note assignment op not comparison
// floorString contained a valid english ordinal for a floor
; // No need to do anything
} else {
// floorString did not contain a properly formed floor
$floor = FALSE;
}
return $floor;
}
function word2num( $inputString )
{
$cards = array('zero',
'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten',
'eleven', 'twelve', 'thirteen', 'fourteen', 'fifteen', 'sixteen', 'seventeen', 'eighteen', 'nineteen', 'twenty');
$cards[30] = 'thirty'; $cards[40] = 'forty'; $cards[50] = 'fifty'; $cards[60] = 'sixty';
$cards[70] = 'seventy'; $cards[80] = 'eighty'; $cards[90] = 'ninety'; $cards[100] = 'hundred';
$ords = array('zeroth',
'first', 'second', 'third', 'fourth', 'fifth', 'sixth', 'seventh', 'eighth', 'ninth', 'tenth',
'eleventh', 'twelfth', 'thirteenth', 'fourteenth', 'fifteenth', 'sixteenth', 'seventeenth', 'eighteenth', 'nineteenth', 'twentieth');
$ords[30] = 'thirtieth'; $ords[40] = 'fortieth'; $ords[50] = 'fiftieth'; $ords[60] = 'sixtieth';
$ords[70] = 'seventieth'; $ords[80] = 'eightieth'; $ords[90] = 'ninetieth'; $ords[100] = 'hundredth';
// break the string at any whitespace, dash, comma, or the word 'and'
$words = preg_split( '/([\s-,](?!and\s)|\sand\s)/i', $inputString );
$sum = 0;
foreach ($words as $word) {
$word = strtolower($word);
$value = array_search($word, $ords); // try the ordinal words
if (!$value) { $value = array_search($word, $cards); } // try the cardinal words
if (!$value) {
// if temp is still false, it's not a known number word, fail and exit
return FALSE;
}
if ($value == 100) { $sum *= 100; }
else { $sum += $value; }
}
return $sum;
}
?>
In the general case, parsing words into numbers is not easy. The best thread that I could find that discusses this is here. It is not nearly as easy as the inverse problem of converting numbers into words. My solution only works for numbers <2000, and it liberally interprets poorly formed constructs rather than tossing an error. Also, it is not resilient against spelling mistakes at all. For example:
forty forty second = 82
five nineteen hundredth = 2400
fifty-sixth second ninth = 67
If you have a lot of inputs and most of them are well formed, throwing errors for spelling mistakes is not really a big deal because you can manually correct the short list of problem entries. Silently accepting bad input, however, could be a real problem depending on your application. Just something to think about when deciding if it is worth it to make the conversion code more robust.
I'm trying to print the possible words that can be formed from a phone number in php. My general strategy is to map each digit to an array of possible characters. I then iterate through each number, recursively calling the function to iterate over each possible character.
Here's what my code looks like so far, but it's not working out just yet. Any syntax corrections I can make to get it to work?
$pad = array(
array('0'), array('1'), array('abc'), array('def'), array('ghi'),
array('jkl'), array('mno'), array('pqr'), array('stuv'), array('wxyz')
);
function convertNumberToAlpha($number, $next, $alpha){
global $pad;
for($i =0; $i<count($pad[$number[$next]][0]); $i++){
$alpha[$next] = $pad[$next][0][$i];
if($i<strlen($number) -1){
convertNumberToAlpha($number, $next++, $alpha);
}else{
print_r($alpha);
}
}
}
$alpha = array();
convertNumberToAlpha('22', 0, $alpha);
How is this going to be used? This is not a job for a simple recursive algorithm such as what you have suggested, nor even an iterative approach. An average 10-digit number will yield 59,049 (3^10) possibilities, each of which will have to be evaluated against a dictionary if you want to determine actual words.
Many times, the best approach to this is to pre-compile a dictionary which maps 10-digit numbers to various words. Then, your look-up is a constant O(1) algorithm, just selecting by a 10 digit number which is mapped to an array of possible words.
In fact, pre-compiled dictionaries were the way that T9 worked, mapping dictionaries to trees with logarithmic look-up functions.
The following code should do it. Fairly straight forward: it uses recursion, each level processes one character of input, a copy of current combination is built/passed at each recursive call, recursion stops at the level where last character of input is processed.
function alphaGenerator($input, &$output, $current = "") {
static $lookup = array(
1 => "1", 2 => "abc", 3 => "def",
4 => "ghi", 5 => "jkl", 6 => "mno",
7 => "pqrs", 8 => "tuv", 9 => "wxyz",
0 => "0"
);
$digit = substr($input, 0, 1); // e.g. "4"
$other = substr($input, 1); // e.g. "3556"
$chars = str_split($lookup[$digit], 1); // e.g. "ghi"
foreach ($chars as $char) { // e.g. g, h, i
if ($other === false) { // base case
$output[] = $current . $char;
} else { // recursive case
alphaGenerator($other, $output, $current . $char);
}
}
}
$output = array();
alphaGenerator("43556", $output);
var_dump($output);
Output:
array(243) {
[0]=>string(5) "gdjjm"
[1]=>string(5) "gdjjn"
...
[133]=>string(5) "helln"
[134]=>string(5) "hello"
[135]=>string(5) "hfjjm"
...
[241]=>string(5) "iflln"
[242]=>string(5) "ifllo"
}
You should read Norvigs article on writing a spellchecker in Python http://norvig.com/spell-correct.html . Although its a spellchecker and in python not php, it is the same concept around finding words with possible variations, might give u some good ideas.
*I try to count the unique appearances of a substring inside a list of words *
So check the list of words and detect if in any words there are substrings based on min characters that occur multiple times and count them. I don't know any substrings.
This is a working solution where you know the substring but what if you do not know ?
Theres a Minimum Character count where words are based on.
Will find all the words where "Book" is a substring of the word. With below php function.
Wanted outcome instad:
book count (5)
stor count (2)
Given a string of length 100
book bookstore bookworm booking book cooking boring bookingservice.... ok
0123456789... ... 100
your algorithm could be:
Investigate substrings from different starting points and substring lengths.
You take all substrings starting from 0 with a length from 1-100, so: 0-1, 0-2, 0-3,... and see if any of those substrings accurs more than once in the overall string.
Progress through the string by starting at increasing positions, searching all substrings starting from 1, i.e. 1-2, 1-3, 1-4,... and so on until you reach 99-100.
Keep a table of all substrings and their number of occurances and you can sort them.
You can optimize by specifying a minimum and maximum length, which reduces your number of searches and hit accuracy quite dramatically. Additionally, once you find a substring save them in a array of searched substrings. If you encounter the substring again, skip it. (i.e. hits for book that you already counted you should not count again when you hit the next booksubstring). Furthermore you will never have to search strings that are longer than half of the total string.
For the example string you might run additional test for the uniquness of a string.
You'd have
o x ..
oo x 7
bo x 7
ok x 6
book x 5
booking x 2
bookingservice x 1
with disregarding stings shorter than 3 (and longer than half of total textstring), you'd get
book x 5
booking x 2
bookingservice x 1
which is already quite a plausible result.
[edit] This would obviously look through all of the string, not just natural words.
[edit] Normally I don't like writing code for OPs, but in this case I got a bit interested myself:
$string = "book bookshelf booking foobar bar booking ";
$string .= "selfservice bookingservice cooking";
function search($string, $min = 4, $max = 16, $threshhold = 2) {
echo "<pre><br/>";
echo "searching <em>'$string'</em> for string occurances ";
echo "of length $min - $max: <br/>";
$hits = array();
$foundStrings = array();
// no string longer than half of the total string will be found twice
if ($max > strlen($string) / 2) {
$max = strlen($string);
}
// examin substrings:
// start from 0, 1, 2...
for ($start = 0; $start < $max; $start++) {
// and string length 1, 2, 3, ... $max
for ($length = $min; $length < strlen($string); $length++) {
// get the substring in question,
// but search for natural words (trim)
$substring = trim(substr($string, $start, $length));
// if substring was not counted yet,
// add the found count to the hits
if (!in_array($substring, $foundStrings)) {
preg_match_all("/$substring/i", $string, $matches);
$hits[$substring] = count($matches[0]);
}
}
}
// sort the hits array desc by number of hits
arsort($hits);
// remove substring hits with hits less that threshhold
foreach ($hits as $substring => $count) {
if ($count < $threshhold) {
unset($hits[$substring]);
}
}
print_r($hits);
}
search($string);
?>
The comments and variable names should make the code explain itself. $string would come for a read file in your case. This exmaple would output:
searching 'book bookshelf booking foobar bar booking selfservice
bookingservice cooking' for string occurances of length 4 - 16:
Array
(
[ook] => 6
[book] => 5
[boo] => 5
[bookin] => 3
[booking] => 3
[booki] => 3
[elf] => 2
)
Let me know how you implement it :)
This is my first approximation: unfinished, untested, has at least 1 bug, and is written in eiffel. Well I am not going to do all the work for you.
deferred class
SUBSTRING_COUNT
feature
threshold : INTEGER_32 =5
biggest_starting_substring_length(a,b:STRING):INTEGER_32
deferred
end
biggest_starting_substring(a,b:STRING):STRING
do
Result := a.substring(0,biggest_starting_substring_length(a,b))
end
make_list_of_substrings(a,b:STRING)
local
index:INTEGER_32
this_one: STRING
do
from
a_index := b_index + 1
invariant
a_index >=0 and a_index <= a.count
until
a_index >= a.count
loop
this_one := biggest_starting_substring(a.substring (a_index, a.count-1),b)
if this_one.count > threshold then
list.extend (this_one)
end
variant
a.count - a_index
end
end -- biggest_substring
list : ARRAYED_LIST[STRING]
end