I need to identify a fraction from a form field in a recipe database using a Regex.
Ingredients will be entered in a two part form fields. Field one is the amount, Field two is the ingredient. I then need to break field one into its fractional components to input into the database.
Possible entries include:
1, 1/2, 1 1/2, and any of the previous with words attached such as 1 cup, or 1/2 tbsp.
the hardest I foresee would be: [2 28 oz. cans] where 2 is the number, and 28 oz. cans would be the word.
I have found:
(\b[0-9]{1,3}(?:,?[0-9]{3})*(?:.[0-9]{2})?\b)
which sort of works. I am completely new to Regex, so I am working on guess and check only, and I am having a hard time making it work for me.
Problem #1: I need to identify the word part as well. The word part can be multiple words as well, such as 2 large cans, where large cans would be the word part. The above Regex identifies the numbers very well, but I cant figure out a way to grab the rest of the form field. For example 1 1/2 tbsp gives me 1,1,2 but that is all, and I need tbsp as well. I tried to use this Regex and use len to cut the original down, subtracting the fraction off the front, but had problems since 1 / 2 and 1/2 are both allowed, so cant figure out how many spots to subtract (1 / 2 should subtract 6 from the front of the string, 1/2 should subtract 4 from the front of the string, and just looking at the regex results of 1,2 I cant tell howmany to subtract).
Problem #2: This isnt so important, but any ideas on how to identity the [2 28 oz cans] problem? The above Regex pulls 2,28 out which is not correct, it shoudl only pull 2 out and then the rest (28 oz cans) would be the other part that the solution to problem 1 will hopefully find.
Here's a regex that will match mixed numbers, whole numbers, and the rest of the entry (the ingredient, hopefully with any extraneous numbers):
^((\d+( \d+/\d+)?)|(\d+/\d+))( (.+))?$
So for example if had 2 28 ounce cans it would match:
group 1: 2
group 2: 2
group 3:
group 4:
group 5: 28 ounce cans
group 5: 28 ounce cans
The groups you care about are 1 & 5. Group 1 will always contain the amount (as a number, fraction, or number with a fraction) and group 6 will always have the remaining text (the ingredient).
Related
I need to identify a fraction from a form field in a recipe database using a Regex.
Ingredients will be entered in a two part form fields. Field one is the amount, Field two is the ingredient. I then need to break field one into its fractional components to input into the database.
Possible entries include:
1, 1/2, 1 1/2, and any of the previous with words attached such as 1 cup, or 1/2 tbsp.
the hardest I foresee would be: [2 28 oz. cans] where 2 is the number, and 28 oz. cans would be the word.
I have found:
(\b[0-9]{1,3}(?:,?[0-9]{3})*(?:.[0-9]{2})?\b)
which sort of works. I am completely new to Regex, so I am working on guess and check only, and I am having a hard time making it work for me.
Problem #1: I need to identify the word part as well. The word part can be multiple words as well, such as 2 large cans, where large cans would be the word part. The above Regex identifies the numbers very well, but I cant figure out a way to grab the rest of the form field. For example 1 1/2 tbsp gives me 1,1,2 but that is all, and I need tbsp as well. I tried to use this Regex and use len to cut the original down, subtracting the fraction off the front, but had problems since 1 / 2 and 1/2 are both allowed, so cant figure out how many spots to subtract (1 / 2 should subtract 6 from the front of the string, 1/2 should subtract 4 from the front of the string, and just looking at the regex results of 1,2 I cant tell howmany to subtract).
Problem #2: This isnt so important, but any ideas on how to identity the [2 28 oz cans] problem? The above Regex pulls 2,28 out which is not correct, it shoudl only pull 2 out and then the rest (28 oz cans) would be the other part that the solution to problem 1 will hopefully find.
Here's a regex that will match mixed numbers, whole numbers, and the rest of the entry (the ingredient, hopefully with any extraneous numbers):
^((\d+( \d+/\d+)?)|(\d+/\d+))( (.+))?$
So for example if had 2 28 ounce cans it would match:
group 1: 2
group 2: 2
group 3:
group 4:
group 5: 28 ounce cans
group 5: 28 ounce cans
The groups you care about are 1 & 5. Group 1 will always contain the amount (as a number, fraction, or number with a fraction) and group 6 will always have the remaining text (the ingredient).
I've been struggling on the following (algorithmic?) problem for days now. I have a list of persons that I need to group evenly. Each "round" of groupings is stored so that the next time (round) I try to group people together, we group them if they were not matched together during a previous round.
Example:
John
Bob
Laura
Lucy
Michael
Mark
1st round
Group 1
John
Bob
Laura
Group 2
Lucy
Michael
Mark
Now for the 2nd round we have to avoid grouping John Bob Laura together (or at least minimize it).
I came up with the solution that works for these small edge cases where I create a pairing matrix with
[John - Bob] = 1 (number of times they got paired together in previous rounds)
[Bob - Laura] = 1
etc...
I then loop through that matrix, for each person I find the lowest number of times that person got paired with someone else and add these 2 to the list. Etc, until everybody got added to that list.
I then split that list into groups of desired number (the group size is the only parameter).
I found out that this doesn't work after a few rounds or with larger "rosters".
I'm getting close to think that this might be a NP problem because I'd have to iterate many many times to find the perfect "list".
Is there an algorithm I should look into? I'm coding this in PHP but Java or pseudo code works too.
A roster containing 36 persons.
1st round, group size = 6
2nd round, group size = 4
I should not have someone paired with someone else more than once.
With my solution, I have around 10 pairings that happened twice.
I have a web application, written in PHP that incorporates Javascript and JQuery, that will be used as my company's Inventory Management System (IMS). What I would like to be able to create is a Regex expression based upon user input of a value.
The idea behind this is that most manufacturers' serial numbers schema, length of characters and mixture of alpha to numeric values, is unique to a certain part. So when a part is added to the IMS and the first serial number is scanned into the system I would like a Regex statement to be built and saved to a database table corresponding to that part type. Any future times that a serial number is scanned the part should be auto-selected as the part type as it matches the serial number schema for that manufacturer. I understand this methodology may not always hold true to a single part so I could even return a list of parts that match the schema instead of the user needing to look it up in the catalog.
The basis of my question is what is the best starting point to look at having a function in code be able to decipher a value given by a user to create a Regex expression? I'm not requesting a full function but a starting point of how to look at my situation and goal so I can understand where to begin. I've scratched my head long enough and starting writing functions numerous times just to delete the entire block knowing I was headed for disaster.
Anything in code is possible - is this feasible?
EDIT - ADDED SAMPLE VALUES
DVD-RW (Optical Drives)
1613518L121
1613509L121
1613519L121
VGA Output Cards
0324311071068
0324311071134
COM Expansion Cards
608131234
608131237
Hard Drives
WMAYUJ753738
WMAYUJ072099
WMAYUJ683739
WMAYUJ844900
As you can see some values are going to be numeric only of a certain length of characters. Others will have alpha characters at the beginning followed by a series of numbers. Others may have alpha/numeric characters interspersed with each other. In most every single case a simple length of alpha/numeric rule is going to fit for identifying a singular part type in our list of goods. However, in those cases that more than one expression matches a value, I can simply have the application show a list of two or more products that match the regex and prompt the user to select the proper part. This, overall, will save time and mistakes in selecting a product type in the WMS database.
Thanks for the comments. I understand I'm not asking a question that has one answer to it. I'm looking for a starting point on how to best step through the string and spit out a corresponding Regex statement that would match the value.
As #Pete says, I think you have set yourself too ambitious a goal. Some thoughts, perhaps overly generalized from your specific needs.
I take it that you want to scan a serial number like 1-56592-487-8 and infer that the regular expression /\d-\d{5}-\d{3}-\d/ matches parts of this type from a given manufacturer. (This happens to be the ISBN-10 for my copy of "Java in a Nutshell." ISBNs are not serial numbers, but work with me.) But you can't infer from a handful of examples what pattern the manufacturer uses. Maybe the first character position is a hex digit (0-F). Maybe the last character is a checksum that can be a digit or X (like ISBNs). Maybe there is a suffix, not always present, that denotes the plant. So you will find yourself building up many patterns for the same manufacturer/part type as new instances of the part come in.
You will also have the reverse problem. A maker of widgets uses the regex /[A-Z]{3}\d{7}/, and a maker of sonic screwdrivers uses the same pattern.
That said, about the best you can do is something this:
for each character in the scanned serial number
if it is a capital letter
add [A-Z] to the regular expression
else if it is a digit
add \d to the regular expression
else
add the character itself to the regular expression, escaped as necessary
end for
collapse multiple occurrences with the {,} interval qualifier
The rules for Vehicle Identification Numbers may also be inspiring. Think about how you would infer the rules for VINs, given a handful of examples.
EDIT: sorry, my sample code is buggy you need this kind of algorithms as first step on the parts that you will guess: longest substring or this
you will need to add iteratives and some masking like explained above and by David, also on the sample below, the "L121" for DVD-RW is not guessed (as i have stated that i must be starting with 'common'). So you will need to find all the common consecutive subsequences and decide which one are relevant! (probably with a kind of maximization gain function )
using the second link long_substr :
>>> for x in d:
for y in d:
if x == y: continue
common = long_substr([x, y])
length = len(common)
if x.startswith(common) and y.startswith(common):
print "\t".join((x, y, str(length), common))
that produce =>
0324311071068 0324311071134 10 0324311071
0324311071134 0324311071068 10 0324311071
1613519L121 1613518L121 6 161351
1613519L121 1613509L121 5 16135
WMAYUJ844900 WMAYUJ753738 6 WMAYUJ
WMAYUJ844900 WMAYUJ072099 6 WMAYUJ
WMAYUJ844900 WMAYUJ683739 6 WMAYUJ
WMAYUJ753738 WMAYUJ844900 6 WMAYUJ
WMAYUJ753738 WMAYUJ072099 6 WMAYUJ
WMAYUJ753738 WMAYUJ683739 6 WMAYUJ
1613518L121 1613519L121 6 161351
1613518L121 1613509L121 5 16135
WMAYUJ072099 WMAYUJ844900 6 WMAYUJ
WMAYUJ072099 WMAYUJ753738 6 WMAYUJ
WMAYUJ072099 WMAYUJ683739 6 WMAYUJ
WMAYUJ683739 WMAYUJ844900 6 WMAYUJ
WMAYUJ683739 WMAYUJ753738 6 WMAYUJ
WMAYUJ683739 WMAYUJ072099 6 WMAYUJ
608131237 608131234 8 60813123
1613509L121 1613519L121 5 16135
1613509L121 1613518L121 5 16135
608131234 608131237 8 60813123
--- first buggy reply start here
below is the first part of my reply, that could only help you to understand where i was wrong and may be give you some ideas :
a sample using the Longest Common Subsequence probleme solver LCS with your particular need, that i can think of being a first step of a process of guessing what will be common ?
it is in Python, but for the demo part, it can be easily readable (or can be cut and paste in IDLE (the python editor)) assumong that you use the ActiveState Code Recipes of the first link above
this has to do with bio informatics (think of genes alignment)
you will need something to decide what is the most interesting common sequence (may be having a minimal length? and then proceed with masking like already proposed by David or in my comment
(at first i do not see that the LCS what not a LCS consecutive solver, while you will need it to be! SO my first usage of the LCS solver is buggy :( as it is not contiguous, i have MAYUJ8 or WMAYUJ7 and not WMAYUJ - which is shorter ! while solver find longest common characters without expecting them to be consectuive! - again sorry for that)
>>> raw = """1613518L121
1613509L121
1613519L121
0324311071068
0324311071134
608131234
608131237
WMAYUJ753738
WMAYUJ072099
WMAYUJ683739
WMAYUJ844900"""
>>> d = dict()
>>> for line in raw.split("\n"):
if not line.strip(): continue
value = line.strip()
d[value] = 1
>>> for x in d:
for y in d:
if x == y: continue
length = LCSLength(x, y)
common = LCS(x,y)
if length >= 3 and x.startswith(common):
print "\t".join((x, y, str(length), common))
that produce =>
0324311071068 0324311071134 10 0324311071
0324311071068 608131234 4 0324
0324311071134 0324311071068 10 0324311071
WMAYUJ844900 WMAYUJ753738 7 WMAYUJ8
WMAYUJ753738 WMAYUJ072099 7 WMAYUJ7
608131237 608131234 8 60813123
608131234 608131237 8 60813123
Run spam detecting algorithms (statistical one like bayes or similar "learning" ones). This will or won't help you, but if not, I honestly doubt you will ever make any useful logical algorithm here.
I am thinking about this all day and can't seem to figure out an memory efficient and speedy way.
The problem is:
for example, I have these letters:
e f j l n r r t t u w x (12 letters)
I am looking for this word
TURTLE (6 letters)
How do I find all the possible words in the full range (12 words) with php?
( Or with python, if that might be a lot easier? )
Things I've tried:
Using permutations: I have made all strings possible using a permutation algorithm, put them in array (only the ones 6 chars long) and do an in_array to check if it matches one of the words in my array with valid words (in this case, containing TURTLE, but sometimes two or three words).
This calculating costs a lot of memory and time, especially with 6+ characters to get permutations of.
creating a regex (I am bad at this). I wanted to create a regex to check if 6 of the 12 (input) characters are in a word from the "valid array". problem is, we don't know what letter from the 12 will be the starting position and the position of the other words.
An example of this would be:
http://drawsomethingwords.net/
I hope you can help me with this problem, as I would really like to fix this.
Thanks for all of your time :)
I've encountered similar problems when writing a crossword editor (e.g., find all words of length 5 with a 'B' in second position). Basically it comes down to:
Process a word list and organize words by length (i.e., a list of all words of length 2, length 3, length 4, etc). The reason is that you often know the length of the word(s) that you wish to search for. If you want to search for words of unknown length, you can repeat a search again for a different word list.
Insert each separate word list into a tertiary search tree which makes searching for words a lot faster. Each node in the tree contains a character and you can descend the tree to search for words. There are also specialized data structures such as a trie but I have not (yet) explored.
Now for your problem, you could use the search tree to write a search function such as
function findWords($tree, $letters) {
// ...
}
where tree is the search tree containing the words of the length that you wish to search for and letters is a list of valid characters. In your example, letters would be the string efjlnrrttuwx.
The search tree allows you to search for words, one character at a time, and you can keep track of characters that you have encountered so far. As long as these characters are in the list of valid letters, you keep searching. Once you've encountered a leaf node in the search tree, you have found an existing word which you can add to the result. If you encounter a character which is not in letters (or it has already been used), you can skip that word and continue the search elsewhere in the search tree.
My crossword editor Palabra contains an implementation of the above steps (a part is done in Python but mostly in C). It works fast enough for Ubuntu's default word list containing roughly 70K words.
There are probably better ways, but this is just off the top of my head:
I assume you have a database of words (i.e. dictionary). Add fields a-z to the database table. Write a script that sums up the count of each letter in the word and writes them in the a-z fields as an integer. I.E. for balloon, the table would look like:
id name a b ... l ... n ... o
1 balloon 1 1 2 ... 1 ... 2
Then when the user enters a word, you calculate how many of each character are in that word and match that up with the database.
// User enters 'zqlamonrlob'
// You count the letters:
a b c d e f g h i j k l m n o p q r s t u v w x y z
1 1 0 0 0 0 0 0 0 0 0 2 1 1 2 0 1 1 0 0 0 0 0 0 0 1
// Query the database
$sql = "SELECT `name` FROM `my_table` WHERE `a` <= {$count['a'] AND `b` <= {$count['b'] ...}";
That would get you a list of words that use some or all of the letters that the user entered.
Here's a regex, just to show it can (but not necessarily should) be done:
preg_match('/^(?:t()|u()|r()|t()|l()|e()|.)+$\1\2\3\4\5\6/i', 'efjlnrrttuwx')
matches.
How does it work? Empty capturing parentheses always match if the preceding letter matches. The backreferences at the end of the regex make sure that each of the characters has participated in the match. Therefore,
preg_match('/^(?:t()|u()|r()|t()|l()|e()|.)+$\1\2\3\4\5\6/i', 'efjlnrrtuwx')
(correctly) will not match because there is only one t in the string but the regex needs two different ts.
The problem is that of course the regex engine has to check many permutations to arrive at this conclusion. While a successful match may be quite fast (175 steps of the regex engine in the first case), an unsuccessful match attempt can be expensive (3816 steps in the second case).
I think you need to approach this problem from the opposite direction.
Loop through your list of words, testing the words with the specified number of characters, to see if the word characters are in the specified character set.
I'm creating a web app in PHP where people can try to translate words they need to learn for school.
For example, someone needs to translate the Dutch word 'weer' to 'weather' in English, but unfortunately he types 'whether'. Because he almost typed the right word, I want to give him another try, with dots '.' on the places where he made a mistake:
Language A: weer
Language B: weather
Input: whether
Output: w..ther
Or, for example
Language A: fout
Language B: mistake
Input: mitake
Output: mi.take
Or:
Language A: echt
Language B: genuine
Input: genuinely
Output: genuinely (almost good, shorten the word a little bit)
But, if the input differs too much from the desired translation, I don't want to get output like ........
I heard about Levenshtein distance and I think I need an algorithm that is much like that one, but I don't know how to place dots on the right place instead of echoing how many operations are to be done.
So, how can I return the misspelled word with dots on the places where someone made a mistake?
First, take a look at the Levenshtein algorithm on wikipedia.
Then go on and look at the examples and the resulting matrix d on the article page:
*k* *i* *t* *t* *e* *n*
>0 1 2 3 4 5 6
*s* 1 >1 2 3 4 5 6
*i* 2 2 >1 2 3 4 5
*t* 3 3 2 >1 2 3 4
*t* 4 4 3 2 >1 2 3
*i* 5 5 4 3 2 >2 3
*n* 6 6 5 4 3 3 >2
*g* 7 7 6 5 4 4 >3
The distance is found on lower right corner of the matrix, d[m, n]. But from there it
is now possible to follow backtrack the minimum steps to the upper left of of the matrix, d[1, 1]. You just go left, up-left, or up at each step whichever minimizes the path.
In the example above, you'd find the path marked by the '>' signs:
s i t t i n g k i t t e n
0 1 1 1 1 2 2 3 0 1 1 1 1 2 2 3
^ ^ ^ ^ ^ ^
changes in the distance, replace by dots
Now you can find on the minimum path at which locations d[i,j] the distance changes (marked by ^ in the example above), and for those letters you put in the first (or second) word a dot at position i (or j respectively).
Result:
s i t t i n g k i t t e n
^ ^ ^ ^ ^ ^
. i t t . n . . i t t . n .
The terminology you are looking for is called "edit distance." Using something like Levenshtein distance will tell you the number of operations needed to transform one string into the other (insertions, deletion, substitutions, etc).
Here is a list of other "editing distance" algorithms.
Once you decide that a word is "close enough" (i.e. it doesn't exceed some threshold of edits needed), you can show where the edits need to occur (by showing dots).
So how do you know where to put the dots?
The interesting thing about the "Levenshtein distance" is it uses a M x N matrix with one word on each axis (see the sample matrix in Levenshtein article). Once you create the matrix, you can determine which letters require "additional edits" to be correct. That's where you put the dots. If the letter requires "no additional edits," you simply print the letter. Pretty cool.
I think you need to do a multi step process not just Levenshtein. First I would check if the input word is a form of target word. That would catch your 3rd example and also no worry about adding dots. You could also use this step to catch synonyms. The next step is to check the length difference of the two strings.
If the difference is 0 you can do a letter for letter comparison to place the dots. If you don't want to show all dots then you should keep a count of dots placed and once over the limit display some error message. (Sorry that was incorrect)
If the difference is shows the input is longer you need to check for a letter to be delete would fix the problem. Here you can use Levenshtein to see how close they are if they are too far away show your error message if it is in range you will need to do the steps of Levenshtein in reverse and mark the changes somehow. Not sure how you want to show that a letter needs deleted.
If the difference shows the input is shorter you can do use Levenshtein distance to see if the two words are close enough or show the error. Then do the steps in reverse again adding dots for insertions and dots for substitutions.
Actually the last two steps can be combined into one function that runs through the algorithm and remembers the insert delete or substitution and changes the output accordingly.