Regex creation based upon input

Regex creation based upon input - php

I have a web application, written in PHP that incorporates Javascript and JQuery, that will be used as my company's Inventory Management System (IMS). What I would like to be able to create is a Regex expression based upon user input of a value.
The idea behind this is that most manufacturers' serial numbers schema, length of characters and mixture of alpha to numeric values, is unique to a certain part. So when a part is added to the IMS and the first serial number is scanned into the system I would like a Regex statement to be built and saved to a database table corresponding to that part type. Any future times that a serial number is scanned the part should be auto-selected as the part type as it matches the serial number schema for that manufacturer. I understand this methodology may not always hold true to a single part so I could even return a list of parts that match the schema instead of the user needing to look it up in the catalog.
The basis of my question is what is the best starting point to look at having a function in code be able to decipher a value given by a user to create a Regex expression? I'm not requesting a full function but a starting point of how to look at my situation and goal so I can understand where to begin. I've scratched my head long enough and starting writing functions numerous times just to delete the entire block knowing I was headed for disaster.
Anything in code is possible - is this feasible?
EDIT - ADDED SAMPLE VALUES
DVD-RW (Optical Drives)
1613518L121
1613509L121
1613519L121
VGA Output Cards
0324311071068
0324311071134
COM Expansion Cards
608131234
608131237
Hard Drives
WMAYUJ753738
WMAYUJ072099
WMAYUJ683739
WMAYUJ844900
As you can see some values are going to be numeric only of a certain length of characters. Others will have alpha characters at the beginning followed by a series of numbers. Others may have alpha/numeric characters interspersed with each other. In most every single case a simple length of alpha/numeric rule is going to fit for identifying a singular part type in our list of goods. However, in those cases that more than one expression matches a value, I can simply have the application show a list of two or more products that match the regex and prompt the user to select the proper part. This, overall, will save time and mistakes in selecting a product type in the WMS database.
Thanks for the comments. I understand I'm not asking a question that has one answer to it. I'm looking for a starting point on how to best step through the string and spit out a corresponding Regex statement that would match the value.

As #Pete says, I think you have set yourself too ambitious a goal. Some thoughts, perhaps overly generalized from your specific needs.
I take it that you want to scan a serial number like 1-56592-487-8 and infer that the regular expression /\d-\d{5}-\d{3}-\d/ matches parts of this type from a given manufacturer. (This happens to be the ISBN-10 for my copy of "Java in a Nutshell." ISBNs are not serial numbers, but work with me.) But you can't infer from a handful of examples what pattern the manufacturer uses. Maybe the first character position is a hex digit (0-F). Maybe the last character is a checksum that can be a digit or X (like ISBNs). Maybe there is a suffix, not always present, that denotes the plant. So you will find yourself building up many patterns for the same manufacturer/part type as new instances of the part come in.
You will also have the reverse problem. A maker of widgets uses the regex /[A-Z]{3}\d{7}/, and a maker of sonic screwdrivers uses the same pattern.
That said, about the best you can do is something this:
for each character in the scanned serial number
if it is a capital letter
add [A-Z] to the regular expression
else if it is a digit
add \d to the regular expression
else
add the character itself to the regular expression, escaped as necessary
end for
collapse multiple occurrences with the {,} interval qualifier
The rules for Vehicle Identification Numbers may also be inspiring. Think about how you would infer the rules for VINs, given a handful of examples.

EDIT: sorry, my sample code is buggy you need this kind of algorithms as first step on the parts that you will guess: longest substring or this
you will need to add iteratives and some masking like explained above and by David, also on the sample below, the "L121" for DVD-RW is not guessed (as i have stated that i must be starting with 'common'). So you will need to find all the common consecutive subsequences and decide which one are relevant! (probably with a kind of maximization gain function )
using the second link long_substr :
>>> for x in d:
for y in d:
if x == y: continue
common = long_substr([x, y])
length = len(common)
if x.startswith(common) and y.startswith(common):
print "\t".join((x, y, str(length), common))
that produce =>
0324311071068 0324311071134 10 0324311071
0324311071134 0324311071068 10 0324311071
1613519L121 1613518L121 6 161351
1613519L121 1613509L121 5 16135
WMAYUJ844900 WMAYUJ753738 6 WMAYUJ
WMAYUJ844900 WMAYUJ072099 6 WMAYUJ
WMAYUJ844900 WMAYUJ683739 6 WMAYUJ
WMAYUJ753738 WMAYUJ844900 6 WMAYUJ
WMAYUJ753738 WMAYUJ072099 6 WMAYUJ
WMAYUJ753738 WMAYUJ683739 6 WMAYUJ
1613518L121 1613519L121 6 161351
1613518L121 1613509L121 5 16135
WMAYUJ072099 WMAYUJ844900 6 WMAYUJ
WMAYUJ072099 WMAYUJ753738 6 WMAYUJ
WMAYUJ072099 WMAYUJ683739 6 WMAYUJ
WMAYUJ683739 WMAYUJ844900 6 WMAYUJ
WMAYUJ683739 WMAYUJ753738 6 WMAYUJ
WMAYUJ683739 WMAYUJ072099 6 WMAYUJ
608131237 608131234 8 60813123
1613509L121 1613519L121 5 16135
1613509L121 1613518L121 5 16135
608131234 608131237 8 60813123
--- first buggy reply start here
below is the first part of my reply, that could only help you to understand where i was wrong and may be give you some ideas :
a sample using the Longest Common Subsequence probleme solver LCS with your particular need, that i can think of being a first step of a process of guessing what will be common ?
it is in Python, but for the demo part, it can be easily readable (or can be cut and paste in IDLE (the python editor)) assumong that you use the ActiveState Code Recipes of the first link above
this has to do with bio informatics (think of genes alignment)
you will need something to decide what is the most interesting common sequence (may be having a minimal length? and then proceed with masking like already proposed by David or in my comment
(at first i do not see that the LCS what not a LCS consecutive solver, while you will need it to be! SO my first usage of the LCS solver is buggy :( as it is not contiguous, i have MAYUJ8 or WMAYUJ7 and not WMAYUJ - which is shorter ! while solver find longest common characters without expecting them to be consectuive! - again sorry for that)
>>> raw = """1613518L121
1613509L121
1613519L121
0324311071068
0324311071134
608131234
608131237
WMAYUJ753738
WMAYUJ072099
WMAYUJ683739
WMAYUJ844900"""
>>> d = dict()
>>> for line in raw.split("\n"):
if not line.strip(): continue
value = line.strip()
d[value] = 1
>>> for x in d:
for y in d:
if x == y: continue
length = LCSLength(x, y)
common = LCS(x,y)
if length >= 3 and x.startswith(common):
print "\t".join((x, y, str(length), common))
that produce =>
0324311071068 0324311071134 10 0324311071
0324311071068 608131234 4 0324
0324311071134 0324311071068 10 0324311071
WMAYUJ844900 WMAYUJ753738 7 WMAYUJ8
WMAYUJ753738 WMAYUJ072099 7 WMAYUJ7
608131237 608131234 8 60813123
608131234 608131237 8 60813123

Run spam detecting algorithms (statistical one like bayes or similar "learning" ones). This will or won't help you, but if not, I honestly doubt you will ever make any useful logical algorithm here.

Related

regular expression for fraction number "4 3/4"? [duplicate]

I need to identify a fraction from a form field in a recipe database using a Regex.
Ingredients will be entered in a two part form fields. Field one is the amount, Field two is the ingredient. I then need to break field one into its fractional components to input into the database.
Possible entries include:
1, 1/2, 1 1/2, and any of the previous with words attached such as 1 cup, or 1/2 tbsp.
the hardest I foresee would be: [2 28 oz. cans] where 2 is the number, and 28 oz. cans would be the word.
I have found:
(\b[0-9]{1,3}(?:,?[0-9]{3})*(?:.[0-9]{2})?\b)
which sort of works. I am completely new to Regex, so I am working on guess and check only, and I am having a hard time making it work for me.
Problem #1: I need to identify the word part as well. The word part can be multiple words as well, such as 2 large cans, where large cans would be the word part. The above Regex identifies the numbers very well, but I cant figure out a way to grab the rest of the form field. For example 1 1/2 tbsp gives me 1,1,2 but that is all, and I need tbsp as well. I tried to use this Regex and use len to cut the original down, subtracting the fraction off the front, but had problems since 1 / 2 and 1/2 are both allowed, so cant figure out how many spots to subtract (1 / 2 should subtract 6 from the front of the string, 1/2 should subtract 4 from the front of the string, and just looking at the regex results of 1,2 I cant tell howmany to subtract).
Problem #2: This isnt so important, but any ideas on how to identity the [2 28 oz cans] problem? The above Regex pulls 2,28 out which is not correct, it shoudl only pull 2 out and then the rest (28 oz cans) would be the other part that the solution to problem 1 will hopefully find.

Here's a regex that will match mixed numbers, whole numbers, and the rest of the entry (the ingredient, hopefully with any extraneous numbers):
^((\d+( \d+/\d+)?)|(\d+/\d+))( (.+))?$
So for example if had 2 28 ounce cans it would match:
group 1: 2
group 2: 2
group 3:
group 4:
group 5: 28 ounce cans
group 5: 28 ounce cans
The groups you care about are 1 & 5. Group 1 will always contain the amount (as a number, fraction, or number with a fraction) and group 6 will always have the remaining text (the ingredient).

Anti-forgery unique serial number generation

I am trying to generate a random serial number to put on holographic stickers in order to let customers check if the purchased product is authentic or not.
Preface:
Once you input that and query that code it will be nulled, so next time you do it again you receive a message that the product might be fake because the code is already used.
Considering that I should make this system for a factory that produces no more than 2/3 millions pieces a year, for me is a bit hard understand how to set up everything, at least the 1st time…
I thought about 20 digits code in 4 groups (no letters because must be very easy for the user read and input the code)
12345-67890-98765-43210
This is what I think is the easiest way to do everything:
function mycheckdigit()
{
...
return $myserial;
}
$mycustomcode="123";
$qty=20000;
$myfile = fopen("./thefile.txt","w") or die("Houston we got a problem here");
//using a txt file for a test, should be a DB instead...
for($i=0;$i<=$qty;$i++) {
$txt = date("y").$mycustomcode.str_pad(gettimeofday()['usec'],6,STR_PAD_LEFT).random_int(1000000,9999999). "\n";
//here the code to make check digits
mycheckdigit($txt);
fwrite($myfile,$myserial);
}
fclose($myfile);
The 1st group identifying something like year: 18 and 3 custom code
The 2nd group include microtime (gettimeofday()['usec'])
The 3rd completely random
last group including 3 random number and a check digit for group 1 and a check digit for group 2
in short:
Y= year
E= part of the EAN or custom code
M= Microtime generated number (gettimeofday()['usec'])
D= random_int() digits
C= Check Digit
YYEEE-MMMMM-MDDDD-DDDCC
In this way, I have a prefix that changes every year, I can recognize what brand is the product (so I could use one DB source only) and I still have enough random digits to be - maybe - quite unique if I consider that I will “pick-up” only a portion of the numbers from 1,000,000 and 9,999,999 and split it following using above sorting
Some questions for you:
Do you think I have enough combinations to not generate same code in one year considering 2 million codes? I would not use a lookup in the DB for the same code if it is not really necessary because could slow down batch generation (executed in batch during production process)
Could be better put some also unique identifier, like a day of the year (001-365) and make random_int() 3 digits shorter? Please Consider that I will generate codes monthly and not daily (but I think there is no big change in uniqueness)
Considering that backend in PHP I am thinking to use mt_rand() function, could be a good approach?
UPDATE: After the #apokryfos suggestion, I read more about UUID generation and similar I found a good compromise using random_int() instead.
Because I just need digits, so HEX hashes are not useful for my needs and making things more complicated
I would avoid using complex cryptographic things like RSA keys and so on…
I don’t need that level of security and complexity, I just need a way to generate a unique serial number, most unique as possible that is not easy to be guessed and nulled if you don’t scratch the sticker (so number creation should not be made A to Z, but randomly)

You can play with 11 random digits per year so that's 11 digit numbers 1 to 99999999999 (99.9 billion is a lot more than 2 million) so w.r.t. enough combinations I think you're covered.
However using mt_rand you're likely to get collisions. Here's a way to plan your way to 2 million random numbers before using the database:
<?php
$arr = [];
while (count($arr) < 1000000) {
$num = mt_rand(1, 99999999999);
$numStr = str_pad($num,11,0,STR_PAD_LEFT); //Force 11 digits
if (!isset($arr[$numStr])) {
$arr[$numStr] = true;
}
}
$keys= array_keys($arr);
The number of collisions is generally low (the first collision occurs at at about 300 000 - 500 000 numbers generated so it's pretty rare.
Each value in the array $keys is an 11 digit number which is random and unique.
This approach is relatively fast but be aware it will need quite a bit of memory (more than 128MB).
This being said, a more generally used method is to generate a universally unique identifier (UUID) which is a lot more likely to be unique and will therefore does not really need checking for uniqueness.

Generate a pseudo random 6 character string from an integer

I am trying to resolve the following problem via PHP. The aim is to generate a unique 6-character string based on an integer seed and containing a predefined range of characters. The second requirement is that the string must appear random (so if code 1 were 100000, it is not acceptable for code 2 to be 100001, and 3 100002)
The range of characters is:
Uppercase A-Z excluding: B, I, O, S and Z
0-9 excluding: 0, 1, 2, 5, 8
So that would be a total of 26 characters if I am not mistaken. My first idea would to be encoding from base 10 to base 24 starting at number 7962624. So do 7962624 + seed, and then base24 encode that number.
This gives me the characters 0-N. If I replace the resulting string in the following fashion, I then meet the first criteria:
B=P, I=Q, 0=R, 1=T, 2=U, 5=V, 8=W
So at this point, my codes will look something like this:
1=TRRRR, 2=TRRRT, 3=TRRRU
So my question to you gurus is: How can I make a method that behaves consistently (so the return string for a given integer is always the same) and meets the 2 requirements above? I have spent 2 full days on this now and short of dumping 700,000,000 codes into a database and retrieving them randomly I'm all out of ideas.
Stephen

You get a reasonably random looking sequence if you take your input sequence 1,2,3... and apply a linear map modulo a prime number. The number of unique codes is limited to the prime number so you should choose a large one. The resulting codes will be unique as long as you choose a multiplier that's not divisible by the prime.
Here's an example: With 6 characters you can make 266=308915776 unique strings, so a suitable prime number could be 308915753. This function therefore will generate over 300.000.000 unique codes:
function encode($num) {
$scrambled = (240049382*$num + 37043083) % 308915753;
return base_convert($scrambled, 10, 26);
}
Make sure that you run this on 64bit PHP though, otherwise the multiplication will overflow. On 32bit you'll have to use bcmath. The codes generated for the numbers 1 through 9 are:
n89a2d
hdh4jo
biopb9
5o6k2k
3eek5
k8m9aj
ee4424
8jbojf
2ojjb0
All that's left is filling in the initial 0s that are sometimes missing and replacing the letters and numbers so that none of the forbidden characters are produced.
As you can see, there's no obvious pattern, but someone with some time on their hands, enough motivation and with access to a few of this codes will be able to find out what's going on. A safer alternative is using an encryption algorithm with a small block size, such as Skip32.

Regex - Identify Fractions

Here's a regex that will match mixed numbers, whole numbers, and the rest of the entry (the ingredient, hopefully with any extraneous numbers):
^((\d+( \d+/\d+)?)|(\d+/\d+))( (.+))?$
So for example if had 2 28 ounce cans it would match:
group 1: 2
group 2: 2
group 3:
group 4:
group 5: 28 ounce cans
group 5: 28 ounce cans
The groups you care about are 1 & 5. Group 1 will always contain the amount (as a number, fraction, or number with a fraction) and group 6 will always have the remaining text (the ingredient).

Place dots where a word is misspelled

I'm creating a web app in PHP where people can try to translate words they need to learn for school.
For example, someone needs to translate the Dutch word 'weer' to 'weather' in English, but unfortunately he types 'whether'. Because he almost typed the right word, I want to give him another try, with dots '.' on the places where he made a mistake:
Language A: weer
Language B: weather
Input: whether
Output: w..ther
Or, for example
Language A: fout
Language B: mistake
Input: mitake
Output: mi.take
Or:
Language A: echt
Language B: genuine
Input: genuinely
Output: genuinely (almost good, shorten the word a little bit)
But, if the input differs too much from the desired translation, I don't want to get output like ........
I heard about Levenshtein distance and I think I need an algorithm that is much like that one, but I don't know how to place dots on the right place instead of echoing how many operations are to be done.
So, how can I return the misspelled word with dots on the places where someone made a mistake?

First, take a look at the Levenshtein algorithm on wikipedia.
Then go on and look at the examples and the resulting matrix d on the article page:
*k* *i* *t* *t* *e* *n*
>0 1 2 3 4 5 6
*s* 1 >1 2 3 4 5 6
*i* 2 2 >1 2 3 4 5
*t* 3 3 2 >1 2 3 4
*t* 4 4 3 2 >1 2 3
*i* 5 5 4 3 2 >2 3
*n* 6 6 5 4 3 3 >2
*g* 7 7 6 5 4 4 >3
The distance is found on lower right corner of the matrix, d[m, n]. But from there it
is now possible to follow backtrack the minimum steps to the upper left of of the matrix, d[1, 1]. You just go left, up-left, or up at each step whichever minimizes the path.
In the example above, you'd find the path marked by the '>' signs:
s i t t i n g k i t t e n
0 1 1 1 1 2 2 3 0 1 1 1 1 2 2 3
^ ^ ^ ^ ^ ^
changes in the distance, replace by dots
Now you can find on the minimum path at which locations d[i,j] the distance changes (marked by ^ in the example above), and for those letters you put in the first (or second) word a dot at position i (or j respectively).
Result:
s i t t i n g k i t t e n
^ ^ ^ ^ ^ ^
. i t t . n . . i t t . n .

The terminology you are looking for is called "edit distance." Using something like Levenshtein distance will tell you the number of operations needed to transform one string into the other (insertions, deletion, substitutions, etc).
Here is a list of other "editing distance" algorithms.
Once you decide that a word is "close enough" (i.e. it doesn't exceed some threshold of edits needed), you can show where the edits need to occur (by showing dots).
So how do you know where to put the dots?
The interesting thing about the "Levenshtein distance" is it uses a M x N matrix with one word on each axis (see the sample matrix in Levenshtein article). Once you create the matrix, you can determine which letters require "additional edits" to be correct. That's where you put the dots. If the letter requires "no additional edits," you simply print the letter. Pretty cool.

I think you need to do a multi step process not just Levenshtein. First I would check if the input word is a form of target word. That would catch your 3rd example and also no worry about adding dots. You could also use this step to catch synonyms. The next step is to check the length difference of the two strings.
If the difference is 0 you can do a letter for letter comparison to place the dots. If you don't want to show all dots then you should keep a count of dots placed and once over the limit display some error message. (Sorry that was incorrect)
If the difference is shows the input is longer you need to check for a letter to be delete would fix the problem. Here you can use Levenshtein to see how close they are if they are too far away show your error message if it is in range you will need to do the steps of Levenshtein in reverse and mark the changes somehow. Not sure how you want to show that a letter needs deleted.
If the difference shows the input is shorter you can do use Levenshtein distance to see if the two words are close enough or show the error. Then do the steps in reverse again adding dots for insertions and dots for substitutions.
Actually the last two steps can be combined into one function that runs through the algorithm and remembers the insert delete or substitution and changes the output accordingly.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.