Using regex to extract variables from a plain-text form letter? - php

I'm looking for a good example of using Regular Expressions in PHP to "reverse engineer" a form letter (with a known format, of course) that has been pasted into a multiline textbox and sent to a script for processing.
So, for example, let's assume this is the original plain-text input (taken from a USDA press release):
WASHINGTON, April 5, 2010 - North
American Bison Co-Op, a New Rockford,
N.D., establishment is recalling
approximately 25,000 pounds of whole
beef heads containing tongues that may
not have had the tonsils completely
removed, which is not compliant with
regulations that require the removal
of tonsils from cattle of all ages,
the U.S. Department of Agriculture's
Food Safety and Inspection Service
(FSIS) announced today.
For clarity, the fields that are variables are highlighted below:
[pr_city=]WASHINGTON, [pr_date=]April 5, 2010 - [corp_name=]North
American Bison Co-Op, a [corp_city=]New Rockford,
[corp_state=]N.D., establishment is recalling
approximately [amount=]25,000 pounds of [product=]whole
beef heads containing tongues that may
not have had the tonsils completely
removed, which is not compliant with
regulations that require [reason=]the removal
of tonsils from cattle of all ages,
the U.S. Department of Agriculture's
Food Safety and Inspection Service
(FSIS) announced today.
How could I efficiently extract the contents of the
pr_city
pr_date
corp_name
corp_city
corp_state
amount
product
reason
fields from my example?
Any help would be appreciated, thanks.

Well, a regex that works on your example could look like this (line breaks introduced to keep this beast legible, need to be removed prior to use):
/^(?P<pr_city>[^,]+), (?P<pr_date>[^-]+) - (?P<corp_name>.*?), a
(?P<corp_city>[^,]+), (?P<corp_state>[^,]+), establishment is
recalling approximately (?P<amount>.*?) of (?P<product>.*?),
which is not compliant with regulations that require (?P<reason>.*?),
the U\.S\. Department of Agriculture\'s Food Safety and Inspection
Service \(FSIS\) announced today\.$/
So, in PHP you could do
if (preg_match('/^(?P<pr_city>[^,]+), (?P<pr_date>[^-]+) - (?P<corp_name>.*?), a (?P<corp_city>[^,]+), (?P<corp_state>[^,]+), establishment is recalling approximately (?P<amount>.*?) of (?P<product>.*?), which is not compliant with regulations that require (?P<reason>.*?), the U\.S\. Department of Agriculture\'s Food Safety and Inspection Service \(FSIS\) announced today\.$/', $subject, $regs)) {
$prcity = $regs['pr_city'];
$prdate = $regs['pr_date'];
... etc.
} else {
$result = "";
}
This assumes a couple of things, for instance that there are no line breaks, and that the input is the entire string (and not a larger string from which this part has to be extracted from). I've tried to make assumptions about legal values that make some sense, but there is the very real chance that other inputs could break this. So some more test cases are probably needed.

If the surrounding text is constant, then something like this partial regex could do the trick:
preg_match('/^(.*?), (.*?)- (.*?), a (.*?), (.*?), establishment is recalling approximately (.*?), which is not compliant with regulations that require (.*?), the U.S. Department of Agriculture's Food Safety and Inspection Service (FSIS) announced today./', $text, $matches);
$matches[1] = 'WASHINGTON';
$matches[2] = 'April 5, 2010';
$matches[3] = ... etc...
If the surrounding text changes, then you're going to end up with a ton of false matches, no matches, etc... Essentially you'd need an AI to parse/understand PR releases.

Edit: Please disregard this crazy answer, as the other two are better. I should probably delete it, but I'm keeping it up for reference.
I have a crazy idea that just might work: build an XML string from the input by adding markups, then parse it. It might look something like this (completely untested) code:
preg_replace('([^,]*), ([^-]*)- ...etc...', '<pr_city>\1</pr_city><pr_date>\2</pr_date> ...etc...');
Parsing the XML afterwards is a needlessly complicated process that is best left to the PHP documentation: http://www.php.net/manual/en/function.xml-parse.php .
You could also consider converting it to JSON with this method, then using json_decode() to parse it. In any case, you have to think about what happens when " marks and > symbols appear in the input.
It might be easier to just match and remove one piece of the text at a time.

Related

php search text file for any wav file names

I have a series of files which contain raw text or json data, in these files will be wav file names. All of the wav file have the suffix of .wav
Is there anyway using php I can search an individual text or json file and return an array of of any .wav files found ?
This example of random text contains 6 .wav files, how would I search this and extract the filenames ?
Spoke as as other again ye. Hard on to roof he drew. So sell side newfile.wav ye in mr evil. Longer waited mr of nature seemed. Improving knowledge incommode objection me ye is prevailed playme.wav principle in. Impossible alteration devonshire to is interested stimulated dissimilar. To matter esteem polite do if.
Spot of come to ever test.wav hand as lady meet on. Delicate contempt received two yet advanced. Gentleman as belonging he commanded believing dejection in by. On no am winding chicken so behaved. Its preserved sex enjoyment new way behaviour. Him yet devonshire celebrated welcome.wav especially. Unfeeling one provision are smallness resembled repulsive.
Raising say express had chiefly detract demands she. Quiet led own cause three him. Front no party young abode state up. Saved he do fruit woody of to. Met defective are allowance two perceived listening consulted contained. It chicken oh colonel pressed excited suppose to shortly. He improve started no we manners another.wav however effects. Prospect humoured mistress to by proposal marianne attended. Simplicity the far admiration preference everything. Up help home head spot an he room in.
Talent she for lively eat led sister. Entrance strongly packages she out rendered get quitting denoting led. Dwelling confined improved it he no doubtful raptures. Several carried through an of up attempt gravity. Situation to be at offending elsewhere distrusts if. Particular use for considered projection cultivated. Worth of do doubt shall it their. Extensive existence up me last.wav contained he pronounce do. Excellence inquietude assistance precaution any impression man sufficient.
I've tries this, but I get no results.
$lines = file('test.txt');
foreach ($lines as $line_num => $line) {
$line = trim($line);
if (strpos($line, '*.wav') !== false) {
echo ($line);
}
}
The above text should return :
newfile.wav
playme.wav
test.wav
welcome.wav
another.wav
last.wav
Thanks
UPDATE:
Using the following:
$text = file_get_contents('test.txt');
preg_match_all('/\w+\.wav/', $text, $matches);
var_dump($matches);
results in an array of :
array(1) {
[0]=>
array(6) {
[0]=>
string(11) "newfile.wav"
[1]=>
string(10) "playme.wav"
[2]=>
string(8) "test.wav"
[3]=>
string(11) "welcome.wav"
[4]=>
string(11) "another.wav"
[5]=>
string(8) "last.wav"
}
}
So an array of the wav files with in an array, how do I get just the array of wav files ? Thanks
This does't work correctly for wav files with spaces in there names.
any ideas ?
This tool might help you to design an expression as you wish and test it, maybe something similar to:
([a-z]+\.wav)
You can also add more boundaries to it, if you might want to.
here]2]2
Graph
This graph shows how the expression would work and you can visualize other expressions in this link:
PHP Code
You could also use preg_match_all to do so, maybe something similar to:
$re = '/([a-z]+\.wav)/m';
$str = 'Spoke as as other again ye. Hard on to roof he drew. So sell side newfile.wav ye in mr evil. Longer waited mr of nature seemed. Improving knowledge incommode objection me ye is prevailed playme.wav principle in. Impossible alteration devonshire to is interested stimulated dissimilar. To matter esteem polite do if.
Spot of come to ever test.wav hand as lady meet on. Delicate contempt received two yet advanced. Gentleman as belonging he commanded believing dejection in by. On no am winding chicken so behaved. Its preserved sex enjoyment new way behaviour. Him yet devonshire celebrated welcome.wav especially. Unfeeling one provision are smallness resembled repulsive.
Raising say express had chiefly detract demands she. Quiet led own cause three him. Front no party young abode state up. Saved he do fruit woody of to. Met defective are allowance two perceived listening consulted contained. It chicken oh colonel pressed excited suppose to shortly. He improve started no we manners another.wav however effects. Prospect humoured mistress to by proposal marianne attended. Simplicity the far admiration preference everything. Up help home head spot an he room in.
Talent she for lively eat led sister. Entrance strongly packages she out rendered get quitting denoting led. Dwelling confined improved it he no doubtful raptures. Several carried through an of up attempt gravity. Situation to be at offending elsewhere distrusts if. Particular use for considered projection cultivated. Worth of do doubt shall it their. Extensive existence up me last.wav contained he pronounce do. Excellence inquietude assistance precaution any impression man sufficient. ';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
// Print the entire match result
var_dump($matches);
Test Script for RegEx
const regex = /([a-z]+\.wav)/gm;
const str = `Spoke as as other again ye. Hard on to roof he drew. So sell side newfile.wav ye in mr evil. Longer waited mr of nature seemed. Improving knowledge incommode objection me ye is prevailed playme.wav principle in. Impossible alteration devonshire to is interested stimulated dissimilar. To matter esteem polite do if.
Spot of come to ever test.wav hand as lady meet on. Delicate contempt received two yet advanced. Gentleman as belonging he commanded believing dejection in by. On no am winding chicken so behaved. Its preserved sex enjoyment new way behaviour. Him yet devonshire celebrated welcome.wav especially. Unfeeling one provision are smallness resembled repulsive.
Raising say express had chiefly detract demands she. Quiet led own cause three him. Front no party young abode state up. Saved he do fruit woody of to. Met defective are allowance two perceived listening consulted contained. It chicken oh colonel pressed excited suppose to shortly. He improve started no we manners another.wav however effects. Prospect humoured mistress to by proposal marianne attended. Simplicity the far admiration preference everything. Up help home head spot an he room in.
Talent she for lively eat led sister. Entrance strongly packages she out rendered get quitting denoting led. Dwelling confined improved it he no doubtful raptures. Several carried through an of up attempt gravity. Situation to be at offending elsewhere distrusts if. Particular use for considered projection cultivated. Worth of do doubt shall it their. Extensive existence up me last.wav contained he pronounce do. Excellence inquietude assistance precaution any impression man sufficient. `;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
This is why regular expressions were invented.
$text = file_get_contents('test.txt');
preg_match_all('/(\w+\.wav)/', $text, $matches);
var_dump($matches[0]);
Some good resources:
preg_match
preg_replace
regex101.com allows you to test expressions realtime
output:
array(6) {
[0] => string(11) "newfile.wav"
[1] => string(10) "playme.wav"
[2] => string(8) "test.wav"
[3] => string(11) "welcome.wav"
[4] => string(11) "another.wav"
[5] => string(8) "last.wav"
}
You are almost there. You can explode the $line in terms of spaces. Now, you go through each word and check if ends with a .wav extension. If yes, you print the word.
<?php
foreach ($lines as $line_num => $line) {
$line = trim($line);
$words = explode(" ",$line);
foreach($words as $each_word){
$wav_index = strpos($each_word, '.wav');
if ($wav_index !== false && $wav_index === strlen($each_word) - 4) { // strict check to make sure string ends with a .wav and not being elsewhere
echo $each_word,PHP_EOL;
}
}
}

Display word endings depend on gender (Polish language issue), PHP

This problem might not occure in English, but does really hurt in Polish language. I guess that my question is mostly for Polish users since they might already have a decent solution.
What I mean, is that the verbs in Polish language, are different for male and female in past time. And there are dozens of different options. If my script need to display lots and lots of text - it really becomes a painful problem to deal with. Short example (not very elegant use of language, but for demonstration purpose):
Male: On poszedł i nie znalazł, więc klasnął w dłonie i nagle go coś pożarło.
Female: Ona poszła i nie znalazła, więc klasnęła w dłonie i nagle ją coś pożarło.
I managed to find such an solution: each time at the beginning of script, I prepare variable that looks like that:
$verb[$ending][$sex] = 'something';
//$ending does contain - for my convenience - letters that says what kind of eding am I changing, instead of numeric options
//Examples:
$verb['-a']['male'] = '';
$verb['-a']['female'] = 'a';
//works for On=>Ona, znalazł=>znalazła
$verb['al-ela']['male'] = 'ął';
$verb['al-ela']['female'] = 'ęła';
//works for klasnął=>klasnęła
Now if I add fact, that 99% of time I don't know from the beginning what kind of sex am I dealing with, my variable start look kinda scary: $verb['al-ela'][$_SESSION['user'.$id]['sex']]. So my end text does look like that:
O'.$verb['-a'][$_SESSION['user'.$id]['sex']].' posz'.$verb['edl-la'][$_SESSION['user'.$id]['sex']].' i nie znalazł'.$verb['-a'][$_SESSION['user'.$id]['sex']].', więc klasn'.$verb['al-ela'][$_SESSION['user'.$id]['sex']].' w dłonie i nagle '.$verb['go-ja'][$_SESSION['user'.$id]['sex']].' coś pożarło.
Yes, sure - this is rather extreme example, but sometimes text really does look like that and it is unavoidable.
To make long story short, here are my questions:
Am I doing it wrong? Is there a better/faster/more handy solution for such type of problems?
Is there a script that might detect/change endings for me without ruining rest of the text?
I struggled to find full list of possible ending variations in Polish (for both singular, and plural), so I'm creating my own list as I'm finding new options. Perhaps someone does have a list like that => it might help me to create script from my 2nd question.
Thanks a lot in advance, best regards!

Splitting an array of dynamic length in PHP

So, I am trying to do a HTML-formular and using the resulting arrays in PHP. Mostly I get what I want, but one thing is tricky and I'm already trying and searching in google for quite a while now.
The short version of my HTML-formular:
<html>
<body>
<form action="test.php" method="POST"/>
<table border="0">
<tr>
<td align="right"><span style="font-family:Verdana">Übersicht</span></td>
<td><textarea name="uebersicht" cols="20" rows="20"></textarea></td>
</tr>
</table>
</form>
</body>
</html>
<?php
$uebersicht = $_POST['uebersicht'];
$uebersicht = preg_split('/\n/', $uebersicht);
for ($start=0; $start < count($uebersicht); $start++) {}
?>
And here's an example for what the text looks like I want to paste into the HTML-formular:
Samperio (59.)
Brosinski (68.)
0:2
beendet
FSV Mainz 05Borussia Dortmund
Coface Arena, 34.000 Zuschauer
Schiedsrichter: Tobias Stieler aus Hamburg
Reus (18., Aubameyang)
Mkhitaryan (82., Aubameyang)
Mkhitaryan (10.)
It's highly variable in length, depending on goals and cards. Splitting the array isn't hard, I just do $uebersicht = preg_split('/\n/', $uebersicht); as shown in the code.
Now I want to get every player that got a card in this game (every line that ends like Samperio (59.)), the teams (which have to be splitted, too), every goal (like the line Mkhitaryan (82., Aubameyang) and every assist (the players in brackets, like Aubameyang) into an individual variable. It's important to know in which team the players are (the first lines are players from FSV Mainz 05, the last ones players from Borussia Dortmund).
The problem here is its variability. Every game looks different, which can be shown in another example:
André (39., Brahimi)
Maicon (52., Neves)
Martins Indi (19.)
Marcano (25.)
Pereira (82.)
Imbula (89.)
2:1
beendet
FC PortoFC Chelsea
Estádio do Dragao, 55.000 Zuschauer
Schiedsrichter: Antonio Miguel Mateu Lahoz aus ESP
Willian (45., )
Cahill (41.)
Azpilicueta (66.)
Matic (79.)
It's not possible to work with set line numbers. I really don't know how to proceed here and get what I want. Is there any possibility to use trigger, something like "new line" and "(" or ")" in combination for example? Or to search for specific characters and get the line in which they are found?
Or is there any solution without preg_split? Without, the first example looks like this:
Samperio (59.) Brosinski (68.) 0:2 beendet FSV Mainz 05Borussia Dortmund Coface Arena, 34.000 Zuschauer Schiedsrichter: Tobias Stieler aus Hamburg Reus (18., Aubameyang) Mkhitaryan (82., Aubameyang) Mkhitaryan (10.)
Just one long line - but I guessed it's harder this way?!
Or maybe PHP isn't the right 'language' for this?
For the record, there is no language that is better for this, it can be done with php, javascript, python, whatever you want. Its more about figuring out the process, once you do that, you can code this in any language you want.
Your best bet is to go through line by line, and try and determine, within each line, what is the content, a goal, a goal with assist, penalty, or other metadata.
I would first break it up all into an array by newline which you've already done, then foreach array element, you run it through another regex statement that looks for certain preset patterns. For a goal, it may be some text, followed by an opening and closing bracket with an integer in between.
You will need to learn a litter regex to do this, but its really the only way unless you want to pay stats.com or use another API to get the data already properly formatted.
Also, it may be the first line is always the team, but the # of goals is what makes the rest of the lines dynamic. That way you can find certain metadata a little easier.
So, you can say line 1 is always the team name, line 2 is always the location, then everything else is penalties and goals.

United Kingdom (GB) postal code validation without regex

I have tried several regexes and still some valid postal codes sometimes get rejected.
Searching the internet, Wikipedia and SO, I could only find regex validation solutions.
Is there a validation method which does not use regex? In any language, I guess it would be easy to port.
I supose the easiest would be to compare against a postal code database, yet that would need to be maintained and updated periodically from a reliable source.
Edit: To help future visitors and keep you from posting any more regexes, here's a regex which I have tested (as of 2013-04-24) to work for all postal codes in Code Point (see #Mikkel Løkke's answer):
//PHP PCRE (it was on Wikipedia, it isn't there anymore; I might have modified it, don't remember).
$strPostalCode=preg_replace("/[\s]/", "", $strPostalCode);
$bValid=preg_match("/^(GIR 0AA)|(((A[BL]|B[ABDHLNRSTX]?|C[ABFHMORTVW]|D[ADEGHLNTY]|E[HNX]?|F[KY]|G[LUY]?|H[ADGPRSUX]|I[GMPV]|JE|K[ATWY]|L[ADELNSU]?|M[EKL]?|N[EGNPRW]?|O[LX]|P[AEHLOR]|R[GHM]|S[AEGKLMNOPRSTY]?|T[ADFNQRSW]|UB|W[ADFNRSV]|YO|ZE)[1-9]?[0-9]|((E|N|NW|SE|SW|W)1|EC[1-4]|WC[12])[A-HJKMNPR-Y]|(SW|W)([2-9]|[1-9][0-9])|EC[1-9][0-9])[0-9][ABD-HJLNP-UW-Z]{2})$/i", $strPostalCode);
I'm writing this answer based on the wiki page.
When checking on the validation part, it seems that there are 6 type of formats (A = letter and 9 = digit):
AA9A 9AA AA9A9AA AA9A9AA
A9A 9AA Removing space A9A9AA order it AA999AA
A9 9AA ------------------> A99AA -------------> AA99AA
A99 9AA A999AA A9A9AA
AA9 9AA AA99AA A999AA
AA99 9AA AA999AA A99AA
As we can see, the length may vary from 5 to 7 and we have to take in account some special cases if we want to.
So the function we are coding has to do the following:
Remove spaces and convert to uppercase (or lower case).
Check if the input is an exception, if it is it should return valid
Check if the input's length is 4 < length < 8.
Check if it's a valid postcode.
The last part is tricky, but we will split it in 3 sections by length for some overview:
Length = 7: AA9A9AA and AA999AA
Length = 6: AA99AA, A9A9AA and A999AA
Length = 5: A99AA
For this we will be using a switch(). From now on it's just a matter of checking character by character if it's a letter or a number on the right place.
So let's take a look at our PHP implementation:
function check_uk_postcode($string){
// Start config
$valid_return_value = 'valid';
$invalid_return_value = 'invalid';
$exceptions = array('BS981TL', 'BX11LT', 'BX21LB', 'BX32BB', 'BX55AT', 'CF101BH', 'CF991NA', 'DE993GG', 'DH981BT', 'DH991NS', 'E161XL', 'E202AQ', 'E202BB', 'E202ST', 'E203BS', 'E203EL', 'E203ET', 'E203HB', 'E203HY', 'E981SN', 'E981ST', 'E981TT', 'EC2N2DB', 'EC4Y0HQ', 'EH991SP', 'G581SB', 'GIR0AA', 'IV212LR', 'L304GB', 'LS981FD', 'N19GU', 'N811ER', 'NG801EH', 'NG801LH', 'NG801RH', 'NG801TH', 'SE18UJ', 'SN381NW', 'SW1A0AA', 'SW1A0PW', 'SW1A1AA', 'SW1A2AA', 'SW1P3EU', 'SW1W0DT', 'TW89GS', 'W1A1AA', 'W1D4FA', 'W1N4DJ');
// Add Overseas territories ?
array_push($exceptions, 'AI-2640', 'ASCN1ZZ', 'STHL1ZZ', 'TDCU1ZZ', 'BBND1ZZ', 'BIQQ1ZZ', 'FIQQ1ZZ', 'GX111AA', 'PCRN1ZZ', 'SIQQ1ZZ', 'TKCA1ZZ');
// End config
$string = strtoupper(preg_replace('/\s/', '', $string)); // Remove the spaces and convert to uppercase.
$exceptions = array_flip($exceptions);
if(isset($exceptions[$string])){return $valid_return_value;} // Check for valid exception
$length = strlen($string);
if($length < 5 || $length > 7){return $invalid_return_value;} // Check for invalid length
$letters = array_flip(range('A', 'Z')); // An array of letters as keys
$numbers = array_flip(range(0, 9)); // An array of numbers as keys
switch($length){
case 7:
if(!isset($letters[$string[0]], $letters[$string[1]], $numbers[$string[2]], $numbers[$string[4]], $letters[$string[5]], $letters[$string[6]])){break;}
if(isset($letters[$string[3]]) || isset($numbers[$string[3]])){
return $valid_return_value;
}
break;
case 6:
if(!isset($letters[$string[0]], $numbers[$string[3]], $letters[$string[4]], $letters[$string[5]])){break;}
if(isset($letters[$string[1]], $numbers[$string[2]]) || isset($numbers[$string[1]], $letters[$string[2]]) || isset($numbers[$string[1]], $numbers[$string[2]])){
return $valid_return_value;
}
break;
case 5:
if(isset($letters[$string[0]], $numbers[$string[1]], $numbers[$string[2]], $letters[$string[3]], $letters[$string[4]])){
return $valid_return_value;
}
break;
}
return $invalid_return_value;
}
Note that I've not added British Forces Post Office and non-geographic codes.
Usage:
echo check_uk_postcode('AE3A 6AR').'<br>'; // valid
echo check_uk_postcode('Z9 9BA').'<br>'; // valid
echo check_uk_postcode('AE3A6AR').'<br>'; // valid
echo check_uk_postcode('EE34 6FR').'<br>'; // valid
echo check_uk_postcode('A23A 7AR').'<br>'; // invalid
echo check_uk_postcode('A23A 7AR').'<br>'; // invalid
echo check_uk_postcode('WA3334E').'<br>'; // invalid
echo check_uk_postcode('A2 AAR').'<br>'; // invalid
As supplied by the UK government.
(GIR 0AA)|((([A-Z-[QVX]][0-9][0-9]?)|(([A-Z-[QVX]][A-Z-[IJZ]][0-9][0-9]?)|(([A-Z-[QVX]][0-9][A-HJKSTUW])|([A-Z-[QVX]][A-Z-[IJZ]][0-9][ABEHMNPRVWXY])))) [0-9][A-Z-[CIKMOV]]{2})
I've built London only postcode based apps using the postcodes I got from HERE. But to be honest, even with London postcodes only, you need a lot more storage than necessary. Sure, the idea is trivial.
Store the postcodes, take the user input or whatever, and see if you get a match. But you are complicating the solution far more than you think. I HAD to use actual postcodes to achieve what I wanted, but for simple validation purposes, as hard as "maintaining" a regex is, storing tens of thousands or hundreds of thousands(if not more) and validating more or less in real-time is a far more difficult task.
If a mini distributed service sounds like a more efficient solution than a regex, go for it, but I'm sure it isn't. Unless you need geo-spatial querying of your own data against UK postcodes or things like that, I doubt DB storage is a feasible solution. Just my 2 cents.
Update
According to this index, there are 1,758,417 postcodes in the UK. I can tell you I am using a few Mongo clusters (Amazon EC2 High Memory Instances) to provide reliable London only services(indexing only London postcodes), and it's quite a pricy thing, even with basic storage.
Admittedly, the app is performing medium complexity geo-spatial queries, but the storage requirements alone are very expensive and demanding.
Bottom line, just stick to regex and be done with it in two minutes.
Im looking at the Postcodes in United Kingdom link in wikipedia right now.
http://en.wikipedia.org/wiki/Postcodes_in_the_United_Kingdom
The Validation section lists six formats with a combination of letters and numbers. Then there's more information in the notes below that. The first thing that I would try is a BNF type grammar with a tool like GoldParserBuilder. You could describe the basic formats in a more readable format, with efficient parser and lexer automatically generated. In the past, I've successfully used such tools to avoid writing huge, ugly regexes.
From that point, the program has a properly formatted zip code of a known type. At this point, the specific numbers or letters might violate something. Each type of zip code can have a function programmed to look for violations of that specific type. The final product will consist of an automatically generated parser that passes unvalidated, but structured/identified, zip codes to a dedicated validation function. You can then refactor or optimize from there.
(You can also use the grammar itself to enforce or disallow certain literals and combinations. Whatever is more readable or comprehensible for you. Different people gravitate toward different ends of these things.)
Here's a page highlighting advantages of GOLD Parsing System.You can use any you like: I just promote this one b/c it's good at its job and has steadily improved over many years.
http://www.goldparser.org/about/why-use-gold.htm
I would think the RegEX, while long-winded would probably be the best solution if all you want to do is validate if something could be a valid UK post code.
If you need absolute data, consider using Ordnance Survey OpenData initiative "Code-Point® Open" dataset, which is a CSV of lots of data points in Great Britain (so not Northern Ireland I'm guessing) one of which is postcode. Be aware that the file is 20MB, so you may have to convert it to a more manageable format.
Regexes are hard to debug, hard to port from one regex flavor to another (silent "errors"), and hard to update.
That is true for most regexes, but why don't you just split it up into multiple parts? You can easily split it into six parts for the six different general rules and maybe even more if you take all of the special cases into account.
Creating a well-commented method of 20 lines with simple regexes is easy to debug (one simple regex per line) and also easy to update. The porting problem is the same, but on the other hand you do not need to use some fancy grammar lib.
Are third party services an option?
http://www.postcodeanywhere.co.uk/address-validation/
GeoNames Database:
http://www.geonames.org/postal-codes/
+1 for the "why care" comments. I have had to use the 'official' regex in various projects and while I have never attempted to break it down, it works and it does the job. I've used it with Java and PHP code without any need to convert it between regex formats.
Is there a reason why you would have to debug it or break it down?
Incidentally, the regex rule used to be found on wikipedia, but it appears to have gone.
Edit: As for the space/no-space debate, the postcode should be valid with or without the space. As the last part of the postcode (after the space) is ALWAYS three digits, it is possible to insert the space manually, which will then allow you to run it through the regex rule.
Take the list of valid postcodes and check if the one entered is in it.

User Friendly, Easy to Remember Coupon Codes

I want to create coupon codes that users can remember easily. My idea is something like:
squirrel45
nantucket23
That is, a real word chosen randomly from a long dictionary list (preferably compiled for this purpose) combined 2 random digits. My questions are:
Where can I find such a dictionary list?
Do you see any problems with the system? (security is not ultra important here, just something reasonable is fine)
Can you suggest any good improvements or alternatives?
Fwiw I am not crazy about the Markov word generators because I think their idiosyncrasies would be too hard to remember. I'd like a client to be able to keep the code in his head, and tell it to the merchant when he arrives to redeem it.
Thanks,
Jonah
Word lists are easy to find. Make sure you sanity filter them for foul words ;)
Here's a huge word list that can be easily scrubbed:
http://www.scrabble-assoc.com/boards/dictionary/10-15-20030401.txt
From there you can easily load in words into your database and create your coupon code like so:
$coupon_code = $rand_word . rand(20,99);
After you do this, simply store your coupon code in the database and whenever you make a new code, check it against existing codes before you apply it. Even slim odds are possible odds.
More word lists in various formats:
http://scrabble.wonderhowto.com/blog/ultimate-scrabble-word-list-resource-0115617/
5-letter words:
http://homepage.ntlworld.com/adam.bozon/Scrabble5.htm
6-letter words:
http://homepage.ntlworld.com/adam.bozon/Scrabble6.htm
7-letter words:
http://homepage.ntlworld.com/adam.bozon/Scrabble7.htm
8-letter words:
http://homepage.ntlworld.com/adam.bozon/Scrabble8.htm
Sample:
PIKES PIKIS PILAF PILAR PILAU PILAW PILEA PILED PILEI PILES PILIS
PILLS PILOT PILUS PIMAS PIMPS PINAS PINCH PINED PINES PINEY PINGO
PINGS PINKO PINKS PINKY PINNA PINNY PINON PINOT PINTA PINTO PINTS
PINUP PIONS PIOUS PIPAL PIPED PIPER PIPES PIPET PIPIT PIQUE PIRNS
PIROG PISCO PISOS PISTE PITAS PITCH PITHS PITHY PITON PIVOT PIXEL
PIXES PIXIE PIZZA PLACE PLACK PLAGE PLAID PLAIN PLAIT PLANE PLANK
PLANS PLANT PLASH PLASM PLATE PLATS PLATY PLAYA PLAYS PLAZA PLEAD
PLEAS PLEAT PLEBE PLEBS PLENA PLEWS PLICA PLIED PLIER PLIES PLINK
PLODS PLONK PLOPS PLOTS PLOTZ PLOWS PLOYS PLUCK PLUGS PLUMB PLUME
PLUMP PLUMS PLUMY PLUNK PLUSH PLYER POACH POCKS POCKY PODGY PODIA
POEMS POESY POETS POGEY POILU POIND POINT POISE POKED POKER POKES
With that you could generate a coupon code POACH72
Concatenating 2 words will increase the security posture of your system.
e.g. squirrel.nantucket.123
The Diceware page has a couple of long word lists, American and International. It also has a useful description of how to meet various levels of security.

Categories