How do I parse these strings? - php

I have some strings from our accounting system I need to process. The accounting system only gives the option to put in the postal code and city in one input field. The data is later exported through xml and imported in a php system.
I'm looking for a way to extract the postal code from the city, however these come in various formats so a simple substr(); is not working
Some examples of the values I need to process are:
1234 ZC ALPHEN AAN DEN RIJN
1234SG UTRECHT
33602 BIELEFELD
W7 3QB LONDON
How do I split the postal code from city for each of these? I already contacted the manufacturer of the accounting system, and they understood my problem and will look into splitting the values in 2 for future calls, but that will take some time.

It's not in keeping with Google's Terms and Conditions unless you're storing this data to be displayed on a Google map, but it is awfully tempting to harness their power because they're just so good at this stuff.
The Geocoding API will be able to handle pretty much any address/postcode combination and variation you can throw at it - with or without spaces, postcode first or last, etc. etc., including different place names ("London", "Londres").
A request to
http://maps.googleapis.com/maps/api/geocode/json?address=2408%20ZC%20ALPHEN%20AAN%20DEN%20RIJN&sensor=false
returns a JSON stream containing, among other things:
"address_components" : [
{
"long_name" : "2408 ZB",
"short_name" : "2408 ZB",
"types" : [ "postal_code" ]
},
{
"long_name" : "Alphen aan den Rijn",
"short_name" : "Alphen aan den Rijn",
"types" : [ "locality", "political" ]
},
...
This page outlines the requirements and limitations for using the service.
Note that the Google API will guess stuff if the data is slightly wrong. Your initial example of 1234 ZC isn't correct and the API will interpolate in an attempt to give you something you work with. Make sure you explore how the API reacts to incorrect data, and be careful not to shoot yourself in the foot with the results.

If you know the country at the time you are attempting to split the postal code off from the city you could use that to look up a regular expression (or similar piece of data) that corresponds to the correct way to parse out the postal code.
For example, you might map countries to regexes in an array (these regular expressions are just samples -- not vigorously tested):
$regexMap = array(
'US' => '(\d{5}|\d{5}-\d{4}|\d{9})\s+(.*)',
'UK' => '([\d\w]{2,4}\s+\d\w{2})\s+(.*)',
...
);
$regularExpression = $regexMap[$country];
preg_match($regularExpression, $incomingPostalCodeAndCity, $postalData);
$postalCode = $postalData[0];
$city = $postalData[1];
While you probably can combine regular expressions for some (many?) countries, postal codes vary enough that you'll probably still need a fairly lengthy list of regexes.
Each regex should be designed to return the postal code as the first subpattern and the city as the second subpattern.
There is some related information in the answers to this question: What is the ultimate postal code and zip regex? (including some lists of postal code regular expressions for various countries).

Related

Highlighting the elastic search giving the results only matched

Working on Php and Elastic Search 6.5.2. For test scenario I used postman, When I apply Highlighting, Only the trimmed piece of content which has matched keyword sentence is displaying when I added two or more fragments under the highlight. I Don't wan to trim the content at the elastic search level. For this I changed the number of fragments to zero and I got the expected output in the elastic search that its giving the whole content, butwhen I check the output in the php application the whole content is getting bold whenever the matched keyword existing in the url field.
Index:
PUT test/_doc/1
{
"title":"Apply For the admissions graduate and undergraduate"
"url":"https://someurl.com/admissions",
"content": "Engineers play an important role in almost every aspect of modern life. As an engineer in the 21st century, you’ll work in teams to develop ingenious ways to transform the world in which we live. Industrial engineers are in high demand in nearly every industry. Astounding innovations in semiconductor microelectronic engineering will continue to drive productivity and the economy by playing a key role in a wide range of technologies – information, communication, nanotechnology, defense, medicine, and energy.Admission into the microelectronic engineering program is competitive, but our admission process is a personal one. Each application is reviewed holistically for strength of academic preparation, performance on standardized tests, counselor recommendations, and your personal career interests. We seek applicants from a variety of geographical, social, cultural, economic, and ethnic backgrounds."
}
Query:
{
"query":{
"query_string":{
"fields":[
"content"
],
"query":"admissions"
}
},
"highlight":{
"fields":{
"title":{
"pre_tags":[
"<strong>"
],
"post_tags":[
"</strong>"
],
"number_of_fragments":3 //changed to 0 earlier
},
"content":{
"pre_tags":[
"<strong>"
],
"post_tags":[
"</strong>"
],
"fragment_size":150,
"number_of_fragments":3 //changed to 0 earlier
}
}
}
}
Result:
"highlight": {
"content": [
"information, communication, nanotechnology, defense, medicine, and energy.Admission into the microelectronic engineering program is competitive, but our <strong>admission</strong>"
]
}
After looking at the code you shared via comments. This issue is not in elasticsearch response.
This following line is causing the issue:
<?php echo substr($r['highlight']['content'][0],0,300); ?>
Due to this one of the result is missing closing </b> tag as you are taking only first 300 characters.
If you notice you can see, one of the highlighting content in your html has the following:
Tourism Management Curriculum Capstone/Exam/Thesis Options <b>Admissi</div>
As you can see closing </b> tag is missing for <b>Admissi and hence everything after this is in bold.
Solution to this could be not using substring.

United Kingdom (GB) postal code validation without regex

I have tried several regexes and still some valid postal codes sometimes get rejected.
Searching the internet, Wikipedia and SO, I could only find regex validation solutions.
Is there a validation method which does not use regex? In any language, I guess it would be easy to port.
I supose the easiest would be to compare against a postal code database, yet that would need to be maintained and updated periodically from a reliable source.
Edit: To help future visitors and keep you from posting any more regexes, here's a regex which I have tested (as of 2013-04-24) to work for all postal codes in Code Point (see #Mikkel Løkke's answer):
//PHP PCRE (it was on Wikipedia, it isn't there anymore; I might have modified it, don't remember).
$strPostalCode=preg_replace("/[\s]/", "", $strPostalCode);
$bValid=preg_match("/^(GIR 0AA)|(((A[BL]|B[ABDHLNRSTX]?|C[ABFHMORTVW]|D[ADEGHLNTY]|E[HNX]?|F[KY]|G[LUY]?|H[ADGPRSUX]|I[GMPV]|JE|K[ATWY]|L[ADELNSU]?|M[EKL]?|N[EGNPRW]?|O[LX]|P[AEHLOR]|R[GHM]|S[AEGKLMNOPRSTY]?|T[ADFNQRSW]|UB|W[ADFNRSV]|YO|ZE)[1-9]?[0-9]|((E|N|NW|SE|SW|W)1|EC[1-4]|WC[12])[A-HJKMNPR-Y]|(SW|W)([2-9]|[1-9][0-9])|EC[1-9][0-9])[0-9][ABD-HJLNP-UW-Z]{2})$/i", $strPostalCode);
I'm writing this answer based on the wiki page.
When checking on the validation part, it seems that there are 6 type of formats (A = letter and 9 = digit):
AA9A 9AA AA9A9AA AA9A9AA
A9A 9AA Removing space A9A9AA order it AA999AA
A9 9AA ------------------> A99AA -------------> AA99AA
A99 9AA A999AA A9A9AA
AA9 9AA AA99AA A999AA
AA99 9AA AA999AA A99AA
As we can see, the length may vary from 5 to 7 and we have to take in account some special cases if we want to.
So the function we are coding has to do the following:
Remove spaces and convert to uppercase (or lower case).
Check if the input is an exception, if it is it should return valid
Check if the input's length is 4 < length < 8.
Check if it's a valid postcode.
The last part is tricky, but we will split it in 3 sections by length for some overview:
Length = 7: AA9A9AA and AA999AA
Length = 6: AA99AA, A9A9AA and A999AA
Length = 5: A99AA
For this we will be using a switch(). From now on it's just a matter of checking character by character if it's a letter or a number on the right place.
So let's take a look at our PHP implementation:
function check_uk_postcode($string){
// Start config
$valid_return_value = 'valid';
$invalid_return_value = 'invalid';
$exceptions = array('BS981TL', 'BX11LT', 'BX21LB', 'BX32BB', 'BX55AT', 'CF101BH', 'CF991NA', 'DE993GG', 'DH981BT', 'DH991NS', 'E161XL', 'E202AQ', 'E202BB', 'E202ST', 'E203BS', 'E203EL', 'E203ET', 'E203HB', 'E203HY', 'E981SN', 'E981ST', 'E981TT', 'EC2N2DB', 'EC4Y0HQ', 'EH991SP', 'G581SB', 'GIR0AA', 'IV212LR', 'L304GB', 'LS981FD', 'N19GU', 'N811ER', 'NG801EH', 'NG801LH', 'NG801RH', 'NG801TH', 'SE18UJ', 'SN381NW', 'SW1A0AA', 'SW1A0PW', 'SW1A1AA', 'SW1A2AA', 'SW1P3EU', 'SW1W0DT', 'TW89GS', 'W1A1AA', 'W1D4FA', 'W1N4DJ');
// Add Overseas territories ?
array_push($exceptions, 'AI-2640', 'ASCN1ZZ', 'STHL1ZZ', 'TDCU1ZZ', 'BBND1ZZ', 'BIQQ1ZZ', 'FIQQ1ZZ', 'GX111AA', 'PCRN1ZZ', 'SIQQ1ZZ', 'TKCA1ZZ');
// End config
$string = strtoupper(preg_replace('/\s/', '', $string)); // Remove the spaces and convert to uppercase.
$exceptions = array_flip($exceptions);
if(isset($exceptions[$string])){return $valid_return_value;} // Check for valid exception
$length = strlen($string);
if($length < 5 || $length > 7){return $invalid_return_value;} // Check for invalid length
$letters = array_flip(range('A', 'Z')); // An array of letters as keys
$numbers = array_flip(range(0, 9)); // An array of numbers as keys
switch($length){
case 7:
if(!isset($letters[$string[0]], $letters[$string[1]], $numbers[$string[2]], $numbers[$string[4]], $letters[$string[5]], $letters[$string[6]])){break;}
if(isset($letters[$string[3]]) || isset($numbers[$string[3]])){
return $valid_return_value;
}
break;
case 6:
if(!isset($letters[$string[0]], $numbers[$string[3]], $letters[$string[4]], $letters[$string[5]])){break;}
if(isset($letters[$string[1]], $numbers[$string[2]]) || isset($numbers[$string[1]], $letters[$string[2]]) || isset($numbers[$string[1]], $numbers[$string[2]])){
return $valid_return_value;
}
break;
case 5:
if(isset($letters[$string[0]], $numbers[$string[1]], $numbers[$string[2]], $letters[$string[3]], $letters[$string[4]])){
return $valid_return_value;
}
break;
}
return $invalid_return_value;
}
Note that I've not added British Forces Post Office and non-geographic codes.
Usage:
echo check_uk_postcode('AE3A 6AR').'<br>'; // valid
echo check_uk_postcode('Z9 9BA').'<br>'; // valid
echo check_uk_postcode('AE3A6AR').'<br>'; // valid
echo check_uk_postcode('EE34 6FR').'<br>'; // valid
echo check_uk_postcode('A23A 7AR').'<br>'; // invalid
echo check_uk_postcode('A23A 7AR').'<br>'; // invalid
echo check_uk_postcode('WA3334E').'<br>'; // invalid
echo check_uk_postcode('A2 AAR').'<br>'; // invalid
As supplied by the UK government.
(GIR 0AA)|((([A-Z-[QVX]][0-9][0-9]?)|(([A-Z-[QVX]][A-Z-[IJZ]][0-9][0-9]?)|(([A-Z-[QVX]][0-9][A-HJKSTUW])|([A-Z-[QVX]][A-Z-[IJZ]][0-9][ABEHMNPRVWXY])))) [0-9][A-Z-[CIKMOV]]{2})
I've built London only postcode based apps using the postcodes I got from HERE. But to be honest, even with London postcodes only, you need a lot more storage than necessary. Sure, the idea is trivial.
Store the postcodes, take the user input or whatever, and see if you get a match. But you are complicating the solution far more than you think. I HAD to use actual postcodes to achieve what I wanted, but for simple validation purposes, as hard as "maintaining" a regex is, storing tens of thousands or hundreds of thousands(if not more) and validating more or less in real-time is a far more difficult task.
If a mini distributed service sounds like a more efficient solution than a regex, go for it, but I'm sure it isn't. Unless you need geo-spatial querying of your own data against UK postcodes or things like that, I doubt DB storage is a feasible solution. Just my 2 cents.
Update
According to this index, there are 1,758,417 postcodes in the UK. I can tell you I am using a few Mongo clusters (Amazon EC2 High Memory Instances) to provide reliable London only services(indexing only London postcodes), and it's quite a pricy thing, even with basic storage.
Admittedly, the app is performing medium complexity geo-spatial queries, but the storage requirements alone are very expensive and demanding.
Bottom line, just stick to regex and be done with it in two minutes.
Im looking at the Postcodes in United Kingdom link in wikipedia right now.
http://en.wikipedia.org/wiki/Postcodes_in_the_United_Kingdom
The Validation section lists six formats with a combination of letters and numbers. Then there's more information in the notes below that. The first thing that I would try is a BNF type grammar with a tool like GoldParserBuilder. You could describe the basic formats in a more readable format, with efficient parser and lexer automatically generated. In the past, I've successfully used such tools to avoid writing huge, ugly regexes.
From that point, the program has a properly formatted zip code of a known type. At this point, the specific numbers or letters might violate something. Each type of zip code can have a function programmed to look for violations of that specific type. The final product will consist of an automatically generated parser that passes unvalidated, but structured/identified, zip codes to a dedicated validation function. You can then refactor or optimize from there.
(You can also use the grammar itself to enforce or disallow certain literals and combinations. Whatever is more readable or comprehensible for you. Different people gravitate toward different ends of these things.)
Here's a page highlighting advantages of GOLD Parsing System.You can use any you like: I just promote this one b/c it's good at its job and has steadily improved over many years.
http://www.goldparser.org/about/why-use-gold.htm
I would think the RegEX, while long-winded would probably be the best solution if all you want to do is validate if something could be a valid UK post code.
If you need absolute data, consider using Ordnance Survey OpenData initiative "Code-Point® Open" dataset, which is a CSV of lots of data points in Great Britain (so not Northern Ireland I'm guessing) one of which is postcode. Be aware that the file is 20MB, so you may have to convert it to a more manageable format.
Regexes are hard to debug, hard to port from one regex flavor to another (silent "errors"), and hard to update.
That is true for most regexes, but why don't you just split it up into multiple parts? You can easily split it into six parts for the six different general rules and maybe even more if you take all of the special cases into account.
Creating a well-commented method of 20 lines with simple regexes is easy to debug (one simple regex per line) and also easy to update. The porting problem is the same, but on the other hand you do not need to use some fancy grammar lib.
Are third party services an option?
http://www.postcodeanywhere.co.uk/address-validation/
GeoNames Database:
http://www.geonames.org/postal-codes/
+1 for the "why care" comments. I have had to use the 'official' regex in various projects and while I have never attempted to break it down, it works and it does the job. I've used it with Java and PHP code without any need to convert it between regex formats.
Is there a reason why you would have to debug it or break it down?
Incidentally, the regex rule used to be found on wikipedia, but it appears to have gone.
Edit: As for the space/no-space debate, the postcode should be valid with or without the space. As the last part of the postcode (after the space) is ALWAYS three digits, it is possible to insert the space manually, which will then allow you to run it through the regex rule.
Take the list of valid postcodes and check if the one entered is in it.

User Friendly, Easy to Remember Coupon Codes

I want to create coupon codes that users can remember easily. My idea is something like:
squirrel45
nantucket23
That is, a real word chosen randomly from a long dictionary list (preferably compiled for this purpose) combined 2 random digits. My questions are:
Where can I find such a dictionary list?
Do you see any problems with the system? (security is not ultra important here, just something reasonable is fine)
Can you suggest any good improvements or alternatives?
Fwiw I am not crazy about the Markov word generators because I think their idiosyncrasies would be too hard to remember. I'd like a client to be able to keep the code in his head, and tell it to the merchant when he arrives to redeem it.
Thanks,
Jonah
Word lists are easy to find. Make sure you sanity filter them for foul words ;)
Here's a huge word list that can be easily scrubbed:
http://www.scrabble-assoc.com/boards/dictionary/10-15-20030401.txt
From there you can easily load in words into your database and create your coupon code like so:
$coupon_code = $rand_word . rand(20,99);
After you do this, simply store your coupon code in the database and whenever you make a new code, check it against existing codes before you apply it. Even slim odds are possible odds.
More word lists in various formats:
http://scrabble.wonderhowto.com/blog/ultimate-scrabble-word-list-resource-0115617/
5-letter words:
http://homepage.ntlworld.com/adam.bozon/Scrabble5.htm
6-letter words:
http://homepage.ntlworld.com/adam.bozon/Scrabble6.htm
7-letter words:
http://homepage.ntlworld.com/adam.bozon/Scrabble7.htm
8-letter words:
http://homepage.ntlworld.com/adam.bozon/Scrabble8.htm
Sample:
PIKES PIKIS PILAF PILAR PILAU PILAW PILEA PILED PILEI PILES PILIS
PILLS PILOT PILUS PIMAS PIMPS PINAS PINCH PINED PINES PINEY PINGO
PINGS PINKO PINKS PINKY PINNA PINNY PINON PINOT PINTA PINTO PINTS
PINUP PIONS PIOUS PIPAL PIPED PIPER PIPES PIPET PIPIT PIQUE PIRNS
PIROG PISCO PISOS PISTE PITAS PITCH PITHS PITHY PITON PIVOT PIXEL
PIXES PIXIE PIZZA PLACE PLACK PLAGE PLAID PLAIN PLAIT PLANE PLANK
PLANS PLANT PLASH PLASM PLATE PLATS PLATY PLAYA PLAYS PLAZA PLEAD
PLEAS PLEAT PLEBE PLEBS PLENA PLEWS PLICA PLIED PLIER PLIES PLINK
PLODS PLONK PLOPS PLOTS PLOTZ PLOWS PLOYS PLUCK PLUGS PLUMB PLUME
PLUMP PLUMS PLUMY PLUNK PLUSH PLYER POACH POCKS POCKY PODGY PODIA
POEMS POESY POETS POGEY POILU POIND POINT POISE POKED POKER POKES
With that you could generate a coupon code POACH72
Concatenating 2 words will increase the security posture of your system.
e.g. squirrel.nantucket.123
The Diceware page has a couple of long word lists, American and International. It also has a useful description of how to meet various levels of security.

Converting a Google Directions v3 JSON Array to PHP Array/Variables to Store in MySQL

I'm trying to convert a JSON object/array from the Google Directions API v3 to PHP so I can store it in a MySQL Database.
The array looks something like this (I truncated it pretty heavily…sorry for making it so long…I guess this is also an informative question for people wanting to know what the array string from Google Directions looks like):
{
"status":"OK",
"routes":[{
"summary":"Lakelands Trail State Park",
"legs":[{
"steps":[{
"travel_mode":"BICYCLING",
"start_location":{"za":42.73698,"Ba":-84.4838},
"end_location":{"za":42.74073,"Ba":-84.48378},
"polyline":{"points":"cazcGvvsbOmVC","levels":"BB"},
"duration":{"value":68,"text":"1 min"},
"distance":{"value":417,"text":"0.3 mi"},
"encoded_lat_lngs":"cazcGvvsbOmVC",
"path":[{"za":42.73698,"Ba":-84.4838},{"za":42.740730000000006,"Ba":-84.48378000000001}],
"lat_lngs":[{"za":42.73698,"Ba":-84.4838},{"za":42.740730000000006,"Ba":-84.48378000000001}],
"instructions":"Head north on Abbot Rd toward Elizabeth St",
"start_point":{"za":42.73698,"Ba":-84.4838},
"end_point":{"za":42.74073,"Ba":-84.48378}
},{
//more steps go here
//end steps array
}}],
"duration":{"value":12309,"text":"3 hours 25 mins"},
"distance":{"value":66198,"text":"41.1 mi"},
"start_location":{"za":42.66069,"Ba":-84.07321},
"end_location":{"za":42.27668,"Ba":-83.74076},
"start_address":"E Grand River Ave, Fowlerville, MI 48836, USA",
"end_address":"angell hall, 435 S State St, Ann Arbor, MI 48109, USA",
"via_waypoint":[]
//end leg array
}],
"copyrights":"Map data ©2011 Google",
"warnings":["Bicycling directions are in beta. Use caution – This route may contain streets that aren't suited for bicycling."],
"waypoint_order":[0],
"bounds":{"U":{"b":42.276680000000006,"d":42.740840000000006},"O":{"d":-84.4838,"b":-83.73996000000001}},
"optimized_waypoint_order":[0]}],
"Ef":{"origin":"east lansing, mi","destination":"1139 Angell Hall 435 S. State Street Ann Arbor, MI 48109",
"waypoints":[{"location":"Bloated Goat Saloon, East Grand River Avenue, Fowlerville, MI","stopover":true}],"optimizeWaypoints":false,"travelMode":"BICYCLING"}
Using AJAX, I sent this object to PHP to decode it and save it in a database.
The only problem is that I don't know how to parse the JSON into PHP…I kind of feel like a dog that actually caught the car they were chasing…I don't know what to do next.
So my question:
How do I take the array above and turn it into something PHP can send to a MySQL database (I already have the database structure set up)? I'm not very good with programming languages, so if you could write your response in the most basic way possible, I'd be ever so grateful.
You could just store the JSON string directly in the database in a text field.
Alternately, if you want to read the individual fields in PHP, use the json_decode function to turn the JSON string into a PHP array.
http://php.net/manual/en/function.json-decode.php
Do convert a json object to an object / array that can be used within the PHP Engine you would use json_decode() such as:
$context = json_decode($json_feed);
and them use like an internal array:
if($context["status"] == "OK"){/*...*/}
foreach($context["routes"] as $Route)
{
//have Fun Walking
}
If you wish to send the array to the database in bulk, I.e, No individual fields within the database then you can use serialize to create a reversible string that can be stored within the database and then unserialize to convert it back into a php enumerable entity

Using regex to extract variables from a plain-text form letter?

I'm looking for a good example of using Regular Expressions in PHP to "reverse engineer" a form letter (with a known format, of course) that has been pasted into a multiline textbox and sent to a script for processing.
So, for example, let's assume this is the original plain-text input (taken from a USDA press release):
WASHINGTON, April 5, 2010 - North
American Bison Co-Op, a New Rockford,
N.D., establishment is recalling
approximately 25,000 pounds of whole
beef heads containing tongues that may
not have had the tonsils completely
removed, which is not compliant with
regulations that require the removal
of tonsils from cattle of all ages,
the U.S. Department of Agriculture's
Food Safety and Inspection Service
(FSIS) announced today.
For clarity, the fields that are variables are highlighted below:
[pr_city=]WASHINGTON, [pr_date=]April 5, 2010 - [corp_name=]North
American Bison Co-Op, a [corp_city=]New Rockford,
[corp_state=]N.D., establishment is recalling
approximately [amount=]25,000 pounds of [product=]whole
beef heads containing tongues that may
not have had the tonsils completely
removed, which is not compliant with
regulations that require [reason=]the removal
of tonsils from cattle of all ages,
the U.S. Department of Agriculture's
Food Safety and Inspection Service
(FSIS) announced today.
How could I efficiently extract the contents of the
pr_city
pr_date
corp_name
corp_city
corp_state
amount
product
reason
fields from my example?
Any help would be appreciated, thanks.
Well, a regex that works on your example could look like this (line breaks introduced to keep this beast legible, need to be removed prior to use):
/^(?P<pr_city>[^,]+), (?P<pr_date>[^-]+) - (?P<corp_name>.*?), a
(?P<corp_city>[^,]+), (?P<corp_state>[^,]+), establishment is
recalling approximately (?P<amount>.*?) of (?P<product>.*?),
which is not compliant with regulations that require (?P<reason>.*?),
the U\.S\. Department of Agriculture\'s Food Safety and Inspection
Service \(FSIS\) announced today\.$/
So, in PHP you could do
if (preg_match('/^(?P<pr_city>[^,]+), (?P<pr_date>[^-]+) - (?P<corp_name>.*?), a (?P<corp_city>[^,]+), (?P<corp_state>[^,]+), establishment is recalling approximately (?P<amount>.*?) of (?P<product>.*?), which is not compliant with regulations that require (?P<reason>.*?), the U\.S\. Department of Agriculture\'s Food Safety and Inspection Service \(FSIS\) announced today\.$/', $subject, $regs)) {
$prcity = $regs['pr_city'];
$prdate = $regs['pr_date'];
... etc.
} else {
$result = "";
}
This assumes a couple of things, for instance that there are no line breaks, and that the input is the entire string (and not a larger string from which this part has to be extracted from). I've tried to make assumptions about legal values that make some sense, but there is the very real chance that other inputs could break this. So some more test cases are probably needed.
If the surrounding text is constant, then something like this partial regex could do the trick:
preg_match('/^(.*?), (.*?)- (.*?), a (.*?), (.*?), establishment is recalling approximately (.*?), which is not compliant with regulations that require (.*?), the U.S. Department of Agriculture's Food Safety and Inspection Service (FSIS) announced today./', $text, $matches);
$matches[1] = 'WASHINGTON';
$matches[2] = 'April 5, 2010';
$matches[3] = ... etc...
If the surrounding text changes, then you're going to end up with a ton of false matches, no matches, etc... Essentially you'd need an AI to parse/understand PR releases.
Edit: Please disregard this crazy answer, as the other two are better. I should probably delete it, but I'm keeping it up for reference.
I have a crazy idea that just might work: build an XML string from the input by adding markups, then parse it. It might look something like this (completely untested) code:
preg_replace('([^,]*), ([^-]*)- ...etc...', '<pr_city>\1</pr_city><pr_date>\2</pr_date> ...etc...');
Parsing the XML afterwards is a needlessly complicated process that is best left to the PHP documentation: http://www.php.net/manual/en/function.xml-parse.php .
You could also consider converting it to JSON with this method, then using json_decode() to parse it. In any case, you have to think about what happens when " marks and > symbols appear in the input.
It might be easier to just match and remove one piece of the text at a time.

Categories