PHP parse street address with abbreviations from text input for search - php

I need to parse a street address in PHP a string that might have abbreviations.
This string comes from a text input.
The fields I need to search are:
street (alphanumeric - might have
building (alphanumeric - might have
number (alphanumeric - might have
area (numeric from 1 to 5)
other (unknown field & used to search in all the above fields in the database)
For example users submits one of this text text:
street Main Road Bulding H7 Number 5 Area 1
st Main Road bldg H7 Nr 5 Ar 5
stMain bldgh7
ar5 unknown other search parameter
street Main Road h7 2b
street main street str main road
The outcome I would like to see as a array:
[street]=>Main Road [building]=>h7 [number]=>5 [area]=>1
[street]=>Main Road [building]=>h7 [number]=>5 [area]=>5
[street]=>Main [building]=>h7
[area]=>5 [other]=>unknown other search parameter
[street]=>Main Road [other]=>h7 2b
[street]=>Main Street&&Main Road
My code so far...but dosen't work with examples 3.,4.,5.,6.:
<?php
//posted address
$address = "str main one bldg 5b other param area 1";
//to replace
$replace = ['street'=>['st','str'],
'building'=>['bldg','bld'],
'number'=>['nr','numb','nmbr']];
//replace
foreach($replace as $field=>$abbrs)
foreach($abbrs as $abbr)
$address = str_replace($abbr.' ',$field.' ',$address);
//fields
$fields = array_keys($replace);
//match
if(preg_match_all('/('.implode('|',array_keys($fields)).')\s+([^\s]+)/si', $address, $matches)) {
//matches
$search = array_combine($matches[1], $matches[2]);
//other
$search['other'] = str_replace($matches[0],"",$address);
}else{
//search in all the fields
$search['other'] = $address;
}
//search
print_r($search);
Code tester: http://ideone.com/j3q4YI

Wow, you've got one hairy mess to clean up. I've toiled for a few hours on this. It works on all of your samples, but I would NOT stake my career on it being perfect on all future cases. There are simply too many variations in addresses. I hope you can understand my process and modify it if/when new samples failed to be captured properly. I'll leave all my debugging comment in place, because I reckon you'll use them for future edits.
$addresses=array(
"street Main Road Bulding H7 Number 5 Area 1",
"st Main Road bldg H7 Nr 5 Ar 5",
"stMain bldgh7",
"ar5 unknown other search parameter",
"street Main Road h7 2b",
"street main street str main road"
);
$regex["area"]="/^(.*?)(ar(?:ea)?\s?)([1-5])(.*?)$/i";
$regex["number"]="/^(.*?)(n(?:umbe)?r\s?)([0-9]+)(.*?)$/i";
$regex["building"]="/^(.*?)(bu?i?ldi?n?g\s?)([^\s]+)(.*?)$/i";
$regex["corner"]="/^(.*?str?(?:eet)?)\s?(str?(?:eet)?.*)$/i"; // 2 streets in string
$regex["street"]="/^(.*?)(str?(?:eet)?\s?)([^\s]*(?:\s?ro?a?d|\s?str?e?e?t?|.*?))(\s?.*?)$/i";
$regex["other"]="/^(.+)$/";
$search=[];
foreach($addresses as $i=>$address){
echo "<br><div><b>$address</b> breakdown:</div>";
foreach($regex as $key=>$rgx){
if(strlen($address)>0){
//echo "<div>addr(",strlen($address),") $address</div>";
if(preg_match($rgx,$address,$matches)){
if($key=="other"){
$search[$i][$key]=$matches[0]; // everything that remains
}elseif($key=="corner"){
$search[$i]["street"]=""; // NOTICE suppression
// loop through both halves of corner address omitting element[0]
foreach(array_diff_key($matches,array('')) as $half){
//echo "half= $half<br>";
if(preg_match($regex["street"],$half,$half_matches)){
//print_r($half_matches);
$search[$i]["street"].=(strlen($search[$i]["street"])>0?"&&":"").ucwords($half_matches[3]);
$address=trim($half_matches[1].$half_matches[4]);
// $matches[2] is the discarded identifier
//echo "<div>$key Found: {$search[$i][$key]}</div>";
//echo "<div>Remaining: $address</div>";
}
}
}else{
$search[$i][$key]=($key=="street"?ucwords($matches[3]):$matches[3]);
$address=trim($matches[1].$matches[4]);
// $matches[2] is the discarded identifier
//echo "<div>$key Found: {$search[$i][$key]}</div>";
//echo "<div>Remaining: $address</div>";
//print_r($matches);
}
}
}else{
break; // address is fully processed
}
}
echo "<pre>";
var_export($search[$i]);
echo "</pre>";
}
The output is an array that satisfies your brief, but the keys are out of order because I captured the address components out of order -- this may not matter to you, so I didn't bother re-sorting it.
street Main Road Bulding H7 Number 5 Area 1 breakdown:
array (
'area' => '1',
'number' => '5',
'building' => 'H7',
'street' => 'Main Road',
)
st Main Road bldg H7 Nr 5 Ar 5 breakdown:
array (
'area' => '5',
'number' => '5',
'building' => 'H7',
'street' => 'Main Road',
)
stMain bldgh7 breakdown:
array (
'building' => 'h7',
'street' => 'Main',
)
ar5 unknown other search parameter breakdown:
array (
'area' => '5',
'other' => 'unknown other search parameter',
)
street Main Road h7 2b breakdown:
array (
'street' => 'Main Road',
'other' => 'h7 2b',
)
street main street str main road breakdown:
array (
'street' => 'Main Street&&Main Road',
)
...boy am I glad this project doesn't belong to me. Good luck!

Thank you for the help! I thought that I should do something like multiple preg_matches.
I just found a PHP extension that does exactly what I want.
The library is PHP Postal (https://github.com/openvenues/php-postal) and requires libpostal. It takes about 15-20 seconds to load the library when you run PHP, after this everything work ok.
Total execution time for parsing: 0.00030-0.00060 seconds.
$parsed = Postal\Parser::parse_address("The Book Club 100-106 Leonard St, Shoreditch, London, Greater London, EC2A 4RH, United Kingdom");
foreach ($parsed as $component) {
echo "{$component['label']}: {$component['value']}\n";
}
Output:
house: the book club
house_number: 100-106
road: leonard st
suburb: shoreditch
city: london
state_district: greater london
postcode: ec2a 4rh
country: united kingdom
All I had to do after this is to replace my labels and format the address.
Hope this will help others, who want to parse a address in PHP.

Related

Change parts of a string with values from an array, but only when the whole value can be found

I have to replace strings of multiple country names to their translations in another language. So I created an array of countries, where the keys are the countries in English and the values are the countries in the destination language...
So, let me first put a relevant extract of the array I used:
$countries = array(
//...
'Canada' => 'Καναδάς',
//...
'France' => 'Γαλλία',
//...
'Germany' => 'Γερμανία',
//...
'Korea' => 'Κορέα',
//...
'South Korea' => 'Νότια Κορέα',
//...
'United States' => 'Ηνωμένες Πολιτείες',
//...
'West Germany' => 'Δυτική Γερμανία',
//...
);
And the code I used is this:
$tmp[] = str_replace(array_keys($countries), $countries, $api->getCountry());
And below are two (special case) examples that are giving me a hard time figuring out how to deal with them...
West Germany • France
United States • Canada • South Korea
So the above two examples are replaced like this:
West Γερμανία • Γαλλία
Ηνωμένες Πολιτείες • Καναδάς • South Κορέα
I think it's very obvious what's happening here... The key Germany is found before the key West Germany, so str_replace replaces the Germany part with the translated name of it, and so West remains untranslated... The same happens with Korea, which (alphabetically) happens to be before South Korea...
Moving West Germany and South Korea above Germany and Korea fixes the problem, but this is not the proper way to deal with this I suppose, as it will happen to East Germany, and generally, any other country that has a two-word, etc...
What's the correct way to deal with this in your opinion? TIA
This is a little bit of a cheat but if you're going to use array_keys rather than a loop, you should just presort the countries array by length.
$keys = array_map('strlen', array_keys($countries));
array_multisort($keys, SORT_DESC, $countries);
$tmp[] = str_replace(array_keys($countries), $countries, $api->getCountry());
Here's a little example you can test at: https://www.tehplayground.com/uARSRel47jYICSIA
$countries = array(
'Canada' => 'Καναδάς',
'France' => 'Γαλλία',
'Germany' => 'Γερμανία',
'Korea' => 'Κορέα',
'South Korea' => 'Νότια Κορέα',
'United States' => 'Ηνωμένες Πολιτείες',
'West Germany' => 'Δυτική Γερμανία'
);
$keys = array_map('strlen', array_keys($countries));
array_multisort($keys, SORT_DESC, $countries);
echo str_replace(array_keys($countries), $countries, "West Germany"). "\n";
echo str_replace(array_keys($countries), $countries, "France") . "\n";
echo str_replace(array_keys($countries), $countries, "United States") . "\n";
echo str_replace(array_keys($countries), $countries, "Canada") . "\n";
echo str_replace(array_keys($countries), $countries, "South Korea") . "\n";
Output:
Δυτική Γερμανία
Γαλλία
Ηνωμένες Πολιτείες
Καναδάς
Νότια Κορέα
Update
As it turns out uksort takes care of this in one line:
uksort($countries,function($a, $b) { return strlen($b) > strlen($a);});

PHP: How to check if an entry exists in API?

I am querying the Wikipedia API. Normally I get the following, and I echo out the extract.
array:4 [▼
"pageid" => 13275
"ns" => 0
"title" => "Hungary"
"extract" => """
<p><span></span></p>\n
<p><b>Hungary</b> (<span><span>/<span><span title="/ˈ/ primary stress follows">ˈ</span><span title="'h' in 'hi'">h</span><span title="/ʌ/ short 'u' in 'bud'">ʌ</span><span title="/ŋ/ 'ng' in 'sing'">ŋ</span><span title="'g' in 'guy'">ɡ</span><span title="/ər/ 'er' in 'finger'">ər</span><span title="/i/ 'y' in 'happy'">i</span></span>/</span></span>; Hungarian: <span lang="hu"><i>Magyarország</i></span> <span title="Representation in the International Phonetic Alphabet (IPA)">[ˈmɒɟɒrorsaːɡ]</span>) is a parliamentary constitutional republic in Central Europe. It is situated in the Carpathian Basin and is bordered by Slovakia to the north, Romania to the east, Serbia to the south, Croatia to the southwest, Slovenia to the west, Austria to the northwest, and Ukraine to the northeast. The country's capital and largest city is Budapest. Hungary is a member of the European Union, NATO, the OECD, the Visegrád Group, and the Schengen Area. The official language is Hungarian, which is the most widely spoken non-Indo-European language in Europe.</p>\n
But if the entry does not exist in Wiki then I get this.
array:3 [▼
"ns" => 0
"title" => "Kisfelegyhaza"
"missing" => ""
]
So my question is how do I check if extract exists?
I tried the following but it does not work.
$wiki_array = The data received from Wiki
if (array_key_exists('extract',$wiki_array)){
// do something
}
$wiki_array = The data received from Wiki
if( isset($wiki_array['extract']) ){
// do something
}
isset($var) to check if that var is setted (so not null)
For anyone facing a the same problem, here is the solution I used.
foreach($wiki_array['query']['pages'] as $page){
if( isset($page['extract']) ){
echo '<p>';
echo $page['extract'];
echo '</p>';
}
}

How to gather company data from VIES database via SOAP

I am using the VIES database to gather company data, based on European VAT number for my PHP application.
The things that I need are:
city
street name
house number
postcode
comapny name
as separate data but the VIES database is giving me all of it as a one string.
Working example:
<?php
try {
$opts = array(
'http' => array(
'user_agent' => 'PHPSoapClient'
)
);
$context = stream_context_create($opts);
$client = new SoapClient(
'http://ec.europa.eu/taxation_customs/vies/checkVatService.wsdl',
array('stream_context' => $context,
'cache_wsdl' => WSDL_CACHE_NONE)
);
$result = $client->checkVat(
array(
'countryCode' => 'PL',
'vatNumber' => '5242106963'
)
);
print_r($result);
} catch (Exception $e) {
echo $e->getMessage();
}
?>
I am receiving:
stdClass Object (
[countryCode] => PL
[vatNumber] => 5242106963
[requestDate] => 2015-02-20+01:00
[valid] => 1
[name] => COCA-COLA HBC POLSKA SPÓŁKA Z OGRANICZONĄ ODPOWIEDZIALNOŚCIĄ
[address] => ANNOPOL 20 03-236 WARSZAWA
)
But I need the address like this:
$street='ANNOPOL';
$number='20';
$city='WARSZAWA';
$postcode='03-236';
Also please keep in mind that for other companies, the street name or city can have more then one word, like "New York", so an easy solution to divide the data based on space between words doesn't work for me.
As you have stated that postal code will be in 99-999 format and assuming the street number (+ any flat identification) will always start with a number, you can use a preg_match to parse the address string:
$result = new stdClass();
$result->address = 'Wita Stwosza 15 M5 31-042 Kraków';
preg_match(
'#'
. '(.*?)' // Match as many chars as possible (street)
. '\s+(\d+(?:.*?))' // Space, then number and possibly flat (number)
. '\s+(\d\d\-\d\d\d)' // Space, then digits/dash/digits (postcode)
. '\s(.*)' // Space, then everything after that (city)
. '#',
$result->address,
$matches
);
list ($street, $number, $postcode, $city) = array_slice($matches, 1);
echo "Street: $street", PHP_EOL;
echo "Number: $number", PHP_EOL;
echo "Postcode: $postcode", PHP_EOL;
echo "City: $city", PHP_EOL;
Output:
Street: Wita Stwosza
Number: 15 M5
Postcode: 31-042
City: Kraków
As far as I can see the VIES data already has newlines built into the result. So you should be able to explode based upon the newline character. Then it will just be a case of working out if the postcode is last or the city.
To confirm what I am saying just:
echo nl2br($result->address);

how to replace text in a mysql database content array

im trying to get rid of unneccesary text in my database content.My code looks like this:
if(mysql_num_rows($result))
$items[] = array();
while($row = mysql_fetch_assoc($result)) {
$items[] = array('id' => $row['id'], 'cat' => $row['cat'], 'type' => $row['type'], 'name' => $row['name'], 'sub_title' => $row['sub_title'], 'display_date' => $row['display_date'], 'slug' => $row['slug'], 'ticket_url' => $row['ticket_url'], 'status' => $row['status'], 'content' => $row['content'], 'display_until' => $row['display_until'], 'photo' => $row['photo'], 'thumb' => $row['thumb']);
$removals = array('\n','\r','\t','<\/div>\r\n');
$spaces = "";
$parsedText = str_replace($removals, $spaces, $items);
}
echo json_encode(array('events'=>$items));
And the content then displays like this:
{"events":[[],{"id":"66","cat":"9","type":"2","name":"Oileán - A Celebration of the Blasket Islands","sub_title":"National Folk Theatre","display_date":"Tues 4th - Thurs 6th May at 8.30pm","slug":"This production celebrates life on the Blasket Islands in times past, exploring the way of life of the islanders and their spirit of survival. Oileán captures the essence of this island community, their traditions and customs, their wealth of song and story, their love of life and their strong kinship with one another. ","ticket_url":"","status":"1","content":"
\r\n\tPresented by the members of the National Folk Theatre of Ireland</strong>, this production celebrates and explores Blasket Island living while also challenging our own notions of identity as contemporary islanders. </div>\r\n
\r\n\t </div>\r\n
\r\n\tPremiered in 2003, Oileán</strong></em> marked the 50th</sup> anniversary of the departure of the Blasket Islanders to the mainland. The Great Blasket Island, located off the coast of West Kerry still retains an almost mystical significance for many, both from Ireland and abroad. The way of life of the islanders and their spirit of survival is framed in this production, which captures the essence of this island community, their traditions and customs, their wealth of song and story, their love of life and their strong kinship with one another. </div>\r\n
\r\n\t </div>\r\n
\r\n\tOileán</i></b> is delivered in the unique Siamsa style through the medium of dance, mime, music and song.</div>\r\n
\r\n\t </div>\r\n
\r\n\t
\r\n\t\t </div>\r\n\t
\r\n\t\tPlease note that due to the popularity of performances by the National Folk Theatre</strong>, some productions may be sold out well in advance and tickets may not be available on-line. However, we often have returns and tickets may be available nearer to the day of a performance</strong>. Please contact us directly by phone on: +353 (0)66 7123055.</em></div>\r\n\t
\r\n\t\t </div>\r\n\t
\r\n\t\t </div>\r\n</div>\r\n","display_until":"20100504","photo":"1269869378-oilean_side.jpg","thumb":"1269869378-oilean_thumb.jpg"},
The above display is the first item in the DB.
Im trying the replace all the \r , \n , etc in the above content?How can i go about this?Is what i have allready on the right track?
2 things
if(mysql_num_rows($result))
$items = array(); // not $items[], that would set the first item as an array
while($row = mysql_fetch_assoc($result)) {
$removals = array("\n","\r","\t","<\/div>\r\n");
$spaces = "";
$items[] = array(
'id' => $row['id'],
'cat' => $row['cat'],
'type' => $row['type'],
'name' => $row['name'],
'sub_title' => $row['sub_title'],
'display_date' => $row['display_date'],
'slug' => $row['slug'],
'ticket_url' => $row['ticket_url'],
'status' => $row['status'],
// replace the content here
// youll want to use preg_replace though otherwise youll end up with multiple </div>'s
'content' => str_replace( $removals, $spaces, $row['content'] ),
'display_until' => $row['display_until'],
'photo' => $row['photo'],
'thumb' => $row['thumb']
);
}
echo json_encode(array('events'=>$items));

Probably simple php regex

Sorry for the simple question that I could research, but i crashed a database today, been here 12 hours, and want to go home.
I am rotating recursively through files trying to extract city, phone number, and email address so that I can match the city and phone to my database entries and update the users email address. In theory, they could just login with their email and request to reset their password.
heres what i need. my file contents look like this:
> Address : 123 main street City : somecity State/Province : somestate
> Zip/Postal Code : 12345 Country : United States Phone : 1231231234 Fax
> : E-Mail : example#example.com ==== CUSTOMER SHIPPING INFORMATION ===
I should note that there is other info before and after the snippet I showed. Can someone please help me with a regex to remove the 3 items? Thanks.
Try something like this, without regex..
$string = 'Address : 123 main street City : somecity State/Province : somestate Zip/Postal Code : 12345 Country : United States Phone : 1231231234 Fax : E-Mail : example#example.com ==== CUSTOMER SHIPPING INFORMATION ===';
$string = str_replace(
array(
' ==== CUSTOMER SHIPPING INFORMATION ===',
'Address',
'City',
'State/Province',
'Zip/Postal Code',
'Country',
'Phone',
'Fax',
'E-Mail'
)
, '', $string);
$string = explode(' : ', $string);
unset($string[0]);
print_r($string);
Result...
Array
(
[0] =>
[1] => 123 main street
[2] => somecity
[3] => somestate
[4] => 12345
[5] => United States
[6] => 1231231234
[7] =>
[8] => example#example.com
)
If there are linebreaks, something like this...
$string = explode("\n", $string);
foreach($string as $value){
list(, $info) = explode(' : ', $value);
echo $info . '<br />';
}
Solution with regex..
$fields = array('City', 'Phone', 'E-mail');
foreach($fields as $field){
preg_match("#$field : (.*?) #is", $string, $matches);
echo "$field : $matches[1]";
echo '<br />';
}
Result:
City : somecity
Phone : 1231231234
E-mail : example#example.com
Something like this:
Address\s*:\s*(.*?)\s*City\s*:\s*(.*?)\s*State/Province\s*:\s*(.*?)\s*Zip/Postal Code\s*:\s*(.*?)\s*Country\s*:\s*(.*?)\s*Phone\s*:\s*(.*?)\s*Fax\s*:\s*(.*?)\s*E-Mail\s*:\s*(.*?)\s
Will work if you rip out the > at the start of each line first.
proof
If you print_r that, you'll see the different components.
Note the "dot matches all" modifier. Might be even easier if you rip out newlines too (after you take out the >).
What about this
$test =
'
> Address : 123 main street City : somecity State/Province : somestate
> Zip/Postal Code : 12345 Country : United States Phone : 1231231234 Fax
> : E-Mail : example#example.com ==== CUSTOMER SHIPPING INFORMATION ===
';
preg_match('#.*City : (.+?) .*? Country : (.+?) Phone : (.+?) .*#si',$test,$res);
var_dump($res);
result
array
0 => string '
> Address : 123 main street City : somecity State/Province : somestate
> Zip/Postal Code : 12345 Country : United States Phone : 1231231234 Fax
> : E-Mail : example#example.com ==== CUSTOMER SHIPPING INFORMATION ===
' (length=217)
1 => string 'somecity' (length=8)
2 => string 'United States' (length=13)
3 => string '1231231234' (length=10)

Categories