I am querying the Wikipedia API. Normally I get the following, and I echo out the extract.
array:4 [▼
"pageid" => 13275
"ns" => 0
"title" => "Hungary"
"extract" => """
<p><span></span></p>\n
<p><b>Hungary</b> (<span><span>/<span><span title="/ˈ/ primary stress follows">ˈ</span><span title="'h' in 'hi'">h</span><span title="/ʌ/ short 'u' in 'bud'">ʌ</span><span title="/ŋ/ 'ng' in 'sing'">ŋ</span><span title="'g' in 'guy'">ɡ</span><span title="/ər/ 'er' in 'finger'">ər</span><span title="/i/ 'y' in 'happy'">i</span></span>/</span></span>; Hungarian: <span lang="hu"><i>Magyarország</i></span> <span title="Representation in the International Phonetic Alphabet (IPA)">[ˈmɒɟɒrorsaːɡ]</span>) is a parliamentary constitutional republic in Central Europe. It is situated in the Carpathian Basin and is bordered by Slovakia to the north, Romania to the east, Serbia to the south, Croatia to the southwest, Slovenia to the west, Austria to the northwest, and Ukraine to the northeast. The country's capital and largest city is Budapest. Hungary is a member of the European Union, NATO, the OECD, the Visegrád Group, and the Schengen Area. The official language is Hungarian, which is the most widely spoken non-Indo-European language in Europe.</p>\n
But if the entry does not exist in Wiki then I get this.
array:3 [▼
"ns" => 0
"title" => "Kisfelegyhaza"
"missing" => ""
]
So my question is how do I check if extract exists?
I tried the following but it does not work.
$wiki_array = The data received from Wiki
if (array_key_exists('extract',$wiki_array)){
// do something
}
$wiki_array = The data received from Wiki
if( isset($wiki_array['extract']) ){
// do something
}
isset($var) to check if that var is setted (so not null)
For anyone facing a the same problem, here is the solution I used.
foreach($wiki_array['query']['pages'] as $page){
if( isset($page['extract']) ){
echo '<p>';
echo $page['extract'];
echo '</p>';
}
}
Related
I have to replace strings of multiple country names to their translations in another language. So I created an array of countries, where the keys are the countries in English and the values are the countries in the destination language...
So, let me first put a relevant extract of the array I used:
$countries = array(
//...
'Canada' => 'Καναδάς',
//...
'France' => 'Γαλλία',
//...
'Germany' => 'Γερμανία',
//...
'Korea' => 'Κορέα',
//...
'South Korea' => 'Νότια Κορέα',
//...
'United States' => 'Ηνωμένες Πολιτείες',
//...
'West Germany' => 'Δυτική Γερμανία',
//...
);
And the code I used is this:
$tmp[] = str_replace(array_keys($countries), $countries, $api->getCountry());
And below are two (special case) examples that are giving me a hard time figuring out how to deal with them...
West Germany • France
United States • Canada • South Korea
So the above two examples are replaced like this:
West Γερμανία • Γαλλία
Ηνωμένες Πολιτείες • Καναδάς • South Κορέα
I think it's very obvious what's happening here... The key Germany is found before the key West Germany, so str_replace replaces the Germany part with the translated name of it, and so West remains untranslated... The same happens with Korea, which (alphabetically) happens to be before South Korea...
Moving West Germany and South Korea above Germany and Korea fixes the problem, but this is not the proper way to deal with this I suppose, as it will happen to East Germany, and generally, any other country that has a two-word, etc...
What's the correct way to deal with this in your opinion? TIA
This is a little bit of a cheat but if you're going to use array_keys rather than a loop, you should just presort the countries array by length.
$keys = array_map('strlen', array_keys($countries));
array_multisort($keys, SORT_DESC, $countries);
$tmp[] = str_replace(array_keys($countries), $countries, $api->getCountry());
Here's a little example you can test at: https://www.tehplayground.com/uARSRel47jYICSIA
$countries = array(
'Canada' => 'Καναδάς',
'France' => 'Γαλλία',
'Germany' => 'Γερμανία',
'Korea' => 'Κορέα',
'South Korea' => 'Νότια Κορέα',
'United States' => 'Ηνωμένες Πολιτείες',
'West Germany' => 'Δυτική Γερμανία'
);
$keys = array_map('strlen', array_keys($countries));
array_multisort($keys, SORT_DESC, $countries);
echo str_replace(array_keys($countries), $countries, "West Germany"). "\n";
echo str_replace(array_keys($countries), $countries, "France") . "\n";
echo str_replace(array_keys($countries), $countries, "United States") . "\n";
echo str_replace(array_keys($countries), $countries, "Canada") . "\n";
echo str_replace(array_keys($countries), $countries, "South Korea") . "\n";
Output:
Δυτική Γερμανία
Γαλλία
Ηνωμένες Πολιτείες
Καναδάς
Νότια Κορέα
Update
As it turns out uksort takes care of this in one line:
uksort($countries,function($a, $b) { return strlen($b) > strlen($a);});
I found the same question but that's not very helpful because that is not working in many cases. So, I'm writing this question may be somebody have a better solution for it.
These are my addresses example.
[0] => "Skattkarr Varmland SE-65671" //Sweden
[1] => "Rayleigh , Essex SS6 8YJ" //UK
[2] => "Horgen, Zürich 8810" //Switzerland
[3] => "Edmonton Alberta T5A 2L8" //Canada
[4] => "REDDING, CA 96003" //USA
[5] => "New York, NY 96003" //USA
[6] => "New York NY 96003" //USA
I tried alot, but for many cases I'm getting failed.
I can pass 2 or 3 but I can't pass for all. Especially when the the country changes.
I tried to explode(" ",$addr[0]), it giving me the state on 0 and city on 1, but I try to use explode(" ",$addr[6]), It will give me New as a state and York as city. And same for UK and Canada zip code will be wrong.
My last question was marked duplicate, but my query is different and This question does not help me.
In order to separate these strings into state city and zipcode, you will need to define rules that can apply to all of your strings.
If we separate them by space, New York is not gonna work since New York is a city but has space in the middle.
If we separate them by comma, some of them don't have comma.
If we separate by both space and comma, we cannot assume the last item will be zipcode since T5A 2L8 is zipcode but will be separated.
So there is no rule that I can think of that would work with your data. You should start from how these strings can be separated and identified. Try to apply it to the code and we will gladly help you.
I tried to use OSM nominatim, to separate and validate data.
Request
https://nominatim.openstreetmap.org/search?format=json&limit=1&addressdetails=1&q=1088+Burton+Dr.+REDDING,+CA+96003+US
Response
[
{
"place_id": 266720693,
"licence": "Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright",
"osm_type": "way",
"osm_id": 10591437,
"boundingbox": [
"40.591203388592",
"40.591303388592",
"-122.34939939898",
"-122.34929939898"
],
"lat": "40.59125338859231",
"lon": "-122.34934939898274",
"display_name": "1088, Burton Drive, Lancer Hills Estates, Redding, Shasta County, California, 96003, United States",
"class": "place",
"type": "house",
"importance": 0.621,
"address": {
"house_number": "1088",
"road": "Burton Drive",
"neighbourhood": "Lancer Hills Estates",
"city": "Redding",
"county": "Shasta County",
"state": "California",
"postcode": "96003",
"country": "United States",
"country_code": "us"
}
}
]
And this what I actually want, breaking down of the address string.
Looking for a composer package, which without a huge dependency (without knowledge bases more than 3MB) and third-party services will be able to determine the language of the text.
The text is very often consists of several words.
For example, I'd like to see this package with a high accuracy identified the languages of the following fragments:
text on english
Текст на русском
Текст на русском и some words on english
結城友奈は勇者である -鷲尾須美の章- 第2章 「たましい」
விவேகம்
El aeropuerto se considera
Wunderbar steht er da im Silberhaar.
Ein weiß glänzendes
si les faits n’obéissent pas
4 8 15 16 23 42
Mainly interested in the qualitative determination for the following languages: English, Russian, German, Spanish, Dutch, Italian, French, Chinese, Japanese, Norwegian, Danish, Indian.
A big plus would be if this package is not outdated or abandoned.
PS: It is important that they do not take much memory when running.
I test PHP-package Text_LanguageDetect with my examples and some other tests and I am disappointed with the results...
require_once('libs/languagedetect/Text/LanguageDetect.php');
$l = new Text_LanguageDetect();
$l->setNameMode(2);
1.
$l->detect('text on english', 4); // BAD
=> [
"nl" => 0.244,
"fi" => 0.23111111111111,
"sq" => 0.21933333333333,
"et" => 0.21333333333333,
]
2.
$l->detect('Текст на русском', 4); // OK
=> [
"ru" => 0.36770833333333,
"sr" => 0.30083333333333,
"bg" => 0.29145833333333,
"uk" => 0.22354166666667,
]
3.
$l->detect('Текст на русском и some words on english', 4); // ???
=> [
"ru" => 0.17625,
"sr" => 0.14675,
"" => 0.14608333333333,
"bg" => 0.14341666666667,
]
4.
$l->detect('結城友奈は勇者である -鷲尾須美の章- 第2章 「たましい」', 4); // BAD
=> []
5.
$l->detect('விவேகம்', 4); // BAD
=> []
6.
$l->detect('El aeropuerto se considera', 4); // OK
=> [
"es" => 0.49410256410256,
"pt" => 0.32576923076923,
"it" => 0.30230769230769,
"fr" => 0.25333333333333,
]
7.
$l->detect('Wunderbar steht er da im Silberhaar.', 4); // OK
=> [
"de" => 0.39235294117647,
"da" => 0.34078431372549,
"sv" => 0.31029411764706,
"no" => 0.30147058823529,
]
8.
$l->detect('Ein weiß glänzendes', 4); // OK
=> [
"de" => 0.43947368421053,
"nl" => 0.2259649122807,
"cy" => 0.17456140350877,
"fr" => 0.17070175438596,
]
9.
$l->detect('si les faits n’obéissent pas', 4); // OK
=> [
"fr" => 0.37595238095238,
"pt" => 0.23869047619048,
"la" => 0.22880952380952,
"de" => 0.20511904761905,
]
10.
$l->detect('4 8 15 16 23 42', 4); // OK
=> []
11.
$l->detect('accuracy identified', 4); // BAD
=> [
"la" => 0.19368421052632,
"no" => 0.14491228070175,
"es" => 0.13491228070175,
"ro" => 0.13157894736842,
]
12.
$l->detect('big text', 4); // BAD
=> [
"is" => 0.32708333333333,
"tl" => 0.21208333333333,
"nl" => 0.205,
"vi" => 0.20458333333333,
]
13.
$l->detect('very long text is ok', 4); // OK
=> [
"en" => 0.29383333333333,
"nl" => 0.26883333333333,
"tl" => 0.20583333333333,
"hu" => 0.182,
]
14.
$l->detect('symbols', 4); // BAD
=> [
"de" => 0.068095238095238,
"nl" => 0.049523809523809,
"sw" => 0.044285714285714,
"pl" => 0.040952380952381,
]
15.
$l->detect('language', 4); // BAD
=> [
"da" => 0.34875,
"tl" => 0.33458333333333,
"" => 0.33416666666667,
"id" => 0.28291666666667,
]
I'm really very disappointed that such an old and seemingly time - tested package is not working as it should.
I have a lot of checks with short words, which need to identify what they are in English (or at least give the possibility that they are on it). Because you will come across characters and other special characters.
Very convenient that the package displays the list of languages - I could check in 4 language as in the example. But I doubt that there will be false positives. So I don't dare to use it.
Please advise other solution on PHP. It would be incorrect to run from php a different process for checking the language.
Here's a list of libraries that can do language detection that I'm aware of:
langid.py (Python)
language-detection (Java)
Chromium language detector (C++)
Language-detector (Java)
Apache Tika's Language Detector (Java)
None of these require the use of a third-party service, but the performance of most solutions is pretty heavily dependent on the length of the input test - YMMV... But most of them can do dozens of languages, so you should be covered, in theory - and if not, most allow you to train your own model.
I need to parse a street address in PHP a string that might have abbreviations.
This string comes from a text input.
The fields I need to search are:
street (alphanumeric - might have
building (alphanumeric - might have
number (alphanumeric - might have
area (numeric from 1 to 5)
other (unknown field & used to search in all the above fields in the database)
For example users submits one of this text text:
street Main Road Bulding H7 Number 5 Area 1
st Main Road bldg H7 Nr 5 Ar 5
stMain bldgh7
ar5 unknown other search parameter
street Main Road h7 2b
street main street str main road
The outcome I would like to see as a array:
[street]=>Main Road [building]=>h7 [number]=>5 [area]=>1
[street]=>Main Road [building]=>h7 [number]=>5 [area]=>5
[street]=>Main [building]=>h7
[area]=>5 [other]=>unknown other search parameter
[street]=>Main Road [other]=>h7 2b
[street]=>Main Street&&Main Road
My code so far...but dosen't work with examples 3.,4.,5.,6.:
<?php
//posted address
$address = "str main one bldg 5b other param area 1";
//to replace
$replace = ['street'=>['st','str'],
'building'=>['bldg','bld'],
'number'=>['nr','numb','nmbr']];
//replace
foreach($replace as $field=>$abbrs)
foreach($abbrs as $abbr)
$address = str_replace($abbr.' ',$field.' ',$address);
//fields
$fields = array_keys($replace);
//match
if(preg_match_all('/('.implode('|',array_keys($fields)).')\s+([^\s]+)/si', $address, $matches)) {
//matches
$search = array_combine($matches[1], $matches[2]);
//other
$search['other'] = str_replace($matches[0],"",$address);
}else{
//search in all the fields
$search['other'] = $address;
}
//search
print_r($search);
Code tester: http://ideone.com/j3q4YI
Wow, you've got one hairy mess to clean up. I've toiled for a few hours on this. It works on all of your samples, but I would NOT stake my career on it being perfect on all future cases. There are simply too many variations in addresses. I hope you can understand my process and modify it if/when new samples failed to be captured properly. I'll leave all my debugging comment in place, because I reckon you'll use them for future edits.
$addresses=array(
"street Main Road Bulding H7 Number 5 Area 1",
"st Main Road bldg H7 Nr 5 Ar 5",
"stMain bldgh7",
"ar5 unknown other search parameter",
"street Main Road h7 2b",
"street main street str main road"
);
$regex["area"]="/^(.*?)(ar(?:ea)?\s?)([1-5])(.*?)$/i";
$regex["number"]="/^(.*?)(n(?:umbe)?r\s?)([0-9]+)(.*?)$/i";
$regex["building"]="/^(.*?)(bu?i?ldi?n?g\s?)([^\s]+)(.*?)$/i";
$regex["corner"]="/^(.*?str?(?:eet)?)\s?(str?(?:eet)?.*)$/i"; // 2 streets in string
$regex["street"]="/^(.*?)(str?(?:eet)?\s?)([^\s]*(?:\s?ro?a?d|\s?str?e?e?t?|.*?))(\s?.*?)$/i";
$regex["other"]="/^(.+)$/";
$search=[];
foreach($addresses as $i=>$address){
echo "<br><div><b>$address</b> breakdown:</div>";
foreach($regex as $key=>$rgx){
if(strlen($address)>0){
//echo "<div>addr(",strlen($address),") $address</div>";
if(preg_match($rgx,$address,$matches)){
if($key=="other"){
$search[$i][$key]=$matches[0]; // everything that remains
}elseif($key=="corner"){
$search[$i]["street"]=""; // NOTICE suppression
// loop through both halves of corner address omitting element[0]
foreach(array_diff_key($matches,array('')) as $half){
//echo "half= $half<br>";
if(preg_match($regex["street"],$half,$half_matches)){
//print_r($half_matches);
$search[$i]["street"].=(strlen($search[$i]["street"])>0?"&&":"").ucwords($half_matches[3]);
$address=trim($half_matches[1].$half_matches[4]);
// $matches[2] is the discarded identifier
//echo "<div>$key Found: {$search[$i][$key]}</div>";
//echo "<div>Remaining: $address</div>";
}
}
}else{
$search[$i][$key]=($key=="street"?ucwords($matches[3]):$matches[3]);
$address=trim($matches[1].$matches[4]);
// $matches[2] is the discarded identifier
//echo "<div>$key Found: {$search[$i][$key]}</div>";
//echo "<div>Remaining: $address</div>";
//print_r($matches);
}
}
}else{
break; // address is fully processed
}
}
echo "<pre>";
var_export($search[$i]);
echo "</pre>";
}
The output is an array that satisfies your brief, but the keys are out of order because I captured the address components out of order -- this may not matter to you, so I didn't bother re-sorting it.
street Main Road Bulding H7 Number 5 Area 1 breakdown:
array (
'area' => '1',
'number' => '5',
'building' => 'H7',
'street' => 'Main Road',
)
st Main Road bldg H7 Nr 5 Ar 5 breakdown:
array (
'area' => '5',
'number' => '5',
'building' => 'H7',
'street' => 'Main Road',
)
stMain bldgh7 breakdown:
array (
'building' => 'h7',
'street' => 'Main',
)
ar5 unknown other search parameter breakdown:
array (
'area' => '5',
'other' => 'unknown other search parameter',
)
street Main Road h7 2b breakdown:
array (
'street' => 'Main Road',
'other' => 'h7 2b',
)
street main street str main road breakdown:
array (
'street' => 'Main Street&&Main Road',
)
...boy am I glad this project doesn't belong to me. Good luck!
Thank you for the help! I thought that I should do something like multiple preg_matches.
I just found a PHP extension that does exactly what I want.
The library is PHP Postal (https://github.com/openvenues/php-postal) and requires libpostal. It takes about 15-20 seconds to load the library when you run PHP, after this everything work ok.
Total execution time for parsing: 0.00030-0.00060 seconds.
$parsed = Postal\Parser::parse_address("The Book Club 100-106 Leonard St, Shoreditch, London, Greater London, EC2A 4RH, United Kingdom");
foreach ($parsed as $component) {
echo "{$component['label']}: {$component['value']}\n";
}
Output:
house: the book club
house_number: 100-106
road: leonard st
suburb: shoreditch
city: london
state_district: greater london
postcode: ec2a 4rh
country: united kingdom
All I had to do after this is to replace my labels and format the address.
Hope this will help others, who want to parse a address in PHP.
OK, let's say when a user selects a country, they are also added with a "federation". These federations are pretty much region-centric.
Let's say I have something like this:
function getFedration($country_iso) {
// 6 federations
// afc = asian nations
// caf = african nations
// cocacaf = north & central america and Caribbean nations
// conmebol = south america
// ofc = Oceanian nations
// uefa = european nations
$afc = array("Japan", "China", "South Korea");
$caf = array("Cameroon", "Chad", "Ivory Coast");
$concacaf = array("United States" , "Canada", "Mexico");
$conmebol = array("Argetina", "Brazil", "Chile");
$ofc = array("Fiji", "New Zealand", "Samoa");
$uefa = array("Spain", "England", "Montenegro");
/*
PSEUDO-code
If $country_iso is in either of six arrays... mark that as the federation...
*/
return $federation;
}
I know, it says a country's name but when it comes down to it, it will be country's iso like JP instead of Japan, CN instead of China, et cetera.
So, I was wondering, is this a feasible thing or is there a better way you'd think?
How about putting all federations into an array, in order to loop through it? Makes things easier, like so:
function countryToFederation($country_iso) {
$federations = array(
"afc" => array("Japan", "China", "South Korea"),
"caf" => array("Cameroon", "Chad", "Ivory Coast"),
"concacaf" => array("United States" , "Canada", "Mexico"),
"conmebol" => array("Argetina", "Brazil", "Chile"),
"ofc" => array("Fiji", "New Zealand", "Samoa"),
"uefa" => array("Spain", "England", "Montenegro"),
);
foreach($federations as $federation) {
if(in_array($country_iso, $federation)) {
return $federation;
}
}
}
If a federation can only belong to one country, I would create one array instead:
$countryToFederationMap = array(
'Japan' => 'AFC',
'China' => 'AFC',
'Cameroon' => 'CAF',
// ...
);
Then the federation is simply:
return $countryToFederationMap[$country];