I'm not sure if this is an anti-pattern or not, but it feels a bit convoluted, so I'd like to get your opinion on how these cases should be handled:
Let's say we have this data:
$sofas[0]['color'] = 'green';
$sofas[0]['pillows'] = 8;
$sofas[0]['pattern'] = 'moons';
$sofas[1]['color'] = 'green';
$sofas[1]['pillows'] = 8;
$sofas[1]['pattern'] = 'ducks';
$sofas[1]['footrest'] = 'small';
$sofas[2]['color'] = 'green';
$sofas[2]['pillows'] = 8;
$sofas[2]['pattern'] = 'stripes';
$sofas[2]['speakers'] = 'badass';
color, pillows and pattern comes from the database, whilst "footrest" and "speakers" have been added on by an api.
We can say for the sake of argument that there are 1250 different attributes that can be added by the api like "footrest" and "speakers".
We now want to load an some data from the database based on these attributes, like an image for example.
So we have a table that looks like this:
ID , attribute_value, image
1 , 'color_green' 'img0023',
2 , 'pillows_8' 'img003',
3 , 'pattern_moons' 'img002',
6 , 'pattern_ducks' 'img0083',
7 , 'footrest_small' 'img0058',
10 , 'pattern_stripes''img0073',
11 , 'speakers_badass''img00pluto'
etc , etc , etc;
So, the way I figure I can approach this two ways:
$sofaSQL="'color_green',
'pillows_8',
'pattern_moons',
'pattern_ducks',
'footrest_small',
'pattern_stripes',
'speakers_badass'";
$sql = "SELECT ID, attribute_value, image
FROM `example`
WHERE attribute_value IN ($sofaSQL)"
and then loop through the array and check if the key + '_' + value matches the rows in the recordset to see what images should be used for sofas[0], sofas[1] and sofas[2].
The other option I see would be to prep each sofa with a different sql statement, ie:
$sofaSQL="'color_green',
'pillows_8',
'pattern_moons';
-add images-
$sofaSQL="'color_green',
'pillows_8',
'pattern_ducks',
'footrest_small';
-add images-
$sofaSQL="'color_green',
'pillows_8',
'pattern_stripes',
'speakers_badass'";
-add images-
That seems simpler, but it doesn't feel right to hammer the database with a seperate request for each item in the array.
So, what would you recommend in this case? IS there a better way of dealing with attributes that are selected randomly/from an api?
I'm definitely not the worst when it comes down to regex, but this one has got me stumped.
In short, this is the code I currently have.
$aNumbers = array(
'612345678',
'546123465',
'131234567',
'+31(0)612345678'
);
foreach($aNumbers as $sNumber) {
$aMatches = array();
$sNumber = preg_replace('/(\(0\)|[^\d]+)/', '', $sNumber);
preg_match('/(\d{1,2})?(\d{3})(\d{3})(\d{3})$/', $sNumber, $aMatches);
var_dump($sNumber);
var_dump($aMatches);
}
Simply put, I want to match specific formats for telephone numbers to ensure a unified display.
+31(0)612345678
+31(0)131234567
Both stripped would be without + and (0).
Cut down in parts:
31 6 123 456 78
Country Net Number
31 13 123 456 78
Country Net Number
Now, in some cases the +31 (or +1, +222) are optional. The 6 and 13 are always included, but as a fun twist, the following format is also possible:
31 546 123 456
Country Net Number
Is this even possible with regex?
I've answered a few of these types of questions, and my strategy is to identify certain portions of formatting or number relationships that convey meaning, and get rid of the rest.
One of my examples that parses non-NANP number formatting uses a list of valid area codes in the parsing expression, and identifies country code when present. It extracts the country code, area code, and then the rest of the number.
or your country, I am assuming the list of area/net/region codes in HansM's answer is either correct or easily replaceable, so I'll guess that this modification of a regex might be useful:
^[ -]*(\+31)?[ -]*[(0)]*[ -]*(7|43|32|45|33|49|39|31|47|34|46|41|90|44|351|353|358)[ -]*((?:\d[ -]*)+)
It will first match the country code, if it is present, and store it in back-reference 1, then ignore a single zero. It will then match one of the area/net/region codes and store it in back-reference 2. It will then get any number of digits (one or more), mixed with dashes (-) and/or spaces () and store those into back-reference 3
After this, you could parse the third numbering group for validity or further reformatting
I'm testing it on Regex 101, but I could use a list of acceptable and unacceptable input, and how it should be reformatted when acceptable...
[EDIT]
I've used this list of city codes for the Netherlands and modified the expression thusly:
^[ -]*(\+31)?[ -]*[(0)]*[ -]*([123457]0|23|24|26|35|45|71|73|570)[ -]*((?:\d[ -]*)+)
which performs the following parsing:
input (1) (2) (3)
--------------------- ------ ------ ---------------
0707123456 70 7123456
0267-123456 26 7-123456
0407-12 34 56 40 7-12 34 56
0570123456 570 123456
07312345 73 12345
+31(0)734423211 +31 73 4423211
but I still don't know if that's helpful for you
[EDIT 2]
Wikipedia has what appears to be a more comprehensive list of codes
010, 0111, 0113, 0114, 0115, 0117, 0118, 013, 015, 0161, 0162, 0164, 0165, 0166, 0167, 0168, 0172, 0174, 0180, 0181, 0182, 0183, 0184, 0186, 0187, 020, 0222, 0223, 0224, 0226, 0227, 0228, 0229, 023, 024, 0251, 0252, 0255, 026, 0294, 0297, 0299, 030, 0313, 0314, 0315, 0316, 0317, 0318, 0320, 0321, 033, 0341, 0342, 0343, 0344, 0345, 0346, 0347, 0348, 035, 036, 038, 040, 0411, 0412, 0413, 0416, 0418, 043, 045, 046, 0475, 0478, 0481, 0485, 0486, 0487, 0488, 0492, 0493, 0495, 0497, 0499, 050, 0511, 0512, 0513, 0514, 0515, 0516, 0517, 0518, 0519, 0521, 0522, 0523, 0524, 0525, 0527, 0528, 0529, 053, 0541, 0543, 0544, 0545, 0546, 0547, 0548, 055, 0561, 0562, 0566, 0570, 0571, 0572, 0573, 0575, 0577, 0578, 058, 0591, 0592, 0593, 0594, 0595, 0596, 0597, 0598, 0599, 070, 071, 072, 073, 074, 075, 076, 077, 078, 079
which can be used in the code selection portion like this (if you'd prefer it to be more easily read and updated):
10|111|113|114|115|117|118|13|15|161|162|164|165|166|167|168|172|174|180|181|182|183|184|186|187|20|222|223|224|226|227|228|229|23|24|251|252|255|26|294|297|299|30|313|314|315|316|317|318|320|321|33|341|342|343|344|345|346|347|348|35|36|38|40|411|412|413|416|418|43|45|46|475|478|481|485|486|487|488|492|493|495|497|499|50|511|512|513|514|515|516|517|518|519|521|522|523|524|525|527|528|529|53|541|543|544|545|546|547|548|55|561|562|566|570|571|572|573|575|577|578|58|591|592|593|594|595|596|597|598|599|70|71|72|73|74|75|76|77|78|79
or like this (if you'd prefer a more efficient evaluation of the expression):
1([035]|1[134578]|6[124-8]|7[24]|8[0-467])|2([0346]|2[2346-9]|5[125]|9[479])|3([03568]|1[34-8]|2[01]|4[1-8])|4([0356]|1[12368]|7[58]|8[15-8]|9[23579])|5([0358]|[19][1-9]|2[1-5789]|4[13-8]|6[126]|7[0-3578])|7[0-9]
I have used the nuget package libphonenumber-csharp.
That has helped me to create a (Dutch) phone number validator, here is a code snippet, without other parts of my solution it will not compile but at least you can get an idea of how to handle this.
public override void Validate()
{
ValidationMessages = new Dictionary<string, string>();
ErrorMessage = string.Empty;
string phoneNumber;
string countryCode = _defaultCountryCode;
// If the phoneNumber is not required, it is allowed to be empty.
// So in that case isValid gets defaultvalue true
bool isValid = (!_isRequired);
if (!string.IsNullOrEmpty(_phoneNumber))
{
var phoneUtil = PhoneNumberUtil.GetInstance();
try
{
phoneNumber = PhoneNumbers.PhoneNumberUtil.Normalize(_phoneNumber);
countryCode = PhoneNumberUtil2.GetRegionCode(phoneNumber, _defaultCountryCode);
PhoneNumber oPhoneNumber = phoneUtil.Parse(phoneNumber, countryCode);
var t1 = oPhoneNumber.NationalNumber;
var t2 = oPhoneNumber.CountryCode;
var formattedNo = phoneUtil.Format(oPhoneNumber, PhoneNumberFormat.E164);
isValid = PhoneNumbers.PhoneNumberUtil.IsViablePhoneNumber(formattedNo);
}
catch (NumberParseException e)
{
var err = e.ToString();
isValid = false;
}
}
if ((isValid) && (!string.IsNullOrEmpty(_phoneNumber)))
{
Regex regexValidator = null;
string regex;
// Additional validations for Dutch phone numbers as LibPhoneNumber is to graceful as it comes to
// thinking if a number is valid.
switch (countryCode)
{
case "NL":
if (_phoneNumber.StartsWith("0800") || _phoneNumber.StartsWith("0900"))
{
// 0800/0900 numbers
regex = #"((0800|0900)(-| )?[0-9]{4}([0-9]{3})?$)";
regexValidator = new Regex(regex);
isValid = regexValidator.IsMatch(_phoneNumber);
}
else
{
string phoneNumberCheck = _phoneNumber.Replace("(", "").Replace(")", "").Replace("-", "").Replace(" ", "");
regex = #"^(0031|\+31|0)[1-9][0-9]{8}$";
regexValidator = new Regex(regex);
isValid = regexValidator.IsMatch(phoneNumberCheck);
}
break;
}
}
if (!isValid)
{
ErrorMessage = string.Format(TextProvider.Get(TextProviderConstants.ValMsg_IsInAnIncorrectFormat_0),
ColumnInfoProvider.GetLabel(_labelKey));
ValidationMessages.Add(_messageKey, ErrorMessage);
}
}
Also useful might be my class PhoneNumberUtil2 that builds upon the nuget package libphonenumber-csharp:
// Code start
using System.Collections.Generic;
using System.Globalization;
using System.Linq;
using System.Text;
using PhoneNumbers;
namespace ProjectName.Logic.Miscellaneous
{
public class PhoneNumberUtil2
{
/// <summary>
/// Returns the alphanumeric country code for a normalized phonenumber. If a phonenumber does not contain
/// an international numeric country code, the default country code for the website is returned.
/// This works for 17 countries: NL, GB, FR, DE, BE, AU, SE, NO, IT, TK, RU, CH, DK, IR, PT, ES, FI
/// </summary>
/// <param name="normalizedPhoneNumber"></param>
/// <param name="defaultCountryCode"> </param>
/// <returns></returns>
public static string GetRegionCode(string normalizedPhoneNumber, string defaultCountryCode)
{
if (normalizedPhoneNumber.Length > 10)
{
var dict = new Dictionary<string, string>();
dict.Add("7", "RU");
dict.Add("43", "AT");
dict.Add("32", "BE");
dict.Add("45", "DK");
dict.Add("33", "FR");
dict.Add("49", "DE");
dict.Add("39", "IT");
dict.Add("31", "NL");
dict.Add("47", "NO");
dict.Add("34", "ES");
dict.Add("46", "SE");
dict.Add("41", "CH");
dict.Add("90", "TR");
dict.Add("44", "GB");
dict.Add("351", "PT");
dict.Add("353", "IE");
dict.Add("358", "FI");
// First check 3-digits International Calling Codes
if (dict.ContainsKey(normalizedPhoneNumber.Substring(0, 3)))
{
return dict[normalizedPhoneNumber.Substring(0, 3)];
}
// Then 2-digits International Calling Codes
if (dict.ContainsKey(normalizedPhoneNumber.Substring(0, 2)))
{
return dict[normalizedPhoneNumber.Substring(0, 2)];
}
// And finally 1-digit International Calling Codes
if (dict.ContainsKey(normalizedPhoneNumber.Substring(0, 1)))
{
return dict[normalizedPhoneNumber.Substring(0, 1)];
}
}
return defaultCountryCode;
}
}
}
Do you know some utility or a web site where I can give US city,state and radial distance in miles as input and it would return me all the cities within that radius?
Thanks!
Here is how I do it.
You can obtain a list of city, st, zip codes and their latitudes and longitudes.
(I can't recall off the top of my head where we got ours)
edit: http://geonames.usgs.gov/domestic/download_data.htm
like someone mentioned above would probably work.
Then you can write a method to calculate the min and max latitude and longitudes based on a radius, and query for all cities between those min and max. Then loop through and calculate the distance and remove any that are not in the radius
double latitude1 = Double.parseDouble(zipCodes.getLatitude().toString());
double longitude1 = Double.parseDouble(zipCodes.getLongitude().toString());
//Upper reaches of possible boundaries
double upperLatBound = latitude1 + Double.parseDouble(distance)/40.0;
double lowerLatBound = latitude1 - Double.parseDouble(distance)/40.0;
double upperLongBound = longitude1 + Double.parseDouble(distance)/40.0;
double lowerLongBound = longitude1 - Double.parseDouble(distance)/40.0;
//pull back possible matches
SimpleCriteria zipCriteria = new SimpleCriteria();
zipCriteria.isBetween(ZipCodesPeer.LONGITUDE, lowerLongBound, upperLongBound);
zipCriteria.isBetween(ZipCodesPeer.LATITUDE, lowerLatBound, upperLatBound);
List zipList = ZipCodesPeer.doSelect(zipCriteria);
ArrayList acceptList = new ArrayList();
if(zipList != null)
{
for(int i = 0; i < zipList.size(); i++)
{
ZipCodes tempZip = (ZipCodes)zipList.get(i);
double tempLat = new Double(tempZip.getLatitude().toString()).doubleValue();
double tempLon = new Double(tempZip.getLongitude().toString()).doubleValue();
double d = 3963.0 * Math.acos(Math.sin(latitude1 * Math.PI/180) * Math.sin(tempLat * Math.PI/180) + Math.cos(latitude1 * Math.PI/180) * Math.cos(tempLat * Math.PI/180) * Math.cos(tempLon*Math.PI/180 -longitude1 * Math.PI/180));
if(d < Double.parseDouble(distance))
{
acceptList.add(((ZipCodes)zipList.get(i)).getZipCd());
}
}
}
There's an excerpt of my code, hopefully you can see what's happening. I start out with one ZipCodes( a table in my DB), then I pull back possible matches, and finally I weed out those who are not in the radius.
Oracle, PostGIS, mysql with GIS extensions, sqlite with GIS extensions all support this kind of queries.
If you don't have the dataset look at:
http://www.geonames.org/
Take a look at this web service advertised on xmethods.net. It requires a subscription to actually use, but claims to do what you need.
The advertised method in question's description:
GetPlacesWithin Returns a list of geo
places within a specified distance
from a given place. Parameters: place
- place name (65 char max), state - 2 letter state code (not required for
zip codes), distance - distance in
miles, placeTypeToFind - type of place
to look for: ZipCode or City
(including any villages, towns, etc).
http://xmethods.net/ve2/ViewListing.po?key=uuid:5428B3DD-C7C6-E1A8-87D6-461729AF02C0
You can obtain a pretty good database of geolocated cities/placenames from http://geonames.usgs.gov - find an appropriate database dump, import it into your DB, and performing the kind of query your need is pretty straightforward, particularly if your DBMS supports some kind of spatial queries (e.g. like Oracle Spatial, MySQL Spatial Extensions, PostGIS or SQLServer 2008)
See also: how to do location based search
I do not have a website, but we have implemented this both in Oracle as a database function and in SAS as a statistics macro. It only requires a database with all cities and their lat and long.
Maybe this can help. The project is configured in kilometers though. You can modify these in CityDAO.java
public List<City> findCityInRange(GeoPoint geoPoint, double distance) {
List<City> cities = new ArrayList<City>();
QueryBuilder queryBuilder = geoDistanceQuery("geoPoint")
.point(geoPoint.getLat(), geoPoint.getLon())
//.distance(distance, DistanceUnit.KILOMETERS) original
.distance(distance, DistanceUnit.MILES)
.optimizeBbox("memory")
.geoDistance(GeoDistance.ARC);
SearchRequestBuilder builder = esClient.getClient()
.prepareSearch(INDEX)
.setTypes("city")
.setSearchType(SearchType.QUERY_THEN_FETCH)
.setScroll(new TimeValue(60000))
.setSize(100).setExplain(true)
.setPostFilter(queryBuilder)
.addSort(SortBuilders.geoDistanceSort("geoPoint")
.order(SortOrder.ASC)
.point(geoPoint.getLat(), geoPoint.getLon())
//.unit(DistanceUnit.KILOMETERS)); Original
.unit(DistanceUnit.MILES));
SearchResponse response = builder
.execute()
.actionGet();
SearchHit[] hits = response.getHits().getHits();
scroll:
while (true) {
for (SearchHit hit : hits) {
Map<String, Object> result = hit.getSource();
cities.add(mapper.convertValue(result, City.class));
}
response = esClient.getClient().prepareSearchScroll(response.getScrollId()).setScroll(new TimeValue(60000)).execute().actionGet();
if (response.getHits().getHits().length == 0) {
break scroll;
}
}
return cities;
}
The "LocationFinder\src\main\resources\json\cities.json" file contains all cities from Belgium. You can delete or create entries if you want too. As long as you don't change the names and/or structure, no code changes are required.
Make sure to read the README https://github.com/GlennVanSchil/LocationFinder
In my news page project, I have a database table news with the following structure:
- id: [integer] unique number identifying the news entry, e.g.: *1983*
- title: [string] title of the text, e.g.: *New Life in America No Longer Means a New Name*
- topic: [string] category which should be chosen by the classificator, e.g: *Sports*
Additionally, there's a table bayes with information about word frequencies:
- word: [string] a word which the frequencies are given for, e.g.: *real estate*
- topic: [string] same content as "topic" field above, e.h. *Economics*
- count: [integer] number of occurrences of "word" in "topic" (incremented when new documents go to "topic"), e.g: *100*
Now I want my PHP script to classify all news entries and assign one of several possible categories (topics) to them.
Is this the correct implementation? Can you improve it?
<?php
include 'mysqlLogin.php';
$get1 = "SELECT id, title FROM ".$prefix."news WHERE topic = '' LIMIT 0, 150";
$get2 = mysql_abfrage($get1);
// pTOPICS BEGIN
$pTopics1 = "SELECT topic, SUM(count) AS count FROM ".$prefix."bayes WHERE topic != '' GROUP BY topic";
$pTopics2 = mysql_abfrage($pTopics1);
$pTopics = array();
while ($pTopics3 = mysql_fetch_assoc($pTopics2)) {
$pTopics[$pTopics3['topic']] = $pTopics3['count'];
}
// pTOPICS END
// pWORDS BEGIN
$pWords1 = "SELECT word, topic, count FROM ".$prefix."bayes";
$pWords2 = mysql_abfrage($pWords1);
$pWords = array();
while ($pWords3 = mysql_fetch_assoc($pWords2)) {
if (!isset($pWords[$pWords3['topic']])) {
$pWords[$pWords3['topic']] = array();
}
$pWords[$pWords3['topic']][$pWords3['word']] = $pWords3['count'];
}
// pWORDS END
while ($get3 = mysql_fetch_assoc($get2)) {
$pTextInTopics = array();
$tokens = tokenizer($get3['title']);
foreach ($pTopics as $topic=>$documentsInTopic) {
if (!isset($pTextInTopics[$topic])) { $pTextInTopics[$topic] = 1; }
foreach ($tokens as $token) {
echo '....'.$token;
if (isset($pWords[$topic][$token])) {
$pTextInTopics[$topic] *= $pWords[$topic][$token]/array_sum($pWords[$topic]);
}
}
$pTextInTopics[$topic] *= $pTopics[$topic]/array_sum($pTopics); // #documentsInTopic / #allDocuments
}
asort($pTextInTopics); // pick topic with lowest value
if ($chosenTopic = each($pTextInTopics)) {
echo '<p>The text belongs to topic '.$chosenTopic['key'].' with a likelihood of '.$chosenTopic['value'].'</p>';
}
}
?>
The training is done manually, it isn't included in this code. If the text "You can make money if you sell real estates" is assigned to the category/topic "Economics", then all words (you,can,make,...) are inserted into the table bayes with "Economics" as the topic and 1 as standard count. If the word is already there in combination with the same topic, the count is incremented.
Sample learning data:
word topic count
kaczynski Politics 1
sony Technology 1
bank Economics 1
phone Technology 1
sony Economics 3
ericsson Technology 2
Sample output/result:
Title of the text: Phone test Sony Ericsson Aspen - sensitive Winberry
Politics
....phone
....test
....sony
....ericsson
....aspen
....sensitive
....winberry
Technology
....phone FOUND
....test
....sony FOUND
....ericsson FOUND
....aspen
....sensitive
....winberry
Economics
....phone
....test
....sony FOUND
....ericsson
....aspen
....sensitive
....winberry
Result: The text belongs to topic Technology with a likelihood of 0.013888888888889
Thank you very much in advance!
It looks like your code is correct, but there are a few easy ways to optimize it. For example, you calculate p(word|topic) on the fly for every word while you could easily calculate these values beforehand. (I'm assuming you want to classify multiple documents here, if you're only doing a single document I suppose this is okay since you don't calculate it for words not in the document)
Similarly, the calculation of p(topic) could be moved outside of the loop.
Finally, you don't need to sort the entire array to find the maximum.
All small points! But that's what you asked for :)
I've written some untested PHP-code showing how I'd implement this below:
<?php
// Get word counts from database
$nWordPerTopic = mystery_sql();
// Calculate p(word|topic) = nWord / sum(nWord for every word)
$nTopics = array();
$pWordPerTopic = array();
foreach($nWordPerTopic as $topic => $wordCounts)
{
// Get total word count in topic
$nTopic = array_sum($wordCounts);
// Calculate p(word|topic)
$pWordPerTopic[$topic] = array();
foreach($wordCounts as $word => $count)
$pWordPerTopic[$topic][$word] = $count / $nTopic;
// Save $nTopic for next step
$nTopics[$topic] = $nTopic;
}
// Calculate p(topic)
$nTotal = array_sum($nTopics);
$pTopics = array();
foreach($nTopics as $topic => $nTopic)
$pTopics[$topic] = $nTopic / $nTotal;
// Classify
foreach($documents as $document)
{
$title = $document['title'];
$tokens = tokenizer($title);
$pMax = -1;
$selectedTopic = null;
foreach($pTopics as $topic => $pTopic)
{
$p = $pTopic;
foreach($tokens as $word)
{
if (!array_key_exists($word, $pWordPerTopic[$topic]))
continue;
$p *= $pWordPerTopic[$topic][$word];
}
if ($p > $pMax)
{
$selectedTopic = $topic;
$pMax = $p;
}
}
}
?>
As for the maths...
You're trying to maximize p(topic|words), so find
arg max p(topic|words)
(IE the argument topic for which p(topic|words) is the highest)
Bayes theorem says
p(topic)*p(words|topic)
p(topic|words) = -------------------------
p(words)
So you're looking for
p(topic)*p(words|topic)
arg max -------------------------
p(words)
Since p(words) of a document is the same for any topic this is the same as finding
arg max p(topic)*p(words|topic)
The naive bayes assumption (which makes this a naive bayes classifier) is that
p(words|topic) = p(word1|topic) * p(word2|topic) * ...
So using this, you need to find
arg max p(topic) * p(word1|topic) * p(word2|topic) * ...
Where
p(topic) = number of words in topic / number of words in total
And
p(word, topic) 1
p(word | topic) = ---------------- = p(word, topic) * ----------
p(topic) p(topic)
number of times word occurs in topic number of words in total
= -------------------------------------- * --------------------------
number of words in total number of words in topic
number of times word occurs in topic
= --------------------------------------
number of words in topic