Using a Naive Bayes Classifier to classify tweets: some problems - php

Using, amongst other sources, various posts here on Stackoverflow, I'm trying to implement my own PHP classier to classify tweets into a positive, neutral and negative class. Before coding, I need to get the process straigt. My train-of-thought and an example are as follows:
p(class) * p(words|class)
Bayes theorem: p(class|words) = ------------------------- with
p(words)
assumption that p(words) is the same for every class leads to calculating
arg max p(class) * p(words|class) with
p(words|class) = p(word1|class) * p(word2|topic) * ... and
p(class) = #words in class / #words in total and
p(word, class) 1
p(word|class) = -------------- = p(word, class) * -------- =
p(class) p(class)
#times word occurs in class #words in total #times word occurs in class
--------------------------- * --------------- = ---------------------------
#words in total #words in class #words in class
Example:
------+----------------+-----------------+
class | words | #words in class |
------+----------------+-----------------+
pos | happy win nice | 3 |
neu | neutral middle | 2 |
neg | sad loose bad | 3 |
------+----------------+-----------------+
p(pos) = 3/8
p(neu) = 2/8
p(meg) = 3/8
Calculate: argmax(sad loose)
p(sad loose|pos) = p(sad|pos) * p(loose|pos) = (0+1)/3 * (0+1)/3 = 1/9
p(sad loose|neu) = p(sad|neu) * p(loose|neu) = (0+1)/3 * (0+1)/3 = 1/9
p(sad loose|neg) = p(sad|neg) * p(loose|neg) = 1/3 * 1/3 = 1/9
p(pos) * p(sad loose|pos) = 3/8 * 1/9 = 0.0416666667
p(neu) * p(sad loose|neu) = 2/8 * 1/9 = 0.0277777778
p(neg) * p(sad loose|neg) = 3/8 * 1/9 = 0.0416666667 <-- should be 100% neg!
As you can see, I have "trained" the classifier with a positive ("happy win nice"), a neutral ("neutral middle") and a negative ("sad loose bad") tweet. In order to prevent problems of having probabilities of zero because of one word missing in all classes, I'm using LaPlace (or ädd one") smoothing, see "(0+1)".
I basically have two questions:
Is this a correct blueprint for implementation? Is there room for improvement?
When classifying a tweet ("sad loose"), it is expected to be 100% in class "neg" because it only contains negative words. The LaPlace smoothing is however making things more complicated: class pos and neg have an equal probability. Is there a workaround for this?

There are two main elements to improve in your reasoning.
First, you should improve your smoothing method:
When applying Laplace smoothing, it should be applied to all measurements, not just to those with zero denominator.
In addition, Laplace smoothing for such cases is usually given by (c+1)/(N+V), where V is the vocabulary size (e.g., see in Wikipedia).
Therefore, using probability function you have defined (which might not be the most suitable, see below):
p(sad loose|pos) = (0+1)/(3+8) * (0+1)/(3+8) = 1/121
p(sad loose|neu) = (0+1)/(3+8) * (0+1)/(3+8) = 1/121
p(sad loose|neg) = (1+1)/(3+8) * (1+1)/(3+8) = 4/121 <-- would become argmax
In addition, a more common way of calculating the probability in the first place, would be by:
(number of tweets in class containing term c) / (total number of tweets in class)
For instance, in the limited trainset given above, and disregarding smoothing, p(sad|pos) = 0/1 = 0, and p(sad|neg) = 1/1 = 1. When the trainset size increases, the numbers would be more meaningful. e.g. if you had 10 tweets for the negative class, with 'sad' appearing in 4 of them, then p(sad|neg) would have been 4/10.
Regarding the actual number outputted by the Naive Bayes algorithm: you shouldn't expect the algorithm to assign actual probability to each class; rather, the category order is of more importance. Concretely, using the argmax would give you the algorithm's best guess for the class, but not the probability for it. Assigning probabilities to NB results is another story; for example, see an article discussing this issue.

Naive Bayes Algorithm with Laplacian Correction
Some Picture Links that show how the algorithm work
Data set example
Output
Output Cont.
import pandas as pd
#Calculate Frequency Of Each Value
def CountFrequency(my_list,my_list2,st,st2):
# Creating an empty dictionary
counter=0
for i in range(len(my_list)):
if (my_list[i]==st and my_list2[i]==st2):
counter=counter+1
return counter
#Reading headers From File
headers=pd.read_excel('data_set.xlsx').columns
#Reading From File
df = pd.read_excel('data_set.xlsx')
a=[]
for i in range(len(df.columns)):
a.append([])
for i in range(len(df.columns)):
for row in df.iterrows():
a[i].append(row[1][i])
#print(row[1][i])
#Calculate Table Info
result=[]
length=len(a[0])
tableInfo=sorted(set(a[-1]))
for i in range(len(tableInfo)):
result.append([])
for i in tableInfo :
print("P(",headers[-1],"=\"",i,"\") =", CountFrequency(a[-1],a[-1],i,i),"/",length,"=",CountFrequency(a[-1],a[-1],i,i)/length)
#Take User Input and Calculate Columns Info
for i in range(len(df.columns)-1):
print("Choose value for attribute ",headers[i],"that you want from list : ")
c=1
b=sorted(set(a[i]))
for j in b:
print(c,":",j)
c=c+1
choose=int(input("Enter Number Of Your Choice : "))
co=0
for k in tableInfo:
#Laplacian Correction
if CountFrequency(a[-1],a[i],k,b[choose-1])!=0:
print("P(",headers[i],"=\"",b[choose-1],"\"|",headers[-1],"=\"",k,"\") =", CountFrequency(a[-1],a[i],k,b[choose-1]),"/",CountFrequency(a[-1],a[-1],k,k),"=",CountFrequency(a[-1],a[i],k,b[choose-1])/CountFrequency(a[-1],a[-1],k,k))
result[co].append(CountFrequency(a[-1],a[i],k,b[choose-1])/CountFrequency(a[-1],a[-1],k,k))
else:
print("P(",headers[i],"=\"",b[choose-1],"\"|",headers[-1],"=\"",k,"\") =", CountFrequency(a[-1],a[i],k,b[choose-1])+1,"/",CountFrequency(a[-1],a[-1],k,k)+len(sorted(set(a[i]))),"=",((CountFrequency(a[-1],a[i],k,b[choose-1])+1)/(CountFrequency(a[-1],a[-1],k,k)+len(sorted(set(a[i])))))," With Laplacian correction ")
result[co].append(((CountFrequency(a[-1],a[i],k,b[choose-1])+1)/(CountFrequency(a[-1],a[-1],k,k)+len(sorted(set(a[i]))))))
co=co+1
#Calculate Final Result Laplacian Correction
finalResult=[1]*len(tableInfo)
for res in range(len(result)):
for i in range(len(result[res])):
finalResult[res]*=result[res][i]
#Print final result
print("#####################################################################")
print("#####################################################################")
print("#####################################################################")
mx=0
pos=0
for i in range(len(tableInfo)) :
print("P(X | ",headers[-1],"=\"",tableInfo[i],"\") =",finalResult[i]*(CountFrequency(a[-1],a[-1],tableInfo[i],tableInfo[i])/length))
if mx<finalResult[i]*(CountFrequency(a[-1],a[-1],tableInfo[i],tableInfo[i])/length):
mx=finalResult[i]*(CountFrequency(a[-1],a[-1],tableInfo[i],tableInfo[i])/length)
pos=i
print("ThereFore X belongs To Class (\"",headers[-1],"=",tableInfo[pos],"\")")

Related

Define variables about Player_HP and Damage

The variables created below define the Damage Amount coming from the Player and Opponent. After that, comes the part of LIFE, where each one takes X percentage of life depending on the given DAMAGE. Then, the remaining LIFE of the variables $ player_hp and $ opponent_hp is inserted in DATABASE. (UPDATE says "player_health =?".
Error: Every time I make an attack, it removes LIFE ($ player) and adds MORE LIFE (it should remove until it is negative) in DATABASE to OPPONENT ($ opponent), but simply ADDs more life to the opponent.
Exemp: $player_damage = 1200
$opponent_damage = 500
$player_hp = 100(hp) - $opponent_damage
$opponent_hp = 100(hp) - $player_damage
The database show basically like this:
Player_VIDA = -400
Opponent_VIDA = 600
I've tried user on UPDATE player_health = player_health -?, But it's the same.
My used code:
$player_damage = rand(5, $player_status['attack']);
$player_damage = $player_damage - ($opponent_status['defense'] * 0.3);
$opponent_damage = rand(5, $opponent_status['attack']);
$opponent_damage = $opponent_damage - ($player_status['defense'] * 0.3);
//add def to the health and subtract with damaged
$player_hp = round($opponent_damage - $player_status['health']);
$opponent_hp = round($player_damage - $opponent_status['health']);
Solution: I need to know why the PLAYER and OPPONENT_HP are positive, they should SUBSTRACT to negative.

filter data with php or sql?

I'm not sure if this is an anti-pattern or not, but it feels a bit convoluted, so I'd like to get your opinion on how these cases should be handled:
Let's say we have this data:
$sofas[0]['color'] = 'green';
$sofas[0]['pillows'] = 8;
$sofas[0]['pattern'] = 'moons';
$sofas[1]['color'] = 'green';
$sofas[1]['pillows'] = 8;
$sofas[1]['pattern'] = 'ducks';
$sofas[1]['footrest'] = 'small';
$sofas[2]['color'] = 'green';
$sofas[2]['pillows'] = 8;
$sofas[2]['pattern'] = 'stripes';
$sofas[2]['speakers'] = 'badass';
color, pillows and pattern comes from the database, whilst "footrest" and "speakers" have been added on by an api.
We can say for the sake of argument that there are 1250 different attributes that can be added by the api like "footrest" and "speakers".
We now want to load an some data from the database based on these attributes, like an image for example.
So we have a table that looks like this:
ID , attribute_value, image
1 , 'color_green' 'img0023',
2 , 'pillows_8' 'img003',
3 , 'pattern_moons' 'img002',
6 , 'pattern_ducks' 'img0083',
7 , 'footrest_small' 'img0058',
10 , 'pattern_stripes''img0073',
11 , 'speakers_badass''img00pluto'
etc , etc , etc;
So, the way I figure I can approach this two ways:
$sofaSQL="'color_green',
'pillows_8',
'pattern_moons',
'pattern_ducks',
'footrest_small',
'pattern_stripes',
'speakers_badass'";
$sql = "SELECT ID, attribute_value, image
FROM `example`
WHERE attribute_value IN ($sofaSQL)"
and then loop through the array and check if the key + '_' + value matches the rows in the recordset to see what images should be used for sofas[0], sofas[1] and sofas[2].
The other option I see would be to prep each sofa with a different sql statement, ie:
$sofaSQL="'color_green',
'pillows_8',
'pattern_moons';
-add images-
$sofaSQL="'color_green',
'pillows_8',
'pattern_ducks',
'footrest_small';
-add images-
$sofaSQL="'color_green',
'pillows_8',
'pattern_stripes',
'speakers_badass'";
-add images-
That seems simpler, but it doesn't feel right to hammer the database with a seperate request for each item in the array.
So, what would you recommend in this case? IS there a better way of dealing with attributes that are selected randomly/from an api?

Regex for Dutch telephone numbers

I'm definitely not the worst when it comes down to regex, but this one has got me stumped.
In short, this is the code I currently have.
$aNumbers = array(
'612345678',
'546123465',
'131234567',
'+31(0)612345678'
);
foreach($aNumbers as $sNumber) {
$aMatches = array();
$sNumber = preg_replace('/(\(0\)|[^\d]+)/', '', $sNumber);
preg_match('/(\d{1,2})?(\d{3})(\d{3})(\d{3})$/', $sNumber, $aMatches);
var_dump($sNumber);
var_dump($aMatches);
}
Simply put, I want to match specific formats for telephone numbers to ensure a unified display.
+31(0)612345678
+31(0)131234567
Both stripped would be without + and (0).
Cut down in parts:
31 6 123 456 78
Country Net Number
31 13 123 456 78
Country Net Number
Now, in some cases the +31 (or +1, +222) are optional. The 6 and 13 are always included, but as a fun twist, the following format is also possible:
31 546 123 456
Country Net Number
Is this even possible with regex?
I've answered a few of these types of questions, and my strategy is to identify certain portions of formatting or number relationships that convey meaning, and get rid of the rest.
One of my examples that parses non-NANP number formatting uses a list of valid area codes in the parsing expression, and identifies country code when present. It extracts the country code, area code, and then the rest of the number.
or your country, I am assuming the list of area/net/region codes in HansM's answer is either correct or easily replaceable, so I'll guess that this modification of a regex might be useful:
^[ -]*(\+31)?[ -]*[(0)]*[ -]*(7|43|32|45|33|49|39|31|47|34|46|41|90|44|351|353|358)[ -]*((?:\d[ -]*)+)
It will first match the country code, if it is present, and store it in back-reference 1, then ignore a single zero. It will then match one of the area/net/region codes and store it in back-reference 2. It will then get any number of digits (one or more), mixed with dashes (-) and/or spaces () and store those into back-reference 3
After this, you could parse the third numbering group for validity or further reformatting
I'm testing it on Regex 101, but I could use a list of acceptable and unacceptable input, and how it should be reformatted when acceptable...
[EDIT]
I've used this list of city codes for the Netherlands and modified the expression thusly:
^[ -]*(\+31)?[ -]*[(0)]*[ -]*([123457]0|23|24|26|35|45|71|73|570)[ -]*((?:\d[ -]*)+)
which performs the following parsing:
input (1) (2) (3)
--------------------- ------ ------ ---------------
0707123456 70 7123456
0267-123456 26 7-123456
0407-12 34 56 40 7-12 34 56
0570123456 570 123456
07312345 73 12345
+31(0)734423211 +31 73 4423211
but I still don't know if that's helpful for you
[EDIT 2]
Wikipedia has what appears to be a more comprehensive list of codes
010, 0111, 0113, 0114, 0115, 0117, 0118, 013, 015, 0161, 0162, 0164, 0165, 0166, 0167, 0168, 0172, 0174, 0180, 0181, 0182, 0183, 0184, 0186, 0187, 020, 0222, 0223, 0224, 0226, 0227, 0228, 0229, 023, 024, 0251, 0252, 0255, 026, 0294, 0297, 0299, 030, 0313, 0314, 0315, 0316, 0317, 0318, 0320, 0321, 033, 0341, 0342, 0343, 0344, 0345, 0346, 0347, 0348, 035, 036, 038, 040, 0411, 0412, 0413, 0416, 0418, 043, 045, 046, 0475, 0478, 0481, 0485, 0486, 0487, 0488, 0492, 0493, 0495, 0497, 0499, 050, 0511, 0512, 0513, 0514, 0515, 0516, 0517, 0518, 0519, 0521, 0522, 0523, 0524, 0525, 0527, 0528, 0529, 053, 0541, 0543, 0544, 0545, 0546, 0547, 0548, 055, 0561, 0562, 0566, 0570, 0571, 0572, 0573, 0575, 0577, 0578, 058, 0591, 0592, 0593, 0594, 0595, 0596, 0597, 0598, 0599, 070, 071, 072, 073, 074, 075, 076, 077, 078, 079
which can be used in the code selection portion like this (if you'd prefer it to be more easily read and updated):
10|111|113|114|115|117|118|13|15|161|162|164|165|166|167|168|172|174|180|181|182|183|184|186|187|20|222|223|224|226|227|228|229|23|24|251|252|255|26|294|297|299|30|313|314|315|316|317|318|320|321|33|341|342|343|344|345|346|347|348|35|36|38|40|411|412|413|416|418|43|45|46|475|478|481|485|486|487|488|492|493|495|497|499|50|511|512|513|514|515|516|517|518|519|521|522|523|524|525|527|528|529|53|541|543|544|545|546|547|548|55|561|562|566|570|571|572|573|575|577|578|58|591|592|593|594|595|596|597|598|599|70|71|72|73|74|75|76|77|78|79
or like this (if you'd prefer a more efficient evaluation of the expression):
1([035]|1[134578]|6[124-8]|7[24]|8[0-467])|2([0346]|2[2346-9]|5[125]|9[479])|3([03568]|1[34-8]|2[01]|4[1-8])|4([0356]|1[12368]|7[58]|8[15-8]|9[23579])|5([0358]|[19][1-9]|2[1-5789]|4[13-8]|6[126]|7[0-3578])|7[0-9]
I have used the nuget package libphonenumber-csharp.
That has helped me to create a (Dutch) phone number validator, here is a code snippet, without other parts of my solution it will not compile but at least you can get an idea of how to handle this.
public override void Validate()
{
ValidationMessages = new Dictionary<string, string>();
ErrorMessage = string.Empty;
string phoneNumber;
string countryCode = _defaultCountryCode;
// If the phoneNumber is not required, it is allowed to be empty.
// So in that case isValid gets defaultvalue true
bool isValid = (!_isRequired);
if (!string.IsNullOrEmpty(_phoneNumber))
{
var phoneUtil = PhoneNumberUtil.GetInstance();
try
{
phoneNumber = PhoneNumbers.PhoneNumberUtil.Normalize(_phoneNumber);
countryCode = PhoneNumberUtil2.GetRegionCode(phoneNumber, _defaultCountryCode);
PhoneNumber oPhoneNumber = phoneUtil.Parse(phoneNumber, countryCode);
var t1 = oPhoneNumber.NationalNumber;
var t2 = oPhoneNumber.CountryCode;
var formattedNo = phoneUtil.Format(oPhoneNumber, PhoneNumberFormat.E164);
isValid = PhoneNumbers.PhoneNumberUtil.IsViablePhoneNumber(formattedNo);
}
catch (NumberParseException e)
{
var err = e.ToString();
isValid = false;
}
}
if ((isValid) && (!string.IsNullOrEmpty(_phoneNumber)))
{
Regex regexValidator = null;
string regex;
// Additional validations for Dutch phone numbers as LibPhoneNumber is to graceful as it comes to
// thinking if a number is valid.
switch (countryCode)
{
case "NL":
if (_phoneNumber.StartsWith("0800") || _phoneNumber.StartsWith("0900"))
{
// 0800/0900 numbers
regex = #"((0800|0900)(-| )?[0-9]{4}([0-9]{3})?$)";
regexValidator = new Regex(regex);
isValid = regexValidator.IsMatch(_phoneNumber);
}
else
{
string phoneNumberCheck = _phoneNumber.Replace("(", "").Replace(")", "").Replace("-", "").Replace(" ", "");
regex = #"^(0031|\+31|0)[1-9][0-9]{8}$";
regexValidator = new Regex(regex);
isValid = regexValidator.IsMatch(phoneNumberCheck);
}
break;
}
}
if (!isValid)
{
ErrorMessage = string.Format(TextProvider.Get(TextProviderConstants.ValMsg_IsInAnIncorrectFormat_0),
ColumnInfoProvider.GetLabel(_labelKey));
ValidationMessages.Add(_messageKey, ErrorMessage);
}
}
Also useful might be my class PhoneNumberUtil2 that builds upon the nuget package libphonenumber-csharp:
// Code start
using System.Collections.Generic;
using System.Globalization;
using System.Linq;
using System.Text;
using PhoneNumbers;
namespace ProjectName.Logic.Miscellaneous
{
public class PhoneNumberUtil2
{
/// <summary>
/// Returns the alphanumeric country code for a normalized phonenumber. If a phonenumber does not contain
/// an international numeric country code, the default country code for the website is returned.
/// This works for 17 countries: NL, GB, FR, DE, BE, AU, SE, NO, IT, TK, RU, CH, DK, IR, PT, ES, FI
/// </summary>
/// <param name="normalizedPhoneNumber"></param>
/// <param name="defaultCountryCode"> </param>
/// <returns></returns>
public static string GetRegionCode(string normalizedPhoneNumber, string defaultCountryCode)
{
if (normalizedPhoneNumber.Length > 10)
{
var dict = new Dictionary<string, string>();
dict.Add("7", "RU");
dict.Add("43", "AT");
dict.Add("32", "BE");
dict.Add("45", "DK");
dict.Add("33", "FR");
dict.Add("49", "DE");
dict.Add("39", "IT");
dict.Add("31", "NL");
dict.Add("47", "NO");
dict.Add("34", "ES");
dict.Add("46", "SE");
dict.Add("41", "CH");
dict.Add("90", "TR");
dict.Add("44", "GB");
dict.Add("351", "PT");
dict.Add("353", "IE");
dict.Add("358", "FI");
// First check 3-digits International Calling Codes
if (dict.ContainsKey(normalizedPhoneNumber.Substring(0, 3)))
{
return dict[normalizedPhoneNumber.Substring(0, 3)];
}
// Then 2-digits International Calling Codes
if (dict.ContainsKey(normalizedPhoneNumber.Substring(0, 2)))
{
return dict[normalizedPhoneNumber.Substring(0, 2)];
}
// And finally 1-digit International Calling Codes
if (dict.ContainsKey(normalizedPhoneNumber.Substring(0, 1)))
{
return dict[normalizedPhoneNumber.Substring(0, 1)];
}
}
return defaultCountryCode;
}
}
}

How to find the nearest cities using web services? [duplicate]

Do you know some utility or a web site where I can give US city,state and radial distance in miles as input and it would return me all the cities within that radius?
Thanks!
Here is how I do it.
You can obtain a list of city, st, zip codes and their latitudes and longitudes.
(I can't recall off the top of my head where we got ours)
edit: http://geonames.usgs.gov/domestic/download_data.htm
like someone mentioned above would probably work.
Then you can write a method to calculate the min and max latitude and longitudes based on a radius, and query for all cities between those min and max. Then loop through and calculate the distance and remove any that are not in the radius
double latitude1 = Double.parseDouble(zipCodes.getLatitude().toString());
double longitude1 = Double.parseDouble(zipCodes.getLongitude().toString());
//Upper reaches of possible boundaries
double upperLatBound = latitude1 + Double.parseDouble(distance)/40.0;
double lowerLatBound = latitude1 - Double.parseDouble(distance)/40.0;
double upperLongBound = longitude1 + Double.parseDouble(distance)/40.0;
double lowerLongBound = longitude1 - Double.parseDouble(distance)/40.0;
//pull back possible matches
SimpleCriteria zipCriteria = new SimpleCriteria();
zipCriteria.isBetween(ZipCodesPeer.LONGITUDE, lowerLongBound, upperLongBound);
zipCriteria.isBetween(ZipCodesPeer.LATITUDE, lowerLatBound, upperLatBound);
List zipList = ZipCodesPeer.doSelect(zipCriteria);
ArrayList acceptList = new ArrayList();
if(zipList != null)
{
for(int i = 0; i < zipList.size(); i++)
{
ZipCodes tempZip = (ZipCodes)zipList.get(i);
double tempLat = new Double(tempZip.getLatitude().toString()).doubleValue();
double tempLon = new Double(tempZip.getLongitude().toString()).doubleValue();
double d = 3963.0 * Math.acos(Math.sin(latitude1 * Math.PI/180) * Math.sin(tempLat * Math.PI/180) + Math.cos(latitude1 * Math.PI/180) * Math.cos(tempLat * Math.PI/180) * Math.cos(tempLon*Math.PI/180 -longitude1 * Math.PI/180));
if(d < Double.parseDouble(distance))
{
acceptList.add(((ZipCodes)zipList.get(i)).getZipCd());
}
}
}
There's an excerpt of my code, hopefully you can see what's happening. I start out with one ZipCodes( a table in my DB), then I pull back possible matches, and finally I weed out those who are not in the radius.
Oracle, PostGIS, mysql with GIS extensions, sqlite with GIS extensions all support this kind of queries.
If you don't have the dataset look at:
http://www.geonames.org/
Take a look at this web service advertised on xmethods.net. It requires a subscription to actually use, but claims to do what you need.
The advertised method in question's description:
GetPlacesWithin Returns a list of geo
places within a specified distance
from a given place. Parameters: place
- place name (65 char max), state - 2 letter state code (not required for
zip codes), distance - distance in
miles, placeTypeToFind - type of place
to look for: ZipCode or City
(including any villages, towns, etc).
http://xmethods.net/ve2/ViewListing.po?key=uuid:5428B3DD-C7C6-E1A8-87D6-461729AF02C0
You can obtain a pretty good database of geolocated cities/placenames from http://geonames.usgs.gov - find an appropriate database dump, import it into your DB, and performing the kind of query your need is pretty straightforward, particularly if your DBMS supports some kind of spatial queries (e.g. like Oracle Spatial, MySQL Spatial Extensions, PostGIS or SQLServer 2008)
See also: how to do location based search
I do not have a website, but we have implemented this both in Oracle as a database function and in SAS as a statistics macro. It only requires a database with all cities and their lat and long.
Maybe this can help. The project is configured in kilometers though. You can modify these in CityDAO.java
public List<City> findCityInRange(GeoPoint geoPoint, double distance) {
List<City> cities = new ArrayList<City>();
QueryBuilder queryBuilder = geoDistanceQuery("geoPoint")
.point(geoPoint.getLat(), geoPoint.getLon())
//.distance(distance, DistanceUnit.KILOMETERS) original
.distance(distance, DistanceUnit.MILES)
.optimizeBbox("memory")
.geoDistance(GeoDistance.ARC);
SearchRequestBuilder builder = esClient.getClient()
.prepareSearch(INDEX)
.setTypes("city")
.setSearchType(SearchType.QUERY_THEN_FETCH)
.setScroll(new TimeValue(60000))
.setSize(100).setExplain(true)
.setPostFilter(queryBuilder)
.addSort(SortBuilders.geoDistanceSort("geoPoint")
.order(SortOrder.ASC)
.point(geoPoint.getLat(), geoPoint.getLon())
//.unit(DistanceUnit.KILOMETERS)); Original
.unit(DistanceUnit.MILES));
SearchResponse response = builder
.execute()
.actionGet();
SearchHit[] hits = response.getHits().getHits();
scroll:
while (true) {
for (SearchHit hit : hits) {
Map<String, Object> result = hit.getSource();
cities.add(mapper.convertValue(result, City.class));
}
response = esClient.getClient().prepareSearchScroll(response.getScrollId()).setScroll(new TimeValue(60000)).execute().actionGet();
if (response.getHits().getHits().length == 0) {
break scroll;
}
}
return cities;
}
The "LocationFinder\src\main\resources\json\cities.json" file contains all cities from Belgium. You can delete or create entries if you want too. As long as you don't change the names and/or structure, no code changes are required.
Make sure to read the README https://github.com/GlennVanSchil/LocationFinder

PHP implementation of Bayes classificator: Assign topics to texts

In my news page project, I have a database table news with the following structure:
- id: [integer] unique number identifying the news entry, e.g.: *1983*
- title: [string] title of the text, e.g.: *New Life in America No Longer Means a New Name*
- topic: [string] category which should be chosen by the classificator, e.g: *Sports*
Additionally, there's a table bayes with information about word frequencies:
- word: [string] a word which the frequencies are given for, e.g.: *real estate*
- topic: [string] same content as "topic" field above, e.h. *Economics*
- count: [integer] number of occurrences of "word" in "topic" (incremented when new documents go to "topic"), e.g: *100*
Now I want my PHP script to classify all news entries and assign one of several possible categories (topics) to them.
Is this the correct implementation? Can you improve it?
<?php
include 'mysqlLogin.php';
$get1 = "SELECT id, title FROM ".$prefix."news WHERE topic = '' LIMIT 0, 150";
$get2 = mysql_abfrage($get1);
// pTOPICS BEGIN
$pTopics1 = "SELECT topic, SUM(count) AS count FROM ".$prefix."bayes WHERE topic != '' GROUP BY topic";
$pTopics2 = mysql_abfrage($pTopics1);
$pTopics = array();
while ($pTopics3 = mysql_fetch_assoc($pTopics2)) {
$pTopics[$pTopics3['topic']] = $pTopics3['count'];
}
// pTOPICS END
// pWORDS BEGIN
$pWords1 = "SELECT word, topic, count FROM ".$prefix."bayes";
$pWords2 = mysql_abfrage($pWords1);
$pWords = array();
while ($pWords3 = mysql_fetch_assoc($pWords2)) {
if (!isset($pWords[$pWords3['topic']])) {
$pWords[$pWords3['topic']] = array();
}
$pWords[$pWords3['topic']][$pWords3['word']] = $pWords3['count'];
}
// pWORDS END
while ($get3 = mysql_fetch_assoc($get2)) {
$pTextInTopics = array();
$tokens = tokenizer($get3['title']);
foreach ($pTopics as $topic=>$documentsInTopic) {
if (!isset($pTextInTopics[$topic])) { $pTextInTopics[$topic] = 1; }
foreach ($tokens as $token) {
echo '....'.$token;
if (isset($pWords[$topic][$token])) {
$pTextInTopics[$topic] *= $pWords[$topic][$token]/array_sum($pWords[$topic]);
}
}
$pTextInTopics[$topic] *= $pTopics[$topic]/array_sum($pTopics); // #documentsInTopic / #allDocuments
}
asort($pTextInTopics); // pick topic with lowest value
if ($chosenTopic = each($pTextInTopics)) {
echo '<p>The text belongs to topic '.$chosenTopic['key'].' with a likelihood of '.$chosenTopic['value'].'</p>';
}
}
?>
The training is done manually, it isn't included in this code. If the text "You can make money if you sell real estates" is assigned to the category/topic "Economics", then all words (you,can,make,...) are inserted into the table bayes with "Economics" as the topic and 1 as standard count. If the word is already there in combination with the same topic, the count is incremented.
Sample learning data:
word topic count
kaczynski Politics 1
sony Technology 1
bank Economics 1
phone Technology 1
sony Economics 3
ericsson Technology 2
Sample output/result:
Title of the text: Phone test Sony Ericsson Aspen - sensitive Winberry
Politics
....phone
....test
....sony
....ericsson
....aspen
....sensitive
....winberry
Technology
....phone FOUND
....test
....sony FOUND
....ericsson FOUND
....aspen
....sensitive
....winberry
Economics
....phone
....test
....sony FOUND
....ericsson
....aspen
....sensitive
....winberry
Result: The text belongs to topic Technology with a likelihood of 0.013888888888889
Thank you very much in advance!
It looks like your code is correct, but there are a few easy ways to optimize it. For example, you calculate p(word|topic) on the fly for every word while you could easily calculate these values beforehand. (I'm assuming you want to classify multiple documents here, if you're only doing a single document I suppose this is okay since you don't calculate it for words not in the document)
Similarly, the calculation of p(topic) could be moved outside of the loop.
Finally, you don't need to sort the entire array to find the maximum.
All small points! But that's what you asked for :)
I've written some untested PHP-code showing how I'd implement this below:
<?php
// Get word counts from database
$nWordPerTopic = mystery_sql();
// Calculate p(word|topic) = nWord / sum(nWord for every word)
$nTopics = array();
$pWordPerTopic = array();
foreach($nWordPerTopic as $topic => $wordCounts)
{
// Get total word count in topic
$nTopic = array_sum($wordCounts);
// Calculate p(word|topic)
$pWordPerTopic[$topic] = array();
foreach($wordCounts as $word => $count)
$pWordPerTopic[$topic][$word] = $count / $nTopic;
// Save $nTopic for next step
$nTopics[$topic] = $nTopic;
}
// Calculate p(topic)
$nTotal = array_sum($nTopics);
$pTopics = array();
foreach($nTopics as $topic => $nTopic)
$pTopics[$topic] = $nTopic / $nTotal;
// Classify
foreach($documents as $document)
{
$title = $document['title'];
$tokens = tokenizer($title);
$pMax = -1;
$selectedTopic = null;
foreach($pTopics as $topic => $pTopic)
{
$p = $pTopic;
foreach($tokens as $word)
{
if (!array_key_exists($word, $pWordPerTopic[$topic]))
continue;
$p *= $pWordPerTopic[$topic][$word];
}
if ($p > $pMax)
{
$selectedTopic = $topic;
$pMax = $p;
}
}
}
?>
As for the maths...
You're trying to maximize p(topic|words), so find
arg max p(topic|words)
(IE the argument topic for which p(topic|words) is the highest)
Bayes theorem says
p(topic)*p(words|topic)
p(topic|words) = -------------------------
p(words)
So you're looking for
p(topic)*p(words|topic)
arg max -------------------------
p(words)
Since p(words) of a document is the same for any topic this is the same as finding
arg max p(topic)*p(words|topic)
The naive bayes assumption (which makes this a naive bayes classifier) is that
p(words|topic) = p(word1|topic) * p(word2|topic) * ...
So using this, you need to find
arg max p(topic) * p(word1|topic) * p(word2|topic) * ...
Where
p(topic) = number of words in topic / number of words in total
And
p(word, topic) 1
p(word | topic) = ---------------- = p(word, topic) * ----------
p(topic) p(topic)
number of times word occurs in topic number of words in total
= -------------------------------------- * --------------------------
number of words in total number of words in topic
number of times word occurs in topic
= --------------------------------------
number of words in topic

Categories