PHP implementation of Bayes classificator: Assign topics to texts - php

In my news page project, I have a database table news with the following structure:
- id: [integer] unique number identifying the news entry, e.g.: *1983*
- title: [string] title of the text, e.g.: *New Life in America No Longer Means a New Name*
- topic: [string] category which should be chosen by the classificator, e.g: *Sports*
Additionally, there's a table bayes with information about word frequencies:
- word: [string] a word which the frequencies are given for, e.g.: *real estate*
- topic: [string] same content as "topic" field above, e.h. *Economics*
- count: [integer] number of occurrences of "word" in "topic" (incremented when new documents go to "topic"), e.g: *100*
Now I want my PHP script to classify all news entries and assign one of several possible categories (topics) to them.
Is this the correct implementation? Can you improve it?
<?php
include 'mysqlLogin.php';
$get1 = "SELECT id, title FROM ".$prefix."news WHERE topic = '' LIMIT 0, 150";
$get2 = mysql_abfrage($get1);
// pTOPICS BEGIN
$pTopics1 = "SELECT topic, SUM(count) AS count FROM ".$prefix."bayes WHERE topic != '' GROUP BY topic";
$pTopics2 = mysql_abfrage($pTopics1);
$pTopics = array();
while ($pTopics3 = mysql_fetch_assoc($pTopics2)) {
$pTopics[$pTopics3['topic']] = $pTopics3['count'];
}
// pTOPICS END
// pWORDS BEGIN
$pWords1 = "SELECT word, topic, count FROM ".$prefix."bayes";
$pWords2 = mysql_abfrage($pWords1);
$pWords = array();
while ($pWords3 = mysql_fetch_assoc($pWords2)) {
if (!isset($pWords[$pWords3['topic']])) {
$pWords[$pWords3['topic']] = array();
}
$pWords[$pWords3['topic']][$pWords3['word']] = $pWords3['count'];
}
// pWORDS END
while ($get3 = mysql_fetch_assoc($get2)) {
$pTextInTopics = array();
$tokens = tokenizer($get3['title']);
foreach ($pTopics as $topic=>$documentsInTopic) {
if (!isset($pTextInTopics[$topic])) { $pTextInTopics[$topic] = 1; }
foreach ($tokens as $token) {
echo '....'.$token;
if (isset($pWords[$topic][$token])) {
$pTextInTopics[$topic] *= $pWords[$topic][$token]/array_sum($pWords[$topic]);
}
}
$pTextInTopics[$topic] *= $pTopics[$topic]/array_sum($pTopics); // #documentsInTopic / #allDocuments
}
asort($pTextInTopics); // pick topic with lowest value
if ($chosenTopic = each($pTextInTopics)) {
echo '<p>The text belongs to topic '.$chosenTopic['key'].' with a likelihood of '.$chosenTopic['value'].'</p>';
}
}
?>
The training is done manually, it isn't included in this code. If the text "You can make money if you sell real estates" is assigned to the category/topic "Economics", then all words (you,can,make,...) are inserted into the table bayes with "Economics" as the topic and 1 as standard count. If the word is already there in combination with the same topic, the count is incremented.
Sample learning data:
word topic count
kaczynski Politics 1
sony Technology 1
bank Economics 1
phone Technology 1
sony Economics 3
ericsson Technology 2
Sample output/result:
Title of the text: Phone test Sony Ericsson Aspen - sensitive Winberry
Politics
....phone
....test
....sony
....ericsson
....aspen
....sensitive
....winberry
Technology
....phone FOUND
....test
....sony FOUND
....ericsson FOUND
....aspen
....sensitive
....winberry
Economics
....phone
....test
....sony FOUND
....ericsson
....aspen
....sensitive
....winberry
Result: The text belongs to topic Technology with a likelihood of 0.013888888888889
Thank you very much in advance!

It looks like your code is correct, but there are a few easy ways to optimize it. For example, you calculate p(word|topic) on the fly for every word while you could easily calculate these values beforehand. (I'm assuming you want to classify multiple documents here, if you're only doing a single document I suppose this is okay since you don't calculate it for words not in the document)
Similarly, the calculation of p(topic) could be moved outside of the loop.
Finally, you don't need to sort the entire array to find the maximum.
All small points! But that's what you asked for :)
I've written some untested PHP-code showing how I'd implement this below:
<?php
// Get word counts from database
$nWordPerTopic = mystery_sql();
// Calculate p(word|topic) = nWord / sum(nWord for every word)
$nTopics = array();
$pWordPerTopic = array();
foreach($nWordPerTopic as $topic => $wordCounts)
{
// Get total word count in topic
$nTopic = array_sum($wordCounts);
// Calculate p(word|topic)
$pWordPerTopic[$topic] = array();
foreach($wordCounts as $word => $count)
$pWordPerTopic[$topic][$word] = $count / $nTopic;
// Save $nTopic for next step
$nTopics[$topic] = $nTopic;
}
// Calculate p(topic)
$nTotal = array_sum($nTopics);
$pTopics = array();
foreach($nTopics as $topic => $nTopic)
$pTopics[$topic] = $nTopic / $nTotal;
// Classify
foreach($documents as $document)
{
$title = $document['title'];
$tokens = tokenizer($title);
$pMax = -1;
$selectedTopic = null;
foreach($pTopics as $topic => $pTopic)
{
$p = $pTopic;
foreach($tokens as $word)
{
if (!array_key_exists($word, $pWordPerTopic[$topic]))
continue;
$p *= $pWordPerTopic[$topic][$word];
}
if ($p > $pMax)
{
$selectedTopic = $topic;
$pMax = $p;
}
}
}
?>
As for the maths...
You're trying to maximize p(topic|words), so find
arg max p(topic|words)
(IE the argument topic for which p(topic|words) is the highest)
Bayes theorem says
p(topic)*p(words|topic)
p(topic|words) = -------------------------
p(words)
So you're looking for
p(topic)*p(words|topic)
arg max -------------------------
p(words)
Since p(words) of a document is the same for any topic this is the same as finding
arg max p(topic)*p(words|topic)
The naive bayes assumption (which makes this a naive bayes classifier) is that
p(words|topic) = p(word1|topic) * p(word2|topic) * ...
So using this, you need to find
arg max p(topic) * p(word1|topic) * p(word2|topic) * ...
Where
p(topic) = number of words in topic / number of words in total
And
p(word, topic) 1
p(word | topic) = ---------------- = p(word, topic) * ----------
p(topic) p(topic)
number of times word occurs in topic number of words in total
= -------------------------------------- * --------------------------
number of words in total number of words in topic
number of times word occurs in topic
= --------------------------------------
number of words in topic

Related

Parse strings of numbers and domain/subdomain strings, group, then sum numbers in each group

In PHP, I have an array that shows how many times user clicked on each individual domain like this:
$counts = [
"900,google.com",
"60,mail.yahoo.com",
"10,mobile.sports.yahoo.com",
"40,sports.yahoo.com",
"300,yahoo.com",
"10,stackoverflow.com",
"20,overflow.com",
"5,com.com",
"2,en.wikipedia.org",
"1,m.wikipedia.org",
"1,mobile.sports",
"1,google.co.uk"
];
How can i use this input as a parameter to a function and returns a data structure containing the number of clicks that were recorded on each domain AND each subdomain under it. For example, a click on "mail.yahoo.com" counts toward the totals for "mail.yahoo.com", "yahoo.com", and "com". (Subdomains are added to the left of their parent domain. So "mail" and "mail.yahoo" are not valid domains. Note that "mobile.sports" appears as a separate domain near the bottom of the input.)
Sample output (in any order/format):
calculateClicksByDomain($counts) =>
com: 1345
google.com: 900
stackoverflow.com: 10
overflow.com: 20
yahoo.com: 410
mail.yahoo.com: 60
mobile.sports.yahoo.com: 10
sports.yahoo.com: 50
com.com: 5
org: 3
wikipedia.org: 3
en.wikipedia.org: 2
m.wikipedia.org: 1
mobile.sports: 1
sports: 1
uk: 1
co.uk: 1
google.co.uk: 1
The first step I am stuck at is how can to get subdomains from for example
"mobile.sports.yahoo.com"
such that result is
[com, yahoo.com, sports.yahoo.com, mobile.sports.yahoo.com]
This code would work:
$counts = [
"900,google.com",
"60,mail.yahoo.com",
"10,mobile.sports.yahoo.com",
"40,sports.yahoo.com",
"300,yahoo.com",
"10,stackoverflow.com",
"20,overflow.com",
"5,com.com",
"2,en.wikipedia.org",
"1,m.wikipedia.org",
"1,mobile.sports",
"1,google.co.uk"
];
function calculateClicksByDomain($dataLines)
{
$output = [];
foreach ($dataLines as $dataLine) {
[$count, $domain] = explode(',', $dataLine);
$nameParts = [];
foreach (array_reverse(explode('.', $domain)) as $namePart) {
array_unshift($nameParts, $namePart);
$domain = implode('.', $nameParts);
$output[$domain] = ($output[$domain] ?? 0) + $count;
}
}
return $output;
}
print_r(calculateClicksByDomain($counts));
See: https://3v4l.org/o5VgJ#v8.0.26
This function walks over each line of the data and explodes it into a count and a domain. After that it explodes the domain by the dots, reverses it, and walks over those name parts. In that loop it reconstructs the various subdomains and counts them into the output.

Golf league payout to six finishing positions

I have a golf league of 40 individuals. We all throw money in a pot and pay out the first 6 places based on final score.
If there were no ties the pay out would be simple but often we have, for example, 2 people tied for first place, 3 people tied for second, 1 person alone in third, etc. The variations seem endless.
I've been trying to automate the calculated payouts for each place using PHP but have not been successful. Any suggestions, help, or pointing in the right direction would be much appreciated. I noticed that someone else tried to ask a similar question on this site but was not successful in framing the question. I'll try to do a better job.
Here is some data I've been playing around with:
$playerNumber=40;
$totalPoints=100;
Payouts:
$pointsFirst=0.6*$totalPoints;
$pointsSecond=0.2*$totalPoints;
$pointsThird=0.15*$totalPoints;
$pointsFourth=0.03*$totalPoints;
$pointsFifth=0.02*$totalPoints;
$pointsSixth=0.01*$totalPoints;
For the example given above and to pay out six places, we would calculate the payouts as follows:
If two people are tied for first place, we add first and second place points and divide by two.
If three people are tied for second place, we add third, fourth and fifth place points and divide by three.
If one person is alone in third, this person would win sixth place points.
I can count the number of players who are in or tied for a certain place.
$countFirst=2;
$countSecond=3;
$countThird=1;
$countFourth=2;
$countFifth=1;
$countSixth=2;
In this example the player scores would be 72, 72, 73, 73, 73, 74, 75, 75, 76, 77, 77.
At first I thought this was an application for nested arrays. Then I thought perhaps using arrays, array slice, etc, may be a way to go. Each time I end up in the woods. I'm not seeing the logic.
I have used conditional statements for paying out three places but to pay out six places with this method puts me deep in the woods.
Example of payout to three places using conditional statements:
$pointsFirst=0.5*$totalPoints;
$pointsSecond=0.3*$totalPoints;
$pointsThird=0.2*$totalPoints;
if($countFirst>2) {
$ptsA=round($totalPoints/$countFirst,2);
}
elseif($countFirst==2) {
$ptsA=round(($pointsFirst+$pointsSecond)/2,2);
if($countSecond>1) {
$ptsB=round($pointsThird/$countSecond,2);
}
elseif($countSecond==1) {
$ptsB=round($pointsThird,2);
}
}
elseif($countFirst==1) {
$ptsA=round($pointsFirst,2);
if($countSecond>1) {
$ptsB=round(($pointsSecond+$pointsThird)/2,2);
}
elseif($countSecond==1) {
$ptsB=round($pointsSecond,2);
if($countThird>1) {
$ptsC=round($pointsThird/$countThird,2);
}
elseif($countThird==1) {
$ptsC=round($pointsThird,2);
}
}
}
I hope I have been clear in my request. I'll be glad to clarify anything. If anyone has any ideas on how to efficiently automate a payout calculation to six places I will be eternally grateful. Thank-you! Mike
Per request:
$scores=array();
$scores[0]=72;
$scores[1]=72;
$scores[2]=73;
$scores[3]=73;
$scores[4]=73;
$scores[5]=74;
$scores[6]=75;
$scores[7]=75;
$scores[8]=76;
$scores[9]=77;
$scores[10]=77;
$payout=array();
$payout[0]=0.6*$totalPoints;
$payout[1]=0.2*$totalPoints;
$payout[2]=0.15*$totalPoints;
$payout[3]=0.03*$totalPoints;
$payout[4]=0.02*$totalPoints;
$payout[5]=0.01*$totalPoints;
$countScores=array();
$countScores[0]=$countFirst;
$countScores[1]=$countSecond;
$countScores[2]=$countThird;
$countScores[3]=$countFourth;
$countScores[4]=$countFifth;
$countScores[5]=$countSixth;
First, there is a problem with your Payouts. If you add them up you get 1.01 not 1
0.6 (1st) + 0.2 (2nd ) + 0.15 (3rd) + 0.03 (4th) + 0.02 (5th) + 0.01 (6th) = 1.01
Second, it is easier if you make your Payouts and Counts into arrays -
change these -
$pointsFirst=0.6*$totalPoints;
$pointsSecond=0.2*$totalPoints;
$pointsThird=0.15*$totalPoints;
$pointsFourth=0.03*$totalPoints;
$pointsFifth=0.02*$totalPoints;
$pointsSixth=0.01*$totalPoints;
$countFirst=2;
$countSecond=3;
$countThird=1;
$countFourth=2;
$countFifth=1;
$countSixth=2;
to these
$payout=array();
$payout[0]=0.6*$totalPoints;
$payout[1]=0.2*$totalPoints;
$payout[2]=0.15*$totalPoints;
$payout[3]=0.03*$totalPoints;
$payout[4]=0.02*$totalPoints;
$payout[5]=0.01*$totalPoints;
$count=array();
$count[0]=2;
$count[1]=3;
$count[2]=1;
$count[3]=2;
$count[4]=1;
$count[5]=2;
Here is the start of one way to do it. Although I would eventually change this into a function so that I can use it again with different payouts, and number of places (see phpfiddle examples below)
I see this in 4 steps-
Step 1
// Add together the payments if there are ties - ie. 2 tied for first $payout[0]+$payout[1], etc
$payout_groups = array(); // start a payout array
$payout_groups_key = 0; // array key count
$payout_groups_count = 0; // array counter, use to match the $count array values
for($w=0;$w<count($payout);$w++){ //
if(array_key_exists($payout_groups_key,$payout_groups)){
$payout_groups[$payout_groups_key] += $payout[$w]; // if there are ties, add them together
}
else{
$payout_groups[$payout_groups_key] = $payout[$w]; // else set a new payout level
}
$payout_groups_count++; // increase the counter
if($payout_groups_count == $count[$payout_groups_key]){ // if we merged all the ties, set a new array key and restart the counter
$payout_groups_key++;
$payout_groups_count = 0;
}
}
Step 2
// basic counter to get how many placers/winners. This makes it possible to have ties for 6th (last) place
$x = 0;
$y = 0;
while($y < count($payout)){
$y += $count[$x]; // the $count array values until we reach the amount of places/payouts
$x++;
}
Step 3
// Create array for winnings per placing
$winnings = array(); // start an array
$placings_count = 0; //
$placings_counter = 0;
for($z=0;$z<$y;$z++){
$winnings[$z] = $payout_groups[$placings_count]/$count[$placings_count];
$placings_counter++;
if($placings_counter == $count[$placings_count]){
$placings_count++;
$placings_counter = 0;
}
}
Step 4
// Assign winnings to scorecard
$scoreboard = array();
for($t=0;$t<count($winnings);$t++){
$scoreboard[$t]['score'] = $scores[$t];
$scoreboard[$t]['payout'] = $winnings[$t];
}
You can see this using your defined values at - http://phpfiddle.org/main/code/a1g-qu0
Using the same code above, I changed the payout amounts, and increased it to 7th places - http://phpfiddle.org/main/code/uxi-qgt

How do you rearrange text within a string from a MySQL query?

Solution I am looking for:
I would like to rearrange words within the text string results such that the job title is moved from the end of the string to the beginning of the string for each line item.
Currently, I am retrieving data from an external medical database query ($query). However, I cannot make any changes to the database or to the MySQL query statement itself.
The $query is retrieved and I then place the results in a $data array via the following command:
while($row = mysql_fetch_assoc($query)){$data[] = $row;}
I then change all the job titles to uppercase in the $data array as follows:
$job_01 = 'anesthesiologist';
$job_02 = 'dentist';
$job_03 = 'general practitioner';
$job_04 = 'internist';
$job_05 = 'lawyer';
$job_06 = 'manager';
$job_07 = 'pediatrician';
$job_08 = 'psychiatrist';
$replace_01 = 'ANESTHESIOLOGIST';
$replace_02 = 'DENTIST';
$replace_03 = 'GENERAL PRACTITIONER';
$replace_04 = 'INTERNIST';
$replace_05 = 'LAWYER';
$replace_06 = 'MANAGER';
$replace_07 = 'PEDIATRICIAN';
$replace_08 = 'PSYCHIATRIST';
$searchArray = array($job_01, $job_02, $job_03, $job_04, $job_05, $job_06, $job_07, $job_08);
$replaceArray = array($replace_01, $replace_02, $replace_03, $replace_04, $replace_05, $replace_06, $replace_07, $replace_08);
for ($i=0; $i<=count($data)-1; $i++) {
$line[$i] = str_ireplace($searchArray, $replaceArray, $data[$i]));
}
The final output is in the following line item text string format:
Example Query results (4 line items)
California Long time medical practitioner - ANESTHESIOLOGIST 55yr
New York Specializing in working with semi-passive children - PEDIATRICIAN (doctor) 42yr
Nevada Currently working in a new medical office - PSYCHIATRIST 38yr
Texas Represents the medical-liability industry - LAWYER (attorney) 45yr
I would like to rearrange these results such that I can output the data to my users in the following format by moving the job title to the beginning of each line item as in:
Desired results (usually over 1000 items)
ANESTHESIOLOGIST - California Long time medical practitioner - 55yr
PEDIATRICIAN - New York Specializing in working with semi-passive children - (doctor) 42yr
PSYCHIATRIST - Nevada Currently working in a new medical office - psychiatrist 38yr
LAWYER - Texas Represents the medical-liability industry - lawyer (attorney) 45yr
Ideally, if possible, it would also be nice to have the age moved to the beginning of the text string results as follows:
Ideal Results
55yr - ANESTHESIOLOGIST - California Long time medical practitioner
42yr - PEDIATRICIAN - New York Specializing in working with semi-passive children - (doctor)
38yr - PSYCHIATRIST - Nevada Currently working in a new medical office - psychiatrist
45yr - LAWYER - Texas Represents the medical-liability industry - lawyer (attorney)
You could use a regular expression to extract and rearrange the array:
for ($i=0; $i<=count($data)-1; $i++) {
$line[$i] = str_ireplace($searchArray, $replaceArray, $data[$i]));
// variant a, complete line
if(preg_match_all('/(.*)\s+-\s+(.*)\s+(\d+)yr$/', $line[$i],$matches)) {
$line[$i] = $matches[3][0].'yr - '.$matches[2][0].' - '.$matches[1][0];
// variant b, a line with age, but no jobtitle
} elseif(preg_match_all('/(.*)\s+-\s+(\d+)yr$/', $line[$i],$matches)) {
$line[$i] = $matches[2][0].'yr - '.$matches[1][0];
// variant c, no age
} elseif(preg_match_all('/(.*)\s+-\s+(.*)$/', $line[$i],$matches)) {
$line[$i] = $matches[2][0].' - '.$matches[1][0];
}
// in other cases (no age, no jobtitle), the line is not modified at all.
}

Making the leap from PhP to Python

I am a fairly comfortable PHP programmer, and have very little Python experience. I am trying to help a buddy with his project, the code is easy enough to write in Php, I have most of it ported over, but need a bit of help completing the translation if possible.
The target is to:
Generate a list of basic objects with uid's
Randomly select a few Items to create a second list keyed to the uid containing new
properties.
Test for intersections between the two lists to alter response accordingly.
The following is a working example of what I am trying to code in Python
<?php
srand(3234);
class Object{ // Basic item description
public $x =null;
public $y =null;
public $name =null;
public $uid =null;
}
class Trace{ // Used to update status or move position
# public $x =null;
# public $y =null;
# public $floor =null;
public $display =null; // Currently all we care about is controlling display
}
##########################################################
$objects = array();
$dirtyItems = array();
#CREATION OF ITEMS########################################
for($i = 0; $i < 10; $i++){
$objects[] = new Object();
$objects[$i]->uid = rand();
$objects[$i]->x = rand(1,30);
$objects[$i]->y = rand(1,30);
$objects[$i]->name = "Item$i";
}
##########################################################
#RANDOM ITEM REMOVAL######################################
foreach( $objects as $item )
if( rand(1,10) <= 2 ){ // Simulate full code with 20% chance to remove an item.
$derp = new Trace();
$derp->display = false;
$dirtyItems[$item->uid] = $derp; //# <- THIS IS WHERE I NEED THE PYTHON HELP
}
##########################################################
display();
function display(){
global $objects, $dirtyItems;
foreach( $objects as $key => $value ){ // Iterate object list
if( #is_null($dirtyItems[$value->uid]) ) // Print description
echo "<br />$value->name is at ($value->x, $value->y) ";
else // or Skip if on second list.
echo "<br />Player took item $value->uid";
}
}
?>
So, really I have most of it sorted I am just having trouble with Python's version of an Associative array, to have a list whose keys match the Unique number of Items in the main list.
The output from the above code should look similar to:
Player took item 27955
Player took item 20718
Player took item 10277
Item3 is at (8, 4)
Item4 is at (11, 13)
Item5 is at (3, 15)
Item6 is at (20, 5)
Item7 is at (24, 25)
Item8 is at (12, 13)
Player took item 30326
My Python skills are still course, but this is roughly the same code block as above.
I've been looking at and trying to use list functions .insert( ) or .setitem( ) but it is not quite working as expected.
This is my current Python code, not yet fully functional
import random
import math
# Begin New Globals
dirtyItems = {} # This is where we store the object info
class SimpleClass: # This is what we store the object info as
pass
# End New Globals
# Existing deffinitions
objects = []
class Object:
def __init__(self,x,y,name,uid):
self.x = x # X and Y positioning
self.y = y #
self.name = name #What will display on a 'look' command.
self.uid = uid
def do_items():
global dirtyItems, objects
for count in xrange(10):
X=random.randrange(1,20)
Y=random.randrange(1,20)
UID = int(math.floor(random.random()*10000))
item = Object(X,Y,'Item'+str(count),UID)
try: #This is the new part, we defined the item, now we see if the player has moved it
if dirtyItems[UID]:
print 'Player took ', UID
except KeyError:
objects.append(item) # Back to existing code after this
pass # Any error generated attempting to access means that the item is untouched by the player.
# place_items( )
random.seed(1234)
do_items()
for key in objects:
print "%s at %s %s." % (key.name, key.x, key.y)
if random.randint(1, 10) <= 1:
print key.name, 'should be missing below'
x = SimpleClass()
x.display = False
dirtyItems[key.uid]=x
print ' '
objects = []
random.seed(1234)
do_items()
for key in objects:
print "%s at %s %s." % (key.name, key.x, key.y)
print 'Done.'
So, sorry for the long post, but I wanted to be through and provide both sets of full code. The PhP works perfectly, and the Python is close. If anyone can point me in the correct direction it would be a huge help.
dirtyItems.insert(key.uid,x) is what i tried to use to make a list work as an Assoc array
Edit: minor correction.
You're declaring dirtyItems as an array instead of a dictionary. In python they're distinct types.
Do dirtyItems = {} instead.
Make a dictionary instead of an array:
import random
import math
dirtyItems = {}
Then you can use like:
dirtyItems[key.uid] = x

Using a Naive Bayes Classifier to classify tweets: some problems

Using, amongst other sources, various posts here on Stackoverflow, I'm trying to implement my own PHP classier to classify tweets into a positive, neutral and negative class. Before coding, I need to get the process straigt. My train-of-thought and an example are as follows:
p(class) * p(words|class)
Bayes theorem: p(class|words) = ------------------------- with
p(words)
assumption that p(words) is the same for every class leads to calculating
arg max p(class) * p(words|class) with
p(words|class) = p(word1|class) * p(word2|topic) * ... and
p(class) = #words in class / #words in total and
p(word, class) 1
p(word|class) = -------------- = p(word, class) * -------- =
p(class) p(class)
#times word occurs in class #words in total #times word occurs in class
--------------------------- * --------------- = ---------------------------
#words in total #words in class #words in class
Example:
------+----------------+-----------------+
class | words | #words in class |
------+----------------+-----------------+
pos | happy win nice | 3 |
neu | neutral middle | 2 |
neg | sad loose bad | 3 |
------+----------------+-----------------+
p(pos) = 3/8
p(neu) = 2/8
p(meg) = 3/8
Calculate: argmax(sad loose)
p(sad loose|pos) = p(sad|pos) * p(loose|pos) = (0+1)/3 * (0+1)/3 = 1/9
p(sad loose|neu) = p(sad|neu) * p(loose|neu) = (0+1)/3 * (0+1)/3 = 1/9
p(sad loose|neg) = p(sad|neg) * p(loose|neg) = 1/3 * 1/3 = 1/9
p(pos) * p(sad loose|pos) = 3/8 * 1/9 = 0.0416666667
p(neu) * p(sad loose|neu) = 2/8 * 1/9 = 0.0277777778
p(neg) * p(sad loose|neg) = 3/8 * 1/9 = 0.0416666667 <-- should be 100% neg!
As you can see, I have "trained" the classifier with a positive ("happy win nice"), a neutral ("neutral middle") and a negative ("sad loose bad") tweet. In order to prevent problems of having probabilities of zero because of one word missing in all classes, I'm using LaPlace (or ädd one") smoothing, see "(0+1)".
I basically have two questions:
Is this a correct blueprint for implementation? Is there room for improvement?
When classifying a tweet ("sad loose"), it is expected to be 100% in class "neg" because it only contains negative words. The LaPlace smoothing is however making things more complicated: class pos and neg have an equal probability. Is there a workaround for this?
There are two main elements to improve in your reasoning.
First, you should improve your smoothing method:
When applying Laplace smoothing, it should be applied to all measurements, not just to those with zero denominator.
In addition, Laplace smoothing for such cases is usually given by (c+1)/(N+V), where V is the vocabulary size (e.g., see in Wikipedia).
Therefore, using probability function you have defined (which might not be the most suitable, see below):
p(sad loose|pos) = (0+1)/(3+8) * (0+1)/(3+8) = 1/121
p(sad loose|neu) = (0+1)/(3+8) * (0+1)/(3+8) = 1/121
p(sad loose|neg) = (1+1)/(3+8) * (1+1)/(3+8) = 4/121 <-- would become argmax
In addition, a more common way of calculating the probability in the first place, would be by:
(number of tweets in class containing term c) / (total number of tweets in class)
For instance, in the limited trainset given above, and disregarding smoothing, p(sad|pos) = 0/1 = 0, and p(sad|neg) = 1/1 = 1. When the trainset size increases, the numbers would be more meaningful. e.g. if you had 10 tweets for the negative class, with 'sad' appearing in 4 of them, then p(sad|neg) would have been 4/10.
Regarding the actual number outputted by the Naive Bayes algorithm: you shouldn't expect the algorithm to assign actual probability to each class; rather, the category order is of more importance. Concretely, using the argmax would give you the algorithm's best guess for the class, but not the probability for it. Assigning probabilities to NB results is another story; for example, see an article discussing this issue.
Naive Bayes Algorithm with Laplacian Correction
Some Picture Links that show how the algorithm work
Data set example
Output
Output Cont.
import pandas as pd
#Calculate Frequency Of Each Value
def CountFrequency(my_list,my_list2,st,st2):
# Creating an empty dictionary
counter=0
for i in range(len(my_list)):
if (my_list[i]==st and my_list2[i]==st2):
counter=counter+1
return counter
#Reading headers From File
headers=pd.read_excel('data_set.xlsx').columns
#Reading From File
df = pd.read_excel('data_set.xlsx')
a=[]
for i in range(len(df.columns)):
a.append([])
for i in range(len(df.columns)):
for row in df.iterrows():
a[i].append(row[1][i])
#print(row[1][i])
#Calculate Table Info
result=[]
length=len(a[0])
tableInfo=sorted(set(a[-1]))
for i in range(len(tableInfo)):
result.append([])
for i in tableInfo :
print("P(",headers[-1],"=\"",i,"\") =", CountFrequency(a[-1],a[-1],i,i),"/",length,"=",CountFrequency(a[-1],a[-1],i,i)/length)
#Take User Input and Calculate Columns Info
for i in range(len(df.columns)-1):
print("Choose value for attribute ",headers[i],"that you want from list : ")
c=1
b=sorted(set(a[i]))
for j in b:
print(c,":",j)
c=c+1
choose=int(input("Enter Number Of Your Choice : "))
co=0
for k in tableInfo:
#Laplacian Correction
if CountFrequency(a[-1],a[i],k,b[choose-1])!=0:
print("P(",headers[i],"=\"",b[choose-1],"\"|",headers[-1],"=\"",k,"\") =", CountFrequency(a[-1],a[i],k,b[choose-1]),"/",CountFrequency(a[-1],a[-1],k,k),"=",CountFrequency(a[-1],a[i],k,b[choose-1])/CountFrequency(a[-1],a[-1],k,k))
result[co].append(CountFrequency(a[-1],a[i],k,b[choose-1])/CountFrequency(a[-1],a[-1],k,k))
else:
print("P(",headers[i],"=\"",b[choose-1],"\"|",headers[-1],"=\"",k,"\") =", CountFrequency(a[-1],a[i],k,b[choose-1])+1,"/",CountFrequency(a[-1],a[-1],k,k)+len(sorted(set(a[i]))),"=",((CountFrequency(a[-1],a[i],k,b[choose-1])+1)/(CountFrequency(a[-1],a[-1],k,k)+len(sorted(set(a[i])))))," With Laplacian correction ")
result[co].append(((CountFrequency(a[-1],a[i],k,b[choose-1])+1)/(CountFrequency(a[-1],a[-1],k,k)+len(sorted(set(a[i]))))))
co=co+1
#Calculate Final Result Laplacian Correction
finalResult=[1]*len(tableInfo)
for res in range(len(result)):
for i in range(len(result[res])):
finalResult[res]*=result[res][i]
#Print final result
print("#####################################################################")
print("#####################################################################")
print("#####################################################################")
mx=0
pos=0
for i in range(len(tableInfo)) :
print("P(X | ",headers[-1],"=\"",tableInfo[i],"\") =",finalResult[i]*(CountFrequency(a[-1],a[-1],tableInfo[i],tableInfo[i])/length))
if mx<finalResult[i]*(CountFrequency(a[-1],a[-1],tableInfo[i],tableInfo[i])/length):
mx=finalResult[i]*(CountFrequency(a[-1],a[-1],tableInfo[i],tableInfo[i])/length)
pos=i
print("ThereFore X belongs To Class (\"",headers[-1],"=",tableInfo[pos],"\")")

Categories