Using bag of words - php

I am looking into implementing bag of words approach when dealing with emails stored as text files. I want to use keywords that could indicate that the email needs reply, analyse the emails with binary (something like 1|0|1|0|0 etc depending if the word is used) and then obtain a feature vectors that I could use with different ML algorithms.
I was thinking about using PHP to obtain the feature vectors but I can't find any existing implementations. Is it even possible to do something like that in PHP?

Yes bag of words makes much sense for making classifiers. i am also doing thesis on text classification and i m using php and mysql for it. i m little bit confused about creating bag of words. But after some time it can be done.

Related

Is there a way for PHP (or jQuery) to check if a string is human readable?

Human readable, meaning the string is a real word. This is essentially a form validation. Ideally I'd like to test the 'texture' of the form responses to determine if an actual user has filled out the form versus someone looking for form vulnerabilities. Possibly using a dictionary look-up on the POSTed data and then giving a threshold of returned 'real words'.
I don't see anything in the PHP docs and the Google machine isn't offering up anything, at least this specific. I suspect that someone out there has written a PHP class or even a jQuery plugin that can do this. Something like so:
$string = "laiqbqi";
is_this_string_human_readable($string);
Any ideas?
This can be done using something called Markov Chains.
Essentially, they read through a large chunk of text in a given language (English, French, Russian, etc.) and determine the probability of one character being after another.
e.g. a "q" has a much lower probability of occurring after a "z" than a vowel such as "a" does.
At a lower level, this is actually implemented as a state machine.
As per Mike's comment, a PHP version of this can be found here.
For flavor, an amusing the Daily WTF article on Markov Chains.

rawurlencode for storing data

I have always used rawurlencode to store user entered data into my mysql databases. The main reason I do this is so that stroing foreign characters is very simple I find. I'd then use rawurldecode to retrieve and display the data.
I read somewhere that rawurlencode was not meant for this purpose. Are there any disadvantages to what I'm doing?
So let's say I have a German address with many characters like umlauts etc. What is the simplest way to store this in a mysql database with no risks of it coming out wrong and being searchable using a search script? So far rawurelencode has been excellent for our system. Perhaps the practise can be improved upon by only encoding foreign letters and not common characters like spaces etc, which is a waste of space I totally agree.
Sure there are.
Let's start with the practical: for a large class of characters you are spending 3 bytes of storage for every byte of data. The description of rawurlencode (and of course the RFC) say that those characters are
all non-alphanumeric characters except -_.~
This means that there is a total of 26 + 26 + 10 (alphanumeric) + 4 (special exceptions) = 66 characters for which you do not waste space.
Then there are also the logical drawbacks: You are not storing the data itself, but rather a representation of the data tailored to URLs. Unless the data itself is URLs, that's not what you should be doing.
Drawbacks I can think of:
Waste of disk space.
Waste of CPU cycles encoding and decoding on every read and every write.
Additional complexity (you can't even inspect data with a MySQL client).
Impossibility to use full text searches.
URL encoding is not necessarily unique (there're at least two RFCs). It may not lead to data loss but it can lead to duplicate data (e.g., unique indexes where two rows actually contain the same piece of data).
You can accidentally encode a non-string piece of data such as a date: 2012-04-20%2013%3A23%3A00
But the main consideration is that such technique is completely arbitrary and unnecessary since MySQL doesn't have the least problem storing the complete Unicode catalogue. You could also decide to swap e's and o's in all strings: Holle, werdl!. Your app would run fine but it would not provide any added value.
Update: As Your Common Sense points out, a SQL clause as basic as ORDER BYis no longer usable. It's not that international chars will be ignored; you'll basically get an arbitrary sort order based on the ASCII code of the % and hexadecimal characters. If you can't SELECT * FROM city ORDER BY city_name reliably, you've rendered your DB useless.
I am using a fork to eat a soup
I am using money bills to fire the coals for BBQ
I am using a kettle to boil eggs.
I am using a microscope to hammer the nails.
Are there any disadvantages to what I'm doing?
YES
You are using a tool not on purpose. This is always a disadvantage.
A sane human being alway using a tool that is intended for the certain job. Not some randomly picked one. Especially if there is no shortage in the right tool supply.
URL encoding is not intended to be used with database, as one can tell from the name. That's alone reason enough for the sane developer. Take a look around: find the proper tool.
There is a thing called "common sense" - a thing widely used in the regular life but for some reason always absent in the php world.
A common sense can warn us: if we're using a wrong tool, it may spoil the work. Sooner or later it will spoil it. No need to ask for the certain details - it's a general rule. We are learning this rule at about age of 5.
Why not to use it while playing with some web thingies too?
Why not to ask yourself a question:
What's wrong with storing foreign characters at all?
urlencode makes stroing foreign characters very simple
Any hardships you encountered without urlencode?
Although I feel that common sense should be enough to answer the question, people always look for the "omen", the proof. Here you are:
Database's job is not limited to just storing and retrieving data. A plain text file can handle such a primitive task as well.
Data manipulations is what we are using databases for.
Most widely used ones are sorting and filtering.
Such a quite intelligent thing as a database can sort and filter data character-insensitive, which is very handy feature. But of course it can be done only if characters being saved as is, not as some random codes.
Sorting texts also may use ordering other than just binary order in the character table. Some umlaut characters may be present at the other parts of the table but database collation will put them in the right place. Of course it can be done only if characters being saved as is, not as some random codes.
Sometimes we have to manipulate the data that already stored in the database. Say, cut some piece from the string and compare with the entered value. How it is supposed to be done with urlencoded data?

PHP Repairing Bad Text

This is something I'm working on and I'd like input from the intelligent people here on StackOverflow.
What I'm attempting is a function to repair text based on combining various bad versions of the same text page. Basically this can be used to combine different OCR results into one with greater accuracy than any of them individually.
I start with a dictionary of 600,000 English words, that's pretty much everything including legal and medical terms and common names. I have this already.
Then I have 4 versions of the text sample.
Something like this:
$text[0] = 'Fir5t text sample is thisline';
$text[1] = 'Fir5t text Smplee is this line.';
$text[2] = 'First te*t sample i this l1ne.';
$text[3] = 'F i r st text s ample is this line.';
I attempting to combine the above to get an output which looks like:
$text = 'First text sample is this line.';
Don't tell me it's impossible, because it is certainly not, just very difficult.
I would very much appreciate any ideas anyone has towards this.
Thank you!
My current thoughts:
Just checking the words against the dictionary will not work, since some of the spaces are in the wrong place and occasionally the word will not be in the dictionary.
The major concern is repairing broken spacings, once this is fixed then then the most commonly occurring dictionary word can be chosen if exists, or else the most commonly occurring non-dictionary word.
Have you tried using a longest common subsequence algorithm? These are commonly seen in the "diff" text comparison tools used in source control apps and some text editors. A diff algorithm helps identify changed and unchanged characters in two text samples.
http://en.wikipedia.org/wiki/Diff
Some years ago I worked on an OCR app similar to yours. Rather than applying multiple OCR engines to one image, I used one OCR engine to analyze multiple versions of the same image. Each of the processed images was the result of applying different denoising technique to the original image: one technique worked better for low contrast, another technique worked better when the characters were poorly formed. A "voting" scheme that compared OCR results on each image improved the read rate for arbitrary strings of text such as "BQCM10032". Other voting schemes are described in the academic literature for OCR.
On occasion you may need to match a word for which no combination of OCR results will yield all the letters. For example, a middle letter may be missing, as in either "w rd" or "c tch" (likely "word" and "catch"). In this case it can help to access your dictionary with any of three keys: initial letters, middle letters, and final letters (or letter combinations). Each key is associated with a list of words sorted by frequency of occurrence in the language. (I used this sort of multi-key lookup to improve the speed of a crossword generation app; there may well be better methods out there, but this one is easy to implement.)
To save on memory, you could apply the multi-key method only to the first few thousand common words in the language, and then have only one lookup technique for less common words.
There are several online lists of word frequency.
http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists
If you want to get fancy, you can also rely on prior frequency of occurrence in the text. For example, if "Byrd" appears multiple times, then it may be the better choice if the OCR engine(s) reports either "bird" or "bard" with a low confidence score. You might load a medical dictionary into memory only if there is a statistically unlikely occurrence of medical terms on the same page--otherwise leave medical terms out of your working dictionary, or at least assign them reasonable likelihoods. "Prosthetics" is a common word; "prostatitis" less so.
If you have experience with image processing techniques such as denoising and morphological operations, you can also try preprocessing the image before passing it to the OCR engine(s). Image processing could also be applied to select areas after your software identifies the words or regions where the OCR engine(s) fared poorly.
Certain letter/letter and letter/numeral substitutions are common. The numeral 0 (zero) can be confused with the letter O, C for O, 8 for B, E for F, P for R, and so on. If a word is found with low confidence, or if there are two common words that could match an incompletely read word, then ad hoc shape-matching rules could help. For example, "bcth" could match either "both" or "bath", but for many fonts (and contexts) "both" is the more likely match since "o" is more similar to "c" in shape. In a long string of words such as a a paragraph from a novel or magazine article, "bath" is a better match than "b8th."
Finally, you could probably write a plugin or script to pass the results into a spellcheck engine that checks for noun-verb agreement and other grammar checks. This may catch a few additional errors. Maybe you could try VBA for Word or whatever other script/app combo is popular these days.
Tackling complex algorithms like this by yourself will probably take longer and be more error prone than using a third party tool - unless you really need to program this yourself, you can check the Yahoo Spelling Suggestion API. They allow 5.000 requests per IP per day, I believe.
Others may offer something similar (I think there's a bing API, too).
UPDATE: Sorry, I just read that they've stopped this service in April 2011. They claim to offer a similar service called "Spelling Suggestion YQL table" now.
This is indeed a rather complicated problem.
When I do wonder how to spell a word, the direct way is to open a dictionary. But what if it is a small complex sentence that I'm trying to spell correctly ? One of my personal trick, which works most of the time, is to call Google. I place my sentence between quotes on Google and count the results. Here is an example : entering "your very smart" on Google gives 13'600k page. Entering "you're very smart" gives 20'000k pages. Then, likely, the correct spelling is "you're very smart". And... indeed it is ;)
Based on this concept, I guess you have samples which, for the most parts, are correctly misspelled (well, maybe not if your develop for a teens gaming site...). Can you try to divide the samples into sub pieces, not going up to the words, and matching these by frequency ? The most frequent piece is the most likely correctly spelled. Prior to this, you can already make a dictionary spellcheck with your 600'000 terms to increase the chance that small spelling mistakes will alredy be corrected. This should increase the frequency of correct sub pieces.
Dividing the sentences in pieces and finding the right "piece-size" is also tricky.
What concerns me a little too : how do you extract the samples and match them together to know the correctly spelled sentence is the same (or very close?). Your question seems to assume you have this, which also seems something very complex for me.
Well, what precedes is just a general tip based on my personal and human experience. Donno if this can help. This is obviously not a real answer and is not meant to be one.
You could try using google n-grams to achieve this.
If you need to get right string only by comparing other. Then Something like this maybe will help.
It not finished yet, but already gives some results.
$text[0] = 'Fir5t text sample is thisline';
$text[1] = 'Fir5t text Smplee is this line.';
$text[2] = 'First te*t sample i this l1ne.';
$text[3] = 'F i r st text s ample is this line.';
function getRight($arr){
$_final='';
$count=count($arr);
// Remove multi spaces AND get string lengths
for($i=0;$i<$count;$i++){
$arr[$i]=preg_replace('/\s\s+/', ' ',$arr[$i]);
$len[$i]=strlen($arr[$i]);
}
// Max length
$_max=max($len);
for($i=0;$i<$_max;$i++){
$_el=array();
for($j=0;$j<$count;$j++){
// Cheking letter counts
$_letter=$arr[$j][$i];
if(isset($_el[$_letter]))$_el[$_letter]++;
else$_el[$_letter]=1;
}
//Most probably count
list($mostProbably) = array_keys($_el, max($_el));
$_final.=$mostProbably;
// If probbaly example is not space
if($_el!=' '){
// THERE NEED TO BE CODE FOR REMOVING SPACE FROM LINES WHERE $text[$i] is space
}
}
return $_final;
}
echo getRight($text);

Is it safe to turn a UUID into a short code? (only use first 8 chars)

We use UUIDs for our primary keys in our db (generated by php, stored in mysql). The problem is that when someone wants to edit something or view their profile, they have this huge, scary, ugly uuid string at the end of the url. (edit?id=.....)
Would it be safe (read: still unique) if we only used the first 8 characters, everything before the first hyphen?
If it is NOT safe, is there some way to translate it into something else shorter for use in the url that could be translated back into the hex to use as a lookup? I know that I can base64 encode it to bring it down to 22 characters, but is there something even shorter?
EDIT
I have read this question and it said to use base64. again, anything shorter?
Shortening the UUID increases the probability of a collision. You can do it, but it's a bad idea. Using only 8 characters means just 4 bytes of data, so you'd expect a collision once you have about 2^16 IDs - far from ideal.
Your best option is to take the raw bytes of the UUID (not the hex representation) and encode it using base64. Or, just don't worry much, because I seriously doubt your users care what's in the URL.
Don't cut a single bit out of that UUID: You have no control over the algorithm that produced it, there are multiple possible implementation, algorithm implementation is subject to change (example: changed with the version of PHP you're using)
If you ask me an UUID in the address bar doesn't look scary or difficult at all, even a simple google search for "UUID" produces worst looking URL's, and everybody's used to looking at google URL's!
If you want nicer looking URL's, take a look at the address bar of this stackoverflow.com article. They're using the article ID followed by the title of the question. Only the ID part is relevant, everything else is there to make it easy on the eyes of readers (go ahead and try it, you can delete anything after the ID, you can replace it with junk - doesn't matter).
It is not safe to truncate uuid's. Also, they are designed to be globally unique, so you aren't going to have luck shortening them. Your best bet is to either assign each user a unique number, or let users pick a custom (unique) string (like a username, or nick name) that can be decoded. So you could have edit?id=.... or edit?name=blah and you then decode name into the uuid in your script.
It depends on how you're generating the UUID - if you're using PHP's uniqid then it's the right-most digits that are more "unique". However, if you're going to truncate the data, then there's no real guarantee that it'll be unique anyway.
Irrespective, I'd say that this is a somewhat sub-optimal approach - is there no way you can use a unique (and ideally meaningful) textual reference string instead of an ID in the query string? (Hard to know without more knowledge of the problem domain, but it's always a better approach in my opinion, even if SEO, etc. isn't a factor.)
If you were using this approach, you could also let MySQL generate the unique IDs, which is probably a considerably more sane approach than attempting to handle this in PHP.
If you're worried about scaring users with the UUID in the URL, why not write it out to a hidden form field instead?

coding inspiration needed - keywords contained within string

I have a particular problem and need to know the best way to go about solving it.
I have a php string that can contain a number of keywords (tags actually). For example:-
"seo, adwords, google"
or
"web development, community building, web design"
I want to create a pool of keywords that are related, so all seo, online marketing related keywords or all web development related keywords.
I want to check the keyword / tag string against these pools of keywords and if for example seo or adwords is contained within the keyword string it is matched against the keyword pool for online marketing and a particular piece of content is served.
I wish to know the best way of coding this. I'm guessing some kind of hash table or array but not sure the best way to approach it.
Any ideas?
Thanks
Jonathan
Three approaches come to my mind, although I'm sure there could be more. Of course in any case I would store the values in a database table (or config file, or whatever depending on your application) so it can be edited easily.
1) Easiest: Convert the list into a regular expression of the form "keyword1|keyword2|keyword3" and see if the input matches.
2) Medium: Add the words to a hashtable, then split the input into words (you may have to use regular expression replacing to remove punctuation) and try to find each word of input in the hashtable.
3) Hardest: This may not work depending on your exact situation, but if all the possible content can be indexed by a search solution (like Apache SOLR, for example) then your list of keywords could be used as a search string and you could return results above a particular level of relevance.
It's hard to know exactly which solution would work best without knowing more about your source data. A large number of keywords may jam up a regular expression, but if it's a short list then it might work great. If your inputs are long then #2 won't work so well because you have to test each and every input word. As always your mileage may vary, so I would start with the easiest solution I thought would work and see if the performance is acceptable.

Categories