I've got a website which lists sports scores. It current works like this:
Firstname Lastname 1-0 Firstname Lastname
It explodes this based on spaces, then explodes the third one (containing the scores) based on the - .
The problem with this is that it does not support names with more than 2 words. If I explode using - first, it would not support names with - in there. The results are added in a textarea, because I have many thousands that need to be added, so I don't want to make multiple fields to input data into, as I can currently add matches quickly listing one result per line. Does anyone have advice on how to make the system both multi-word, and special character-insensitive? Is there maybe a way to split when it encounters a number, then select the first chunk as the first name, the last as that players score, and the rest as the last name?
I don't know if there's any way to teach a simple parsing command, or even a regular expression, to do what you want. Some cases will always be ambiguous. For example, if you have the names `Mary Ann Steiner" and "Constantin Van Dyke" the patterns are exactly the same, but one needs to be split (2/1) and the other needs to be split (1/2).
You could possibly find a library that knows how to make educated guesses based on a huge dictionary of known names, but failing that...
I think in this case you need the human brain inputting the data to make some of the decisions, and indicate them during data entry. In my experience using multiple fields isn't that slow if you navigate using the tab key instead of mousing around. You could also enter the data using a delimiter of your own, like:
Mary Ann,Steiner,2-3
Constantin,Van Dyke,4-2
Then you'd run something that explodes those lines based on "," and enters the elements into your db.
If you're copy/pasting or scraping the data from an external site, another option would be to just explode every line using the method you're currently using. This should work for most records, and when it doesn't work, it will be obvious -- the resulting record will have too many elements. You can have your script flag just those records for human intervention.
Related
I attempting what I thought would be a simple exercise, but unless I’m missing a trick, it seems anything but simple.
Im attempting to clean up user input into a form before saving it. The particular problem I have is with hyphenated town names. For example, take Bourton-on-the-Water. Assume the user has Caps lock on or puts spaces next to the hyphens of any other screw up that might come to mind. How do I, within reason, turn it into what it’s meant to be?
You can use trim() to remove whitespace (or other characters) from the beginning and end of a string. You can also use explode() to break strings into parts by a specified character and then recreate your string as you like.
I think the only way you can really accomplish this is by improving the way the user inputs their data.
For example use a postcode lookup system that enters an address based on what they type.
Or have a autocomplete from a predefined list of towns (similar to how Facebook shows towns).
To consider every possible permutation of Bourton On The Water / Bourton-On-The-Water etc... is pretty much impossible.
I'm developing a system where users can create their own pesonal recipes with corresponding ingredients and save them (in mysql).
The problem is that every time an ingredient is saved i check if it allready exists in the ingredients table where i compare the names of the ingredients.
If i should be able to make properly shopping lists from the recipes i want to make sure that for example:
apple - apples - fresh apples
Cant apear
So if "apple" first is created and im trying to save "apples" i wanna check something similar allready exists.
Does an alghorithm like what im trying to explain allready exists?
Hope you have some input!
While it is possible to use soundex or Levenshtein distance, it would still require finding the key word in the phrase - with 'apple' and 'apples' it would probably work, but with 'dozen of fresh apples' - probably not.
From my experience, in that application nothing beats more manual algorithms:
create a base list of ingredients ("flour", "apple", "ham")
when adding new recipe, match ingredient list against the the list, possibly allowing for some fuzzyness using Levenshtein or regexes
create a backend page with a list of "original" vs "match", with a possibly to mark wrong matches with a single click
create a simple interface to do a manual matching for bad hits
You might have some luck with MySQL's SOUNDEX() function, assuming that the words are similar enough and, probably, simple enough.
Documentation can be found here: https://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_soundex
Basically, what it does is reduce a given word to a four character string representing it. The string should be the same for any two words that sound largely the same.
In mySql you can use SOUNDEX() function soundex.
If you want to implement it in php there is levenshtein and similar_text functions
I'm working on a small PHP/Javascript map application that is basically just pulling a bunch of location names and coordinate points from a MySQL database table and adding them to an HTML canvas. It's supposed to represent locations visited by a character in an ongoing collaborative story, so I'd like to be able to also have the map retrieve the character's position on the map at any given time and display this with a different icon.
The most obvious solution -- since the character, like the map locations, has a name and coordinates -- seems to be just including the character as their own row in the locations table as well, and having the map code recognize their unique information and display them differently. But since the character is not, in and of themselves, a location, storing them in the "locations" table seems weird. Creating a whole new table, "character", just for this one row seems like overkill though.
So I guess my more general question is what is a good way for PHP/MySQL to deal with unique data like this, that is related to existing tables but not closely enough. Do I keep this data in a text file and update that with PHP?
There's nothing particularly "weird" about having a table which is intended to have only a single row. Indeed, using a database for some of your data and a text file for other data would probably be a bit more weird.
Having a table also gives you the possibility of tracking character locations over time, if you ever need that information for any reason. (Better to track it and not use it than need it and not have been tracking it.)
If the position of the character is calculated on the spot and doesn't need to be persisted then you can simply add it programmatically to the results from the database and it would be entirely transparent to both the database and the view. But if it does need to be persisted, a table is probably the way to go.
You should have a separate table for the character positions . . . over time. It would have columns such as:
Character id
Location
Date/time stamp
Eventually, you may want to have more than one character whose location can be shown over time. You may have non-characters; in this case, you'll want to change the name of the table.
There is a big difference between the locations and the character positions. The locations are static, at least once they are defined. The character positions are time dependent. They are a separate entity, and are best served by having their own table.
I have to format a telephone number list, and I'd wish to extract and separate the prefix from the number for better viewing.
I have a list of all possible prefixes, but there is no regular pattern.
I mean, I could have these numbers:
00 - 12345 (short prefix)
0000 - 12345 (long prefix)
How can I manage that? Numbers are plain, without any special char (ie without / \ - . , ecc ecc).
Prefixes are like that:
030
031
0321
0322
...
...
Most of the time I have the town of the customer (it's not required) so, usually i can get the prefix from there, but that's not a sure thing, since town and telephone couldn't be linked.
== EDIT ==
Prefix list is 231 entries long. Maybe I'll add somthing more, so take 300 as safe value
Moreover, prefixes come from a single country only (Italy)
I have to save plain numbers without any separator so users can search for it. Infact if they put separators they will never able to find again that.
More info
Prefix ALWAYS starts with a leading 0, its lenght ranges from 2-4
But the more i study this thing, the more i think i can't work it out :(
Because of the extremely varied telephone number formats used around the world, it's probably going to be tough to correctly parse any phone number that is put into your system.
I'm not certain if it would make your ask any easier, but I had the idea that parsing from Right-to-Left might be easier for you, since it's the Prefix length that's unknown
What a pain. I would use a logic funnel to narrow possible choices and finally take a best guess.
First, test if the first few numbers can match anything on your prefix list. For some, hopefully only one prefix can possibly be correct.
Then, perhaps you could use the city to eliminate prefixes from entirely different countries.
Finally, you could default to the most popular format for prefixes.
Without any other information, you can't do better than a good guess unless you want to default to no format at all.
I'm really confused. What do you mean, "extract and separate"? My guess is these phone numbers are in a MySQL database, and you get to use PHP. Are you trying to get the prefix from the numbers, and then insert them into a different field in the same row? Are you pulling these numbers from the database, and you would just like to print the prefixes to the screen?
Regardless of what you're trying to do, and taking for granted that you're using PHP and regexs, isn't this essentially what you're looking for?:
$telephone_number = '333-12345';
$matched = array();
preg_match('~^(\d+)-~', $telephone_number, $matched);
$matched[1] // Should be '333'
ok, I worked it out.
I saw that there aren't shor prefixes that share chars with longer one.
I mean:
02 -> there will never be a longer prefix as 021, 022 and so on
so things are pretty easy now:
I get first 4 numbers -> is that in my prefix table?
YES: stop here
NO: get first 3
and so on..
thanks for your help
Let's say I'm collecting tweets from twitter based on a variety of criteria and storing these tweets in a local mysql database. I want to be able to computer trending topics, like twitter, that can be anywhere from 1-3 words in length.
Is it possible to write a script to do something like this PHP and mysql?
I've found answering on how to compute which terms are "hot" once you're able to get counts of the terms, but I'm stuck at the first part. How should I store the data in the database, how can I count frequency of terms in the database that are 1-3 words in length?
trending topic receipt from me :
1. fetch the tweets
2. split each tweets by space into n-gram (up to 3 gram if you want 3 words length) array
3. filter out each array from url, #username, common words and junk chars
4. count all unique keyword / phrase frequency
5. mute some junk word / phrase
yes, you can do it on php & mysql ;)
How about decomposing your tweets first in single word tokens and calculate for every word its number of occurrences ?
Once you have them, you could decompose in all two word tokens, calculate the number of occurrences and finally do the same with all three word tokens.
You might also want to add some kind of dictionary of words you don't want to count
What you need is either
document classification, or..
automatic tagging
Probably second one. And only then you can count their popularity in time.
Or do the opposite of Dominik and store a set list of phrases you wish to match, spaces and all. Write them as regex strings. For each row in database (file, sql table, whatever), process regex, find count.
It depends on which way around you want to do it trivially: everything - that which is common, thereby finding what is truly trending, or set phrase lookup. In one case, you'll find a lot that might not interest you and you'll need an extensive blocklist - in the other case, you'll need a huge whitelist.
To go beyond that, you need natural language processing tools to determine the meaning of what is said.