I work on a website which allows people to tell about how they were treated when they request for support from companies. The issue is that some people are playing with the platform using meaningless data like
blabla bal bla bka asdfdsff sdfs sdf
Is there a way to prevent this?
Can't do the validation of data manually because the website is very dynamic with a lot of data.
Thanks
Improve your form validation checks.
For the phone number, make sure it's exactly the appropriate size, and it doesn't (for example) have the same number (ie the number 0777777777 will probably be fake).
Calculate the letter usage in a sentence. The most used letters in the english language are e and a (I think). If the ratio is completely different (for example if there is no letter e in a 200 letter text - there is a bit problem ).
Also match the words with a dictionary. For a ratio of unknown words larger than 60% you can consider it to be not valid.
Check for dates, if you're expecting a date that's in the next few days, you shouldn't accept dates for 30 years ago.
Think of the data that you're expecting to receive, and find limits to it, that's the only way. Good luck !
Short answer no.
Long answer: you may want to try to match words against a dictionary. But this is not fool proof and when doing the matching too tight you may get a lot of false positives.
Another way may be to build a blacklist of bogus words and match against that.
Also you may want reconsider making that particular field required. When a lot of people fill in bogus data the form is probably setup wrong.
You can do it to an extent:
Validation on certain fields (phone number, email, numeric/text only fields etc...)
Restrict the user to use pre-defined items, such as drop-downs, check-boxes, rather than just plain text inputs where they have total freedom
Run some checks through the dictionary and determine a desirable percentage of quality that a user submits.
Regardless of what you do, it'll never be 100%. The only (almost!) guaranteed method of correct validation with user input outside of pre-determined values would be to sit someone down and manually check every submitted piece of data. Even then, they're prone to human error and it still wouldn't be 100%.
My advice would be to keep all important fields to values you've already specified yourself with drop-downs, check-boxes, number spinners etc...
Add fields for 'additional comments' on certain items, but keep those fields unnecessary to the main process handling of a submitted form.
Related
I attempting what I thought would be a simple exercise, but unless I’m missing a trick, it seems anything but simple.
Im attempting to clean up user input into a form before saving it. The particular problem I have is with hyphenated town names. For example, take Bourton-on-the-Water. Assume the user has Caps lock on or puts spaces next to the hyphens of any other screw up that might come to mind. How do I, within reason, turn it into what it’s meant to be?
You can use trim() to remove whitespace (or other characters) from the beginning and end of a string. You can also use explode() to break strings into parts by a specified character and then recreate your string as you like.
I think the only way you can really accomplish this is by improving the way the user inputs their data.
For example use a postcode lookup system that enters an address based on what they type.
Or have a autocomplete from a predefined list of towns (similar to how Facebook shows towns).
To consider every possible permutation of Bourton On The Water / Bourton-On-The-Water etc... is pretty much impossible.
I'm extremely new to Solr so go easy on me :)
I have a field for arguments sake stores a product sku! If the sku in a document was 'SKU12345' - how would I return the document if the query '1234' was entered?
I have previously tried using solr.EdgeNGramFilterFactory in the field type specific for the SKU but unfortunately this only works as a string prefix!
I want to try and avoid wild cards to keep performance optimal!
Thankssss :)
If you are new to Solr and you are beginning to implement features like this, I would recommend to read thoroughly through the chapter Understanding Analyzers, Tokenizers, and Filters of the reference guide. Since there are several ways to make your query match, but the best choice would depend on what you need.
Arun's suggestion is not bad, but the Ngrams alone are more geared to find general fractions of words. You would need this, if you want to do some sort of type-ahead or auto-completion. e.g. a User starts to type within an input field somewhere and you want to suggest previously made input that does match in fractions. If you try to make this match with Ngrams alone, your index may become quite large. Since you maybe required to index all permutations of the words to not miss the place where numbers/words start or end.
For your requirement I would tend to suggest the WordDelimiterFilter with splitOnNumerics="1". So the input SKU12345 would be indexed as follows
SKU12345
12345
SKU
So if a user searches for 12345 this would make a match.
If you want to match also fragments of that - like you said 1234 - I would then place a N-GramFilter afterwards. Then you will need to play around with minGramSize and maxGramSize. You will want to keep the gap between the two values low. Since the higher the gap the bigger your index will become.
e.g.
* minGramSize=4 and maxGramSize=5, gap of 1, few permutations
* minGramSize=1 and maxGramSize=5, gap of 4, more permutations
This depend on how small the user input shall be allowed to make a match.
If only the input shall match only from the start and shall not hit fragments in the middle, I would suggest the EdgeN-GramFilter as even better choice over the N-GramFilter. This will only generate fragments from the start of a word, not from the middle. This will lead to further reduction of the index size and better performance.
So if you want to make 2345 match SKU12345 you need Ngram, if only input as 1234 shall match SKU12345 EdgeNgram will do.
You can also set side to "back" to generate the ngrams from right to left.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory
I am struggling with the decision to separate phone numbers stored in a MySQL database.
One school of thought is to break out the phone as:
area code (123)
prefix (123)
suffix (1234)
Another is to simply place the file in a single field with whatever formatting deemed appropriate:
123456789
(123) 123-4567
123-456-7890
My initial reason for thinking the first would be better is in terms of being able to quickly and easily gather statistical data based on the phone numbers collected from our members (X number of members have a 123 area code for example).
Is there really a 'right' way to do it? I do realize that paired with PHP I can retrieve and reformat any way I want but I'd like to use best practice.
Thanks for your advice
EDIT
I will only be storing North American phone numbers for the time being
I vote for one field, processing the data as you put it in so that it's in a known format. I've tried both ways, and the one-field approach seems to generate less code overall.
You want to store it in the most efficient way in the DB, precisely because it's so easy to reformat in PHP. Go for the all-numeric field, with no separators (1231231234) since that would be the best way. If you have international phone numbers, add the country code as well. Then in your code you can format it using regular expressions to look however you want it.
I would store phone numbers as strings, not numbers.
Phone numbers are identifiers that happen to use digits.
Phone numbers starting with zero are valid, but may be interpreted as octal by a programming language.
Strip the phone number to only digits and store the extension in a separate field.
This will allow for uniform formatting later.
For US, strip the prepending ’1′ digit (and determine formatting based on length of the string (10 digits for US)).
I'm in the process of building a callcenter application (it manages queues of contact information for a group of distributed callers to contact) and the architecture specified one field, no spaces, dashes, etc. After quite a bit of analysis, I agree it seems the best.
Based on the variability of entry for phone numbers (apostrophes, dots, dashes, and combinations of each) I built a simple function that deals with user entry, stripping down all but the numbers themselves, and also a "rebuilder" that reformats the raw number into something that's more visually appealing to the user.
Since they've been helpful to me, here's what I've written so far:
public static function cleanPhoneNumbers($input) {
return preg_replace("/[^0-9]/", "", $input);
}
public static function formatPhoneNumbers($phone_number) {
if(strlen($phone_number) == 7) {
return preg_replace("/([0-9]{3})([0-9]{4})/", "$1-$2", $phone_number);
} elseif(strlen($phone_number) == 10) {
return preg_replace("/([0-9]{3})([0-9]{3})([0-9]{4})/", "$1-$2-$3", $phone_number);
} else {
return $phone_number;
}
}
Some caveats: My app is not available for international customers right now (there's a voip application built into it that we don't want to allow to call outside of the US right now) so I've not taken the time to setup for international possibilities. Also, as this is in progress, I will likely return to refactor and bolster these functions later.
I've found one weakness so far that has been a bit of a pain for me. In my app, I have to disallow calls to be made by timezone based on the time of day (for instance, don't allow someone on the West Coast to be called at 6:00am when it's 9:00am in the East) To do that, I have to join a separate area code table to my table with the phone numbers by comparing 3 digit area codes to get the timezone. But I can't simply compare the zip code to my phone number field, because they'd never match. So, I have to deal with additional SQL to get just the first three digits of the number. Not a game-changer, but more work and confusion nonetheless.
Definitely store them in one field as a text string, and only store the numbers. Think of it this way; no matter what the numbers are, its all one telephone number. However, the segmenting of the numbers is dependent on a number of things (locality, how many numbers provided, even personal preference). Easier to store the one and change it later with text manipulation.
I think splitting the number in 3 fields is the best options if you want to use area codes as filters, otherwise, you should only use 1 field.
Remember to use ZEROFILL is you plan on storing them as numbers ;)
it really depends on a couple factors:
is it possible you will have international numbers?
how much area code/city code searching/manipulation will you be doing?
No matter what, I would only store numbers, it's easy enough to format either in MySQL or PHP and add parentheses and dashes.
Unless I was going to do a log of searching by area code, I would just put the entire phone number into a single field since I assume most of the time you would be retrieving the entire phone number anyway.
If it's possible that you will take international numbers in the future:
You might want to add a country field though, that way you won't have to guess what country they are from when dealing with the number.
What you use depends on how you plan to use the data, and where the program will be used.
If you want to efficiently search records by area code, then split out the area code; queries will perform much faster when they're doing simple string comparisons versus string manipulation of the full phone number to get the area code.
HOWEVER, be advised that phone numbers formatted XXX-XXX-XXXX are only found in the US, Canada, and other smaller Caribbean territories that are subject to the NANPA system. Various other world regions (EU, Africa, ASEAN) have very different numbering standards. In such cases, splitting out the equivalent of the "area code" may not make sense. Also, if all you want to do is display a phone number to the user, then just store it as a string.
Whether to store a number with a format or not is mostly personal preference. Storing the raw number allows the formatting to be changed easily; you could go from XXX-XXX-XXXX to (XXX) XXX-XXXX by changing a couple lines of code instead of reformatting the 10 million numbers you already have. Removing special characters from a phone number is also a relatively simple Regex. Storing without formatting will also save you a few bytes per number and allow you to use a fixed-length field (saving further data overhead inherent in varchars). This may be of use in a mobile app where storage is at a premium. However, that 5-terabyte distributed SQL cluster in your server room is probably not gonna notice much difference between a char(10) and a varchar(15). Storing them formatted also speeds up loading the data; you don't have to format it first, just yank it out of the DB and plaster it on the page.
i'm running a dating site and there is a place where people enter their profile - I already have a bad-words filter but now I have a problem where people enter a profile that is just garbage characters or just "aaaaaaaaaaaaaaaaaaaa" or "--------------" etc. I'm looking for an effective way of filtering out the long words of repeated characters. thanks in advance.
this should do it (but it will replace double-characters too, mabe you need to edit a bit):
preg_replace('{(.)\1+}','$1',$text);
OT: can't belive there are still people who use bad-word filters...
Maybe you need some bayesian spam filter-alike filter for that kind of stuff.
Particular words have particular probabilities of occurring in spam email and in legitimate email. For instance, most email users will frequently encounter the word "Viagra" in spam email, but will seldom see it in other email. The filter doesn't know these probabilities in advance, and must first be trained so it can build them up. To train the filter, the user must manually indicate whether a new email is spam or not.
...
You could use a word-list, and flag each message that has long words (e.g. 5+ chars) not on the list - if the field contains 5 8-letter words, of which none are in a dictionary, it's likely it's not meaningful data.
Short question: How do I automatically detect whether a CSV file has headers in the first row?
Details: I've written a small CSV parsing engine that places the data into an object that I can access as (approximately) an in-memory database. The original code was written to parse third-party CSV with a predictable format, but I'd like to be able to use this code more generally.
I'm trying to figure out a reliable way to automatically detect the presence of CSV headers, so the script can decide whether to use the first row of the CSV file as keys / column names or start parsing data immediately. Since all I need is a boolean test, I could easily specify an argument after inspecting the CSV file myself, but I'd rather not have to (go go automation).
I imagine I'd have to parse the first 3 to ? rows of the CSV file and look for a pattern of some sort to compare against the headers. I'm having nightmares of three particularly bad cases in which:
The headers include numeric data for some reason
The first few rows (or large portions of the CSV) are null
There headers and data look too similar to tell them apart
If I can get a "best guess" and have the parser fail with an error or spit out a warning if it can't decide, that's OK. If this is something that's going to be tremendously expensive in terms of time or computation (and take more time than it's supposed to save me) I'll happily scrap the idea and go back to working on "important things".
I'm working with PHP, but this strikes me as more of an algorithmic / computational question than something that's implementation-specific. If there's a simple algorithm I can use, great. If you can point me to some relevant theory / discussion, that'd be great, too. If there's a giant library that does natural language processing or 300 different kinds of parsing, I'm not interested.
As others have pointed out, you can't do this with 100% reliability. There are cases where getting it 'mostly right' is useful, however - for example, spreadsheet tools with CSV import functionality often try to figure this out on their own. Here's a few heuristics that would tend to indicate the first line isn't a header:
The first row has columns that are not strings or are empty
The first row's columns are not all unique
The first row appears to contain dates or other common data formats (eg, xx-xx-xx)
In the most general sense, this is impossible. This is a valid csv file:
Name
Jim
Tom
Bill
Most csv readers will just take hasHeader as an option, and allow you to pass in your own header if you want. Even in the case you think you can detect, that being character headers and numeric data, you can run into a catastrophic failure. What if your column is a list of BMW series?
M
3
5
7
You will process this incorrectly. Worst of all, you will lose the best car!
In the purely abstract sense, I don't think there is an foolproof algorithmic answer to your question since it boils down to: "How do I distinguish dataA from dataB if I know nothing about either of them?". There will always be the potential for dataA to be indistinguishable from dataB. That said, I would start with the simple and only add complexity as needed. For example, if examining the first five rows, for a given column (or columns) if the datatype in rows 2-5 are all the same but differ from the datatype in row 1, there's a good chance that a header row is present (increased sample sizes reduce the possibility of error). This would (sorta) solve #1/#3 - perhaps throw an exception if the rows are all populated but the data is indistinguishable to allow the calling program to decide what to do next. For #2, simply don't count a row as a row unless and until it pulls non-null data....that would work in all but an empty file (in which case you'd hit EOF). It would never be foolproof, but it might be "close enough".
It really depends on just how "general" you want your tool to be. If the data will always be numeric, you have it easy as long as you assume non-numeric headers (which seems like a pretty fair assumption).
But beyond that, if you don't already know what patterns are present in the data, then you can't really test for them ahead of time.
FWIW, I actually just wrote a script for parsing out some stuff from TSVs, all from the same source. The source's approach to headers/formatting was so scattered that it made sense to just make the script ask me questions from the command line while executing. (Is this a header? Which columns are important?). So no automation, but it let's me fly through the data sets I'm working on, instead of trying to anticipate each funny formatting case. Also, my answers are saved in a file, so I only have to be involved once per file. Not ideal, but efficient.
This article provides some good guidance:
Basically, you do statistical analysis on columns based on whether the first row contains a string and the rest of the rows numbers, or something like that.
http://penndsg.com/blog/detect-headers/
If you CSV has a header like this.
ID, Name, Email, Date
1, john, john#john.com, 12 jan 2020
Then doing a filter_var(str, FILTER_VALIDATE_EMAIL) on the header row will fail. Since the email address is only in the row data. So check header row for an email address (assuming your CSV has email addresses in it).
Second idea.
http://php.net/manual/en/function.is-numeric.php
Check header row for is_numeric, most likely a header row does not have numeric data in it. But most likely a data row would have numeric data.
If you know you have dates in your columns, then checking the header row for a date would also work.
Obviously you need to what type of data you are expecting. I am "expecting" email addresses.