How to validate text as not gibberish in PHP? - php

What is the best way to validate a string as not gibberish using PHP?
For example, if I get a string input from a user that must be at least 250 characters long, how can I tell whether they entered legitimate text (e.g. real words) or just gibberish to comply with the minimum characters (e.g. asdlfkjefksjlfkjldskfjelkef)?
I've thought about counting the number of words as one option, but the user could still space out their gibberish (e.g. asdlf kjef ksjlf kjl dskfje lkef), so it needs another kind of check on top of that.
Is there any way to check that at least half of a string contains real dictionary words, or something to that effect?
What is the best solution to this problem?
Thanks.

You cannot do that properly because Colorless green ideas sleep furiously.

You could try a Bloom filter

You can walk through your dictionary and delete all dictionary words from user input and then check the length of the rest

You could look at Markov Chains. Simply put the idea is this algorithm determines whether sequences of characters look like they belong together. It won't necessarily tell you it's not gibberish, but it should catch out things like "ksjhglah etc".
See Markov text generators

Related

Sorting user input

I attempting what I thought would be a simple exercise, but unless I’m missing a trick, it seems anything but simple.
Im attempting to clean up user input into a form before saving it. The particular problem I have is with hyphenated town names. For example, take Bourton-on-the-Water. Assume the user has Caps lock on or puts spaces next to the hyphens of any other screw up that might come to mind. How do I, within reason, turn it into what it’s meant to be?
You can use trim() to remove whitespace (or other characters) from the beginning and end of a string. You can also use explode() to break strings into parts by a specified character and then recreate your string as you like.
I think the only way you can really accomplish this is by improving the way the user inputs their data.
For example use a postcode lookup system that enters an address based on what they type.
Or have a autocomplete from a predefined list of towns (similar to how Facebook shows towns).
To consider every possible permutation of Bourton On The Water / Bourton-On-The-Water etc... is pretty much impossible.

Find common substrings from 2 separate sets

Background: I have a large database of people, and I want to look for duplicates, which is more difficult than it seems. I already do a lot of comparison between the names (which are often spelled in different ways), dates of birth and so on. When two profiles appear to be similar enough to the matching algorithm, they are presented to an operator who will judge.
Most profiles have more than one phone number attached, so I would like to use them to find duplicates. They can be entered as "001-555-123456", but also as "555-123456", "555-123456-7-8", "555-123456 call me in the evening" or anything you might imagine.
My first idea is to strip all non-numeric characters and get the "longest common substring".
There are a lot of algorithms around to find the longest common substring inside a set.
But whenever I compare two profiles A and B, I have two sets of phone numbers. I would like to find the longest common substring between a string in the set A and a string in a set B.
Can you please help me in finding such an algorithm?
I normally program in PHP, a SQL-only solution would be even better, but any other language would go.
As Voitcus said before, you have to clean your data first before you start comparing or looking for duplicates. A phone number should follow a strict pattern. For the numbers which do not match the pattern try to adjust them to it. Then you have the ability to look for duplicates.
Morevover you should do data-cleaning before persisting it, maybe in a seperate column. You then dont have to care for that when looking for duplicates ... just to avoid performance peaks.
Algorithms like levenshtein or similar_text() in php, doesnt fit to that use-case quite well.
In my opinion the best way is to strip all non-numeric characters from the texts containing phone numbers. You can do this in many ways, some regular expression would be the best, but see below.
Then, if it is possible, you can find the country direction code, if the user has its location country. If there is none, assume default and add to the string. The same would be probably with the cities. You can try to take a look also in place one lives, their zip code etc.
At the end of this you should have uniform phone numbers which can be easily compared.
The other way is to compare strings with the country (and city) code removed.
About searching "the longest common substring": The numbers thus filtered are the same, however you might need it eg. if someone typed "call me after 6 p.m.". If you're sure that the phone number is always at the beginning, so nobody typed something like 555-SUPERMAN which translates to 555-78737626, there is also possibility to remove everything after the last alphanumeric character (and this character, as well).
There is also a possibility to filter such data in the SQL statement. Consider something like a SELECT ..., [your trimming function(phone_number)] AS trimmed_phone WHERE (trimmed_phone is not numerical characters only) GROUP BY trimmed_phone. If trimming function would remove only whitespaces and special dividers like -, +, . (commonly in use in Germany), , perhaps etc., this query would leave you all phone numbers that are trimmed but contain characters not numeric -- take a look at the results, probably mostly digits and letters. How many of them are they? Maybe they have something common? Maybe some typical phrases you can filter out too?
If the result from such query would not be very much, maybe it's easier just to do it by hand?

Validate the name of a person in php [duplicate]

I would like to create a regex which validates a name of a person. These should be allowed:
Letters (uppercase and lowercase)
-
spaces
This is pretty easy to create a regex for. The problem is that some people also use special characters in their names. For example, assume a user named gûnther or François. There are a lot of characters like û and ç available and it's hard to list all of these.
Is there an easy way to check for correct human names?
Is there an easy way to check for correct human names?
This has been discussed several times. I'm fairly certain that the only thing that people can agree on is that in order to exist a name cannot be a empty string, thus:
^.+$
(Yes, I am aware that this is probably not what OP is looking for. I'm just summarizing earlier Q&As.)
/^\pL[\pL '-]*\z/ should do the trick
The short answer is no, there is no easy way. You have touched on the biggest issue. There are so many special cases of accents and extra things hanging of letters that it will become a mess to deal with. Additionally, the expression with break down to something like this
^[CAPITAL_LETERS][ALL_LETERS_AND_SYMBOLS]*$
That is not that helpful because "Abcd" fits that and you have no way to know if someone is incorrectly entering info into the field or if it was a crazy Hollywood parent that actually named their kid that or something like Sandwich or Umbrella.
^.+$
Checked #jensgram answer, but that regex only accepts all strings, so it doesn't solve problem, because string needs to be name, in this case it can be anything.
^[A-Z][a-z]+$
My regex only accepts string where first char is uppercase and following chars are letters in lowercase. Also looking through other answers, this seems to be shortest regex and also simpliest.
I don't know exactly what you are trying to do (validate user name input?) but basically I would keep it simple - fail the validation if the text contains numbers. And even that's probably pretty shaky.
I had the same problem. First I came up with something like
preg_match("/^[a-zA-Z]{1,}([\s-]*[a-zA-Z\s\'-]*)$/", $name))
but then realized that UTF-8 chars of countries like Sweden, China etc. for example Õ å would not be allowed which was important to my site since it's an international site and don't want to force users not being able to enter their real name.
I though it might be an easier solution instead of trying to figure out how to allow names like O'Malley and Brooks-Schneider and Õsmar (made that one up :) to rather catch chars that you don't want them to enter. For me it was basically to avoid xss JS code being entered. So I use the following regex to filter out all chars that might be harmful.
preg_match("/[~!##\$%\^&\*\(\)=\+\|\[\]\{\};\\:\",\.\<\>\?\/]+/", $name)
That way they can enter any name they want except chars that really aren't part of any name. Hope this might be useful.

Detect random strings

I am building a string to detect whether filename makes sense or if they are completely random with PHP. I'm using regular expressions.
A valid filename = sample-image-25.jpg
A random filename = 46347sdga467234626.jpg
I want to check if the filename makes sense or not, if not, I want to alert the user to fix the filename before continuing.
Any help?
I'm not really sure that's possible because I'm not sure it's possible to define "random" in a way the computer will understand sufficiently well.
"umiarkowany" looks random, but it's a perfectly valid word I pulled off the Polish Wikipedia page for South Korea.
My advice is to think more deeply about why this design detail is important, and look for a more feasible solution to the underlying problem.
You need way to much work on that. You should make an huge array of most-used-word (like a dictionary) and check if most of the work inside the file (maybe separated by - or _) are there and it will have huge bugs.
Basically you will need of
explode()
implode()
array_search() or in_array()
Take the string and look for a piece glue like "_" or "-" with preg_match(); if there are some, explode the string into an array and compare that array with the dictionary array.
Or, since almost every words has alternate vowel and consonants you could make an huge script that checks whatever most of the words inside the file name are considered "not-random" generated. But the problem will be the same: why do you need of that? Check for a more flexible solution.
Notice:
Consider that even a simple-and-friendly-file.png could be the result of a string generator.
Good luck with that.

Mix two strings into one longer string PHP

I have two strings and I would like to mix the characters from each string into one bigger string, how can I do this in PHP? I can swap chars over but I want something more complicated since it could be guessed.
And please don't say md5() is enough and irreversible. :)
$string1 = '9cb5jplgvsiedji9mi9o6a8qq1';//session_id()
$string2 = '5d41402abc4b2a76b9719d911017c592';//md5()
Thank you for any help.
EDIT: Ah sorry Rob. It would be great if there is a solution where it was just a function I could pass two strings to, and it returned a string.
The returned string must contain both of the previous strings. Not just a concatination, but the characters of each string are mingled into one bigger one.
If you want to make a tamper-proof string which is human readable, add a secure hash to it. MD5 is indeed falling out of favour, so try sha1. For example
$salt="secret";
$hash=sha1($string1.$string2.$salt);
$separator="_";
$str=$string1.$separator.$string2.$separator.$hash;
If you want a string which cannot be read by humans, encrypt it - check out the mcrypt extension which offers a variety of options.
Use one of the SHA variants of the hash() function. Sha2 or sha256 should be sufficient and certainly much better than anything you could come up with.
Unless I am missing something if your wanting to combine those values into a unique value why not do sha1(string1, string2);
I'm guessing you want something reversible, so you can get these values back out. A quick-and-dirty technique for obscuring these two strings further would be to base64-encode them:
base64_encode($string1 . $string2);
Thank you everyone. I completely forgot about the SHA1 - got too into solving a problem that I forgot what else was out there. :)
Well, if not md5(), then sha1(). :)
Anyway,the possibilities to mangle are endless, pick your poison.
What I would do, if I really wanted to do something like that (which can be useful occasionally), I would add another element, chosen on random and shuffle the md5 string by it. and write down the random element in it, too.
For example, let us add to each md5 character a random 2 digit number, which we then split by digits and add 1st digit to resulting string, and 2nd digit - prepend to it.
I stumbled upon someplace where something of that kind was done today. I was trying to find some reference to a particular phone number - whether it appears anywhere on the country-local inet or not.
I visited a popular classified ads site, which gives phone numbers of advertisers and you have the option, when you are looking at a particular ad, to find all ads with the same phone number. Now, what they did, however, was that they encoded search string, so you are not searching for ?phone=123123, but something like ?phone==FFYx23=.
If they hadn't done that, I would be able to find out for my own purposes, rather than checking on ads, IF user with phone 123123 has posted any ads on the site.
If you are looking to verify message integrity and authenticity with hashing - you might want to look at HMAC - there are plenty of implementations in PHP using both SHA1 and MD5:
http://en.wikipedia.org/wiki/HMAC
EDIT: In fact, PHP now has a function for this:
http://us3.php.net/manual/en/function.hash-hmac.php

Categories