I need to restrict phone numbers with patterns like:
0123456789,
1111111111,
2222222222,
3333333333,
4444444444,
etc.
I am trying to do it in PHP.
So far i came up with creating an array and searching in it to restrict that mobile no.
Is there any better way to do the same?
Maybe a regex or maybe in javascript.
If you don't want to keep updating a long array of numbers to check against like you mention yourself you basically want to do pattern detection / pattern recognition.
This can be everything from trivial to very complicated depending on your previous knowledge.
A small start can be found here... But there are tons of very thick books on the subject ;)
http://en.wikipedia.org/wiki/Pattern_recognition
So the "easiest" way is probably to use an array look up. The downside is that you must know every single number you wish to blacklist beforehand. A middle way would be to have an array of regexps of invalid formats that you check against instead of having the actual numbers in the array.
So you could have regexps covering things like
Numbers to short
Numbers to long
Numbers with only the same digits
etc
Depending on available systems and geographical location it might actually be possible to do some kind of look up against some known database of numbers. But that could lead to false positives for unlisted numbers.
For a global, dynamic, working system keeping a look up array and doing look ups against databases could both prove to become very hard to handle since the number of known data sources to keep working support for might grow to large to handle.
If you need to block specific phone numbers, you could do this:
$numberToTest = [...]; // this is your user input number to test
$notAllowed = array(
'0123456789',
'1111111111',
'2222222222',
// and so on
);
if(in_array($numberToTest, $notAllowed)) {
// NOT VALID
}
else {
// VALID
}
to match all numbers like 1111111111, 2222222222, etc you could do this:
if(preg_match('/^(\d)\1{9}$/', $numberToTest)) {
// NOT VALID
}
else {
// VALID
}
Javascript with regexp. isRestricted is set to true when the number is in restricted list:
var number = "0123456789";
var isRestricted = number.match( /0123456789|1111111111|2222222222|3333333333|4444444444/g ) != null;
I'm sure regexps works with PHP pretty much the same way using preg_match.
Related
I am adding a feature to an application that allows authorised oil rig personnel to submit weather reports (for use by our pilots when planning flights) to our system via email. The tricky part is that we want to match these reports to a particular oil platform, but the personnel (and their email accounts) can move between rigs.
We already have a list of waypoints that each have an "aliases" field. Basically if the email subject contains something in the aliases field, we should match the email to that waypoint.
The subject could be "Weather report 10 April # 1100 Rig A for you as requested"
The aliases for that waypoint would be something like
"RRA RPA Rig A RigA"
Keep in mind there is a similar list of aliases for all the other waypoints we have.
Is there a better way of matching than iterating through each word of each alias and checking if it's a substring of the email subject? Because that sounds like a n^2 sort of problem.
The alternative is for us to put a restriction and tell the operators they have to put the rig name at the start or end of the subject.
This sounds more like an algorithms question than a PHP question specifically. Take a look at What is the fastest substring search algorithm?
Well you can transform this into something like an O(n log n) algorithm, but it depends on the implementation specifics of stripos():
define('RIG_ID_1', 123);
define('RIG_ID_2', 456);
function get_rig_id($email_subject) {
$alias_map = [
'RRA' => RIG_ID_1,
'RPA' => RIG_ID_1,
'Rig A' => RIG_ID_1,
'RigA' => RIG_ID_1,
// ...
];
foreach(array_keys($alias_map) as $rig_substr) {
if(stripos($email_subject, $rig_substr) !== false) {
return $alias_map[$rig_substr];
}
}
return null;
}
Here each substring is examined by stripos() exactly once. Probably a better solution is to compose these strings into a series of regexes. Internally, the regex engine is able to scan text very efficiently, typically scanning each character only one time:
ex.:
<?php
define('RIG_ID_1', 123);
define('RIG_ID_2', 456);
function get_rig_id($email_subject) {
$alias_map = [
'/RRA|RPA|Rig\\sA|RigA/i' => RIG_ID_1,
'/RRB|RPB|Rig\\sB|RigB/i' => RIG_ID_2,
// ...
];
foreach(array_keys($alias_map) as $rig_regex) {
if(preg_match($rig_regex, $email_subject)) {
return $alias_map[$rig_regex];
}
}
return null;
}
For your purposes the practical solution is very much dependent upon how many rigs you've got an how many substrings per rig. I suspect that unless you're dealing with tens of thousands of rigs or unless performance is a critical aspect of this application, a naive O(n^2) solution would probably suffice. (Remember that premature optimization is the root of all evil!) A simple benchmark would bear this out.
An even-better solution -- and potentially faster -- would be to set up an elasticsearch instance, but once again that may be too much effort to go to when a naive approach would suffice in a fraction of the implementation time.
I want to add random string as token for form submission which is generated unique forever. I have spent to much time with Google but I am confused which combination to use?
I found so many ways to do this when I googled:
1) Combination of character and number.
2) Combination of character, number and special character.
3) Combination of character, number, special character and date time.
Which combination may i use?
How many character of random string may I generate.?
Any other method which is secure then please let me know.?
Here are some considerations:
Alphabet
The number of characters can be considered the alphabet for the encoding. It doesn't affect the string strength by itself but a larger alphabet (numbers, non-alpha-number characters, etc.) does allow for shorter strings of similar strength (aka keyspace) so it's useful if you are looking for shorter strings.
Input Values
To guarantee your string to be unique, you need to add something which is guaranteed to be unique.
Random value is a good seed value if you have a good random number generator
Time is a good seed value to add but it may not be unique in a high traffic environment
User ID is a good seed value if you assume a user isn't going to create sessions at the exact same time
Unique ID is something the system guarantees is unique. This is often something that the server will guarantee / verify is unique, either in a single server deployment or distributed deployment. A simple way to do this is to add a machine ID and machine unique ID. A more complicated way to do this is to assign key ranges to machines and have each machine manage their key range.
Systems that I've worked with that require absolute uniqueness have added a server unique id which guarantees a item is unique. This means the same item on different servers would be seen as different, which was what was wanted here.
Approach
Pick one more input values that matches your requirement for uniqueness. If you need absolute uniqueness forever, you need something that you control that you are sure is unique, e.g. a machine associated number (that won't conflict with others in a distributed system). If you don't need absolute uniqueness, you can use a random number with other value such as time. If you need randomness, add a random number.
Use an alphabet / encoding that matches your use case. For machine ids, encodings like hexadecimal and base 64 are popular. For machine-readable ids, for case-insensitive encodings, I prefer base32 (Crockford) or base36 and for case-sensitive encodings, I prefer base58 or base62. This is because these base32, 36, 58 and 62 produce shorter strings and (vs. base64) are safe across multiple uses (e.g. URLs, XML, file names, etc.) and don't require transformation between different use cases.
You can definitely get a lot fancier depending on your needs, but I'll just throw this out there since it's what I use frequently for stuff like what you are describing:
md5(rand());
It's quick, simple and easy to remember. And since it's hexadecimal it plays nicely with others.
Refer to this SO Protected Question. This might be what you are looking.
I think its better to redirect you to a previously asked question which has more substantive answers.You will find a lot of options.
Try the code, for function getUniqueToken() which returns you unique string of length 10 (default).
/*
This function will return unique token string...
*/
function getUniqueToken($tokenLength = 10){
$token = "";
//Combination of character, number and special character...
$combinationString = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789*#&$^";
for($i=0;$i<$tokenLength;$i++){
$token .= $combinationString[uniqueSecureHelper(0,strlen($combinationString))];
}
return $token;
}
/*
This helper function will return unique and secure string...
*/
function uniqueSecureHelper($minVal, $maxVal) {
$range = $maxVal - $minVal;
if ($range < 0) return $minVal; // not so random...
$log = log($range, 2);
$bytes = (int) ($log / 8) + 1; // length in bytes
$bits = (int) $log + 1; // length in bits
$filter = (int) (1 << $bits) - 1; // set all lower bits to 1
do {
$rnd = hexdec(bin2hex(openssl_random_pseudo_bytes($bytes)));
$rnd = $rnd & $filter; // discard irrelevant bits
} while ($rnd >= $range);
return $minVal + $rnd;
}
Use this code (two function), you can increase string length by passing int parameter like getUniqueToken(15).
I use your 2nd idea (Combination of character, number and special character), which you refine after googling. I hope my example will help you.
You should go for 3 option. Because it has date and time so it become every time unique.
And for method have you tried
str_shuffle($string)
Every time it generates random string from $string.
End then use substr
($string , start , end)
to cut it down.
End if you want date and time then concatenate the result string with it.
An easily understandable and effective code to generate random strings in PHP. I do not consider predictability concerns important in this connection.
<?php
$d = str_shuffle('0123456789');
$C = str_shuffle('ABCDEFGHIJKLMNOPQRSTUVWXYZ');
$m = str_shuffle('abcdefghijklmnopqrstuvwxyz');
$s = str_shuffle('#!$&()*+-_~');
$l=9; //min 4
$r=substr(str_shuffle($d.$C.$m.$s),0,$l);echo $r.'<br>';
$safe=substr($d,0,1).substr($C,0,1).substr($m,0,1).mb_substr($s,0,1);
$r=str_shuffle($safe.substr($r,0,$l-4));//always at least one digit, special, small and capital
// this also allows for 0,1 or 2 of each available characters in string
echo $r;
exit;
?>
For unique string use uniqid().
And to make it secure, use hashing algorithms
for example :
echo md5(uniqid())
I have several ways to write a phone number :
+5511999999999
55999999999
11999999999
999999999
is there any library or logic way to compare phone numbers in PHP?
Interesting question.
In order to compare two strings (as phone number or email for example) you have to be sure to have them written in similar format.
I suggest to perform validation of the entry of your forms when you get the phone numbers from the user. This way will allow you to compare strings that have the same template. for example: Country code: [+XXX] Area Code: [(XXX)] Phone Number: [XXX-XXXX]
The user should write something like +44(887)345-5532 (for UK PHONE).
If you have file or list with some phones that you would like to compare, you have to come up with a criteria for Country code, area code and actual phone number in order to compare them with no mistakes.
When this is done you can split the string and compare each element.
There are many open source methods of doing such validation. Try and use ZEND FRAMEWORK, or CAKE PHP, CODE IGNITER or some similar framework and see how to do it there.
Hope that helps.
An interesting approach to the problem could be considering a percentage of similarity. For this case you could use similar_text. I built a function thtat takes an array of phone numbers and compares them to one number like this:
function similarPhone($search, $phones, $acceptablePercentage)
{
foreach ($phones as $phone) {
similar_text($search, $phone, $percentage);
if ($percentage >= $acceptablePercentage) {
return true;
}
}
return false;
}
So I use an acceptable percentage of 80% to compare.
First, please note, that I am interested in how something like this would work, and am not intending to build it for a client etc, as I'm sure there may already be open source implementations.
How do the algorithms work which detect plagiarism in uploaded text? Does it use regex to send all words to an index, strip out known words like 'the', 'a', etc and then see how many words are the same in different essays? Does it them have a magic number of identical words which flag it as a possible duplicate? Does it use levenshtein()?
My language of choice is PHP.
UPDATE
I'm thinking of not checking for plagiarism globally, but more say in 30 uploaded essays from a class. In case students have gotten together on a strictly one person assignment.
Here is an online site that claims to do so: http://www.plagiarism.org/
Good plagiarism detection will apply heuristics based on the type of document (e.g. an essay or program code in a specific language).
However, you can also apply a general solution. Have a look at the Normalized Compression Distance (NCD). Obviously you cannot exactly calculate a text's Kolmogorov complexity, but you can approach it be simply compressing the text.
A smaller NCD indicates that two texts are more similar. Some compression
algorithms will give better results than others. Luckily PHP provides support
for several compression algorithms, so you can have your NCD-driven plagiarism
detection code running in no-time. Below I'll give example code which uses
Zlib:
PHP:
function ncd($x, $y) {
$cx = strlen(gzcompress($x));
$cy = strlen(gzcompress($y));
return (strlen(gzcompress($x . $y)) - min($cx, $cy)) / max($cx, $cy);
}
print(ncd('this is a test', 'this was a test'));
print(ncd('this is a test', 'this text is completely different'));
Python:
>>> from zlib import compress as c
>>> def ncd(x, y):
... cx, cy = len(c(x)), len(c(y))
... return (len(c(x + y)) - min(cx, cy)) / max(cx, cy)
...
>>> ncd('this is a test', 'this was a test')
0.30434782608695654
>>> ncd('this is a test', 'this text is completely different')
0.74358974358974361
Note that for larger texts (read: actual files) the results will be much more
pronounced. Give it a try and report your experiences!
I think that this problem is complicated, and doesn't have one best solution.
You can detect exact duplication of words at the whole document level (ie someone downloads an entire essay from the web) all the way down to the phrase level. Doing this at the document level is pretty easy - the most trivial solution would take the checksum of each document submitted and compare it against a list of checksums of known documents. After that you could try to detect plagiarism of ideas, or find sentences that were copied directly then changed slightly in order to throw off software like this.
To get something that works at the phrase level you might need to get more sophisticated if want any level of efficiency. For example, you could look for differences in style of writing between paragraphs, and focus your attention to paragraphs that feel "out of place" compared to the rest of a paper.
There are lots of papers on this subject out there, so I suspect there is no one perfect solution yet. For example, these 2 papers give introductions to some of the general issues with this kind of software,and have plenty of references that you could dig deeper into if you'd like.
http://ir.shef.ac.uk/cloughie/papers/pas_plagiarism.pdf
http://proceedings.informingscience.org/InSITE2007/IISITv4p601-614Dreh383.pdf
Well, you first of all have to understand what you're up against.
Word-for-word plagiarism should be ridiculously easy to spot. The most naive approach would be to take word tuples of sufficient length and compare them against your corpus. The sufficient length can be incredibly low. Compare Google results:
"I think" => 454,000,000
"I think this" => 329,000,000
"I think this is" => 227,000,000
"I think this is plagiarism" => 5
So even with that approach you have a very high chance to find a good match or two (fun fact: most criminals are really dumb).
If the plagiarist used synonyms, changed word ordering and so on, obviously it gets a bit more difficult. You would have to store synonyms as well and try to normalise grammatical structure a bit to keep the same approach working. The same goes for spelling, of course (i.e. try to match by normalisation or try to account for the deviations in your matching, as in the NCD approaches posted in the other answers).
However the biggest problem is conceptual plagiarism. That is really hard and there are no obvious solutions without parsing the semantics of each sentence (i.e. sufficiently complex AI).
The truth is, though, that you only need to find SOME kind of match. You don't need to find an exact match in order to find a relevant text in your corpus. The final assessment should always be made by a human anyway, so it's okay if you find an inexact match.
Plagiarists are mostly stupid and lazy, so their copies will be stupid and lazy, too. Some put an incredible amount of effort into their work, but those works are often non-obvious plagiarism in the first place, so it's hard to track down programmatically (i.e. if a human has trouble recognising plagiarism with both texts presented side-by-side, a computer most likely will, too). For all the other 80%-or-so, the dumb approach is good enough.
It really depends on "plagarised from where".
If you are talking about within the context of a single site, that's vastly different from across the web, or the library of congres, or ...
http://www.copyscape.com/ pretty much proves it can be done.
Basic concept seems to be
do a google search for some uncommon
word sequences
For each result, do a detailed analysis
The detailed analysis portion can certainly be similar, since it is a 1 to 1 comparison, but locating and obtaining source documents is the key factor.
(This is a Wiki! Please edit here with corrections or enhancings)
For better results on not-so-big strings:
There are problems with the direct uso of the NCD formula on strings or little texts. NCD(X,X) is not zero (!). To remove this artifact subtract the self comparison.
See similar_NCD_gzip() demo at http://leis.saocarlos.sp.gov.br/SIMILAR.php
function similar_NCD_gzip($sx, $sy, $prec=0, $MAXLEN=90000) {
# NCD with gzip artifact correctoin and percentual return.
# sx,sy = strings to compare.
# Use $prec=-1 for result range [0-1], $pres=0 for percentual,
# $pres=1 or =2,3... for better precision (not a reliable)
# Use MAXLEN=-1 or a aprox. compress lenght.
# For NCD definition see http://arxiv.org/abs/0809.2553
# (c) Krauss (2010).
$x = $min = strlen(gzcompress($sx));
$y = $max = strlen(gzcompress($sy));
$xy= strlen(gzcompress($sx.$sy));
$a = $sx;
if ($x>$y) { # swap min/max
$min = $y;
$max = $x;
$a = $sy;
}
$res = ($xy-$min)/$max; # NCD definition.
# Optional correction (for little strings):
if ($MAXLEN<0 || $xy<$MAXLEN) {
$aa= strlen(gzcompress($a.$a));
$ref = ($aa-$min)/$min;
$res = $res - $ref; # correction
}
return ($prec<0)? $res: 100*round($res,2+$prec);
}
Suggestions for an updated title are welcome, as I'm having trouble easily quantifying what I'm trying to do.
This is a web-based form with PHP doing the calculations, though this question probably has an algorithmic or language agnostic answer.
Essentially, there is an Amount field and a Charge Code field.
The Charge code entered represents a short-hand for several 'agents' to whom the Amount is divided by. Most cases are single letters, however there are a couple cases where this varies, and gives a bit of trouble.
Basically, A = AgentType1, J = AgentType2, L = AgentType3, and as paperwork and user requirements would have it, "A2" is also a valid replacement for "J".
So an amount of 50 and a Charge Code of "AJ" would result in the Amount being divided by 2 (two agents) and dispersed accordingly. The same for a string like "AA2".
I have currently set up process (that works) that goes like this:
Divide = 0;
RegEx check for AgentType1 in Charge Code:
Divide++;
Set This-AgentType-Gets-Return;
RegEx check for AgentType2 in Charge Code:
Devide++;
Set This-AgentType-Gets-Return;
... etc ...
Then I divide the Amount by the "Divide" amount, and the result gets divvied up to each AgentType present in the Charge Code.
I know there must be an easier/simpler way to implement this, but it's not coming to me at the moment.
Is there a way to quickly derive the number of AgentTypes involved in the Charge Code, and which they are?
I would probably just do something simple like this:
$valid_codes = array('A', 'J', 'L');
// deal with the special A2 case first, to get it out of the string
// this code could be generalized if more special cases need to be handled
if (stripos($charge_code, 'A2') !== FALSE)
{
$found['J'] = true;
str_ireplace('A2', '', $charge_code);
}
foreach ($valid_codes as $code)
{
if (stripos($charge_code, $code) !== FALSE) // if the code was in the string
{
$found[$code] = true;
}
}
Now you can get the number you need to divide amount by with count($found), and the codes you need to divide between with array_keys($found).
Can you change the charge code field to an array of fields? Something like:
<input type="hidden" name="agent[]" value="A" />
for all your agents would let you do:
$divide = count($_POST["agent"]);
foreach($_POST["agent"] as $agent) {
$sum = $_POST["amount"] / $divide;
//do other stuff
}
Couldn't you match the string by something like this regex
^([A-Z]\d*)*$
and then work through the generated match list? The divisor would just be the length of this list (perhaps after removing duplicates).
For mapping symbols to Agents (why AgentTypes?), you could use a simple associative list, or a hashmap (I don't know what kind of constructs are easiest available in PHP).