How would you code an anti plagiarism site?

How would you code an anti plagiarism site? - php

First, please note, that I am interested in how something like this would work, and am not intending to build it for a client etc, as I'm sure there may already be open source implementations.
How do the algorithms work which detect plagiarism in uploaded text? Does it use regex to send all words to an index, strip out known words like 'the', 'a', etc and then see how many words are the same in different essays? Does it them have a magic number of identical words which flag it as a possible duplicate? Does it use levenshtein()?
My language of choice is PHP.
UPDATE
I'm thinking of not checking for plagiarism globally, but more say in 30 uploaded essays from a class. In case students have gotten together on a strictly one person assignment.
Here is an online site that claims to do so: http://www.plagiarism.org/

Good plagiarism detection will apply heuristics based on the type of document (e.g. an essay or program code in a specific language).
However, you can also apply a general solution. Have a look at the Normalized Compression Distance (NCD). Obviously you cannot exactly calculate a text's Kolmogorov complexity, but you can approach it be simply compressing the text.
A smaller NCD indicates that two texts are more similar. Some compression
algorithms will give better results than others. Luckily PHP provides support
for several compression algorithms, so you can have your NCD-driven plagiarism
detection code running in no-time. Below I'll give example code which uses
Zlib:
PHP:
function ncd($x, $y) {
$cx = strlen(gzcompress($x));
$cy = strlen(gzcompress($y));
return (strlen(gzcompress($x . $y)) - min($cx, $cy)) / max($cx, $cy);
}
print(ncd('this is a test', 'this was a test'));
print(ncd('this is a test', 'this text is completely different'));
Python:
>>> from zlib import compress as c
>>> def ncd(x, y):
... cx, cy = len(c(x)), len(c(y))
... return (len(c(x + y)) - min(cx, cy)) / max(cx, cy)
...
>>> ncd('this is a test', 'this was a test')
0.30434782608695654
>>> ncd('this is a test', 'this text is completely different')
0.74358974358974361
Note that for larger texts (read: actual files) the results will be much more
pronounced. Give it a try and report your experiences!

I think that this problem is complicated, and doesn't have one best solution.
You can detect exact duplication of words at the whole document level (ie someone downloads an entire essay from the web) all the way down to the phrase level. Doing this at the document level is pretty easy - the most trivial solution would take the checksum of each document submitted and compare it against a list of checksums of known documents. After that you could try to detect plagiarism of ideas, or find sentences that were copied directly then changed slightly in order to throw off software like this.
To get something that works at the phrase level you might need to get more sophisticated if want any level of efficiency. For example, you could look for differences in style of writing between paragraphs, and focus your attention to paragraphs that feel "out of place" compared to the rest of a paper.
There are lots of papers on this subject out there, so I suspect there is no one perfect solution yet. For example, these 2 papers give introductions to some of the general issues with this kind of software,and have plenty of references that you could dig deeper into if you'd like.
http://ir.shef.ac.uk/cloughie/papers/pas_plagiarism.pdf
http://proceedings.informingscience.org/InSITE2007/IISITv4p601-614Dreh383.pdf

Well, you first of all have to understand what you're up against.
Word-for-word plagiarism should be ridiculously easy to spot. The most naive approach would be to take word tuples of sufficient length and compare them against your corpus. The sufficient length can be incredibly low. Compare Google results:
"I think" => 454,000,000
"I think this" => 329,000,000
"I think this is" => 227,000,000
"I think this is plagiarism" => 5
So even with that approach you have a very high chance to find a good match or two (fun fact: most criminals are really dumb).
If the plagiarist used synonyms, changed word ordering and so on, obviously it gets a bit more difficult. You would have to store synonyms as well and try to normalise grammatical structure a bit to keep the same approach working. The same goes for spelling, of course (i.e. try to match by normalisation or try to account for the deviations in your matching, as in the NCD approaches posted in the other answers).
However the biggest problem is conceptual plagiarism. That is really hard and there are no obvious solutions without parsing the semantics of each sentence (i.e. sufficiently complex AI).
The truth is, though, that you only need to find SOME kind of match. You don't need to find an exact match in order to find a relevant text in your corpus. The final assessment should always be made by a human anyway, so it's okay if you find an inexact match.
Plagiarists are mostly stupid and lazy, so their copies will be stupid and lazy, too. Some put an incredible amount of effort into their work, but those works are often non-obvious plagiarism in the first place, so it's hard to track down programmatically (i.e. if a human has trouble recognising plagiarism with both texts presented side-by-side, a computer most likely will, too). For all the other 80%-or-so, the dumb approach is good enough.

It really depends on "plagarised from where".
If you are talking about within the context of a single site, that's vastly different from across the web, or the library of congres, or ...
http://www.copyscape.com/ pretty much proves it can be done.
Basic concept seems to be
do a google search for some uncommon
word sequences
For each result, do a detailed analysis
The detailed analysis portion can certainly be similar, since it is a 1 to 1 comparison, but locating and obtaining source documents is the key factor.

                           (This is a Wiki! Please edit here with corrections or enhancings)
For better results on not-so-big strings:
There are problems with the direct uso of the NCD formula on strings or little texts. NCD(X,X) is not zero (!). To remove this artifact subtract the self comparison.
See similar_NCD_gzip() demo at http://leis.saocarlos.sp.gov.br/SIMILAR.php
function similar_NCD_gzip($sx, $sy, $prec=0, $MAXLEN=90000) {
# NCD with gzip artifact correctoin and percentual return.
# sx,sy = strings to compare.
# Use $prec=-1 for result range [0-1], $pres=0 for percentual,
# $pres=1 or =2,3... for better precision (not a reliable)
# Use MAXLEN=-1 or a aprox. compress lenght.
# For NCD definition see http://arxiv.org/abs/0809.2553
# (c) Krauss (2010).
$x = $min = strlen(gzcompress($sx));
$y = $max = strlen(gzcompress($sy));
$xy= strlen(gzcompress($sx.$sy));
$a = $sx;
if ($x>$y) { # swap min/max
$min = $y;
$max = $x;
$a = $sy;
}
$res = ($xy-$min)/$max; # NCD definition.
# Optional correction (for little strings):
if ($MAXLEN<0 || $xy<$MAXLEN) {
$aa= strlen(gzcompress($a.$a));
$ref = ($aa-$min)/$min;
$res = $res - $ref; # correction
}
return ($prec<0)? $res: 100*round($res,2+$prec);
}

Related

Does a string contain any of a list of substrings in PHP?

I am adding a feature to an application that allows authorised oil rig personnel to submit weather reports (for use by our pilots when planning flights) to our system via email. The tricky part is that we want to match these reports to a particular oil platform, but the personnel (and their email accounts) can move between rigs.
We already have a list of waypoints that each have an "aliases" field. Basically if the email subject contains something in the aliases field, we should match the email to that waypoint.
The subject could be "Weather report 10 April # 1100 Rig A for you as requested"
The aliases for that waypoint would be something like
"RRA RPA Rig A RigA"
Keep in mind there is a similar list of aliases for all the other waypoints we have.
Is there a better way of matching than iterating through each word of each alias and checking if it's a substring of the email subject? Because that sounds like a n^2 sort of problem.
The alternative is for us to put a restriction and tell the operators they have to put the rig name at the start or end of the subject.

This sounds more like an algorithms question than a PHP question specifically. Take a look at What is the fastest substring search algorithm?
Well you can transform this into something like an O(n log n) algorithm, but it depends on the implementation specifics of stripos():
define('RIG_ID_1', 123);
define('RIG_ID_2', 456);
function get_rig_id($email_subject) {
$alias_map = [
'RRA' => RIG_ID_1,
'RPA' => RIG_ID_1,
'Rig A' => RIG_ID_1,
'RigA' => RIG_ID_1,
// ...
];
foreach(array_keys($alias_map) as $rig_substr) {
if(stripos($email_subject, $rig_substr) !== false) {
return $alias_map[$rig_substr];
}
}
return null;
}
Here each substring is examined by stripos() exactly once. Probably a better solution is to compose these strings into a series of regexes. Internally, the regex engine is able to scan text very efficiently, typically scanning each character only one time:
ex.:
<?php
define('RIG_ID_1', 123);
define('RIG_ID_2', 456);
function get_rig_id($email_subject) {
$alias_map = [
'/RRA|RPA|Rig\\sA|RigA/i' => RIG_ID_1,
'/RRB|RPB|Rig\\sB|RigB/i' => RIG_ID_2,
// ...
];
foreach(array_keys($alias_map) as $rig_regex) {
if(preg_match($rig_regex, $email_subject)) {
return $alias_map[$rig_regex];
}
}
return null;
}
For your purposes the practical solution is very much dependent upon how many rigs you've got an how many substrings per rig. I suspect that unless you're dealing with tens of thousands of rigs or unless performance is a critical aspect of this application, a naive O(n^2) solution would probably suffice. (Remember that premature optimization is the root of all evil!) A simple benchmark would bear this out.
An even-better solution -- and potentially faster -- would be to set up an elasticsearch instance, but once again that may be too much effort to go to when a naive approach would suffice in a fraction of the implementation time.

PHP built in functions complexity (isAnagramOfPalindrome function)

I've been googling for the past 2 hours, and I cannot find a list of php built in functions time and space complexity. I have the isAnagramOfPalindrome problem to solve with the following maximum allowed complexity:
expected worst-case time complexity is O(N)
expected worst-case space complexity is O(1) (not counting the storage required for input arguments).
where N is the input string length. Here is my simplest solution, but I don't know if it is within the complexity limits.
class Solution {
// Function to determine if the input string can make a palindrome by rearranging it
static public function isAnagramOfPalindrome($S) {
// here I am counting how many characters have odd number of occurrences
$odds = count(array_filter(count_chars($S, 1), function($var) {
return($var & 1);
}));
// If the string length is odd, then a palindrome would have 1 character with odd number occurrences
// If the string length is even, all characters should have even number of occurrences
return (int)($odds == (strlen($S) & 1));
}
}
echo Solution :: isAnagramOfPalindrome($_POST['input']);
Anyone have an idea where to find this kind of information?
EDIT
I found out that array_filter has O(N) complexity, and count has O(1) complexity. Now I need to find info on count_chars, but a full list would be very convenient for future porblems.
EDIT 2
After some research on space and time complexity in general, I found out that this code has O(N) time complexity and O(1) space complexity because:
The count_chars will loop N times (full length of the input string, giving it a start complexity of O(N) ). This is generating an array with limited maximum number of fields (26 precisely, the number of different characters), and then it is applying a filter on this array, which means the filter will loop 26 times at most. When pushing the input length towards infinity, this loop is insignificant and it is seen as a constant. Count also applies to this generated constant array, and besides, it is insignificant because the count function complexity is O(1). Hence, the time complexity of the algorithm is O(N).
It goes the same with space complexity. When calculating space complexity, we do not count the input, only the objects generated in the process. These objects are the 26-elements array and the count variable, and both are treated as constants because their size cannot increase over this point, not matter how big the input is. So we can say that the algorithm has a space complexity of O(1).
Anyway, that list would be still valuable so we do not have to look inside the php source code. :)

A probable reason for not including this information is that is is likely to change per release, as improvements are made / optimizations for a general case.
PHP is built on C, Some of the functions are simply wrappers around the c counterparts, for example hypot a google search, a look at man hypot, in the docs for he math lib
http://www.gnu.org/software/libc/manual/html_node/Exponents-and-Logarithms.html#Exponents-and-Logarithms
The source actually provides no better info
https://github.com/lattera/glibc/blob/a2f34833b1042d5d8eeb263b4cf4caaea138c4ad/math/w_hypot.c (Not official, Just easy to link to)
Not to mention, This is only glibc, Windows will have a different implementation. So there MAY even be a different big O per OS that PHP is compiled on
Another reason could be because it would confuse most developers.
Most developers I know would simply choose a function with the "best" big O
a maximum doesnt always mean its slower
http://www.sorting-algorithms.com/
Has a good visual prop of whats happening with some functions, ie bubble sort is a "slow" sort, Yet its one of the fastest for nearly sorted data.
Quick sort is what many will use, which is actually very slow for nearly sorted data.
Big O is worst case - PHP may decide between a release that they should optimize for a certain condition and that will change the big O of the function and theres no easy way to document that.
There is a partial list here (which I guess you have seen)
List of Big-O for PHP functions
Which does list some of the more common PHP functions.
For this particular example....
Its fairly easy to solve without using the built in functions.
Example code
function isPalAnagram($string) {
$string = str_replace(" ", "", $string);
$len = strlen($string);
$oddCount = $len & 1;
$string = str_split($string);
while ($len > 0 && $oddCount >= 0) {
$current = reset($string);
$replace_count = 0;
foreach($string as $key => &$char) {
if ($char === $current){
unset($string[$key]);
$len--;
$replace_count++;
continue;
}
}
$oddCount -= ($replace_count & 1);
}
return ($len - $oddCount) === 0;
}
Using the fact that there can not be more than 1 odd count, you can return early from the array.
I think mine is also O(N) time because its worst case is O(N) as far as I can tell.
Test
$a = microtime(true);
for($i=1; $i<100000; $i++) {
testMethod("the quick brown fox jumped over the lazy dog");
testMethod("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa");
testMethod("testest");
}
printf("Took %s seconds, %s memory", microtime(true) - $a, memory_get_peak_usage(true));
Tests run using really old hardware
My way
Took 64.125452041626 seconds, 262144 memory
Your way
Took 112.96145009995 seconds, 262144 memory
I'm fairly sure that my way is not the quickest way either.
I actually cant see much info either for languages other than PHP (Java for example).
I know a lot of this post is speculating about why its not there and theres not a lot drawing from credible sources, I hope its an partially explained why big O isnt listed in the documentation page though

Optimize PHP script for generating IBAN

I wrote function in PHP that generates croatian IBAN for given bank account. I can easily rewrite for returning any IBAN. Problem is that I think it is not optimized nor elegant. This is the function:
function IBAN_generator($acc){
if(strlen($acc)!=23)
return;
$temp_str=substr($acc,0,3);
$remainder =$temp_str % 97;
for($i=3;$i<=22;$i++)
{
$remainder =$remainder .substr($acc,$i,1);
$remainder = $remainder % 97;
}
$con_num = 98 - $remainder;
if ($con_num<10)
{
$con_num="0".$con_num;
}
$IBAN="HR".$con_num.substr($acc,0,17);
return $IBAN;
}
Is there a better way to generate IBAN?

At first glance it doesn't seem you can make it much faster, it's just simple sequence of string appending.
Unless you have to use it thousands of times and that represents a bottleneck for your application, I'd not waste time make it better, it probably takes a few microseconds, and just upgrading the PHP version would probably make improvements much better than code changes you'd implement.
If you really have to make it faster, possible solutions are
- writing it the function into an extension
- APC op code caching (it should make things generally fast when interpreting the code so globally increase the speed)
- caching results in memory (only if your application runs the same input many times, that is not probably a common case for a simple algorithm like this one)
If you want to play with it and try to make it faster, careful, you could alter the logic and introduce a bug. Always use a unit test, or write some test cases before changing it, always a good practice

You might have a look at this repo:
https://github.com/jschaedl/Iban
On your part is to add the rules for croatia.
After that you should be able to use it for your country.
Greetz

To microoptimize, the substr has O(n) time complexity, so your loop has O(n2). To avoid it, use str_split and access the chars of the generated array by index.
Then, to make it more elegant, the for loop can be replaced with array_reduce.
Also, generally avoid string concatenation in loop, it has O(n2) time and memory complexity. IBAN is short and PHP is not run on microcomputers, so it is not a issue here. But if you ever worked with longer strings, generate an array and then implode it*.
And, of course, if you are consistent with spacing around = and after for/if, it is also more elegant ;-)
* In Javascript I once tried to generate HTML table of Hangul alphabet by iterative string concatenation. It crashed the browser consuming 1GB+ memory.

Restrict Phone numbers with pattern like 0123456789, 8888888888, 9999999999, etc

I need to restrict phone numbers with patterns like:
0123456789,
1111111111,
2222222222,
3333333333,
4444444444,
etc.
I am trying to do it in PHP.
So far i came up with creating an array and searching in it to restrict that mobile no.
Is there any better way to do the same?
Maybe a regex or maybe in javascript.

If you don't want to keep updating a long array of numbers to check against like you mention yourself you basically want to do pattern detection / pattern recognition.
This can be everything from trivial to very complicated depending on your previous knowledge.
A small start can be found here... But there are tons of very thick books on the subject ;)
http://en.wikipedia.org/wiki/Pattern_recognition
So the "easiest" way is probably to use an array look up. The downside is that you must know every single number you wish to blacklist beforehand. A middle way would be to have an array of regexps of invalid formats that you check against instead of having the actual numbers in the array.
So you could have regexps covering things like
Numbers to short
Numbers to long
Numbers with only the same digits
etc
Depending on available systems and geographical location it might actually be possible to do some kind of look up against some known database of numbers. But that could lead to false positives for unlisted numbers.
For a global, dynamic, working system keeping a look up array and doing look ups against databases could both prove to become very hard to handle since the number of known data sources to keep working support for might grow to large to handle.

If you need to block specific phone numbers, you could do this:
$numberToTest = [...]; // this is your user input number to test
$notAllowed = array(
'0123456789',
'1111111111',
'2222222222',
// and so on
);
if(in_array($numberToTest, $notAllowed)) {
// NOT VALID
}
else {
// VALID
}
to match all numbers like 1111111111, 2222222222, etc you could do this:
if(preg_match('/^(\d)\1{9}$/', $numberToTest)) {
// NOT VALID
}
else {
// VALID
}

Javascript with regexp. isRestricted is set to true when the number is in restricted list:
var number = "0123456789";
var isRestricted = number.match( /0123456789|1111111111|2222222222|3333333333|4444444444/g ) != null;
I'm sure regexps works with PHP pretty much the same way using preg_match.

Is any solution the correct solution?

I always think to myself after solving a programming challenge that I have been tied up with for some time, "It works, thats good enough".
I don't think this is really the correct mindset, in my opinion and I think I should always be trying to code with the greatest performance.
Anyway, with this said, I just tried a ProjectEuler question. Specifically question #2.
How could I have improved this solution. I feel like its really verbose. Like I'm passing the previous number in recursion.
<?php
/* Each new term in the Fibonacci sequence is generated by adding the previous two
terms. By starting with 1 and 2, the first 10 terms will be:
1, 2, 3, 5, 8, 13, 21, 34, 55, 89, ...
Find the sum of all the even-valued terms in the sequence which do not exceed
four million.
*/
function fibonacci ( $number, $previous = 1 ) {
global $answer;
$fibonacci = $number + $previous;
if($fibonacci > 4000000) return;
if($fibonacci % 2 == 0) {
$answer = is_numeric($answer) ? $answer + $fibonacci : $fibonacci;
}
return fibonacci($fibonacci, $number);
}
fibonacci(1);
echo $answer;
?>
Note this isn't homework. I left school hundreds of years ago. I am just feeling bored and going through the Project Euler questions

I always think to myself after solving
a programming challenge that I have
been tied up with for some time, "It
works, thats good enough".
I don't think this is really the
correct mindset, in my opinion and I
think I should always be trying to
code with the greatest performance.
One of the classic things presented in Code Complete is that programmers, given a goal, can create an "optimum" computer program using one of many metrics, but its impossible to optimize for all of the parameters at once. Parameters such as
Code Readabilty
Understandability of Code Output
Length of Code (lines)
Speed of Code Execution (performance)
Speed of writing code
Feel free to optimize for any one of these parameters, but keep in mind that optimizing for all of them at the same time can be an exercise in frustration, or result in an overdesigned system.
You should ask yourself: what are your goals? What is "good enough" in this situation? If you're just learning and want to make things more optimized, by all means go for it, just be aware that a perfect program takes infinite time to build, and time is valuable in and of itself.

You can avoid the mod 2 section by doing the operation three times (every third element is even), so that it reads:
$fibonacci = 3*$number + 2*$previous;
and the new input to fibonacci is ($fibonnacci,2*$number+$previous)
I'm not familiar with php, so this is just general algorithm advice, I don't know if it's the right syntax. It's practically the same operation, it just substitutes a few multiplications for moduluses and additions.
Also, make sure that you start with $number as even and the $previous as the odd one that precedes it in the sequence (you could start with $number as 2, $previous as 1, and have the sum also start at 2).

Forget about Fibonacci (Problem 2), i say just advance in Euler. Don't waste time finding the optimal code for every question.
If your answer achieves the One minute rule then you are good to try the next one. After few problems, things will get harder and you will be optimizing the code while you write to achieve that goal

Others on here have said it as well "This is part of the problem with example questions vs real business problems"
The answer to that question is very difficult to answer for a number of reasons:
Language plays a huge role. Some languages are much more suited to some problems and so if you are faced with a mismatch you are going to find your solution "less than eloquent"
It depends on how much time you have to solve the problem, the more time to solve the problem the more likely it is you will come to a solution you like (though the reverse is occasionally true as well too much time makes you over think)
It depends on your level of satisfaction overall. I have worked on several projects where I thought parts where great and coded beautifully, and other parts where utter garbage, but they were outside of what I had time to address.
I guess the bottom line is if you think its a good solution, and your customer/purchaser/team/etc agree then its a good solution for the time. You might change your mind in the future but for now its a good solution.

Use the guideline that the code to solve the problem shouldn't take more than about a minute to execute. That's the most important thing for Euler problems, IMO.
Beyond that, just make sure it's readable - make sure that you can easily see how the code works. This way, you can more easily see how things worked if you ever get a problem like one of the Euler problems you solved, which in turn lets you solve that problem more quickly - because you already know how you should solve it.
You can set other criteria for yourself, but I think that's going above and beyond the intention of Euler problems - to me, the context of the problems seem far more suitable for focusing on efficiency and readability than anything else

I didn't actually test this ... but there was something i personally would have attempted to solve in this solution before calling it "done".
Avoiding globals as much as possible by implementing recursion with a sum argument
EDIT: Update according to nnythm's algorithm recommendation (cool!)
function fibonacci ( $number, $previous, $sum ) {
if($fibonacci > 4000000) { return $sum; }
else {
$fibonacci = 3*$number + 2*$previous;
return fibonacci($fibonnacci,2*$number+$previous,$sum+$fibonacci);
}
}
echo fibonacci(2,1,2);

[shrug]
A solution should be evaluated by the requirements. If all requirements are satisfied, then everything else is moxy. If all requirements are met, and you are personally dissatisfied with the solution, then perhaps the requirements need re-evaluation. That's about as far as you can take this meta-physical question, because we start getting into things like project management and business :S
Ahem, regarding your Euler-Project question, just my two-pence:
Consider refactoring to iterative, as opposed to recursive
Notice every third term in the series is even? No need to modulo once you are given your starting term
For example
public const ulong TermLimit = 4000000;
public static ulong CalculateSumOfEvenTermsTo (ulong termLimit)
{
// sum!
ulong sum = 0;
// initial conditions
ulong prevTerm = 1;
ulong currTerm = 1;
ulong swapTerm = 0;
// unroll first even term, [odd + odd = even]
swapTerm = currTerm + prevTerm;
prevTerm = currTerm;
currTerm = swapTerm;
// begin iterative sum,
for (; currTerm < termLimit;)
{
// we have ensured currTerm is even,
// and loop condition ensures it is
// less than limit
sum += currTerm;
// next odd term, [odd + even = odd]
swapTerm = currTerm + prevTerm;
prevTerm = currTerm;
currTerm = swapTerm;
// next odd term, [even + odd = odd]
swapTerm = currTerm + prevTerm;
prevTerm = currTerm;
currTerm = swapTerm;
// next even term, [odd + odd = even]
swapTerm = currTerm + prevTerm;
prevTerm = currTerm;
currTerm = swapTerm;
}
return sum;
}
So, perhaps more lines of code, but [practically] guaranteed to be faster. An iterative approach is not as "elegant", but saves recursive method calls and saves stack space. Second, unrolling term generation [that is, explicitly expanding a loop] reduces the number of times you would have had to perform modulus operation and test "is even" conditional. Expanding also reduces the number of times your end conditional [if current term is less than limit] is evaluated.
Is it "better", no, it's just "another" solution.
Apologies for the C#, not familiar with php, but I am sure you could translate it fairly well.
Hope this helps, :)
Cheers

It is completely your choice, whether you are happy with a solution or whether you want to improve it further. There are many project Euler problems where a brute force solution would take too long, and where you will have to look for a clever algorithm.
Problem 2 doesn't require any optimisation. Your solution is already more than fast enough.
Still let me explain what kind of optimisation is possible. Often it helps to do some research on the subject. E.g. the wiki page on Fibonacci numbers contains this formula
fib(n) = (phi^n - (1-phi)^n)/sqrt(5)
where phi is the golden ratio. I.e.
phi = (sqrt(5)+1)/2.
If you use that fib(n) is approximately phi^n/sqrt(5) then you can find the index of the largest Fibonacci number smaller than M by
n = floor(log(M * sqrt(5)) / log(phi)).
E.g. for M=4000000, we get n=33, hence fib(33) the largest Fibonacci number smaller than 4000000. It can be observed that fib(n) is even if n is a multiple of 3. Hence the sum of the even Fibonacci numbers is
fib(0) + fib(3) + fib(6) + ... + fib(3k)
To find a closed form we use the formula above from the wikipedia page and notice that
the sum is essentially just two geometric series. The math isn't completely trivial, but using these ideas it can be shown that
fib(0) + fib(3) + fib(6) + ... + fib(3k) = (fib(3k + 2) - 1) /2 .
Since fib(n) has size O(n), the straight forward solution has a complexity of O(n^2).
Using the closed formula above together with a fast method to evaluate Fibonacci numbers
has a complexity of O(n log(n)^(1+epsilon)). For small numbers either solution is of course fine.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.