What is the best strategy to compare two Paragarphs in PHP & MySQL?

What is the best strategy to compare two Paragarphs in PHP & MySQL? - php

I have already Developed a Typing Software to capture Text Typed by candidates in my institutes using PHP & MySQL. In the continuation process, I am stuck with a strategic issue as to how should I compare the Similarity of Texts typed by the Candidates with the Standard Paragraph which I had given them to Type(in the form of Hard Copy, though the same copy is also stored in the MySQL database). My dilemma is that, whether I would use the Levensthein Distance Algorithm in PHP or in MySQL directly itself so that the performance issue is optimized. Actually. I am afraid if Programming in PHP would come out erroneous while evaluating the Texts. It is worthwhile to mention here that the Texts would be compared to get the rank on the basis of Words Typed Per Minute.

The simplest solution would be to utilize PHP's built-in levenshteindocs function to compare the two blocks of text. If you wanted to back the processing off to the MySQL database, you could implement the solution listed in Levenshtein: MySQL + PHPStackOverflow
Another PHP option might be the similar_textdocs function.
The unfortunate drawback for the PHP levenshtein function is that it cannot handle strings longer than 255 characters. As per the php manual docs:
This function returns the Levenshtein-Distance between the two
argument strings or -1, if one of the argument strings is longer than
the limit of 255 characters.
So, if your paragraphs are longer than that you may be forced to implement a MySQL solution, though. I suppose you could break the paragraphs up into 255-character blocks for comparison (though I can't say definitively that this won't "break" the levenshtein algorithm).
I'm not an expert in linguistics parsing and processing, so I can't speak to whether these are the best solutions (as you mention in your question). They are, however, very straightforward and simple to implement.

Related

What is the shortest way to represent all php types in string?

I'm writing a universal system that will hopefully one day apply to medicine, etc. (i.e. it's "scientific").
I figure the best way to go about this is to represent all data in php with string (true would be "true", false would be "false", so on and so forth). The reason for this is that there is exactly one string representation of any value in php (e.g. php code itself).
I am posting this question in an attempt to accelerate the design process of this program.
Some values are easily translated to string: numbers, booleans, etc.
Some are not: objects, arrays, resources.
I figured the format for transmitting objects and arrays is basically json, but I'm not sure if this is a tight fit. It's better than what I currently have (nothing), but, at least at some point, I would like to refine this to a point.
Any ideas?

I'm writing a universal system
This is an ambitious goal indeed; so ambitious as to be foolish to attempt.
Now, probably you don't really mean "can do absolutely anything for anyone", but it's relevant to your question that you don't place any limits on what you're trying to represent. That's making your search for a serialization format unnecessarily difficult.
For instance, you mention resources, which PHP uses for things like database connections, open file handles, etc. They are transient pointers to something that exists briefly and then is gone, and serializing them is not only unsupported by PHP, it's close to meaningless.
Instead of trying to cover "everything", you need to think about what types of data you actually need to handle. Maybe you'll mostly be working with classes defined within the system, so you can define whatever format for those you want. Maybe you want to work with arbitrary bags of key-value pairs, in the form of PHP arrays. You might want to leave the way open for future expansion, but that's about flexibility in the format, not having a specific answer right now.
From there, you can look for what properties you want, and shop around:
JSON is a hugely popular "lowest-common denominator" format. Its main downside is it has no representation of specific custom types, everything has to be composed of lists and key-value pairs (I like to say "JSON has no class").
XML is less fashionable than it used to be, but very powerful for defining custom languages and types. Its quite verbose, but compresses well - a lot of modern file formats are actually zip archives containing compressed XML files.
PHP serialization format is really only intended for short-term, in-application purposes, like cache data. Its fairly concise, and closely tied to PHP's type system, but has security problems if users have influence over the data, as noted on the unserialize manual page.
There are even more concise formats that don't limit themselves to human-readable representations, if that was a relevant factor for you.
Obviously, the list is endless...

I've programmed a solution to this problem. It's a simple class that converts string to int | float | bool | null | string. The idea is that
any value that is not a relativistic value (e.g. an array, something that simply holds other values) is represented by a single string. The implications are broad, I'll do my best to simplify.
Imagine you're making a website, which is basically (and in fact) made of webpages. If a webpage consists of inputs (typically GET and POST form data), and those inputs are string (GET and POST elements are string), all that stands between us and raw php is interpretation of said string.
Or think of it this way: if you model the total potential of php in string, it may not be ultimately how you do things but it works, right here right now. What THAT means is that we can implement it immediately.
The rest of it is left blank, as that is what I mean by "relativistic".
Now, ok, just to cap it all off, if you think about what this implies in form, in the actual php code itself, everything is, at a point at which there is exactly one string per one "non-relativistic" value.
So basically what you have is a bunch of php. The idea is is designed to be semantically AND syntactically as simple and functional as possible (or, at least, completely factorialized). So basically we have one way to represent any potential data in php.
Anyways, you can find it here: https://github.com/cinder-brent/Cinder
Cheers:)
-- edit --
Lo' and behold, I moved the project. It is now at https://github.com/cinder-brent/Leaf

Using VARCHAR in MySQL for everything! (on small or micro sites)

I tried searching for this as I felt it would be a commonly asked beginner's question, but I could only find things that nearly answered it.
We have a small PHP app that is at most used by 5 people (total, ever) and maybe 2 simultaneously, so scalability isn't a concern.
However, I still like to do things in a best practice manner, otherwise bad habits form into permanent bad habits and spill into code you write that faces more than just 5 people.
Given this context, my question is: is there any strong reason to use anything other than VARCHAR(250+) in MySQL for a small PHP app that is constantly evolving/changing? If I picked INT but that later needed to include characters, it would be annoying to have to go back and change it when I could have just future-proofed it and made it a VARCHAR to begin with. In other words, choosing anything other than VARCHAR with a large character count seems pointlessly limiting for a small app. Is this correct?
Thanks for reading and possibly answering!

If you have the numbers 1 through 12 in VARCHAR, and you need them in numerical order, you get 1,10,11,12,2,3,4,5,6,7,8,9. Is that OK? Well, you could fix it in SQL by saying ORDER BY col+0. Do you like that kludge?

One of the major drawbacks will be that you will have to add consistency checks in your code. For a small, private database, no problem. But for larger projects...
Using the proper types will do a lot of checks automatically. E.g., are there any wrong characters in the value; is the date valid...
As a bonus, it is easy to add extra constraints when using right types; is the age less than 110; is the start date less than the end date; is the indexing an existing value in another table?
I prefer to make the types as specific as possible. Although server errors can be nasty and hard to debug, it is way better than having a database that is not consistent.

Probably not a great idea to make a habit out of it as with any real amount of data will become inefficient. If you use the text type the amount of storage space used for the same amount of data will be differ depending on your storage engine.
If you do as you suggested don't forget that all values that would normally be of a numeric type will need to be converted to a numeric type in PHP. For example if you store the value "123" as a varchar or text type and retrieve it as $someVar you will have to do:
$someVar = intval($someVar);
in PHP before arithmetic operations can be performed, otherwise PHP will assume that 123 is a string.

As you may already know VARCHAR columns are variable-length strings. We have the advantage of dynamic memory allocation when using VARCHAR.
VARCHAR is stored inline with the table which makes faster when the size is reasonable.
If your app need performance you can go with CHAR which is little faster than VARCHAR.

fuzzy searching an array in php

after i searched i found how to do a fuzzy searching on a string
but i have an array of strings
$search = {"a" => "laptop","b" => "screen" ....}
that i retrieved from the DB MySQL
IS there any php class or function that does fuzzy searching on an array of words
or at least a link with maybe some useful info's
i saw a comment that recommend using PostgreSQL
and it's fuzzy searching capability but
the company had already a MySQL DB
Is there any recommendation ??

You could do this in MySQL since you already have a MySQL database - How do I do a fuzzy match of company names in MYSQL with PHP for auto-complete? which mentions the MySQL Double Metaphone implementation and has an implementation in SQL for MySQL 5.0+
Edit: Sorry answering here as there is more than could fit in a comment…
Since you've already accepted an answer using PHP Levenshtein function then I suggest you try that approach first. Software is iterative; the PHP array search may be exactly what you want but you have to test and implement it first against your requirements. As I said in your other question a find as you type solution might be the simplest solution here, which simply narrows the product as the user types. There might not be a need to implement any fuzzy searching since you are using the User to do the fuzzy search themselves :-)
For example a user starts typing S, a, m which allows you to narrow the products to those beginning with Sam. So you are always only letting the user select a product you already know is valid.

Look at the Levenshtein function
Basically it gives you the difference (in terms of cost) between to strings. I.e. what is the cost to transform string A into string B.
Set yourself a threshold levenshein distance and anything under that for two words mean they're similar.
Also the Bitap algorithm is faster since it can be implemented via bitwise operators, but I believe you will have to implement it yourself, unless there is a PHP lib for it somewhere.
EDIT
To use levenshtein method:
The search string is "maptop", and you set your "cost threshold" to say 2. That means you want any words that are two string transform operations away from your search string.
so you loop through your array "A" of strings until
levenshtein ( A[i] , searchString ) <= 2
That will be your match. However you may get more than one word that matches, so it is up to you how you want to handle the extra results.

Associative Array : PHP/C vs Flex/Flash

In PHP an Associative Array keeps its order.
// this will keep its order in PHP
a['kiwis']
a['bananas']
a['potatoes']
a['peaches']
However in Flex it doesn't with a perfectly valid explanation. I really can't remember how C treats this problem but I am more leaned to believe it works like php as the Array has it's space pre-reserved in memory and we can just walk the memory. Am I right?
The real question here is why. Why does C/PHP interpretation of this varies from Flash/Flex and what is the main reason Adobe has made Flash work this way.
Thank you.

There isn't a C implementation, you roll your own as needed, or choose from a pre-existing one. As such, a given C implementation may be ordered or unordered.
As to why, the reason is that the advantages are different. Ordered allows you (obviously enough) to depend on that ordering. However, it's wasteful when you don't need that ordering.
Different people will consider the advantage of ordering more or less important than the advantage of not ordering.
The greatest flexibility comes from not ordering, as if you also have some sort of ordered structure (list, linked list, vector would all do) then you can easily create an ordered hashmap out of that (not the optimal solution, but it is easy, so you can't complain you didn't have one given to you). This makes it the obvious choice in something intended, from the early in its design, to be general purpose.
On the other hand, the disadvantage of ordering is generally only in terms of performance, so it's the obvious choice for something intended to give relatively wide-ranging support with a small number of types for a new developer to learn.
The march of history sometimes makes these decisions optimal and sometimes sub-optimal, in ways that no developer can really plan for.

For PHP arrays: These beasts are unique constructs and are somehow complicated, an overview is given in a slashdot response from Kendall Hopkins (scroll down to his answer):
Ken: The PHP array is a chained hash table (lookup of O(c) and O(n) on key collisions)
that allows for int and string keys. It uses 2 different hashing algorithms
to fit the two types into the same hash key space. Also each value stored in
the hash is linked to the value stored before it and the value stored after
(linked list). It also has a temporary pointer which is used to hold the
current item so the hash can be iterated.
In C/C++, there is, as has been said, no "associative array" in the core lanuage. It has a map (ordered) in the STL, as will be in the new standard library (hash_map, unordered_map) and there was a gnu_hash_map (unordered) on some implementations (which was very good imho).
Furthermore, the "order" of elements in an "ordered" C/C++ map is usually not the "insertion order" (as in PHP), it's the "key sort order" or "string hash value sort order".
To answer your question: your view of equivalence of PHP and C/C++ associative arrays does not hold, in PHP, they made a design decision in order to provide maximum comfort under a single interface (and failed or succeeded, whatever). In C/C++, there are many different implementations (with advantages and tradeoffs) available.
Regards
rbo

Singular Value Decomposition (SVD) in PHP

I would like to implement Singular Value Decomposition (SVD) in PHP. I know that there are several external libraries which could do this for me. But I have two questions concerning PHP, though:
1) Do you think it's possible and/or reasonable to code the SVD in PHP?
2) If (1) is yes: Can you help me to code it in PHP?
I've already coded some parts of SVD by myself. Here's the code which I made comments to the course of action in. Some parts of this code aren't completely correct.
It would be great if you could help me. Thank you very much in advance!

SVD-python
Is a very clear, parsimonious implementation of the SVD.
It's practically psuedocode and should be fairly easy to understand
and compare/draw on for your php implementation, even if you don't know much python.
SVD-python
That said, as others have mentioned I wouldn't expect to be able to do very heavy-duty LSA with php implementation what sounds like a pretty limited web-host.
Cheers
Edit:
The module above doesn't do anything all by itself, but there is an example included in the
opening comments. Assuming you downloaded the python module, and it was accessible (e.g. in the same folder), you
could implement a trivial example as follow,
#!/usr/bin/python
import svd
import math
a = [[22.,10., 2., 3., 7.],
[14., 7.,10., 0., 8.],
[-1.,13.,-1.,-11., 3.],
[-3.,-2.,13., -2., 4.],
[ 9., 8., 1., -2., 4.],
[ 9., 1.,-7., 5.,-1.],
[ 2.,-6., 6., 5., 1.],
[ 4., 5., 0., -2., 2.]]
u,w,vt = svd.svd(a)
print w
Here 'w' contains your list of singular values.
Of course this only gets you part of the way to latent semantic analysis and its relatives.
You usually want to reduce the number of singular values, then employ some appropriate distance
metric to measure the similarity between your documents, or words, or documents and words, etc.
The cosine of the angle between your resultant vectors is pretty popular.
Latent Semantic Mapping (pdf)
is by far the clearest, most concise and informative paper I've read on the remaining steps you
need to work out following the SVD.
Edit2: also note that if you're working with very large term-document matrices (I'm assuming this
is what you are doing) it is almost certainly going to be far more efficient to perform the decomposition
in an offline mode, and then perform only the comparisons in a live fashion in response to requests.
while svd-python is great for learning, the svdlibc is more what you would want for such heavy
computation.
finally as mentioned in the bellegarda paper above, remember that you don't have to recompute the
svd every single time you get a new document or request. depending on what you are trying to do you could
probably get away with performing the svd once every week or so, in an offline mode, a local machine,
and then uploading the results (size/bandwidth concerns notwithstanding).
anyway good luck!

Be careful when you say "I don't care what the time limits are". SVD is an O(N^3) operation (or O(MN^2) if it's a rectangular m*n matrix) which means that you could very easily be in a situation where your problem can take a very long time. If the 100*100 case takes one minute, the 1000*1000 case would 10^3 minutes, or nearly 17 hours (and probably worse, realistically, as you're likely to be out of cache). With something like PHP, the prefactor -- the number multiplying the N^3 in order to calculate the required FLOP count, could be very, very large.
Having said that, of course it's possible to code it in PHP -- the language has the required data structures and operations.

I know this is an old Q, but here's my 2-bits:
1) A true SVD is much slower than the calculus-inspired approximations used, eg, in the Netflix prize. See: http://www.sifter.org/~simon/journal/20061211.html
There's an implementation (in C) here:
http://www.timelydevelopment.com/demos/NetflixPrize.aspx
2) C would be faster but PHP can certainly do it.
PHP Architect author Cal Evans: "PHP is a web scripting language... [but] I’ve used PHP as a scripting language for writing the DOS equivalent of BATCH files or the Linux equivalent of shell scripts. I’ve found that most of what I need to do can be accomplished from within PHP. There is even a project to allow you to build desktop applications via PHP, the PHP-GTK project."

Regarding question 1: It definitely is possible. Whether it's reasonable depends on your scenario: How big are your matrices? How often do you intend to run the code? Is it run in a web site or from the command line?
If you do care about speed, I would suggest writing a simple extension that wraps calls to the GNU Scientific Library.

Yes it's posible, but implementing SVD in php ins't the optimal approach. As you can see here PHP is slower than C and also slower than C++, so maybe it was better if you could do it in one of this languages and call them as a function to get your results. You can find an implementation of the algorithm here, so you can guide yourself trough it.
About the function calling can use:
The exec() Function
The system function is quite useful and powerful, but one of the biggest problems with it is that all resulting text from the program goes directly to the output stream. There will be situations where you might like to format the resulting text and display it in some different way, or not display it at all.
The system() Function
The system function in PHP takes a string argument with the command to execute as well as any arguments you wish passed to that command. This function executes the specified command, and dumps any resulting text to the output stream (either the HTTP output in a web server situation, or the console if you are running PHP as a command line tool). The return of this function is the last line of output from the program, if it emits text output.
The passthru() Function
One fascinating function that PHP provides similar to those we have seen so far is the passthru function. This function, like the others, executes the program you tell it to. However, it then proceeds to immediately send the raw output from this program to the output stream with which PHP is currently working (i.e. either HTTP in a web server scenario, or the shell in a command line version of PHP).

Yes. this is perfectly possible to be implemented in PHP.
I don't know what the reasonable time frame for execution and how large it can compute.
I would probably have to implement the algorithm to get a rought idea.
Yes I can help you code it. But why do you need help? Doesn't the code you wrote work?
Just as an aside question. What version of PHP do you use?

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.