Mathematical formula string to php variables and operators - php

I have to following problem. I find it difficult to explain, as I am a hobby coder. Please forgive me if commit any major fauxpas:
I am working on a baseball database that deals with baseball specfic and non specific metrics/stats. Most of the date output is pretty simple when the metrics are cumulative or when I only want to display the dataset of one day that has been entered manually or imported from a csv. (All percentages are calculated and fed into the db as numbers)
For example, if I put in the stats of
{ Hits:1,
Walks:2,
AB:2,
Bavg:0.500 }
for day one, and
{ Hits:2,
Walks:2,
AB:6,
Bavg:0.333 }
and then try to get the totals, Hits, Walks and ABs are simple: SUM. But Bavg has to be a formula (Hits/AB). Other (non baseball specific) metrics, like vertical jump or 60 yard times are pretty straight forward too: MAX or MIN.
The user should be able to add his own metrics. So he has to be able to input a formula for the calculation. This calculation is stored in the database table as a column next to the metric name, id, type (cumulative, max, min, calculated).
The php script that produces the html table is setup to where it is dynamic to what ever metrics and however many metrics the query sends (the metrics can be part of several categories).
In the end result, I want to replace all values of metrics of the calculated types with their formula.
My approach is to get the formula from the mysql table as a string. Then, in php, convert the string that could be Strikes/Pitches*100 into $strikes/$pitches*100 - assuming that is something I could put into an php sql query. However, before it is put into the $strikes/$pitches*100 format, I need to have those variables available to define them. That I'm sure I can do, but I'll cross that bridge when I get there.
Could you point me in the right direction of either how to accomplish that or tell where or what to search for? I'm sure this has been done before somewhere...
I highly appreciate any help!
Clemens

The correct solution has already been given by Vilx-. So I will give you a not-so-correct, dirty solution.
As the correct solution states, eval is evil. But, it is also easy and powerful (as evil often is -- but I'll spare you my "hhh join the Dark Side, Luke hhh" spiel).
And since what you need to do is a very small and simple subset of SQL, you actually can use eval() - or even better its SQL equivalent, plugging user supplied code into a SQL query - as long as you do it safely; and with small requirements, this is possible.
(In the general case it absolutely is not. So keep it in mind - this solution is easy, quick, but does not scale. If the program grows beyond a certain complexity, you'll have to adopt Vilx-'s solution anyway).
You can verify the user-supplied string to ensure that, while it might not be syntactically or logically correct, at least it won't execute arbitrary code.
This is okay:
SELECT SUM(pitch)+AVG(runs)-5*(MIN(balls)) /* or whatever */
and this, while wrong, is harmless too:
SELECT SUM(pitch +
but this absolutely is not (mandatory XKCD reference):
SELECT "Robert'); DROP TABLE Students;--
and this is even worse, since the above would not work on a standard MySQL (that doesn't allow multiple statements by default), while this would:
SELECT SLEEP(3600)
So how do we tell the harmless from the harmful? We start by defining placeholder variables that you can use in your formula. Let us say they will always be in the form {name}. So, get those - which we know to be safe - out of the formula:
$verify = preg_replace('#{[a-z]+}#', '', $formula);
Then, arithmetic operators are also removed; they are safe too.
$verify = preg_replace('#[+*/-]#', '', $verify);
Then numbers and things that look like numbers:
$verify = preg_replace('#[0-9.]+#', '', $verify);
Finally a certain number of functions you trust. The arguments of these functions may have been variables or combinations of variables, and therefore they've been deleted and the function has now no arguments - say, SUM() - or you had nested functions or nested parentheses, like SUM (SUM( ()())).
You keep replacing () (with any spaces inside) with a single space until the replacement no longer finds anything:
for ($old = ''; $old !== $verify; $verify = preg_replace('#\\s*\\(\\s*\\)\\s*#', ' ', $verify)) {
$old = $verify;
}
Now you remove from the result the occurrences of any function you trust, as an entire word:
for ($old = ''; $old !== $verify; $verify = preg_replace('#\\b(SUM|AVG|MIN|MAX)\\b#', ' ', $verify)) {
$old = $verify;
}
The last two steps have to be merged because you might have both nested parentheses and functions, interfering with one another:
for ($old = ''; $old !== $verify; $verify = preg_replace('#\\s*(\\b(SUM|AVG|MIN|MAX)\\b|\\(\\s*\\))\\s*#', ' ', $verify)) {
$old = $verify;
}
And at this point, if you are left with nothing, it means the original string was harmless (at worst it could have triggered a division by 0, or a SQL exception if it was syntactically wrong). If instead you're left with something, the formula is rejected and never saved in the database.
When you have a valid formula, you can replace variables using preg_replace_callback() so that they become numbers (or names of columns). You're left with what is either valid, innocuous SQL code, or incorrect SQL code. You can plug this directly into the query, after wrapping it in try/catch to intercept any PDOException or division by zero.

I'll assume that the requirement is indeed to allow the user to enter arbitrary formulas. As noted in the comments, this is indeed no small task, so if you can settle for something less, I'd advise doing so. But, assuming that nothing less will do, let's see what can be done.
The simplest idea is, of course, to use PHP's eval() function. You can execute arbitrary PHP code from a string. All you need to do is to set up all necessary variables beforehand and grab the return value.
It does have drawbacks however. The biggest one is security. You're essentially executing arbitrary user-supplied code on your server. And it can do ANYTHING that your own code can. Access files, use your database connections, change variables, whatever. Unless you completely trust your users, this is a security disaster.
Also syntax or runtime errors can throw off the rest of your script. And eval() is pretty slow too, since it has to parse the code every time. Maybe not a big deal in your particular case, but worth keeping an eye on.
All in all, in every language that has an eval() function, it is almost universally considered evil and to be avoided at all costs.
So what's the alternative? Well, a dedicated formula parser/executor would be nice. I've written one a few times, but it's far from trivial. The job is easier if the formula is written in Polish notation or Reverse Polish Notation, but those are a pain to write unless you've practiced. For normal formulas, take a look at the Shunting Yard Algorithm. It's straightforward enough and can be easily adapted to functions and whatnot. But it's still fairly tedious.
So, unless you want to do it as a fun challenge, look for a library that has already done it. There seem to be a bunch of them out there. Search for something along the lines of "arithmetic expression parser library php".

Related

PHP does modifying strings copy of update it

I am going over a script making as many optimizations as possible, micro-optimizations even, but fortunately this question doesn't revolve around the necessity of such methods, more an understanding of what PHP is doing.
$sql = rtrim($sql, ',');
When running this line, what I would like to know is whether internally the value returned is a new string (i.e. a modified copy) or the same value in memory, but updated.
If the line looked like this:
$sql2 = rtrim($sql1, ',');
Then I wouldn't be asking, however because it is a modification of the same variable, I am wondering if PHP overwrites it with a modified copy, or updates the same value in memory.
For performance reasons, I need to run the same operations over a millions times in as short a time as possible, which is why I am really obsessing over every tiny detail.
This question isn't just for the example above, but for string manipulation in general.
Answering your specific Q: strings are stored in internal structures called ZVALs and ZVALs do a lazy copy, that is doing a copy references the same ZVAL and bumps its reference count. Updating the string decrements the reference count on the ZVAL (and garbage collects the sting when the count is zero). On update, a new ZVAL is created pointing to the new value.
Now to the general misconception underpinning this Q:
For performance reasons, I need to run the same operations over a millions times in as short a time as possible, which is why I am really obsessing over every tiny detail.
A bubble sort is O(N²). A clever bubble sort is still O(N²). A simple change to the algorithm can get you down to O(N logN). Moral: Algorithmic optimisations deliver big dividends; micro optimizations rarely do so and are usually counter productive as they can create unmaintainable code.
In the case of SQL optimization, replacing an loop of statements with a correctly indexed (join and) a single statement can give you an order of magnitude saving in runtime.
Replacing a PHP for loop with a Array function call can do likewise.

What is the best strategy to compare two Paragarphs in PHP & MySQL?

I have already Developed a Typing Software to capture Text Typed by candidates in my institutes using PHP & MySQL. In the continuation process, I am stuck with a strategic issue as to how should I compare the Similarity of Texts typed by the Candidates with the Standard Paragraph which I had given them to Type(in the form of Hard Copy, though the same copy is also stored in the MySQL database). My dilemma is that, whether I would use the Levensthein Distance Algorithm in PHP or in MySQL directly itself so that the performance issue is optimized. Actually. I am afraid if Programming in PHP would come out erroneous while evaluating the Texts. It is worthwhile to mention here that the Texts would be compared to get the rank on the basis of Words Typed Per Minute.
The simplest solution would be to utilize PHP's built-in levenshteindocs function to compare the two blocks of text. If you wanted to back the processing off to the MySQL database, you could implement the solution listed in Levenshtein: MySQL + PHPStackOverflow
Another PHP option might be the similar_textdocs function.
The unfortunate drawback for the PHP levenshtein function is that it cannot handle strings longer than 255 characters. As per the php manual docs:
This function returns the Levenshtein-Distance between the two
argument strings or -1, if one of the argument strings is longer than
the limit of 255 characters.
So, if your paragraphs are longer than that you may be forced to implement a MySQL solution, though. I suppose you could break the paragraphs up into 255-character blocks for comparison (though I can't say definitively that this won't "break" the levenshtein algorithm).
I'm not an expert in linguistics parsing and processing, so I can't speak to whether these are the best solutions (as you mention in your question). They are, however, very straightforward and simple to implement.

What's faster: MySQL LEFT(*,100) or PHP substr()?

I am building a simple list of the last 10 updated pages from the database. Each record I need to display: name and shortened/truncated description that is stored as TEXT. Some pages the description can be over 10,000 characters.
Which is better for speed and performance? Or a better way to go about this? I use both Zend and Smarty.
MySQL
SELECT id, name, LEFT(description, 100) FROM pages ORDER BY page_modified DESC LIMIT 10;
PHP
function ShortenText($text) {
// Change to the number of characters you want to display
$chars = 100;
$text = $text." ";
$text = substr($text,0,$chars);
$text = substr($text,0,strrpos($text,' '));
$text = $text."...";
return $text;
}
Because your question was specifically "faster" not "better" i can say for sure that performing the calculation in the DB is actually faster. "Better" is a much different question, and depending on the use case, #Graydot's suggestion might be better in some cases.
The notion of having the application server marshal data when it doesn't need to is inconsistent with the idea of specialization. Databases are specialized in retrieving data and performing massive calculations on data; that's what they do best. Application servers are meant to orchestrate the flow between persistence, business logic and user interface.
Would you use sum() in a SQL statement or would you retrieve all the values into your app server, then loop and add them up? ABSOLUTELY, performing the sum in the DB is faster... keep in mind the application server is actually a client to the database. If you pull back all that data to the application server for crunching, you are sending bytes of data across the network (or even just across RAM segments) that don't need to be moved... and that all flows via database drivers so there are lots of little code thingies touching and moving the data along.
BUT there is also the question of "Better" which is problem specific...If you have requirements about needing the row level data, or client side filtering and re-summing (or letting the user specify how many left charatcers they want to see in the result set), then it might make sense to do it in the app server so you dont have to keep going back to the database.
you asked specifically "faster" and the answer is "database" - but "overall faster" might mean something else and "overall better" entirely something else. as usual, truth is fuzzy and the answer to just about everything is "It depends"
hth
Jon
LEFT in the database.
Less data sent back to the client (far less in this case, a max of 1k vs 100k text)
It's trivial compared to the actual table access, ORDER BY etc
It also doesn't break any rules such as "format in the client": it's simply common sense
Edit: looks we have a religious war brewing.
If the question asked for complex string manipulation or formatting or non-aggregate calculations then I'd say use php. This is none of these cases.
One thing you can't optimise is the network compared to db+client code.
I agree with gbn, but if you're looking to integrate the ... suffix, you can try:
SELECT id,
name,
CASE WHEN LENGTH(description)>25 THEN
CONCAT(LEFT(description, 25),'...')
ELSE
description
END CASE AS short_description
FROM pages
ORDER BY page_modified DESC
LIMIT 10;
Where 25 is the number of characters the preview text should have. (Note this won't split in to whole words, but neither does your PHP function).
My POV (which may be wrong!) is that PHP is used to parse the stuff from the server, send it to the DB, and then present it to the client. I prefer to use stored procedures in the database - because it is easy to know what queries are going to be executed and to ensure that the business logic is adhered to.
I just think that having these definite lines is a good idea.
Forgot to mention - The database knows more about the structure and nature of the data than a PHP script.
General Rule-of-Thumb:
Keep substring functions out of the WHERE clause because of the scalar nature of having to compare several columns in WHERE clauses.
Use substring functions on columns because there is a significant bottleneck between the database server and the database client.

Is there any performance difference between using " = " and " LIKE "?

To start I would like to mention that I tried Googling this to no avail.
I would like the option of using wildcards in all my columns. Therefore I would like all my SELECT statements to use LIKE instead of =. Let me point out that there is NO user input data in my application, which rules out any concern of injection attacks.
Is there any speed difference between the two if the rest of the query remains identical? (That is, if the right-hand side of the condition contains no wildcards.)
There is no difference.
WHERE firstname LIKE 'Fred'
is not perceptibly different from
WHERE firstname = 'Fred'
So you're free to use "LIKE" rather than "=" in all your cases where you
want the presence of a wildcard character to control whether or not to
invoke the wildcard search.
I can't find the reference but I've seen this mentioned more than once, and it
makes sense. The index strategy will be equivalent either way (it can only match
the same characters on the same indexes) and I have sometimes written queries this
way because presence or absence of a wildcard in a particular invocation is acceptable.
Also, I've never seen a case where someone tried to parse out the presence of a wild
character and invoke the SQl differently based on the circumstances. It would perhaps
be risky to do so because it would be easier to write an inefficient (unSARGable) query with an expression in the wrong place.
Any decent DBMS should detect a non-wildcard string in the like and treat it exactly the same as an =. But even this check will take some time, however minuscule.
As stated, the time taken for this would be minimal and would only happen once per query. The sort of performance problems you really need to watch out for are things that incur a cost per row, such as select to_lower(column_name). In other words, you probably needn't concern yourself with your particular case.
If you had used wildcards, then it would almost certainly be slower, simply because you'd have to check partial columns. A clause like like 'xyz%' wouldn't be too much slower but wildcards anywhere other than at the end of the string would cause more serious problems.
But, if you were using wildcards, you wouldn't have an option - like would be the only possibility.
Bottom line: unless your DBMS is brain-dead, the difference between = and like for non-wildcard strings will be insignificant.
But, as with all database optimisations: measure, don't guess!
I do remain confused by one aspect of your question though. You state:
Let me point out that there is NO user input data in my application.
which I assume is to ensure us that SQL injection attacks are not possible.
But because of that, surely you know in advance (in the code) whether the query will be a wildcard or non-wildcard one. In which case, why wouldn't you just use the = variant where appropriate and remove all doubt.
And if, as you state in comments, there are no wildcard queries, why would you even consider using like.
Yes, although how much depends.
If the column is indexed, then it can be quite a bit slower. Especially if you compare to something like '%suffix' because the index can't be used at all when the percent sign appears at the start of the search string.
What #Jonathan Wood said. When you ask MySQL, for example, to find values using the wildcard '%' the server has to read every single row in the entire table looking for matches.

Formula calculation mechanism for web

There is a group of simple formulas for calculating some values.
I need to implement this for the web interface (I make this in PHP).
To store formulas I am using simple format like this: "X1+X2+X3". When I need to make calculations, I call function preg_replace in the loop for replacing X1, X2 .. by real data (entered by user or saved earlier - it is not important)
Then I use function eval('$calculation_result ='. $trans_formula .';') where $trans_formula stores text of the formula with substituted data.
This mechanism looks like a primitive and I have a feeling that I'm trying to re-invent the wheel. Perhaps there are some ready algorithms, techniques, methods to accomplish this? Not necessary PHP code. I’ll appreciate even simple algorithm description.
The first thought that hit me: eval is bad!
How I would approach this problem:
1. I would store the formalue in postfix (polish notation)
2. Then I'd write a simple program to evaluate the expression. Its fairly easy to write a postfix evaluator.
This approach will also allow you to check things like value data types and range contraints, if need be. Also eliminates the huge risk of eval.
Cheers!
EDIT in response to your comment to the question:
If your users will be entering their own expressions, you will want to convert them to postfix too. Check out infix to postfix conversion.
Take a look at the evalMath class on PHPClasses.
If the formulas are predeterminated, as I suppose reading your question, it is non useful (better, it is dangerous) use the eval to evaluate them.
Create simple function and call them passing the appropriate parameters (after input checking).
For example, your example will be:
<?php
function sumOfThree($x1, $x2, $x3) {
return $x1+$x2+$x3;
}
// and you can call it as usual:
$calculation_result = sumOfThree($first, $second, $third);
?>
You will get a lot of plus in
speed: eval is very slow to execute (even for easy functions);
debugging: you can debug you function (and get correct error messages);
security: Eval is easily exploitable.

Categories