parse search string

parse search string - php

I have search strings, similar to the one bellow:
energy food "olympics 2010" Terrorism OR "government" OR cups NOT transport
and I need to parse it with PHP5 to detect if the content belongs to any of the following clusters:
AllWords array
AnyWords array
NotWords array
These are the rules i have set:
If it has OR before or after the word or quoted words if belongs to
AnyWord.
If it has a NOT before word or quoted words it belongs to NotWords
If it has 0 or more more spaces before the word or quoted phrase it
belongs to AllWords.
So the end result should be something similar to:
AllWords: (energy, food, "olympics 2010")
AnyWords: (terrorism, "government", cups)
NotWords: (Transport)
What would be a good way to do this?

If you want to do this with Regex, be aware that your parsing will break on stupid user input (the user, not the input =) ).
I'd try the following Regexes.
NotWords:
(?<=NOT\s)\b((?!NOT|OR)\w+|"[^"]+")\b
AllWords:
(?<!OR\s)\b((?!NOT|OR)\w+|"[^"]+")\b(?!\s+OR)
AnyWords:
Well.. the rest. =) They are not that easy to spot, since I do not know how to put "OR behind it or OR in front of it" into regex. Maybe you could join the results from the three regexes
(?<=OR\s)\b((?!NOT|OR)\w+|"[^"]+")\b(?!\s+OR)
(?<=OR\s)\b((?!NOT|OR)\w+|"[^"]+")\b(?=\s+OR)
(?<!OR\s)\b((?!NOT|OR)\w+|"[^"]+")\b(?=\s+OR)
Problems: These require exactly one space between modifier words and expressions. PHP only supports lookbehinds for fixes length expressions, so I see no way around that, sorry. You could just use \b(\w+|"[^"]+")\b to split the input, and parse the resulting array manually.

This is an excellent example of how an test-first driven approach can help you arrive at a solution. It might not be the very best one, but having tests written allow you to refactor with confidence and instantly see if you break any of the existing tests. Anyway, you could set up a few tests like:
public function setUp () {
$this->searchParser = new App_Search_Parser();
}
public function testSingleWordParsesToAllWords () {
$this->searchParser->parse('Transport');
$this->assertEquals(
$this->searchParser->getAllWords(),
array('Transport')
);
$this->assertEquals($this->searchParser->getNotWords(), array());
$this->assertEquals($this->searchParser->getAnyWords());
}
public function testParseOfCombinedSearchString () {
$query = 'energy food "olympics 2010" Terrorism ' .
'OR "government" OR cups NOT transport';
$this->searchParser->parse($query);
$this->assertEquals(
$this->searchParser->getAllWords(),
array('energy', 'food', 'olympics 2010')
);
$this->assertEquals(
$this->searchParser->getNotWords(),
array('Transport')
);
$this->assertEquals(
$this->searchParser->getAnyWords(),
array( 'terrorism', 'government', 'cups')
);
}
Other good tests would include:
testParseTwoWords
testParseTwoWordsWithOr
testParseSimpleWithNot
testParseInvalid
Here you have to decide what invalid input looks like and how you interpret it, i.e:
'NOT Transport': Search for anything that doesn't contain Transport or inform the user that he has to include at least one search term too?
'OR energy': Is it ok to begin with a combinator?
'food OR NOT energy': Does this mean "search for food or anything that doesn't contain energy", or does it mean "search for food and not energy", or doesn't it mean anything? (i.e. throw exception, return false or whatnot)
testParseEmpty
Then, write the tests one by one, and write a simple solution that passes the test. Then refactor and make it right, and run again to see that you still pass the test.
Once a test passes and the code is refactored, then write the next test and repeat the procedure. Add more tests as you find special cases and refactor the code so that it passes all tests. If you break a test, back-up and re-write the code (not the test!) such that it passes.
As for how you can solve this problem, look into preg_match, strtok or rely simply loop through the string adding up tokens as you go.

Related

How to detect if a string contains PHP code? PHP

I am keeping record of every request made to my website. I am very aware of the security measurements that need to be taken before executing any MySQL query that contains data coming from query strings. I clean it as much as possible from injections and so far all tests have been successful using:
htmlspecialchars, strip_tags, mysqli_real_escape_string.
But on the logs of pages visited I find query strings of failed hack attempts that contain a lot of php code:
?1=%40ini_set%28"display_errors"%2C"0"%29%3B%40set_time_limit%280%29%3B%40set_magic_quotes_runtime%280%29%3Becho%20%27->%7C%27%3Bfile_put_contents%28%24_SERVER%5B%27DOCUMENT_ROOT%27%5D.%27/webconfig.txt.php%27%2Cbase64_decode%28%27PD9waHAgZXZhb
In the previous example we can see:
display_errors, set_time_limit, set_magic_quotes_runtime, file_put_contents
Another example:
/?s=/index/%5Cthink%5Capp/invokefunction&function=call_user_func_array&vars[0]=file_put_contents&vars[1][]=ctlpy.php&vars[1][]=<?php #assert($_REQUEST["ysy"]);?>ysydjsjxbei37$
This one is worst, there is even some <?php and $_REQUEST["ysy"] stuff in there. Although I am able to sanitize it, strip tags and encode < or > when I decode the string I can see the type of requests that are being sent.
Is there any way to detect a string that contains php code like:
filter_var($var, FILTER_SANITIZE_PHP);
FYI: This is not a real function, I am trying to give an idea of what I am looking for.
or some sort of function:
function findCode($var){
return ($var contains PHP) ? true : false
}
Again, not real
No need to sanitize, that has been taken care of, just to detect PHP code in a string. I need this because I want to detect them and save them in other logs.
NOTE: NEVER EXECUTE OR EVAL CODE COMING FROM QUERY STRINGS

After reading lots of comments #KIKO Software came up with an ingenious idea by using PHP tokenizer, but it ended up being extremely difficult because the string that is to be analyzed needed to have almost prefect syntax or it would fail.
So the best solution that I came up with is a simple function that tries to find commonly used PHP statements, In my case, especially on query strings with code injection. Another advantage of this solution is that we can modify and add to the list as many PHP statements as we want. Keep in mind that making the list bigger will considerably slow down your script. this functions uses strpos instead of preg_match (regex ) as its proven to perform faster.
This will not find 100% PHP code inside a string, but you can customize it to find as much as is required, never include terms that could be used in regular English, like 'echo' or 'if'
function findInStr($string, $findarray){
$found=false;
for($i=0;$i<sizeof($findarray);$i++){
$res=strpos($string,$findarray[$i]);
if($res !== false){
$found=true;
break;
}
}
return $found;
}
Simply use:
$search_line=array(
'file_put_contents',
'<?=',
'<?php',
'?>',
'eval(',
'$_REQUEST',
'$_POST',
'$_GET',
'$_SESSION',
'$_SERVER',
'exec(',
'shell_exec(',
'invokefunction',
'call_user_func_array',
'display_errors',
'ini_set',
'set_time_limit',
'set_magic_quotes_runtime',
'DOCUMENT_ROOT',
'include(',
'include_once(',
'require(',
'require_once(',
'base64_decode',
'file_get_contents',
'sizeof',
'array('
);
if(findInStr("this has some <?php echo 'PHP CODE' ?>",$search_line)){
echo "PHP found";
}

Functional Programming - Return Transformed array and the count of the array without calculating twice

I'm trying to write more functional code in PHP without any helper libraries.
I need to return some JSON that includes the results of a transformed array AND the count of that array (for convenience on the data consumer end). Since you're not supposed to use variables in FP, I'm stumped on how to get the count of the array without recalculating/remapping the array.
Here's an example of what my code currently looks like:
$duplicates = array_filter( get_results(), 'find_duplicates' );
send_json( array(
"duplicates" => $duplicates,
"numDuplicates" => count( $duplicates )
) );
How can I do the same without storing the results of the filter in a temporary variable to avoid running array_filter() twice?

But first, acknowledge the following...
"Since you're not supposed to use variables in FP..." – that's a ludicrous understanding of functional programming. Variables are used constantly in functional programs. I'm guessing you saw point-free functional programs and then imagined that every program can be expressed in such a way...
the receiver of the JSON could easily get the number of duplicates using JSON.parse(json).duplicates.length because every Array in JavaScript has a length property – it's arguably silly to attach a numDuplicates in the first place. Anyway, let's assume your consumer has a specific API that requires the numDuplicates field...
functional programming is concerned with things like function purity – maybe you've simplified your code in your post (which is bad; don't do that) or that is in fact your actual code. In such a case, get_results() and send_json functions are impure; send_json has an obvious (but unknown) side effect (the return value is not used) — You ask for a functional solution but you have other outstanding non-functional code... so...
There's nothing wrong with the code you have. Sometimes removing a point (variable, or argument), it hurts the readability of the code. In your case, this code is perfectly legible. It is at this point that I feel you're only trying to shorten the code or make it more clever. Your intention is to improve it, but I think you'd actually harm it in this case.
What if I told you...
a variable assignment can be replaced with a lambda? 0_0
(function ($duplicates) {
send_json([
'duplicates' => $duplicates,
'numDuplicates' => count($duplicates)
});
}) (array_filter(get_results(), 'find_duplicates'));
But that made the code longer.. and there's added abstraction which hurts readability T_T In this case, using a normal variable assignment (as in your original code) would've been much better
Combinators
OK, so what if you had some combinators at your disposal to massage the data into the desired shape?
function apply (...$xs) {
return function ($f) use ($xs) {
return call_user_func($f, ...$xs);
};
}
function identity ($x) { return $x; }
// hey look, mom! no points!
send_json(
array_combine(
['duplicates', 'numDuplicates'],
array_map(
apply(
array_filter(get_results(), 'find_duplicates')),
['identity', 'count'])));
Did we achieve anything other than writing the weirdest PHP you or anyone else has probably seen? Not to mention, the input is strangely nested in the middle of the expression...
remarks
I'm nearly certain that you'll be disappointed with this answer (or disagree with me), but I'm also pretty confident that you're not sure what you're looking for. A guess: you saw functional programming that "doesn't use variables" and assumed that's how all programs can and should be written; but that's just not the case. Sometimes using a variable or two can dramatically improve the readability of a given expression.
Anyway, all of this is truly beside the point because attaching numDuplicates is arguably an anti-pattern in JSON anyway (point #2 above).

How can I use php function mb_convert_case and convert only certain words to upper?

I want to pass this input
$string = "AL & JRL buSineSS CENTRE is the best";
Expected Result:
AL & JRL Business Centre Is The Best
I have tried the code below but it converts everything.
mb_convert_case($string, MB_CASE_TITLE, "UTF-8");

So I take it you just want potential acronyms to be ignored, correct? Well, there are a few thoughts. First, you could make a script that ignores anything with 3 or less letters. That's not a great solution, in my opinion. What about "it", "the", etc.? The second is using a dictionary of known words to run ucwords() on. Yuck - that'd be incredibly taxing for such a seemingly simple task!
I'd recommend simply ignoring anything that is all-caps. This way, no matter what the acronym is (or the length), it'll ignore it. Something like this may suffice:
$phrase = "Hello this is a TeSt pHrAse, to be tested ASAP. Thanks.";
$chunks = explode(" ", $phrase);
$result = "";
foreach($chunks as $chunk){
if(!ctype_upper($chunk)) {
$result .= ucwords($chunk) . " ";
} else {
$result .= $chunk . " ";
}
}
$result = rtrim($result);
Result:
Hello This Is A Test Phrase, To Be Tested ASAP. Thanks.
This isn't the most elegant solution, this is just something I've kind of thought about since reading your question. However, if you know your acronyms will be capitalized, this will skip them entirely and only title-case your actual words.
Caveats
The example provided above will not work with an acronym joined to a word by a dash, underscore, etc. This only works on spacing. You can easily tweak the above to your needs, and make it a little more intelligent. However, I wanted to be very clear that this may not fulfill all needs!
Also, this example will come up short in your example phrase. Unfortunately, unless you use a dictionary or count string lengths, this is the closest you'll get. This solution is minimal work for a great deal of functionality. Of course, a dictionary with comparisons would work great (either a dictionary of acronyms or words, either way) - but even then it would be very difficult to keep up to date. Names will throw off a dictionary of words safe to change to title-case. Less commonly used acronyms surely won't be in a dictionary of acronyms. There are endless caveats to all solutions unfortunately. Choose what's best for you.
Hope this helps. If you have any further questions please comment and I'll try the best I can to help.
Randomness
One last thing. I used ucwords(). Feel free to use whatever you want. I'm sure you already know the difference, but check this out:
Best function for Title capitlization?
Always good to know exactly what tool is best for the job. Again, I'm sure you know your own needs and I'm sure you chose the right tool. Just thought it was an interesting read that could help anyone stumbling upon this.
Final Thoughts
You could use a combination of the above examples to custom tailor your own solution. Often it's very satisfactory to combine methods, thus reducing the downsides of each method.
Hope this helps, best of luck to you!

Converting pseudocode into usable (using regular expression?)

As part of the system I am writing, users can create their own custom Rules, to be run when certain events happen.
There are a set number of Objects they can use to create these rules, all of which have a set number of properties and methods:
So as an example of a rule, we could say:
“if this unit award is ‘Distinction’ then set all the criteria on this unit to award ‘Achieved’”
IF UNIT.award equals “Distinction”
THEN UNIT.criteria.set_award(‘A’)
“else if this unit award is ‘Merit’ then set the award of any criteria on this unit whose name starts with either ‘P’ or ‘M’ to ‘Achieved’”
IF UNIT.award equals “Merit”
THEN UNIT.criteria.filter(‘starts’, ‘name’, ‘P’, ‘M’).set_award(‘A’)
“else if this unit award is ‘Pass then set the award of any criteria on this unit whose name starts with ‘P’ to ‘Achieved’”
IF UNIT.award equals “Merit”
THEN UNIT.criteria.filter(‘starts’, ‘name’, ‘P’).set_award(‘A’)
The problem I am having, is I am just not sure how to take that string of object, properties & methods, e.g. “UNIT.criteria.filter(‘starts’, ‘name’, ‘P’).set_award(‘A’)” and convert it into something usable.
The end result I’d like to convert the string to would be something along the lines of:
So I can then convert that into the actual proper objects and return the relevant values or run the relevant methods.
Since there is only a set number of things I need to support (for now at least) and I don’t need anything complex like calculation support or variables, it seems overkill to create a Lexer system, so I was thinking of just using a regular expression to split all the sections.
So using the examples above, I could do a simple split on the “.” character, but if that character is used in a method parameter, e.g. “CRITERION.filter(‘is’, ‘name’, ‘P.1’)” then that screws it up completely.
I could use a less common character to split them, for example a double colon or something “::” but if for whatever reason someone puts that into a parameter it will still cause the same problem. I’ve tried creating a regular expression that splits on the character, only if it’s not between quotes, but I haven’t been able to get it to work.
So basically my question is: would a regular expression be the best way to do this? (If so, could anyone help me with getting it to ignore the specified character if it’s in a method). Or is there another way I could do this that would be easier/better?
Thanks.

I'd think an ORM language like eloquent could do this for you.
But if I had to do this then first I'd split the IF THEN ELSE parts.
Leaving:
UNIT.award equals “Distinction”
UNIT.criteria.filter(‘starts’, ‘name’, ‘P’, ‘M’).set_award(‘A’)
I'm guessing the "equals" could also be "not equals" or "greater" so...
I'd split the first bit around that.
/(?'ident'[a-z.]*?) (?'expression'equals|greater) (?'compare'[0-9a-z\“\”]+)/gi
But an explode around 'equals' will do the same.
Then I'd explode the second part around the dots.
Giving:
UNIT
criteria
filter(a,b,c,d)
set_ward(e)
Pop off the first 2 to get object and property and then a list of possible filters and actions.
But frankly I'd would develop a language that would not mix properties with actions and filters.
Something like:
IF object.prop EQUALS const|var
THEN UPDATE object.prop
WITH const|var [WHERE object.prop filter const|var [AND|OR const|var]]
Eloquent does it straight in php:
DB::table('users')
->where('id', 1)
->update(['votes' => 1]);
So maybe I'd do something like:
THEN object.prop->filter(a,b,c,d)->set('award','A')
This makes it easy to split actions around -> and properties around .
Anyway...
I do my Regex on https://regex101.com/
Hope this helps.

PHP regex parsing - splitting tokens in my own language. Is there a better way?

I am creating my own language.
The goal is to "compile" it to PHP or Javascript, and, ultimately, to interpret and run it on the same language, to make it look like a "middle-level" language.
Right now, I'm focusing on the aspect of interpreting it in PHP and run it.
At the moment, I'm using regex to split the string and extract the multiple tokens.
This is the regex I have:
/\:((?:cons#(?:\d+(?:\.\d+)?|(?:"(?:(?:\\\\)+"|[^"]|(?:\r\n|\r|\n))*")))|(?:[a-z]+(?:#[a-z]+)?|\^?[\~\&](?:[a-z]+|\d+|\-1)))/g
This is quite hard to read and maintain, even though it works.
Is there a better way of doing this?
Here is an example of the code for my language:
:define:&0:factorial
:param:~0:static
:case
:lower#equal:cons#1
:case:end
:scope
:return:cons#1
:scope:end
:scope
:define:~0:static
:define:~1:static
:require:static
:call:static#sub:^~0:~1 :store:~0
:call:&-1:~0 :store:~1
:call:static#sum:^~0:~1 :store:~0
:return:~0
:scope:end
:define:end
This defines a recursive function to calculate the factorial (not so well written, that isn't important).
The goal is to get what is after the :, including the #. :static#sub is a whole token, saving it without the :.
Everything is the same, except for the token :cons, which can take a value after. The value is a numerical value (integer or float, called static or dynamic in the language, respectively) or a string, which must start and end with ", supporting escaping like \". Multi-line strings aren't supported.
Variables are the ones with ~0, using ^ before will get the value to the above :scope.
Functions are similar, being used &0 instead and &-1 points to the current function (no need for ^&-1 here).
Said this, Is there a better way to get the tokens?
Here you can see it in action: http://regex101.com/r/nF7oF9/2

[Update] To issue the pattern being complicated and maintainability, you can split it using PCRE_EXTENDED, and comments:
preg_match('/
# read constant (?)
\:((?:cons#(?:\d+(?:\.\d+)?|
# read a string (?)
(?:"(?:(?:\\\\)+"|[^"]|(?:\r\n|\r|\n))*")))|
# read an identifier (?)
(?:[a-z]+(?:#[a-z]+)?|
# read whatever
\^?[\~\&](?:[a-z]+|\d+|\-1)))
/gx
', $input)
Beware that all space are ignored, except under certain conditions (\n is normally "safe").
Now, if you want to pimp you lexer and parser, then read that:
What does (f)lex [GNU equivalent of LEX] is simply let you pass a list of regexp, and eventually a "group". You can also try ANTLR and PHP Target Runtime to get the work done.
As for you request, I've made a lexer in the past, following the principle of FLEX. The idea is to cycle through the regexp like FLEX does:
$regexp = [reg1 => STRING, reg2 => ID, reg3 => WS];
$input = ...;
$tokens = [];
while ($input) {
$best = null;
$k = null;
for ($regexp as $re => $kind) {
if (preg_match($re, $input, $match)) {
$best = $match[0];
$k = $kind;
break;
}
}
if (null === $best) {
throw new Exception("could not analyze input, invalid token");
}
$tokens[] = ['kind' => $kind, 'value' => $best];
$input = substr($input, strlen($best)); // move.
}
Since FLEX and Yacc/Bison integrates, the usual pattern is to read until next token (that is, they don't do a loop that read all input before parsing).
The $regexp array can be anything, I expected it to be a "regexp" => "kind" key/value, but you can also an array like that:
$regexp = [['reg' => '...', 'kind' => STRING], ...]
You can also enable/disable regexp using groups (like FLEX groups works): for example, consider the following code:
class Foobar {
const FOOBAR = "arg";
function x() {...}
}
There is no need to activate the string regexp until you need to read an expression (here, the expression is what come after the "="). And there is no need to activate the class identifier when you are actually in a class.
FLEX's group permits to read comments, using a first regexp, activating some group that would ignore other regexp, until some matches is done (like "*/").
Note that this approach is a naïve approach: a lexer like FLEX will actually generate an automaton, which use different state to represent your need (the regexp is itself an automaton).
This use an algorithm of packed indexes or something alike (I used the naïve "for each" because I did not understand the algorithm enough) which is memory and speed efficient.
As I said, it was something I made in the past - something like 6/7 years ago.
It was on Windows.
It was not particularly quick (well it is O(N²) because of the two loops).
I think also that PHP was compiling the regexp each times. Now that I do Java, I use the Pattern implementation which compile the regexp once, and let you reuse it. I don't know PHP does the same by first looking into a regexp cache if there was already a compiled regexp.
I was using preg_match with an offset, to avoid doing the substr($input, ...) at the end.
You should try to use the ANTLR3 PHP Code Generation Target, since the ANTLR grammar editor is pretty easy to use, and you will have a really more readable/maintainable code :)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.