Regular Expression to Isolate a String of Characters in the proper context - php

So, I have a dashboard which I'm currently writing (PHP). The idea is that it is supposed to display data in a database relative to a given url specified. If the user wishes to just grab everything, they simply need to specify "all". If they wish to scrape data for specific URLs AND display everything at once, they will specify additional URLs with the "all" directive.
I discovered a bug, however.
If I have a URL which has the characters "all" in it (such as, say, http://everythingallatonce.com <-- that's just an example - I have no idea if that actually exists), the dashboard's parsing algorithm which takes the instruction given won't work properly. In fact, according to this logic, it will think that the user specified a given URL as well AS the words "all", without actually checking off the "perform scrape?" checkbox, which makes no sense at all (hence, it just throws an exception/dies with an error message).
So far, I just have a function like the following:
function _strExists( $needle, $haystack )
{
$pos = strpos( $haystack, $needle );
return ( $pos !== false );
}
Which I use to detect to see if the word "all" exists in the query, like so:
$fetchEverything = _strExists('all', $urls);
What would be a good work around for something like this, to avoid ambiguity between URLs specified which have "all" in them, and the actual query of all by itself? I'm thinking regular expressions, but I'm not sure...
Also
I have considered just using *, but I'd like to avoid that if possible.

If some value for all is being passed in the URL (i.e. all=1). Then you should look in the $_GET superglobal for it's existence (i.e. $_GET['all'])

Related

PHP regex parsing - splitting tokens in my own language. Is there a better way?

I am creating my own language.
The goal is to "compile" it to PHP or Javascript, and, ultimately, to interpret and run it on the same language, to make it look like a "middle-level" language.
Right now, I'm focusing on the aspect of interpreting it in PHP and run it.
At the moment, I'm using regex to split the string and extract the multiple tokens.
This is the regex I have:
/\:((?:cons#(?:\d+(?:\.\d+)?|(?:"(?:(?:\\\\)+"|[^"]|(?:\r\n|\r|\n))*")))|(?:[a-z]+(?:#[a-z]+)?|\^?[\~\&](?:[a-z]+|\d+|\-1)))/g
This is quite hard to read and maintain, even though it works.
Is there a better way of doing this?
Here is an example of the code for my language:
:define:&0:factorial
:param:~0:static
:case
:lower#equal:cons#1
:case:end
:scope
:return:cons#1
:scope:end
:scope
:define:~0:static
:define:~1:static
:require:static
:call:static#sub:^~0:~1 :store:~0
:call:&-1:~0 :store:~1
:call:static#sum:^~0:~1 :store:~0
:return:~0
:scope:end
:define:end
This defines a recursive function to calculate the factorial (not so well written, that isn't important).
The goal is to get what is after the :, including the #. :static#sub is a whole token, saving it without the :.
Everything is the same, except for the token :cons, which can take a value after. The value is a numerical value (integer or float, called static or dynamic in the language, respectively) or a string, which must start and end with ", supporting escaping like \". Multi-line strings aren't supported.
Variables are the ones with ~0, using ^ before will get the value to the above :scope.
Functions are similar, being used &0 instead and &-1 points to the current function (no need for ^&-1 here).
Said this, Is there a better way to get the tokens?
Here you can see it in action: http://regex101.com/r/nF7oF9/2
[Update] To issue the pattern being complicated and maintainability, you can split it using PCRE_EXTENDED, and comments:
preg_match('/
# read constant (?)
\:((?:cons#(?:\d+(?:\.\d+)?|
# read a string (?)
(?:"(?:(?:\\\\)+"|[^"]|(?:\r\n|\r|\n))*")))|
# read an identifier (?)
(?:[a-z]+(?:#[a-z]+)?|
# read whatever
\^?[\~\&](?:[a-z]+|\d+|\-1)))
/gx
', $input)
Beware that all space are ignored, except under certain conditions (\n is normally "safe").
Now, if you want to pimp you lexer and parser, then read that:
What does (f)lex [GNU equivalent of LEX] is simply let you pass a list of regexp, and eventually a "group". You can also try ANTLR and PHP Target Runtime to get the work done.
As for you request, I've made a lexer in the past, following the principle of FLEX. The idea is to cycle through the regexp like FLEX does:
$regexp = [reg1 => STRING, reg2 => ID, reg3 => WS];
$input = ...;
$tokens = [];
while ($input) {
$best = null;
$k = null;
for ($regexp as $re => $kind) {
if (preg_match($re, $input, $match)) {
$best = $match[0];
$k = $kind;
break;
}
}
if (null === $best) {
throw new Exception("could not analyze input, invalid token");
}
$tokens[] = ['kind' => $kind, 'value' => $best];
$input = substr($input, strlen($best)); // move.
}
Since FLEX and Yacc/Bison integrates, the usual pattern is to read until next token (that is, they don't do a loop that read all input before parsing).
The $regexp array can be anything, I expected it to be a "regexp" => "kind" key/value, but you can also an array like that:
$regexp = [['reg' => '...', 'kind' => STRING], ...]
You can also enable/disable regexp using groups (like FLEX groups works): for example, consider the following code:
class Foobar {
const FOOBAR = "arg";
function x() {...}
}
There is no need to activate the string regexp until you need to read an expression (here, the expression is what come after the "="). And there is no need to activate the class identifier when you are actually in a class.
FLEX's group permits to read comments, using a first regexp, activating some group that would ignore other regexp, until some matches is done (like "*/").
Note that this approach is a naïve approach: a lexer like FLEX will actually generate an automaton, which use different state to represent your need (the regexp is itself an automaton).
This use an algorithm of packed indexes or something alike (I used the naïve "for each" because I did not understand the algorithm enough) which is memory and speed efficient.
As I said, it was something I made in the past - something like 6/7 years ago.
It was on Windows.
It was not particularly quick (well it is O(N²) because of the two loops).
I think also that PHP was compiling the regexp each times. Now that I do Java, I use the Pattern implementation which compile the regexp once, and let you reuse it. I don't know PHP does the same by first looking into a regexp cache if there was already a compiled regexp.
I was using preg_match with an offset, to avoid doing the substr($input, ...) at the end.
You should try to use the ANTLR3 PHP Code Generation Target, since the ANTLR grammar editor is pretty easy to use, and you will have a really more readable/maintainable code :)

Is it a bad practice to use a GET parameter (in URL) with no value?

I'm in a little argument with my boss about URLs using GET parameters without value. E.g.
http://www.example.com/?logout
I see this kind of link fairly often on the web, but of course, this doesn't mean it's a good thing. He fears that this is not standard and could lead to unexpected errors, so he'd rather like me to use something like:
http://www.example.com/?logout=yes
In my experience, I've never encountered any problem using empty parameters, and they sometimes make more sense to me (like in this case, where ?logout=no wouldn't make any sense, so the value of "logout" is irrelevant and I would only test for the presence of the parameter server-side, not for its value). (It also looks cleaner.)
However I can't find confirmation that this kind of usage is actually valid and therefore really can't cause any problem ever.
Do you have any link about this?
RFC 2396, "Uniform Resource Identifiers (URI): Generic Syntax", §3.4, "Query Component" is the authoritative source of information on the query string, and states:
The query component is a string of information to be interpreted by
the resource.
[...]
Within a query component, the characters ";", "/", "?", ":", "#",
"&", "=", "+", ",", and "$" are reserved.
RFC 2616, "Hypertext Transfer Protocol -- HTTP/1.1", §3.2.2, "http URL", does not redefine this.
In short, the query string you give ("logout") is perfectly valid.
A value is not required for the key to have any effect. It doesn't make the URL any less valid either, the URL RFC1738 does not list it as required part of the URL.
If you don't really need a value, it's just a matter of preference.
http://example.com/?logout
Is just as much a valid URL as
http://example.com/?logout=yes
All difference that it makes is that if you want to make sure that the "yes" bit was absolutely set, you can check for it's value. Like:
if(isset($_GET['logout']) && $_GET['logout'] == "yes") {
// Only proceed if the value is explicitly set to yes
If you just want to know if the logout key was set somewhere in the URL, it would suffice to just list the key with no value assigned to it. You can then check it like this:
if(isset($_GET['logout'])) {
// Continue regardless of what the value is set to (or if it's left empty)
It's perfectly fine, and won't cause any error. Though, nowadays most frameworks are MVC based, so in the URL you need to mention a controller and an action, so it looks more like /users/logout (BTW, also StackOverflow uses that URL to log users out ;).
The statement that it may cause errors to me sounds like your applications manually access the raw $_GET, and I definitely think that building apps without a framework (which usually provides an MVC stack and a router/dispatcher) is the real dangerous thing here.

Understanding 'parse_str' in PHP

I'm a PHP newbie trying to find a way to use parse_str to parse a number of URLs from a database (note: not from the request, they are already stored in a database, don't ask... so _GET won't work)
So I'm trying this:
$parts = parse_url('http://www.jobrapido.se/?w=teknikinformat%C3%B6r&l=malm%C3%B6&r=auto');
parse_str($parts['query'], $query);
return $query['w'];
Please note that here I am just supplying an example URL, in the real application the URL will be passed in as a parameter from the database. And if I do this it works fine. However, I don't understand how to use this function properly, and how to avoid errors.
First of all, here I used "w" as the index to return, because I could clearly see it was in the query. But how do these things work? Is there a set of specific values I can use to get the entire query string? I mean, if I look further, I can see "l" and "r" here as well...
Sure I could extract those too and concatenate the result, but will these value names be arbitrary, or is there a way to know exactly which ones to extract? Of course there's the "q" value, which I originally thought would be the only one I would need, but apparently not. It's not even in the example URL, although I know it's in lots of others.
So how do I do this? Here's what I want:
Extract all parts of the query string that gives me a readable output of the search string part of the URL (so in the above it would be "teknikinformatör Malmö auto". Note that I would need to translate the URL encoding to Swedish characters, any easy way to do that in PHP?)
Handle errors so that if the above doesn't work for some reason, the method should only return an empty string, thus not breaking the code. Because at this point, if I were to use the above with an actual parameter, $url, passed in instead of the example URL, I would get errors, because many of the URLs do not have the "w" parameter, some may be empty fields in the database, some may be malformed, etc. So how can I handle such errors stably, and just return a value if the parsing works, and return empty string otherwise?
There seems to be a very strange problem that occurs that I cannot see during debugging. I put this test code in just to see what is going on:
function getQuery($url)
{
try
{
$parts = parse_url($url);
parse_str($parts['query'], $query);
if (isset($query['q'])) {
/* return $query['q']; */
return '';
}
} catch (Exception $e) {
return '';
}
}
Now, obviously in the real code I would want something like the commented out part to be returned. However, the puzzling thing is this:
With this code, as far as I see, every path should lead to returning an empty string. But this does not work - it gives me a completely empty grid in the result page. No errors or anything during debugging, and objects look fine when I step through them during debugging.
However, if I remove everything from this method except return ''; then it works fine - of course the field in the grid where the query is supposed to be is empty, but all the other fields have all the information as they should. So this was just a test. But how is it possible that code that should only be able to return an empty string does not work, while the one that only returns an empty string and does nothing else does work? I'm thoroughly confused...
The meaning of the query parameters is entirely up to the application that handles the URL, so there is no "right" parameter - it might be w, q, or searchquery. You can heuristically search for the most common variables (=guess), or return an array of all arguments. It depends on what you're trying to achieve.
parse_str already decodes urlencoding. Note that urlencoding is a way to encode bytes, not characters. It depends on what encoding the application expects. Usually (and in this example query), that should be UTF-8 everywhere, so you should be covered on 1.
Test whether the value exists, and if not, return the empty string, like this:
$heuristicFields = array('q', 'w', 'searchquery');
foreach ($heuristicFields as $hf) {
if (isset($query[$hf])) return $query[$hf];
}
return '';
The function returns null if the input is valid, and runs into errors (i.e., displays warning messages) when the URL is obviously invalid. The try...catch block has no effect.
It turned out the problem was with Swedish characters - if I used utf8_encode() on the value before returning it, it worked fine.

Check query string (PHP)

I use a query string, for example test.php?var=1.
How can I check if a user types anything after that, like another string...
I try to redirect to index.php if any other string (query string) follows my var query string.
Is it possible to check this?
For example:
test.php?var=12134 (This is a good link..)
test.php?a=23&var=123 (this is a bad link, redirect to index..)
test.php?var=123132&a=23 (this is a bad link, redirect to index..)
I'm not sure I fully understand what you want, but if you're not interested in the positioning of the parameters this should work:
if ( isset($_GET['var']) && count($_GET) > 1 ) {
//do something if var and another parameter is given
}
Look in $_SERVER['QUERY_STRING'].
Similar to Tom Haigh’s answer, you could also get the difference of the arguments you expect and those you actually get:
$argKeys = array_keys($_GET);
$additionalArgKeys = array_diff($argKeys, array('var'));
var_dump($additionalArgKeys);
test.php?a=23?var=123 (this is a bad link, redirect to index..)
In this case, you only have one variable sent, named "a" containing the value "a?var=123", therefore it shouldn't be a problem for you.
test.php?var=123132&a=23 (this is a bad link, redirect to index..)
In this case you have two variables sent, ("a" and "var").
In general you can check the $_GET array to see how many variables have been sent and act accordingly, by using count($_GET).
I think you are trying to get rid of unwanted parameters. This is usually done for security reasons.
There won't be a problem, however, if you preinitalize every variable you use and only use variables with $_GET['var'], $_POST['var'] or $_REQUEST['var'].

PHP - securing parameters passed in the URL

I have an application which makes decisions based on part of URL:
if ( isset($this->params['url']['url']) ) {
$url = $this->params['url']['url'];
$url = explode('/',$url);
$id = $this->Provider->getProviderID($url[0]);
$this->providerName = $url[0]; //set the provider name
return $id;
}
This happens to be in a cake app so $this->params['url'] contains an element of URL. I then use the element of the URL so decide which data to use in the rest of my app. My question is...
whats the best way to secure this input so that people can't pass in anything nasty?
thanks,
Other comments here are correct, in AppController's beforeFilter validate the provider against the providers in your db.
However, if all URLs should be prefixed with a provider string, you are going about extracting it from the URL the wrong way by looking in $this->params['url'].
This kind of problem is exactly what the router class, and it's ability to pass params to an action is for. Check out the manual page in the cookbook http://book.cakephp.org/view/46/Routes-Configuration. You might try something like:
Router::connect('/:provider/:controller/:action');
You'll also see in the manual the ability to validate the provider param in the route itself by a regex - if you have a small definite list of known providers, you can hard code these in the route regex.
By setting up a route that captures this part of the URL it becomes instantly available in $this->params['provider'], but even better than that is the fact that the html helper link() method automatically builds correctly formatted URLs, e.g.
$html->link('label', array(
'controller' => 'xxx',
'action' => 'yyy',
'provider' => 'zzz'
));
This returns a link like /zzz/xxx/yyy
What are valid provider names? Test if the URL parameter is one, otherwise reject it.
Hopefully you're aware that there is absolutely no way to prevent the user from submitting absolutely anything, including provider names they're not supposed to use.
I'd re-iterate Karsten's comment: define "anything nasty"
What are you expecting the parameter to be? If you're expecting it to be a URL, use a regex to validate URLs. If you're expecting an integer, cast it to an integer. Same goes for a float, boolean, etc.
These PHP functions might be helpful though:
www.php.net/strip_tags
www.php.net/ctype_alpha
the parameter will be a providername - alphanumeric string. i think the answer is basically to to use ctype_alpha() in combination with a check that the providername is a valid one, based on other application logic.
thanks for the replies
Also, if you have a known set of allowable URLs, a good idea is to whitelist those allowed URLs. You could even do that dynamically by having a DB table that contains the allowed URLs -- pull that from the database, make a comparison to the URL parameter passed. Alternatively, you could whitelist patterns (say you have allowed domains that can be passed, but the rest of the url changes... You can whitelist the domain and/ or use regexps to determine validity).
At the very least, make sure you use strip_tags, or the built-in mysql escape sequences (if using PHP5, parameterizing your SQL queries solves these problems).
It would be more cake-like to use the Sanitize class. In this case Sanitize::escape() or Sanitize::paranoid() seem appropriate.

Categories