PHP glob-style matching

PHP glob-style matching - php

To keep things short, I wrote an access-control system.
One of the requirements of this system is to check whether a canonical/normalized path can be accessed or not by matching it against a pattern.
First thoughts would fall on PREG, problem is, the patterns are file-based, ie, similar to those accepted by glob(). Basically, it's just patterns containing ? (match one arbitrary character) or * (match any character).
So in simple terms, I need to recreate glob()'s matching functionality on PHP.
Sample code:
function path_matches($path, $pattern){
// ... ?
}
path_matches('path/index.php', 'path/*'); // true
path_matches('path2/', 'path/*'); // false
path_matches('path2/test.php', 'path2/*.php'); // true
A possible solution would be to convert $pattern into a regular expression than use preg_match(), is there any other way though?
NB: The reason why I can't use regex is that patterns will be written by non-programmers.

Use fnmatch(), which seems to do the trick.

Converting to a regex seems like the best solution to me. All you need to do is convert * to .*, ? to . and preg_quote. However it's not as simple as it may seem because it's a bit of a chicken-and-egg problem in terms of the order in which you do things.
I don't like this solution but it's the best I can come up with: use a regex to generate the regex.
function path_matches($path, $pattern, $ignoreCase = FALSE) {
$expr = preg_replace_callback('/[\\\\^$.[\\]|()?*+{}\\-\\/]/', function($matches) {
switch ($matches[0]) {
case '*':
return '.*';
case '?':
return '.';
default:
return '\\'.$matches[0];
}
}, $pattern);
$expr = '/'.$expr.'/';
if ($ignoreCase) {
$expr .= 'i';
}
return (bool) preg_match($expr, $path);
}
EDIT Added case-sensitivity option.
See it working

There is already a function in PHP, included since PHP 4.3.0.
fnmatch() checks if the passed string would match the given shell wildcard pattern.

From the PHP documentation for glob(). I think preg_match is the best solution anyway.
http://php.net/manual/en/function.glob.php
<?php
function match_wildcard( $wildcard_pattern, $haystack ) {
$regex = str_replace(
array("\*", "\?"), // wildcard chars
array('.*','.'), // regexp chars
preg_quote($wildcard_pattern)
);
return preg_match('/^'.$regex.'$/is', $haystack);
}
$test = "foobar and blob\netc.";
var_dump(
match_wildcard('foo*', $test), // TRUE
match_wildcard('bar*', $test), // FALSE
match_wildcard('*bar*', $test), // TRUE
match_wildcard('**blob**', $test), // TRUE
match_wildcard('*a?d*', $test), // TRUE
match_wildcard('*etc**', $test) // TRUE
);
?>

I think this should work for turning glob-patterns into regex-patterns:
function glob2regex($globPatt) {
return '/'.preg_replace_callback('/./u', function($m) {
switch($m[0]) {
case '*': return '.*';
case '?': return '.';
}
return preg_quote($m[0],'/');
}, $globPatt).'\z/AsS';
}
You might want to use [^\\/]* for * instead if you want to prevent * from matching over directory names.

Related

check words with preg_match

I have some words with | between each one and I have tried to use preg_match to detect if it's containing target word or not.
I have used this:
<?php
$c_words = 'od|lom|pod|dyk';
$my_word = 'od'; // only od not pod or other word
if (preg_match('/$my_word/', $c_words))
{
echo 'ok';
}
?>
But it doesn't work correctly.
Please help.

No need for regular expressions. The functions explode($delimiter, $str); and in_array($needle, $haystack); will do everything for you.
// splits words into an array
$array = explode('|', $c_words);
// check if "$my_word" exists in the array.
if(in_array($my_word, $array)) {
// YEP
} else {
// NOPE
}
Apart from that, your regular expression would match other words containing the same sequence too.
preg_match('/my/', 'myword|anotherword'); // true
preg_match('/another/', 'myword|anotherword'); // true
That's exactly why you shouldn't use regular expressions in this case.

You can't pass a variable into a string with single quotes, you need to use either
preg_match("/$my_word/", $c_words);
Or – and I find that cleaner :
preg_match('/' .$my_word. '/', $c_words);
But for something as simple as that I don't even know if I'd use a Regex, a simple if (strpos($c_words, $my_word) !== 0) should be enough.

You are using preg_match() the wrong way. Since you're using | as a delimiter you can try this:
if (preg_match('/'.$all_words.'/', $my_word, $c_words))
{
echo 'ok';
}
Read the documentation for preg_match().

If string contains forward slash

How do i make a if statement which checks if the string contains a forward slash?
$string = "Test/Test";
if($string .......)
{
mysql_query("");
}
else
{
echo "the value contains a invalid character";
}

You can use strpos, which will make sure there is a forward slash in the string but you need to run it through an equation to make sure it's not false. Here you can use strstr(). Its short and simple code, and gets the job done!
if(strstr($string, '/')){
//....
}
For those who live and die by the manual, when the haystack is very large, or the needle is very small, it is quicker to use strstr(), despite what the manual says.
Example:
Using strpos(): 0.00043487548828125
Using strstr(): 0.00023317337036133

if(strpos($string, '/') !== false) {
// string contains /
}
From the PHP manual of strstr:
Note:
If you only want to determine if a particular needle occurs within
haystack, use the faster and less memory intensive function strpos()
instead.

Use strpos()
If it doesn't return false, the character was matched.

I compared strpos() results with 0. Somehow comparison with false did not work for me.
if (strpos($t, '/') !== 0) {
echo "No forward slash!";
}

Finding string and replacing with same case string

I need help while trying to spin articles. I want to find text and replace synonymous text while keeping the case the same.
For example, I have a dictionary like:
hello|hi|howdy|howd'y
I need to find all hello and replace with any one of hi, howdy, or howd'y.
Assume I have a sentence:
Hello, guys! Shouldn't you say hello me when I say you HELLO?
After my operation it will be something like:
hi, guys! Shouldn't you say howd'y to me when I say howdy?
Here, I lost the case. I want to maintain it! It should actually be:
Hi, guys! Shouldn't you say howd'y to me when I say HOWDY?
My dictionary size is about 5000 lines
hello|hi|howdy|howd'y go|come
salaries|earnings|wages
shouldn't|should not
...

I'd suggest using preg_replace_callback with a callback function that examines the matched word to see if (a) the first letter is not capitalized, or (b) the first letter is the only capitalized letter, or (c) the first letter is not the only capitalized letter, and then replace with the properly modified replacement word as desired.

You can find your string and do two tests:
$outputString = 'hi';
if ( $foundString == ucfirst($foundString) ) {
$outputString = ucfirst($outputString);
} else if ( $foundString == strtoupper($foundString) ) {
$outputString = strtoupper($outputString);
} else {
// do not modify string's case
}

Here's a solution for retaining the case (upper, lower or capitalized):
// Assumes $replace is already lowercase
function convertCase($find, $replace) {
if (ctype_upper($find) === true)
return strtoupper($replace);
else if (ctype_upper($find[0]) === true)
return ucfirst($replace);
else
return $replace;
}
$find = 'hello';
$replace = 'hi';
// Find the word in all cases that it occurs in
while (($pos = stripos($input, $find)) !== false) {
// Extract the word in its current case
$found = substr($input, $pos, strlen($find));
// Replace all occurrences of this case
$input = str_replace($found, convertCase($found, $replace), $input);
}

You could try the following function. Be aware that it will only work with ASCII strings, as it uses some of the useful properties of ASCII upper and lower case letters. However, it should be extremely fast:
function preserve_case($old, $new) {
$mask = strtoupper($old) ^ $old;
return strtoupper($new) | $mask .
str_repeat(substr($mask, -1), strlen($new) - strlen($old) );
}
echo preserve_case('Upper', 'lowercase');
// Lowercase
echo preserve_case('HELLO', 'howdy');
// HOWDY
echo preserve_case('lower case', 'UPPER CASE');
// upper case
echo preserve_case('HELLO', "howd'y");
// HOWD'Y
This is my PHP version of the clever little perl function:
How do I substitute case insensitively on the LHS while preserving case on the RHS?

"string" != "string"

I'm doing some kind of own templates system. I want to change
<title>{site('title')}</title>
Into function "site" execution with parameter "title". Here's
private function replaceFunc($subject)
{
foreach($this->func as $t)
{
$args = explode(", ", preg_replace('/\{'.$t.'\(\'([a-zA-Z,]+)\'\)\}/', '$1', $subject));
$subject = preg_replace('/\{'.$t.'\([a-zA-Z,\']+\)\}/', call_user_func_array($t, $args), $subject);
}
return $subject;
}
Here's site:
function site($what)
{
global $db;
$s = $db->askSingle("SELECT * FROM ".DB_PREFIX."config");
switch($what)
{
case 'title':
return 'Title of page';
break;
case 'version':
return $s->version;
break;
case 'themeDir':
return 'lolmao';
break;
default:
return false;
}
}
I've tried to compare $what (which is for this case "title") with "title". MD5 are different. strcmp gives -1, "==", and "===" return false. What is wrong? ($what type is string. You can't change call_user_func_array into call_user_func, because later I'll be using multiple arguments)
Edit:
Strlen $what - strlen title
403 - 5
Heh - looks like I haven't cut the rest ;)
var_dump
string(403) "
title"

MD5 are diffrent. Strcmp gives -1,
"==", and "===" return false.
Throw in var_dump() and strlen()
And this function for especially hard cases:
function dump(&$str) {
$i=0;
while (isset($str[$i])) echo strtoupper(dechex(ord($str[$i++])));
}

Have you tried to trim the whitespaces?
$what = trim($what)
Maybe there is a trailing/beginning whitespace character. Also make sure they are both equally cased:
$what = strtolower(trim($what)) //trim and lower

Are you sure that there aren't any whitespaces? Use trim() to get rid of them. If the md5s are different the strings are different. var_dump(str_split($what)) will output the string char by char, if a whitespace isn't your problem maybe this helps.

I've tried to compare $what (which is for this case "title") with "title". MD5 are different.
That would suggest that $what is not "title". You should put in some debugging statements in there:
function site($what) {
var_dump($what);
die();
}
Check there's no extra spaces or characters you weren't expecting.

Coalescing regular expressions in PHP

Suppose I have the following two strings containing regular expressions. How do I coalesce them? More specifically, I want to have the two expressions as alternatives.
$a = '# /[a-z] #i';
$b = '/ Moo /x';
$c = preg_magic_coalesce('|', $a, $b);
// Desired result should be equivalent to:
// '/ \/[a-zA-Z] |Moo/'
Of course, doing this as string operations isn't practical because it would involve parsing the expressions, constructing syntax trees, coalescing the trees and then outputting another regular expression equivalent to the tree. I'm completely happy without this last step. Unfortunately, PHP doesn't have a RegExp class (or does it?).
Is there any way to achieve this? Incidentally, does any other language offer a way? Isn't this a pretty normal scenario? Guess not. :-(
Alternatively, is there a way to check efficiently if either of the two expressions matches, and which one matches earlier (and if they match at the same position, which match is longer)? This is what I'm doing at the moment. Unfortunately, I do this on long strings, very often, for more than two patterns. The result is slow (and yes, this is definitely the bottleneck).
EDIT:
I should have been more specific – sorry. $a and $b are variables, their content is outside of my control! Otherwise, I would just coalesce them manually. Therefore, I can't make any assumptions about the delimiters or regex modifiers used. Notice, for example, that my first expression uses the i modifier (ignore casing) while the second uses x (extended syntax). Therefore, I can't just concatenate the two because the second expression does not ignore casing and the first doesn't use the extended syntax (and any whitespace therein is significant!

I see that porneL actually described a bunch of this, but this handles most of the problem. It cancels modifiers set in previous sub-expressions (which the other answer missed) and sets modifiers as specified in each sub-expression. It also handles non-slash delimiters (I could not find a specification of what characters are allowed here so I used ., you may want to narrow further).
One weakness is it doesn't handle back-references within expressions. My biggest concern with that is the limitations of back-references themselves. I'll leave that as an exercise to the reader/questioner.
// Pass as many expressions as you'd like
function preg_magic_coalesce() {
$active_modifiers = array();
$expression = '/(?:';
$sub_expressions = array();
foreach(func_get_args() as $arg) {
// Determine modifiers from sub-expression
if(preg_match('/^(.)(.*)\1([eimsuxADJSUX]+)$/', $arg, $matches)) {
$modifiers = preg_split('//', $matches[3]);
if($modifiers[0] == '') {
array_shift($modifiers);
}
if($modifiers[(count($modifiers) - 1)] == '') {
array_pop($modifiers);
}
$cancel_modifiers = $active_modifiers;
foreach($cancel_modifiers as $key => $modifier) {
if(in_array($modifier, $modifiers)) {
unset($cancel_modifiers[$key]);
}
}
$active_modifiers = $modifiers;
} elseif(preg_match('/(.)(.*)\1$/', $arg)) {
$cancel_modifiers = $active_modifiers;
$active_modifiers = array();
}
// If expression has modifiers, include them in sub-expression
$sub_modifier = '(?';
$sub_modifier .= implode('', $active_modifiers);
// Cancel modifiers from preceding sub-expression
if(count($cancel_modifiers) > 0) {
$sub_modifier .= '-' . implode('-', $cancel_modifiers);
}
$sub_modifier .= ')';
$sub_expression = preg_replace('/^(.)(.*)\1[eimsuxADJSUX]*$/', $sub_modifier . '$2', $arg);
// Properly escape slashes
$sub_expression = preg_replace('/(?<!\\\)\//', '\\\/', $sub_expression);
$sub_expressions[] = $sub_expression;
}
// Join expressions
$expression .= implode('|', $sub_expressions);
$expression .= ')/';
return $expression;
}
Edit: I've rewritten this (because I'm OCD) and ended up with:
function preg_magic_coalesce($expressions = array(), $global_modifier = '') {
if(!preg_match('/^((?:-?[eimsuxADJSUX])+)$/', $global_modifier)) {
$global_modifier = '';
}
$expression = '/(?:';
$sub_expressions = array();
foreach($expressions as $sub_expression) {
$active_modifiers = array();
// Determine modifiers from sub-expression
if(preg_match('/^(.)(.*)\1((?:-?[eimsuxADJSUX])+)$/', $sub_expression, $matches)) {
$active_modifiers = preg_split('/(-?[eimsuxADJSUX])/',
$matches[3], -1, PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE);
}
// If expression has modifiers, include them in sub-expression
if(count($active_modifiers) > 0) {
$replacement = '(?';
$replacement .= implode('', $active_modifiers);
$replacement .= ':$2)';
} else {
$replacement = '$2';
}
$sub_expression = preg_replace('/^(.)(.*)\1(?:(?:-?[eimsuxADJSUX])*)$/',
$replacement, $sub_expression);
// Properly escape slashes if another delimiter was used
$sub_expression = preg_replace('/(?<!\\\)\//', '\\\/', $sub_expression);
$sub_expressions[] = $sub_expression;
}
// Join expressions
$expression .= implode('|', $sub_expressions);
$expression .= ')/' . $global_modifier;
return $expression;
}
It now uses (?modifiers:sub-expression) rather than (?modifiers)sub-expression|(?cancel-modifiers)sub-expression but I've noticed that both have some weird modifier side-effects. For instance, in both cases if a sub-expression has a /u modifier, it will fail to match (but if you pass 'u' as the second argument of the new function, that will match just fine).

Strip delimiters and flags from each. This regex should do it:
/^(.)(.*)\1([imsxeADSUXJu]*)$/
Join expressions together. You'll need non-capturing parenthesis to inject flags:
"(?$flags1:$regexp1)|(?$flags2:$regexp2)"
If there are any back references, count capturing parenthesis and update back references accordingly (e.g. properly joined /(.)x\1/ and /(.)y\1/ is /(.)x\1|(.)y\2/ ).

EDIT
I’ve rewritten the code! It now contains the changes that are listed as follows. Additionally, I've done extensive tests (which I won’t post here because they’re too many) to look for errors. So far, I haven’t found any.
The function is now split into two parts: There’s a separate function preg_split which takes a regular expression and returns an array containing the bare expression (without delimiters) and an array of modifiers. This might come in handy (it already has, in fact; this is why I made this change).
The code now correctly handles back-references. This was necessary for my purpose after all. It wasn’t difficult to add, the regular expression used to capture the back-references just looks weird (and may actually be extremely inefficient, it looks NP-hard to me – but that’s only an intuition and only applies in weird edge cases). By the way, does anyone know a better way of checking for an uneven number of matches than my way? Negative lookbehinds won't work here because they only accept fixed-length strings instead of regular expressions. However, I need the regex here to test whether the preceeding backslash is actually escaped itself.
Additionally, I don’t know how good PHP is at caching anonymous create_function use. Performance-wise, this might not be the best solution but it seems good enough.
I’ve fixed a bug in the sanity check.
I’ve removed the cancellation of obsolete modifiers since my tests show that it isn't necessary.
By the way, this code is one of the core components of a syntax highlighter for various languages that I’m working on in PHP since I’m not satisfied with the alternatives listed elsewhere.
Thanks!
porneL, eyelidlessness, amazing work! Many, many thanks. I had actually given up.
I've built upon your solution and I'd like to share it here. I didn't implement re-numbering back-references since this isn't relevant in my case (I think …). Perhaps this will become necessary later, though.
Some Questions …
One thing, #eyelidlessness: Why do you feel the necessity to cancel old modifiers? As far as I see it, this isn't necessary since the modifiers are only applied locally anyway.
Ah yes, one other thing. Your escaping of the delimiter seems overly complicated. Care to explain why you think this is needed? I believe my version should work as well but I could be very wrong.
Also, I've changed the signature of your function to match my needs. I also thing that my version is more generally useful. Again, I might be wrong.
BTW, you should now realize the importance of real names on SO. ;-) I can't give you real credit in the code. :-/
The Code
Anyway, I'd like to share my result so far because I can't believe that nobody else ever needs something like that. The code seems to work very well. Extensive tests are yet to be done, though. Please comment!
And without further ado …
/**
* Merges several regular expressions into one, using the indicated 'glue'.
*
* This function takes care of individual modifiers so it's safe to use
* <em>different</em> modifiers on the individual expressions. The order of
* sub-matches is preserved as well. Numbered back-references are adapted to
* the new overall sub-match count. This means that it's safe to use numbered
* back-refences in the individual expressions!
* If {#link $names} is given, the individual expressions are captured in
* named sub-matches using the contents of that array as names.
* Matching pair-delimiters (e.g. <code>"{…}"</code>) are currently
* <strong>not</strong> supported.
*
* The function assumes that all regular expressions are well-formed.
* Behaviour is undefined if they aren't.
*
* This function was created after a {#link https://stackoverflow.com/questions/244959/
* StackOverflow discussion}. Much of it was written or thought of by
* “porneL” and “eyelidlessness”. Many thanks to both of them.
*
* #param string $glue A string to insert between the individual expressions.
* This should usually be either the empty string, indicating
* concatenation, or the pipe (<code>|</code>), indicating alternation.
* Notice that this string might have to be escaped since it is treated
* like a normal character in a regular expression (i.e. <code>/</code>)
* will end the expression and result in an invalid output.
* #param array $expressions The expressions to merge. The expressions may
* have arbitrary different delimiters and modifiers.
* #param array $names Optional. This is either an empty array or an array of
* strings of the same length as {#link $expressions}. In that case,
* the strings of this array are used to create named sub-matches for the
* expressions.
* #return string An string representing a regular expression equivalent to the
* merged expressions. Returns <code>FALSE</code> if an error occurred.
*/
function preg_merge($glue, array $expressions, array $names = array()) {
// … then, a miracle occurs.
// Sanity check …
$use_names = ($names !== null and count($names) !== 0);
if (
$use_names and count($names) !== count($expressions) or
!is_string($glue)
)
return false;
$result = array();
// For keeping track of the names for sub-matches.
$names_count = 0;
// For keeping track of *all* captures to re-adjust backreferences.
$capture_count = 0;
foreach ($expressions as $expression) {
if ($use_names)
$name = str_replace(' ', '_', $names[$names_count++]);
// Get delimiters and modifiers:
$stripped = preg_strip($expression);
if ($stripped === false)
return false;
list($sub_expr, $modifiers) = $stripped;
// Re-adjust backreferences:
// We assume that the expression is correct and therefore don't check
// for matching parentheses.
$number_of_captures = preg_match_all('/\([^?]|\(\?[^:]/', $sub_expr, $_);
if ($number_of_captures === false)
return false;
if ($number_of_captures > 0) {
// NB: This looks NP-hard. Consider replacing.
$backref_expr = '/
( # Only match when not escaped:
[^\\\\] # guarantee an even number of backslashes
(\\\\*?)\\2 # (twice n, preceded by something else).
)
\\\\ (\d) # Backslash followed by a digit.
/x';
$sub_expr = preg_replace_callback(
$backref_expr,
create_function(
'$m',
'return $m[1] . "\\\\" . ((int)$m[3] + ' . $capture_count . ');'
),
$sub_expr
);
$capture_count += $number_of_captures;
}
// Last, construct the new sub-match:
$modifiers = implode('', $modifiers);
$sub_modifiers = "(?$modifiers)";
if ($sub_modifiers === '(?)')
$sub_modifiers = '';
$sub_name = $use_names ? "?<$name>" : '?:';
$new_expr = "($sub_name$sub_modifiers$sub_expr)";
$result[] = $new_expr;
}
return '/' . implode($glue, $result) . '/';
}
/**
* Strips a regular expression string off its delimiters and modifiers.
* Additionally, normalize the delimiters (i.e. reformat the pattern so that
* it could have used '/' as delimiter).
*
* #param string $expression The regular expression string to strip.
* #return array An array whose first entry is the expression itself, the
* second an array of delimiters. If the argument is not a valid regular
* expression, returns <code>FALSE</code>.
*
*/
function preg_strip($expression) {
if (preg_match('/^(.)(.*)\\1([imsxeADSUXJu]*)$/s', $expression, $matches) !== 1)
return false;
$delim = $matches[1];
$sub_expr = $matches[2];
if ($delim !== '/') {
// Replace occurrences by the escaped delimiter by its unescaped
// version and escape new delimiter.
$sub_expr = str_replace("\\$delim", $delim, $sub_expr);
$sub_expr = str_replace('/', '\\/', $sub_expr);
}
$modifiers = $matches[3] === '' ? array() : str_split(trim($matches[3]));
return array($sub_expr, $modifiers);
}
PS: I've made this posting community wiki editable. You know what this means …!

I'm pretty sure it's not possible to just put regexps together like that in any language - they could have incompatible modifiers.
I'd probably just put them in an array and loop through them, or combine them by hand.
Edit: If you're doing them one at a time as described in your edit, you maybe be able to run the second one on a substring (from the start up to the earliest match). That might help things.

function preg_magic_coalasce($split, $re1, $re2) {
$re1 = rtrim($re1, "\/#is");
$re2 = ltrim($re2, "\/#");
return $re1.$split.$re2;
}

You could do it the alternative way like this:
$a = '# /[a-z] #i';
$b = '/ Moo /x';
$a_matched = preg_match($a, $text, $a_matches);
$b_matched = preg_match($b, $text, $b_matches);
if ($a_matched && $b_matched) {
$a_pos = strpos($text, $a_matches[1]);
$b_pos = strpos($text, $b_matches[1]);
if ($a_pos == $b_pos) {
if (strlen($a_matches[1]) == strlen($b_matches[1])) {
// $a and $b matched the exact same string
} else if (strlen($a_matches[1]) > strlen($b_matches[1])) {
// $a and $b started matching at the same spot but $a is longer
} else {
// $a and $b started matching at the same spot but $b is longer
}
} else if ($a_pos < $b_pos) {
// $a matched first
} else {
// $b matched first
}
} else if ($a_matched) {
// $a matched, $b didn't
} else if ($b_matched) {
// $b matched, $a didn't
} else {
// neither one matched
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP glob-style matching - php

Use fnmatch(), which seems to do the trick.

There is already a function in PHP, included since PHP 4.3.0. fnmatch() checks if the passed string would match the given shell wildcard pattern.

Related

check words with preg_match

If string contains forward slash

Finding string and replacing with same case string

"string" != "string"

Coalescing regular expressions in PHP

Categories

Resources