Matching initialized integer values - php

Let's say for an instance I have this string:
var a=23434,bc=3434,erd=5656,ddfeto='dsf3df34dff3',eof='sdfwerwer34',wer=4554;
How should I match all the initializations assigned as integer? Here's my current try, but I don't understand why it's matching everything.
$pattern = '/var (.*=\d)/';
preg_match_all($pattern,$page,$matches);
EDIT: I'm trying to match each initialization:
1 => a=23434
2 => bc=3434
and so on...
EDIT: Here's an update on my try:
$pattern = '/[^v^a^r] (.*=\d+),/';
preg_match_all($pattern,$page,$matches);
0 => 'var a=23434,bc=3434,erd=5656,'
1 => 'a=23434,bc=3434,erd=5656'

The function is using "greedy" matching. You don't want that. In PHP, you can either follow your wildcard with a ? to specify non-greedy matching, as in:
$pattern = '/var (.*?=\d)/';
or using the U flag as documented here, as in:
$pattern = '/var (.*=\d)/U';
which will make all wildcards use non-greedy matching.
EDIT: Also, since you're including "var", you would probably need to change it to
$pattern = '/var (.*?=\d)*/';
or
$pattern = '/var (.*=\d)*/U';
to match any number of (.*=\d) patterns.
EDIT: Update per discussion:
PHP
$page = "var a=23434,bc=3434,erd=5656,ddfeto='dsf3df34dff3',eof='sdfwerwer34',wer=4554;";
$pattern = '/([a-zA-Z]+=\d+)/';
preg_match_all($pattern,$page,$matches);
print_r($matches[1]);
Produces
Array
(
[0] => a=23434
[1] => bc=3434
[2] => erd=5656
[3] => wer=4554
)
Note: This filters out the entries that have the RHS enclosed in single quotes. If you don't want that, let us know.
EDIT #2: My answer to your question exceeded the size of the comment box so I edited my answer.
The [a-zA-z] expression matches only alphabetical characters of either case. Note that the updated code also removed the "ungreedy" modifier, so we actually want it to be greedy now. And since we want it to be greedy, the . will "eat" too much. Go ahead, play around with the code, see what happens when you change it to .* it is a good opportunity to get more familiar with regex.
Since the . "eats" too much, we need to restrict it from matching all characters to matching the ones we want. We could have used something like
$pattern = '/([^\s,]*=\d+)/';
where the [^\s,]* would match any number of non-whitespace, non-comma characters. This would also have worked for your test cases.
But in this case, we can say confidently what the characters we want to include are, so instead of "blacklisting" characters, we'll "whitelist" them. In this case we specify that we want to match any alphabetical character of either case.
As is the case with many things, especially in programming, there are many ways to skin a cat. There are a number of alternative regex patterns that would have also worked for your test cases. Its up to you to understand the limits of each, how they will perform on edge cases, and how maintainable they are, and make a decision.

You don't have to use regex:
$string = substr($string, 4); // remove the first 4 characters, 'var '
$pairs = explode(',', $string); // split using the comma
foreach ($pairs as $pair) {
list($key, $value) = explode('=', $pair);
if (is_int($value)) {
// this is an integer
} else {
// not an integer
}
}

Try this regex
$pattern = '/([a-zA-Z]+=\d+)/';

Related

Problems with Router / URL Matcher class

since regular expressions aren't really my specialty, I need help with this little problem (in PHP).
I want to match a given url with an array of defined routes, e.g.:
$definedRoute = '/admin/user/[:id]/edit';
$url = '/admin/user/37/edit';
In my class, it would be like this, I imagine (getRoutes() returns an array of defined routes):
foreach ($this->getRoutes() as $route) {
$pattern = '~' . preg_replace('~\[\:[a-z]+\]~', '[a-z0-9]+',
str_replace('/', '\/', $route['definition'])) . '~';
if (preg_match($pattern, $url)) {
$parameters = $this->getRouteParameters($route['definition']);
(new $route['class']())->{$route['method']}($parameters);
// die? break?
}
}
I went about it like this: replace every occurence of a named parameter like [:id] with a regex for lowercase letters and numbers, e.g. [a-z0-9]+.
This would actually work but in some cases, it would match multiple and therefore the wrong routes. Also, it would always match ~\/~ in most cases. But every url should only be matched once.
Edit #1: the problem is: routes get matched multiple times. How can I prevent this?
Can someone enlighten me?
I don't know if this will cover every conceivable case, but you can use preg_match_all or preg_match as opposed to iterating over the patterns. It should also improve performance.
What this does is make the match order (left to right) important, with an array and loop you cannot do that (you can actually but it's uglier). Then we can sort the routes on complexity, like this:
//this is intentionally in the opposite order of what I want it.
$routes = ['definition' => ['\/', '\/admin\/user\/[a-z0-9]+\/']];
//the more / separators the closer to the beginning we want it. or the more complex regexs go first.
uasort($routes['definition'], function($a,$b){
//count the number of / in the route
//note the <=> spaceship (as it's called) is only available in PHP7+
return substr_count($b, '/') <=> substr_count($a, '/');
});
$url = '/admin/user/37/edit';
//in regex the pipe | is OR
preg_match('~^('.implode('|', $routes['definition']).')~i', $url, $matches);
print_r($matches);
Sandbox
Output:
Array
(
[0] => Array
(
[0] => /admin/user/37/edit
)
[1] => Array
(
[0] => /admin/user/37/edit
)
)
Which is correct in this case. Even if you do get multiple matches, you can count the length of them strlen and then take the longest or "best" match from them. This is pretty simple using strlen and probably sorting by length, so I will leave it up to you.
However I wouldn't call this a guarantee of it working 100% of the time, it's just the first thing that came to me.
Another Idea
Another idea is you are not anchoring the match to the start and end of the string. In theory the route could/should match the entire string so with my above example if you add ^ and $ here:
preg_match('~^('.implode('|', $routes['definition']).')$~i', $url, $matches);
This will ensure a full match and in this case ~\/~ will not match even if the array is not sorted, as you can see in the sandbox below.
Sandbox
That said it's not inconceivable you would only have/need a partial match. This is up to you and how you build your routes and URLs. You can of course just use the ^ start as with a begins with type of match, but you would need to sort them in that case.
Preg Match vs Preg Match All
Preg match will also work, but it will only return the first match. So if it matches multiple time you cannot compare them to find the best one. This may be fine if you use ^ and $.
Hope it helps.

Match all substrings that end with 4 digits using regular expressions

I am trying to split a string in php, which looks like this:
ABCDE1234ABCD1234ABCDEF1234
Into an array of string which, in this case, would look like this:
ABCDE1234
ABCD1234
ABCDEF1234
So the pattern is "an undefined number of letters, and then 4 digits, then an undefined number of letters and 4 digits etc."
I'm trying to split the string using preg_split like this:
$pattern = "#[0-9]{4}$#";
preg_split($pattern, $stringToSplit);
And it returns an array containing the full string (not split) in the first element.
I'm guessing the problem here is my regex as I don't fully understand how to use them, and I am not sure if I'm using it correctly.
So what would be the correct regex to use?
You don't want preg_split, you want preg_match_all:
$str = 'ABCDE1234ABCD1234ABCDEF1234';
preg_match_all('/[a-z]+[0-9]{4}/i', $str, $matches);
var_dump($matches);
Output:
array(1) {
[0]=>
array(3) {
[0]=>
string(9) "ABCDE1234"
[1]=>
string(8) "ABCD1234"
[2]=>
string(10) "ABCDEF1234"
}
}
PHP uses PCRE-style regexes which let you do lookbehinds. You can use this to see if there are 4 digits "behind" you. Combine that with a lookahead to see if there's a letter ahead of you, and you get this:
(?<=\d{4})(?=[a-z])
Notice the dotted lines on the Debuggex Demo page. Those are the points you want to split on.
In PHP this would be:
var_dump(preg_split('/(?<=\d{4})(?=[a-z])/i', 'ABCDE1234ABCD1234ABCDEF1234'));
Use the principle of contrast:
\D+\d{4}
# requires at least one non digit
# followed by exactly four digits
See a demo on regex101.com.
In PHP this would be:
<?php
$string = 'ABCDE1234ABCD1234ABCDEF1234';
$regex = '~\D+\d{4}~';
preg_match_all($regex, $string, $matches);
?>
See a demo on ideone.com.
I'm no good at regex so here is the road less traveled:
<?php
$s = 'ABCDE1234ABCD1234ABCDEF1234';
$nums = range(0,9);
$num_hit = 0;
$i = 0;
$arr = array();
foreach(str_split($s) as $v)
{
if(isset($nums[$v]))
{
++$num_hit;
}
if(!isset($arr[$i]))
{
$arr[$i] = '';
}
$arr[$i].= $v;
if($num_hit === 4)
{
++$i;
$num_hit = 0;
}
}
print_r($arr);
First, why is your attempted pattern not delivering the desired output? Because the $ anchor tells the function to explode the string by using the final four numbers as the "delimiter" (characters that should be consuming while dividing the string into separate parts).
Your result:
array (
0 => 'ABCDE1234ABCD1234ABCDEF', // an element of characters before the last four digits
1 => '', // an empty element containing the non-existent characters after the four digits
)
In plain English, to fix your pattern, you must:
Not consume any characters while exploding and
Ensure that no empty elements are generated.
My snippet is at the bottom of this post.
Second, there seems to be some debate about what regex function to use (or even if regex is a preferrable tool).
My stance is that using a non-regex method will require a long-winded block of lines which will be equally if not more difficult to read than a regex pattern. Using regex affords you to generate your result in one-line and not in an unsightly fashion. So let's dispose of iterated sets of conditions for this task.
Now the critical concern is whether this task is simply "extracting" data from a consistent and valid string (case "A"), or if it is "validating AND extracting" data from a string (case"B") because the input cannot be 100 trusted to be consistent/correct.
In case A, you needn't concern yourself with producing valid elements in the output, so preg_split() or preg_match_all() are good candidates.
In case B, preg_split() would not be advisable, because it only hunts for delimiting substrings -- it remains ignorant of all other characters in the string.
Assuming this task is case A, then a decision is still pending about the better function to call. Well, both functions generate an array, but preg_match_all() creates a multidimensional array while you desire a flat array (like preg_split() provides). This means you would need to add a new variable to the global scope ($matches) and append [0] to the array to access the desired fullstring matches. To someone who doesn't understand regex patterns, this may border on the bad practice of using "magic numbers".
For me, I strive to code for Directness and Accuracy, then Efficiency, then Brevity and Clarity. Since you're not likely to notice any performance drops while performing such a small operation, efficiency isn't terribly important. I just want to make some comparisons to highlight the cost of a pattern that leverages only look-arounds or a pattern that misses an oportunity to greedily match predictable characters.
/(?<=\d{4})(?=[a-z])/i 79 steps (Demo)
~\d{4}\K~ 25 steps (Demo)
/[a-z]+[0-9]{4}\K/i 13 steps (Demo)
~\D+[0-9]{4}\K~ 13 steps (Demo)
~\D+\d{4}\K~ 13 steps (Demo)
FYI, \K is a metacharacter that means "restart the fullstring match", in other words "forget/release all previously matched characters up to this point". This effectively ensures that no characters are lost while spitting.
Suggested technique: (Demo)
var_export(
preg_split(
'~\D+\d{4}\K~', // pattern
'ABCDE1234ABCD1234ABCDEF1234', // input
0, // make unlimited explosions
PREG_SPLIT_NO_EMPTY // exclude empty elements
)
);
Output:
array (
0 => 'ABCDE1234',
1 => 'ABCD1234',
2 => 'ABCDEF1234',
)

I have a PHP regEx, how do add a condition for the number of characters?

I have a regular expression that Im using in php:
$word_array = preg_split(
'/(\/|\.|-|_|=|\?|\&|html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|%|\+)/',
urldecode($path), NULL, PREG_SPLIT_NO_EMPTY
);
It works great. It takes a chunk of url paramaters like:
/2009/06/pagerank-update.html
and returns an array like:
array(4) {
[0]=>
string(4) "2009"
[1]=>
string(2) "06"
[2]=>
string(8) "pagerank"
[3]=>
string(6) "update"
}
The only thing I need is for it to also not return strings that are less than 3 characters. So the "06" string is garbage and I'm currently using an if statement to weed them out.
The magic of the split. My original assumption was technically not correct (albeit a solution easier to come to). So let's check your split pattern:
(\/|\.|-|_|=|\?|\&|html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|%|\+)
I re-arranged it a bit. The outer parenthesis is not necessary and I moved the single characters into a character class at the end:
html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|[\/._=?&%+-]
That for some sorting upfront. Let's call this pattern the split pattern, s in short and define it.
You want to match all parts that are not of those characters from the split-at pattern and at minimum three characters.
I could achieve this with the following pattern, including support of the correct split sequences and unicode support.
$pattern = '/
(?(DEFINE)
(?<s> # define subpattern which is the split pattern
html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|
[\\/._=?&%+-] # a little bit optimized with a character class
)
)
(?:(?&s)) # consume the subpattern (URL starts with \/)
\K # capture starts here
(?:(?!(?&s)).){3,} # ensure this is not the skip pattern, take 3 characters minimum
/ux';
Or in smaller:
$path = '/2009/06/pagerank-update.htmltesthtmltest%C3%A4shtml';
$subject = urldecode($path);
$pattern = '/(?(DEFINE)(?<s>html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|[\\/._=?&%+-]))(?:(?&s))\K(?:(?!(?&s)).){3,}/u';
$word_array = preg_match_all($pattern, $subject, $m) ? $m[0] : [];
print_r($word_array);
Result:
Array
(
[0] => 2009
[1] => pagerank
[2] => update
[3] => test
[4] => testä
)
The same principle can be used with preg_split as well. It's a little bit different:
$pattern = '/
(?(DEFINE) # define subpattern which is the split pattern
(?<s>
html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|
[\/._=?&%+-]
)
)
(?:(?!(?&s)).){3,}(*SKIP)(*FAIL) # three or more is okay
|(?:(?!(?&s)).){1,2}(*SKIP)(*ACCEPT) # two or one is none
|(?&s) # split # split, at least
/ux';
Usage:
$word_array = preg_split($pattern, $subject, 0, PREG_SPLIT_NO_EMPTY);
Result:
Array
(
[0] => 2009
[1] => pagerank
[2] => update
[3] => test
[4] => testä
)
These routines work as asked for. But this does have its price with performance. The cost is similar to the old answer.
Related questions:
Antimatch with Regex
Split string by delimiter, but not if it is escaped
Old answer, doing a two-step processing (first splitting, then filtering)
Because you are using a split routine, it will split - regardless of the length.
So what you can do is to filter the result. You can do that again with a regular expression (preg_filter), for example one that is dropping everything smaller three characters:
$word_array = preg_filter(
'/^.{3,}$/', '$0',
preg_split(
'/(\/|\.|-|_|=|\?|\&|html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org|%|\+)/',
urldecode($path),
NULL,
PREG_SPLIT_NO_EMPTY
)
);
Result:
Array
(
[0] => 2009
[2] => pagerank
[3] => update
)
I'm guessing you're building a URL router of some kind.
Detecting which parameters are useful and which are not should not be part of this code. It may vary per page whether a short parameter is relevant.
In this case, couldn't you just ignore the 1'th element? Your page should (or 'handler') should have knowledge over which parameters it wants to be called with, it should do the triage.
I would think that if you were trying to derive meaning from the URL's that you would actually want to write clean URL's in such a way that you don't need a complex regex to derive the value.
In many cases this involves using server redirect rules and a front controller or request router.
So what you build are clean URL's like
/value1/value2/value3
Without any .html,.php, etc. in the URL at all.
It seems to me that you are not addressing the problem at the point of entry into the system (i.e the web server) adequately so as to make your URL parsing as simple as it should be.
How about trying preg_match() instead of preg_split()?
The pattern (using the Assertions):
/([a-z0-9]{3,})(?<!htm|html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org)/iu
The function call:
$pattern = '/([a-z0-9]{3,})(?<!htm|html|shtml|www|php|cgi|htm|aspx|asp|index|com|net|org)/iu';
$subject = '/2009/06/pagerank-update.html';
preg_match_all($pattern, $subject, $matches);
print_r($matches);
You can try the function here: functions-online.com/preg_match_all.html
Hope this helps
Don't use a regex to break apart that path. Just use explode.
$dirs = explode( '/', urldecode($path) );
Then, if you need to break apart an individual element of the array, do that, like on your "pagerank-update" element at the end.
EDIT:
The key is that you have two different problems. First you want to break apart the path elements on slashes. Then, you want to break up the filename into smaller parts. Don't try to cram everything into one regex that tries to do everything.
Three discrete steps:
$dirs = explode...
Weed out arguments < 3 chars
Break up file argument at the end
It is far clearer if you break up your logic into discrete logical chunks rather than trying to make the regex do everything.

preg_replace or regex string translation

I found some partial help but cannot seem to fully accomplish what I need. I need to be able to do the following:
I need an regular expression to replace any 1 to 3 character words between two words that are longer than 3 characters with a match any expression:
For example:
walk to the beach ==> walk(.*)beach
If the 1 to 3 character word is not preceded by a word that's longer than 3 characters then I want to translate that 1 to 3 letter word to '<word> ?'
For example:
on the beach ==> on ?the ?beach
The simpler the rule the better (of course, if there's an alternative more complicated version that's more performant then I'll take that as well as I eventually anticipate heavy usage eventually).
This will be used in a PHP context most likely with preg_replace. Thus, if you can put it in that context then even better!
By the way so far I have got the following:
$string = preg_replace('/\s+/', '(.*)', $string);
$string = preg_replace('/\b(\w{1,3})(\.*)\b/', '${1} ?', $string);
but that results in:
walk to the beach ==> 'walk(.*)to ?beach'
which is not what I want. 'on the beach' seems to translate correctly.
I think you will need two replacements for that. Let's start with the first requirement:
$str = preg_replace('/(\w{4,})(?: \w{1,3})* (?=\w{4,})/', '$1(.*)', $str);
Of course, you need to replace those \w (which match letters, digits and underscores) with a character class of what you actually want to treat as a word character.
The second one is a bit tougher, because matches cannot overlap and lookbehinds cannot be of variable length. So we have to run this multiple times in a loop:
do
{
$str = preg_replace('/^\w{0,3}(?: \w{0,3})* (?!\?)/', '$0?', $str, -1, $count);
} while($count);
Here we match everything from the beginning of the string, as long as it's only up-to-3-letter words separated by spaces, plus one trailing space (only if it is not already followed by a ?). Then we put all of that back in place, and append a ?.
Update:
After all the talk in the comments, here is an updated solution.
After running the first line, we can assume that the only less-than-3-letter words left will be at the beginning or at the end of the string. All others will have been collapsed to (.*). Since you want to append all spaces between those with ?, you do not even need a loop (in fact these are the only spaces left):
$str = preg_replace('/ /', ' ?', $str);
(Do this right after my first line of code.)
This would give the following two results (in combination with the first line):
let us walk on the beach now go => let ?us ?walk(.*)beach ?now ?go
let us walk on the beach there now go => let ?us ?walk(.*)beach(.*)there ?now ?go

How can I extract all integer values from string using a regex?

I am trying to learn regex. I have the string:
$x = "5ft2inches";
How can I read [5,2] into an array using a regex?
If you are assuming that the string will be of the form "{number}ft{number}inches" then you can use preg_match():
preg_match('/(\d+)ft(\d+)inches/', $string, $matches);
(\d+) will match a string of one or more digits. The parentheses will tell preg_match() to place the matched numbers into the $matches variable (the third argument to the function). The function will return 1 if it made a match, of 0 if it didn't.
Here is what $matches looks like after a successful match:
Array
(
[0] => 5ft2inches
[1] => 5
[2] => 2
)
The entire matched string is the first element, then the parenthesized matches follow. So to make your desired array:
$array = array($matches[1], $matches[2]);
Assuming PHP, any reason no one has suggested split?
$numbers = preg_split('/[^0-9]+/', $x, -1, PREG_SPLIT_NO_EMPTY);
In Perl:
#!/usr/bin/perl
use strict;
use warnings;
my $x = "5ft2inches";
my %height;
#height{qw(feet inches)} = ($x =~ /^([0-9]+)ft([0-9]+)inches$/);
use Data::Dumper;
print Dumper \%height;
Output:
$VAR1 = {
'feet' => '5',
'inches' => '2'
};
Or, using split:
#height{qw(feet inches)} = split /ft|inches/, $x;
The regular expression is simply /[0-9]+/ but how to get it into an array depends entirely on what programming language you're using.
With Regular Expressions, you can either extract your data in a contextless way, or a contextful way.
IE, if you match for any digits: (\d+) (NB: Assumes that your language honors \d as the shortcut for 'any digits')
You can then extract each group, but you might not know that your string was actually "5 2inches" instead of "6ft2inches" OR "29Cabbages1Fish4Cows".
If you add context: (\d+)ft(\d+)inches
You know for sure what you've extracted (Because otherwise you'd not get a match) and can refer to each group in turn to get the feet and inches.
If you're not always going to have a pair of numbers to extract, you'll need to make the various components optional. Check out This Regular Expression Cheat Sheet (His other cheat sheets are nifty too) for more info,
You don't mention the language you are using, so here is the general solution: You don't "extract" the numbers, you replace everything except numbers with an empty string.
In C#, this would look like
string numbers = Regex.Replace(dirtyString, "[^0-9]", "");
Have to watch out for double digit numbers.
/\d{1,2}/
might work a little better for feet and inches. The max value of '2' should be upped to whatever is appropriate.
use `/[\d]/`

Categories