Erroneous Matches with Regular Expression - php

$regexp = '/(?:<input\stype="hidden"\sname="){1}([a-zA-Z0-9]*)(?:"\svalue="1"\s\/>)/';
$response = '<input type="hidden" name="7d37dddd0eb2c85b8d394ef36b35f54f" value="1" />';
preg_match($regexp, $response, $matches);
echo $matches[1]; // Outputs: 7d37dddd0eb2c85b8d394ef36b35f54f
So I'm using this regular expression to search for an authentication token on a webpage implementing Joomla in order to preform a scripted login.
I've got all this working but am wondering what is wrong with my regular expression as it always returns 2 items.
Array ( [0] => [1] => 7d37dddd0eb2c85b8d394ef36b35f54f)
Also the name of the input I'm checking for changes every page load both in length and name.

Nothing is wrong. Item [0] always contains the entire match. From the docs (emphasis mine):
If matches is provided, then it is
filled with the results of search.
$matches[0] will contain the text that
matched the full pattern, $matches[1]
will have the text that matched the
first captured parenthesized
subpattern, and so on.
Your regex (overlooking the fact that you are working on HTML with regexes in the first place, which you know you shouldn't) is a bit too complicated.
$regexp = '#<input\s+type="hidden"\s+name="([0-9a-f]*)"\s+value="1"\s*/>#i'
You don't need the non-capturing groups at all.
You use \s, which limits you to a single character. \s+ is probably better.
Using something different than / as the regex boundary makes escaping of forward slashes in the regex unnecessary.
Making the regex case-insensitive could be useful, too.
The auth token looks like a hex string, so matching a-z is unnecessary.

As per the manual entry for preg_match:
If matches is provided, then it is filled with the results of search. $matches[0] will contain the text that matched the full pattern, $matches[1] will have the text that matched the first captured parenthesized subpattern, and so on.

Related

(PHP) How to find words beginning with a pattern and replace all of them?

I have a string. An example might be "Contact /u/someone on reddit, or visit /r/subreddit or /r/subreddit2"
I want to replace any instance of "/r/x" and "/u/x" with "[/r/x](http://reddit.com/r/x)" and "[/u/x](http://reddit.com/u/x)" basically.
So I'm not sure how to 1) find "/r/" and then expand that to the rest of the word (until there's a space), then 2) take that full "/r/x" and replace with my pattern, and most importantly 3) do this for all "/r/" and "/u/" matches in a single go...
The only way I know to do this would be to write a function to walk the string, character by character, until I found "/", then look for "r" and "/" to follow; then keep going until I found a space. That would give me the beginning and ending characters, so I could do a string replacement; then calculate the new end point, and continue walking the string.
This feels... dumb. I have a feeling there's a relatively simple way to do this, and I just don't know how to google to get all the relevant parts.
A simple preg_replace will do what you want.
Try:
$string = preg_replace('#(/(?:u|r)/[a-zA-Z0-9_-]+)#', '[\1](http://reddit.com\1)', $string);
Here is an example: http://ideone.com/dvz2zB
You should see if you can discover what characters are valid in a Reddit name or in a Reddit username and modify the [a-zA-Z0-9_-] charset accordingly.
You are looking for a regular expression.
A basic pattern starts out as a fixed string. /u/ or /r/ which would match those exactly. This can be simplified to match one or another with /(?:u|r)/ which would match the same as those two patterns. Next you would want to match everything from that point up to a space. You would use a negative character group [^ ] which will match any character that is not a space, and apply a modifier, *, to match as many characters as possible that match that group. /(?:u|r)/[^ ]*
You can take that pattern further and add a lookbehind, (?<= ) to ensure your match is preceded by a space so you're not matching a partial which results in (?<= )/(?:u|r)/[^ ]*. You wrap all of that to make a capturing group ((?<= )/(?:u|r)/[^ ]*). This will capture the contents within the parenthesis to allow for a replacement pattern. You can express your chosen replacement using the \1 reference to the first captured group as [\1](http://reddit.com\1).
In php you would pass the matching pattern, replacement pattern, and subject string to the preg_replace function.
In my opinion regex would be an overkill for such a simple operation. If you just want to replace instance of "/r/x" with "[r/x](http://reddit.com/r/x)" and "/u/x" with "[/u/x](http://reddit.com/u/x)" you should use str_replace although with preg_replace it'll lessen the code.
str_replace("/r/x","[/r/x](http://reddit.com/r/x)","whatever_string");
use regex for intricate search string and replace. you can also use http://www.jslab.dk/tools.regex.php regular expression generator if you have something complex to capture in the string.

PHP Regular Expression - Extract Data

I have a long string, and am trying to extract specific data that is deliminated in that string by specific words.
For example, here is a subset of the string:
Current Owner 123 Capital Calculated
I am looking to extract
123 Capital
and as you can see it is surrounded by "Current Owner" (with a bunch of arbitrary spaces) to the left and "Calculated" (again with arbitrary spaces) to the right.
I tried this, but I'm a bit new at RegEx. Can anyone help me create a more effective RegEx?
preg_match("/Owner[.+]Calculated/",$inputString,$owner);
Thanks!
A character class defines a set of characters. Saying, "match one character specified by the class". Place the dot . and quantifier inside of a capturing group instead and enable the s modifier which forces the dot to span newlines.
preg_match('/Owner(.+?)Calculated/s', $inputString, $owner);
echo trim($owner[1]);
Note: + is a greedy operator, meaning it will match as much as it can and still allow the remainder of the regex to match. Use +? instead to prevent greediness meaning "one or more — preferably as few as possible".
You can use lookarounds as
(?<=Owner)\s*.*?(?=\s+Calculated)
Example usage
$str = "Current Owner 123 Capital Calculated ";
preg_match("/(?<=Owner)\s*.*?(?=\s+Calculated)/", $str, $matches);
print_r($matches);
Will give an output
Array ( [0] => 123 Capital )
Hope this helps, group index #1 is your target:
Owner\s+(\d+\s+\w+)\s+Calculated
You may also want to try a tool like RegExr to help you learn/tinker.

Finding match, removing the bits I don't want, and then putting it back in

I'm trying to parse thru a file and find a particular match, filter it in some way, and then print that data back into the file with some of the characters removed. I've been trying different things for a couple hours with preg slits and preg replace, but my regular express knowledge is limited so I haven't made much progress.
I have a large file that has many instances like this [something]{title:value}. I want to find everything between "[" and "}" and remove everything besides the "something" bit.
After that parts done I want to find everything between "{" and "}" on everything left like {title:value} and then remove everything besides the "value" part. I'm sure there is some simple method to do this, so even just a resource on how to get started would be helpful.
Not sure if I get your meaning right (and haven't touched PHP for months), what about this?
$matches = array();
preg_match_all("/\[(.*?)\]\{.*?:(.*?)\}/", $str, $matches);
$something = $matches[1]; // $something stores all texts in the "something" part
$value = $matches[2]; // $value stores all texts in the "value" part
Doc for preg_match_all
For the regex pattern \[(.*?)\]\{.*?:(.*?)\}:
We escapes all the [, ], { and } with a slash because these characters have a special meaning in regex, and need an escape for the literal character.
.*? is a lazy match all, which will match any character until the next character matches the next token. It is used instead of .* so that it won't match other symbols
(.*?) is a capturing group, getting what we need and PHP will put those matches in $matches array
So the entire thing is - match the [ character, then any string until getting the ] character and put it in capturing group 1, then ]{ characters, then any string until getting the : character (no capturing group because we don't care.), then match the : character, then any string until the } character and put it incapturing group 2.
You can do it in one shot:
$txt = preg_replace('~\[\K[^]]*(?=])|{[^:}]+:\K[^}]+(?=})~', '', $txt);
\K removes from match result all that have been matched on his left.
The lookahead (?=...) (followed by) performs a check but add nothing to the match result.

What is the use of '\G' anchor in regex?

I'm having a difficulty with understanding how \G anchor works in PHP flavor of regular expressions.
I'm inclined to think (even though I may be wrong) that \G is used instead of ^ in situations when multiple matches of the same string are taking place.
Could someone please show an example of how \Gshould be used, and explain how and why it works?
UPDATE
\G forces the pattern to only return matches that are part of a continuous chain of matches. From the first match each subsequent match must be preceded by a match. If you break the chain the matches end.
<?php
$pattern = '#(match),#';
$subject = "match,match,match,match,not-match,match";
preg_match_all( $pattern, $subject, $matches );
//Will output match 5 times because it skips over not-match
foreach ( $matches[1] as $match ) {
echo $match . '<br />';
}
echo '<br />';
$pattern = '#(\Gmatch),#';
$subject = "match,match,match,match,not-match,match";
preg_match_all( $pattern, $subject, $matches );
//Will only output match 4 times because at not-match the chain is broken
foreach ( $matches[1] as $match ) {
echo $match . '<br />';
}
?>
This is straight from the docs
The fourth use of backslash is for certain simple assertions. An
assertion specifies a condition that has to be met at a particular
point in a match, without consuming any characters from the subject
string. The use of subpatterns for more complicated assertions is
described below. The backslashed assertions are
\G
first matching position in subject
The \G assertion is true only when the current matching position is at
the start point of the match, as specified by the offset argument of
preg_match(). It differs from \A when the value of offset is non-zero.
http://www.php.net/manual/en/regexp.reference.escape.php
You will have to scroll down that page a bit but there it is.
There is a really good example in ruby but it is the same in php.
How the Anchor \z and \G works in Ruby?
\G will match the match boundary, which is either the beginning of the string, or the point where the last character of last match is consumed.
It is particularly useful when you need to do complex tokenization, while also making sure that the tokens are valid.
Example problem
Let us take the example of tokenizing this input:
input 'some input in quote' more input '\'escaped quote\'' lots#_$of_fun ' \' \\ ' crazy'stuff'
Into these tokens (I use ~ to denote end of string):
input~
some input in quote~
more~
input~
'escaped quote'~
lots#_$of_fun~
' \ ~
crazy~
stuff~
The string consists of a mix of:
Singly quoted string, which allows the escape of \ and ', and spaces are conserved. Empty string can be specified using singly quoted string.
OR unquoted string, which consists of a sequence of non-white-space characters, and does not contain \ or '.
Space between 2 unquoted string will delimit them. Space is not necessary to delimit other cases.
For the sake of simplicity, let us assume the input does not contain new line (in real case, you need to consider it). It will add to the complexity of the regex without demonstrating the point.
The RAW regex for singly quoted string is '(?:[^\\']|\\[\\'])*+'
And the RAW regex for unquoted string is [^\s'\\]++
You don't need to care too much about the 2 piece of regex above, though.
The solution below with \G can make sure that when the engine fails to find any match, all characters from the beginning of the string to the position of last match has been consumed. Since it cannot skip character, the engine will stop matching when it fails to find valid match for both specifications of tokens, rather than grabbing random stuff in the rest of the string.
Construction
At the first step of construction, we can put together this regex:
\G(?:'((?:[^\\']|\\[\\'])*+)'|([^\s'\\]++))
Or simply put (this is not regex - just to make it easier to read):
\G(Singly_quote_regex|Unquoted_regex)
This will match the first token only, since when it attempts matching for the 2nd time, the match stops at the space before 'some input....
We just need to add a bit to allow for 0 or more space, so that in the subsequent match, the space at the position left off by the last match is consumed:
\G *+(?:'((?:[^\\']|\\[\\'])*+)'|([^\s'\\]++))
The regex above will now correctly identify the tokens, as seen here.
The regex can be further modified so that it returns the rest of the string when the engine fails to retrieve any valid token:
\G *+(?:'((?:[^\\']|\\[\\'])*+)'|([^\s'\\]++)|((?s).+$))
Since the alternation is tried in order from left-to-right, the last alternative ((?s).+$) will be match if and only if the string ahead doesn't make up a valid single quoted or unquoted token. This can be used to check for error.
The first capturing group will contain the text inside single quoted string, which needs extra processing to turn into the desired text (it is not really relevant here, so I leave it as an exercise to the readers). The second capturing group will contain the unquoted string. And the third capturing group acts as an indicator that the input string is not valid.
Demo for the final regex
Conclusion
The above example is demonstrate of one scenario of usage of \G in tokenization. There can be other usages that I haven't come across.

Get word from string - PHP

I am trying to extract a word that matches a specific pattern from various strings.
The strings vary in length and content.
For example:
I want to extract any word that begins with jac from the following strings and populate an array with the full words:
I bought a jacket yesterday.
Jack is going home.
I want to go to Jacksonville.
The resulting array should be [jacket,Jack,Jacksonville]
I have been trying to use preg_match() but for some reason it won't work. Any suggestions???
$q = "jac";
$str = "jacket";
preg_match($q,$str,$matches);
print $matches[1];
This returns null :S. I dunno what the problem is.
You can use preg_match as:
preg_match("/\b(jac.+?)\b/i", $string, $matches);
See it
You've got to read the manual a few hundred times and it will eventually come to you.
Otherwise, what you're trying to capture can be expressed as "look for 'jac' followed by 0 or more letters* and make sure it's not preceded by a letter" which gives you: /(?<!\\w)(jac\\w*)/i
Here's an example with preg_match_all() so that you can capture all the occurences of the pattern, not just the first:
$q = "/(?<!\\w)(jac\\w*)/i";
$str = "I bought a jacket yesterday.
Jack is going home.
I want to go to Jacksonville.";
preg_match_all($q,$str,$matches);
print_r($matches[1]);
Note: by "letter" I mean any "word character." Officially, it includes numbers and other "word characters." Depending on the exact circumstances, one may prefer \w (word character) or \b (word boundary.)
You can include extra characters by using a character class. For instance, in order to match any word character as well as single quotes, you can use [\w'] and your regexp becomes:
$q = "/(?<!\\w)(jac[\\w']*)/i";
Alternatively, you can add an optional 's to your existing pattern, so that you capture "jac" followed by any number of word characters optionally followed by "'s"
$q = "/(?<!\\w)(jac\\w*(?:'s)?)/i";
Here, the ?: inside the parentheses means that you don't actually need to capture their content (because they're already inside a pair of parentheses, it's unnecessary), and the ? after the parentheses means that the match is optional.

Categories