php non-greedy regex problem - php

demo:
$str = 'bcs >Hello >If see below!';
$repstr = preg_replace('/>[A-Z0-9].*?see below[^,\.<]*/','',$str);
echo $repstr;
What I want this tiny programme to output is "bcs >Hello ",but in fact it's only "bcs "
What's wrong with my pattern?

I think the problem is that you're misinterpreting how a non-greedy quantifier acts. Once it's in operation, yes, it stops earlier than it would otherwise. But it isn't aware of what comes before it (or potentially the text that comes later, either). It's only concerned with it's current position. Hence, the regular expression you posted will match all of:
">Hello >If see below!"
Let's see how this works:
/>[A-Z0-9].*?see below[^,\.<]*/
The regex first looks for ">" in "bcs >Hello >If see below!", and finds the first one, which is the one right before "Hello". Ok, let's check the next part of the expression:
[A-Z0-9]
The next char is a H, which matches the pattern [A-Z0-9]. Still good! Next:
.*?
Now we match all non-newline chars until we get to the first instance to match the remaining expressions of "see below[^,.<]*". If we had used just a plain greedy quantifier, we could match through multiple cases of "see below[^,.<]*" until we matched the last possible one. (So if your string had continued on, and there'd been other text match that pattern, it would have captured that as well) The non-greedy quantifier doesn't mean that your whole pattern will return the smallest possible match of all possible matches in the string. It just dictates how that particular character match functions.
You might want to try the following pattern then:
/>[A-Z0-9][^>]*?see below[^,\.<]*/
Hopefully this clears it up!

Why don't you write it like this:
$str = 'bcs >Hello >If see below!';
$repstr = preg_replace('/>If see below[^,\.<]*/','',$str);
echo $repstr;

This might be a good alternative to what you have.
The problem with your regexp is that instead of selecting what you want, you are selecting what you don't want and replacing that with an empty string.
The best approach, in my opinion, is selecting what you want, that is what the code below does. What you end up with is what is what is matched by the first sub-pattern otherwise you get your string back.
$str = 'bcs >Hello >If see below!';
$repstr = preg_replace('/^([\w]+ >[\w]+).*?see below.*?$/i', '$1', $str);
var_dump($repstr);
I hope this helps.

Related

Regex negative match in middle of string

I have these strings (checking every line separately):
adminUpdate
adminDeleteEx
adminEditXe
adminExclude
listWebsite
listEx
Now I want to match anything that starts with admin but does not ends with Ex (case-insensitive)
So after applying regex I must match:
adminUpdate
adminEditXe
adminExclude
My current regex is /^admin[a-z]+(?!ex)$/gi but it matches anything that starts with admin
Just a slight change:
/^admin[a-z]+(?<!ex)$/gi
^
Turn your look-ahead into a look-behind.
It is quite hard to explain in the current form. Basically, you need to reach the end of the string $ for the regex to match, and when it happens, (?!ex) is at the end of the string, so it can't see anything ahead. However, since we are at the end of the string, we can use a look-behind (?<!ex) to check whether the string ends with ex or not.
Since look-around is zero-width, we can swap the position of (?<!ex) and $ without changing the meaning (it does change how the engine searches for the matched string, though):
/^admin[a-z]+$(?<!ex)/gi
It is counter-intuitive to write it this way, but easier to see where my argument goes.
Another way to look at it is: due to the fact that (?!ex) and $ are zero-width assertion, they are checked at the same position, and being at the end of the string $ implies you won't see anything ahead.
^(?!.*Ex$)admin.*$
Try this.See demo.
https://regex101.com/r/sH8aR8/23
$re = ""^(?!.*Ex$)admin.*$"m";
$str = "adminUpdate\nadminDeleteEx\nadminEditXe\nadminExclude\nlistWebsite\nlistEx";
preg_match_all($re, $str, $matches);

Getting String in Text

I can't seem to figure out exactly how to get a certain portion of text I need out of a larger block. I need to be able to grab a string between a : and a ( in a regular block of text while parsing an email. An example is below. I know a bit about preg_match and I figured that was the answer, but can't seem to get that to work either. Any help would be appreciated as searches of here and Google have turned up nothing.
For GC3J26P: Wise Lake II (Traditional)
I need just the text between the : and the beginning parentheses. Thanks for any help.
Since you say you've tried and not managed to get anything going, try this:
$str = "For GC3J26P: Wise Lake II (Traditional)";
preg_match('/(?<=:).*(?=\()/', $str, $str);
if ($str) echo $str[0];
If you're new to the murky - yet beautiful - world of REGEX, allow me to explain what's happening here.
It's all about the pattern. Our pattern defines what is and is not acceptable in what we match - i.e. capture.
More than that, it even states what should immediately precede and procede our match. These are called look-behind and look-ahead assertions, respectively. These assertions are anchors for our matching points - they do not contribute to the captured match itself.
So our pattern translates as:
1) begin the match after a colon (but do not include the colon in the match)
2) then allow and capture any character (barring line breaks and certain other spacial characters), zero or more times
3) match up to (but not including) an opening bracket
Our pattern is what's called greedy. In our case, this means that, should the sub-string you wish to match itself contain a colon or bracket, this will be no problem and won't break things. As long as there is a valid match available starting from SOME colon, and up to SOME opening bracket, all's fine. (Note: greedy behaviour can be modified, if required).
There's far, far more to REGEX and you either love it hate it. If you're interested, I suggest reading up on it. It's very satisfying once you get into it.
And with that, it's bed time.
You can try this:
$string = 'For GC3J26P: Wise Lake II (Traditional)';
preg_match('/\:(.*?)\(/', $string, $matches);
echo $matches[0]; // : Wise Lake II (
echo $matches[1]; // Wise Lake II
Here it is:
I used a 'named sub-pattern' (named it "name", but could be called almost anything)...
$str="For GC3J26P: Wise Lake II (Traditional)";
preg_match('/:(?P<name>[\S\s]+)\(/', $str, $matches);
echo $matches["name"];

Need php regex between 2 sets of chars

I need a regular expression for php that outputs everything between <!--:en--> and <!--:-->.
So for <!--:en-->STRING<!--:--> it would output just STRING.
EDIT: oh and the following <!--:--> nedds to be the first one after <!--:en--> becouse there are more in the text..
The one you want is actually not too complicated:
/<!--:en-->(.*?)<!--:-->/gi
Your matches will be in capture group 1.
Explanation:
The .*? is a lazy quantifier. Basically, it means "keep matching until you find the shortest string that will still fit this pattern." This is what will cause the matching to stop at the first instance of <!--:-->, rather than sucking up everything until the last <!--:--> in the document.
Usage is something like preg_match("/<!--:en-->(.*?)<!--:-->/gi", $input) if I recall my PHP correctly.
If you have just that input
$input = '<!--:en-->STRING<!--:-->';
You can try with
$output = strip_tags($input);
Try:
^< !--:en-- >(.*)< !--:-- >$
I don't think any of the other characters need to be escaped.
<!--:en--\b[^>]*>(.*?)<!--:-->
This will match the things between your tags. This will break if you nest your tags, but you didnt say you were doing that :)

Regular expression anchor text for a link

I am trying to pull the anchor text from a link that is formatted this way:
<h3><b>File</b> : i_want_this</h3>
I want only the anchor text for the link : "i_want_this"
"variable_text" varies according to the filename so I need to ignore that.
I am using this regex:
<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>
This is matching of course the complete link.
PHP uses a pretty close version to PCRE (PERL Regex). If you want to know a lot about regex, visit perlretut.org. Also, look into Regex generators like exspresso.
For your use, know that regex is greedy. That means that when you specify that you want something, follwed by anything (any repetitions) followed by something, it will keep on going until that second something is reached.
to be more clear, what you want is this:
<a href="
any character, any number of times (regex = .* )
">
any character, any number of times (regex = .* )
</a>
beyond that, you want to capture the second group of "any character, any number of times". You can do that using what are called capture groups (capture anything inside of parenthesis as a group for reference later, also called back references).
I would also look into named subpatterns, too - with those, you can reference your choice with a human readable string rather than an array index. Syntax for those in PHP are (?P<name>pattern) where name is the name you want and pattern is the actual regex. I'll use that below.
So all that being said, here's the "lazy web" for your regex:
<?php
$str = '<h3><b>File</b> : i_want_this</h3>';
$regex = '/(<a href\=".*">)(?P<target>.*)(<\/a>)/';
preg_match($regex, $str, $matches);
print $matches['target'];
?>
//This should output "i_want_this"
Oh, and one final thought. Depending on what you are doing exactly, you may want to look into SimpleXML instead of using regex for this. This would probably require that the tags that we see are just snippits of a larger whole as SimpleXML requires well-formed XML (or XHTML).
I'm sure someone will probably have a more elegant solution, but I think this will do what you want to done.
Where:
$subject = "<h3><b>File</b> : i_want_this</h3>";
Option 1:
$pattern1 = '/(<a href=")(.*)(">)(.*)(<\/a>)/i';
preg_match($pattern1, $subject, $matches1);
print($matches1[4]);
Option 2:
$pattern2 = '()(.*)()';
ereg($pattern2, $subject, $matches2);
print($matches2[4]);
Do not use regex to parse HTML. Use a DOM parser. Specify the language you're using, too.
Since it's in a captured group and since you claim it's matching, you should be able to reference it through $1 or \1 depending on the language.
$blah = preg_match( $pattern, $subject, $matches );
print_r($matches);
The thing to remember is that regex's return everything you searched for if it matches. You need to specify that only care about the part you've surrounded in parenthesis (the anchor text). I'm not sure what language you're using the regex in, but here's an example in Ruby:
string = 'i_want_this'
data = string.match(/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/)
puts data # => outputs 'i_want_this'
If you specify what you want in parenthesis, you can reference it:
string = 'i_want_this'
data = string.match(/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/)[1]
puts data # => outputs 'i_want_this'
Perl will have you use $1 instead of [1] like this:
$string = 'i_want_this';
$string =~ m/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/;
$data = $1;
print $data . "\n";
Hope that helps.
I'm not 100% sure if I understand what you want. This will match the content between the anchor tags. The URL must start with /en/browse/file/, but may end with anything.
#(.*?)#
I used # as a delimiter as it made it clearer. It'll also help if you put them in single quotes instead of double quotes so you don't have to escape anything at all.
If you want to limit to numbers instead, you can use:
#(.*?)#
If it should have just 5 numbers:
#(.*?)#
If it should have between 3 and 6 numbers:
#(.*?)#
If it should have more than 2 numbers:
#(.*?)#
This should work:
<a href="[^"]*">([^<]*)
this says that take EVERYTHING you find until you meet "
[^"]*
same! take everything with you till you meet <
[^<]*
The paratese around [^<]*
([^<]*)
group it! so you can collect that data in PHP! If you look in the PHP manual om preg_match you will se many fine examples there!
Good luck!
And for your concrete example:
<a href="/en/browse/file/variable_text">([^<]*)
I use
[^<]*
because in some examples...
.*?
can be extremely slow! Shoudln't use that if you can use
[^<]*
You should use the tool Expresso for creating regular expression... Pretty handy..
http://www.ultrapico.com/Expresso.htm

regular expression to strip attributes and values from html tags

Hi Guys I'm very new to regex, can you help me with this.
I have a string like this "<input attribute='value' >" where attribute='value' could be anything and I want to get do a preg_replace to get just <input />
How do I specify a wildcard to replace any number of any characters in a srting?
like this? preg_replace("/<input.*>/",$replacement,$string);
Many thanks
What you have:
.*
will match "any character, and as many as possible.
what you mean is
[^>]+
which translates to "any character, thats not a ">", and there must be at least one
or altertaively,
.*?
which means
"any character, but only enough to make this rule work"
BUT DONT
Parsing HTML with regexps is Bad
use any of the existing html parsers, DOM librarys, anything, Just NOT NAïVE REGEX
For example:
<foo attr=">">
Will get grabbed wrongly by regex as
'<foo attr=" ' with following text of '">'
Which will lead you to this regex:
`<[a-zA-Z]+( [a-zA-Z]+=['"][^"']['"])*)> etc etc
at which point you'll discover this lovely gem:
<foo attr="'>\'\"">
and your head will explode.
( the syntax highlighter verifies my point, and incorrectly matches thinking i've ended the tag. )
Some people were close... but not 100%:
This:
preg_replace("<input[^>]*>", $replacement, $string);
should be this:
preg_replace("<input[^>]*?>", $replacement, $string);
You don't want that to be a greedy match.
preg_replace("<input[^>]*>", $replacement, $string);
// [^>] means "any character except the greater than symbol / right tag bracket"
This is really basic stuff, you should catch up with some reading. :-)
If I understand the question correctly, you have the code:
preg_replace("/<input.*>/",$replacement,$string);
and you want us to tell you what you should use for $replacement to delete what was matched by .*
You have to go about this the other way around. Use capturing groups to capture what you want to keep, and reinsert that into the replacement. E.g.:
preg_replace("/(<input).*(>)/","$1$2",$string);
Of course, you don't really need capturing groups here, as you're only reinserting literal text. Bet the above shows the technique, in case you want to do this in a situation where the tag can vary. This is a better solution:
preg_replace("/<input [^>]*>/","<input />",$string);
The negated character class is more specific than the dot. This regex will work if there are two HTML tags in the string. Your original regex won't.

Categories