how can i write this regex? ungreedy related - php

I'm sorry for the poor title, but it is a very generic question
I have to match this pattern
;AAAAAAA(BBBBBB,CCCCC,DDDDDD)
AAAAA = all characters starting from ";" to "(" (both ;( not included)
BBBBB = all characters starting from "(" to "," (both (, not included)
CCCCC = all characters starting from "," to "," (both ,, not included)
DDDDD = all characters starting from "," to ")" (both ,) not included)
The "all characters between x and y" is a problem that kills me everytime
:(
I'm using PHP and I have to match all occurrences of this pattern (preg_match_all) that also, sadly, can be on multiple lines
Thank you in advance!

I would recommend you do not use an ungreedy quantifier, but instead make all repetitions mutually exclusive with their delimiters. What does this mean? It means, for instance, that A can be any character except (. Giving this regex:
;([^(]*)[(]([^,]*),([^,]*),([^)]*)[)]
Where the last [)] is not even necessary.
The PHP code would then look like this:
preg_match_all('/;([^(]*)[(]([^,]*),([^,]*),([^)]*)[)]/', $input, $matches);
$fullMatches = $matches[0];
$arrayOfAs = $matches[1];
$arrayOfBs = $matches[2];
$arrayOfCs = $matches[3];
$arrayOfDs = $matches[4];
As the comments show, my escaping technique is a matter of taste. This regex is of course equal to:
;([^(]*)\(([^,]*),([^,]*),([^)]*)\)
But I think that looks a lot more mismatched/unbalanced than the other variant. Take you pick!
Finally, for the question why this approach would be better than using ungreedy (lazy) quantifiers. Here is some good, general reading. Basically, when you use ungreedy quantifiers, the engine still has to backtrack. It tries one repetition first, then notices that ( after that doesn't match. So it has to go back into the repetition and consume another character. But then the ( still doesn't match, so back to the repetition again. With this approach however, the engine will consume as much as possible, when going into the repetition for the first time. And when all non-( characters are consumed, then the engine will be able to match the following ( right away.

You could use something like this code:
preg_match_all('/;(.*?)\((.*?),(.*?),(.*?)\)/s',$text,$matches);
See it on ideone.com.
Basically, you can use .*? (question mark being ungreedy), make sure to escape the parentheses, and you may need the s modifier to have it work on multiple lines.
Variables would be in an array: $matches

Related

match the following pattern

I have a lot of (p. #) like (p. 13) (p. 234) in a string and I want to remove them. I used the following pattern to match but it doesn't work
preg_replace('/\(p\.*\)/','',$string);
( to escape (
p is p
\. to escape .
I need some help here. Thank you.
This is the regular expression you're looking for:
\(p\.\s+\d+\)
Or, in your code:
preg_replace('/\(p\.\s+\d+\)/', '', $string);
Here's a fiddle.
You have nothing in your regexp to match the page number. You're just matching something like (p.......).
preg_replace('/\(p\.\s*\d+\s*\)/', '', $string);
The * means to match 0 or more of the preceding item, so it would be working on the . character in your example. Perhaps you might need something like:
preg_replace('/\(p\. ?[0-9]+\)/','',$string);
This matches:
(p.
Then a space, which the ? makes optional
Then one or more digits 0-9, due to the +
Then )
Hope this helps
First at all preg_replace is a function, not a procedure, if you want to see any changes, you need to write:
$str = preg_replace($pattern, $replacement, $str);
In your pattern you wrote \.* that means a literal . zero or more times, I assume that is not what you want. I assume you wanted to write \..* a literal . and zero or more characters. But this doesn't work too for two reasons: this doesn't check if characters are digits, and since the * quantifier is greedy, .* will match the all the characters until the end of the line and will backtrack until the last parenthesis.
The good way is probably Barmar pattern that checks there is at least a digit (using the + quantifier) or the same more constraignant (without a variable number of spaces):
/\(p\. \d+\)/

preg_replace pattern to remove pNUMBERxNUMBER

Im trying to locate a pattern with preg_replace() and remove it...
I have a string, that contains this: p130x130/ and these numbers vary, they can be higher, or lower ... what I need to do is locate that string, and remove it, whole thing.
I've been trying to use this:
preg_replace('/p+[0-9]+x+[0-9]"/', '', $str);
but that doesnt work for some reason. Would any of you know the correct regexp?
Kind regards
You need to first remove the + quantifier after p then switch the + quantifier from after x and place it after your character class (e.g. x[0-9]+), also remove the quote " inside of your expression, which to me looks like a typo here. You can also use a different delimiter to avoid escaping the ending slash.
$str = preg_replace('~p[0-9]+x[0-9]+/~', '', $str);
If the ending slash is by mistake a typo as well, then this is what you're looking for.
$str = preg_replace('/p[0-9]+x[0-9]+/', '', $str);
Regex to match p130x130/ is,
p[0-9]+x[0-9]+\/
Try this:
$str = preg_replace("/p[0-9]+?x[0-9]+?\//is","",$str);
As mentioned by the comment I have to explain the code as I'm a teacher now.
I've used "/" as a delimiter, but you can use different characters to avoid slashing.
The part that says [0-9]+ is saying to match any character between 0 and 9 at least once, but more if possible. If I had put [0-9]*? then it would have matched an empty space too (as * means to match 0 or more, not 1 or more like +) which is probably not what you wanted anyway.
I've put the ? at the end to make it non-greedy, just a habit of mine but I don't think it's needed. (I used ereg a lot previously).
Anyway, it's going to find 0-9 until it hits an x, and then it does another match for more numbers until it hits a single forward slash. I've backslashed that slash because my delimiter is a slash also and I didn't want it to end there.

rexexp solution for php

I have tried to work this out myself (even bought a Kindle book!), but I am struggling with backreferences in php.
What I want is like the following example:
var $html = "hello %world|/worldlink/% again";
output:
hello world again
I tried stuff like:
preg_replace('/%([a-z]+)|([a-z]+)%/', '\1', $html);
but with no joy.
Any ideas please? I am sure someone will post the exact answer but I would like an explanation as well please - so that I don't have to keep asking these questions :)
The slashes "/" are not included in your allowed range [a-z]. Instead use
preg_replace('/%([a-z]+)\|([a-z\/]+)%/', '\1', $html);
Your expression:
'/%([a-z]+)|([a-z]+)%/'
Is only capturing one thing. The | in the middle means "OR". You're trying to capture both, so you don't need an OR in there. You want a literal | symbol so you need to escape it:
'/%([a-z]+)\|([a-z\/]+)%/'
The / character also needs to be included in your char set, and escaped as above.
Your regex (/%([a-z]+)|([a-z]+)%/) reads this way:
Match % followed by + (= one or
more) a-z characters (and store this
into backreference #1).
Or (the |):
Match + (= one or more) a-z
characters (and store this into
backreference #2) followed by a
%.
What you are looking for is:
preg_replace('~%([a-z]+)[|]([a-z/]+)%~', '$1', $html);
Basically I just escaped the | regex meta character (you can do this by either surrounding it with [] like I did or just prepending a backwards slash \, personally I find the former easier to read), and added a / to the second capture group.
I also changed your delimiters from / to ~ because tildes are much more unlikely to appear in strings, if you want to keep using / as your delimiter you also have to escape their occurrences in your regex.
It's also recommended that you use the $ syntax instead of \ in your replacement backreferences:
$replacement may contain references
of the form \\n or (since PHP 4.0.4)
$n, with the latter form being the
preferred one.
Here is a version that works according to the OPs data/information provided (using a non-slash delimiter to avoid escaping slashes):
preg_replace('#%([a-z]+)\|([a-z/]+)%#', '\1', $html);
Using a non slash delimiter, would alleviate the need to escape slashes.
Outputs:
hello world again
The Explanation
Why yours did not work. First up the | is an OR operator, and, in your example, should be escaped. Second up, since you are using /'s or expect slashes it is better to use a non-slash delimiter, such as #. Third up, the slash needed to be added to list of allowed matches. As stated before you may want to include a bit more options, as any type of word with numbers underscores periods hyphens will fail / break the script. Hopefully that is the explanation you were looking for.
Here's what works for me:
preg_replace('/%([a-z]+)\|([a-z\/]+)%/', '\1', $html);
Your regular expression doesn't escape the |, and doesn't include the proper characters for the URL.
Here's a basic live example supporting only a-z and slashes:
preg_replace('/%([a-z]+)\|([a-z\/]+)%/', '\1', $html);
In reality, you're going to want to change those [a-z]+ blocks to something more expressive. Do some searches for URL-matching regular expressions, and pick one that fits what you want.
$html = "hello %world|/worldlink/% again";
echo preg_replace('/([A-ZA-z_ ]*)%(.+)\|(.+)%([A-ZA-z_ ]*)/', '$1$2$4', $html);
output:
hello world again
here is a working code : http://www.ideone.com/0qhZ8

preg_match basics question

Got some trouble with my preg_match.
The code.
$text = "tel: 012 213 123. mobil: 0303 11234 \n address: street 14";
$regex_string = '/(tel|Tel|TEL)[\s|:]+(.+)[\.|\n]/';
preg_match($regex_string , $text, $match);
And I get this result in $match[2]
"012 213 123. mobil: 023 123 123"
First question.
I want the regex to stop at the .(dot) but it doesent.
Can someone explain to why it isnt?
Second question.
preg_match uses () to get their match.
Is it possible to skip the parentheses surrounding the different "Tel" and still get the same functionality?
Thnx all stackoverflow is great :D
This should do:
/tel(?:\s|:)+([^.]+)(?:\.|$)/i
+ is a greedy quantifier, which means it'll match as many characters as possible.
To your second question: in this particular case you just need to use case-insensitive match (i flag). Generally, you could use (?:...) syntax, example of which you could see in the end match. Square brackets are used for character classes.
If you're simply trying to extract a phone number out of that line, and it's guaranteed to be 11 numbers, you could simply use this:
$text = 'tel: 012 213 123. mobil: 0303 11234';
$phone_number = substr(preg_replace('/[^\d]/', '', $text), 0, 11);`
With your example, $phone_number would be 0122131230.
How this works is any non-digit is replaced with an empty string, removing it, and then the first 11 numbers are returned.
No idea - your regex works for me (I get "012 213 123" in $match[2] with your code). The fact that the mobile phone differs between the two might indicate that it's not really the output of your code; check again.
Some other things - if you happen to have more dots in the line ("tel: xxx. phone: xxx. fax: xxx" for example), you will get bad results - use non-greedy operators ("get least chunk that matches" .*? instead of "get biggest chunk that matches" .*) or limit the repeated characters ("any number of non-periods" [^.]*). Also, you could spare yourself the trouble by making the regex case-insensitive (unless you really hate people typing "tEl").
Your other question: (?:stuff) will match "stuff" just like (stuff), but will not capture it.
Useful link: http://www.regular-expressions.info/
Why do you have pipes in your character classes [\.|\n] and [\s|:]? Character classes (stuff in square brackets []) are by definition like an OR relationship, so you don't need the pipe... unless you really are trying to match pipe |.
As for question #1, I'm not sure what's cusiong your problem, but usually this has to do with greedy quantifiers. The (.+) quantifier is greedy, so it matches as much as it can while still matching the entire pattern. Greedy quantifiers don't care what comes after them in the pattern. Since a period . matches any character other than new line characters, it can match a period, and so it does match a period. To make a quantifier non-greedy you can use a question mark ?.
For your second question In RegEx uses parenthesis to group things and to store them. If you want to group (tel|Tel|TEL) but not store it in $match you can put a ?: at after the open parenthesis:
(?:tel|Tel|TEL)
Do you mean you want to match only the number, so you don't have to strip off the tel: and the dot? Try this:
/tel[:\s]+\K[^.]+/i
The i makes it case-insensitive.
[:\s] matches a colon or whitespace (the | doesn't mean "or" in a character class, it just matches a |).
[^.]+ matches one or more non-dots; it stops matching when it sees a dot or the end of the line, so you don't have to match the dot if you don't want it in the result.
Finally, \K means "forget about whatever you've matched so far and pretend the match really started here"--a little gem of a feature that's only available in Perl and PHP (that I know of).

Including new lines in PHP preg_replace function

I'm trying to match a string that may appear over multiple lines. It starts and ends with a specific string:
{a}some string
can be multiple lines
{/a}
Can I grab everything between {a} and {/a} with a regex? It seems the . doesn't match new lines, but I've tried the following with no luck:
$template = preg_replace( $'/\{a\}([.\n]+)\{\/a\}/', 'X', $template, -1, $count );
echo $count; // prints 0
It matches . or \n when they're on their own, but not together!
Use the s modifier:
$template = preg_replace( $'/\{a\}([.\n]+)\{\/a\}/s', 'X', $template, -1, $count );
// ^
echo $count;
I think you've got more problems than just the dot not matching newlines, but let me start with a formatting recommendation. You can use just about any punctuation character as the regex delimiter, not just the slash ('/'). If you use another character, you won't have to escape slashes within the regex. I understand '%' is popular among PHPers; that would make your pattern argument:
'%\{a\}([.\n]+)\{/a\}%'
Now, the reason that regex didn't work as you intended is because the dot loses its special meaning when it appears inside a character class (the square brackets)--so [.\n] just matches a dot or a linefeed. What you were looking for was (?:.|\n), but I would have recommended matching the carriage-return as well as the linefeed:
'%\{a\}((?:.|[\r\n])+)\{/a\}%'
That's because the word "newline" can refer to the Unix-style "\n", Windows-style "\r\n", or older-Mac-style "\r". Any given web page may contain any of those or a mixture of two or more styles; a mix of "\n" and "\r\n" is very common. But with /s mode (also known as single-line or DOTALL mode), you don't need to worry about that:
'%\{a\}(.+)\{/a\}%s'
However, there's another problem with the original regex that's still present in this one: the + is greedy. That means, if there's more than one {a}...{/a} sequence in the text, the first time your regex is applied it will match all of them, from the first {a} to the last {/a}. The simplest way to fix that is to make the + ungreedy (a.k.a, "lazy" or "reluctant") by appending a question mark:
'%\{a\}(.+?)\{/a\}%s'
Finally, I don't know what to make of the '$' before the opening quote of your pattern argument. I don't do PHP, but that looks like a syntax error to me. If someone could educate me in this matter, I'd appreciate it.
From http://www.regular-expressions.info/dot.html:
"The dot matches a single character,
without caring what that character is.
The only exception are newline
characters."
you will need to add a trailing /s flag to your expression.

Categories