Non greedy match does not work - php

I want to implement non greedy match using .*? pattern. However, I came across one sample string which shows, that non greedy match does not work. This is the code and the sample string:
preg_match_all('/\<w:t.*?\>\<w:p\>/', '<w:t xml:space="preserve"></w:t></w:r><w:r><w:rPr><w:b/></w:rPr><w:t xml:space="preserve">Text 1 </w:t></w:r><w:r><w:rPr><w:b/><w:u w:val="single"/><w:color w:val="ff0000"/></w:rPr><w:t xml:space="preserve"></w:t></w:r><w:r><w:rPr><w:b/><w:u w:val="single"/><w:color w:val="ff0000"/><w:i/></w:rPr><w:t xml:space="preserve">Text 2</w:t></w:r><w:r><w:t xml:space="preserve"></w:t></w:r><w:r><w:t xml:space="preserve"></w:t></w:r><w:r><w:t xml:space="preserve"></w:t></w:r></w:p></w:t></w:r></w:p><w:p w:rsidRDefault="004D3323" w:rsidP="003F03B1"><w:r><w:t><w:p>', $match);
But if I print_r the $match variable, I see that this pattern matches the whole string. However, what I want is to match only such strings as:
"<w:t><w:p>" and "<w:t any text may go here><w:p>"
So, what I did wrong and how can I fix it? Thanks!

Use this regex instead:
<w:t[^>]*><w:p>
[^>]* allows all characters except >
see https://regex101.com/r/nuMzTk/1

Related

How to match these strings using preg_match

preg_match('/"\'<>&/', 'misiek"')
Why does not it work ?
As stated in the comments - it does exactly what you told it to do. In your case, you simply check if the string you provide contains the exact substring: "\'<>& anywhere.
So with your pattern, the following strings would result in a match:
"'<>&
LOREM "'<>& IPSUM
Since both of these include the pattern you searched for. However, LO"R'EM<>IPS&UM would not return a match, because you are not checking for the individual characters, only the complete pattern.
If you change your pattern to:
/["\'<>&]/
You instead look for a list of characters. This will return true if any of the characters wrapped in brackets are found.
misiek - would in this case not match
LO"R'EM<>IPS&UM - would match
mis&iek - would match
You can test your regex patterns as well as build them on this site:
https://regex101.com
There you'll also find the available modifiers you can use and how / why to use them.
Good luck!
I am guessing: could it be that you want to match a string containing at least one of the characters listed in your regular expression? In that case you should do the following:
$res=preg_match('/["\'<>&]/' , 'misiek"');
And the result should be positive ($res===1), see here:
http://rextester.com/KYNGYI23753

Regex - Match characters but don't include within results

I have got the following Regex, which ALMOST works...
(?:^https?:\/\/)(?:www|[a-z]+)\.([^.]+)
I need the result to be the only result, or within the same position in the Array.
So for example this http://m.facebook.com/ matches perfect, there is only 1 group.
However, if I change it to http://facebook.com/ then I get com/in place of where Facebook should be. So I need to have (?:www|[a-z]+) as an optional check really.
Edit:
What I expect is just to match facebook, if ANY of the strings are as follows:
http://www.facebook.com
http://facebook.com
http://m.facebook.com
And obviously the https counterparts.
This is my Regex now
(?:^https?:\/\/)(?:www)?\.?([^.]+)
This is close, however it matches the m on when I try `http://m.facebook.com
https://regex101.com/r/GDapY5/1
So I need to have (?:www|[a-z]+) as an optional check really.
A ? at the end of a pattern is generally used for "optional" bits -- it means "match zero or one" of that thing, so your subpattern would be something like this:
(?:www|[a-z]+)?
If you're simply trying to get the second level domain, I wouldn't bother with regex, because you'll be constantly adjusting it to handle special cases you come across. Just split on dots and take the penultimate value:
$domain = array_reverse(explode('.', parse_url($str)['host']))[1];
Or:
$domain = array_reverse(explode('.', parse_url($str, PHP_URL_HOST)))[1];
Perhaps you could make the first m. part optional with (?:\w+\.)?.
Instead of a capturing group you could use \K to reset the starting point of the reported match.
Then match one or more word characters \w+ and use a positive lookahead to assert that what follows is a dot (?=\.)
For example:
^https?://(?:www)?(?:\w+\.)?\K\w+(?=\.)
Edit: Or you could match for m. or www. using an alternation:
^https?://(?:m\.|www\.)?\K\w+(?=\.)
Demo Php

php regexp: can't exclude one element

I am trying to set-up a quite complex regexp, but I can't avoid just one element from not-match list.
My regular expression is:
1234567-8_abc((?!_ABC|_DEFGHI)[\w]?)*(\.ios|\.and)
What I have to exclude is:
1234567-8_abc.ios
1234567-8_abc_DEFGHI.ios
1234567-8_abc_ABC.ios
Instead, what I have to include is:
1234567-8_abc_1UP.ios
1234567-8_abc_FI.ios
1234567-8_abc_gmg.ios
1234567-8_abc_1UP.and
1234567-8_abc_FI.and
1234567-8_abc_gmg.and
1234567-8_abc_ddd.and
1234567-8_abc_qwert.ios
1234567-8_abc_88.ios
Well, I can't exclude the first option (1234567-8_abc.ios).
I tried it here.
How can I achieve this?
Thank you!
You can use this pattern:
1234567-8_abc_[^_.]++(?<!_ABC|_DEFGHI)\.(?:ios|and)
Note: I assume that each substring between _ and .ios doesn't contain a dot or an underscore.
The possessive quantifier ++ is necessary to fail faster with the less possible backtracking steps
This regex matches your examples in PHP:
1234567-8_abc_((?!ABC|DEFGHI)[\w]?)*(\.ios|\.and)
Add a negative lookahead like below,
1234567-8_abc(?!_ABC|_DEFGHI)\w+(\.ios|\.and)
DEMO
(?!_ABC|_DEFGHI) Negative lookahead asserts that the string following _abc wouldn't be _ABC or _DEFGHI . And it must have one or more word characters before .ios or .and. So it won't match this 1234567-8_abc.ios string.
1234567-8_abc(?:(?!_ABC|_DEFGHI)\w)+(\.ios|\.and)
Try this.Your regex has left \w after 1234567-8_abc optional.Just made it compulsary.See demo.
http://regex101.com/r/bB8jY7/1

How to write the reg express to get the following pattern in the php?

There is a website and I would like to get all the <td> (any content) </td> pattern string
So I write like this:
preg_match("/<td>.*</td>/", $web , $matches);
die(var_dump($matches));
That return null, how to fix the problem? Thanks for helping
OK.
You are only not escaping properly I guess.
Also use groups to capture your stuff properly.
<td>(.*)<\/td>
should do. You can try this regex on your given text here. Don't forget the global flag if you are matching ALL td's. (preg_match_all in PHP)
Usually parsing HTML with regex is not a good idea, try to use DOM parsers instead.
Example -> http://simplehtmldom.sourceforge.net/
Test the above regex with
$web = file_get_contents('http://www.w3schools.com/html/html_tables.asp' );
preg_match_all("/<td>(.*)<\/td>/", $web , $matches);
print_r( $matches);
Lazy Quantifier, Different Delimiter
You need .*? rather than .*, otherwise you can overshoot the closing </td>. Also, your / delimiter needed to be escaped when it appeared in </td>. We can replace it with another one that doesn't need escaping.
Do this:
$regex = '~<td>.*?</td>~';
preg_match_all($regex, $web, $matches);
print_r($matches[0]);
Explanation
The ~ is just an esthetic tweak—you can use any delimiter you like around your regex patttern, and in general ~ is more versatile than /, which needs to be escaped more often, for instance in </td>.
The star quantifier in .*? is made "lazy" by the ? so that the dot only matches as many characters as needed to allow the next token to match (shortest match). Without the ?, the .* first matches the whole string, then backtracks only as far as needed to allow the next token to match (longest match).

Regex ignore URL already in HTML tags

I'm having a little problem with my Regex
I've made a custom BBcode for my website, however I also want URLs to be parsed too.
I'm using preg_replace and this is the pattern used to identify URLS:
/([\w]+:\/\/[\w-?&;#~=\.\/\#]+[\w\/])/is
Which works great, however if a URL is within a [img][/img] block, the above pattern also picks it up and produces a result like this:
//[img]http://url.com/toimg.jeg[/img] will produce this result:
<img src="<a href="http://url.com/toimg.jeg" target="_blank">/>
//When it should produce:
<img src="http://url.com/toimg.jeg"/>
I tried using this:
/([^"][\w]+:\/\/[\w-?&;#~=\.\/\#]+[\w\/][^"])/is
With no luck.
Any help will be appreciated.
Edit:
For solution See the 2nd comment on stema's answer.
Try this
(?<!href=")(\b[\w]+:\/\/[\w-?&;#~=\.\/\#]+[\w\/])
See it here on Regexr
To make it more general you can simplify your lookbehind to check only for "=""
(?<!=")(\b[\w]+:\/\/[\w-?&;#~=\.\/\#]+[\w\/])
See it on Regexr
(?<!href=") is a negative lookbehind assertion, it ensures that there is no "href="" before your pattern.
\b is a word boundary that anchors the start of your link to a change from a non word to a word character. without this the lookbehind would be useless and it would match from the "ttp://..." on.

Categories