Getting all URLs on multiple lines - php

I'm trying to get all these URLs from a website, but I only seem to be able to get the first URL. How can I match all the URLs?
So far I've tried
auto">(.*?)<\/pre>
and:
auto">(.*?)\s<\/pre>
I've tried adding several modifiers such as m and i, but it didn't seem to help.
This is what I'm searching:
auto">http://url-one.com
http://url-two.com
http://url-three.com
http://url-four.com
http://url-five.com</pre>
Can someone help me understand what I am missing?

Quick Answer
As Jonny5 hinted in his comment, . does not match newline characters by default: so (.*?) will not match beyond the first line without the s regex modifier, and his suggestion is then the quick answer:
/auto">(.*?)<\/pre>/s
You can check out his Regex101 demo or related PHP code...
$re = "/auto\">(.*?)<\\/pre>/s";
$str = "auto\">http://url-one.com\nhttp://url-two.com\nhttp://url-three.com\nhttp://url-four.com\nhttp://url-five.com</pre>";
preg_match($re, $str, $matches);
...for reference.
Digging Deeper
However, there is a little more going on here.
i and m Modifiers
First, regardless whether you use the i or m modifier(s), no line of the sample text would match with auto"> at the beginning and <\/pre> at the end of the pattern. You would have to group and follow each with a quantifier to make it optional (e.g. (?:auto">)? and (?:<\/pre>)?) to match each line of the sample text.
m Requires Matching Globally
Second, the m modifier would necessitate matching globally – and further tweaks to the pattern to avoid the last URL match ending with </pre>:
/(?:auto">)?(.+)(?=(?:\n|<\/pre>))/m
You can also check out a second Regex101 demo of this twist or try it out in PHP:
$re = "/(?:auto\">)?(.+)(?=(?:\\n|<\\/pre>))/m";
$str = "auto\">http://url-one.com\nhttp://url-two.com\nhttp://url-three.com\nhttp://url-four.com\nhttp://url-five.com</pre>";
preg_match_all($re, $str, $matches); // NOTE: preg_match_all to match globally
^^^^
Which Approach to Choose
The choice between simply adding the s modifier or tweaking the pattern, adding the m modifier, and matching globally mostly comes down to whether you want a single match with all the URLs (separated by newlines) or many matches, each with one of the URLs.
The latter yields the matches below...
MATCH 1
1. [6-24] `http://url-one.com`
MATCH 2
1. [25-43] `http://url-two.com`
MATCH 3
1. [44-64] `http://url-three.com`
MATCH 4
1. [65-84] `http://url-four.com`
MATCH 5
1. [85-104] `http://url-five.com`
...versus the single match that the original pattern and the s modifier yield:
MATCH 1
1. [6-104] `http://url-one.com
http://url-two.com
http://url-three.com
http://url-four.com
http://url-five.com`

Related

Weird PHP Regex Preg_Match Bug?

My PHP version is PHP 7.2.24-0ubuntu0.18.04.7 (cli). However it looks like this problem occurs with all versions I've tested.
I've encountered a very weird bug when using preg_match. Anyone know a fix?
The first section of code here works, the second one doesn't. But the regex itself is valid. For some reason the something_happened word is causing it to fail.
$one = ' (branch|leaf)';
echo "ONE:\n";
preg_match('/(?:\( ?)?((?:(?:\**\[(?:!?\d+)?\])*\**[A-Za-z_]\w*)+(?: ?\| ?(?:\**\[(?:!?\d+)?\])*\**[A-Za-z_]\w*)+)(?: ?\))?/', $one, $matches, PREG_OFFSET_CAPTURE);
print_r($matches); // this works
$two = 'something_happened (branch|leaf)';
echo "\nTWO:\n";
preg_match('/(?:\( ?)?((?:(?:\**\[(?:!?\d+)?\])*\**[A-Za-z_]\w*)+(?: ?\| ?(?:\**\[(?:!?\d+)?\])*\**[A-Za-z_]\w*)+)(?: ?\))?/', $two, $matches2, PREG_OFFSET_CAPTURE);
print_r($matches2); // this doesn't work
It seems somehow related to the word something_happened. If I change this word it works.
The regex is matching 2 or more type names separated by | that may or may not be surrounded in (), and each type name may or may not be preceded by any number of [] (or [some number] or [!some number]) and *.
Try it and see for yourself! Please let me know if you know how to fix it!
The problem lies in the (?:(?:\**\[(?:!?\d+)?\])*\**[A-Za-z_]\w*)+ group: the + quantifier quantifies a group with many subsequent optional patterns, and that creates too many options to match a string before the subsequent patterns.
In PHP, you can workaround the problem by using either
Possessive quantifier:
'/(?:\(\ ?)?((?:(?:\**\[(?:!?\d+)?\])*\**[A-Za-z_]\w*)++(?:\ ?\|\ ?(?:\**\[(?:!?\d+)?\])*\**[A-Za-z_]\w*)+)(?:\ ?\))?/'
Note the ++ at the end of the group mentioned.
2. Atomic group:
'/(?:\(\ ?)?((?>(?:\**\[(?:!?\d+)?\])*\**[A-Za-z_]\w*)+(?:\ ?\|\ ?(?:\**\[(?:!?\d+)?\])*\**[A-Za-z_]\w*)+)(?:\ ?\))?/'
See this regex demo. Note the (?>...) syntax.
Also, note how the regex is formatted here, it is very convenient to use the x (extended) flag to break the regex into several lines, format it, so that it could be easier to track down the issue. It is required to escape all literal whitespace and # chars, but it is a minor inconvenience when it comes to debugging long patterns like this.

Regex - Match characters but don't include within results

I have got the following Regex, which ALMOST works...
(?:^https?:\/\/)(?:www|[a-z]+)\.([^.]+)
I need the result to be the only result, or within the same position in the Array.
So for example this http://m.facebook.com/ matches perfect, there is only 1 group.
However, if I change it to http://facebook.com/ then I get com/in place of where Facebook should be. So I need to have (?:www|[a-z]+) as an optional check really.
Edit:
What I expect is just to match facebook, if ANY of the strings are as follows:
http://www.facebook.com
http://facebook.com
http://m.facebook.com
And obviously the https counterparts.
This is my Regex now
(?:^https?:\/\/)(?:www)?\.?([^.]+)
This is close, however it matches the m on when I try `http://m.facebook.com
https://regex101.com/r/GDapY5/1
So I need to have (?:www|[a-z]+) as an optional check really.
A ? at the end of a pattern is generally used for "optional" bits -- it means "match zero or one" of that thing, so your subpattern would be something like this:
(?:www|[a-z]+)?
If you're simply trying to get the second level domain, I wouldn't bother with regex, because you'll be constantly adjusting it to handle special cases you come across. Just split on dots and take the penultimate value:
$domain = array_reverse(explode('.', parse_url($str)['host']))[1];
Or:
$domain = array_reverse(explode('.', parse_url($str, PHP_URL_HOST)))[1];
Perhaps you could make the first m. part optional with (?:\w+\.)?.
Instead of a capturing group you could use \K to reset the starting point of the reported match.
Then match one or more word characters \w+ and use a positive lookahead to assert that what follows is a dot (?=\.)
For example:
^https?://(?:www)?(?:\w+\.)?\K\w+(?=\.)
Edit: Or you could match for m. or www. using an alternation:
^https?://(?:m\.|www\.)?\K\w+(?=\.)
Demo Php

Non greedy match does not work

I want to implement non greedy match using .*? pattern. However, I came across one sample string which shows, that non greedy match does not work. This is the code and the sample string:
preg_match_all('/\<w:t.*?\>\<w:p\>/', '<w:t xml:space="preserve"></w:t></w:r><w:r><w:rPr><w:b/></w:rPr><w:t xml:space="preserve">Text 1 </w:t></w:r><w:r><w:rPr><w:b/><w:u w:val="single"/><w:color w:val="ff0000"/></w:rPr><w:t xml:space="preserve"></w:t></w:r><w:r><w:rPr><w:b/><w:u w:val="single"/><w:color w:val="ff0000"/><w:i/></w:rPr><w:t xml:space="preserve">Text 2</w:t></w:r><w:r><w:t xml:space="preserve"></w:t></w:r><w:r><w:t xml:space="preserve"></w:t></w:r><w:r><w:t xml:space="preserve"></w:t></w:r></w:p></w:t></w:r></w:p><w:p w:rsidRDefault="004D3323" w:rsidP="003F03B1"><w:r><w:t><w:p>', $match);
But if I print_r the $match variable, I see that this pattern matches the whole string. However, what I want is to match only such strings as:
"<w:t><w:p>" and "<w:t any text may go here><w:p>"
So, what I did wrong and how can I fix it? Thanks!
Use this regex instead:
<w:t[^>]*><w:p>
[^>]* allows all characters except >
see https://regex101.com/r/nuMzTk/1

Move multiple letters in string using regex

Using a regular expression I want to move two letters in a string.
W28
L36
W29-L32
Should be changed to:
28W
36L
29W-32L
The numbers vary between 25 and 44. The letters that need to be moved are always "W" and/or "L" and the "W" is always first when they both exist in the string.
I need to do this with a single regular expression using PHP. Any ideas would be awesome!
EDIT:
I'm new to regular expressions and tried a lot of things without success. The closest I came was using "/\b(W34)\b/" for each possibility. I also found something about using variables in the replace function but had no luck using these.
Your regex \b(W34)\b matches exactly W34 as a whole word. You need a character class to match W or L, and some alternatives to match the numeric range, and use the most of capturing groups.
You can use the following regex replacement:
$re = '/\b([WL])(2[5-9]|3[0-9]|4[0-4])\b/';
$str = "W28\nL36\nW29-L32";
$result = preg_replace($re, "$2$1", $str);
echo $result;
See IDEONE demo
Here, ([WL]) matches and captures either W or L into group 1, and (2[5-9]|3[0-9]|4[0-4]) matches integer numbers from 25 till 44 and captures into group 2. Backreferences are used to reverse the order of the groups in the replacement string.
And here is a regex demo in case you want to adjust it later.

regex to match everything until it hits uppercase

I found the following code from this question, regex to match everything until it finds 2 upper case characters?
^.*(?=\b(?:[^\sA-Z]*[A-Z]){2})
however my question is slightly different then the OP
I want to match everything up to the upper case in the following string,
the rules should match everything until it negative lookaround finds 2 uppercase characters and then match everything inbetween from the 1st uppercase until the start of the 2nd uppercase character
so I Want (continue from op example)
Http is an HttpHeader
is to get Http is an Http
instead of Http is an which OP is getting in posted thread
Seems overly comp. to me
preg_match( '/[^A-Z]+/', $str, $res );
preg_match('/[^A-Z]*([A-Z]{1}[^A-Z]*[A-Z]{1}[^A-Z]*)/', $str, $res);
use this pattern ^.*?(?=\b(?:[^\sA-Z]*[A-Z]){2}).+?(?=[A-Z]) Demo
([A-Z].*?\w+(?=[A-Z]))
You may follow the above regex. That's so simple and yet fast. See matched groups here: Live demo

Categories