regex to fetch url from string - php

In php am doing one task -
I want such a regex which will fetch like
$str = "this is my friend's website http://example1.com I think it is coll some text example.com some text t.com/2000 some text rs.500 some text http://www some text"
how can I fetch following with the help of regex -
http://example1.com
example.com
t.com/2000
http://www
rs.500 must be avoided!
actually I need such a regex which can satisfy any link
please help me with that

This regex is what you're looking for (mandatory regex101 link):
(https?:\/\/\S+)|([a-z]+\.[a-z]+(?:\/\S+)?)
It's basically the two regexes https?:\/\/\S+ and [a-z]+\.[a-z]+(?:\/\S+)? placed into capturing groups (so that you can extract all URLs with a global search) and then combined with an OR.
https?:\/\/\S+ finds URLs that are prefixed with http:// or https:// by matching:
The string "http" literally http, followed by
An optional "s" s? followed by
A colon and two forward slashes :\/\/, followed by
One or more non-whitespace characters \S+
If https?:\/\/\S+ doesn't match, then [a-z]+\.[a-z]+(?:\/\S+)? kicks in and finds URLs that are not prefixed with http:// or https:// and whose top level domains don't contain numbers by matching:
One or more lowercase letters [a-z]+, followed by
A dot \., followed by
One or more lowercase letters [a-z]+, followed by
An optional group, which consists of
A forward slash \/, followed by
One or more non-whitespace characters \S+

Related

How to check the URL's structure using PHP preg_match?

All my site's URLs have the following structure:
https://www.example.com/section/item
where section is a word and item is a number.
So, possible URLs are:
https://www.example.com
https://www.example.com/section
https://www.example.com/section/item
By .htaccess, all requests go to index.php (route).
I want to show a 404 error message if user types:
https://www.example.com/section/item/somethingelse
In order to check the URL's structure, how can I change the pattern properly in the following function?
function isValidURL($url) {
return preg_match('|^http(s)?://[a-z0-9-]+(.[a-z0-9-]+)*(:[0-9]+)?(/.*)?$|i', $url);
}
Thanks.
If section is a word (and can not contain digits), and item is a number, you could match word characters except digits using [^\W\d]+ and \d+ to match 1+ digits.
As in the example data there are optional parts, you could replace (/.*)?$ with (?:/[^\W\d]+(?:/\d+)?)?$.
Explanation
(?: Non capturing group
/[^\W\d]+ For section, match 1+ times a word char except a digit
(?:/\d+)? For item, optionally match / and 1+ digits
)? Close non capturing group and make it optional
If section can be a word which can also consists of only digits, you could also use \w+
The pattern might look like
^https?://[a-z0-9-]+(?:\.[a-z0-9-]+)*(?::[0-9]+)?(?:/[^\W\d]+(?:/\d+)?)?$
Regex demo
Note to escape the dot to match it literally.

regex for extracting all urls from string excluding period for terminating strings

I'm trying to extract URL from a piece of string I have different posts that contains URL in their message. I've prepared a pattern to match but it's not working properly. I have asked the same question here but forgot to add this case in that so I'm asking a new question for it.
Tried Pattern
\b(\.?)(?:https?://)?(?:(?i:[a-z]+\.)+)[^\s,]+\b
CODE
for ( $i = 0; $i < $resultcount; $i ++ ) {
$pattern = '%\b(\.?)(?:https?://)?(?:(?i:[a-z]+\.)+)[^\s,]+\b%';
$message = (string)$result[$i]['message'];
preg_match_all($pattern,$message,$match);
print_r($match);
}
A Example of my post like this
"This is just a post to test regex for extracting URL.
http://google.com, https://www.youtube.com/watch?v=dlw32af
https://instagram.com/oscar/ en.wikipedia.org"
Post may have comma or may not have comma for multiple URLs and also it is possible that a string and url doesn't have any space in between like
sometext.http://google.com
regexDemo
Thank you people :)
This will match strings which are precisely encoded and have formats like an HTTP URL except those fall into IDN categorization:
(?i)(?:https?://[^"'\s<>(){}]++|[a-z0-9](?<=\b.)[a-z0-9-]*+(?:\.[a-z-]{2,}+)++(?=[/?"'()\s]|:\d++|\Z)[^"'\s<>(){}]*+)
So you will not expect
ftp://username:password#ftpserver/folder/
to be matched.
Live demo
In your initial question you failed to specify that each "word"
(a part of URL) can contain something other than letters.
Note that your regex contains [a-z] which suggests, that you
want to match only URLs, which have "words" composed entirely
of letters, without any digits, minus chars or underscores.
Try the following regex:
(?:https?:\/\/)?(?i)[a-z][a-z0-9_-]*(?:[.\/](?!http)[a-z0-9_-]+)+\/?(?:\?[^\s,.]+)?
Description:
(?:https?:\/\/)? - Optional protocol name.
(?i) - Turn on case insensitive option.
[a-z][a-z0-9_-]* - The first "word" of the URL (first a letter,
then any number of letter, digit, underscore or minus chars).
(?:[.\/] - Non-capturing group: Either a dot or a slash.
(?!http) - then negative lookahead, to block cases when URL starting from
http is immediately preceded by a dot (or a slash).
[a-z0-9_-]+)+ - then the next "word" (optional, no requirement to start
from a letter), all this (non-capturing group) repeated.
\/? - Optional slash, terminating the part before query string (if any).
(?:\?[^\s,.]+)? - Optional non-capturing group for query string.
It starts from ? and then a sequence of chars other than space,
comma or dot.
The above regex does not match the trailing dot, just as you wish.
Note:
As I tried this regex under regex101.com, I quoted / chars contained
in it. You probably can omit this quotation.
Following your comment, I changed the regex, that a "word" can contain also
digits, underscores and minus chars.
Note also that - as a first or last char between [...] stands
for itself (opposite to - between two other chars, where it means
from - to).

regex for matching url pattern not working

I am having hard time creating a regular expression that should match all urls built on a particular pattern such as http://subdomain.example.com/reply/aoo/spo/4429163785
here is the regex I so far have written.
"/(^https?)(\:\/\/)([a-z]+)(\.craigslist\.org/reply)\/([a-z]{3}\/spo)\/([0-9]{10}$)/"
please help me improve my regex.
"/(^https?)(\:\/\/)([a-z]+)(\.craigslist\.org\/reply)\/([a-z]{3}\/spo)\/([0-9]{10}$)/"
You forgot to escape the / in after the .org part right before reply.
Change the delimiter to something else so you don’t need to escape it every now and then. Also, don’t capture those parts of the string where no variations are allowed (e.g. .craigslist.org/reply/):
"~^(https?)://([a-z]+)\\.craigslist\\.org/reply/([a-z]{3})/spo/([0-9]{10})$~"
Explanation:
~ The opening delimiter
^ Match the beginning of the string
(https?) Match either http or https – capture it
:// Match a colon followed by two forward slashes
([a-z]+) Match one or more lowercase characters – capture it
\\. Match a period
craigslist Match the characters exactly as given
\\. Match a period
org/reply/ Match the characters exactly as given
([a-z]{3}) Match three lowercase characters – capture it
/spo/ Match the characters exactly as given
([0-9]{10}) Match ten numerical characters – capture it
$ Match the ending of the string
~ The closing delimiter

correction required for regular expression to get site name

Problem: Extraction anything between http://www. and .com OR http:// & .com.
Solution:
<?php
$url1='http://www.examplehotel.com';
//$url2='http://test-hotel-1.com';
$pattern='#^http://([^/]+).com#i';
preg_match($pattern, $url1, $matches);
print_r($matches);
?>
When $url1 is matched it should return string 'examplehotel'
when $url2 is matched it should return string 'test-hotel-1'
It works correctly for $url2 but empty for $url1....
In my pattern I want to add [http://] or [http://www.] I added (http://)+(www.)+ but the match returns are not expected :(.
May I know where I am going wrong?
try this one:
$pattern='#^http://(?:www\.)?([^\.]+).com#i';
or in your pattern you just need to make www optional (may or may not appear in pattern):
$pattern='#^http://(?:www\.)?([^/]+).com#i';
The problem is, that you are matching everything from the two slashes to the .com. If there is a www. you are matching this too, within your capturing group.
The solution is to match www. optionally before your capturing group, like this
^http://(?:www\.)?([^/]+)\.com
^^^^^^^^^^ ^^
(?:www\.)? This is a non capturing group, i.e. the content is not stored in the result. The ? at the end makes it optional.
\. will match a literal ".". . is a special character in regex and means "Any character".
See it here online on Regexr, When you hover your mouse over the strings, you will see the content of the capturing group.
Regarding your tries with [http://] and so on. When you use square brackets, then you are creating a character class, that means match one of the characters from inside the brackets. When you want to group the characters, then use a capturing () or a non capturing (?:) group.
preg_match_all('%http(?:s)?://(?:www\.)?(.*?)\.com%i', $url, $result, PREG_PATTERN_ORDER);
print_r($result[1])

regex to filter out numbers in seo url

I have some urls like these below
http://www.bla-bla.com/hello-world/blah/1345346-asfasdf.html
http://www.bla-bla.com/hello-world/454536556-asdf-rtrthr-dssdfg.html
http://www.bla-bla.com/hello-world/bla/how/what/26609768-nmbbasdf.html
IF the url has a slash followed by numbers, I need to return the just numbers
so the result must be
1345346
454536556
26609768
How can I get everything but the numbers from urls
If those are the only numbers in your URL, you can simply use /\d+/, which stands for "Any digit one or more times".
If you need to specifically group out the numbers in the final part of the string, you can use something more like this: /\/(\d+).*\.html$/, which stands for "A group of digits, following a literal forward slash '/', followed by any characters and .html at the end of a string", and capture group 1 would contain it.
As per request from comment: to get the numbers preceded by a forward slash / and ending with a hyphen -, just use this: /(?<=\/)\d+(?=\-)/, which can be broken down as:
(?<=\/) # Look before the group for a forward slash, but don't add it to the capture group.
\d+ # Match one or more digits (0-9)
(?=\-) # Look after the group for a hyphen, but don't add it to the capture group.
Try using this as your regular expression: /\/([0-9]+)/

Categories