preg_replace how change part of uri - php

I am trying to change all the links of a html with php preg_replace.
All the uris have the following form
http://example.com/page/58977?forum=60534#comment-60534
I want to change it to:
http://example.com/60534
which means removing everything after "page" and before "comment-", including these two strings.
I tried the following, but it returns no changes:
$result = preg_replace("/^.page.*.comment-.$/", "", $html);
but it seems that my regex syntax is not correct, as it returns the html unchanged.
Could you please help me with this?

The ^ is an anchor that only matches the start of the string, and $ only matches at the end. In order to match you should not anchor the regular expression:
$result = preg_replace("/page.*?comment-/", "", $html);
Note that this could match things that are not URLs. You may want to be more specific as to what will be replaced, for example you might want to only replace links starting with either http: or https: and that don't contain whitespace.

You probably simply need this: http://php.net/manual/en/function.parse-url.php
This function parses a URL and returns an associative array containing any of the various components of the URL that are present.

Alternate way without using regular expression.
Uses parse_url()
<?php
$url = 'http://example.com/page/58977?forum=60534#comment-60534';
$array = parse_url($url);
parse_str($array['query'], $query);
$http = ($array['scheme']) ? $array['scheme'].'://' : NULL;
echo $http.$array['host'].'/'.$query['forum'];
?>
Demo: http://codepad.org/xB3kO588

Related

Extract only numbers from link with codeception

I have this link, and i need to work only with the numbers from that link.
How would i extract them?
I didn't find any answer that would work with codepcetion.
https://www.my-website.com/de/booking/extras#tab-nav-extras-1426
I tired something like this.
$I->grabFromCurrentUrl('\d+');
But i won't work.
Any ideas ?
Staying within the framework:
The manual clearly says that:
grabFromCurrentUrl
Executes the given regular expression against the current URI and
returns the first capturing group. If no parameters are provided, the
full URI is returned.
Since you didn't used any capturing groups (...), nothing is returned.
Try this:
$I->grabFromCurrentUrl('~(\d+)$~');
The $ at the end is optional, it just states that the string should end with the pattern.
Also note that the opening and closing pattern delimiters you would normally use (/) are replaced by tilde (~) characters for convenience, since the input string has a great chance to contain multiple forward slashes. Custom pattern delimiters are completely standard in regexp, as #Naktibalda pointed it out in this answer.
You can use parse_url() to parse entire URL and then extract the part which is most interested for you. After that you can use regex to extract only numbers from the string.
$url = "https://www.my-website.com/de/booking/extras#tab-nav-extras-1426";
$parsedUrl = parse_url($url);
$fragment = $parsedUrl['fragment']; // Contains: tab-nav-extras-1426
$id = preg_replace('/[^0-9]/', '', $fragment);
var_dump($id); // Output: string(4) "1426"
A variant using preg_match() after parse_url():
$url = "https://www.my-website.com/de/booking/extras#tab-nav-extras-1426";
preg_match('/\d+$/', parse_url($url)['fragment'], $id);
var_dump($id[0]);
// Outputs: string(4) "1426"

PHP: Check if string is URL [duplicate]

I'm not very good at regular expressions at all.
I've been using a lot of framework code to date, but I'm unable to find one that is able to match a URL like http://www.example.com/etcetc, but it is also is able to catch something like www.example.com/etcetc and example.com/etcetc.
For matching all kinds of URLs, the following code should work:
<?php
$regex = "((https?|ftp)://)?"; // SCHEME
$regex .= "([a-z0-9+!*(),;?&=$_.-]+(:[a-z0-9+!*(),;?&=$_.-]+)?#)?"; // User and Pass
$regex .= "([a-z0-9\-\.]*)\.(([a-z]{2,4})|([0-9]{1,3}\.([0-9]{1,3})\.([0-9]{1,3})))"; // Host or IP address
$regex .= "(:[0-9]{2,5})?"; // Port
$regex .= "(/([a-z0-9+$_%-]\.?)+)*/?"; // Path
$regex .= "(\?[a-z+&\$_.-][a-z0-9;:#&%=+/$_.-]*)?"; // GET Query
$regex .= "(#[a-z_.-][a-z0-9+$%_.-]*)?"; // Anchor
?>
Then, the correct way to check against the regex is as follows:
<?php
if(preg_match("~^$regex$~i", 'www.example.com/etcetc', $m))
var_dump($m);
if(preg_match("~^$regex$~i", 'http://www.example.com/etcetc', $m))
var_dump($m);
?>
Courtesy: Comments made by splattermania in the PHP manual: preg_match
RegEx Demo in regex101
This worked for me in all cases I had tested:
$url_pattern = '/((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:#\-_=#]+\.([a-zA-Z0-9\&\.\/\?\:#\-_=#])*/';
Tests:
http://test.test-75.1474.stackoverflow.com/
https://www.stackoverflow.com
https://www.stackoverflow.com/
http://wwww.stackoverflow.com/
http://wwww.stackoverflow.com
http://test.test-75.1474.stackoverflow.com/
http://www.stackoverflow.com
http://www.stackoverflow.com/
stackoverflow.com/
stackoverflow.com
http://www.example.com/etcetc
www.example.com/etcetc
example.com/etcetc
user:pass#example.com/etcetc
example.com/etcetc?query=aasd
example.com/etcetc?query=aasd&dest=asds
http://stackoverflow.com/questions/6427530/regular-expression-pattern-to-match-url-with-or-without-http-www
http://stackoverflow.com/questions/6427530/regular-expression-pattern-to-match-url-with-or-without-http-www/
Every valid Internet URL has at least one dot, so the above pattern will simply try to find any at least two strings chained by a dot and has valid characters that URL may have.
Try this:
/^http:\/\/|(www\.)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$/
It works exactly like the people want.
It takes with or with out http://, https://, and www.
You can use a question mark after a regular expression to make it conditional so you would want to use:
http:\/\/(www\.)?
That will match anything that has either http://www. or http:// (with no www.)
You could just use a replace method to remove the above, thus getting you the domain. It depends on what you need the domain for.
Try something like this:
.*([\w-]+\.)+[a-z]{2,5}(/[\w-]+)*
Use:
/(https?://)?((?:(\w+-)*\w+)\.)+(?:[a-z]{2})(\/?\w?-?=?_?\??&?)+[\.]?([a-z0-9\?=&_\-%#])?/g
It matches something.com, http(s):// or www. It does not match other [something]:// URLs though, but for my purpose that's not necessary.
The regex matches e.g.:
http://foo.co.uk/
www.regex.com/foo.html?q=bar$some=thi-ng,regex
regex.foo.com/blog
You can try this:
r"(http[s]:\/\/)?([\w-]+\.)+([a-z]{2,5})(\/+\w+)? "
Selection:
may be start with http:// or https:// (optional)
anything (word) end with dot (.)
followed by 2 to 5 character [a-z]
followed by "/[anything]" (optional)
followed by space
Try this
$url_reg = /(ftp|https?):\/\/(\w+:?\w*#)?(\S+)(:[0-9]+)?(\/([\w#!:.?+=&%#!\/-])?)?/;
I have been using the following, which works for all my test cases, as well as fixes any issues where it would trigger at the end of a sentence preceded by a full-stop (end.), or where there were single character initials, such as 'C.C. Plumbing'.
The following regex contains multiple {2,}s, which means two or more matches of the previous pattern.
((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:#\-_=#]{2,}\.([a-zA-Z0-9\&\.\/\?\:#\-_=#]){2,}
Matches URLs such as, but not limited to:
https://example.com
http://example.com
example.com
example.com/test
example.com?value=test
Does not match non-URLs such as, but not limited to:
C.C Plumber
A full-stop at the end of a sentence.
Single characters such as a.b or x.y
Please note: Due to the above, this will not match any single character URLs, such as: a.co, but it will match if it is preceded by a URL scheme, such as: http://a.co.
I was getting so many issues getting the answer from anubhava to work due to recent PHP allowing $ in strings and the preg match wasn't working.
Here is what I used:
// Regular expression
$re = '/((https?|ftp):\/\/)?([a-z0-9+!*(),;?&=.-]+(:[a-z0-9+!*(),;?&=.-]+)?#)?([a-z0-9\-\.]*)\.(([a-z]{2,4})|([0-9]{1,3}\.([0-9]{1,3})\.([0-9]{1,3})))(:[0-9]{2,5})?(\/([a-z0-9+%-]\.?)+)*\/?(\?[a-z+&$_.-][a-z0-9;:#&%=+\/.-]*)?(#[a-z_.-][a-z0-9+$%_.-]*)?/i';
// Match all
preg_match_all($re, $blob, $matches, PREG_SET_ORDER, 0);
// Print the entire match result
var_dump($matches);
// The first element of the array is the full match
This PHP Composer package URL highlight is doing a good job in PHP:
<?php
use VStelmakh\UrlHighlight\UrlHighlight;
$urlHighlight = new UrlHighlight();
$matches = $urlHighlight->getUrls($string);
?>
If it does not have to be regex, you could always use the validate filters that are in PHP.
filter_var('http://example.com', FILTER_VALIDATE_URL);
filter_var (mixed $variable [, int $filter = FILTER_DEFAULT [, mixed $options ]]);
Types of Filters
Validate Filters
Regex if you want to ensure a URL starts with HTTP/HTTPS:
https?:\/\/(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&//=]*)
If you do not require the HTTP protocol:
[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&//=]*)

how to make regex string to match domain and url [duplicate]

I'm not very good at regular expressions at all.
I've been using a lot of framework code to date, but I'm unable to find one that is able to match a URL like http://www.example.com/etcetc, but it is also is able to catch something like www.example.com/etcetc and example.com/etcetc.
For matching all kinds of URLs, the following code should work:
<?php
$regex = "((https?|ftp)://)?"; // SCHEME
$regex .= "([a-z0-9+!*(),;?&=$_.-]+(:[a-z0-9+!*(),;?&=$_.-]+)?#)?"; // User and Pass
$regex .= "([a-z0-9\-\.]*)\.(([a-z]{2,4})|([0-9]{1,3}\.([0-9]{1,3})\.([0-9]{1,3})))"; // Host or IP address
$regex .= "(:[0-9]{2,5})?"; // Port
$regex .= "(/([a-z0-9+$_%-]\.?)+)*/?"; // Path
$regex .= "(\?[a-z+&\$_.-][a-z0-9;:#&%=+/$_.-]*)?"; // GET Query
$regex .= "(#[a-z_.-][a-z0-9+$%_.-]*)?"; // Anchor
?>
Then, the correct way to check against the regex is as follows:
<?php
if(preg_match("~^$regex$~i", 'www.example.com/etcetc', $m))
var_dump($m);
if(preg_match("~^$regex$~i", 'http://www.example.com/etcetc', $m))
var_dump($m);
?>
Courtesy: Comments made by splattermania in the PHP manual: preg_match
RegEx Demo in regex101
This worked for me in all cases I had tested:
$url_pattern = '/((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:#\-_=#]+\.([a-zA-Z0-9\&\.\/\?\:#\-_=#])*/';
Tests:
http://test.test-75.1474.stackoverflow.com/
https://www.stackoverflow.com
https://www.stackoverflow.com/
http://wwww.stackoverflow.com/
http://wwww.stackoverflow.com
http://test.test-75.1474.stackoverflow.com/
http://www.stackoverflow.com
http://www.stackoverflow.com/
stackoverflow.com/
stackoverflow.com
http://www.example.com/etcetc
www.example.com/etcetc
example.com/etcetc
user:pass#example.com/etcetc
example.com/etcetc?query=aasd
example.com/etcetc?query=aasd&dest=asds
http://stackoverflow.com/questions/6427530/regular-expression-pattern-to-match-url-with-or-without-http-www
http://stackoverflow.com/questions/6427530/regular-expression-pattern-to-match-url-with-or-without-http-www/
Every valid Internet URL has at least one dot, so the above pattern will simply try to find any at least two strings chained by a dot and has valid characters that URL may have.
Try this:
/^http:\/\/|(www\.)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?$/
It works exactly like the people want.
It takes with or with out http://, https://, and www.
You can use a question mark after a regular expression to make it conditional so you would want to use:
http:\/\/(www\.)?
That will match anything that has either http://www. or http:// (with no www.)
You could just use a replace method to remove the above, thus getting you the domain. It depends on what you need the domain for.
Try something like this:
.*([\w-]+\.)+[a-z]{2,5}(/[\w-]+)*
Use:
/(https?://)?((?:(\w+-)*\w+)\.)+(?:[a-z]{2})(\/?\w?-?=?_?\??&?)+[\.]?([a-z0-9\?=&_\-%#])?/g
It matches something.com, http(s):// or www. It does not match other [something]:// URLs though, but for my purpose that's not necessary.
The regex matches e.g.:
http://foo.co.uk/
www.regex.com/foo.html?q=bar$some=thi-ng,regex
regex.foo.com/blog
You can try this:
r"(http[s]:\/\/)?([\w-]+\.)+([a-z]{2,5})(\/+\w+)? "
Selection:
may be start with http:// or https:// (optional)
anything (word) end with dot (.)
followed by 2 to 5 character [a-z]
followed by "/[anything]" (optional)
followed by space
Try this
$url_reg = /(ftp|https?):\/\/(\w+:?\w*#)?(\S+)(:[0-9]+)?(\/([\w#!:.?+=&%#!\/-])?)?/;
I have been using the following, which works for all my test cases, as well as fixes any issues where it would trigger at the end of a sentence preceded by a full-stop (end.), or where there were single character initials, such as 'C.C. Plumbing'.
The following regex contains multiple {2,}s, which means two or more matches of the previous pattern.
((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:#\-_=#]{2,}\.([a-zA-Z0-9\&\.\/\?\:#\-_=#]){2,}
Matches URLs such as, but not limited to:
https://example.com
http://example.com
example.com
example.com/test
example.com?value=test
Does not match non-URLs such as, but not limited to:
C.C Plumber
A full-stop at the end of a sentence.
Single characters such as a.b or x.y
Please note: Due to the above, this will not match any single character URLs, such as: a.co, but it will match if it is preceded by a URL scheme, such as: http://a.co.
I was getting so many issues getting the answer from anubhava to work due to recent PHP allowing $ in strings and the preg match wasn't working.
Here is what I used:
// Regular expression
$re = '/((https?|ftp):\/\/)?([a-z0-9+!*(),;?&=.-]+(:[a-z0-9+!*(),;?&=.-]+)?#)?([a-z0-9\-\.]*)\.(([a-z]{2,4})|([0-9]{1,3}\.([0-9]{1,3})\.([0-9]{1,3})))(:[0-9]{2,5})?(\/([a-z0-9+%-]\.?)+)*\/?(\?[a-z+&$_.-][a-z0-9;:#&%=+\/.-]*)?(#[a-z_.-][a-z0-9+$%_.-]*)?/i';
// Match all
preg_match_all($re, $blob, $matches, PREG_SET_ORDER, 0);
// Print the entire match result
var_dump($matches);
// The first element of the array is the full match
This PHP Composer package URL highlight is doing a good job in PHP:
<?php
use VStelmakh\UrlHighlight\UrlHighlight;
$urlHighlight = new UrlHighlight();
$matches = $urlHighlight->getUrls($string);
?>
If it does not have to be regex, you could always use the validate filters that are in PHP.
filter_var('http://example.com', FILTER_VALIDATE_URL);
filter_var (mixed $variable [, int $filter = FILTER_DEFAULT [, mixed $options ]]);
Types of Filters
Validate Filters
Regex if you want to ensure a URL starts with HTTP/HTTPS:
https?:\/\/(www\.)?[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&//=]*)
If you do not require the HTTP protocol:
[-a-zA-Z0-9#:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()#:%_\+.~#?&//=]*)

Extracting URLs from a JSON-like string

I need to extract the first URL from some content. The content may be like this:
({items:[{url:"http://cincinnati.ebayclassifieds.com/",name:"Cincinnati"},{url:"http://dayton.ebayclassifieds.com/",name:"Dayton"}],error:null});
or may contain only a link
({items:[{url:"http://portlandor.ebayclassifieds.com/",name:"Portland (OR)"}],error:null});
currently I have :
$pattern = "/\:\[\{url\:\"(.*)\"\,name/";
preg_match_all($pattern, $htmlContent, $matches);
$URL = $matches[1][0];
however it works only if there is a single link so I need a regex which should work for the both cases.
You can use this REGEX:
$pattern = "/url\:\"([^\"]+)\"/";
Worked for me :)
Hopefully this should work for you
<?php
$str = '({items:[{url:"http://cincinnati.ebayclassifieds.com/",name:"Cincinnati"},{url:"http://dayton.ebayclassifieds.com/",name:"Dayton"}],error:null});'; //The string you want to extract the 1st URL from
$match = ""; //Define the match variable
preg_match("%(((ht|f)tp(s?))\://)?(www.|[a-zA-Z].)[a-zA-Z0-9\-\.]+\.(com|edu|gov|mil|net|org|biz|info|name|museum|us|ca|uk)(\:[0-9]+)*(/($|[a-zA-Z0-9\.\,\;\?\'\\\+&\%\$#\=~_\-]+))*%",$str,$match); //I Googled for the best Regular expression for URLs and found the one included in the preg_match
echo $match[0]; //Return the first item in the array (the first URL returned)
?>
This is the website that I found the regular expression on: http://regexlib.com/Search.aspx?k=URL
like the others have said, json_decode should work for you aswell
That smells like JSON to me. Try using http://php.net/json_decode
Looks like JSON to me, visit http://php.net/manual/en/book.json.php and use json_decode().

How to remove all links from a mixed string with php

i have a variable in php can be like this : $string = "hello check out this http://xx.xx/xxx & thanks !!";
i want a function to strip that link and remove it from the string, the url can be with WWW. or without it.
also, this variable can contains multiple urls .
Thanks
You could use regular expressions to do it:
$string = preg_replace('\b(https?|ftp|file)://[-A-Z0-9+&##/%?=~_|!:,.;]*[-A-Z0-9+&##/%=~_|]', '', $string);
That will find every instance of a URL in the string and replace it with a blank string.
You can use a regular expression to match and replace all URL formats, as follows:
$string = preg_replace("/(^¦\s)(http:\/\/)?(www\.)?[\.0-9a-zA-Z\-_~]+\.(com¦net¦org¦info¦name¦biz¦.+\.\w\w)((\/)?[0-9a-zA-Z\.\-_~#]+)?\b/ie","",$string);

Categories