PHP regex match excluded several urls - php

I have a trouble with urls match! For example:
we have urls(strings):
/, /news, /news/1-addsf, /articles, /guides etc.
and task: get match of all, excepts all starts with "/news" (with or w/o continue "/1-addsf", need both regexp's) AND "/articles" AND "/"
i try smth, like this:
#([^\/news.*]|[^\/articles])#is
#\/[^(news.*|articles)]#is
#^\/(^news|^articles)#is
and manymanymany other variants
i think, that i doesn't know smth or am bad googler, but i can't find smth for this question.
Need worked regexp! Thanks!
p.s. sorry for my english.

Seems like you want something like,
#^/(?!news|articles).*#is
The above regex matches all the url strings except the ones which starts with /news or /articles
DEMO

Related

make link clickables with preg_replace

I want to make my links automatically clickables, but it doesn't work.
Here's my code:
$val['message'] = preg_replace('#https?://(w{3}.)?([a-zA-Z0-9_-]{1,20}(.[a-zA-Z0-9_-]{1,10}))(/[a-zA-Z0-9_-]{1,12}(/[a-zA-Z0-9_-]{1,12}))?(/([a-zA-Z0-9_-]{1,20})(.[a-zA-Z0-9_-]{1,7}))?(\?[a-zA-Z0-9_-]{1,7}=[a-zA-Z0-9_-]{1,7}(&[a-zA-Z0-9_-]{1,7}=[a-zA-Z0-9_-]{1,7}))?#is', '$0', $val['message']);
(here is my preg thing, but with lines:)
'https?://
(w{3}.)?
([a-zA-Z0-9_-]{1,20}(.[a-zA-Z0-9_-]{1,10}))
(/[a-zA-Z0-9_-]{1,12}(/[a-zA-Z0-9_-]{1,12}))?
(/([a-zA-Z0-9_-]{1,20})
(.[a-zA-Z0-9_-]{1,7}))?
(\?[a-zA-Z0-9_-]{1,7}=[a-zA-Z0-9_-]{1,7}
(&[a-zA-Z0-9_-]{1,7}=[a-zA-Z0-9_-]{1,7}))?
I also tried this:
$val['message'] = preg_replace("#(([\w]+?://[\w#$%&~.-;:=,?#[]+])(/[\w#$%&~/.-;:=,?#[]+])?)#is", "$1", $val['message']);
but doesn't work with links like https://www.youtube.com/watch?v=videolink
Try this regex, worked for me:
(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?
Why does everyone like to try to make their own regex for this? Linkifying links is hard work with lots of edge cases, not to mention what should or shouldn't be included in the link, e.g.
Are you talking about youtube.com?
I like the ASP.net language
I wonder what www.stackoverflow.com counts as a link
Parentheses are a particular pain in the butt (example: http://example.com/?auth=gH;2($Hd)DA0;QAb)
Aside: in the last line above, StackOverflow's preview section links everything until the last closing bracket, but after submission it only links up to the first punctuation mark bracket. Helps prove my point about how hard this is to get right and consistent though!
Best to use something established, example:
https://github.com/misd-service-development/php-linkify
For something a bit more quick n dirty:
http://buildinternet.com/2010/05/how-to-automatically-linkify-text-with-php-regular-expressions/

RegExp to match a segment of a URL

I'm trying to use RegExp to match a segment of a URL.
The URL in question is this:
http://www.example.com/news/region/north-america/
As I need this regex for the WordPress URL Rewrite API, the subject will only be the path section of the URL:
news/region/north-america
In the above example I need to be able to extract the north-america portion of the path, however when pagination is used the path becomes something like this:
news/region/north-america/page/2
Where I still only need to extract the north-america portion.
The RegExp I've come up with is as follows:
^news/region/(.*?)/(.*?)?/?(.*?)?$
However this does not match for news/region/north-america only news/region/north-america/page/2
From what I can tell I need to make the trailing slash after north-america optional, but adding /? doesn't seem to work.
Try this:
preg_match('/news\/region\/(.*?)\//',"http://www.example.com/news/region/north-america/page/2",$matches);
the $matches[1] will give you the output. as "north-america".
You should match using this regex:
^news/region/([^/]+)
This will give you news/region/north-america even when URI becomes /news/region/north-america/page/2
georg's suggested rule work like a charm:
^news/region/(.*?)(?:/(.*?)/(.*?))?$
For those interested in the application of this regex, I used it in the WP Rewrite API to grab the custom taxonomy and page number (if present) and assign the relevant matches to the the WP re-write:
$newRules['news/region/(.?)(?:/(.?)/(.*?))?$']='index.php?region=$matches[1]&forcetemplate=news&paged=$matches[3]';

PHP preg_match , check if language is defined in url

I would like to test for a language match in a url.
Url will be like : http://www.domainname.com/en/#m=4&guid=%some_param%
I want to check if there is an existing language code within the url. I was thinking something between these lines :
^(.*:)\/\/([a-z\-.]+)(:[0-9]+)?(.*)$
or
^(http|https:)\/\/([a-z\-.]+)(:[0-9]+)?(.*)$
I'm not that sharp with regex. can anyone help or point me towards the right direction ?
[https]+://[a-z-]+.([a-z])+/
try this,
http://www.regexr.com/ this is a easy site for creating regex
If you know the data you are testing is a url then I would not bother adding all of the url parts to the regex. Keep it simple like: /\/[a-z]{2}\// That looks for a two letter combination between two forward slashes. If you need to capture the language code then wrap it in parentheses: /\/([a-z]{2})\//

Can't use OR( | ) in php Regular expression

I'm a newbie here. I'm facing a weird problem in using regex in PHP.
$result = "some very long long string with different kind of links";
$regex='/<.*?href.*?="(.*?net.*?)"/'; //this is the regex rule
preg_match_all($regex,$result,$parts);
Here in this code I'm trying to get the links from the result string. But it will provide me only those links which contains .net. But I also want to get those links which have .com. For this I tried this code
$regex='/<.*?href.*?="(.*?net|com.*?)"/';
But it shows nothing.
SOrry for my bad English.
Thanks in advance.
Update 1 :
now i'm using this
$regex='/<.*?href.*?="(.*?)"/';
this rule grab all the links from the string. But this is not perfect. Because it also grabs other substrings like "javascript".
The | character applies to everything within the capturing group, so (.*?net|com.*?) will match either .*?net or com.*?, I think what you want is (.*?(net|com).*?).
If you do not want the extra capturing group, you can use (.*?(?:net|com).*?).
You could also use (.*?net.*?|.*?com.*?), but this is not recommended because of the unnecessary repetition.
Your regex gets interpreted as .*?net or com.*?. You'll want (.*?(net|com).*?).
Try this:
$regex='/<.*?href.*?="(.*?\.(?:net|com)\b.*?)"/i';
or better:
$regex='/<a .*?href\s*+=\s*+"\K.*?\.(?:net|com)\b[^"]*+/i';
<.*?href
is a problem. This will match from the first < on the current line to the first href, regardless of whether they belong to the same tag.
Generally, it's unwise to try and parse HTML with regexes; if you absolutely insist on doing that, at least be a bit more specific (but still not perfect):
$regex='/<[^<>]*href[^<>=]*="(?:[^"]*(net|com)[^"]*)"/';

PHPCrawl: exclude urls anding with ?query=

I'm playing with PHPCrawl and I'd like to know if anybody knows if it possible to exclude from crawling all the URLS with parameters (either if they are .html or .php)like
domain.com/article.html?showComment=1289420017718
Add a non-follow match pattern for any URL containing a question mark:
$crawler->addNonFollowMatch(".*\?.*")
I just foudn myself this works better
$crawler->addNonFollowMatch("/\?/");

Categories