PHP preg_match , check if language is defined in url - php

I would like to test for a language match in a url.
Url will be like : http://www.domainname.com/en/#m=4&guid=%some_param%
I want to check if there is an existing language code within the url. I was thinking something between these lines :
^(.*:)\/\/([a-z\-.]+)(:[0-9]+)?(.*)$
or
^(http|https:)\/\/([a-z\-.]+)(:[0-9]+)?(.*)$
I'm not that sharp with regex. can anyone help or point me towards the right direction ?

[https]+://[a-z-]+.([a-z])+/
try this,
http://www.regexr.com/ this is a easy site for creating regex

If you know the data you are testing is a url then I would not bother adding all of the url parts to the regex. Keep it simple like: /\/[a-z]{2}\// That looks for a two letter combination between two forward slashes. If you need to capture the language code then wrap it in parentheses: /\/([a-z]{2})\//

Related

PHP regex match excluded several urls

I have a trouble with urls match! For example:
we have urls(strings):
/, /news, /news/1-addsf, /articles, /guides etc.
and task: get match of all, excepts all starts with "/news" (with or w/o continue "/1-addsf", need both regexp's) AND "/articles" AND "/"
i try smth, like this:
#([^\/news.*]|[^\/articles])#is
#\/[^(news.*|articles)]#is
#^\/(^news|^articles)#is
and manymanymany other variants
i think, that i doesn't know smth or am bad googler, but i can't find smth for this question.
Need worked regexp! Thanks!
p.s. sorry for my english.
Seems like you want something like,
#^/(?!news|articles).*#is
The above regex matches all the url strings except the ones which starts with /news or /articles
DEMO

RegExp to match a segment of a URL

I'm trying to use RegExp to match a segment of a URL.
The URL in question is this:
http://www.example.com/news/region/north-america/
As I need this regex for the WordPress URL Rewrite API, the subject will only be the path section of the URL:
news/region/north-america
In the above example I need to be able to extract the north-america portion of the path, however when pagination is used the path becomes something like this:
news/region/north-america/page/2
Where I still only need to extract the north-america portion.
The RegExp I've come up with is as follows:
^news/region/(.*?)/(.*?)?/?(.*?)?$
However this does not match for news/region/north-america only news/region/north-america/page/2
From what I can tell I need to make the trailing slash after north-america optional, but adding /? doesn't seem to work.
Try this:
preg_match('/news\/region\/(.*?)\//',"http://www.example.com/news/region/north-america/page/2",$matches);
the $matches[1] will give you the output. as "north-america".
You should match using this regex:
^news/region/([^/]+)
This will give you news/region/north-america even when URI becomes /news/region/north-america/page/2
georg's suggested rule work like a charm:
^news/region/(.*?)(?:/(.*?)/(.*?))?$
For those interested in the application of this regex, I used it in the WP Rewrite API to grab the custom taxonomy and page number (if present) and assign the relevant matches to the the WP re-write:
$newRules['news/region/(.?)(?:/(.?)/(.*?))?$']='index.php?region=$matches[1]&forcetemplate=news&paged=$matches[3]';

Can't use OR( | ) in php Regular expression

I'm a newbie here. I'm facing a weird problem in using regex in PHP.
$result = "some very long long string with different kind of links";
$regex='/<.*?href.*?="(.*?net.*?)"/'; //this is the regex rule
preg_match_all($regex,$result,$parts);
Here in this code I'm trying to get the links from the result string. But it will provide me only those links which contains .net. But I also want to get those links which have .com. For this I tried this code
$regex='/<.*?href.*?="(.*?net|com.*?)"/';
But it shows nothing.
SOrry for my bad English.
Thanks in advance.
Update 1 :
now i'm using this
$regex='/<.*?href.*?="(.*?)"/';
this rule grab all the links from the string. But this is not perfect. Because it also grabs other substrings like "javascript".
The | character applies to everything within the capturing group, so (.*?net|com.*?) will match either .*?net or com.*?, I think what you want is (.*?(net|com).*?).
If you do not want the extra capturing group, you can use (.*?(?:net|com).*?).
You could also use (.*?net.*?|.*?com.*?), but this is not recommended because of the unnecessary repetition.
Your regex gets interpreted as .*?net or com.*?. You'll want (.*?(net|com).*?).
Try this:
$regex='/<.*?href.*?="(.*?\.(?:net|com)\b.*?)"/i';
or better:
$regex='/<a .*?href\s*+=\s*+"\K.*?\.(?:net|com)\b[^"]*+/i';
<.*?href
is a problem. This will match from the first < on the current line to the first href, regardless of whether they belong to the same tag.
Generally, it's unwise to try and parse HTML with regexes; if you absolutely insist on doing that, at least be a bit more specific (but still not perfect):
$regex='/<[^<>]*href[^<>=]*="(?:[^"]*(net|com)[^"]*)"/';

URL Beautification using .htaccess or php?

In search of a more userfriendly & search engine friendly urls, i want have beautied my urls:
The htacces apache rule that achieves this (Thanks to Laurence Gonsalves)
RewriteRule ^([a-z][a-z])/(.*) /$2?ln=$1 [L]
which makes this possible:
/uk/somepage instead of /somepage?ln=uk
/de/somepage instead of /somepage?ln=de
/ja/somepage instead of /somepage?ln=ja
Now the difficult part: previously, the url was replaced with a normal link like href="?ln=de" or href="?ln=it" for changing language of the current page. But now how can i achieve that? Sothat the current page stays the same, but only the preceding two lowercase letters that say to the browser what language it is in change?
So how to tell the link to only change the /uk/contact to /de/contact once the german (de) language flag is clicked? php solution to rewrite the url or htaccess solutions are accepted.
I found out that $_SERVER['REQUEST_URI'] will output /uk/somepage but i cant write the php code that can split up the components, add a new language code like "de" into it, which i can put manually into a normal href that goes on a German flag. etc. Thanks for any and all clues/answers!
You'd probably want to look at something like explode or regular expressions to strip out the non-language part of the URL (e.g., /contact) and just add it again to a new string containing the language identifier.
Maybe this could get you started:
<?php
function changeLanguageLink($language_id)
{
$uri = $_SERVER['REQUEST_URI'];
$link = preg_replace('/\/?(uk|de)\/(.*)/', "/$2", $uri);
$link = $language_id . $link;
return $link;
}
?>
Change language to UK
Well, you can split the request_uri using, well, split() or explode().
$uri_bits=explode('/', $_SERVER['REQUEST_URI']);
In theory the language identifier will be in $uri_bits[ 1] (as [0] would contain a zero length string, but you should test it by print_r()-ing the array). Of course, you should test if the $uri_bits[ 1] exists, and it's the language identifier, the simplest way to do it would be:
if($uri_bits[1]==$_GET['lang'])
Then you can change that and concatenate the bits again using implode()
$uri_bits[1]="it";
$url_german=implode('/', $uri_bits);
At least that's how I'd do it.

PHP regex for filtering out urls from specific domains for use in a vBulletin plug-in

I'm trying to put together a plug-in for vBulletin to filter out links to filesharing sites. But, as I'm sure you often hear, I'm a newb to php let alone regexes.
Basically, I'm trying to put together a regex and use a preg_replace to find any urls that are from these domains and replace the entire link with a message that they aren't allowed. I'd want it to find the link whether it's hyperlinked, posted as plain text, or enclosed in [CODE] bb tags.
As for regex, I would need it to find URLS with the following, I think:
Starts with http or an anchor tag. I believe that the URLS in [CODE] tags could be processed the same as the plain text URLS and it's fine if the replacement ends up inside the [CODE] tag afterward.
Could contain any number of any characters before the domain/word
Has the domain somewhere in the middle
Could contain any number of any characters after the domain
Ends with a number of extentions such as (html|htm|rar|zip|001) or in a closing anchor tag.
I have a feeling that it's numbers 2 and 4 that are tripping me up (if not much more). I found a similar question on here and tried to pick apart the code a bit (even though I didn't really understand it). I now have this which I thought might work, but it doesn't:
<?php
$filterthese = array('domain1', 'domain2', 'domain3');
$replacement = 'LINKS HAVE BEEN FILTERED MESSAGE';
$regex = array('!^http+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*(html|htm|rar|zip|001)$!',
'!^<a+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*</a>$!');
$this->post['message'] = preg_replace($regex, $replacement, $this->post['message']);
?>
I have a feeling that I'm way off base here, and I admit that I don't fully understand php let alone regexes. I'm open to any suggestions on how to do this better, how to just make it work, or links to RTM (though I've read up a bit and I'm going to continue).
Thanks.
You can use parse_url on the URLs and look into the hashmap it returns. That allows you to filter for domains or even finer-grained control.
I think you can avoid the overhead of this in using the filter_var built-in function.
You may use this feature since PHP 5.2.0.
$good_url = filter_var( filter_var( $raw_url, FILTER_SANITIZE_URL), FILTER_VALIDATE_URL);
Hmm, my first guess: You put $filterthese directly inside a single-quoted string. That single quotes don't allow for variable substitution. Also, the $filterthese is an array, that should first be joined:
var $filterthese = implode("|", $filterthese);
Maybe I'm way off, because I don't know anything about vBulletin plugins and their embedded magic, but that points seem worth a check to me.
Edit: OK, on re-checking your provided source, I think the regexp line should read like this:
$regex = '!(?#
possible "a" tag [start]: )(<a[^>]+href=["\']?)?(?#
offending link: )https?://(?#
possible subdomains: )(([a-z0-9-]+\.)*\.)?(?#
domains to block: )('.implode("|", $filterthese).')(?#
possible path: )(/[^ "\'>]*)?(?#
possible "a" tag [end]: )(["\']?[^>]*>)?!';

Categories