I'm playing with PHPCrawl and I'd like to know if anybody knows if it possible to exclude from crawling all the URLS with parameters (either if they are .html or .php)like
domain.com/article.html?showComment=1289420017718
Add a non-follow match pattern for any URL containing a question mark:
$crawler->addNonFollowMatch(".*\?.*")
I just foudn myself this works better
$crawler->addNonFollowMatch("/\?/");
Related
I have a trouble with urls match! For example:
we have urls(strings):
/, /news, /news/1-addsf, /articles, /guides etc.
and task: get match of all, excepts all starts with "/news" (with or w/o continue "/1-addsf", need both regexp's) AND "/articles" AND "/"
i try smth, like this:
#([^\/news.*]|[^\/articles])#is
#\/[^(news.*|articles)]#is
#^\/(^news|^articles)#is
and manymanymany other variants
i think, that i doesn't know smth or am bad googler, but i can't find smth for this question.
Need worked regexp! Thanks!
p.s. sorry for my english.
Seems like you want something like,
#^/(?!news|articles).*#is
The above regex matches all the url strings except the ones which starts with /news or /articles
DEMO
I'm trying to use RegExp to match a segment of a URL.
The URL in question is this:
http://www.example.com/news/region/north-america/
As I need this regex for the WordPress URL Rewrite API, the subject will only be the path section of the URL:
news/region/north-america
In the above example I need to be able to extract the north-america portion of the path, however when pagination is used the path becomes something like this:
news/region/north-america/page/2
Where I still only need to extract the north-america portion.
The RegExp I've come up with is as follows:
^news/region/(.*?)/(.*?)?/?(.*?)?$
However this does not match for news/region/north-america only news/region/north-america/page/2
From what I can tell I need to make the trailing slash after north-america optional, but adding /? doesn't seem to work.
Try this:
preg_match('/news\/region\/(.*?)\//',"http://www.example.com/news/region/north-america/page/2",$matches);
the $matches[1] will give you the output. as "north-america".
You should match using this regex:
^news/region/([^/]+)
This will give you news/region/north-america even when URI becomes /news/region/north-america/page/2
georg's suggested rule work like a charm:
^news/region/(.*?)(?:/(.*?)/(.*?))?$
For those interested in the application of this regex, I used it in the WP Rewrite API to grab the custom taxonomy and page number (if present) and assign the relevant matches to the the WP re-write:
$newRules['news/region/(.?)(?:/(.?)/(.*?))?$']='index.php?region=$matches[1]&forcetemplate=news&paged=$matches[3]';
I would like to test for a language match in a url.
Url will be like : http://www.domainname.com/en/#m=4&guid=%some_param%
I want to check if there is an existing language code within the url. I was thinking something between these lines :
^(.*:)\/\/([a-z\-.]+)(:[0-9]+)?(.*)$
or
^(http|https:)\/\/([a-z\-.]+)(:[0-9]+)?(.*)$
I'm not that sharp with regex. can anyone help or point me towards the right direction ?
[https]+://[a-z-]+.([a-z])+/
try this,
http://www.regexr.com/ this is a easy site for creating regex
If you know the data you are testing is a url then I would not bother adding all of the url parts to the regex. Keep it simple like: /\/[a-z]{2}\// That looks for a two letter combination between two forward slashes. If you need to capture the language code then wrap it in parentheses: /\/([a-z]{2})\//
So I've got this URL regex:
/(?:((?:[^-/"':!=a-z0-9_#]|^|\:))((https?://)((?:[^\p{P}\p{Lo}\s].-|[^\p{P}\p{Lo}\s])+.[a-z]{2,}(?::[0-9]+)?)(/(?:(?:([a-z0-9!*';:=+\$/%#[]-_,~]+))|#[a-z0-9!*';:=+\$/%#[]-_,~]+/|[.\,]?(?:[a-z0-9!*';:=+\$/%#[]-_~]|,(?!\s)))*[a-z0-9=#/]?)?(\?[a-z0-9!*'();:&=+\$/%#[]-_.,~]*[a-z0-9_&=#/])?))/iux
What it's currently matching:
http://www.google.com
http://google.com
I need it to also match:
www.google.com
google.com
I tried making the protocol part of the regex optional by slapping a ? at the end "(https?:\/\/)?" but that didn't do anything.
Ideas?
I'd look for something in the language that you are using to do this. URLs are tough to match with a regex. If you insist, I changed yours to make the (https?://) optional. I did not check it though.
/(?:((?:[^-/"':!=a-z0-9_#]|^|\:))((https?://)?((?:[^\p{P}\p{Lo}\s].-|[^\p{P}\p{Lo}\s])+.[a-z]{2,}(?::[0-9]+)?)(/(?:(?:([a-z0-9!*';:=+\$/%#[]-_,~]+))|#[a-z0-9!*';:=+\$/%#[]-_,~]+/|[.\,]?(?:[a-z0-9!*';:=+\$/%#[]-_~]|,(?!\s)))*[a-z0-9=#/]?)?(\?[a-z0-9!*'();:&=+\$/%#[]-_.,~]*[a-z0-9_&=#/])?))/iux
I got this example from the RFC 3986 and was directed there by this comment. Although, I'd still recommend using something from whatever language you are using rather than a regex.
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
Since you are using PHP, did you consider using parse_url? It looks like it will return false on bad urls.
I was looking for ways to mimic something I've seen, however I'm really not even sure where to start or how to search for it.
Lets say my page was:
foo.com/ and my index page could take an argument of: index.php?id=5
What I'm wanting to do is create the following:
foo.com/5/ rather than placing index.php?id=5 just use the webstring to pass in the parameters, to hide not only the fact its a PHP page, but to clean up the url a bit more.
Is this possible?
Cheers
You'll want to look into URL rewriting. With the commonly used Apache webserver, this is accomplished with mod_rewrite.
or /?5/123/
and in php parse the query string if rewrite is not available
Something like this should suit:
RewriteRule ^pages/([A-Za-z_-]*)(/?)$ /index.php?page=$1
Broken down, we're looking for a URL that starts with pages, has any combination of letters, underscores and hyphens, and an optional trailing forward slash, and passing that to /index.php to handle.
Yes Mod_rewrite is best option, you can create .htaccess file. if you do not want the write a custom function which will handle the your url.