Regex is my bete noire, can anyone help me isolate a string from a URL?
I want to get the page name from a URL which could appear in any of the following ways from an input form:
https://www.facebook.com/PAGENAME?sk=wall&filter=2
http://www.facebook.com/PAGENAME?sk=wall&filter=2
www.facebook.com/PAGENAME
facebook.com/PAGENAME?sk=wall
... and so on.
I can't seem to find a way to isolate the string after .com/ but before ? (if present at all). Is it preg_match, replace or split?
If anyone can recommend a particularly clear and introductory regex guide they found useful, it'd be appreciated.
You can use the parse_url function and then get the last segment from the path of the url:
$parts=parse_url($url);
$path_parts=explode("/", $parts["path"]);
$page=$path_parts[count($path_parts)-1];
For learning and testing regexes I found RegExr, an online tool, very useful: http://gskinner.com/RegExr/
But as others mentioned, parsing the url with appropriate functions might be better in this case.
I think you can use this php function (parse_url) directly instead of using regex.
Use smth like:
substr(parse_url('https://www.facebook.com/PAGENAME?sk=wall&filter=2', PHP_URL_PATH), 1);
Related
I would like to test for a language match in a url.
Url will be like : http://www.domainname.com/en/#m=4&guid=%some_param%
I want to check if there is an existing language code within the url. I was thinking something between these lines :
^(.*:)\/\/([a-z\-.]+)(:[0-9]+)?(.*)$
or
^(http|https:)\/\/([a-z\-.]+)(:[0-9]+)?(.*)$
I'm not that sharp with regex. can anyone help or point me towards the right direction ?
[https]+://[a-z-]+.([a-z])+/
try this,
http://www.regexr.com/ this is a easy site for creating regex
If you know the data you are testing is a url then I would not bother adding all of the url parts to the regex. Keep it simple like: /\/[a-z]{2}\// That looks for a two letter combination between two forward slashes. If you need to capture the language code then wrap it in parentheses: /\/([a-z]{2})\//
I have this regular expression:
preg_match_all("/<a\s.*?href\s*=\s*['|\"](.*?)(?=#|\"|')/si", $data, $matches);
to find all urls, it works fine, BUT how can I modificate it to find urls with question marks ONLY?
Example:
0123
And preg_match_all will return:
http://site.com/index.php?id=1
http://site.com/calc/index.php?id=1&scheme=Venus
preg_match_all("#<a\s*href\s*=[\'\"]([^\'\"]+\?[^\'\"]+)[\'\"]#si", $data, $matches);
Try this.
Don't try to make everything happen in one regex. Use your existing method, and then separately check the URL that you get back to see if it has a question mark in it.
That said, don't use regular expressions to parse HTML. You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged.
Andy Lester gave you the answer with right thing to do.
Here's your regex though:
<a\s.*?href\s*=\s*['|\"](.*?\?.*?)(?=#|\"|')
as seen here:
http://rubular.com/r/LHi11VMMR9
I'm a newbie in PHP regex patterns, so i tried to make a pattern for this URL:
$turl=http://ss-3.domian.com/screenshot/50/18/screenshot_multiple/501800/501800_multiple_1_extra_large.jpg
I just want to retrieve 3 things: "3", "50/18", "501800"
So I used this code:
preg_match('#http://ss-(.*?).domain.com/screenshot/(.*?)/screenshot_multiple/(.*?)/(.*?)_multiple_1_extra_large\.jpg#',$turl,$t_url)
So if I use $matches[1]=3; $matches[2]=50/18; $matches[3]=501800, I should get the numbers right??
<?php
$turl = 'http://ss-3.domain.com/screenshot/50/18/screenshot_multiple/501800/501800_multiple_1_extra_large.jpg';
preg_match_all('#http://ss\-([^\.]*)\.domain.com/[^/]+/([^/]*)/([^/]*)/[^/]*/([^/]*)/([^/]*)#msi',$turl,$match);
// For testing
var_dump($match);
?>
You had a typo (domian) in the search string and it wasn't in quotes. This sort of URL is likely to change, so I've made it as generic as possible while still keeping the shape. I think if we knew your problem we would reconsider using regex if possible. Also, reading the function declarations in php.net is a big help and will give you a good understanding of their applications.
Does anyone know an up to date regular expression for validating URLs? I found a few on Google but they all allowed junk URL's i.e (www.google_com) when testing.
My regular expression knowledge is not so vast, so I would hate to put something together that would fail under pressure.
Thanks.
You can use the filter functions in PHP
$filtered = filter_var($url, FILTER_VALIDATE_URL);
http://uk3.php.net/manual/en/function.filter-var.php
Not every problem should be answered with a regex.
http://php.net/manual/en/function.parse-url.php
I'm trying to write a regex in php that in a line like
<a href="mypage.php?(some junk)&p=12345&(other junk)" other link stuff>Text</a>
and it will only return me "p=12345", or even "12345". Note that the (some junk)& and the &(otherjunk) may or may not be present.
Can I do this with one expression, or will I need more than one? I can't seem to work out how to do it in one, which is what I would like if at all possible. I'm also open to other methods of doing this, if you have a suggestion.
Thanks
Perhaps a better tactic over using a regular expressoin in this case is to use parse_url.
You can use that to get the query (what comes after the ? in your URL) and split on the '&' character and then the '=' to put things into a nice dictionary.
Use parse_url and parse_str:
$url = 'mypage.php?(some junk)&p=12345&(other junk)';
$parsed_url = parse_url($url);
parse_str($parsed_url['query'], $parsed_str);
echo $parsed_str['p'];