I have this regular expression:
preg_match_all("/<a\s.*?href\s*=\s*['|\"](.*?)(?=#|\"|')/si", $data, $matches);
to find all urls, it works fine, BUT how can I modificate it to find urls with question marks ONLY?
Example:
0123
And preg_match_all will return:
http://site.com/index.php?id=1
http://site.com/calc/index.php?id=1&scheme=Venus
preg_match_all("#<a\s*href\s*=[\'\"]([^\'\"]+\?[^\'\"]+)[\'\"]#si", $data, $matches);
Try this.
Don't try to make everything happen in one regex. Use your existing method, and then separately check the URL that you get back to see if it has a question mark in it.
That said, don't use regular expressions to parse HTML. You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged.
Andy Lester gave you the answer with right thing to do.
Here's your regex though:
<a\s.*?href\s*=\s*['|\"](.*?\?.*?)(?=#|\"|')
as seen here:
http://rubular.com/r/LHi11VMMR9
Related
I'm using preg_match to match the first contact page link within some HTML markup.
I have spent many hours investigating, reading PHP regex documents, debugging, and trying to find a similar solution on StackOverflow. There is a lot of advice for regex, just could not find it specific to my subpattern issue.
Example HTML:-
<ul class='root dropdown'><li class="item1 current-item-root first-item current-item">Home</li><li class="item2">Contact Us</li><li class="item3 parent category-page"><a
Instead of returning
/contact-us
it returns
"/">Home</a></li><li class="item2"><a href="/contact-us
Here is the code:-
preg_match( '/href.{1,5}"(?P<link>.{0,50}contact.{0,20})"/isxU', $input_line, $output_array);
I expected the regex U setting to make {0,50} non-greedy, but it's grabbing too much text.
The code is designed to pick up href links in various formats like below:-
/contact
/contact-us
websitename.com/contact-me
Here is a working example:-
https://www.phpliveregex.com/p/Dh2
Many thanks for your help. The answer was to exclude any other quotes captured in the sub-pattern, which was part of your example. The best and most fantastic part of your answer was to direct me to use https://regex101.com/ . That's an amazing tool for regex, great highlighting and explanations of the expression.
My answer:-
href="(?<link>[^"]{0,50}contact.*[^"]{0,50})"
I have been struggling for a while now to make the following work. Basically, I'd like to be able to extract a URL from an expression contained in an HTML template, as follows:
{rssfeed:url(http://www.example.com/feeds/posts/default)}
The idea is that, when this is found, the URL is extracted, and an RSS feed parser is used to get the RSS and insert it here. It all works, for example, if I hardcode the URL in my PHP code, but I just need to get this regex figured out so the template is actually flexible enough to be useful in many situations.
I've tried at least ten different regex expressions, mostly found here on SO, but none are working. The regex doesn't even need to validate the URL; I just want to find it and extract it, and the delimiters for the URL don't need to be parens, either.
Thank you!
Could this work for you?
'#((https?://)?([-\w]+\.[-\w\.]+)+\w(:\d+)?(/([-\w/_\.]*(\?\S+)?)?)*)#'
I use it to match URLs in text.
Example:
$subject = "{rssfeed:url(http://www.example.com/feeds/posts/default)}";
$pattern ='#((https?://)?([-\w]+\.[-\w\.]+)+\w(:\d+)?(/([-\w/_\.]*(\?\S+)?)?)*)#';
preg_match_all($pattern, $subject, $matches);
print($matches[1][0]);
Output:
http://www.example.com/feeds/posts/default
Note:
There is also a nice article on Daring Fireball called An Improved Liberal, Accurate Regex Pattern for Matching URLs that could be interesting for you.
/\{rssfeed\:url\(([^)]*)\)\}/
preg_match_all('/\{rssfeed\:url\(([^)]*)\)\}/', '{rssfeed:url(http://www.example.com/feeds/posts/default)}', $matches, PREG_PATTERN_ORDER);
print_r($matches[1]);
you should be able to get ALL the urls on the content available in $matches[1]..
Note: this will only get urls with the {rssfeed:url()} format, not all the urls in the content.
you can try this here: http://www.spaweditor.com/scripts/regex/index.php
I've got a problem with regexp function, preg_replace(), in PHP.
I want to get viewstate from html's input, but it doesn't work properly.
This code:
$viewstate = preg_replace('/^(.*)(<input\s+id="__VIEWSTATE"\s+type="hidden"\s+value=")(.*[^"])("\s+name="__VIEWSTATE">)(.*)$/u','^\${3}$',$html);
Returns this:
%0D%0A%0D%0A%3C%21DOCTYPE+html+PUBLIC+%22-%2F%2FW3C%2F%2FDTD+XHTML+1.0+Transitional%2F%2FEN%22+%22http%3A%2F%2Fwww.w3.org%2FTR%2Fxhtml1%2FDTD%2Fxhtml1-transitional.dtd%22%3E%0D%0A%0D%0A%3Chtml+xmlns%3D%22http%3A%2F%2Fwww.w3.org%2F1999%2Fxhtml%22+%3E%0D%0A%3Chead%3E%3Ctitle%3E%0D%0A%09Strava.cz%0D%0A%3C%2Ftitle%3E%3Clink+rel%3D%22shortcut+icon%22+href%3D%22..%2FGrafika%2Ffavicon.ico%22+type%3D%22image%2Fx-icon%22+%2F%3E%3Clink+rel%3D%22stylesheet%22+type%3D%22text%2Fcss%22+media%3D%22screen%22+href%3D%22..%2FStyly%2FZaklad.css%22+%2F%3E%0D%0A++++%3Cstyle+type%3D%22text%2Fcss%22%3E%0D%0A++++++++.style1%0D%0A++++++++%7B%0D%0A++++++++++++width%3A+47px%3B%0D%0A++++++++%7D%0D%0A++++++++.style2%0D%0A++++++++%7B%0D%0A++++++++++++width%3A+64px%3B%0D%0A++++++++%7D%0D%0A++++%3C%2Fstyle%3E%0D%0A%0D%0A%3Cscript+type%3D%22text%2Fjavascript%22%3E%0D%0A%0D%0A++var+_gaq+%3D+_gaq+%7C%7C+%5B%5D%3B%0D%0A++_gaq.push%28%5B
EDIT: Sorry, I left this question for a long time. Finally I used DOMDocument.
To be sure i'd split this match into two phases:
Find the relevant input element
Get the value
Because you cannot be certain what the attributes order in the element will be.
if(preg_match('/<input[^>]+name="__VIEWSTATE"[^>]*>/i', $input, $match))
$value = preg_replace('/.*value="([^"]*)".*/i', '$1', $match[0]);
And, of course, always consider DOM and DOMXpath over regex for parsing html/xml.
You should only capture when you're planning on using the data. So most () are obsolete in that regexp pattern. Not a cause for failure but I thought I'd mention it.
Instead of using [^"] to mark that you don't want that character you could use the non-greedy modifier - ?. This makes sure the pattern is matching as little as it can. Since you have name="__VIEWSTATE" following the value this should be safe.
Let's put this in practice and simplify the pattern some. This works as you want:
'/.*<input\s+id="__VIEWSTATE"\s+type="hidden"\s+value="(.+?)"\s+name="__VIEWSTATE">.*/'
I would strongly recommend checking out an alternative to regexp for DOM operations. This makes certain your code works also if the attributes changes order. Plus it's so much nicer to work with.
The main mistake was the use of funciton preg_replace, witch returns the subject - neither the matched pattern nor the replacement. Thank you for your ideas and for the recommendation of DOMDocument. m93a
http://www.php.net/manual/en/function.preg-replace.php#refsect1-function.preg-replace-returnvalues
Regex is my bete noire, can anyone help me isolate a string from a URL?
I want to get the page name from a URL which could appear in any of the following ways from an input form:
https://www.facebook.com/PAGENAME?sk=wall&filter=2
http://www.facebook.com/PAGENAME?sk=wall&filter=2
www.facebook.com/PAGENAME
facebook.com/PAGENAME?sk=wall
... and so on.
I can't seem to find a way to isolate the string after .com/ but before ? (if present at all). Is it preg_match, replace or split?
If anyone can recommend a particularly clear and introductory regex guide they found useful, it'd be appreciated.
You can use the parse_url function and then get the last segment from the path of the url:
$parts=parse_url($url);
$path_parts=explode("/", $parts["path"]);
$page=$path_parts[count($path_parts)-1];
For learning and testing regexes I found RegExr, an online tool, very useful: http://gskinner.com/RegExr/
But as others mentioned, parsing the url with appropriate functions might be better in this case.
I think you can use this php function (parse_url) directly instead of using regex.
Use smth like:
substr(parse_url('https://www.facebook.com/PAGENAME?sk=wall&filter=2', PHP_URL_PATH), 1);
This question already has answers here:
Closed 12 years ago.
Possible Duplicates:
Identifying if a URL is present in a string
Php parse links/emails
I'm working on some PHP code which takes input from various sources and needs to find the URLs and save them somewhere. The kind of input that needs to be handled is as follows:
http://www.youtube.com/watch?v=IY2j_GPIqRA
Try google: http://google.com! (note exclamation mark is not part of the URL)
Is http://somesite.com/ down for anyone else?
Output:
http://www.youtube.com/watch?v=IY2j_GPIqRA
http://google.com
http://somesite.com/
I've already borrowed one regular expression from the internet which works, but unfortunately wipes the query string out - not good!
Any help putting together a regular expression, or perhaps another solution to this problem, would be appreciated.
Jan Goyvaerts, Regex Guru, has addressed this issue in his blog. There are quite a few caveats, for example extracting URLs inside parentheses correctly. What you need exactly depends on the "quality" of your input data.
For the examples you provided, \b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$] works when used in case-insensitive mode.
So to find all matches in a multiline string, use
preg_match_all('/\b(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)[-A-Z0-9+&##\/%=~_|$?!:,.]*[A-Z0-9+&##\/%=~_|$]/i', $subject, $result, PREG_PATTERN_ORDER);
$result = $result[0];
Why not try this one. It is the first result of Googling "URL regular expression".
((https?|ftp|gopher|telnet|file|notes|ms-help):((\/\/)|(\\\\))+[\w\d:##%\/;$()~_?\+-=\\\.&]*)
Not PHP, but it should work, I just slightly modified it by escaping forward slashes.
source