Using PHP's preg_match_all to extract a URL

Using PHP's preg_match_all to extract a URL - php

I have been struggling for a while now to make the following work. Basically, I'd like to be able to extract a URL from an expression contained in an HTML template, as follows:
{rssfeed:url(http://www.example.com/feeds/posts/default)}
The idea is that, when this is found, the URL is extracted, and an RSS feed parser is used to get the RSS and insert it here. It all works, for example, if I hardcode the URL in my PHP code, but I just need to get this regex figured out so the template is actually flexible enough to be useful in many situations.
I've tried at least ten different regex expressions, mostly found here on SO, but none are working. The regex doesn't even need to validate the URL; I just want to find it and extract it, and the delimiters for the URL don't need to be parens, either.
Thank you!

Could this work for you?
'#((https?://)?([-\w]+\.[-\w\.]+)+\w(:\d+)?(/([-\w/_\.]*(\?\S+)?)?)*)#'
I use it to match URLs in text.
Example:
$subject = "{rssfeed:url(http://www.example.com/feeds/posts/default)}";
$pattern ='#((https?://)?([-\w]+\.[-\w\.]+)+\w(:\d+)?(/([-\w/_\.]*(\?\S+)?)?)*)#';
preg_match_all($pattern, $subject, $matches);
print($matches[1][0]);
Output:
http://www.example.com/feeds/posts/default
Note:
There is also a nice article on Daring Fireball called An Improved Liberal, Accurate Regex Pattern for Matching URLs that could be interesting for you.

/\{rssfeed\:url\(([^)]*)\)\}/
preg_match_all('/\{rssfeed\:url\(([^)]*)\)\}/', '{rssfeed:url(http://www.example.com/feeds/posts/default)}', $matches, PREG_PATTERN_ORDER);
print_r($matches[1]);
you should be able to get ALL the urls on the content available in $matches[1]..
Note: this will only get urls with the {rssfeed:url()} format, not all the urls in the content.
you can try this here: http://www.spaweditor.com/scripts/regex/index.php

Related

extracting only image src, not other 'src' tags in html with php

I've been able to use preg_match on getting the src of any image tags, but I only really need the src of images with class 'wp-post-image' in this case. However, this code is returning nothing for me
$pattern = '<img(?:[^>]+src="(.+?)"[^>]+(?:id|class)="image"|[^>]+(?:id|class)="wp-post-image"[^>]+src="(.+?)")
';
preg_match($pattern,$results[$k]['description'], $matches);
$results[$k]['image'] = $matches[0];
print_r($results[$k]['image']);
The old version returns all image matches which includes 4 that have the class I'm looking for so maybe my syntax is just wrong?
old version that returned all images:
$pattern = '%<img.*?src=["\'](.*?)["\'].*?/>%i';
preg_match($pattern,$results[$k]['description'], $matches);
$src = $matches[0];
//print_r($src);

Asking to parse HTML with regex on SO will get you flamed. Not without reason, but flamed nonetheless.
If you insist on using regex (which, if for nothing else, is good practice), I suggest using a regex sandbox to test out patterns on sample text. One I use is https://regex101.com/ .
The old version (which you say worked) is looking for either single or double quotes around the src attribute. The new version is only looking for double quotes, which is possibly why it's failing.
Rather than trying to write a more complicated regex, it may be easier to use your old regex -- which grabs all the image links -- along with an expanded capture, and then look through the captured links to sort out the ones you need:
$pattern = '%(<img.*?src=["\'].*?["\'].*?/>)%i';

PHP preg_match subpattern captures too much text (too greedy)

I'm using preg_match to match the first contact page link within some HTML markup.
I have spent many hours investigating, reading PHP regex documents, debugging, and trying to find a similar solution on StackOverflow. There is a lot of advice for regex, just could not find it specific to my subpattern issue.
Example HTML:-
<ul class='root dropdown'><li class="item1 current-item-root first-item current-item">Home</li><li class="item2">Contact Us</li><li class="item3 parent category-page"><a
Instead of returning
/contact-us
it returns
"/">Home</a></li><li class="item2"><a href="/contact-us
Here is the code:-
preg_match( '/href.{1,5}"(?P<link>.{0,50}contact.{0,20})"/isxU', $input_line, $output_array);
I expected the regex U setting to make {0,50} non-greedy, but it's grabbing too much text.
The code is designed to pick up href links in various formats like below:-
/contact
/contact-us
websitename.com/contact-me
Here is a working example:-
https://www.phpliveregex.com/p/Dh2

Many thanks for your help. The answer was to exclude any other quotes captured in the sub-pattern, which was part of your example. The best and most fantastic part of your answer was to direct me to use https://regex101.com/ . That's an amazing tool for regex, great highlighting and explanations of the expression.
My answer:-
href="(?<link>[^"]{0,50}contact.*[^"]{0,50})"

PHP Regular expression: Get all urls with question mark

I have this regular expression:
preg_match_all("/<a\s.*?href\s*=\s*['|\"](.*?)(?=#|\"|')/si", $data, $matches);
to find all urls, it works fine, BUT how can I modificate it to find urls with question marks ONLY?
Example:
0123
And preg_match_all will return:
http://site.com/index.php?id=1
http://site.com/calc/index.php?id=1&scheme=Venus

preg_match_all("#<a\s*href\s*=[\'\"]([^\'\"]+\?[^\'\"]+)[\'\"]#si", $data, $matches);
Try this.

Don't try to make everything happen in one regex. Use your existing method, and then separately check the URL that you get back to see if it has a question mark in it.
That said, don't use regular expressions to parse HTML. You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged.

Andy Lester gave you the answer with right thing to do.
Here's your regex though:
<a\s.*?href\s*=\s*['|\"](.*?\?.*?)(?=#|\"|')
as seen here:
http://rubular.com/r/LHi11VMMR9

Extract URL from string

I'm trying to find a reliable solution to extract a url from a string of characters. I have a site where users answer questions and in the source box, where they enter their source of information, I allow them to enter a url. I want to extract that url and make it a hyperlink. Similar to how Yahoo Answers does it.
Does anyone know a reliable solution that can do this?
All the solutions I have found work for some URL's but not for others.
Thanks

John Gruber has spent a fair amount of time perfecting the "one regex to rule them all" for link detection. Using preg_replace() as mentioned in the other answers, using the following regex should be one of the most accurate, if not the most accurate, method for detecting a link:
(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
If you only wanted to match HTTP/HTTPS:
(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

$string = preg_replace('/https?:\/\/[^\s"<>]+/', '$0', $string);
It only matches http/https, but that's really the only protocol you want to turn into a link. If you want others, you can change it like this:
$string = preg_replace('/(https?|ssh|ftp):\/\/[^\s"]+/', '$0', $string);

There are a lot of edge cases with urls. Like url could contain brackets or not contain protocol etc. Thats why regex is not enough.
I created a PHP library that could deal with lots of edge cases: Url highlight.
You could extract urls from string or directly highlight them.
Example:
<?php
use VStelmakh\UrlHighlight\UrlHighlight;
$urlHighlight = new UrlHighlight();
// Extract urls
$urlHighlight->getUrls("This is example http://example.com.");
// return: ['http://example.com']
// Make urls as hyperlinks
$urlHighlight->highlightUrls('Hello, http://example.com.');
// return: 'Hello, http://example.com.'
For more details see readme. For covered url cases see test.

Yahoo! Answers does a fairly good job of link identification when the link is written properly and separate from other text, but it isn't very good at separating trailing punctuation. For example The links are http://example.com/somepage.php, http://example.com/somepage2.php, and http://example.com/somepage3.php. will include commas on the first two and a period on the third.
But if that is acceptable, then patterns like this should do it:
\<http:[^ ]+\>
It looks like stackoverflow's parser is better. Is is open source?

This code is worked for me.
function makeLink($string){
/*** make sure there is an http:// on all URLs ***/
$string = preg_replace("/([^\w\/])(www\.[a-z0-9\-]+\.[a-z0-9\-]+)/i", "$1http://$2",$string);
/*** make all URLs links ***/
$string = preg_replace("/([\w]+:\/\/[\w-?&;#~=\.\/\#]+[\w\/])/i","<a target=\"_blank\" href=\"$1\">$1</a>",$string);
/*** make all emails hot links ***/
$string = preg_replace("/([\w-?&;#~=\.\/]+\#(\[?)[a-zA-Z0-9\-\.]+\.([a-zA-Z]{2,3}|[0-9]{1,3})(\]?))/i","$1",$string);
return $string;
}

PHP regex for filtering out urls from specific domains for use in a vBulletin plug-in

I'm trying to put together a plug-in for vBulletin to filter out links to filesharing sites. But, as I'm sure you often hear, I'm a newb to php let alone regexes.
Basically, I'm trying to put together a regex and use a preg_replace to find any urls that are from these domains and replace the entire link with a message that they aren't allowed. I'd want it to find the link whether it's hyperlinked, posted as plain text, or enclosed in [CODE] bb tags.
As for regex, I would need it to find URLS with the following, I think:
Starts with http or an anchor tag. I believe that the URLS in [CODE] tags could be processed the same as the plain text URLS and it's fine if the replacement ends up inside the [CODE] tag afterward.
Could contain any number of any characters before the domain/word
Has the domain somewhere in the middle
Could contain any number of any characters after the domain
Ends with a number of extentions such as (html|htm|rar|zip|001) or in a closing anchor tag.
I have a feeling that it's numbers 2 and 4 that are tripping me up (if not much more). I found a similar question on here and tried to pick apart the code a bit (even though I didn't really understand it). I now have this which I thought might work, but it doesn't:
<?php
$filterthese = array('domain1', 'domain2', 'domain3');
$replacement = 'LINKS HAVE BEEN FILTERED MESSAGE';
$regex = array('!^http+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*(html|htm|rar|zip|001)$!',
'!^<a+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*</a>$!');
$this->post['message'] = preg_replace($regex, $replacement, $this->post['message']);
?>
I have a feeling that I'm way off base here, and I admit that I don't fully understand php let alone regexes. I'm open to any suggestions on how to do this better, how to just make it work, or links to RTM (though I've read up a bit and I'm going to continue).
Thanks.

You can use parse_url on the URLs and look into the hashmap it returns. That allows you to filter for domains or even finer-grained control.

I think you can avoid the overhead of this in using the filter_var built-in function.
You may use this feature since PHP 5.2.0.
$good_url = filter_var( filter_var( $raw_url, FILTER_SANITIZE_URL), FILTER_VALIDATE_URL);

Hmm, my first guess: You put $filterthese directly inside a single-quoted string. That single quotes don't allow for variable substitution. Also, the $filterthese is an array, that should first be joined:
var $filterthese = implode("|", $filterthese);
Maybe I'm way off, because I don't know anything about vBulletin plugins and their embedded magic, but that points seem worth a check to me.
Edit: OK, on re-checking your provided source, I think the regexp line should read like this:
$regex = '!(?#
possible "a" tag [start]: )(<a[^>]+href=["\']?)?(?#
offending link: )https?://(?#
possible subdomains: )(([a-z0-9-]+\.)*\.)?(?#
domains to block: )('.implode("|", $filterthese).')(?#
possible path: )(/[^ "\'>]*)?(?#
possible "a" tag [end]: )(["\']?[^>]*>)?!';

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Using PHP's preg_match_all to extract a URL - php

Related

extracting only image src, not other 'src' tags in html with php

PHP preg_match subpattern captures too much text (too greedy)

PHP Regular expression: Get all urls with question mark

Extract URL from string

PHP regex for filtering out urls from specific domains for use in a vBulletin plug-in

Categories

Resources