"catching" links in regex using php ignoring inline js - php

I'm stuck trying to make a regex in PHP that catches the link and its content from a html page (which I have no control over) and replaces it with a link of mine.
i.e.:
<a style="position:absolute;more_styles:more;" href="http://www.google.co.il/" class="something">This is the content</a>
Becomes:
<a style="position:absolute;more_styles:more;" href="my_function('http://www.google.co.il/')" class="something">This is the content</a>
This is the regex that I wrote:
$content = preg_replace('|<a(.*?)href=[\"\'](.*?)[\"\'][^>]*>(.*?)</a>|i','$3',$content);
This works well with all the links except links like:
<a href="http://google.co.il" onclick="if(MSIE_VER()>=4){this.style.behavior='url(#default#homepage)';this.setHomePage('http://www.google.co.il')}" class='brightgrey rightbar' style='font-size:12px'><b>Make me the home page!</b></a>
Obviously, the regexp stops at "MSIE_VER()>" because of the "[^>]*" part and i get the wrong content when I use "$3".
I tried almost every option to make this work but no luck.
Any thoughts?
Thank you all in advance..

First of all your code is trying to do something different that to add my_function - it tries to remove the starting tag and replace it with url only. There are several ways to acheieve your declared goal (i.e. substituing my_function to all hrefs) , the most pragmafic would be:
$content = preg_replace('|href=[\"\'](.*?)[\"\']|i',"href=\"my_function('$1')\"",$content);
if you need more prudent approach than I would use
$content = preg_replace('|(<a.*?)href=[\"\'](.*?)[\"\'](.*?</a>)|i',"$1href=\"my_function('$2')\"$3",$content);
last but not least if you need removing tag rather than what you have written, let me know there is million ways to do it.

By default .* will take evryting it can - eg. it takes onclick argument, because regex is still valid - replace "." with [^\"] - it will tell regexp to take evrything excluding " ( which cannot be in URL )
$content = preg_replace('|<a(.*?)href=[\"\']([^"]*?)[\"\'][^>]*>(.*?)</a>|i','$3',$content);

Related

Regex not preceded by href="

So I am adding [embed][/embed] around youtube links in a WordPress environment, since if you use different fields for content input in the backend than the normale content editor, it won't do this automatically (even if you apply_filter the_content).
So, I found this regex which works perfect for my application:
$firstalinea = preg_replace('/\s*[a-zA-Z\/\/:\.]*youtu(be.com\/watch\?v=|.be\/)([a-zA-Z0-9\-_]+)([a-zA-Z0-9\/\*\-\_\?\&\;\%\=\.]*)/i', '[embed]https://www.youtube.com/watch?v=$2[/embed]', $firstalinea);
Except for one thing. If someone places a link to a YouTube-video instead of wanting to embed it, it also replaces and then the link does not work anymore.
Link
So, how to make the regex NOT work, if preceded by href=" ?
Thanks!
Solved it:
$re = '/(?<!href=\")(http:\/\/|https:\/\/)(?:www\.)?youtu(be.com\/watch\?v=|.be\/)([a-zA-Z0-9\-_]+)([a-zA-Z0-9\/\*\-\_\?\&\;\%\=\.]*)/i';
$firstalinea = preg_replace($re, '[embed]https://www.youtube.com/watch?v=$3[/embed]', $firstalinea);

PHP Regex to get a name from a url tag

There is a lot of Regex to get links or a value from url tags <a href , but what about extract a value from url tags like this
$text = '[URL="http://google.com"]ANY THING[/URL]';
if i want get value ANY THING from this url tag , what Regex Can I use ?
You can use this one
'/\[URL[^]]+\](?P<name>[^\[]+)\[\/URL\]/'
But you should probably learn Why.. Here is a good tester that shows that regx at work
https://regex101.com/r/hS0sO5/1
Traditionally this is called BBcode ( builtin board code )
https://en.wikipedia.org/wiki/BBCode
There are full PHP implementations of this sort of thing you can use besides regx
If you want both ( optionally the url ) you can use this one
'/\[URL(?:\=\"(?P<url>[^"]+)\")?\](?P<name>[^\[]+)\[\/URL\]/'
And here is that one at work
https://regex101.com/r/hS0sO5/2
That last one does require the " around URL, I have seen them done with ' or no quotes at all.

Adding negative lookback to this regex pattern in php

I have spent the entire day trying to figure out how to get this code to only affect the first instance it runs across. Eventually, I learned about a negative lookback and tried to implement that.
I have tried every possible arrangement except, of course, the correct one.I discovered regex101, which is really cool, but ultimately didn’t help me find the solution.
$content = preg_replace('/<img[^>]+./','', get_the_content_with_format());
This will be used in wordpress to strip out the first image on a page (moving it above the written content), but leave the rest in so that there can be images used in the post description.
Be easy on me, please. This is my first question here and I really am not a programmer.
Update: Because l’L'l asked, this is the entire chunk of relevant code.
<?php
//this will remove the images from the content editor
// it will not remove links from images, so if an image has a link, you will end up with an empty line.
$content = preg_replace('/<img[^>]+./','', get_the_content_with_format());
//this IF statement checks if $content has any value left after the images were removed
// If so, it will echo the div below it.. if not will won't do anything.
if($content != ""):?>
<div class="portfolio-box">
<?php echo do_shortcode( $content ) ?>
</div>
<?php endif; ?>
I’ve tried both of the solutions offered here but, for whatever reason, they didn’t work.
And, thank you guys very much for helping, by the way.
You could just anchor it at the beginning of the string (with ^), capture everything up to the first image (with (.*?)), and replace all of that with the content before the image:
$content = preg_replace('/^(.*?)<img[^>]+/s','$1', get_the_content_with_format());
Note I also added the modifier s so that dot (.) matches newlines.
If you just want to replace the first occurence of the regex match, just add "1" as fourth parameter, which indicates, that only one match will be replaced.
See http://php.net/manual/de/function.preg-replace.php
In your example, this would look like:
$content = preg_replace('/<img[^>]+./','', get_the_content_with_format(), 1);

Can't use OR( | ) in php Regular expression

I'm a newbie here. I'm facing a weird problem in using regex in PHP.
$result = "some very long long string with different kind of links";
$regex='/<.*?href.*?="(.*?net.*?)"/'; //this is the regex rule
preg_match_all($regex,$result,$parts);
Here in this code I'm trying to get the links from the result string. But it will provide me only those links which contains .net. But I also want to get those links which have .com. For this I tried this code
$regex='/<.*?href.*?="(.*?net|com.*?)"/';
But it shows nothing.
SOrry for my bad English.
Thanks in advance.
Update 1 :
now i'm using this
$regex='/<.*?href.*?="(.*?)"/';
this rule grab all the links from the string. But this is not perfect. Because it also grabs other substrings like "javascript".
The | character applies to everything within the capturing group, so (.*?net|com.*?) will match either .*?net or com.*?, I think what you want is (.*?(net|com).*?).
If you do not want the extra capturing group, you can use (.*?(?:net|com).*?).
You could also use (.*?net.*?|.*?com.*?), but this is not recommended because of the unnecessary repetition.
Your regex gets interpreted as .*?net or com.*?. You'll want (.*?(net|com).*?).
Try this:
$regex='/<.*?href.*?="(.*?\.(?:net|com)\b.*?)"/i';
or better:
$regex='/<a .*?href\s*+=\s*+"\K.*?\.(?:net|com)\b[^"]*+/i';
<.*?href
is a problem. This will match from the first < on the current line to the first href, regardless of whether they belong to the same tag.
Generally, it's unwise to try and parse HTML with regexes; if you absolutely insist on doing that, at least be a bit more specific (but still not perfect):
$regex='/<[^<>]*href[^<>=]*="(?:[^"]*(net|com)[^"]*)"/';

PHP regex for filtering out urls from specific domains for use in a vBulletin plug-in

I'm trying to put together a plug-in for vBulletin to filter out links to filesharing sites. But, as I'm sure you often hear, I'm a newb to php let alone regexes.
Basically, I'm trying to put together a regex and use a preg_replace to find any urls that are from these domains and replace the entire link with a message that they aren't allowed. I'd want it to find the link whether it's hyperlinked, posted as plain text, or enclosed in [CODE] bb tags.
As for regex, I would need it to find URLS with the following, I think:
Starts with http or an anchor tag. I believe that the URLS in [CODE] tags could be processed the same as the plain text URLS and it's fine if the replacement ends up inside the [CODE] tag afterward.
Could contain any number of any characters before the domain/word
Has the domain somewhere in the middle
Could contain any number of any characters after the domain
Ends with a number of extentions such as (html|htm|rar|zip|001) or in a closing anchor tag.
I have a feeling that it's numbers 2 and 4 that are tripping me up (if not much more). I found a similar question on here and tried to pick apart the code a bit (even though I didn't really understand it). I now have this which I thought might work, but it doesn't:
<?php
$filterthese = array('domain1', 'domain2', 'domain3');
$replacement = 'LINKS HAVE BEEN FILTERED MESSAGE';
$regex = array('!^http+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*(html|htm|rar|zip|001)$!',
'!^<a+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*</a>$!');
$this->post['message'] = preg_replace($regex, $replacement, $this->post['message']);
?>
I have a feeling that I'm way off base here, and I admit that I don't fully understand php let alone regexes. I'm open to any suggestions on how to do this better, how to just make it work, or links to RTM (though I've read up a bit and I'm going to continue).
Thanks.
You can use parse_url on the URLs and look into the hashmap it returns. That allows you to filter for domains or even finer-grained control.
I think you can avoid the overhead of this in using the filter_var built-in function.
You may use this feature since PHP 5.2.0.
$good_url = filter_var( filter_var( $raw_url, FILTER_SANITIZE_URL), FILTER_VALIDATE_URL);
Hmm, my first guess: You put $filterthese directly inside a single-quoted string. That single quotes don't allow for variable substitution. Also, the $filterthese is an array, that should first be joined:
var $filterthese = implode("|", $filterthese);
Maybe I'm way off, because I don't know anything about vBulletin plugins and their embedded magic, but that points seem worth a check to me.
Edit: OK, on re-checking your provided source, I think the regexp line should read like this:
$regex = '!(?#
possible "a" tag [start]: )(<a[^>]+href=["\']?)?(?#
offending link: )https?://(?#
possible subdomains: )(([a-z0-9-]+\.)*\.)?(?#
domains to block: )('.implode("|", $filterthese).')(?#
possible path: )(/[^ "\'>]*)?(?#
possible "a" tag [end]: )(["\']?[^>]*>)?!';

Categories