Regex not preceded by href=" - php

So I am adding [embed][/embed] around youtube links in a WordPress environment, since if you use different fields for content input in the backend than the normale content editor, it won't do this automatically (even if you apply_filter the_content).
So, I found this regex which works perfect for my application:
$firstalinea = preg_replace('/\s*[a-zA-Z\/\/:\.]*youtu(be.com\/watch\?v=|.be\/)([a-zA-Z0-9\-_]+)([a-zA-Z0-9\/\*\-\_\?\&\;\%\=\.]*)/i', '[embed]https://www.youtube.com/watch?v=$2[/embed]', $firstalinea);
Except for one thing. If someone places a link to a YouTube-video instead of wanting to embed it, it also replaces and then the link does not work anymore.
Link
So, how to make the regex NOT work, if preceded by href=" ?
Thanks!

Solved it:
$re = '/(?<!href=\")(http:\/\/|https:\/\/)(?:www\.)?youtu(be.com\/watch\?v=|.be\/)([a-zA-Z0-9\-_]+)([a-zA-Z0-9\/\*\-\_\?\&\;\%\=\.]*)/i';
$firstalinea = preg_replace($re, '[embed]https://www.youtube.com/watch?v=$3[/embed]', $firstalinea);

Related

Regex to update URLs after a content migration

I recently moved some old content to a new site and updated some URL structures. I need to do a find-replace on the entire database to update some old links. This would be easy if I knew regex, but I don't so hoping this is easy for the SO guru's.
Note: This is PHP regex.
Find:
https://api.floodmagazine.com/{number}/{string}/
Result:
https://api.floodmagazine.com/789/foo-bar/
https://api.floodmagazine.com/12345/foo-bar-1/
Replace with:
https://floodmagazine.com/$1/$2/
Result:
https://floodmagazine.com/789/foo-bar/
https://floodmagazine.com/12345/foo-bar-1/
It's not as easy as just doing a search for the sub-domain (api.floodmagazine.com) because there are URL's in the DB that need that sub-domain to remain (images for example). So the /{number/{string}/ part is an important way to find only the URL's that need to be changed.
I just need the regex part, I'm using WP Migrate for the database updating part.
Thanks for the help!
https:\/\/api.floodmagazine.com\/([0-9]+)\/([A-z0-9._+-]+)\/?
that should work. On regex101 you have to escape / so I kept that here. That may not be true in your tooling.
You can omit the last ? if you don’t want the trailing slash to be optional.
This should grab all the URLs you describe :
(https://floodmagazine.com)(\/)[0-9]*(\/)[A-z-0-9]*(\/)
To avoid URL error du to WordPress inconsistency you can use this PHP code generated with regex101
$re = '/https?:\/\/([^\/]+)\/([^\/]+)\/([^\/]+)\/?/m';
$str = 'https://api.floodmagazine.com/789/foo-bar/';
$subst = 'https://floodmagazine.com/$2/$3/';
$result = preg_replace($re, $subst, $str);
this regex catch domain, id and post name. Can catch special case like non HTTPS, special char ...
and return the result like expected in your exemple

Hide the Youtube URL in post content on WordPress while using get_the_content()

In my website, I'm using get_the_content() to display the contents of my posts. But, my post video URL, which is youtube here, also getting displayed along with my content.
I tried using preg_replace() to avoid the string that has the youtube URL. But I want to do it dynamically, by using beginning of the string
'https://www.youtube.com/watch?v' to replace it with empty string.
I'm exactly looking for something like this.
$description = preg_replace('https://www.youtube.com/watch?v=f0cReHKUiJM', '', get_the_content());
Please share your ideas on how can I accomplish this.
Thanks in advance.
If you literally want to remove all youtube urls from your get_the_content() string, (assuming the href value is double quoted) you can use this:
~https?:/{2}(?:w{3}\.)?youtube\.com[^"]+~
Pattern Demo
This can be implemented as:
$description=preg_replace('~https?:/{2}(?:w{3}\.)?youtube\.com[^"]+~','',get_the_content());
PHP Demo
But this will leave the "shell" of the link (...).
You could go one step further and remove tags and retain the text within:
~<a.*?href="https?://(?:www\.)?youtube\.com[^>]*>(.*?)</a>~
Pattern Demo
Implementation:
$description=preg_replace('~<a.*?https?://(?:www\.)?youtube\.com[^>]*>(.*?)</a>~','\1',get_the_content());
PHP Demo
The truth is, you haven't provided any full string samples from get_the_content() or explained if there were variations in the url or a tag structure (extra attributes or manner of quoting). These factors will likely impact the pattern design. My answer is mostly based on assumptions. Maybe it will help you. If it doesn't, it would be best if you updated your question.

Trying to stop regex at a tag

I know there are other posts with a similar name but I've looked through them and they haven't helped me resolve this.
I'm trying to get my head around regex and preg_match. I am going through a body of text and each time a link exists I want it to be extracted. I'm currently using the following:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
which works fine until it finds one that has <br after it. Then I get the url plus the <br which means it doesn't work correctly. How can I have it so that it stops at the < without including it?
Also, I have been looking everywhere for a clear explanation of using regex and I'm still confused by it. Has anyone any good guides on it for future reference?
\S* is too broad. In particular, I could inject into your code with a URL like:
http://hax.hax/"><script>alert('HAAAAAAAX!');</script>
You should only allow characters that are allowed in URLs:
[-A-Za-z0-9._~:/?#[]#!$&'()*+,;=]*
Some of these characters are only allowed in specific places (such as ?) so if you want better validation you will need more cleverness
Instead of \S exclude the open tag char from the class:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/[^<]*)?/";
You might even want to be more restrictive by only allowing characters valid in URLs:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/[a-zA-Z_\-\.%\?&]*)?/";
(or some more characters)
You could use this one as presented on the:
http://regex101.com/r/zV1uI7
On the bottom of the site you got it explained step by step.

"catching" links in regex using php ignoring inline js

I'm stuck trying to make a regex in PHP that catches the link and its content from a html page (which I have no control over) and replaces it with a link of mine.
i.e.:
<a style="position:absolute;more_styles:more;" href="http://www.google.co.il/" class="something">This is the content</a>
Becomes:
<a style="position:absolute;more_styles:more;" href="my_function('http://www.google.co.il/')" class="something">This is the content</a>
This is the regex that I wrote:
$content = preg_replace('|<a(.*?)href=[\"\'](.*?)[\"\'][^>]*>(.*?)</a>|i','$3',$content);
This works well with all the links except links like:
<a href="http://google.co.il" onclick="if(MSIE_VER()>=4){this.style.behavior='url(#default#homepage)';this.setHomePage('http://www.google.co.il')}" class='brightgrey rightbar' style='font-size:12px'><b>Make me the home page!</b></a>
Obviously, the regexp stops at "MSIE_VER()>" because of the "[^>]*" part and i get the wrong content when I use "$3".
I tried almost every option to make this work but no luck.
Any thoughts?
Thank you all in advance..
First of all your code is trying to do something different that to add my_function - it tries to remove the starting tag and replace it with url only. There are several ways to acheieve your declared goal (i.e. substituing my_function to all hrefs) , the most pragmafic would be:
$content = preg_replace('|href=[\"\'](.*?)[\"\']|i',"href=\"my_function('$1')\"",$content);
if you need more prudent approach than I would use
$content = preg_replace('|(<a.*?)href=[\"\'](.*?)[\"\'](.*?</a>)|i',"$1href=\"my_function('$2')\"$3",$content);
last but not least if you need removing tag rather than what you have written, let me know there is million ways to do it.
By default .* will take evryting it can - eg. it takes onclick argument, because regex is still valid - replace "." with [^\"] - it will tell regexp to take evrything excluding " ( which cannot be in URL )
$content = preg_replace('|<a(.*?)href=[\"\']([^"]*?)[\"\'][^>]*>(.*?)</a>|i','$3',$content);

PHP regex for filtering out urls from specific domains for use in a vBulletin plug-in

I'm trying to put together a plug-in for vBulletin to filter out links to filesharing sites. But, as I'm sure you often hear, I'm a newb to php let alone regexes.
Basically, I'm trying to put together a regex and use a preg_replace to find any urls that are from these domains and replace the entire link with a message that they aren't allowed. I'd want it to find the link whether it's hyperlinked, posted as plain text, or enclosed in [CODE] bb tags.
As for regex, I would need it to find URLS with the following, I think:
Starts with http or an anchor tag. I believe that the URLS in [CODE] tags could be processed the same as the plain text URLS and it's fine if the replacement ends up inside the [CODE] tag afterward.
Could contain any number of any characters before the domain/word
Has the domain somewhere in the middle
Could contain any number of any characters after the domain
Ends with a number of extentions such as (html|htm|rar|zip|001) or in a closing anchor tag.
I have a feeling that it's numbers 2 and 4 that are tripping me up (if not much more). I found a similar question on here and tried to pick apart the code a bit (even though I didn't really understand it). I now have this which I thought might work, but it doesn't:
<?php
$filterthese = array('domain1', 'domain2', 'domain3');
$replacement = 'LINKS HAVE BEEN FILTERED MESSAGE';
$regex = array('!^http+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*(html|htm|rar|zip|001)$!',
'!^<a+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*</a>$!');
$this->post['message'] = preg_replace($regex, $replacement, $this->post['message']);
?>
I have a feeling that I'm way off base here, and I admit that I don't fully understand php let alone regexes. I'm open to any suggestions on how to do this better, how to just make it work, or links to RTM (though I've read up a bit and I'm going to continue).
Thanks.
You can use parse_url on the URLs and look into the hashmap it returns. That allows you to filter for domains or even finer-grained control.
I think you can avoid the overhead of this in using the filter_var built-in function.
You may use this feature since PHP 5.2.0.
$good_url = filter_var( filter_var( $raw_url, FILTER_SANITIZE_URL), FILTER_VALIDATE_URL);
Hmm, my first guess: You put $filterthese directly inside a single-quoted string. That single quotes don't allow for variable substitution. Also, the $filterthese is an array, that should first be joined:
var $filterthese = implode("|", $filterthese);
Maybe I'm way off, because I don't know anything about vBulletin plugins and their embedded magic, but that points seem worth a check to me.
Edit: OK, on re-checking your provided source, I think the regexp line should read like this:
$regex = '!(?#
possible "a" tag [start]: )(<a[^>]+href=["\']?)?(?#
offending link: )https?://(?#
possible subdomains: )(([a-z0-9-]+\.)*\.)?(?#
domains to block: )('.implode("|", $filterthese).')(?#
possible path: )(/[^ "\'>]*)?(?#
possible "a" tag [end]: )(["\']?[^>]*>)?!';

Categories