Check for multiple patterns with preg_replace? - php

I am looking for a solution for validating submitted posts. I want to check if someone submits (within the post):
An Iframe for embedding YouTube or Vimeo video, replacing the correct width used in the Iframe
An URL, replaced by a HTML clickable link
An image URL, replacing it by a HTML
I was able to find the correct regex's for each of these requirements, but using 3 seperate preg_replace functions causes interference. For example, detecting an URL will also detect the URL inside the Iframe.
I have searched for a solution on this, both on Stackoverflow as on the rest of the internet. But I am not an expert, so perhaps someone could help me out or direct me to the right tutorial/website/how-to...

What you can do is first match the iframes with preg_match, and then replace them with a placeholder.
Then you can do the replacements for urls/images. Then, replace the iframe placeholders back with the iframes you matched earlier.
You can generate unique sequential placeholders by using preg_replace_callback, so that you get to run some code to increment a $placeholder_id for each replacement.
This is a general strategy that can often greatly simplify complex parsing.

You can simply pass and array of URL patterns to preg_replace() like this:
$pattern_array = array(
'/somepattern/',
'/someotherpattern/',
'/yetanotherpattern/',
)
$replacement_array = array(
'somereplacement',
'someotherreplacement',
'yetanotherreplacement'
}
$result = preg_replace($pattern_array, $replacement_array, $subject_string);

Related

Hide the Youtube URL in post content on WordPress while using get_the_content()

In my website, I'm using get_the_content() to display the contents of my posts. But, my post video URL, which is youtube here, also getting displayed along with my content.
I tried using preg_replace() to avoid the string that has the youtube URL. But I want to do it dynamically, by using beginning of the string
'https://www.youtube.com/watch?v' to replace it with empty string.
I'm exactly looking for something like this.
$description = preg_replace('https://www.youtube.com/watch?v=f0cReHKUiJM', '', get_the_content());
Please share your ideas on how can I accomplish this.
Thanks in advance.
If you literally want to remove all youtube urls from your get_the_content() string, (assuming the href value is double quoted) you can use this:
~https?:/{2}(?:w{3}\.)?youtube\.com[^"]+~
Pattern Demo
This can be implemented as:
$description=preg_replace('~https?:/{2}(?:w{3}\.)?youtube\.com[^"]+~','',get_the_content());
PHP Demo
But this will leave the "shell" of the link (...).
You could go one step further and remove tags and retain the text within:
~<a.*?href="https?://(?:www\.)?youtube\.com[^>]*>(.*?)</a>~
Pattern Demo
Implementation:
$description=preg_replace('~<a.*?https?://(?:www\.)?youtube\.com[^>]*>(.*?)</a>~','\1',get_the_content());
PHP Demo
The truth is, you haven't provided any full string samples from get_the_content() or explained if there were variations in the url or a tag structure (extra attributes or manner of quoting). These factors will likely impact the pattern design. My answer is mostly based on assumptions. Maybe it will help you. If it doesn't, it would be best if you updated your question.

PHP Regex to get a name from a url tag

There is a lot of Regex to get links or a value from url tags <a href , but what about extract a value from url tags like this
$text = '[URL="http://google.com"]ANY THING[/URL]';
if i want get value ANY THING from this url tag , what Regex Can I use ?
You can use this one
'/\[URL[^]]+\](?P<name>[^\[]+)\[\/URL\]/'
But you should probably learn Why.. Here is a good tester that shows that regx at work
https://regex101.com/r/hS0sO5/1
Traditionally this is called BBcode ( builtin board code )
https://en.wikipedia.org/wiki/BBCode
There are full PHP implementations of this sort of thing you can use besides regx
If you want both ( optionally the url ) you can use this one
'/\[URL(?:\=\"(?P<url>[^"]+)\")?\](?P<name>[^\[]+)\[\/URL\]/'
And here is that one at work
https://regex101.com/r/hS0sO5/2
That last one does require the " around URL, I have seen them done with ' or no quotes at all.

How do you validate wikipedia URLs like these in PHP?

Url: https://en.m.wikipedia.org/wiki/Professional_Tax
Is not being validated with Regex:
function isValidURL($url) {
return preg_match('|^(http(s)?://)?[a-z0-9-]+\.(.[a-z0-9-]+)+(:[0-9]+)?(/.*)?$|i', $url);
}
So the purpose of this is: We have a whole lot of urls embedded inside posts (forum) - we want to create a script which will basically keep track of which urls are still good. For this we need to extract the URLs from the posts and create a database - which can be checked at intervals for their status codes.
To matching this URL You can use this:
^https?\:\/\/([\w\.]+)wikipedia.org\/wiki\/([\w]+\_?)+
This only match URL, but to validate which url's are still good (if I understand right it means active)... this is not job for a regex.
Here is an alternative regex for most URLs:
(?<![#\w])(((http|https)(:\/\/))?([\w\-_]{2,})(([\.])([\w\-_]*)){1,})([\w.,#?^=%&:\/~+#-]*[\w#?^=%&\/~+#-])
You can experiment with regex here: https://regex101.com/

Parsing link from javascript function

I'm trying to parse a direct link out of a javascript function within a page. I'm able to parse the html info I need, but am stumped on the javascript part. Is this something that is achievable with php and possibly regex?
function videoPoster() {
document.getElementById("html5_vid").innerHTML =
"<video x-webkit-airplay='allow' id='html5_video' style='margin-top:"
+ style_padding
+ "px;' width='400' preload='auto' height='325' controls onerror='cantPlayVideo()' "
+ "<source src='http://video-website.com/videos/videoname.mp4' type='video/mp4'>";
}
What I need to pull out is the link "http://video-website.com/videos/videoname.mp4". Any help or pointers would be greatly appreciated!
/http://.*\.mp4/ will give you all characters between http:// and .mp4, inclusive.
See it in action.
If you need the session id, use something like /http://.*\.mp4?sessionid=\d+/
In general, no. Nothing short of a full javascript parser will always extract urls, and even then you'll have trouble with urls that are computed nontrivially.
In practice, it is often best to use the simplest capturing regexp that works for the code you actually need to parse. In this case:
['"](http://[^'"]*)['"]
If you have to enter that regexp as a string, beware of escaping.
If you ever have unescaped quotation marks in urls, this will fail. That's valid but rare. Whoever is writing the stuff you're parsing is unlikely to use them because they make referring to the urls in javascript a pain.
For your specific case, this should work, provided that none of the characters in the URL are escaped.
preg_match("/src='([^']*)'/", $html, $matches);
$url = $matches[1];
See the preg_match() manual page. You should probably add error handling, ensuring that the function returns 1 (that the regex matched) and possibly performing some additional checks as well (such as ensuring that the URL begins with http:// and contains .mp4?).
(As with all Web scraping techniques, the owner or maintainer of the site you are scraping may make a future change that breaks your script, and you should be prepared for that.)
The following captures any url in your html
$matches=array();
if (preg_match_all('/src=["\'](?P<urls>https?:\/\/[^"\']+)["\']/', $html, $matches)){
print_r($matches['urls']);
}
if you want to do the same in javascript you could use this:
var matches;
if (matches=html.match(/src=["'](https?:\/\/[^"']+)["']/g)){
//gives you all matches, but they are still including the src=" and " parts, so you would
//have to run every match again against the regex without the g modifier
}

PHP regex for filtering out urls from specific domains for use in a vBulletin plug-in

I'm trying to put together a plug-in for vBulletin to filter out links to filesharing sites. But, as I'm sure you often hear, I'm a newb to php let alone regexes.
Basically, I'm trying to put together a regex and use a preg_replace to find any urls that are from these domains and replace the entire link with a message that they aren't allowed. I'd want it to find the link whether it's hyperlinked, posted as plain text, or enclosed in [CODE] bb tags.
As for regex, I would need it to find URLS with the following, I think:
Starts with http or an anchor tag. I believe that the URLS in [CODE] tags could be processed the same as the plain text URLS and it's fine if the replacement ends up inside the [CODE] tag afterward.
Could contain any number of any characters before the domain/word
Has the domain somewhere in the middle
Could contain any number of any characters after the domain
Ends with a number of extentions such as (html|htm|rar|zip|001) or in a closing anchor tag.
I have a feeling that it's numbers 2 and 4 that are tripping me up (if not much more). I found a similar question on here and tried to pick apart the code a bit (even though I didn't really understand it). I now have this which I thought might work, but it doesn't:
<?php
$filterthese = array('domain1', 'domain2', 'domain3');
$replacement = 'LINKS HAVE BEEN FILTERED MESSAGE';
$regex = array('!^http+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*(html|htm|rar|zip|001)$!',
'!^<a+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*</a>$!');
$this->post['message'] = preg_replace($regex, $replacement, $this->post['message']);
?>
I have a feeling that I'm way off base here, and I admit that I don't fully understand php let alone regexes. I'm open to any suggestions on how to do this better, how to just make it work, or links to RTM (though I've read up a bit and I'm going to continue).
Thanks.
You can use parse_url on the URLs and look into the hashmap it returns. That allows you to filter for domains or even finer-grained control.
I think you can avoid the overhead of this in using the filter_var built-in function.
You may use this feature since PHP 5.2.0.
$good_url = filter_var( filter_var( $raw_url, FILTER_SANITIZE_URL), FILTER_VALIDATE_URL);
Hmm, my first guess: You put $filterthese directly inside a single-quoted string. That single quotes don't allow for variable substitution. Also, the $filterthese is an array, that should first be joined:
var $filterthese = implode("|", $filterthese);
Maybe I'm way off, because I don't know anything about vBulletin plugins and their embedded magic, but that points seem worth a check to me.
Edit: OK, on re-checking your provided source, I think the regexp line should read like this:
$regex = '!(?#
possible "a" tag [start]: )(<a[^>]+href=["\']?)?(?#
offending link: )https?://(?#
possible subdomains: )(([a-z0-9-]+\.)*\.)?(?#
domains to block: )('.implode("|", $filterthese).')(?#
possible path: )(/[^ "\'>]*)?(?#
possible "a" tag [end]: )(["\']?[^>]*>)?!';

Categories