regular expression for replacing all links but css and js - php

i want to download a site an replace all links on that site to an internal link.
that's easy:
$page=file_get_contents($url);
$local=$_SERVER['HTTP_HOST'].$_SERVER['PHP_SELF'];
$page=preg_replace('/href="(.+?)"/','href="http://'.$local.'?href=\\1"',$page);
but i want to exclude all css files and js files from replacing, so i tried:
$regex='/href="(.+?(?!(\.js|\.css)))"/';
$page=preg_replace($regex,'href="http://'.$local.'?href=\\1"',$page);
but that didnt work,
what am i doing wrong?
i thought
?!
is a negative lookahead

To answer your regex question, you need a lookbehind there and better limit the match with a character class:
$regex = '/href="([^"]+(?<!\.js|\.css))"/';
The charclass first matches the whole link content, then asserts that this didn't end in .js or .css.
You might want to augment the whole match with <a\s[^>]*? even, so it really just finds anything that looks like a link.
Another option would be using domdocument or querypath for such tasks, which is usually tedious and more code, but simpler to add programmatic conditions to:
htmlqp->find("a") FOREACH $a->attr("href", "http:/...".$a->attr("href"))
// would need a real foreach and an if and stuff..

Related

Ommitting a specific pattern from a regex statement

I've spent the last couple of days trying to figure out how to resolve this particular issue and posting on SO, but no dice so far. I think this is probably easier than I've been making it to be, but I need some help;
Here is a pretty basic regex statement that linkifies pretty much any link. It's not the only regex pattern I have, so I've included a piece that skips over the link if it includes the specific pattern "img.youtube.com/vi/" It works great;
$message = preg_replace("#(((f|ht)tp(s)?://)?!(img.youtube.com/vi/)[-a-zA-Z?-??-?()0-9#:%_+.~\#?&;//=,])+#i", "<a href=$1 target='_blank'><b>$1</b></a>", $message);
I do not want this to linkify any url with .jpeg, jpg, gif, or any popular image format, I have another expression that will embed those kinds of links (and it works fine, too). So, I need to find a way to get this expression to reject those kinds of links.
I've gotten advice on negative lookarounds, matching to specific strings, but none of them seem to work so far. I need to find a way to get this regex to ignore any URL that ends with .jpeg and so forth;
So, the regex statement above already has an example of a string that disqualifies certain URLs - ?!(img.youtube.com/vi/). This seems like that's all I need to do, but where do I put it and how does it look? The + symbol in the statement makes it so that the regex will scrutinize the string all the way to the end of it, using the matching characters of [-a-zA-Z?-??-?()0-9#:%_+.~#?&;//=,]. So, this matching string should probably be put somewhere before the + symbol. Does it go in "?!(img.youtube.com/vi/)" ? In my mind, it should probably look like this;
$message = preg_replace("#(((f|ht)tp(s)?://)?!(img.youtube.com/vi/|/^\.jpeg$/|/^\.jpg$/|/^\gif$/)[-a-zA-Z?-??-?()0-9#:%_+.~\#?&;//=,])+#i",
"<a href=$1 target='_blank'><b>$1</b></a>", $message);
Any help is appreciated.
I answer and also clean up your regexp
(?i)((?:f|ht)tps?://((?!img|jpe?g|gif|png|bmp))(?:([-a-z0-9()#:%_+.~#?&;/=,])(?2))+(?!(?3)))
Now the img etc you don't want is in the neg lookahead and you can add a things you don't like.
$good="http://www.google.com/";
$bad="http://img.google.com/";
$r="#(?i)((?:f|ht)tps?://((?!img|jpe?g|gif|png|bmp))(?:([-a-z0-9()#:%_+.~\#?&;/=,])(?2))+(?!(?3)))#";
$rep="<a href=$1 target='_blank'><b>$1</b></a>";
echo preg_replace($r,$rep,$good);
echo preg_replace($r,$rep,$bad);
You can try here http://ideone.com/419yfm
Just remove this part of the regex:
img|
<?php
$good="http://www.google.com/";
$bad="http://img.google.com/";
$r="#(?i)((?:f|ht)tps?://((?!jpe?g|gif|png|bmp))(?:([-a-z0-9()#:%_+.~\#?&;/=,])(?2))+(?!(?3)))#";
$rep="<a href=$1 target='_blank'><b>$1</b></a>";
echo preg_replace($r,$rep,$good); echo "\n";
echo preg_replace($r,$rep,$bad);
?>
DEMO

Trying to stop regex at a tag

I know there are other posts with a similar name but I've looked through them and they haven't helped me resolve this.
I'm trying to get my head around regex and preg_match. I am going through a body of text and each time a link exists I want it to be extracted. I'm currently using the following:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
which works fine until it finds one that has <br after it. Then I get the url plus the <br which means it doesn't work correctly. How can I have it so that it stops at the < without including it?
Also, I have been looking everywhere for a clear explanation of using regex and I'm still confused by it. Has anyone any good guides on it for future reference?
\S* is too broad. In particular, I could inject into your code with a URL like:
http://hax.hax/"><script>alert('HAAAAAAAX!');</script>
You should only allow characters that are allowed in URLs:
[-A-Za-z0-9._~:/?#[]#!$&'()*+,;=]*
Some of these characters are only allowed in specific places (such as ?) so if you want better validation you will need more cleverness
Instead of \S exclude the open tag char from the class:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/[^<]*)?/";
You might even want to be more restrictive by only allowing characters valid in URLs:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/[a-zA-Z_\-\.%\?&]*)?/";
(or some more characters)
You could use this one as presented on the:
http://regex101.com/r/zV1uI7
On the bottom of the site you got it explained step by step.

PHP Regex URL parsing issues preg_replace

I have a custom markup parsing function that has been working very well for many years. I recently discovered a bug that I hadn't noticed before and I haven't been able to fix it. If anyone can help me with this that'd be awesome. So I have a custom built forum and text based MMORPG and every input is sanitized and parsed for bbcode like markup. It'll also parse out URL's and make them into legit links that go to an exit page with a disclaimer that you're leaving the site... So the issue that I'm having is that when I user posts multiple URL's in a text box (let's say \n delimited) it'll only convert every other URL into a link. Here's the parser for URL's:
$markup = preg_replace("/(^|[^=\"\/])\b((\w+:\/\/|www\.)[^\s<]+)" . "((\W+|\b)([\s<]|$))/ei", '"$1".shortURL("$2")."$4"', $markup);
As you can see it calls a PHP function, but that's not the issue here. Then entire text block is passed into this preg_replace at the same time rather than line by line or any other means.
If there's a simpler way of writing this preg_replace, please let me know
If you can figure out why this is only parsing every other URL, that's my ultimate goal here
Example INPUT:
http://skylnk.co/tRRTnb
http://skylnk.co/hkIJBT
http://skylnk.co/vUMGQo
http://skylnk.co/USOLfW
http://skylnk.co/BPlaJl
http://skylnk.co/tqcPbL
http://skylnk.co/jJTjRs
http://skylnk.co/itmhJs
http://skylnk.co/llUBAR
http://skylnk.co/XDJZxD
Example OUTPUT:
http://skylnk.co/tRRTnb
<br>http://skylnk.co/hkIJBT
<br>http://skylnk.co/vUMGQo
<br>http://skylnk.co/USOLfW
<br>http://skylnk.co/BPlaJl
<br>http://skylnk.co/tqcPbL
<br>http://skylnk.co/jJTjRs
<br>http://skylnk.co/itmhJs
<br>http://skylnk.co/llUBAR
<br>http://skylnk.co/XDJZxD
<br>
e flag in preg_replace is deprecated. You can use preg_replace_callback to access the same functionality.
i flag is useless here, since \w already matches both upper case and lower case, and there is no backreference in your pattern.
I set m flag, which makes the ^ and $ matches the beginning and the end of a line, rather than the beginning and the end of the entire string. This should fix your weird problem of matching every other line.
I also make some of the groups non-capturing (?:pattern) - since the bigger capturing groups have captured the text already.
The code below is not tested. I only tested the regex on regex tester.
preg_replace_callback(
"/(^|[^=\"\/])\b((?:\w+:\/\/|www\.)[^\s<]+)((?:\W+|\b)(?:[\s<]|$))/m",
function ($m) {
return "$m[1]".shortURL($m[2])."$m[3]";
},
$markup
);

PHP regex for filtering out urls from specific domains for use in a vBulletin plug-in

I'm trying to put together a plug-in for vBulletin to filter out links to filesharing sites. But, as I'm sure you often hear, I'm a newb to php let alone regexes.
Basically, I'm trying to put together a regex and use a preg_replace to find any urls that are from these domains and replace the entire link with a message that they aren't allowed. I'd want it to find the link whether it's hyperlinked, posted as plain text, or enclosed in [CODE] bb tags.
As for regex, I would need it to find URLS with the following, I think:
Starts with http or an anchor tag. I believe that the URLS in [CODE] tags could be processed the same as the plain text URLS and it's fine if the replacement ends up inside the [CODE] tag afterward.
Could contain any number of any characters before the domain/word
Has the domain somewhere in the middle
Could contain any number of any characters after the domain
Ends with a number of extentions such as (html|htm|rar|zip|001) or in a closing anchor tag.
I have a feeling that it's numbers 2 and 4 that are tripping me up (if not much more). I found a similar question on here and tried to pick apart the code a bit (even though I didn't really understand it). I now have this which I thought might work, but it doesn't:
<?php
$filterthese = array('domain1', 'domain2', 'domain3');
$replacement = 'LINKS HAVE BEEN FILTERED MESSAGE';
$regex = array('!^http+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*(html|htm|rar|zip|001)$!',
'!^<a+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*</a>$!');
$this->post['message'] = preg_replace($regex, $replacement, $this->post['message']);
?>
I have a feeling that I'm way off base here, and I admit that I don't fully understand php let alone regexes. I'm open to any suggestions on how to do this better, how to just make it work, or links to RTM (though I've read up a bit and I'm going to continue).
Thanks.
You can use parse_url on the URLs and look into the hashmap it returns. That allows you to filter for domains or even finer-grained control.
I think you can avoid the overhead of this in using the filter_var built-in function.
You may use this feature since PHP 5.2.0.
$good_url = filter_var( filter_var( $raw_url, FILTER_SANITIZE_URL), FILTER_VALIDATE_URL);
Hmm, my first guess: You put $filterthese directly inside a single-quoted string. That single quotes don't allow for variable substitution. Also, the $filterthese is an array, that should first be joined:
var $filterthese = implode("|", $filterthese);
Maybe I'm way off, because I don't know anything about vBulletin plugins and their embedded magic, but that points seem worth a check to me.
Edit: OK, on re-checking your provided source, I think the regexp line should read like this:
$regex = '!(?#
possible "a" tag [start]: )(<a[^>]+href=["\']?)?(?#
offending link: )https?://(?#
possible subdomains: )(([a-z0-9-]+\.)*\.)?(?#
domains to block: )('.implode("|", $filterthese).')(?#
possible path: )(/[^ "\'>]*)?(?#
possible "a" tag [end]: )(["\']?[^>]*>)?!';

PHP Regex negation

I have a web bot which extracts some data from a website. The problem is that the html content is sent without line brakes so it's a little bit harder to match certain things so I need to extract everything that is between td tags. Here's a string example:
<a class="a" href="javascript:ow(19623507)">**-**-**-***.cstel.net</a> (<b><font color="#3300cc">Used</font></b>)</td><td><a class="a" href="javascript:ow(19623507)">**-**-**-***.cstel.net</a> (<b><font color="#3300cc">Used</font></b>)</td>
And my regex so far:
<a\s+class="a"\s+href="javascript:ow\((.*?)\)">.+</a>(?!<td>).+</td>
But my regex matches the whole line instead of matching all contents. Any ideas?
Don't waste your time on regexes. Use DOM and XPath.
DOMDocument::loadHTML($html)->getElementsByTagName('a')
Have you tried changing .+ to .+? ?
Can you determine where the proper line breaks SHOULD be? If so, it might be easier to first replace those tokens with a proper line break and then use the pattern you have (assuming that pattern works - I haven't tried it).
Your pattern looks VERY specific, but perhaps it works fine for what you are doing.

Categories