Regex pattern to detect a link but not an image - php

I've been fooling around for a while with regex. A few days ago I started modifying a regex pattern I found some time ago. It detects all hyperlinks, my version should only detect hyperlinks and not images.
http://domain.com/someimage.jpg
shouldn't be detected. But it does detect an image partly. I don't how to solve this.
The original regex:
/(https?)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,10}(\/\S*)?/i
Link to my version:
http://regexr.com/38rv9
Please help. Thanks!

You just need a space at last.
/((https?)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,10}(\/(?:(\S(?!jpg|jpeg|png|gif))*))?)\s/ig

I would accomplish this by making sure what is clicked by the user does NOT end with an image file extension. You mention you are using php; have ONE condition statement that matches your original regex:
/(https?)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,10}(\/\S*)?/i
but does not match any common image file extension at the END of the expression:
/^.*\.*[*(jpg$|jpeg$|gif$|png$|tif$)]/i
This would work for any text string that precedes the image file extension; preg_match will be useful to accomplish this.

Related

Trying to stop regex at a tag

I know there are other posts with a similar name but I've looked through them and they haven't helped me resolve this.
I'm trying to get my head around regex and preg_match. I am going through a body of text and each time a link exists I want it to be extracted. I'm currently using the following:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
which works fine until it finds one that has <br after it. Then I get the url plus the <br which means it doesn't work correctly. How can I have it so that it stops at the < without including it?
Also, I have been looking everywhere for a clear explanation of using regex and I'm still confused by it. Has anyone any good guides on it for future reference?
\S* is too broad. In particular, I could inject into your code with a URL like:
http://hax.hax/"><script>alert('HAAAAAAAX!');</script>
You should only allow characters that are allowed in URLs:
[-A-Za-z0-9._~:/?#[]#!$&'()*+,;=]*
Some of these characters are only allowed in specific places (such as ?) so if you want better validation you will need more cleverness
Instead of \S exclude the open tag char from the class:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/[^<]*)?/";
You might even want to be more restrictive by only allowing characters valid in URLs:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/[a-zA-Z_\-\.%\?&]*)?/";
(or some more characters)
You could use this one as presented on the:
http://regex101.com/r/zV1uI7
On the bottom of the site you got it explained step by step.

Extracting links using regex via php

need a little help with the regexp i make... have some text where links are encoded like
[NameLink ->link] or [NameLink->link]
The link can be without http:// or www.
Tried to get it using this pattern
/\[.{1,}\-\>.{1,}\]/
but if there are 2 such encoded links in a row then it doesn't separate them and takes also the content between two links. Can someone tell me what's the problem? Thank you
Use +? instead of {1,}. Also, have a read on greedy vs. nongreedy.
You may want to strip spaces using \s* around your .+?, this allows for both [NameLink -> link] and [NameLink ->link].

Looking for a regex to get the data stored within the angle <> brackets

Ok I tried to google it but couldn't find a solution so I am asking here. I am trying to save the HTML tags into a variable in php. I am trying to use preg_match but cannot find the right pattern(regex). I did find one regex '\s*(.*?)\s*>\s*'. This works ok on the functions-online site where I try it and gives me the whole tag i.e.i=<body> but when I try to run it in my programme I get
preg_match(): Delimiter must not be alphanumeric or backslash
It would be helpful if anyone could sort out this issue and even better if anyone could give the regex to get the data within the angle brackets(HTML tags)
Please also let me know if there is another method to store the html tags in php i.e.
<body>
then $var=body
RegEx match open tags except XHTML self-contained tags <-- Read the 1st answer if you are considering "parsing" HTML with regexes
You need to add so called delimiters: '/\s*(.*?)\s*>\s*/'
Ok Thanx to the link provided by killerx I did find a regex which could be use but it is not the best method but should work for my task
'\'<([a-z]+)[^>]*(?<!/)>\''
This should work. It will get the full tag in an array and the tag description in the other.
Thanx a ton for helping me out

regular expression for replacing all links but css and js

i want to download a site an replace all links on that site to an internal link.
that's easy:
$page=file_get_contents($url);
$local=$_SERVER['HTTP_HOST'].$_SERVER['PHP_SELF'];
$page=preg_replace('/href="(.+?)"/','href="http://'.$local.'?href=\\1"',$page);
but i want to exclude all css files and js files from replacing, so i tried:
$regex='/href="(.+?(?!(\.js|\.css)))"/';
$page=preg_replace($regex,'href="http://'.$local.'?href=\\1"',$page);
but that didnt work,
what am i doing wrong?
i thought
?!
is a negative lookahead
To answer your regex question, you need a lookbehind there and better limit the match with a character class:
$regex = '/href="([^"]+(?<!\.js|\.css))"/';
The charclass first matches the whole link content, then asserts that this didn't end in .js or .css.
You might want to augment the whole match with <a\s[^>]*? even, so it really just finds anything that looks like a link.
Another option would be using domdocument or querypath for such tasks, which is usually tedious and more code, but simpler to add programmatic conditions to:
htmlqp->find("a") FOREACH $a->attr("href", "http:/...".$a->attr("href"))
// would need a real foreach and an if and stuff..

Complex PHP/Perl regular expression for emoticons

I've checked google for help on this subject but all the answers keep overlooking a fatal flaw in the replacement method.
Essentially I have a set of emoticons such as :) LocK :eek and so on and need to replace them with image tags. The problem I'm having is identifying that a particular emoticon is not part of a word and is alone on a line. For example on our site we allow 'quick links' which are not included in the smiley replacement which take the format go:forum, user:Username and so on. Pretty much all answers I've read don't allow for this possiblity and as such break these links (i.e. go<img src="image.gif" />orum). I've tried experimenting around with different ways to get around this to check for the start of the line, spaces/newline characters and so on but I've not had much luck.
Any help with this problem would be greatly appreciated. Oh also I'm using PHP 5 and the preg_% functions.
Thanks,
Rupert S.
Edit 18/04/2011:
Thanks for your help peeps :) Have created the final regex that I though I'd share with everyone, had a couple problems to do with special space chars including newline but it's now working like a dream the final regex is:
(?<=\s|\A|\n|\r|\t|\v|\<br \/\>|\<br\>)(:S)(?=\s|\Z|$|\n|\r|\t|\v|\<br \/\>|\<br\>)
To complete the comment into an answer: The simplest workaround would be to assert that the emoticons are always surrounded by whitespace.
(?<=\s|^)[<:-}]+(?=\s|$)
The \s covers normal spaces and line breaks. Just to be safe ^ and $ cover occurrences at the start or very end of the text subject. The assertions themselves do not match, so can be ignored in the replacement string/callback.
If you want to do all the replace in one single preg_replace, try this:
preg_replace('/(?<=^|\s)(:\)|:eek)(?=$|\s)/e'
,"'$1'==':)'?'<img src=\"smile.gif\"/>':('$1'==':eek'?'<img src=\"eek.gif\"/>':'$1')"
,$input);

Categories