I have the following code to convert links to hyperlinks in a string.
$text_block = preg_replace('$(\s|^)(https?://[a-z0-9_./?=&-]+)(?![^<>]*>)$i', ' $2 ',$text_block);
However, if the link as a period at the end, like "do a search on http://google.com.", the regex includes the period in the link. How can I change the regex above to look for and not include a period if present?
EDIT: For clarification - the $text_block is a large block of text that may contain many links. The regex need to parse through the block of text and find and convert all found links.
EDIT 2: As pointed out below in the comments, I guess you'd have to account for domains like ".co.uk". So I guess it would have to look for and remove the last period that is followed by whitespace, if present... gets tricky. Any ideas?
A not particularly elegant, but purely regex solution is :
$(\s|^)(https?://[a-z0-9_./?=&-]+[a-z0-9_/?=&-])(?![^<>]*>)$i
Just ensures the last character is any of the valid characters except .
PHP
$text_block = preg_replace('$(\s|^)(https?://[a-z0-9_./?=&-]+[a-z0-9_/?=&-])(?![^<>]*>)$i', ' $2 ',$text_block);
Working on RegExr
Try this:
$output = rtrim($string, '.');
Related
I'm parsing an external feed which contains location and date inside post title which I want to get rid of, so:
This happened on Date in Location
I need to find on (space on space) and remove everything till the end of the line, same for in(space in space).
I googled a bit, but regex is really unfathomable for me so I'd appreciate any help.
Thanks!
Well, a literal "on" does match exactly. Then tell the regex engine to match everything after: ".*". (Note, that the . doesn't match newlines, so it works as needed.)
In the case of "in" you need an alternative, which is marked by parentheses () and the vertical bar |: "(on|in)". You could also make that a bit tighter with character classes []: "[oi]n".
With that we arrive at this regex:
/ [oi]n .*/
To the end of the line? Then I suppose:
preg_replace("/(?:on|in).*?(\n|$)/", "", 'This happened on Date in Location');
Would do it.
Use a negative lookbehind if you want to remove everything after the on and in but not the on and in themselves.
(?<=\son\s).*
and
(?<=\sin\s).*
http://regexr.com?30ops
Here's what I have so far:
/(^|\s)(http:\/\/(\S+)(?!(.png|.gif|.jpg)($|\.\s|\.$|\s)))($|\.\s|\.$|\s)/i
And I'm replacing it like so:
'$1$2$6'
Sometimes, my users type something like this: http://google.com. <- How do I avoid including that final period without parsing out other periods that are in URLs?
Also, in case you're wondering what the .gif .png etc is for, I'm parsing out images to automatically create elements.
Edit:
This is for PHP.
This is for a forum where users post lots of things including links. It successfully handles every situation except for punctuation after the URL.
Edit 2:
Parse out might be the wrong word. I'm not trying to remove the punctuation, just separate it from the URL so I can display a working link to my users.
Edit 3:
http://daringfireball.net/2010/07/improved_regex_for_matching_urls
I haven't testing fully yet, but it seems to work. I'll make it a solution after I've tested. Or if someone else wants points, feel free to test and I'll vote for your solution.
So updated solution:
/\b(http:\/\/(\S+(?<!\.)(?=(?:$|\s|\.(?:$|\s)))))(?<!(?:\.(?:png|gif|jpg)))/i
See it here online on Regexr
I replaced your (^|\s) by \b thats a word boundary that is exactly what you want here.
To your (\S+) I changed to (\S+(?<!\.)(?=(?:$|\s|\.(?:$|\s)))). Basically I match every non whitespace till there is $|\s|\.(?:$|\s) ahead and no dot on the left (the (?<!\.) part).
The following look around needs to be a look behind.
Then I cleaned your brackets and alternations a bit up and used some non capturing groups (the groups that start with (?:)
So for your test string users type something like this: http://google.com. <- How do I avoid it will match http://google.com with this in the first group and google.com in the second group.
PHP solution:
$line = 'http://www.google.com.';
echo preg_replace(
"/(\s*)((http:\/\/)?(\S+?(.png|.gif|.jpg)?))(\W*)$/i",
'$1$2$6',
$line), "\n";
I want to find any pattern matching: ###-##-####
and replace the ###-##, with ***-**
but leave the -####
I tried this below, but nothing is being replaced at all.
preg_replace('/(^[\d]{3})(-)([\d]{2})(-[\d]{4}$)/','\2\4',$myText);
Any help is appreciated
Update, here is my entire code string as it currently stands, after trying a few of the suggestions below. I am comparing the second echo output to the first... and the social numbers all remain the same.
Also, as it was mentioned below, my string does contain more than just a social... it is thousands of characters long. which i think is my real issue. Sorry if i didnt clear that up in the beginning.
//Make the CSC credit report request.
$strCscResponse = $Csc->makeRequest($strFixedFormatRecord);
echo "<br/><br/><pre>" . $strCscResponse . "</pre><br/><br/>";
$strCscResponse = str_replace("!", " ", $strCscResponse);
$strCscResponse = preg_replace('/^\d{3}-\d{2}(-\d{4})$/','***-**$1',$strCscResponse);
echo "<br/><br/><pre>" . $strCscResponse . "</pre><br/><br/>";
update
I'd like to mark all the answers and "the answer" just because i didnt clarify the string has more than just a social in it. thank you for the help with this issue, embarrisingly enough it has been driving me wild for a couple days now.
There is one possible problem: you might not be matching the right string (if you are trying to find SSNs buried in a large block of text) - the ^ and $ anchors will only match beginning of string (or sometimes beginning of line) - if this is not what you want, but instead you want to find SSNs in a long string, you need to get rid of those anchors.
The other problem, potentially, is that you seem to want to replace things with asterisks, but you do not include asterisks in your replacement expression. you need to use a replacement expression like
`***-**\4`
Try this regex:
(\d{3})(-)(\d{2})(-\d{4})
Try this:
preg_replace('/^\d{3}-\d{2}(-\d{4})$/','***-**$1',$myText);
you have ^ and $ in your pattern, but I see no m modifier, so this
will only match if ###-##-#### is the entire string.
[\d] can be
shortened to \d
your \2\4 will leave --####, if you wanted *-####
you can simply have *\4
I know I've seen this done a lot in places, but I need something a little more different than the norm. Sadly When I search this anywhere it gets buried in posts about just making the link into an html tag link. I want the PHP function to strip out the "http://" and "https://" from the link as well as anything after the .* so basically what I am looking for is to turn A into B.
A: http://www.youtube.com/watch?v=spsnQWtsUFM
B: www.youtube.com
If it helps, here is my current PHP regex replace function.
ereg_replace("[[:alpha:]]+://[^<>[:space:]]+[[:alnum:]/]", "\\0", htmlspecialchars($body, ENT_QUOTES)));
It would probably also be helpful to say that I have absolutely no understanding in regular expressions. Thanks!
EDIT: When I entered a comment like this blahblah https://www.facebook.com/?sk=ff&ap=1 blah I get html like this<a class="bwl" href="blahblah https://www.facebook.com/?sk=ff&ap=1 blah">www.facebook.com</a> which doesn't work at all as it is taking the text around the link with it. It works great if someone only comments a link however. This is when I changed the function to this
preg_replace("#^(.*)//(.*)/(.*)$#",'<a class="bwl" href="\0">\2</a>', htmlspecialchars($body, ENT_QUOTES));
This is the simples and cleanest way:
$str = 'http://www.youtube.com/watch?v=spsnQWtsUFM';
preg_match("#//(.+?)/#", $str, $matches);
$site_url = $matches[1];
EDIT: I assume that the $str had been checked to be a URL in the first place, so I left that out. Also, I assume that all the URLs will contain either 'http://' or 'https://'. In case the url is formatted like this www.youtube.com/watch?v=spsnQWtsUFM or even youtube.com/watch?v=spsnQWtsUFM, the above regexp won't work!
EDIT2: I'm sorry, I didn't realize that you were trying to replace all strings in a whole test. In that case, this should work the way you want it:
$str = preg_replace('#(\A|[^=\]\'"a-zA-Z0-9])(http[s]?://(.+?)/[^()<>\s]+)#i', '\\1\\3', $str);
I am not a regex whizz either,
^(.*)//(.*)/(.*)$
\2
was what worked for me when I tried to use as find and replace in programmer's notepad.
^(.)// should extract the protocol - referred as \1 in the second line.
(.)/ should extract everything till the first / - referred as \2 in the second line.
(.*)$ captures everything till the end of the string. - referred as \3 in the second line.
Added later
^(.*)( )(.*)//(.*)/(.*)( )(.*)$
\1\2\4 \7
This should be a bit better, but will only replace just 1 URL
The \0 is replaced by the entire matched string, whereas \x (where x is a number other than 0 starting at 1) will be replaced by each subpart of your matched string based on what you wrap in parentheses and the order those groups appear. Your solution is as follows:
ereg_replace("[[:alpha:]]+://([^<>[:space:]]+[:alnum:]*)[[:alnum:]/]", "\\1
I haven't been able to test this though so let me know if it works.
I think this should do it (I haven't tested it):
preg_match('/^http[s]?:\/\/(.+?)\/.*/i', $main_url, $matches);
$final_url = ''.$matches[1].'';
I'm surprised no one remembers PHP's parse_url function:
$url = 'http://www.youtube.com/watch?v=spsnQWtsUFM';
echo parse_url($url, PHP_URL_HOST); // displays "www.youtube.com"
I think you know what to do from there.
$result = preg_replace('%(http[s]?://)(\S+)%', '\2', $subject);
The code with regex does not work completely.
I made this code. It is much more comprehensive, but it works:
See the result here: http://cht.dk/data/php-scripts/inc_functions_links.php
See the source code here: http://cht.dk/data/php-scripts/inc_functions_links.txt
I have a load of user-submitted content. It is HTML, and may contain URLs. Some of them will be <a>'s already (if the user is good) but sometimes users are lazy and just type www.something.com or at best http://www.something.com.
I can't find a decent regex to capture URLs but ignore ones that are immediately to the right of either a double quote or '>'. Anyone got one?
Jan Goyvaerts, creator of RegexBuddy, has written a response to Jeff Atwood's blog that addresses the issues Jeff had and provides a nice solution.
\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$]
In order to ignore matches that occur right next to a " or >, you could add (?<![">]) to the start of the regex, so you get
(?<![">])\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&##/%=~_|$?!:,.]*[A-Z0-9+&##/%=~_|$]
This will match full addresses (http://...) and addresses that start with www. or ftp. - you're out of luck with addresses like ars.userfriendly.org...
This thread is old as the hills, but I came across it while working on my own problem: That is, convert any urls into links, but leave alone any that are already within anchor tags. After a while, this is what has popped out:
(?!(?!.*?<a)[^<]*<\/a>)(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&#/%=~_|$?!:,.]*[A-Z0-9+&#/%=~_|$]
With the following input:
http://www.google.com
http://google.com
www.google.com
<p>http://www.google.com<p>
this is a normal sentence. let's hope it's ok.
www.google.com
This is the output of a preg_replace:
http://www.google.com
http://google.com
www.google.com
<p>http://www.google.com<p>
this is a normal sentence. let's hope it's ok.
www.google.com
Just wanted to contribute back to save somebody some time.
I made a slight modification to the Regex contained in the original answer:
(?<![.*">])\b(?:(?:https?|ftp|file)://|[a-z]\.)[-A-Z0-9+&#/%=~_|$?!:,.]*[A-Z0-9+&#/%=~_|$]
which allows for more subdomains, and also runs a more full check on tags. To apply this to PHP's preg replace, you can use:
$convertedText = preg_replace( '#(?<![.*">])\b(?:(?:https?|ftp|file)://|[a-z]\.)[-A-Z0-9+&#/%=~_|$?!:,.]*[A-Z0-9+&#/%=~_|$]#i', '\0', $originalText );
Note, I removed # from the regex, in order to use it as a delimiter for preg_replace. It's pretty rare that # would be used in a URL anyway.
Obviously, you can modify the replacement text, and remove target="_blank", or add rel="nofollow" etc.
Hope that helps.
To skip existing ones just use a look-behind - add (?<!href=") to the beginning of your regular expression, so it would look something like this:
/(?<!href=")http://\S*/
Obviously this isn't a complete solution for finding all types of URLs, but this should solve your problem of messing with existing ones.
if (preg_match('/\b(?<!=")(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*[A-Z0-9+&##\/%=~_|](?!.*".*>)(?!.*<\/a>)/i', $subject)) {
# Successful match
} else {
# Match attempt failed
}
Shameless plug: You can look here (regular expression replace a word by a link) for inspiration.
The question asked to replace some word with a certain link, unless there already was a link. So the problem you have is more or less the same thing.
All you need is a regex that matches a URL (in place of the word). The simplest assumption would be like this: An URL (optionally) starts with "http://", "ftp://" or "mailto:" and lasts as long as there are no white-space characters, line breaks, tag brackets or quotes).
Beware, long regex ahead. Apply case-insensitively.
(href\s*=\s*['"]?)?((?:http://|ftp://|mailto:)?[^.,<>"'\s\r\n\t]+(?:\.(?![.<>"'\s\r\n])[^.,!<>"'\s\r\n\t]+)+)
Be warned - this will also match URLs that are technically invalid, and it will recognize things.formatted.like.this as an URL. It depends on your data if it is too insensitive. I can fine-tune the regex if you have examples where it returns false positives.
The regex will produce two match groups. Group 2 will contain the matched thing, which is most likely an URL. Group 1 will either contain an empty string or an 'href="'. You can use it as an indicator that this match occurred inside a href parameter of an existing link and you don't have to do touch that one.
Once you confirm that this does the right thing for you most of the time (with user supplied data, you can never be sure), you can do the rest in two steps, as I proposed it in the other question:
Make a link around every URL there is (unless there is something in match group 1!) This will produce double nested <a> tags for things that have a link already.
Scan for incorrectly nested <a> tags, removing the innermost one