Search and Replace URL - Regex?

Search and Replace URL - Regex? - php

I'm trying to create a WordPress shortcode (the WordPress part of it isn't that relevant) that will search within some specified text for a link and replace it with one that I specify. For example:
[scode]Click on this link[scode]
[scode]Click on this link[scode]
...will be changed to:
[scode]Click on this link[scode]
I'm trying to put together a function that will search for links and replace them with the one that I specify. Here's what I have right now:
// Adds [hide] shortcode for hiding content from non-registered users.
function hide_text( $atts,$content) {
if ( is_user_logged_in () ) {
return $content;
}
else {
$pattern = '(?<=href=("|\'))[^"\']+(?=("|\'))';
$newurl = "http://replacementurl.com";
$content = preg_replace($pattern,$newurl,$content);
echo $content;
}
}
add_shortcode( 'hide', 'hide_text' );
This just crashes the site, though. I'm not a PHP expert (much less an expert on regex), but are there at least any glaring irregularities in my code?
UPDATE:
I ran debug on the site and found out from the log that there was an extra } in there. Now the site isn't crashing, but the content being echoed is blank... Code updated above

There is syntax error in your pattern, change it to:
$pattern = "(?<=href=(\"|'))[^\"']+(?=(\"|'))";
Errors:
$pattern = "(?<=href=("|'))[^"']+(?=("|'))";
^-- ^--not escaped

http://replcaement url.com Pretty sure this is spelled incorrectly.
and there isn't an ; at the end of the line.
Looks like you've done the regex correctly for the most part you also need to escape some reserved characters look at #Akam's answer.
I suggest using preg quotes.
(?<=href=("|'))[^"']+(?=("|'))
Edit live on Debuggex

Related

using preg_replace() with wordpress filter the_content

I am trying to append a php query to the end of links from my website to a sponsor site using the_content filter and preg_replace(). I have tested my regex expression on regexr.com and it works. I have also used print_r() to test that the function is being called but for some reason the links are not being changed in practice. Here is the code I am having trouble with
add_filter('the_content', 'linkAppend');
function linkAppend($content) {
global $referalString;
preg_replace('/\/\/(www|launch)?\.?(solarwinds\.com)\/[^"]*/g','$&?cmp='.$referalString, $content);
return $content;
}
If someone could point me in the right direction, or let me know where I went wrong, I would be greatly appreciate it.

Turns out I had multiple problems, it was pointed out to me that I wasn't assigning the output of my preg_replace( ) to $content and then I found and solved issues with the global flag on the regular expression not being valid in php and me not accounting for links without a / after the .com. Final fix looks like this:
add_filter('the_content', 'linkAppend');
function linkAppend($content) {
global $referalString;
$content = preg_replace('/\/\/(www|launch)?\.?(solarwinds\.com)\/?[^"]*/m','$0?cmp='.$referalString, $content);
return $content;
}

URL detection problems

i'm currently having some problems with detecting urls and making them clickable.
Until now it always worked fine, probably because we always tested this with real urls, but now the website is live, we're having some problems.
This was the code we used to detect them before
$content = preg_replace('!(((f|ht)tp://)[-a-zA-Zа-яА-Я()0-9#:%_+.~#?&;//=]+)!i', '$1', $content);
$content = eregi_replace('([[:space:]()[{}])(www.[-a-zA-Z0-9#:%_\+.~#?&//=]+)', '\\1\\2', $content);
It was doing a great job for normal urls, but some urls are giving problems:
- hk.linkedin.com
- www.test.com
- test.com
Also notice that some urls don't have http in fron of them.
I'm really not that good with regex, so I would very much appreciate it if somebody could help me figure this out.

What exactly you wanted to get. In this example, I can see blatant lack of understanding for regular expressions... but then, I see this exact code used in few codes according to Google Code Search. But those were made to find URLs in middle of text (not always what looks like URL is URL, but if it contains http:// or www it's sure that's URL.
Not everything needs to be done only using regular expressions. Those are helpful, but sometimes they make additional problems.
One of problems in regular expressions is that they don't have conditionals on result. You can use multiple regular expressions, but there is chance that something will be done wrongly (like affecting what previous regular expression has done). Just look at this. It assigns additional function (you can use e modifier, but it may make code unreadable).
<?php
$content = preg_replace_callback('{\b(?:(https?|ftp)://)?(\S+[.]\S+)\b}i',
'addHTTP', $content);
function addHTTP($matches) {
if(empty($matches[1])) {
return 'http://' . $matches[2] . '';
}
else {
return '' . $matches[2] . '';
}
}
Or two regular expressions (little harder to understand)...
$content = preg_replace('{\b(?:(?:https?|ftp)://)\S+[.]\S+\b}i',
'$0', $content);
$content = preg_replace('{\b(?<!["\'=><.])[-a-zA-Zа-яА-Яа-яА-Я()0-9#:%_+.~#?&;//=]+[.][-a-zA-Zа-яА-Яа-яА-Я()0-9#:%_+.~#?&;//=]+(?!["\'=><.])\b}i',
'http://$0', $content);
Also, you should avoid using target="". Users don't expect that new window will appear when clicking the link. After user will click such link he might wonder why "Go left" button doesn't work (hint: new window caused it to disappear). If somebody really wants to open link in new window he will do it yourself (it's not hard...).
Note that usually such stuff is linked with other helpers like this. For example, Stack Overflow uses some kind of Markdown modification which does more intelligent renaming, like changing plain text lists to HTML lists... But that all depends on what you need. If you only need processing links, you can try using those regexpes, but well...

Replace anchor text with PHP (and regular expression)

I have a string that contains a lot of links and I would like to adjust them before they are printed to screen:
I have something like the following:
replace_this
and would like to end up with something like this
replace this
Normally I would just use something like:
echo str_replace("_"," ",$url);
In in this case I can't do that as the URL contains underscores so it breaks my links, the thought was that I could use regular expression to get around this.
Any ideas?

Here's the regex: <a(.+?)>.+?<\/a>.
What I'm doing is preserving the important dynamic stuff within the anchor tag, and and replacing it with the following function:
preg_replace('/<a(.+?)>.+?<\/a>/i',"<a$1>REPLACE</a>",$url);

This will cover most cases, but I suggest you review to make sure that nothing unexpected was missed or changed.
pattern = "/_(?=[^>]*<)/";
preg_replace($pattern,"",$url);

You can use this regular expression
(>(.*)<\s*/)
along with preg_replace_callback .
EDIT :
$replaced_text = preg_replace_callback('~(>(.*)<\s*/)~g','uscore_replace', $text);
function uscore_replace($matches){
return str_replace('_','',$matches[1]); //try this with 1 as index if it fails try 0, I am not entirely sure
}

PHP: finding, replacing, shortening, and prettifying user links with <a> tags, ellipses, and link icons

When a user enters a URL, e.g. http://www.google.com, I would like to be able to parse that text using PHP, find any links, and replace them with <a> tags that include the original URL as an HREF.
In other words, http://www.google.com will become
http://www.google.com
I'd like to be able to do this for all URLs of these forms (with .com interchangeable with any TLD):
http://www.google.com
www.google.com
google.com
docs.google.com
What's the most performant way to do this? I could try writing some really fancy regex, but I doubt that's the best method available to me.
For bonus points, I'd also like to prepend http:// to any URL lacking it, and strip the display text itself down to something of the form http://www.google.com/reallyLongL... and display an external link icon afterwards.

Trying to find links in the format domain.com is going to be a pain in the butt. It would require keeping track of all TLDs and using them in the search.if you didnt the end of the last sentence i typed and the beginning of this sentence would be a link to http://search.if. Even if you did .in is a valid TLD and a common word.
I'd recommend telling your users they have to begin links with www. or http:// then write a simple regex to capture them and add the links.

www.google.com
This is not a URL, it's a hostname. It's generally not a good idea to start marking up bare hostnames in arbitrary text, because in the general case any word or sequence of dot-separated words is a perfectly valid hostname. That means you up with horrible hacks like looking for leading www. (and you'll get questions like “why can I link to www.stackoverflow.com but not stackoverflow.com?”) or trailing TLDs (which gets more and more impractical as more new TLDs are introduced; “why can I like to ncm.com but not ncm.museum?”), and you'll often mark up things that aren't supposed to be links.
I could try writing some really fancy regex
Well I can't see how you'd do it without regex.
The trick is coping with markup. If you can have <, & and " characters in the input, you mustn't let them into HTML output. If your input is plain text, you can do that by calling htmlspecialchars() before applying a simple replacement on a pattern like that in nico's answer.
(If the input already contains markup, you've got problems and you'd probably need an HTML parser to determine which bits are markup to avoid adding more markup inside of. Similarly, if you're doing more processing after this, inserting more tags, those steps are may have the same difficulty. In ‘bbcode’-like languages this often leads to bugs and security problems.)
Another problem is trailing punctuation. It's common for people to put a full stop, comma, close bracket, exclamation mark etc after a link, which aren't supposed to be part of the link but which are actually valid characters. It's useful to strip these off and not put them in the link. But then you break Wiki links that end in ), so maybe you want to not treat ) as a trailing character if there's a ( in the link, or something like that. This sort of thing can't be done in a simple regex replace, but you can in a replacement callback function.

HTML Purifier has a built-in linkify function to save you all the headaches.
It's other features are also simply too useful to pass up if you're dealing with any kind of user input that you also have to display.

Not so fancy regexps that should work
/\b(https?:\/\/[^\s+\"\<\>]+)/ig
/\b(www.[^\s+\"\<\>]+)/ig
Note that the last two would be impossible to do correctly as you cannot distinguish google.com from something like this.Where I finish one sentence and don't put a space after the full stop.
As for shortening the URLs, having your URL in $url:
if (strlen($url) > 20) // Or whatever length you like
{
$shortURL = substr($url, 0, 20)."…";
}
else
{
$shortURL = $url;
}
echo '<a href="'.$url.'" >'.$shortURL.'</a>';

From http://www.exorithm.com/algorithm/view/markup_urls
function markup_urls ($text)
{
// split the text into words
$words = preg_split('/([\s\n\r]+)/', $text, -1, PREG_SPLIT_DELIM_CAPTURE);
$text = "";
// iterate through the words
foreach($words as $word) {
// chopword = the portion of the word that will be replaced
$chopword = $word;
$chopword = preg_replace('/^[^A-Za-z0-9]*/', '', $chopword);
if ($chopword <> '') {
// linkword = the text that will replace chopword in the word
$linkword='';
// does it start with http://abc. ?
if (preg_match('/^(http:\/\/)[a-zA-Z0-9_]{2,}.*/', $chopword)) {
$chopword = preg_replace('/[^A-Za-z0-9\/]*$/', '', $chopword);
$linkword = ''.$chopword.'';
// does it equal abc.def.ghi ?
} else if (preg_match('/^[a-zA-Z]{2,}\.([a-zA-Z0-9_]+\.)+[a-zA-Z]{2,}(\/.*)?/', $chopword)) {
$chopword = preg_replace('/[^A-Za-z0-9\/]*$/', '', $chopword);
$linkword = ''.$chopword.'';
// does it start with abc#def.ghi ?
} else if (preg_match('/^[a-zA-Z0-9_\.]+\#([a-zA-Z0-9_]{2,}\.)+[a-zA-Z]{2,}.*/', $chopword)) {
$chopword = preg_replace('/[^A-Za-z0-9]*$/', '', $chopword);
$linkword = ''.$chopword.'';
}
// replace chopword with linkword in word (if linkword was set)
if ($linkword <> '') {
$word = str_replace($chopword, $linkword, $word);
}
}
// append the word
$text = $text.$word;
}
return $text;
}

I got this working exactly the way I want here:
<?php
$input = <<<EOF
http://www.example.com/
http://example.com
www.example.com
http://iamanextremely.com/long/link/so/I/will/be/trimmed/down/a/bit/so/i/dont/mess
/up/text/wrapping.html
EOF;
function trimlong($match)
{
$url = $match[0];
$display = $url;
if ( strlen($display) > 30 ) {
$display = substr($display,0,30)."...";
}
return ''.$display.' <img src="http://static.goalscdn.com/img/external-link.gif" height="10" width="11" />';
}
$output = preg_replace_callback('#(http://|www\\.)[^\\s<]+[^\\s<,.]#i',
array($this,'trimlong'),$input);
echo $output;

preg_replace() help in PHP

Consider this string
hello awesome <a href="" rel="external" title="so awesome is cool"> stuff stuff
What regex could I use to match any occurence of awesome which doesn't appear within the title attribute of the anchor?
So far, this is what I've came up with (it doesn't work sadly)
/[^."]*(awesome)[^."]*/i
Edit
I took Alan M's advice and used a regex to capture every word and send it to a callback. Thanks Alan M for your advice. Here is my final code.
$plantDetails = end($this->_model->getPlantById($plantId));
$botany = new Botany_Model();
$this->_botanyWords = $botany->getArray();
foreach($plantDetails as $key=>$detail) {
$detail = preg_replace_callback('/\b[a-z]+\b/iU', array($this, '_processBotanyWords'), $detail);
$plantDetails[$key] = $detail;
}
And the _processBotanyWords()...
private function _processBotanyWords($match) {
$botanyWords = $this->_botanyWords;
$word = $match[0];
if (array_key_exists($word, $botanyWords)) {
return '' . $word . '';
} else {
return $word;
}
}
Hope this well help someone else some day! Thanks again for all your answers.

This subject comes up pretty much every day here and basically the issue is this: you shouldn't be using regular expressions to parse or alter HTML (or XML). That's what HTML/XML parsers are for. The above problem is just one of the issues you'll face. You may get something that mostly works but there'll still be corner cases where it doesn't.
Just use an HTML parser.

Asssuming this is related to the question you posted and deleted a little while ago (that was you, wasn't it?), it's your fundamental approach that's wrong. You said you were generating these HTML links yourself by replacing words from a list of keywords. The trouble is that keywords farther down the list sometimes appear in the generated title attributes and get replaced by mistake--and now you're trying to fix the mistakes.
The underlying problem is that you're replacing each keyword using a separate call to preg_replace, effectively processing the entire text over and over again. What you should do is process the text once, matching every single word and looking it up in your list of keywords; if it's on the list, replace it. I'm not set up to write/test PHP code, but you probably want to use preg_replace_callback:
$text = preg_replace_callback('/\b[A-Za-z]+\b/', "the_callback", $text);
"the_callback" is the name of a function that looks up the word and, if it's in the list, generates the appropriate link; otherwise it returns the matched word. It may sound inefficient, processing every word like this, but in fact it's a great deal more efficient than your original approach.

Sure, using a parsing library is the industrial-strength solution, but we all have times were we just want to write something in 10 seconds and be done. Next time you want to process the meaty text of a page, ignoring tags, try just run your input through strip_tags first. This way you will get only the plain, visible text and your regex powers will again reign supreme.

This is so horrible I hesitate to post it, but if you want a quick hack, reverse the problem--instead of finding the stuff that isn't X, find the stuff that IS, change it, do the thing and change it back.
This is assuming you're trying to change awesome (to "wonderful"). If you're doing something else, adjust accordingly.
$string = 'Awesome is the man who <b>awesome</b> does and awesome is.';
$string = preg_replace('#(title\s*=\s*\"[^"]*?)awesome#is', "$1PIGDOG", $string);
$string = preg_replace('#awesome#is', 'wonderful', $string);
$string = preg_replace('#pigdog#is', 'awesome', $string);
Don't vote me down. I know it's hack.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Search and Replace URL - Regex? - php

There is syntax error in your pattern, change it to: $pattern = "(?<=href=(\"|'))[^\"']+(?=(\"|'))"; Errors: $pattern = "(?<=href=("|'))[^"']+(?=("|'))"; ^-- ^--not escaped

Related

using preg_replace() with wordpress filter the_content

URL detection problems

Replace anchor text with PHP (and regular expression)

PHP: finding, replacing, shortening, and prettifying user links with <a> tags, ellipses, and link icons

preg_replace() help in PHP

Categories

Resources