Automatically convert keywords to links in php

Automatically convert keywords to links in php - php

I am trying to convert specific keywords in text, which are stored in array, to the links.
Example text:
$text='This text contains many keywords, but also formated keywords.'
So now I want to convert the word keywords to the #keywords.
I used the very simple preg_replace function
preg_replace('/keywords/i',' keywords ',$text);
but obviously it converts to link also the string already formatted as a link, so I get a messy html like:
$text='This text contains many keywords, but also formated keywords" title="keywords">keywords</a>.'
Expected result:
$text='This text contains many keywords, but also formated keywords.'
Any suggestions?
THX
EDIT
We are one step from the perfect function, but still not working well in this case:
$text='This text contains many keywords, but also formated
keywords.'
In this case it replaces also the word keywords in the href, so we again get the messy code like
keywords.com/keywords" title="keywords">keywords</a>

I'm not great with regular expressions, but maybe this one will work:
/[^#>"]keywords/i
What I think it will do is ignore any instances of #keywords, >keywords, and "keywords and find the rest.
EDIT:
After testing it out, it looks like that replaces the space before the word as well, and doesn't work if keywords is the beginning of the string. It also didn't preserve original capitalization. I have tested this one, and it works perfectly for me:
$string = "Keywords and keywords, plus some more keywords with the original keywords.";
$string = preg_replace("/(?<![#>\"])keywords/i", "$0", $string);
echo $string;
The first three are replaced, preserving the original capitalization, and the last one is left untouched. This one uses a negative lookbehind and backreferences.
EDIT 2:
OP edited question. With the new example provided, the following regex will work:
$string = 'This text contains many keywords, but also formated keywords.';
$string = preg_replace("/(?<![#>\".\/])keywords/i", "$0", $string);
echo $string;
// outputs: This text contains many keywords, but also formated keywords.
This will replace all instances of keywords that are not preceded by #, >, ", ., or /.

Here is the problem:
The keyword could be inside the href, the title, or the text of the link, and anywhere in there (like if the keyword was sanity and you already had href="insanity". Or even worse, you could have a non-keyword link that happens to contain a keyword, something like:
Click here to find more keywords and such!
In the above example, even though it fits every other possible criteria (it's got spaces before and after being the easiest one to test for), it still would result in a link within a link, which I think breaks the internet.
Because of this, you need to use lookaheads and lookbehinds to check if the keyword is wrapped in a link. But there is one catch: lookbehinds have to have a defined pattern (meaning no wild cards).
I thought I'd be the hero and show you the easy fix for your issue, which would be something to the effect of:
'/(?<!\<a.?>)[list|of|keywords](?!\<\/a>)/'
Except you can't do that because the lookbehind in this case has that wildcard. Without it, you end up with a super greedy expression.
So my proposed alternative is to use regex to find all link elements, then str_replace to swap them out with a placeholder, and then replacing them with the placeholder at the end.
Here's how I did it:
$text='This text contains many keywords, but also formated keywords.';
$keywords = array('text', 'formatted', 'keywords');
//This is just to make the regex easier
$keyword_list_pattern = '['. implode($keywords,"|") .']';
// First, get all matching keywords that are inside link elements
preg_match_all('/<a.*' . $keyword_list_pattern . '.*<\/a>/', $text, $links);
$links = array_unique($links[0]); // Cleaning up array for next step.
// Second, swap out all matches with a placeholder, and build restore array:
foreach($links as $count => $link) {
$link_key = "xxx_{$count}_xxx";
$restore_links[$link_key] = $link;
$text = str_replace($link, $link_key, $text);
}
// Third, we build a nice replacement array for the keywords:
foreach($keywords as $keyword) {
$keyword_links[$keyword] = "<a href='#$keyword'>$keyword</a>";
}
// Merge the restore links to the bottom of the keyword links for one mass replacement:
$keyword_links = array_merge($keyword_links, $restore_links);
$text = str_replace(array_keys($keyword_links), $keyword_links, $text);
echo $text;

You can change your RegEx so that it only targets keywords with a space in front. Since the formatted keywords do no contain a space. Here is an example.
$text = preg_replace('/ keywords/i',' keywords',$text);

Related

How to use preg_replace to hide parts of title

I tried explaining this but I don't think anyone understood,
I have a lot of titles which are all different, what I'm trying to do, is get rid of every "Lyrics" word in the title, and also, everything that appears before the "-" symbol.
I managed to do the lyrics part with istr_replace
<?php echo str_ireplace('lyrics', '', get_the_title()); ?>
How can I do the second part, what can I apply to the code to make it so?
Example of what I want it to do:
"random title - more random title lyrics" turns to "more random title"
The code applied, would delete the "random title - " on every single title on my website,
I was previously given this by someone here, Idk if this would help
$string = preg_replace("/[^-]+-(.*) Lyrics/", "$1", $string);

If you're trying to do it all in one regexp, that's probably asking too much. Why not break it up into two steps: remove everything up through - (and following spaces)
$string = preg_replace('/^[^-]*-\s*/, '', $string);
and then remove the first "lyrics" (case insensitive), if any
$string = preg_replace('/lyrics/i', '', $string);
This can get more complicated -- what if there are no or multiple hyphens, what if there are no or multiple "lyrics", what do you want to do about spaces or punctuation after the hyphen and around "lyrics", can "lyrics" be part of another word, what about arbitrary capitalization of "lyrics", etc.? You need to make sure it works as expected in all these cases.

Preg_replace cyrillic word or phrase - not both

I am writing a PHP function which is supposed to convert certain keywords into links. It uses Cyrillic words in UTF-8.
So I came up with this:
function keywords($text){
$keywords = Db::get('keywords'); //array with words and corresponding links
foreach ($keywords as $value){
$keyword = $value['keyword'];
$link = $value['link'];
$text = preg_replace('/(?<!\pL)('.$keyword.')(?!\pL)/iu', '$1', $text);
}
return $text;
}
So far this runs like a charm, but now I want to replace phrases with links - phrases that may contain other keywords. For example I want the word "car" to link to one place, and "blue car" to other.
Any ideas?

As written in the comment, i post this as an answer, hoping it's been useful to you.
You could try replacing the keyword into the text firstly by using a placeholder and then, when entire text has been parsed, you can substitute those placeholders with the real words.
For example, take the phrase:
"I have a car, a blue car."
We already ordered the keywords list from longer to smaller, so we get to check "blue car"; We find it in the text, so we put the placeholder and obtain:
"I have a car, a [[1]]."
The second keyword in the list is "car"; after substitution in the text, we obtain:
"I have a [[2]], a [[1]]."
Finally, when all keywords have been substituted, you only have to replace the placeholders in their order using the preg_replace in your function, and get the text with links.

Different results between preg_replace & preg_match_all

I have a forum that supports hashtags. I'm using the following line to convert all hashtags into links. I'm using the (^|\(|\s|>) pattern to avoid picking up named anchors in URLs.
$str=preg_replace("/(^|\(|\s|>)(#(\w+))/","$1$2",$str);
I'm using this line to pick up hashtags to store them in a separate field when the user posts their message, this picks up all hashtags EXCEPT those at the start of a new line.
preg_match_all("/(^|\(|\s|>)(#(\w+))/",$Content,$Matches);
Using the m & s modifiers doesn't make any difference. What am I doing wrong in the second instance?
Edit: the input text could be plain text or HTML. Example of problem input:
#startoftextreplacesandmatches #afterwhitespacereplacesandmatches <b>#insidehtmltagreplacesandmatches</b> :)
#startofnewlinereplacesbutdoesnotmatch :(

Your replace operation has a problem which you have evidently not yet come across - it will allow unescaped HTML special characters through. The reason I know this is because your regex allows hashtags to be prefixed with >, which is a special character.
For that reason, I recommend you use this code to do the replacement, which will double up as the code for extracting the tags to be inserted into the database:
$hashtags = array();
$expr = '/(?:(?:(^|[(>\s])#(\w+))|(?P<notag>.+?))/';
$str = preg_replace_callback($expr, function($matches) use (&$hashtags) {
if (!empty($matches['notag'])) {
// This takes care of HTML special characters outside hashtags
return htmlspecialchars($matches['notag']);
} else {
// Handle hashtags
$hashtags[] = $matches[2];
return htmlspecialchars($matches[1]).'#'.htmlspecialchars($matches[2]).'';
}
}, $str);
After the above code has been run, $str will contain the modified string, properly escaped for direct output, and $hashtags will be populated with all the tags matched.
See it working

How to ignore regex matches wrapped by a particular string?

I had a great idea for some functionality on a project and I've tried to implement it to the best of my ability but I need a little help achieving the desired effect. The page in question is: http://dev.favorcollective.com/guidelines/ (just to provide some context)
I'm using php's preg_replace to go through a particular page's contents (giant string) and I'm having it search for glossary terms and then I wrap the terms with a bit of html that enables dynamic glossary definition tooltips.
Here is my current code:
function annotate($content)
{
global $glossary_terms;
$search = array();
$replace = array();
$count=1;
foreach ($glossary_terms as $term):
array_push($search,'/\b('.preg_quote($term['term'],'/').')[?=a-zA-Z]*/i');
$id = "annotation-".$count;
$replacement = ''.$term['term'].'<span id="'.$id.'" style="display:none;"><span class="term">'.$term['term'].'</span><span class="definition">'.$term['def'].'</span></span>';
array_push($replace,(string)$replacement);
$count++;
endforeach;
return preg_replace($search, $replace, $content);
}
• But what if I want to ignore matches inside of <h#> </h#> tags?
• I also have a particular string that I do not want a specific term to match within. For example, I want the word "proficiency" to match any time it is NOT used in the context of "ACTFL Proficiency Guidelines" how would I go about adding exceptions to my regular expression? Is that even an option?
• Finally, how can I return the matched text as a variable? Currently when I match for a term ending in 's' or 'ing' (on purpose) my script prints the matched term rather than the original string that was matched (i.e. it's replacing "descriptions" with "description"). Is there anyway to do that?

not a php guy (c#), but here goes. I assume that:
'/\b('.preg_quote($term['term'],'/').')[?=a-zA-Z]*/i' will map to this far more readable pattern:
/\b(ESCAPED_TERM)[?=a-zA-Z]*/i
so, as far as excluding <h#> type tags, regex is ok only if you can assume your data would be the simple, non-nested case: <h#>TERM<h#>. If you can, you can use a negative lookahead assertion:
/\b(ESCAPED_TERM)(?!<h\d>)[?=a-zA-Z]*/i
you can use a lookahead with a lookbehind to handle your special case:
/\b(ESCAPED_TERM|(?<!ACTFL )Proficiency(?!\sGuidelines))(?!<h\d>)[?=a-zA-Z]*/i
note: if you have a bunch of these special cases, PHP might (should) have an "ignore whitespace" flag which will let you put each token on newline.

Regular expressions are awesome, wonderful, magical. But everything has its limits.
That's why it's nice to have a language like PHP to provide the extra functionality. :)
Can you strip out headers with a non-greedy regexp?
$content = preg_replace('/<h[1-6]>.*?<\/h[1-6]>/sim', "", $content);
If non-greedy evaluations aren't working, what about just assuming that there won't be any other HTML inside your headers?
$content = preg_replace('/<h[1-6]>[^<]*<\/h[1-6]>/im', "", $content);
Also, you might want to use sprintf to simplify your replacement:
/*
1 get_bloginfo('url')
2 preg_replace( '/\s+/', '', $term['term']).
3 $id
4 $term['term']
5 $term['def']
*/
$rfmt = '%4$s<span id="%3$s" style="display:none;"><span class="term">%4$s</span><span class="definition">%5$s</span></span>';
...
$replacement = sprintf($rfmt, get_bloginfo('url'), preg_replace( '/\s+/', '', $term['term']), $id, $term['term'], $term['def'] );

PHP preg_replace Expertise Sought

I'm creating some custom BBcode for a forum. I'm trying to get the regular expression right, but it has been eluding me for two days. Any expert advice is welcome.
The input (e.g. sample forum post):
[quote=Bob]I like Candace. She is nice.[/quote]
I agree, she is very nice. I like Ashley, too, and especially [Ryan] when he's drinking.
Essentially, I want to encase any names (from a specified list) in [user][/user] BBcode... except, of course, those being quoted, because doing that causes some terrible parsing errors. Below is an example of how I want the output to be.
The desired output:
[quote=Bob]I like [user]Candace[/user]. She is nice.[/quote]
I agree, she is very nice. I like [user]Ashley[/user], too, and especially [[user]Ryan[/user]] when he's drinking.
My current code:
$searchArray = array(
'/(?i)(Ashley|Bob|Candace|Ryan|Tim)/'
);
$replaceArray = array(
"[user]\\0[/user]"
);
$text = preg_replace($searchArray, $replaceArray, $input);
$input is of course set to the post contents (i.e. the first example listed above). How can I achieve the results I want? I don't want the regex to match when a name is preceded by an equals sign (=), but putting a [^=] in front of the names in the regex will make it match any non-equals sign character (i.e. spaces), which then messes up the formatting.
Update
The problem is that by using \1 instead of \0 it is omitting the first character before the names (because anything but = is matched). The output results in this:
[quote=Bob]I like[user]Candace[/user]. She is nice.[/quote]
I agree, she is very nice. I like[user]Ashley[/user], too, and especially [user]Ryan[/user]] when he's drinking.

You were on the right track with the [^=] idea. You can put it outside the capture group, and instead of \\0 which is the full match, use \\1 and \\2 i.e. the first & second capture groups
$searchArray = array(
'/(?i)([^=])(Ashley|Bob|Candace|Ryan|Tim)/'
);
$replaceArray = array(
"\\1[user]\\2[/user]"
);
$text = preg_replace($searchArray, $replaceArray, $input);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Automatically convert keywords to links in php - php

You can change your RegEx so that it only targets keywords with a space in front. Since the formatted keywords do no contain a space. Here is an example. $text = preg_replace('/ keywords/i',' keywords',$text);

Related

How to use preg_replace to hide parts of title

Preg_replace cyrillic word or phrase - not both

Different results between preg_replace & preg_match_all

How to ignore regex matches wrapped by a particular string?

PHP preg_replace Expertise Sought

Categories

Resources