How to ignore regex matches wrapped by a particular string? - php

I had a great idea for some functionality on a project and I've tried to implement it to the best of my ability but I need a little help achieving the desired effect. The page in question is: http://dev.favorcollective.com/guidelines/ (just to provide some context)
I'm using php's preg_replace to go through a particular page's contents (giant string) and I'm having it search for glossary terms and then I wrap the terms with a bit of html that enables dynamic glossary definition tooltips.
Here is my current code:
function annotate($content)
{
global $glossary_terms;
$search = array();
$replace = array();
$count=1;
foreach ($glossary_terms as $term):
array_push($search,'/\b('.preg_quote($term['term'],'/').')[?=a-zA-Z]*/i');
$id = "annotation-".$count;
$replacement = ''.$term['term'].'<span id="'.$id.'" style="display:none;"><span class="term">'.$term['term'].'</span><span class="definition">'.$term['def'].'</span></span>';
array_push($replace,(string)$replacement);
$count++;
endforeach;
return preg_replace($search, $replace, $content);
}
• But what if I want to ignore matches inside of <h#> </h#> tags?
• I also have a particular string that I do not want a specific term to match within. For example, I want the word "proficiency" to match any time it is NOT used in the context of "ACTFL Proficiency Guidelines" how would I go about adding exceptions to my regular expression? Is that even an option?
• Finally, how can I return the matched text as a variable? Currently when I match for a term ending in 's' or 'ing' (on purpose) my script prints the matched term rather than the original string that was matched (i.e. it's replacing "descriptions" with "description"). Is there anyway to do that?

not a php guy (c#), but here goes. I assume that:
'/\b('.preg_quote($term['term'],'/').')[?=a-zA-Z]*/i' will map to this far more readable pattern:
/\b(ESCAPED_TERM)[?=a-zA-Z]*/i
so, as far as excluding <h#> type tags, regex is ok only if you can assume your data would be the simple, non-nested case: <h#>TERM<h#>. If you can, you can use a negative lookahead assertion:
/\b(ESCAPED_TERM)(?!<h\d>)[?=a-zA-Z]*/i
you can use a lookahead with a lookbehind to handle your special case:
/\b(ESCAPED_TERM|(?<!ACTFL )Proficiency(?!\sGuidelines))(?!<h\d>)[?=a-zA-Z]*/i
note: if you have a bunch of these special cases, PHP might (should) have an "ignore whitespace" flag which will let you put each token on newline.

Regular expressions are awesome, wonderful, magical. But everything has its limits.
That's why it's nice to have a language like PHP to provide the extra functionality. :)
Can you strip out headers with a non-greedy regexp?
$content = preg_replace('/<h[1-6]>.*?<\/h[1-6]>/sim', "", $content);
If non-greedy evaluations aren't working, what about just assuming that there won't be any other HTML inside your headers?
$content = preg_replace('/<h[1-6]>[^<]*<\/h[1-6]>/im', "", $content);
Also, you might want to use sprintf to simplify your replacement:
/*
1 get_bloginfo('url')
2 preg_replace( '/\s+/', '', $term['term']).
3 $id
4 $term['term']
5 $term['def']
*/
$rfmt = '%4$s<span id="%3$s" style="display:none;"><span class="term">%4$s</span><span class="definition">%5$s</span></span>';
...
$replacement = sprintf($rfmt, get_bloginfo('url'), preg_replace( '/\s+/', '', $term['term']), $id, $term['term'], $term['def'] );

Related

How do i replace an expression with metacharacters using preg_replace in PHP?

I've tried in different ways, replacing a excerpt of text with metacharacters with another text with metacharacters in PHP. I know, it seems very simple but i've tried many times but didn't get it.
pattern: $Sel = 'anyvalues';
replacement: $Sel = 'one other';
subject: <?php $Sel = 'anyvalues';
I deeply tried using addslashes() and preg_quote() methods in different orders but it doesn't work. I need it to be as represented instead of looking for any workarounds.
Each one of these values lie in different files, i.e., the pattern lies in one file, replacement in another one, and subject in another as well.
Have you wrapped your pattern in delimiters?
$pattern = '$Sel = "anyvalues_containing_delimiter_/";';
$replacement = '$Sel = "one other";';
$subject = '<?php $Sel = "anyvalues_containing_delimiter_/";';
echo preg_replace(sprintf('/%s/', preg_quote($pattern, '/')), $replacement, $subject);
The contents of pattern are the only ones that need escaping, hence the preg_quote call. Second argument is the delimiter in which the pattern is wrapped (what sprintf does here), so any occurences within the pattern content can also be escaped before they are applied.
Edit: I've modified the snippet to cover for occurences of the delimiter within file contents.
Second edit: moved code from linked snippet into the answer itself.

Searching page and replacing some elements

I have 2 sets of tags on page, first is
{tip}tooltip text{/tip}
and second is
{tip class="someClass"}tooltip text{/tip}
I need to replace those with
<span class=„sstooltip”><i>?</i><em>tooltip text</em></span>
I dont know how to deal with adding new class to the <span> tag. (The tooltip class is always present)
This is my regex /\{tip.*?(?:class="([a-z]+)")?\}(.*?)\{\/tip\}/.
I guess I need to check array indexes for class value, but those are different, depending on {tip} tag version. Do I need two regular expressions, one for each version, or there is some way to extract and replace class value?
php code:
$regex = "/\{tip.*?(?:class=\"([a-z]+)\")?\}(.*?)\{\/tip\}/";
$matches = null;
preg_match_all($regex, $article->text, $matches);
if (is_array($matches)) {
foreach ($matches as $match) {
$article->text = preg_replace(
$regex,
"<span class=tooltip \$1"."><i>?</i><em>"."\$2"."</em></span>",
$article->text
);
}
}
Here's your answer (I've also made it a bit more robust):
{tip(?:\s+class\s*=\s*"([a-zA-Z\s]+)")?}([^{]*){\/tip}
PCRE (which PHP uses, if memory serves) will automatically pick up that the first capture group (which grabs the classes) is empty in the first case, and just substitute the empty string in the replacement. The second case is self-explanatory.
Your replacement code, then, will look like this:
$article->text = preg_replace(
'/{tip(?:\s+class\s*=\s*"([a-zA-Z\s]+)")?}([^}]*){\/tip}/',
'<span class="tooltip $1"><i>?</i><em>$2</em></span>',
$article->text
);
Yout don't need to check if the regex matches beforehand - that's implied by preg_replace, which is performing a regex match and then replacing any text matched by the pattern with that text. If there are no matches, no replacement occurs.
Regex Demo on Regex101
Code Demo on repl.it

Replace from one custom string to another custom string

How can I replace a string starting with 'a' and ending with 'z'?
basically I want to be able to do the same thing as str_replace but be indifferent to the values in between two strings in a 'haystack'.
Is there a built in function for this? If not, how would i go about efficiently making a function that accomplishes it?
That can be done with Regular Expression (RegEx for short).
Here is a simple example:
$string = 'coolAfrackZInLife';
$replacement = 'Stuff';
$result = preg_replace('/A.*Z/', $replacement, $string);
echo $result;
The above example will return coolStuffInLife
A little explanation on the givven RegEx /A.*Z/:
- The slashes indicate the beginning and end of the Regex;
- A and Z are the start and end characters between which you need to replace;
- . matches any single charecter
- * Zero or more of the given character (in our case - all of them)
- You can optionally want to use + instead of * which will match only if there is something in between
Take a look at Rubular.com for a simple way to test your RegExs. It also provides short RegEx reference
$string = "I really want to replace aFGHJKz with booo";
$new_string = preg_replace('/a[a-zA-z]+z/', 'boo', $string);
echo $new_string;
Be wary of the regex, are you wanting to find the first z or last z? Is it only letters that can be between? Alphanumeric? There are various scenarios you'd need to explain before I could expand on the regex.
use preg_replace so you can use regex patterns.

Automatically convert keywords to links in php

I am trying to convert specific keywords in text, which are stored in array, to the links.
Example text:
$text='This text contains many keywords, but also formated keywords.'
So now I want to convert the word keywords to the #keywords.
I used the very simple preg_replace function
preg_replace('/keywords/i',' keywords ',$text);
but obviously it converts to link also the string already formatted as a link, so I get a messy html like:
$text='This text contains many keywords, but also formated keywords" title="keywords">keywords</a>.'
Expected result:
$text='This text contains many keywords, but also formated keywords.'
Any suggestions?
THX
EDIT
We are one step from the perfect function, but still not working well in this case:
$text='This text contains many keywords, but also formated
keywords.'
In this case it replaces also the word keywords in the href, so we again get the messy code like
keywords.com/keywords" title="keywords">keywords</a>
I'm not great with regular expressions, but maybe this one will work:
/[^#>"]keywords/i
What I think it will do is ignore any instances of #keywords, >keywords, and "keywords and find the rest.
EDIT:
After testing it out, it looks like that replaces the space before the word as well, and doesn't work if keywords is the beginning of the string. It also didn't preserve original capitalization. I have tested this one, and it works perfectly for me:
$string = "Keywords and keywords, plus some more keywords with the original keywords.";
$string = preg_replace("/(?<![#>\"])keywords/i", "$0", $string);
echo $string;
The first three are replaced, preserving the original capitalization, and the last one is left untouched. This one uses a negative lookbehind and backreferences.
EDIT 2:
OP edited question. With the new example provided, the following regex will work:
$string = 'This text contains many keywords, but also formated keywords.';
$string = preg_replace("/(?<![#>\".\/])keywords/i", "$0", $string);
echo $string;
// outputs: This text contains many keywords, but also formated keywords.
This will replace all instances of keywords that are not preceded by #, >, ", ., or /.
Here is the problem:
The keyword could be inside the href, the title, or the text of the link, and anywhere in there (like if the keyword was sanity and you already had href="insanity". Or even worse, you could have a non-keyword link that happens to contain a keyword, something like:
Click here to find more keywords and such!
In the above example, even though it fits every other possible criteria (it's got spaces before and after being the easiest one to test for), it still would result in a link within a link, which I think breaks the internet.
Because of this, you need to use lookaheads and lookbehinds to check if the keyword is wrapped in a link. But there is one catch: lookbehinds have to have a defined pattern (meaning no wild cards).
I thought I'd be the hero and show you the easy fix for your issue, which would be something to the effect of:
'/(?<!\<a.?>)[list|of|keywords](?!\<\/a>)/'
Except you can't do that because the lookbehind in this case has that wildcard. Without it, you end up with a super greedy expression.
So my proposed alternative is to use regex to find all link elements, then str_replace to swap them out with a placeholder, and then replacing them with the placeholder at the end.
Here's how I did it:
$text='This text contains many keywords, but also formated keywords.';
$keywords = array('text', 'formatted', 'keywords');
//This is just to make the regex easier
$keyword_list_pattern = '['. implode($keywords,"|") .']';
// First, get all matching keywords that are inside link elements
preg_match_all('/<a.*' . $keyword_list_pattern . '.*<\/a>/', $text, $links);
$links = array_unique($links[0]); // Cleaning up array for next step.
// Second, swap out all matches with a placeholder, and build restore array:
foreach($links as $count => $link) {
$link_key = "xxx_{$count}_xxx";
$restore_links[$link_key] = $link;
$text = str_replace($link, $link_key, $text);
}
// Third, we build a nice replacement array for the keywords:
foreach($keywords as $keyword) {
$keyword_links[$keyword] = "<a href='#$keyword'>$keyword</a>";
}
// Merge the restore links to the bottom of the keyword links for one mass replacement:
$keyword_links = array_merge($keyword_links, $restore_links);
$text = str_replace(array_keys($keyword_links), $keyword_links, $text);
echo $text;
You can change your RegEx so that it only targets keywords with a space in front. Since the formatted keywords do no contain a space. Here is an example.
$text = preg_replace('/ keywords/i',' keywords',$text);

Regular expression anchor text for a link

I am trying to pull the anchor text from a link that is formatted this way:
<h3><b>File</b> : i_want_this</h3>
I want only the anchor text for the link : "i_want_this"
"variable_text" varies according to the filename so I need to ignore that.
I am using this regex:
<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>
This is matching of course the complete link.
PHP uses a pretty close version to PCRE (PERL Regex). If you want to know a lot about regex, visit perlretut.org. Also, look into Regex generators like exspresso.
For your use, know that regex is greedy. That means that when you specify that you want something, follwed by anything (any repetitions) followed by something, it will keep on going until that second something is reached.
to be more clear, what you want is this:
<a href="
any character, any number of times (regex = .* )
">
any character, any number of times (regex = .* )
</a>
beyond that, you want to capture the second group of "any character, any number of times". You can do that using what are called capture groups (capture anything inside of parenthesis as a group for reference later, also called back references).
I would also look into named subpatterns, too - with those, you can reference your choice with a human readable string rather than an array index. Syntax for those in PHP are (?P<name>pattern) where name is the name you want and pattern is the actual regex. I'll use that below.
So all that being said, here's the "lazy web" for your regex:
<?php
$str = '<h3><b>File</b> : i_want_this</h3>';
$regex = '/(<a href\=".*">)(?P<target>.*)(<\/a>)/';
preg_match($regex, $str, $matches);
print $matches['target'];
?>
//This should output "i_want_this"
Oh, and one final thought. Depending on what you are doing exactly, you may want to look into SimpleXML instead of using regex for this. This would probably require that the tags that we see are just snippits of a larger whole as SimpleXML requires well-formed XML (or XHTML).
I'm sure someone will probably have a more elegant solution, but I think this will do what you want to done.
Where:
$subject = "<h3><b>File</b> : i_want_this</h3>";
Option 1:
$pattern1 = '/(<a href=")(.*)(">)(.*)(<\/a>)/i';
preg_match($pattern1, $subject, $matches1);
print($matches1[4]);
Option 2:
$pattern2 = '()(.*)()';
ereg($pattern2, $subject, $matches2);
print($matches2[4]);
Do not use regex to parse HTML. Use a DOM parser. Specify the language you're using, too.
Since it's in a captured group and since you claim it's matching, you should be able to reference it through $1 or \1 depending on the language.
$blah = preg_match( $pattern, $subject, $matches );
print_r($matches);
The thing to remember is that regex's return everything you searched for if it matches. You need to specify that only care about the part you've surrounded in parenthesis (the anchor text). I'm not sure what language you're using the regex in, but here's an example in Ruby:
string = 'i_want_this'
data = string.match(/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/)
puts data # => outputs 'i_want_this'
If you specify what you want in parenthesis, you can reference it:
string = 'i_want_this'
data = string.match(/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/)[1]
puts data # => outputs 'i_want_this'
Perl will have you use $1 instead of [1] like this:
$string = 'i_want_this';
$string =~ m/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/;
$data = $1;
print $data . "\n";
Hope that helps.
I'm not 100% sure if I understand what you want. This will match the content between the anchor tags. The URL must start with /en/browse/file/, but may end with anything.
#(.*?)#
I used # as a delimiter as it made it clearer. It'll also help if you put them in single quotes instead of double quotes so you don't have to escape anything at all.
If you want to limit to numbers instead, you can use:
#(.*?)#
If it should have just 5 numbers:
#(.*?)#
If it should have between 3 and 6 numbers:
#(.*?)#
If it should have more than 2 numbers:
#(.*?)#
This should work:
<a href="[^"]*">([^<]*)
this says that take EVERYTHING you find until you meet "
[^"]*
same! take everything with you till you meet <
[^<]*
The paratese around [^<]*
([^<]*)
group it! so you can collect that data in PHP! If you look in the PHP manual om preg_match you will se many fine examples there!
Good luck!
And for your concrete example:
<a href="/en/browse/file/variable_text">([^<]*)
I use
[^<]*
because in some examples...
.*?
can be extremely slow! Shoudln't use that if you can use
[^<]*
You should use the tool Expresso for creating regular expression... Pretty handy..
http://www.ultrapico.com/Expresso.htm

Categories