PHP:preg_replace function - php

$text = "
<tag>
<html>
HTML
</html>
</tag>
";
I want to replace all the text present inside the tags with htmlspecialchars(). I tried this:
$regex = '/<tag>(.*?)<\/tag>/s';
$code = preg_replace($regex,htmlspecialchars($regex),$text);
But it doesn't work.
I am getting the output as htmlspecialchars of the regex pattern. I want to replace it with htmlspecialchars of the data matching with the regex pattern.
what should i do?

You're replacing the match with the pattern itself, you're not using the back-references and the e-flag, but in this case, preg_replace_callback would be the way to go:
$code = preg_replace_callback($regex,'htmlspecialchars',$text);
This will pass the mathces groups to htmlspecialchars, and use its return value as replacement. The groups might be an array, in which case, you can try either:
function replaceCallback($matches)
{
if (is_array($matches))
{
$matches = implode ('', array_slice($matches, 1));//first element is full string
}
return htmlspecialchars($matches);
}
Or, if your PHP version permits it:
preg_replace_callback($expr, function($matches)
{
$return = '';
for ($i=1, $j = count($matches); $i<$j;$i++)
{//loop like this, skips first index, and allows for any number of groups
$return .= htmlspecialchars($matches[$i]);
}
return $return;
}, $text);
Try any of the above, until you find simething that works... incidentally, if all you want to remove is <tag> and </tag>, why not go for the much faster:
echo htmlspecialchars(str_replace(array('<tag>','</tag>'), '', $text));
That's just keeping it simple, and it'll almost certainly be faster, too.
See the quickest, easiest way in action here

If you want to isolate the actual contents as defined by your pattern, you could use preg_match($regex,$text,$hits);. This will give you an array of hits those bits that were between the paratheses in the pattern, starting at $hits[1], $hits[0] contains the whole matched string). You can then start manipulating these found matches, possibly using htmlspecialchars ... and combine them again into $code.

Related

preg_replace with arrays of different sizes

I know that the answer is probably simple and I am just not seeing it.
This code gets me an array of "tags" in the layout (eg [[tag]]), and I have an array of replacements that comes in with the request ($this->data). My first inclination was to use preg_match_all to get an array of all the tags and just pass in both arrays:
if(isset($this->layout))
{
ob_start();
include(VIEWS.'layouts/'.$this->layout.'.phtml');
$this->layout = ob_get_contents();
preg_match_all('/\[\[.*\]\]/', $this->layout, $tags);
print preg_replace($tags, $this->data, $this->layout);
}
But the arrays are not the same length (most of the time). The layout might reuse some tags, and the passed in data variables might not include some tags in the layout.
I feel like there must be a more efficient way to do this than doing a foreach and building the output in iterations.
This project is way too small to implement a full templating engine like Smarty or Twig... it is actually just a few pages and a few replacements. My client just wants a simple way to add things like page titles and email recipients, etc.
Any advice would be appreciated. Like I said, I am positive that it is something simple that I am overlooking.
EDIT:
$this->data is an array of replacement text that look like:
tag => replacement_text
EDIT 2:
If I user preg_match_all('/\[\[(.*)\]\]/', $this->layout, $tags); the array includes JUST the tags (no [[]]), I just need a way to match them up with the array of replacement strings in $this->data.
You can simply use str_replace for this job, creating an array of search strings and replacements from $this->data:
$search = array_map(function ($s) { return "[[$s]]"; }, array_keys($this->data));
$replacements = array_values($this->data);
echo str_replace($search, $replacements, $this->layout);
Demo on 3v4l.org
You don't need to get $tags by matching $this->layout, the information is all in the keys of $this->data. You just need to add [[...]] around the keys.
$tags = array_map(function($key) {
return '/\[\[' . preg_quote($key) . '\]\]';
}, array_keys($this->data));
Another solution is to use preg_replace_callback() to look the tag up in $this->data;
echo preg_replace_callback('/\[\[(.*?)\]\]/', function($matches) {
return $this->data[$matches[1]] ?? $matches[0];
}, $this->layout);
Note that I changed the regexp to use a non-greedy quantifier; your regexp will match from the beginning of the first tag to the end of the last tag.
If the tag isn't found in $this->data, ?? $matches[0] leaves it unchanged.
You could make use of preg_replace_callback:
$result = preg_replace_callback('/\[\[(?<tag>.*?)]]/', function ($matches) {
return $this->data[$matches['tag']] ?? $matches[0];
}, $this->layout);
Demo: https://3v4l.org/I9Vvh
Shorter PHP 7.4 version:
$result = preg_replace_callback('/\[\[(?<tag>.*?)]]/', fn($matches) =>
$this->data[$matches['tag']] ?? $matches[0], $this->layout);
Edited with the ?? $matches[0] (courtesy of #Barmar) -- this is basically the same answer, just leaving it in case the PHP 7.4 syntax is useful.

How to improve my algorithm?/seaching and replacing words in a formated text/

I have a source of html, and an array of keywords. I'm trying to find all words which begin with any keyword in the keywords array and wrap it in a link tag.
For example, the keyword array has two values: [ABC, DEF]. It should match ABCDEF, DEFAD, etc. and wrap each word with hyperlink markup.
Here is the code I've got so far:
$_keys = array('ABC', 'DEF');
$text = 'Some ABCDD <strong>HTML</strong> text. DEF';
function search_and_replace(($key,$text)
{
$words = preg_split('/\s+/', trim($text)); //to seprate words in $_text
for($words as $word)
{
if(strpos($word,$key) !== false)
{
if($word.startswith($key))
{
str_replace($word,''.$word.',$_text);
}
}
}
return text;
}
for($_keys as $_key)
{
$text = search_and_replace($key,$text);
}
My questions:
Would this algorithm work?
How would I modify this to work with UTF-8?
How can I recognize hyperlinks in the html and ignore them (don't want to put a hyperlink in a hyperlink).
Is this algorithm safe?
is the algorithm "true"? ( I'm reading "accurate")
No, it is not. Since str_replace functions as follows
a string or an array with all occurrences of search in subject
replaced with the given replace value.
The string you're matching is not the only one being replaced. Using your example, if you ran this function against your data set, you'd end up wrapping each occurrence of ABC in multiple tags ( just run your code to see it, but you'll have to fix syntax errors).
work with UTF-8 Alphabets?
Not sure, but as written, I don't think so. See Preg_Replace and UTF8. PREG functions should be multibyte safe.
I want to igonre all words in each a tag for search operetion
That's awefully hard. You'll have to avoid <a ...>word</a>, which starts to make a big mess fast. Regex matching HTML reliably is a fool's errand.
Probably the best would be to interpret the webpage as XML or HTML. Have you considered doing this in javascript? Why do it on the server side? The advantage of JS is twofold - one, it runs on the client side, so you're offloading / distributing the work, and two, since the DOM is already interpreted, you can find all text nodes and replace them fairly easily. In fact, I was helping a frend working on a chrome extension to to almost exactly what you're describing; you could modify it to do what you're looking for easily.
a better alternative method?
Definitely. What you're showing here is one of the worse methods of doing this. I'd push for you to use preg_replace ( another answer has a good start for the regex you'd want, matching word breaks tather than whitespace) but since you want to avoid changing some elements, I'm thinking now that doing this in JS client-side is far better.
In order to maximize your performance you should look into Trie (same as Retrieval Tree) data structure. (http://en.wikipedia.org/wiki/Trie) If I were you I would first build a Trie containing the words in the HTML page. At this step you could also check if the word is inside an <a> tag and if it this then do not add it to the Trie. You can easily do that with a Regex match
How about regex?
preg_match_all("/\b".$word."\B*\b/",$matches);
foreach($matches as $each) {
print($each[0]);
}
(Sorry, my PHP is a bit rusty)
For a simple task like this PHP regular expressions will serve well. The idea is to find all hyperlinks ( and optionally some other HTML elements ) and replace them with unique tokens. After that we are free to seek and replace desired keywords, and in the end we will restore the removed HTML elements back.
$_keys = array( 'ABC', 'DEF', 'ABČ' );
$text =
'Some <a href="#" >ABC</a> ABCDđD <strong>ABCDEF</strong> text. DEF
<p class="test">
PHP is <em>the</em> most ABCwidely used
langČuage ABC for ABČogr ammDEFing on the webABC DEFABC.
</p>';
// array for holding html items replaced with tokens
$tokens = array();
$id = 0;
// we will replace all links and strong elements (a|strong)
$text = preg_replace_callback( '/<(a|strong)[^>]*>.*?<\/\1\s*>/s',
function( $matches ) use ( &$tokens, &$id )
{
// store matches into the tokens array
$tokens[ '#'.++$id.'#' ] = $matches[0];
// replace matches with the unique id
return '#'.$id.'#';
},
$text
);
echo htmlentities( $text );
/* - outputs: Some #1# ABCDđD #2# text. DEF <p class="test"> #3# is <em>the</em> most ABCwidely used langČuage ABC for pćrogrABCamming on the webABC DEFABC. </p>
- note the #1# #2# #3# tokens
*/
// wrap the words that starts with items in $_keys array ( with u(PCRE_UTF8) modifier )
$text = preg_replace( '/\b('. implode( '|', $_keys ) . ')\w*\b/u', '$0', $text );
// replace the tokens with values
$text = str_replace( array_keys($tokens), array_values($tokens), $text );
echo $text;
Info about UTF-8 strings in PHP regex:

preg_match acting very strange

I am using preg_match() to extract pieces of text from a variable, and let's say the variable looks like this:
[htmlcode]This is supposed to be displayed[/htmlcode]
middle text
[htmlcode]This is also supposed to be displayed[/htmlcode]
i want to extract the contents of the [htmlcode]'s and input them into an array. i am doing this by using preg_match().
preg_match('/\[htmlcode\]([^\"]*)\[\/htmlcode\]/ms', $text, $matches);
foreach($matches as $value){
return $value . "<br />";
}
The above code outputs
[htmlcode]This is supposed to be displayed[/htmlcode]middle text[htmlcode]This is also supposed to be displayed[/htmlcode]
instead of
[htmlcode]This is supposed to be displayed[/htmlcode]
[htmlcode]This is also supposed to be displayed[/htmlcode]
and if have offically run out of ideas
As explained already; the * pattern is greedy. Another thing is to use preg_match_all() function. It'll return you a multi-dimension array of matched content.
preg_match_all('#\[htmlcode\]([^\"]*?)\[/htmlcode\]#ms', $text, $matches);
foreach( $matches[1] as $value ) {
And you'll get this: http://codepad.viper-7.com/z2GuSd
A * grouper is greedy, i.e. it will eat everything until last [/htmlcode]. Try replacing * with non-greedy *?.
* is by default greedy, ([^\"]*?) (notice the added ?) should make it lazy.
What do lazy and greedy mean in the context of regular expressions?
Look at this piece of code:
preg_match('/\[htmlcode\]([^\"]*)\[\/htmlcode\]/ms', $text, $matches);
foreach($matches as $value){
return $value . "<br />";
}
Now, if your pattern works fine and all is ok, you should know:
return statement will break all loops and will exit the function.
The first element in matches is the whole match, the whole string. In your case $text
So, what you did is returned the first big string and exited the function.
I suggest you can check for desired results:
$matches[1] and $matches[2]

php replace regular expression instead of string replace

I'm trying to give my client the ability to call a function that has various code snippets by inserted a short code in their WYSIWYG editor.
For example, they will write something like...
[getSnippet(1)]
This will call my getSnippet($id) php function and output the appropriate 'chunk'.
It works when I hard code the $id like this...
echo str_replace('[getSnippet(1)]',getSnippet(1),$rowPage['sidebar_details']);
However, I really want to make the '1' dynamic. I'm sort of on the right track with something like...
function getSnippet($id) {
if ($id == 1) {
echo "car";
}
}
$string = "This [getSnippet(1)] is a sentence.This is the next one.";
$regex = '#([getSnippet(\w)])#';
$string = preg_replace($regex, '. \1', $string);
//If you want to capture more than just periods, you can do:
echo preg_replace('#(\.|,|\?|!)(\w)#', '\1 \2', $string);
Not quite working :(
Firstly in your regex you need to add literal parentheses (the ones you have just capture \w but that will not match the parentheses themselves):
$regex = '#(\[getSnippet\((\w)\)\])#';
I also escaped the square brackets, otherwise they will open a character class. Also be aware that this captures only one character for the parameter!
But I recommend you use preg_replace_callback, with a regex like this:
function getSnippet($id) {
if ($id == 1) {
return "car";
}
}
function replaceCallback($matches) {
return getSnippet($matches[1]);
}
$string = preg_replace_callback(
'#\[getSnippet\((\w+)\)\]#',
'replaceCallback',
$string
);
Note that I changed the echo in your getSnippet to a return.
Within the callback $matches[1] will contain the first captured group, which in this case is your parameter (which now allows for multiple characters). Of course, you could also adjust you getSnippet function to read the id from the $matches array instead of redirecting through the replaceCallback.
But this approach here is slightly more flexible, as it allows you to redirect to multiple functions. Just as an example, if you changed the regex to #\[(getSnippet|otherFunction)\((\w+)\)\]# then you could find two different functions, and replaceCallback could find out the name of the function in $matches[1] and call the function with the parameter $matches[2]. Like this:
function getSnippet($id) {
...
}
function otherFunction($parameter) {
...
}
function replaceCallback($matches) {
return $matches[1]($matches[2]);
}
$string = preg_replace_callback(
'#\[(getSnippet|otherFunction)\((\w+)\)\]#',
'replaceCallback',
$string
);
It really depends on where you want to go with this. The important thing is, there is no way of processing an arbitrary parameter in a replacement without using preg_replace_callback.

Get more backreferences from regexp than parenthesis

Ok this is really difficult to explain in English, so I'll just give an example.
I am going to have strings in the following format:
key-value;key1-value;key2-...
and I need to extract the data to be an array
array('key'=>'value','key1'=>'value1', ... )
I was planning to use regexp to achieve (most of) this functionality, and wrote this regular expression:
/^(\w+)-([^-;]+)(?:;(\w+)-([^-;]+))*;?$/
to work with preg_match and this code:
for ($l = count($matches),$i = 1;$i<$l;$i+=2) {
$parameters[$matches[$i]] = $matches[$i+1];
}
However the regexp obviously returns only 4 backreferences - first and last key-value pairs of the input string. Is there a way around this? I know I can use regex just to test the correctness of the string and use PHP's explode in loops with perfect results, but I'm really curious whether it's possible with regular expressions.
In short, I need to capture an arbitrary number of these key-value; pairs in a string by means of regular expressions.
You can use a lookahead to validate the input while you extract the matches:
/\G(?=(?:\w++-[^;-]++;?)++$)(\w++)-([^;-]++);?/
(?=(?:\w++-[^;-]++;?)++$) is the validation part. If the input is invalid, matching will fail immediately, but the lookahead still gets evaluated every time the regex is applied. In order to keep it (along with the rest of the regex) in sync with the key-value pairs, I used \G to anchor each match to the spot where the previous match ended.
This way, if the lookahead succeeds the first time, it's guaranteed to succeed every subsequent time. Obviously it's not as efficient as it could be, but that probably won't be a problem--only your testing can tell for sure.
If the lookahead fails, preg_match_all() will return zero (false). If it succeeds, the matches will be returned in an array of arrays: one for the full key-value pairs, one for the keys, one for the values.
regex is powerful tool, but sometimes, its not the best approach.
$string = "key-value;key1-value";
$s = explode(";",$string);
foreach($s as $k){
$e = explode("-",$k);
$array[$e[0]]=$e[1];
}
print_r($array);
Use preg_match_all() instead. Maybe something like:
$matches = $parameters = array();
$input = 'key-value;key1-value1;key2-value2;key123-value123;';
preg_match_all("/(\w+)-([^-;]+)/", $input, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
$parameters[$match[1]] = $match[2];
}
print_r($parameters);
EDIT:
to first validate if the input string conforms to the pattern, then just use:
if (preg_match("/^((\w+)-([^-;]+);)+$/", $input) > 0) {
/* do the preg_match_all stuff */
}
EDIT2: the final semicolon is optional
if (preg_match("/^(\w+-[^-;]+;)*\w+-[^-;]+$/", $input) > 0) {
/* do the preg_match_all stuff */
}
No. Newer matches overwrite older matches. Perhaps the limit argument of explode() would be helpful when exploding.
what about this solution:
$samples = array(
"good" => "key-value;key1-value;key2-value;key5-value;key-value;",
"bad1" => "key-value-value;key1-value;key2-value;key5-value;key-value;",
"bad2" => "key;key1-value;key2-value;key5-value;key-value;",
"bad3" => "k%ey;key1-value;key2-value;key5-value;key-value;"
);
foreach($samples as $name => $value) {
if (preg_match("/^(\w+-\w+;)+$/", $value)) {
printf("'%s' matches\n", $name);
} else {
printf("'%s' not matches\n", $name);
}
}
I don't think you can do both validation and extraction of data with one single regexp, as you need anchors (^ and $) for validation and preg_match_all() for the data, but if you use anchors with preg_match_all() it will only return the last set matched.

Categories