Regex pattern - match word that starts with # - php

My mobile application is just like a forum on mobile platform (WP7, Silverlight, .NET). One feature consists in tagging other users by writing "#" char followed by the username.
On server side, PHP, I'm parsing the text so that it matches tags and replace them with more readable string such as [tag](display name)|(user id)[/tag], but that's not important for our purpose.
In order to match tags, I'm replacing all special chars with a space so I can prevent this like .... #name, ..... Then I'm removing all multiple spaces that the previous command could have been created. And finally I'm splitting each whitespace and then I check if that word begins with "#" char.
This is not of course the best method, but It's what I managed to do so far. There's a weak point, new line chars make my code fail. For example:
Hello, this is my first line
since I go to second and then I tag
#Jonh
who is a good boy
In case like this, the code I'm going to write below fails.
Where $resp is the text to parse.
if (strpos($resp,'#') !== false) {
$new_str = preg_replace('/[^a-zA-Z0-9_ \#]/', ' ', $resp);
$new_str = preg_replace('/\s+/', ' ', $new_str);
foreach(explode(' ', $new_str) as $word)
{
if (strpos($word, '#') === 0) {
//found my tag!
}
}
}
}
What would you advise to do?

Rather than using regex to replace everything you don't want to match, you should be able to immediately match any word with an # before it.
$subject = "..blah.. #name, ..blah..#hello,blah";
$pattern = '/#\w+/';
preg_match_all($pattern, $subject, $matches);
print_r($matches);
Output:
Array ( [0] => Array ( [0] => #name [1] => #hello ) )
/#\w+/ assumes that only numbers, letters and underscores (thats what the \w matches) are valid matches (i.e. #user123_xd), if you want to include for example the - (dash) in valid matches (e.g. #user1a-12) then the $pattern would be /#[\w-]+/

Related

simple pattern with preg_match_ALL work fine!, how to use with preg_replace?

thanks by your help.
my target is use preg_replace + pattern for remove very sample strings.
then only using preg_replace in this string or others, I need remove ANY content into <tag and next symbol >, the pattern is so simple, then:
$x = '#<\w+(\s+[^>]*)>#is';
$s = 'DATA<td class="td1">111</td><td class="td2">222</td>DATA';
preg_match_all($x, $s, $Q);
print_r($Q[1]);
[1] => Array
(
[0] => class="td1"
[1] => class="td2"
)
work greath!
now I try remove strings using the same pattern:
$new_string = '';
$Q = preg_replace($x, "\\1$new_string", $s);
print_r($Q);
result is completely different.
what is bad in my use of preg_replace?
using only preg_replace() how I can remove this strings?
(we can use foreach(...) for remove each string, but where is the error in my code?)
my result expected when I intro this value:
$s = 'DATA<td class="td1">111</td><td class="td2">222</td>DATA';
is this output:
$Q = 'DATA<td>111</td><td>222</td>DATA';
Let's break down your RegEx, #<\w+(\s+[^>]*)>#is, and see if that helps.
# // Start delimiter
< // Literal `<` character
\w+ // One or more word-characters, a-z, A-Z, 0-9 or _
( // Start capturing group
\s+ // One or more spaces
[^>]* // Zero or more characters that are not the literal `>`
) // End capturing group
> // Literal `>` character
# // End delimiter
is // Ignore case and `.` matches all characters including newline
Given the input DATA<td class="td1">DATA this matches <td class="td1"> and captures class="td1". The difference between match and capture is very important.
When you use preg_match you'll see the entire match at index 0, and any subsequent captures at incrementing indexes.
When you use preg_replace the entire match will be replaced. You can use the captures, if you so choose, but you are replacing the match.
I'm going to say that again: whatever you pass as the replacement string will replace the entirety of the found match. If you say $1 or \\=1, you are saying replace the entire match with just the capture.
Going back to the sample after the breakdown, using $1 is the equivalent of calling:
str_replace('<td class="td1">', ' class="td1"', $string);
which you can see here: https://3v4l.org/ZkPFb
To your question "how to change [0] by $new_string", you are doing it correctly, it is your RegEx itself that is wrong. To do what you are trying to do, your pattern must capture the tag itself so that you can say "replace the HTML tag with all of the attributes with just the tag".
As one of my comments noted, this is where you'd invert the capturing. You aren't interesting in capturing the attributes, you are throwing those away. Instead, you are interested in capturing the tag itself:
$string = 'DATA<td class="td1">DATA';
$pattern = '#<(\w+)\s+[^>]*>#is';
echo preg_replace($pattern, '<$1>', $string);
Demo: https://3v4l.org/oIW7d

Replace whole words from blacklist array instead of partial matches

I have an array of words
$banned_names = array('about','access','account');
The actual array is very long a contains bad words so at risk of breaking any rule I just added an example, the issue I'm having is the following:
$title = str_ireplace($filterWords, '****', $dn1['title']);
This works however, one of my filtered words is 'rum' and if I was to post the word 'forum' it will display as 'fo****'
So I need to only replace the word with **** if it matches the exact word from the array, if I was to give an example the phrase "Lets check the forum and see if anyone has rum", would be "Lets check the forum and see if anyone has ****".
Similar to the other answers but this uses \b in regex to match word boundaries (whole words). It also creates the regex-compatible banned list on the fly before passing to preg_replace_callback().
$dn1['title'] = 'access forum';
$banned_names = array('about','access','account','rum');
$banned_list = array_map(function($r) { return '/\b' . preg_quote($r, '/') . '\b/'; }, $banned_names);
$title = preg_replace_callback($banned_list, function($m) {
return $m[0][0].str_repeat('*', strlen($m[0])-1);
}, $dn1['title']);
echo $title; //a***** forum
You can use regex with \W to match a "non-word" character:
var_dump(preg_match('/\Wrum\W/i', 'the forum thing')); // returns 0 i.e. doesn't match
var_dump(preg_match('/\Wrum\W/i', 'the rum thing')); // returns 1 i.e. matches
The preg_replace() method takes an array of filters like str_replace() does, but you'll have to adjust the list to include the pattern delimiters and the \W on both sides. You could store the full patterns statically in your list:
$banlist = ['/\Wabout\W/i','/\Waccess\W/i', ... ];
preg_replace($banlist, '****', $text);
Or adjust the array on the fly to add those bits.
You can use preg_replace() to look for your needles with a beginning/end of string tag after converting each string in your haystack to an array of strings, so you'll be matching on full words. Alternatively you can add spaces and continue to use str_ireplace() but that option would fail if your word is the first or last word in the string being checked.
Adding spaces (will miss first/last word, not reccomended):
You'll have to modify your filtering array first of course. And yes the foreach could be simpler, but I hope this makes clear what I'm doing/why.
foreach($filterWords as $key => $value){
$filterWords[$key] = " ".$value." ";
}
str_ireplace ( $filterWords, "****", $dn1['title'] );
OR
Breaking up long string (recommended):
foreach($filterWords as $key => $value){
$filterWords[$key] = "/^".$value."$/i"; //add regex for beginning/end of string value
}
preg_replace ( $filterWords, "****", explode(" ", $dn1['title']) );

Loop over array and replacing with regex returns empty string

I have a String which contains substrings which I have to replace. The substrings are stored in an array. When I loop through the array everything works fine, until the array has more than 120 entries.
foreach ( $activeTags as $k => $v ) {
$find = $activeTags[$k]['Tag']['tag'];
$replace = 'that';
$pattern = "/\#\#[a-zA-Z][a-zA-Z]\#\#.*\b$find\b.*\#\#END_[a-zA-Z][a-zA-Z]\#\#|$find/";
$sText = '<p>Do not replace ##DS## this ##END_DS## replace this.</p>';
$sText = preg_replace_callback($pattern, function($match) use($find, $replace){
if($match[0] == $find){
return($replace);
}else{
return($match[0]);
}
}, $sText);
}
when count($activeTags) == 121 i only get an empty string.
Has onyone an idea why this happens?
Try this improved pattern:
$pattern = "~##([a-zA-Z]{2})##.*?\b$find\b.*?##END_\1##|$find~s";
Description
Discussion
The ~s flag indicates that dot (.) should match newlines. In your example, p tags are metionned. So I guess its an html fragment. Since newlines are alloed in html, I have added the ~s flag. More over, I have made the pattern more readable by:
using custom pattern boundaries: / becomes ~, you avoid escape anything...
replacing duplicate subpatterns: [a-zA-Z][a-zA-Z] becomes [a-zA-Z]{2}
taking advantage of the sequence ##DS## ##END_DS##. I use a backreference (\1) for matching what was found in the first matching group (Group 1 in the above image).

replace multiple spaces, tabs and newlines into one space except commented text

I need replace multiple spaces, tabs and newlines into one space except commented text in my html.
For example the following code:
<br/> <br>
<!--
this is a comment
-->
<br/> <br/>
should turn into
<br/><br><!--
this is a comment
--><br/><br/>
Any ideas?
The new solution
After thinking a bit, I came up with the following solution with pure regex. Note that this solution will delete the newlines/tabs/multi-spaces instead of replacing them:
$new_string = preg_replace('#(?(?!<!--.*?-->)(?: {2,}|[\r\n\t]+)|(<!--.*?-->))#s', '$1', $string);
echo $new_string;
Explanation
(? # If
(?!<!--.*?-->) # There is no comment
(?: {2,}|[\r\n\t]+) # Then match 2 spaces or more, or newlines or tabs
| # Else
(<!--.*?-->) # Match and group it (group #1)
) # End if
So basically when there is no comment it will try to match spaces/tabs/newlines. If it does find it then group 1 wouldn't exist and there will be no replacements (which will result into the deletion of spaces...). If there is a comment then the comment is replaced by the comment (lol).
Online demo
The old solution
I came up with a new strategy, this code require PHP 5.3+:
$new_string = preg_replace_callback('#(?(?!<!--).*?(?=<!--|$)|(<!--.*?-->))#s', function($m){
if(!isset($m[1])){ // If group 1 does not exist (the comment)
return preg_replace('#\s+#s', ' ', $m[0]); // Then replace with 1 space
}
return $m[0]; // Else return the matched string
}, $string);
echo $new_string; // Output
Explaining the regex:
(? # If
(?!<!--) # Lookahead if there is no <!--
.*? # Then match anything (ungreedy) until ...
(?=<!--|$) # Lookahead, check for <!-- or end of line
| # Or
(<!--.*?-->) # Match and group a comment, this will make for us a group #1
)
# The s modifier is to match newlines with . (dot)
Online demo
Note: What you are asking and what you have provided as expected output are a bit contradicting. Anyways if you want to remove instead of replacing by 1 space, then just edit the code from '#\s+#s', ' ', $m[0] to '#\s+#s', '', $m[0].
It's much simpler to do this in several runs (as is done for instance in php markdown).
Step1: preg_replace_callback() all comments with something unique while keeping their original values in a keyed array -- ex: array('comment_placeholder:' . md5('comment') => 'comment', ...)
Step2: preg_replace() white spaces as needed.
Step3: str_replace() comments back where they originally were using the keyed array.
The approach you're leaning towards (splitting the string and only processing the non-comment parts) works fine too.
There almost certainly is a means to do this with pure regex, using ugly look-behinds, but not really recommended: the regex might yield backtracking related errors, and the comment replacement step allows you to process things further if needed without worrying about the comments themselves.
I’d do the following:
split the input into comment and non-comment parts
do replacement on the non-comment parts
put everything back together
Example:
$parts = preg_split('/(<!--(?:(?!-->).)*-->)/s', $input, -1, PREG_SPLIT_DELIM_CAPTURE);
foreach ($parts as $i => &$part) {
if ($i % 2 === 0) {
// non-comment part
$part = preg_replace('/\s+/', ' ', $part);
} else {
// comment part
}
}
$output = implode('', $parts);
You can use this:
$pattern = '~\s*+(<br[^>]*>|<!--(?>[^-]++|-(?!->))*-->)\s*+~';
$replacement = '$1';
$result = preg_replace($pattern, $replacement, $subject);
This pattern captures br tags and comments, and matches spaces around. Then it replaces the match by the capture group.

Identifying a random repeating pattern in a structured text string

I have a string that has the following structure:
ABC_ABC_PQR_XYZ
Where PQR has the structure:
ABC+JKL
and
ABC itself is a string that can contain alphanumeric characters and a few other characters like "_", "-", "+", "." and follows no set structure:
eg.qWe_rtY-asdf or pkl123
so, in effect, the string can look like this:
qWe_rtY-asdf_qWe_rtY-asdf_qWe_rtY-asdf+JKL_XYZ
My goal is to find out what string constitutes ABC.
I was initially just using
$arrString = explode("_",$string);
to return $arrString[0] before I was made aware that ABC ($arrString[0]) itself can contain underscores, thus rendering it incorrect.
My next attempt was exlpoding it on "_" anyway and then comparing each of the exploded string parts with the first string part until I get a semblance of a pattern:
function getPatternABC($string)
{
$count = 0;
$pattern ="";
$arrString = explode("_", $string);
foreach($arrString as $expString)
{
if(strcmp($expString,$arrString[0])!==0 || $count==0)
{
$pattern = $pattern ."_". $arrString[$count];
$count++;
}
else break;
}
return substr($pattern,1);
}
This works great - but I wanted to know if there was a more elegant way of doing this using regular expressions?
Here is the regex solution:
'^([a-zA-Z0-9_+-]+)_\1_\1\+'
What this does is match (starting from the beginning of the string) the longest possible sequence consisting of the characters inside the square brackets (edit that per your spec). The sequence must appear exactly twice, each time followed by an underscore, and then must appear once more followed by a plus sign (this is actually the first half of PQR with the delimiter before JKL). The rest of the input is ignored.
You will find ABC captured as capture group 1.
So:
$input = 'qWe_rtY-asdf_qWe_rtY-asdf_qWe_rtY-asdf+JKL_XYZ';
$result = preg_match('/^([a-zA-Z0-9_+-]+)_\1_\1\+/', $input, $matches);
if ($result) {
echo $matches[2];
}
See it in action.
Sure, just make a regular expression that matches your pattern. In this case, something like this:
preg_match('/^([a-zA-Z0-9_+.-]+)_\1_\1\+JKL_XYZ$/', $string, $match);
Your ABC is in $match[1].
If the presence of underscores in these strings has a low frequency, it may be worth checking to see if a simple explode() will do it before bothering with regex.
<?php
$str = 'ABC_ABC_PQR_XYZ';
if(substr_count($str, '_') == 3)
$abc = reset(explode('_', $str));
else
$abc = regexy_function($str);
?>

Categories