I have a website where users can have custom actions when a keyword is detected in a sentence. How I currently do matches is like the following:
$output = array();
preg_match('/\b' . $keyword . '\b/', $phrase, $output);
If I find a match if(count($output) > 0) { then the custom action is ran. This is for spoken sentences so it is for things like operator, we have a custom one called [silence] so when silence is detected it runs an action.
However when the keyword contains brackets for example: [silence] the regex fails because it has square brackets. I have tried escaping both like \b\[silence\]\b However this does not detect a match.
Also this is in PHP
Thanks in advance,
Joe
The "word boundary" expression matches if the next character is a part of a word, and [ isn't (it is not a letter)
From Regex tutorial :
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
Simply put: \b allows you to perform a “whole words only” search using a regular expression in the form of \bword\b. A “word character” is a character that can be used to form words. All characters that are not “word characters” are “non-word characters”.
So you need to "rewrite" the \b expression that can suit your need, like :
(?<=[\s\.,;])\[silence\](?=[\s\.,;])
First, a non-matching "delimiter character" (space, dot, comma, ... You probably need to add a few more), followed by your expression, followed by a non-matching delimiter character again.
Related
I have a regular expression to escape all special characters in a search string. This works great, however I can't seem to get it to work with word boundaries. For example, with the haystack
add +
or
add (+)
and the needle
+
the regular expression /\+/gi matches the "+". However the regular expression /\b\+/gi doesn't. Any ideas on how to make this work?
Using
add (plus)
as the haystack and /\bplus/gi as the regex, it matches fine. I just can't figure out why the escaped characters are having problems.
\b is a zero-width assertion: it doesn't consume any characters, it just asserts that a certain condition holds at a given position. A word boundary asserts that the position is either preceded by a word character and not followed by one, or followed by a word character and not preceded by one. (A "word character" is a letter, a digit, or an underscore.) In your string:
add +
...there's a word boundary at the beginning because the a is not preceded by a word character, and there's one after the second d because it's not followed by a word character. The \b in your regex (/\b\+/) is trying to match between the space and the +, which doesn't work because neither of those is a word character.
Try changing it to:
/\b\s?+/gi
Edit:
Extend this concept as far as you want. If you want the first + after any word boundary:
/\b[^+]*+/gi
Boundaries are very conditional assertions; what they anchor depends on what they touch. See this answer for a detailed explanation, along with what else you can do to deal with it.
What I have now: preg_match("[\W|_]",$string), which matches any non-word character and underscores. However, I only want to match strings containing \w and individual spaces in the middle of the string (as opposed to $string starting or ending with any number of spaces), but not underscores. Thanks for your help!
Examples that should be matched: Example 123 or One Two Three.
Examples that should be rejected: example& or (starting with one ore more spaces, and multiple spaces between "Example" and "of) Example of foo.
Ah, so you don't need to catch the results of the match - just to test whether or not the string matches some pattern. That can be done with...
$pattern = '/^[A-Z0-9](?:[A-Z0-9 ]*[A-Z0-9])?$/i';
... but that's destined to fail if you want to cover letters outside of ASCII range. You should use this instead then:
$pattern = '/^[\p{L}0-9](?:[\p{L}0-9 ]*[\p{L}0-9])?$/u';
Check the demo to see that in action.
Consider the following strings
breaking out a of a simple prison
this is b moving up
following me is x times better
All strings are lowercased already. I would like to remove any "loose" a-z characters, resulting in:
breaking out of simple prison
this is moving up
following me is times better
Is this possible with a single regex in php?
$str = "breaking out a of a simple prison
this is b moving up
following me is x times better";
$res = preg_replace("#\\b[a-z]\\b ?#i", "", $str);
echo $res;
How about:
preg_replace('/(^|\s)[a-z](\s|$)/', '$1', $string);
Note this also catches single characters that are at the beginning or end of the string, but not single characters that are adjacent to punctuation (they must be surrounded by whitespace).
If you also want to remove characters immediately before punctuation (e.g. 'the x.'), then this should work properly in most (English) cases:
preg_replace('/(^|\s)[a-z]\b/', '$1', $string);
As a one-liner:
$result = preg_replace('/\s\p{Ll}\b|\b\p{Ll}\s/u', '', $subject);
This matches a single lowercase letter (\p{Ll}) which is preceded or followed by whitespace (\s), removing both. The word boundaries (\b) ensure that only single letters are indeed matched. The /u modifier makes the regex Unicode-aware.
The result: A single letter surrounded by spaces on both sides is reduced to a single space. A single letter preceded by whitespace but not followed by whitespace is removed completely, as is a single letter only followed but not preceded by whitespace.
So
This a is my test sentence a. o How funny (what a coincidence a) this is!
is changed to
This is my test sentence. How funny (what coincidence) this is!
You could try something like this:
preg_replace('/\b\S\s\b/', "", $subject);
This is what it means:
\b # Assert position at a word boundary
\S # Match a single character that is a “non-whitespace character”
\s # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
\b # Assert position at a word boundary
Update
As raised by Radu, because I've used the \S this will match more than just a-zA-Z. It will also match 0-9_. Normally, it would match a lot more than that, but because it's preceded by \b, it can only match word characters.
As mentioned in the comments by Tim Pietzcker, be aware that this won't work if your subject string needs to remove single characters that are followed by non word characters like test a (hello). It will also fall over if there are extra spaces after the single character like this
test a hello
but you could fix that by changing the expression to \b\S\s*\b
Try this one:
$sString = preg_replace("#\b[a-z]{1}\b#m", ' ', $sString);
Can someone explain what this function
preg_replace('/&\w;/', '', $buf)
does? I have looked at various tutorials and found that it replaces the pattern /&\w;/ with string ''. But I can't understand the pattern /&\w;/. What does it represent?
Similarly in
preg_match_all("/(\b[\w+]+\b)/", $buf, $words)
I can't understand what does the string "/(\b[\w+]+\b)/" represents.
Please help. Thanks in advance :)
The explanation of your first expression is simple, it is:
& # Match the character “&” literally
\w # Match a single character that is a “word character” (letters, digits, and underscores)
; # Match the character “;” literally
The second one is:
( # Match the regular expression below and capture its match into backreference number 1
\b # Assert position at a word boundary
[\w+] # Match a single character present in the list below
# A word character (letters, digits, and underscores)
# The character “+”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
\b # Assert position at a word boundary
)
The preg_replace function makes use of regular expressions. Regular expressions allow you to find patterns in text in a really powerful way.
To be able to use functions like preg_replace or preg_match I recommend you to take a look first at how regular expressions work.
You can gather a lot of info on this site http://www.regular-expressions.info/
And you can use software tools to help you understand the regex (like RegexBuddy)
In regular expressions, \w stands for any "word" character. That is: a-z, A-Z, 0-9 and underscore. \b stands for "word boundary", that is the beginning and end of a word (a series of word characters).
So, /&\w;/ is a regular expression to match the & sign, followed by a series of word characters, followed by a ;. For example, &foobar; would match, and preg_replace will replace it with an empty string.
In that same manner, /(\b[\w+]+\b)/ matches a word boundary, followed by multiple word characters, followed by another word boundary. The words are captured separately using the parenthesis. So, this regular expression will simply return the words in a string as an array.
I'm still kinda new to using Regular Expressions, so here's my plight. I have some rules for acceptable usernames and I'm trying to make an expression for them.
Here they are:
1-15 Characters
a-z, A-Z, 0-9, and spaces are acceptable
Must begin with a-z or A-Z
Cannot end in a space
Cannot contain two spaces in a row
This is as far as I've gotten with it.
/^[a-zA-Z]{1}([a-zA-Z0-9]|\s(?!\s)){0,14}[^\s]$/
It works, for the most part, but doesn't match a single character such as "a".
Can anyone help me out here? I'm using PCRE in PHP if that makes any difference.
Try this:
/^(?=.{1,15}$)[a-zA-Z][a-zA-Z0-9]*(?: [a-zA-Z0-9]+)*$/
The look-ahead assertion (?=.{1,15}$) checks the length and the rest checks the structure:
[a-zA-Z] ensures that the first character is an alphabetic character;
[a-zA-Z0-9]* allows any number of following alphanumeric characters;
(?: [a-zA-Z0-9]+)* allows any number of sequences of a single space (not \s that allows any whitespace character) that must be followed by at least one alphanumeric character (see PCRE subpatterns for the syntax of (?:…)).
You could also remove the look-ahead assertion and check the length with strlen.
make everything after your first character optional
^[a-zA-Z]?([a-zA-Z0-9]|\s(?!\s)){0,14}[^\s]$
The main problem of your regexp is that it needs at least two characters two have a match :
one for the [a-zA-Z]{1} part
one for the [^\s] part
Beside this problem, I see some parts of your regexp that could be improved :
The [^\s] class will match any character, except spaces : a dot or semi-colon will be accepted, try to use the [a-zA-Z0-9] class here to ensure the character is a correct one.
You can delete the {1} part at the beginning, as the regexp will match exactly one character by default