What is going on with this preg_replace? - php

What is going on with the \w character type? At the moment it outputs an array called $replace that has all the name except only the first letter of each first name. I don't really understand what its doing to get to this point. \w is any word character but that doesn't help me.
<?php
$rappers = array('Drake Themotto', 'Tom Ford', 'Lil Wayne');
$replace = preg_replace('/(\w)\w* (\w)/', '\1 \2', $rappers);
print_r($replace);
?>

From left to right your regex contains:
A group with one word character
Zero or more word characters
A space
A group with one word character
For "Drake Themotto" this means:
The first group \1 will be "D"
The following word characters "rake" match but will not be stored
The space will not be stored
The second group \2 will be "T"
For the replacement this means that the matching part of your string is "Drake T". This matching string will be replaced by "\1 \2" which is "D T" in this case.
After that, there are some other characters "hemotto". You did not mention them in your regex, but since it does not contain a $ to mark the end of the string (in this case the regex would not match) or another \w* to match (= in this case: to remove) the other characters of the string, this rest simply will be ignored. Because you just "replace" something, "ignored" means that nothing will be replaced here and it will be appended to the result.

Related

How do I escape the brackets in a mysql REGEXP [duplicate]

I have a regular expression to escape all special characters in a search string. This works great, however I can't seem to get it to work with word boundaries. For example, with the haystack
add +
or
add (+)
and the needle
+
the regular expression /\+/gi matches the "+". However the regular expression /\b\+/gi doesn't. Any ideas on how to make this work?
Using
add (plus)
as the haystack and /\bplus/gi as the regex, it matches fine. I just can't figure out why the escaped characters are having problems.
\b is a zero-width assertion: it doesn't consume any characters, it just asserts that a certain condition holds at a given position. A word boundary asserts that the position is either preceded by a word character and not followed by one, or followed by a word character and not preceded by one. (A "word character" is a letter, a digit, or an underscore.) In your string:
add +
...there's a word boundary at the beginning because the a is not preceded by a word character, and there's one after the second d because it's not followed by a word character. The \b in your regex (/\b\+/) is trying to match between the space and the +, which doesn't work because neither of those is a word character.
Try changing it to:
/\b\s?+/gi
Edit:
Extend this concept as far as you want. If you want the first + after any word boundary:
/\b[^+]*+/gi
Boundaries are very conditional assertions; what they anchor depends on what they touch. See this answer for a detailed explanation, along with what else you can do to deal with it.

Regex - Escaping square brackets along with boundry

I have a website where users can have custom actions when a keyword is detected in a sentence. How I currently do matches is like the following:
$output = array();
preg_match('/\b' . $keyword . '\b/', $phrase, $output);
If I find a match if(count($output) > 0) { then the custom action is ran. This is for spoken sentences so it is for things like operator, we have a custom one called [silence] so when silence is detected it runs an action.
However when the keyword contains brackets for example: [silence] the regex fails because it has square brackets. I have tried escaping both like \b\[silence\]\b However this does not detect a match.
Also this is in PHP
Thanks in advance,
Joe
The "word boundary" expression matches if the next character is a part of a word, and [ isn't (it is not a letter)
From Regex tutorial :
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
Simply put: \b allows you to perform a “whole words only” search using a regular expression in the form of \bword\b. A “word character” is a character that can be used to form words. All characters that are not “word characters” are “non-word characters”.
So you need to "rewrite" the \b expression that can suit your need, like :
(?<=[\s\.,;])\[silence\](?=[\s\.,;])
First, a non-matching "delimiter character" (space, dot, comma, ... You probably need to add a few more), followed by your expression, followed by a non-matching delimiter character again.

Php Regex to insert character after first all-capital letter word in a string

I'm trying to use a preg_replace or similar php function to:
- identify the first all capital letter word in a string,
- and insert a character directly after it (a dash or semi-colon will do)
- the all capital letter word should be 3 characters long or more.
So far I have the regular expression:
/(?<!\ )([^A-Z{3,}])/
But, this isn't working in terms of only words that are 3+ characters. I'm also not sure I have it 'strictly' only looking at the very first word.
I believe that once I have the regex sorted out - this
$string = "LONDON On November 12th twelve people...";
$replaced_string = preg_replace('/myregex/',': ', $string);
will output as the following
LONDON: On November 12th twelve people..."
It's a fairly simple regex, really:
$replacedString = preg_replace('/\b([A-Z]{3,})\b/', '$1: ', $string);
It works like this:
\b: word boundary. This detects the start and end of a "word"
([A-Z]{3,}): Match 3 or more upper-case characters. The brackets capture this part of the match, so we can use it in the replacement string
\b: Another word boundary
Replace this match with:
'$1: ': the $1 refers back to the first captured group (the 3 or more upper case characters). To this, we're adding a colon and a space. That will be our replacement string
This will add the colon and space after all upper-case words of 3 or more characters. To replace only 1 word, just pass a limit to preg_replace:
$replaced = preg_replace('/\b([A-Z]{3,})\b/', '$1: ', $string, 1);
Where that last argument is the number of matches you wish to replace. -1 for all, 1 for 1, 2 for 2, etc...
Demo
Judging by your sample string, the upper-case words are city names. It's possible for city names to contain a dash, or even a space. To address this, you might want to match all strings containing upper-case chars, dashes and spaces:
$replaceAll = preg_replace('/\b([A-Z -]{2,}[A-Z])\b/', '$1: ', $string);
Demo 2
What changed:
([A-Z -]{2,}: The capturing match start with upper-case chars (2 or more, not 3), but also matches spaces and dashes.
[A-Z]): The last character of the captured group must be an upper-case character, this avoids capturing the trailing spaces or dashes. The result is that we capture stuff like "NEW YORK" or "FOO-TOWN", but not "ON - Something".
The rest is the same as before. If you want to allow for other characters that might occur (like a dot) just add them to the first part of the capturing group. The most complete pattern will probably be something like this:
$replaced = preg_replace('/\b([A-Z][A-Z .-]+[A-Z])\b/', '$1: ', $string);
This ensures the captured group starts, and ends with an upper case character, and contains any number of upper-case chars, spaces, dots and dashes in between. So this will match something like "ST. LEWIS", too

Difference between regular expressions

I'm trying to work out what the differences are between these two:
preg_match('-^[^'.$inv.']+\.?$-' , $name
preg_match('-['.$inv.']-', $name
Thanks
To make it easier to exemplify, assume $inv = 'a'…
-^[^a]+\.?$- needs to match the whole string, because of the caret and the dollar signs. The string is expected to start with a character other than "a", followed by 0 or more characters that are still not "a"s. The last character in this string, however, can be a dot (hence the question mark after the dot)
-[a]- will match the first "a" in the string and it will stop looking as soon as it finds a match because you're using preg_match() and not preg_match_all().
Your first pattern does not make any sense, though, since already \. = [^a] (translated into English as: a dot is already not an "a")
[EDIT] The first pattern can actually mean something when there's a dot in the character class.
First of, be careful with $inv, depending on its content it could be possible to do some injections in the regular expression. To avoid that issue, use preg_quote().
That said, the first regex will be :
^ <-- the given string must begin with
[ <-- one of those characters
^ <-- inverse the accepted characters (instead of accepted characters, the following characters will be those that are not accepted)
$inv <-- characters
] <-- end of the list of characters (here not accepted characters)
+ <-- at least one character must be matched, more are accepted
\. <-- a '.'
? <-- the previous '.' isn't mandatory
$ <-- the given string must end here
If $inv = 'abc.' it will match:
def
def.
d
d.
It won't match:
., because the . isn't accepted by the [^abc.] group, even though there is \.? later, at least one character must be before a .
de.s, because the . isn't accepted in the [^abc.] group, it is only possible to have it at the end of the given string thanks to \.?
a
deb
testc
teskopkl;;[!##$b., because of the b
an empty string, at least one character must be matched with '[^'.$inv.']+'
It could be simplified into '^[^'.$inv.']+$' (don't forget the preg_quote though)
The second one will be:
[ <-- one of those characters
$inv <-- characters
] <-- end of the list of characters (here accepted characters)
If $inv = 'abc.' it will match
any string containing at least one of the letters a, b, c or .
It won't match any string which doesn't contain a, b, c or ..
In plain English, the first one is looking for an entire line which begins with one or more characters not included with the $inv string, and ending with an optional period.
The second one simply tries to match one character as specified by the value for $inv.
The first pattern matches a line containing none of the characters in $inv, optionally ending the line with a period.
The second pattern matches anything containing any of the characters in $inv.
- is the pattern delimiter, marking the beginning and end of the expression. It can technically be any character, but is most often /.
^ denotes the beginning of the string
[ ] encapsulates a set of characters to be matched
[^ ] encapsulates a set of characters that should not be matched, any other character is considered to be a match.
+ denotes that the previous character or set of characters should be matched one or more times.
. normally matches any character, which is why it is escaped as \. here to indicate a literal period character.
? denotes that the previous character should be matched zero or one time.
$ denotes the end of a string.
['.$inv.']
Lets go with the second one to begin with, since it's the simpler one.
This simply matches a string containing any single one of the characters contained within the string in the variable $inv.
It could contain anything else before or after that character from $inv.
^[^'.$inv.']+\.?$
Now the second one:
This matches a string that contains anything except the characters in $inv (the ^ inside the [] is a negative match).
The match that isn't part of $inv must be at the start of the string (the ^ outside the [] matches the start of the string).
The string can contain as many matching characters as it likes (one or more; that's the + sign after the [])
After that, it may optionally have a dot (the \.? is an optional dot character).
And nothing else after that (the $ matches the end of the string).
Note that in both cases, if $inv contains any regex reserved characters, it will fail (or do something unexpected). You should use preg_quote() to avoid this.
So... uh, they're completely different expressions. Not so much "what's the difference between them" as "what's the same about them". Answer: not much.
The first matches a string from start up to the first occurance of $inv followed by one or zero periods where the string must end.
The second matches a string only containing $inv.
Essentially they are almost the same, except the first allows for a possible . at the end.

Regex to remove single characters from string

Consider the following strings
breaking out a of a simple prison
this is b moving up
following me is x times better
All strings are lowercased already. I would like to remove any "loose" a-z characters, resulting in:
breaking out of simple prison
this is moving up
following me is times better
Is this possible with a single regex in php?
$str = "breaking out a of a simple prison
this is b moving up
following me is x times better";
$res = preg_replace("#\\b[a-z]\\b ?#i", "", $str);
echo $res;
How about:
preg_replace('/(^|\s)[a-z](\s|$)/', '$1', $string);
Note this also catches single characters that are at the beginning or end of the string, but not single characters that are adjacent to punctuation (they must be surrounded by whitespace).
If you also want to remove characters immediately before punctuation (e.g. 'the x.'), then this should work properly in most (English) cases:
preg_replace('/(^|\s)[a-z]\b/', '$1', $string);
As a one-liner:
$result = preg_replace('/\s\p{Ll}\b|\b\p{Ll}\s/u', '', $subject);
This matches a single lowercase letter (\p{Ll}) which is preceded or followed by whitespace (\s), removing both. The word boundaries (\b) ensure that only single letters are indeed matched. The /u modifier makes the regex Unicode-aware.
The result: A single letter surrounded by spaces on both sides is reduced to a single space. A single letter preceded by whitespace but not followed by whitespace is removed completely, as is a single letter only followed but not preceded by whitespace.
So
This a is my test sentence a. o How funny (what a coincidence a) this is!
is changed to
This is my test sentence. How funny (what coincidence) this is!
You could try something like this:
preg_replace('/\b\S\s\b/', "", $subject);
This is what it means:
\b # Assert position at a word boundary
\S # Match a single character that is a “non-whitespace character”
\s # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
\b # Assert position at a word boundary
Update
As raised by Radu, because I've used the \S this will match more than just a-zA-Z. It will also match 0-9_. Normally, it would match a lot more than that, but because it's preceded by \b, it can only match word characters.
As mentioned in the comments by Tim Pietzcker, be aware that this won't work if your subject string needs to remove single characters that are followed by non word characters like test a (hello). It will also fall over if there are extra spaces after the single character like this
test a hello
but you could fix that by changing the expression to \b\S\s*\b
Try this one:
$sString = preg_replace("#\b[a-z]{1}\b#m", ' ', $sString);

Categories