PHP preg_replace skipping where match overlaps - php

I've been Googling this regex behavior all afternoon.
$str = ' b c d w i e f g h this string';
echo preg_replace('/\s[bcdefghjklmnopqrstuvwxyzBCDEFGHJKLMNOPQRSTUVWXYZ]{1}\s/', ' ', $str);
I want to remove all instances of a single character by itself (except for A and I) and leave one space in its place. This code appears to work on every other match. I suspect this is because the matches overlap each other.
I suspect a lookaround would be appropriate here, but I've never used them before and could use a snippet.
EDIT:
Just to avoid confusion about what I am trying to accomplish. I want to turn the above string into this:
$str = ' i this string';
Notice all single-letter characters that are NOT "A" or "I" have been removed.

You can use look-arounds instead. They are 0-length matches, and hence would not consume spaces. And {1} is really absurd there, you can remove it.
echo preg_replace('/(?<=\s)[bcdefghjklmnopqrstuvwxyzBCDEFGHJKLMNOPQRSTUVWXYZ](?=\s)/', '', $str)
You can make use of range and case-insensitive flag (?i) here, to reduce the pain of typing all those characters:
echo preg_replace('/(?i)(?<=\s)[B-HJ-Z](?=\s)/', '', $str)
or word boundaries will also work here:
echo preg_replace('/(?i)\b[B-HJ-Z]\b/', '', $str)

You can try this:
/(?<=\s)[b-hj-zB-HJ-Z](?=\s)/
or modify the ranges if you dont need both i and I.

You can use this:
echo preg_replace('~(?i)\b[B-HJ-Z]\b~', ' ', $str);
Notice: instead of using spaces to delimit the single letter, I use word boundaries which are the zero width limit between a character from [a-zA-Z0-9_] and another character. This is more general than a space and include (for example) punctuation symbols.

Related

Using preg_replace() with search words that may have special characters [duplicate]

Regular Expressions are completely new to me and having done much searching my expression for testing purposes is this:
preg_replace('/\b0.00%\b/','- ', '0.00%')
It yields 0.00% when what I want is - .
With preg_replace('/\b0.00%\b/','- ', '50.00%') yields 50.00% which is what I want - so this is fine.
But clearly the expression is not working as it is not, in the first example replacing 0.00% with -.
I can think of workarounds with if(){} for testing length/content of string but presume the replace will be most efficient
The word boundary after % requires a word char (letter, digit or _) to appear right after it, so there is no replacement taking place here.
You need to replace the word boundaries with unambiguous boundaries defined with the help of (?<!\w) and (?!\w) lookarounds that will fail the match if the keywords are preceded or followed with word characters:
$value='0.00%';
$str = 'Price: 0.00%';
echo preg_replace('/(?<!\w)' . preg_quote($value, '/') . '(?!\w)/i', '- ', $str);
See the PHP demo
Output: Price: -
preg_replace has three arguments as you probably already know. The regular expression pattern to match, the replacement value, and the string to search (in that order).
It appears that your preg_replace regex pattern has word boundries \b it is looking for on either end of the value you are looking for 0.00% which should not really be needed. This looks a bit like a bug to me especially when I plug it into the regex website I use. It works fine there. There is probably a somewhat odd querk with it so you might want to try it without the \b and try something like the start of string ^ and end of string characters $.

PHP - Regex Remove anything that is not alphanumeric by keeping some exception

I want to remove anything that is not alphanumeric regardless of lowercase or uppercase and replace with ' '. But some exceptions are there.
exceptions are
'.!?
But allowing single quote is a headache and I already searched a lot in Stack-overflow didn't find any answer for my requirement.
$text = preg_replace( '/[^\da-z !\' ?.]/i', ' ', $text );
I tried the above regex but it's replacing single quotes also. But i need to keep that and replace all other non alpha-numeral characters with empty space. Can somebody help me with this?
For eg:
$string_input = "So one of the secrets of producing link-worthy! * content is to write quality content that’s share-worthy!"
$string_output = "So one of the secrets of producing link worthy! content is to write quality content that’s share-worthy!"
You can use the NOT-pattern in regex:
<?php
echo implode(' ', preg_split('#[^a-z0-9\.\?\'!]#i', $input));
You cannot use preg_replace in a simple way to replace all at once. But you can explode on the regex and implode them with a space.
Explaining the regex:
# are delimiter
[] Makes a group
^ all within the group are NOT matched (inverter)
a-z Do not match characters a to z
0-9 Match character 0 to 9
Other characters are escaped.
i flag to make match case insensitive.

How to replace all occurrences of \ only if they are not followed by n?

Seems to be relatively simple task that gets me stuck in one PHP application. I have a string which has a bunch of \n in it. Those are fine and need to stay there. However, there are also single occurrences of the character \ and those I need to replace or remove, let's just say with empty character without removing the ones that are followed by n.
The best I came up with was to first replace all \n with something else, some weird character, then replace the remaining \ with empty space and then convert back the weird character to \n. However, that seems to be a waste of time and resources, besides, nothing guarantees me that I'll find weird enough character that will never be encountered in the rest of the string...
Any tips?
You need
$s = preg_replace('~\\\\(?!n)~', '', $s);
See the PHP demo:
$s = '\\n \\t \\';
$s = preg_replace('~\\\\(?!n)~', '', $s);
echo $s; // => \n t
We need 4 backslashes to pass 2 literal backslashes to the regex engine to match 1 literal backslash in the input string. The (?!n) is a negative lookahead that fails all matches of a backslash that is immediately followed with n.
You should be able to do this with a negative lookahead assertion:
\\(?!n)
The \\ looks for the backslash, the (?!n) asserts that the next character is not an n, but does not match the character.
To use this in PHP:
$text = 'foo\nbar\nb\az\n';
$newtext = preg_replace('/\\\\(?!n)/', '', $text);
Details here: https://regex101.com/r/F2qhAP/1

Regex: remove non-alphanumeric chars, multiple whitespaces and trim() all together

I have a $text to strip off all non-alphanumeric chars, replace multiple white spaces and newline by single space and eliminate beginning and ending space.
This is my solution so far.
$text = '
some- text!!
for testing?
'; // $text to format
//strip off all non-alphanumeric chars
$text = preg_replace("/[^a-zA-Z0-9\s]/", "", $text);
//Replace multiple white spaces by single space
$text = preg_replace('/\s+/', ' ', $text);
//eliminate beginning and ending space
$finalText = trim($text);
/* result: $finalText ="some text for testing";
without non-alphanumeric chars, newline, extra spaces and trim()med */
Is it possible to combine/achieve all these in one regular expression? as I would get the desired result in one line as below
$finalText = preg_replace(some_reg_expression, $replaceby, $text);
thanks
Edit: clarified with a test string
Of course you can. That is very easy.
The re will look like:
((?<= )\s*)|[^a-zA-Z0-9\s]|(\s*$)|(^\s*)
I have no PHP at hand, I have used Perl (just to test the re and show that it works) (you can play with my code here):
$ cat test.txt
a b c d
a b c e f g fff f
$ cat 1.pl
while(<>) {
s/((?<= )\s*)|[^a-zA-Z0-9\s]|(\s*$)|(^\s*)//g;
print $_,"\n";
}
$ cat test.txt | perl 1.pl
a b c d
a b c e f g fff f
For PHP it will be the same.
What does the RE?
((?<= )\s*) # all spaces that have at least one space before them
|
[^a-zA-Z0-9\s] # all non-alphanumeric characters
|
(\s*$) # all spaces at the end of string
|
(^\s*) # all spaces at the beginning of string
The only tricky part here is ((?<= )\s*), lookbehind assertion. You remove spaces if and only if the substring of spaces has a space before.
When you want to know how lookahead/lookbehind assertions work, please take a look at http://www.regular-expressions.info/lookaround.html.
Update from the discussion:
What happens when $text ='some ? ! ? text';?
Then the resulting string contains multiple spaces between "some" and "text".
It is not so easy to solve the problem, because one need positive lookbehind assertions with variable length, and that is not possible at the moment. One cannot simple check spaces because it can happen so that it is not a space but non-alphanumerich character and it will be removed anyway (for example: in " !" the "!" sign will be removed but RE knows nothing about; one need something like (?<=[^a-zA-Z0-9\s]* )\s* but that unfortunately will not work because PCRE does not support lookbehind variable length assertions.
I do not think that you can achieve that with one regex. You would basically need to stick in an if else condition, which it is not possible through Regular Expressions alone.
You would basically need one regex to remove non-alphanumeric digits and another one to collapse the spaces, which is basically what you are already doing.
Check this if this is what you are looking for ---
$patterns = array ('/[^a-zA-Z0-9\s]/','/\s+/');
$replace = array ("", ' ');
trim( preg_replace($patterns, $replace, $text) );
MAy be it may need some modification, just let me know if this is something what you want to do??
For your own sanity, you will want to keep regular expressions that you can still understand and edit later on :)
$text = preg_replace(array(
"/[^a-zA-Z0-9\s]/", // remove all non-space, non-alphanumeric characters
'/\s{2,}/', // replace multiple white space occurrences with single
), array(
'',
' ',
), trim($originalText));
$text =~ s/([^a-zA-Z0-9\s].*?)//g;
Doesn't have to be any harder than this.

Remove number then a space from the start of a string

How would I go about removing numbers and a space from the start of a string?
For example, from '13 Adam Court, Cannock' remove '13 '
Because everyone else is going the \d+\s route I'll give you the brain-dead answer
$str = preg_replace("#([0-9]+ )#","",$str);
Word to the wise, don't use / as your delimiter in regex, you will experience the dreaded leaning-toothpick-problem when trying to do file paths or something like http://
:)
Use the same regex I gave in my JavaScript answer, but apply it using preg_replace():
preg_replace('/^\d+\s+/', '', $str);
Try this one :
^\d+ (.*)$
Like this :
preg_replace ("^\d+ (.*)$", "$1" , $string);
Resources :
preg_replace
regular-expressions.info
On the same topic :
Regular expression to remove number, then a space?
regular expression for matching number and spaces.
I'd use
/^\d+\s+/
It looks for a number of any size in the beginning of a string ^\d+
Then looks for a patch of whitespace after it \s+
When you use a backslash before certain letters it represents something...
\d represents a digit 0,1,2,3,4,5,6,7,8,9.
\s represents a space .
Add a plus sign (+) to the end and you can have...
\d+ a series of digits (number)
\s+ multiple spaces (typos etc.)
The same regex I gave you on your other question still applies. You just have to use preg_replace() instead.
Search for /^[\s\d]+/ and replace with the empty string. Eg:
$str = preg_replace(/^[\s\d]+/, '', $str);
This will remove digits and spaces in any order from the beginning of the string. For something that removes only a number followed by spaces, see BoltClock's answer.
If the input strings all have the same ecpected format and you will receive the same result from left trimming all numbers and spaces (no matter the order of their occurrence at the front of the string), then you don't actually need to fire up the regex engine.
I love regex, but know not to use it unless it provides a valuable advantage over a non-regex technique. Regex is often slower than non-regex techniques.
Use ltrim() with a character mask that includes spaces and digits.
Code: (Demo)
var_export(
ltrim('420 911 90210 666 keep this part', ' 0..9')
);
Output:
'keep this part'
It wouldn't matter if the string started with a space either. ltrim() will greedily remove all instances of spaces or numbers from the start of the string intil it can't anymore.

Categories