Regex: remove non-alphanumeric chars, multiple whitespaces and trim() all together - php

I have a $text to strip off all non-alphanumeric chars, replace multiple white spaces and newline by single space and eliminate beginning and ending space.
This is my solution so far.
$text = '
some- text!!
for testing?
'; // $text to format
//strip off all non-alphanumeric chars
$text = preg_replace("/[^a-zA-Z0-9\s]/", "", $text);
//Replace multiple white spaces by single space
$text = preg_replace('/\s+/', ' ', $text);
//eliminate beginning and ending space
$finalText = trim($text);
/* result: $finalText ="some text for testing";
without non-alphanumeric chars, newline, extra spaces and trim()med */
Is it possible to combine/achieve all these in one regular expression? as I would get the desired result in one line as below
$finalText = preg_replace(some_reg_expression, $replaceby, $text);
thanks
Edit: clarified with a test string

Of course you can. That is very easy.
The re will look like:
((?<= )\s*)|[^a-zA-Z0-9\s]|(\s*$)|(^\s*)
I have no PHP at hand, I have used Perl (just to test the re and show that it works) (you can play with my code here):
$ cat test.txt
a b c d
a b c e f g fff f
$ cat 1.pl
while(<>) {
s/((?<= )\s*)|[^a-zA-Z0-9\s]|(\s*$)|(^\s*)//g;
print $_,"\n";
}
$ cat test.txt | perl 1.pl
a b c d
a b c e f g fff f
For PHP it will be the same.
What does the RE?
((?<= )\s*) # all spaces that have at least one space before them
|
[^a-zA-Z0-9\s] # all non-alphanumeric characters
|
(\s*$) # all spaces at the end of string
|
(^\s*) # all spaces at the beginning of string
The only tricky part here is ((?<= )\s*), lookbehind assertion. You remove spaces if and only if the substring of spaces has a space before.
When you want to know how lookahead/lookbehind assertions work, please take a look at http://www.regular-expressions.info/lookaround.html.
Update from the discussion:
What happens when $text ='some ? ! ? text';?
Then the resulting string contains multiple spaces between "some" and "text".
It is not so easy to solve the problem, because one need positive lookbehind assertions with variable length, and that is not possible at the moment. One cannot simple check spaces because it can happen so that it is not a space but non-alphanumerich character and it will be removed anyway (for example: in " !" the "!" sign will be removed but RE knows nothing about; one need something like (?<=[^a-zA-Z0-9\s]* )\s* but that unfortunately will not work because PCRE does not support lookbehind variable length assertions.

I do not think that you can achieve that with one regex. You would basically need to stick in an if else condition, which it is not possible through Regular Expressions alone.
You would basically need one regex to remove non-alphanumeric digits and another one to collapse the spaces, which is basically what you are already doing.

Check this if this is what you are looking for ---
$patterns = array ('/[^a-zA-Z0-9\s]/','/\s+/');
$replace = array ("", ' ');
trim( preg_replace($patterns, $replace, $text) );
MAy be it may need some modification, just let me know if this is something what you want to do??

For your own sanity, you will want to keep regular expressions that you can still understand and edit later on :)
$text = preg_replace(array(
"/[^a-zA-Z0-9\s]/", // remove all non-space, non-alphanumeric characters
'/\s{2,}/', // replace multiple white space occurrences with single
), array(
'',
' ',
), trim($originalText));

$text =~ s/([^a-zA-Z0-9\s].*?)//g;
Doesn't have to be any harder than this.

Related

PHP: Count number of spaces in a multiple space span

I'm scanning a form field entry ($text) for spaces and replacing the spaces with a blank spot using preg_replace.
$text=preg_replace('/\s/',' ',$text);
This works great, except when there are multiple consecutive spaces in a line. They are all treated as a blank.
I can use this if I know the amount of spaces there will be:
$text=preg_replace('/ {2,}/','**' ,$text);
However I will never be sure of how many spaces the input could be.
Sample Input 1: This is a test.
Sample Input 2: This is a test.
Sample Input 3: This is a test.
Using both preg_replace statements above I get:
Sample Output 1: This is a test.
Sample Output 2: This**is a test.
Sample Output 3: This**is a test.
How would I go about scanning the input for consecutive spaces, counting them and setting that count to a variable to place inside the preg_replace statement for multiple spaces?
Or is there another way of doing this that I am clearly missing?
*Note: Using for the replacement works to maintain the extra spaces, but I cannot replace the space with . When I do it breaks the word-wrap in my output and breaks the words wherever the wrap happens as the string never ends and it will just wrap whenever instead of before or after a word.
if you want replace multiple space with single space you could use
$my_result = preg_replace('!\s+!', ' ', $text);
You can use an alternation with two lookarounds to check if there's a whitespace before or after:
$text = preg_replace('~\s(?:(?=\s)|(?<=\s\s))~', '*', $text);
demo
details:
\s # a whitespace
(?:
(?=\s) # followed by 1 whitespace
| # OR
(?<=\s\s) # preceded by 2 whitespaces (including the previous)
)
Use preg_replace_callback to count the found spaces.
$text = 'This is a test.';
print preg_replace_callback('/ {1,}/',function($a){
return str_repeat('*',strlen($a[0]));
},$text);
Result: This**is*a*test.

How to correctly replace multiple white spaces with a single white space in PHP?

I was scouring through SO answers and found that the solution that most gave for replacing multiple spaces is:
$new_str = preg_replace("/\s+/", " ", $str);
But in many cases the white space characters include UTF characters that include line feed, form feed, carriage return, non-breaking space, etc. This wiki describes that UTF defines twenty-five characters defined as whitespace.
So how do we replace all these characters as well using regular expressions?
When passing u modifier, \s becomes Unicode-aware. So, a simple solution is to use
$new_str = preg_replace("/\s+/u", " ", $str);
^^
See the PHP online demo.
The first thing to do is to read this explanation of how unicode can be treated in regex. Coming specifically to PHP, we need to first of all include the PCRE modifier 'u' for the engine to recognize UTF characters. So this would be:
$pattern = "/<our-pattern-here>/u";
The next thing is to note that in PHP unicode characters have the pattern \x{00A0} where 00A0 is hex representation for non-breaking space. So if we want to replace consecutive non-breaking spaces with a single space we would have:
$pattern = "/\x{00A0}+/u";
$new_str = preg_replace($pattern," ",$str);
And if we were to include other types of spaces mentioned in the wiki like:
\x{000D} carriage return
\x{000C} form feed
\x{0085} next line
Our pattern becomes:
$pattern = "/[\x{00A0}\x{000D}\x{000C}\x{0085}]+/u";
But this is really not great since the regex engine will take forever to find out all combinations of these characters. This is because the characters are included in square brackets [ ] and we have a + for one or more occurrences.
A better way to then get faster results is by replacing all occurrences of each of these characters by a normal space first. And then replacing multiple spaces with a single normal space. We remove the [ ]+ and instead separate the characters with the or operator | :
$pattern = "/\x{00A0}|\x{000D}|\x{000C}|\x{0085}/u";
$new_str = preg_replace($pattern," ",$str); // we have one-to-one replacement of character by a normal space, so 5 unicode chars give 5 normal spaces
$final_str = preg_replace("/\s+/", " ", $new_str); // multiple normal spaces now become single normal space
A pattern that matches all Unicode whitespaces is [\pZ\pC]. Here is a unit test to prove it.
If you're parsing user input in UTF-8 and need to normalize it, it's important to base your match on that list. So to answer your question that would be:
$new_str = preg_replace("/[\pZ\pC]+/u", " ", $str);

PHP preg_replace skipping where match overlaps

I've been Googling this regex behavior all afternoon.
$str = ' b c d w i e f g h this string';
echo preg_replace('/\s[bcdefghjklmnopqrstuvwxyzBCDEFGHJKLMNOPQRSTUVWXYZ]{1}\s/', ' ', $str);
I want to remove all instances of a single character by itself (except for A and I) and leave one space in its place. This code appears to work on every other match. I suspect this is because the matches overlap each other.
I suspect a lookaround would be appropriate here, but I've never used them before and could use a snippet.
EDIT:
Just to avoid confusion about what I am trying to accomplish. I want to turn the above string into this:
$str = ' i this string';
Notice all single-letter characters that are NOT "A" or "I" have been removed.
You can use look-arounds instead. They are 0-length matches, and hence would not consume spaces. And {1} is really absurd there, you can remove it.
echo preg_replace('/(?<=\s)[bcdefghjklmnopqrstuvwxyzBCDEFGHJKLMNOPQRSTUVWXYZ](?=\s)/', '', $str)
You can make use of range and case-insensitive flag (?i) here, to reduce the pain of typing all those characters:
echo preg_replace('/(?i)(?<=\s)[B-HJ-Z](?=\s)/', '', $str)
or word boundaries will also work here:
echo preg_replace('/(?i)\b[B-HJ-Z]\b/', '', $str)
You can try this:
/(?<=\s)[b-hj-zB-HJ-Z](?=\s)/
or modify the ranges if you dont need both i and I.
You can use this:
echo preg_replace('~(?i)\b[B-HJ-Z]\b~', ' ', $str);
Notice: instead of using spaces to delimit the single letter, I use word boundaries which are the zero width limit between a character from [a-zA-Z0-9_] and another character. This is more general than a space and include (for example) punctuation symbols.

Regex to remove ALL single characters from a string

I need a Regular Expression to remove ALL single characters from a string, not just single letters or numbers
The string is:
"A Future Ft Casino Karate Chop ( Prod By Metro )"
it should come out as:
"Future Ft Casino Karate Chop Prod By Metro"
The expression I am using at the moment (in PHP), correctly removes the single 'A' but leaves the single '(' and ')'
This is the code I am using:
$string = preg_replace('/\b\w\b\s?/', '', $string);
Try this:
(^| ).( |$)
Breakdown:
1. (^| ) -> Beginning of line or space
2. . -> Any character
3. ( |$) -> Space or End of line
Actual code:
$string = preg_replace('/(^| ).( |$)/', '$1', $string);
Note: I'm not familiar with the workings of PHP regex, so the code might need a slight tweak depending on how the actual regex needs declared.
As m.buettner pointed out, there will be a trailing white space here with this code. A trim would be needed to clear it out.
Edit: Arnis Juraga pointed out that this would not clear out multiple single characters a b c would filter out to b. If this is an issues use this regex:
(^| ).(( ).)*( |$)
The (( ).)* added to the middle will look for any space following by any character 0 or more times. The downside is this will end up with double spaces where a series of single characters were located.
Meaning this:
The a b c dog
Will become this:
The dog
After performing the replacement to get single individual characters, you would need to use the following regex to locate the double spaces, then replace with a single space
( ){2}
A slightly more efficient version that does not require capturing would be using lookarounds. It's a bit less intuitive due to the multiple negative logic:
$string = preg_replace('/(?<!\S).(?!\S)\s*/', '', $input);
This will remove any character that is neither preceded nor followed by a non-whitespace character (so only those that are between whitespace or at the string boundaries). It will also include all trailing whitespace in the match, so as to leave only the preceding whitespace if there is any. The caveat is, that just like Nick's answer the ) at the end of the string will leave a trailing whitespace (because it is in front of the character). This can easily be solved by trimming the string.

Regex of non breaking space in php

input:
$string = "a b c d e";
i have a string in php and I need to replace the string with the non-break space code
output:
"a \xc2\xa0b c \xc2\xa0d \xc2\xa0\xc2\xa0e"
single space and the first space is not allowed to replace with \xc2\xa0
when two space appear " ", the output is " \xc2\xa0", first space is kept and the second space is replace.
when three spaces appear " ", the output is " \xc2\xa0\xc2\xa0", first space is kept and the second and third space is replaced.
the input string is randomly
Any idea with the Regular expression or other function of php
Thank you very much.
preg_replace('/(?<= ) {1,2}/', "\xc2\xa0", $str);
Lookbehind (?<= ) sees if a space is preceeding the match, {1,2} matches 1 and 2 spaces. The replace will only happen with the spaces matched, not the lookbehind. If you want to replace as many spaces as possible (if there are more than 3 also), just replace {1,2} with +.
$s = preg_replace('~(?<= ) ~', '\xc2\xa0', $s);

Categories