Regular expressions - remove all non-alpha-numeric characters CRLF problem

Regular expressions - remove all non-alpha-numeric characters CRLF problem - php

First off, if it's not clear from the tag, I'm doing this in PHP - but that probably doesn't matter much.
I have this code:
$inputStr = strip_tags($inputStr);
$inputStr = preg_replace("/[^a-zA-Z\s]/", " ", $inputStr);
Which seems to remove all HTML tags and virtually all special and non-alphabetic characters perfectly. The one problem is, for some reason, it doesn't filter out carraige return/line feeds (just the combination).
If I add this line:
$inputStr = preg_replace("/\s+/", " ", $inputStr);
at the end, however, it works great. Can someone tell me:
Why doesn't the first preg_replace filter out the CR/LFs?
What this second preg_repalce is actually doing? I understand the first one for the most part, but hte second one is confusing me - it works but I don't know why.
Can I combine them into 1 line somehow?

You told it to remove everything except letters and whitespace. Newlines are whitespace, so they don't get removed. You could use \h instead of \s to only exclude horizontal whitespace.
It simply means "replace every sequence of one or more whitespace characters (\s+) with a single space."
preg_replace("/[^A-Za-z]+/", " ", ...) might do.

Your first regex is removing all characters that are not letters or whitespace. CRLFs are whitespace, so they aren't filtered out.
The second one is replacing whitespace with a space character. Essentially it condenses sequences of whitespace into a single space (due to the quantifier being greedy).
I suggest removing the \s from the first regex, see if that works.

\s matches whitespace such as \n.
It is replacing all whitespace characters with a space.
You could make it one unreadable line, but probably not one regex.

Related

Regex Matches white space but not tab (php)

How to write a regex with matches whitespace but no tabs and new line?
thanks everything
[[:blank:]]{2,} <-- Even though this isn't good for me because its whitespace or tab but not newlines.

As per my original comment, you can use this.
Code
See regex in use here
Note: The link contains whitespace characters: tab, newline, and space. Only space is matched.
[^\S\t\n\r]
So your regex would be [^\S\t\n\r]{2,}
Explanation
[^\S\t\n\r] Match any character not present in the set.
\S Matches any non-whitespace character. Since it's a double negative it will actually match any whitespace character. Adding \t, \n, and \r to the negated set ensures we exclude those specific characters as well. Basically, this regex is saying:
Match any whitespace character except \t\n\r
This principle in regex is often used with word characters \w to negate the underscore _ character: [^\W_]

[ ]{2,} works normally (not sure about php)
or even / {2,}/

regex: remove all text within "double-quotes" (multiline included)

I'm having a hard time removing text within double-quotes, especially those spread over multiple lines:
$file=file_get_contents('test.html');
$replaced = preg_replace('/"(\n.)+?"/m','', $file);
I want to remove ALL text within double-quotes (included). Some of the text within them will be spread over multiple lines.
I read that newlines can be \r\n and \n as well.

Try this expression:
"[^"]+"
Also make sure you replace globally (usually with a g flag - my PHP is rusty so check the docs).

Another edit: daalbert's solution is best: a quote followed by one or more non-quotes ending with a quote.
I would make one slight modification if you're parsing HTML: make it 0 or more non-quote characters...so the regex will be:
"[^"]*"
EDIT:
On second thought, here's a better one:
"[\S\s]*?"
This says: "a quote followed by either a non-whitespace character or white-space character any number of times, non-greedily, ending with a quote"
The one below uses capture groups when it isn't necessary...and the use of a wildcard here isn't explicit about showing that wildcard matches everything but the new-line char...so it's more clear to say: "either a non-whitespace char or whitespace char" :) -- not that it makes any difference in the result.
there are many regexes that can solve your problem but here's one:
"(.*?(\s)*?)*?"
this reads as:
find a quote optionally followed by: (any number of characters that are not new-line characters non-greedily, followed by any number of whitespace characters non-greedily), repeated any number of times non-greedily
greedy means it will go to the end of the string and try matching it. if it can't find the match, it goes one from the end and tries to match, and so on. so non-greedy means it will find as little characters as possible to try matching the criteria.
great link on regex: http://www.regular-expressions.info
great link to test regexes: http://regexpal.com/
Remember that your regex may have to change slightly based on what language you're using to search using regex.

You can use single line mode (also know as dotall) and the dot will match even newlines (whatever they are):
/".+?"/s
You are using multiline mode which simply changes the meaning of ^ and $ from beginning/end of string to beginning/end of text. You don't need it here.

"[^"]+"

Something like below. s is dotall mode where . will match even newline:
/".+?"/s

$replaced = preg_replace('/"[^"]*"/s','', $file);
will do this for you. However note it won't allow for any quoted double quotes (e.g. A "test \" quoted string" B will result in A quoted string" B with a leading space, not in A B as you might expect.

regex: delete white characters

I try to delete more then one white characters from my string:
$content = preg_replace('/\s+/', " ", $content); //in some cases it doesn't work
but when i wrote
$content = preg_replace('/\s\s+/', " ", $content); //works fine
could somebody explain why?
because when i write /\s+/ it must match all with one or more white character, why it doesn't work?
Thanks

What is the minimum number of whitespace characters you want to match?
\s+ is equivalent to \s\s* -- one mandatory whitespace character followed by any number more of them.
\s\s+ is equivalent to \s\s\s* -- two mandatory whitespace characters followed by any number more (if this is what you want, it might be clearer as \s{2,}).
Also note that $content = preg_replace('/\s+/', " ", $content); will replace any single spaces in $content with a single space. In other words, if your string only contains single spaces, the result will be no change.

I just wanted to add to that the reason why your /s+/ worked sometimes and not others, is that regular expressions are very greedy, so it is going to try to match one or more space characters, and as many as it can match. I think that is where you got tripped up in finding a solution.
Sorry I'm not yet able to add comments, or I would have just added this comment to Daniel's answer, which is good.

Are you using the Ungreedy option (/U)? It doesn't say so in your code, but if so, it would explain why the first preg_replace() is replacing each single space with a single space (no change). In that case, the second preg_replace() would be replacing each double space with a single space. If you try the second one on a string of four spaces and the result is a double space, I would suspect ungreediness.

try preg_replace("/([\s]{2,})/", " ", $text)

Matching duplicate whitespace with preg_replace

I'm writing a WordPress plugin, and one of the features is removing duplicate whitespace.
My code looks like this:
return preg_replace('/\s\s+/u', ' ', $text, -1, $count);
I don't understand why I need the u
modifier. I've seen other plugins
that use preg_replace and don't
need to modify it for Unicode. I
believe I have a default installation
of WordPress .
Without the modifier, the code
replaces all the spaces with Unicode
replacement glyphs instead of spaces.
With the u modifier, I don't get
the glyphs, and it doesn't replace all the whitespace.
Each space below has from 1-10 spaces. The regex only removes on space from each group.
Before:
This sentence has extra space. This doesn’t. Extra space, Lots of extra space.
After:
This sentence has extra space. This doesn’t. Extra space, Lots of extra space.
$count = 9
How can I make the regex replace the whole match with the one space?
Update: If I try this with regular php, it works fine
$new_text = preg_replace('/\s\s+/', ' ', $text, -1, $count);
It only breaks when I use it within the wordpress plugin.
I'm using this function in a filter:
function jje_test( $text ) {
$new_text = preg_replace('/\s\s+/', ' ', $text, -1, $count);
echo "Count: $count";
return $new_text;
}
add_filter('the_content', 'jje_test');
I have tried:
Removing all other filters on the_content
remove_all_filters('the_content');
Changing the priority of the filter added to the_content, earlier or later
All kinds of permutations of \s+, \s\s+, [ ]+ etc.
Even replacing all single spaces with an empty string, will not replace the spaces

This will replace all sequences of two or more spaces, tabs, and/or line breaks with a single space:
return preg_replace('/[\p{Z}\s]{2,}/u', ' ', $text);
You need the /u flag if $text holds text encoded as UTF-8. Even if there are no Unicode characters in your regex, PCRE has to interpret $text correctly.
I added \p{Z} to the character class because PCRE only matches ASCII characters when using shorthands such as \s, even when using /u. Adding \p{Z} makes sure all Unicode whitespace is matched. There might be other spaces such as non-breaking spaces in your string.
I'm not sure if using echo in a WordPress filter is a good idea.

The u modifier simply puts it into UTF-8 mode, which is useful if you need to do anything specific with characters that have a code point above 0x7f. You can still work on UTF-8 encoded strings without using that modifier, you just won't be able to specifically match or transform such characters easily.
There are some whitespace characters in Unicode that are above 0x7f. It's pretty rare to encounter them in most data. But you may see, for example, a non-breaking space character, which is unicode \uA0, or some rarer characters.
I don't know why using it would cause Unicode "replacement" glyphs to be output. I'd say it would be a problem elsewhere... what character encoding are you outputting your script as?

To answer jjeaton's follow-up question in the comments to my first reply, the following replaces each sequence of spaces, tabs, and/or line breaks with the first character in that sequence. Effectively, this deletes the second and following whitespace characters in each sequence of two or more whitespace characters. A run of spaces is replaced with a single space, a run of tabs is replaced with a single tab, etc. A run of a space and a tab (in that order) is replaced with a space, and a run of a tab and a space is replaced with a tab, etc.
return preg_replace('/([\p{Z}\s])[\p{Z}\s]+/u', '$1', $text);
This regex works by first matching one space and capturing it with a capturing group, followed by one or more spaces. The replacement text is simply reinserts the text matched byt the first (and only) capturing group.

Don't know about any modifiers, but this did the trick:
<?php
$text = ' Hi, my name is Andrés. ';
echo preg_replace(array('/^\s+/', '/\s+$/', '/\s{2,}/'), ' ', $text);
/*
Hi, my name is Andrés.
*/
?>

preg_replace('!\s+!', ' ', 'This sentence has extra space. This doesn’t. Extra space, Lots of extra space.');

Matching a space in regex

How can I match a space character in a PHP regular expression?
I mean like "gavin schulz", the space in between the two words. I am using a regular expression to make sure that I only allow letters, number and a space. But I'm not sure how to find the space. This is what I have right now:
$newtag = preg_replace("/[^a-zA-Z0-9s|]/", "", $tag);

If you're looking for a space, that would be " " (one space).
If you're looking for one or more, it's " *" (that's two spaces and an asterisk) or " +" (one space and a plus).
If you're looking for common spacing, use "[ X]" or "[ X][ X]*" or "[ X]+" where X is the physical tab character (and each is preceded by a single space in all those examples).
These will work in every* regex engine I've ever seen (some of which don't even have the one-or-more "+" character, ugh).
If you know you'll be using one of the more modern regex engines, "\s" and its variations are the way to go. In addition, I believe word boundaries match start and end of lines as well, important when you're looking for words that may appear without preceding or following spaces.
For PHP specifically, this page may help.
From your edit, it appears you want to remove all non valid characters The start of this is (note the space inside the regex):
$newtag = preg_replace ("/[^a-zA-Z0-9 ]/", "", $tag);
# ^ space here
If you also want trickery to ensure there's only one space between each word and none at the start or end, that's a little more complicated (and probably another question) but the basic idea would be:
$newtag = preg_replace ("/ +/", " ", $tag); # convert all multispaces to space
$newtag = preg_replace ("/^ /", "", $tag); # remove space from start
$newtag = preg_replace ("/ $/", "", $tag); # and end

Cheat Sheet
Here is a small cheat sheet of everything you need to know about whitespace in regular expressions:
[[:blank:]]
Space or tab only, not newline characters. It is the same as writing [ \t].
[[:space:]] & \s
[[:space:]] and \s are the same. They will both match any whitespace character spaces, newlines, tabs, etc...
\v
Matches vertical Unicode whitespace.
\h
Matches horizontal whitespace, including Unicode characters. It will also match spaces, tabs, non-breaking/mathematical/ideographic spaces.
x (eXtended flag)
Ignore all whitespace. Keep in mind that this is a flag, so you will add it to the end of the regex
like /hello/gmx. This flag will ignore whitespace in your regular expression.
For example, if you write an expression like /hello world/x, it will match helloworld, but not hello world. The extended flag also allows comments in your regex.
Example
/helloworld #hello this is a comment/
If you need to use a space, you can use \ to match spaces.

To match exactly the space character, you can use the octal value \040 (Unicode characters displayed as octal) or the hexadecimal value \x20 (Unicode characters displayed as hex).
Here is the regex syntax reference: https://www.regular-expressions.info/nonprint.html.

In Perl the switch is \s (whitespace).

I am using a regex to make sure that I
only allow letters, number and a space
Then it is as simple as adding a space to what you've already got:
$newtag = preg_replace("/[^a-zA-Z0-9 ]/", "", $tag);
(note, I removed the s| which seemed unintentional? Certainly the s was redundant; you can restore the | if you need it)
If you specifically want *a* space, as in only a single one, you will need a more complex expression than this, and might want to consider a separate non-regex piece of logic.

It seems to me like using a REGEX in this case would just be overkill. Why not just just strpos to find the space character. Also, there's nothing special about the space character in regular expressions, you should be able to search for it the same as you would search for any other character. That is, unless you disabled pattern whitespace, which would hardly be necessary in this case.

You can also use the \b for a word boundary. For the name I would use something like this:
[^\b]+\b[^\b]+(\b|$)
EDIT Modifying this to be a regex in Perl example
if( $fullname =~ /([^\b]+)\b[^\b]+([^\b]+)(\b|$)/ ) {
$first_name = $1;
$last_name = $2;
}
EDIT AGAIN Based on what you want:
$new_tag = preg_replace("/[\s\t]/","",$tag);

Use it like this to allow for a single space.
$newtag = preg_replace("/[^a-zA-Z0-9\s]/", "", $tag)

I'm trying out [[:space:]] in an instance where it looks like bloggers in WordPress are using non-standard space characters. It looks like it will work.

This matches tires better because not all vendors use the same size format. I deal with many vendors all doing size in different format. This is my expression for now
/^[\d][\d](?:\d)?(?:\-|\/|\s)?([?:\d]+)?(?:\.)?(?:\d)?(?:\d)?(?:R|-|\s)?[1-3]([?:[\d]+)?(?:\.)?([?:\d])?(?:\s|-)/img
will catch all
35-12.50-22 HAIDA[AA]
35-12-22 HAIDA[AA]
35/35R20
35/35r20
thus uis a test
rrrrr
awdg
3345588
225-45-17 ACCELERA[AC]
195 50 16 KELLY
1955016 KELLY
CP671"
158 Buckshot
165-40-16-ACHILLES
11-24.5-16-LEAO-LLA08
11-24.5-LEAO-D37
11-22.5-14-LINGLONG-LLD37
11-22.5-HAPPYROAD[AA]

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regular expressions - remove all non-alpha-numeric characters CRLF problem - php

\s matches whitespace such as \n. It is replacing all whitespace characters with a space. You could make it one unreadable line, but probably not one regex.

Related

Regex Matches white space but not tab (php)

regex: remove all text within "double-quotes" (multiline included)

regex: delete white characters

Matching duplicate whitespace with preg_replace

Matching a space in regex

Categories

Resources