Regex Entire Input Matches Pattern - php

How can I make a regex pattern for use with the PHP preg_replace function that removes all characters which do not fit in a certain pattern. For example:
[a-zA-Z0-9]

You can negate the character set by using ^:
[^a-zA-Z0-9]
The ^ only negates the existing character set [...] it is in, and it only applies when it is the first character inside the set. You can read more about negated character sets here
So, finally:
preg_replace('/[^a-zA-Z0-9]/', '', $input);
Edit:
As noted in the comments below, you can also add the + quantifier so consecutive invalid characters will be replaced in 1 match of preg_replace's iteration:
preg_replace('/[^a-zA-Z0-9]+/', '', $input);

Related

Regular expression alphanumeric with dash and underscore and space, but not at the beginning or at the end of the string [duplicate]

I want to design an expression for not allowing whitespace at the beginning and at the end of a string, but allowing in the middle of the string.
The regex I've tried is this:
\^[^\s][a-z\sA-Z\s0-9\s-()][^\s$]\
This should work:
^[^\s]+(\s+[^\s]+)*$
If you want to include character restrictions:
^[-a-zA-Z0-9-()]+(\s+[-a-zA-Z0-9-()]+)*$
Explanation:
the starting ^ and ending $ denotes the string.
considering the first regex I gave, [^\s]+ means at least one not whitespace and \s+ means at least one white space. Note also that parentheses () groups together the second and third fragments and * at the end means zero or more of this group.
So, if you take a look, the expression is: begins with at least one non whitespace and ends with any number of groups of at least one whitespace followed by at least one non whitespace.
For example if the input is 'A' then it matches, because it matches with the begins with at least one non whitespace condition. The input 'AA' matches for the same reason. The input 'A A' matches also because the first A matches for the at least one not whitespace condition, then the ' A' matches for the any number of groups of at least one whitespace followed by at least one non whitespace.
' A' does not match because the begins with at least one non whitespace condition is not satisfied. 'A ' does not matches because the ends with any number of groups of at least one whitespace followed by at least one non whitespace condition is not satisfied.
If you want to restrict which characters to accept at the beginning and end, see the second regex. I have allowed a-z, A-Z, 0-9 and () at beginning and end. Only these are allowed.
Regex playground: http://www.regexr.com/
This RegEx will allow neither white-space at the beginning nor at the end of your string/word.
^[^\s].+[^\s]$
Any string that doesn't begin or end with a white-space will be matched.
Explanation:
^ denotes the beginning of the string.
\s denotes white-spaces and so [^\s] denotes NOT white-space. You could alternatively use \S to denote the same.
. denotes any character expect line break.
+ is a quantifier which denote - one or more times. That means, the character which + follows can be repeated on or more times.
You can use this as RegEx cheat sheet.
In cases when you have a specific pattern, say, ^[a-zA-Z0-9\s()-]+$, that you want to adjust so that spaces at the start and end were not allowed, you may use lookaheads anchored at the pattern start:
^(?!\s)(?![\s\S]*\s$)[a-zA-Z0-9\s()-]+$
^^^^^^^^^^^^^^^^^^^^
Here,
(?!\s) - a negative lookahead that fails the match if (since it is after ^) immediately at the start of string there is a whitespace char
(?![\s\S]*\s$) - a negative lookahead that fails the match if, (since it is also executed after ^, the previous pattern is a lookaround that is not a consuming pattern) immediately at the start of string, there are any 0+ chars as many as possible ([\s\S]*, equal to [^]*) followed with a whitespace char at the end of string ($).
In JS, you may use the following equivalent regex declarations:
var regex = /^(?!\s)(?![\s\S]*\s$)[a-zA-Z0-9\s()-]+$/
var regex = /^(?!\s)(?![^]*\s$)[a-zA-Z0-9\s()-]+$/
var regex = new RegExp("^(?!\\s)(?![^]*\\s$)[a-zA-Z0-9\\s()-]+$")
var regex = new RegExp(String.raw`^(?!\s)(?![^]*\s$)[a-zA-Z0-9\s()-]+$`)
If you know there are no linebreaks, [\s\S] and [^] may be replaced with .:
var regex = /^(?!\s)(?!.*\s$)[a-zA-Z0-9\s()-]+$/
See the regex demo.
JS demo:
var strs = ['a b c', ' a b b', 'a b c '];
var regex = /^(?!\s)(?![\s\S]*\s$)[a-zA-Z0-9\s()-]+$/;
for (var i=0; i<strs.length; i++){
console.log('"',strs[i], '"=>', regex.test(strs[i]))
}
if the string must be at least 1 character long, if newlines are allowed in the middle together with any other characters and the first+last character can really be anyhing except whitespace (including ##$!...), then you are looking for:
^\S$|^\S[\s\S]*\S$
explanation and unit tests: https://regex101.com/r/uT8zU0
This worked for me:
^[^\s].+[a-zA-Z]+[a-zA-Z]+$
Hope it helps.
How about:
^\S.+\S$
This will match any string that doesn't begin or end with any kind of space.
^[^\s].+[^\s]$
That's it!!!! it allows any string that contains any caracter (a part from \n) without whitespace at the beginning or end; in case you want \n in the middle there is an option s that you have to replace .+ by [.\n]+
pattern="^[^\s]+[-a-zA-Z\s]+([-a-zA-Z]+)*$"
This will help you accept only characters and wont allow spaces at the start nor whitespaces.
This is the regex for no white space at the begining nor at the end but only one between. Also works without a 3 character limit :
\^([^\s]*[A-Za-z0-9]\s{0,1})[^\s]*$\ - just remove {0,1} and add * in order to have limitless space between.
As a modification of #Aprillion's answer, I prefer:
^\S$|^\S[ \S]*\S$
It will not match a space at the beginning, end, or both.
It matches any number of spaces between a non-whitespace character at the beginning and end of a string.
It also matches only a single non-whitespace character (unlike many of the answers here).
It will not match any newline (\n), \r, \t, \f, nor \v in the string (unlike Aprillion's answer). I realize this isn't explicit to the question, but it's a useful distinction.
Letters and numbers divided only by one space. Also, no spaces allowed at beginning and end.
/^[a-z0-9]+( [a-z0-9]+)*$/gi
I found a reliable way to do this is just to specify what you do want to allow for the first character and check the other characters as normal e.g. in JavaScript:
RegExp("^[a-zA-Z][a-zA-Z- ]*$")
So that expression accepts only a single letter at the start, and then any number of letters, hyphens or spaces thereafter.
use /^[^\s].([A-Za-z]+\s)*[A-Za-z]+$/. this one. it only accept one space between words and no more space at beginning and end
If we do not have to make a specific class of valid character set (Going to accept any language character), and we just going to prevent spaces from Start & End, The must simple can be this pattern:
/^(?! ).*[^ ]$/
Try on HTML Input:
input:invalid {box-shadow:0 0 0 4px red}
/* Note: ^ and $ removed from pattern. Because HTML Input already use the pattern from First to End by itself. */
<input pattern="(?! ).*[^ ]">
Explaination
^ Start of
(?!...) (Negative lookahead) Not equal to ... > for next set
Just Space / \s (Space & Tabs & Next line chars)
(?! ) Do not accept any space in first of next set (.*)
. Any character (Execpt \n\r linebreaks)
* Zero or more (Length of the set)
[^ ] Set/Class of Any character expect space
$ End of
Try it live: https://regexr.com/6e1o4
^[^0-9 ]{1}([a-zA-Z]+\s{1})+[a-zA-Z]+$
-for No more than one whitespaces in between , No spaces in first and last.
^[^0-9 ]{1}([a-zA-Z ])+[a-zA-Z]+$
-for more than one whitespaces in between , No spaces in first and last.
Other answers introduce a limit on the length of the match. This can be avoided using Negative lookaheads and lookbehinds:
^(?!\s)([a-zA-Z0-9\s])*?(?<!\s)$
This starts by checking that the first character is not whitespace ^(?!\s). It then captures the characters you want a-zA-Z0-9\s non greedily (*?), and ends by checking that the character before $ (end of string/line) is not \s.
Check that lookaheads/lookbehinds are supported in your platform/browser.
Here you go,
\b^[^\s][a-zA-Z0-9]*\s+[a-zA-Z0-9]*\b
\b refers to word boundary
\s+ means allowing white-space one or more at the middle.
(^(\s)+|(\s)+$)
This expression will match the first and last spaces of the article..

Regex expression to allow only characters used for writing articles

I am looking to construct a regex expression that allows charachters that are used for writing articles, such as:
Alphabets: a-zA-Z
Numbers: 0-9
Special characters: -.,+*/'´"!#%&/()=?#£$€{[]}_:;
Spaces: newlines(enter space) and spaces
My inital attempt using php, looked like this:
preg_replace('/[^a-zA-Z,0-9],.;+- /', ,'', $input)
But the line above didn't work
Edit: second attempt to escape the characters to avoid messing up the expression:
preg_replace('/[^a-zA-Z,0-9]\-\.\,\+\*\/\'\´\"\!\#\%\&\/\(\)\=\?\#\£\$\€\{\[\]\}\_\:\;/', '', $input)
The preg_replace function expects three parameters, not two. A regex, the replacement value, and then the string it should match against.
Additionally your regex should have all characters in the character class, otherwise you are matching that character class then the literal characters after it which likely don't occur. The ;+ also would allow for multiple continuous semicolons, not a + because it is a quantifier when not in a character class and unescaped.
preg_replace('/[^a-zA-Z0-9,.;+-]+/', '', $input)
another regex you could potentially use would be:
preg_replace('/[^[:print:]]+/u', '', $input)
this will replace any non
Visible characters and spaces (anything except control characters)
you can read more here https://www.regular-expressions.info/posixbrackets.html

How do I check if a string is composed only of letters and numbers? (PHP) [duplicate]

This question already has answers here:
How to check, if a php string contains only english letters and digits?
(10 answers)
Closed 12 months ago.
Title says it all: I am checking to see if a user's username contains anything that isn't a number or letter, such as €{¥]^}+<€, punctuation, spaces or even things like âæłęč. Is this possible in php?
You can use the ctype_alnum() function in PHP.
From the manual..
Check for alphanumeric character(s)
Returns TRUE if every character in text is either a letter or a digit, FALSE otherwise.
var_dump(ctype_alnum("æøsads")); // false
var_dump(ctype_alnum("123asd")); // true
Live demo at https://3v4l.org/5etr7
PHP does REGEX
What you want to do is fairly trivial, PHP has a number of regex functions
Testing a String For a Character
If all you want is to know IF a string contains non-alphanumeric characters, then just use preg_match():
preg_match( '/[^A-Za-z0-9]*/', $userName );
This will return 1 if the username contains anything other than alphanumeric (A-Z or a-z or 0to9), it returns 0 if it doesn't contain a non-alphanumeric.
Regex Pattern Elements
Regex PCRE patterns open and close with a delimiter such as a slash/, and that needs to be treated like a string (quoted):'/myPattern/' Some other key features are:
[ brackets contain match sets ]
[a-z] // means match any lowercase letter
This pattern means check the current character in the $String relative to the pattern in these brackets, in this case match any lowercase letter a to z.
^ Caret (Meta-Character)
[^a-z] // means no lowercase letters If the caret ^ (aka hat) is the first character inside brackets, it NEGATES the pattern inside brackets so [^A7] means match anything EXCEPT uppercase A and the numeral 7. (Note: when outside brackets, the caret ^ means the start of the string.)
\w\W\d\D\s\S. Meta-Characters (WildCards)
\w // match all alphanumeric An escaped (i.e. preceded by a backslash \ ) lowercase w means match any "word" character, i.e. alphanumeric and the underscore _, this is shorthand for [A-Za-z0-9_]. The uppercase \W is the NOT word character, equivalent to [^A-Za-z0-9_] or [^\w]
. // (dot) match ANY single character except return/newline
\w // match any word character [A-Za-z0-9_]
\W // NOT any word character [^A-Za-z0-9_]
\d // match any digit [0-9]
\D // NOT any digit [^0-9].
\s // match any whitespace (tab, space, newline)
\S // NOT any whitespace
.*+?| Meta-Characters (Quantifiers))
These modify the behavior outside of a set []
* // match previous character or [set] zero or more times,
// so .* means match everything (including nothing) until reaching a return/newline.
+ // match previous at least one or more times.
? // match previous only zero or one time (i.e. optional).
| // means logical OR eg.: com|net means match either literal "com" or "net"
Not shown: capture groups, backreferences, substitution (the real power of regex). See https://www.phpliveregex.com/#tab-preg-match for more including a live pattern-match playground that is based on the PHP functions, and delivers results as arrays.
Back To Your StringCleaning
So for your pattern, to match all non-letters and numbers (including underscores) you need either: '/[^A-Za-z0-9]*/' or '/[\W_]*/'
Strip Search
If instead you want to STRIP all the non-alpha characters from a string then use preg_replace( $Regex, $Replacement, $StringToClean )
<?php
$username = 'Svéñ déGööfinøff';
echo preg_replace('/[\W_]*/', '', $username);
?>
The output is: SvdGfinff If you'd prefer to replace certain accented letters with standard latin ones to keep the names reasonably readable, then I believe you'd need a lookup table (array). There is one ready to use at the PHP site

Regex to strip out everything but words and numbers (and latin chars)

Im trying to clean a post string used in an ajax request (sanitize before db query) to allow only alphanumeric characters, spaces (1 per word, not multiple), can contain "-", and latin characters like "ç" and "é" without success, can anyone help or point me on the right direction?
This is the regex I'm using so far:
$string = preg_replace('/^[a-z0-9 àáâãäåçèéêëìíîïðñòóôõöøùúû-]+$/', '', mb_strtolower(utf8_encode($_POST['q'])));
Thank you.
$regEx = '/^[^\w\p{L}-]+$/iu';
\w - matches alphanumerics
\p{L} - matches a single Unicode Code Point in the 'Letters' category (see the Unicode Categories section here).
- at the end of the character class matches a single hyphen.
^ in the character classes negates the character class, so that the regex will match the opposite of the character class (anything you do not specify).
+ outside of the character class says match 1 or more characters
^ and $ outside of the character class will cause the engine to only accept matches that start at the beginning of a line and goes until the end of the line.
After the pattern, the i modifier says ignore case and the u tells the pattern matching engine that we're going to be sending UTF8 data it's way, and g modifier originally present has been removed since it's not necessary in PHP (instead global matching is dependent on which matching function is called)
$string = mb_strtolower(utf8_encode($_POST['q'])));
$string = preg_replace('/[^a-z0-9 àáâãäåçèéêëìíîïðñòóôõöøùúû-]+/g', '', $string);
$string = preg_replace('/ +/g', ' ', $string);
Why not just use mysql_real_escape_string?
$string = preg_replace('/[^a-z0-9 àáâãäåçèéêëìíîïðñòóôõöøùúû\-]/u', '', mb_strtolower(utf8_encode($_POST['q']), 'UTF-8'));
$string = preg_replace( '/ +/', ' ', $string );
should do the trick. Note that
the character class is negated by putting ^ inside the character class
you need the u flag when dealing with unicode strings either in the pattern or in the subject
it's better to specify the character set explicitly in mb_* functions because otherwise they will fall back on your system defaults, and that may not be UTF-8.
the hyphen character needed escaping (\- instead of - at the end of your character class)

Replace placeholders which start with # then whole word

I need to replace words that start with hash mark (#) inside a text.
Well I know how I can replace whole words.
preg_replace("/\b".$variable."\b/", $value, $text);
Because that \b modifier accepts only word characters so a word containing hash mark wont be replaced.
I have this html which contains #companyName type of variables which I replace with a value.
\b matches between an alphanumeric character (shorthand \w) and a non-alphanumeric character (\W), counting underscores as alphanumeric. This means, as you have seen, that it won't match before a # (unless that's preceded by an alnum character).
I suggest that you only surround your query word with \b if it starts and end with an alnum character.
So, perhaps something like this (although I don't know any PHP, so this may be syntactically completely wrong):
if (preg_match('/^\w/', $variable))
$variable = '\b'.$variable;
if (preg_match('/\w$/', $variable))
$variable = $variable.'\b';
preg_replace('/'.$variable.'/', $value, $text);
All \b does is match a change between non-word and word characters. Since you know $variable starts with non-word characters, you just need to precede the match by a non-word character (\W).
However, since you are replacing, you either need to make the non-word match zero-width, i.e. a look-behind:
preg_replace("/(?<=\\W)".$variable."\\b/", $value, $text);
or incorporate the matched character into the replacement text:
preg_replace("/(\\W)".$variable."\\b/", $value, "$1$text");
Why not just
preg_replace("/#\b".$variable."\b/", $value, $text);
Following expression can also be used for marking boundaries for words containing non-word characters:-
preg_replace("/(^|\s|\W)".$variable."($|\s|\W)/", $value, $text);

Categories