I am looking to construct a regex expression that allows charachters that are used for writing articles, such as:
Alphabets: a-zA-Z
Numbers: 0-9
Special characters: -.,+*/'´"!#%&/()=?#£$€{[]}_:;
Spaces: newlines(enter space) and spaces
My inital attempt using php, looked like this:
preg_replace('/[^a-zA-Z,0-9],.;+- /', ,'', $input)
But the line above didn't work
Edit: second attempt to escape the characters to avoid messing up the expression:
preg_replace('/[^a-zA-Z,0-9]\-\.\,\+\*\/\'\´\"\!\#\%\&\/\(\)\=\?\#\£\$\€\{\[\]\}\_\:\;/', '', $input)
The preg_replace function expects three parameters, not two. A regex, the replacement value, and then the string it should match against.
Additionally your regex should have all characters in the character class, otherwise you are matching that character class then the literal characters after it which likely don't occur. The ;+ also would allow for multiple continuous semicolons, not a + because it is a quantifier when not in a character class and unescaped.
preg_replace('/[^a-zA-Z0-9,.;+-]+/', '', $input)
another regex you could potentially use would be:
preg_replace('/[^[:print:]]+/u', '', $input)
this will replace any non
Visible characters and spaces (anything except control characters)
you can read more here https://www.regular-expressions.info/posixbrackets.html
Related
So I have this regex that works on regex101.com
(?:[^\#\\S\\+]*)
It matches the first from first#second.
Whenever I try to use my regex with PHP's preg_replace I don't get the result I expect.
So far I tried it via preg_quote():
\(\?\:\[\^\\#\\S\\\+\]\*\)
And tried it with escaping the original \\ with 4 \'s:
\(\?\:\[\^\\#\\\\S\\\\\+\]\*\)
Still no success. Am I doing something fundamentaly wrong?
I'm just using:
preg_replace("/$regex/", "", $string);
All my other regexes that don't need so many escape chars work perfectly that way.
When you use (?:[^\#\\S\\+]*) in a preg_match in PHP, both in a single or double quoted string literal, the \\S is parsed as a non-whitespace pattern. [^\S] is equal to \s, i.e. it matches whitespace.
The preg_quote() function is only meant to be used to make any string a literal one for a regex, it just escapes all chars that are sepcial regex metacharacters / operators (like (, ), [, etc.), thus you should not use it here.
While you could use a regex to match 1+ chars other than whitespace and # from the start of a string like preg_match('~^[^#\s]+~', $s, $match), you can just explode your input string with # and get the 0th item.
In PHP it is a common practice to treat strings as immutable. Sometimes there's a need to modify a string "in-place".
We go with the additional array creation approach.
This array should contain every single letter from the source string.
There's a function for that in PHP (str_split). One issue, it doesn't handle multibyte encodings well enough.
There's also a mb_split function which takes a regex as an input parameter for separator sequence. So
mb_split('.', '123')
returns ['', '', '', ''].
BUT:
mb_split('', '123')
returns ['123'].
So I believe there is a counterpart regex which matches empty space between any variation of multi-byte character sequence.
So for '123' it should match
'1~2', '2~3'
where ~ is an actual match. That is just like \b but for anything.
Is there a regex hack to do so?
Use
preg_match_all('~\X~u', $s, $arr)
The $arr[0] will contain all the characters. The \X pattern matches any Unicode grapheme. The /u modifier is necessary to make the regex engine treat the input string as a Unicode string and make the pattern Unicode aware.
See the PHP demo.
I use PHP.
My string can look like this
This is a string-test width åäö and some über+strange characters: _like this?
Question
Is there a way to remove non-alphanumeric characters and replace them with a space? Here are some non-alphanumeric characters:
-
+
:
_
?
I've read many threads about it but they don't support other languages, like this one:
preg_replace("/[^A-Za-z0-9 ]/", '', $string);
Requirements
My list of none letter characters might not be complete.
My content contain characters in different languages, like åäöü. Could be very many more.
The non-alphanumeric characters should be replaced with a space. Else the word would be glued to eachother.
You can try this:
preg_replace('~[^\p{L}\p{N}]++~u', ' ', $string);
\p{L} stands for all alphabetic characters (whatever the alphabet).
\p{N} stands for numbers.
With the u modifier characters of the subject string are treated as unicode characters.
Or this:
preg_replace('~\P{Xan}++~u', ' ', $string);
\p{Xan} contains unicode letters and digits.
\P{Xan} contains all that is not unicode letters and digits. (Be careful, it contains white spaces too that you can preserve with ~[^\p{Xan}\s]++~u )
If you want a more specific set of allowed letters you must replace \p{L} with ranges in unicode table.
Example:
preg_replace('~[^a-zÀ-ÖØ-öÿŸ\d]++~ui', ' ', $string);
Why using a possessive quantifier (++) here?
~\P{Xan}+~u will give you the same result as ~\P{Xan}++~u. The difference here is that in the first the engine records each backtracking position (that we don't need) when in the second it doesn't (as in an atomic group). The result is a small performance profit.
I think it's a good practice to use possessive quantifiers and atomic groups when it's possible.
However, the PCRE regex engine makes automatically a quantifier possessive in obvious situations (example: a+b => a++b) except If the PCRE module has been compiled with the option PCRE_NO_AUTO_POSSESS. (http://www.pcre.org/pcre.txt)
More informations about possessive quantifiers and atomic groups here (possessive quantifiers) and here (atomic groups) or here
Are you perhaps looking for \W?
Something like:
/[\W_]*/
Matches all non-alphanumeric character and underscores.
\w matches all word character (alphabet, numeric, underscores)
\W matches anything not in \w.
So, \W matches any non-alphanumeric characters and you add the underscore since \W doesn't match underscores.
EDIT: This make your line of code become:
preg_replace("/[\W_]*/", ' ', $string);
The ' ' means that all matching characters (those not letter and not number) will become white spaces.
reEDIT: You might additionally want to use another preg_replace to remove all the consecutive spaces and replace them with a single space, otherwise you'll end up with:
This is a string test width and some ber strange characters like this
You can use:
preg_replace("/\s+/", ' ', $string);
And lastly trim the beginning and end spaces if any.
I am not entirely sure which variety of regex you are using. However, POSIX regexes allow you to express an alphabetical class, where [:alpha:] represents any alphabetic character.
So try:
preg_replace("/[^[:alpha:]0-9 ]/", '', $string);
Actually, I forgot about [:alnum:] - that makes it simpler:
preg_replace("/[^[:alnum:] ]/", '', $string);
\p{xx} is what you are looking for, I believe, see here
So, try:
preg_replace("/\P{L}+/u", ' ', $string);
I have a regular expression that allows only specific characters from the name fields in an HTML form, namely letters, white space, single quotes, hyphens and periods. Here is the pattern:
return mb_ereg_match("^[\w\s'-\.]+$", $name);
Problem is this pattern, for some reason, returns true when there are literal asterisks in $name. This shouldn't be possible unless I'm missing something. I've done multiple searches on literal asterisks and all I found was the "\*" pattern for intentionally matching them.
The same pattern in preg_match() also returns a match when passed a string like "*John".
What the heck am I missing?
You need a double-backslash in front of these codes. One to escape the backslash, one to escape the escape sequence.
You also need to escape the -, otherwise it accepts all characters "between" ' and ..
return mb_ereg_match("^[\\w\\s'\\-\\.]+$", $name);
Have a look at a working case (using preg_match): http://ideone.com/E8afAM
When enclosed in square-brackets, the hyphen acts as a special character to denote a range. In your case, it's matching all characters in the range ' to ..
Escaping the hyphen should return the desired result:
^[\w\s'\-\.]+$
I have a regular expression that allows only specific characters from the name fields in an HTML form, namely letters, white space, single quotes, hyphens and periods.
You miss, that \w is not a letter character. php.net says:
A "word" character is any letter or digit or the underscore character, that is, any character which can be part of a Perl "word".
And, the perl definition is:
A \w matches a single alphanumeric character (an alphabetic character, or a decimal digit) or a connecting punctuation character, such as an underscore ("_").
The connecting punctuation character should mean only _ as i read, but this is maybe a multibyte extension's bug.
If you use mb_ereg_match only for whole unicode matches, give a try to preg_match's /u modifier & the Unicode character properties feature, since php 5.1.0
Is this a correct syntax for preg_replace (regular expression) to remove ?ajax=true or &ajax=true from a string?
echo preg_replace('/(\?|&)ajax=true/', '', $string);
So for example /hello/hi?ajax=true will give me /hello/hi and /hello/hi?ajax=true will give me /hello/hi
Do I need to escape &?
Why don't you try it?
You don't need to escape "&". It is not a special character in regex.
Your expression should be working, that is an alternation that you are using. But if you have only single characters in your alternation, it is more readable, if you use a character class.
echo preg_replace('/[?&]ajax=true/', '', $string);
[?&] is a character class, it will match one character out of the characters listed between the square brackets.
I think it is ok your expression. You can add (?i) to ignore Upper Case letters. The result should be something like:
echo preg_replace('/(\?|&)(?i)ajax=true/', '', $string);