I have this regex that matches strings that I want to check on validity.
However recently I want to use this same regex to replace every character that is not valid to the regex with a character (let's say x).
My regex to match these types of strings is: '#^[\pL\'\’\d][\pL\.\-\ \'\/\,\’\d]*$#iu'
Which allows for the first character to be of any language or any digit and some determined special chars. And all the following letters to be slightly the same but slightly more special characters.
This is what I do (nothing special).
preg_replace($regex, 'x', $string);
Things I tried include trying to negate the regex:
'(?![\pL\'\’\d][\pL\.\-\ \'\/\,\’\d]*)'
'[^\pL\'\’\d][^\pL\.\-\ \'\/\,\’\d]*'
I've also tried splitting up the string into the firstchar and the rest of the string and split the regex in 2.
$validationRegex1 = '[^\pL\'\’\d]';
$validationRegex2 = '[^\pL\.\-\ \'\/\,\’\d]*';
$fixedStr1 = (string) preg_replace($validationRegex1, 'x', $firstChar)
. (string) preg_replace($validationRegex2, 'x', $theRest);
But this also did not seemed to work.
I've experimented a bit with this online tool: https://www.functions-online.com/preg_replace.html
Does anyone know what I am overlooking?
Examples of strings and their expected results
'-' should become 'x'.
'Random-morestuff' stays 'Random-morestuff'
'Random%morestuff' should become 'Randomxmorestuff'
'Rândôm' stays 'Rândôm'
Just an idea but if I got you right, you could use
(?(DEFINE)
(?<first>[\pL\d'’])
(?<other>[-\ \pL\d.'/,’])
)
\b(?&first)(?&other)+\b(*SKIP)(*FAIL)|.
This needs to be replaced by x. You do not have to escape everything in a character class, I changed this accordingly.
See a demo on regex101.com.
A bit more explanation: The (?(DEFINE)...) thingy lets you define subroutines that can be used afterwards and is just syntactic sugar in this case (maybe a bit showing off, really). As you have stated that other characters are allowed depending on theirs positions, I just called them first and other. The \b marks a word boundary, that is a boundary between \w (usually [a-zA-Z0-9_]) and \W (not \w). All of these "words" are allowed, so we let the engine "forget" what has been matched with the (*SKIP)(*FAIL) mechanism and match any other character on the right side of the alternation (|). See how (*SKIP)(*FAIL) works here on SO.
Use
$fixedStr1 = preg_replace('/[\p{L}\'\’\d][\p{L}\.\ \'\/\,\’\d-]*(*SKIP)(*FAIL)|./u', 'x', $input_string);
See regex proof.
Fail matches that match valid symbol words and replace every character appearing in other places.
Related
I am trying to grab the text after the last number in the string and grab the whole string if it doesn't contain numbers.
The best regex I could come up with is:
([^\d\s]*)$
However I found that \s and \d aren't supported in mysql regexp rather [[:space:]] and not sure what \d is equivalent too.
This is what I'm trying to accomplish:
'1/2 Oz' returns 'Oz'
'2 3/4 Oz' returns 'Oz'
'As needed' returns 'As needed'
This is the regex you will need:
/^.*?(\d+(?=\D*$)\s*)/
And just replace matched text with empty string ""
PHP code:
$s = preg_replace('/^.*?(\d+(?=\D*$)\s*)/', '', 'Foo Oz');
//=> Foo Oz
$s = preg_replace('/^.*?(\d+(?=\D*$)\s*)/', '', '1/2 Oz');
//=> Oz
Live Demo: http://ideone.com/u887D7
First of all, you could simply avoid the class, and use a range instead:
[^0-9[:space:]]*$
But there is one for digits as well (which may actually include non-ASCII digits). The documentation has a list of these. They are called POSIX bracket expressions by the way.
[^[:digit:][:space:]]*$
However, the general problem with this approach is that it doesn't allow for spaces later on in the string (like the one between As and needed. To get those, but still avoid capturing trailing spaces after digits, make sure, the first character is neither space nor digit, then match the rest of the string as non-digits. In addition, make the whole thing optional, to ensure that it still works with strings ending in a digit.
([^[:digit:][:space:]][^:digit:]*)?$
I need to strip any non-alphanumeric characters from the end of strings using PHP's preg_replace:
Word One, Two, -, Word One, Two,[space], Word One, Two,, Word One, Two should all become Word One, Two.
I have tried preg_replace('/(.+)\\W+$/', '$1', 'Word One, Two, -'); but this only strips the last non-word character. I also tried '/(.+)\\W*$/' as I assumed this would make it work if 0 or 1 non-word characters are found (as I need) but it then doesn't match at all. I think I need to make the \W greedy but I'm not sure how. Any ideas? Also, please feel free to explain to me what I am doing wrong so I don't find myself haunting the SO regex tag ;-)
This is because (.+) eats up all other character, including non-word characters. The regex engine starts matching the string and starts out with all characters in the capturing group. Only then it notices that the \W at the end of the string won't fit and backs up, tentatively allowing a single character to be matched by the \W. But a single character is all that's needed to satisfy the \W+, so it just stops and just strips that single character. That's also the reason why (.+)\W*$ doesn't work at all, because \W* is content with matching nothing at all.
Use
preg_replace('/\\W+$/', '', $foo);
instead. This avoids the problem by just replacing trailing non-word characters without even trying to match something else.
Another option would be
preg_replace('/(.+?)\\W+$/', '$1', $foo);
which would use a lazy quantifier (+?) for the capturing group. This quantifier tries satisfying the match while matching as little as possible (as opposed to + which tries to match as much as possible as we saw above). But generally I'd avoid replacing parts of the match by themselves if you can avoid it. To strip things from a string you certainly don't need to match more than you need to strip.
What your regex is doing is looking for the maximum possible amount of any character, while still keeping at least one non-word at the end.
What you need to do is just drop the (.+), and use:
preg_replace("/\W+$/","",$input);
I'm trying to parse product names that have multiple abbreviates for sizes. For example, medium can be
m, medium, med
I tried a simple
preg_match('/m|medium|med/i',$prod_name,$matches);
which works fine for 'product m xyz'. However, when I try 'product s/m abc' I'm getting a false-positive match.
I also tried
preg_match('/\bm\b|\bmedium\b|\bmed\b/i',$prod_name,$matches);
to force it to be found in a word, but the m in s/m is still being matched. I'm assuming this is due to the engine treating '/' in the name as a word delimiter?
So to sum up, I need to match 'm' in a string, but not 's/m' or 'small', etc.. Any help is appreciated.
%\b(?<![/-])(m|med|medium)(?![/-])\b%
You can use negative lookbehind or lookahead to exclude the offending separators. This means "m"/"med"/"medium" which is its own word, but not preceded or followed by a slash or a dash. It also works on the beginning and end of string, since negative lookahead/lookbehind do not force a matching character to be present.
If you only want to delimit on whitespace, you can use the positive version:
%\b(?<=\s|^)(m|med|medium)(?=\s|$)\b%
("m"/"med"/"medium" which is preceded by whitespace or the start of the string, and followed by whitespace or the end of the string)
I always think of these things in ERE first. And according to re_format(7) ERE's word boundaries, [[:<:]] and [[:>:]] match the null string at the beginning and end of a word respectively. So ... since preg should understand ERE notation, I might go with:
/[[:<:]](m(ed(ium)?)?)[[:>:]]/
Or for easier reading, perhaps:
/[[:<:]](m|med|medium)[[:>:]]/
In PHP though, you can use PREG instead of ERE. In PREG, \b indicates a word boundary, so:
preg_match('/\b(m(ed(ium)?)?)\b/', $prod_name, $matches);
Try this, it should match medium, med, and m.
medium|med|^m$
I am trying to validate a string of 3 numbers followed by / then 5 more numbers
I thought this would work
(/^([0-9]+[0-9]+[0-9]+/[0-9]+[0-9]+[0-9]+[0-9]+[0-9])/i)
but it doesn't, any ideas what i'm doing wrong
Try this
preg_match('#^\d{3}/\d{5}#', $string)
The reason yours is not working is due to the + symbols which match "one or more" of the nominated character or character class.
Also, when using forward-slash delimiters (the characters at the start and end of your expression), you need to escape any forward-slashes in the pattern by prefixing them with a backslash, eg
/foo\/bar/
PHP allows you to use alternate delimiters (as in my answer) which is handy if your expression contains many forward-slashes.
First of all, you're using / as the regexp delimiter, so you can't use it in the pattern without escaping it with a backslash. Otherwise, PHP will think that you're pattern ends at the / in the middle (you can see that even StackOverflow's syntax highlighting thinks so).
Second, the + is "greedy", and will match as many characters as it can, so the first [0-9]+ would match the first 3 numbers in one go, leaving nothing for the next two to match.
Third, there's no need to use i, since you're dealing with numbers which aren't upper- or lowercase, so case-sensitivity is a moot point.
Try this instead
/^\d{3}\/\d{5}$/
The \d is shorthand for writing [0-9], and the {3} and {5} means repeat 3 or 5 times, respectively.
(This pattern is anchored to the start and the end of the string. Your pattern was only anchored to the beginning, and if that was on purpose, the remove the $ from my pattern)
I recently found this site useful for debugging regexes:
http://www.regextester.com/index2.html
It assumes use of /.../ (meaning you should not include those slashes in the regex you paste in).
So, after I put your regex ^([0-9]+[0-9]+[0-9]+/[0-9]+[0-9]+[0-9]+[0-9]+[0-9]) in the Regex box and 123/45678 in the Test box I see no match. When I put a backslash in front of the forward slash in the middle, then it recognizes the match. You can then try matching 1234/567890 and discover it still matches. Then you go through and remove all the plus signs and then it correctly stops matching.
What I particularly like about this particular site is the way it shows the partial matches in red, allowing you to see where your regex is working up to.
when I try preg_match with the following expression: /.{0,5}/, it still matches string longer than 5 characters.
It does, however, work properly when trying in online regexp matcher
The site you reference, myregexp.com, is focussed on Java.
Java has a specific function for matching an exact pattern, without needing to use anchor characters. This is the function which myregexp.com uses.
In most other languages, in order to match an exact pattern, you would need to add the anchoring characters ^ and $ at the start and end of the pattern respectively, otherwise the regex assumes it only needs to find the matched pattern somewhere within the string, rather than the whole string being the match.
This means that without the anchors, your pattern will match any string, of any length, because whatever the string, it will contain within it somewhere a match for "zero to five of any character".
So in PHP, and Perl, and virtually any other language, you need your pattern to look like this:
/^.{0,5}$/
Having explained all that, I would make one final observation though: this specific pattern really doesn't need to be a regular expression -- you could achieve the same thing with strlen(). In addition, the dot character in regex may not work exactly as you expect: it typically matches almost any character; some characters, including new line characters, are excluded by default, so if your string contains five characters, but one of them is a new line, it will fail your regex when you might have expected it to pass. With this in mind, strlen() would be a safer option (or mb_strlen() if you expect to have unicode characters).
If you need to match any character in regex, and the default behaviour of the dot isn't good enough, there are two options: One is to add the s modifier at the end of the expression (ie it becomes /^.{0,5}$/s). The s modifier tells regex to include new line characters in the dot "any character" match.
The other option (which is useful for languages that don't support the s modifier) is to use an expression and its negative together in a character class - eg [\s\S] - instead of the dot. \s matches any white space character, and \S is a negative of \s, so any character not matched by \s. So together in a character class they match any character. It's more long winded and less readable than a dot, but in some languages it's the only way to be sure.
You can find out more about this here: http://www.regular-expressions.info/dot.html
Hope that helps.
You need to anchor it with ^$. These symbols match the beginning and end of the string respectively, so it must be 0-5 characters between the beginning and end. Leaving out the anchors will match anywhere in the string so it could be longer.
/^.{0,5}$/
For better readability, I would probably also enclose the . in (), but that's kind of subjective.
/^(.){0,5}$/