how to detect certain ending words in a mention - php

I have the following regular expression to detect mentions and extract them into string:
preg_match_all('/(?<=^|\s)#([^#\s]+)/'
this works well for detecting strings like this:
#ajksdh
#kajshd123
#12398asdd
however I wanted to make an exception so that it doesn't detect mention strings that end with 'rb', so the following shouldn't be matched
#72rb
#80rb
so the format is some numbers followed by 'rb'. Is this even possible?

Step 1
To exclude strings ending with rb, just add a closing boundary and a negative lookbehind:
(?<=^|\s)#([^#\s]+)(?<!rb)\b
See demo
Step 2
What this is missing is that the [^#\s] does not really define what you want (I am guessing). At the moment, it is matching newlines, for instance, and Japanese characters. This is probably closer to what you want:
(?<=^|\s)#((?:(?!#)\w)+)(?<!rb)\b
See demo
Fine-Tuning
If instead of just \w you want to allow more characters, let me know which, and we can tune this. For instance, to allow all ASCII characters except space, we could use:
(?<=^|\s)#((?:(?!#)[!-~])+)(?<!rb)\b

Related

Allow Parenthesis and forward slash to this regex

I have this regex
!preg_match("/^[a-z0-9](?:[a-z0-9'. -]*[a-z0-9])?$/i", stripslashes($post['job_title']))
and I want to allow numbers parenthesis and also slashes in this regex. because some job title can be "Front-end developer/designer" or "Recruitment Staff (HR)"How can I achieve this?
Okay I managed to make a proper regex for this which allows Slashes within but not at the START/END, and also allows parenthesis within and at START/END.
!preg_match("/^[a-z0-9\(\)](?:[a-z0-9\/\(\)'. -]*[a-z0-9\(\)])?$/i", stripslashes($post['job_title']))
Thanks to #anubhava his reply gave me an idea how to add stuff in the regex
I don't think your intention is being translated to the pattern.
/^[a-z0-9](?:[a-z0-9'. -]*[a-z0-9])?$/i
/^[a-z0-9\(\)](?:[a-z0-9\/\(\)'. -]*[a-z0-9\(\)])?$/i
In the pattern in your question and the oattern in your answer, the third segment (final optional character match) it provides no effective validation. You see the multi-character (zero or more) matching in the middle of the pattern contains all characters in the last character class. In other words, your pattern will behave exactly the same without the last optional check. These are suitable replacements:
/^[a-z0-9](?:[a-z0-9'. -]*$/i
~^[a-z0-9()](?:[a-z0-9/()'. -]*$~i
If you mean to demand that the string ends in alphanumeric or parenthetical character, then remove your ? before the $.
That said, if you want to ensure that:
hyphens, spaces, and dots only occur the the middle of the string and
all parentheses are properly opened and closed, contain characters between them, and do not occur at the start of the string
etc.
then the best strategy will be "test driven development". Create a large, diverse sample of strings as well as unrealistic strings that you know should fail. Then run your current pattern against all strings. Then analyze which cases do not evaluate as expected and adjust your pattern.

preg_replace doesnt not replace what I want

I have this regex that matches strings that I want to check on validity.
However recently I want to use this same regex to replace every character that is not valid to the regex with a character (let's say x).
My regex to match these types of strings is: '#^[\pL\'\’\d][\pL\.\-\ \'\/\,\’\d]*$#iu'
Which allows for the first character to be of any language or any digit and some determined special chars. And all the following letters to be slightly the same but slightly more special characters.
This is what I do (nothing special).
preg_replace($regex, 'x', $string);
Things I tried include trying to negate the regex:
'(?![\pL\'\’\d][\pL\.\-\ \'\/\,\’\d]*)'
'[^\pL\'\’\d][^\pL\.\-\ \'\/\,\’\d]*'
I've also tried splitting up the string into the firstchar and the rest of the string and split the regex in 2.
$validationRegex1 = '[^\pL\'\’\d]';
$validationRegex2 = '[^\pL\.\-\ \'\/\,\’\d]*';
$fixedStr1 = (string) preg_replace($validationRegex1, 'x', $firstChar)
. (string) preg_replace($validationRegex2, 'x', $theRest);
But this also did not seemed to work.
I've experimented a bit with this online tool: https://www.functions-online.com/preg_replace.html
Does anyone know what I am overlooking?
Examples of strings and their expected results
'-' should become 'x'.
'Random-morestuff' stays 'Random-morestuff'
'Random%morestuff' should become 'Randomxmorestuff'
'Rândôm' stays 'Rândôm'
Just an idea but if I got you right, you could use
(?(DEFINE)
(?<first>[\pL\d'’])
(?<other>[-\ \pL\d.'/,’])
)
\b(?&first)(?&other)+\b(*SKIP)(*FAIL)|.
This needs to be replaced by x. You do not have to escape everything in a character class, I changed this accordingly.
See a demo on regex101.com.
A bit more explanation: The (?(DEFINE)...) thingy lets you define subroutines that can be used afterwards and is just syntactic sugar in this case (maybe a bit showing off, really). As you have stated that other characters are allowed depending on theirs positions, I just called them first and other. The \b marks a word boundary, that is a boundary between \w (usually [a-zA-Z0-9_]) and \W (not \w). All of these "words" are allowed, so we let the engine "forget" what has been matched with the (*SKIP)(*FAIL) mechanism and match any other character on the right side of the alternation (|). See how (*SKIP)(*FAIL) works here on SO.
Use
$fixedStr1 = preg_replace('/[\p{L}\'\’\d][\p{L}\.\ \'\/\,\’\d-]*(*SKIP)(*FAIL)|./u', 'x', $input_string);
See regex proof.
Fail matches that match valid symbol words and replace every character appearing in other places.

Validate unicode textarea for minimum length

I have to validate Russian text (utf8) entered in textarea field of the form. The number of characters (no spaces, no empty lines) should be at least 500. The text should be checked with regex and can have many lines.
I have tried:
#^.{500}.*#
This indeed makes the restriction somehow. However, it seems that this pattern does not respect unicode. 260 Russian characters are enough to pass the check. I cannot figure out how to:
check unicode characters
do not count white spaces
do not count empty lines
Okay, so firstly . by default matches bytes, because the input string is interpreted as ASCII. Using Unicode mode changes that (as Esailija correctly pointed out), so that . correctly matches (Unicode) characters:
#^.{500}#u
You don't need the trailing .*, because there is no need to match the full string in PHP. Note that this does not match if there is a line-break within the first 500 characters, because . does not match line-breaks (you should add the s modifier as well, to change that).
For the second requirement to exclude whitespace from the count, you could do something like this:
#^(?:\s*\S){500}#u
That subgroup matches as many space-character as possible, and then one non-space character. And that together has to be matched 500 times. Hence, you only get one repetition per one non-whitespace character, as required.
Note that there is no need for the s modifier for this to work in under all circumstances, because we don't use ..
There is one caveat though, which is explained in this article, though. With Unicode some characters are made up of multiple code points. For instance, à can be written as one character a followed by another code point (U+0300 or `) which is a combining mark. So while there are two different Unicode code points, they are still only one character. However, . matches code points (because it doesn't distinguish between combining marks and "stand-alone characters"). I suppose that will not affect your situation, since Cyrillic doesn't use accents. But it's something worth to be aware of. If it is relevant for you, you might want to look into a more advanced solution like Ωmega's.
You need the u flag to activate UTF-8 awaraness in preg_ functions:
$regex = '#^.{500}.*#u';
If you just want to see if it's 500 characters long, you can just use mb_strlen:
mb_internal_encoding("UTF-8");
$input_without_whitespace = preg_replace( '/[\x{0009}\x{000B}\x{000C}\x{0020}\x{00A0}\x{FEFF}\x{200C}\x{200D}]/u', "", $input );
if( mb_strlen( $input_without_whitespace ) > 500 ) {
}
Use regex pattern
/(?>\s*+\P{M}\p{M}*){500}/u

regex validation

I am trying to validate a string of 3 numbers followed by / then 5 more numbers
I thought this would work
(/^([0-9]+[0-9]+[0-9]+/[0-9]+[0-9]+[0-9]+[0-9]+[0-9])/i)
but it doesn't, any ideas what i'm doing wrong
Try this
preg_match('#^\d{3}/\d{5}#', $string)
The reason yours is not working is due to the + symbols which match "one or more" of the nominated character or character class.
Also, when using forward-slash delimiters (the characters at the start and end of your expression), you need to escape any forward-slashes in the pattern by prefixing them with a backslash, eg
/foo\/bar/
PHP allows you to use alternate delimiters (as in my answer) which is handy if your expression contains many forward-slashes.
First of all, you're using / as the regexp delimiter, so you can't use it in the pattern without escaping it with a backslash. Otherwise, PHP will think that you're pattern ends at the / in the middle (you can see that even StackOverflow's syntax highlighting thinks so).
Second, the + is "greedy", and will match as many characters as it can, so the first [0-9]+ would match the first 3 numbers in one go, leaving nothing for the next two to match.
Third, there's no need to use i, since you're dealing with numbers which aren't upper- or lowercase, so case-sensitivity is a moot point.
Try this instead
/^\d{3}\/\d{5}$/
The \d is shorthand for writing [0-9], and the {3} and {5} means repeat 3 or 5 times, respectively.
(This pattern is anchored to the start and the end of the string. Your pattern was only anchored to the beginning, and if that was on purpose, the remove the $ from my pattern)
I recently found this site useful for debugging regexes:
http://www.regextester.com/index2.html
It assumes use of /.../ (meaning you should not include those slashes in the regex you paste in).
So, after I put your regex ^([0-9]+[0-9]+[0-9]+/[0-9]+[0-9]+[0-9]+[0-9]+[0-9]) in the Regex box and 123/45678 in the Test box I see no match. When I put a backslash in front of the forward slash in the middle, then it recognizes the match. You can then try matching 1234/567890 and discover it still matches. Then you go through and remove all the plus signs and then it correctly stops matching.
What I particularly like about this particular site is the way it shows the partial matches in red, allowing you to see where your regex is working up to.

regex: remove all text within "double-quotes" (multiline included)

I'm having a hard time removing text within double-quotes, especially those spread over multiple lines:
$file=file_get_contents('test.html');
$replaced = preg_replace('/"(\n.)+?"/m','', $file);
I want to remove ALL text within double-quotes (included). Some of the text within them will be spread over multiple lines.
I read that newlines can be \r\n and \n as well.
Try this expression:
"[^"]+"
Also make sure you replace globally (usually with a g flag - my PHP is rusty so check the docs).
Another edit: daalbert's solution is best: a quote followed by one or more non-quotes ending with a quote.
I would make one slight modification if you're parsing HTML: make it 0 or more non-quote characters...so the regex will be:
"[^"]*"
EDIT:
On second thought, here's a better one:
"[\S\s]*?"
This says: "a quote followed by either a non-whitespace character or white-space character any number of times, non-greedily, ending with a quote"
The one below uses capture groups when it isn't necessary...and the use of a wildcard here isn't explicit about showing that wildcard matches everything but the new-line char...so it's more clear to say: "either a non-whitespace char or whitespace char" :) -- not that it makes any difference in the result.
there are many regexes that can solve your problem but here's one:
"(.*?(\s)*?)*?"
this reads as:
find a quote optionally followed by: (any number of characters that are not new-line characters non-greedily, followed by any number of whitespace characters non-greedily), repeated any number of times non-greedily
greedy means it will go to the end of the string and try matching it. if it can't find the match, it goes one from the end and tries to match, and so on. so non-greedy means it will find as little characters as possible to try matching the criteria.
great link on regex: http://www.regular-expressions.info
great link to test regexes: http://regexpal.com/
Remember that your regex may have to change slightly based on what language you're using to search using regex.
You can use single line mode (also know as dotall) and the dot will match even newlines (whatever they are):
/".+?"/s
You are using multiline mode which simply changes the meaning of ^ and $ from beginning/end of string to beginning/end of text. You don't need it here.
"[^"]+"
Something like below. s is dotall mode where . will match even newline:
/".+?"/s
$replaced = preg_replace('/"[^"]*"/s','', $file);
will do this for you. However note it won't allow for any quoted double quotes (e.g. A "test \" quoted string" B will result in A quoted string" B with a leading space, not in A B as you might expect.

Categories