I am trying to do a preg_match to filter unwanted spam queries and I would like to match any word that is listed in the preg_match and filter it if it has no space after it.
So for example if I have the word balloon in the preg_match then I want to filter anything like "balloon1" or "balloond" or "balloonedfbdg" etc and allow anything with a space after balloon like "balloon big", "balloon small" etc.
I have a lot of queries from google that take a single word and add a whole bunch of crap to it that I want to filter out. It is only a few words but it is irritating for me enough to come here and find an answer to fix this.
I already use a preg_match for some of the spam queries using regular expressions but I do not know how to match something that is not spaced and allow something that has a space.
Any help is appreciated, Thanks.
Your Expression: /(balloon|otherwordone|othertwo)[^\s]/i
This matches the listed words if they're not followed by a whitespace (\s)
Edit: Using \B (not a word boundary):
/(balloon|otherwordone|othertwo)\B/i
This prevents common sentence symbols from triggering the regex (like dot, comma).
Related
I have a regex that will be used to match #users tags.
I use lokarround assertions, letting punctuation and white space characters surround the tags.
There is an added complication, there are a type of bbcodes that represent html.
I have two types of bbcodes, inline (^B bold ^b) and blocks (^C center ^c).
The inline ones have to be passed thru to reach for the previous or next character.
And the blocks are allowed to surround a tag, just like punctuation.
I made a regex that does work. What I want to do now is to lower the number of steps that it does in every character that’s not going to be a match.
At first I thought I could do a regex that would just look for #, and when found, it would start looking at the lookarrounds, that worked without the inline bbcodes, but since lookbehind cannot be quantifiable, it’s more difficult since I cannot add ((\^[BIUbiu])++)* inside, producing much more steps.
How could I do my regex more efficient with fewer steps?
Here is a simplified version of it, in the Regex101 link there is the full regex.
(?<=[,\.:=\^ ]|\^[CJLcjl])((\^[BIUbiu])++)*#([A-Za-z0-9\-_]{2,25})((\^[BIUbiu])++)*(?=[,\.:=\^ ]|\^[CJLcjl])
https://regex101.com/r/lTPUOf/4/
A rule of thumb:
Do not let engine make an attempt on matching each single one character if
there are some boundaries.
The quote originally comes from this answer. Following regular expression reduces steps in a significant manner because of the left side of the outermost alternation, from ~20000 to ~900:
(?:[^#^]++|[#^]{2,}+)(*SKIP)(*F)
|
(?<=([HUGE-CHARACTER-CLASS])|\^[cjleqrd])
(\^[34biu78])*+#([a-z\d][\w-.]{0,25}[a-z\d])(\^[34biu78])*+(?=(?1))
Actually I don't care much about the number of steps being reported by regex101 because that wouldn't be true within your own environment and it is not obvious if some steps are real or not or what steps are missed. But in this case since the logic of regex is clear and the difference is a lot it makes sense.
What is the logic?
We first try to match what probably is not desired at all, throw it away and look for parts that may match our pattern. [^#^]++ matches up to a # or ^ symbols (desired characters) and [#^]{2,}+ prevents engine to take extra steps before finding out it's going nowhere. So we make it to fail as soon as possible.
You can use i flag instead of defining uppercase forms of letters (this may have a little impact however).
See live demo here
I'm getting insane over this, it's so simple, yet I can't figure out the right regex. I need a regex that will match blacklisted words, ie "ass".
For example, in this string:
<span class="bob">Blacklisted word was here</span>bass
I tried that regex:
((?!class)ass)
That matches the "ass" in the word "bass" bot NOT "class".
This regex flags "ass" in both occurences. I checked multiple negative lookaheads on google and none works.
NOTE: This is for a CMS, for moderators to easily find potentially bad words, I know you cannot rely on a computer to do the filtering.
If you have lookbehind available (which, IIRC, JavaScript does not and that seems likely what you're using this for) (just noticed the PHP tag; you probably have lookbehind available), this is very trivial:
(?<!cl)(ass)
Without lookbehind, you probably need to do something like this:
(?:(?!cl)..|^.?)(ass)
That's ass, with any two characters before as long as they are not cl, or ass that's zero or one characters after the beginning of the line.
Note that this is probably not the best way to implement a blacklist, though. You probably want this:
\bass\b
Which will match the word ass but not any word that includes ass in it (like association or bass or whatever else).
It seems to me that you're actually trying to use two lists here: one for words that should be excluded (even if one is a part of some other word), and another for words that should not be changed at all - even though they have the words from the first list as substrings.
The trick here is to know where to use the lookbehind:
/ass(?<!class)/
In other words, the good word negative lookbehind should follow the bad word pattern, not precede it. Then it would work correctly.
You can even get some of them in a row:
/ass(?<!class)(?<!pass)(?<!bass)/
This, though, will match both passhole and pass. ) To make it even more bullet-proof, we can add checking the word boundaries:
/ass(?<!\bclass\b)(?<!\bpass\b)(?<!\bbass\b)/
UPDATE: of course, it's more efficient to check for parts of the string, with (?<!cl)(?<!b) etc. But my point was that you can still use the whole words from whitelist in the regex.
Then again, perhaps it'd be wise to prepare the whitelists accordingly (so shorter patterns will have to be checked).
Is this one is what you want ? (?<!class)(\w+ass)
can you please tell me how to validate a hyperlink from different hyperlinks. eg
i want to fetch these links separately starting with the bolded address(between two stars) from a website using simple html dom
1 http://**www.website1.com**/1/2/
2 http://**news.website2.com**/s/d
3 http://**website3.com/news**/gds
i know we can do it using preg_match ;but i am getting a hardtime understanding preg_match.
can anyone give me a preg_match script for these websites validation..
and can you also explain me what this means
preg_match('|^http(s)?://[a-z0-9-]+(.[a-z0-9-]+)*(:[0-9]+)?(/.*)?$|i', $url)
what are those random looking characters in preg_match? what is the meaning of these characters?
If you want to learn about regular expression, I think you could get a good start on the regular-expressions.info website.
And if you want to use them more, the book Mastering Regular Expressions is a must read.
Edit: here is a simple walkthrough tho:
the first parameter of preg_match is the regexp string. The second is the string you're testing against. A third optionnal one can be used and would be an array inside which everything captured is stored.
the | are used to delimit your regexp and its options. What is between the first one is the regexp, the i at the end is an option (meaning your regexp is case insensitive)
the first ^ is marking where your string you want to match starts
then (s)? mean that you want one or no s character, and you want to "capture it"
[a-z0-9]+ is any number (even 0) of alphanumeric characters
(.[a-z0-9-]+)* is wrong. It should be (\.[a-z0-9-]+)* to capture any number of sequences formed by a dot then at least one alphanumeric character
(:[0-9]+)? will capture one or no sequence formed by : followed by any number. It's used to get the url port
(/.*)? captures the end of the url, a slash followed by any number of any character
$ is the end of your string
Have a look at In search of the perfect URL validation regex.
I have a regular expression in my PHP script like this:
/(\b$term|$term\b)(?!([^<]+)?>)/iu
This matches the word contained in $term, as long as there's a word boundary before or after and it's not inside a HTML tag.
However, this doesn't work in non-ASCII cases, for example with Russian text. Is there a way to make it work?
I can get almost as good result with
/(\s$term|$term\s)(?!([^<]+)?>)/iu
but this is obviously more limited and since this regexp is about highlighting search terms, it has the problem of including the space in the highlight.
I've read this StackOverflow question about the problem, but it doesn't help - doesn't work correctly. In that example the captures are the other way around (capture text outside the search term, when I need to capture the search term).
Any way to make this work? Thanks!
You could use zero-width lookahead/lookbehind assertions to assert the that characters to the left and right of what you're matching are non-letters?
The \b is certainly defined to work perfectly well on Unicode, as is required by UTS#18. What are you saying it is not doing? What are the exact text strings involved?
Just a simple regex I don't know how to write.
The regex has to make sure a string matches all 3 words. I see how to make it match any of the 3:
/advancedbrain|com_ixxocart|p\=completed/
but I need to make sure that all 3 words are present in the string.
Here are the words
advancebrain
com_ixxocart
p=completed
Use lookahead assertions:
^(?=.*advancebrain)(?=.*com_ixxochart)(?=.*p=completed)
will match if all three terms are present.
You might want to add \b work boundaries around your search terms to ensure that they are matched as complete words and not substrings of other words (like advancebraindeath) if you need to avoid this:
^(?=.*\badvancebrain\b)(?=.*\bcom_ixxochart\b)(?=.*\bp=completed\b)
^(?=.*?p=completed)(?=.*?advancebrain)(?=.*?com_ixxocart).*$
Spent too long testing and refining =/ Oh well.. Will still post my answer
Use lookahead:
(?=.*\badvancebrain)(?=.*\bcom_ixxocart)(?=.*\bp=completed)
Order won't matter. All three are required.