I have looked at many questions here (and many more websites) and some provided hints but none gave me a definitive answer. I know regular expressions but I am far from being a guru. This particular question deals with regex in PHP.
I need to locate words in a text that are not surrounded by a hyperlink of a given class. For example, I might have
This elephant is green and this elephant is blue while this elephant is red.
I would need to match against the second and third elephants but not the first (identified by test class "no_check"). Note that there could more attributes than just href and class within hyperlinks. I came up with
((?<!<a .*class="no_check".*>)\belephant\b)
which works beautifully in regex test software but not in PHP.
Any help is greatly appreciated. If you cannot provide a regular expression but can find some sort of PHP code logic that would circumvent the need for it, I would be equally grateful.
If variable width negative look-behind is not available a quick and dirty solution is to reverse the string in memory and use variable width negative look-ahead instead. then reverse the string again.
But you may be better off using an HTML parser.
I think the simplest approach would be to match either a complete <a> element with a "no_check" attribute, or the word you're searching for. For example:
<a [^<>]*class="no_check"[^<>]*>.*?</a>|(\belephant\b)
If it was the word you matched, it will be in capture group #1; if not, that group should be empty or null.
Of course, by "simplest approach" I really meant the simplest regex approach. Even simpler would be to use an HTML parser.
I ended up using a mixed solution. It turns out that I had to parse a text for specific keywords and check if they were already part of a link and if not add them to a hyperlink. The solutions provided here were very interesting but not exactly tailored enough for what I needed.
The idea of using an HTML parser was a good one though and I am currently using one in another project. So hats off to both Alan Moore and Eric Strom for suggesting that solution.
Related
I must detect the presence of some words (even polyrematic, like in "bag of words") in a user-submitted string.
I need to find the exact word, not part of it, so the strstr/strpos/stripos family is not an option for me.
My current approach (PHP/PCRE regex) is the following:
\b(first word|second word|many other words)\b
Is there any other better approach? Am I missing something important?
Words are about 1500.
Any help is appreciated
A regular expression the way you're demonstrating will work. It may be challenging to maintain if the list of words grows long or changes.
The method you're using will work in the event that you need to look for phrases with spaces and the list doesn't grow much.
If there are no spaces in the words you're looking for, you could split the input string on space characters (\s+, see https://www.php.net/manual/en/function.preg-split.php ), then check to see if any of those words are in a Set (https://www.php.net/manual/en/class.ds-set.php) made up of the words you're looking for. This will be a bit more code, but less regex maintenance, so ymmv based on your application.
If the set has spaces, consider instead using Trie. Wiktor Stribiżew suggests: https://github.com/sters/php-regexp-trie
I'm not an expert at regular expressions by any stretch of the imagination! I understand the basics of how regex comes together, but through a regular expression, can I search for two words that could appear anywhere within a piece of text?
i.e. the words hot and weather. Could be written as:
the weather was hot
during the hot weather
the weather has become even hotter
Is it possible that a regex can be created to pick up all three scenarios, but not (for example) the picture was shot in poor weather?
Any help would be appreciated - I'm working in PHP5.6 by the way alternative is there a better way to do it that I haven't thought of?
If you just need those two words you could have the regex search with an optional list of word endings.
For Example:
(\bweather\b.*hot(ter|test)?\b|\bhot(ter|test)?\b.*\bweather\b)
But if you need to build the regex from a user's input you would want to have a full list of possible endings:
(s|er|est|ier|iest|ter|test|etc|etc)?
Example:
(\bweather(s|er|est|ier|iest|ter|test|etc|etc)?\b.*hot(s|er|est|ier|iest|ter|test|etc|etc)?\b|\b(s|er|est|ier|iest|ter|test|etc|etc)?\b.*\bweather(s|er|est|ier|iest|ter|test|etc|etc)?\b)
The only problem is it would miss endings like silly being silliest or see being saw without adding additional logic to look at the original words.
I have a regex created from a list in a database to match names for types of buildings in a game. The problem is typos, sometimes those writing instructions for their team in the game will misspell a building name and obviously the regex will then not pick it up (i.e. spelling "University" and "Unversity").
Are there any suggestions on making a regex match misspellings of 1 or 2 letters?
The regex is dynamically generated and run on a local machine that's able to handle a lot more load so I have as a last resort to algorithmically create versions of each word with a letter missing and then another with letters added in.
I'm using PHP but I'd hope that any solution to this issue would not be PHP specific.
Allow me to introduce you to the Levenshtein Distance, a measure of the difference between strings as the number of transformations needed to convert one string to the other.
It's also built into PHP.
So, I'd split the input file by non-word characters, and measure the distance between each word and your target list of buildings. If the distance is below some threshold, assume it was a misspelling.
I think you'd have more luck matching this way than trying to craft regex's for each special case.
Google's implementation of "did you mean" by looking at previous results might also help:
How do you implement a "Did you mean"?
What is Soundex() ? – Teifion (28 mins ago)
A soundex is similar to the levenshtein function Triptych mentions. It is a means of comparing strings. See: http://us3.php.net/soundex
You could also look at metaphone and similar_text. I would have put this in a comment but I don't have enough rep yet to do that. :D
Back in the days we sometimes used Soundex() for these problems.
You're in luck; the algorithms folks have done lots of work on approximate matching of regular expressions. The oldest of these tools is probably agrep originally developed at the University of Arizona and now available in a nice open-source version. You simply tell agrep how many mistakes you are willing to tolerate and it matches from there. It can also match other blocks of text besides lines. The link above has links to a newer, GPLed version of agrep and also a number of language-specific libraries for approximate matching of regular expressions.
This might be overkill, but Peter Norvig of Google has written an excellent article on writing a spell checker in Python. It's definitely worth a read and might apply to your case.
At the end of the article, he's also listed contributed implementations of the algorithm in various other languages.
I give an example to easily describe the problem.
Input text:
Wayne Rooney is an English footballer who plays as a striker for Manchester United. Rooney became the youngest player to play for England when he earned his first cap in a friendly against Australia. Theo Walcott broke Rooney's appearance record by 36 days in May 2006.
Input keyword: wayne rooney
Expected output (keyword count): 3 (wayne rooney, rooney, rooney's)
So, it doesn't only count "wayne rooney", but also other similar words.
I have searching over SO, I got this regex:
$keyword_count = preg_match_all("/(\w*(?:wayne|rooney)\w*)/i", $source, $res);
But it gives me 4 as the output. It counts "wayne rooney" as two different keywords.
Could anyone help me to construct the correct formula?
Is Regex really the most efficient solution for this? I have a high volume of text to search. Any other solution, for example Text Mining library for PHP?
Thanks a lot.
Try this regex:
(?i)(\b(?:wayne(?:'s)?\s*)?rooney(?:'s)?\b)
If you have limited count of regular rules to parse string, regex is appropriate to solve your problem. In general case you should use other methods (may be several regex).
Maybe this could be helpful or an alternative to regex:
http://php.net/manual/en/function.levenshtein.php
http://en.wikipedia.org/wiki/Levenshtein_distance
For this special case you can do something like this
Wayne(?:\sRooney[\w']*)?|Rooney[\w']*
See it here on Regexr
It says: Search for Wayne Rooney OR Rooney (each can be followed by [\w']*), but for the first part the (?:\sRooney[\w']*)? is optional.
Are you are simply trying to match a single known name from a piece of text, or are you actually trying to identify something matching "known people" or "names"?
If the latter then you could be using something like OpenCalais constrained to a known type 'people' (who knows, maybe there is a 'footballers' taxonomy).
Here is an analysis of similar tools.
I wouldn't call myself a master regarding regex, i pretty much just know the basics. I've been playing around with it, but i can't seem to get the desired result. So if someone would help me, i would really appreciate it!
I'm trying to check wether unwanted words exist in a string. I'm working on a math project, and i'm gonna be using eval() to calculate the string, so i need to make sure it's safe.
The string may contain (just for example now, i'll add more functions later) the following words: (read the comments)
floor() // spaces or numbers are allowed between the () chars. If possible, i'd also like to allow other math functions inside, so it'd look like: floor( floor(8)*1 ).
It may contain any digit, any math sign (+ - * /) and dots/commas (,.) anywhere in the string
Just to be clear, here's another example: If a string like this is passed, i do not want it to pass:
9*9 + include('somefile') / floor(2) // Just a random example on something that's not allowed
Now that i think about it, it looks kind of complicated. I hope you can at least give me some hints.
Thanks in advance,
-Anthony
Edit: This is a bit off-topic, but if you know a better way of calculating math functions, please suggest it. I've been looking for a safe math class/function that calculates an input string, but i haven't found one yet.
Please do not use eval() for this.
My standard answer to this question whenever it crops up:
Don't use eval (especially if the formula contains user input) or reinvent the wheel by writing your own formula parser.
Take a look at the evalMath class on PHPClasses. It should do everything that you want in a nice safe sandbox.
To rephrase your problem, you want to allow only a specific set of characters, plus certain predefined words. The alternation operator (pipe symbol) is your friend in this case:
([0-9\+\-\*\/\.\,\(\) ]|floor|ceiling|other|functions)*
Of course, using eval is inherently dangerous, and it is difficult to guarantee that this regex will offer full protection in a language with syntax as expansive as PHP.