RegEx to match expression even if it is not 100% the same - php

How would you write a regular expression pattern that matches a string even if it is 90% accurate?
For example:
$search_string = "Two years in, the <a href='site.com'>company</a> has expanded to 35 cities, five of which are outside the U.S. "
$subject = "Two years in,the company has expanded to 35 cities, five of which are outside the U.S."
The end result is that the $search_string matches the $subject and returns true even though they are not 100% the same.

You can have some optional parts on the regex pattern. For example:
$search_string = "A tiny little bear";
$regex = "A ([a-zA-Z]+)? little bear";
The ? character there says that the group before it goes optional, and the [a-zA-Z]+ indicates there will be one or more letters inside it.
Thus, using preg_match you can get a validation not 100% restrictive.

in case any one comes around looking for the right way to do it
$search_string = "Two years in, the <a href='site.com'>company</a> has expanded to 35 cities, five of which are outside the U.S. ";
$subject = "Two years in,the company has expanded to 35 cities, five of which are outside the U.S.";
similar_text ($search_string,$subject,$sim);
echo 'text is: ' .round($sim). '% similar';
result:
text is:85% similar
you can use the result to determine what value is a match in your particular circumstances like so:
similar_text($search_string,$subject,$sim);
if($sim >=85){
echo 'MATCH';
}

Just for grins, I tried this out using Perl.
All the warnings about using regex to parse html apply:
(Should not use on html).
This will split the Search string on either html or entities or whitespace.
After that, the parts are joined with .*? using the modifiers (?is).
This is not a true partial matching substring regex because
it requires all the parts to exist.
This does overcome the distance or content between them however.
Possibly, with a little algorithm work, it could be tweaked in such
a way that parts are optional, in the form of clustering.
use strict;
use warnings;
my $search_string = "Two years in, the <a href='site.com'>company</a> has expanded to 35 cities, five of which are outside the U.S. ";
my $subject = "Two years in,the company has expanded to 35 cities, five of which are outside the U.S.";
## Trim leading/trailing whitespace from $search_string
$search_string =~ s/^\s+|\s+$//g;
## Split the $search_string on html tags or entities or whitespaces ..
my #SearchParts = split m~
\s+|
(?i)[&%](?:[a-z]+|(?:\#(?:[0-9]+|x[0-9a
-f]+)));|<(?:script(?:\s+(?:"[\S\s]*?"|'
[\S\s]*?'|[^>]*?)+)?\s*>[\S\s]*?</script
\s*|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:(?:
(?:"[\S\s]*?")|(?:'[\S\s]*?'))|(?:[^>]*?
))+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE
[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:-
-[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTI
TY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>
~x, $search_string;
## Escape the metacharacters from SearchParts
#SearchParts = grep { $_ = quotemeta } #SearchParts;
## Join the SearchParts into a regex
my $rx = '(?si)(?:' . ( join '.*?', #SearchParts ) . ')';
## Try to match SearchParts in the $subject
if ( $subject =~ /$rx/ )
{
print "Match in subject:\n'$&' \n";
}
Output:
Match in subject:
'Two years in,the company has expanded to 35 cities, five of which are outside the U.S.'
edit:
As a side note, each element of #SearchParts could be further split//
once again (on each character), joining with .*?.
This would get into the realm of a true partial match.
Not quite there though as each character is required to match.
The order is maintained, but each one would have to be optional.
Usually, without capture groups, there is no way to tell the percentage
of actual letter's matched.
If you were to use Perl however, it's fairly easy to count in
regex Code construct (?{{..}}) where a counter can be incremented.
I guess, at that point it becomes non-portable. Better to use C++.

Related

PHP: Split a string at the first period that isn't the decimal point in a price or the last character of the string

I want to split a string as per the parameters laid out in the title. I've tried a few different things including using preg_match with not much success so far and I feel like there may be a simpler solution that I haven't clocked on to.
I have a regex that matches the "price" mentioned in the title (see below).
/(?=.)\£(([1-9][0-9]{0,2}(,[0-9]{3})*)|[0-9]+)?(\.[0-9]{1,2})?/
And here are a few example scenarios and what my desired outcome would be:
Example 1:
input: "This string should not split as the only periods that appear are here £19.99 and also at the end."
output: n/a
Example 2:
input: "This string should split right here. As the period is not part of a price or at the end of the string."
output: "This string should split right here"
Example 3:
input: "There is a price in this string £19.99, but it should only split at this point. As I want it to ignore periods in a price"
output: "There is a price in this string £19.99, but it should only split at this point"
I suggest using
preg_split('~\£(?:[1-9]\d{0,2}(?:,\d{3})*|[0-9]+)?(?:\.\d{1,2})?(*SKIP)(*F)|\.(?!\s*$)~u', $string)
See the regex demo.
The pattern matches your pattern, \£(?:[1-9]\d{0,2}(?:,\d{3})*|[0-9]+)?(?:\.\d{1,2})? and skips it with (*SKIP)(*F), else, it matches a non-final . with \.(?!\s*$) (even if there is trailing whitespace chars).
If you really only need to split on the first occurrence of the qualifying dot you can use a matching approach:
preg_match('~^((?:\£(?:[1-9]\d{0,2}(?:,\d{3})*|[0-9]+)?(?:\.\d{1,2})?|[^.])+)\.(.*)~su', $string, $match)
See the regex demo. Here,
^ - matches a string start position
((?:\£(?:[1-9]\d{0,2}(?:,\d{3})*|[0-9]+)?(?:\.\d{1,2})?|[^.])+) - one or more occurrences of your currency pattern or any one char other than a . char
\. - a . char
(.*) - Group 2: the rest of the string.
To split a text into sentences avoiding the different pitfalls like dots or thousand separators in numbers and some abbreviations (like etc.), the best tool is intlBreakIterator designed to deal with natural language:
$str = 'There is a price in this string £19.99, but it should only split at this point. As I want it to ignore periods in a price';
$si = IntlBreakIterator::createSentenceInstance('en-US');
$si->setText($str);
$si->next();
echo substr($str, 0, $si->current());
IntlBreakIterator::createSentenceInstance returns an iterator that gives the indexes of the different sentences in the string.
It takes in account ?, ! and ... too. In addition to numbers or prices pitfalls, it works also well with this kind of string:
$str = 'John Smith, Jr. was running naked through the garden crying "catch me! catch me!", but no one was chasing him. His psychatre looked at him from the window with a circumspect eye.';
More about rules used by IntlBreakIterator here.
You could simply use this regex:
\.
Since you only have a space after the first sentence (and not a price), this should work just as well, right?

RegEx with preg_match to find and replace a SIMILAR string

I am using regular expressions with preg_replace() in order to find and replace a sentence in a piece of text. The $search_string contains plain text + html tags + elements. The problem is that only sometimes the elements convert to white space on run time, making it difficult to find and replace using str_replace(). So, I'm trying to build a pattern that is equal to the search string and will match anything like it which contains, or does not contain the elements;
For example:
$search_string = 'Two years in, the company has expanded to 35 cities, five of which are outside the U.S. Plus, in April, ClassPass acquired its main competitor, Fitmob.';
$pattern = $search_string(BUT IGNORE THE elements in the subject)
$subject = "text text text text text". $search_string . "text text text text text";
Using A regular expression to exclude a word/string, I've tried:
$pattern = '`^/(?!\ )'.$search_string.'`';
$output = preg_replace($pattern, $replacement_string,$subject);
The end result will be that if the $subject does contains a string that is like my $seach_string but without the elements, it will still match and replace it with $replacement_string
EDIT:
The actual values:
$subject = file_get_contents("http://venturebeat.com/2015/11/10/sources-classpass-raises-30-million-from-google-ventures-and-others/");
$search_string = "Two years in, the company has expanded to 35 cities, five of which are outside the U.S. Plus, in April, ClassPass acquired its main competitor, Fitmob.";
$replacement_string = "<span class='smth'>Two years in, the company has expanded to 35 cities, five of which are outside the U.S. Plus, in April, ClassPass acquired its main competitor, Fitmob.</span>";
Not a very efficient way of doing it but it should workout for you,
preg_replace('Two.*?years.*?in.*?the.*?company.*?has.*?expanded.*?to.*?35.*?cities.*?five.*?of.*?which.*?are.*?outside.*?the.*?U\.S\..*?Plus.*?in.*?April.*?ClassPass.*?acquired.*?its.*?main.*?competitor.*?Fitmob\.', '<span class=\'smth\'>$0</span>', $subject);

How to combine these two regex passes into one?

I have a few thousand strings that have one of these two forms:
SomeT1tle-ThatL00ks L1k3.this - $3.57 KnownWord
SomeT1tle-ThatL00ks L1k3.that - 4.5% KnownWord
The SomeT1tle-ThatL00ks L1ke.this part may contain uppercase and lowercase characters, digits, periods, dashes, and spaces. It is always followed by a space-dash-space pattern.
I want to pull out the Title (the part before the space-dash-space separator) and the Amount, which is right before KnownWord.
So for these two strings I'd like:
SomeT1tle-ThatL00ks L1k3.this, $3.57 and
SomeT1tle-ThatL00ks L1k3.that, 4.5%.
This code works (using Perl equivalent Regular Expressions)
$my_string = "SomeT1tle-ThatL00ks L1k3.this - $3.57 KnownWord";
$pattern_title = "/^(.*?)\x20\x2d\x20/";
$pattern_amount = "/([0-9.$%]+) KnownWord$/";
preg_match_all($pattern_title, $my_string, $matches_title);
preg_match_all($pattern_amount, $my_string, $matches_amount);
echo $matches_title[1][0] . " " . $matches_amount[1][0] . "<br>";
I tried putting both patterns together:
$pattern_together_doesnt_work = "/^(.*?)\x20\x2d\x20([0-9.$%]+) KnownWord$/";
but the first part of the pattern always matches the whole thing, even with the "lazy" part (.*? rather than .*). I can't negative-match spaces and dashes, because the title itself can contain either.
Any hints?
Use this pattern
/^(.*?)\x20\x2d\x20([0-9.$%]+) KnownWord$/

preg_replace or regex string translation

I found some partial help but cannot seem to fully accomplish what I need. I need to be able to do the following:
I need an regular expression to replace any 1 to 3 character words between two words that are longer than 3 characters with a match any expression:
For example:
walk to the beach ==> walk(.*)beach
If the 1 to 3 character word is not preceded by a word that's longer than 3 characters then I want to translate that 1 to 3 letter word to '<word> ?'
For example:
on the beach ==> on ?the ?beach
The simpler the rule the better (of course, if there's an alternative more complicated version that's more performant then I'll take that as well as I eventually anticipate heavy usage eventually).
This will be used in a PHP context most likely with preg_replace. Thus, if you can put it in that context then even better!
By the way so far I have got the following:
$string = preg_replace('/\s+/', '(.*)', $string);
$string = preg_replace('/\b(\w{1,3})(\.*)\b/', '${1} ?', $string);
but that results in:
walk to the beach ==> 'walk(.*)to ?beach'
which is not what I want. 'on the beach' seems to translate correctly.
I think you will need two replacements for that. Let's start with the first requirement:
$str = preg_replace('/(\w{4,})(?: \w{1,3})* (?=\w{4,})/', '$1(.*)', $str);
Of course, you need to replace those \w (which match letters, digits and underscores) with a character class of what you actually want to treat as a word character.
The second one is a bit tougher, because matches cannot overlap and lookbehinds cannot be of variable length. So we have to run this multiple times in a loop:
do
{
$str = preg_replace('/^\w{0,3}(?: \w{0,3})* (?!\?)/', '$0?', $str, -1, $count);
} while($count);
Here we match everything from the beginning of the string, as long as it's only up-to-3-letter words separated by spaces, plus one trailing space (only if it is not already followed by a ?). Then we put all of that back in place, and append a ?.
Update:
After all the talk in the comments, here is an updated solution.
After running the first line, we can assume that the only less-than-3-letter words left will be at the beginning or at the end of the string. All others will have been collapsed to (.*). Since you want to append all spaces between those with ?, you do not even need a loop (in fact these are the only spaces left):
$str = preg_replace('/ /', ' ?', $str);
(Do this right after my first line of code.)
This would give the following two results (in combination with the first line):
let us walk on the beach now go => let ?us ?walk(.*)beach ?now ?go
let us walk on the beach there now go => let ?us ?walk(.*)beach(.*)there ?now ?go

PHP preg_match with regex: only single hyphens and spaces between words continue

I was trying to write an regex that allows single hyphens and single spaces only within words but not at the beginning or at the end of the words.
I thought I have this sorted from the answer I got yesterday, but I just realised there is small error which I don't quite understand,
Why it won't accept the inputs like,
'forum-category-b forum-category-a'
'forum-category-b Counter-terrorism'
'forum-category-a Preventing'
'forum-category-a Preventing Violent'
'forum-category-a International-Research-and-Publications'
'International-Research-and-Publications forum-category-b forum-category-a'
but it takes,
'forum-category-b'
'Counter-terrorism forum-category-a'
'Preventing forum-category-a'
'Preventing Violent forum-category-a'
'International-Research-and-Publications forum-category-b'
Why is that? How can I fix it? It Below is the regex with the initial test, but ideally it should accept all the combination inputs above,
$aWords = array(
'a',
'---stack---over---flow---',
' stack over flow',
'stack-over-flow',
'stack over flow',
'stacoverflow'
);
foreach($aWords as $sWord) {
if (preg_match('/^(\w+([\s-]\w+)?)+$/', $sWord)) {
echo 'pass: ' . $sWord . "\n";
} else {
echo 'fail: ' . $sWord . "\n";
}
}
accept/ to reject the input like these below,
---stack---over---flow---
stack-over-flow- stack-over-flow2
stack over flow
Thanks.
Your pattern does not do what you want. Let's break it apart:
^(\w+([\s-]\w+)?)+$
It matches strings that consist solely of one or more sequences of the pattern:
\w+([\s-]\w+)?
...which is a sequence of word characters, followed optionally by one other sequence of word characters, separated by one space or dash character.
In other words, your pattern searches for strings like:
xxx-xxxyyy-yyyzzz zzz
...but you intent to write a pattern that would find:
xxx-xxxxxx-xxxxxx yyy
In your examples, this one is matched:
Counter-terrorism forum-category-a
...but it is interpreted as the following sequence:
(Counter(-terroris)) (m( foru)) (m(-categor) (y(-a))
As you can see, the pattern did not really find the words you are looking for.
This example is not matched:
forum-category-a Preventing Violent
...since the pattern cannot form groups of "word characters, space-or-dash, word-characters" when it encounters a single word character followed by space or dash:
(forum(-categor)) (y(-a)) <Mismatch: Found " " but expected "\w">
If you would add another character to "forum-category-a", say "forum-category-ax", it would match again, since it could split at the "ax":
(forum(-categor)) (y(-a)) (x( Preventin)) (g( Violent))
What you are actually interested in is a pattern like
^(\w+(-\w+)*)(\s\w+(-\w+)*)*$
...which would find a sequence of words that may contain dashes, separated by spaces:
(forum(-category)(-a)) ( Preventing) ( Violent)
By the way, I tested this using a Python script, and while trying to match your pattern against the example string "International-Research-and-Publications forum-category-b forum-category-a", the regular expression engine seemed to run into an infinite loop...
import re
expr = re.compile(r'^(\w+([\s-]\w+)?)+$')
expr.match('International-Research-and-Publications forum-category-b forum-category-a')
the part of your pattern ([\s-]\w+)? is the issue. It's only allowing for one repetition (the trailing ?). Try changing the last ? to * and see if that helps.
Nope, I still believe that's the problem. The original pattern is looking for "word" or "word[space_hyphen]word" repeated 1+ times. Which is weird because the pattern should fall within another match. But switching the question mark worked for me.
There should be only one answer to this problem:
/^((?<=\w)[ -]\w|[^ -])+$/
There is only 1 rule as stated \w[ -]\w and thats it. And its on a per character basis granularity, and cannot be anthing else. Add the [^ -] for the rest.

Categories