Exclude a few words from a simple regex in PHP - php

I'm categorizing a few folders on my drives and I want to weed out low quality files using this regex (this works):
xvid|divx|480p|320p|DivX|XviD|DIVX|XVID|XViD|DiVX|DVDSCR|PDTV|pdtv|DVDRip|dvdrip|DVDRIP
Now some filenames are in High Definition but still have DVD or XviD in their filenames but also 1080p, 720p, 1080i or 720i. I need a single regex to match the one above but exclude these words 1080p, 720p, 1080i or 720i.

Use two regex's
one to find if it matches
1080p|720p|1080i|720i
Then if it doesn't, that is no match is found for the above, check for matches:
xvid|divx|480p|320p|DivX|XviD|DIVX|XVID|XViD|DiVX|DVDSCR|PDTV|pdtv|DVDRip|dvdrip|DVDRIP
Regular expressions don't support inverse matching, you could use negative look-arounds but for this task I wouldn't say they're appropriate. As you check for all the cases of 1080p-divx, you put a negative look ahead, however it doesn't catch divx-10bit-1080p, you couldn't achieve this in a simple regex.

You can use a negative lookahead for this
^(?!.*(?:1080p|720p|1080i|720i)).*(?:xvid|divx|480p|320p|DivX|XviD|DIVX|XVID|XViD|DiVX|DVDSCR|PDTV|pdtv|DVDRip|dvdrip|DVDRIP)
This will match on your search strings, but fail if there is also 1080p|720p|1080i|720i in the string.

You can do it like this:
<pre><?php
$subjects = array('Arrival of the train at La Ciotat station.avi',
'Gardenator II - multi - DVDrip - 720i.mkv',
'The adventures of Roberto the bear - divx.avi',
'Tokyo’s Ginza District - dvdrip.mkv');
$pattern = '~(?(DEFINE)(?<excl>(?>d(?>vd(?>rip|scr)|ivx)|pdtv|xvid|320p|480p)))
(?(DEFINE)(?<keep>(?>[^17]+?|1(?!080[ip])|7(?!20[ip]))))
^\g<keep>*\g<excl>\g<keep>*$ ~ix';
foreach($subjects as $subject) {
if (preg_match($pattern, $subject)) echo $subject."\n"; }
The main interest is to avoid to test a lookahead on each character.

Related

preg - Difference between Search Patterns with [] and without

It seems I am not able to understand something very basic with preg regex Patterns in PHP.
What is the difference between these Regex Patterns:
\b([A-Z...]...)
[\b]{1}([A-Z...]...)
The Pattern should start with a word boundary, but why is the result different, when I put it in []{1} ??
The first one works like I expected, but the second not. The problem is, that I want to put more into the [], so that the pattern can start with a word boundary OR a small character [a-z].
Thank you!
Example Text:
Race1529/05/201512:45K4 Senior Men 1000m
LaneName(s)NFBib(s)TimeRank250m500m750m
152
Martin SCHUBERT / Lukas REUSCHENBACH155
11
153
151Kostja STROINSKI / Kai SPENNER
03:07.740
GER
8
I want to find the names of the racers. Sometimes they have a word-break (\b) at the beginning, sometimes not. (But i need the word-break.)
$pattern = '#\b(['.$GB.$KB.'\s\-]{2,40})\s(['.$GB.'\'\-\s]{2,40})[0-9]{0,5}#';
($GB is a variable with all Uppercase Letters, $KB with lower case letters)
preg_match_all gives me all racers where the Name has a word-break at the beginning. (In this example Schubert, Reuschenbach, Spenner) but of course not Stroinski. So, I try this:
$pattern = '#[\b0-9]+(['.$GB.$KB.'\s\-]{2,40})\s(['.$GB.'\'\-\s]{2,40})[0-9]{0,5}#';
Does not work. Even if i remove the 0-9 and only put [\b]{1} at the beginning it doesn't find any hit.
I don't see the difference between \b and [\b]{1}. It seems to be a very basic misunderstanding.
The [\b] is a character class that only matches a backspace char (\u0008).
See PHP regex reference:
note that "\b" has a different meaning, namely the backspace character, inside a character class
Also, .{1} = ., the {1} limiting quantifier is always redundant and only makes sense when your patterns are built dynamically from variables.

replace all punctuations except for abbreviations

I have a regex in PHP that replaces everything I don't want with spaces
/[^a-z0-9\p{L}]/siu
But there is this one exception, I want to keep punctuations for abbreviations.
Example:
F.B.I.Federal.Bureau.of.Investigation => 'F B I Federal Bureau of
Investigation'
S.W.A.T.Team => 'S W A T Team'
Should be:
F.B.I.Federal.Bureau.of.Investigation => 'F.B.I. Federal Bureau of
Investigation'
S.W.A.T.Team => 'S.W.A.T. Team'
PHP code:
$s = "F.B.I.Federal.Bureau.of.Investigation";
return preg_replace('/[^a-z0-9\p{L}]/siu', " ", $s);
so the logic is, that it should check the second char of first match, and if it's an '.' char, then don't replace.
Not sure if this is possible with regex, then I would appreciate an alternative with PHP.
Actually, there are many types of abbreviations, and as Jon Stirling says, there is no really 100% working solution here since you need a whole list of possible abbreviations to filter out. You may have a peek at some fancy regex solution by #ndn and grab the pattern part related to abbreviations there.
If you need to only handle patterns like in the question, you may consider using
'~(\b(?:\p{Lu}\.){2,})|[^0-9\p{L}]~u'
or - if D.Word should also be treated as an abbreviation:
'~(\b(?:\p{Lu}\.)+)|[^0-9\p{L}]~u'
and replace with '$1 '. See the regex demo.
Pattern details:
(\b(?:\p{Lu}\.)+) - Group 1 (later referenced with $1 backreference): 1 or more consequent occurrences of any Unicode uppercase letter and a dot after it
| - or
[^0-9\p{L}] - any char that is not an ASCII digit and a Unicode letter.
And here is a variant of a regex with #ndn's abbreviations:
'~\b((?:[Ee]tc|St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd|pp|[Vv]iz|i\.?\s*e|[Vvol]|[Rr]col|maj|Lt|[Ff]ig|[Ff]igs|[Vv]iz|[Vv]ols|[Aa]pprox|[Ii]ncl|Pres|[Dd]ept|min|max|[Gg]ovt|lb|ft|c\.?\s*f|vs|\p{Lu}(?:\.\p{Lu})+)\.)|[^0-9\p{L}]~'
See the regex demo.
If you do not want to remove -, ( and ), just make sure to add them to the negated character class, replace [^0-9\p{L}] with [^0-9\p{L}()-].
Feel free to update by adding more abbreviations or enhance by shrinking the alternatives.

How to match all words but "stop" in a string by regex

another regex question. I use PHP, and have a string: fdjkaljfdlstopfjdslafdj. You see there is a stop in the middle. I just want to replace any other words excluding that stop. i try to use [^stop], but it also includes the s at the end of the string.
My Solution
Thanks everyone’s help here.
I also figure out a solution with pure RegEx method(I mean in my knowledge scoop to RegEx. PCRE verbs are too advanced for me). But it needs 2 steps. I don’t want to mix PHP method in, because sometimes the jobs are out of coding area, i.e. multi-renaming filenames in Total Commander.
Let’s see the string: xxxfooeoropwfoo,skfhlk;afoofsjre,jhgfs,vnhufoolsjunegpq. For example, I want to keep all foos in this string, and replace any other non-foo greedily into ---.
First, I need to find all the non-foo between each foo: (?<=foo).+?(?=foo).
The string will turn into xxxfoo---foo---foo---foolsjunegpq, just both sides non-foo words left now.
Then use [^-]+(?=foo)|(?<=foo)[^-]+.
This time: ---foo---foo---foo---foo---. All words but foo have been turned into ---.
i just dont want to include "stop"...
You can skip it by using PCRE verbs (*SKIP)(*F) try like this
stop(*SKIP)(*F)|.
Demo at regex101
or sequence: (stop)(*SKIP)(*F)|(?:(?!(?1)).)+
or for words: stop(*SKIP)(*F)|\w+
[^stop] doesn't means any text that is NOT stop. It just means any character that is not one of the 4 characters inside [...] which is in this case s,t,o,p.
Better to split on the text you don't want to match:
$s = 'fdjkaljfdlstopfjdslafdjstopfoobar';
php> $arr = preg_split('/stop/', $s);
php> print_r($arr);
Array
(
[0] => fdjkaljfdl
[1] => fjdslafdj
[2] => foobar
)
You can generalize this to any pattern:
(?<neg>stop)(*SKIP)(*FAIL)|(?s:.)+?(?=\Z|(?&neg))
Demo
Just put the pattern you don't want in the neg group.
This regex will try to do the following for any character position:
Match the pattern you don't want. If it matches, discard it with (*SKIP)(*FAIL) and restart another match at this position.
If the pattern you don't want doesn't match at a particular position, then match anything, until either:
You reach the end of the input string (\Z)
Or the pattern you don't want immediately follows the current matching position ((?&neg))
This approach is slower than manually tuning the expression, you could get better performance at the cost of repeating yourself, which avoids the recursion:
stop(*SKIP)(*FAIL)|(?s:.)+?(?=\Z|stop)
But of course, the best approach would be to use the features provided by your language: match the string you don't want, then use code to discard it and keep everything else.
In PHP, you can use the PREG_OFFSET_CAPTURE flag to tell the preg_match_all function to provide you the offsets of each match.

Detect cloth sizes with regex

I am trying to detect with regex, strings that have a pattern of {any_number}{x-}{large|medium|small} for a site with clothing I am building in PHP.
I have managed to match the sizes against a preconfigured set of strings by using:
$searchFor = '7x-large';
$regex = '/\b'.$searchFor.'\b/';
//Basically, it's finding the letters
//surrounded by a word-boundary (the \b bits).
//So, to find the position:
preg_match($regex, $opt_name, $match, PREG_OFFSET_CAPTURE);
I even managed to detect weird sizes like 41 1/2 with regex, but I am not an expert and I am having a hard time on this.
I have come up with
preg_match("/^(?<![\/\d])([xX\-])(large|medium|small)$/", '7x-large', $match);
but it won't work.
Could you pinpoint what I am doing wrong?
It sounds like you also want to match half sizes. You can use something like this:
$theregex = '~(?i)^\d+(?:\.5)?x-(?:large|medium|small)$~';
if (preg_match($theregex, $yourstring,$m)) {
// Yes! It matches!
// the match is $m[0]
}
else { // nah, no luck...
}
Note that the (?i) makes it case-insensitive.
This also assumes you are validating that an entire string conforms to the pattern. If you want to find the pattern as a substring of a larger string, remove the ^ and $ anchors:
$theregex = '~(?i)\d+(?:\.5)?x-(?:large|medium|small)~';
Look at the specification you have and build it up piece by piece. You want "{any_number}{x-}{large|medium|small}".
"{any_number}" would be \d+. This does not allow fractional numbers such as 12.34, but the question does not specify whether they are required.
"{x-}" is a simple string x-
"{large|medium|small}" is a choice between three alternatives large|medium|small.
Joining the pieces together gives \d+x-(large|medium|small). Note the brackets around the alternation, without then the expression would be interpreted as (\d+x-large)|medium|small.
You mention "weird sizes like 41 1/2" but without specifying how "weird" the number to be matched are. You need a precise specification of what you include in "weird" before you can extend the regular expression.

Rotation in PHP's regex

How can you match the following words by PHP, either by regex/globbing/...?
Examples
INNO, heppeh, isi, pekkep, dadad, mum
My attempt would be to make a regex which has 3 parts:
1st match match [a-zA-Z]*
[a-zA-Z]?
rotation of the 1st match // Problem here!
The part 3 is the problem, since I do not know how to rotate the match.
This suggests me that regex is not the best solution here, since it is too very inefficient for long words.
I think regex are a bad solution. I'd do something with the condition like: ($word == strrev($word)).
Regexs are not suitable for finding palindromes of an arbitrary length.
However, if you are trying to find all of the palindromes in a large set of text, you could use regex to find a list of things that might be palindromes, and then filter that list to find the words that actually are palindromes.
For example, you can use a regex to find all words such that the first X characters are the reverse of the last X characters (from some small fixed value of X, like 2 or 3), and then run a secondary filter against all the matches to see if the whole word is in fact a palindrome.
In PHP once you get the string you want to check (by regex or split or whatever) you can just:
if ($string == strrev($string)) // it's a palindrome!
i think this regexp can work
$re = '~([a-z])(.?|(?R))\1~';

Categories