Match a string with two different fixed lengths with preg_match

Match a string with two different fixed lengths with preg_match - php

Is it possible to match a string of two different lengths with preg_match? And if yes, how?
I’m looking for something like this:
preg_match("/^[a-zA-Z0-9]{13|25}$/", $string);
As in, return true if $string las a length of exactly either 13 or 25 characters.
P.S.: I know that should be {13,25} — {min,max} —, but I’m not interested in matching within an interval.

This is a fast way:
preg_match('/^[a-zA-Z0-9]{13}([a-zA-Z0-9]{12})?$/', $string);

Something like
preg_match("/^([a-zA-Z0-9]{13}|[a-zA-Z0-9]{25})$/", $string);
([a-zA-Z0-9]{13}|[a-zA-Z0-9]{25} alternation matches either of length 13 or 25
Example : http://regex101.com/r/bJ9vV5/1

I know your question was about doing it with regex, but it's generally best practice to avoid regex whenever possible. A few reasons why:
You should benchmark to be certain, but in most cases, built-in functions will out-perform regex.
Regex is not (thoroughly) understood by a lot of coders
Regex is usually less flexible as far as throwing new "business rules" / requirements into the mix. For example, what if you needed to add in a requirement to do something if the length is 13, and something different if it's 25? Or maybe do something if it's right chars but wrong length? You will not be able to code for these things with regex alone (The solution I present below doesn't address these "what ifs" either but the difference here is that you now have ability to separate the stuff as needed)
So here is a non-regex approach.
if ( in_array(strlen($string),array(13,25)) && ctype_alnum($string) ) {
// good
} else {
// bad
}

Related

Adding regex to detect a word that repeats 3 characters?

I have searched and searched but I cannot find anything quite exactly like what I need. I have a code:
$repeater = "pompom";
if (preg_match('/([a-zA-Z])\1{3}/', $repeater)) {
echo "Yes, $repeater does repeat 3 characters.<br>";
}
else {
echo "No, $repeater does not repeat 3 characters.<br>";
}
(I can barely understand regex as it is... so just ignore my current regex.. it's just a mixture of randomness I began to type.)
Anyhow, I need the regex code to return
true for words like
pompom
grugru
mopmop
cancan
etc...
and return false for words like
coocoo
daadaa
allall
giigii
etc.
The regex must detect and return true for any word that has 3 different characters that repeat more than once in that word.
This must work for words that have characters that are not necessarily in sequence with one another. I have found solutions to that. Words such as "cooo" or "pooool" is not what I need to apply this regex for. Note: This must return True only for words that have 3 or more different letters in the word and are repeated more than once. Such as, pompom..
This should return false for words like coocoo because there are only 2 different letters in the word.
Again, please ignore my current regex it was just what I had when I decided to ask for some help. I've tried probably 200 different methods, all wrong of course :].
Any help would be nice, maybe we can figure this out together I just need some ideas to bounce off of.

The following regex will perform as requested:
^((.)(?!\2)(.)(?!\2)(?!\3).)\1$
https://regex101.com/r/eHKzWB/3

PHP Regex Check if two strings share two common characters

I'm just getting to know regular expressions, but after doing quite a bit of reading (and learning quite a lot), I still have not been able to figure out a good solution to this problem.
Let me be clear, I understand that this particular problem might be better solved not using regular expressions, but for the sake of brevity let me just say that I need to use regular expressions (trust me, I know there are better ways to solve this).
Here's the problem. I'm given a big file, each line of which is exactly 4 characters long.
This is a regex that defines "valid" lines:
"/^[AB][CD][EF][GH]$/m"
In english, each line has either A or B at position 0, either C or D at position 1, either E or F at position 2, and either G or H at position 3. I can assume that each line will be exactly 4 characters long.
What I'm trying to do is given one of those lines, match all other lines that contain 2 or more common characters.
The below example assumes the following:
$line is always a valid format
BigFileOfLines.txt contains only valid lines
Example:
// Matches all other lines in string that share 2 or more characters in common
// with "$line"
function findMatchingLines($line, $subject) {
$regex = "magic regex I'm looking for here";
$matchingLines = array();
preg_match_all($regex, $subject, $matchingLines);
return $matchingLines;
}
// Example Usage
$fileContents = file_get_contents("BigFileOfLines.txt");
$matchingLines = findMatchingLines("ACFG", $fileContents);
/*
* Desired return value (Note: this is an example set, there
* could be more or less than this)
*
* BCEG
* ADFG
* BCFG
* BDFG
*/
One way I know that will work is to have a regex like the following (the following regex would only work for "ACFG":
"/^(?:AC.{2}|.CF.|.{2}FG|A.F.|A.{2}G|.C.G)$/m"
This works alright, performance is acceptable. What bothers me about it though is that I have to generate this based off of $line, where I'd rather have it be ignorant of what the specific parameter is. Also, this solution doesn't scale terrible well if later the code is modified to match say, 3 or more characters, or if the size of each line grows from 4 to 16.
It just feels like there's something remarkably simple that I'm overlooking. Also seems like this could be a duplicate question, but none of the other questions I've looked at really seem to address this particular problem.
Thanks in advance!
Update:
It seems that the norm with Regex answers is for SO users to simply post a regular expression and say "This should work for you."
I think that's kind of a halfway answer. I really want to understand the regular expression, so if you can include in your answer a thorough (within reason) explanation of why that regular expression:
A. Works
B. Is the most efficient (I feel there are a sufficient number of assumptions that can be made about the subject string that a fair amount of optimization can be done).
Of course, if you give an answer that works, and nobody else posts the answer *with* a solution, I'll mark it as the answer :)
Update 2:
Thank you all for the great responses, a lot of helpful information, and a lot of you had valid solutions. I chose the answer I did because after running performance tests, it was the best solution, averaging equal runtimes with the other solutions.
The reasons I favor this answer:
The regular expression given provides excellent scalability for longer lines
The regular expression looks a lot cleaner, and is easier for mere mortals such as myself to interpret.
However, a lot of credit goes to the below answers as well for being very thorough in explaining why their solution is the best. If you've come across this question because it's something you're trying to figure out, please give them all a read, helped me tremendously.

Why don't you just use this regex $regex = "/.*[$line].*[$line].*/m";?
For your example, that translates to $regex = "/.*[ACFG].*[ACFG].*/m";

This is a regex that defines "valid" lines:
/^[A|B]{1}|[C|D]{1}|[E|F]{1}|[G|H]{1}$/m
In english, each line has either A or B at position 0, either C or D
at position 1, either E or F at position 2, and either G or H at
position 3. I can assume that each line will be exactly 4 characters
long.
That's not what that regex means. That regex means that each line has either A or B or a pipe at position 0, C or D or a pipe at position 1, etc; [A|B] means "either 'A' or '|' or 'B'". The '|' only means 'or' outside of character classes.
Also, {1} is a no-op; lacking any quantifier, everything has to appear exactly once. So a correct regex for the above English is this:
/^[AB][CD][EF][GH]$/
or, alternatively:
/^(A|B)(C|D)(E|F)(G|H)$/
That second one has the side effect of capturing the letter in each position, so that the first captured group will tell you whether the first character was A or B, and so on. If you don't want the capturing, you can use non-capture grouping:
/^(?:A|B)(?:C|D)(?:E|F)(?:G|H)$/
But the character-class version is by far the usual way of writing this.
As to your problem, it is ill-suited to regular expressions; by the time you deconstruct the string, stick it back together in the appropriate regex syntax, compile the regex, and do the test, you would probably have been much better off just doing a character-by-character comparison.
I would rewrite your "ACFG" regex thus: /^(?:AC|A.F|A..G|.CF|.C.G|..FG)$/, but that's just appearance; I can't think of a better solution using regex. (Although as Mike Ryan indicated, it would be better still as /^(?:A(?:C|.E|..G))|(?:.C(?:E|.G))|(?:..EG)$/ - but that's still the same solution, just in a more efficiently-processed form.)

You've already answered how to do it with a regex, and noted its shortcomings and inability to scale, so I don't think there's any need to flog the dead horse. Instead, here's a way that'll work without the need for a regex:
function findMatchingLines($line) {
static $file = null;
if( !$file) $file = file("BigFileOfLines.txt");
$search = str_split($line);
foreach($file as $l) {
$test = str_split($l);
$matches = count(array_intersect($search,$test));
if( $matches > 2) // define number of matches required here - optionally make it an argument
return true;
}
// no matches
return false;
}

There are 6 possibilities that at least two characters match out of 4: MM.., M.M., M..M, .MM., .M.M, and ..MM ("M" meaning a match and "." meaning a non-match).
So, you need only to convert your input into a regex that matches any of those possibilities. For an input of ACFG, you would use this:
"/^(AC..|A.F.|A..G|.CF.|.C.G|..FG)$/m"
This, of course, is the conclusion you're already at--so good so far.
The key issue is that Regex isn't a language for comparing two strings, it's a language for comparing a string to a pattern. Thus, either your comparison string must be part of the pattern (which you've already found), or it must be part of the input. The latter method would allow you to use a general-purpose match, but does require you to mangle your input.
function findMatchingLines($line, $subject) {
$regex = "/(?<=^([AB])([CD])([EF])([GH])[.\n]+)"
+ "(\1\2..|\1.\3.|\1..\4|.\2\3.|.\2.\4|..\3\4)/m";
$matchingLines = array();
preg_match_all($regex, $line + "\n" + $subject, $matchingLines);
return $matchingLines;
}
What this function does is pre-pend your input string with the line you want to match against, then uses a pattern that compares each line after the first line (that's the + after [.\n] working) back to the first line's 4 characters.
If you also want to validate those matching lines against the "rules", just replace the . in each pattern to the appropriate character class (\1\2[EF][GH], etc.).

People may be confused by your first regex. You give:
"/^[A|B]{1}|[C|D]{1}|[E|F]{1}|[G|H]{1}$/m"
And then say:
In english, each line has either A or B at position 0, either C or D at position 1, either E or F at position 2, and either G or H at position 3. I can assume that each line will be exactly 4 characters long.
But that's not what that regex means at all.
This is because the | operator has the highest precedence here. So, what that regex really says, in English, is: Either A or | or B in the first position, OR C or | or D in the first position, OR E or | or F in the first position, OR G or '|orH` in the first position.
This is because [A|B] means a character class with one of the three given characters (including the |. And because {1} means one character (it is also completely superfluous and could be dropped), and because the outer | alternate between everything around it. In my English expression above each capitalized OR stands for one of your alternating |'s. (And I started counting positions at 1, not 0 -- I didn't feel like typing the 0th position.)
To get your English description as a regex, you would want:
/^[AB][CD][EF][GH]$/
The regex will go through and check the first position for A or B (in the character class), then check C or D in the next position, etc.
--
EDIT:
You want to test for only two of these four characters matching.
Very Strictly speaking, and picking up from #Mark Reed's answer, the fastest regex (after it's been parsed) is likely to be:
/^(A(C|.E|..G))|(.C(E)|(.G))|(..EG)$/
as compared to:
/^(AC|A.E|A..G|.CE|.C.G|..EG)$/
This is because of how the regex implementation steps through text. You first test if A is in the first position. If that succeeds, then you test the sub-cases. If that fails, then you're done with all those possible cases (or which there are 3). If you don't yet have a match, you then test if C is in the 2nd position. If that succeeds, then you test for the two subcases. And if none of those succeed, you test, `EG in the 3rd and 4th positions.
This regex is specifically created to fail as fast as possible. Listing each case out separately, means to fail, you would have test 6 different cases (each of the six alternatives), instead of 3 cases (at a minimum). And in cases of A not being the first position, you would immediately go to test the 2nd position, without hitting it two more times. Etc.
(Note that I don't know exactly how PHP compiles regex's -- it's possible that they compile to the same internal representation, though I suspect not.)
--
EDIT: On additional point. Fastest regex is a somewhat ambiguous term. Fastest to fail? Fastest to succeed? And given what possible range of sample data of succeeding and failing rows? All of these would have to be clarified to really determine what criteria you mean by fastest.

Here's something that uses Levenshtein distance instead of regex and should be extensible enough for your requirements:
$lines = array_map('rtrim', file('file.txt')); // load file into array removing \n
$common = 2; // number of common characters required
$match = 'ACFG'; // string to match
$matchingLines = array_filter($lines, function ($line) use ($common, $match) {
// error checking here if necessary - $line and $match must be same length
return (levenshtein($line, $match) <= (strlen($line) - $common));
});
var_dump($matchingLines);

I bookmarked the question yesterday in the evening to post an answer today, but seems that I'm a little late ^^ Here is my solution anyways:
/^[^ACFG]*+(?:[ACFG][^ACFG]*+){2}$/m
It looks for two occurrences of one of the ACFG characters surrounded by any other characters. The loop is unrolled and uses possessive quantifiers, to improve performance a bit.
Can be generated using:
function getRegexMatchingNCharactersOfLine($line, $num) {
return "/^[^$line]*+(?:[$line][^$line]*+){$num}$/m";
}

How to solve €25.99 vs 25,99€ preg_match problem?

If I have these strings:
$string1 = "This book costs €25.99 in our shop."
and on the other side
$string2 = "This book costs 25,99€ in our shop."
How to get the "€25.99" or "25,99€" using preg_match ? How will the code look like?
Please, notice that there are 2 ways of writing the euro symbol. The correct way in EU is to write the symbol after the number like 25,99€ and using comma as desimal separator. However, a lot of US people are stuck to the dollar way (€25.99) and dot as desimal separator.
How to do this check for both cases and get the value with symbol in the cleanest and most effiecient way?

Here's the raw regex: €\d+(?:[,.]\d+)?|\d+(?:[,.]\d+)?€
preg_match ( "/€\d+(?:[,.]\d+)?|\d+(?:[,.]\d+)?€/" , $string1, $matches)
If you want to consider optional spaces between euro and the value, use this:
preg_match ( "/€ ?\d+(?:[,.]\d+)?|\d+(?:[,.]\d+)? ?€/" , $string1, $matches)

agent-j's pattern is on the right track, but I would do something slightly more restrictive:
/€\d+(:?[.,]\d{2})?|\d+(:?[.,]\d{2})?€/
The only difference is that the decimal part is limited to 2 places, if it exists. I don't think you want to allow something like 99,999€, especially since that could mean "99 thousand, 999 euros" if written in the American style.
What I think you're trying to get at in your reference to the cleanest and most efficient way is that the above pattern seems awkward and redundant when you look at it. It's basically the \d+(:?[.,]\d{2})? portion repeated twice, with the € symbol switching sides. This feels wrong, but it isn't. You can't really get around it without bringing in just as much complexity, if not more. Even if you try to get around it with fancy lookarounds, it's going to look something like this:
/^(?=.*€)€?\d+(:?[.,]\d{2})?((?<!€.*)€)?$/
Clearly not an improvement. Sometimes the most obvious solution is the best one, even if it makes you feel dirty.
Note: If you want to get really crazy with it, you can try a variation (caution: untested, and I haven't done much PHP in a while):
$inner = "(:?\d{1,3}(?:([.,])\d{3})*(?:(?!\1)[.,]\d{2})?|\d*(?:[.,]\d{2})?)";
Usage:
preg_match ( "/€" . $inner . "|" . $inner . "€/", $string1, $matches)
That should also accept things like 99,999.99; 999999,99; 9.999.999,99; .99; etc.

Check for both cases:
/([$€]?[\d,]+[$€]?)/
The ? makes the [$€] optional (literally '0 or 1 of...'), so you'd have to check for the degenerate case where there's just a bare number with no currency symbol at all.

php regular expression to filter out junk

So I have an interesting problem: I have a string, and for the most part i know what to expect:
http://www.someurl.com/st=????????
Except in this case, the ?'s are either upper case letters or numbers. The problem is, the string has garbage mixed in: the string is broken up into 5 or 6 pieces, and in between there's lots of junk: unprintable characters, foreign characters, as well as plain old normal characters. In short, stuff that's apt to look like this: Nyþ=mî;ëMÝ×nüqÏ
Usually the last 8 characters (the ?'s) are together right at the end, so at the moment I just have PHP grab the last 8 chars and hope for the best. Occasionally, that doesn't work, so I need a more robust solution.
The problem is technically unsolvable, but I think the best solution is to grab characters from the end of the string while they are upper case or numeric. If I get 8 or more, assume that is correct. Otherwise, find the st= and grab characters going forward as many as I need to fill up the 8 character quota. Is there a regex way to do this or will i need to roll up my sleeves and go nested-loop style?
update:
To clear up some confusion, I get an input string that's like this:
[garbage]http:/[garbage]/somewe[garbage]bsite.co[garbage]m/something=[garbage]????????
except the garbage is in unpredictable locations in the string (except the end is never garbage), and has unpredictable length (at least, I have been able to find patterns in neither). Usually the ?s are all together hence me just grabbing the last 8 chars, but sometimes they aren't which results in some missing data and returned garbage :-\

$var = '†http://þ=www.ex;üßample-website.î;ëcomÝ×ü/joy_hÏere.html'; // test case
$clean = join(
array_filter(
str_split($var, 1),
function ($char) {
return (
array_key_exists(
$char,
array_flip(array_merge(
range('A','Z'),
range('a','z'),
range((string)'0',(string)'9'),
array(':','.','/','-','_')
))
)
);
}
)
);
Hah, that was a joke. Here's a regex for you:
$clean = preg_replace('/[^A-Za-z0-9:.\/_-]/','',$var);

As stated, the problem is unsolvable. If the garbage can contain "plain old normal characters" characters, and the garbage can fall at the end of the string, then you cannot know whether the target string from this sample is "ABCDEFGH" or "BCDEFGHI":
__http:/____/somewe___bsite.co____m/something=__ABCDEFGHI__

What do these values represent? If you want to retain all of it, just without having to deal with garbage in your database, maybe you should hex-encode it using bin2hex().

You can use this regular expression :
if (preg_match('/[\'^£$%&*()}{##~?><>,|=_+¬-]/', $string) ==1)

Rotation in PHP's regex

How can you match the following words by PHP, either by regex/globbing/...?
Examples
INNO, heppeh, isi, pekkep, dadad, mum
My attempt would be to make a regex which has 3 parts:
1st match match [a-zA-Z]*
[a-zA-Z]?
rotation of the 1st match // Problem here!
The part 3 is the problem, since I do not know how to rotate the match.
This suggests me that regex is not the best solution here, since it is too very inefficient for long words.

I think regex are a bad solution. I'd do something with the condition like: ($word == strrev($word)).

Regexs are not suitable for finding palindromes of an arbitrary length.
However, if you are trying to find all of the palindromes in a large set of text, you could use regex to find a list of things that might be palindromes, and then filter that list to find the words that actually are palindromes.
For example, you can use a regex to find all words such that the first X characters are the reverse of the last X characters (from some small fixed value of X, like 2 or 3), and then run a secondary filter against all the matches to see if the whole word is in fact a palindrome.

In PHP once you get the string you want to check (by regex or split or whatever) you can just:
if ($string == strrev($string)) // it's a palindrome!

i think this regexp can work
$re = '~([a-z])(.?|(?R))\1~';

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Match a string with two different fixed lengths with preg_match - php

This is a fast way: preg_match('/^[a-zA-Z0-9]{13}([a-zA-Z0-9]{12})?$/', $string);

Something like preg_match("/^([a-zA-Z0-9]{13}|[a-zA-Z0-9]{25})$/", $string); ([a-zA-Z0-9]{13}|[a-zA-Z0-9]{25} alternation matches either of length 13 or 25 Example : http://regex101.com/r/bJ9vV5/1

Related

Adding regex to detect a word that repeats 3 characters?

PHP Regex Check if two strings share two common characters

How to solve €25.99 vs 25,99€ preg_match problem?

php regular expression to filter out junk

Rotation in PHP's regex

Categories

Resources