So I've been working on a little project to write a syntax highlighter for a game's scripting language. It's all gone off without a hitch, except for one part: the numbers.
Take these lines for example
(5:42) Set database entry {healthpoints2} to the value 100.
(5:140) Move the user to position (29,40) on the map.
I want to highlight that 100 on the end, without highlighting the (5:42) or the 2 in the braces. The numbers won't always be in the same place, and there won't always only be one number.
I basically need a regexp to say:
"Match any numbers that aren't anywhere between {} and don't match the (#:#) pattern."
I've been at this for a day and a half now and I'm pulling out my hair trying to figure it out. Help with this would be greatly appreciated!
I've already looked at regular-expressions.info, and tried playing around with RegexBuddy, but i'm just not getting it :c
Edit: By request, here's some more lines copied right from the script editor.
(0:7) When somebody moves into position (**10** fhejwkfhwjekf **20**,
(0:20) When somebody rolls exactly **10** on **2** dice of **6** sides,
(0:31) When somebody says {...},
(3:3) within the diamond (**5**,**10**) - **20** //// **25**,
(3:14) in a line starting at (#, #) and going # more spaces northeast.
(5:10) play sound # to everyone who can see (#,#).
(5:14) move the user to (#,#) if there's nobody already there.
(5:272) set message ~msg to be the portion of message ~msg from position # to position #.
(5:302) take variable %var and add # to it.
(5:600) set database entry {...} about the user to #.
(5:601) set database entry {...} about the user named {...} to #.
You might kick yourself when you see this solution...
Assuming this desired number will always be used in a sentence, it should always have a space preceding it.
$pattern = '/ [0-9]+/s';
If the preceding space isn't always present, let me know and I'll update the answer.
Here's the updated regex to match the 2 examples in your question:
$pattern = '/[^:{}0-9]([0-9,]+)[^:{}0-9]/s';
3nd update to account for your question revisions:
$pattern = '/[^:{}0-9a-z#]([0-9]+[, ]?[0-9]*)[^:{}0-9a-z#]/s';
So you don't highlight the number in things like
{update 29 testing}
you might want to pre strip the braces, like so:
$pattern = '/[^:{}0-9a-z#]([0-9]+[, ]?[0-9]*)[^:{}0-9a-z#]/s';
$str = '(0:7) Hello {update 29 testing} 123 Rodger alpha charlie 99';
$tmp_str = preg_replace('/{[^}]+}/s', '', $str);
preg_match($pattern, $tmp_str, $matches);
(\d+,\s?\d+)|(?<![\(\{:]|:\d)\d+(?![\)\}])
http://regexr.com?30omd
Would this work?
Related
$my_string = '88888805';
echo preg_replace("/(^.|.$)(*SKIP)(*F)|(.)/","*",$,my_string);
This shows the first and last number like thus 8******5
But how can i show this number like this 888888**. (The last 2 number is hidden)
Thank you!
From this: 8******5
To: 888888**
I'm not sure if you have worked on this Regex pattern to do something unique. However, I will provide you with a general one that should fit your question without using your current pattern.
$my_string = '88888805';
echo preg_replace("/([0-9]+)[0-9]{2}$/","$1**",$,my_string);
Explanation:
The ([0-9]+) will match all digits, this could be replaced with \d+, it's between brackets to be captured as we are going to use it in the results.
[0-9]{2} is going to match the last 2 digits, again, it can be replaced with \d{2}, it's outside the brackets because we don't want to include them in the result. the $ after that is to indicate the end of the test, it's optional anyways.
Results:
Input: 88888805
Output: 888888**
echo preg_replace("/(.{2}$)(*SKIP)(*F)|(.)/","*",$my_string);
If it for a uni assignment, you'd probably want to do this. Basically says, don't match if its the last two characters, otherwise match.
I am trying to retrieve matches from a comma separated list that is located inside parenthesis using regular expression. (I also retrieve the version number in the first capture group, though that's not important to this question)
What's worth noting is that the expression should ideally handle all possible cases, where the list could be empty or could have more than 3 entries = 0 or more matches in the second capture group.
The expression I have right now looks like this:
SomeText\/(.*)\s\(((,\s)?([\w\s\.]+))*\)
The string I am testing this on looks like this:
SomeText/1.0.4 (debug, OS X 10.11.2, Macbook Pro Retina)
Result of this is:
1. [6-11] `1.0.4`
2. [32-52] `, Macbook Pro Retina`
3. [32-34] `, `
4. [34-52] `Macbook Pro Retina`
The desired result would look like this:
1. [6-11] `1.0.4`
2. [32-52] `debug`
3. [32-34] `OS X 10.11.2`
4. [34-52] `Macbook Pro Retina`
According to the image above (as far as I can see), the expression should work on the test string. What is the cause of the weird results and how could I improve the expression?
I know there are other ways of solving this problem, but I would like to use a single regular expression if possible. Please don't suggest other options.
When dealing with a varying number of groups, regex ain't the best. Solve it in two steps.
First, break down the statement using a simple regex:
SomeText\/([\d.]*) \(([^)]*)\)
1. [9-14] `1.0.4`
2. [16-55] `debug, OS X 10.11.2, Macbook Pro Retina`
Then just explode the second result by ',' to get your groups.
Probably the \G anchor works best here for binding the match to an entry point. This regex is designed for input that is always similar to the sample that is provided in your question.
(?<=SomeText\/|\G(?!^))[(,]? *\K[^,)(]+
(?<=SomeText\/|\G) the lookbehind is the part where matches should be glued to
\G matches where the previous match ended (?!^) but don't match start
[(,]? *\ matches optional opening parenthesis or comma followed by any amount of space
\K resets beginning of the reported match
[^,)(]+ matches the wanted characters, that are none of ( ) ,
Demo at regex101 (grab matches of $0)
Another idea with use of capture groups.
SomeText\/([^(]*)\(|\G(?!^),? *([^,)]+)
This one without lookbehind is a bit more accurate (it also requires the opening parenthesis), of better performance (needs fewer steps) and probably easier to understand and maintain.
SomeText\/([^(]*)\( the entry anchor and version is captured here to $1
|\G(?!^),? *([^,)]+) or glued to previous match: capture to $2 one or more characters, that are not , ) preceded by optional space or comma.
Another demo at regex101
Actually, stribizhev was close:
(?:SomeText\/([^() ]*)\s*\(|(?!^)\G),?\s*([^(),]+)(?=[^()]*\))
Just had to make that one class expect at least one match
(?:SomeText\/([0-9.]+)\s*\(|(?!^)\G),?\s*([^(),]+)(?=[^()]*\)) is a little more clear as long as the version number is always numbers and periods.
I wanted to come up with something more elegant than this (though this does actually work):
SomeText\/(.*)\s\(([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?([^\,]+)?\,?\s?\)
Obviously, the
([^\,]+)?\,?\s?
is repeated 6 times.
(It can be repeated any number of times and it will work for any number of comma-separated items equal to or below that number of times).
I tried to shorten the long, repetitive list of ([^\,]+)?\,?\s? above to
(?:([^\,]+)\,?\s?)*
but it doesn't work and my knowledge of regex is not currently good enough to say why not.
This should solve your problem. Use the code you already have and add something like this. It will determine where commas are in your string and delete them.
Use trim() to delete white spaces at the start or the end.
$a = strpos($line, ",");
$line = trim(substr($line, 55-$a));
I hope, this helps you!
i have huge string that i need to separate information. Some parts of it vary and some dont. The difficulty i am facing is that i cant find a symbol or something on which i could get the match i want. So here is the string:
$str = "01;01;283;Póvoa do Vâle do Trigo;15315100 01;01;249;Alcafaz;;;;;;;;;;;3750;011;AGADÃO 01;01;2504;Caselho;;;;;;;;;;;3750;012;AGADÃO _ "15" '' ghdhghg AND IT CONTINUES
so if we look at the first part of the string (01;01;283;Póvoa do Vale do Trigo;15315100), what i want to stay with is:
01;01;283
and remove the rest of the stuff
in every case, but looking at the first example... :
the 01 is always a number never superior to 2 (not 040 or 150505 or 4075)
the same for the next 01 never superior to 2 (not 405 or 1565 or 425)
then the 283 is the number that can be bigger, it varies (it can be 300 or 17581 or 40755794)
essentially in the end i want only the beginning of each part like:
01;01;283
01;01;249
01;01;2504
05,80,104258
94,76,56789124
sorry for any misspelling i am Portuguese
i forget to say that this separated parts will then go to an array! so the regular expression should not match for example like this:
15315100 01;01;249
so i cant use .+ for example
I AM USING PREG_REPLACE
Try this:
/(\d+;\d+;\d+)/
Should work.
Try the following. The regex is in the match_all line.
$str = "***01;01;283***;Póvoa do Vâle do Trigo;15315100 ***01;01;249***;Alcafaz;;;;;;;;;;;3750;011;AGADÃO ";
preg_match_all("/\*\*\*[01][0-9];[01][0-9];[0-9]*\*\*\*.*?/", $str, $matches);
print_r($matches);
((?:\d\d;){2}\d+)
DEMO
And maybe it would be easier to just get everything between ***XXX***
\*([\d;]+)\*
DEMO
I'm just getting to know regular expressions, but after doing quite a bit of reading (and learning quite a lot), I still have not been able to figure out a good solution to this problem.
Let me be clear, I understand that this particular problem might be better solved not using regular expressions, but for the sake of brevity let me just say that I need to use regular expressions (trust me, I know there are better ways to solve this).
Here's the problem. I'm given a big file, each line of which is exactly 4 characters long.
This is a regex that defines "valid" lines:
"/^[AB][CD][EF][GH]$/m"
In english, each line has either A or B at position 0, either C or D at position 1, either E or F at position 2, and either G or H at position 3. I can assume that each line will be exactly 4 characters long.
What I'm trying to do is given one of those lines, match all other lines that contain 2 or more common characters.
The below example assumes the following:
$line is always a valid format
BigFileOfLines.txt contains only valid lines
Example:
// Matches all other lines in string that share 2 or more characters in common
// with "$line"
function findMatchingLines($line, $subject) {
$regex = "magic regex I'm looking for here";
$matchingLines = array();
preg_match_all($regex, $subject, $matchingLines);
return $matchingLines;
}
// Example Usage
$fileContents = file_get_contents("BigFileOfLines.txt");
$matchingLines = findMatchingLines("ACFG", $fileContents);
/*
* Desired return value (Note: this is an example set, there
* could be more or less than this)
*
* BCEG
* ADFG
* BCFG
* BDFG
*/
One way I know that will work is to have a regex like the following (the following regex would only work for "ACFG":
"/^(?:AC.{2}|.CF.|.{2}FG|A.F.|A.{2}G|.C.G)$/m"
This works alright, performance is acceptable. What bothers me about it though is that I have to generate this based off of $line, where I'd rather have it be ignorant of what the specific parameter is. Also, this solution doesn't scale terrible well if later the code is modified to match say, 3 or more characters, or if the size of each line grows from 4 to 16.
It just feels like there's something remarkably simple that I'm overlooking. Also seems like this could be a duplicate question, but none of the other questions I've looked at really seem to address this particular problem.
Thanks in advance!
Update:
It seems that the norm with Regex answers is for SO users to simply post a regular expression and say "This should work for you."
I think that's kind of a halfway answer. I really want to understand the regular expression, so if you can include in your answer a thorough (within reason) explanation of why that regular expression:
A. Works
B. Is the most efficient (I feel there are a sufficient number of assumptions that can be made about the subject string that a fair amount of optimization can be done).
Of course, if you give an answer that works, and nobody else posts the answer *with* a solution, I'll mark it as the answer :)
Update 2:
Thank you all for the great responses, a lot of helpful information, and a lot of you had valid solutions. I chose the answer I did because after running performance tests, it was the best solution, averaging equal runtimes with the other solutions.
The reasons I favor this answer:
The regular expression given provides excellent scalability for longer lines
The regular expression looks a lot cleaner, and is easier for mere mortals such as myself to interpret.
However, a lot of credit goes to the below answers as well for being very thorough in explaining why their solution is the best. If you've come across this question because it's something you're trying to figure out, please give them all a read, helped me tremendously.
Why don't you just use this regex $regex = "/.*[$line].*[$line].*/m";?
For your example, that translates to $regex = "/.*[ACFG].*[ACFG].*/m";
This is a regex that defines "valid" lines:
/^[A|B]{1}|[C|D]{1}|[E|F]{1}|[G|H]{1}$/m
In english, each line has either A or B at position 0, either C or D
at position 1, either E or F at position 2, and either G or H at
position 3. I can assume that each line will be exactly 4 characters
long.
That's not what that regex means. That regex means that each line has either A or B or a pipe at position 0, C or D or a pipe at position 1, etc; [A|B] means "either 'A' or '|' or 'B'". The '|' only means 'or' outside of character classes.
Also, {1} is a no-op; lacking any quantifier, everything has to appear exactly once. So a correct regex for the above English is this:
/^[AB][CD][EF][GH]$/
or, alternatively:
/^(A|B)(C|D)(E|F)(G|H)$/
That second one has the side effect of capturing the letter in each position, so that the first captured group will tell you whether the first character was A or B, and so on. If you don't want the capturing, you can use non-capture grouping:
/^(?:A|B)(?:C|D)(?:E|F)(?:G|H)$/
But the character-class version is by far the usual way of writing this.
As to your problem, it is ill-suited to regular expressions; by the time you deconstruct the string, stick it back together in the appropriate regex syntax, compile the regex, and do the test, you would probably have been much better off just doing a character-by-character comparison.
I would rewrite your "ACFG" regex thus: /^(?:AC|A.F|A..G|.CF|.C.G|..FG)$/, but that's just appearance; I can't think of a better solution using regex. (Although as Mike Ryan indicated, it would be better still as /^(?:A(?:C|.E|..G))|(?:.C(?:E|.G))|(?:..EG)$/ - but that's still the same solution, just in a more efficiently-processed form.)
You've already answered how to do it with a regex, and noted its shortcomings and inability to scale, so I don't think there's any need to flog the dead horse. Instead, here's a way that'll work without the need for a regex:
function findMatchingLines($line) {
static $file = null;
if( !$file) $file = file("BigFileOfLines.txt");
$search = str_split($line);
foreach($file as $l) {
$test = str_split($l);
$matches = count(array_intersect($search,$test));
if( $matches > 2) // define number of matches required here - optionally make it an argument
return true;
}
// no matches
return false;
}
There are 6 possibilities that at least two characters match out of 4: MM.., M.M., M..M, .MM., .M.M, and ..MM ("M" meaning a match and "." meaning a non-match).
So, you need only to convert your input into a regex that matches any of those possibilities. For an input of ACFG, you would use this:
"/^(AC..|A.F.|A..G|.CF.|.C.G|..FG)$/m"
This, of course, is the conclusion you're already at--so good so far.
The key issue is that Regex isn't a language for comparing two strings, it's a language for comparing a string to a pattern. Thus, either your comparison string must be part of the pattern (which you've already found), or it must be part of the input. The latter method would allow you to use a general-purpose match, but does require you to mangle your input.
function findMatchingLines($line, $subject) {
$regex = "/(?<=^([AB])([CD])([EF])([GH])[.\n]+)"
+ "(\1\2..|\1.\3.|\1..\4|.\2\3.|.\2.\4|..\3\4)/m";
$matchingLines = array();
preg_match_all($regex, $line + "\n" + $subject, $matchingLines);
return $matchingLines;
}
What this function does is pre-pend your input string with the line you want to match against, then uses a pattern that compares each line after the first line (that's the + after [.\n] working) back to the first line's 4 characters.
If you also want to validate those matching lines against the "rules", just replace the . in each pattern to the appropriate character class (\1\2[EF][GH], etc.).
People may be confused by your first regex. You give:
"/^[A|B]{1}|[C|D]{1}|[E|F]{1}|[G|H]{1}$/m"
And then say:
In english, each line has either A or B at position 0, either C or D at position 1, either E or F at position 2, and either G or H at position 3. I can assume that each line will be exactly 4 characters long.
But that's not what that regex means at all.
This is because the | operator has the highest precedence here. So, what that regex really says, in English, is: Either A or | or B in the first position, OR C or | or D in the first position, OR E or | or F in the first position, OR G or '|orH` in the first position.
This is because [A|B] means a character class with one of the three given characters (including the |. And because {1} means one character (it is also completely superfluous and could be dropped), and because the outer | alternate between everything around it. In my English expression above each capitalized OR stands for one of your alternating |'s. (And I started counting positions at 1, not 0 -- I didn't feel like typing the 0th position.)
To get your English description as a regex, you would want:
/^[AB][CD][EF][GH]$/
The regex will go through and check the first position for A or B (in the character class), then check C or D in the next position, etc.
--
EDIT:
You want to test for only two of these four characters matching.
Very Strictly speaking, and picking up from #Mark Reed's answer, the fastest regex (after it's been parsed) is likely to be:
/^(A(C|.E|..G))|(.C(E)|(.G))|(..EG)$/
as compared to:
/^(AC|A.E|A..G|.CE|.C.G|..EG)$/
This is because of how the regex implementation steps through text. You first test if A is in the first position. If that succeeds, then you test the sub-cases. If that fails, then you're done with all those possible cases (or which there are 3). If you don't yet have a match, you then test if C is in the 2nd position. If that succeeds, then you test for the two subcases. And if none of those succeed, you test, `EG in the 3rd and 4th positions.
This regex is specifically created to fail as fast as possible. Listing each case out separately, means to fail, you would have test 6 different cases (each of the six alternatives), instead of 3 cases (at a minimum). And in cases of A not being the first position, you would immediately go to test the 2nd position, without hitting it two more times. Etc.
(Note that I don't know exactly how PHP compiles regex's -- it's possible that they compile to the same internal representation, though I suspect not.)
--
EDIT: On additional point. Fastest regex is a somewhat ambiguous term. Fastest to fail? Fastest to succeed? And given what possible range of sample data of succeeding and failing rows? All of these would have to be clarified to really determine what criteria you mean by fastest.
Here's something that uses Levenshtein distance instead of regex and should be extensible enough for your requirements:
$lines = array_map('rtrim', file('file.txt')); // load file into array removing \n
$common = 2; // number of common characters required
$match = 'ACFG'; // string to match
$matchingLines = array_filter($lines, function ($line) use ($common, $match) {
// error checking here if necessary - $line and $match must be same length
return (levenshtein($line, $match) <= (strlen($line) - $common));
});
var_dump($matchingLines);
I bookmarked the question yesterday in the evening to post an answer today, but seems that I'm a little late ^^ Here is my solution anyways:
/^[^ACFG]*+(?:[ACFG][^ACFG]*+){2}$/m
It looks for two occurrences of one of the ACFG characters surrounded by any other characters. The loop is unrolled and uses possessive quantifiers, to improve performance a bit.
Can be generated using:
function getRegexMatchingNCharactersOfLine($line, $num) {
return "/^[^$line]*+(?:[$line][^$line]*+){$num}$/m";
}
This is a homework assignment and my first experience using RegEx. I am starting to grasp the syntax and symbols used and can do some simple pattern matching/manipulation, but can't quite foresee how to achieve some of the goals of this assignment.
I have been given a text file that is formatted like this:
Steve Blenheim:238-923-7366:95 Latham Lane, Easton, PA 83755:11/12/56:20300
Betty Boop:245-836-8357:635 Cutesy Lane, Hollywood, CA 91464:6/23/23:14500
Igor Chevsky:385-375-8395:3567 Populus Place, Caldwell, NJ 23875:6/18/68:23400
Norma Corder:397-857-2735:74 Pine Street, Dearborn, MI 23874:3/28/45:245700
There are about 50 lines of names and corresponding info, each entry is on a new line and each 'field' is separated by a colon. Mostly I need to find specific things from the file and print them on a webpage but I don't quite understand.
Here is one problem I solved:
$myFile = "datebook.txt";
$data = file($myFile);//I have used this to place all data in an array, but it may be necessary to place the data into a string?
//1) Print all lines containing the pattern Street (case insensitive).
$pattern = "/street/i";
$linesFound = preg_grep($pattern, $data);
echo "<pre>", print_r($linesFound, true), "</pre>";
Here are some I have not and specific questions regarding them:
2) Print the first and last names in which the first name starts with a letter ‘B’.
How do I only search for first names and not last names, city names, etc?
How do I print the full name and only the full name?
5) Print Lori Gortz’s name and address.
I understand how to find the pattern 'Lori Gortz' but how do I return her address as well?
11) Print lines that end in exactly five digits.
12) Print the file with the first and last names reversed.
14) Give everyone a $250.00 raise.
Don't know how to do any of these. I assume the last number for each entry is their salary.
Any help is appreciated. Please respond with an explanation of the code as well, thank you.
Check the RegEx quick reference, I think you'll figure out most of your tasks there. For example, Lori's address would be a string after the number after the second colon and before the second coma (in her line, of course).
The best way to do all the task would be to go over each line and make an array with all the elements. That way you could easy replace names, increase salaries, check if it ends with 5 digits, etc.
You can also try this online tester. Good luck.
Edit:
Little help for a start:
^[A-z ]* this gets full names
^[A-z]* this gets first names
etc...
Edit2:
See what this code does:
$line = "Betty Boop:245-836-8357:635 Cutesy Lane, Hollywood, CA 91464:6/23/23:14500";
$regex = "/\s|:/";
$result = preg_split($regex, $line);
:)
I don't want to do all of them, but here's some hints.. For question 2:
^[A-Z]* B.*$
^ basically means a new line.
[A-Z]* means any number of characters from A-Z
Next we match a space
Next we match a B
The .* means any number of other characters.
Lastly, we match with an end of line using $
This can definitely be improved and made more flexible, but I'll let you do that..