php: Regex and the caret - php

So I'm trying to make a regex to include the course contents of a text, but exclude the 3 digit numbers followed by a period and some text. Basically I'm trying to divide the course text into individual courselines, so that I end up with an array where every element has the courseinfo of one class.
For example, suppose we have:
$text = "295. Student-Initiated Courses or Projects. (1-2)
Open to students who have completed the first-year curriculum. Clinical work, field work, legal assistance, individual research and writing, writing or editing for profes¬sional journals, student-taught courses, or other legal projects of a serious, educational nature. Requires the approval of the Law 295 Administrator and the Dean.
296. Legal Dissertation. (8-13)"
and this giant regex:
$lineDelimiter = ='/(?:[0-9]{3}(?:\.5|\-[1-5])?[A-Z]?)(?:\-[0-9]{3}(?:\.5|\-[0-9])? [A-Z]?)?\.\s*.+\.\s*(?:(?:\([0-9]+\-*[0-9]*\))(?:\s*or\s*\([0-9]+\-*[0-9]*\))?)?\s*(?:Prerequisite)?.+(?:\n.+)?\.\n?(?:\s*Mr\.\s.+,?|\s*Ms\.\s.+,?|\s*Dr\.\s.+,?|\s*The\sFaculty.*,?)*[^(?:[0-9]{3}\..+)]/';
The very last part of that giant regex, which consists of
'/[^(?:[0-9]{3}\..+)]/'
errors when I preg_match_all.
I'm trying to exclude the "296. Legal Dissertation. (8-13)" part so that it will be
"295. Student-Initiated Courses or Projects. (1-2)
Open to students who have completed the first-year curriculum. Clinical work, field work, legal assistance, individual research and writing, writing or editing for profes¬sional journals, student-taught courses, or other legal projects of a serious, educational nature. Requires the approval of the Law 295 Administrator and the Dean."

If you want to match everything except the last part with the number and the text you could try this:
'/([\s\S]+)(?=\d{3}\..+)/'
[\s\S]+ matches everything - both whitespace and non-whitespace
(?=) performs Positive Lookahead. It does not match the things in the brackets, but it makes sure the stuff before the brackets is followed by this in the brackets.
\d{3}\..+ matches 3 digits followed by a dot and some other characters without new lines.

Related

Preg_match is "ignoring" a capture group delimiter

We have thousands of structured filenames stored in our database, and unfortunately many hundreds have been manually altered to names that do not follow our naming convention. Using regex, I'm trying to match the correct file names in order to identify all the misnamed ones.
The files are all relative to a meeting agenda, and use the date, meeting type, Agenda Item#, and description in the name.
Our naming convention is yyyymmdd_aa[_bbb]_ccccc.pdf where:
yyyymmdd is a date (and may optionally use underscores such as yyyy_mm_dd)
aa is a 2-3 character Meeting Type code
bbb is an optional Agenda Item
ccccc is a freeform variable length description of the file (alphanumeric only)
Example filenames:
20200225_RM_agenda.pdf
20200225_RM_2_memo.pdf
20200225_SS1_3c_presenTATION.pdf
20200225_CA_4d_SiGnEd.pdf
20200225_RM_5_Order1234.pdf
2021_02_25_EV_Notice.pdf
The regex I'm using to match these files is below (regex demo):
/^(\d{4}[_]?\d{2}[_]?\d{2})_(\w{2,3})_([a-z0-9]{1,3})_?(.+)?.pdf/i
The Problem:
In general, it's working fine, BUT if the Agenda Number ("bbb") is NOT in the filename, the regex captures and returns the first 3 characters of the description. It seems to me that the 3rd capture group _([a-z0-9]{1,3})_ is saying 1-3 alphanumeric characters between underscores, but I don't know how to "force the underscore delimiters", or otherwise tell it that the group may not be there, and that it's now looking at the descriptive text. This can be seen in the demo code where the first and last filenames do not use an Agenda Number.
Any assistance is appreciated.
The optional identifier ? is for the last thing, either a characters or group. So the expression ([a-z0-9]{1,3})_? makes the underscore optional, but not the preceding group. The solution is to move the underscore into the parenthesis.
^(\d{4}[_]?\d{2}[_]?\d{2})_(\w{2,3})_([a-z0-9]{1,3}_)?(.+)?.pdf
Additionally, the [_]? can be simplified to just _?, file name periods should be escaped (otherwise they are a wildcard), and I personally like to name my groups using (?<name>) syntax. Putting that all together you get:
^(?<date>\d{4}_?\d{2}_?\d{2})_(?<meeting_type>\w{2,3})_(?<agenda>[a-z0-9]{1,3}_)?(?<description>.+)?\.pdf$
Demo here: https://regex101.com/r/BUKCih/1
Updated:
I've made some updates based on the comments. I added $ to the end to force "end of filename" as #Chris Maurer said. This stops file.pdf.txt from getting through. I also made a sub-group and moved the name into that group, which allows the trailing underscore to not be included in the named-group. I'm going to leave Chris's other comment about tightening the last matching group alone, although I do agree with it, and the OP might find a couple of non-conforming files if they use [a-z0-9]+ or similar. I don't remember off-hand if PHP supports POSIX but if so [:alnum:] could be used too.
^(?<date>\d{4}_?\d{2}_?\d{2})_(?<meeting_type>\w{2,3})_((?<agenda>[a-z0-9]{1,3})_)?(?<description>.+)?\.pdf$
Updated demo here: https://regex101.com/r/ebmxkF/1

How to Retrieve Overlapping Matches with Complex Regex and Preg_Match_All in PHP

Have read the following which have some overlap (pun intended!) with the issue I am facing:
preg_match_all how to get *all* combinations? Even overlapping ones
Overlapping matches with preg_match_all and pattern ending with repeated character
However, I don’t really know how to apply their answers to my issue which is a little more complicated.
My regex that I use with preg_match_all():
/.{240}[^\[]Order[^ ][^\(].{9}/u
With the following string:
56A.  Subject to the provisions of this Act, any decision of the Court or the Appeal Board shall be final and conclusive, and no decision or order of the Court or the Appeal Board shall be challenged, appealed against, reviewed, quashed or called into question in any court and shall not be subject to any Quashing Order, Prohibiting Order, Mandatory Order or injunction in any court on any account.[20/99; 42/2005]
I intended it to match exactly 3 times. The first match has “Quashing Order” 9 characters before the end. The second match has “Prohibiting Order” 9 characters before the end. The third match has “Mandatory Order” 9 characters before the end.
However, as expected it’s only matching the first one, as the expected matches are overlapping.
I applied what I read in the other posts, I tried this:
(?=(.{240}[^\[]Order[^ ][^\(].{9}))
I still don’t get what I need.
How do I solve this?
You can use
\w+\s+Order\b
See the regex demo.
Regex details
\w+ - one or more word chars
\s+ - 1 or more whitespaces
Order\b - a whole word Order, as \b is a word boundary.
You will need to use a positive look-behind assertion for .{240}, just like the answer you found suggests using a positive look-ahead assertion for .{9}:
/(?<=.{240})[^\[]Order[^ ][^\(](?=.{9})/u
This RE matches your string only twice because of [^ ], as #bobblebubble said. Adjust that part as necessary.

Convert text in specific format into real PHP code assignments

I'm having some problems to get a text in a specific format into real working PHP code.
My text file:
#T1:The German sociologist Max Weber once proposed
#S:Jos Bleau
#C:jos.bleau#domain.com
#L:"He used to be so conservative," she says, throwing up her hands in mock exasperation. "We used to have the worst arguments right here at this table. I was part of the first group of public city school teachers that struck to form a union, and Richard was very angry with me. He saw unions as corrupt. He was also very opposed to social security. He thought people could make much more money investing it on their own. Who knew that within 10 years he would become so idealistic
#R:At first, <#Ri>Stallman viewed these notices<#$p> with alarm. Rare was the software program that didn't borrow source code from past programs, and yet, with a single stroke of the president's pen, Congress had given programmers and companies the power to assert individual authorship over communally built programs. It also injected a dose of formality into what had otherwise been an informal system.
The AI Lab of the 1970s was by all accounts a special place. Cutting-edge projects and top-flight researchers gave it an esteemed position in the world of computer science. The internal hacker culture and its anarchic policies lent a rebellious mystique as well. Only later, when many of the lab's scientists and software superstars had departed, would hackers fully realize the unique and ephemeral world they had once inhabited.
As a single parent for nearly a decade-she and Richard's father, Daniel Stallman, were married in 1948, divorced in 1958, and split custody of their son afterwards-Lippman can attest to her son's aversion to authority. She can also attest to her son's lust for knowledge. It was during the times when the two forces intertwined, Lippman says, that she and her son experienced their biggest battles.
#ST:Fusions
#R:Such mythological descriptions, while extreme, underline an important fact. The ninth floor of 545 Tech Square was more than a workplace for many. For hackers such as Stallman, it was home.
The belief in individual freedom over arbitrary authority extended to school as well. Two years ahead of his classmates by age 11, Stallman endured all the usual frustrations of a gifted public-school student. It wasn't long after the puzzle incident that his mother attended the first in what would become a long string of parent-teacher conferences.
#ST:Fusions
#R:The belief in individual freedom over arbitrary authority extended to school as well. Two years ahead of his classmates by age 11, Stallman endured all the usual frustrations of a gifted public-school student. It wasn't long after the puzzle incident that his mother attended the first in what would become a long string of parent-teacher conferences.
#BV:Thirty years later, Breidbart remembers
#CP:(Picture: Credit – Jos Bleau) or #CP:(Picture: Thanks)
The expected output I need (Half pseudo code; Unescaped quotes):
<?php
$title1 = 'The German sociologist Max Weber once proposed';
$signature = 'Jos Bleau';
$email = 'jos.bleau#domain.com';
$lead = '"He used to be so conservative," she says, throwing up her hands in mock exasperation. "We used to have the worst arguments right here at this table. I was part of the first group of public city school teachers that struck to form a union, and Richard was very angry with me. He saw unions as corrupt. He was also very opposed to social security. He thought people could make much more money investing it on their own. Who knew that within 10 years he would become so idealistic';
$text[] = 'At first, <#Ri>Stallman viewed these notices<#$p> with alarm. Rare was the software program that didn't borrow source code from past programs, and yet, with a single stroke of the president's pen, Congress had given programmers and companies the power to assert individual authorship over communally built programs. It also injected a dose of formality into what had otherwise been an informal system.
The AI Lab of the 1970s was by all accounts a special place. Cutting-edge projects and top-flight researchers gave it an esteemed position in the world of computer science. The internal hacker culture and its anarchic policies lent a rebellious mystique as well. Only later, when many of the lab's scientists and software superstars had departed, would hackers fully realize the unique and ephemeral world they had once inhabited.
As a single parent for nearly a decade-she and Richard's father, Daniel Stallman, were married in 1948, divorced in 1958, and split custody of their son afterwards-Lippman can attest to her son's aversion to authority. She can also attest to her son's lust for knowledge. It was during the times when the two forces intertwined, Lippman says, that she and her son experienced their biggest battles.';
$subtitle[] = 'Fusions';
//etc...
?>
Note:
The names like $title1 and #T1 are completely unrelated to each other and $title1 is just used as example. It could also be $xy or something else
If #XY appears more than once in the file then the values should be added as array element, else as simple assignment
I don't know if preg_split() is the correct direction and I can do it with it? Or do I have to use other functions to accomplish this?
Explanation
First we get the data from the text file into a variable with file_get_contents() and also initialize our $output array, where each element is a line in the output, with a php tag <?php.
You can also modify $lookup with shortcut => variable name elements, where you can define which #XY: gets replaced with which variable name. If not defined the shortcut will be used as variable name.
Now that we have prepared some stuff we match each #XY: with the corresponding data with preg_match_all().
Regular Expression
/#(\w+):(.*?)(?=#\w+:)/s
\w+ matches all word characters \[a-zA-Z0-9_\], which is the XY part from #XY: and we keep it with a capturing group
+ is a quantifier and says that \w should match 1 or more times
(.*?) matches everything as much as needed
With the flag s, * also matches new lines
(?=#\w+:) makes sure (.*?) matches everything until the next #XY: and not more. Where ?= is a positive lookahead and as it says it looks ahead if that regex in the parentheses(#\w+) can be matched
We also preemptively save the amount each shortcut appears in the data with array_count_values().
Now that we have matched all data which we want we can loop through all shortcuts, which are saved in $m[1]. In the foreach loop we simply check if you have defined a lookup variable name or if we use the shortcut as variable name.
Then we simply add each assignment as new element to the output array. Where you have to note three things:
Complex (curly) syntax is used, so that you don't get problems with invalid variable names, see: How can I access a property with an invalid name?
Depending on how many times a shortcut appeared in the data we decide if it should be added as array element or normal assignment. If the shortcut appears more than once in the data it will be adding the value as array element else as simple string assignment
We use trim() to remove spaces, new lines, ... from the start and end of the string. And we use addslashes(), so we don't get problems with quotes
Done. And now we are already done. Just depending on how you want to output the result you can save it to a file with file_put_contents() or just print out the array.
Code
<?php
$text = file_get_contents("test.txt");
$output = ["<?php"];
$lookup = []; //Example: ["ST" => "subtitle"]
preg_match_all("/#(\w+):(.*?)(?=#\w+:)/s", $text, $m);
$variableShortcutCount = array_count_values($m[1]);
foreach($m[1] as $key => $variableShortcut){
if(isset($lookup[$variableShortcut])){
$output[] = '${"' . $lookup[$variableShortcut] . ($variableShortcutCount[$variableShortcut] > 1 ? '"}[]' : '"}') . " = '". addslashes(trim($m[2][$key])) . "';" ;
} else {
$output[] = '${"' . $variableShortcut . ($variableShortcutCount[$variableShortcut] > 1 ? '"}[]' : '"}') . " = '". addslashes(trim($m[2][$key])) . "';" ;
}
}
//Output to file
//file_put_contents("output.txt", implode(PHP_EOL, $output));
//Output to browser
echo "<pre><code>";
highlight_string(implode(PHP_EOL, $output));
?>
output:
<?php
${"T1"} = 'The German sociologist Max Weber once proposed';
${"S"} = 'Jos Bleau';
${"C"} = 'jos.bleau#domain.com';
${"L"} = '\"He used to be so conservative,\" she says, throwing up her hands in mock exasperation. \"We used to have the worst arguments right here at this table. I was part of the first group of public city school teachers that struck to form a union, and Richard was very angry with me. He saw unions as corrupt. He was also very opposed to social security. He thought people could make much more money investing it on their own. Who knew that within 10 years he would become so idealistic';
${"R"}[] = 'At first, <#Ri>Stallman viewed these notices<#$p> with alarm. Rare was the software program that didn\'t borrow source code from past programs, and yet, with a single stroke of the president\'s pen, Congress had given programmers and companies the power to assert individual authorship over communally built programs. It also injected a dose of formality into what had otherwise been an informal system.
The AI Lab of the 1970s was by all accounts a special place. Cutting-edge projects and top-flight researchers gave it an esteemed position in the world of computer science. The internal hacker culture and its anarchic policies lent a rebellious mystique as well. Only later, when many of the lab\'s scientists and software superstars had departed, would hackers fully realize the unique and ephemeral world they had once inhabited.
As a single parent for nearly a decade-she and Richard\'s father, Daniel Stallman, were married in 1948, divorced in 1958, and split custody of their son afterwards-Lippman can attest to her son\'s aversion to authority. She can also attest to her son\'s lust for knowledge. It was during the times when the two forces intertwined, Lippman says, that she and her son experienced their biggest battles.';
${"subtitle"}[] = 'Fusions';
${"R"}[] = 'Such mythological descriptions, while extreme, underline an important fact. The ninth floor of 545 Tech Square was more than a workplace for many. For hackers such as Stallman, it was home.
The belief in individual freedom over arbitrary authority extended to school as well. Two years ahead of his classmates by age 11, Stallman endured all the usual frustrations of a gifted public-school student. It wasn\'t long after the puzzle incident that his mother attended the first in what would become a long string of parent-teacher conferences.';
${"subtitle"}[] = 'Fusions';
${"R"}[] = 'The belief in individual freedom over arbitrary authority extended to school as well. Two years ahead of his classmates by age 11, Stallman endured all the usual frustrations of a gifted public-school student. It wasn\'t long after the puzzle incident that his mother attended the first in what would become a long string of parent-teacher conferences.';
${"BV"} = 'Thirty years later, Breidbart remembers';
${"CP"} = '(Picture: Credit – Jos Bleau) or';

Is there a regex symbol to match one, the other, or both (if possible)?

I want to highlight a group of words, they can appear single or in a row. I'd like them to be highlighted together if they appear one after the other, and if they don't, they should also be highlighted, like the normal behavior. For instance, if I want to highlight the words:
results as
And the subject is:
real time results: shows results as you type
I'd like the result to be:
real time results: shows <span class="highlighted"> results as </span> you type
The whitespaces are also a headache, because I tried using an or expression:
( results )|( as )
with whitespaces to prevent highlighting words like bass, crash, and so on. But since the whitespace after results is the same as the whitespace before as, the regexp ignores it and only highlights results.
It can be used to highlighted many words so combinations of
( (one) (two) )|( (two) (one) )|( one )|( two )
are not an option :(
Then I thought that there may be an operator that worked like | that could be use to match both if possible, else one, or the other.
Using spaces to ensure you match full words is the wrong approach. That's what word boundaries are for: \b matches a position between a word and a non-word character (where word characters usually are letters, digits and underscores). To match combinations of your desired words, you can simply put them all in an alternation (like you already do), and repeat as often as possible. Like so:
(?:\bresults\b\s*|\bas\b\s*)+
This assumes that you want to highlight the first and separate results in your example as well (which would satisfy your description of the problem).
Perhaps you do not need to match a string of words next to each other. Why not just apply your highlighting like so:
real time results: shows <span class="highlighted">results</span> <span class="highlighted">as</span> you type
The only realy difference is that the space between the words is not highlighted, but it's a clean and easy compromise which will save you hours of work and doesn't seem to hurt the UX in the least (in my opinion).
In that case, you could just use alternation:
\b(results|as)\b
(\b being the word boundary anchor)
If you really don't like the space between words not being highlight, you could write a jQuery function to find "highlighted" spans separated by only white space and then combine them (a "second stage" to achieve your UX design goals).
Update
(OK... so merging spans is actually kind of difficult via jQuery. See Find text between two tags/nodes)

Regex to detect word abbreviations

I'm currently working on a CSV that has information about Portugal's administrative areas and postal codes, but the file doesn't follow any strict format, which means sometimes there are entire strings in uppercase, along with other issues.
The issue I want to solve is as follows : some areas have a abbreviation at the end of the name, related to it's parent's administrative level, that I want to remove. As far as I can see, this are the rules :
Abbreviations don't take more than 3 characters in lenght (always 3 characters so far);
The first character may be any letter, case insensitive;
The last 2 characters are always consonants (e.g. Z, B, M, P, ..);
(edit) the abbreviations always occur as the last word in a string;
(edit 2) - The strings are always UTF-8
The purpose is to remove this abbreviations from the area names.
Sounds simple enough..
/\b[a-z][ZBMP]{2}\b/i
Would match any such described abbrevations, Add letters to the second character class ([ZBMP]) to complete the match.
It would only match if it's not part of another word (That's the \b's job).

Categories