How to emulate lookbehind within a regex with non-fixed width

How to emulate lookbehind within a regex with non-fixed width - php

I'm currently encountering a problem with Regex in given circumstances : I need to parse PHP source files (class especially) to look for constants that are defined within those files and to retrieve them back to the output.
Those constants can have some documentation (and that's why I left the idea of Reflection since retrieving constants via Reflection only returns their name and their value) that may be shipped within comments tags.
I did manage to build the two separate parts of the regex (1 being the comment tag, the other being the const declaration) but I can't manage to link them both successfully : it seems that the very first constant within the file will also contain all the previously declared elements until it reaches the very first comment block.
My regex is as follows (I'm not a regex God so feel free to bring any criticism) :
((\t\ )*(/\*+(.|\n)*\*/)\R+)?([\t| ]*(?|(public|protected|private)\s*)?const\s+([a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*)\s*=\s*(.*);)
There goes the sample test : Regex101
In case the initial code disappears :
/**
*
*/
class Test {
/**
*
*/
public const LOL = "damn";
/**
*
*/
private const TEST = 5;
public const plop = "dong";
}
I did look there and there for tips and I've learnt about positive lookbehind but from what I understood, it only works with fixed-width patterns.
I'm running out of ideas.

I would favor a multi-step approach: separate every class, then look for comments (eventually) and for the constants. In terms of regex, this can be achieved via
class\h*(?P<classname>\w+)[^{}]* # look for class literally and capture the name
(\{
(?:[^{}]*|(?2))* # the whole block matches the class content
\})
See a demo on regex101.com.
Now, to the comments and constants
^\h*
(?:(?P<comment>\Q/*\E(?s:.*?)\Q*/\E)(?s:.*?))?
(?:public|private)\h*const\h*
(?P<key>\w+)\h*=\h*(?P<value>[^;]+)
See a demo for this step on regex101.com as well.
The last step would be to clean the comments:
^\h*/?\*+\h*/?
See a demo for the cleansing on regex101.com.
Lastly, you'll need two loops:
preg_match_all($regex_class, $source, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
preg_match_all($const_class, $match[0], $constants, PREG_SET_ORDER);
foreach ($constants as $constant) {
$comment = preg_replace($clean_comment, '', $constant["comment"]);
# find the actual values here
echo "Class: {$match["classname"]}, Constant Name: {$constant["key"]}, Constant Value: {$constant["value"]}, Comment: $comment\n";
}
}
An overall demo can be found on ideone.com.
Mind the individual regex modifiers in the demo and source code (especially verbose and multiline !).
You can do it in an array as well:
$result = [];
preg_match_all($regex_class, $source, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
preg_match_all($const_class, $match[0], $constants, PREG_SET_ORDER);
foreach ($constants as $constant) {
$comment = trim(preg_replace($clean_comment, '', $constant["comment"]));
$result[$match["classname"]][] = array('name' => $constant["key"], 'value' => $constant['value'], 'comment' => $comment);
}
}
print_r($result);

You can do it without positive lookbehind:
You have to match a comment, immediately followed by a const declaration:
(?:(?:^/\*\*$\s+)((?:^ ?\*.*$\s*?)+)(?:\s+^\*/$\s+))?^\s+(public|protected|private) const (\S+)\s+= ([^;]+);
The first group will allow you to retrieve the documentation:
Comment part
(?:^/\*\*$\s+) finds the beginning of a block comment
((?:^ ?\*.*$\s*?)+) represents the group containing the content of your comments
(?:\s+^\*/$\s+) end of the comment
Declaration part:
^\s+ to skip the whitespace at the beginning of the line
(public|protected|private) const a group to determine visibility
(\S+)\s+= ([^;]+); groups for the name and value

Related

preg_match_all multine code scanning matches with a first word match

first time posting a question here. I am currently writing a tool for reverse engineer PHP7 code in order to create UML class diagrams and for that purpose I'm using preg_match_all to extract sections of code from the source files. So far so good but I must admit I don't fully understand yet how regular expressions work. Still I was able to create complex patterns but this one beats me.
what I want is to match the use clause from classes body in order to get the traits names. It can be in one of these following formats *(as mentioned in https://www.php.net/manual/en/language.oop5.traits.php):
1)
use trait;
use othertrait;
2)
use trait, othertrait, someothertrait;
or
use trait, othertrait, someothertrait { "conflict_resolutions" }
I don't care about conflict resolutions yet so I can drop these.
so far I have the following regex pattern:
class usetrait_finder {
use finder;
function __construct( string $source ){
$this->source = $source;
$this->pattern = "/";
$this->pattern .= "(?:(use\s+|,)\s*(?<traitname>[a-zA-Z0-9_]*))";
$this->pattern .= "";
$this->pattern .= "/ms";
$this->matches($source);
}
function get_trait_name(): string {
return $this->matches["traitname"][$this->current_key];
}
}
which matches the mentioned cases, but I know it's a cheat because the "use" word must appear at least one at first. I wrote a PHPUnit test to check every normal case and the following test doesn't pass:
// tag "use" must be preset at least once
function test_invalid_source_2(){
$source = "function sarasa(
sometrait
,
someothertrait )
{ anything }
function test() {}
}";
$finder = new usetrait_finder( $source );
var_dump( $finder->matches($source)[0] );
$this->assertEquals( false, $finder->more_elements() );
}
the var_dump output is:
array(1) {
[0]=>
string(16) ",
someothertrait"
}
my expected output should be an empty string or null since the word "use" is not the first thing matched on the first line. Of course others tests must pass, and in the $matches["traitname"] should be only one trait name.
usetrait_finder is here:
https://github.com/rudymartinb/classtree/blob/master/src/usetrait_finder.php
"finder" trait section of above source is here (not really important but it won't hurt mentioning):
https://github.com/rudymartinb/classtree/blob/master/src/traits/finder.php
full test case is here: https://github.com/rudymartinb/classtree/blob/master/tests/usetrait_finder_Test.php
thank you in advance

Calling a variable that is nowhere else

In the following code, from Beginning PHP and mySQL 5e, if acronym function is called without mentioning $matches, how ,in the definition of acronym, $matches is never linked to anything, but rather used in the isset($acronym[$matches[1]])) ?, How isset knows what is $matches in the first place?
The following is the code and I have tested that it is working.
I just cannot follow up with the use of an arbitrary term; $matches, and its use.
// This function will add the acronym's long form
// directly after any acronyms found in $matches
function acronym($matches) {
$acronyms = array(
'WWW' => 'World Wide Web',
'IRS' => 'Internal Revenue Service',
'PDF' => 'Portable Document Format');
if (isset($acronyms[$matches[1]]))
return $acronyms[$matches[1]] . " (" . $matches[1] . ")";
else
return $matches[1];
}
// The target text
$text = "The <acronym>IRS</acronym> offers tax forms in
<acronym>PDF</acronym> format on the <acronym>WWW</acronym>.";
// Add the acronyms' long forms to the target text
$newtext = preg_replace_callback("/<acronym>(.*)<\/acronym>/U", 'acronym',
$text);
print_r($newtext);
The output is:
The Internal Revenue Service (IRS) offers tax forms inPortable Document Format (PDF) format on the World Wide Web (WWW).
Reminder: The input, for the function preg_replace_callback is:
The <acronym>IRS</acronym> offers tax forms in <acronym>PDF</acronym> format on the <acronym>WWW</acronym>.

The preg_replace_callback() function is written in that way, that it calls the function with a well defined argument. See the manual of this function:
A callback that will be called and passed an array of matched elements in the subject string. The callback should return the replacement string. This is the callback signature:
handler ( array $matches ) : string
So your function acronym() will get an array with the matches from the regex. Keep in mind that you are not calling the acronym() function by yourself, the function preg_replace_callback() does that for you (with the argument defined in the documentation).

Using preg_replace_callback to identify and manipulate latex code

I have latex + html code somewhere in the following form:
...some text1.... \[latex-code1\]....some text2....\[latex-code2\]....etc
Firstly I want to obtain the latex codes in an array codes[] to be able to send them to a server for rendering, so that
code[0]=latex-code1, code[1]=latex-code2, etc
Secondly, I want to modify this text so that it looks like:
...some text1.... <img src="root/1.png">....some text2....<img src="root/2.png">....etc
i.e, the i-th latex code fragment is replaced by the link to the i-th rendered image.
I have been trying to do this with preg_replace_callback and preg_match_all but being new to PHP haven't been able to make it work. Please advise.

If you're looking for codez:
$html = '...some text1.... \[latex-code1\]....some text2....\[latex-code2\]....etc';
$codes = array();
$count = 0;
$replace = function($matches) use (&$codes, &$count) {
list(, $codes[]) = $matches;
return sprintf('<img src="root/%d.png">', ++$count);
};
$changed = preg_replace_callback('~\\\\\\[(.+?)\\\\\\]~', $replace, $html);
echo "Original: $html\n";
echo "Changed : $changed\n\nLatex Codes: ", print_r($codes, 1), "Count: ", $count;
I don't know at which part you've got the problems, if it's the regex pattern, you use characters inside your markers that needs heavy escaping: For PHP and PCRE, that's why there are so many slashes.
Another tricky part is the callback function because it needs to collect the codes as well as having a counter. It's done in the example with an anonymous function that has variable aliases / references in it's use clause. This makes the variables $codes and $count available inside the callback.

Need to extract special tags and replace them based upon their contents using regular expression

I'm working on a simple templating system. Basically I'm setting it up such that a user would enter text populated with special tags of the form: <== variableName ==>
When the system would display the text it would search for all tags of the form mentioned and replace the variableName with its corresponding value from a database result.
I think this would require a regular expression but I'm really messed up in REGEX here. I'm using php btw.
Thanks for the help guys.

A rather quick and dirty hack here:
<?php
$teststring = "Hello <== tag ==>";
$values = array();
$values['tag'] = "world";
function replaceTag($name)
{
global $values;
return $values[$name];
}
echo preg_replace('/<== ([a-z]*) ==>/e','replaceTag(\'$1\')',$teststring);
Output:
Hello world
Simply place your 'variables' in the variable array and they will be replaced.
The e modifier to the regular expression tells it to eval the replacement, the [a-z] lets you name the "variables" using the characters a-z (you could use [a-z0-9] if you wanted to include numbers). Other than that its pretty much standard PHP.

Very useful - Pointed me to what I was looking for...
Replacing tags in a template e.g.
<<page_title>>, <<meta_description>>
with corresponding request variables e,g,
$_REQUEST['page_title'], $_REQUEST['meta_description'],
using a modified version of the code posted:
$html_output=preg_replace('/<<(\w+)>>/e', '$_REQUEST[\'$1\']', $template);
Easy to change this to replace template tags with values from a DB etc...

If you are doing a simple replace, then you don't need to use a regexp. You can just use str_replace() which is quicker.
(I'm assuming your '<== ' and ' ==>' are delimiting your template var and are replaced with your value?)
$subject = str_replace('<== '.$varName.' ==>', $varValue, $subject);
And to cycle through all your template vars...
$tplVars = array();
$tplVars['ONE'] = 'This is One';
$tplVars['TWO'] = 'This is Two';
// etc.
// $subject is your original document
foreach ($tplVars as $varName => $varValue) {
$subject = str_replace('<== '.$varName.' ==>', $varValue, $subject);
}

Need a regex to add css class to first and last list item

UPDATE:
Thank you all for your input. Some additional information.
It's really just a small chunk of markup (20 lines) I'm working with and had aimed to to leverage a regex to do the work.
I also do have the ability to hack up the script (an ecommerce one) to insert the classes as the navigation is built. I wanted to limit the number of hacks I have in place to keep things easier on myself when I go to update to the latest version of the software.
With that said, I'm pretty aware of my situation and the various options available to me. The first part of my regex works as expected. I posted really more or less to see if someone would say, "hey dummy, this is easy just change this....."
After coming close with a few of my efforts, it's more of the principle at this point. To just know (and learn) a solution exists for this problem. I also hate being beaten by a piece of code.
ORIGINAL:
I'm trying to leverage regular expressions to add a CSS a class to the first and last list items within an ordered list. I've tried a bunch of different ways but can't produce the results I'm looking for.
I've got a regular expression for the first list item but can't seem to figure a correct one out for the last. Here is what I'm working with:
$patterns = array('/<ul+([^<]*)<li/m', '/<([^<]*)(?<=<li)(.*)<\/ul>/s');
$replace = array('<ul$1<li class="first"','<li class="last"$2$3</ul>');
$navigation = preg_replace($patterns, $replace, $navigation);
Any help would be greatly appreciated.

Jamie Zawinski would have something to say about this...
Do you have a proper HTML parser? I don't know if there's anything like hpricot available for PHP, but that's the right way to deal with it. You could at least employ hpricot to do the first cleanup for you.
If you're actually generating the HTML -- do it there. It looks like you want to generate some navigation and have a .first and .last kind of thing on it. Take a step back and try that.

+1 to generating the right html as the best option.
But a completely different approach, which may or may not be acceptable to you: you could use javascript.
This uses jquery to make it easy ...
$(document).ready(
function() {
$('#id-of-ul:firstChild').addClass('first');
$('#id-of-ul:lastChild').addClass('last');
}
);
As I say, may or may not be any use in this case, but I think its a valid solution to the problem in some cases.
PS: You say ordered list, then give ul in your example. ol = ordered list, ul = unordered list

You wrote:
$patterns = array('/<ul+([^<]*)<li/m','/<([^<]*)(?<=<li)(.*)<\/ul>/s');
First pattern:
ul+ => you search something like ullll...
The m modifier is useless here, since you don't use ^ nor $.
Second pattern:
Using .* along with s is "dangerous", because you might select the whole document up to the last /ul of the page...
And well, I would just drop s modifier and use: (<li\s)(.*?</li>\s*</ul>) with replace: '$1class="last" $2'
In view of above remarks, I would write the first expression: <ul.*?>\s*<li
Although I am tired of seeing the Jamie Zawinski quote each time there is a regex question, Dustin is right in pointing you to a HTML parser (or just generating the right HTML from the start!): regexes and HTML doesn't mix well, because HTML syntax is complex, and unless you act on a well known machine generated output with very predictable result, you are prone to get something breaking in some cases.

I don't know if anyone cares any longer, but I have a solution that works in my simple test case (and I believe it should work in the general case).
First, let me point out two things: While PhiLho is right in that the s is "dangerous", since dots may match everything up to the final of the document, this may very well be what you want. It only becomes a problem with not well formed pages. Be careful with any such regex on large, manually written pages.
Second, php has a special meaning of backslashes, even in single quotes. Most regexen will perform well either way, but you should always double-escape them, just in case.
Now, here's my code:
<?php
$navigation='<ul>
<li>Coffee</li>
<li>Tea</li>
<li>Milk</li>
<li>Beer</li>
<li>Water</li>
</ul>';
$patterns = array('/<ul.*?>\\s*<li/',
'/<li((.(?<!<li))*?<\\/ul>)/s');
$replace = array('$0 class="first"',
'<li class="last"$1');
$navigation = preg_replace($patterns, $replace, $navigation);
echo $navigation;
?>
This will output
<ul>
<li class="first">Coffee</li>
<li>Tea</li>
<li>Milk</li>
<li>Beer</li>
<li class="last">Water</li>
</ul>
This assumes no line feeds inside the opening <ul...> tag. If there are any, use the s modifier on the first expression too.
The magic happens in (.(?<!<li))*?. This will match any character (the dot) that is not the beginning of the string <li, repeated any amount of times (the *) in a non-greedy fashion (the ?).
Of course, the whole thing would have to be expanded if there is a chance the list items already have the class attribute set. Also, if there is only one list item, it will match twice, giving it two such attributes. At least for xhtml, this would break validation.

You could load the navigation in a SimpleXML object and work with that. This prevents you from breaking your markup with some crazy regex :)

As a preface .. this is waaay over-complicating things in most use-cases. Please see other answers for more sanity :)
Here is a little PHP class I wrote to solve a similar problem. It adds 'first', 'last' and any other classes you want. It will handle li's with no "class" attribute as well as those that already have some class(es).
<?php
/**
* Modify list items in pre-rendered html.
*
* Usage Example:
* $replaced_text = ListAlter::addClasses($original_html, array('cool', 'awsome'));
*/
class ListAlter {
private $classes = array();
private $classes_found = FALSE;
private $count = 0;
private $total = 0;
// No public instances.
private function __construct() {}
/**
* Adds 'first', 'last', and any extra classes you want.
*/
static function addClasses($html, $extra_classes = array()) {
$instance = new self();
$instance->classes = $extra_classes;
$total = preg_match_all('~<li([^>]*?)>~', $html, $matches);
$instance->total = $total ? $total : 0;
return preg_replace_callback('~<li([^>]*?)>~', array($instance, 'processListItem'), $html);
}
private function processListItem($matches) {
$this->count++;
$this->classes_found = FALSE;
$processed = preg_replace_callback('~(\w+)="(.*?)"~', array($this, 'appendClasses'), $matches[0]);
if (!$this->classes_found) {
$classes = $this->classes;
if ($this->count == 1) {
$classes[] = 'first';
}
if ($this->count == $this->total) {
$classes[] = 'last';
}
if (!empty($classes)) {
$processed = rtrim($matches[0], '>') . ' class="' . implode(' ', $classes) . '">';
}
}
return $processed;
}
private function appendClasses($matches) {
array_shift($matches);
list($name, $value) = $matches;
if ($name == 'class') {
$value = array_filter(explode(' ', $value));
$value = array_merge($value, $this->classes);
if ($this->count == 1) {
$value[] = 'first';
}
if ($this->count == $this->total) {
$value[] = 'last';
}
$value = implode(' ', $value);
$this->classes_found = TRUE;
}
return sprintf('%s="%s"', $name, $value);
}
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to emulate lookbehind within a regex with non-fixed width - php

Related

preg_match_all multine code scanning matches with a first word match

Calling a variable that is nowhere else

Using preg_replace_callback to identify and manipulate latex code

Need to extract special tags and replace them based upon their contents using regular expression

Need a regex to add css class to first and last list item

Categories

Resources