Regex on File Names - php

I have a function called getContents(), Which accepts a regex for the file names it finds.
I scan the js folder for javascript files, with the following two regex patterns:
$js['head'] = "/(\.head\.js\.php)|(\.head\.js)|(\.h.js)/";
$js['foot'] = "/(\.foot\.js\.php)|(\.foot\.js)|(\.f.js)|(\.js)^(\.head\.js)/";
I have a naming system whereby if you determine where the javascript file gets loaded, in the <head> tag or footer of the HTML page. All files are generally considered to be loaded at the bottom of the page, unless you specify (.head.js for example).
Up until a few days a go I noticed that the js['foot'] array was also including .head.js as well, causing the files to be loaded twice. So I added in the ^(\.head\.js) and it worked! it stopped the .head.js files being added into the footer array. I was quite pleased with myself, because I suck at regex. However it seems now that standard .js files (any normal .js files) arnt being loaded into the $js['foot'] array now. Why is this? If I remove the ^(\.head\.js) part it loads them.
To be clear, I want the $js['foot'] array to load files ending with:
.foot.js.php
.foot.js
.f.js
.js
And IGNORE all:
.head.js.php
.head.js
.h.js
Can someone correct my regex above to do this? I thought the ^ operator was NOT but i was wrong!

^(\.head\.js) in the middle of string makes it an invalid because ^ is considered anchor that matches line start.
You actually need a negative lookbehind assertion to stop matching head.js in footer regex:
$js['head'] = '/\.head\.js(?:\.php)?|\.h.js/';
$js['foot'] = '/\.foot\.js(?:\.php)?|(?<!head|h)\.js/';
RegEx Demo

Related

Targeting specific PHP tag with regex

All my wordpress websites have recently been hacked, and a very long PHP line has been added on top of all PHP files.
It looks like that (juste a sample of the entire code)
<?php $gqmtlkp = '~ x24<!%o:!>! x242178}527}88:}35csboe))1/35.)1/14+9**-)1/2986+7452]88]5]48]32M3]317]445]212]445]43]321]y]252]18y]#>q%
The problem is that code is generated and is different in all files. But I noticed that every code contains
explode(chr((729-609))
Can someone help me with building a regex line, that will target first php tag (optional) containing : (numbers vary)
explode(chr((xxx-xxx))
so that I can automatically remove it in every files ?
Thanks a lot for your help
Based on my understanding of your request you're looking to escape the following format: <?php(optional) explode(chr((xxx-xxx))) <- your sample was missing a third closing paranthesis for explode() function so I added it. If that's not right then just remove the last \) portion.
Try this: /(\<\?php)? explode\(chr\(\([0-9]{3,3}-[0-9]{3,3}\)\)\)/
Not sure if space after optional first php tag is necessary. You can adjust it going from there.

regular expression for replacing all links but css and js

i want to download a site an replace all links on that site to an internal link.
that's easy:
$page=file_get_contents($url);
$local=$_SERVER['HTTP_HOST'].$_SERVER['PHP_SELF'];
$page=preg_replace('/href="(.+?)"/','href="http://'.$local.'?href=\\1"',$page);
but i want to exclude all css files and js files from replacing, so i tried:
$regex='/href="(.+?(?!(\.js|\.css)))"/';
$page=preg_replace($regex,'href="http://'.$local.'?href=\\1"',$page);
but that didnt work,
what am i doing wrong?
i thought
?!
is a negative lookahead
To answer your regex question, you need a lookbehind there and better limit the match with a character class:
$regex = '/href="([^"]+(?<!\.js|\.css))"/';
The charclass first matches the whole link content, then asserts that this didn't end in .js or .css.
You might want to augment the whole match with <a\s[^>]*? even, so it really just finds anything that looks like a link.
Another option would be using domdocument or querypath for such tasks, which is usually tedious and more code, but simpler to add programmatic conditions to:
htmlqp->find("a") FOREACH $a->attr("href", "http:/...".$a->attr("href"))
// would need a real foreach and an if and stuff..

Optional regular expression segment, but list of requirements if present?

I have a small routing engine in PHP. I'm trying to allow it to optionally match different "formats", such as requests to "/user/profile.json" or "/user/profile.xml". However, it should also match just a plain "/user/profile".
So, if if the format is present, it must be ".json" or ".xml". But it isn't required to be present at all.
Here is what I have so far:
#^GET /something/([a-zA-Z0-9\.\-_]+)(\.(html|json))?$#
Obviously, this doesn't work. This allows any "format" to be requested since the entire format segment is optional. How can I keep it optional, but constrain the formats that can be requested?
^GET /something/([a-zA-Z0-9._-]+)(\.(html|json))?$
allows dots in the first character class, so any file extension is legal. I expect you did that on purpose so filenames with dots in them are possible.
However, this means that if a filename contains a dot, it must end in either .html or .json. Right?
So change the regex to (using the \w shorthand for [A-Za-z0-9_]):
^GET /something/([\w.-]+\.(html|json)|[\w-]+)$
Alternative suggestion:
Instead of putting the desired output format into the URL, have the client specify it via the Accept Header in the HTTP Request (where it belongs). Content negotiation is baked into the HTTP protocol, so you do not have to reinvent it via URLs. Technically, it is wrong to put the format into the URL. Your URIs should point to the resource itself and not the resource representation.
Also see W3C: Content Negotiation: why it is useful, and how to make it work
The issue you're getting is arising from the fact that most extensions are alpha numeric, yet in your regex you're allowing a dot and characters:
#^GET /something/[a-zA-Z0-9\.\-_]+(\.(html|json))?$#
The section of problem being [a-zA-Z0-9\.\-_]+. For the example of the .csv making it though is because it's still matching that character range.
If something has dots in it's file name, then by default, it has a file extension (intentional or unintentional). The file My.Finance.Documents has the extension ".Documents" even though you'd assume it to be a text file or something else.
I hate doing it, but I think you might want to have a larger conditional in your regex, something along the lines of (this is an example, I haven't tested it):
#^GET /something/([^\.]+|.*\.(?:html|json))$#
Basically, if the file name has not dots in it, it's ok. If it does have a dot in it (which guarantees it has an extension), it must end with .html or .json.

If just the index.php loads as its generic unparsed self, what exactly is [not] happening?

I visited a client's site today, and I'm getting the actual content of their index.php file itself rather than their website. The function of the index.php file says:
This file loads and executes the parser. *
Assuming this is not happening, what would be some common reasons for that?
If the apache and php are configured correctly, so that .php files go through the php interpreter, the thing I would check is whether the php files are using short open tags "<?"
instead of standard "<?php" open tags. By default newer php versions are configured to not accept short tags as this feature is deprecated now. If this is the case, look for "short_open_tag" line in php.ini and set it to "on" or, preferrably and if time allows, change the tags in the code. Although the second option is better in the long run, it can be time consumming and error-prone if done manually.
I have done such a thing in the past with a site-wide find/replace operation and the general way is this.
Find all "<?=" and replace with "~|~|~|~|" or some other unusual string that is extremely unlikely to turn up in real code.
Find all "<?php" and replace with "$#$#$#"
Find all "<?" in the site and replace with "$#$#$#"
Find all "$#$#$#" and replace with "<?php " The trailing space is advised
Find all "~|~|~|~|" and replace with "<?php echo " The trailing space is neccessary

PHP - I need some help with my Regex

I've created a simple template 'engine' in PHP to substitute PHP-generated data into the HTML page. Here's how it works:
In my main template file, I have variables like so:
<title><!-- %{title}% --></title>
I then assign data into those variables for the main page load
$assign = array (
'title' => 'my website - '
);
I then have separate template blocks that get loaded for the content pages. The above really just handles the header and the footer. In one of these 'content template files', I have variables like so:
<!-- %{title=content page}% -->
Once this gets executed, the main template data is edited to include the content page variables resulting in:
<title>my website - content page</title>
It does this with the following code:
if (preg_match('/<!-- %{title=\s*(.*?)}% -->/s', $string, $matches)) {
// Find variable names in the form of %{varName=new data to append}%
// If found, append that new data to the exisiting data
$string = preg_replace('/<!-- %{title=\s*(.*?)}% -->/s', null, $string);
$varData[$i] .= $matches[1];
}
This basically removes the template variables and then assigns the variable data to the existing variable. Now, this all works fine. What I'm having issues with is nesting template variables. If I do something like:
<!-- %{title=content page (author: <!-- %{name}% -->) -->
The pattern, at times, messes up the opening and closing tags of each variable.
How can I fix my regular expression to prevent this?
Thank you.
The answer is you don't do this with regex. Regular expressions are a regular language. When you start nesting things it is no longer a regular language. It is, at a minimum, a context-free language ("CFL"). CFLs can only be processed (assuming they're unambiguous) with a stack.
Specifically, regular languages can be represented with a finite state machine ("FSM"). CFLs require a pushdown automaton ("PDA").
An example of the difference is nested tags in HTML:
<div>
<div>inner</div>
</div>
My advice is don't write your own template language. That's been done. Many times. Use Smarty or something in Zend, Kohana or whatever. If you do write your own, do it properly. Parse it.
Why are you rolling your own template engine? If you want this kind of complexity, there's a lot of places that have already come up with solutions for it. You should just plug in Smarty or something like that.
If you're asking what I think you're asking, it's literally impossible. If I read your question correctly, you want to match arbitrarily-nested <!-- ... --> sequences with particular things inside. Unfortunately, regular expressions can only match certain classes of strings; any regular expression can match only a regular language. One well-known language which is not regular is the language of balanced parentheses (also known as the the Dyck language), which is exactly what you're trying to match. In order to match arbitrarily-nested comment strings, you need a more powerful tool. I'm fairly sure there are pre-existing PHP template engines; you might look into one of those.
To resolve your problem you should
replace preg_match() with preg_match_all();
find the pattern, and replace them from the last one to the first one;
use a more restrictive pattern like '/<!-- %{title=\s*([^}]*?)}% -->/s'.
I've done something similar in the past, and I have encountered the same nesting issue you did. In your case, what I would do is repeatedly search your text for matches (rather than searching once and looping through the matches) and extract the strings you want by searching for anything that doesn't include your closing string.
In your case, it would probably look like this:
/(<!--([^(-->)]*?)-->)/
Regexes like this are a nightmare to explain, but basically, ([^(-->)]*) will find any string that doesn't include your closing tag (let's call that AAA). It will be inside a matching group that is, itself, your template tag, (<!--AAA-->).
I'm convinced this sort of templating method is the wrong way to do things, but I've never known enough to do it better. It's always bothered me in ASP and ColdFusion that you had to nest your scripting tags inside HTML and when I started to do it myself, I considered it a personal failure.
Most Regexes I do now are in JavaScript and so I may be missing some of the awesome nuances PHP has via Perl. I'd be happy if someone can write this more cleanly.
I too have ran into this problem in the past, although I didn't use regular expressions.
If instead you search from right to left for the opening tag, <!-- %{ in your syntax, using strrpos (PHP5+), then search forwards for the first occurrence of the next closing tag, and then replace that chunk first, you will end up replacing the inner-most nested variables first. This should resolve your problem.
You can also do it the other way around and find the first occurrence of a closing tag, and work backwards to find its corresponding opening tag.

Categories