Replace slashes / for dashes - in markdown files - php

I have ~300 markdown files held within a single Git repository. I need to change the format of all the internal links within these documents. Internal links are links that do not leave the repository. They look something like this:
Checkout the [new plugin](/developers/tools/plugin/install-the-plugin)
guide if you're stuck. If you know what you're doing head on over to
the [examples section](/developers/examples/plugin-tutorials) and get
your hands dirty.
I need to change all the internal links so that they:
Don't contain /developers/
All the slashes / are converted to dashes -.
The example above should look something like this:
Checkout the [new plugin](tools-plugin-install-the-plugin) guide if
you're stuck. If you know what you're doing head on over to the
[examples section](examples-plugin-tutorials) and get your hands dirty.
One caveat is that I don't want to target images. Images look identical to links, just with an exclamation mark ! at the start:
![Plugin Logo](/developers/tools/plugin/images/logo.png)
I've looked into things and it looks like sed is a way forward in terms of tools. I've managed to build the following regex that captures the links I'm looking for:
\]\(\/developers\/.*\)
This regex doesn't ignore the ![]() image syntax annoyingly. I was able to get PHP to return the locations of each hit on each page, but then I wasn't able to do a find-and-replace on the slashes / within those results.
Any ideas or pointers would be greatly appreciated.

You may do it with a single PHP regex:
$text = preg_replace('~!\[[^][]*]\([^()]*\)(*SKIP)(*F)|(?:\G(?!\A)|(?<=]\()/developers/)([^()/]*)/(?=[^()]*\))~', '$1-', $text)
See the regex demo
Details
!\[[^][]*]\([^()]*\)(*SKIP)(*F) - match !, [, any 0+ chars other than [ and ], then a ](, 0+ chars other than ( and ), ) and then omit the match and go on to search for the next match at the end of the current failed match
| - or
(?:\G(?!\A)|(?<=]\()/developers/) - end of the previous successful match (\G(?!\A)) or (|) a /developers/ string preceded with ](
([^()/]*) - Group 1 ($1): any 0+ chars other than (, ) and /
/ - a / char
(?=[^()]*\)) - ...that is followed with any 0+ chars other than ( and ) and then a ).

Related

PHP (preg_replace) regex strip image sizes from filename

I'm working on a open-source plugin for WordPress and frankly facing an odd issue.
Consider the following filenames:
/wp-content/uploads/buddha_-800x600-2-800x600.jpg
/wp-content/uploads/cutlery-tray-800x600-2-800x600.jpeg
/wp-content/uploads/custommade-wallet-800x600-2-800x600.jpeg
/wp-content/uploads/UI-paths-800x800-1.jpg
The current regex I have:
(-[0-9]{1,4}x[0-9]{1,4}){1}
This will remove both matches from the filename, for example buddha_-800x600-2-800x600.jpg will become buddha_-2.jpg which is invalid.
I have tried a variety of regex:
.*(-\d{1,4}x\d{1,4}) // will trip out everything
(-\d{1,4}x\d{1,4}){1}|.*(-\d{1,4}x\d{1,4}){1} // same as above
(-\d{1,4}x\d{1,4}){1}|(-\d{1,4}x\d{1,4}){1} // will strip out all size matches
Unfortunately my knowledge with regex is quite limited, can someone advise how to achieve the goal please?
The goal is to remove only what is relevant, which would result in:
/wp-content/uploads/buddha_-800x600-2.jpg
/wp-content/uploads/cutlery-tray-800x600-2.jpeg
/wp-content/uploads/custommade-wallet-800x600-2.jpeg
/wp-content/uploads/UI-paths-1.jpg
Much appreciated!
You can use a capture group with a backreference to match strings where there are 2 of the same parts and replace that with a single part.
Or match the dimensions to be removed.
((-\d+x\d+)-\d+)\2|-\d+x\d+
( Capture group 1
(-\d+x\d+) Capture group 2, match - 1+ digits x and 1+ digits
-\d+ Match - and 1+ digits
)\2 Close group 2 followed by a backreference to what is captured in grouip 1
| Or
-\d+x\d+ Match the dimensions format
Regex demo | Php demo
For example
$pattern = '~((-\d+x\d+)-\d+)\2|-\d+x\d+~';
$strings = [
"/wp-content/uploads/buddha_-800x600-2-800x600.jpg",
"/wp-content/uploads/cutlery-tray-800x600-2-800x600.jpeg",
"/wp-content/uploads/custommade-wallet-800x600-2-800x600.jpeg",
"/wp-content/uploads/UI-paths-800x800-1.jpg",
];
foreach ($strings as $s) {
echo preg_replace($pattern, '$1', $s) . PHP_EOL;
}
Output
/wp-content/uploads/buddha_-800x600-2.jpg
/wp-content/uploads/cutlery-tray-800x600-2.jpeg
/wp-content/uploads/custommade-wallet-800x600-2.jpeg
/wp-content/uploads/UI-paths-1.jpg
I would try something like this. You can test it yourself. Here is the code:
$a = [
'/wp-content/uploads/buddha_-800x600-2-800x600.jpg',
'/wp-content/uploads/cutlery-tray-800x600-2-800x600.jpeg',
'/wp-content/uploads/custommade-wallet-800x600-2-800x600.jpeg',
'/wp-content/uploads/UI-paths-800x800-1.jpg'
];
foreach($a as $img)
echo preg_replace('#-\d+x\d+((-\d+|)\.[a-z]{3,4})#i', '$1', $img).'<br>';
It checks for ending -(number)x(number)(dot)(extension)
This is a clear case of « Match the rejection, revert the match ».
So, you just have to think about the pattern you are searching to remove:
[0-9]+x[0-9]+
which is simply (much condensed):
\d+x\d+
The next step is to build the groups extractor:
^(.*[^0-9])[0-9]+x[0-9]+([^x]*\.[a-z]+)$
We added the extension of the file as a suffix for the extract.
The rejection of the "x" char is a (bad…) trick to ensure the match of the last size only. It won’t work in the case of an alphanumeric suffix between the size and the extension (toto-800x1024-ex.jpg for instance).
And then, the replacement string:
$1$2
For clarity of course, we are only working on a successfully extracted filename. But if you want to treat the whole string, the pattern becames:
^/(.*[^0-9])[0-9]+x[0-9]+([^/x]*\.[a-z]+)$
If you want to split the filename and the folder name:
^/(.*/)([^/]+[^0-9])[0-9]+x[0-9]+([^/x]*)(\.[a-z]+)$
^/(.*/)([^/]+\D)\d+x\d+([^/x]*)(\.[a-z]+)$
$folder=$1;
$filename="$1$2";

Regex cant limit search range

I have following problem:
I have a pattern like this:
/(?<=template=")(.*?)(.*\/)/gm
And an text like this:
template="test/widgets/glasgow.phtml"}}
My regex should search for the path infront of my file, i need to cut it out so that it will look at the end like this:
template="glasgow.phtml"}}
That works fine but the problem is that i have sometimes an text that looks like this:
block="core/template" template="test/widgets/getcallus.phtml"}}</p>
It cuts everything out till the </.
This is getting cutted out:
test/widgets/getcallus.phtml"}}</
Instead of:
test/widgets/
I have tried to limit the end with $ but it doesnt do nothing.
I am testing it on regexr.com
https://regexr.com/50hi2
You may use the following pattern:
template="\K[^"\/]*\/[^"\/]*\/
See the regex demo. In PHP, you may get rid of backslashes if you specify another regex delimiter:
$regex = '~template="\K[^"/]*/[^"/]*/~';
Details
template=" - literal text
\K - match reset operator
[^"\/]* - 0 or more chars other than / and "
\/ - a / char
[^"\/]* - 0 or more chars other than / and "
\/ - a / char
It is equal to template="\K(?:[^"\/]*\/){2}, where (?:...){2} repeats the non-capturing group sequence of patterns twice.
Be careful with (.*?)(.*\/)
This pattern corresponds to a REDOS vulnerability. (There are 2^n ways to read the n chars before the last /...
To keep a regex closed to yours, you can use
/(?<=template=")([^"]*?\/)*([^"]*)"/
([^"]*?\/)* reads as many blocks "non / nor " chars followed by /" as possible.
https://regex101.com/r/SMSv5R/2

How can I write a regexp that recursively matches RESTful path?

Regexp being not my strength, I would like some help on this one, if it is even possible:
I need to create a regexp that recursively matches a RESTful path. The purpose is to create a Symfony route matching this regexp. Here is some examples of what I mean by RESTful path:
/resources
/resources/123
/resources/123/children-resources
/resources/123/children-resources/123
/resources/123/children-resources/123/grandchildren-resources
And so on...
Basically, I would like this pattern to repeat itself indefinitly one or more time:
^\/[a-z]+(\-[a-z]+)*(\/[0-9]+)?$
Note that to access to a child resource, the identifier of the parent resource must be present.
I made a short list of unit tests (for two-level paths only to start) here:
https://regex101.com/r/Hxg0m4/2/tests
I searched questions on the same subject, but none were really relevant to my question. I also tried some modifications on the regexp above - like using the + sign at the end of the regexp, or use (?R)... It never passed my unit tests.
Any help will be gladly appreciated.
P.S: This is my first question on stackoverflow, please don't hesitate to tell me how to better formulate my question.
This recursive pattern should work:
^(\/[a-z]+(?:-[a-z]+)*(?:$|\/\d+(?:$|(?1))))
Explanation:
^ // assert start of string
(
\/ // start with a slash
[a-z]+(?:-[a-z]+)* // followed by a word
(?: // then, either:
$ // end of string
| // or:
\/ // a slash
\d+ // followed by digits
(?: // then, either:
$ // end of string
| // or:
(?1) // recurse the entire pattern (except the start of string anchor)
)
)
)

Capturing key value pairs from a url string with a regex pattern

I'm trying to use regex to parse a string like the below:
/subject=hello±#text=something that may contain\#hello.com or a normal sla/sh±#date=blah/somethingelseI don't want to capture after the first/
into:
subject = hello
text =something that may contain\#hello.com or a normal sla/sh
date = blah
Ideally I'd like to be able to split the string after the first '/' by something like '±#' - and only that combination in that order.
I've looked around and at the minute have the below:
([^/±#,= ]+)=([^±#,= ]+)
But this doesn't match only '±#' - it matches either # or ±.
It also doesn't cope with the escaped #. (Instead i get: text= something that may contain\ ).
Is there a better way to do this?
Thanks
Try this:
(?:\/|(?<=±#))(.*?=.*?)(?:±#|$|\/(?!.*±#))
See live demo
An important part is the negative look ahead after the trailing slash /(?!.*±#) - this means "match a slash, but only if ±# doesn't appear in the input after it".
Given this input:
/subject=hello±#text=something that may contain\#hello.com or a normal sla/sh±#date=blah/somethingelseI don't want to capture after the first/
It produces matches whose group 1 are:
subject=hello
text=something that may contain\#hello.com or a normal sla/sh
date=blah

Can you help simplify my ip range matching regex?

I have a regex that will match IP addresses.
it looks like:
^(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|*|25[0-5]-25[0-5]|2[0-4][0-9]-25[0-5]|2[0-4][0-9]-2[0-4][0-9]|[01]?[0-9][0-9]?-25[0-5]|[01]?[0-9][0-9]?-2[0-4][0-9]|[01]?[0-9][0-9]?-[01]?[0-9][0-9])$
which you will mostly recognise from many other posts here on SO. however I have modded it to match the range form XXX.XXX.XXX.XXX-XXY
However it now seems a little complex, particularly the final () capture. I would like some help to simplify this regex if possible.
Just to be clear
aaaa - not matched
999.1.1.1 - not matched
1.1.1.999 - not matched
192.168.2.1 - matched
192.168.2.* - matched
192.168.2.10-20 - matched
EDIT
I forgot to mention that I need the existing capture groups as well.
You could perhaps use optional groups (?: ... )? instead and use another grouping for the first 3 parts of the IP?
^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(?:25[0-5](?:-25[0-5])?|
2[0-4][0-9](?:-(?:25[0-5]|2[0-4][0-9]))?|
[01]?[0-9][0-9]?(?:-(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))?|
\*)$
regex101 demo
Updated with capture groups
^((?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))\.
((?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))\.
((?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))\.
(25[0-5](?:-25[0-5])?|
2[0-4][0-9](?:-(?:25[0-5]|2[0-4][0-9]))?|
[01]?[0-9][0-9]?(?:-(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))?|
\*)$
This works -
^(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(?:(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(?:\-(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|(\*))$
As can be seen here
This should work and is a bit shorter:
^((25[0-5]|2[0-4]\d|[01]?\d{1,2})\.){3}(\*|(25[0-5]|2[0-4]\d|[01]?\d{1,2}))(\-(25[0-5]|2[0-4]\d|[01]?\d{1,2}))?$
See:
http://regex101.com/r/sD9iZ0

Categories