Regex: Rename Files

Regex: Rename Files - php

I am trying to rename a bunch of image files.
They are named inconsistently however there is some logic to it
They all start with an Id number
After the Id there may be some of the following (Items To Be Removed):
a space
2 letters
a dash -
These will appear in various orders and sometimes more than once, for the space or dash.
The filenames may have any of these items but not necessarily all of them.
Some filenames do have all 3 items.
They may have an additional _ after this
Then they may have a number {Index}
Finally they end in .ext where ext = jpg|png|gif...
Here are some example filenames:
1227.jpg
1227_1.jpg
2200 WH-1.jpg
2200WH 2.jpg
2200 WH2.jpg
2201_BK 1.png
2203 RD_1.jpg
I am trying to remove/replace the mentioned items so the filenames are as follows:
ID.ext or ID_{index}.ext
So the above list would turn into:
1227.jpg
1227_1.jpg
2200_1.jpg
2200_2.jpg
2201_1.png
2203_1.jpg
I have tried writing a few expressions but am a little stumped on this one.
I am working on a PHP project though other languages would be fine for this script.

Pattern: /^\d+\K[-a-z_ ]+/i
Replace: _
(Pattern Demo)
Basically only match when there are one or more characters between the id and the index. Simple.
/ #pattern delimiter
^ #start of string
\d+ #one or more digits
\K #restart fullstring match so that the fullstring match is replaced
[-a-z_ ]+ #match one or more hyphens, letters, underscores, or spaces
/ #pattern delimiter
i #make the pattern case-insensitive
Code: (Demo)
$images=['1227.jpg','1227_1.jpg','2200 WH-1.jpg','2200WH 2.jpg','2200 WH2.jpg','2201_BK 1.png','2203 RD_1.jpg'];
var_export(preg_replace('/^\d+\K[-a-z_ ]+/i','_',$images));
Output:
array (
0 => '1227.jpg',
1 => '1227_1.jpg',
2 => '2200_1.jpg',
3 => '2200_2.jpg',
4 => '2200_2.jpg',
5 => '2201_1.png',
6 => '2203_1.jpg',
)
Question extension solution: (Demo) (Demo)
You can do it with two patterns and replacements on a single preg_replace() call or you can use preg_replace() then str_replace() to mop up the dangling underscores. This will come down to personal coding preference. (It could also be done with a preg_replace_callback() that checks if there is an index number in the image name before adding the underscore, but that will make a more convoluted snippet.)
Codes:
$images=['1227.jpg','1227_1.jpg','2200 WH-1.jpg','2200WH 2.jpg','2200 WH2.jpg','2201_BK 1.png','2203 RD_1.jpg','2200 WH.jpg','3000_01.jpg'];
foreach($images as $image){
echo str_replace('_.','.',preg_replace('/^\d+\K[-a-z_ ]+0*/i','_',$image)),"\n";
}
Or
$images=['1227.jpg','1227_1.jpg','2200 WH-1.jpg','2200WH 2.jpg','2200 WH2.jpg','2201_BK 1.png','2203 RD_1.jpg','2200 WH.jpg','3000_01.jpg'];
foreach($images as $image){
echo preg_replace(['~^\d+\K[-a-z_ ]+0*~i','~_\.~'],['_','.'],$image),"\n";
}

I would do it with the following pattern:
(\d{4})([^0-9.]*)(\d\.)
And with a substitution of $1_$3.
Step by step:
(\d{4}) - Check for the 1st 4 digits.
([^0-9.]*) - Check for everything that is not a number or a period after the ID.
(\d\.) - Check for ending number and period before extension (This is so we can properly place the underscore)
Adding the substitution means that the 4 digit number will be added to the beginning, all non-number (or period) characters will be removed, and an underscore will be added between the $1 and whatever is left. If there is nothing after the ID, no underscore will be added, then the period is added inside the substitution as well.
You can view this on Regex101 for a very detailed step-by-step of what is going on.
In PHP this would be:
preg_replace("/(\d{4})([^0-9.]*)(\d)\./", "$1_", $string);
Output:
1227.jpg
1227_1.jpg
2200_1.jpg
2200_2.jpg
2200_2.jpg
2201_1.png
2203_1.jpg

Not a PHP person but the regular expression I would use is:
/(\d+).*?(\d?)\.(.*)/
This will capture the first set of numbers, skip the middle part, capture the number on the end if present, then capture the file extension.
Then in ruby I would do the following:
id, index, extension = my_file_name.match(/(\d+).*?(\d?)\.(.*)/)
new_name = id.to_s
new_name += "_#{index}" unless index.empty?
new_name += ".#{extension}"

Related

PHP (preg_replace) regex strip image sizes from filename

I'm working on a open-source plugin for WordPress and frankly facing an odd issue.
Consider the following filenames:
/wp-content/uploads/buddha_-800x600-2-800x600.jpg
/wp-content/uploads/cutlery-tray-800x600-2-800x600.jpeg
/wp-content/uploads/custommade-wallet-800x600-2-800x600.jpeg
/wp-content/uploads/UI-paths-800x800-1.jpg
The current regex I have:
(-[0-9]{1,4}x[0-9]{1,4}){1}
This will remove both matches from the filename, for example buddha_-800x600-2-800x600.jpg will become buddha_-2.jpg which is invalid.
I have tried a variety of regex:
.*(-\d{1,4}x\d{1,4}) // will trip out everything
(-\d{1,4}x\d{1,4}){1}|.*(-\d{1,4}x\d{1,4}){1} // same as above
(-\d{1,4}x\d{1,4}){1}|(-\d{1,4}x\d{1,4}){1} // will strip out all size matches
Unfortunately my knowledge with regex is quite limited, can someone advise how to achieve the goal please?
The goal is to remove only what is relevant, which would result in:
/wp-content/uploads/buddha_-800x600-2.jpg
/wp-content/uploads/cutlery-tray-800x600-2.jpeg
/wp-content/uploads/custommade-wallet-800x600-2.jpeg
/wp-content/uploads/UI-paths-1.jpg
Much appreciated!

You can use a capture group with a backreference to match strings where there are 2 of the same parts and replace that with a single part.
Or match the dimensions to be removed.
((-\d+x\d+)-\d+)\2|-\d+x\d+
( Capture group 1
(-\d+x\d+) Capture group 2, match - 1+ digits x and 1+ digits
-\d+ Match - and 1+ digits
)\2 Close group 2 followed by a backreference to what is captured in grouip 1
| Or
-\d+x\d+ Match the dimensions format
Regex demo | Php demo
For example
$pattern = '~((-\d+x\d+)-\d+)\2|-\d+x\d+~';
$strings = [
"/wp-content/uploads/buddha_-800x600-2-800x600.jpg",
"/wp-content/uploads/cutlery-tray-800x600-2-800x600.jpeg",
"/wp-content/uploads/custommade-wallet-800x600-2-800x600.jpeg",
"/wp-content/uploads/UI-paths-800x800-1.jpg",
];
foreach ($strings as $s) {
echo preg_replace($pattern, '$1', $s) . PHP_EOL;
}
Output
/wp-content/uploads/buddha_-800x600-2.jpg
/wp-content/uploads/cutlery-tray-800x600-2.jpeg
/wp-content/uploads/custommade-wallet-800x600-2.jpeg
/wp-content/uploads/UI-paths-1.jpg

I would try something like this. You can test it yourself. Here is the code:
$a = [
'/wp-content/uploads/buddha_-800x600-2-800x600.jpg',
'/wp-content/uploads/cutlery-tray-800x600-2-800x600.jpeg',
'/wp-content/uploads/custommade-wallet-800x600-2-800x600.jpeg',
'/wp-content/uploads/UI-paths-800x800-1.jpg'
];
foreach($a as $img)
echo preg_replace('#-\d+x\d+((-\d+|)\.[a-z]{3,4})#i', '$1', $img).'<br>';
It checks for ending -(number)x(number)(dot)(extension)

This is a clear case of « Match the rejection, revert the match ».
So, you just have to think about the pattern you are searching to remove:
[0-9]+x[0-9]+
which is simply (much condensed):
\d+x\d+
The next step is to build the groups extractor:
^(.*[^0-9])[0-9]+x[0-9]+([^x]*\.[a-z]+)$
We added the extension of the file as a suffix for the extract.
The rejection of the "x" char is a (bad…) trick to ensure the match of the last size only. It won’t work in the case of an alphanumeric suffix between the size and the extension (toto-800x1024-ex.jpg for instance).
And then, the replacement string:
$1$2
For clarity of course, we are only working on a successfully extracted filename. But if you want to treat the whole string, the pattern becames:
^/(.*[^0-9])[0-9]+x[0-9]+([^/x]*\.[a-z]+)$
If you want to split the filename and the folder name:
^/(.*/)([^/]+[^0-9])[0-9]+x[0-9]+([^/x]*)(\.[a-z]+)$
^/(.*/)([^/]+\D)\d+x\d+([^/x]*)(\.[a-z]+)$
$folder=$1;
$filename="$1$2";

split a value into two and then reverse the value in php

I have a value like this 73b6424b. I want to split value into two parts. Like 73b6 and 424b. Then the two split value want to reverse. Like 424b and 73b6. And concatenate this two value like this 424b73b6. I have already done this like way
$substr_device_value = 73b6424b;
$first_value = substr($substr_device_value,0,4);
$second_value = substr($substr_device_value,4,8);
$final_value = $second_value.$first_value;
I am searching more than easy way what I have done. Is it possible?? If yes then approach please

You may use
preg_replace('~^(.{4})(.{4})$~', '$2$1', $s)
See the regex demo
Details
^ - matches the string start position
(.{4}) - captures any 4 chars into Group 1 ($1)
(.{4}) - captures any 4 chars into Group 2 ($2)
$ - end of string.
The '$2$1' replacement pattern swaps the values.
NOTE: If you want to pre-validate the data before swapping, you may replace . pattern with a more specific one, say, \w to only match word chars, or [[:alnum:]] to only match alphanumeric chars, or [0-9a-z] if you plan to only match strings containing digits and lowercase ASCII letters.

preg - Difference between Search Patterns with [] and without

It seems I am not able to understand something very basic with preg regex Patterns in PHP.
What is the difference between these Regex Patterns:
\b([A-Z...]...)
[\b]{1}([A-Z...]...)
The Pattern should start with a word boundary, but why is the result different, when I put it in []{1} ??
The first one works like I expected, but the second not. The problem is, that I want to put more into the [], so that the pattern can start with a word boundary OR a small character [a-z].
Thank you!
Example Text:
Race1529/05/201512:45K4 Senior Men 1000m
LaneName(s)NFBib(s)TimeRank250m500m750m
152
Martin SCHUBERT / Lukas REUSCHENBACH155
11
153
151Kostja STROINSKI / Kai SPENNER
03:07.740
GER
8
I want to find the names of the racers. Sometimes they have a word-break (\b) at the beginning, sometimes not. (But i need the word-break.)
$pattern = '#\b(['.$GB.$KB.'\s\-]{2,40})\s(['.$GB.'\'\-\s]{2,40})[0-9]{0,5}#';
($GB is a variable with all Uppercase Letters, $KB with lower case letters)
preg_match_all gives me all racers where the Name has a word-break at the beginning. (In this example Schubert, Reuschenbach, Spenner) but of course not Stroinski. So, I try this:
$pattern = '#[\b0-9]+(['.$GB.$KB.'\s\-]{2,40})\s(['.$GB.'\'\-\s]{2,40})[0-9]{0,5}#';
Does not work. Even if i remove the 0-9 and only put [\b]{1} at the beginning it doesn't find any hit.
I don't see the difference between \b and [\b]{1}. It seems to be a very basic misunderstanding.

The [\b] is a character class that only matches a backspace char (\u0008).
See PHP regex reference:
note that "\b" has a different meaning, namely the backspace character, inside a character class
Also, .{1} = ., the {1} limiting quantifier is always redundant and only makes sense when your patterns are built dynamically from variables.

php regex negative lookahead

I have a dictionary of 4 letter words. I want to write a regex to go through the dictionary and matches all words given a set of letters.
Suppose I pass in a,b,l,l. I want to find all words with exactly those letters.
I know I could do /[abl]{4}/ but that will also match words with 2 a's or 2 b's.
I feel like I need to do a negative look ahead. Something like:
[l|(ab)(?!\1)]{4}
The attempt here is that I want a word that starts with l or a or b and not followed by a or b.

First thing you need to anchor your pattern to describe where the string begins and ends:
for a whole string (^ start of the string, $ end of the string):
^[abl]{4}$
or to find words in a larger text, use word-boundaries (limit between a character from [A-Za-z0-9_] and something else):
\b[abl]{4}\b
Then you need to say that l must occur two times (or that a and b must occurs only one time, but it's more complicated):
for a whole string:
^(?=.*l.*l)[abl]{4}$
in a larger text:
\b(?=\w*l\w*l)[abl]{4}\b
To avoid two a or b, you can use an other lookahead:
for a whole string:
^(?=.*l.*l)(?=l*al*b|l*bl*a)[abl]{4}$
in a larger text:
\b(?=\w*l\w*l)(?=l*al*b|l*bl*a)[abl]{4}\b
About [l|(ab)(?!\1)]: in a character class, special regex characters or sequence of characters loose their special meaning and all characters are seen as literals. So [l|(ab)(?!\1)] is the same than [)(!|?1abl] for example. (Since \1 is an unknown escape sequence in a character class, the backslash is ignored.)
Note that with several constraints the pattern becomes quickly ugly. You should consider an other approach that consists to catch all words with \b[abl]{4}\b and to filter them in a second time (using count_chars for example).
$str ='abll labl ball aabl lblabla 1234';
$dict = 'abll';
$count = count_chars($dict);
$result = [];
if (preg_match_all('~\b[abl]{4}\b~', $str, $matches)) {
$result = array_filter($matches[0], function ($i) use ($count) {
return $count == count_chars($i);
});
}
print_r($result);

If you want specify letters dynamically and then generate regexp that will do all work - this will be a very expensive work.
Simple approach: you can generate simple regexp like /^[abl]{4}$/, get all words from dictionary that match him and then validate each word separately - check letters quantity.
More efficient approach: you can index your words in dictionary with sorted list of letters like this:
word: apple | index: aelpp
word: pale | index: aelp
And so on. To get all words from list of letters you simply should sort this letters and find exact match with "index" value.

Edit: So for 47 letters it would be
\b(?:((?(1)(?!))l1)|((?(2)(?!))l2)|...|((?(47)(?!))l47)){47}\b
Letters can be duplicates, say 4 a's and 15 r's (but no more), etc ...
( immune to permutations )
To match out of order items only once,
use a conditional to allow each item to match once,
but no more.
It's not complicated, and is immune to permutations.
Works every time !
\b(?:((?(1)(?!))a)|((?(2)(?!))b)|((?(3)(?!))l)|((?(4)(?!))l)){4}\b
Expanded
\b
(?:
( # (1)
(?(1)(?!))
a
)
|
( # (2)
(?(2)(?!))
b
)
|
( # (3)
(?(3)(?!))
l
)
|
( # (4)
(?(4)(?!))
l
)
){4}
\b

Search a String for Alpha Numeric Characters in a Pattern

I have a string that contains 5 words. In the string one of the words is a Ham Radio Call Sign and can be anyone of the thousands of call signs in the US. In order to extract the Call Sign from the string I need to utilize the below pattern. The Call Sign I need to extract can be in any of the 5 positions in the string. The number is never the first character and the number is never the last character. The string is actually put together from an Array since it is originally read from a text file.
$string = $word[1] $word[2] $word[3] etc....
So the search can be either done on the whole string or each piece of the array.
Patterns:
1 Number and 3 Letters Example: AB4C A4BC
1 Number and 4 Letters Example: A4BCD
1 Number and 5 Letters Example: AB4CDE
I have tried everything I can think of and search till I cant search no more. I am sure I am over thinking this.

A two-step regular expression like this would do it:
$str = "hello A4AB there BC5AD";
$signs = array();
preg_match_all('/[A-Z][A-Z\d]{1,3}[A-Z]/', $str, $possible_signs);
foreach($possible_signs[0] as $possible_sign)
if (preg_match('/^\D+\d\D+$/', $possible_sign))
array_push($signs, $possible_sign);
print_r($signs); //Array ([0] => A4AB [1] => BC5AD)
Explanation
This is a regular expression approach, using two patterns. I don't think it could be done with one and still satisfy the exact requirements of the matching rules.
The first pattern enforces the following requirements:
substring starts and ends with a capital letter
substring contains only other capital letters or numbers between the first and last letter
substring is, overall, not more than 6 characters long
What I can't do in that same pattern, for complex REGEX reasons I won't go into (unless someone knows a way and can correct me), is enforce that only one number is contained.
#jeroen's answer does enforce this in a single pattern, but in turn does not enforce the correct length of the substring. Either way, we need a second pattern.
So after grabbing the initial matches, we loop over the results. We then apply each to a second pattern that enforces simply that there is only one number in the substring.
If so, we green-light the substring and it's added to the $signs array.
Hope this helps.

It depends on what the other words can contain, but you could use a regular expression like:
#\b[a-z]+\d[a-z]+\b#i
^ case insensitive
^^ a word boundary
^^^^^^ One or more letters
^^ One number
You can make it more restrictive by using {1,3} instead of + for the letters so that you have a sequence of 1 to 3 letters.
The complete expression would be something like:
$success = preg_match('#\b[a-z]+\d[a-z]+\b#i', $input_string, $matches);
where $matches[0] will contain the matched value, see the manual.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Regex: Rename Files - php

Related

PHP (preg_replace) regex strip image sizes from filename

split a value into two and then reverse the value in php

preg - Difference between Search Patterns with [] and without

php regex negative lookahead

Search a String for Alpha Numeric Characters in a Pattern

Categories

Resources