PHP (preg_replace) regex strip image sizes from filename - php

I'm working on a open-source plugin for WordPress and frankly facing an odd issue.
Consider the following filenames:
/wp-content/uploads/buddha_-800x600-2-800x600.jpg
/wp-content/uploads/cutlery-tray-800x600-2-800x600.jpeg
/wp-content/uploads/custommade-wallet-800x600-2-800x600.jpeg
/wp-content/uploads/UI-paths-800x800-1.jpg
The current regex I have:
(-[0-9]{1,4}x[0-9]{1,4}){1}
This will remove both matches from the filename, for example buddha_-800x600-2-800x600.jpg will become buddha_-2.jpg which is invalid.
I have tried a variety of regex:
.*(-\d{1,4}x\d{1,4}) // will trip out everything
(-\d{1,4}x\d{1,4}){1}|.*(-\d{1,4}x\d{1,4}){1} // same as above
(-\d{1,4}x\d{1,4}){1}|(-\d{1,4}x\d{1,4}){1} // will strip out all size matches
Unfortunately my knowledge with regex is quite limited, can someone advise how to achieve the goal please?
The goal is to remove only what is relevant, which would result in:
/wp-content/uploads/buddha_-800x600-2.jpg
/wp-content/uploads/cutlery-tray-800x600-2.jpeg
/wp-content/uploads/custommade-wallet-800x600-2.jpeg
/wp-content/uploads/UI-paths-1.jpg
Much appreciated!

You can use a capture group with a backreference to match strings where there are 2 of the same parts and replace that with a single part.
Or match the dimensions to be removed.
((-\d+x\d+)-\d+)\2|-\d+x\d+
( Capture group 1
(-\d+x\d+) Capture group 2, match - 1+ digits x and 1+ digits
-\d+ Match - and 1+ digits
)\2 Close group 2 followed by a backreference to what is captured in grouip 1
| Or
-\d+x\d+ Match the dimensions format
Regex demo | Php demo
For example
$pattern = '~((-\d+x\d+)-\d+)\2|-\d+x\d+~';
$strings = [
"/wp-content/uploads/buddha_-800x600-2-800x600.jpg",
"/wp-content/uploads/cutlery-tray-800x600-2-800x600.jpeg",
"/wp-content/uploads/custommade-wallet-800x600-2-800x600.jpeg",
"/wp-content/uploads/UI-paths-800x800-1.jpg",
];
foreach ($strings as $s) {
echo preg_replace($pattern, '$1', $s) . PHP_EOL;
}
Output
/wp-content/uploads/buddha_-800x600-2.jpg
/wp-content/uploads/cutlery-tray-800x600-2.jpeg
/wp-content/uploads/custommade-wallet-800x600-2.jpeg
/wp-content/uploads/UI-paths-1.jpg

I would try something like this. You can test it yourself. Here is the code:
$a = [
'/wp-content/uploads/buddha_-800x600-2-800x600.jpg',
'/wp-content/uploads/cutlery-tray-800x600-2-800x600.jpeg',
'/wp-content/uploads/custommade-wallet-800x600-2-800x600.jpeg',
'/wp-content/uploads/UI-paths-800x800-1.jpg'
];
foreach($a as $img)
echo preg_replace('#-\d+x\d+((-\d+|)\.[a-z]{3,4})#i', '$1', $img).'<br>';
It checks for ending -(number)x(number)(dot)(extension)

This is a clear case of « Match the rejection, revert the match ».
So, you just have to think about the pattern you are searching to remove:
[0-9]+x[0-9]+
which is simply (much condensed):
\d+x\d+
The next step is to build the groups extractor:
^(.*[^0-9])[0-9]+x[0-9]+([^x]*\.[a-z]+)$
We added the extension of the file as a suffix for the extract.
The rejection of the "x" char is a (bad…) trick to ensure the match of the last size only. It won’t work in the case of an alphanumeric suffix between the size and the extension (toto-800x1024-ex.jpg for instance).
And then, the replacement string:
$1$2
For clarity of course, we are only working on a successfully extracted filename. But if you want to treat the whole string, the pattern becames:
^/(.*[^0-9])[0-9]+x[0-9]+([^/x]*\.[a-z]+)$
If you want to split the filename and the folder name:
^/(.*/)([^/]+[^0-9])[0-9]+x[0-9]+([^/x]*)(\.[a-z]+)$
^/(.*/)([^/]+\D)\d+x\d+([^/x]*)(\.[a-z]+)$
$folder=$1;
$filename="$1$2";

Related

Formatting camel case to readable in PHP while skipping abbreviations

So i am stuck - I have looked at tons of answers in here, but none seems to resolve my last problem.
Through an API with JSON, I receive an equipment list in a camelcase format. I can not change that.
I need this camelcase to be translated into normal language -
So far i have gotten most words seperated through:
$string = "SomeEquipmentHere";
$spaced = preg_replace('/([A-Z])/', ' $1', $string);
var_dump($spaced);
string ' Some Equipment Here' (length=20)
$trimmed = trim($spaced);
var_dump($trimmed);
string 'Some Equipment Here' (length=19)
Which is working fine - But in some of the equipments consists of abbreviations
"ABSBrakes" - this would require ABS and separated from Brakes
I can't check for several uppercases next to each other since it will then keep ABS and Brakes together - there are more like these, ie: "CDRadio"
So what is want is the output to be:
"ABS Brakes"
Is there a way to format it so, if there is uppercases next to eachother, then only add a space before the last uppercase letter of that sequence?
I am not strong in regex.
EDIT
Both contributions are awesome - people coming here later should read both answers
The last problems to consists are the following patterns :
"ServiceOK" becomes "Service O K"
"ESP" becomes "ES P"
The pattern only consisting of a pure uppercased abbreviation is fixed by a function counting lowercase letter, if there is none, it will skip over the preg_replace().
But as Flying wrote in the comments on his answer, there could potentially be a lot of instances not covered by his regex, and an answer could be impossible - I don't know if this could be a challenge for the regex.
Possibly by adding some "If there is not a lowercase after the uppercase, there should not be inserted a space" rule
Here is a single-call pattern that doesn't use any anchors, capture groups, or references in the replacement string: /(?:[a-z]|[A-Z]+)\K(?=[A-Z]|\d+)/
Pattern&Replace Demo
Code: (Demo)
$tests = [
'SomeEquipmentHere',
'ABSBrakes',
'CDRadio',
'Valve14',
];
foreach ($tests as $test) {
echo preg_replace('/(?:[a-z]|[A-Z]+)\K(?=[A-Z]|\d+)/',' ',$test),"\n";
}
Output:
Some Equipment Here
ABS Brakes
CD Radio
Valve 14
This is a better method because there is nothing to mop up. If there are new strings to consider (that break my method), please leave them in a comment so that I can update my pattern.
Pattern Explanation:
/ #start the pattern
(?:[a-z] #match 1 lowercase letter
| #or
[A-Z]+) #1 or more uppercase letters
\K #restart the fullstring match (forget the past)
(?=[A-Z] #look-ahead for 1 uppercase letter
| #or
\d+) #1 or more digits
/ #end the pattern
Edit:
There are some other patterns that may provide better accuracy including:
/(?:[a-z]|\B[A-Z]+)\K(?=[A-Z]\B|\d+)/
Granted, the above pattern will not properly handle ServiceOK
Demo Link Word Boundaries Link
or this pattern with an anchor:
/(?!^)(?=[A-Z][a-z]+|(?<=\D)\d)/
The above pattern will accurately split: SomeEquipmentHere, ABSBrakes, CDRadio, Valve14, ServiceOK, ESP as requested by the OP.
Demo Link
*Note: Pattern accuracy can be improved as more sample strings are provided.
Here is how it can be solved:
$tests = [
'SomeEquipmentHere',
'ABSBrakes',
'CDRadio',
'Valve14',
];
foreach ($tests as $test) {
echo trim(preg_replace('/\s+/', ' ', preg_replace('/([A-Z][a-z]+)|([A-Z]+(?=[A-Z]))|(\d+)/', '$1 $2 $3', $test)));
echo "\n";
}
Related test on regex101.
UPDATE: Added example for additional question

Group regex with fix part

$txt = "toto1 555.4545.555.999.7465.432.674";
$rgx = "/([\w]+)\s([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9]+)\.([0-9]+)/";
preg_match($rgx, $txt, $res);
var_dump($res);
I would like to simplify this pattern by avoiding repeating "([0-9]+)" because i don't know how many they are.
Any one can say me how ?
Here is a direct answer to the question, as you have stated it:
/[\w]+\s[0-9]+(?:\.[0-9]+)+/
However, note that I have removed all of the numbered capture groups. This could be problematic, depending on what you're actually trying to achieve.
It is not possible to "count" with capture groups in regular expressions, so you would need to write some other code (i.e. not just one match, with one regex, and using back-references) to deal with this if you wish to run any queries like "What digits appear after the fifth "."?"
There are two ways you can do this. If you just need to verify that the string matches the pattern, this regex will do the job: \w+\s(?:[0-9]+\.?)+
However, if you need to split the string in to it's component parts (in my interpretation, the beginning word followed by the sequence of decimal separated numbers), then you could use this pattern: (\w+)\s((?:[0-9]+\.?)+)
The second pattern will return the beginning word, toto1 in group 1, followed by the decimal separated numbers in group 2 555.4545.555.999.7465.432.674 which you could then split in PHP if required: $sequence = explode('.', $matches[2]);
What you need can be obtained with a preg_split with a regex matching 1 or more whitespaces or dots:
$txt = "toto1 555.4545.555.999.7465.432.674";
$rgx = '/[\s.]+/';
$res = preg_split($rgx, $txt);
print_r($res);
See the PHP demo
If you need a regex approach, you can use a \G based regex with preg_match_all:
'~(?|([\w]+)|(?!\A)\G[\s.]*([0-9]+))~'
See the regex demo and a PHP demo:
$txt = "toto1 555.4545.555.999.7465.432.674";
$rgx = '~(?|(\w+)|(?!\A)\G[\s.]*([0-9]+))~';
preg_match_all($rgx, $txt, $res);
print_r($res[1]);
Pattern details:
The (?|...) is a branch reset group to reset group IDs in all the branches
(\w+) - Group 1 matches 1+ word chars
| - or (then goes Branch 2)
(?!\A)\G - the end of the previous successful match
[\s.]* - zero or more whitespaces or dots
([0-9]+) - Group 1 (again!) matching 1 or more digits.

Detect cloth sizes with regex

I am trying to detect with regex, strings that have a pattern of {any_number}{x-}{large|medium|small} for a site with clothing I am building in PHP.
I have managed to match the sizes against a preconfigured set of strings by using:
$searchFor = '7x-large';
$regex = '/\b'.$searchFor.'\b/';
//Basically, it's finding the letters
//surrounded by a word-boundary (the \b bits).
//So, to find the position:
preg_match($regex, $opt_name, $match, PREG_OFFSET_CAPTURE);
I even managed to detect weird sizes like 41 1/2 with regex, but I am not an expert and I am having a hard time on this.
I have come up with
preg_match("/^(?<![\/\d])([xX\-])(large|medium|small)$/", '7x-large', $match);
but it won't work.
Could you pinpoint what I am doing wrong?
It sounds like you also want to match half sizes. You can use something like this:
$theregex = '~(?i)^\d+(?:\.5)?x-(?:large|medium|small)$~';
if (preg_match($theregex, $yourstring,$m)) {
// Yes! It matches!
// the match is $m[0]
}
else { // nah, no luck...
}
Note that the (?i) makes it case-insensitive.
This also assumes you are validating that an entire string conforms to the pattern. If you want to find the pattern as a substring of a larger string, remove the ^ and $ anchors:
$theregex = '~(?i)\d+(?:\.5)?x-(?:large|medium|small)~';
Look at the specification you have and build it up piece by piece. You want "{any_number}{x-}{large|medium|small}".
"{any_number}" would be \d+. This does not allow fractional numbers such as 12.34, but the question does not specify whether they are required.
"{x-}" is a simple string x-
"{large|medium|small}" is a choice between three alternatives large|medium|small.
Joining the pieces together gives \d+x-(large|medium|small). Note the brackets around the alternation, without then the expression would be interpreted as (\d+x-large)|medium|small.
You mention "weird sizes like 41 1/2" but without specifying how "weird" the number to be matched are. You need a precise specification of what you include in "weird" before you can extend the regular expression.

preg_replace or regex string translation

I found some partial help but cannot seem to fully accomplish what I need. I need to be able to do the following:
I need an regular expression to replace any 1 to 3 character words between two words that are longer than 3 characters with a match any expression:
For example:
walk to the beach ==> walk(.*)beach
If the 1 to 3 character word is not preceded by a word that's longer than 3 characters then I want to translate that 1 to 3 letter word to '<word> ?'
For example:
on the beach ==> on ?the ?beach
The simpler the rule the better (of course, if there's an alternative more complicated version that's more performant then I'll take that as well as I eventually anticipate heavy usage eventually).
This will be used in a PHP context most likely with preg_replace. Thus, if you can put it in that context then even better!
By the way so far I have got the following:
$string = preg_replace('/\s+/', '(.*)', $string);
$string = preg_replace('/\b(\w{1,3})(\.*)\b/', '${1} ?', $string);
but that results in:
walk to the beach ==> 'walk(.*)to ?beach'
which is not what I want. 'on the beach' seems to translate correctly.
I think you will need two replacements for that. Let's start with the first requirement:
$str = preg_replace('/(\w{4,})(?: \w{1,3})* (?=\w{4,})/', '$1(.*)', $str);
Of course, you need to replace those \w (which match letters, digits and underscores) with a character class of what you actually want to treat as a word character.
The second one is a bit tougher, because matches cannot overlap and lookbehinds cannot be of variable length. So we have to run this multiple times in a loop:
do
{
$str = preg_replace('/^\w{0,3}(?: \w{0,3})* (?!\?)/', '$0?', $str, -1, $count);
} while($count);
Here we match everything from the beginning of the string, as long as it's only up-to-3-letter words separated by spaces, plus one trailing space (only if it is not already followed by a ?). Then we put all of that back in place, and append a ?.
Update:
After all the talk in the comments, here is an updated solution.
After running the first line, we can assume that the only less-than-3-letter words left will be at the beginning or at the end of the string. All others will have been collapsed to (.*). Since you want to append all spaces between those with ?, you do not even need a loop (in fact these are the only spaces left):
$str = preg_replace('/ /', ' ?', $str);
(Do this right after my first line of code.)
This would give the following two results (in combination with the first line):
let us walk on the beach now go => let ?us ?walk(.*)beach ?now ?go
let us walk on the beach there now go => let ?us ?walk(.*)beach(.*)there ?now ?go

Regex: Rename Files

I am trying to rename a bunch of image files.
They are named inconsistently however there is some logic to it
They all start with an Id number
After the Id there may be some of the following (Items To Be Removed):
a space
2 letters
a dash -
These will appear in various orders and sometimes more than once, for the space or dash.
The filenames may have any of these items but not necessarily all of them.
Some filenames do have all 3 items.
They may have an additional _ after this
Then they may have a number {Index}
Finally they end in .ext where ext = jpg|png|gif...
Here are some example filenames:
1227.jpg
1227_1.jpg
2200 WH-1.jpg
2200WH 2.jpg
2200 WH2.jpg
2201_BK 1.png
2203 RD_1.jpg
I am trying to remove/replace the mentioned items so the filenames are as follows:
ID.ext or ID_{index}.ext
So the above list would turn into:
1227.jpg
1227_1.jpg
2200_1.jpg
2200_2.jpg
2201_1.png
2203_1.jpg
I have tried writing a few expressions but am a little stumped on this one.
I am working on a PHP project though other languages would be fine for this script.
Pattern: /^\d+\K[-a-z_ ]+/i
Replace: _
(Pattern Demo)
Basically only match when there are one or more characters between the id and the index. Simple.
/ #pattern delimiter
^ #start of string
\d+ #one or more digits
\K #restart fullstring match so that the fullstring match is replaced
[-a-z_ ]+ #match one or more hyphens, letters, underscores, or spaces
/ #pattern delimiter
i #make the pattern case-insensitive
Code: (Demo)
$images=['1227.jpg','1227_1.jpg','2200 WH-1.jpg','2200WH 2.jpg','2200 WH2.jpg','2201_BK 1.png','2203 RD_1.jpg'];
var_export(preg_replace('/^\d+\K[-a-z_ ]+/i','_',$images));
Output:
array (
0 => '1227.jpg',
1 => '1227_1.jpg',
2 => '2200_1.jpg',
3 => '2200_2.jpg',
4 => '2200_2.jpg',
5 => '2201_1.png',
6 => '2203_1.jpg',
)
Question extension solution: (Demo) (Demo)
You can do it with two patterns and replacements on a single preg_replace() call or you can use preg_replace() then str_replace() to mop up the dangling underscores. This will come down to personal coding preference. (It could also be done with a preg_replace_callback() that checks if there is an index number in the image name before adding the underscore, but that will make a more convoluted snippet.)
Codes:
$images=['1227.jpg','1227_1.jpg','2200 WH-1.jpg','2200WH 2.jpg','2200 WH2.jpg','2201_BK 1.png','2203 RD_1.jpg','2200 WH.jpg','3000_01.jpg'];
foreach($images as $image){
echo str_replace('_.','.',preg_replace('/^\d+\K[-a-z_ ]+0*/i','_',$image)),"\n";
}
Or
$images=['1227.jpg','1227_1.jpg','2200 WH-1.jpg','2200WH 2.jpg','2200 WH2.jpg','2201_BK 1.png','2203 RD_1.jpg','2200 WH.jpg','3000_01.jpg'];
foreach($images as $image){
echo preg_replace(['~^\d+\K[-a-z_ ]+0*~i','~_\.~'],['_','.'],$image),"\n";
}
I would do it with the following pattern:
(\d{4})([^0-9.]*)(\d\.)
And with a substitution of $1_$3.
Step by step:
(\d{4}) - Check for the 1st 4 digits.
([^0-9.]*) - Check for everything that is not a number or a period after the ID.
(\d\.) - Check for ending number and period before extension (This is so we can properly place the underscore)
Adding the substitution means that the 4 digit number will be added to the beginning, all non-number (or period) characters will be removed, and an underscore will be added between the $1 and whatever is left. If there is nothing after the ID, no underscore will be added, then the period is added inside the substitution as well.
You can view this on Regex101 for a very detailed step-by-step of what is going on.
In PHP this would be:
preg_replace("/(\d{4})([^0-9.]*)(\d)\./", "$1_", $string);
Output:
1227.jpg
1227_1.jpg
2200_1.jpg
2200_2.jpg
2200_2.jpg
2201_1.png
2203_1.jpg
Not a PHP person but the regular expression I would use is:
/(\d+).*?(\d?)\.(.*)/
This will capture the first set of numbers, skip the middle part, capture the number on the end if present, then capture the file extension.
Then in ruby I would do the following:
id, index, extension = my_file_name.match(/(\d+).*?(\d?)\.(.*)/)
new_name = id.to_s
new_name += "_#{index}" unless index.empty?
new_name += ".#{extension}"

Categories