Multiple regular expressions in PHP - php

Is there a better to handle running multiple regular expressions in PHP (or in general)?
I have the code below to replace non alphanumeric characters with dashes, once the replacement has happened I want to remove the instances of multiple dashes.
$slug = preg_replace('/[^A-Za-z0-9]/i', '-', $slug);
$slug = preg_replace('/\-{2,}/i', '-', $slug);
Is there a neater way to do this? ie, setup the regex pattern to replace one pattern and then the other?
(I'm like a kid with a fork in a socket when it comes to regex)

You can eliminate the second preg_replace by saying what you really mean in the first one:
$slug = preg_replace('/[^a-z0-9]+/i', '-', $slug);
What you really mean is "replace all sequences of one or more non-alphanumeric characters with a single hyphen" and that's what /[^a-z0-9]+/i does. Also, you don't need to include the upper and lower case letters when you specify a case insensitive regex so I took that out.

No. What you have is the appropriate way to deal with this problem.
Consider it from this angle: regular expressions are meant to find a pattern (a single pattern) and deal with it somehow. As such, by trying to deal with more than one pattern at a time, you're only giving yourself headaches. It's best as is, for everyone involved.

If $slug already doesn't have multiple hyphens then you can avoid 2nd preg_replace call by using first preg_replace call like this:
$slug = preg_replace('/[^a-z0-9]+-?/i', '-', $slug);
Above code will find non-alphanumeric character optionally followed by hyphen and replace that matched text by a single hyphen -. hence no need to make 2nd preg_replace call.

Related

preg_replace pattern to remove pNUMBERxNUMBER

Im trying to locate a pattern with preg_replace() and remove it...
I have a string, that contains this: p130x130/ and these numbers vary, they can be higher, or lower ... what I need to do is locate that string, and remove it, whole thing.
I've been trying to use this:
preg_replace('/p+[0-9]+x+[0-9]"/', '', $str);
but that doesnt work for some reason. Would any of you know the correct regexp?
Kind regards
You need to first remove the + quantifier after p then switch the + quantifier from after x and place it after your character class (e.g. x[0-9]+), also remove the quote " inside of your expression, which to me looks like a typo here. You can also use a different delimiter to avoid escaping the ending slash.
$str = preg_replace('~p[0-9]+x[0-9]+/~', '', $str);
If the ending slash is by mistake a typo as well, then this is what you're looking for.
$str = preg_replace('/p[0-9]+x[0-9]+/', '', $str);
Regex to match p130x130/ is,
p[0-9]+x[0-9]+\/
Try this:
$str = preg_replace("/p[0-9]+?x[0-9]+?\//is","",$str);
As mentioned by the comment I have to explain the code as I'm a teacher now.
I've used "/" as a delimiter, but you can use different characters to avoid slashing.
The part that says [0-9]+ is saying to match any character between 0 and 9 at least once, but more if possible. If I had put [0-9]*? then it would have matched an empty space too (as * means to match 0 or more, not 1 or more like +) which is probably not what you wanted anyway.
I've put the ? at the end to make it non-greedy, just a habit of mine but I don't think it's needed. (I used ereg a lot previously).
Anyway, it's going to find 0-9 until it hits an x, and then it does another match for more numbers until it hits a single forward slash. I've backslashed that slash because my delimiter is a slash also and I didn't want it to end there.

Regex word boundary alternative

I was using the standard \b word boundary. However, it doesn't quite deal with the dot (.) character the way I want it to.
So the following regex:
\b(\w+)\b
will match cats and dogs in cats.dog if I have a string that says cats and dogs don't make cats.dogs.
I need a word boundary alternative that will match a whole word only if:
it does not contain the dot(.) character
it is encapsulated by at least one space( ) character on each side
Any ideas?!
P.S. I need this for PHP
You could try using (?<=\s) before and (?=\s) after in place of the \b to ensure that there is a space before and after it, however you might want to also allow for the possibility of being at the start or end of the string with (?<=\s|^) and (?=\s|$)
This will automatically exclude "words" with a . in them, but it would also exclude a word at the end of a sentence since there is no space between it and the full stop.
What you are trying to match can be done easily with array and string functions.
$parts = explode(' ', $str);
$res = array_filter($parts, function($e){
return $e!=="" && strpos($e,".")===false;
});
I recommend this method as it saves time. Otherwise wasting few hours to find a good regex solution is quite unproductive.

PHP Regex Help (converting from preg_match_all to preg_replace)

I'm having a bit of difficulties converting some regex from being used in preg_match_all to being used in preg_replace.
Basically, via regex only, I would like to match uppercase characters that are preceded by either a space, beginning of text, or a hypen. This is not a problem, I have the following for this which works well:
preg_match_all('/(?<= |\A|-)[A-Z]/',$str,$results);
echo '<pre>' . print_r($results,true) . '</pre>';
Now, what I'd like to do, is to use preg_replace to only return the string with the uppercase characters that match my criteria above. If I port the regex straight into preg_replace, then it obviously replaces the characters I want to keep.
Any help would be much appreciated :)
Also, I'm fully aware regex isn't the best solution for this in terms of efficiency, but nonetheless I would like to use preg_replace.
According to De Morgan's laws,
if you want to keep letters that are
A-Z, and
preceded by [space], \A, or -
then you'd want to remove characters that are
not A-Z, or
not preceded by [space], \A, or -
Perhaps this (replace match with empty string)?
/[^A-Z]|(?<! |\A|-)./
See example here.
I think it will be something like this:
$sString = preg_replace('#.*?(?<= |\A|-)([A-Z])([a-z]+)#m',"$1", $sString);

Splitting large strings into words in php

I have a long string in php consisting of different paragraphs each of which with different sentences (it is pretty much a small document). I want to split the whole thing into words by removing any symbols/characters that are not relevant. For example remove commas, spaces, new lines, full stops, exclamation marks and anything that might be irrelevant so as to end up with only words.
Is there an easy way of doing this in one go, for example by using a regular expression and the preg_split function or do I have to use the explode function a number of times: eg first get all the sentences (by removing '.', '!' etc). Then get words by removing ',' and spaces etc etc.
I would not like to use the explode function on all the possible characters that are irrelevant since it is time consuming and I may accidentally omit some of all those possible characters.
I would like to find a more automatic way. I think a well define regular expression might do the work but again I will need to specify all the possible characters and also I have no idea of how to write regular expressions in php.
So what can you suggest to me ?
Do you want to remove punctuation characters, etc and then split the words into an array? Or just strip it so there are only letters and spaces? Not exactly sure what you're trying to achieve, but the following might help:
<?php
$string = "This is a sentence! It has *lots* of #$#king random non-word characters. Wouldn't you like to strip them?";
$words = preg_replace("/[^\w\ _]+/", '', $string); // strip all punctuation characters, news lines, etc.
$words = preg_split("/\s+/", $words); // split by left over spaces
var_dump($words);
Either way, it gives you the general idea of using regular expressions to manipulate text as needed. My example has two parts, this way words like "wouldn't" aren't split into two words like other answers have suggested.
To be unicode compatible, you should use this one:
preg_split('/\PL+/u', $string, -1, PREG_SPLIT_NO_EMPTY);
wich splits on characters that are not letter.
Have a look at here to see the unicode character properties.
Just use preg_replace() and define a regular expression to match on the different characters you wish to replace and provide a replacement character to replace them with.
http://php.net/manual/en/function.preg-replace.php
For the characters you wish to search on you can define those in a PHP array as seen in the PHP manual.
Your answer is in the domain of regular expressions and would probably be very difficult to get right. You could get something that works well in almost all cases but there would be exceptions.
This might help:
http://www.regular-expressions.info/wordboundaries.html

RegEx string "preg_replace"

I need to do a "find and replace" on about 45k lines of a CSV file and then put this into a database.
I figured I should be able to do this with PHP and preg_replace but can't seem to figure out the expression...
The lines consist of one field and are all in the following format:
"./1/024/9780310320241/SPSTANDARD.9780310320241.jpg" or "./t/fla/8204909_flat/SPSTANDARD.8204909_flat.jpg"
The first part will always be a period, the second part will always be one alphanumeric character, the third will always be three alphanumeric characters and the fourth should always be between 1 and 13 alphanumeric characters.
I came up with the following which seems to be right however I will openly profess to not knowing very much at all about regular expressions, it's a little new to me! I'm probably making a whole load of silly mistakes here...
$pattern = "/^(\.\/[0-9a-zA-Z]{1}\/[0-9a-zA-Z]{3}\/[0-9a-zA-Z]{1,13}\/)$/";
$new = preg_replace($pattern, " ", $i);
Anyway any and all help appreciated!
Thanks,
Phil
The only mistake I encouter is the anchor for the string end $ that should be removed. And your expression is also missing the _ character:
/^(\.\/[0-9a-zA-Z]{1}\/[0-9a-zA-Z]{3}\/[0-9a-zA-Z_]{1,13}\/)/
A more general pattern would be to just exclude the /:
/^(\.\/[^\/]{1}\/[^\/]{3}\/[^\/]{1,13}\/)/
You should use PHP's builtin parser for extracting the values out of the csv before matching any patterns.
I'm not sure I understand what you're asking. Do you mean every line in the file looks like that, and you want to process all of them? If so, this regex would do the trick:
'#^.*/#'
That simply matches everything up to and including the last slash, which is what your regex would do if it weren't for that rogue '$' everyone's talking about. If there are other lines in other formats that you want to leave alone, this regex will probably suit your needs:
'#^\./\w/\w{3}/\w{1,13}/#"
Notice how I changed the regex delimiter from '/' to '#' so I don't have to escape the slashes inside. You can use almost any punctuation character for the delimiters (but of course they both have to be the same).
The $ means the end of the string. So your pattern would match ./1/024/9780310320241/ and ./t/fla/8204909_flat/ if they were alone on their line. Remove the $ and it will match the first four parts of your string, replacing them with a space.
$pattern = "/(\.\/[0-9a-z]{1}\/[0-9a-z]{3}\/[0-9a-z\_]+\.(jpg|bmp|jpeg|png))\n/is";
I just saw, that your example string doesn't end with /, so may be you should remove it from your pattern at the end. Also underscore is used in the filename and should be in the character class.

Categories