Regex unicode - route codeignite - php

I am working on getting the regex to work but now I am starting to go in circles.
This regex would be used in the codeigniter for the routing purposes, something like:
$route['([\p{Ll}\p{Cyrillic}0-9\s\-]+)-(\d+).html'] = "blog/$2";
I've got a regex that does what I need to:
$pattern = "/^[\p{Ll}0-9\s\-]+$/u";
But for some reason it doesn't want to work in the patten bellow
$str="asdбв-37.html";
$pattern = "#^([\p{Ll}\p{Cyrillic}0-9\s\-]+)-(\d+).html#";
$result = (bool) preg_match($pattern, $str);
if($result)
echo "$str is composed of Cyrillic and alphanumeric characters\n";
My end target is to check that any character, from any language, is written in the lower case, that is why I have used \p{Ll}

The pattern which is working but for asdбв-37.html doesn't allow periods. Try adding them in:
^[a-zA-Z\p{Cyrillic}0-9\s.-]+$
[Also, you don't need to escape the - if it's at the end or beginning of a character class to change its meaning to literal.]

Related

How to replace (.) dots by (·) interpuncts in PHP?

There's a French typographic rule to write properly some words in a genderless way, by adding an (·) interpunct between letters.
A few authors on my website are however typing a simple (.) dot instead.
As a solution, I'd like to create a function to replace in PHP strings each dots which are placed between two lowercase letters by interpuncts. But my PHP skills are rather limited… Here is what I'm looking for:
REPLACE THIS:
$string = "T.N.T.: Chargé.e des livreur.se.s."
BY THIS:
$string = "T.N.T.: Chargé·e des livreur·se·s."
Could someone help me please?
Thank you.
Use the preg_replace with pattern to dynamically match 3 groups - two lowercase letters (including special French letters) and dot between, and use the first and third captured group in replacement, along with intepunct:
$string = "T.N.T.: Chargé.e des livreur.se.s.";
$pattern = '/([a-zàâçéèêëîïôûùüÿñæœ])(\.)([a-zàâçéèêëîïôûùüÿñæœ])/';
$replacement = '$1·$3'; //captured first and third group, and interpunct in the middle
//results in "T.N.T.: Chargé·e des livreur·se·s."
$string_replaced = preg_replace($pattern, $replacement, $string);
More about preg_replace:
https://www.php.net/manual/en/function.preg-replace.php
You could use str_replace() if you know the grammar rules surrounding the dots you want to replace. (for instance, if everything between éand e is concerned, then you can do :
$bodytag = str_replace("é.e", "é·e", $sourceText);
But you will always risk some side effects. For instance if there is an acronym you don't want to be replaced with this pattern. I don't think there is any magic way to avoid this.
More specifically
I'd like to create a function to replace in PHP strings each dots which are placed between two lowercase letters by interpuncts.
This can be achieved with preg_replace() and the appropriate REGEX
See this post

PHP preg_match_all no character or not a certain character

Right now the test seems to be working for avoiding the characters that I don't want but it's only returning a count of 2. I know why, I just don't know how to address it. The problem is the last ? is being excluded because the actual match for the 2nd match is (?+ so it's not matching the 3rd since there is no "starting" character for that pattern, it would just be ?).
$pattern = "/([^\w\d'\"`]\?[^\w\d'\"`])/";
$subject = "`test` = ? and `other` = (?+?)";
$count = preg_match_all($pattern, $subject, $matches);
echo "Count: $count\n"; // echoes 2 instead of 3
Basically, I want to count up all the parameters used, so match all ? in the $subject with a ? not surrounded by letters, numbers, quotes, and ticks.
This is the actual pattern that matters:
[^\w\d'\"'`]
Update:
For others, miken32's solution is to convert the above pattern to:
(?=[^\w\d'\"'`])
Try using a lookahead assertion:
$pattern = "/((?<=[^\w\d'\"`])\?(?=[^\w\d'\"`]))/";
It will look ahead without moving the search forward.
Edited to add the lookbehind assertion as well.

Regular Expression to check if string ends with one underscore and two letters with php

I'm trying to check if a string ends with one _ and two letters on an old system with php. I've check here on stackoverflow for answers and I found one that wanted to do the same but with one . and two digits.
I tried to change it to work with my needs, and I got this:
\\.*\\_\\a{2,2}$
Then I went to php and tried this:
$regex = '(\\.*\\_\\a{2,2}$)';
echo preg_match($regex, $key);
But this always returns an error, saying the following:
preg_match(): Delimiter must not be alphanumeric or backslash
I get this happens because I can't use the backslashes or something, how can I do this correctly? And also, is my regex correct(I don't know ho to form this expressions and how they work)?
You can use this regex with delimiters:
$regex = '/_[a-z]{2}$/i';
You're getting that error because in PHP every regex needs a delimiter (not use of / above which can be any other character like ~ also).
^.*_[a-zA-Z]{2}$
This should do it for you.
$re = "/^.*_[a-zA-Z]{2}$/";
$str = "abc_ac";
preg_match($re, $str);

A more efficient string cleaning Regex in PHP

Okay, I was hoping someone could help me with a little regex-fu.
I am trying to clean up a string.
Basically, I am:
Replacing all characters except A-Za-z0-9 with a replacement.
Replacing consecutive duplicates of the replacement with a single instance of the replacement.
Trimming the replacement from the beginning and end of the string.
Example Input:
(&&(%()$()#&#&%&%%(%$+-_The dog jumped over the log*(&)$%&)#)##%&)&^)##)
Required Output:
The+dog+jumped+over+the+log
I am currently using this very discombobulated code and just know there is a much more elegant way to accomplish this....
function clean($string, $replace){
$ok = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
$ok .= $replace;
$pattern = "/[^".preg_quote($ok, "/")."]/";
return trim(preg_replace('/'.preg_quote($replace.$replace).'+/', $replace, preg_replace($pattern, $replace, $string)),$replace);
}
Could a Regex-Fu Master please grace me with a simpler/more efficient solution?
A much better solution suggested and explained by Botond Balázs and hakre:
function clean($string, $replace, $skip=""){
// Escape $skip
$escaped = preg_quote($replace.$skip, "/");
// Regex pattern
// Replace all consecutive occurrences of "Not OK"
// characters with the replacement
$pattern = '/[^A-Za-z0-9'.$escaped.']+/';
// Execute the regex
$result = preg_replace($pattern, $replace, $string);
// Trim and return the result
return trim($result, $replace);
}
I'm not a "regex ninja" but here's how I would do it.
function clean($string, $replace){
/// Remove all "not OK" characters from the beginning and the end:
$result = preg_replace('/^[^A-Za-z0-9]+/', '', $string);
$result = preg_replace('/[^A-Za-z0-9]+$/', '', $result);
// Replace all consecutive occurrences of "not OK"
// characters with the replacement:
$result = preg_replace('/[^A-Za-z0-9]+/', $replace, $result);
return $result;
}
I guess this could be simplified more but when dealing with regexes, clarity and readability is often more important than being clever or writing super-optimal code.
Let's see how it works:
/^[^A-Za-z0-9]+/:
^ matches the beginning of the string.
[^A-Za-z0-9] matches all non-alphanumeric characters
+ means "match one or more of the previous thing"
/[^A-Za-z0-9]+$/:
same thing as above, except $ matches the end of the string
/[^A-Za-z0-9]+/:
same thing as above, except it matches mid-string too
EDIT: OP is right that the first two can be replaced with a call to trim():
function clean($string, $replace){
// Replace all consecutive occurrences of "not OK"
// characters with the replacement:
$result = preg_replace('/[^A-Za-z0-9]+/', $replace, $result);
return trim($result, $replace);
}
I don't want to sound super-clever, but I would not call it regex-foo.
What you do is actually pretty much in the right direction because you use preg_quote, many others are not even aware of that function.
However probably at the wrong place. Wrong place because you quote for characters inside a character class and that has (similar but) different rules for quoting in a regex.
Additionally, regular expressions have been designed with a case like yours in mind. That is probably the part where you look for a wizard, let's see some options how to make your negative character class more compact (I keep the generation out to make this more visible):
[^0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]
There are constructs like 0-9, A-Z and a-z that can represent exactly that. As you can see - is a special character inside a character class, it is not meant literal but as having some characters from-to:
[^0-9A-Za-z]
So that is already more compact and represents the same. There are also notations like \d and \w which might be handy in your case. But I take the first variant for a moment, because I think it's already pretty visible what it does.
The other part is the repetition. Let's see, there is + which means one or more. So you want to replace one or more of the non-matching characters. You use it by adding it at the end of the part that should match one or more times (and by default it's greedy, so if there are 5 characters, those 5 will be taken, not 4):
[^0-9A-Za-z]+
I hope this is helpful. Another step would be to also just drop the non-matching characters at the beginning and end, but it's early in the morning and I'm not that fluent with that.

Making a url regex global

I've been searching for a regex to replace plain text url's in a string (the string can contain more than 1 url), by:
url
and I found this:
http://mathiasbynens.be/demo/url-regex
I would like to use the diegoperini's regex (which according to the tests is the best):
_^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?#)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:/[^\s]*)?$_iuS
But I want o make it global to replace all the url's in a string.
When I use this:
/_(?:(?:https?|ftp)://)(?:\S+(?::\S*)?#)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:/[^\s]*)?_iuS/g
It does not work, how do I make this regex global and what does the underscore at the beginning and the "_iuS", at the end, means?
I would like to use it with php so I am using:
preg_replace($regex, '$0', $examplestring);
The underscores are the regex delimiters, the i, u and S are pattern modifiers :
i (PCRE_CASELESS)
If this modifier is set, letters in the pattern match both upper and lower
case letters.
U (PCRE_UNGREEDY)
This modifier inverts the "greediness" of the quantifiers so that they are
not greedy by default, but become greedy if followed by ?. It is not compatible
with Perl. It can also be set by a (?U) modifier setting within the pattern
or by a question mark behind a quantifier (e.g. .*?).
S
When a pattern is going to be used several times, it is worth spending more
time analyzing it in order to speed up the time taken for matching. If this
modifier is set, then this extra analysis is performed. At present, studying
a pattern is useful only for non-anchored patterns that do not have a single
fixed starting character.
For more informations see http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php
When you added the / ... /g , you added another regex delimiter plus the modifier g wich does not exists in PCRE, that's why it did not work.
I agree with #verdesmarald and used this pattern in the following function:
$string = preg_replace_callback(
"_(?:(?:https?|ftp)://)(?:\S+(?::\S*)?#)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:/[^\s]*)?_iuS",
create_function('$match','
$m = trim(strtolower($match[0]));
$m = str_replace("http://", "", $m);
$m = str_replace("https://", "", $m);
$m = str_replace("ftp://", "", $m);
$m = str_replace("www.", "", $m);
if (strlen($m) > 25)
{
$m = substr($m, 0, 25) . "...";
}
return "$m";
'), $string);
return $string;
It seem to do the trick, and resolve an issue I was having. As #verdesmarald said, removing the ^ and $ characters allowed the pattern to work even in my pre_replace_callback().
Only thing that concerns me, is how efficient is the pattern. If used in a busy/high traffic web app, could it cause a bottle neck?
UPDATE
The above regex pattern breaks if there is a trail dot at the end of the path section of a url, like so http://www.mydomain.com/page.. To solve this I modified the final part of the regex pattern by adding ^. making the final part look like so [^\s^.]. As I read it, do not match a trailing space or dot.
In my tests so far it seems to be working fine.

Categories