A more efficient string cleaning Regex in PHP

A more efficient string cleaning Regex in PHP - php

Okay, I was hoping someone could help me with a little regex-fu.
I am trying to clean up a string.
Basically, I am:
Replacing all characters except A-Za-z0-9 with a replacement.
Replacing consecutive duplicates of the replacement with a single instance of the replacement.
Trimming the replacement from the beginning and end of the string.
Example Input:
(&&(%()$()#&#&%&%%(%$+-_The dog jumped over the log*(&)$%&)#)##%&)&^)##)
Required Output:
The+dog+jumped+over+the+log
I am currently using this very discombobulated code and just know there is a much more elegant way to accomplish this....
function clean($string, $replace){
$ok = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
$ok .= $replace;
$pattern = "/[^".preg_quote($ok, "/")."]/";
return trim(preg_replace('/'.preg_quote($replace.$replace).'+/', $replace, preg_replace($pattern, $replace, $string)),$replace);
}
Could a Regex-Fu Master please grace me with a simpler/more efficient solution?
A much better solution suggested and explained by Botond Balázs and hakre:
function clean($string, $replace, $skip=""){
// Escape $skip
$escaped = preg_quote($replace.$skip, "/");
// Regex pattern
// Replace all consecutive occurrences of "Not OK"
// characters with the replacement
$pattern = '/[^A-Za-z0-9'.$escaped.']+/';
// Execute the regex
$result = preg_replace($pattern, $replace, $string);
// Trim and return the result
return trim($result, $replace);
}

I'm not a "regex ninja" but here's how I would do it.
function clean($string, $replace){
/// Remove all "not OK" characters from the beginning and the end:
$result = preg_replace('/^[^A-Za-z0-9]+/', '', $string);
$result = preg_replace('/[^A-Za-z0-9]+$/', '', $result);
// Replace all consecutive occurrences of "not OK"
// characters with the replacement:
$result = preg_replace('/[^A-Za-z0-9]+/', $replace, $result);
return $result;
}
I guess this could be simplified more but when dealing with regexes, clarity and readability is often more important than being clever or writing super-optimal code.
Let's see how it works:
/^[^A-Za-z0-9]+/:
^ matches the beginning of the string.
[^A-Za-z0-9] matches all non-alphanumeric characters
+ means "match one or more of the previous thing"
/[^A-Za-z0-9]+$/:
same thing as above, except $ matches the end of the string
/[^A-Za-z0-9]+/:
same thing as above, except it matches mid-string too
EDIT: OP is right that the first two can be replaced with a call to trim():
function clean($string, $replace){
// Replace all consecutive occurrences of "not OK"
// characters with the replacement:
$result = preg_replace('/[^A-Za-z0-9]+/', $replace, $result);
return trim($result, $replace);
}

I don't want to sound super-clever, but I would not call it regex-foo.
What you do is actually pretty much in the right direction because you use preg_quote, many others are not even aware of that function.
However probably at the wrong place. Wrong place because you quote for characters inside a character class and that has (similar but) different rules for quoting in a regex.
Additionally, regular expressions have been designed with a case like yours in mind. That is probably the part where you look for a wizard, let's see some options how to make your negative character class more compact (I keep the generation out to make this more visible):
[^0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz]
There are constructs like 0-9, A-Z and a-z that can represent exactly that. As you can see - is a special character inside a character class, it is not meant literal but as having some characters from-to:
[^0-9A-Za-z]
So that is already more compact and represents the same. There are also notations like \d and \w which might be handy in your case. But I take the first variant for a moment, because I think it's already pretty visible what it does.
The other part is the repetition. Let's see, there is + which means one or more. So you want to replace one or more of the non-matching characters. You use it by adding it at the end of the part that should match one or more times (and by default it's greedy, so if there are 5 characters, those 5 will be taken, not 4):
[^0-9A-Za-z]+
I hope this is helpful. Another step would be to also just drop the non-matching characters at the beginning and end, but it's early in the morning and I'm not that fluent with that.

Related

How to remove repeated sequence of characters in a string?

Imagine if:
$string = "abcdabcdabcdabcdabcdabcdabcdabcd";
How do I remove the repeated sequence of characters (all characters, not just alphabets) in the string so that the new string would only have "abcd"? Perhaps running a function that returns a new string with removed repetitions.
$new_string = remove_repetitions($string);
The possible string before removing the repetition is always like above. I don’t know how else to explain since English is not my first language. Other examples are,
$string = “EqhabEqhabEqhabEqhabEqhab”;
$string = “o=98guo=98guo=98gu”;
Note that I want it to work with other sequence of characters as well. I tried using Regex but I couldn't figure out a way to accomplish it. I am still new to php and Regex.

For details : https://algorithms.tutorialhorizon.com/remove-duplicates-from-the-string/
In different programming have a different way to remove the same or duplicate character from a string.
Example: In PHP
<?php
$str = "Hello World!";
echo count_chars($str,3);
?>
OutPut : !HWdelor
https://www.w3schools.com/php/func_string_count_chars.asp

Here, if we wish to remove the repeating substrings, I can't think of a way other than knowing what we wish to collect since the patterns seem complicated.
In that case, we could simply use a capturing group and add our desired output in it the remove everything else:
(abcd|Eqhab|guo=98)
I'm guessing it should be simpler way to do this though.
Test
$re = '/.+?(abcd|Eqhab|guo=98)\1.+/m';
$str = 'abcdabcdabcdabcdabcdabcdabcdabcd
EqhabEqhabEqhabEqhabEqhab
o98guo=98guo=98guo=98guo=98guo=98guo=98guo98';
$subst = '$1';
$result = preg_replace($re, $subst, $str);
echo $result;
Demo

You did not tell what exactly to remove. A "sequnece of characters" can be as small as just 1 character.
So this simple regex should work
preg_replace ( '/(.)(?=.*?\1)/g','' 'abcdabcdabcdabcdabcdabcd');

PHP preg_match regular expression for find date in string

I try to make system that can detect date in some string, here is the code :
$string = "02/04/16 10:08:42";
$pattern = "/\<(0?[1-9]|[12][0-9]|3[01])\/\.- \/\.- \d{2}\>/";
$found = preg_match($pattern, $string);
if ($found) {
echo ('The pattern matches the string');
} else {
echo ('No match');
}
The result i found is "No Match", i don't think that i used correct regex for the pattern. Can somebody tell me what i must to do to fix this code

First of all, remove all gibberish from the pattern. This is the part you'll need to work on:
(/0?[1-9]|[12][0-9]|3[01]/)
(As you said, you need the date only, not the datetime).
The main problem with the pattern, that you are using the logical OR operators (|) at the delimiters. If the delimiters are slashes, then you need to replace the tube characters with escaped slashes (/). Note that you need to escape them, because the parser will not take them as control characters. Like this: \/.
Now, you need to solve some logical tasks here, to match the numbers correctly and you're good to go.
(I'm not gonna solve the homework for you :) )
These articles will help you to solve the problem tough:
Character classes
Repetition opetors
Special characters
Pipe character (alternation operator)
Good luck!

In your comment you say you are looking for yyyy, but the example says yy.
I made a code for yy because that is what you gave us, you can easily change the 2 to a 4 and it's for yyyy.
preg_match("/((0|1|2|3)[0-9])\/\d{2}\/\d{2}/", $string, $output_array);
Echo $output_array[1]; // date
Edit:
If you use this pattern it will match the time too, thus make it harder to match wrong.
((0|1|2|3)[0-9])/\d{2}/\d{2}\s+\d{2}:\d{2}:\d{2}
http://www.phpliveregex.com/p/fjP
Edit2:
Also, you can skip one line of code.
You first preg_match to $found and then do an if $found.
This works too:
If(preg_match($pattern, $string, $found))}{
Echo $found[1];
}Else{
Echo "nothing found";
}
With pattern and string as refered to above.
As you can see the found variable is in the preg_match as the output, thus if there is a match the if will be true.

Regex preg_match change only special character between quote

$value = "abrak'adabra' baba";
$pattern = array();
$replacement = array();
$pattern[] = '/(\'[^\']+\')|(a)/e';
$replacement = "strlen('\\2') ? 'i' : '\\0'";
The code above change abrak'adabra' baba into ibrik'adabra' bibi
What I want to do is to change abrak'adabra' baba into abrak'idibri' baba. How to do that?
Honestly I don't even really understand the regex pattern above.
There are what I know and I don't know about the code:
In $pattern say: (any word which contain has two quotes and no quote between) or (character "a"). In the replacement, php code such a strlen will works because /e modifier will be used. But I can't understand why is it an "or" logic there.
If length of the second part in the pattern (the a character) is more than zero, than replace it with "i", else do something else (I don't understand what \0 means)
I'll appreciate any help. This regex stuff has frustating me :(

Using the e modifier (eval) in patterns is dangerous, as someone could potentially execute malicious code on your server (see the manual's section on that for more).
Instead, if you need to do extra processing on matched items, you can use preg_replace_callback:
// Find all characters between single quotes
$result = preg_replace_callback('/\'(.*?)\'/', function($matches){
// Replace 'a' with 'i' in found matches
return '\''.str_replace('a', 'i', $matches[1]).'\'';
}, $value);
If all you're doing is replacing a with i between the quotes, there may be more optimal ways to go about it, but this way you have room for more advanced processing on the strings found between quotes.

Replace from one custom string to another custom string

How can I replace a string starting with 'a' and ending with 'z'?
basically I want to be able to do the same thing as str_replace but be indifferent to the values in between two strings in a 'haystack'.
Is there a built in function for this? If not, how would i go about efficiently making a function that accomplishes it?

That can be done with Regular Expression (RegEx for short).
Here is a simple example:
$string = 'coolAfrackZInLife';
$replacement = 'Stuff';
$result = preg_replace('/A.*Z/', $replacement, $string);
echo $result;
The above example will return coolStuffInLife
A little explanation on the givven RegEx /A.*Z/:
- The slashes indicate the beginning and end of the Regex;
- A and Z are the start and end characters between which you need to replace;
- . matches any single charecter
- * Zero or more of the given character (in our case - all of them)
- You can optionally want to use + instead of * which will match only if there is something in between
Take a look at Rubular.com for a simple way to test your RegExs. It also provides short RegEx reference

$string = "I really want to replace aFGHJKz with booo";
$new_string = preg_replace('/a[a-zA-z]+z/', 'boo', $string);
echo $new_string;
Be wary of the regex, are you wanting to find the first z or last z? Is it only letters that can be between? Alphanumeric? There are various scenarios you'd need to explain before I could expand on the regex.

use preg_replace so you can use regex patterns.

How to strip this part of my string?

$string = "Hot_Chicks_call_me_at_123456789";
How can I strip away so that I only have the numberst after the last letter in the string above?
Example, I need a way to check a string and remove everything in front of (the last UNDERSCORE FOLLOWED by the NUMBERS)
Any smart solutions for this?
Thanks
BTW, it's PHP!

Without using a regular expression
$string = "Hot_Chicks_call_me_at_123456789";
echo end( explode( '_', $string ) );

If it always ends in a number you can just match /(\d+)$/ with regex, is the formatting consistent? Is there anything between the numbers like dashes or spaces?
You can use preg_match for the regex part.
<?php
$subject = "abcdef_sdlfjk_kjdf_39843489328";
preg_match('/(\d+)$/', $subject, $matches);
if ( count( $matches ) > 1 ) {
echo $matches[1];
}
I only recommend this solution if speed isn't an issue, and if the formatting is completely consistent.

PHP's PCRE Regular Expression engine was built for this kind of task
$string = "Hot_Chicks_call_me_at_123456789";
$new_string = preg_replace('{^.*_(\d+)$}x','$1',$string);
//same thing, but with whitespace ignoring and comments turned on for explanations
$new_string = preg_replace('{
^.* #match any character at start of string
_ #up to the last underscore
(\d+) #followed by all digits repeating at least once
$ #up to the end of the string
}x','$1',$string);
echo $new_string . "\n";

To be a bit churlish, your stated specification would suggest the following algorithm:
def trailing_number(s):
results = list()
for char in reversed(s):
if char.isalpha(): break
if char.isdigit(): results.append(char)
return ''.join(reversed(results))
It returns only the digits from the end of the string up to the first letter it encounters.
Of course this example is in Python, since I don't know PHP nearly as well. However it should be easily translated as the concept is easy enough ... reverse the string (or iterate from the end towards the beginning) and accumulate digits until you find a letter and break (or fall out of the loop at the beginning of the string).
In C it would be more efficient to use something a bit like for(x=strlen(s);x>s;x--) to walk backwards through the string, saving a pointer to the most recently encountered digit until we break or drop out of the loop at the beginning of the string. Then return the pointer into the middle of the string where our most recent (leftmost) digit was found.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

A more efficient string cleaning Regex in PHP - php

Related

How to remove repeated sequence of characters in a string?

PHP preg_match regular expression for find date in string

Regex preg_match change only special character between quote

Replace from one custom string to another custom string

How to strip this part of my string?

Categories

Resources