Regular expression to remove trailing chars - php

I'm looking for a regular expression in Php that could transform incoming strings like this:
abaisser_negation_pronominal_question => abaisser_n_p_q
abaisser_pronominal_question => abaisser_p_q
abaisser_negation_question => abaisser_n_q
abaisser_negation_pronominal => abaisser_n_p
abaisser_negation_voix_passive_pronominal => abaisser_n_v_p_p
abaisser => abaisser
With the Php code close to something like:
$line=preg_replace("/<h3>/im", "", $line);
How would you do?

You can use:
$input = preg_replace('/(_[A-Za-z])[^_\n]*/', '$1', $input);
RegEx Demo
Explanation:
This regex searches for (_[A-Za-z])[^_\n]* which means underscore followed by single letter and then match before a newline or underscore
It capture first part (_[A-Za-z]) in a backreference $1
Replacement is $1 leaving underscore and first letter in the replacement string

You could use \K or positive lookbehind.
$input = preg_replace('~_.\K[^_\n]*~', '', $input);
Pattern _. in the above regex would match an _ and also the character following the underscore. \K discards the previously matched characters that is, _ plus the following character. It won't take these two characters into consideration. Now [^_\n]* matches any character but not of an _ or a \n newline character zero or more times. So the characters after the character which was preceded by an underscore would be matched upto the next _ or \n character. Removing those characters will give you the desired output.
DEMO
$input = preg_replace('~(?<=_.)[^_\n]*~', '', $input);
It just looks after to the _ and the character following the _ and matches all the characters upto the next underscore or newline character.
DEMO

You can use regex
$input = preg_replace('/_(.)[^\n_]+/', '_$1', $input);
DEMO
What it does is capture the character after _ and match till \n or _ is encountered and replaced with the _$1 which means _ plus the character captured.

$line = preg_replace("/_([a-z])([a-z]*)/i", "_$1", $line);

Related

regex capture certain characters only

currently dealing with a bit of a problem. this is my string "all-days"
im in need of some assistance to creating a regex to capture the first character, the dash and also the first character after the dash. Im a bit of a newbie to Regex so forgive me.
Here is what ive got so far. (^.)
capture the first character, the dash and also the first
character after the dash
With preg_match function:
$s = "all-days";
preg_match('/^(.)[^-]*(-)(.)/', $s, $m);
unset($m[0]);
print_r($m);
The output:
Array
(
[1] => a
[2] => -
[3] => d
)
Its not regex but If you want just a solution as you want by other way it can be achieve by explode, array_walk and implode
$string = 'all-days-with-my-style';
$arr = explode("-",$string);
$new = array_walk($arr,function(&$a){
$a = $a[0];
});
echo implode("-",$arr);
Live demo : https://eval.in/882846
Output is : a-d-w-m-s
I assume your string only contains word characters and hyphens, and doesn't have consecutive hyphens:
To remove all that isn't the first character the hyphens and the first character after them, remove all that isn't after a word boundary:
$result = preg_replace('~\B\w+~', '', 'all-days');
If you only want to match these characters, just catch each character after a word boundary:
if ( preg_match_all('~\b.~', 'all-days', $matches) )
print_r($matches[0]);
Code
See code in use here
\b(\w|-\b)
For more precision, the following can be used (note that it uses Unicode groups, so it doesn't work in every language, but it does in PHP). This will only match letters, not numbers and underscores. It uses a negative lookbehind and positive lookahead, but you can understand it if you keep reading this article and break it apart one piece at a time.
(\b\p{L}|(?<=\p{L})-(?=\p{L}))
Explanation
\b Assert position at a word boundary
(\w|-\b) Capture the following into capture group 1
\w Match any word character
| Or
- Match the - character literally
\b Assert position at a word boundary
\b:
Asserts the position in the string matches 1 of the following:
^\w Assert position at the start of the string and match a word character
\w$ Match a word character and assert its position as the last position in the string
\W\w Match any non-word character, followed by a word character
\w\W Match any word character, followed by a non-word character
\w:
Means a word character (usually defined by any character in the set a-zA-Z0-9_, however, some languages also accept Unicode characters that represent any letter, number, or underscore \p{L}\p{N}_).
For more precision (depending on the use-case), you can specify [a-zA-Z] (for ASCII letters), \p{L} for Unicode letters, or [a-z] with the i flag for ASCII characters with the case-insensitive flag enabled in regex.

php regex: if line doesn't end with... remove line

I have a string stored in variable $text:
$text = '
I should not be removed.
I should not be removed.
I should not be removed?
I should not be removed!
I should be removed
I should be removed-
I should not be removed?
';
I want to remove all lines in the string that do not end with ., ? or !. How do I do this effectively? Maybe a preg_replace() approach?
If there is no whitespace at the end of the lines, you can use
'~^.*(?<![.?!])$\R?~m'
See regex demo
Explanation:
^ - start of line (as /m modifier indicates the multiline mode when ^ and $ match start and end of line, not string)
.* - any characters but a newline up to...
(?<![.?!])$ - the end of the string that is not preceded with a . or ! or ?
\R? - optional line break
To ignore the trailing whitespace, use a lookahead based regex:
'~^(?!.*[.?!]\h*$).*$\R?~m'
See regex demo
Explanation:
^ - start of a line
(?!.*[.?!]\h*$) - a negative lookahead that fails a match if there is a ., ? or ! at the end of the string followed by optional horizontal whitespace (\h*)
.*$ - any characters but a newline, 0 or more occurrences, up to the end of the line
\R? - optional newline sequence (optional, as the last line may not be followed with a newline character).
PHP code demo:
$re = '~^(?!.*[.?!]\h*$).*$\R?~m';
$str = "I should not be removed. \nI should not be removed.\nI should not be removed?\nI should not be removed! \nI should be removed\nI should be removed-\nI should not be removed? ";
$result = preg_replace($re, "", $str);
echo $result;
If you need to ignore the whitespace and punctuation, just add a [\p{P}\h] character class to the lookahead:
^(?!.*[.?!][\p{P}\h]*$).*$\R?
See demo. Now, the lookahead looks like (?!.*[.?!][\p{P}\h]*$). It fails a match if there is a ., ?, or ! followed by punctuation (\p{P}) or horizontal whitespace (\h), zero or more occurrences (*).
AND FINAL UPDATE: If you need to also ignore all non-word symbols (including Unicode letters) and all HTML entities, you can use
'~^(?!.*[.?!](&\w+;|\W)*$).*$\R?~m'
See another regex demo and an IDEONE demo. The lines ending with .  and .  do not get removed.
The difference here is (&\w+;|\W)* that matches 0 or more substrings starting with & and followed by 1 or more word characters (letters [A-Za-z], digits ([0-9]) or an underscore) and then a semi-colon, or non-word characters (\W). You can unroll the pattern as [^\w&]*(?:&\w+;\W*)* so that the regex performance might improve.
Note that you can use \W to match all Unicode letters and symbols other than ASCII since the /u modifier is not used here.

php - regular expression matching first occurrence

I require to match first occurrence of the following pattern starting with \s or ( then NIC followed by any characters followed # or . followed by 5 or 6 digits.
Regular expression used :
preg_match('/[\\s|(]NIC.*[#|.]\d{5,6}/i', trim($test), $matches1);
Example:
$test = "(NIC.123456"; // works correctly
$test = "(NIC.123456 oldnic#65703 checking" // produce result (NIC.123456 oldnic#65703
But it needs to be only (NIC.123456. What is the problem?
You need to add the ? quantifier for a non-greedy match. Here .* is matching the most amount possible.
You also don't need to double escape \\s here, you can just use \s and you can just combine the selective characters inside your character class instead of adding in the pipe | delimiter.
Also note that your expression will match strings like the following (NIC_CCC.123456, to avoid this you can use a word boundary \b matching the boundary between a word character and not a word character.
preg_match('/(?<=^|\s)\(nic\b.*?[#.]\d{5,6}/i', $test, $match);
Regular expression:
(?<= look behind to see if there is:
^ the beginning of the string
| OR
\s whitespace (\n, \r, \t, \f, and " ")
) end of look-behind
\( '('
nic 'nic'
\b the boundary between a word char (\w) and not a word char
.*? any character except \n (0 or more times)
[#.] any character of: '#', '.'
\d{5,6} digits (0-9) (between 5 and 6 times)
See live demo
have tried using
$test1 = explode(" ", $test);
and use $test1[0] to display your result.

Regex to remove single characters from string

Consider the following strings
breaking out a of a simple prison
this is b moving up
following me is x times better
All strings are lowercased already. I would like to remove any "loose" a-z characters, resulting in:
breaking out of simple prison
this is moving up
following me is times better
Is this possible with a single regex in php?
$str = "breaking out a of a simple prison
this is b moving up
following me is x times better";
$res = preg_replace("#\\b[a-z]\\b ?#i", "", $str);
echo $res;
How about:
preg_replace('/(^|\s)[a-z](\s|$)/', '$1', $string);
Note this also catches single characters that are at the beginning or end of the string, but not single characters that are adjacent to punctuation (they must be surrounded by whitespace).
If you also want to remove characters immediately before punctuation (e.g. 'the x.'), then this should work properly in most (English) cases:
preg_replace('/(^|\s)[a-z]\b/', '$1', $string);
As a one-liner:
$result = preg_replace('/\s\p{Ll}\b|\b\p{Ll}\s/u', '', $subject);
This matches a single lowercase letter (\p{Ll}) which is preceded or followed by whitespace (\s), removing both. The word boundaries (\b) ensure that only single letters are indeed matched. The /u modifier makes the regex Unicode-aware.
The result: A single letter surrounded by spaces on both sides is reduced to a single space. A single letter preceded by whitespace but not followed by whitespace is removed completely, as is a single letter only followed but not preceded by whitespace.
So
This a is my test sentence a. o How funny (what a coincidence a) this is!
is changed to
This is my test sentence. How funny (what coincidence) this is!
You could try something like this:
preg_replace('/\b\S\s\b/', "", $subject);
This is what it means:
\b # Assert position at a word boundary
\S # Match a single character that is a “non-whitespace character”
\s # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
\b # Assert position at a word boundary
Update
As raised by Radu, because I've used the \S this will match more than just a-zA-Z. It will also match 0-9_. Normally, it would match a lot more than that, but because it's preceded by \b, it can only match word characters.
As mentioned in the comments by Tim Pietzcker, be aware that this won't work if your subject string needs to remove single characters that are followed by non word characters like test a (hello). It will also fall over if there are extra spaces after the single character like this
test a hello
but you could fix that by changing the expression to \b\S\s*\b
Try this one:
$sString = preg_replace("#\b[a-z]{1}\b#m", ' ', $sString);

Replace placeholders which start with # then whole word

I need to replace words that start with hash mark (#) inside a text.
Well I know how I can replace whole words.
preg_replace("/\b".$variable."\b/", $value, $text);
Because that \b modifier accepts only word characters so a word containing hash mark wont be replaced.
I have this html which contains #companyName type of variables which I replace with a value.
\b matches between an alphanumeric character (shorthand \w) and a non-alphanumeric character (\W), counting underscores as alphanumeric. This means, as you have seen, that it won't match before a # (unless that's preceded by an alnum character).
I suggest that you only surround your query word with \b if it starts and end with an alnum character.
So, perhaps something like this (although I don't know any PHP, so this may be syntactically completely wrong):
if (preg_match('/^\w/', $variable))
$variable = '\b'.$variable;
if (preg_match('/\w$/', $variable))
$variable = $variable.'\b';
preg_replace('/'.$variable.'/', $value, $text);
All \b does is match a change between non-word and word characters. Since you know $variable starts with non-word characters, you just need to precede the match by a non-word character (\W).
However, since you are replacing, you either need to make the non-word match zero-width, i.e. a look-behind:
preg_replace("/(?<=\\W)".$variable."\\b/", $value, $text);
or incorporate the matched character into the replacement text:
preg_replace("/(\\W)".$variable."\\b/", $value, "$1$text");
Why not just
preg_replace("/#\b".$variable."\b/", $value, $text);
Following expression can also be used for marking boundaries for words containing non-word characters:-
preg_replace("/(^|\s|\W)".$variable."($|\s|\W)/", $value, $text);

Categories