php - regular expression matching first occurrence - php

I require to match first occurrence of the following pattern starting with \s or ( then NIC followed by any characters followed # or . followed by 5 or 6 digits.
Regular expression used :
preg_match('/[\\s|(]NIC.*[#|.]\d{5,6}/i', trim($test), $matches1);
Example:
$test = "(NIC.123456"; // works correctly
$test = "(NIC.123456 oldnic#65703 checking" // produce result (NIC.123456 oldnic#65703
But it needs to be only (NIC.123456. What is the problem?

You need to add the ? quantifier for a non-greedy match. Here .* is matching the most amount possible.
You also don't need to double escape \\s here, you can just use \s and you can just combine the selective characters inside your character class instead of adding in the pipe | delimiter.
Also note that your expression will match strings like the following (NIC_CCC.123456, to avoid this you can use a word boundary \b matching the boundary between a word character and not a word character.
preg_match('/(?<=^|\s)\(nic\b.*?[#.]\d{5,6}/i', $test, $match);
Regular expression:
(?<= look behind to see if there is:
^ the beginning of the string
| OR
\s whitespace (\n, \r, \t, \f, and " ")
) end of look-behind
\( '('
nic 'nic'
\b the boundary between a word char (\w) and not a word char
.*? any character except \n (0 or more times)
[#.] any character of: '#', '.'
\d{5,6} digits (0-9) (between 5 and 6 times)
See live demo

have tried using
$test1 = explode(" ", $test);
and use $test1[0] to display your result.

Related

Why regex with lookaheads doesn't match?

I need (in PHP) to split a sententse by the word that cannot be the first or the last one in the sentence. Say the word is "pression" and here is my regex
/^.+?(?=[\s\.\,\:\;])pression(?=[\s\.\,\:\;]).+$/i
Live here: https://regex101.com/r/CHAhKj/1/
First, it doesn't match.
Next, I think - it is at all possible to split that way? I tryed simplified example
print_r(preg_split('/^.+pizza.+$/', 'my pizza is cool'));
live here http://sandbox.onlinephpfunctions.com/code/10b674900fc1ef44ec79bfaf80e83fe1f4248d02
and it prints an array of 2 empty strings, when I expect
['my ', ' is cool']
I need (in PHP) to split a sentence by the word that cannot be the first or the last one in the sentence
You may use this regex:
(?<=[^\s.?]\h)pression(?=\h[^\s.?])
RegEx Demo
RegEx Details:
(?<=[^\s.?]\h): Lookbehind to assert that ahead of current position we have a space and a character that not a whitespace, not a dot and not a ?.
pression: Match word pression
(?=\h[^\s.?]): Lookahead to assert that before current position we have a space and a character that not a whitespace, not a dot and not a ?
First, ^.+?(?=[\s\.\,\:\;])pression(?=[\s\.\,\:\;]).+$ can't match any string at all because the (?=[\s\.\,\:\;])p part requires p to be also either a whitespace char, or a ., ,, : or ;, which invalidates the whole match at once.
Second, ^.+pizza.+$ pattern does not ensure the pizza matched is not the first or last word in a sentence as . matches whitespace, too. It does not return anything meaningful, because preg_split uses the match to break string into chunks, and the two empty values are 1) start of string and 2) empty string positions.
That said, all you need is:
preg_match('~^(.*?\w\W+)pression(\W+\w.*)$~is', $text, $m)
See the regex demo. Details:
^ - start of string
(.*?\w\W+) - Capturing group 1: any zero or more chars, as few as possible, then a word char and then one or more non-word chars
pression - a word
(\W+\w.*) - Capturing group 2: one or more non-word chars, a word char, and then any zero or more chars as many as possible
$ - end of string.
s makes the . match across lines and i flag makes the pattern match in a case insensitive way.
See the PHP demo:
$text = "You can use any regular expression pression inside the lookahead ";
if (preg_match('~^(.*?\w\W+)pression(\W+\w.*)$~is', $text, $m)) {
echo $m[1] . " << | >> " . $m[2];
}
// => You can use any regular expression << | >> inside the lookahead

How to use preg_replace to remove excessive single spaces

We are extracting text from PDF files, and there is a high frequency of results that contain malformed text. Specifically adding spaces between the characters of a word. e.g. SEATTLE is being returned as S E A T T L E.
Is there a RegEx expression for preg_replace that can remove any spaces in the case of n number of single character "words"? Specifically, remove spaces from any occurrence of a string that is more than 3 single alpha characters and is separated by spaces?
If googled this for awhile, but can't even imagine how to construct the expression. As expressed in a comment, I don't want ALL spaces removed, but only when there is an occurrence of >3 single alpha characters, e.g. Welcome to the Greater S E A T T L E area should become Welcome to the Greater SEATTLE area. The result is to be used in full text searching, so case sensitivity is not a concern.
You may use a simple approach with a preg_replace_callback. Match '~\b[A-Za-z](?: [A-Za-z]){2,}\b~' and str_replace spaces in the anonymous function:
$regex = '~\b[A-Za-z](?: [A-Za-z]){2,}\b~';
$result = preg_replace_callback($regex, function($m) {
return str_replace(" ", "", $m[0]);
}, $s);
See the regex demo.
To only match sequences of uppercase letters, remove a-z from the pattern:
$regex = '~\b[A-Z](?: [A-Z]){2,}\b~';
And another thing: there may be soft/hard spaces, tabs, other kind of whitespace. Then, use
$regex = '~\b[A-Za-z](?:\h[A-Za-z]){2,}\b~u';
^^ ^
Finally, to match any Unicode letter, use \p{L} (to only match uppercase ones, \p{Lu}) instead of [a-zA-Z]:
$regex = '~\b\p{L}(?:\h\p{L}){2,}\b~u';
NOTE: It will most probably fail to work in some cases, e.g. when there are one-letter words. You will have to handle those cases separately/manually. Anyway, there is no safe regex-only way to fix OCR issues.
Pattern details
\b - a word boundary
[A-Za-z] - a single letter
(?: [A-Za-z]){2,} - 2 or more occurrences of
- a space (\h matches any kind of horizontal whitespace)
[A-Za-z] - a single letter
\b - a word boundary
When usign u modifier, \h becomes Unicode-aware.
You could do this in one go:
(?i:(?<!\S)([a-z]) +((?1))|\G(?!\A) +((?1))\b)
See live demo here
Explanation:
(?i: # Start of non-capturing group with case-insensitive modifier on
(?<!\S) # Negative lookbehind to ensure there is no leading non-whitespace character
([a-z]) + # Capture one letter and at least one space
((?1)) # Capture one letter in 2nd capturing group
| # Or
\G(?!\A) + # Start match from where previous match ends
# with matching spaces
((?1))\b # Match a letter at word boundary
) # End of non-capturing group
PHP code:
$str = preg_replace('~(?i:(?<!\S)([a-z]) +((?1))|\G(?!\A) +((?1))\b)~', '$1$2$3', $str);
You may use this pure regex approach with lookarounds and \G:
$re = '~\b(?:(?=(?:\pL\h+){3}\pL\b)|(?<!^)\G)(\pL)\h+(?=\pL\b)~';
$repl = preg_replace($re, '$1', $str);
RegEx Demo
RegEx Details:
\b: Match word boundary
(?:: Start non-capture group
(?=(?:\pL\h+){3}\pL\b): Lookahead to assert we have 3+ single letters separated by 1+ spaces
|: OR
(?<!^)\G: \G asserts position at the end of the previous match. (?<!^) ensures we don't match start of the string for the first match
): End non-capture group
(\pL): Match a single letter and capture it
\h+: Followed by 1+ horizontal whitespace
(?=\pL\b): Assert that we only have a single letter ahead
In the replacement we use $1 which is the letter left of whitespace we capture

php regex: if line doesn't end with... remove line

I have a string stored in variable $text:
$text = '
I should not be removed.
I should not be removed.
I should not be removed?
I should not be removed!
I should be removed
I should be removed-
I should not be removed?
';
I want to remove all lines in the string that do not end with ., ? or !. How do I do this effectively? Maybe a preg_replace() approach?
If there is no whitespace at the end of the lines, you can use
'~^.*(?<![.?!])$\R?~m'
See regex demo
Explanation:
^ - start of line (as /m modifier indicates the multiline mode when ^ and $ match start and end of line, not string)
.* - any characters but a newline up to...
(?<![.?!])$ - the end of the string that is not preceded with a . or ! or ?
\R? - optional line break
To ignore the trailing whitespace, use a lookahead based regex:
'~^(?!.*[.?!]\h*$).*$\R?~m'
See regex demo
Explanation:
^ - start of a line
(?!.*[.?!]\h*$) - a negative lookahead that fails a match if there is a ., ? or ! at the end of the string followed by optional horizontal whitespace (\h*)
.*$ - any characters but a newline, 0 or more occurrences, up to the end of the line
\R? - optional newline sequence (optional, as the last line may not be followed with a newline character).
PHP code demo:
$re = '~^(?!.*[.?!]\h*$).*$\R?~m';
$str = "I should not be removed. \nI should not be removed.\nI should not be removed?\nI should not be removed! \nI should be removed\nI should be removed-\nI should not be removed? ";
$result = preg_replace($re, "", $str);
echo $result;
If you need to ignore the whitespace and punctuation, just add a [\p{P}\h] character class to the lookahead:
^(?!.*[.?!][\p{P}\h]*$).*$\R?
See demo. Now, the lookahead looks like (?!.*[.?!][\p{P}\h]*$). It fails a match if there is a ., ?, or ! followed by punctuation (\p{P}) or horizontal whitespace (\h), zero or more occurrences (*).
AND FINAL UPDATE: If you need to also ignore all non-word symbols (including Unicode letters) and all HTML entities, you can use
'~^(?!.*[.?!](&\w+;|\W)*$).*$\R?~m'
See another regex demo and an IDEONE demo. The lines ending with .  and .  do not get removed.
The difference here is (&\w+;|\W)* that matches 0 or more substrings starting with & and followed by 1 or more word characters (letters [A-Za-z], digits ([0-9]) or an underscore) and then a semi-colon, or non-word characters (\W). You can unroll the pattern as [^\w&]*(?:&\w+;\W*)* so that the regex performance might improve.
Note that you can use \W to match all Unicode letters and symbols other than ASCII since the /u modifier is not used here.

Regular expression to remove trailing chars

I'm looking for a regular expression in Php that could transform incoming strings like this:
abaisser_negation_pronominal_question => abaisser_n_p_q
abaisser_pronominal_question => abaisser_p_q
abaisser_negation_question => abaisser_n_q
abaisser_negation_pronominal => abaisser_n_p
abaisser_negation_voix_passive_pronominal => abaisser_n_v_p_p
abaisser => abaisser
With the Php code close to something like:
$line=preg_replace("/<h3>/im", "", $line);
How would you do?
You can use:
$input = preg_replace('/(_[A-Za-z])[^_\n]*/', '$1', $input);
RegEx Demo
Explanation:
This regex searches for (_[A-Za-z])[^_\n]* which means underscore followed by single letter and then match before a newline or underscore
It capture first part (_[A-Za-z]) in a backreference $1
Replacement is $1 leaving underscore and first letter in the replacement string
You could use \K or positive lookbehind.
$input = preg_replace('~_.\K[^_\n]*~', '', $input);
Pattern _. in the above regex would match an _ and also the character following the underscore. \K discards the previously matched characters that is, _ plus the following character. It won't take these two characters into consideration. Now [^_\n]* matches any character but not of an _ or a \n newline character zero or more times. So the characters after the character which was preceded by an underscore would be matched upto the next _ or \n character. Removing those characters will give you the desired output.
DEMO
$input = preg_replace('~(?<=_.)[^_\n]*~', '', $input);
It just looks after to the _ and the character following the _ and matches all the characters upto the next underscore or newline character.
DEMO
You can use regex
$input = preg_replace('/_(.)[^\n_]+/', '_$1', $input);
DEMO
What it does is capture the character after _ and match till \n or _ is encountered and replaced with the _$1 which means _ plus the character captured.
$line = preg_replace("/_([a-z])([a-z]*)/i", "_$1", $line);

how to use preg_split() in php?

Can anybody explain to me how to use preg_split() function?
I didn't understand the pattern parameter like this "/[\s,]+/".
for example:
I have this subject: is is. and I want the results to be:
array (
0 => 'is',
1 => 'is',
)
so it will ignore the space and the full-stop, how I can do that?
preg means Pcre REGexp", which is kind of redundant, since the "PCRE" means "Perl Compatible Regexp".
Regexps are a nightmare to the beginner. I still don’t fully understand them and I’ve been working with them for years.
Basically the example you have there, broken down is:
"/[\s,]+/"
/ = start or end of pattern string
[ ... ] = grouping of characters
+ = one or more of the preceeding character or group
\s = Any whitespace character (space, tab).
, = the literal comma character
So you have a search pattern that is "split on any part of the string that is at least one whitespace character and/or one or more commas".
Other common characters are:
. = any single character
* = any number of the preceeding character or group
^ (at start of pattern) = The start of the string
$ (at end of pattern) = The end of the string
^ (inside [...]) = "NOT" the following character
For PHP there is good information in the official documentation.
This should work:
$words = preg_split("/(?<=\w)\b\s*[!?.]*/", 'is is.', -1, PREG_SPLIT_NO_EMPTY);
echo '<pre>';
print_r($words);
echo '</pre>';
The output would be:
Array
(
[0] => is
[1] => is
)
Before I explain the regex, just an explanation on PREG_SPLIT_NO_EMPTY. That basically means only return the results of preg_split if the results are not empty. This assures you the data returned in the array $words truly has data in it and not just empty values which can happen when dealing with regex patterns and mixed data sources.
And the explanation of that regex can be broken down like this using this tool:
NODE EXPLANATION
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
\w word characters (a-z, A-Z, 0-9, _)
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
[!?.]* any character of: '!', '?', '.' (0 or more
times (matching the most amount possible))
An nicer explanation can be found by entering the full regex pattern of /(?<=\w)\b\s*[!?.]*/ in this other other tool:
(?<=\w) Positive Lookbehind - Assert that the regex below can be matched
\w match any word character [a-zA-Z0-9_]
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
\s* match any white space character [\r\n\t\f ]
Quantifier: Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
!?. a single character in the list !?. literally
That last regex explanation can be boiled down by a human—also known as me—as the following:
Match—and split—any word character that comes before a word boundary that can have multiple spaces and the punctuation marks of !?..
PHP's str_word_count may be a better choice here.
str_word_count($string, 2) will output an array of all words in the string, including duplicates.
Documentation says:
The preg_split() function operates exactly like split(), except that
regular expressions are accepted as input parameters for pattern.
So, the following code...
<?php
$ip = "123 ,456 ,789 ,000";
$iparr = preg_split ("/[\s,]+/", $ip);
print "$iparr[0] <br />";
print "$iparr[1] <br />" ;
print "$iparr[2] <br />" ;
print "$iparr[3] <br />" ;
?>
This will produce following result.
123
456
789
000
So, if have this subject: is is and you want:
array (
0 => 'is',
1 => 'is',
)
you need to modify your regex to "/[\s]+/"
Unless you have is ,is you need the regex you already have "/[\s,]+/"

Categories