I have a string stored in variable $text:
$text = '
I should not be removed.
I should not be removed.
I should not be removed?
I should not be removed!
I should be removed
I should be removed-
I should not be removed?
';
I want to remove all lines in the string that do not end with ., ? or !. How do I do this effectively? Maybe a preg_replace() approach?
If there is no whitespace at the end of the lines, you can use
'~^.*(?<![.?!])$\R?~m'
See regex demo
Explanation:
^ - start of line (as /m modifier indicates the multiline mode when ^ and $ match start and end of line, not string)
.* - any characters but a newline up to...
(?<![.?!])$ - the end of the string that is not preceded with a . or ! or ?
\R? - optional line break
To ignore the trailing whitespace, use a lookahead based regex:
'~^(?!.*[.?!]\h*$).*$\R?~m'
See regex demo
Explanation:
^ - start of a line
(?!.*[.?!]\h*$) - a negative lookahead that fails a match if there is a ., ? or ! at the end of the string followed by optional horizontal whitespace (\h*)
.*$ - any characters but a newline, 0 or more occurrences, up to the end of the line
\R? - optional newline sequence (optional, as the last line may not be followed with a newline character).
PHP code demo:
$re = '~^(?!.*[.?!]\h*$).*$\R?~m';
$str = "I should not be removed. \nI should not be removed.\nI should not be removed?\nI should not be removed! \nI should be removed\nI should be removed-\nI should not be removed? ";
$result = preg_replace($re, "", $str);
echo $result;
If you need to ignore the whitespace and punctuation, just add a [\p{P}\h] character class to the lookahead:
^(?!.*[.?!][\p{P}\h]*$).*$\R?
See demo. Now, the lookahead looks like (?!.*[.?!][\p{P}\h]*$). It fails a match if there is a ., ?, or ! followed by punctuation (\p{P}) or horizontal whitespace (\h), zero or more occurrences (*).
AND FINAL UPDATE: If you need to also ignore all non-word symbols (including Unicode letters) and all HTML entities, you can use
'~^(?!.*[.?!](&\w+;|\W)*$).*$\R?~m'
See another regex demo and an IDEONE demo. The lines ending with .  and .  do not get removed.
The difference here is (&\w+;|\W)* that matches 0 or more substrings starting with & and followed by 1 or more word characters (letters [A-Za-z], digits ([0-9]) or an underscore) and then a semi-colon, or non-word characters (\W). You can unroll the pattern as [^\w&]*(?:&\w+;\W*)* so that the regex performance might improve.
Note that you can use \W to match all Unicode letters and symbols other than ASCII since the /u modifier is not used here.
Related
I want to split a string only at white spaces that does not have a certain delimiter (: in my case) before it. E.g.:
$string = "Time: 10:40 Request: page.php Action: whatever this is Refer: Facebook";
Then from something like this I want to achieve an array such that:
$array = ["Time: 10:40", "Request: page.php", "Action: whatever this is", "Refer: Facebook"];
I've tried the following so far:
$split = preg_split('/(:){0}\s/', $visit);
But this is still splitting at every occurence of a white space.
Edit: I think I asked the wrong question, however "whatever this is" should stay as a single string
Edit 2: The bits before the colons are known and stay the same, maybe incorporating those somehow makes the task easier (of not splitting at whitespace characters in strings that should stay together)?
You can use a lookahead in your split regex:
/\h+(?=[A-Z][a-z]*: )/
RegEx Demo
Regex \h+(?=[A-Z][a-z]*: ) matches 1+ whitespaces that is followed by a word starting with upper case letter and a colon and space.
you can do it
$string = "Time: 10:40 Request: page.php Action: whatever this is Refer: Facebook";
$split = preg_split('/\h+(?=[A-Z][a-z]*:)/', $string);
dd($split);
Another option could be to match what is before the colon and then match upon the next part that starts with a space, non whitespace chars and colon:
\S+:\h+.*?(?=\h+\S+:)\K\h+
\S+: Match 1+ times a non whitespace char
\h+ Match 1+ times a horizontal whitespace char
.*? Match any char except a newline non greedy
(?=\h+\S+:) Positive lookahead, assert what is on the right is 1+ horizontal whitespace chars, 1+ non whitespace chars and a colon
\K\h+ Forget what was matched using \K and match 1+ horizontal whitespace chars
Regex demo | php demo
We are extracting text from PDF files, and there is a high frequency of results that contain malformed text. Specifically adding spaces between the characters of a word. e.g. SEATTLE is being returned as S E A T T L E.
Is there a RegEx expression for preg_replace that can remove any spaces in the case of n number of single character "words"? Specifically, remove spaces from any occurrence of a string that is more than 3 single alpha characters and is separated by spaces?
If googled this for awhile, but can't even imagine how to construct the expression. As expressed in a comment, I don't want ALL spaces removed, but only when there is an occurrence of >3 single alpha characters, e.g. Welcome to the Greater S E A T T L E area should become Welcome to the Greater SEATTLE area. The result is to be used in full text searching, so case sensitivity is not a concern.
You may use a simple approach with a preg_replace_callback. Match '~\b[A-Za-z](?: [A-Za-z]){2,}\b~' and str_replace spaces in the anonymous function:
$regex = '~\b[A-Za-z](?: [A-Za-z]){2,}\b~';
$result = preg_replace_callback($regex, function($m) {
return str_replace(" ", "", $m[0]);
}, $s);
See the regex demo.
To only match sequences of uppercase letters, remove a-z from the pattern:
$regex = '~\b[A-Z](?: [A-Z]){2,}\b~';
And another thing: there may be soft/hard spaces, tabs, other kind of whitespace. Then, use
$regex = '~\b[A-Za-z](?:\h[A-Za-z]){2,}\b~u';
^^ ^
Finally, to match any Unicode letter, use \p{L} (to only match uppercase ones, \p{Lu}) instead of [a-zA-Z]:
$regex = '~\b\p{L}(?:\h\p{L}){2,}\b~u';
NOTE: It will most probably fail to work in some cases, e.g. when there are one-letter words. You will have to handle those cases separately/manually. Anyway, there is no safe regex-only way to fix OCR issues.
Pattern details
\b - a word boundary
[A-Za-z] - a single letter
(?: [A-Za-z]){2,} - 2 or more occurrences of
- a space (\h matches any kind of horizontal whitespace)
[A-Za-z] - a single letter
\b - a word boundary
When usign u modifier, \h becomes Unicode-aware.
You could do this in one go:
(?i:(?<!\S)([a-z]) +((?1))|\G(?!\A) +((?1))\b)
See live demo here
Explanation:
(?i: # Start of non-capturing group with case-insensitive modifier on
(?<!\S) # Negative lookbehind to ensure there is no leading non-whitespace character
([a-z]) + # Capture one letter and at least one space
((?1)) # Capture one letter in 2nd capturing group
| # Or
\G(?!\A) + # Start match from where previous match ends
# with matching spaces
((?1))\b # Match a letter at word boundary
) # End of non-capturing group
PHP code:
$str = preg_replace('~(?i:(?<!\S)([a-z]) +((?1))|\G(?!\A) +((?1))\b)~', '$1$2$3', $str);
You may use this pure regex approach with lookarounds and \G:
$re = '~\b(?:(?=(?:\pL\h+){3}\pL\b)|(?<!^)\G)(\pL)\h+(?=\pL\b)~';
$repl = preg_replace($re, '$1', $str);
RegEx Demo
RegEx Details:
\b: Match word boundary
(?:: Start non-capture group
(?=(?:\pL\h+){3}\pL\b): Lookahead to assert we have 3+ single letters separated by 1+ spaces
|: OR
(?<!^)\G: \G asserts position at the end of the previous match. (?<!^) ensures we don't match start of the string for the first match
): End non-capture group
(\pL): Match a single letter and capture it
\h+: Followed by 1+ horizontal whitespace
(?=\pL\b): Assert that we only have a single letter ahead
In the replacement we use $1 which is the letter left of whitespace we capture
I require to match first occurrence of the following pattern starting with \s or ( then NIC followed by any characters followed # or . followed by 5 or 6 digits.
Regular expression used :
preg_match('/[\\s|(]NIC.*[#|.]\d{5,6}/i', trim($test), $matches1);
Example:
$test = "(NIC.123456"; // works correctly
$test = "(NIC.123456 oldnic#65703 checking" // produce result (NIC.123456 oldnic#65703
But it needs to be only (NIC.123456. What is the problem?
You need to add the ? quantifier for a non-greedy match. Here .* is matching the most amount possible.
You also don't need to double escape \\s here, you can just use \s and you can just combine the selective characters inside your character class instead of adding in the pipe | delimiter.
Also note that your expression will match strings like the following (NIC_CCC.123456, to avoid this you can use a word boundary \b matching the boundary between a word character and not a word character.
preg_match('/(?<=^|\s)\(nic\b.*?[#.]\d{5,6}/i', $test, $match);
Regular expression:
(?<= look behind to see if there is:
^ the beginning of the string
| OR
\s whitespace (\n, \r, \t, \f, and " ")
) end of look-behind
\( '('
nic 'nic'
\b the boundary between a word char (\w) and not a word char
.*? any character except \n (0 or more times)
[#.] any character of: '#', '.'
\d{5,6} digits (0-9) (between 5 and 6 times)
See live demo
have tried using
$test1 = explode(" ", $test);
and use $test1[0] to display your result.
I am writing a script to clean up a file line-by-line with non-ascii characters, but I am having trouble with a regex pattern. I need a regex pattern that matches any line that starts with an asterisk, may have an equals, and will contain non-ascii characters and spaces. I know how to match a non-ascii character, but not in the same set as other positively defined characters.
Here is a sample line that I need to match:
* = Ìÿð ÿð
Here is the pattern I have so far:
/\*[^[:ascii:]]+[\r\n]/
This will match lines that start with asterisk and containing non-ascii characters, but not if the line has spaces or equals in it.
Try the following expression:
^\*\s*=?\s*[[:^ascii:]\s]+[\r\n]*$
This matches the start-of-line ^, then it matches zero or more spaces \s* followed by an optional equal sign =? then zero or more white spaces \s*.
Now a nice piece of expression matches one or more characters which are a combination of non-ascii and white spaces [[:^ascii:]\s]+, check docs to see the syntax for character classes.
Finally the expression matches a combination of carriage returns and newlines which may end the line.
Regex101 Demo
Maybe this - (edit: changed after reread )
# ^\*(?=.*[^\0-\177])
^
\*
(?= .* [^\0-\177] )
Consider the following strings
breaking out a of a simple prison
this is b moving up
following me is x times better
All strings are lowercased already. I would like to remove any "loose" a-z characters, resulting in:
breaking out of simple prison
this is moving up
following me is times better
Is this possible with a single regex in php?
$str = "breaking out a of a simple prison
this is b moving up
following me is x times better";
$res = preg_replace("#\\b[a-z]\\b ?#i", "", $str);
echo $res;
How about:
preg_replace('/(^|\s)[a-z](\s|$)/', '$1', $string);
Note this also catches single characters that are at the beginning or end of the string, but not single characters that are adjacent to punctuation (they must be surrounded by whitespace).
If you also want to remove characters immediately before punctuation (e.g. 'the x.'), then this should work properly in most (English) cases:
preg_replace('/(^|\s)[a-z]\b/', '$1', $string);
As a one-liner:
$result = preg_replace('/\s\p{Ll}\b|\b\p{Ll}\s/u', '', $subject);
This matches a single lowercase letter (\p{Ll}) which is preceded or followed by whitespace (\s), removing both. The word boundaries (\b) ensure that only single letters are indeed matched. The /u modifier makes the regex Unicode-aware.
The result: A single letter surrounded by spaces on both sides is reduced to a single space. A single letter preceded by whitespace but not followed by whitespace is removed completely, as is a single letter only followed but not preceded by whitespace.
So
This a is my test sentence a. o How funny (what a coincidence a) this is!
is changed to
This is my test sentence. How funny (what coincidence) this is!
You could try something like this:
preg_replace('/\b\S\s\b/', "", $subject);
This is what it means:
\b # Assert position at a word boundary
\S # Match a single character that is a “non-whitespace character”
\s # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
\b # Assert position at a word boundary
Update
As raised by Radu, because I've used the \S this will match more than just a-zA-Z. It will also match 0-9_. Normally, it would match a lot more than that, but because it's preceded by \b, it can only match word characters.
As mentioned in the comments by Tim Pietzcker, be aware that this won't work if your subject string needs to remove single characters that are followed by non word characters like test a (hello). It will also fall over if there are extra spaces after the single character like this
test a hello
but you could fix that by changing the expression to \b\S\s*\b
Try this one:
$sString = preg_replace("#\b[a-z]{1}\b#m", ' ', $sString);