preg_replace doesn't match with two spaces between words - php

i need to format uppercase words to bold but it doesn't work if the word contains two spaces
is there any way to make regex match only with words which end with colon?
$str = "BAKA NO TEST: hey";
$str = preg_replace('~[A-Z]{4,}\s[A-Z]\s{2,}(?:\s[A-Z]{4,})?:?~', '<b>$0</b>', $str);
output: <b>BAKA NO TEST:</b> hey
but it returns <b>BAKA</b> NO TEST: hey
the original $str is a multiline text so there are many lowercase and uppercase words but i need to change only some

You can do it like this:
$txt = preg_replace('~[A-Z]+(?:\s[A-Z]+)*:~', '<b>$0</b>', $txt);
Explanations:
[A-Z]+ # uppercase letter one or more times
(?: # open a non capturing group
\s # a white character (space, tab, newline,...)
[A-Z]+ #
)* # close the group and repeat it zero or more times
If you want a more tolerant pattern you can replace \s by \s+ to allow more than one space between each words.

Unless you have some good reason to use that regexp, try something simpler, like:
/([A-Z\s]+):/
Also, just so you know, you can use asterisk to specify none or more space characters: \s*

Related

php Regex file path | second matching after specific character

I trying to extract file patches, without disk letter, that are inside text. Like from AvastSecureBrowserElevationService; C:\Program Files (x86)\AVAST Software\Browser\Application\elevation_service.exe [X] extract :\Program Files (x86)\AVAST Software\Browser\Application\elevation_service.exe.
My actual regex look like this, but it will stop on any space, which can contains file names.
(?<=:\\)([^ ]*)
The soulution that I figure out is, that I can match first space character after dot, because there is very little chance that there will be some directory name with space after dot, and I will always do fast manual check. But I do not know how to write this in regex
You may use this regex for this purpose:
(?<=[a-zA-Z]):[^.]+\.\S+
RegEx Demo
RegEx Details:
(?<=[a-zA-Z]): Lookbehind to assert we have a English letter before :
:: Match literal :
[^.]+: Match 1+ non-dot characters
\.: Match literal .
\S+: Match 1+ non-whitespace characters
Here we would consume our entire string, as we collect what we wish to output, and we would preg_replace:
.+C(:\\.+\..+?)\s.+
Test
$re = '/.+C(:\\.+\..+?)\s.+/m';
$str = 'AvastSecureBrowserElevationService; C:\\Program Files (x86)\\AVAST Software\\Browser\\Application\\elevation_service.exe [X]';
$subst = '$1';
$result = preg_replace($re, $subst, $str);
echo $result;
Demo
You can use the following regex:
[A-Z]\K:.+\.\w+
It will match any capital letter followed by :, then any character string ending wit ., followed by at least one word character.
\K removes from the match what comes before it.
Demo

How can I split a string by white spaces that are not precedent by a certain character?

I want to split a string only at white spaces that does not have a certain delimiter (: in my case) before it. E.g.:
$string = "Time: 10:40 Request: page.php Action: whatever this is Refer: Facebook";
Then from something like this I want to achieve an array such that:
$array = ["Time: 10:40", "Request: page.php", "Action: whatever this is", "Refer: Facebook"];
I've tried the following so far:
$split = preg_split('/(:){0}\s/', $visit);
But this is still splitting at every occurence of a white space.
Edit: I think I asked the wrong question, however "whatever this is" should stay as a single string
Edit 2: The bits before the colons are known and stay the same, maybe incorporating those somehow makes the task easier (of not splitting at whitespace characters in strings that should stay together)?
You can use a lookahead in your split regex:
/\h+(?=[A-Z][a-z]*: )/
RegEx Demo
Regex \h+(?=[A-Z][a-z]*: ) matches 1+ whitespaces that is followed by a word starting with upper case letter and a colon and space.
you can do it
$string = "Time: 10:40 Request: page.php Action: whatever this is Refer: Facebook";
$split = preg_split('/\h+(?=[A-Z][a-z]*:)/', $string);
dd($split);
Another option could be to match what is before the colon and then match upon the next part that starts with a space, non whitespace chars and colon:
\S+:\h+.*?(?=\h+\S+:)\K\h+
\S+: Match 1+ times a non whitespace char
\h+ Match 1+ times a horizontal whitespace char
.*? Match any char except a newline non greedy
(?=\h+\S+:) Positive lookahead, assert what is on the right is 1+ horizontal whitespace chars, 1+ non whitespace chars and a colon
\K\h+ Forget what was matched using \K and match 1+ horizontal whitespace chars
Regex demo | php demo

How to use preg_replace to remove excessive single spaces

We are extracting text from PDF files, and there is a high frequency of results that contain malformed text. Specifically adding spaces between the characters of a word. e.g. SEATTLE is being returned as S E A T T L E.
Is there a RegEx expression for preg_replace that can remove any spaces in the case of n number of single character "words"? Specifically, remove spaces from any occurrence of a string that is more than 3 single alpha characters and is separated by spaces?
If googled this for awhile, but can't even imagine how to construct the expression. As expressed in a comment, I don't want ALL spaces removed, but only when there is an occurrence of >3 single alpha characters, e.g. Welcome to the Greater S E A T T L E area should become Welcome to the Greater SEATTLE area. The result is to be used in full text searching, so case sensitivity is not a concern.
You may use a simple approach with a preg_replace_callback. Match '~\b[A-Za-z](?: [A-Za-z]){2,}\b~' and str_replace spaces in the anonymous function:
$regex = '~\b[A-Za-z](?: [A-Za-z]){2,}\b~';
$result = preg_replace_callback($regex, function($m) {
return str_replace(" ", "", $m[0]);
}, $s);
See the regex demo.
To only match sequences of uppercase letters, remove a-z from the pattern:
$regex = '~\b[A-Z](?: [A-Z]){2,}\b~';
And another thing: there may be soft/hard spaces, tabs, other kind of whitespace. Then, use
$regex = '~\b[A-Za-z](?:\h[A-Za-z]){2,}\b~u';
^^ ^
Finally, to match any Unicode letter, use \p{L} (to only match uppercase ones, \p{Lu}) instead of [a-zA-Z]:
$regex = '~\b\p{L}(?:\h\p{L}){2,}\b~u';
NOTE: It will most probably fail to work in some cases, e.g. when there are one-letter words. You will have to handle those cases separately/manually. Anyway, there is no safe regex-only way to fix OCR issues.
Pattern details
\b - a word boundary
[A-Za-z] - a single letter
(?: [A-Za-z]){2,} - 2 or more occurrences of
- a space (\h matches any kind of horizontal whitespace)
[A-Za-z] - a single letter
\b - a word boundary
When usign u modifier, \h becomes Unicode-aware.
You could do this in one go:
(?i:(?<!\S)([a-z]) +((?1))|\G(?!\A) +((?1))\b)
See live demo here
Explanation:
(?i: # Start of non-capturing group with case-insensitive modifier on
(?<!\S) # Negative lookbehind to ensure there is no leading non-whitespace character
([a-z]) + # Capture one letter and at least one space
((?1)) # Capture one letter in 2nd capturing group
| # Or
\G(?!\A) + # Start match from where previous match ends
# with matching spaces
((?1))\b # Match a letter at word boundary
) # End of non-capturing group
PHP code:
$str = preg_replace('~(?i:(?<!\S)([a-z]) +((?1))|\G(?!\A) +((?1))\b)~', '$1$2$3', $str);
You may use this pure regex approach with lookarounds and \G:
$re = '~\b(?:(?=(?:\pL\h+){3}\pL\b)|(?<!^)\G)(\pL)\h+(?=\pL\b)~';
$repl = preg_replace($re, '$1', $str);
RegEx Demo
RegEx Details:
\b: Match word boundary
(?:: Start non-capture group
(?=(?:\pL\h+){3}\pL\b): Lookahead to assert we have 3+ single letters separated by 1+ spaces
|: OR
(?<!^)\G: \G asserts position at the end of the previous match. (?<!^) ensures we don't match start of the string for the first match
): End non-capture group
(\pL): Match a single letter and capture it
\h+: Followed by 1+ horizontal whitespace
(?=\pL\b): Assert that we only have a single letter ahead
In the replacement we use $1 which is the letter left of whitespace we capture

Regex to get only characters without space inside special tags

I have 2 texts in a string:
%Juan%
%Juan Gonzalez%
And I want to only be able to get %Juan% and not the one with the Space, I have been trying several Regexes witout luck. I currently use:
/%(.*)%/U
but it gets both things, I tried adding and playing with [^\s] but it doesnt works.
Any help please?
The issue is that . matches any character but a newline. The /U ungreedy mode only makes .* lazy and it captures a text from the % up to the first % to the right of the first %.
If your strings contain one pair of %...%, you may use
/%(\S+)%/
See the regex demo
The \S+ pattern matches 1+ characters other than a whitespace, and the whole [^\h%] negated character class that matches any character but a horizontal space and % symbol.
If you have multiple %...% pairs, you may use
/%([^\h%]+)%/
See another regex demo, where \h matches any horizontal whitespace.
PHP demo:
$re = '/%([^\h%]+)%/';
$str = "%Juan%\n%Juan Gonzalez%";
preg_match_all($re, $str, $matches);
print_r($matches[1]);

Regex to remove single characters from string

Consider the following strings
breaking out a of a simple prison
this is b moving up
following me is x times better
All strings are lowercased already. I would like to remove any "loose" a-z characters, resulting in:
breaking out of simple prison
this is moving up
following me is times better
Is this possible with a single regex in php?
$str = "breaking out a of a simple prison
this is b moving up
following me is x times better";
$res = preg_replace("#\\b[a-z]\\b ?#i", "", $str);
echo $res;
How about:
preg_replace('/(^|\s)[a-z](\s|$)/', '$1', $string);
Note this also catches single characters that are at the beginning or end of the string, but not single characters that are adjacent to punctuation (they must be surrounded by whitespace).
If you also want to remove characters immediately before punctuation (e.g. 'the x.'), then this should work properly in most (English) cases:
preg_replace('/(^|\s)[a-z]\b/', '$1', $string);
As a one-liner:
$result = preg_replace('/\s\p{Ll}\b|\b\p{Ll}\s/u', '', $subject);
This matches a single lowercase letter (\p{Ll}) which is preceded or followed by whitespace (\s), removing both. The word boundaries (\b) ensure that only single letters are indeed matched. The /u modifier makes the regex Unicode-aware.
The result: A single letter surrounded by spaces on both sides is reduced to a single space. A single letter preceded by whitespace but not followed by whitespace is removed completely, as is a single letter only followed but not preceded by whitespace.
So
This a is my test sentence a. o How funny (what a coincidence a) this is!
is changed to
This is my test sentence. How funny (what coincidence) this is!
You could try something like this:
preg_replace('/\b\S\s\b/', "", $subject);
This is what it means:
\b # Assert position at a word boundary
\S # Match a single character that is a “non-whitespace character”
\s # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
\b # Assert position at a word boundary
Update
As raised by Radu, because I've used the \S this will match more than just a-zA-Z. It will also match 0-9_. Normally, it would match a lot more than that, but because it's preceded by \b, it can only match word characters.
As mentioned in the comments by Tim Pietzcker, be aware that this won't work if your subject string needs to remove single characters that are followed by non word characters like test a (hello). It will also fall over if there are extra spaces after the single character like this
test a hello
but you could fix that by changing the expression to \b\S\s*\b
Try this one:
$sString = preg_replace("#\b[a-z]{1}\b#m", ' ', $sString);

Categories