php Regex file path | second matching after specific character

php Regex file path | second matching after specific character - php

I trying to extract file patches, without disk letter, that are inside text. Like from AvastSecureBrowserElevationService; C:\Program Files (x86)\AVAST Software\Browser\Application\elevation_service.exe [X] extract :\Program Files (x86)\AVAST Software\Browser\Application\elevation_service.exe.
My actual regex look like this, but it will stop on any space, which can contains file names.
(?<=:\\)([^ ]*)
The soulution that I figure out is, that I can match first space character after dot, because there is very little chance that there will be some directory name with space after dot, and I will always do fast manual check. But I do not know how to write this in regex

You may use this regex for this purpose:
(?<=[a-zA-Z]):[^.]+\.\S+
RegEx Demo
RegEx Details:
(?<=[a-zA-Z]): Lookbehind to assert we have a English letter before :
:: Match literal :
[^.]+: Match 1+ non-dot characters
\.: Match literal .
\S+: Match 1+ non-whitespace characters

Here we would consume our entire string, as we collect what we wish to output, and we would preg_replace:
.+C(:\\.+\..+?)\s.+
Test
$re = '/.+C(:\\.+\..+?)\s.+/m';
$str = 'AvastSecureBrowserElevationService; C:\\Program Files (x86)\\AVAST Software\\Browser\\Application\\elevation_service.exe [X]';
$subst = '$1';
$result = preg_replace($re, $subst, $str);
echo $result;
Demo

You can use the following regex:
[A-Z]\K:.+\.\w+
It will match any capital letter followed by :, then any character string ending wit ., followed by at least one word character.
\K removes from the match what comes before it.
Demo

Related

Why regex with lookaheads doesn't match?

I need (in PHP) to split a sententse by the word that cannot be the first or the last one in the sentence. Say the word is "pression" and here is my regex
/^.+?(?=[\s\.\,\:\;])pression(?=[\s\.\,\:\;]).+$/i
Live here: https://regex101.com/r/CHAhKj/1/
First, it doesn't match.
Next, I think - it is at all possible to split that way? I tryed simplified example
print_r(preg_split('/^.+pizza.+$/', 'my pizza is cool'));
live here http://sandbox.onlinephpfunctions.com/code/10b674900fc1ef44ec79bfaf80e83fe1f4248d02
and it prints an array of 2 empty strings, when I expect
['my ', ' is cool']

I need (in PHP) to split a sentence by the word that cannot be the first or the last one in the sentence
You may use this regex:
(?<=[^\s.?]\h)pression(?=\h[^\s.?])
RegEx Demo
RegEx Details:
(?<=[^\s.?]\h): Lookbehind to assert that ahead of current position we have a space and a character that not a whitespace, not a dot and not a ?.
pression: Match word pression
(?=\h[^\s.?]): Lookahead to assert that before current position we have a space and a character that not a whitespace, not a dot and not a ?

First, ^.+?(?=[\s\.\,\:\;])pression(?=[\s\.\,\:\;]).+$ can't match any string at all because the (?=[\s\.\,\:\;])p part requires p to be also either a whitespace char, or a ., ,, : or ;, which invalidates the whole match at once.
Second, ^.+pizza.+$ pattern does not ensure the pizza matched is not the first or last word in a sentence as . matches whitespace, too. It does not return anything meaningful, because preg_split uses the match to break string into chunks, and the two empty values are 1) start of string and 2) empty string positions.
That said, all you need is:
preg_match('~^(.*?\w\W+)pression(\W+\w.*)$~is', $text, $m)
See the regex demo. Details:
^ - start of string
(.*?\w\W+) - Capturing group 1: any zero or more chars, as few as possible, then a word char and then one or more non-word chars
pression - a word
(\W+\w.*) - Capturing group 2: one or more non-word chars, a word char, and then any zero or more chars as many as possible
$ - end of string.
s makes the . match across lines and i flag makes the pattern match in a case insensitive way.
See the PHP demo:
$text = "You can use any regular expression pression inside the lookahead ";
if (preg_match('~^(.*?\w\W+)pression(\W+\w.*)$~is', $text, $m)) {
echo $m[1] . " << | >> " . $m[2];
}
// => You can use any regular expression << | >> inside the lookahead

How to use preg_replace to remove excessive single spaces

We are extracting text from PDF files, and there is a high frequency of results that contain malformed text. Specifically adding spaces between the characters of a word. e.g. SEATTLE is being returned as S E A T T L E.
Is there a RegEx expression for preg_replace that can remove any spaces in the case of n number of single character "words"? Specifically, remove spaces from any occurrence of a string that is more than 3 single alpha characters and is separated by spaces?
If googled this for awhile, but can't even imagine how to construct the expression. As expressed in a comment, I don't want ALL spaces removed, but only when there is an occurrence of >3 single alpha characters, e.g. Welcome to the Greater S E A T T L E area should become Welcome to the Greater SEATTLE area. The result is to be used in full text searching, so case sensitivity is not a concern.

You may use a simple approach with a preg_replace_callback. Match '~\b[A-Za-z](?: [A-Za-z]){2,}\b~' and str_replace spaces in the anonymous function:
$regex = '~\b[A-Za-z](?: [A-Za-z]){2,}\b~';
$result = preg_replace_callback($regex, function($m) {
return str_replace(" ", "", $m[0]);
}, $s);
See the regex demo.
To only match sequences of uppercase letters, remove a-z from the pattern:
$regex = '~\b[A-Z](?: [A-Z]){2,}\b~';
And another thing: there may be soft/hard spaces, tabs, other kind of whitespace. Then, use
$regex = '~\b[A-Za-z](?:\h[A-Za-z]){2,}\b~u';
^^ ^
Finally, to match any Unicode letter, use \p{L} (to only match uppercase ones, \p{Lu}) instead of [a-zA-Z]:
$regex = '~\b\p{L}(?:\h\p{L}){2,}\b~u';
NOTE: It will most probably fail to work in some cases, e.g. when there are one-letter words. You will have to handle those cases separately/manually. Anyway, there is no safe regex-only way to fix OCR issues.
Pattern details
\b - a word boundary
[A-Za-z] - a single letter
(?: [A-Za-z]){2,} - 2 or more occurrences of
- a space (\h matches any kind of horizontal whitespace)
[A-Za-z] - a single letter
\b - a word boundary
When usign u modifier, \h becomes Unicode-aware.

You could do this in one go:
(?i:(?<!\S)([a-z]) +((?1))|\G(?!\A) +((?1))\b)
See live demo here
Explanation:
(?i: # Start of non-capturing group with case-insensitive modifier on
(?<!\S) # Negative lookbehind to ensure there is no leading non-whitespace character
([a-z]) + # Capture one letter and at least one space
((?1)) # Capture one letter in 2nd capturing group
| # Or
\G(?!\A) + # Start match from where previous match ends
# with matching spaces
((?1))\b # Match a letter at word boundary
) # End of non-capturing group
PHP code:
$str = preg_replace('~(?i:(?<!\S)([a-z]) +((?1))|\G(?!\A) +((?1))\b)~', '$1$2$3', $str);

You may use this pure regex approach with lookarounds and \G:
$re = '~\b(?:(?=(?:\pL\h+){3}\pL\b)|(?<!^)\G)(\pL)\h+(?=\pL\b)~';
$repl = preg_replace($re, '$1', $str);
RegEx Demo
RegEx Details:
\b: Match word boundary
(?:: Start non-capture group
(?=(?:\pL\h+){3}\pL\b): Lookahead to assert we have 3+ single letters separated by 1+ spaces
|: OR
(?<!^)\G: \G asserts position at the end of the previous match. (?<!^) ensures we don't match start of the string for the first match
): End non-capture group
(\pL): Match a single letter and capture it
\h+: Followed by 1+ horizontal whitespace
(?=\pL\b): Assert that we only have a single letter ahead
In the replacement we use $1 which is the letter left of whitespace we capture

Regex to get only characters without space inside special tags

I have 2 texts in a string:
%Juan%
%Juan Gonzalez%
And I want to only be able to get %Juan% and not the one with the Space, I have been trying several Regexes witout luck. I currently use:
/%(.*)%/U
but it gets both things, I tried adding and playing with [^\s] but it doesnt works.
Any help please?

The issue is that . matches any character but a newline. The /U ungreedy mode only makes .* lazy and it captures a text from the % up to the first % to the right of the first %.
If your strings contain one pair of %...%, you may use
/%(\S+)%/
See the regex demo
The \S+ pattern matches 1+ characters other than a whitespace, and the whole [^\h%] negated character class that matches any character but a horizontal space and % symbol.
If you have multiple %...% pairs, you may use
/%([^\h%]+)%/
See another regex demo, where \h matches any horizontal whitespace.
PHP demo:
$re = '/%([^\h%]+)%/';
$str = "%Juan%\n%Juan Gonzalez%";
preg_match_all($re, $str, $matches);
print_r($matches[1]);

How to use search and replace all the matching words in a sentence in php

I have to search and replace all the words starting with # and # in a sentence. Can you please let me know the best way to do this in PHP. I tried with
preg_replace('/(\#+|\#+).*?(?=\s)/','--', $string);
This will solve only one word in a sentence. I want all the matches to be replace.
I cannot g here like in perl.

preg_replace replaces all matches by default. If it is not doing so, it is an issue with your pattern or the data.
Try this pattern instead:
(?<!\S)[##]+\w+
(?<!\S) - do not match if the pattern is preceded by a non-whitespace character.
[##]+ - match one or more of # and #.
\w+ - match one or more word characters (letter, numbers, underscores). This will preserve punctuation. For example, #foo. would be replaced by --.. If you don't want this, you could use \S+ instead, which matches all characters that are not whitespace.

A word starting with a character implies that it has a space right before this character. Try something like that:
/(?<!\S)[##].*(?=[^a-z])/
Why not use (?=\s)? Because if there is some ponctuation right after the word, it's not part of the word. Note: you can replace [^a-z] by any list of unallowed character in your word.
Be careful though, there are are two particular cases where that doesn't work. You have to use 3 preg_replace in a row, the two others are for words that begin and end the string:
/^[##].*(?=[^a-z])/
/(?<!\S)[##].*$/

Try this :
$string = "#Test let us meet_me#noon see #Prasanth";
$new_pro_name = preg_replace('/(?<!\S)(#\w+|#\w+)/','--', $string);
echo $new_pro_name;
This replaces all the words starting with # OR #
Output: -- let us meet_me#noon see --
If you want to replace word after # OR # even if it at the middle of the word.
$string = "#Test let us meet_me#noon see #Prasanth";
$new_pro_name = preg_replace('/(#\w+|#\w+)/','--', $string);
echo $new_pro_name;
Output: -- let us meet_me-- see --

Regex to remove single characters from string

Consider the following strings
breaking out a of a simple prison
this is b moving up
following me is x times better
All strings are lowercased already. I would like to remove any "loose" a-z characters, resulting in:
breaking out of simple prison
this is moving up
following me is times better
Is this possible with a single regex in php?

$str = "breaking out a of a simple prison
this is b moving up
following me is x times better";
$res = preg_replace("#\\b[a-z]\\b ?#i", "", $str);
echo $res;

How about:
preg_replace('/(^|\s)[a-z](\s|$)/', '$1', $string);
Note this also catches single characters that are at the beginning or end of the string, but not single characters that are adjacent to punctuation (they must be surrounded by whitespace).
If you also want to remove characters immediately before punctuation (e.g. 'the x.'), then this should work properly in most (English) cases:
preg_replace('/(^|\s)[a-z]\b/', '$1', $string);

As a one-liner:
$result = preg_replace('/\s\p{Ll}\b|\b\p{Ll}\s/u', '', $subject);
This matches a single lowercase letter (\p{Ll}) which is preceded or followed by whitespace (\s), removing both. The word boundaries (\b) ensure that only single letters are indeed matched. The /u modifier makes the regex Unicode-aware.
The result: A single letter surrounded by spaces on both sides is reduced to a single space. A single letter preceded by whitespace but not followed by whitespace is removed completely, as is a single letter only followed but not preceded by whitespace.
So
This a is my test sentence a. o How funny (what a coincidence a) this is!
is changed to
This is my test sentence. How funny (what coincidence) this is!

You could try something like this:
preg_replace('/\b\S\s\b/', "", $subject);
This is what it means:
\b # Assert position at a word boundary
\S # Match a single character that is a “non-whitespace character”
\s # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
\b # Assert position at a word boundary
Update
As raised by Radu, because I've used the \S this will match more than just a-zA-Z. It will also match 0-9_. Normally, it would match a lot more than that, but because it's preceded by \b, it can only match word characters.
As mentioned in the comments by Tim Pietzcker, be aware that this won't work if your subject string needs to remove single characters that are followed by non word characters like test a (hello). It will also fall over if there are extra spaces after the single character like this
test a hello
but you could fix that by changing the expression to \b\S\s*\b

Try this one:
$sString = preg_replace("#\b[a-z]{1}\b#m", ' ', $sString);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

php Regex file path | second matching after specific character - php

You may use this regex for this purpose: (?<=[a-zA-Z]):[^.]+\.\S+ RegEx Demo RegEx Details: (?<=[a-zA-Z]): Lookbehind to assert we have a English letter before : :: Match literal : [^.]+: Match 1+ non-dot characters \.: Match literal . \S+: Match 1+ non-whitespace characters

You can use the following regex: [A-Z]\K:.+\.\w+ It will match any capital letter followed by :, then any character string ending wit ., followed by at least one word character. \K removes from the match what comes before it. Demo

Related

Why regex with lookaheads doesn't match?

How to use preg_replace to remove excessive single spaces

Regex to get only characters without space inside special tags

How to use search and replace all the matching words in a sentence in php

Regex to remove single characters from string

Categories

Resources