parse search string for phrases and keywords - php

i need to parse a search string for keywords and phrases in php, for example
string 1: value of "measured response" detect goal "method valuation" study
will yield: value,of,measured reponse,detect,goal,method valuation,study
i also need it to work if the string has:
no phrases enclosed in quotes,
any number of phrases encloses in quotes with any number of keywords outside the quotes,
only phrases in quotes,
only space-separated keywords.
i'm leaning towards using preg_match with the pattern '/(\".*\")/' to get the phrases into an array, then remove the phrases from the string, then finally work the keywords into the array. i just can't pull everything together!
i'm also thinking of replacing spaces outside quotes with commas. then explode them to an array. if that's a better option, how do i do that with preg_replace?
is there a better way to go about this? help! thanks much, everyone

preg_match_all('/(?<!")\b\w+\b|(?<=")\b[^"]+/', $subject, $result, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($result[0]); $i++) {
# Matched text = $result[0][$i];
}
This should yield the results you are looking for.
Explanation :
# (?<!")\b\w+\b|(?<=")\b[^"]+
#
# Match either the regular expression below (attempting the next alternative only if this one fails) «(?<!")\b\w+\b»
# Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!")»
# Match the character “"” literally «"»
# Assert position at a word boundary «\b»
# Match a single character that is a “word character” (letters, digits, etc.) «\w+»
# Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
# Assert position at a word boundary «\b»
# Or match regular expression number 2 below (the entire match attempt fails if this one fails to match) «(?<=")\b[^"]+»
# Assert that the regex below can be matched, with the match ending at this position (positive lookbehind) «(?<=")»
# Match the character “"” literally «"»
# Assert position at a word boundary «\b»
# Match any character that is NOT a “"” «[^"]+»
# Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»

There is no need to use a regular expression, the built in function str_getcsv can be used to explode a string with any given delimiter, enclosure and escape characters.
Really it is as simple as.
// where $string is the string to parse
$array = str_getcsv($string, ' ', '"');

$s = 'value of "measured response" detect goal "method valuation" study';
preg_match_all('~(?|"([^"]+)"|(\S+))~', $s, $matches);
print_r($matches[1]);
output:
Array
(
[0] => value
[1] => of
[2] => measured response
[3] => detect
[4] => goal
[5] => method valuation
[6] => study
)
The trick here is to use a branch-reset group: (?|...|...). It's just like an alternation contained in a non-capturing group - (?:...|...) - except that within each branch the capturing-group numbers start at the same number. (For more info, see the PCRE docs and search for DUPLICATE SUBPATTERN NUMBERS.)
Thus, the text we're interested in is always captured group #1. You can retrieve the contents of group #1 for all matches via $matches[1]. (That's assuming the PREG_PATTERN_ORDER flag is set; I didn't specify it like #FailedDev did because it's the default. See the PHP docs for details.)

Related

Validate string to contain only qualifying characters and a specific optional substring in the middle

I'm trying to make a regular expression in PHP. I can get it working in other languages but not working with PHP.
I want to validate item names in an array
They can contain upper and lower case letters, numbers, underscores, and hyphens.
They can contain => as an exact string, not separate characters.
They cannot start with =>.
They cannot finish with =>.
My current code:
$regex = '/^[a-zA-Z0-9-_]+$/'; // contains A-Z a-z 0-9 - _
//$regex = '([^=>]$)'; // doesn't end with =>
//$regex = '~.=>~'; // doesn't start with =>
if (preg_match($regex, 'Field_name_true2')) {
echo 'true';
} else {
echo 'false';
};
// Field=>Value-True
// =>False_name
//Bad_name_2=>
Use negative lookarounds. Negative lookahead (?!=>) at the beginning to prohibit beginning with =>, and negative lookbehind (?<!=>) at the end to prohibit ending with =>.
^(?!=>)(?:[a-zA-Z0-9-_]+(=>)?)+(?<!=>)$
DEMO
There is absolutely no requirement for lookarounds here.
Anchors and an optional group will suffice.
Demo
/^[\w-]+(?:=>[\w-]+)?$/
^^^^^^^^^^^^^-- this whole non-capturing group is optional
This allows full strings consisting exclusively of [0-9a-zA-Z-] or split ONCE by =>.
The non-capturing group may occur zero or one time.
In other words, => may occur after one or more [\w-] characters, but if it does occur, it MUST be immediately followed by one or more [\w-] characters until the end of the string.
To cover some of the ambiguity in the question requirements:
If foo=>bar=>bam is valid, then use /^[\w-]+(?:=>[\w-]+)*$/ which replaces ? (zero or one) with * (zero or more).
If foo=>=>bar is valid then use /^[\w-]+(?:(?:=>)+[\w-]+)*$/ which replaces => (must occur once) with (?:=>)+ (substring must occur one or more times).
Well, your character ranges equal to \w, so you could use
^(?!=>)(?:(?!=>$)(?:[-\w]|=>))+$
This construct uses a "tempered greedy token", see a demo on regex101.com.
More shiny, complicated and surely over the top, you could use subroutines as in:
(?(DEFINE)
(?<chars>[-\w]) # equals to A-Z, a-z, 0-9, _, -
(?<af>=>) # "arrow function"
(?<item>
(?!(?&af)) # no af at the beginning
(?:(?&af)?(?&chars)++)+
(?!(?&af)) # no af at the end
)
)
^(?&item)$
See another demo on regex101.com
For the example data, you can use
^[a-zA-Z0-9_-]+=>[a-zA-Z0-9_-]+$
The pattern matches:
^ Start of string
[a-zA-Z0-9_-]+ Match 1+ times any of the listed ranges or characters (can not start with =>)
=> Match literally
[a-zA-Z0-9_-]+ Match again 1+ times any of the listed ranges or characters
$ End of string
Regex demo
If you want to allow for optional spaces:
^\h*[a-zA-Z0-9_-]+\h*=>\h*[a-zA-Z0-9_-]+\h*$
Regex demo
Note that [a-zA-Z0-9_-] can be written as [\w-]

Why regex with lookaheads doesn't match?

I need (in PHP) to split a sententse by the word that cannot be the first or the last one in the sentence. Say the word is "pression" and here is my regex
/^.+?(?=[\s\.\,\:\;])pression(?=[\s\.\,\:\;]).+$/i
Live here: https://regex101.com/r/CHAhKj/1/
First, it doesn't match.
Next, I think - it is at all possible to split that way? I tryed simplified example
print_r(preg_split('/^.+pizza.+$/', 'my pizza is cool'));
live here http://sandbox.onlinephpfunctions.com/code/10b674900fc1ef44ec79bfaf80e83fe1f4248d02
and it prints an array of 2 empty strings, when I expect
['my ', ' is cool']
I need (in PHP) to split a sentence by the word that cannot be the first or the last one in the sentence
You may use this regex:
(?<=[^\s.?]\h)pression(?=\h[^\s.?])
RegEx Demo
RegEx Details:
(?<=[^\s.?]\h): Lookbehind to assert that ahead of current position we have a space and a character that not a whitespace, not a dot and not a ?.
pression: Match word pression
(?=\h[^\s.?]): Lookahead to assert that before current position we have a space and a character that not a whitespace, not a dot and not a ?
First, ^.+?(?=[\s\.\,\:\;])pression(?=[\s\.\,\:\;]).+$ can't match any string at all because the (?=[\s\.\,\:\;])p part requires p to be also either a whitespace char, or a ., ,, : or ;, which invalidates the whole match at once.
Second, ^.+pizza.+$ pattern does not ensure the pizza matched is not the first or last word in a sentence as . matches whitespace, too. It does not return anything meaningful, because preg_split uses the match to break string into chunks, and the two empty values are 1) start of string and 2) empty string positions.
That said, all you need is:
preg_match('~^(.*?\w\W+)pression(\W+\w.*)$~is', $text, $m)
See the regex demo. Details:
^ - start of string
(.*?\w\W+) - Capturing group 1: any zero or more chars, as few as possible, then a word char and then one or more non-word chars
pression - a word
(\W+\w.*) - Capturing group 2: one or more non-word chars, a word char, and then any zero or more chars as many as possible
$ - end of string.
s makes the . match across lines and i flag makes the pattern match in a case insensitive way.
See the PHP demo:
$text = "You can use any regular expression pression inside the lookahead ";
if (preg_match('~^(.*?\w\W+)pression(\W+\w.*)$~is', $text, $m)) {
echo $m[1] . " << | >> " . $m[2];
}
// => You can use any regular expression << | >> inside the lookahead

PHP regular expressions to filter results

I have a list of several email addresses which look like the following
smtp:email1#myemail.com
smtp:email2#something.myemail.com
SMTP:email3#myemail.com
X400: //some random line
Is there any way I can only get the emails which only end in myemail.com? So from the above, this would be
email1#myemail.com
email3#myemail.com
So it should get rid of any random lines, and it should also ignore it if there is anything else in the string e.g. something.
I have managed to get some data by doing
([a-zA-Z]+)(#)
Probably not the best way but it gets me whats infront of the # sign. Any help filtering these out appreciated.
Thanks
You may want to use a regex to filter only emails from domain myemail.com:
<?php
$emailList = <<< LOL
smtp:email1#myemail.com
smtp:email2#something.myemail.com
SMTP:email3#myemail.com
X400: //some random line
LOL;
preg_match_all('/smtp:(.*?#myemail\.com)$/im', $emailList , $matches, PREG_PATTERN_ORDER);
print_r($matches[1]);
/*
Array
(
[0] => email1#myemail.com
[1] => email3#myemail.com
)
*/
Demo:
http://ideone.com/hcd0aa
Regex Explanation:
smtp:(.*?#myemail\.com)$
Options: Case insensitive; Exact spacing; Dot doesn’t match line breaks; ^$ don’t match at line breaks; Greedy quantifiers
Match the character string “smtp:” literally «smtp:»
Match the regex below and capture its match into backreference number 1 «(.*?#myemail\.com)»
Match any single character that is NOT a line break character «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character string “#myemail” literally «#myemail»
Match the character “.” literally «\.»
Match the character string “com” literally «com»
Assert position at the end of the string, or before the line break at the end of the string, if any «$»

How to match multiple substrings that occur after a specific substring?

I am trying to read out the server names from a nginx config file.
I need to use regex on a line like this:
server_name this.com www.this.com someother-example.com;
I am using PHP's preg_match_all() and I've tried different things so far:
/^(?:server_name[\s]*)(?:(.*)(?:\s*))*;$/m
// no output
/^(?:server_name[\s]*)((?:(?:.*)(?:\s*))*);$/m
// this.com www.this.com someother-example.com
But I can't find the right one to list the domains as separate values.
[
0 => 'this.com',
1 => 'www.this.com',
2 => 'someother-example.com'
]
as Bob's your uncle wrote:
(?:server_name|\G(?!^))\s*\K[^;|\s]+
Does the trick!
The plain English requirement is to extract the space-delimited strings that immediately follow server_name then several spaces.
The dynamic duo of \G (start from the start / continue from the end of the last match) and \K (restart the fullstring match) will be the heroes of the day.
Code: (Demo)
$string = "server_name this.com www.this.com someother-example.com;";
var_export(preg_match_all('~(?:server_name +|\G(?!^) )\K[^; ]+~', $string, $out) ? $out[0] : 'no matches');
Output:
array (
0 => 'this.com',
1 => 'www.this.com',
2 => 'someother-example.com',
)
Pattern Explanation:
(?: # start of non-capturing group (to separate piped expressions from end of the pattern)
server_name + # literally match "server_name" followed by one or more spaces
| # OR
\G(?!^) # continue searching for matches immediately after the previous match, then match a single space
) # end of the non-capturing group
\K # restart the fullstring match (aka forget any previously matched characters in "this run through")
[^; ]+ # match one or more characters that are NOT a semicolon or a space
The reason that you see \G(?!^) versus just \G (which, for the record, will work just fine on your sample input) is because \G can potentially match from two different points by its default behavior. https://www.regular-expressions.info/continue.html
If you were to use the naked \G version of my pattern AND add a single space to the front of the input string, you would not make the intended matches. \G would successfully start at the beginning of the string, then match the single space, then server_name via the negated character class [^; ].
For this reason, disabling \G's "start at the start of the string` ability makes the pattern more stable/reliable/accurate.
preg_match_all() returns an array of matches. The first element [0] is a collection of fullstring matches (what is matched regardless of capture groups). If there are any capture groups, they begin from [1] and increment with each new group.
Because you need to match server_name before targeting the substrings to extract, using capture groups would mean a bloated output array and an unusable [0] subarray of fullstring matches.
To extract the desired space-delimited substrings and omit server_name from the results, \K is used to "forget" the characters that are matched prior to finding the desired substrings. https://www.regular-expressions.info/keep.html
Without the \K to purge the unwanted leading characters, the output would be:
array (
0 => 'server_name this.com',
1 => ' www.this.com',
2 => ' someother-example.com',
)
If anyone is comparing my answer to user3776824's or HamZa's:
I am electing to be very literal with space character matching. There are 4 spaces after server_name, so I could have used an exact quantifier {4} but opted for a bit of flexibility here. \s* isn't the most ideal because when matching there will always be "one or more spaces" to match. I don't have a problem with \s, but to be clear it does match spaces, tabs, newlines, and line returns.
I am using (?!^) -- a negative lookahead -- versus (?<!^) -- a negative lookbehind because it does the same job with a less character. You will more commonly see the use of \G(?!^) from experienced regex craftsmen.
There is never a need to use "alternative" syntax (|) within a character class to separate values. user3776824's pattern will actually exclude pipes in addition to semicolons and spaces -- though I don't expect any negative impact in the outcome based on the sample data. The pipe in the pattern simply should not be written.

how to extract a certain digit from a String using regular expression in php?

I have a String (filename): s_113_2.3gp
How can I extract the number that appears after the second underscore? In this case it's '2' but in some cases that can be a few digits number.
Also the number of digits that appears after the first underscore can vary so the length of this String is not constant.
You can use a capturing group:
preg_match('/_(\d+)\.\w+$/', $str, $matches);
$number = $matches[1];
\d+ represents 1 or more digits. The parentheses around that capture it, so you can later retrieve it with $matches[1]. The . needs to be escaped, because otherwise it would match any character but line breaks. \w+ matches 1 or more word characters (digits, letters, underscores). And finally the $ represents the end of the string and "anchors" the regular expression (otherwise you would get problems with strings containing multiple .).
This also allows for arbitrary file extensions.
As Ωmega pointed out below there is another possibility, that does not use a capturing group. With the concept of lookarounds, you can avoid matching _ at the start and the \.\w+$ at the end:
preg_match('/(?<=_)\d+(?=\.\w+$)/', $str, $matches);
$number = $matches[0];
However, I would recommend profiling, before applying this rather small optimization. But it is something to keep in mind (or rather, to read up on!).
Using regex lookaround it is very short code:
$n = preg_match('/(?<=_)\d+(?=\.)/', $str, $m) ? $m[0] : "";
...which reads: find one or more digits \d+ that are between underscore (?<=_) and period (?=\.)

Categories