Match/extract all characters between 2 strings

Match/extract all characters between 2 strings - php

I want to extract John Doe from the string \n*DRIVGo*\nVolledige naam: John Doe\nTelefoonnummer: 0612345678\nIP: 94.214.168.86\n
So I guess the regex pattern needs to extract all characters between 'Volledige naam:' and '\n'. Is there anyone who can help me out?

You may use this regex to capture the name in group 1,
naam:\s+([a-zA-Z ]+)
As the name can only contain alphabets and spaces hence use of [a-zA-Z ]+ charset.
Php sample codes,
$str = "\n*DRIVGo*\nVolledige naam: John Doe\nTelefoonnummer: 0612345678\nIP: 94.214.168.86\n";
preg_match('/naam:\s+([a-zA-Z ]+)/', $str, $matches);
print_r($matches[1]);
Prints,
John Doe
Online demo

You can use
^Volledige naam:\s*\K.+
in multiline mode. That is
^ # start of line
Volledige naam:\s*\K # Volledige naam:, whitespaces and "forget" what#s been matched
.+ # rest of the line
In PHP:
<?php
$string = <<<DATA
*DRIVGo*
Volledige naam: John Doe
Telefoonnummer: 0612345678
IP: 94.214.168.86
DATA;
$regex = '~^Volledige naam:\s*\K.+~m';
if (preg_match($regex, $string, $match)) {
print_r($match);
}
?>
See a demo on ideone.com as well as on regex101.com.

The required string exists constantly at indexOf(':') and ends at the same call using the previously obtained value of indexOf as the offset in the subsequent call. (Given that the first call doesn't indicate that the result was not found and also that result of the send call [which would indicate the complete segment is not contained in the string])
Using a regular expression for this seems less useful because the source string will not varry in some way which requires automata.
Consider a simple split('\n') operation [optionally given a length of matches to obtain] which can be followed by further such calls if necessary to obtain the desired value without the need of any underlying engine.
The logic provided would be the same as a Regex is doing for you with it's underlying implementation although the associated cost both in terms of memory and performance is usually only justified for certain scenarios [for instance involving code page or locale conversions but not limited to, another case would be finding words with incorrect Declension, Punctuation etc.] which in this case do not seem to be needed.
Consider a parser construct with fields and methods that can obtain [point to] and also verify the integrity of the data when requires; This will also allow you to quickly serialize and deserialize the results in most cases.
Finally since you indicated your language is PHP I figured I should also let you know that equivalent of indexOf is strpos and the following code will demonstrate various ways to solve this problem without the use of regex.
$str = "\n*DRIVGo*\nVolledige naam: John Doe\nTelefoonnummer: 0612345678\nIP: 94.214.168.86\n";
$search = chr(10);
$parts = explode($search, $str);
$partsCount = count($parts);
print_r($parts);
if($partsCount > 1) print($parts[1]); //*DRIVGo*
print('-----Same results via different methodology------');
$groupStart = 0;
$groupEnd = $groupStart;
$max = strlen($str);
//While the groupEnd has not approached the length of str
while($groupEnd <= $max &&
($groupStart = strpos($str, $search, $groupStart)) >= 0 && // find search in str starting at groupStart, assign result to groupStart
($groupEnd = strpos($str, $search, $groupEnd + 1)) > $groupStart) // find search in str starting at groupEnd + 1, assign result to groupEnd
{
//Show the start, end, length and resulting substring
print_r([$groupStart, $groupEnd, $groupEnd - $groupStart, substr($str, $groupStart, $groupEnd - $groupStart)]);
//advance the parsing
$groupStart = $groupEnd;
}

Related

Replace (add) words case sensitive from arrays

I am new to php and especially to regex.
My target is to enrich textes automatically with hints for "keywords" which are listed in arrays.
So far I had come.
$pattern = array("/\bexplanations\b/i",
"/\btarget\b/i",
"/\bhints\b/i",
"/\bhint\b/i",
);
$replacement = array("explanations <i>(Erklärungen)</i>",
"target <i>Ziel</i>",
"hints <i>Hinsweise</i>",
"hint <i>Hinweis</i>",
);
$string = "Target is to add some explanations (hints) from an array to
this text. I am thankful for every hint.";
echo preg_replace($pattern, $replacement, $string);
returns:
target <i>Ziel</i> is to add some explanations <i>(Erklärungen)</i> (hints <i>Hinsweise</i>) from an array to this text. I am thankful for every hint <i>Hinweis</i>
1) In generally I wonder if there are more elegant solutions (eventually without replacing the original word)?
On later state the arrays will contain more than 1000 items... and come from mariadb.
2) How can I achive, that the word "Targets" achives a case sensitive treatment?
(without duplicate the length of my arrays).
Sorry for my English and many thanks in advance.

If you project to increase the size of your array and if the text may be a bit long, processing all the text (once per word) isn't a reliable way. Also, with a large array, it isn't reliable to build a giant alternation with all the words.
But if you store all the translations in an associative array and split the text on word-boundaries, you can do it in one pass:
// Translation array with all keys lowercase
$trans = [ 'explanations' => 'Erklärungen',
'target' => 'Ziel',
'hints' => 'Hinsweise',
'hint' => 'Hinweis'
];
$parts = preg_split('~\b~', $text);
$partsLength = count($parts);
// All words are in the odd indexes
for ($i=1; $i<$partsLength; $i+=2) {
$lcWord = strtolower($parts[$i]);
if (isset($trans[$lcWord]))
$parts[$i] .= ' <i>(' . $trans[$lcWord] . ')</i>';
}
$result = implode('', $parts);
Actually the limitation here is that you can't use a key that contains a word-boundary (if you want to translate a whole expression with several words for instance), but if you want to handle this case, you can use preg_match_all in place of preg_split and build a pattern that tests these special cases before, something like:
preg_match_all('~mushroom pie\b|\w+|\W*~iS', $text, $m);
$parts = &$m[0];
$partsLength = count($parts);
$i = 1 ^ preg_match('~^\w~', $parts[0]);
for (; $i<$partsLength; $i+=2) {
...
(if you have a lot of exceptions (too many) other strategies are possible.)

Enclose search words with parentheses in regex patterns and use backteferences in replacements. 
See this PHP demo:
$pattern = array("/\b(explanations)\b/i", "/\b(target)\b/i", "/\b(hints)\b/i", "/\b(hint)\b/i", );
$replacement = array('$1 <i>(Erklärungen)</i>', '$1 <i>Ziel</i>', '$1 <i>Hinsweise</i>', '$1 <i>Hinweis</i>', );
$string = "Target is to add some explanations (hints) from an array to this text. I am thankful for every hint.";
echo preg_replace($pattern, $replacement, $string);
That way, you will replace with the words found with actual case used in the text.
Note it is very important to make sure the patterns go in the descending order with longer patterns coming before shorter ones (first Targets, then Target, etc.)

PHP trouble with preg_match

I thought I had this working; however after further evaluation it seems it's not working as I would have hoped it was.
I have a query pulling back a string. The string is a comma separated list just as you see here:
(1,145,154,155,158,304)
Nothing has been added or removed.
I have a function that I thought I could use preg_match to determine if the user's id was contained within the string. However, it appears that my code is looking for any part.
preg_match('/'.$_SESSION['MyUserID'].'/',$datafs['OptFilter_1']))
using the same it would look like such
preg_match('/1/',(1,145,154,155,158,304)) I would think. After testing if my user id is 4 the current code returns true and it shouldn't. What am I doing wrong? As you can see the id length can change.

It's better to have all your IDs in an array then checking if a desired ID is existed:
<?php
$str = "(1,145,154,155,158,304)";
$str = str_replace(array("(", ")"), "", $str);
$arr = explode(',', $str);
if(in_array($_SESSION['MyUserID'], $arr))
{
// ID existed
}
As your string - In dealing with Regular Expressions, however it's not recommended here, below regex will match your ID if it's there:
preg_match("#[,(]$ID[,)]#", $str)
Explanations:
[,(] # a comma , or opening-parenthesis ( character
$ID # your ID
[,)] # a comma , or closing-parenthesis ) character

Regex to match specific string not enclosed by another, different specific string

I need a regex to match a string not enclosed by another different, specific string. For instance, in the following situation it would split the content into two groups: 1) The content before the second {Switch} and 2) The content after the second {Switch}. It wouldn't match the first {Switch} because it is enclosed by {my_string}'s. The string will always look like shown below (i.e. {my_string}any content here{/my_string})
Some more
{my_string}
Random content
{Switch} //This {Switch} may or may not be here, but should be ignored if it is present
More random content
{/my_string}
Content here too
{Switch}
More content
So far I've gotten what is below which I know isn't very close at all:
(.*?)\{Switch\}(.*?)
I'm just not sure how to use the [^] (not operator) with a specific string versus different characters.

It really seems you're trying to use a regular expression to parse a grammar - something that regular expressions are really bad at doing. You might be better off writing a parser to break down your string into the tokens that build it, and then processing that tree.
Perhaps something like http://drupal.org/project/grammar_parser might help.

Try this simple function:
function find_content()
function find_content($doc) {
$temp = $doc;
preg_match_all('~{my_string}.*?{/my_string}~is', $temp, $x);
$i = 0;
while (isset($x[0][$i])) {
$temp = str_replace($x[0][$i], "{REPL:$i}", $temp);
$i++;
}
$res = explode('{Switch}', $temp);
foreach ($res as &$part)
foreach($x[0] as $id=>$content)
$part = str_replace("{REPL:$id}", $content, $part);
return $res;
}
Use it this way
$content_parts = find_content($doc); // $doc is your input document
print_r($content_parts);
Output (your example)
Array
(
[0] => Some more
{my_string}
Random content
{Switch} //This {Switch} may or may not be here, but should be ignored if it is present
More random content
{/my_string}
Content here too
[1] =>
More content
)

You can try positive lookahead and lookbehind assertions (http://www.regular-expressions.info/lookaround.html)
It might look something like this:
$content = 'string of text before some random content switch text some more random content string of text after';
$before = preg_quote('String of text before');
$switch = preg_quote('switch text');
$after = preg_quote('string of text after');
if( preg_match('/(?<=' $before .')(.*)(?:' $switch .')?(.*)(?=' $after .')/', $content, $matches) ) {
// $matches[1] == ' some random content '
// $matches[2] == ' some more random content '
}

$regex = (?:(?!\{my_string\})(.*?))(\{Switch\})(?:(.*?)(?!\{my_string\}));
/* if "my_string" and "Switch" aren't wrapped by "{" and "}" just remove "\{" and "\}" */
$yourNewString = preg_replace($regex,"$1",$yourOriginalString);
This might work. Can't test it know, but i'll update later!
I don't if this is what you're looking for, but to negate more than one character, the regex syntax is:
(?!yourString)
and it is called "negative lookahead assertion".
/Edit:
This should work and return true:
$stringMatchesYourRulesBoolean = preg_match('~(.*?)('.$my_string.')(.*?)(?<!'.$my_string.') ?('.$switch.') ?(?!'.$my_string.')(.*?)('.$my_string.')(.*?)~',$yourString);

Have a look at PHP PEG. It is a little parser written in PHP. You can write your own grammar and parse it. It's going to be very simple in your case.
The grammar syntax and the way of parsing is all explained in the README.md
Extracts from the readme:
token* - Token is optionally repeated
token+ - Token is repeated at least one
token? - Token is optionally present
Tokens may be :
- bare-words, which are recursive matchers - references to token rules defined elsewhere in the grammar,
- literals, surrounded by `"` or `'` quote pairs. No escaping support is provided in literals.
- regexs, surrounded by `/` pairs.
- expressions - single words (match \w+)
Sample grammar: (file EqualRepeat.peg.inc)
class EqualRepeat extends Packrat {
/* Any number of a followed by the same number of b and the same number of c characters
* aabbcc - good
* aaabbbccc - good
* aabbc - bad
* aabbacc - bad
*/
/*Parser:Grammar1
A: "a" A? "b"
B: "b" B? "c"
T: !"b"
X: &(A !"b") "a"+ B !("a" | "b" | "c")
*/
}

using preg_split to fetch terms between two markers in string

ok I have two strings.
(I use this for a language library system to allow translators to provide translations with placeholders).
In the first string, there are two instances. note that it's not always a single instance, some cases it will be none, one, two, or more.
This is a {[John Doe]} and this is {[Jane Doe]}
and then I have a string that is stored like this:
C'est {[1]} et c'est {[2]}
(translation)
This is a {[1]} and this is a {[2]}
so what I need to do is take the first string, replace everything between {[]} of the starting string and match each instance, i.e. first of first string with {[1]} of second string etc. keep in mind that the reason I am using {[1]} and {[2]} is because in some languages, terms may appear in a different order for gramatical accuracy, but are still terms that don't need translation them selves (names).
so the question is. how do I do this? am thinking preg_split and then match index+1 of each with the second string. that part I can handle. the problem I am having is getting the right regex search going..
this is as close as I could get it..
preg_split('/[(\{\[).*(\]\})]/', $str, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
that returns an array of everything before and after each instance of {[ and ]} when I am just trying to get the contents of inbetween the two..
EDIT: solution derived from NikiC's answer.
function lang($str){
$nwStr = $str;
preg_match_all('(\{\[(.+?)\]\})', $str, $placeholders);
foreach ($placeholders[0] as $mk => $match) {
$pos = $mk+1;
$nwStr = str_replace("$match","{[$pos]}",$nwStr);
}
$result = preg_replace_callback('(\{\[(\d+)\]\})', function ($matches) use ($placeholders) {
$n = $matches[1]-1;
return $placeholders[1][$n];
}, $translation);
return $result;
}
basically what i am doing here is first looping through to replace the matches with the placeholders so that I can match the proper placeholder text in my language files. (i.e. create the right label string out of the input string)

First grab the placeholders from the string:
preg_match_all('(\{\[(.+?)\]\})', $string, $matches);
$placeholders = $matches[1];
Now replace with a callback:
$result = preg_replace_callback('(\{\[(\d+)\]\})', function ($matches) use ($placeholders) {
$n = $matches[1] + 1;
return $placeholders[$n];
}, $translation);

You're almost there. PREG_SPLIT_DELIM_CAPTURE captures the groups between ( and ), so this:
preg_split('/(\{\[.*\]\})/U', $str, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
should work better. I also added the U modifier so that * is ungreedy.
edit also, you have a pair of [ and ] which definitely don't belong there!
Another thing, you probably want to have the parts between the {[...]} construct, so this is better:
preg_split('/\{\[(.*)\]\}/U', $str, -1, PREG_SPLIT_DELIM_CAPTURE);
By removing the PREG_SPLIT_NO_EMPTY, you now know for certain that you will find the tagged parts at odd indexes.

Regular Expressions: how to do "option split" replaces

those reqular expressions drive me crazy. I'm stuck with this one:
test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not
Task:
Remove all [[ and ]] and if there is an option split choose the later one so output should be:
test1:link test2:silver test3:out1insideout2 test4:this|not
I came up with (PHP)
$text = preg_replace("/\\[\\[|\\]\\]/",'',$text); // remove [[ or ]]
this works for part1 of the task. but before that I think I should do the option split, my best solution:
$text = preg_replace("/\\[\\[(.*\|)(.*?)\\]\\]/",'$2',$text);
Result:
test1:silver test3:[[out1[[inside]]out2]] this|not
I'm stuck. may someone with some free minutes help me? Thanks!

I think the easiest way to do this would be multiple passes. Use a regular expression like:
\[\[(?:[^\[\]]*\|)?([^\[\]]+)\]\]
This will replace option strings to give you the last option from the group. If you run it repeatedly until it no longer matches, you should get the right result (the first pass will replace [[out1[[inside]]out2]] with [[out1insideout2]] and the second will ditch the brackets.
Edit 1: By way of explanation,
\[\[ # Opening [[
(?: # A non-matching group (we don't want this bit)
[^\[\]] # Non-bracket characters
* # Zero or more of anything but [
\| # A literal '|' character representing the end of the discarded options
)? # This group is optional: if there is only one option, it won't be present
( # The group we're actually interested in ($1)
[^\[\]] # All the non-bracket characters
+ # Must be at least one
) # End of $1
\]\] # End of the grouping.
Edit 2: Changed expression to ignore ']' as well as '[' (it works a bit better like that).
Edit 3: There is no need to know the number of nested brackets as you can do something like:
$oldtext = "";
$newtext = $text;
while ($newtext != $oldtext)
{
$oldtext = $newtext;
$newtext = preg_replace(regexp,replace,$oldtext);
}
$text = $newtext;
Basically, this keeps running the regular expression replace until the output is the same as the input.
Note that I don't know PHP, so there are probably syntax errors in the above.

This is impossible to do in one regular expression since you want to keep content in multiple "hierarchies" of the content. It would be possible otherwise, using a recursive regular expression.
Anyways, here's the simplest, most greedy regular expression I can think of. It should only replace if the content matches your exact requirements.
You will need to escape all backslashes when putting it into a string (\ becomes \\.)
\[\[((?:[^][|]+|(?!\[\[|]])[^|])++\|?)*]]
As others have already explained, you use this with multiple passes. Keep looping while there are matches, performing replacement (only keeping match group 1.)
Difference from other regular expressions here is that it will allow you to have single brackets in the content, without breaking:
test1:[[link]] test2:[[gold|si[lv]er]]
test3:[[out1[[in[si]de]]out2]] test4:this|not
becomes
test1:[[link]] test2:si[lv]er
test3:out1in[si]deout2 test4:this|not

Why try to do it all in one go. Remove the [[]] first and then deal with options, do it in two lines of code.
When trying to get something going favour clarity and simplicity.
Seems like you have all the pieces.

Why not just simply remove any brackets that are left?
$str = 'test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not';
$str = preg_replace('/\\[\\[(?:[^|\\]]+\\|)+([^\\]]+)\\]\\]/', '$1', $str);
$str = str_replace(array('[', ']'), '', $str);

Well, I didn't stick to just regex, because I'm of a mind that trying to do stuff like this with one big regex leads you to the old joke about "Now you have two problems". However, give something like this a shot:
$str = 'test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not'; $reg = '/(.*?):(.*?)( |$)/';
preg_match_all($reg, $str, $m);
foreach($m[2] as $pos => $match) {
if (strpos($match, '|') !== FALSE && strpos($match, '[[') !== FALSE ) {
$opt = explode('|', $match); $match = $opt[count($opt)-1];
}
$m[2][$pos] = str_replace(array('[', ']'),'', $match );
}
foreach($m[1] as $k=>$v) $result[$k] = $v.':'.$m[2][$k];

This is C# using only using non-escaped strings, hence you will have to double the backslashes in other languages.
String input = "test1:[[link]] " +
"test2:[[gold|silver]] " +
"test3:[[out1[[inside]]out2]] " +
"test4:this|not";
String step1 = Regex.Replace(input, #"\[\[([^|]+)\|([^\]]+)\]\]", #"[[$2]]");
String step2 = Regex.Replace(step1, #"\[\[|\]\]", String.Empty);
// Prints "test1:silver test3:out1insideout2 test4:this|not"
Console.WriteLine(step2);

$str = 'test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not';
$s = preg_split("/\s+/",$str);
foreach ($s as $k=>$v){
$v = preg_replace("/\[\[|\]\]/","",$v);
$j = explode(":",$v);
$j[1]=preg_replace("/.*\|/","",$j[1]);
print implode(":",$j)."\n";
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Match/extract all characters between 2 strings - php

I want to extract John Doe from the string \nDRIVGo\nVolledige naam: John Doe\nTelefoonnummer: 0612345678\nIP: 94.214.168.86\n So I guess the regex pattern needs to extract all characters between 'Volledige naam:' and '\n'. Is there anyone who can help me out?

Related

Replace (add) words case sensitive from arrays

PHP trouble with preg_match

Regex to match specific string not enclosed by another, different specific string

using preg_split to fetch terms between two markers in string

Regular Expressions: how to do "option split" replaces

Categories

Resources

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Match/extract all characters between 2 strings - php

I want to extract John Doe from the string \n*DRIVGo*\nVolledige naam: John Doe\nTelefoonnummer: 0612345678\nIP: 94.214.168.86\n So I guess the regex pattern needs to extract all characters between 'Volledige naam:' and '\n'. Is there anyone who can help me out?

Related

Replace (add) words case sensitive from arrays

PHP trouble with preg_match

Regex to match specific string not enclosed by another, different specific string

using preg_split to fetch terms between two markers in string

Regular Expressions: how to do "option split" replaces

Categories

Resources

I want to extract John Doe from the string \nDRIVGo\nVolledige naam: John Doe\nTelefoonnummer: 0612345678\nIP: 94.214.168.86\n So I guess the regex pattern needs to extract all characters between 'Volledige naam:' and '\n'. Is there anyone who can help me out?