get substring between 2 characters in php - php

Im using a mentioning system like on twitter and instagram where you simply put #johndoe
what im trying to do is be able to strip down to the name in-between "#" and these characters ?,,,],:,(space)
as an example heres my string:
hey #johnDoe check out this event, be sure to bring #janeDoe:,#johnnyappleSeed?, #johnCitizen] , and #fredNerk
how can i get an array of janeDoe,johnnyappleSeed,johnCitizen,fredNerk without the characters ?,,,],: attached to them.
i know i have to use a variation of preg_match but i dont have a strong understanding of it.

This is what you've asked for: /\#(.*?)\s/
This is what you really want: /\b\#(.*?)\b/
Put either one into preg_match_all() and evaluate the results array.

preg_match_all("/\#(.*?)\s/", $string, $result_array);

$check_hash = preg_match_all ("/#[a-zA-Z0-9]*/g", $string_to_match_against, $matches);
You could then do somthing like
foreach ($matches as $images){
echo $images."<br />";
}
UPDATE: Just realized you were looking to remove the invalid characters. Updated script should do it.

How about:
$str = 'hey #johnDoe check out this event, be sure to bring #janeDoe:,#johnnyappleSeed?, #johnCitizen] , and #fredNerk';
preg_match_all('/#(.*?)(?:[?, \]: ]|$)/', $str, $m);
print_r($m);
output:
Array
(
[0] => Array
(
[0] => #johnDoe
[1] => #janeDoe:
[2] => #johnnyappleSeed?
[3] => #johnCitizen]
[4] => #fredNerk
)
[1] => Array
(
[0] => johnDoe
[1] => janeDoe
[2] => johnnyappleSeed
[3] => johnCitizen
[4] => fredNerk
)
)
explanation:
The regular expression:
(?-imsx:#(.*?)(?:[?, \]: ]|$))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
# '#'
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
[?, \]: ] any character of: '?', ',', ' ', '\]',
':', ' '
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
$ before an optional \n, and the end of
the string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------

Related

Finding sentences between characters

I am trying to find sentences between pipe | and dot ., e.g.
| This is one. This is two.
The regex pattern I use :
preg_match_all('/(:\s|\|+)(.*?)(\.|!|\?)/s', $file0, $matches);
So far I could not manage to capture both sentences. The regex I use captures only the first sentence.
How can I solve this problem?
EDIT: as it may seen from the regex, I am trying to find the sentences BETWEEN (: or |) AND (. or ! or ?)
Column or pipe indicates starting point for sentences.
The sentences might be:
: Sentence one. Sentence two. Sentence three.
| Sentence one. Sentence two?
| Sentence one. Sentence two! Sentence three?
I would keep it simple and just match on:
\s*[^.|]+\s*
This says to match any content not consisting of pipes or full stops, and it also trims optional whitespace before/after each sentence.
$input = "| This is one. This is two.";
preg_match_all('/\s*[^.|]+\s*/s', $input, $matches);
print_r($matches[0]);
This prints:
Array
(
[0] => This is one
[1] => This is two
)
This does the job:
$str = '| This is one. This is two.';
preg_match_all('/(?:\s|\|)+(.*?)(?=[.!?])/', $str, $m);
print_r($m)
Output:
Array
(
[0] => Array
(
[0] => | This is one
[1] => This is two
)
[1] => Array
(
[0] => This is one
[1] => This is two
)
)
Demo & explanation
Another option is to make use of \G to get iterative matches asserting the position at the end of the previous match and capture the values in a capturing group matching a dot and 0+ horizontal whitespace chars after.
(?:\|\h*|\G(?!^))([^.\r\n]+)\.\h*
In parts
(?: Non capturing group
\|\h* Match | and 0+ horizontal whitespace chars
| Or
\G(?!^) Assert position at the end of previous match
) Close group
( Capture group 1
- [^.\r\n]+ Match 1+ times any char other than . or a newline
) Close group
\.\h* Match 1 . and 0+ horizontal whitespace chars
Regex demo | Php demo
For example
$re = '/(?:\|\h*|\G(?!^))([^.\r\n]+)\.\h*/';
$str = '| This is one. This is two.
John loves Mary.| This is one. This is two.';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
print_r($matches);
Output
Array
(
[0] => Array
(
[0] => | This is one.
[1] => This is one
)
[1] => Array
(
[0] => This is two
[1] => This is tw
)
)
To keep it simple, find everything between | and . and then split:
$input = "John loves Mary. | This is one. This is two. | Sentence 1. Sentence 2.";
preg_match_all('/\|\s*([^|]+)\./', $input, $matches);
if ($matches) {
foreach($matches[1] as $match) {
print_r(preg_split('/\.\s*/', $match));
}
}
Prints:
Array
(
[0] => This is one
[1] => This is two
)
Array
(
[0] => Sentence 1
[1] => Sentence 2
)

split string by spaces and colon but not if inside quotes

having a string like this:
$str = "dateto:'2015-10-07 15:05' xxxx datefrom:'2015-10-09 15:05' yyyy asdf"
the desired result is:
[0] => Array (
[0] => dateto:'2015-10-07 15:05'
[1] => xxxx
[2] => datefrom:'2015-10-09 15:05'
[3] => yyyy
[4] => asdf
)
what I get with:
preg_match_all("/\'(?:[^()]|(?R))+\'|'[^']*'|[^(),\s]+/", $str, $m);
is:
[0] => Array (
[0] => dateto:'2015-10-07
[1] => 15:05'
[2] => xxxx
[3] => datefrom:'2015-10-09
[4] => 15:05'
[5] => yyyy
[6] => asdf
)
Also tried with preg_split("/[\s]+/", $str) but no clue how to escape if value is between quotes. Can anyone show me how and also please explain the regex. Thank you!
I would use PCRE verb (*SKIP)(*F),
preg_split("~'[^']*'(*SKIP)(*F)|\s+~", $str);
DEMO
Often, when you are looking to split a string, using preg_split isn't the best approach (that seems a little counter intuitive, but that's true most of the time). A more efficient way consists to find all items (with preg_match_all) using a pattern that describes all that is not the delimiter (white-spaces here):
$pattern = <<<'EOD'
~(?=\S)[^'"\s]*(?:'[^']*'[^'"\s]*|"[^"]*"[^'"\s]*)*~
EOD;
if (preg_match_all($pattern, $str, $m))
$result = $m[0];
pattern details:
~ # pattern delimiter
(?=\S) # the lookahead assertion only succeeds if there is a non-
# white-space character at the current position.
# (This lookahead is useful for two reasons:
# - it allows the regex engine to quickly find the start of
# the next item without to have to test each branch of the
# following alternation at each position in the strings
# until one succeeds.
# - it ensures that there's at least one non-white-space.
# Without it, the pattern may match an empty string.
# )
[^'"\s]* #"'# all that is not a quote or a white-space
(?: # eventual quoted parts
'[^']*' [^'"\s]* #"# single quotes
|
"[^"]*" [^'"\s]* # double quotes
)*
~
demo
Note that with this a little long pattern, the five items of your example string are found in only 60 steps. You can use this shorter/more simple pattern too:
~(?:[^'"\s]+|'[^']*'|"[^"]*")+~
but it's a little less efficient.
For your example, you can use preg_split with negative lookbehind (?<!\d), i.e.:
<?php
$str = "dateto:'2015-10-07 15:05' xxxx datefrom:'2015-10-09 15:05' yyyy asdf";
$matches = preg_split('/(?<!\d)(\s)/', $str);
print_r($matches);
Output:
Array
(
[0] => dateto:'2015-10-07 15:05'
[1] => xxxx
[2] => datefrom:'2015-10-09 15:05'
[3] => yyyy
[4] => asdf
)
Demo:
http://ideone.com/EP06Nt
Regex Explanation:
(?<!\d)(\s)
Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!\d)»
Match a single character that is a “digit” «\d»
Match the regex below and capture its match into backreference number 1 «(\s)»
Match a single character that is a “whitespace character” «\s»

PHP - Regex match curly brackets within other regex expression

I am trying to figure out how to match other parts of the stuff I need but can't seem to get it to work.
This is what I have so far:
preg_match_all("/^(.*?)(?:.\(([\d]+?)[\/I^\(]*?\))(?:.\((.*?)\))?/m",$data,$r, PREG_SET_ORDER);
Example text:
INPUT - Each line represents a line inside a text file.
-------------------------------------------------------------------------------------
"!?Text" (1234) 1234-4321
"#1 Text" (1234) 1234-????
#2 Text (1234) {Some text (#1.1)} 1234
Text (1234) 1234
Some Other Text: More Text here 1234-4321 (1234) (V) 1234
What I want to do:
I want to also match things in curly brackets and stuff in brackets of curly brackets.
I can't seem to get it to work considering that things in curly brackets + brackets may not always be within the line.
Essentially first (1234) will be a year and I only want to match it once, however in the last string example it also matches (V) but I don't want it to.
Desirable output:
Array
(
[0] => "!?Text" (1234)
[1] => "!?Text"
[2] => 1234
)
Array
(
[0] => "#1 Text" (1234)
[1] => "#1 Text"
[2] => 1234
)
Array
(
[0] => "#2 Text" (1234)
[1] => "#2 Text"
[2] => 1234
[3] => Some text (#1.1) // Matches things within curly brackets if there are any.
[4] => Some text // Extracts text before brackets
[5] => #1.1 // Extracts text within brackets (if any because brackets may not be within curly brackets.)
)
Array
(
[0] => Text (1234)
[1] => Text
[2] => 1234
)
Array // (My current regular expression gives me a 4th match with value 'V', which it shouldn't do)
(
[0] => Some Other Text: More Text here 1234-4321 (1234) (V)
[1] => Some Other Text: More Text here 1234-4321
[2] => 1234
)
What about using:
^((.*?) *\((\d+)\))(?: *\{((.*?) *\((.+?)\)) *\})?
DEMO
NODE EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
.*? any character except \n (0 or more
times (matching the least amount
possible))
--------------------------------------------------------------------------------
) end of \2
--------------------------------------------------------------------------------
* ' ' (0 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
\( '('
--------------------------------------------------------------------------------
( group and capture to \3:
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
' '
--------------------------------------------------------------------------------
) end of \3
--------------------------------------------------------------------------------
\) ')'
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
* ' ' (0 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
\{ '{'
--------------------------------------------------------------------------------
( group and capture to \4:
--------------------------------------------------------------------------------
( group and capture to \5:
--------------------------------------------------------------------------------
.*? any character except \n (0 or more
times (matching the least amount
possible))
--------------------------------------------------------------------------------
) end of \5
--------------------------------------------------------------------------------
* ' ' (0 or more times (matching the
most amount possible))
--------------------------------------------------------------------------------
\( '('
--------------------------------------------------------------------------------
( group and capture to \6:
--------------------------------------------------------------------------------
. any character except \n
--------------------------------------------------------------------------------
? ' ' (optional (matching the most
amount possible))
--------------------------------------------------------------------------------
) end of \6
--------------------------------------------------------------------------------
\) ')'
--------------------------------------------------------------------------------
) end of \4
--------------------------------------------------------------------------------
* ' ' (0 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
\} '}'
--------------------------------------------------------------------------------
)? end of grouping

Regular expression match, extracting only wanted segments of string

I am trying to extract three segments from a string. As I am not particularly good with regular expressions, I think what I have done could probably be done better.
I would like to extract the bold parts of the following string:
SOMETEXT: ANYTHING_HERE (Old=ANYTHING_HERE,
New=ANYTHING_HERE)
Some examples could be:
ABC: Some_Field (Old=,New=123)
ABC: Some_Field (Old=ABCde,New=1234)
ABC: Some_Field (Old=Hello World,New=Bye Bye World)
So the above would return the following matches:
$matches[0] = 'Some_Field';
$matches[1] = '';
$matches[2] = '123';
So far I have the following code:
preg_match_all('/^([a-z]*\:(\s?)+)(.+)(\s?)+\(old=(.+)\,(\s?)+new=(.+)\)/i',$string,$matches);
The issue with the above is that it returns a match for each separate segment of the string. I do not know how to ensure the string is the correct format using a regular expression without catching and storing the match if that makes sense?
So, my question, if not already clear, how I can retrieve just the segments that I want from the above string?
You don't need preg_match_all. You can use this preg_match call:
$s = 'SOMETEXT: ANYTHING_HERE (Old=ANYTHING_HERE1, New=ANYTHING_HERE2)';
if (preg_match('/[^:]*:\s*(\w*)\s*\(Old=(\w*),\s*New=(\w*)/i', $s, $arr))
print_r($arr);
OUTPUT:
Array
(
[0] => SOMETEXT: ANYTHING_HERE (Old=ANYTHING_HERE1, New=ANYTHING_HERE2
[1] => ANYTHING_HERE
[2] => ANYTHING_HERE1
[3] => ANYTHING_HERE2
)
if(preg_match_all('/([a-z]*)\:\s*.+\(Old=(.+),\s*New=(.+)\)/i',$string,$matches)) {
print_r($matches);
}
Example:
$string = 'ABC: Some_Field (Old=Hello World,New=Bye Bye World)';
Will match:
Array
(
[0] => Array
(
[0] => ABC: Some_Field (Old=Hello World,New=Bye Bye World)
)
[1] => Array
(
[0] => ABC
)
[2] => Array
(
[0] => Hello World
)
[3] => Array
(
[0] => Bye Bye World
)
)
The problem is that you're using more parenthesis than you need, and thus capturing more segments of the input than you wish.
eg, each (\s?)+ segment should just be \s*
The regex that you're looking for is:
[^:]+:\s*(.+)\s*\(old=(.*)\s*,\s*new=(.*)\)
In PHP:
preg_match_all('/[^:]+:\s*(.+)\s*\(old=(.*)\s*,\s*new=(.*)\)/i',$string,$matches);
A useful tool can be found here: http://www.myregextester.com/index.php
This tool offers an "Explain" checkbox (as well as a "PHP" checkbox and "i" flag checkbox which you'll want to select) which provides a full explanation of the regex as well. For posterity, I've included the explanation below as well:
NODE EXPLANATION
----------------------------------------------------------------------
(?i-msx: group, but do not capture (case-insensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
[^:]+ any character except: ':' (1 or more times
(matching the most amount possible))
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
.+ any character except \n (1 or more times
(matching the most amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
\( '('
----------------------------------------------------------------------
old= 'old='
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
, ','
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
new= 'new='
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
\) ')'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
What about something simpler like ^_^
[:=]\s*([\w\s]*)
Live DEMO
:\s*([^(\s]+)\s*\(Old=([^,]*),New=([^)]*)
Live demo
also please tell if you want explanations.

regular expression end tag = start tag

Take a look at this regular expression:
(?:\(?")(.+)(?:"\)?)
This regex would match e.g
"a"
("a")
but also
"a)
How can I say that the starting character [ in this case " or ) ] is the same as the ending character? There must be a simplier solution than this, right?
"(.+)"|(?:\(")(.+)(?:"\))
I don't think there's a good way to do this specifically with regex, so you are stuck doing something like this:
/(?:
"(.+)"
|
\( (.+) \)
)/x
how about:
(\(?)(")(.+)\2\1
explanation:
(?-imsx:(\(?)(")(.+)\2\1)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
\(? '(' (optional (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
.+ any character except \n (1 or more times
(matching the most amount possible))
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
\2 what was matched by capture \2
----------------------------------------------------------------------
\1 what was matched by capture \1
----------------------------------------------------------------------
) end of grouping
You can use Placeholders in PHP. But note, that this is not normal Regex behaviour, its special to PHP.:
preg_match("/<([^>]+)>(.+)<\/\1>/") (the \1 references the outcome of the first match)
This will use the first match as condition for the closing match. This matches <a>something</a> but not <h2>something</a>
However in your case you would need to turn the "(" matched within the first group into a ")" - which wont work.
Update: replacing ( and ) to <BRACE> AND <END_BRACE>. Then you can match using /<([^>]+)>(.+)<END_\1>/. Do this for all Required elements you use: ()[]{}<> and whatevs.
(a) is as nice as [f] will become <BRACE>a<END_BRACE> is as nice as <BRACKET>f<END_BRACKET> and the regex will capture both, if you use preg_match_all
$returnValue = preg_match_all('/<([^>]+)>(.+)<END_\\1>/', '<BRACE>a<END_BRACE> is as nice as <BRACKET>f<END_BRACKET>', $matches);
leads to
array (
0 =>
array (
0 => '<BRACE>a<END_BRACE>',
1 => '<BRACKET>f<END_BRACKET>',
),
1 =>
array (
0 => 'BRACE',
1 => 'BRACKET',
),
2 =>
array (
0 => 'a',
1 => 'f',
),
)

Categories