PHP - Regex match curly brackets within other regex expression - php

I am trying to figure out how to match other parts of the stuff I need but can't seem to get it to work.
This is what I have so far:
preg_match_all("/^(.*?)(?:.\(([\d]+?)[\/I^\(]*?\))(?:.\((.*?)\))?/m",$data,$r, PREG_SET_ORDER);
Example text:
INPUT - Each line represents a line inside a text file.
-------------------------------------------------------------------------------------
"!?Text" (1234) 1234-4321
"#1 Text" (1234) 1234-????
#2 Text (1234) {Some text (#1.1)} 1234
Text (1234) 1234
Some Other Text: More Text here 1234-4321 (1234) (V) 1234
What I want to do:
I want to also match things in curly brackets and stuff in brackets of curly brackets.
I can't seem to get it to work considering that things in curly brackets + brackets may not always be within the line.
Essentially first (1234) will be a year and I only want to match it once, however in the last string example it also matches (V) but I don't want it to.
Desirable output:
Array
(
[0] => "!?Text" (1234)
[1] => "!?Text"
[2] => 1234
)
Array
(
[0] => "#1 Text" (1234)
[1] => "#1 Text"
[2] => 1234
)
Array
(
[0] => "#2 Text" (1234)
[1] => "#2 Text"
[2] => 1234
[3] => Some text (#1.1) // Matches things within curly brackets if there are any.
[4] => Some text // Extracts text before brackets
[5] => #1.1 // Extracts text within brackets (if any because brackets may not be within curly brackets.)
)
Array
(
[0] => Text (1234)
[1] => Text
[2] => 1234
)
Array // (My current regular expression gives me a 4th match with value 'V', which it shouldn't do)
(
[0] => Some Other Text: More Text here 1234-4321 (1234) (V)
[1] => Some Other Text: More Text here 1234-4321
[2] => 1234
)

What about using:
^((.*?) *\((\d+)\))(?: *\{((.*?) *\((.+?)\)) *\})?
DEMO
NODE EXPLANATION
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
( group and capture to \2:
--------------------------------------------------------------------------------
.*? any character except \n (0 or more
times (matching the least amount
possible))
--------------------------------------------------------------------------------
) end of \2
--------------------------------------------------------------------------------
* ' ' (0 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
\( '('
--------------------------------------------------------------------------------
( group and capture to \3:
--------------------------------------------------------------------------------
\d digits (0-9)
--------------------------------------------------------------------------------
' '
--------------------------------------------------------------------------------
) end of \3
--------------------------------------------------------------------------------
\) ')'
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
--------------------------------------------------------------------------------
* ' ' (0 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
\{ '{'
--------------------------------------------------------------------------------
( group and capture to \4:
--------------------------------------------------------------------------------
( group and capture to \5:
--------------------------------------------------------------------------------
.*? any character except \n (0 or more
times (matching the least amount
possible))
--------------------------------------------------------------------------------
) end of \5
--------------------------------------------------------------------------------
* ' ' (0 or more times (matching the
most amount possible))
--------------------------------------------------------------------------------
\( '('
--------------------------------------------------------------------------------
( group and capture to \6:
--------------------------------------------------------------------------------
. any character except \n
--------------------------------------------------------------------------------
? ' ' (optional (matching the most
amount possible))
--------------------------------------------------------------------------------
) end of \6
--------------------------------------------------------------------------------
\) ')'
--------------------------------------------------------------------------------
) end of \4
--------------------------------------------------------------------------------
* ' ' (0 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
\} '}'
--------------------------------------------------------------------------------
)? end of grouping

Related

Getting part of string after space

I'm receiving string from the Wikipedia APi which look like this:
{{Wikibooks|Wikijunior:Countries A-Z|France}} {{Sister project links|France}} * [http://www.bbc.co.uk/news/world-europe-17298730 France] from the [[BBC News]] * [http://ucblibraries.colorado.edu/govpubs/for/france.htm France] at ''UCB Libraries GovPubs'' *{{dmoz|Regional/Europe/France}} * [http://www.britannica.com/EBchecked/topic/215768/France France] ''Encyclopædia Britannica'' entry * [http://europa.eu/about-eu/countries/member-countries/france/index_en.htm France] at the [[European Union|EU]] *{{Wikiatlas|France}} *{{osmrelation-inline|1403916}} * [http://www.ifs.du.edu/ifs/frm_CountryProfile.aspx?Country=FR Key Development Forecasts for France] from [[International Futures]] ;Economy *{{INSEE|National Institute of Statistics and Economic Studies}} * [http://stats.oecd.org/Index.aspx?QueryId=14594 OECD France statistics]
I have to use both the actual url's, and the description of the url. So for example, for
[http://www.bbc.co.uk/news/world-europe-17298730 France] from the [[BBC News]]
I need to have "http://www.bbc.co.uk/news/world-europe-17298730" and also "France] from the [[BBC News]] " but without the brackets, like so "France from the BBC News".
I managed to get the first parts, by doing the following:
if(preg_match_all('/\[http(.*?)\s/',$result,$extmatch)) {
$mt= str_replace("[[","",$extmatch[1]);
But I don't know how to go around getting the second part (I'm quite weak at regex unfortunately :-( ).
Any ideas?
A solution not using regex:
Explode the string at '*'
Ditch the parts starting with '{';
Remove all the brackets
Explode the String at 'space'
The first part is the link
Glue back together the rest for the description
The code:
$parts=explode('*',$str);
$links=array();
foreach($parts as $k=>$v){
$parts[$k]=ltrim($v);
if(substr($parts[$k],0,1)!=='['){
unset($parts[$k]);
continue;
}
$parts[$k]=preg_replace('/\[|\]/','',$parts[$k]);
$subparts=explode(' ',$parts[$k]);
$links[$k][0]=$subparts[0];
unset($subparts[0]);
$links[$k][1]=implode(' ',$subparts);
}
echo '<pre>'.print_r($links,true).'</pre>';
The result:
Array
(
[1] => Array
(
[0] => http://www.bbc.co.uk/news/world-europe-17298730
[1] => France from the BBC News
)
[2] => Array
(
[0] => http://ucblibraries.colorado.edu/govpubs/for/france.htm
[1] => France at ''UCB Libraries GovPubs''
)
[4] => Array
(
[0] => http://www.britannica.com/EBchecked/topic/215768/France
[1] => France ''Encyclopædia Britannica'' entry
)
[5] => Array
(
[0] => http://europa.eu/about-eu/countries/member-countries/france/index_en.htm
[1] => France at the European Union|EU
)
[8] => Array
(
[0] => http://www.ifs.du.edu/ifs/frm_CountryProfile.aspx?Country=FR
[1] => Key Development Forecasts for France from International Futures ;Economy
)
[10] => Array
(
[0] => http://stats.oecd.org/Index.aspx?QueryId=14594
[1] => OECD France statistics
)
)
PHP:
$input = "{{Wikibooks|Wikijunior:Countries A-Z|France}} {{Sister project links|France}} * [http://www.bbc.co.uk/news/world-europe-17298730 France] from the [[BBC News]] * [http://ucblibraries.colorado.edu/govpubs/for/france.htm France] at ''UCB Libraries GovPubs'' *{{dmoz|Regional/Europe/France}} * [http://www.britannica.com/EBchecked/topic/215768/France France] ''Encyclopædia Britannica'' entry * [http://europa.eu/about-eu/countries/member-countries/france/index_en.htm France] at the [[European Union|EU]] *{{Wikiatlas|France}} *{{osmrelation-inline|1403916}} * [http://www.ifs.du.edu/ifs/frm_CountryProfile.aspx?Country=FR Key Development Forecasts for France] from [[International Futures]] ;Economy *{{INSEE|National Institute of Statistics and Economic Studies}} * [http://stats.oecd.org/Index.aspx?QueryId=14594 OECD France statistics]";
$regex = '/\[(http\S+)\s+([^\]]+)\](?:\s+from(?:\s+the)?\s+\[\[(.*?)\]\])?/';
preg_match_all($regex, $input, $matches, PREG_SET_ORDER);
var_dump($matches);
Output:
array(6) {
[0]=>
array(4) {
[0]=>
string(78) "[http://www.bbc.co.uk/news/world-europe-17298730 France] from the [[BBC News]]"
[1]=>
string(47) "http://www.bbc.co.uk/news/world-europe-17298730"
[2]=>
string(6) "France"
[3]=>
string(8) "BBC News"
}
...
...
...
...
...
}
Explanation:
\[ (?# match [ literally)
( (?# start capture group)
http (?# match http literally)
\S+ (?# match 1+ non-whitespace characters)
) (?# end capture group)
\s+ (?# match 1+ whitespace characters)
( (?# start capture group)
[^\]]+ (?# match 1+ non-] characters)
) (?# end capture group)
\] (?# match ] literally)
(?: (?# start non-capturing group)
\s+ (?# match 1+ whitespace characters)
from (?# match from literally)
(?: (?# start non-capturing group)
\s+ (?# match 1+ whitespace characters)
the (?# match the literally)
)? (?# end optional non-capturing group)
\s+ (?# match 1+ whitespace characters)
\[\[ (?# match [[ literally)
( (?# start capturing group)
.*? (?# lazily match 0+ characters)
) (?# end capturing group)
\]\] (?# match ]] literally)
)? (?# end optional non-caputring group)
Let me know if you need a more thorough explanation, but my comments above should help. If you have any specific questions I'd be more than happy to help. Link below will help you visualize what the expression is doing.
Regex101

Regular expression match, extracting only wanted segments of string

I am trying to extract three segments from a string. As I am not particularly good with regular expressions, I think what I have done could probably be done better.
I would like to extract the bold parts of the following string:
SOMETEXT: ANYTHING_HERE (Old=ANYTHING_HERE,
New=ANYTHING_HERE)
Some examples could be:
ABC: Some_Field (Old=,New=123)
ABC: Some_Field (Old=ABCde,New=1234)
ABC: Some_Field (Old=Hello World,New=Bye Bye World)
So the above would return the following matches:
$matches[0] = 'Some_Field';
$matches[1] = '';
$matches[2] = '123';
So far I have the following code:
preg_match_all('/^([a-z]*\:(\s?)+)(.+)(\s?)+\(old=(.+)\,(\s?)+new=(.+)\)/i',$string,$matches);
The issue with the above is that it returns a match for each separate segment of the string. I do not know how to ensure the string is the correct format using a regular expression without catching and storing the match if that makes sense?
So, my question, if not already clear, how I can retrieve just the segments that I want from the above string?
You don't need preg_match_all. You can use this preg_match call:
$s = 'SOMETEXT: ANYTHING_HERE (Old=ANYTHING_HERE1, New=ANYTHING_HERE2)';
if (preg_match('/[^:]*:\s*(\w*)\s*\(Old=(\w*),\s*New=(\w*)/i', $s, $arr))
print_r($arr);
OUTPUT:
Array
(
[0] => SOMETEXT: ANYTHING_HERE (Old=ANYTHING_HERE1, New=ANYTHING_HERE2
[1] => ANYTHING_HERE
[2] => ANYTHING_HERE1
[3] => ANYTHING_HERE2
)
if(preg_match_all('/([a-z]*)\:\s*.+\(Old=(.+),\s*New=(.+)\)/i',$string,$matches)) {
print_r($matches);
}
Example:
$string = 'ABC: Some_Field (Old=Hello World,New=Bye Bye World)';
Will match:
Array
(
[0] => Array
(
[0] => ABC: Some_Field (Old=Hello World,New=Bye Bye World)
)
[1] => Array
(
[0] => ABC
)
[2] => Array
(
[0] => Hello World
)
[3] => Array
(
[0] => Bye Bye World
)
)
The problem is that you're using more parenthesis than you need, and thus capturing more segments of the input than you wish.
eg, each (\s?)+ segment should just be \s*
The regex that you're looking for is:
[^:]+:\s*(.+)\s*\(old=(.*)\s*,\s*new=(.*)\)
In PHP:
preg_match_all('/[^:]+:\s*(.+)\s*\(old=(.*)\s*,\s*new=(.*)\)/i',$string,$matches);
A useful tool can be found here: http://www.myregextester.com/index.php
This tool offers an "Explain" checkbox (as well as a "PHP" checkbox and "i" flag checkbox which you'll want to select) which provides a full explanation of the regex as well. For posterity, I've included the explanation below as well:
NODE EXPLANATION
----------------------------------------------------------------------
(?i-msx: group, but do not capture (case-insensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
[^:]+ any character except: ':' (1 or more times
(matching the most amount possible))
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
.+ any character except \n (1 or more times
(matching the most amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
\( '('
----------------------------------------------------------------------
old= 'old='
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
, ','
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
new= 'new='
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
\) ')'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
What about something simpler like ^_^
[:=]\s*([\w\s]*)
Live DEMO
:\s*([^(\s]+)\s*\(Old=([^,]*),New=([^)]*)
Live demo
also please tell if you want explanations.

get substrings from string with parentheses, brackets and hyphen

Regex is not my strongest suit and I'm having a bit of trouble with this situation.
I have the following string:
locale (district - town) [parish]
I need to extract the following information:
1 - locale
2 - district
3 - town
And I have these solutions:
1 - locale
preg_match("/([^(]*)\s/", $input_line, $output_array);
2 - district
preg_match("/.*\(([^-]*)\s/", $input_line, $output_array);
3 - town
preg_match("/.*\-\s([^)]*)/", $input_line, $output_array);
And these seem to work fine.
However, the string may be presented like any of these:
localeA(localeB) (district - town) [parish]
locale (district - townA(townB)) [parish]
locale (district - townA-townB) [parish]
Locale can also include parentheses of its own.
Town can include parentheses and/or an hyphen of its own.
Which makes it difficult to extract the right information. In the 3 scenarios above I would have to extract:
localeA(localeB) + district + town
locale + district + townA(townB)
locale + district + townA-townB
I find it hard to deal with all these scenarios. Can you help me out?
Thanks in advance
If locale, district and town haven't spaces in them:
preg_match("/^\s*(\S+)\s*\((\S+)\s*-\s*(\S+)\)/", $input_line, $output_array);
explanation:
The regular expression:
(?-imsx:^\s*(\S+)\s*\((\S+)\s*-\s*(\S+)\))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
^ the beginning of the string
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
\S+ non-whitespace (all but \n, \r, \t, \f,
and " ") (1 or more times (matching the
most amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
\( '('
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
\S+ non-whitespace (all but \n, \r, \t, \f,
and " ") (1 or more times (matching the
most amount possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
- '-'
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
\S+ non-whitespace (all but \n, \r, \t, \f,
and " ") (1 or more times (matching the
most amount possible))
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
\) ')'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
Not sure what exactly your rules and edge cases are, but this works for the examples provided
preg_match('#^(.+?) \((.+?) - (.+?)\) \[(.+)\]$#',$str,$matches);
Gives these results (when run for each example string in $str):
Array
(
[0] => locale (district - town) [parish]
[1] => locale
[2] => district
[3] => town
[4] => parish
)
Array
(
[0] => localeA(localeB) (district - town) [parish]
[1] => localeA(localeB)
[2] => district
[3] => town
[4] => parish
)
Array
(
[0] => locale (district - townA(townB)) [parish]
[1] => locale
[2] => district
[3] => townA(townB)
[4] => parish
)
Array
(
[0] => locale (district - townA-townB) [parish]
[1] => locale
[2] => district
[3] => townA-townB
[4] => parish
)

get substring between 2 characters in php

Im using a mentioning system like on twitter and instagram where you simply put #johndoe
what im trying to do is be able to strip down to the name in-between "#" and these characters ?,,,],:,(space)
as an example heres my string:
hey #johnDoe check out this event, be sure to bring #janeDoe:,#johnnyappleSeed?, #johnCitizen] , and #fredNerk
how can i get an array of janeDoe,johnnyappleSeed,johnCitizen,fredNerk without the characters ?,,,],: attached to them.
i know i have to use a variation of preg_match but i dont have a strong understanding of it.
This is what you've asked for: /\#(.*?)\s/
This is what you really want: /\b\#(.*?)\b/
Put either one into preg_match_all() and evaluate the results array.
preg_match_all("/\#(.*?)\s/", $string, $result_array);
$check_hash = preg_match_all ("/#[a-zA-Z0-9]*/g", $string_to_match_against, $matches);
You could then do somthing like
foreach ($matches as $images){
echo $images."<br />";
}
UPDATE: Just realized you were looking to remove the invalid characters. Updated script should do it.
How about:
$str = 'hey #johnDoe check out this event, be sure to bring #janeDoe:,#johnnyappleSeed?, #johnCitizen] , and #fredNerk';
preg_match_all('/#(.*?)(?:[?, \]: ]|$)/', $str, $m);
print_r($m);
output:
Array
(
[0] => Array
(
[0] => #johnDoe
[1] => #janeDoe:
[2] => #johnnyappleSeed?
[3] => #johnCitizen]
[4] => #fredNerk
)
[1] => Array
(
[0] => johnDoe
[1] => janeDoe
[2] => johnnyappleSeed
[3] => johnCitizen
[4] => fredNerk
)
)
explanation:
The regular expression:
(?-imsx:#(.*?)(?:[?, \]: ]|$))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
# '#'
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
[?, \]: ] any character of: '?', ',', ' ', '\]',
':', ' '
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
$ before an optional \n, and the end of
the string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------

regular expression end tag = start tag

Take a look at this regular expression:
(?:\(?")(.+)(?:"\)?)
This regex would match e.g
"a"
("a")
but also
"a)
How can I say that the starting character [ in this case " or ) ] is the same as the ending character? There must be a simplier solution than this, right?
"(.+)"|(?:\(")(.+)(?:"\))
I don't think there's a good way to do this specifically with regex, so you are stuck doing something like this:
/(?:
"(.+)"
|
\( (.+) \)
)/x
how about:
(\(?)(")(.+)\2\1
explanation:
(?-imsx:(\(?)(")(.+)\2\1)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
\(? '(' (optional (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
( group and capture to \3:
----------------------------------------------------------------------
.+ any character except \n (1 or more times
(matching the most amount possible))
----------------------------------------------------------------------
) end of \3
----------------------------------------------------------------------
\2 what was matched by capture \2
----------------------------------------------------------------------
\1 what was matched by capture \1
----------------------------------------------------------------------
) end of grouping
You can use Placeholders in PHP. But note, that this is not normal Regex behaviour, its special to PHP.:
preg_match("/<([^>]+)>(.+)<\/\1>/") (the \1 references the outcome of the first match)
This will use the first match as condition for the closing match. This matches <a>something</a> but not <h2>something</a>
However in your case you would need to turn the "(" matched within the first group into a ")" - which wont work.
Update: replacing ( and ) to <BRACE> AND <END_BRACE>. Then you can match using /<([^>]+)>(.+)<END_\1>/. Do this for all Required elements you use: ()[]{}<> and whatevs.
(a) is as nice as [f] will become <BRACE>a<END_BRACE> is as nice as <BRACKET>f<END_BRACKET> and the regex will capture both, if you use preg_match_all
$returnValue = preg_match_all('/<([^>]+)>(.+)<END_\\1>/', '<BRACE>a<END_BRACE> is as nice as <BRACKET>f<END_BRACKET>', $matches);
leads to
array (
0 =>
array (
0 => '<BRACE>a<END_BRACE>',
1 => '<BRACKET>f<END_BRACKET>',
),
1 =>
array (
0 => 'BRACE',
1 => 'BRACKET',
),
2 =>
array (
0 => 'a',
1 => 'f',
),
)

Categories