I'm receiving string from the Wikipedia APi which look like this:
{{Wikibooks|Wikijunior:Countries A-Z|France}} {{Sister project links|France}} * [http://www.bbc.co.uk/news/world-europe-17298730 France] from the [[BBC News]] * [http://ucblibraries.colorado.edu/govpubs/for/france.htm France] at ''UCB Libraries GovPubs'' *{{dmoz|Regional/Europe/France}} * [http://www.britannica.com/EBchecked/topic/215768/France France] ''Encyclopædia Britannica'' entry * [http://europa.eu/about-eu/countries/member-countries/france/index_en.htm France] at the [[European Union|EU]] *{{Wikiatlas|France}} *{{osmrelation-inline|1403916}} * [http://www.ifs.du.edu/ifs/frm_CountryProfile.aspx?Country=FR Key Development Forecasts for France] from [[International Futures]] ;Economy *{{INSEE|National Institute of Statistics and Economic Studies}} * [http://stats.oecd.org/Index.aspx?QueryId=14594 OECD France statistics]
I have to use both the actual url's, and the description of the url. So for example, for
[http://www.bbc.co.uk/news/world-europe-17298730 France] from the [[BBC News]]
I need to have "http://www.bbc.co.uk/news/world-europe-17298730" and also "France] from the [[BBC News]] " but without the brackets, like so "France from the BBC News".
I managed to get the first parts, by doing the following:
if(preg_match_all('/\[http(.*?)\s/',$result,$extmatch)) {
$mt= str_replace("[[","",$extmatch[1]);
But I don't know how to go around getting the second part (I'm quite weak at regex unfortunately :-( ).
Any ideas?
A solution not using regex:
Explode the string at '*'
Ditch the parts starting with '{';
Remove all the brackets
Explode the String at 'space'
The first part is the link
Glue back together the rest for the description
The code:
$parts=explode('*',$str);
$links=array();
foreach($parts as $k=>$v){
$parts[$k]=ltrim($v);
if(substr($parts[$k],0,1)!=='['){
unset($parts[$k]);
continue;
}
$parts[$k]=preg_replace('/\[|\]/','',$parts[$k]);
$subparts=explode(' ',$parts[$k]);
$links[$k][0]=$subparts[0];
unset($subparts[0]);
$links[$k][1]=implode(' ',$subparts);
}
echo '<pre>'.print_r($links,true).'</pre>';
The result:
Array
(
[1] => Array
(
[0] => http://www.bbc.co.uk/news/world-europe-17298730
[1] => France from the BBC News
)
[2] => Array
(
[0] => http://ucblibraries.colorado.edu/govpubs/for/france.htm
[1] => France at ''UCB Libraries GovPubs''
)
[4] => Array
(
[0] => http://www.britannica.com/EBchecked/topic/215768/France
[1] => France ''Encyclopædia Britannica'' entry
)
[5] => Array
(
[0] => http://europa.eu/about-eu/countries/member-countries/france/index_en.htm
[1] => France at the European Union|EU
)
[8] => Array
(
[0] => http://www.ifs.du.edu/ifs/frm_CountryProfile.aspx?Country=FR
[1] => Key Development Forecasts for France from International Futures ;Economy
)
[10] => Array
(
[0] => http://stats.oecd.org/Index.aspx?QueryId=14594
[1] => OECD France statistics
)
)
PHP:
$input = "{{Wikibooks|Wikijunior:Countries A-Z|France}} {{Sister project links|France}} * [http://www.bbc.co.uk/news/world-europe-17298730 France] from the [[BBC News]] * [http://ucblibraries.colorado.edu/govpubs/for/france.htm France] at ''UCB Libraries GovPubs'' *{{dmoz|Regional/Europe/France}} * [http://www.britannica.com/EBchecked/topic/215768/France France] ''Encyclopædia Britannica'' entry * [http://europa.eu/about-eu/countries/member-countries/france/index_en.htm France] at the [[European Union|EU]] *{{Wikiatlas|France}} *{{osmrelation-inline|1403916}} * [http://www.ifs.du.edu/ifs/frm_CountryProfile.aspx?Country=FR Key Development Forecasts for France] from [[International Futures]] ;Economy *{{INSEE|National Institute of Statistics and Economic Studies}} * [http://stats.oecd.org/Index.aspx?QueryId=14594 OECD France statistics]";
$regex = '/\[(http\S+)\s+([^\]]+)\](?:\s+from(?:\s+the)?\s+\[\[(.*?)\]\])?/';
preg_match_all($regex, $input, $matches, PREG_SET_ORDER);
var_dump($matches);
Output:
array(6) {
[0]=>
array(4) {
[0]=>
string(78) "[http://www.bbc.co.uk/news/world-europe-17298730 France] from the [[BBC News]]"
[1]=>
string(47) "http://www.bbc.co.uk/news/world-europe-17298730"
[2]=>
string(6) "France"
[3]=>
string(8) "BBC News"
}
...
...
...
...
...
}
Explanation:
\[ (?# match [ literally)
( (?# start capture group)
http (?# match http literally)
\S+ (?# match 1+ non-whitespace characters)
) (?# end capture group)
\s+ (?# match 1+ whitespace characters)
( (?# start capture group)
[^\]]+ (?# match 1+ non-] characters)
) (?# end capture group)
\] (?# match ] literally)
(?: (?# start non-capturing group)
\s+ (?# match 1+ whitespace characters)
from (?# match from literally)
(?: (?# start non-capturing group)
\s+ (?# match 1+ whitespace characters)
the (?# match the literally)
)? (?# end optional non-capturing group)
\s+ (?# match 1+ whitespace characters)
\[\[ (?# match [[ literally)
( (?# start capturing group)
.*? (?# lazily match 0+ characters)
) (?# end capturing group)
\]\] (?# match ]] literally)
)? (?# end optional non-caputring group)
Let me know if you need a more thorough explanation, but my comments above should help. If you have any specific questions I'd be more than happy to help. Link below will help you visualize what the expression is doing.
Regex101
Related
I'm looking to split a string by spaces, unless there is the string " NOT ", in which case I would only want to split by the space before the "NOT", and not after the "NOT".
Example:
"cancer disease NOT brain NOT sickle"
should become:
["cancer", "disease", "NOT brain", "NOT sickle"]
Here is what I have so far, but it is incorrect:
$splitKeywordArr = preg_split('/[^(NOT)]( )/', "cancer disease NOT brain NOT sickle")
It results in:
["cance", "diseas", "NOT brai", "NOT sickle"]
I know why it is incorrect, but I don't know how to fix it.
You may use
<?php
$text = "cancer disease NOT brain NOT sickle";
$pattern = "~NOT\s+(*SKIP)(*FAIL)|\s+~";
print_r(preg_split($pattern, $text));
?>
Which yields
Array
(
[0] => cancer
[1] => disease
[2] => NOT brain
[3] => NOT sickle
)
See a demo on ideone.com.
You might also match optional repetitions of the word NOT followed by 1+ word characters in case the word occurs multiple times after each other.
(?:\bNOT\h+)*\w+
The pattern matches:
(?: Non capture group
\bNOT\h+ A word boundary, match NOT and 1 or more horizontal whitespace chars
)* Close non capture group and optionally repeat
\w+ Match 1+ word characters
Regex demo | Php demo
$str = "cancer disease NOT brain NOT sickle";
preg_match_all('/(?:\bNOT\h+)*\w+/', $str, $matches);
print_r($matches[0]);
Output
Array
(
[0] => cancer
[1] => disease
[2] => NOT brain
[3] => NOT sickle
)
I'm not good at Regex and I've been trying for hours now so I hope you can help me. I have this text:
✝his is *✝he* *in✝erne✝*
I need to capture (using PREG_OFFSET_CAPTURE) only the ✝ in a word surrounded with *, so I only need to capture the last three ✝ in this example. The output array should look something like this:
[0] => Array
(
[0] => ✝
[1] => 17
)
[1] => Array
(
[0] => ✝
[1] => 32
)
[2] => Array
(
[0] => ✝
[1] => 44
)
I've tried using (✝) but ofcourse this will select all instances including the words without asterisks. Then I've tried \*[^ ]*(✝)[^ ]*\* but this only gives me the last instance in one word. I've tried many other variations but all were wrong.
To clarify: The asterisk can be at all places in the string, but always at the beginning and end of a word. The opening asterisk always precedes a space except at the beginning of the string and the closing asterisk always ends with a space except at the end of the string. I must add that punctuation marks can be inside these asterisks. ✝ is exactly (and only) what I need to capture and can be at any position in a word.
You could make use of the \G anchor to get iterative matches between the *. The anchor matches either at the start of the string, or at the end of the previous match.
(?:\*|\G(?!^))[^&*]*(?>&(?!#)[^&*]*)*\K✝(?=[^*]*\*)
Explanation
(?: Non capture group
\* Match *
| Or
\G(?!^) Assert the end of the previous match, not at the start
) Close non capture group
[^&*]* Match 0+ times any char except & and *
(?> Atomic group
&(?!#) Match & only when not directly followed by #
[^&*]* Match 0+ times any char except & and *
)* Close atomic group and repeat 0+ times
\K Clear the match buffer (forget what is matched until now)
✝ Match literally
(?=[^*]*\*) Positive lookahead, assert a * at the right
Regex demo | Php demo
For example
$re = '/(?:\*|\G(?!^))[^&*]*(?>&(?!#)[^&*]*)*\K✝(?=[^*]*\*)/m';
$str = '✝his is *✝he* *in✝erne✝*';
preg_match_all($re, $str, $matches, PREG_OFFSET_CAPTURE);
print_r($matches[0]);
Output
Array
(
[0] => Array
(
[0] => ✝
[1] => 16
)
[1] => Array
(
[0] => ✝
[1] => 31
)
[2] => Array
(
[0] => ✝
[1] => 43
)
)
Note The the offset is 1 less than the expected as the string starts counting at 0. See PREG_OFFSET_CAPTURE
If you want to match more variations, you could use a non capturing group and list the ones that you would accept to match. If you don't want to cross newline boundaries you can exclude matching those in the negated character class.
(?:\*|\G(?!^))[^&*\r\n]*(?>&(?!#)[^&*\\rn]*)*\K&#(?:x271D|169);(?=[^*\r\n]*\*)
Regex demo
https://www.tehplayground.com/KWmxySzbC9VoDvP9
Why is the first string matched?
$list = [
'3928.3939392', // Should not be matched
'4.239,99',
'39',
'3929',
'2993.39',
'393993.999'
];
foreach($list as $str){
preg_match('/^(?<![\d.,])-?\d{1,3}(?:[,. ]?\d{3})*(?:[^.,%]|[.,]\d{1,2})-?(?![\d.,%]|(?: %))$/', $str, $matches);
print_r($matches);
}
output
Array
(
[0] => 3928.3939392
)
Array
(
[0] => 4.239,99
)
Array
(
[0] => 39
)
Array
(
[0] => 3929
)
Array
(
[0] => 2993.39
)
Array
(
)
You seem to want to match the numbers as standalone strings, and thus, you do not need the lookarounds, you only need to use anchors.
You may use
^-?(?:\d{1,3}(?:[,. ]\d{3})*|\d*)(?:[.,]\d{1,2})?$
See the regex demo
Details
^ - start of string
-? - an optional -
(?: - start of a non-capturing alternation group:
\d{1,3}(?:[,. ]\d{3})* - 1 to 3 digits, followed with 0+ sequences of ,, . or space and then 3 digits
| - or
\d* - 0+ digits
) - end of the group
(?:[.,]\d{1,2})? - an optional sequence of . or , followed with 1 or 2 digits
$ - end of string.
I have been sitting for hours to figure out a regExp for a preg_match_all function in php.
My problem is that i whant two different things from the string.
Say you have the string "Code is fun [and good for the brain.] But the [brain is] tired."
What i need from this an array of all the word outside of the brackets and the text in the brackets together as one string.
Something like this
[0] => Code
[1] => is
[2] => fun
[3] => and good for the brain.
[4] => But
[5] => the
[6] => brain is
[7] => tired.
Help much appreciated.
You could try the below regex also,
(?<=\[)[^\]]*|[.\w]+
DEMO
Code:
<?php
$data = "Code is fun [and good for the brain.] But the [brain is] tired.";
$regex = '~(?<=\[)[^\]]*|[.\w]+~';
preg_match_all($regex, $data, $matches);
print_r($matches);
?>
Output:
Array
(
[0] => Array
(
[0] => Code
[1] => is
[2] => fun
[3] => and good for the brain.
[4] => But
[5] => the
[6] => brain is
[7] => tired.
)
)
The first lookbind (?<=\[)[^\]]* matches all the characters which are present inside the braces [] and the second [.\w]+ matches one or more word characters or dot from the remaining string.
You can use the following regex:
(?:\[([\w .!?]+)\]+|(\w+))
The regex contains two alternations: one to match everything inside the two square brackets, and one to capture every other word.
This assumes that the part inside the square brackets doesn't contain any characters other than alphabets, digits, _, !, ., and ?. In case you need to add more punctuation, it should be easy enough to add them to the character class.
If you don't want to be that specific about what should be captured, then you can use a negated character class instead — specify what not to match instead of specifying what to match. The expression then becomes: (?:\[([^\[\]]+)\]|(\w+))
Visualization:
Explanation:
(?: # Begin non-capturing group
\[ # Match a literal '['
( # Start capturing group 1
[\w .!?]+ # Match everything in between '[' and ']'
) # End capturing group 1
\] # Match literal ']'
| # OR
( # Begin capturing group 2
\w+ # Match rest of the words
) # End capturing group 2
) # End non-capturing group
Demo
I have two conditions in my regex (regex used on php)
(BIOLOGIQUES\s+(\d+)\s+(\d+)\s+\/\s+(\d+))|(Dossier N.\s+:\s+(\d+)\s+(\d+)\s+\/\s+(\d+))
When I test the 1st condition with the following I get 4 match groups 1 2 3 and 4
BIOLOGIQUES 47 131002 / 4302
Please see the 1st condition here http://www.rubular.com/r/a6zQS8Wth6
But when I test with the second condition the groups match are 5 6 7 and 8
Dossier N° : 47 131002 / 4302
The second condition here : http://www.rubular.com/r/eYzBJq1rIW
Is there a way to always have 1 2 3 and 4 match groups in the second condition too?
Since the parts of both regexps that match the numbers are the same, you can do the alternation just for the beginning, instead of around the entire regexp:
preg_match('/((?:BIOLOGIQUES|Dossier N.\s+:)\s+(\d+)\s+(\d+)\s+\/\s+(\d+))/u', $content, $match);
Use the u modifier to match UTF-8 characters correctly.
I assume your regex is compressed. If the dot is meant to abbrev. the middle initial it should be escaped. The suggestion below factors out like Barmar's does. If you don't want to capture the different names, remove the parenthesis from them.
Sorry, it looks like you intend it to be a dot metachar. Just remove the \ from it.
# (?:(BIOLOGIQUES)|(Dossier\ N\.\s+:))\s+((\d+)\s+(\d+)\s+\/\s+(\d+))
(?:
( BIOLOGIQUES ) # (1)
| ( Dossier\ N \. \s+ : ) # (2)
)
\s+
( # (3 start)
( \d+ ) # (4)
\s+
( \d+ ) # (5)
\s+ \/ \s+
( \d+ ) # (6)
) # (3 end)
Edit, the regex should be factored, but if it gets too different, a way to re-use the same capture groups is to use Branch Reset.
Here is your original code with some annotations using branch reset.
(?|(BIOLOGIQUES\s+(\d+)\s+(\d+)\s+\/\s+(\d+))|(Dossier\ N.\s+:\s+(\d+)\s+(\d+)\s+\/\s+(\d+)))
(?|
br 1 ( # (1 start)
BIOLOGIQUES \s+
2 ( \d+ ) # (2)
\s+
3 ( \d+ ) # (3)
\s+ \/ \s+
4 ( \d+ ) # (4)
1 ) # (1 end)
|
br 1 ( # (1 start)
Dossier\ N . \s+ : \s+
2 ( \d+ ) # (2)
\s+
3 ( \d+ ) # (3)
\s+ \/ \s+
4 ( \d+ ) # (4)
1 ) # (1 end)
)
Or, you could factor it AND use branch reset.
# (?|(BIOLOGIQUES\s+)|(Dossier\ N.\s+:\s+))(?:(\d+)\s+(\d+)\s+\/\s+(\d+))
(?|
br 1 ( BIOLOGIQUES \s+ ) # (1)
|
br 1 ( Dossier\ N . \s+ : \s+ ) # (1)
)
(?:
2 ( \d+ ) # (2)
\s+
3 ( \d+ ) # (3)
\s+ \/ \s+
4 ( \d+ ) # (4)
)