Extracting some content from given string

Extracting some content from given string - php

This is the content piece:
This is content that is a sample.
[md] Special Content Piece [/md]
This is some more content.
What I want is a preg_match_all expression such that it can fetch and give me the following from the above content:
[md] Special Content Piece [/md]
I have tried this:
$pattern ="/\[^[a-zA-Z][0-9\-\_\](.*?)\[\/^[a-zA-Z][0-9\-\_]\]/";
preg_match_all($pattern, $content, $matches);
But it gives a blank array. Could someone help?

$pattern = "/\[md\](.*?)\[\md\]/";
generally
$pattern = "/\[[a-zA-Z0-9\-\_]+\](.*?)\[\/[a-zA-Z0-9\-\_]+\]/";
or even better
$pattern = "/\[\w+\](.*?)\[\/\w+\]/";
and to match the start tag with the end tag:
$pattern = "/\[(\w+)\](.*?)\[\/\1\]/";
(Just note that the "tag" name is then returned in the match array.)

You can use this:
$pattern = '~\[([^]]++)]\K[^[]++(?=\[/\1])~';
explanation:
~ #delimiter of the pattern
\[ #literal opening square bracket (must be escaped)
( #open the capture group 1
[^]]++ #all characters that are not ] one or more times
) #close the capture group 1
] #literal closing square bracket (no need to escape)
\K #reset all the match before
[^[]++ #all characters that are not [ one or more times
(?= #open a lookahead assertion (this doesn't consume characters)
\[/ #literal opening square bracket and slash
\1 #back reference to the group 1
] #literal closing square bracket
) #close the lookhead
~
Interest of this pattern:
The result is the whole match because i have reset all the match before \K and because the lookahead assertion, after what you are looking for, don't consume characters and is not in the match.
The character classes are defined in negative and therefore are shorter to write and permissive (you don't care about what characters must be inside)
The pattern checks if the opening and closing tags are the same with the system of capture group\back reference.
Limits:
This expression don't deal with nested structures (you don't ask for). If you need that, please edit your question.
For nested structures you can use:
(?=(\[([^]]++)](?<content>(?>[^][]++|(?1))*)\[/\2]))
If attributes are allowed in your bbcode:
(?=(\[([^]\s]++)[^]]*+](?<content>(?>[^][]++|(?1))*)\[/\2]))
If self-closing bbcode tags are allowed:
(?=((?:\[([^][]++)](?<content>(?>[^][]++|(?1))*)\[/\2])|\[[^/][^]]*+]))
Notes:
A lookahead means in other words: "followed by"
I use possessive quantifiers (++) instead of simple gready quantifiers (+) to inform the regex engine that it doesn't need to backtrack (gain of performance) and atomic groups (ie:(?>..)) for the same reasons.
In the patterns for nested structures slashes are not escaped, to use them you must choose a delimiter that is not a slash (~, #, `).
The patterns for nested structures use recursion (ie (?1)), you can have more informations about this feature here and here.

Update:
If you're likely to be working with nested "tags", I'd probably go for something like this:
$pattern = '/(\[\s*([^\]]++)\s*\])(?=(.*?)(\[\s*\/\s*\2\s*\]))/';
Which, as you probably can tell, is not unlike what CasimiretHippolyte suggested (only his regex, AFAIKT, won't capture outer tags in a scenario like the following:)
his is content that is a sample.
[md] Special Content [foo]Piece[/foo] [/md]
This is some more content.
Whereas, with this expression, $matches looks like:
array (
0 =>
array (
0 => '[md]',
1 => '[foo]',
),
1 =>
array (
0 => '[md]',
1 => '[foo]',
),
2 =>
array (
0 => 'md',
1 => 'foo',
),
3 =>
array (
0 => ' Special Content [foo]Piece[/foo] ',
1 => 'Piece',
),
4 =>
array (
0 => '[/md]',
1 => '[/foo]',
),
)
A rather simple pattern to match all substrings looking like this [foo]sometext[/foo]
$pattern = '/(\[[^\/\]]+\])([^\]]+)(\[\s*\/\s*[^\]]+\])/';
if (preg_match_all($pattern, $content, $matches))
{
echo '<pre>';
print_r($matches);
echo '</pre>';
}
Output:
array (
0 =>
array (
0 => '[md] Special Content Piece [/md]',
),
1 =>
array (
0 => '[md]',
),
2 =>
array (
0 => ' Special Content Piece ',
),
3 =>
array (
0 => '[/md]',
),
)
How this pattern works: It's devided into three groups.
The first: (\[[^\/\]]+\]) matches opening and closing [], with everything inbetween that is neither a closing bracket nor a forward slash.
The second: '([^]]+)' matches every char after the first group that is not [
The third: (\[\s*\/\s*[^\]]+\]) matches an opening [, followed by zero or more spaces, a forward slash, again followed by zero or more spaces, and any other char that isn't ]
If you want to match a specific end-tag, but keeping the same three groups (with a fourth), use this (slightly more complex) expression:
$pattern = '/(\[\s*([^\]]+?)\s*\])(.+?)(\[\s*\/\s*\2\s*\])/';
This'll return:
array (
0 =>
array (
0 => '[md] Special Content Piece [/md]',
),
1 =>
array (
0 => '[md]',
),
2 =>
array (
0 => 'md',
),
3 =>
array (
0 => ' Special Content Piece ',
),
4 =>
array (
0 => '[/md]',
),
)
Note that group 2 (the one we used in the expression as \2) is the "tagname" itself.

Related

How Can I Split a String in One Character Increments but also Keep The String between Brackets Intact?

So, say I have string like:
abcd[efg]hi[jkl]m
What I want is to split the string in such a way that each character is placed in its own index in an array, except for the characters within the brackets. They should remain grouped together. In other words, I want to create an array from the input string as such:
[0] => a
[1] => b
[2] => c
[3] => d
[4] => efg
[5] => h
[6] => i
[7] => jkl
[8] => m
I know I can split the input string in the way shown below using preg_split('/[\[]*[\][]/U', $value, -1):
[0] => abcd
[1] => efg
[2] => hi
[3] => jkl
[4] => m
However, I do not know the regular expression I'd need to use to get me exactly what I want. What regular expression will give me my desired solution?

Effectively, you split on every "optionally occurring" square-braced expression. IOW, if a square braced expression is encountered, split on the whole expression (without consuming it) and if there is no expression encountered, then split on that position in the string.
The pattern can look like either of these: (Regex Demo)
~\[([^\]]+)]|~ # (shorter pattern, 44 steps on the sample input string)
or
~(?:\[([^\]]+)])?~ # (longer pattern, 64 steps on the sample input string)
Both preg flags are essential to ensure that no substrings are lost and no empty elements are generated during the process.
Code: (Demo)
$string = 'abcd[efg]hi[jkl]m';
var_export(
preg_split(
'~\[([^\]]+)]|~',
$string,
0,
PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE
)
);
Output:
array (
0 => 'a',
1 => 'b',
2 => 'c',
3 => 'd',
4 => 'efg',
5 => 'h',
6 => 'i',
7 => 'jkl',
8 => 'm',
)
Pattern Breakdown:
~ #starting pattern delimiter
\[ #literal opening square brace
( #open capturing group
[^\]]+ #match one or more non-closing square braces
) #close capturing group
] #literal closing square brace
| #OR operator, to allow a match on current position (not a brace expression)
~
The literal braces are written outside of the inner capture group so that they are discarded/consumed during the splitting. Normally, the inner capture group would be consumed as well, but the PREG_SPLIT_DELIM_CAPTURE flag demands that that substring should remain after the splitting.

Regex: Capturing multiple instances in one word group

I'm not good at Regex and I've been trying for hours now so I hope you can help me. I have this text:
✝his is *✝he* *in✝erne✝*
I need to capture (using PREG_OFFSET_CAPTURE) only the ✝ in a word surrounded with *, so I only need to capture the last three ✝ in this example. The output array should look something like this:
[0] => Array
(
[0] => ✝
[1] => 17
)
[1] => Array
(
[0] => ✝
[1] => 32
)
[2] => Array
(
[0] => ✝
[1] => 44
)
I've tried using (✝) but ofcourse this will select all instances including the words without asterisks. Then I've tried \*[^ ]*(✝)[^ ]*\* but this only gives me the last instance in one word. I've tried many other variations but all were wrong.
To clarify: The asterisk can be at all places in the string, but always at the beginning and end of a word. The opening asterisk always precedes a space except at the beginning of the string and the closing asterisk always ends with a space except at the end of the string. I must add that punctuation marks can be inside these asterisks. ✝ is exactly (and only) what I need to capture and can be at any position in a word.

You could make use of the \G anchor to get iterative matches between the *. The anchor matches either at the start of the string, or at the end of the previous match.
(?:\*|\G(?!^))[^&*]*(?>&(?!#)[^&*]*)*\K✝(?=[^*]*\*)
Explanation
(?: Non capture group
\* Match *
| Or
\G(?!^) Assert the end of the previous match, not at the start
) Close non capture group
[^&*]* Match 0+ times any char except & and *
(?> Atomic group
&(?!#) Match & only when not directly followed by #
[^&*]* Match 0+ times any char except & and *
)* Close atomic group and repeat 0+ times
\K Clear the match buffer (forget what is matched until now)
✝ Match literally
(?=[^*]*\*) Positive lookahead, assert a * at the right
Regex demo | Php demo
For example
$re = '/(?:\*|\G(?!^))[^&*]*(?>&(?!#)[^&*]*)*\K✝(?=[^*]*\*)/m';
$str = '✝his is *✝he* *in✝erne✝*';
preg_match_all($re, $str, $matches, PREG_OFFSET_CAPTURE);
print_r($matches[0]);
Output
Array
(
[0] => Array
(
[0] => ✝
[1] => 16
)
[1] => Array
(
[0] => ✝
[1] => 31
)
[2] => Array
(
[0] => ✝
[1] => 43
)
)
Note The the offset is 1 less than the expected as the string starts counting at 0. See PREG_OFFSET_CAPTURE
If you want to match more variations, you could use a non capturing group and list the ones that you would accept to match. If you don't want to cross newline boundaries you can exclude matching those in the negated character class.
(?:\*|\G(?!^))[^&*\r\n]*(?>&(?!#)[^&*\\rn]*)*\K&#(?:x271D|169);(?=[^*\r\n]*\*)
Regex demo

Can preg_match() capture unknown number of occurrences?

Let's say I'm having the following string:
$string = 'cats[Garfield,Tom,Azrael]';
I need to capture the following strings:
cats
Garfield
Tom
Azrael
That string can be any word-like text, followed by brackets with the list of comma-separated word-like entries. I tried the following:
preg_match('#^(\w+)\[(\w+)(?:,(\w+))*\]$#', $string, $matches);
The problem is that $matches ignores Tom, matching only the first and the last cat.
Now, I know how to do that with more calls, perhaps combining preg_match() and explode(), so the question is not how to do it in general.
The question is: can that be done in single preg_match(), so I could validate and match on one go?

The underlying question seems to be: is it possible to extract each occurrence of a repeated capture group?
The answer is no.
However, several workarounds exists:
The most understandable uses two steps: you capture the full list and then you split it. Something like:
$str = 'cats[Garfield,Tom,Azrael,Supermatou]';
if ( preg_match('~(?<item>\w+)\[(?<list>\w+(?:,\w+)*)]~', $str, $m) )
$result = [ $m['item'], explode(',', $m['list']) ];
(or any structure you want)
An other workaround uses preg_match_all in conjunction with the \G anchor that matches either the start of the string or the position after a successful match:
$pattern = '~(?:\G(?!\A),|(?<item>\w+)\[(?=[\w,]+]))(?<elt>\w+)~';
if ( preg_match_all($pattern, $str, $matches) )
print_r($matches);
This design ensures that all elements are between the brackets.
To obtain a more flat result, you can also write it like this:
$pattern = '~\G(?!\A)[[,]\K\w+|\w+(?=\[[\w,]+])~';
details of this last pattern:
~
# first alternative (can't be the first match)
\G (?!\A) # position after the last successful match
# (the negative lookahead discards the start of the string)
[[,] # an opening bracket or a comma
\K # return the whole match from this position
\w+ # an element
| # OR
# second alternative (the first match)
\w+ # the item
(?= # lookahead to check forward if the format is correct
\[ # opening bracket
[\w,]+ # word characters and comma (feel free to be more descriptive
# like \w+(?:,\w+)* or anything you want)
] # closing bracket
)
~

Why not a simple preg_match_all:
$string = 'cats[Garfield,Tom,Azrael], entity1[child11,child12,child13], entity2:child21&child22&child23';
preg_match_all('#\w+#', $string, $matches);
print_r($matches);
Output:
Array
(
[0] => Array
(
[0] => cats
[1] => Garfield
[2] => Tom
[3] => Azrael
[4] => entity1
[5] => child11
[6] => child12
[7] => child13
[8] => entity2
[9] => child21
[10] => child22
[11] => child23
)
)

Regex to split string with the last occurrence of a dot, colon or underscore

we have thousands of rows of data containing articlenumers in all sort of formats and I need to split off main article number from a size indicator. There is (almost) always a dot, dash or underscore between some last characters (not always 2).
In short: Data is main article number + size indicator, the separator is differs but 1 of 3 .-_
Question: how do I split main article number + size indicator? My regex below isn't working that I built based on some Google-ing.
preg_match('/^(.*)[\.-_]([^\.-_]+)$/', $sku, $matches);
Sample data + expected result
AR.110052.15-40 [AR.110052.15 & 40]
BI.533.41-41 [BI.533.41 & 41]
CG.00554.000-39 [CG.00554.000 & 39]
LL.PX00.SC004-40 [LL.PX00.SC004 & 40]
LOS.HAPPYSOCKS.1X [LOS.HAPPYSOCKS & 1X]
MI.PMNH300043-XXXXL [MI.PMNH300043 & XXXXL]

You need to move the - to the end of character class to make the regex engine parse it as a literal hyphen:
^(.*)[._-]([^._-]+)$
See the regex demo. Actually, even ^(.+)[._-](.+)$ will work.
^ - matches the start of string
(.*) - Group 1 capturing any 0+ chars as many as possible up to the last...
[._-] - either . or _ or -
([^._-]+) - Group 2: one or more chars other than ., _ and -
$ - end of string.

Use preg_split() instead of preg_match() because:
this isn't a validation task, it is an extraction task and
preg_split() returns the exact desired array compared to preg_match() which carries the unnecessary fullstring match in its returned array.
Limit the number of elements produced (like you would with explode()'s limit parameter.
No capture groups are needed at all.
Greedily match zero or more characters, then just before matching the latest occurring delimiter, restart the fullstring match with \K. This will effectively use the matched delimiter as the character to explode on and it will be "lost" in the explosion.
Code: (Demo)
$strings = [
'AR.110052.15-40',
'BI.533.41-41',
'CG.00554.000-39',
'LL.PX00.SC004-40',
'LOS.HAPPYSOCKS.1X',
'MI.PMNH300043-XXXXL',
];
foreach ($strings as $string) {
var_export(preg_split('~.*\K[._-]~', $string, 2));
echo "\n";
}
Output:
array (
0 => 'AR.110052.15',
1 => '40',
)
array (
0 => 'BI.533.41',
1 => '41',
)
array (
0 => 'CG.00554.000',
1 => '39',
)
array (
0 => 'LL.PX00.SC004',
1 => '40',
)
array (
0 => 'LOS.HAPPYSOCKS',
1 => '1X',
)
array (
0 => 'MI.PMNH300043',
1 => 'XXXXL',
)

preg_match_all for words in and outside of brackets

I have been sitting for hours to figure out a regExp for a preg_match_all function in php.
My problem is that i whant two different things from the string.
Say you have the string "Code is fun [and good for the brain.] But the [brain is] tired."
What i need from this an array of all the word outside of the brackets and the text in the brackets together as one string.
Something like this
[0] => Code
[1] => is
[2] => fun
[3] => and good for the brain.
[4] => But
[5] => the
[6] => brain is
[7] => tired.
Help much appreciated.

You could try the below regex also,
(?<=\[)[^\]]*|[.\w]+
DEMO
Code:
<?php
$data = "Code is fun [and good for the brain.] But the [brain is] tired.";
$regex = '~(?<=\[)[^\]]*|[.\w]+~';
preg_match_all($regex, $data, $matches);
print_r($matches);
?>
Output:
Array
(
[0] => Array
(
[0] => Code
[1] => is
[2] => fun
[3] => and good for the brain.
[4] => But
[5] => the
[6] => brain is
[7] => tired.
)
)
The first lookbind (?<=\[)[^\]]* matches all the characters which are present inside the braces [] and the second [.\w]+ matches one or more word characters or dot from the remaining string.

You can use the following regex:
(?:\[([\w .!?]+)\]+|(\w+))
The regex contains two alternations: one to match everything inside the two square brackets, and one to capture every other word.
This assumes that the part inside the square brackets doesn't contain any characters other than alphabets, digits, _, !, ., and ?. In case you need to add more punctuation, it should be easy enough to add them to the character class.
If you don't want to be that specific about what should be captured, then you can use a negated character class instead — specify what not to match instead of specifying what to match. The expression then becomes: (?:\[([^\[\]]+)\]|(\w+))
Visualization:
Explanation:
(?: # Begin non-capturing group
\[ # Match a literal '['
( # Start capturing group 1
[\w .!?]+ # Match everything in between '[' and ']'
) # End capturing group 1
\] # Match literal ']'
| # OR
( # Begin capturing group 2
\w+ # Match rest of the words
) # End capturing group 2
) # End non-capturing group
Demo

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Extracting some content from given string - php

Related

How Can I Split a String in One Character Increments but also Keep The String between Brackets Intact?

Regex: Capturing multiple instances in one word group

Can preg_match() capture unknown number of occurrences?

Regex to split string with the last occurrence of a dot, colon or underscore

preg_match_all for words in and outside of brackets

Categories

Resources