Extract string in brackets when there are other brackets embedded in quotes - php

I want to extract this bracketed part from a string:
[list items='["one","two"]' ok="no" b="c"]
I am using the following preg_match call:
preg_match('~\[([a-zA-Z0-9_]+)[ ]+([a-zA-Z0-9]+=[^\[]+)\]~s', $string,$match)
But I have trouble with the brackets that appear within quotes.
I have two files
theme.html
[list items=""one","[x]tw"'o"" ok="no" b="c""/]
#book
[button text="t'"extB1" name="ok"'" /]
Asdfz " s wr aw3r '
[button text="t"'extB2" name="no"'" /]
file.php
$string=file_get_contents('theme.html');
for (;;) {
if (!preg_match('~\[([a-zA-Z0-9_]+)[ ]+([a-zA-Z0-9]+=[^\[]+)\]~s', $string,$match)) {
exit;
}
$string=str_replace($match[0], '', $string);
echo "<pre><br>";
print_r($match);
echo "<br></pre>";
}
and this is output:
<pre><br>Array
(
[0] = [button text="textB1" name="ok"]
[1] = button
[2] = text="textB1" name="ok"
)
<br></pre>
<pre><br>Array
(
[0] = [button text="textB2" name="no"]
[1] = button
[2] = text="textB2" name="no"
)
<br></pre>
As you can see the output does not include
[list items='["one","two"]' ok="no" b="c"]
I know the problem is caused by the embedded square brackets, but I don't know how I can correct the code to ignore them.

You could use this variation of your preg_match call:
if (!preg_match('~\[(\w+)\s+(\w+=(?:\'[^\']*\'|[^\[])+?)\]~s', $string, $match))
With \'[^\']*\' it detects the presence of a quote and will grab all characters until the next quote, without blocking on an opening bracket. Only if that cannot be matched, will it go for the part you had: [^\[])+. I added a ? to that, to make it non-greedy, which makes sure it will not grab a closing ].
Note also that [a-zA-Z_] can be shortened to \w, and [ ] can be written as \s which will also allow other white-space, which I believe is OK.
See it run on eval.in.
Alternative: match complete lines only
If the quotes can appear anywhere without guarantee that closing brackets appear within quotes, then the above will not work.
Instead we could require that the match must span a complete line in the text:
if (!preg_match('~^\s*\[(\w+)\s+(\w+=.*?)\]\s*$~sm', $string, $match))
See it run on eval.in.

Related

Need Help in using preg_split() in php

Can anyone explain how to use preg_split() function to split below mentioned string
String [
date=2017-05-31 time=14:12:05 devname=FGT3HD3914801291 devid=FGT3HD3914801449 logid=0316013056 type=utm subtype=webfilter eventtype=ftgd_blk level=warning vd="root" policyid=63 sessionid=9389050 user="" srcip=172.30.10.90 srcport=53542 srcintf="port5" dstip=50.7.146.50 dstport=80 dstintf="port2" proto=6 service=HTTP hostname="noblockweb.org" profile="IT ADMIN" action=blocked reqtype=direct url="/wpad.dat?1925450516382f9869bdfee527b429fb23737930" sentbyte=126 rcvdbyte=325 direction=outgoing msg="URL belongs to a denied category in policy" method=domain cat=55 catdesc="Meaningless Content" crscore=10 crlevel=medium
]
I need output in following structure
Array
(
[0] => date=2017-05-31
[1] => time=14:12:05
.
.
.
.
[20]=> msg="URL belongs to a denied category in policy"
.
.
)
preg_split is maybe not the right tool for this. You can better use preg_match_all for this kind of splitting.
<?php
$str = 'date=2017-05-31 time=14:12:05 devname=FGT3HD3914801291 devid=FGT3HD3914801449 logid=0316013056 type=utm subtype=webfilter eventtype=ftgd_blk level=warning vd="root" policyid=63 sessionid=9389050 user="" srcip=172.30.10.90 srcport=53542 srcintf="port5" dstip=50.7.146.50 dstport=80 dstintf="port2" proto=6 service=HTTP hostname="noblockweb.org" profile="IT ADMIN" action=blocked reqtype=direct url="/wpad.dat?1925450516382f9869bdfee527b429fb23737930" sentbyte=126 rcvdbyte=325 direction=outgoing msg="URL belongs to a denied category in policy" method=domain cat=55 catdesc="Meaningless Content" crscore=10 crlevel=medium';
preg_match_all('/ ?(\w+\=(("[^"]*")|([^ ]*)))/',$str,$matches);
print_r($matches[1]);
' ?' - match a space character, it is marked as optional by the question mark, because it should also match the first item.
The parantheses are there for getting the matching parts back and they build blocks of expressions. This first pair of paran is around the part we are interested. Thatswhy $matches[1]is used. $matches[0] contains the whole matched part, with the possible first space character.
(...|...) - the bar character means that the part left or right of it could match.
("[^"]") - this is for matching quoted strings. [^"] means, match everything what is not a quotation mark ". The square brackets build a class of matching characters. If the caret ^ is the first character in the class, that means the class is inverted.
([^ ]*) - everything what is not a space character with possible zero length.

preg_replace all line breaks between specific BBCode tags

I'm programming a wiki with BBCode-like editing syntax.
I want the user to be allowed to enter line breaks that resolve to <br> tags.
Until here there's no problem occuring.
Now i also have the following lines, that should convert into a table:
[table]
[row]
[col]Column1[/col]
[col]Column2[/col]
[col]Column3[/col]
[/row]
[/table]
All those line breaks, that were entered when formatting the editable BBCode above are creating <br> tags that are forced to be rendered in front of the html-table.
My goal is to remove all line breaks between [table] and [/table] in my parser function using php's preg_replace without breaking the possibility to enter normal text using newlines.
This is my parsing function so far:
function richtext($text)
{
$text = htmlspecialchars($text);
$expressions = array(
# Poor attempts
'/\[table\](\r\n*)|(\r*)|(\n*)\[\/table\]/' => '',
'/\[table\]([^\n]*?\n+?)+?\[\/table\]/' => '',
'/\[table\].*?(\r+).*?\[\/table\]/' => '',
# Line breaks
'/\r\n|\r|\n/' => '<br>'
);
foreach ($expressions as $pattern => $replacement)
{
$text = preg_replace($pattern, $replacement, $text);
}
return $text;
}
It would be great if you could also explain a bit what the regex is doing.
Style
First of all, you don't need the foreach loop, preg_replace accepts mixed variables, e.g. arrays, see Example #2: http://www.php.net/manual/en/function.preg-replace.php
Answer
Use this regex to remove all line breaks between two tags (here table and row):
(\[table\]([^\r\n]*))(\r\n)*([^\r\n]*\[row\])
The tricky part is to replace it (See also this: preg_replace() Only Specific Part Of String):
$result = preg_replace('/(\[table\][^\r\n]*)(\r\n)*([^\r\n]*\[row\])/', '$1$4', $subject);
Instead of replacing with '', you replace it only the second group ((\r\n)*) with '$1$4'.
Example
[table] // This will also work with multiple line breaks
[row]
[col]Column1[/col]
[col]Column2[/col]
[col]Column3[/col]
[/row]
[/table]
With the regex, this will output:
[table] [row]
[col]Column1[/col]
[col]Column2[/col]
[col]Column3[/col]
[/row]
[/table]

Using RegEx to Capture All Links & In Between Text From A String

<Link to: http://www.someurl(.+)> maybe some text here(.*) <Link: www.someotherurl(.+)> maybe even more text(.*)
Given that this is all on one line, how can I match or better yet extract all full urls and text? ie. for this example I wish to extract:
http://www.someurl(.+) . maybe some text here(.*) . www.someotherurl(.+) . maybe even more text(.*)
Basically, <Link.*:.* would start each link capture and > would end it. Then all text after the first capture would be captured as well up until zero or more occurrences of the next link capture.
I have tried:
preg_match_all('/<Link.*?:.*?(https|http|www)(.+?)>(.*?)/', $v1, $m4);
but I need a way to capture the text after the closing >. The problem is that there may or may not be another link after the first one (of course there could also be no links to begin with!).
$string = "<Link to: http://www.someurl(.+)> maybe some text here(.*) <Link: www.someotherurl(.+)> maybe even more text(.*)";
$string = preg_split('~<link(?: to)?:\s*([^>]+)>~i',$string,-1,PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY);
echo "<pre>";
print_r($string);
output:
Array
(
[0] => http://www.someurl(.+)
[1] => maybe some text here(.*)
[2] => www.someotherurl(.+)
[3] => maybe even more text(.*)
)
You can use this pattern:
preg_match_all('~<link\b[^:]*:\s*\K(?<link>[^\s>]++)[^>]*>\s*(?<text>[^<]++)~',
$txt, $matches, PREG_SET_ORDER);
foreach($matches as $match) {
printf("<br/>link: %s\n<br/>text: %s", $match['link'], $match['text']);
}

What's the right pattern for this hidden input

I have this field Returned by curl_exec:
<input name="NUMBER_R" type="hidden" value="1500000">
150000 is a random number and may change the others are constant
i tried:
preg_match ('/<input name="NUMBER_R" type="hidden" value="([^"]*)" \/>/', $result, $number)
and also:
preg_match ('/<input name=\'NUMBER_R\' type=\'hidden\' value=\'(\\d+)\'>/ims', $result, $number)
but no luck...
Here is the full code:
$result=curl_exec($cid);
curl_close($cid);
$number = array();
if (preg_match ('REGEX', $result, $number))
{
echo $number[1];
}
EDIT 1:
Sorry i forgot [1] in echo $number[1];
Also 1500000 is a random number and may change
Description
This regex will find the input tag which has the attributes name="number_r" and type="hidden" in any order. Then it'll pull out the attribute value with it's associated values. It does require the value text to be all digits
<input\b\s+(?=[^>]*name=(["'])number_r\1)(?=[^>]*type=(["'])hidden\2)[^>]*value=(["'])(\d+)\3[^>]*>
<input\b\s+ consume the open bracket and the tag name, ensure there is a word break and white space
(?=[^>]*name=(["'])number_r\1) look ahead to ensure this tag include the correct name attribute
(?=[^>]*type=(["'])hidden\2) look ahead to ensure this tag also includes the type attribute
[^>]* move the cursor forward until we find the
value= tag
(["']) capture the open qoute
(\d+) capture the substring and require it to be all digits
\3 match the correct close quote. This is can be omitted as you've already received the desired substring.
[^>]*> match the rest of the characters in the tag. This is can be omitted as you've already received the desired substring.
Groups
Group 0 gets the entire input tag
the open quote for name which is back referenced to ensure the correct close quote is captured
the open quote for type which is back referenced to ensure the correct close quote is captured
the open quote for value which is back referenced to ensure the correct close quote is captured
the value in the attribute named value
PHP Code Example:
<?php
$sourcestring="<input name="NUMBER_R" type="hidden" value="1500000">";
preg_match('/<input\b\s+(?=[^>]*name=(["\'])number_r\1)(?=[^>]*type=(["\'])hidden\2)[^>]*value=(["\'])(\d+)\3[^>]*>/im',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
$matches Array:
(
[0] => <input name="NUMBER_R" type="hidden" value="1500000">
[1] => "
[2] => "
[3] => "
[4] => 1500000
)
Try using DOM and Xpath for get that.
$xml = new DomDocument;
$xml->loadXml('<input name="NUMBER_R" type="hidden" value="1500000" />');
$xpath = new DomXpath($xml);
// traverse all results
foreach ($xpath->query('//input[#name="NUMBER_R"]') as $rowNode) {
var_dump($rowNode->getAttribute('value'));
}
testet : http://codepad.viper-7.com/8dwu9f

Regex to replace reg trademark

I need some help with regex:
I got a html output and I need to wrap all the registration trademarks with a <sup></sup>
I can not insert the <sup> tag in title and alt properties and obviously I don't need to wrap regs that are already superscripted.
The following regex matches text that is not part of a HTML tag:
(?<=^|>)[^><]+?(?=<|$)
An example of what I'm looking for:
$original = `<div>asd® asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>`
The filtered string should output:
<div>asd<sup>®</sup> asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>
thanks a lot for your time!!!
Well, here is a simple way, if you agree to following limitation:
Those regs that are already processed have the </sup> following right after the ®
echo preg_replace('#®(?!\s*</sup>|[^<]*>)#','<sup>®</sup>', $s);
The logic behind is:
we replace only those ® which are not followed by </sup> and...
which are not followed by > simbol without opening < symbol
I would really use an HTML parser in place of regular expressions, since HTML is not regular and will present more edge cases than you can dream of (ignoring your contextual limitations that you've identified above).
You don't say what technology you're using. If you post that up, someone can undoubtedly recommend the appropriate parser.
Regex is not enough for what you want. First you must write code to identify when content is a value of an attribute or a text node of an element. Then you must through all that content and use some replace method. I am not sure what it is in PHP, but in JavaScript it would look something like:
content[i].replace(/\®/g, "<sup>®</sup>");
I agree with Brian that regular expressions are not a good way to parse HTML, but if you must use regular expressions, you could try splitting the string into tokens and then running your regexp on each token.
I'm using preg_split to split the string on HTML tags, as well as on the phrase <sup>&reg</sup> -- this will leave text that's either not an already superscript ® or a tag as tokens. Then for each token, ® can be replaced with <sup>®</sup>:
$regex = '/(<sup>®<\/sup>|<.*?>)/i';
$original = '<div>asd® asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>';
// we need to capture the tags so that the string can be rebuilt
$tokens = preg_split($regex, $original, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
/* $tokens => Array
(
[0] => <div>
[1] => asd® asdasd. asd
[2] => <sup>®</sup>
[3] => asd
[4] => <img alt="qwe®qwe" />
[5] => </div>
)
*/
foreach ($tokens as &$token)
{
if ($token[0] == "<") continue; // Skip tokens that are tags
$token = substr_replace('®', '<sup>®</sup>');
}
$tokens = join("", $tokens); // reassemble the string
// $tokens => "<div>asd<sup>®</sup> asdasd. asd<sup>®</sup>asd <img alt="qwe®qwe" /></div>"
Note that this is a naive approach, and if the output isn't formatted as expected it might not parse like you'd like (again, regular expression is not good for HTML parsing ;) )

Categories