regex question redux regarding definition list - php

Trying to figure out a way to throw out attributes in this data that do not have any values. Thanks for helping.
My current regex code , thanks to Tomalak looks like this
Regex find
([^=|]+)=([^|]+)(?:\||$)
Regex replace
<dt>$1</dt><dd>$2</dd>
Data looks like this
Bristle Material=|Wire Material=Steel|Dia.=4 in|Grit=|Bristle Diam=|Wire Size=0.0095 in|Arbor Diam=|Arbor Thread - TPI or Pitch=1/2 - 3/8 in|No. of Knots=|Face Width=1/2 in|Face Plate Thickness=7/16 in|Trim Length=7/8 in|Stem Diam=|Speed=6000 rpm [Max]|No. of Rows=|Color=|Hub Material=|Structure=|Tool Shape=|Applications=Cleaning rust, scale and dirt, Light Deburring, Edge Blending, Roughening for adhesion, Finish preparation prior to plating or painting|Applicable Materials=|Type=|Used With=Straight Grinders, Bench/Pedestal Grinders, Right Angle Grinders|Packing Type=|Quantity=1 per pack|Wt.=
End result should like this
<dt>Wire Material</dt><dd>Steel</dd><dt>Dia.</dt><dd>4 in</dd><dt>Wire Size</dt><dd>0.0095 in</dd>
Not this
Bristle Material=|<dt>Wire Material</dt><dd>Steel</dd><dt>Dia.</dt><dd>4 in</dd>Grit=|Bristle Diam=|<dt>Wire Size</dt><dd>0.0095 in

Here is how you can do it in PHP without using regular expressions:
$parts_list = explode('|', "Bristle Material=|Wire M....");
$parts = "";
foreach( $parts_list as $part ){
$p = explode('=', $part);
if(!empty($p[1])) $parts .= "<dt>$p[0]</dt>\n<dd>$p[1]</dd>\n";
}
echo $parts;
And here is how you can do it with regular expressions:
$parts = preg_replace(
array('/([^=|]*)=(?:\||$)/','/([^=|]*)=([^|]+)(?:\||$)/'),
array('', '<dt>$1</dt><dd>$2</dd>'),
$inputString
);
echo $parts;
Update
This is using a special replace feature of the PHP preg_replace which takes an array of regex expressions, and an array of replacement strings. The array() syntax of the function basically equates to this:
If I can match this: /([^=|]*)=(?:\||$)/ then replace it with an empty string.
If I can match this: /([^=|]*)=([^|]+)(?:\||$)/ then replace it with <dt>$1</dt><dd>$2</dd>
To test it in a Regex editor, you would run the first expression first (/([^=|]*)=(?:\||$)/) then run the second expression on the result of the first expression.

([^=|]*)=([^|]*)(?:\||$)
to skip the ones with out a value, try this:
(?:[^=|]*=|([^=|]*)=([^|]+))(?:\||$)

looks like you want preg_match here rather than preg_replace
preg_match_all('~([^|]+)=([^|\s][^|]*)~', $str, $matches, PREG_SET_ORDER);
foreach($matches as $match)
echo "<dt>{$match[1]}</dt><dd>{$match[2]}</dd>\n";

Related

preg_match acting very strange

I am using preg_match() to extract pieces of text from a variable, and let's say the variable looks like this:
[htmlcode]This is supposed to be displayed[/htmlcode]
middle text
[htmlcode]This is also supposed to be displayed[/htmlcode]
i want to extract the contents of the [htmlcode]'s and input them into an array. i am doing this by using preg_match().
preg_match('/\[htmlcode\]([^\"]*)\[\/htmlcode\]/ms', $text, $matches);
foreach($matches as $value){
return $value . "<br />";
}
The above code outputs
[htmlcode]This is supposed to be displayed[/htmlcode]middle text[htmlcode]This is also supposed to be displayed[/htmlcode]
instead of
[htmlcode]This is supposed to be displayed[/htmlcode]
[htmlcode]This is also supposed to be displayed[/htmlcode]
and if have offically run out of ideas
As explained already; the * pattern is greedy. Another thing is to use preg_match_all() function. It'll return you a multi-dimension array of matched content.
preg_match_all('#\[htmlcode\]([^\"]*?)\[/htmlcode\]#ms', $text, $matches);
foreach( $matches[1] as $value ) {
And you'll get this: http://codepad.viper-7.com/z2GuSd
A * grouper is greedy, i.e. it will eat everything until last [/htmlcode]. Try replacing * with non-greedy *?.
* is by default greedy, ([^\"]*?) (notice the added ?) should make it lazy.
What do lazy and greedy mean in the context of regular expressions?
Look at this piece of code:
preg_match('/\[htmlcode\]([^\"]*)\[\/htmlcode\]/ms', $text, $matches);
foreach($matches as $value){
return $value . "<br />";
}
Now, if your pattern works fine and all is ok, you should know:
return statement will break all loops and will exit the function.
The first element in matches is the whole match, the whole string. In your case $text
So, what you did is returned the first big string and exited the function.
I suggest you can check for desired results:
$matches[1] and $matches[2]

Get data out of string

I am going to parse a log file and I wonder how I can convert such a string:
[5189192e][game]: kill killer='0:Tee' victim='1:nameless tee' weapon=5 special=0
into some kind of array:
$log['5189192e']['game']['killer'] = '0:Tee';
$log['5189192e']['game']['victim'] = '1:nameless tee';
$log['5189192e']['game']['weapon'] = '5';
$log['5189192e']['game']['special'] = '0';
The best way is to use function preg_match_all() and regular expressions.
For example to get 5189192e you need to use expression
/[0-9]{7}e/
This says that the first 7 characters are digits last character is e you can change it to fits any letter
/[0-9]{7}[a-z]+/
it is almost the same but fits every letter in the end
more advanced example with subpatterns and whole details
<?php
$matches = array();
preg_match_all('\[[0-9]{7}e\]\[game]: kill killer=\'([0-9]+):([a-zA-z]+)\' victim=\'([0-9]+):([a-zA-Z ]+)\' weapon=([0-9]+) special=([0-9])+\', $str, $matches);
print_r($matches);
?>
$str is string to be parsed
$matches contains the whole data you needed to be pared like killer id, weapon, name etc.
Using the function preg_match_all() and a regex you will be able to generate an array, which you then just have to organize into your multi-dimensional array:
here's the code:
$log_string = "[5189192e][game]: kill killer='0:Tee' victim='1:nameless tee' weapon=5 special=0";
preg_match_all("/^\[([0-9a-z]*)\]\[([a-z]*)\]: kill (.*)='(.*)' (.*)='(.*)' (.*)=([0-9]*) (.*)=([0-9]*)$/", $log_string, $result);
$log[$result[1][0]][$result[2][0]][$result[3][0]] = $result[4][0];
$log[$result[1][0]][$result[2][0]][$result[5][0]] = $result[6][0];
$log[$result[1][0]][$result[2][0]][$result[7][0]] = $result[8][0];
$log[$result[1][0]][$result[2][0]][$result[9][0]] = $result[10][0];
// $log is your formatted array
You definitely need a regex. Here is the pertaining PHP function and here is a regex syntax reference.

Regular expressions for parsing "abc.q=dfd" strings

I have string as abvd.qweqw.sdfs.a=aqwrwewrwerrew. I need to parse this string and get piece before = and after =. Symbol . can occur many times. So, please, tell me, which regular expression can I use for parsing? Thank you.
Purely based on your example:
/([a-z.]+)=([a-z]+)/
Edit
But actually:
/([a-z_.]+)=(.*)/i
The results are in memory groups 1 and 2. In code:
if (preg_match('/^([a-z_.]+)=(.*)/i', $str, $matches)) {
// $matches[1] contains part before =
// $matches[2] contains part after =
}
Btw, I've tweaked the expression by anchoring it (using ^). If that doesn't work, just remove it from the expression.
You can use simple string function for that.
list($first, $second) = explode('=', 'abvd.qweqw.sdfs.a=aqwrwewrwerrew);
Thats the full code.
<?php
if(preg_match('/([a-z.]+)=([a-z]+)/', "abvd.qweqw.sdfs.a=aqwrwewrwerrew", $matches)){
print $matches[1]."\n";
print $matches[2]."\n";
}
?>

Get more backreferences from regexp than parenthesis

Ok this is really difficult to explain in English, so I'll just give an example.
I am going to have strings in the following format:
key-value;key1-value;key2-...
and I need to extract the data to be an array
array('key'=>'value','key1'=>'value1', ... )
I was planning to use regexp to achieve (most of) this functionality, and wrote this regular expression:
/^(\w+)-([^-;]+)(?:;(\w+)-([^-;]+))*;?$/
to work with preg_match and this code:
for ($l = count($matches),$i = 1;$i<$l;$i+=2) {
$parameters[$matches[$i]] = $matches[$i+1];
}
However the regexp obviously returns only 4 backreferences - first and last key-value pairs of the input string. Is there a way around this? I know I can use regex just to test the correctness of the string and use PHP's explode in loops with perfect results, but I'm really curious whether it's possible with regular expressions.
In short, I need to capture an arbitrary number of these key-value; pairs in a string by means of regular expressions.
You can use a lookahead to validate the input while you extract the matches:
/\G(?=(?:\w++-[^;-]++;?)++$)(\w++)-([^;-]++);?/
(?=(?:\w++-[^;-]++;?)++$) is the validation part. If the input is invalid, matching will fail immediately, but the lookahead still gets evaluated every time the regex is applied. In order to keep it (along with the rest of the regex) in sync with the key-value pairs, I used \G to anchor each match to the spot where the previous match ended.
This way, if the lookahead succeeds the first time, it's guaranteed to succeed every subsequent time. Obviously it's not as efficient as it could be, but that probably won't be a problem--only your testing can tell for sure.
If the lookahead fails, preg_match_all() will return zero (false). If it succeeds, the matches will be returned in an array of arrays: one for the full key-value pairs, one for the keys, one for the values.
regex is powerful tool, but sometimes, its not the best approach.
$string = "key-value;key1-value";
$s = explode(";",$string);
foreach($s as $k){
$e = explode("-",$k);
$array[$e[0]]=$e[1];
}
print_r($array);
Use preg_match_all() instead. Maybe something like:
$matches = $parameters = array();
$input = 'key-value;key1-value1;key2-value2;key123-value123;';
preg_match_all("/(\w+)-([^-;]+)/", $input, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
$parameters[$match[1]] = $match[2];
}
print_r($parameters);
EDIT:
to first validate if the input string conforms to the pattern, then just use:
if (preg_match("/^((\w+)-([^-;]+);)+$/", $input) > 0) {
/* do the preg_match_all stuff */
}
EDIT2: the final semicolon is optional
if (preg_match("/^(\w+-[^-;]+;)*\w+-[^-;]+$/", $input) > 0) {
/* do the preg_match_all stuff */
}
No. Newer matches overwrite older matches. Perhaps the limit argument of explode() would be helpful when exploding.
what about this solution:
$samples = array(
"good" => "key-value;key1-value;key2-value;key5-value;key-value;",
"bad1" => "key-value-value;key1-value;key2-value;key5-value;key-value;",
"bad2" => "key;key1-value;key2-value;key5-value;key-value;",
"bad3" => "k%ey;key1-value;key2-value;key5-value;key-value;"
);
foreach($samples as $name => $value) {
if (preg_match("/^(\w+-\w+;)+$/", $value)) {
printf("'%s' matches\n", $name);
} else {
printf("'%s' not matches\n", $name);
}
}
I don't think you can do both validation and extraction of data with one single regexp, as you need anchors (^ and $) for validation and preg_match_all() for the data, but if you use anchors with preg_match_all() it will only return the last set matched.

Regular Expressions: how to do "option split" replaces

those reqular expressions drive me crazy. I'm stuck with this one:
test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not
Task:
Remove all [[ and ]] and if there is an option split choose the later one so output should be:
test1:link test2:silver test3:out1insideout2 test4:this|not
I came up with (PHP)
$text = preg_replace("/\\[\\[|\\]\\]/",'',$text); // remove [[ or ]]
this works for part1 of the task. but before that I think I should do the option split, my best solution:
$text = preg_replace("/\\[\\[(.*\|)(.*?)\\]\\]/",'$2',$text);
Result:
test1:silver test3:[[out1[[inside]]out2]] this|not
I'm stuck. may someone with some free minutes help me? Thanks!
I think the easiest way to do this would be multiple passes. Use a regular expression like:
\[\[(?:[^\[\]]*\|)?([^\[\]]+)\]\]
This will replace option strings to give you the last option from the group. If you run it repeatedly until it no longer matches, you should get the right result (the first pass will replace [[out1[[inside]]out2]] with [[out1insideout2]] and the second will ditch the brackets.
Edit 1: By way of explanation,
\[\[ # Opening [[
(?: # A non-matching group (we don't want this bit)
[^\[\]] # Non-bracket characters
* # Zero or more of anything but [
\| # A literal '|' character representing the end of the discarded options
)? # This group is optional: if there is only one option, it won't be present
( # The group we're actually interested in ($1)
[^\[\]] # All the non-bracket characters
+ # Must be at least one
) # End of $1
\]\] # End of the grouping.
Edit 2: Changed expression to ignore ']' as well as '[' (it works a bit better like that).
Edit 3: There is no need to know the number of nested brackets as you can do something like:
$oldtext = "";
$newtext = $text;
while ($newtext != $oldtext)
{
$oldtext = $newtext;
$newtext = preg_replace(regexp,replace,$oldtext);
}
$text = $newtext;
Basically, this keeps running the regular expression replace until the output is the same as the input.
Note that I don't know PHP, so there are probably syntax errors in the above.
This is impossible to do in one regular expression since you want to keep content in multiple "hierarchies" of the content. It would be possible otherwise, using a recursive regular expression.
Anyways, here's the simplest, most greedy regular expression I can think of. It should only replace if the content matches your exact requirements.
You will need to escape all backslashes when putting it into a string (\ becomes \\.)
\[\[((?:[^][|]+|(?!\[\[|]])[^|])++\|?)*]]
As others have already explained, you use this with multiple passes. Keep looping while there are matches, performing replacement (only keeping match group 1.)
Difference from other regular expressions here is that it will allow you to have single brackets in the content, without breaking:
test1:[[link]] test2:[[gold|si[lv]er]]
test3:[[out1[[in[si]de]]out2]] test4:this|not
becomes
test1:[[link]] test2:si[lv]er
test3:out1in[si]deout2 test4:this|not
Why try to do it all in one go. Remove the [[]] first and then deal with options, do it in two lines of code.
When trying to get something going favour clarity and simplicity.
Seems like you have all the pieces.
Why not just simply remove any brackets that are left?
$str = 'test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not';
$str = preg_replace('/\\[\\[(?:[^|\\]]+\\|)+([^\\]]+)\\]\\]/', '$1', $str);
$str = str_replace(array('[', ']'), '', $str);
Well, I didn't stick to just regex, because I'm of a mind that trying to do stuff like this with one big regex leads you to the old joke about "Now you have two problems". However, give something like this a shot:
$str = 'test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not'; $reg = '/(.*?):(.*?)( |$)/';
preg_match_all($reg, $str, $m);
foreach($m[2] as $pos => $match) {
if (strpos($match, '|') !== FALSE && strpos($match, '[[') !== FALSE ) {
$opt = explode('|', $match); $match = $opt[count($opt)-1];
}
$m[2][$pos] = str_replace(array('[', ']'),'', $match );
}
foreach($m[1] as $k=>$v) $result[$k] = $v.':'.$m[2][$k];
This is C# using only using non-escaped strings, hence you will have to double the backslashes in other languages.
String input = "test1:[[link]] " +
"test2:[[gold|silver]] " +
"test3:[[out1[[inside]]out2]] " +
"test4:this|not";
String step1 = Regex.Replace(input, #"\[\[([^|]+)\|([^\]]+)\]\]", #"[[$2]]");
String step2 = Regex.Replace(step1, #"\[\[|\]\]", String.Empty);
// Prints "test1:silver test3:out1insideout2 test4:this|not"
Console.WriteLine(step2);
$str = 'test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not';
$s = preg_split("/\s+/",$str);
foreach ($s as $k=>$v){
$v = preg_replace("/\[\[|\]\]/","",$v);
$j = explode(":",$v);
$j[1]=preg_replace("/.*\|/","",$j[1]);
print implode(":",$j)."\n";
}

Categories