Multiple patterns within regex - php

I have a json and I need to match all "text" keys as well as the "html" keys.
For example, the json could be like below:
[{
"layout":12,
"text":"Lorem",
"html":"<div>Ipsum</div>"
}]
Or it could be like below:
[{
"layout":12,
"settings":{
"text":"Lorem",
"atts":{
"html":"<div>Ipsum</div>"
}
}
}]
The json is not always using the same structure so I have to match the keys and get their values using preg_match_all. I have tried the following to get the value of the "text" key:
preg_match_all('|"text":"([^"]*)"|',$json,$match_txt,PREG_SET_ORDER);
The above works fine for matching a single key. When it comes to matching a second key ("html" in this case) it just doesn't work. I have tried the following:
preg_match_all('|"text|html":"([^"]*)"|',$json,$match_txt,PREG_SET_ORDER);
Can you please give me some hints why the OR operator (text|html) doesn't work? Strangely, the above (multi-pattern) regex works fine when I test it in an online tester but it doesn't work in my php files.

Fixing text|html
You should add text|html to a group, otherwise it will look for "text or html".
|"(text|html)":"([^"]*)"|
Delimiters
This won't currently work with your delimiters though as you use the pipe (|) inside of the expression. You should change your delimiters to something else, here I've used /.
/"(text|html)":"([^"]*)"/
If you still want to use the pipe as your delimiters, you should escape the pipe within the expression.
|"(text\|html)":"([^"]*)"|
If you don't want to manually escape it, preg_quote() can do it for you.
$exp = preg_quote('"(text|html)":"([^"]*)"');
preg_match_all("|{$exp}|",$json,$match_txt,PREG_SET_ORDER);
Parsing JSON
Although that regex will work, it will need additional parsing and it makes more sense to use a recursive function for this.
json_decode() will decode a JSON string into the relative data types. In the example below I've passed an additional argument true which means I will get an associative array where you would normally get an object.
Once findKeyData() is called, it will recursively call itself and work through all of the data until it finds the specified key. If not, it returns null.
function findKeyData($data, $key) {
foreach ($data as $k => $v) {
if (is_array($v)) {
$data = findKeyData($v, $key);
if (! is_null($data)) {
return $data;
}
}
if ($k == $key) {
return $v;
}
}
return null;
}
$json1 = json_decode('[{
"layout":12,
"text":"Lorem",
"html":"<div>Ipsum</div>"
}]', true);
$json2 = json_decode('[{
"layout":12,
"settings":{
"text":"Lorem",
"atts":{
"html":"<div>Ipsum</div>"
}
}
}]', true);
var_dump(findKeyData($json1, 'text')); // Lorem
var_dump(findKeyData($json1, 'html')); // <div>Ipsum</div>
var_dump(findKeyData($json2, 'text')); // Lorem
var_dump(findKeyData($json2, 'html')); // <div>Ipsum</div>

preg_match_all('/"(?:text|html)":"([^"]*)"/',$json,$match_txt,PREG_SET_ORDER);
print $match_txt[0][0]." with group 1: ".$match_txt[0][1]."\n";
print $match_txt[1][0]." with group 1: ".$match_txt[1][1]."\n";
returns:
$ php -f test.php
"text":"Lorem" with group 1: Lorem
"html":"<div>Ipsum</div>" with group 1: <div>Ipsum</div>
The enclosing parentheses are needed : (?:text|html); I couldn't get it to work on https://regex101.com without. ?: means the content of the parentheses will not be captured (i.e., not available in the results).
I also replaced the pipe (|) delimiter with forward slashes since you also have a pipe inside the regex. Another option is to escape the pipe inside the regex: |"(?:text\|html)":"([^"]*)"|.

I don't see any reason to use a regex to parse a valid json string:
array_walk_recursive(json_decode($json, true), function ($v, $k) {
if ( in_array($k, ['text', 'html']) )
echo "$k -> $v\n";
});
demo

You use the Pipe | character as delimiter, I think this will break your regexp. Does it work using another delimiter like
preg_match_all('#"text|html":"([^"]*)"#',$json,$match_txt,PREG_SET_ORDER);
?

Related

A quicker PHP regex test

I need to sanitize incoming JSON that typically bears the form
'["P4950Zp550","P4950Zp575","P4950Zp600","P5000Zp550","P5000Zp575","P5000Zp600","P4975Zp550","P4975Zp600"]'
with the number of digits following each P|M, p|m varying between 3 and 5. json_decoding this and then applying the test
preg_match('/(P|M){1}[0-9]{3,5}Z(p|m){1}[0-9]{3,5}/',$value)
eight times, in a foreach loop, (I always have eight values in the array) would be a trivial matter. However, I am wondering if there might not be a Regex I could write that could do this in a oner without me having to first json_decode in the incoming string. My knowledge of RegExs is at its limits with the regex I have created here.
Decode the JSON and then use a loop:
$json = '["P4950Zp550","P4950Zp,575","P4950Zp600","P5000Zp550","P5000Zp,575","P5000Zp600","P4975Zp550","P4975Zp600"]';
$array = json_decode($json, true);
foreach ($array as $value) {
if (!preg_match('/^[PM]\d{3,5}Z[pm]\d{3,5}$/',$value)) {
echo "Invalid value: $value<br>\n";
}
}
DEMO
Trying to parse the original JSON with a regexp is a bad idea.
If you truly, deeply want to validate the 8-element array in a single pass while it is still a json string, you can use this:
Pattern: ~^\["[PM]\d{3,5}Z[pm]\d{3,5}(?:","[PM]\d{3,5}Z[pm]\d{3,5}){7}"]$~
Pattern Demo -- this matches the first, then the seven to follow; all wrapped appropriately.
Code (Demo)
$string = '["P4950Zp550","P4950Zp575","P4950Zp600","P5000Zp550","P5000Zp575","P5000Zp600","P4975Zp550","P4975Zp600"]';
if (preg_match('~^\["[PM]\d{3,5}Z[pm]\d{3,5}(?:","[PM]\d{3,5}Z[pm]\d{3,5}){7}"]$~', $string)) {
echo "pass";
} else {
echo "fail";
}
// outputs: pass
Sometimes, you just want to bulk validate input :]
See if this helps
(\"[PM]\d{3,5}Z[pm](\,)?\d{3,5}\"(\,)?)*
Here, the expression is enclosed in ()* which groups the inner expression and looks for any number of occurrences (*) of the inner group. You can include the square brackets also if you prefer...

Problems with Empty Strings. Empty() not working

I'm building a form that'll create a Word doc. There's a part where the user will create a list, and will separete the list lines by using the " | " (vertical bar) as a delimeter. I'm trying to explode() a string like this: "First line| Second line| Third and last line |". As you guys saw, I placed a vertival bar delimiter after the last line, that's 'cause the user will probably do this mistake, and it will generate a empty line on the list.
I'm trying to avoid this error by using something like this:
$lines = explode("|",$lines);
for($a=0;$a<count($lines);$a++)
{
if(!empty($lines[$a]) or !ctype_space($lines[$a]))
{
//generate the line inside de Word Doc
}
}
This code works when I create the string by my own while testing the code, but won't work when the string come from a Form. And keep generating a empty line list inside the Word Doc.
When I var_dump() the $lines array it shows the last key as: [2]=> string(0) ""
I'm using Laravel and the form was created with the Form:: facade.(don't know if this matter, prob not)
If you guys could help me, I'd apreciate.
Alternatively just use array_filter with the callback of trim to remove elements that are empty or contain spaces before you iterate.
<?php
$string = '|foo|bar|baz|bat|';
$split = explode('|', $string);
$split = array_filter($split, 'trim');
var_export($split);
Output:
array (
1 => 'foo',
2 => 'bar',
3 => 'baz',
4 => 'bat',
)
However you might not want to remove some empty values!
You could just trim off your pipes to begin with:
<?php
$string = '|foo|bar|baz|bat|';
$string = trim($string, '|');
$split = explode('|', $string);
var_export($split);
Output as above.
Or use rtrim.
You may want to use PHP's && (and) rather than or.
For reference, see Logical Operators.
You only want to output the line if both empty() and ctype_space() return false. With your current code, blank strings will pass your if test because ctype_space() returns false even though empty() does not. Strings made up entirely of spaces will also pass because empty() returns false even though ctype_space() does not.
To correct the logic:
if(!empty($lines[$a]) && !ctype_space($lines[$a])) { ... }
Alternatively, I'd suggest trimming white space from the string before checking empty():
$lines = explode("|",$lines);
if (!empty($lines)) {
foreach ($lines as $line) {
if (!empty(trim($line)) {
// output this line
}
}
}
Also, see 'AND' vs '&&' as operator.

Regular expression match formatted text in PHP

I have formatted text like this:
Record
name=aaa
age=16
info=blabla bla
Record
name=bbb
age=15
info=foo bar foo bar
Would like to convert it into arrays with regular expression in PHP. So far I've tried:
preg_match_all("/Record.*\n(?m:^(.+)=(.+)$)+/",$text,$matches);
But it only catches "Record name=aaa" and "Record name=bbb"
Wondering why the + does not work in this case. So how should I form my pattern here?
You have not matched the newlines after the first. Move the \n inside the (?m:...) section
This will do it.
$data = array_values(array_map(
function($e){
preg_match_all('/(.*?)=([^\r\n]*)/', $e, $m);
return array_combine($m[1], $m[2]);
},
array_filter(explode("Record", $text))
));
First it splits the whole data by Record as delimiter using explode and array_filter. Then for each of the chunk it extracts the key-value pair using preg_match_all and constructs an associative array (by array_combine).

php: "sscanf" to 'consume' a string but allows a missing parameter

This is for an osCommerce contribution called
("Automatically add multiple products with attribute to cart from external source")
This existing code uses sscanf to 'explode' a string that represents a
- product ID,
- a productOption,
- and quantity:
sscanf('28{8}17[1]', '%d{%d}%d[%f]',
$productID, // 28
$productOptionID, $optionValueID, //{8}17 <--- Product Options!!!
$productQuantity //[1]
);
This works great if there is only 1 'set' of Product Options (e.g. {8}17).
But this procedure needs to be adapted so that it can handle multiple Product Options, and put them into an array, e.g.:
'28{8}17{7}15{9}19[1]' //array(8=>17, 7=>15, 9=>19)
OR
'28{8}17{7}15[1]' //array(8=>17, 7=>15)
OR
'28{8}17[1]' //array(8=>17)
Thanks in advance. (I'm a pascal programmer)
You should not try to do complex recursive parses with one sscanf. Stick it in a loop. Something like:
<?php
$str = "28{8}17{7}15{9}19[1]";
#$str = "28{8}17{7}15[1]";
#$str = "28{8}17[1]";
sscanf($str,"%d%s",$prod,$rest);
printf("Got prod %d\n", $prod);
while (sscanf($rest,"{%d}%d%s",$opt,$id,$rest))
{
printf("opt=%d id=%d\n",$opt,$id);
}
sscanf($rest,"[%d]",$quantity);
printf("Got qty %d\n",$quantity);
?>
Maybe regular expressions may be interesting
$a = '28{8}17{7}15{9}19[1]';
$matches = null;
preg_match_all('~\\{[0-9]{1,3}\\}[0-9]{1,3}~', $a, $matches);
To get the other things
$id = (int) $a; // ;)
$quantity = substr($a, strrpos($a, '[')+ 1, -1);
According the comment a little update
$a = '28{8}17{7}15{9}19[1]';
$matches = null;
preg_match_all('~\\{([0-9]{1,3})\\}([0-9]{1,3})~', $a, $matches, PREG_SET_ORDER);
$result = array();
foreach ($matches as $entry) {
$result[$entry[1]] = $entry[2];
}
sscanf() is not the ideal tool for this task because it doesn't handle recurring patterns and I don't see any real benefit in type casting or formatting the matched subexpressions.
If this was purely a text extraction task (in other words your incoming data was guaranteed to be perfectly formatted and valid), then I could have recommended a cute solution that used strtr() and parse_str() to quickly generate a completely associative multi-dimensional output array.
However, when you commented "with sscanf I had an infinite loop if there is a missing bracket in the string (because it looks for open and closing {}s). Or if I leave out a value. But with your regex solution, if I drop a bracket or leave out a value", then this means that validation is an integral component of this process.
For that reason, I'll recommend a regex pattern that both validates the string and breaks the string into its meaningful parts. There are several logical aspects to the pattern but the hero here is the \G metacharacter that allows the pattern to "continue" matching where the pattern last finished matching in the string. This way we have an array of continuous fullstring matches to pull data from when creating your desired multidimensional output.
The pattern ^\d+(?=.+\[\d+]$)|\G(?!^)(?:{\K\d+}\d+|\[\K\d(?=]$)) in preg_match_all() generates the following type of output in the fullstring element ([0]):
[id], [option0, option1, ...](optional), [quantity]
The first branch in the pattern (^\d+(?=.+\[\d+]$)) validates the string to start with the id number and ends with a square brace wrapped number representing the quantity.
The second branch begins with the "continue" character and contains two logical branches itself. The first matches an option expression (and forgets the leading { thanks to \K) and the second matches the number in the quantity expression.
To create the associative array of options, target the "middle" elements (if there are any), then split the strings on the lingering } and assign these values as key-value pairs.
This is a direct solution because it only uses one preg_ call and it does an excellent job of validating and parsing the variable length data.
Code: (Demo with a battery of test cases)
if (!preg_match_all('~^\d+(?=.+\[\d+]$)|\G(?!^)(?:{\K\d+}\d+|\[\K\d(?=]$))~', $test, $m)) {
echo "invalid input";
} else {
var_export(
[
'id' => array_shift($m[0]),
'quantity' => array_pop($m[0]),
'options' => array_reduce(
$m[0],
function($result, $string) {
[$key, $result[$key]] = explode('}', $string, 2);
return $result;
},
[]
)
]
);
}

Get more backreferences from regexp than parenthesis

Ok this is really difficult to explain in English, so I'll just give an example.
I am going to have strings in the following format:
key-value;key1-value;key2-...
and I need to extract the data to be an array
array('key'=>'value','key1'=>'value1', ... )
I was planning to use regexp to achieve (most of) this functionality, and wrote this regular expression:
/^(\w+)-([^-;]+)(?:;(\w+)-([^-;]+))*;?$/
to work with preg_match and this code:
for ($l = count($matches),$i = 1;$i<$l;$i+=2) {
$parameters[$matches[$i]] = $matches[$i+1];
}
However the regexp obviously returns only 4 backreferences - first and last key-value pairs of the input string. Is there a way around this? I know I can use regex just to test the correctness of the string and use PHP's explode in loops with perfect results, but I'm really curious whether it's possible with regular expressions.
In short, I need to capture an arbitrary number of these key-value; pairs in a string by means of regular expressions.
You can use a lookahead to validate the input while you extract the matches:
/\G(?=(?:\w++-[^;-]++;?)++$)(\w++)-([^;-]++);?/
(?=(?:\w++-[^;-]++;?)++$) is the validation part. If the input is invalid, matching will fail immediately, but the lookahead still gets evaluated every time the regex is applied. In order to keep it (along with the rest of the regex) in sync with the key-value pairs, I used \G to anchor each match to the spot where the previous match ended.
This way, if the lookahead succeeds the first time, it's guaranteed to succeed every subsequent time. Obviously it's not as efficient as it could be, but that probably won't be a problem--only your testing can tell for sure.
If the lookahead fails, preg_match_all() will return zero (false). If it succeeds, the matches will be returned in an array of arrays: one for the full key-value pairs, one for the keys, one for the values.
regex is powerful tool, but sometimes, its not the best approach.
$string = "key-value;key1-value";
$s = explode(";",$string);
foreach($s as $k){
$e = explode("-",$k);
$array[$e[0]]=$e[1];
}
print_r($array);
Use preg_match_all() instead. Maybe something like:
$matches = $parameters = array();
$input = 'key-value;key1-value1;key2-value2;key123-value123;';
preg_match_all("/(\w+)-([^-;]+)/", $input, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
$parameters[$match[1]] = $match[2];
}
print_r($parameters);
EDIT:
to first validate if the input string conforms to the pattern, then just use:
if (preg_match("/^((\w+)-([^-;]+);)+$/", $input) > 0) {
/* do the preg_match_all stuff */
}
EDIT2: the final semicolon is optional
if (preg_match("/^(\w+-[^-;]+;)*\w+-[^-;]+$/", $input) > 0) {
/* do the preg_match_all stuff */
}
No. Newer matches overwrite older matches. Perhaps the limit argument of explode() would be helpful when exploding.
what about this solution:
$samples = array(
"good" => "key-value;key1-value;key2-value;key5-value;key-value;",
"bad1" => "key-value-value;key1-value;key2-value;key5-value;key-value;",
"bad2" => "key;key1-value;key2-value;key5-value;key-value;",
"bad3" => "k%ey;key1-value;key2-value;key5-value;key-value;"
);
foreach($samples as $name => $value) {
if (preg_match("/^(\w+-\w+;)+$/", $value)) {
printf("'%s' matches\n", $name);
} else {
printf("'%s' not matches\n", $name);
}
}
I don't think you can do both validation and extraction of data with one single regexp, as you need anchors (^ and $) for validation and preg_match_all() for the data, but if you use anchors with preg_match_all() it will only return the last set matched.

Categories