Parse text and populate associative array from two substrings per line - php

Given a large string of text, I want to search for the following patterns:
#key: value
So an example is:
some crazy text
more nonesense
#first: first-value;
yet even more non-sense
#second: second-value;
finally more non-sense
The output should be:
array("first" => "first-value", "second" => "second-value");

<?php
$string = 'some crazy text
more nonesense
#first: first-value;
yet even more non-sense
#second: second-value;
finally more non-sense';
preg_match_all('##(.*?): (.*?);#is', $string, $matches);
$count = count($matches[0]);
for($i = 0; $i < $count; $i++)
{
$return[$matches[1][$i]] = $matches[2][$i];
}
print_r($return);
?>
Link http://ideone.com/fki3U
Array (
[first] => first-value
[second] => second-value )

Tested in PHP 5.3:
// set-up test string and final array
$myString = "#test1: test1;#test2: test2;";
$myArr = array();
// do the matching
preg_match_all('/#([^\:]+)\:([^;]+);/', $myString, $matches);
// put elements of $matches in array here
$actualMatches = count($matches) - 1;
for ($i=0; $i<$actualMatches; $i++) {
$myArr[$matches[1][$i]] = $matches[2][$i];
}
print_r($myArr);
The reasoning behind this is this:
The regex is creating two capture groups. One capture group is the key, the
other the data for that key. The capture groups are the portions of the regex
inside left and right bananas, i.e., (...).
$actualMatches just adjusts for the fact that preg_match_all returns an
extra element containing all matches lumped together.
Demo.

Match whole qualifying lines starting with # and ending with ;.
Capture the substring that does not contain any colons as the first group and capture the substring between the space after the colon and the semicolon at the end of the line.
By using the any character dot in the second capture group, the substring may contain a semicolon without damaging any extracted data.
Call array_combine() to form key-value relationships between the two capture groups.
Code: (Demo)
preg_match_all(
'/^#([^:]+): (.+);$/m',
$text,
$m
);
var_export(array_combine($m[1], $m[2]));
Output:
array (
'first' => 'first-value',
'second' => 'second-value',
)

You can try looping the string line by line (explode and foreach) and check if the line starts with an # (substr) if it has, explode the line by :.
http://php.net/manual/en/function.explode.php
http://nl.php.net/manual/en/control-structures.foreach.php
http://nl.php.net/manual/en/function.substr.php

Depending on what your input string looks like, you might be able to simply use parse_ini_string, or make some small changes to the string then use the function.

Related

PHP: preg_replace only first matching string in array

I've started with preg_replace in PHP and I wonder how I can replace only first matching array key with a specified array value cause I set preg_replace number of changes parameter to '1' and it's changing more than one time anyways. I also splitted my string to single words and I'm examining them one by one:
<?php
$internal_message = 'Hey, this is awesome!';
$words = array(
'/wesome(\W|$)/' => 'wful',
'/wful(\W|$)/' => 'wesome',
'/^this(\W|$)/' => 'that',
'/^that(\W|$)/' => 'this'
);
$splitted_message = preg_split("/[\s]+/", $internal_message);
$words_num = count($splitted_message);
for($i=0; $i<$words_num; $i++) {
$splitted_message[$i] = preg_replace(array_keys($words), array_values($words), $splitted_message[$i], 1);
}
$message = implode(" ", $splitted_message);
echo $message;
?>
I want this to be on output:
Hey, that is awful
(one suffix change, one word change and stops)
Not this:
Hey, this is awesome
(two suffix changes, two word changes and back to original word & suffix...)
Maybe I can simplify this code? I also can't change order of the array keys and values cause there will be more suffixes and single words to change soon. I'm kinda newbie in php coding and I'll be thankful for any help ;>
You may use plain text in the associative array keys that you will use to create dynamic regex patterns from, and use preg_replace_callback to replace the found values with the replacements in one go.
$internal_message = 'Hey, this is awesome!';
$words = array(
'wesome' => 'wful',
'wful' => 'wesome',
'this' => 'that',
'that' => 'this'
);
$rx = '~(?:' . implode("|", array_keys($words)) . ')\b~';
echo "$rx\n";
$message = preg_replace_callback($rx, function($m) use ($words) {
return isset($words[$m[0]]) ? $words[$m[0]] : $m[0];
}, $internal_message);
echo $message;
// => Hey, that is awful!
See the PHP demo.
The regex is
~(?:wesome|wful|this|that)\b~
The (?:wesome|wful|this|that) is a non-capturing group that matches any of the values inside, and \b is a word boundary, a non-consuming pattern that ensures there is no letter, digit or _ after the suffix.
The preg_replace_callback parses the string once, and when a match occurs, it is passed to the anonymous function (function($m)) together with the $words array (use ($words)) and if the $words array contains the found key (isset($words[$m[0]])) the corresponding value is returned ($words[$m[0]]) or the found match is returned otherwise ($m[0]).

How to get equal parts of multiple strings/array?

I have the following point: a xls file contains one column with codes. The codes have a prefix and a unique code like this:
- VIP-AX757
- VIP-QBHE6
- CODE-IUEF7
- CODE-QDGF3
- VIP-KJQFB
- ...
How can I get equal parts of strings or an array? perfect would be if I get an array like this:
- $result[VIP] = 3;
- $result[CODE] = 2;
An array with the found prefix and the sum of cells with that prefix. But the result is not so important at the moment.
I couldn't find a soloution how to get equal parts of two strings: how to compare this "VIP-AX757" and "VIP-QBHE6" and get a result that says: "VIP-" is the same prefix/part in this two strings?
Hope someone has an idea.
thx!
-drum roll- Time for a one-liner!
$result = array_count_values(array_map(function($v) {list($a) = explode("-",$v); return $a;},$input));
(Assumes $input is your array of codes)
If you are using PHP 5.4 or newer (you should be), then:
$result = array_count_values(array_map(function($v) {return explode("-",$v)[0];},$input));
Tested in PHP CLI:
If the prefix is always followed by a '-' then you can do something like this:-
foreach ($codes as $code) {
$tmp = explode("-",$code);
$result[$tmp[0]] += 1;
}
print_r($result);
Depends on the variability of the data, but something like:
preg_match_all('/^([^-]+)/m', $string, $matches);
$result = array_count_values($matches[1]);
print_r($result);
If you don't know that there is an - after the prefix but the prefix is always letters then:
preg_match_all('/^([A-Z]+)/im', $string, $matches);
$result = array_count_values($matches[1]);
Otherwise you'll have to define exactly what the prefix can contain if it's not the delimiter.
Since you stated via comment to Niet that you don't have a reliable delimiter, then we can only write a pattern that identifies your targeted substrings based on their location in each line.
I recommend preg_match_all() with no capture group, a start of the line anchor, and a multi-line pattern modifier (m).
I've written a preg_split() alternative, but the pattern is a little "clunkier" because of the way I'm handling the line returns.
Code: (Demo)
$string = 'VIP-AX757
VIP-QBHE6
CODE-IUEF7
CODE-QDGF3
VIP-KJQFB';
var_export(array_count_values(preg_match_all('~^[A-Z]+~m', $string, $out) ? $out[0] : []));
echo "\n\n";
var_export(array_count_values(preg_split('~[^A-Z][^\r\n]+\R?~', $string, -1, PREG_SPLIT_NO_EMPTY)));
Output:
array (
'VIP' => 3,
'CODE' => 2,
)
array (
'VIP' => 3,
'CODE' => 2,
)

Efficient way to parse this string into array in PHP?

Background
I have an array which I create by splitting a string based on every occurrence of 0d0a using preg_split('/(?<=0d0a)(?!$)/').
For example:
$string = "78781110d0a78782220d0a";
will be split into:
Array ( [0] => 78781110d0a [1] => 78782220d0a )
A valid array element has to start with 7878 and end with 0d0a.
The Problem
But sometimes, there's an additional 0d0a in the string which splits into an extra and invalid array element, i.e., that doesn't begin with 7878.
Take this string for example:
$string = "78781110d0a2220d0a78783330d0a";
This is split into:
Array ( [0] => 78781110d0a [1] => 2220d0a [2] => 78783330d0a )
But it should actually be:
Array ( [0] => 78781110d0a2220d0a [1] => 78783330d0a)
My Solution
I've written the following (messy) code to get around this:
$data = Array('78781110d0a','2220d0a','78783330d0a');
$i = 0; //count for $data array;
$j = 0; //count for $dataFixed array;
$dataFixed = $data;
foreach($data as $packet) {
if (substr($packet,0,4) != "7878") { //if packet doesn't start with 7878, do some fixing
if ($i != 0) { //its the first packet, can't help it!
$j++;
if ((substr(strtolower($packet), -4, 4) == "0d0a")) { //if the packet doesn't end with 0d0a, its 'mostly' not valid, so discard it
$dataFixed[$i-$j] = $dataFixed[$i-$j] . $packet;
}
unset($dataFixed[$i-$j+1]);
$dataFixed = array_values($dataFixed);
}
}
$i++;
}
Description
I first copy the array to another array $dataFixed. In a foreach loop of the $data array, I check whether it starts with 7878. If it doesn't, I join it with the previous array in $data. I then unset the current array in $dataFixed and reset the array elements with array_values.
But I'm not very confident about this solution.. Is there a better, more efficient way?
UPDATE
What if the input string doesn't end in 0d0a like its supposed to? It will stick to the previous array element..
For e.g.: in the string 78781110d0a2220d0a78783330d0a0000, 0000 should be separated as another array element.
Use another positive lookahead (?=7878) to form:
preg_split('/(?<=0d0a)(?=7878)/',$string)
Note: I removed (?!$) because I wasn't sure what that was for, based on your example data.
For example, this code:
$string = "78781110d0a2220d0a78783330d0a";
$array = preg_split('/(?<=0d0a)(?=7878)(?!$)/',$string);
print_r($array);
Results in:
Array ( [0] => 78781110d0a2220d0a [1] => 78783330d0a )
UPDATE:
Based on your revised question of having possible random characters at the end of the input string, you can add three lines to make a complete program of:
$string = "78781110d0a2220d0a787830d0a330d0a0000";
$array = preg_split('/(?<=0d0a)(?=7878)/',$string);
$temp = preg_split('/(7878.*0d0a)/',$array[count($array)-1],null,PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE);
$array[count($array)-1] = $temp[0];
if(count($temp)>1) { $array[] = $temp[1]; }
print_r($array);
We basically do the initial splitting, then split the last element of the resulting array by the expected data format, keeping the delimiter using PREG_SPLIT_DELIM_CAPTURE. The PREG_SPLIT_NO_EMPTY ensures we won't get an empty array element if the input string doesn't end in random characters.
UPDATE 2:
Based on your comment below where it seems you're implying there might be random characters between any of the desired matches, and you want these random characters preserved, you could do this:
$string = "0078781110d0a2220d0a2220d0a0000787830d0a330d0a000078781110d0a2220d0a0000787830d0a330d0a0000";
$split1 = preg_split('/(7878.*?0d0a)/',$string,null,PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE);
$result = array();
foreach($split1 as $e){
$split2 = preg_split('/(.*0d0a)/',$e,null,PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE);
foreach($split2 as $el){
// test if $el doesn't start with 7878 and ends with 0d0a
if(strpos($el,'7878') !== 0 && substr($el,-4) == '0d0a'){
//if(preg_match('/^(?!7878).*0d0a$/',$el) === 1){
$result[ count($result)-1 ] = $result[ count($result)-1 ] . $el;
} else {
$result[] = $el;
}
}
}
print_r($result);
The strategy employed here is different than above. First we split the input string based on the delimiter that matches your desired data, using the nongreedy regex .*?. At this point we have some strings that contain the ending of a desired value and some garbage at the end, so we split again based on the last occurrence of "0d0a" with the greedy regex .*0d0a. We then append any of those resulting values that don't start with "7878" but end with "0d0a" to the previous value, as this should repair the first and second halves that got split because it contained an extra "0d0a".
I provided two methods for the innermost if statement, one using regular expressions. The regex one is marginally slower in my testing, so I've left that one commented out.
I might still not have your full requirements, so you'll have to let me know if it works and perhaps provided your full dataset.
I think you are using a delimiter "0d0a" which also happens to be part of a content! Its not possible to avoid getting junk data as long as delimiter can also be part of content. Somehow delimiter must be unique.
Possible solutions.
Change the delimited to something else that doesn't occur as part of your data ( 000000, #!.;)
If you are definite about length of text that easy arrange item may have, use it. As per examples its not possible.
Solutions given in answers considering only sample data you have shared. If you are confidant about what will be the content of string, then these solutions given by others are pretty good to use. Otherwise these solutions wont assure you guarantee!
Best solution: Fix right delimiter then use regex or explode whatever you prefer.
Why don't you use preg_match_all instead? You can avoid all of the non-capturing groups (the look aheads, look behinds) in order to split the string (which without the non-capturing groups removes the matches), and just find the matches you're looking for:
Updated
<?php
$string = "00787817878110d0a22278780d0a78783330d0a00";
preg_match_all('/7878.*?0d0a(?=7878|[^(7878)]*?$)/', $string, $arr);
print_r($arr);
?>
Gives an array $arr[0] => ( [0] => 787817878110d0a22278780d0a, [1] => 78783330d0a ). Strips leading and trailing garbage characters (whatever doesn't start with 7878 or end with 7878 or 0d0a.
So $arr[0] would be the array of values that you are looking for.
See example on ideone
Works with multiple 7878 values and multiple 0d0a values (even though that's ridiculous).
Update
If splitting is more your style, why not avoid regular expressions altogether?
<?php
$string = "787817878110d0a22278780d0a78783330d0a";
$arr = explode('0d0a7878', $string);
$string = implode('0d0a,7878', $arr);
$arr = explode(',', $string);
print_r($arr);
?>
Here we split the string by the delimiter 0d0a7878, which is what #CharlieGorichanaz's solution is doing, and props to him for the quick, accurate solution. We then add a comma, because who doesn't love comma separated values? And we explode again on the commas for an array of desired values. Performance-wise, this ought to be faster than using regular expressions. See example.

preg_math multiply responce

<?php
$string = "Movies and Stars I., 32. part";
$pattern = "((IX|IV|V?I{0,3}[\.]))";
if(preg_match($pattern, $string, $x) == false)
{
print "NAPAKA!";
}
else
{
print_r($x);
}
?>
And the response is:
Array ( [0] => I. [1] => I. )
I should get only 1 response... Why do I get multiple responses?
The element at index 0 is the whole matched string. The element at index 1 is the contents of the first capture group, i.e. the content inside the parenthesis. In this case, they just happen to be the same. Just use $x[0] to get the value you're looking for.
The nested parenthesis should, in this instance, be a "non-capturing" subpattern.
$pattern = "~((?:IX|IV|V?I{0,3}[\.]))~";
Try that. It will tell the regex compiler to not capture the results of those parenthesis into the array.
In fact, looking at your regex, you don't even need those parenthesis. Make your regex this:
$pattern = "~IX|IV|V?I{0,3}[\.]~";
That should also work.
Your pattern has multiple groups in it -> the () brackets tell you what to capture in your match.
Try this:
$pattern = "(IX|IV|V?I{0,3}[\.])";
If you have a hard time identifying the wanted groups in the result you can name them as specified in the php.net documentation.
That would look something like this:
$pattern = "(?P<groupname>IX|IV|V?I{0,3}[\.])";
You get 0-indexed for all mathced string and result for every paretness (). it's helpful to get groups i.e
preg_match('~([0-9]+)([a-z]+)','12abc',$x);
$x is ([0]=>12abc [1]=>12 [2]=>abc)
In your case you can simply delete () (1 pair ot them, 1 pair is used as delimiters)

how to parse this: char(x)?

I use YAML to get MySQL schema and i need to parse only these kind of strings CHAR(60) or VARCHAR(90) etc..
Parsing result would be like this:
array('CHAR', 60);
array('VARCHAR', 90);
The following regex will do it. If these don't occur one per line, you should also add the \b boundaries after the opening slash and before the closing one.
$s = "VARCHAR(90)";
$matches = array();
preg_match("/([A-Z]+)\(([0-9]+)\)/", $s, $matches);
// Then use the matched values into your array.
array($matches[1], $matches[2]);
EDIT: Had the $matches array keys wrong the first time. Should be 1 & 2, not 0 & 1.

Categories