How to find sequenced pattern of characters in a string with PHP? - php

Let's say I have random block of text:
EAMoAAQAABwEBAAAAAAAAAAAAAAABAgMFBgcIBAkBAQABBQEBAAAAAAAAAAAAAAAGAgMEBQcBCBAAAQMDAgMEBQcIBQgGCwEAAQACAxEEBSEGMRIHQVFhE3GBIhQIkaGxwTJCI9FScoKSojMV8GLCUxbhstKDo7M0ZHOTJEQlF/HiQ2PDVHSExEUmGBEBAAIBAgMDCAgCCgMBAQEAAAECAxEEITEFQRIGUWFxgZGhIhPwscHRMlIUB0Jy4fGCkqLCI1MVFrLSQ2IzF//aAAwDAQACEQMRAD8A7+QEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEEDwXkzpxHgusxi7NrnXF3G0NBLhzAkAeAqVH934r6bt57uTPSJ8ne1n2Rqycezy35VlRttwYu5DXNlLOcczOdpHM3hUUqtLs/wBxulZonXJ8vjp8caa+eOa5k6flrPLVcIbm3n/gytf4NcCVKtj1XbbqNcOSuT+W0W+pi3x2rzjRWWxUCAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAggV5It2Uy8GNYAWmW6kr5MDftO8T3BRXxR4s2/SccTb48lvw0jnPnn8tfP6o1Ze02ds08OERzlid+/P5Orp5BHEeFuxxa0Dxpx9a+fOu+Iup9Tmfm30p+Ss92vr/N6bat/t67fDyjWfLLG79pt45YpAA8NdUAg9ngolTFNbedtqWi0avVicv5bLKFr2kSRltHaahrXCnylZcd6k208rDy4ItxlkUr5+XnZE1zxq0h3KfUQqv1GWsxeI0tHKY1rPtjRgVivKZU7HebrS491ybX+TWnO7V7PEn7w+f0rpPhb9zdxt7Rj3szkx/n/AI6+n88f4vTyebno8Wr3qTGvun7mawSxzsbNC4Pje0Oa9pqCD2grv+3z0zUi9Ji1bRrEx2wjtqzWdJ5wqq8pEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQU
SPECIFICATIONS:
patternABC >= 2 characters = groupABC IF groupABC occurs more than once
groupABC + (groupABC)n = sequence WHERE n >= 1 AND sequence > 6 characters
** A sequence needs to be > 6 characters in order to be evaluated
BREAKDOWN:
How do I find any repeating patterns that occur in sequence?
QEBAQEBAQEBAQEBAQEBAQEBA
I also want to count how many times each group repeats:
QEBA QEBA QEBA QEBA QEBA QEBA = 6
Also the sequence must be > 6 characters in order to be evaluated:
NO GOOD: AA AA AA
GOOD: AA AA AA AA
It would be ideal if the output could be stored in an associative array, with duplicate entries removed:
QEBA => 6, AA => 4, QEBA => 3, AA => 8, (QEBA => 6)<- REMOVE
Does anyone have the time & the inclination to tackle this problem?
You rock if you do!

$str = 'EAMoAAQAABwEBAAAAAAAAAAAAAAABAgMFBgcIBAkBAQABBQEBAAAAAAAAAAAAAAAGAgMEBQcBCBAAAQMDAgMEBQcIBQgGCwEAAQACAxEEBSEGMRIHQVFhE3GBIhQIkaGxwTJCI9FScoKSojMV8GLCUxbhstKDo7M0ZHOTJEQlF/HiQ2PDVHSExEUmGBEBAAIBAgMDCAgCCgMBAQEAAAECAxEEITEFQRIGUWFxgZGhIhPwscHRMlIUB0Jy4fGCkqLCI1MVFrLSQ2IzF//aAAwDAQACEQMRAD8A7+QEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEEDwXkzpxHgusxi7NrnXF3G0NBLhzAkAeAqVH934r6bt57uTPSJ8ne1n2Rqycezy35VlRttwYu5DXNlLOcczOdpHM3hUUqtLs/wBxulZonXJ8vjp8caa+eOa5k6flrPLVcIbm3n/gytf4NcCVKtj1XbbqNcOSuT+W0W+pi3x2rzjRWWxUCAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAggV5It2Uy8GNYAWmW6kr5MDftO8T3BRXxR4s2/SccTb48lvw0jnPnn8tfP6o1Ze02ds08OERzlid+/P5Orp5BHEeFuxxa0Dxpx9a+fOu+Iup9Tmfm30p+Ss92vr/N6bat/t67fDyjWfLLG79pt45YpAA8NdUAg9ngolTFNbedtqWi0avVicv5bLKFr2kSRltHaahrXCnylZcd6k208rDy4ItxlkUr5+XnZE1zxq0h3KfUQqv1GWsxeI0tHKY1rPtjRgVivKZU7HebrS491ybX+TWnO7V7PEn7w+f0rpPhb9zdxt7Rj3szkx/n/AI6+n88f4vTyebno8Wr3qTGvun7mawSxzsbNC4Pje0Oa9pqCD2grv+3z0zUi9Ji1bRrEx2wjtqzWdJ5wqq8pEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQU';
preg_match_all( '/(\S{2,}?)\1+/', $str, $matches );
// Remove duplicates
$matches[0] = array_unique( $matches[0] );
foreach ( $matches[0] as $key => $value ) {
if ( strlen( $value ) > 6 ) {
$repeated = $matches[1][$key];
$results[] = array( $repeated => count( explode( $repeated, $value ) ) - 1 );
}
}
print_r($results);
/*
[AA] => 7
[QEBA] => 93
[CAgI] => 18
[EBAQ] => 18
*/
The above assumes a sequence is composed of non-space characters.

Get the sequences with preg_match_all('/(?:(.{6,})\1)/',$inputText,$sequences)
(note: sequences will be saved in $sequences)
Explained RegEx demo: http://regex101.com/r/rW4nE2
Use array_unique() to get rid of duplicates.
Loop through each sequence and:
Get the groups with preg_match_all('/(.+?)(\1)(\1)?/',$sequence,$groups)
Explained RegEx demo: http://regex101.com/r/pC3pB7
Use count() if you need to.

Related

Parse strictly formatted text containing multiple entries with no delimiting character

I have a string containing multiple products orders which have been joined together without a delimiter.
I need to parse the input string and convert sets of three substrings into separate rows of data.
I tried splitting the string using split() and strstr() function, but could not generate the desired result.
How can I convert this statement into different columns?
RM is Malaysian Ringgit
From this statement:
"2 x Brew Coffeee Panas: RM7.42 x Tongkat Ali Ais: RM8.6"
Into seperate row:
2 x Brew Coffeee Panas: RM7.4
2 x Tongkat Ali Ais: RM8.6
And this 2 row into this table in DB:
Table: Products
Product Name
Quantity
Total Amount (RM)
Brew Coffeee Panas
2
7.4
Tongkat Ali Ais
2
8.6
*Note: the "total amount" substrings will reliably have a numeric value with precision to one decimal place.
You could use regex if your string format is consistent. Here's an expression that could do that:
(\d) x (.+?): RM(\d+\.\d)
Basic usage
$re = '/(\d) x (.+?): RM(\d+\.\d)/';
$str = '2 x Brew Coffeee Panas: RM7.42 x Tongkat Ali Ais: RM8.6';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
var_export($matches);
Which gives
array (
0 =>
array (
0 => '2 x Brew Coffeee Panas: RM7.4',
1 => '2',
2 => 'Brew Coffeee Panas',
3 => '7.4',
),
1 =>
array (
0 => '2 x Tongkat Ali Ais: RM8.6',
1 => '2',
2 => 'Tongkat Ali Ais',
3 => '8.6',
),
)
Group 0 will always be the full match, after that the groups will be quantity, product and price.
Try it online
Capture one or more digits
Match the space, x, space
Capture one or more non-colon characters until the first occuring colon
Match the colon, space, then RM
Capture the float value that has a max decimal length of 1OP says in comment under question: it only take one decimal place for the amount
There are no "lazy quantifiers" in my pattern, so the regex can move most swiftly.
This regex pattern is as Accurate as the sample data and requirement explanation allows, as Efficient as it can be because it only contains greedy quantifiers, as Concise as it can be thanks to the negated character class, and as Readable as the pattern can be made because there are no superfluous characters.
Code: (Demo)
var_export(
preg_match_all('~(\d+) x ([^:]+): RM(\d+\.\d)~', $string, $m)
? array_slice($m, 1) // omit the fullstring matches
: [] // if there are no matches
);
Output:
array (
0 =>
array (
0 => '2',
1 => '2',
),
1 =>
array (
0 => 'Brew Coffeee Panas',
1 => 'Tongkat Ali Ais',
),
2 =>
array (
0 => '7.4',
1 => '8.6',
),
)
You can add the PREG_SET_ORDER argument to the preg_match_all() call to aid in iterating the matches as rows.
preg_match_all('~(\d+) x ([^:]+): RM(\d+\.\d)~', $string, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
echo '<tr><td>' . implode('</td><td>', array_slice($match, 1)) . '</td></tr>';
}
You can use a regex like this:
/(\d+)\sx\s([^:]+):\sRM(\d+\.?\d?)(?=\d|$)/
Explanation:
(\d+) captures one or more digits
\s matches a whitespace character
([^:]+): captures one or more non : characters that come before a : character (you can also use something like [a-zA-Z0-9\s]+): if you know exactly which characters can exist before the : character - in this case lower case and upper case letters, digits 0 through 9 and whitespace characters)
(\d+\.?\d?) captures one or more digits, followed by a . and another digit if they exist
(?=\d|$) is a positive lookahead which matches a digit after the main expression without including it in the result, or the end of the string
You can also add the PREG_SET_ORDER flag to preg_match_all() to group the results:
PREG_SET_ORDER
Orders results so that $matches[0] is an array of first set of matches, $matches[1] is an array of second set of matches, and so on.
Code example:
<?php
$txt = "2 x Brew Coffeee Panas: RM7.42 x Tongkat Ali Ais: RM8.62 x B026 Kopi Hainan Kecil: RM312 x B006 Kopi Hainan Besar: RM19.5";
$pattern = "/(\d+)\sx\s([^:]+):\sRM(\d+\.?\d?)(?=\d|$)/";
if(preg_match_all($pattern, $txt, $matches, PREG_SET_ORDER)) {
print_r($matches);
}
?>
Output:
Array
(
[0] => Array
(
[0] => 2 x Brew Coffeee Panas: RM7.4
[1] => 2
[2] => Brew Coffeee Panas
[3] => 7.4
)
[1] => Array
(
[0] => 2 x Tongkat Ali Ais: RM8.6
[1] => 2
[2] => Tongkat Ali Ais
[3] => 8.6
)
[2] => Array
(
[0] => 2 x B026 Kopi Hainan Kecil: RM31
[1] => 2
[2] => B026 Kopi Hainan Kecil
[3] => 31
)
[3] => Array
(
[0] => 2 x B006 Kopi Hainan Besar: RM19.5
[1] => 2
[2] => B006 Kopi Hainan Besar
[3] => 19.5
)
)
See it live here php live editor and here regex tester.
The first thing I would do would be to perform a simple replacement using preg_replace to insert, with the aid of a a back-reference to the captured item, based upon the known format of a single decimal point. Anything beyond that single decimal point forms part of the next item - the quantity in this case.
$str="2 x Brew Coffeee Panas: RM7.42 x Tongkat Ali Ais: RM8.625 x Koala Kebabs: RM15.23 x Fried Squirrel Fritters: RM32.4";
# qty price
# 2 7.4
# 2 8.6
# 25 15.2
# 3 32.4
/*
Our RegEx to find the decimal precision,
to split the string apart and the quantity
*/
$pttns=(object)array(
'repchar' => '#(RM\d{1,}\.\d{1})#',
'splitter' => '#(\|)#',
'combo' => '#^((\d{1,}) x)(.*): RM(\d{1,}\.\d{1})$#'
);
# create a new version of the string with our specified delimiter - the PIPE
$str = preg_replace( $pttns->repchar, '$1|', $str );
# split the string intp pieces - discard empty items
$a=array_filter( preg_split( $pttns->splitter, $str, null ) );
#iterate through matches - find the quantity,item & price
foreach($a as $str){
preg_match($pttns->combo,$str,$matches);
$qty=$matches[2];
$item=$matches[3];
$price=$matches[4];
printf('%s %d %d<br />',$item,$qty,$price);
}
Which yields:
Brew Coffeee Panas 2 7
Tongkat Ali Ais 2 8
Koala Kebabs 25 15
Fried Squirrel Fritters 3 32

How to split repeated chars and numbers with preg_split?

I'm trying to solve some problem and I need to split repeated chars and all integers
$code = preg_split('/(.)(?!\1|$)\K/', $code);
I tried this one, but it separate and not repeated chars and not repeated integers , I need only chars
I have a string 'FFF86C6'
I need an array (FFF, 86, C, 6);
with pattern '/(.)(?!\1|$)\K/' returns (FFF, 8, 6, C, 6)
Do you have any idea how to make it?
You can use this regex with preg_match_all:
([A-Za-z])(\1*)|\d+
It looks for a letter, followed by some number of the same character, or some digits. By using preg_match_all we find all matches in the string. Usage in PHP:
$string = "FFF86CR6";
$pieces = preg_match_all('/([A-Za-z])(\1*)|\d+/', $string, $matches);
print_r($matches[0]);
Output:
Array (
[0] => FFF
[1] => 86
[2] => C
[3] => R
[4] => 6
)
Demo on 3v4l.org

Split string after each number

I have a database full of strings that I'd like to split into an array. Each string contains a list of directions that begin with a letter (U, D, L, R for Up, Down, Left, Right) and a number to tell how far to go in that direction.
Here is an example of one string.
$string = "U29R45U2L5D2L16";
My desired result:
['U29', 'R45', 'U2', 'L5', 'D2', 'L16']
I thought I could just loop through the string, but I don't know how to tell if the number is one or more spaces in length.
You can use preg_split to break up the string, splitting on something which looks like a U,L,D or R followed by numbers and using the PREG_SPLIT_DELIM_CAPTURE to keep the split text:
$string = "U29R45U2L5D2L16";
print_r(preg_split('/([UDLR]\d+)/', $string, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY));
Output:
Array (
[0] => U29
[1] => R45
[2] => U2
[3] => L5
[4] => D2
[5] => L16
)
Demo on 3v4l.org
A regular expression should help you:
<?php
$string = "U29R45U2L5D2L16";
preg_match_all("/[A-Z]\d+/", $string, $matches);
var_dump($matches);
Because this task is about text extraction and not about text validation, you can merely split on the zer-width position after one or more digits. In other words, match one or more digits, then forget them with \K so that they are not consumed while splitting.
Code: (Demo)
$string = "U29R45U2L5D2L16";
var_export(
preg_split(
'/\d+\K/',
$string,
0,
PREG_SPLIT_NO_EMPTY
)
);
Output:
array (
0 => 'U29',
1 => 'R45',
2 => 'U2',
3 => 'L5',
4 => 'D2',
5 => 'L16',
)

Split 4 digit numbers

I want to split a 4 digit number with 4 digit decimal .
Inputs:
Input 1 : 5546.263
Input 2 : 03739.712 /*(some time may have one zero at first)*/
Result: (array)
Result of input 1 : 0 => 55 , 1 => 46.263
Result of input 2 : 0 => 37 , 1 => 39.712
P.S : Inputs is GPS data and always have 4 digit as number / 3 digit as decimal and some time have zero at first .
You could use the following function:
function splitNum($num) {
$num = ltrim($num, '0');
$part1 = substr($num, 0, 2);
$part2 = substr($num, 2);
return array($part1, $part2);
}
Test case 1:
print_r( splitNum('5546.263') );
Output:
Array
(
[0] => 55
[1] => 46.263
)
Test case 2:
print_r( splitNum('03739.712') );
Output:
Array
(
[0] => 37
[1] => 39.712
)
Demo!
^0*([0-9]{2})([0-9\.]+) should work just fine and do what you want:
$input = '03739.712';
if (preg_match('/^0*([0-9]{2})([0-9\.]+)/', $input, $matches)) {
$result = array((int)$matches[1], (float)$matches[2]);
}
var_dump($result); //array(2) { [0]=> int(37) [1]=> float(39.712) }
Regex autopsy:
^ - the string MUST start here
0* - the character '0' repeated 0 or more times
([0-9]{2}) - a capturing group matching a digit between 0 and 9 repeated exactly 2 times
([0-9\.]+) - a capturing group matching a digit between 0 and 9 OR a period repeated 1 or more times
Optionally you can add $ to the end to specify that "the string MUST end here"
Note: Since we cast to an int in the first match, you can omit the 0* part, but if you plan NOT to cast it, then leave it in.

Replacing based on position in string

Is there a way using regex to replace characters in a string based on position?
For instance, one of my rewrite rules for a project I’m working on is “replace o with ö if o is the next-to-last vowel and even numbered (counting left to right).”
So, for example:
heabatoik would become heabatöik (o is the next-to-last vowel, as well as the fourth vowel)
habatoik would not change (o is the next-to-last vowel, but is the third vowel)
Is this possible using preg_replace in PHP?
Starting with the beginning of the subject string, you want to match 2n + 1 vowels followed by an o, but only if the o is followed by exactly one more vowel:
$str = preg_replace(
'/^((?:(?:[^aeiou]*[aeiou]){2})*)' . # 2n vowels, n >= 0
'([^aeiou]*[aeiou][^aeiou]*)' . # odd-numbered vowel
'o' . # even-numbered vowel is o
'(?=[^aeiou]*[aeiou][^aeiou]*$)/', # exactly one more vowel
'$1$2ö',
'heaeafesebatoik');
To do the same but for an odd-numbered o, match 2n leading vowels rather than 2n + 1:
$str = preg_replace(
'/^((?:(?:[^aeiou]*[aeiou]){2})*)' . # 2n vowels, n >= 0
'([^aeiou]*)' . # followed by non-vowels
'o' . # odd-numbered vowel is o
'(?=[^aeiou]*[aeiou][^aeiou]*$)/', # exactly one more vowel
'$1$2ö',
'habatoik');
If one doesn't match, then it performs no replacement, so it's safe to run them in sequence if that's what you're trying to do.
You can use preg_match_all to split the string into vowel/non-vowel parts and process that.
e.g. something like
preg_match_all("/(([aeiou])|([^aeiou]+)*/",
$in,
$out, PREG_PATTERN_ORDER);
Depending on your specific needs, you may need to modify the placement of ()*+? in the regex.
I like to expand on Schmitt. (I don't have enough points to add a comment, I'm not trying to steal his thunder). I would use the flag PREG_OFFSET_CAPTURE as it returns not only the vowels but also there locations. This is my solution:
const LETTER = 1;
const LOCATION = 2
$string = 'heabatoik'
preg_match_all('/[aeiou]/', $string, $in, $out, PREG_OFFSET_CAPTURE);
$lastElement = count($out) - 1; // -1 for last element index based 0
//if second last letter location is even
//and second last letter is beside last letter
if ($out[$lastElement - 1][LOCATION] % 2 == 0 &&
$out[$lastElement - 1][LOCATION] + 1 == $out[$lastElement][LOCATION])
substr_replace($string, 'ö', $out[$lastElement - 1][LOCATION]);
note:
print_r(preg_match_all('/[aeiou]/', 'heabatoik', $in, $out, PREG_OFFSET_CAPTURE));
Array
(
[0] => Array
(
[0] => Array
(
[0] => e
[1] => 1
)
[1] => Array
(
[0] => a
[1] => 2
)
[2] => Array
(
[0] => a
[1] => 4
)
[3] => Array
(
[0] => o
[1] => 6
)
[4] => Array
(
[0] => i
[1] => 7
)
)
)
This is how I would do it:
$str = 'heabatoik';
$vowels = preg_replace('#[^aeiou]+#i', '', $str);
$length = strlen($vowels);
if ( $length % 2 && $vowels[$length - 2] == 'o' ) {
$str = preg_replace('#o([^o]+)$#', 'ö$1', $str);
}

Categories