Can not match the last group of numbers using php preg_match() - php

preg_match_all("/(\d{12})
(?:,|$)/","111762396541,561572500056,561729950637,561135281443",$matches);
var_dump($mathes):
array (size=2)
0 =>
array (size=4)
0 => string '561762396543,' (length=13)
1 => string '561572500056,' (length=13)
2 => string '561729950637,' (length=13)
3 => string '561135281443' (length=12)
1 =>
array (size=4)
0 => string '561762396543' (length=12)
1 => string '561572500056' (length=12)
2 => string '561729950637' (length=12)
3 => string '561135281443' (length=12)
But I want the $matches like this:
array (size=4)
0 => string '561762396543,' (length=13)
1 => string '561572500056,' (length=13)
2 => string '561729950637,' (length=13)
3 => string '561135281443' (length=12)
I wanna match groups of numbers(each has 12 digits) and a suffix comma if there is one.The exeption is the last group of numbers,it doesnt have to match a comma,cause it reaches the end of the line.

Try this instead:
preg_match_all("/(\d{12}(?:,|$))/","111762396541,561572500056,561729950637,561135281443",$matches);
When the $ is inside your character range brackets [ ] it is looking for the $ characters not the end-of-line.
EDIT: If you want to include the comma in your matches, then just use the above code sample and look at $matches[0].
If you wanted an easier syntax that matches any sort of word boundary, the \b will match commas and end-of-line, too:
preg_match_all("/(\d{12}\b)/","111762396541,561572500056,561729950637,561135281443",$matches);

Related

PHP preg_split() pattern

I need help finding a PCRE pattern using preg_split().
I'm using the regex pattern below to split a string based on its starting 3 character code and semi-colons. The pattern works fine in Javascript, but now I need to use the pattern in PHP. I tried preg_split() but just getting back junk.
// Each group will begin with a three letter code, have three segments separated by a semi-colon. The string will not be terminated with a semi-colon.
// Pseudocode
string_to_split = "AAA;RED;111;BBB;BLUE;22;CCC;GREEN;33;DDD;WHITE;44"
// This works in JS
// https://regex101.com
$pattern = "/[AAA|BBB|CCC|DDD][^;]*;[^;]*[;][^;]*/gi";
Match 1
Full match 0-11 `AAA;RED;111`
Match 2
Full match 12-23 `BBB;BLUE;22`
Match 3
Full match 24-36 `CCC;GREEN;33`
Match 4
Full match 37-49 `DDD;WHITE;44`
$pattern = "/[AAA|BBB|CCC|DDD][^;]*;[^;]*[;][^;]*/";
$split = preg_split($pattern, $string_to_split);
returns
array(5)
0:""
1:";"
2:";"
3:";"
4:""
According to your additional information in some comments to the answers, I update my answer to be very specific to your source format.
You might want something like this:
$subject = "AAA;RED;111;AAA;Oh my dog;12.34;AAA;Oh Long John;.4556;BBB;Oh Long Johnson;1.2323;BBB;Oh Don Piano;.33;CCC;Why I eyes ya;1.445;CCC;All the live long day;2.3343;DDD;Faith Hilling;.89";
$pattern = '/(?<=;|^)(AAA|BBB|CCC|DDD);([^;]*);((?:\d*\.)?\d+)(?=;|$)/';
preg_match_all($pattern, $subject,$matches);
var_dump($matches);
giving you
array (size=4)
0 =>
array (size=8)
0 => string 'AAA;RED;111' (length=11)
1 => string 'AAA;Oh my dog;12.34' (length=19)
2 => string 'AAA;Oh Long John;.4556' (length=22)
3 => string 'BBB;Oh Long Johnson;1.2323' (length=26)
4 => string 'BBB;Oh Don Piano;.33' (length=20)
5 => string 'CCC;Why I eyes ya;1.445' (length=23)
6 => string 'CCC;All the live long day;2.3343' (length=32)
7 => string 'DDD;Faith Hilling;.89' (length=21)
1 =>
array (size=8)
0 => string 'AAA' (length=3)
1 => string 'AAA' (length=3)
2 => string 'AAA' (length=3)
3 => string 'BBB' (length=3)
4 => string 'BBB' (length=3)
5 => string 'CCC' (length=3)
6 => string 'CCC' (length=3)
7 => string 'DDD' (length=3)
2 =>
array (size=8)
0 => string 'RED' (length=3)
1 => string 'Oh my dog' (length=9)
2 => string 'Oh Long John' (length=12)
3 => string 'Oh Long Johnson' (length=15)
4 => string 'Oh Don Piano' (length=12)
5 => string 'Why I eyes ya' (length=13)
6 => string 'All the live long day' (length=21)
7 => string 'Faith Hilling' (length=13)
3 =>
array (size=8)
0 => string '111' (length=3)
1 => string '12.34' (length=5)
2 => string '.4556' (length=5)
3 => string '1.2323' (length=6)
4 => string '.33' (length=3)
5 => string '1.445' (length=5)
6 => string '2.3343' (length=6)
7 => string '.89' (length=3)
The start marker should occur at the start of string or immidiately after a semicolon, so we do a lookbehind, looking for start or semicolon:
(?<=;|^)
We look for an alternative of AAA,BBB,CCC or DDD and capture it:
(AAA|BBB|CCC|DDD)
After a semicolon we look for any character except a semicolon. The quantifier * means 0 or more time. Use + if you want at least 1.
;([^;]*)
After the next semicolon wie look for a number. This task has to be splitted to fit a valid format: We first look for 0 or more digits followed by a dot:
(?:\d*\.)?
where (?:) means a non-capturing group.
Behind we look for at least one digit: \d+
We want to capture both parts of of the number using parentheses after the searched semicolon:
;((?:\d*\.)?\d+)
This matches "1234", ".1234", "1.234", "12.34" , "123.4" but "1234.", "1.2.3"
Finally we want this to immediately occur before a semicolon or the end of string. Thus we do a lookahead:
(?=;|$)
Lookaheads and lookbehinds are not part of the captured result behind or respectively before.
I've modified your pattern a little, and added a couple of flags to preg_split.
The PREG_SPLIT_NO_EMPTY flag will exclude empty matches from the result, and PREG_SPLIT_DELIM_CAPTURE will include the captured value in the result.
$split = preg_split('/([abcd]{3};[^;]+;\d+);?/i', $string, -1, PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE);
Result:
Array
(
[0] => AAA;RED;111
[1] => BBB;BLUE;22
[2] => CCC;GREEN;33
[3] => DDD;WHITE;44
)
Alternatively, and more suitably, you can use preg_match_all with the following pattern.
preg_match_all('/([abcd]{3};[^;]+;\d+);?/i', $string, $matches);
print_r($matches[0]);
Result:
Array
(
[0] => AAA;RED;111
[1] => BBB;BLUE;22
[2] => CCC;GREEN;33
[3] => DDD;WHITE;44
)
You don't want to split your string but match elements, use preg_match_all:
$str = "AAA;RED;111;AAA;Oh my dog;2.34;AAA;Oh Long John;.4556;BBB;Oh Long Johnson;1.2323;BBB;Oh Don Piano;.33;CCC;Why I eyes ya;1.445;CCC;All the live long day;2.3343;DDD;Faith Hilling;.89";
$res = preg_match_all('/(?:AAA|BBB|CCC|DDD);[^;]*;[^;]*;?/', $str, $m);
print_r($m[0]);
Output:
Array
(
[0] => AAA;RED;111;
[1] => AAA;Oh my dog;2.34;
[2] => AAA;Oh Long John;.4556;
[3] => BBB;Oh Long Johnson;1.2323;
[4] => BBB;Oh Don Piano;.33;
[5] => CCC;Why I eyes ya;1.445;
[6] => CCC;All the live long day;2.3343;
[7] => DDD;Faith Hilling;.89
)
Explanation:
/ : regex delimiter
(?:AAA|BBB|CCC|DDD) : non capture group AAA or BBB or CCC or DDD
; : a semicolon
[^;]* : 0 or more any character that is not a semicolon
; : a semicolon
[^;]* : 0 or more any character that is not a semicolon
;? : optional semicolon
/ : regex delimiter

How to limit a variable search to a single line of text?

Considering this sample text:
grupo1, tiago1A, bola1A, mola1A, tijolo1A, pedro1B, bola1B, mola1B, tijolo1B, raimundo1C, bola1C, mola1C, tijolo1C, joao1D, bola1D, mola1D, tijolo1D, felipe1E, bola1E, mola1E, tijolo1E,
grupo2, tiago2A, bola2A, mola2A, tijolo2A, pedro2B, bola2B, mola2B, tijolo2B, raimundo2C, bola2C, mola2C, tijolo2C, joao2D, bola2D, mola2D, tijolo2D, felipe2E, bola2E, mola2E, tijolo2E,
grupo3, tiago3A, bola3A, mola3A, tijolo3A, pedro3B, bola3B, mola3B, tijolo3B, raimundo3C, bola3C, mola3C, tijolo3C, joao3D, bola3D, mola3D, tijolo3D, felipe3E, bola3E, mola3E, tijolo3E,
grupo4, tiago4A, bola4A, mola4A, tijolo4A, pedro4B, bola4B, mola4B, tijolo4B, raimundo4C, bola4C, mola4C, tijolo4C, joao4D, bola4D, mola4D, tijolo4D, felipe4E, bola4E, mola4E, tijolo4E,
grupo5, tiago5A, bola5A, mola5A, tijolo5A, pedro5B, bola5B, mola5B, tijolo5B, raimundo5C, bola5C, mola5C, tijolo5C, joao5D, bola5D, mola5D, tijolo5D, felipe5E, bola5E, mola5E, tijolo5E,
I would like to capture the 20 values that follow grupo3 and store them in groups of 4.
I am using this: (Demo)
/grupo3,((.*?),(.*?),(.*?),(.*?)),/
but this only returns the first 4 comma separated values after grupo3.
I need generate this array structure:
Match 1
Group 1 tiago3A
Group 2 bola3A
Group 3 mola3A
Group 4 tijolo3A
Match 2
Group 1 pedro3B
Group 2 bola3B
Group 3 mola3B
Group 4 tijolo3B
Match 3
Group 1 raimundo3C
Group 2 bola3C
Group 3 mola3C
Group 4 tijolo3C
Match 4
Group 1 joao3D
Group 2 bola3D
Group 3 mola3D
Group 4 tijolo3D
Match 5
Group 1 felipe3E
Group 2 bola3E
Group 3 mola3E
Group 4 tijolo3E
You can try the following:
/,(.*?),(.*?),(.*?),(.*?),.*?$/m
the /m in the end indicates the flag for multi-line and $ before that indicates end of line. Demo
Edit: For getting every 4 elements only form the 3rd paragraph
/grupo3,((.*?),(.*?),(.*?),(.*?)), ((.*?),(.*?),(.*?),(.*?)), ((.*?),(.*?),(.*?),(.*?)), ((.*?),(.*?),(.*?),(.*?)), ((.*?),(.*?),(.*?),(.*?)),/
Demo
And you can get the desired output in PHP like:
preg_match('/grupo3,((.*?),(.*?),(.*?),(.*?)), ((.*?),(.*?),(.*?),(.*?)), ((.*?),(.*?),(.*?),(.*?)), ((.*?),(.*?),(.*?),(.*?)), ((.*?),(.*?),(.*?),(.*?)),/', $str, $matches);
$groups = [];
unset($matches[0]);
$matches = array_values($matches);
$count = count($matches);
$j=0;
for($i=1;$i<$count;$i++)
{
if($i%5 == 0)
{
$j++;
continue;
}
$groups[$j][] = $matches[$i];
}
var_dump($groups);
Output will be something like:
array (size=5)
0 =>
array (size=4)
0 => string ' tiago3A' (length=8)
1 => string ' bola3A' (length=7)
2 => string ' mola3A' (length=7)
3 => string ' tijolo3A' (length=9)
1 =>
array (size=4)
0 => string 'pedro3B' (length=7)
1 => string ' bola3B' (length=7)
2 => string ' mola3B' (length=7)
3 => string ' tijolo3B' (length=9)
2 =>
array (size=4)
0 => string 'raimundo3C' (length=10)
1 => string ' bola3C' (length=7)
2 => string ' mola3C' (length=7)
3 => string ' tijolo3C' (length=9)
3 =>
array (size=4)
0 => string 'joao3D' (length=6)
1 => string ' bola3D' (length=7)
2 => string ' mola3D' (length=7)
3 => string ' tijolo3D' (length=9)
4 =>
array (size=4)
0 => string 'felipe3E' (length=8)
1 => string ' bola3E' (length=7)
2 => string ' mola3E' (length=7)
3 => string 'tijolo3E' (length=0)
Please forgive the lateness of this answer. This is the comprehensive answer with a clean/direct solution that I would have posted earlier if this page wasn't put on hold. This is as refined a solution as I can devise without knowing more about how your input data is generated/accessed.
The input:
$text='grupo1, tiago1A, bola1A, mola1A, tijolo1A, pedro1B, bola1B, mola1B, tijolo1B, raimundo1C, bola1C, mola1C, tijolo1C, joao1D, bola1D, mola1D, tijolo1D, felipe1E, bola1E, mola1E, tijolo1E,
grupo2, tiago2A, bola2A, mola2A, tijolo2A, pedro2B, bola2B, mola2B, tijolo2B, raimundo2C, bola2C, mola2C, tijolo2C, joao2D, bola2D, mola2D, tijolo2D, felipe2E, bola2E, mola2E, tijolo2E,
grupo3, tiago3A, bola3A, mola3A, tijolo3A, pedro3B, bola3B, mola3B, tijolo3B, raimundo3C, bola3C, mola3C, tijolo3C, joao3D, bola3D, mola3D, tijolo3D, felipe3E, bola3E, mola3E, tijolo3E,
grupo4, tiago4A, bola4A, mola4A, tijolo4A, pedro4B, bola4B, mola4B, tijolo4B, raimundo4C, bola4C, mola4C, tijolo4C, joao4D, bola4D, mola4D, tijolo4D, felipe4E, bola4E, mola4E, tijolo4E,
grupo5, tiago5A, bola5A, mola5A, tijolo5A, pedro5B, bola5B, mola5B, tijolo5B, raimundo5C, bola5C, mola5C, tijolo5C, joao5D, bola5D, mola5D, tijolo5D, felipe5E, bola5E, mola5E, tijolo5E,';
The method: (PHP Demo)
var_export(preg_match('/^grupo3, \K.*(?=,)/m',$text,$out)?array_chunk(explode(', ',$out[0]),4):'fail');
Use preg_match() to extract the single line, then use explode() to split the string on "comma space", then use array_chunk() to store in an array of 5 subarrays containing 4 elements each.
The pattern targets grupo3, at the start of the line, then restarts the full match using \K then greedily matches every non-newline character and stops just before the last comma in the line. The positive lookahead (?=,) doesn't store the final comma in the full string match.
(Pattern Demo)
My method does not retain any leading and trailing spaces, just the values themselves.
Output:
array (
0 =>
array (
0 => 'tiago3A',
1 => 'bola3A',
2 => 'mola3A',
3 => 'tijolo3A',
),
1 =>
array (
0 => 'pedro3B',
1 => 'bola3B',
2 => 'mola3B',
3 => 'tijolo3B',
),
2 =>
array (
0 => 'raimundo3C',
1 => 'bola3C',
2 => 'mola3C',
3 => 'tijolo3C',
),
3 =>
array (
0 => 'joao3D',
1 => 'bola3D',
2 => 'mola3D',
3 => 'tijolo3D',
),
4 =>
array (
0 => 'felipe3E',
1 => 'bola3E',
2 => 'mola3E',
3 => 'tijolo3E',
),
)
p.s. If the search term ($needle) is to be dynamic, you can use something like this to achieve the same result: (PHP Demo)
$needle='grupo3';
// if the needle may include any regex-sensitive characters, use preg_quote($needle,'/') at $needle
var_export(preg_match('/^'.$needle.', \K.*(?=,)/m',$text,$out)?array_chunk(explode(', ',$out[0]),4):'fail');
/* or this is equivalent...
if(preg_match('/^'.$needle.', \K.*(?=,)/m',$text,$out)){
$singles=explode(', ',$out[0]);
$groups=array_chunk($singles,4);
var_export($groups);
}else{
echo 'fail';
}
*/

Regular expression to match | but not ||

My goal is to split a string such as, a|b||c|d in a, b||c and d.
I tried using several methods, but end up splititng my string in any way:
Lookbehind:
var_dump(preg_split("/\\|(?<!\\|\\|)/", 'a|b||c|d'));
array (size=4)
0 => string 'a' (length=1)
1 => string 'b' (length=1)
2 => string '|c' (length=2)
3 => string 'd' (length=1)
Lookahead:
var_dump(preg_split("/(?!\\|\\|)\\|/", 'a|b||c|d'));
array (size=4)
0 => string 'a' (length=1)
1 => string 'b|' (length=2)
2 => string 'c' (length=1)
3 => string 'd' (length=1)
How can I just ignore doublepipes?
Just split your input according to the below regex which uses negative lookarounds.
(?<!\|)\|(?!\|)
DEMO
| is a special meta character in regex which acts like a logical OR or alternation operator. To match a literal | symbol, you need to escape the | in your regex like \|
You can use this regex for splitting:
(?<!\|)\|(?!\|)

What array function should I use for creating an index?

Hello guys I am trying to create an index of all words on html page that my crawler parses.
At this moment I have managed to breakdown the html page into an array of words and I have filtered out all the stop words.
At this stage I have a few problems.
The array of words from the parsed html page have words that are repeated, I like that because I still have to record how many times a word appeared in the page.
The array looks like this.
$wordsFromHTML =
array (size=119)
0 => string 'web' (length=3)
1 => string 'giants' (length=6)
2 => string 'vryheid' (length=7)
3 => string 'news' (length=4)
4 => string 'access' (length=6)
5 => string 'mails' (length=5)
6 => string 'mobile' (length=6)
7 => string 'february' (length=8)
8 => string 'access' (length=6)
9 => string 'mails' (length=5)
10 => string 'web' (length=3)
11 => string 'february' (length=8)
12 => string 'access' (length=6)
13 => string 'mails' (length=5)
14 => string 'desktop' (length=7)
15 => string 'february' (length=8)
16 => string 'hosting' (length=7)
17 => string 'web' (length=3)
18 => string 'giants' (length=6)
19 => string 'vryheid' (length=7)
20 => string 'february' (length=8)
22 => string 'us' (length=2)
Now I want to save all the words from the $wordsFromHTML to the $indesArray which is my final index.
It should look like this.
$indexArray = array('web'=>array('url'=>array(0,10,17)))
The problem is how to keep incrementing the position ($wordsFromHTML keys) for each word that was repeated from the $wordsFromHTML array in the final index array.
The index array should only have unique words and if another word that already exists try to come in, we use the already existing word which has the same URL and increment its position.
Hope you understand my question.

Smarty value of array extraction using date_format result on other array value as a key

I have two arrays in a smarty template: $months and $contract.
{$months|var_dump} gets this:
array (size=12)
1 => string 'января' (length=12)
2 => string 'февраля' (length=14)
3 => string 'марта' (length=10)
4 => string 'апреля' (length=12)
5 => string 'мая' (length=6)
6 => string 'июня' (length=8)
7 => string 'июля' (length=8)
8 => string 'августа' (length=14)
9 => string 'сентября' (length=16)
10 => string 'октября' (length=14)
11 => string 'ноября' (length=12)
12 => string 'декабря' (length=14)
array values are russian names of months in genitive.
{$contract|var_dump} gets this
'date_till' => '1355518365' (length=10)
so I need to create a month number first from $contract.date_till. it is usually done like
{$contract.date_till|date_format:"%m"}
And now the question is: how do I extract a month name from $months array by the month number made of $contract.date_till with date_format?
I've tried many variants described in smarty manuals, but noone works. For example, this one doesn't:
{$months[{$contract.date_till|date_format:"%m"}]}
{assign var=monthNo value=$contract.date_till|date_format:"%m"}
{$months.$monthNo}
This will give u the month of the necessary date.

Categories