PHP preg_match_all regex - php

If I have a string like: 10/10/12/12
I'm using:
$string = '10/10/12/12';
preg_match_all('/[0-9]+\/[0-9]+/', $string, $results);
This only seems to match 10/10, and 12/12. I also want to match 10/12. Is it because after the 10/10 is matched that is removed from the picture? So after the first match it'll only match things from /12/12?
If I want to match all 10/10, 10/12, 12/12, what should my regex look like? Thanks.
Edit: I did this
$arr = explode('/', $string);
$count = count($arr) - 1;
$newarr = array();
for ($i = 0; $i < $count; $i++)
{
$newarr[] = $arr[$i].'/'.$arr[$i+1];
}

I'd advise not using regular expression. Instead you could for example first split on slash using explode. Then iterate over the parts, checking for two consecutive parts which both consist of only digits.
The reason why your regular expression doesn't work is because the match consumes the characters it matches. Searching for the next match starts from just after where the previous match ended.
If you really want to use regular expressions you can use a zero-width match such as a lookahead to avoid consuming the characters, and put a capturing match inside the lookahead.
'#[0-9]+/(?=([0-9]+))#'
See it working online: ideone

Related

Find words length 8 to 10, starting with S

I want to read a file and show the count of words having length 8 to 10 and starting with S.
I am getting all the count of the file but not getting how to apply condition for length and starting with S.
I am new in php if anyone has an idea then let me know..
Below is my code:
<?php
$count = 0;
$file = fopen("data.txt", "r");
while (($line = fgets($file)) !== false) {
$words = explode(" ", $line);
$count = $count + count($words);
}
echo "Number of words present in given file: " . $count;
fclose($file);
?>
I also need to know, how we do this for a CSV file.
To find words, it's probably a bit more complicated because we might not have spaces between words and we also have ponctuation.
I know that you are new to PHP and I expect you don't know what regular expressions are so my answer might be rather complicated but I'll try to explain it. Regular expressions are very useful and are used to search or to replace things in strings. It's a very powerfull search engine and learning to use them is very useful in any programming language.
Counting the words
Splitting with space might not be suffisiant. They might be tabulations or other chars so we could split the string using a regular expression but this might also get complicated. Instead we'll use a regular expression to match all the words inside the line. This can be done like this:
$nbr_words = preg_match_all('/[\p{L}-]+/u', $line, $matches, PREG_SET_ORDER, 0);
Here's the running example
The text could contain accents and ponctuation, like this:
En-tête: Attention, en français les mots contiennent des caractères accentués.
This will return 10 matches. It would also work if you have some tabulations instead of spaces.
Now, what does this regular expression mean?
Let's see it in action on regex101
Explanation:
\p{L} is to find any unicode letter, such as a, b, ü or é but only letters in any language. So , or ´ won't be matched.
[] is used to define a list of possible chars. So [abc] would mean the letter “a”, “b” or “c”. You can also set ranges like [a-z]. If you want to say “a”, “b” or “-“ then you have to put the “-“ char at the beginning or the end, like this [ab-]. As words can have hyphens like week-end, self-service or après-midi we have to match unicode letters or hyphens, leading to [\p{L}-].
this unicode letter or hyphen must be one or multiple times. To do that, we’ll use the + operator. This leads us to [\p{L}-]+.
The regular expression has some flags to change some settings. I have set the u flag for unicode. In PHP, you start your regular expression with a symbol (usually a /, but it could be ~ or wathever) then you put your pattern and you finish with the same symbol and you add the flags. So you could write ~[\p{L}-]+~u or #[\p{L}-]+#u and it would be the same.
Counting words starting with S and 8-10 long
We'll use a regular expression again: /(?<=\P{L}|^)s[\p{L}-]{7,9}(?=\P{L}|$)/ui
A test case on regex101
This one is a bit more complicated:
we'll use the u for unicode flag and then we'll use the i for case-insensitive as we want to match s and also S in uppercase.
then, searching for a word of 8 to 10 chars is like searching for a s followed by 7 to 9 unicode letters. To say that you want something 7 to 9 times you use {7,9} after the element you are searching for. So this becomes [\p{L}-]{7,9} to say we want any unicode letter or hyphen 7 to 9 times. If we add the s in front, we get s[\p{L}-]{7,9}. This will match sex-appeal, SARS-CoV but not sos.
now, a bit more complicated. We only want to match if this word is preceded by a non-letter or the beginning of the string. This is to avoid matching struction in the word obstruction. This can be solved with a positive lookbehind (?<= something ) and the something is \P{L} for a unicode non-letter or (use the pipe | operator) the beginning of a string with the ^ operator. This leads to this positive lookbehind: (?<=\P{L}|^)
same thing for what is after the word. It should be a non-letter or the end of the string. This is done with a positive lookahead (?= something ) where something is \P{L} to match a unicode non-letter or $ to match the end of a string. This leads to this positive lookahead: (?=\P{L}|$)
Intergrating in your code
<?php
$total_words = 0;
$total_s_words = 0;
$file = fopen("data.txt", "r");
while (($line = fgets($file)) !== false) {
$nbr_words = preg_match_all('/[\p{L}-]+/u', $line, $matches, PREG_SET_ORDER, 0);
if ($nbr_words) $total_words += $nbr_words;
$nbr_s_words = preg_match_all('/(?<=\P{L}|^)s[\p{L}-]{7,9}(?=\P{L}|$)/ui', $line, $matches, PREG_SET_ORDER, 0);
if ($nbr_s_words) $total_s_words += $nbr_s_words;
}
print "Number of words present in given file: $total_words\n";
print "Number of words starting with 's' and 8-10 chars long: $total_s_words\n";
fclose($file);
?>
A working online example
As mentioned in the comments, strlen() gives the length of a string. If you are using PHP 8 you can use str_starts_with() to get the first letter of the string. In older versions you can use strpos(), substr() or [0] to get the character in the first position (ex: $word[0]).
Since you have an array of words, you'll want to loop through it and check each one, something like:
foreach($words as $word) {
if(strlen($word) >= 8 && strlen($word) <= 10) {
//count words between 8 and 10
}
if(str_starts_with($word, 'S')) {
//count words starting with S
}
}
If you want words that are both between 8 and 10 characters and start with S at the same time, you can just combine the two above if statements.
References for these functions:
https://www.php.net/manual/en/function.strlen.php
https://www.php.net/manual/en/function.str-starts-with.php
You have to use strlen() and substr().
Example code below
<?php
$count = 0;
$file = fopen("data.txt", "r");
while (($line = fgets($file)) !== false) {
$words = explode(" ", $line);
foreach($words as $word) {
// strlen() will give the length of the string/word
$len = strlen($word);
if($len >= 8 && $len <= 10) {
// Check the first character, if S then increment the counter
if(substr($word, 0, 1) == "S")
$count++;
}
}
}
echo "Number of words present in given file: " . $count;
fclose($file);
?>

Make two simple regex's into one

I am trying to make a regex that will look behind .txt and then behind the "-" and get the first digit .... in the example, it would be a 1.
$record_pattern = '/.txt.+/';
preg_match($record_pattern, $decklist, $record);
print_r($record);
.txt?n=chihoi%20%283-1%29
I want to write this as one expression but can only seem to do it as two. This is the first time working with regex's.
You can use this:
$record_pattern = '/\.txt.+-(\d)/';
Now, the first group contains what you want.
Your regex would be,
\.txt[^-]*-\K\d
You don't need for any groups. It just matches from the .txt and upto the literal -. Because of \K in our regex, it discards the previously matched characters. In our case it discards .txt?n=chihoi%20%283- string. Then it starts matching again the first digit which was just after to -
DEMO
Your PHP code would be,
<?php
$mystring = ".txt?n=chihoi%20%283-1%29";
$regex = '~\.txt[^-]*-\K\d~';
if (preg_match($regex, $mystring, $m)) {
$yourmatch = $m[0];
echo $yourmatch;
}
?> //=> 1

preg_match match all starting words

I am trying to get all matched patterns from a list of words;
$pattern = '/^(ab|abc|abcd|asdf)/';
preg_match_all($pattern, 'abcdefgh', $matches);
I want to get 'ab, abc and abcd'
But this return only 'ab'. It works if I loop through patterns after exploding them.
Is there any way to solve it though single match?
Regular expressions consume characters as they are matching through the string, so they can't natively find overlapping matches.
You can use extended features like lookahead assertions together with capturings groups, but that requires an ugly construction:
preg_match_all(
'/^
(?:(?=(ab)))?
(?:(?=(abc)))?
(?:(?=(abcd)))?
(?:(?=(asdf)))?
/x',
$subject, $result, PREG_SET_ORDER);
for ($matchi = 0; $matchi < count($result); $matchi++) {
for ($backrefi = 0; $backrefi < count($result[$matchi]); $backrefi++) {
# Matched text = $result[$matchi][$backrefi];
}
}

Regex Preg_match_all match all pattern

Here is my concern,
I have a string and I need to extract chraracters two by two.
$str = "abcdef" should return array('ab', 'bc', 'cd', 'de', 'ef'). I want to use preg_match_all instead of loops. Here is the pattern I am using.
$str = "abcdef";
preg_match_all('/[\w]{2}/', $str);
The thing is, it returns Array('ab', 'cd', 'ef'). It misses 'bc' and 'de'.
I have the same problem if I want to extract a certain number of words
$str = "ab cd ef gh ij";
preg_match_all('/([\w]+ ){2}/', $str); // returns array('ab cd', 'ef gh'), I'm also missing the last part
What am I missing? Or is it simply not possible to do so with preg_match_all?
For the first problem, what you want to do is match overlapping string, and this requires zero-width (not consuming text) look-around to grab the character:
/(?=(\w{2}))/
The regex above will capture the match in the first capturing group.
DEMO
For the second problem, it seems that you also want overlapping string. Using the same trick:
/(?=(\b\w+ \w+\b))/
Note that \b is added to check the boundary of the word. Since the match does not consume text, the next match will be attempted at the next index (which is in the middle of the first word), instead of at the end of the 2nd word. We don't want to capture from middle of a word, so we need the boundary check.
Note that \b's definition is based on \w, so if you ever change the definition of a word, you need to emulate the word boundary with look-ahead and look-behind with the corresponding character set.
DEMO
In case if you need a Non-Regex solution, Try this...
<?php
$str = "abcdef";
$len = strlen($str);
$arr = array();
for($count = 0; $count < ($len - 1); $count++)
{
$arr[] = $str[$count].$str[$count+1];
}
print_r($arr);
?>
See Codepad.

Regular expression for between two dynamic patterns

I want to find anything that matches
[^1] and [/^1]
Eg if the subject is like this
sometext[^1]abcdef[/^1]somemoretext[^2]12345[/^2]
I want to get back an array with abcdef and 12345 as the elements.
I read this
And I wrote this code and I am unable to advance past searching between []
<?php
$test = '[12345]';
getnumberfromstring($test);
function getnumberfromstring($text)
{
$pattern= '~(?<=\[)(.*?)(?=\])~';
$matches= array();
preg_match($pattern, $text, $matches);
var_dump($matches);
}
?>
Your test checks the string '[12345]' which does not apply for the rule of having an "opening" of [^digit] and a "closing" of [\^digit]. Also, you're using preg_match when you should be using: preg_match_all
Try this:
<?php
$test = 'sometext[^1]abcdef[/^1]somemoretext[^2]12345[/^2]';
getnumberfromstring($test);
function getnumberfromstring($text)
{
$pattern= '/(?<=\[\^\d\])(.*?)(?=\[\/\^\d\])/';
$matches= array();
preg_match_all($pattern, $text, $matches);
var_dump($matches);
}
?>
That other answer doesn't really apply to your case; your delimiters are more complex and you have to use part of the opening delimiter to match the closing one. Also, unless the numbers inside the tags are limited to one digit, you can't use a lookbehind to match the first one. You have to match the tags in the normal way and use a capturing group to extract the content. (Which is how I would have done it anyway. Lookbehind should never be the first tool you reach for.)
'~\[\^(\d+)\](.*?)\[/\^\1\]~'
The number from the opening delimiter is captured in the first group and the backreference \1 matches the same number, thus insuring that the delimiters are correctly paired. The text between the delimiters is captured in group #2.
I have tested following code in php 5.4.5:
<?php
$foo = 'sometext[^1]abcdef[/^1]somemoretext[^2]12345[/^2]';
function getnumberfromstring($text)
{
$matches= array();
# match [^1]...[/^1], [^2]...[/^2]
preg_match_all('/\[\^(\d+)\]([^\[\]]+)\[\/\^\1\]/', $text, $matches, PREG_SET_ORDER);
for($i = 0; $i < count($matches); ++$i)
printf("%s\n", $matches[$i][2]);
}
getnumberfromstring($foo);
?>
output:
abcdef
123456

Categories