I want to read a file and show the count of words having length 8 to 10 and starting with S.
I am getting all the count of the file but not getting how to apply condition for length and starting with S.
I am new in php if anyone has an idea then let me know..
Below is my code:
<?php
$count = 0;
$file = fopen("data.txt", "r");
while (($line = fgets($file)) !== false) {
$words = explode(" ", $line);
$count = $count + count($words);
}
echo "Number of words present in given file: " . $count;
fclose($file);
?>
I also need to know, how we do this for a CSV file.
To find words, it's probably a bit more complicated because we might not have spaces between words and we also have ponctuation.
I know that you are new to PHP and I expect you don't know what regular expressions are so my answer might be rather complicated but I'll try to explain it. Regular expressions are very useful and are used to search or to replace things in strings. It's a very powerfull search engine and learning to use them is very useful in any programming language.
Counting the words
Splitting with space might not be suffisiant. They might be tabulations or other chars so we could split the string using a regular expression but this might also get complicated. Instead we'll use a regular expression to match all the words inside the line. This can be done like this:
$nbr_words = preg_match_all('/[\p{L}-]+/u', $line, $matches, PREG_SET_ORDER, 0);
Here's the running example
The text could contain accents and ponctuation, like this:
En-tête: Attention, en français les mots contiennent des caractères accentués.
This will return 10 matches. It would also work if you have some tabulations instead of spaces.
Now, what does this regular expression mean?
Let's see it in action on regex101
Explanation:
\p{L} is to find any unicode letter, such as a, b, ü or é but only letters in any language. So , or ´ won't be matched.
[] is used to define a list of possible chars. So [abc] would mean the letter “a”, “b” or “c”. You can also set ranges like [a-z]. If you want to say “a”, “b” or “-“ then you have to put the “-“ char at the beginning or the end, like this [ab-]. As words can have hyphens like week-end, self-service or après-midi we have to match unicode letters or hyphens, leading to [\p{L}-].
this unicode letter or hyphen must be one or multiple times. To do that, we’ll use the + operator. This leads us to [\p{L}-]+.
The regular expression has some flags to change some settings. I have set the u flag for unicode. In PHP, you start your regular expression with a symbol (usually a /, but it could be ~ or wathever) then you put your pattern and you finish with the same symbol and you add the flags. So you could write ~[\p{L}-]+~u or #[\p{L}-]+#u and it would be the same.
Counting words starting with S and 8-10 long
We'll use a regular expression again: /(?<=\P{L}|^)s[\p{L}-]{7,9}(?=\P{L}|$)/ui
A test case on regex101
This one is a bit more complicated:
we'll use the u for unicode flag and then we'll use the i for case-insensitive as we want to match s and also S in uppercase.
then, searching for a word of 8 to 10 chars is like searching for a s followed by 7 to 9 unicode letters. To say that you want something 7 to 9 times you use {7,9} after the element you are searching for. So this becomes [\p{L}-]{7,9} to say we want any unicode letter or hyphen 7 to 9 times. If we add the s in front, we get s[\p{L}-]{7,9}. This will match sex-appeal, SARS-CoV but not sos.
now, a bit more complicated. We only want to match if this word is preceded by a non-letter or the beginning of the string. This is to avoid matching struction in the word obstruction. This can be solved with a positive lookbehind (?<= something ) and the something is \P{L} for a unicode non-letter or (use the pipe | operator) the beginning of a string with the ^ operator. This leads to this positive lookbehind: (?<=\P{L}|^)
same thing for what is after the word. It should be a non-letter or the end of the string. This is done with a positive lookahead (?= something ) where something is \P{L} to match a unicode non-letter or $ to match the end of a string. This leads to this positive lookahead: (?=\P{L}|$)
Intergrating in your code
<?php
$total_words = 0;
$total_s_words = 0;
$file = fopen("data.txt", "r");
while (($line = fgets($file)) !== false) {
$nbr_words = preg_match_all('/[\p{L}-]+/u', $line, $matches, PREG_SET_ORDER, 0);
if ($nbr_words) $total_words += $nbr_words;
$nbr_s_words = preg_match_all('/(?<=\P{L}|^)s[\p{L}-]{7,9}(?=\P{L}|$)/ui', $line, $matches, PREG_SET_ORDER, 0);
if ($nbr_s_words) $total_s_words += $nbr_s_words;
}
print "Number of words present in given file: $total_words\n";
print "Number of words starting with 's' and 8-10 chars long: $total_s_words\n";
fclose($file);
?>
A working online example
As mentioned in the comments, strlen() gives the length of a string. If you are using PHP 8 you can use str_starts_with() to get the first letter of the string. In older versions you can use strpos(), substr() or [0] to get the character in the first position (ex: $word[0]).
Since you have an array of words, you'll want to loop through it and check each one, something like:
foreach($words as $word) {
if(strlen($word) >= 8 && strlen($word) <= 10) {
//count words between 8 and 10
}
if(str_starts_with($word, 'S')) {
//count words starting with S
}
}
If you want words that are both between 8 and 10 characters and start with S at the same time, you can just combine the two above if statements.
References for these functions:
https://www.php.net/manual/en/function.strlen.php
https://www.php.net/manual/en/function.str-starts-with.php
You have to use strlen() and substr().
Example code below
<?php
$count = 0;
$file = fopen("data.txt", "r");
while (($line = fgets($file)) !== false) {
$words = explode(" ", $line);
foreach($words as $word) {
// strlen() will give the length of the string/word
$len = strlen($word);
if($len >= 8 && $len <= 10) {
// Check the first character, if S then increment the counter
if(substr($word, 0, 1) == "S")
$count++;
}
}
}
echo "Number of words present in given file: " . $count;
fclose($file);
?>
Related
How would I go about splitting the word:
oneTwoThreeFour
into an array so that I can get:
one Two Three Four
with preg_match ?
I tired this but it just gives the whole word
$words = preg_match("/[a-zA-Z]*(?:[a-z][a-zA-Z]*[A-Z]|[A-Z][a-zA-Z]*[a-z])[a-zA-Z]*\b/", $string, $matches)`;
You can use preg_split as:
$arr = preg_split('/(?=[A-Z])/',$str);
See it
I'm basically splitting the input string just before the uppercase letter. The regex used (?=[A-Z]) matches the point just before a uppercase letter.
You can also use preg_match_all as:
preg_match_all('/((?:^|[A-Z])[a-z]+)/',$str,$matches);
Explanation:
( - Start of capturing parenthesis.
(?: - Start of non-capturing parenthesis.
^ - Start anchor.
| - Alternation.
[A-Z] - Any one capital letter.
) - End of non-capturing parenthesis.
[a-z]+ - one ore more lowercase letter.
) - End of capturing parenthesis.
I know that this is an old question with an accepted answer, but IMHO there is a better solution:
<?php // test.php Rev:20140412_0800
$ccWord = 'NewNASAModule';
$re = '/(?#! splitCamelCase Rev:20140412)
# Split camelCase "words". Two global alternatives. Either g1of2:
(?<=[a-z]) # Position is after a lowercase,
(?=[A-Z]) # and before an uppercase letter.
| (?<=[A-Z]) # Or g2of2; Position is after uppercase,
(?=[A-Z][a-z]) # and before upper-then-lower case.
/x';
$a = preg_split($re, $ccWord);
$count = count($a);
for ($i = 0; $i < $count; ++$i) {
printf("Word %d of %d = \"%s\"\n",
$i + 1, $count, $a[$i]);
}
?>
Note that this regex, (like codaddict's '/(?=[A-Z])/' solution - which works like a charm for well formed camelCase words), matches only a position within the string and consumes no text at all. This solution has the additional benefit that it also works correctly for not-so-well-formed pseudo-camelcase words such as: StartsWithCap and: hasConsecutiveCAPS.
Input:
oneTwoThreeFour
StartsWithCap
hasConsecutiveCAPS
NewNASAModule
Output:
Word 1 of 4 = "one"
Word 2 of 4 = "Two"
Word 3 of 4 = "Three"
Word 4 of 4 = "Four"
Word 1 of 3 = "Starts"
Word 2 of 3 = "With"
Word 3 of 3 = "Cap"
Word 1 of 3 = "has"
Word 2 of 3 = "Consecutive"
Word 3 of 3 = "CAPS"
Word 1 of 3 = "New"
Word 2 of 3 = "NASA"
Word 3 of 3 = "Module"
Edited: 2014-04-12: Modified regex, script and test data to correctly split: "NewNASAModule" case (in response to rr's comment).
While ridgerunner's answer works great, it seems not to work with all-caps substrings that appear in the middle of sentence. I use following and it seems to deal with these just alright:
function splitCamelCase($input)
{
return preg_split(
'/(^[^A-Z]+|[A-Z][^A-Z]+)/',
$input,
-1, /* no limit for replacement count */
PREG_SPLIT_NO_EMPTY /*don't return empty elements*/
| PREG_SPLIT_DELIM_CAPTURE /*don't strip anything from output array*/
);
}
Some test cases:
assert(splitCamelCase('lowHigh') == ['low', 'High']);
assert(splitCamelCase('WarriorPrincess') == ['Warrior', 'Princess']);
assert(splitCamelCase('SupportSEELE') == ['Support', 'SEELE']);
assert(splitCamelCase('LaunchFLEIAModule') == ['Launch', 'FLEIA', 'Module']);
assert(splitCamelCase('anotherNASATrip') == ['another', 'NASA', 'Trip']);
A functionized version of #ridgerunner's answer.
/**
* Converts camelCase string to have spaces between each.
* #param $camelCaseString
* #return string
*/
function fromCamelCase($camelCaseString) {
$re = '/(?<=[a-z])(?=[A-Z])/x';
$a = preg_split($re, $camelCaseString);
return join($a, " " );
}
$string = preg_replace( '/([a-z0-9])([A-Z])/', "$1 $2", $string );
The trick is a repeatable pattern $1 $2$1 $2 or lower UPPERlower UPPERlower etc....
for example
helloWorld = $1 matches "hello", $2 matches "W" and $1 matches "orld" again so in short you get $1 $2$1 or "hello World", matches HelloWorld as $2$1 $2$1 or again "Hello World". Then you can lower case them uppercase the first word or explode them on the space, or use a _ or some other character to keep them separate.
Short and simple.
When determining the best pattern for your project, you will need to consider the following pattern factors:
Accuracy (Robustness) -- whether the pattern is correct in all cases and is reasonably future-proof
Efficiency -- the pattern should be direct, deliberate, and avoid unnecessary labor
Brevity -- the pattern should use appropriate techniques to avoid unnecessary character length
Readability -- the pattern should be keep as simple as possible
The above factors also happen to be in the hierarchical order that strive to obey. In other words, it doesn't make much sense to me to prioritize 2, 3, or 4 when 1 doesn't quite satisfy the requirements. Readability is at the bottom of the list for me because in most cases I can follow the syntax.
Capture Groups and Lookarounds often impact pattern efficiency. The truth is, unless you are executing this regex on thousands of input strings, there is no need to toil over efficiency. It is perhaps more important to focus on pattern readability which can be associated with pattern brevity.
Some patterns below will require some additional handling/flagging by their preg_ function, but here are some pattern comparisons based on the OP's sample input:
preg_split() patterns:
/^[^A-Z]+\K|[A-Z][^A-Z]+\K/ (21 steps)
/(^[^A-Z]+|[A-Z][^A-Z]+)/ (26 steps)
/[^A-Z]+\K(?=[A-Z])/ (43 steps)
/(?=[A-Z])/ (50 steps)
/(?=[A-Z]+)/ (50 steps)
/([a-z]{1})[A-Z]{1}/ (53 steps)
/([a-z0-9])([A-Z])/ (68 steps)
/(?<=[a-z])(?=[A-Z])/x (94 steps) ...for the record, the x is useless.
/(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])/ (134 steps)
preg_match_all() patterns:
/[A-Z]?[a-z]+/ (14 steps)
/((?:^|[A-Z])[a-z]+)/ (35 steps)
I'll point out that there is a subtle difference between the output of preg_match_all() and preg_split(). preg_match_all() will output a 2-dimensional array, in other words, all of the fullstring matches will be in the [0] subarray; if there is a capture group used, those substrings will be in the [1] subarray. On the other hand, preg_split() only outputs a 1-dimensional array and therefore provides a less bloated and more direct path to the desired output.
Some of the patterns are insufficient when dealing with camelCase strings that contain an ALLCAPS/acronym substring in them. If this is a fringe case that is possible within your project, it is logical to only consider patterns that handle these cases correctly. I will not be testing TitleCase input strings because that is creeping too far from the question.
New Extended Battery of Test Strings:
oneTwoThreeFour
hasConsecutiveCAPS
newNASAModule
USAIsGreatAgain
Suitable preg_split() patterns:
/[a-z]+\K|(?=[A-Z][a-z]+)/ (149 steps) *I had to use [a-z] for the demo to count properly
/(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])/ (547 steps)
Suitable preg_match_all() pattern:
/[A-Z]?[a-z]+|[A-Z]+(?=[A-Z][a-z]|$)/ (75 steps)
Finally, my recommendations based on my pattern principles / factor hierarchy. Also, I recommend preg_split() over preg_match_all() (despite the patterns having less steps) as a matter of directness to the desired output structure. (of course, choose whatever you like)
Code: (Demo)
$noAcronyms = 'oneTwoThreeFour';
var_export(preg_split('~^[^A-Z]+\K|[A-Z][^A-Z]+\K~', $noAcronyms, 0, PREG_SPLIT_NO_EMPTY));
echo "\n---\n";
var_export(preg_match_all('~[A-Z]?[^A-Z]+~', $noAcronyms, $out) ? $out[0] : []);
Code: (Demo)
$withAcronyms = 'newNASAModule';
var_export(preg_split('~[^A-Z]+\K|(?=[A-Z][^A-Z]+)~', $withAcronyms, 0, PREG_SPLIT_NO_EMPTY));
echo "\n---\n";
var_export(preg_match_all('~[A-Z]?[^A-Z]+|[A-Z]+(?=[A-Z][^A-Z]|$)~', $withAcronyms, $out) ? $out[0] : []);
I took cool guy Ridgerunner's code (above) and made it into a function:
echo deliciousCamelcase('NewNASAModule');
function deliciousCamelcase($str)
{
$formattedStr = '';
$re = '/
(?<=[a-z])
(?=[A-Z])
| (?<=[A-Z])
(?=[A-Z][a-z])
/x';
$a = preg_split($re, $str);
$formattedStr = implode(' ', $a);
return $formattedStr;
}
This will return: New NASA Module
Another option is matching /[A-Z]?[a-z]+/ - if you know your input is on the right format, it should work nicely.
[A-Z]? would match an uppercase letter (or nothing). [a-z]+ would then match all following lowercase letters, until the next match.
Working example: https://regex101.com/r/kNZfEI/1
You can split on a "glide" from lowercase to uppercase thus:
$parts = preg_split('/([a-z]{1})[A-Z]{1}/', $string, -1, PREG_SPLIT_DELIM_CAPTURE);
//PREG_SPLIT_DELIM_CAPTURE to also return bracketed things
var_dump($parts);
Annoyingly you will then have to rebuild the words from each corresponding pair of items in $parts
Hope this helps
First of all codaddict thank you for your pattern, it helped a lot!
I needed a solution that works in case a preposition 'a' exists:
e.g. thisIsACamelcaseSentence.
I found the solution in doing a two step preg_match and made a function with some options:
/*
* input: 'thisIsACamelCaseSentence' output: 'This Is A Camel Case Sentence'
* options $case: 'allUppercase'[default] >> 'This Is A Camel Case Sentence'
* 'allLowerCase' >> 'this is a camel case sentence'
* 'firstUpperCase' >> 'This is a camel case sentence'
* #return: string
*/
function camelCaseToWords($string, $case = null){
isset($case) ? $case = $case : $case = 'allUpperCase';
// Find first occurances of two capitals
preg_match_all('/((?:^|[A-Z])[A-Z]{1})/',$string, $twoCapitals);
// Split them with the 'zzzzzz' string. e.g. 'AZ' turns into 'AzzzzzzZ'
foreach($twoCapitals[0] as $match){
$firstCapital = $match[0];
$lastCapital = $match[1];
$temp = $firstCapital.'zzzzzz'.$lastCapital;
$string = str_replace($match, $temp, $string);
}
// Now split words
preg_match_all('/((?:^|[A-Z])[a-z]+)/', $string, $words);
$output = "";
$i = 0;
foreach($words[0] as $word){
switch($case){
case 'allUpperCase':
$word = ucfirst($word);
break;
case 'allLowerCase':
$word = strtolower($word);
break;
case 'firstUpperCase':
($i == 0) ? $word = ucfirst($word) : $word = strtolower($word);
break;
}
// remove te 'zzzzzz' from a word if it has
$word = str_replace('zzzzzz','', $word);
$output .= $word." ";
$i++;
}
return $output;
}
Feel free to use it, and in case there is an 'easier' way to do this in one step please comment!
Full function based on #codaddict answer:
function splitCamelCase($str) {
$splitCamelArray = preg_split('/(?=[A-Z])/', $str);
return ucwords(implode($splitCamelArray, ' '));
}
Considering this input string:
"this is a Test String to get the last index of word with an uppercase letter in PHP"
How can I get the position of the last uppercase letter (in this example the position of the first "P" (not the last one "P") of "PHP" word?
I think this regex works. Give it a try.
https://regex101.com/r/KkJeho/1
$pattern = "/.*\s([A-Z])/";
//$pattern = "/.*\s([A-Z])[A-Z]+/"; pattern to match only all caps word
Edit to solve what Wiktor wrote in comments I think you could str_replace all new lines with space as the input string in the regex.
That should make the regex treat it as a single line regex and still give the correct output.
Not tested though.
To find the position of the letter/word:
$str = "this is a Test String to get the last index of word with an uppercase letter in PHP";
$pattern = "/.*\s([A-Z])(\w+)/";
//$pattern = "/.*\s([A-Z])([A-Z]+)/"; pattern to match only all caps word
preg_match($pattern, $str, $match);
$letter = $match[1];
$word = $match[1] . $match[2];
$position = strrpos($str, $match[1].$match[2]);
echo "Letter to find: " . $letter . "\nWord to find: " . $word . "\nPosition of letter: " . $position;
https://3v4l.org/sJilv
If you also want to consider a non-regex version: You can try splitting the string at the whitespace character, iterating the resulting string array backwards and checking if the current string's first character is an upper case character, something like this (you may want to add index/null checks):
<?php
$str = "this is a Test String to get the last index of word with an uppercase letter in PHP";
$explodeStr = explode(" ",$str);
$i = count($explodeStr) - 1;
$characterCount=0;
while($i >= 0) {
$firstChar = $explodeStr[$i][0];
if($firstChar == strtoupper($firstChar)){
echo $explodeStr[$i]. ' at index: ';
$idx = strlen($str)-strlen($explodeStr[$i] -$characterCount);
echo $idx;
break;
}
$characterCount += strlen($explodeStr[i]) +1; //+1 for whitespace
$i--;
}
This prints 80 which is indeed the index of the first P in PHP (including whitespaces).
Andreas' pattern looks pretty solid, but this will find the position faster...
.* \K[A-Z]{2,}
Pattern Demo
Here is the PHP implementation: Demo
$str='this is a Test String to get the last index of word with an uppercase letter in PHP test';
var_export(preg_match('/.* \K[A-Z]{2,}/',$str,$out,PREG_OFFSET_CAPTURE)?$out[0][1]:'fail');
// 80
If you want to see a condensed non-regex method, this will work:
Code: Demo
$str='this is a Test String to get the last index of word with an uppercase letter in PHP test';
$allcaps=array_filter(explode(' ',$str),'ctype_upper');
echo "Position = ",strrpos($str,end($allcaps));
Output:
Position = 80
This assumes that there is an all caps word in the input string. If there is a possibility of no all-caps words, then a conditional would sort it out.
Edit, after re-reading the question, I am unsure what exactly makes PHP the targeted substring -- whether it is because it is all caps, or just the last word to start with a capitalized letter.
If just the last word starting with an uppercase letter then this pattern will do: /.* \K[A-Z]/
If the word needs to be all caps, then it is possible that /b word boundaries may be necessary.
Some more samples and explanation from the OP would be useful.
Another edit, you can declare a set of characters to exclude and use just two string functions. I am using a-z and a space with rtrim() then finding the right-most space, and adding 1 to it.
$str='this is a Test String to get the last index of word with an uppercase letter in PHP test';
echo strrpos(rtrim($str,'abcdefghijklmnopqrstuvwxyz '),' ')+1;
// 80
I am currently working on a web app which is using a specific string to call a function. Here is a sample string:
$string = "translate from-to word for translate"
First I need to validate the string, and it should be like the above $string. How should I validate the string?
Then I need to extract 3 substrings from $string.
The word that precedes the hyphen. (To be named: $target)
The word that follows the hyphen. (To be named: $source)
The text (not including the first space) that follows $source to the end of the string. (To be named: $translate)
This is my coding attempt to get the from and to:
$found = false;
$source ="";
$target = "";
$next = 3;
$prev = 1;
for($i=0;$i<strlen($string);$i++){
if($found== false){
if($string[$i] == "-"){
$found = true;
while($string[$i+$prev] != " "){
$target .= $string[$i+$prev];
$prev +=1;
}
/*$next -=1;
while($string[$i-$next] != " " && $next > 0){
$source .= $string[$i-$next];
$next -=1;
}*/
}
}
}
From that code, I only can return the $target which contains to after -.I don't know how to get $source.
Please show me the fastest way to get the from as $source and to as $target.
Then I need to get word for translate (all of the string after from-to).
So the result should be
$target = "to";
$source = "from";
$translate = "word for translate";
Finally, if the $string has two hyphens, like translate from-to from-to test-test word for translate, it should be return false;
note to and from are random strings.
Consider the following possible input strings:
translate from-to word for translate (1 hyphen, no accents or non-English characters)
translate dari-ke dari-ke word for translate (2 hyphens)
translate clé-solution word for translate (1 hyphen, accented character used)
translate goodbye-さようなら word for translate (1 hyphen , Japanese characters used)
A case-insensitive pattern like: /^[a-z]+? ([a-z]+)-([a-z]+?) ([a-z ]+)$/i will perform as requested on the first two sample strings with high efficiency, but not the last two.
Using the "word character" (\w) to match the substrings (instead of case-insensitive [a-z]) will perform as intended with the first two samples with, but also allows 0-9 and _ as valid characters. This means a slight drop in pattern accuracy (this may be of no noticeable consequence to your project).
If you are translating strings that may go beyond English characters, it can be simpler / more forgiving to use a "negated character class" for matching. If you want to allow letters beyond a-z, like accented and other multibyte characters, then [^-] will offer a broad allowance of characters (at the expense of allowing many unwanted letters too). Here is a demo of this kind of pattern.
It is important to only write "capture groups" for substrings that you want to subsequently use. For this reason, I do not capture the leading substring translate.
list() is a handy "language construct" to assign variable names to array values. Notice that the first element (the fullstring match) is not assigned to a variable. This is why list()'s parameters starts with ,. If you don't wish to leverage the convenience of list(), then you can manually assign the three variable names over three lines like this:
$source=$out[1];
$target=$out[2];
$translate=$out[3];
Code: (Demo)
$strings=[
"translate from-to word for translate",
"translate dari-ke dari-ke word for translate",
"translate clé-solution word for translate",
"translate goodbye-さようなら word for translate"
];
foreach($strings as $string){
if(preg_match('/^[a-z]+? ([^-]+)-([^-]+?) ([a-z ]+)$/i',$string,$out)){
list(,$source,$target,$translate)=$out;
echo "source=$source; target=$target; translate=$translate";
}else{
var_export(false); // $found=false;
}
echo "<br>";
}
Output:
source=from; target=to; translate=word for translate
false
source=clé; target=solution; translate=word for translate
source=goodbye; target=さようなら; translate=word for translate
While regex provides a much more concise method with fewer function calls, this is a non-regex method:
if(substr_count($string,'-')!=1){
var_export(false); // $found=false;
}else{
$trimmed=ltrim($string,'translate ');
$array=explode(' ',$trimmed,2);
list($source,$target)=explode('-',$array[0]);
$translate=$array[1];
echo "source=$source; target=$target; translate=$translate";
}
If I understand your question correctly, this can be done with a regular expression:
<?php
$string = "translate from-to word for translate";
$result = preg_match("/^([\w ]+?) (\w+)-(\w+) ([\w ]+)$/", $string, $matches);
if ($result) {
print_r($matches);
$source = $matches[2];
$target = $matches[3];
$translate = $matches[4];
} else {
echo "No match";
}
Output:
Array
(
[0] => translate from-to word for translate
[1] => translate
[2] => from
[3] => to
[4] => word for translate
)
Here is an explanation of the regular expression.
Here is my concern,
I have a string and I need to extract chraracters two by two.
$str = "abcdef" should return array('ab', 'bc', 'cd', 'de', 'ef'). I want to use preg_match_all instead of loops. Here is the pattern I am using.
$str = "abcdef";
preg_match_all('/[\w]{2}/', $str);
The thing is, it returns Array('ab', 'cd', 'ef'). It misses 'bc' and 'de'.
I have the same problem if I want to extract a certain number of words
$str = "ab cd ef gh ij";
preg_match_all('/([\w]+ ){2}/', $str); // returns array('ab cd', 'ef gh'), I'm also missing the last part
What am I missing? Or is it simply not possible to do so with preg_match_all?
For the first problem, what you want to do is match overlapping string, and this requires zero-width (not consuming text) look-around to grab the character:
/(?=(\w{2}))/
The regex above will capture the match in the first capturing group.
DEMO
For the second problem, it seems that you also want overlapping string. Using the same trick:
/(?=(\b\w+ \w+\b))/
Note that \b is added to check the boundary of the word. Since the match does not consume text, the next match will be attempted at the next index (which is in the middle of the first word), instead of at the end of the 2nd word. We don't want to capture from middle of a word, so we need the boundary check.
Note that \b's definition is based on \w, so if you ever change the definition of a word, you need to emulate the word boundary with look-ahead and look-behind with the corresponding character set.
DEMO
In case if you need a Non-Regex solution, Try this...
<?php
$str = "abcdef";
$len = strlen($str);
$arr = array();
for($count = 0; $count < ($len - 1); $count++)
{
$arr[] = $str[$count].$str[$count+1];
}
print_r($arr);
?>
See Codepad.
If I have a string like: 10/10/12/12
I'm using:
$string = '10/10/12/12';
preg_match_all('/[0-9]+\/[0-9]+/', $string, $results);
This only seems to match 10/10, and 12/12. I also want to match 10/12. Is it because after the 10/10 is matched that is removed from the picture? So after the first match it'll only match things from /12/12?
If I want to match all 10/10, 10/12, 12/12, what should my regex look like? Thanks.
Edit: I did this
$arr = explode('/', $string);
$count = count($arr) - 1;
$newarr = array();
for ($i = 0; $i < $count; $i++)
{
$newarr[] = $arr[$i].'/'.$arr[$i+1];
}
I'd advise not using regular expression. Instead you could for example first split on slash using explode. Then iterate over the parts, checking for two consecutive parts which both consist of only digits.
The reason why your regular expression doesn't work is because the match consumes the characters it matches. Searching for the next match starts from just after where the previous match ended.
If you really want to use regular expressions you can use a zero-width match such as a lookahead to avoid consuming the characters, and put a capturing match inside the lookahead.
'#[0-9]+/(?=([0-9]+))#'
See it working online: ideone