PHP Regex 2 words per group - php

I've been wondering, is it possible to group every 2 words using regex? For 1 word i use this:
((?:\w'|\w|-)+)
This works great. But i need it for 2 (or even more words later on).
But if I use this one:
((?:\w'|\w|-)+) ((?:\w'|\w|-)+) it will make groups of 2 but not really how i want it. And when it encounters a special char it will start over.
Let me give you an example:
If I use it on this text: This is an . example text using & my / Regex expression
It will make groups of
This is
example text
regex expression
and i want groups like this:
This is
is an
an example
example text
text using
using my
my regex
regex expression
It is okay if it resets after a . So that it won't match hello . guys together for example.
Is this even possible to accomplish? I've just started experimenting with RegEx so i don't quite know the possibilities with this.
If this isn't possible could you point me in a direction that I should take with my problem?
Thanks in advance!

Regex is an overkill for this. Simply collect the words, then create the pairs:
$a = array('one', 'two', 'three', 'four');
$pairs = array();
$prev = null;
foreach($a as $word) {
if ($prev !== null) {
$pairs[] = "$prev $word";
}
$prev = $word;
}
Live demo: http://ideone.com/8dqAkz

try this
$samp = "This is an . example text using & my / Regex expression";
//removes anything other than alphabets
$samp = preg_replace('/[^A-Z ]/i', "", $samp);
//removes extra spaces
$samp = str_replace(" "," ",$samp);
//the following code splits the sentence into words
$jk = explode(" ",$samp);
$i = sizeof($jk);
$j = 0;
//this combines words in desired format
$array="";
for($j=0;$j<$i-1;$j++)
{
$array[] = $jk[$j]." ".$jk[$j+1];
}
print_r($array);
Demo
EDIT
for your question
I've changed the regex like this: "/[^A-Z0-9-' ]/i" so it doesn't
mess up words like 'you're' and '9-year-old' for example. But by doing
this when there is a seperate - or ' in my text, it will treat those
as a seperate words. I know why it does this but is it preventable?
change the regex like this
preg_replace('/[^A-Z0-9 ]+[^A-Z0-9\'-]/i', "", $samp)
Demo

First, strip out non-word characters (replace \W with '') Then perform your match. Many problems can be made simpler by breaking them down. Regexes are no exception.
Alternatively, strip out non-word characters, condense whitespace into single spaces, then use explode on space and array_chunk to group your words into pairs.

Related

How can remove the numberic suffix in php?

For example, if I want to get rid of the repeating numeric suffix from the end of an expression like this:
some_text_here_1
Or like this:
some_text_here_1_5
and I want finally receive something like this:
some_text_here
What's the best and flexible solution?
$newString = preg_replace("/_?\d+$/","",$oldString);
It is using regex to match an optional underscore (_?) followed by one or more digits (\d+), but only if they are the last characters in the string ($) and replacing them with the empty string.
To capture unlimited _ numbers, just wrap the whole regex (except the $) in a capture group and put a + after it:
$newString = preg_replace("/(_?\d+)+$/","",$oldString);
If you only want to remove a numberic suffix if it is after an underscore (e.g. you want some_text_here14 to not be changed, but some_text_here_14 to be changed), then it should be:
$newString = preg_replace("/(_\d+)+$/","",$oldString);
Updated to fix more than one suffix
Strrpos is far better than regex on such a simple string problem.
$str = "some_text_here_13_15";
While(is_numeric(substr($str, strrpos($str, "_")+1))){
$str = substr($str,0 , strrpos($str, "_"));
}
Echo $str;
Strrpos finds the last "_" in str and if it's numeric remove it.
https://3v4l.org/OTdb9
Just to give you an idea of what I mean with regex not being a good solution on this here is the performance.
Regex:
https://3v4l.org/Tu8o2/perf#output
0.027 seconds for 100 runs.
My code with added numeric check:
https://3v4l.org/dkAqA/perf#output
0.003 seconds for 100 runs.
This new code performs even better than before oddly enough, regex is very slow. Trust me on that
You be the judge on what is best.
First you'll want to do a preg_replace() in order to remove all digits by using the regex /\d+/. Then you'll also want to trim any underscores from the right using rtrim(), providing _ as the second parameter.
I've combined the two in the following example:
$string = "some_text_here_1";
echo rtrim(preg_replace('/\d+/', '', $string), '_'); // some_text_here
I've also created an example of this at 3v4l here.
Hope this helps! :)
$reg = '#_\d+$#';
$replace = '';
echo preg_replace($reg, $replace, $string);
This would do
abc_def_ghi_123 > abc_def_ghi
abc_def_1 > abc_def
abc_def_ghi > abc_def_ghi
abd_def_ > abc_def_
abc_123_def > abd_123_def
in case of abd_def_123_345 > abc_def
one could change the line
$reg = '#(?:_\d+)+$#';

Using preg_replace to modify first space, but not inside a group of words using PHP

I would like to replace commas and potential spaces (i.e. that user can type or not) of an expression using preg_replace.
$expression = 'alfa,beta, gamma gmm, delta dlt, epsilon psln';
but I was unable to format the output as I want:
'alfa|beta|gamma gmm|delta dlt|epsilon psln'
Amongst others I tried this:
preg_replace (/,\s+/, '|', $expression);
and although it was the closest I got, it's not yet right. With code above I receive:
alfa,beta|gamm|gmm|delt|dlt|epsilo|psl|
Then I tried this (with | = OR):
preg_replace (/,\s+|,/, '|', $expression);
and although I solved the problem with the comma, it is still wrong:
alfa|beta|gamm|gmm|delt|dlt|epsilo|psl|
What should I do to only delete space after comma and not inside the word-group?
Many thanks in advance!
Use ,\s* instead of ,\s+ and replace the matched characters with | symbol. If you use ,\s+, it matches the commas and the following one or more spaces but it forgot the commas which are alone. By making the occurrence of spaces to zero or more times, it would also match the commas which are alone.
DEMO
Code:
<?php
$string = 'alfa,beta, gamma gmm, delta dlt, epsilon psln';
$pattern = "~,\s*~";
$replacement = "|";
echo preg_replace($pattern, $replacement, $string);
?>
Output:
alfa|beta|gamma gmm|delta dlt|epsilon psln
How about using regular PHP functions to achieve this?
<?php
$expression = 'alfa,beta, gamma gmm, delta dlt, epsilon psln';
$pieces = explode(',', $expression);
foreach($pieces as $k => $v)
$pieces[$k] = trim($v);
$result = implode('|', $pieces);
echo $result;
?>
Output:
alfa|beta|gamma gmm|delta dlt|epsilon psln
This will distinguish between spaces at start/end of piece and spaces in pieces.

Php replace exact word

Here is my problem:
Using preg_replace('#\b(word)\b#','****',$text);
Where in text I have word\word and word, the preg_replace above replaces both word\word and word so my resulting string is ***\word and ***.
I want my string to look like : word\word and ***.
Is this possible? What am I doing wrong???
LATER EDIT
I have an array with urls, I foreach that array and preg_replace the text where url is found, but it's not working.
For instance, I have http://www.link.com and http://www.link.com/something
If I have http://www.link.com it also replaces http://www.link.com/something.
You are effectively specifying that you don't want certain characters to count as word boundary. Therefore you need to specify the "boundaries" yourself, something like this:
preg_replace('#(^|[^\w\\])(word)([^\w\\]|$)#','**',$text);
What this does is searches for the word surrounded by line boundaries or non-word characters except the back slash \. Therefore it will match .word, but not .word\ and not `\word. If you need to exclude other characters from matching, just add them inside the brackets.
You could just use str_replace("word\word", "word\word and"), I dont really see why you would need to use a preg_replace in your case given above.
Here is a simple solution that doesn't use a regex. It will ONLY replace single occurances of 'word' where it is a lone word.
<?php
$text = "word\word word cat dog";
$new_text = "";
$words = explode(" ",$text); // split the string into seperate 'words'
$inc = 0; // loop counter
foreach($words as $word){
if($word == "word"){ // if the current word in the array of words matches the criteria, replace it
$words[$inc] = "***";
}
$new_text.= $words[$inc]." ";
$inc ++;
}
echo $new_text; // gives 'word\word *** cat dog'
?>

preg replace complete word using partial patterns in PHP

I am using preg_replace($oldWords, $newWords, $string); to replace an array of words.
I wish to replace all words starting with foo into hello, and all words starting with bar into world
i.e foo123 should change to hello , foobar should change to hello, barx5 should change to world, etc.
If my arrays are defined as:
$oldWords = array('/foo/', '/bar/');
$newWords = array('hello', 'world');
then foo123 changes to hello123 and not hello. similarly barx5 changes to worldx5 and not world
How do I replace the complete matched word?
Thanks.
This is actually pretty simple if you understand regex, as well as how preg_replace works.
Firstly, your replacement arrays are incorrectly formed. What is:
$oldWords = array('\foo\', '\bar\');
Should instead be:
$oldWords = array('/foo/', '/bar/');
As the backslash in php escapes the character after it, meaning your strings were getting turned into non-strings, and it was messing up the rest of your code.
As to your actual question, however, you can achieve the desired effect with this:
$oldWords = array('/foo\w*/', '/bar\w*/');
\w matches any word character, while * is a quantifier either meaning 0 or any number of matches.
Adding in those two items will cause the regex to match any string with foo and x number of word-characters directly after it, which is what preg_replace then replaces; the match.
one way to do it is to loop through the array checking each word, since we are only checking the first three letters I would use a substr() instead of a regex because regex functions are slower.
foreach( $oldWords as $word ) {
$newWord = substr( $word, 0, 2 );
if( $newWord === 'foo' ) {
$word = 'hello';
}
else if( $newWord === 'bar' ) {
$word = 'world';
}
};

Identifying a random repeating pattern in a structured text string

I have a string that has the following structure:
ABC_ABC_PQR_XYZ
Where PQR has the structure:
ABC+JKL
and
ABC itself is a string that can contain alphanumeric characters and a few other characters like "_", "-", "+", "." and follows no set structure:
eg.qWe_rtY-asdf or pkl123
so, in effect, the string can look like this:
qWe_rtY-asdf_qWe_rtY-asdf_qWe_rtY-asdf+JKL_XYZ
My goal is to find out what string constitutes ABC.
I was initially just using
$arrString = explode("_",$string);
to return $arrString[0] before I was made aware that ABC ($arrString[0]) itself can contain underscores, thus rendering it incorrect.
My next attempt was exlpoding it on "_" anyway and then comparing each of the exploded string parts with the first string part until I get a semblance of a pattern:
function getPatternABC($string)
{
$count = 0;
$pattern ="";
$arrString = explode("_", $string);
foreach($arrString as $expString)
{
if(strcmp($expString,$arrString[0])!==0 || $count==0)
{
$pattern = $pattern ."_". $arrString[$count];
$count++;
}
else break;
}
return substr($pattern,1);
}
This works great - but I wanted to know if there was a more elegant way of doing this using regular expressions?
Here is the regex solution:
'^([a-zA-Z0-9_+-]+)_\1_\1\+'
What this does is match (starting from the beginning of the string) the longest possible sequence consisting of the characters inside the square brackets (edit that per your spec). The sequence must appear exactly twice, each time followed by an underscore, and then must appear once more followed by a plus sign (this is actually the first half of PQR with the delimiter before JKL). The rest of the input is ignored.
You will find ABC captured as capture group 1.
So:
$input = 'qWe_rtY-asdf_qWe_rtY-asdf_qWe_rtY-asdf+JKL_XYZ';
$result = preg_match('/^([a-zA-Z0-9_+-]+)_\1_\1\+/', $input, $matches);
if ($result) {
echo $matches[2];
}
See it in action.
Sure, just make a regular expression that matches your pattern. In this case, something like this:
preg_match('/^([a-zA-Z0-9_+.-]+)_\1_\1\+JKL_XYZ$/', $string, $match);
Your ABC is in $match[1].
If the presence of underscores in these strings has a low frequency, it may be worth checking to see if a simple explode() will do it before bothering with regex.
<?php
$str = 'ABC_ABC_PQR_XYZ';
if(substr_count($str, '_') == 3)
$abc = reset(explode('_', $str));
else
$abc = regexy_function($str);
?>

Categories