i have for example the following string
#kirbypanganja[Kirby Panganja] elow #kyraminerva[Kyra] test #watever[watever ever evergreen]
I want to get the substring that match with #username[Full Name], Im really new on regex thing. Im using the ff code:
$mention_regex = '/#([A-Za-z0-9_]+)/i';
preg_match_all($mention_regex, $content, $matches);
var_dump($matches);
where the $content is the string above.
what should be the correct regex so that i can have the array #username[Full Name] format?
You can use:
#[^]]+]
i.e.:
$string = "#kirbypanganja[Kirby Panganja] elow #kyraminerva[Kyra] test #watever[watever ever evergreen]";
preg_match_all('/#[^]]+]/', $string, $result);
print_r($result[0]);
Output:
Array
(
[0] => #kirbypanganja[Kirby Panganja]
[1] => #kyraminerva[Kyra]
[2] => #watever[watever ever evergreen]
)
PHP Demo
Regex Demo and Explanation
Regex: /#[A-Za-z0-9_]+\[[a-zA-Z\s]+\]/
/#[A-Za-z0-9_]+\[[a-zA-Z\s]+\]/ this will match
Example: #thanSomeCharacters[Some Name Can contain space]
Try this code snippet here
<?php
$content='#kirbypanganja[Kirby Panganja] elow #kyraminerva[Kyra] test #watever[watever ever evergreen]';
$mention_regex = '/#[A-Za-z0-9_]+\[[a-zA-Z\s]+\]/i';
preg_match_all($mention_regex, $content, $matches);
print_r($matches);
I'll start with a very direct, one-liner method that I believe is best and then discuss the other options...
Code (Demo):
$string = "#kirbypanganja[Kirby Panganja] elow #kyraminerva[Kyra] test #watever[watever ever evergreen]";
$result = preg_split('/]\K[^#]+/', $string, 0, PREG_SPLIT_NO_EMPTY);
var_export($result);
Output:
array (
0 => '#kirbypanganja[Kirby Panganja]',
1 => '#kyraminerva[Kyra]',
2 => '#watever[watever ever evergreen]',
)
Pattern (Demo):
] #match a literal closing square bracket
\K #forget the matched closing square bracket
[^#]+ #match 1 or more non-at-signs
My pattern takes 12 steps, which is the same step efficiency as Pedro's pattern.
There are two benefits to the coder by using preg_split():
it does not return true/false nor require an output variable like preg_match_all() which means it can be used as a one-liner without a condition statement.
it returns a one-dimensional array, versus a two-dimensional array like preg_match_all(). This means the the entire returned array is instantly ready to unpack without any subarray accessing.
In case you are wondering what the 3rd and 4th parameters are in preg_split(), the 0 value means return an unlimited amount of substrings. This is the default behavior, but it is used as a placeholder for parameter 4. PREG_SPLIT_NO_EMPTY effectively removes any empty substrings that would have been generated by trying to split at the start or end of the input string.
That concludes my recommended method, now I'll take a moment to compare the other answers currently posted on this page, and then present some non-regex methods which I do not recommended.
The most popular and intuitive method is to use a regex pattern with preg_match_all(). Both Sahil and Pedro have opted for this course of action. Let's compare the patterns that they've chosen...
Sahil's pattern /#[A-Za-z0-9_]+\[[a-zA-Z\s]+\]/i correctly matches the desired substrings in 18 steps, but uses unnecessary redundancies like using the i modifier/flag despite using A-Za-z in the character class. Here is a demo. Also, [A-Za-z0-9_] is more simply expressed as \w.
Pedro's pattern /#[^]]+]/ correctly matches the desired string in 12 steps. Here is a demo.
By all comparisons, Pedro's method is superior to Sahil's because it has equal accuracy, higher efficiency, and increased pattern brevity. If you want to use preg_match_all(), you will not find a more refined regex pattern than Pedro's.
That said, there are other ways to extract the desired substrings. First, the more tedious way that doesn't involve regex that I would never recommend...
Regex-free method: strpos() & substr()
$result = [];
while (($start = strpos($string, '#')) !== false) {
$result[] = substr($string, $start, ($stop = strpos($string, ']') + 1) - $start);
$string = substr($string, $stop);
}
var_export($result);
Coders should always entertain the idea of a non-regex method when dissecting strings, but as you can see from this code above, it just isn't sensible for this case. It requires four function calls on each iteration and it isn't the easiest thing to read. So let's dismiss this method.
Here is another way that provides the correct result...
$result = [];
foreach (explode('#', $string) as $v) {
if ($v) {
$result[] = '#' . substr($v, 0, strrpos($v, ']') + 1);
}
}
It makes fewer function calls compared to the previous regex-free method, but it still too much handling for such a simple task.
At this point, it is the clear that the most sensible methods should be using regex. And there is nothing wrong with choosing preg_match_all() -- if this were my project, I might elect to use it. However, it is important to consider the direct-ness of preg_split(). This function is just like explode() but with the ability to use a regex pattern. This question is a perfect stage for preg_split() because the substrings that should be omitted can also be used as the delimiter between the desired substrings.
Related
I need help extracing all the sub string between curly brackets that are found inside a specific string.
I found some solutions in javascript but I need it for PHP.
$string = "www.example.com/?foo={foo}&test={test}";
$subStrings = HELPME($string);
print_r($subStrings);
The result should be:
array( [0] => foo, [1] => test )
I tried playing with preg_match but I got confused.
I'd appreciate if whoever manage to get it to work with preg_match, explain also what is the logic behind it.
You could use this regex to capture the strings between {}
\{([^}]*)\}
Explanation:
\{ Matches a literal {
([^}]*) Capture all the characters not of } zero or more times. So it would capture upto the next } symbol.
\} Matches a literal }
Your code would be,
<?php
$regex = '~\{([^}]*)\}~';
$string = "www.example.com/?foo={foo}&test={test}";
preg_match_all($regex, $string, $matches);
var_dump($matches[1]);
?>
Output:
array(2) {
[0]=>
string(3) "foo"
[1]=>
string(4) "test"
}
DEMO
Regex Pattern: \{(\w+)\}
Get all the matches that is captured by parenthesis (). The pattern says anything that is enclosed by {...} are captured.
Sample code:
$regex = '/\{(\w{1,})\}/';
$testString = ''; // Fill this in
preg_match_all($regex, $testString, $matches);
// the $matches variable contains the list of matches
Here is demo on debuggex
If you want to capture any type of character inside the {...} then try below regex pattern.
Regex : \{(.*?)\}
Sample code:
$regex = '/\{(.{0,}?)\}/';
$testString = ''; // Fill this in
preg_match_all($regex, $testString, $matches);
// the $matches variable contains the list of matches
Here is demo on debuggex
<?php
$string = "www.example.com/?foo={foo}&test={test}";
$found = preg_match('/\{([^}]*)\}/',$string, $subStrings);
if($found){
print_r($subStrings);
}else{
echo 'NOPE !!';
}
DEMO HERE
Function parse_url, which parses a URL and return its components. Including the query string.
Try This:
preg_match_all("/\{.*?\}/", $string, $subStrings);
var_dump($subStrings[0]);
Good Luck!
You can use the expression (?<=\{).*?(?=\}) to match any string of text enclosed in {}.
$string = "www.example.com/?foo={foo}&test={test}";
preg_match_all("/(?<=\{).*?(?=\})/",$string,$matches);
print_r($matches[0]);
Regex explained:
(?<=\{) is a positive lookbehind, asserting that the line of text is preceeded by a {.
Similarly (?=\}) is a positive lookahead asserting that it is followed by a }. .* matches 0 or more characters of any type. And the ? in .*? makes it match the least possible amount of characters. (Meaning it matches foo in {foo} and {bar} as opposed to foo} and {bar.
$matches[0] contains an array of all the matched strings.
I see answers here using regular expressions with capture groups, lookarounds, and lazy quantifiers. All of these techniques will slow down the pattern -- granted, the performance is very unlikely to be noticeable in the majority of use cases. Because we are meant to offer solutions that are suitable to more scenarios than just the posted question, I'll offer a few solutions that deliver the expected result and explain the differences using the OP's www.example.com/?foo={foo}&test={test} string assigned to $url. I have prepared a php DEMO of the techniques to follow. For information about the function calls, please follow the links to the php manual. For an in depth breakdown of the regex patterns, I recommend using regex101.com -- a free online tool that allows you to test patterns against strings, see the results as both highlighted text and a grouped list, and provides a technique breakdown character-by-character of how the regex engine is interpreting your pattern.
#1 Because your input string is a url, a non-regex technique is appropriate because php has native functions to parse it: parse_url() with parse_str(). Unfortunately, your requirements go beyond extracting the query string's values, you also wish to re-index the array and remove the curly braces from the values.
parse_str(parse_url($url, PHP_URL_QUERY), $assocArray);
$values = array_map(function($v) {return trim($v, '{}');}, array_values($assocArray));
var_export($values);
While this approach is deliberate and makes fair use of native functions that were built for these jobs, it ends up making longer, more convoluted code which is somewhat unpleasant in terms of readability. Nonetheless, it provides the desired output array and should be considered as a viable process.
#2 preg_match_all() is a super brief and highly efficient technique to extract the values. One draw back with using regular expressions is that the regex engine is completely "unaware" of any special meanings that a formatted input string may have. In this case, I don't see any negative impacts, but when hiccups do arise, often the solution is to use a parser that is "format/data-type aware".
var_export(preg_match_all('~\{\K[^}]*~', $url, $matches) ? $matches[0] : []);
Notice that my pattern does not need capture groups or lookarounds; nor does my answer suffer from the use of a lazy quantifier. \K is used to "restart the fullstring match" (in other words, forget any matched characters upto that point). All of these features will mean that the regex engine can traverse the string with peak efficiency. If there is a downsides to using the function they are:
that a multi-dimensional array is generated while you only want a one-dimensional array
that the function creates a reference variable instead of returning the results
#3 preg_split() most closely aligns with the plain-English intent of your task AND it provides the exact output as its return value.
var_export(preg_split('~(?:(?:^|})[^{]*{)|}[^{]*$~', $url, 0, PREG_SPLIT_NO_EMPTY));
My pattern, while admittedly unsavoury to the novice regex pattern designer AND slightly less efficient because it is making "branched" matches (|), basically says: "Split the string at the following delimiters:
from the start of the string or from a }, including all non-{ characters, then the first encountered { (this is the end of the delimiter).
from the lasts }, including all non-{ characters until the end of the string."
In PHP I have string with nested brackets:
bar[foo[test[abc][def]]bar]foo
I need a regex that matches the inner bracket-pairs first, so the order in which preg_match_all finds the matching bracket-pairs should be:
[abc]
[def]
[test[abc][def]]
[foo[test[abc][def]]bar]
All texts may vary.
Is this even possible with preg_match_all ?
This is not possible with regular expressions. No matter how complex your regex, it will always return the left-most match first.
At best, you'd have to use multiple regexes, but even then you're going to have trouble because regexes can't really count matching brackets. Your best bet is to parse this string some other way.
Is not evident in your question what kind of "structure of matches" you whant... But you can use only simple arrays. Try
preg_match_all('#\[([a-z\)\(]+?)\]#',$original,$m);
that, for $original = 'bar[foo[test[abc][def]]bar]foo' returns an array with "abc" and "def", the inner ones.
For your output, you need a loop for the "parsing task".
PCRE with preg_replace_callback is better for parsing.
Perhaps this loop is a good clue for your problem,
$original = 'bar[foo[test[abc][def]]bar]foo';
for( $aux=$oldAux=$original;
$oldAux!=($aux=printInnerBracket($aux));
$oldAux=$aux
);
print "\n-- $aux";
function printInnerBracket($s) {
return preg_replace_callback(
'#\[([a-z\)\(]+?)\]#', // the only one regular expression
function($m) {
print "\n$m[0]";
return "($m[1])";
},
$s
);
}
Result (the callback print):
[abc]
[def]
[test(abc)(def)]
[foo(test(abc)(def))bar]
-- bar(foo(test(abc)(def))bar)foo
See also this related question.
I would like to know how I could transform the given string into the specified array:
String
all ("hi there \(option\)", (this, that), other) another
Result wanted (Array)
[0] => all,
[1] => Array(
[0] => "hi there \(option\)",
[1] => Array(
[0] => this,
[1] => that
),
[2] => other
),
[2] => another
This is used for a kind of console that I'm making on PHP.
I tried to use preg_match_all but, I don't know how I could find parentheses inside parentheses in order to "make arrays inside arrays".
EDIT
All other characters that are not specified on the example should be treated as String.
EDIT 2
I forgot to mention that all parameter's outside the parentheses should be detected by the space character.
The 10,000ft overview
You need to do this with a small custom parser: code takes input of this form and transforms it to the form you want.
In practice I find it useful to group parsing problems like this in one of three categories based on their complexity:
Trivial: Problems that can be solved with a few loops and humane regular expressions. This category is seductive: if you are even a little unsure if the problem can be solved this way, a good rule of thumb is to decide that it cannot.
Easy: Problems that require building a small parser yourself, but are still simple enough that it doesn't quite make sense to bring out the big guns. If you need to write more than ~100 lines of code then consider escalating to the next category.
Involved: Problems for which it makes sense to go formal and use an already existing, proven parser generator¹.
I classify this particular problem as belonging into the second category, which means that you can approach it like this:
Writing a small parser
Defining the grammar
To do this, you must first define -- at least informally, with a few quick notes -- the grammar that you want to parse. Keep in mind that most grammars are defined recursively at some point. So let's say our grammar is:
The input is a sequence
A sequence is a series series of zero or more tokens
A token is either a word, a string or an array
Tokens are separated by one or more whitespace characters
A word is a sequence of alphabetic characters (a-z)
A string is an arbitrary sequence of characters enclosed within double quotes
An array is a series of one or more tokens separated by commas
You can see that we have recursion in one place: a sequence can contain arrays, and an array is also defined in terms of a sequence (so it can contain more arrays etc).
Treating the matter informally as above is easier as an introduction, but reasoning about grammars is easier if you do it formally.
Building a lexer
With the grammar in hand you know need to break the input down into tokens so that it can be processed. The component that takes user input and converts it to individual pieces defined by the grammar is called a lexer. Lexers are dumb; they are only concerned with the "outside appearance" of the input and do not attempt to check that it actually makes sense.
Here's a simple lexer I wrote to parse the above grammar (don't use this for anything important; may contain bugs):
$input = 'all ("hi there", (this, that) , other) another';
$tokens = array();
$input = trim($input);
while($input) {
switch (substr($input, 0, 1)) {
case '"':
if (!preg_match('/^"([^"]*)"(.*)$/', $input, $matches)) {
die; // TODO: error: unterminated string
}
$tokens[] = array('string', $matches[1]);
$input = $matches[2];
break;
case '(':
$tokens[] = array('open', null);
$input = substr($input, 1);
break;
case ')':
$tokens[] = array('close', null);
$input = substr($input, 1);
break;
case ',':
$tokens[] = array('comma', null);
$input = substr($input, 1);
break;
default:
list($word, $input) = array_pad(
preg_split('/(?=[^a-zA-Z])/', $input, 2),
2,
null);
$tokens[] = array('word', $word);
break;
}
$input = trim($input);
}
print_r($tokens);
Building a parser
Having done this, the next step is to build a parser: a component that inspects the lexed input and converts it to the desired format. A parser is smart; in the process of converting the input it also makes sure that the input is well-formed by the grammar's rules.
Parsers are commonly implemented as state machines (also known as finite state machines or finite automata) and work like this:
The parser has a state; this is usually a number in an appropriate range, but each state is also described with a more human-friendly name.
There is a loop that reads reads lexed tokens one at a time. Based on the current state and the value of the token, the parser may decide to do one or more of the following:
take some action that affects its output
change its state to some other value
decide that the input is badly formed and produce an error
¹ Parser generators are programs whose input is a formal grammar and whose output is a lexer and a parser you can "just add water" to: just extend the code to perform "take some action" depending on the type of token; everything else is already taken care of. A quick search on this subject gives led PHP Lexer and Parser Generator?
There's no question that you should write parser if you are building syntax tree. But if you just need to parse this sample input regex still might be a tool:
<?php
$str = 'all, ("hi there", (these, that) , other), another';
$str = preg_replace('/\, /', ',', $str); //get rid off extra spaces
/*
* get rid off undefined constants with surrounding them with quotes
*/
$str = preg_replace('/(\w+),/', '\'$1\',', $str);
$str = preg_replace('/(\w+)\)/', '\'$1\')', $str);
$str = preg_replace('/,(\w+)/', ',\'$1\'', $str);
$str = str_replace('(', 'array(', $str);
$str = 'array('.$str.');';
echo '<pre>';
eval('$res = '.$str); //eval is evil.
print_r($res); //print the result
Demo.
Note: If input will be malformed regex will definitely fail. I am writing this solution just in a case you need fast script. Writing lexer and parser is time-consuming work, that will need lots of research.
As far as I know, the parentheses problem is a Chomsky language class 2, while regular expressions are equivalent to Chomsky language class 3, so there should be no regular expression, which solves this problem.
But I read something not long ago:
This PCRE pattern solves the parentheses problem (assume the PCRE_EXTENDED option is set so that white space is ignored): \( ( (?>[^()]+) | (?R) )* \)
With delimiters and without spaces: /\(((?>[^()]+)|(?R))*\)/.
This is from Recursive Patterns (PCRE) - PHP manual.
There is an example on that manual, which solves nearly the same problem you specified!
You, or others might find it and proceed with this idea.
I think the best solution is to write a sick recursive pattern with preg_match_all. Sadly I'm not in the power to do such madness!
First, I want to thank everyone that helped me on this.
Unfortunately, I can't accept multiple answers because, if I could, I would give to you all because all answers are correct for different types of this problem.
In my case, I just needed something simple and dirty and, following #palindrom and #PLB answers, I've got the following working for me:
$str=transformEnd(transformStart($string));
$str = preg_replace('/([^\\\])\(/', '$1array(', $str);
$str = 'array('.$str.');';
eval('$res = '.$str);
print_r($res); //print the result
function transformStart($str){
$match=preg_match('/(^\(|[^\\\]\()/', $str, $positions, PREG_OFFSET_CAPTURE);
if (count($positions[0]))
$first=($positions[0][1]+1);
if ($first>1){
$start=substr($str, 0,$first);
preg_match_all("/(?:(?:\"(?:\\\\\"|[^\"])+\")|(?:'(?:\\\'|[^'])+')|(?:(?:[^\s^\,^\"^\']+)))/is",$start,$results);
if (count($results[0])){
$start=implode(",", $results[0]).",";
} else {
$start="";
}
$temp=substr($str, $first);
$str=$start.$temp;
}
return $str;
}
function transformEnd($str){
$match=preg_match('/(^\)|[^\\\]\))/', $str, $positions, PREG_OFFSET_CAPTURE);
if (($total=count($positions)) && count($positions[$total-1]))
$last=($positions[$total-1][1]+1);
if ($last==null)
$last=-1;
if ($last<strlen($str)-1){
$end=substr($str,$last+1);
preg_match_all("/(?:(?:\"(?:\\\\\"|[^\"])+\")|(?:'(?:\\\'|[^'])+')|(?:(?:[^\s^\,^\"^\']+)))/is",$end,$results);
if (count($results[0])){
$end=",".implode(",", $results[0]);
} else {
$end="";
}
$temp=substr($str, 0,$last+1);
$str=$temp.$end;
}
if ($last==-1){
$str=substr($str, 1);
}
return $str;
}
Other answers are helpful too for who is searching a better way to do this.
Again, thank you all =D.
I will put the algorithm or pseudo code for implementing this. Hopefully you can work-out how to implement it in PHP:
function Parser([receives] input:string) returns Array
define Array returnValue;
for each integer i from 0 to length of input string do
charachter = ith character from input string.
if character is '('
returnValue.Add(Parser(substring of input after i)); // recursive call
else if character is '"'
returnValue.Add(substring of input from i to the next '"')
else if character is whitespace
continue
else
returnValue.Add(substring of input from i to the next space or end of input)
increment i to the index actually consumed
return returnValue
I want to know if this works:
replace ( with Array(
Use regex to put comma after words or parentheses without comma
preg_replace( '/[^,]\s+/', ',', $string )
eval( "\$result = Array( $string )" )
if the string values are fixed, it can be done some how like this
$ar = explode('("', $st);
$ar[1] = explode('",', $ar[1]);
$ar[1][1] = explode(',', $ar[1][1]);
$ar[1][2] = explode(')',$ar[1][1][2]);
unset($ar[1][1][2]);
$ar[2] =$ar[1][2][1];
unset($ar[1][2][1]);
I have a list of words in an array. What is the fastest way to check if any of these words exist in an string?
Currently, I am checking the existence of array elements one by one through a foreach loop by stripos. I am curious if there is a faster method, like what we do for str_replace using an array.
Regarding to your additional comment you could explode your string into single words using explode() or preg_split() and then check this array against the needles-array using array_intersect(). So all the work is done only once.
<?php
$haystack = "Hello Houston, we have a problem";
$haystacks = preg_split("/\b/", $haystack);
$needles = array("Chicago", "New York", "Houston");
$intersect = array_intersect($haystacks, $needles);
$count = count($intersect);
var_dump($count, $intersect);
I could imagine that array_intersect() is pretty fast. But it depends what you really want (matching words, matching fragments, ..)
my personal function:
function wordsFound($haystack,$needles) {
return preg_match('/\b('.implode('|',$needles).')\b/i',$haystack);
}
//> Usage:
if (wordsFound('string string string',array('words')))
Notice if you work with UTF-8 exotic strings you need to change \b with teh corrispondent of utf-8 preg word boundary
Notice2: be sure to enter only a-z0-9 chars in $needles (thanks to MonkeyMonkey) otherwise you need to preg_quote it before
Notice3: this function is case insensitve thanks to i modifier
In general regular expressions are slower compared to basic string functions like str_ipos(). But I think it really depends on the situation. If you really need the maximum performance, I suggest making some tests with real-world data.
Okay, here's what I'm trying to do: I'm trying to use PHP to develop what's essentially a tiny subset of a markdown implementation, not worth using a full markdown class.
I need essentially do a str_replace, but alternate the replace string for every occurrence of the needle, so as to handle the opening and closing HTML tags.
For example, italics are a pair of asterisks like *this*, and code blocks are surrounded by backticks like `this`.
I need to replace the first occurrence of a pair of the characters with the opening HTML tag corresponding, and the second with the closing tag.
Any ideas on how to do this? I figured some sort of regular expression would be involved...
Personally, I'd loop through each occurrence of * or \ with a counter, and replace the character with the appropriate HTML tag based on the count (for example, if the count is even and you hit an asterisk, replace it with <em>, if it's odd then replace it with </em>, etc).
But if you're sure that you only need to support a couple simple kinds of markup, then a regular expression for each might be the easiest solution. Something like this for asterisks, for example (untested):
preg_replace('/\*([^*]+)\*/', '<em>\\1</em>', $text);
And something similar for backslashes.
What you're looking for is more commonly handled by a state machine or lexer/parser.
This is ugly but it works. Catch: only for one pattern type at a time.
$input = "Here's some \\italic\\ text and even \\some more\\ wheee";
$output = preg_replace_callback( "/\\\/", 'replacer', $input );
echo $output;
function replacer( $matches )
{
static $toggle = 0;
if ( $toggle )
{
$toggle = 0;
return "</em>";
}
$toggle = 1;
return "<em>";
}
I created an alternative to str_replace, because the PHP manual for str_replace says that:
If search and replace are arrays, then str_replace() takes a value
from each array and uses them to search and replace on subject.
If replace has fewer values than search, then an empty string is used
for the rest of replacement values.
If search is an array and replace is a string, then this replacement
string is used for every value of search.
The converse would not make sense, though.
But the converse DOES make sense if the same needle appears several times in your haystack, such as '?' in a prepared statement (e.g. PHP's MySQLi extension), and you need to write a log or diagnostic report of what's going on as it runs through the parameters, substituting the parameters in the query string to make a 'cut and paste' version of the query for testing elsewhere.
Occurrences of needle are replaced left-to-right with the values in the replace array. If there are more occurrences of needle that there are replacements, it resets the replace array pointer. This means that for the OP's use, the needle would be "*", and the replacement would be an array with two values, "<I>" and "</I>".
function str_replace_seriatim(string $needle, array $replace, string $haystack) {
$occurrences = substr_count($haystack, $needle);
for ($i = 0; $i <= $occurrences; $i++) {
$substitute = current($replace);
$pos = strpos($haystack, $needle);
if ($pos !== FALSE) $haystack = substr_replace($haystack, $substitute, $pos, strlen($needle));
if ((next($replace) === FALSE)) reset($replace);
}
return $haystack;
}
To do the whole lot in one function call, I suppose that one could expand on this a little, taking an array ($pincushion) of needles and a multidimensional array as the replacement, but I'm not sure if that isn't more work than just multiple function calls.