Get the number of matched characters in a regex group - php

I may be pushing the boundaries of Regular Expressions, but who knows...
I'm working in php.
In something like:
preg_replace('/(?:\n|^)(={3,6})([^=]+)(\1)/','<h#>$2</h#>', $input);
Is there a way to figure out how many '=' (={3,6}) matched, so I can backreference it where the '#'s are?
Effectively turning:
===Heading 3=== into <h3>Heading 3</h3>
====Heading 4==== into <h4>Heading 4</h4>
...

You can use:
preg_replace('/(?:\n|^)(={3,6})([^=]+)(\1)/e',
"'<h'.strlen('$1').'>'.'$2'.'</h'.strlen('$1').'>'", $input);
Ideone Link

No, PCRE can't do that. You should instead use preg_replace_callback and do some character counting then:
preg_replace_callback('/(?:\n|^)(={3,6})([^=]+)(\1)/', 'cb_headline', $input);
function cb_headline($m) {
list(, $markup, $text) = $m;
$n = strlen($markup);
return "<h$n>$text</h$n>";
}
Additionally you might want to be forgiving with the trailing === signs. Don't use a backreference but allow a variable number.
You might also wish to use the /m flag for your regex, so you can keep ^ in place of the more complex (?:\n|^) assertion.

It is very simple with modifier e in regexp, no need in preg_replace_callback
$str = '===Heading 3===';
echo preg_replace('/(?:\n|^)(={3,6})([^=]+)(\1)/e',
'implode("", array("<h", strlen("$1"), ">$2</h", strlen("$1"), ">"));',
$str);
or this way
echo preg_replace('/(?:\n|^)(={3,6})([^=]+)(\1)/e',
'"<h".strlen("$1").">$2</h".strlen("$1").">"',
$str);

I would do it like this:
<?php
$input = '===Heading 3===';
$h_tag = preg_replace_callback('#(?:\n|^)(={3,6})([^=]+)(\1)#', 'paragraph_replace', $input);
var_dump($h_tag);
function paragraph_replace($matches) {
$length = strlen($matches[1]);
return "<h{$length}>". $matches[2] . "</h{$length}>";
}
?>
Output:
string(18) "<h3>Heading 3</h3>"

Related

PHP exploding url from text, possible?

i need to explode youtube url from this line:
[embed]https://www.youtube.com/watch?v=L3HQMbQAWRc[/embed]
It is possible? I need to delete [embed] & [/embed].
preg_match is what you need.
<?php
$str = "[embed]https://www.youtube.com/watch?v=L3HQMbQAWRc[/embed]";
preg_match("/\[embed\](.*)\[\/embed\]/", $str, $matches);
echo $matches[1]; //https://www.youtube.com/watch?v=L3HQMbQAWRc
$string = '[embed]https://www.youtube.com/watch?v=L3HQMbQAWRc[/embed]';
$string = str_replace(['[embed]', '[/embed]'], '', $string);
See str_replace
why not use str_replace? :) Quick & Easy
http://php.net/manual/de/function.str-replace.php
Just for good measure, you can also use positive lookbehind's and lookahead's in your regular expressions:
(?<=\[embed\])(.*)(?=\[\/embed\])
You'd use it like this:
$string = "[embed]https://www.youtube.com/watch?v=L3HQMbQAWRc[/embed]";
$pattern = '/(?<=\[embed\])(.*)(?=\[\/embed\])/';
preg_match($pattern, $string, $matches);
echo $match[1];
Here is an explanation of the regex:
(?<=\[embed\]) is a Positive Lookbehind - matches something that follows something else.
(.*) is a Capturing Group - . matches any character (except a newline) with the Quantifier: * which provides matches between zero and unlimited times, as many times as possible. This is what is matched between the groups prior to and after. This are the droids you're looking for.
(?=\[\/embed\]) is a Positive Lookahead - matches things that come before it.

how to remove everything before second occurance of underscore

I couldn't find the solution using search.
I am looking for a php solution to remove all character BEFORE the second occurance of and underscore (including the underscore)
For example:
this_is_a_test
Should output as:
a_test
I currently have this code but it will remove everything after the first occurance:
preg_replace('/^[^_]*.s*/', '$1', 'this_is_a_test');
Using a slightly different approach,
$s='this_is_a_test';
echo implode('_', array_slice( explode( '_', $s ),2 ) );
/* outputs */
a_test
preg_replace('/^.*_.*_(.*)$/U', '$1', 'this_is_a_test');
Note the U modifier which tells regex to take as less characters for .* as possible.
You can also use explode, implode along with array_splice like as
$str = "this_is_a_test";
echo implode('_',array_splice(explode('_',$str),2));//a_test
Demo
Why go the complicated way? This is a suggestion though using strrpos and substr:
<?php
$str = "this_is_a_test";
$str_pos = strrpos($str, "_");
echo substr($str, $str_pos-1);
?>
Try this one.
<?php
$string = 'this_is_a_test';
$explode = explode('_', $string, 3);
echo $explode[2];
?>
Demo
I'm still in favor of a regular expression in this case:
preg_replace('/^.*?_.*?_/', '', 'this_is_a_test');
Or (which looks more complex here but is easily adjustable to N..M underscores):
preg_replace('/^(?:.*?_){2}/', '', 'this_is_a_test');
The use of the question mark in .*? makes the match non-greedy; and the pattern has been expanded from the original post to "match up through" the second underscore.
Since the goal is to remove text the matched portion is simply replaced with an empty string - there is no need for a capture group or to use such as the replacement value.
If the input doesn't include two underscores then nothing is removed; such can be adjusted, very easily with the second regular expression, if the rules are further clarified.

Regex: Using capture data further in the regex

I want to parse some text that start with ":" and could be surround with parentheses to stop the match so:
"abcd:(someText)efgh" and
"abcd:someText"
will return someText.
but i have a problem to set the parentheses optionnal.
I make this but it does not works:
$reg = '#:([\\(]){0,1}([a-z]+)$1#i';
$v = 'abc:(someText)def';
var_dump(preg_match($reg,$v,$matches));
var_dump($matches);
The $1 makes it failed.
i don't know how to tell him :
If there is a "(" at the beginning, there must be ")" at the end.
You can't test if the count of something is equal to another count. It's a regex problem who can only be used with regular language (http://en.wikipedia.org/wiki/Regular_language). To achieve your goal, as you asked - and that is if there's a '(' should be a ')' -, you'll need a Context-Free Language (http://en.wikipedia.org/wiki/Context-free_language).
Anyway, you can use this regex:
'/:(\([a-z]+\)|[a-z]+)/i
To return the match of different sub-patterns in the regex to the same element of the $matches array, you can use named subpattern with the internal option J to allow duplicate names. The return element in $matches is the same as the name of the pattern:
$pattern = '~(?J:.+:\((?<text>[^)]+)\).*|.+:(?<text>.+))~';
$texts = array(
'abc:(someText)def',
'abc:someText'
);
foreach($texts as $text)
{
preg_match($pattern, $text, $matches);
echo $text, ' -> ', $matches['text'], '<br>';
}
Result:
abc:(someText)def -> someText
abc:someText -> someText
Demo
This regex will match either :word or :(word) groups 1 and 2 hold the respective results.
if (preg_match('/:([a-z]+)|\(([a-z]+)\)/i', $subject, $regs)) {
$result = ($regs[1])?$regs[1]:$regs[2];
} else {
$result = "";
}
regex: with look-behind
"(?<=:\(|:)[^()]+"
test with grep:
kent$ echo "abcd:(someText)efgh
dquote> abcd:someOtherText"|grep -Po "(?<=:\(|:)[^()]+"
someText
someOtherText
Try this
.+:\((.+)\).*|.+:(.+)
if $1 is empty there are no parentheses and $2 has your text.

Masking all but first letter of a word using Regex

I'm attempting to create a bad word filter in PHP that will analyze the word and match against an array of known bad words, but keep the first letter of the word and replace the rest with asterisks. Example:
fook would become f***
shoot would become s**
The only part I don't know is how to keep the first letter in the string, and how to replace the remaining letters with something else while keeping the same string length.
$string = preg_replace("/\b(". $word .")\b/i", "***", $string);
Thanks!
$string = 'fook would become';
$word = 'fook';
$string = preg_replace("~\b". preg_quote($word, '~') ."\b~i", $word[0] . str_repeat('*', strlen($word) - 1), $string);
var_dump($string);
$string = preg_replace("/\b".$word[0].'('.substr($word, 1).")\b/i", "***", $string);
This can be done in many ways, with very weird auto-generated regexps...
But I believe using preg_replace_callback() would end up being more robust
<?php
# as already pointed out, your words *may* need sanitization
foreach($words as $k=>$v)
$words[$k]=preg_quote($v,'/');
# and to be collapsed into a **big regexpy goodness**
$words=implode('|',$words);
# after that, a single preg_replace_callback() would do
$string = preg_replace_callback('/\b('. $words .')\b/i', "my_beloved_callback", $string);
function my_beloved_callback($m)
{
$len=strlen($m[1])-1;
return $m[1][0].str_repeat('*',$len);
}
Here is unicode-friendly regular expression for PHP:
function lowercase_except_first_letter($s) {
// the following line SKIP the first word and pass it to callback func...
// \W it allows to keep the first letter even in words in quotes and brackets
return preg_replace_callback('/(?<!^|\s|\W)(\w)/u', function($m) {
return mb_strtolower($m[1]);
}, $s);
}

Regex, get string value between two characters

I'd like to return string between two characters, # and dot (.).
I tried to use regex but cannot find it working.
(#(.*?).)
Anybody?
Your regular expression almost works, you just forgot to escape the period. Also, in PHP you need delimiters:
'/#(.*?)\./s'
The s is the DOTALL modifier.
Here's a complete example of how you could use it in PHP:
$s = 'foo#bar.baz';
$matches = array();
$t = preg_match('/#(.*?)\./s', $s, $matches);
print_r($matches[1]);
Output:
bar
Try this regular expression:
#([^.]*)\.
The expression [^.]* will match any number of any character other than the dot. And the plain dot needs to be escaped as it’s a special character.
this is the best and fast to use
function get_string_between ($str,$from,$to) {
$string = substr($str, strpos($str, $from) + strlen($from));
if (strstr ($string,$to,TRUE) != FALSE) {
$string = strstr ($string,$to,TRUE);
}
return $string;
}
If you're learning regex, you may want to analyse those too:
#\K[^.]++(?=\.)
(?<=#)[^.]++(?=\.)
Both these regular expressions use possessive quantifiers (++). Use them whenever you can, to prevent needless backtracking. Also, by using lookaround constructions (or \K), we can match the part between the # and the . in $matches[0].

Categories