Writing a simple preg_replace in PHP

Writing a simple preg_replace in PHP - php

I'm not much of a coder, but I need to write a simple preg_replace statement in PHP that will help me with a WordPress plugin. Basically, I need some code that will search for a string, pull out the video ID, and return the embed code with the video id inserted into it.
In other words, I'm searching for this:
[youtube=http://www.youtube.com/watch?v=VIDEO_ID_HERE&hl=en&fs=1]
And want to replace it with this (keeping the video id the same):
param name="movie" value="http://www.youtube.com/v/VIDEO_ID_HERE&hl=en&fs=1&rel=0
If possible, I'd be forever grateful if you could explain how you've used the various slashes, carets, and Kleene stars in the search pattern, i.e. translate it from grep to English so I can learn. :-)
Thanks!
Mike

BE CAREFUL! If this is a BBCode-style system with user input, these other two solutions would leave you vulnerable to XSS attacks.
You have several ways to protect yourself against this. Have the regex explicitly disallow the characters that could get you in trouble (or, allow only those valid for a youtube video id), or actually sanitize the input and use preg_match instead, which I will illustrate below going off of RoBorg's regex.
<?php
$input = "[youtube=http://www.youtube.com/watch?v=VIDEO_ID_HERE&hl=en&fs=1]";
if ( preg_match('/\[youtube=.*?v=(.*?)&.*?\]/i', $input, $matches ) )
{
$sanitizedVideoId = urlencode( strip_tags( $matches[1] ) );
echo 'param name="movie" value="http://www.youtube.com/v/' . $sanitizedVideoId . '&hl=en&fs=1&rel=0';
} else {
// Not valid input
}
Here's an example of this type of attack in action
<?php
$input = "[youtube=http://www.youtube.com/watch?v=\"><script src=\"http://example.com/xss.js\"></script>&hl=en&fs=1]";
// Is vulnerable to XSS
echo preg_replace('/\[youtube=.*?v=(.*?)&.*?\]/i', 'param name="movie" value="http://www.youtube.com/v/$1&hl=en&fs=1&rel=0', $input );
echo "\n";
// Prevents XSS
if ( preg_match('/\[youtube=.*?v=(.*?)&.*?\]/i', $input, $matches ) )
{
$sanitizedVideoId = urlencode( strip_tags( $matches[1] ) );
echo 'param name="movie" value="http://www.youtube.com/v/' . $sanitizedVideoId . '&hl=en&fs=1&rel=0';
} else {
// Not valid input
}

$str = preg_replace('/\[youtube=.*?v=([a-z0-9_-]+?)&.*?\]/i', 'param name="movie" value="http://www.youtube.com/v/$1&hl=en&fs=1&rel=0', $str);
/ - Start of RE
\[ - A literal [ ([ is a special character so it needs escaping)
youtube= - Make sure we've got the right tag
.*? - Any old rubbish, but don't be greedy; stop when we reach...
v= - ...this text
([a-z0-9_-]+?) - Take some more text (just z-a 0-9 _ and -), and don't be greedy. Capture it using (). This will get put in $1
&.*?\] - the junk up to the ending ]
/i - end the RE and make it case-insensitive for the hell of it

I would avoind regular expressions in this case if at all possible, because: who guarantees that the querystring in the first url will always be in that format?
i'd use parse_url($originalURL, PHP-URL-QUERY); and then loop through the returned array finding the correct 'name=value' pair for the v part of the query string:
something like:
$originalURL = 'http://www.youtube.com/watch?v=VIDEO_ID_HERE&hl=en&fs=1';
foreach( parse_url( $originalURL, PHP_URL_QUERY) as $keyvalue )
{
if ( strlen( $keyvalue ) > 2 && substr( $keyvalue, 0, 2 ) == 'v=' )
{
$videoId = substr( $keyvalue, 2 );
break;
}
}
$newURL = sprintf( 'http://www.youtube.com/v/%s/whatever/else', url_encode( $videoId ) );
p.s. written in the SO textbox, untested.

$embedString = 'youtube=http://www.youtube.com/watch?v=VIDEO_ID_HERE&hl=en&fs=1';
preg_match('/v=([^&]*)/',$embedstring,$matches);
echo 'param name="movie" value="http://www.youtube.com/v/'.$matches[1].'&hl=en&fs=1&rel=0';
Try that.
The regex /v=([^&]*)/ works this way:
it searches for v=
it then saves the match to the pattern inside the parentheses to $matches
[^&] tells it to match any character except the ampersand ('&')
* tells it we want anywhere from 0 to any number of those characters in the match

A warning. If the text after .*? isn't found immediately, the regex engine will continue to search over the whole line, possibly jumping to the next [youtube...] tag. It is often better to use [^\]]*? to limit the search inside the brackets.
Based on RoBorgs answer:
$str = preg_replace('/\[youtube=[^\]]*?v=([^\]]*?)&[^\]]*?\]/i', ...)
[^\]] will match any character except ']'.

Related

Regex only for specific domain name in URL

As much as I've tried I can't seem to find the correct regex to locate what I'm after here.
I only want to select the first instance of the url that matches the domain www.myweb.com from the following...
Some text https://www.myweb.com/page/cat/323123442321-rghe432 and then another https://www.adifferentsite.com/fsdhjss/erwr
I need to completely ignore the second url www.adifferentsite.com and only work with the first one that matches www.myweb.com, ignoring any other possible instances of www.myweb.com
Once the first matching domain is discovered I need to store the rest of the url that comes after it...
page/cat/323123442321-rghe432
...into a new variable $newvar, so...
$newvar = 'page/cat/323123442321-rghe432';
I'm trying :
return preg_replace_callback( '/http://www.myweb.com/\/[0-9a-zA-Z]+/', array( __CLASS__, 'my_callback' ), $newvar );
I've read tons of documents on how to detect url's but can't find anything about detecting a specific url.
I really can't grasp how to formulate regex so this formula is incorrect. Any help would be greatly appreciated.
EDIT Edited the question to be a bit more specific and hopefully a bit easier to resolve.

You can use a preg_replace_callback and pass an array into the anonymous function (or just your custom callback function) to fill it with all the necessary URL parts.
Here is a demo:
$rests = array();
$re = '~\b(https?://)www\.myweb\.com/(\S+)~';
$str = "Some text https://www.myweb.com/page/cat/323123442321-rghe432 and then another https://www.adifferentsite.com/fsdhjss/erwr";
echo $result = preg_replace_callback($re, function ($m) use (&$rests) {
array_push($rests, $m[2]);
return $m[1] . "embed.myweb.com/" . $m[2];
}, $str) . PHP_EOL;
print_r($rests);
Results:
Some text https://embed.myweb.com/page/cat/323123442321-rghe432 and then another https://www.adifferentsite.com/fsdhjss/erwr
Array
(
[0] => page/cat/323123442321-rghe432
)
A couple of words:
'~\b(https?://)www\.myweb\.com/(\S+)~' has ~ as a regex delimiter, so you do not have to escape /
It is declared with a single-quoted literal, so you do not have to use double-escaping for \\S
It matches and captures into capturing groups 2 substrings: \b(https?://) (that matches a whole word http or https followed by ://) and (\S+) (that matches 1 or more non-whitespace characters). These capturing groups are marked with (...) in the pattern and can be accessed via $matches[n] where n is the id of the capturing group.
UPDATE
If you only need to replace the first occurrence of the URL, pass the limit argument to the preg_replace_callback:
$rest = "";
$re = '~\b(https?://)www\.myweb\.com/(\S+\b)~';
$str = "Some text https://www.myweb.com/page/cat/323123442321-rghe432, another http://www.myweb.com/page/cat/323123442321-rghe432 and then another https://www.adifferentsite.com/fsdhjss/erwr";
echo $result = preg_replace_callback($re, function ($m) use (&$rest) {
$rest = $m[2];
return $m[1] . "embed.myweb.com/" . $m[2];
}, $str, 1) . PHP_EOL;
//-LIMIT ^ - HERE -
echo $rest;
See another IDEONE demo

Php preg_match pattern

I have a form field that I want to check if the user submitted a correct pattern. I tried it this way.
// $car_plate should in XXX-1111, or XXX-111(three letters in uppercase followed by a dash and four or three numbers)
<?php
$car_plate = $values['car_plate'];
if (!preg_match('[A-Z]{3}-[0-9]{3|4}$', $car_plate)) {
$this->errors ="Pattern for plate number is XXX-1111 or XXX-111";
} else {
// code to submit
}
?>
The following car_plate numbers is in correct format (AAA-456, AGC-4567, WER-123). In this case it always return the error. What is the correct way?

alternative to TimoSta's answer.
/^[a-zA-Z]{3}-?\d{3,4}$/
this allows for user to enter letters in lowercase and to skip the dash
you can format the data later like this:
$input = 'abc1234';
if ( preg_match( '/^([a-zA-Z]{3})-?(\d{3,4})$/', $input, $matches ) )
{
$new_input = strtoupper( $matches[1] ) . '-' . $matches[2];
echo $new_input;
}
outputs: ABC-1234

Looks like you're a little bit off with your regular expression.
Try this one:
/^[A-Z]{3}-[0-9]{3,4}$/
In PHP, you have to enclose your regular expression with delimiters, in this case the slashes. In addition to that, {3|4} is not valid, the correct syntax is {3,4} as you can see in the docs covering repetition.

preg_replace everything but content within bbcode

I'm trying to replace everything in my content with empty space except the content within my bbcode (and the bbcode itself).
This is my code to eliminate my bbcode.
The BBCode is just a little helper to identify important content.
$content = preg_replace ( '/\[lang_chooser\](.*?)\[\/lang_chooser\]/is' , '$1' , $content );
Isn't it possible to just negate this code?
$content = preg_replace ( '/^[\[lang_chooser\](.*?)\[\/lang_chooser\]]/is' , '' , $content );
Cheers & thanks four your help!
EDIT
here is my solution (sorry, I can't answer my own question at the moment)
$firstOcc = stripos($content, '[lang_chooser]');
$lastOcc = stripos($content, '[/lang_chooser]');
$content = substr($content, $firstOcc, $lastOcc + strlen('[/lang_chooser]') - $firstOcc);
$content = preg_replace('/' . addcslashes('[lang_chooser](.*?)[/lang_chooser]', '/[]') . '/is', '$1', $content);
I think it's not the best solution, but its working for the moment.
Maybe there is a better way to do it ;-)

The ^ character does not negate except for in character classes. It means match the beginning of the string (or the line if you are in multiline mode).
It is possible to have negative look aheads and look backs, but not to negate entire regular expressions I think.
If you just want to replace a string by part of that string, use preg_match and assign the matches array to your text
if( preg_match ( '/(\[lang_chooser\].*?\[\/lang_chooser\])/is', $content, $matches ) )
echo $matches[ 0 ]; // should have what you want
For readability I use addcslashes to escape the / and [:
if( preg_match ( '/' . addcslashes( '([lang_chooser].*?[/lang_chooser])', '/[]' ) . '/is', $content, $matches ) )
The best part of addcslashes is that you can take any regular expression (from a variable, from a search box value, from config) and safely call preg functions without worrying about what delimiter to use.
You probably also want the u modifier for unicode compliance unless for some strange reason you don't use utf-8:
if( preg_match ( '/' . addcslashes( '([lang_chooser].*?[/lang_chooser])', '/[]' ) . '/isu', $content, $matches ) )
In the mean time I improved the addslashes approach a bit. It allows to use string literals in regular expressions without worrying about meta characters. Xeoncross pointed out preg_quote. It might still be nice to have an escape class like this, so you can take a fixed delimiter from somewhere to keep your code neater. Also you might want to add other regex flavors at some point or be able to catch future changes to preg_quote without changing the rest of your codebase. Currently only supports pcre:
class Escape
{
/*
* escapes meta characters in strings in order to put them in regular expressions
*
* usage:
* pcre_replace( '/' . Escape::pcre( $text ) . '/u', $string );
*
*/
static
function pcre( $string )
{
return
preg_quote( $string, '/' )
;
}
}

how to make a string lowercase without changing url

I'm using mb_strtolower to make a string lowercase, but sometimes text contains urls with upper case. And when I use mb_strtolower, of course the urls changing and not working.
How can I convert string to lower without changin urls?

Since you have not posted your string, this can be only generally answered.
Whenever you use a function on a string to make it lower-case, the whole string will be made lower-case. String functions are aware of strings only, they are not aware of the contents written within these strings specifically.
In your scenario you do not want to lowercase the whole string I assume. You want to lowercase only parts of that string, other parts, the URLs, should not be changed in their case.
To do so, you must first parse your string into these two different parts, let's call them text and URLs. Then you need to apply the lowercase function only on the parts of type text. After that you need to combine all parts together again in their original order.
If the content of the string is semantically simple, you can split the string at spaces. Then you can check each part, if it begins with http:// or https:// (is_url()?) and if not, perform the lowercase operation:
$text = 'your content http://link.me/now! might differ';
$fragments = explode(' ', $text);
foreach($fragments as &$fragment) {
if (is_not_url($fragment))
$fragment = strtolower($fragment) // or mb_strtolower
;
}
unset($fragment); // remove reference
$lowercase = implode(' ', $fragments);
To have this code to work, you need to define the is_not_url() function. Additionally, the original text must contain contents that allows to work on rudimentary parsing it based on the space separator.
Hopefully this example help you getting along with coding and understanding your problem.

Here you go, iterative, but as fine as possible.
function strtolower_sensitive ( $input ) {
$regexp = "#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie";
if(preg_match_all($regexp, $input, $matches, PREG_SET_ORDER)) {
for( $i=0, $hist=array(); $i<=count($matches); ++$i ) {
str_replace( $u=$matches[$i][0], $n="sxxx".$i+1, $input ); $hist[]=array($u,$n);
}
$input = strtolower($input);
foreach ( $hist as $h ) {
str_replace ( $h[1], $h[0], $input );
}
}
return $input;
}
$input is your string, $output will be your answer. =)

Regular Expressions: how to do "option split" replaces

those reqular expressions drive me crazy. I'm stuck with this one:
test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not
Task:
Remove all [[ and ]] and if there is an option split choose the later one so output should be:
test1:link test2:silver test3:out1insideout2 test4:this|not
I came up with (PHP)
$text = preg_replace("/\\[\\[|\\]\\]/",'',$text); // remove [[ or ]]
this works for part1 of the task. but before that I think I should do the option split, my best solution:
$text = preg_replace("/\\[\\[(.*\|)(.*?)\\]\\]/",'$2',$text);
Result:
test1:silver test3:[[out1[[inside]]out2]] this|not
I'm stuck. may someone with some free minutes help me? Thanks!

I think the easiest way to do this would be multiple passes. Use a regular expression like:
\[\[(?:[^\[\]]*\|)?([^\[\]]+)\]\]
This will replace option strings to give you the last option from the group. If you run it repeatedly until it no longer matches, you should get the right result (the first pass will replace [[out1[[inside]]out2]] with [[out1insideout2]] and the second will ditch the brackets.
Edit 1: By way of explanation,
\[\[ # Opening [[
(?: # A non-matching group (we don't want this bit)
[^\[\]] # Non-bracket characters
* # Zero or more of anything but [
\| # A literal '|' character representing the end of the discarded options
)? # This group is optional: if there is only one option, it won't be present
( # The group we're actually interested in ($1)
[^\[\]] # All the non-bracket characters
+ # Must be at least one
) # End of $1
\]\] # End of the grouping.
Edit 2: Changed expression to ignore ']' as well as '[' (it works a bit better like that).
Edit 3: There is no need to know the number of nested brackets as you can do something like:
$oldtext = "";
$newtext = $text;
while ($newtext != $oldtext)
{
$oldtext = $newtext;
$newtext = preg_replace(regexp,replace,$oldtext);
}
$text = $newtext;
Basically, this keeps running the regular expression replace until the output is the same as the input.
Note that I don't know PHP, so there are probably syntax errors in the above.

This is impossible to do in one regular expression since you want to keep content in multiple "hierarchies" of the content. It would be possible otherwise, using a recursive regular expression.
Anyways, here's the simplest, most greedy regular expression I can think of. It should only replace if the content matches your exact requirements.
You will need to escape all backslashes when putting it into a string (\ becomes \\.)
\[\[((?:[^][|]+|(?!\[\[|]])[^|])++\|?)*]]
As others have already explained, you use this with multiple passes. Keep looping while there are matches, performing replacement (only keeping match group 1.)
Difference from other regular expressions here is that it will allow you to have single brackets in the content, without breaking:
test1:[[link]] test2:[[gold|si[lv]er]]
test3:[[out1[[in[si]de]]out2]] test4:this|not
becomes
test1:[[link]] test2:si[lv]er
test3:out1in[si]deout2 test4:this|not

Why try to do it all in one go. Remove the [[]] first and then deal with options, do it in two lines of code.
When trying to get something going favour clarity and simplicity.
Seems like you have all the pieces.

Why not just simply remove any brackets that are left?
$str = 'test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not';
$str = preg_replace('/\\[\\[(?:[^|\\]]+\\|)+([^\\]]+)\\]\\]/', '$1', $str);
$str = str_replace(array('[', ']'), '', $str);

Well, I didn't stick to just regex, because I'm of a mind that trying to do stuff like this with one big regex leads you to the old joke about "Now you have two problems". However, give something like this a shot:
$str = 'test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not'; $reg = '/(.*?):(.*?)( |$)/';
preg_match_all($reg, $str, $m);
foreach($m[2] as $pos => $match) {
if (strpos($match, '|') !== FALSE && strpos($match, '[[') !== FALSE ) {
$opt = explode('|', $match); $match = $opt[count($opt)-1];
}
$m[2][$pos] = str_replace(array('[', ']'),'', $match );
}
foreach($m[1] as $k=>$v) $result[$k] = $v.':'.$m[2][$k];

This is C# using only using non-escaped strings, hence you will have to double the backslashes in other languages.
String input = "test1:[[link]] " +
"test2:[[gold|silver]] " +
"test3:[[out1[[inside]]out2]] " +
"test4:this|not";
String step1 = Regex.Replace(input, #"\[\[([^|]+)\|([^\]]+)\]\]", #"[[$2]]");
String step2 = Regex.Replace(step1, #"\[\[|\]\]", String.Empty);
// Prints "test1:silver test3:out1insideout2 test4:this|not"
Console.WriteLine(step2);

$str = 'test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not';
$s = preg_split("/\s+/",$str);
foreach ($s as $k=>$v){
$v = preg_replace("/\[\[|\]\]/","",$v);
$j = explode(":",$v);
$j[1]=preg_replace("/.*\|/","",$j[1]);
print implode(":",$j)."\n";
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Writing a simple preg_replace in PHP - php

Related

Regex only for specific domain name in URL

Php preg_match pattern

preg_replace everything but content within bbcode

how to make a string lowercase without changing url

Regular Expressions: how to do "option split" replaces

Categories

Resources