preg_replace everything but content within bbcode

preg_replace everything but content within bbcode - php

I'm trying to replace everything in my content with empty space except the content within my bbcode (and the bbcode itself).
This is my code to eliminate my bbcode.
The BBCode is just a little helper to identify important content.
$content = preg_replace ( '/\[lang_chooser\](.*?)\[\/lang_chooser\]/is' , '$1' , $content );
Isn't it possible to just negate this code?
$content = preg_replace ( '/^[\[lang_chooser\](.*?)\[\/lang_chooser\]]/is' , '' , $content );
Cheers & thanks four your help!
EDIT
here is my solution (sorry, I can't answer my own question at the moment)
$firstOcc = stripos($content, '[lang_chooser]');
$lastOcc = stripos($content, '[/lang_chooser]');
$content = substr($content, $firstOcc, $lastOcc + strlen('[/lang_chooser]') - $firstOcc);
$content = preg_replace('/' . addcslashes('[lang_chooser](.*?)[/lang_chooser]', '/[]') . '/is', '$1', $content);
I think it's not the best solution, but its working for the moment.
Maybe there is a better way to do it ;-)

The ^ character does not negate except for in character classes. It means match the beginning of the string (or the line if you are in multiline mode).
It is possible to have negative look aheads and look backs, but not to negate entire regular expressions I think.
If you just want to replace a string by part of that string, use preg_match and assign the matches array to your text
if( preg_match ( '/(\[lang_chooser\].*?\[\/lang_chooser\])/is', $content, $matches ) )
echo $matches[ 0 ]; // should have what you want
For readability I use addcslashes to escape the / and [:
if( preg_match ( '/' . addcslashes( '([lang_chooser].*?[/lang_chooser])', '/[]' ) . '/is', $content, $matches ) )
The best part of addcslashes is that you can take any regular expression (from a variable, from a search box value, from config) and safely call preg functions without worrying about what delimiter to use.
You probably also want the u modifier for unicode compliance unless for some strange reason you don't use utf-8:
if( preg_match ( '/' . addcslashes( '([lang_chooser].*?[/lang_chooser])', '/[]' ) . '/isu', $content, $matches ) )
In the mean time I improved the addslashes approach a bit. It allows to use string literals in regular expressions without worrying about meta characters. Xeoncross pointed out preg_quote. It might still be nice to have an escape class like this, so you can take a fixed delimiter from somewhere to keep your code neater. Also you might want to add other regex flavors at some point or be able to catch future changes to preg_quote without changing the rest of your codebase. Currently only supports pcre:
class Escape
{
/*
* escapes meta characters in strings in order to put them in regular expressions
*
* usage:
* pcre_replace( '/' . Escape::pcre( $text ) . '/u', $string );
*
*/
static
function pcre( $string )
{
return
preg_quote( $string, '/' )
;
}
}

Related

html_entity_decode in specific regular expression for a preg_replace

I have this preg_replace patterns and replacements :
$patterns = array(
"/<br\W*?\/>/",
"/<strong>/",
"/<*\/strong>/",
"/<h1>/",
"/<*\/h1>/",
"/<h2>/",
"/<*\/h2>/",
"/<em>/",
"/<*\/em>/",
'/(?:\<code*\>([^\<]*)\<\/code\>)/',
);
$replacements = array(
"\n",
"[b]",
"[/b]",
"[h1]",
"[/h1]",
"[h2]",
"[/h2]",
"[i]",
"[/i]",
'[code]***HTML DECODE HERE***[/code]',
);
In my string I want to html_entity_decode the content between these tags :
<code> < $gt; </code> but keep my array structure for preg replace
so this : <code> < > </code> will be this : [code] < > [/code]
Any help will be very appreciated, thanks!

You cannot encode this in the replacement string. As PoloRM suggested, you could use preg_replace_callback specifically for your last replacement instead:
function decode_html($matches)
{
return '[code]'.html_entity_decode($matches[1]).'[/code]';
}
$str = '<code> < > </code>';
$str = preg_replace_callback('/(?:\<code*\>([^\<]*)\<\/code\>)/', 'decode_html', $str);
Equivalently, using create_function:
$str = preg_replace_callback(
'/(?:\<code*\>([^\<]*)\<\/code\>)/',
create_function(
'$matches',
'return \'[code]\'.html_entity_decode($matches[1]).\'[/code]\';'
),
$str
);
Or, as of PHP 5.3.0:
$str = preg_replace_callback(
'/(?:\<code*\>([^\<]*)\<\/code\>)/',
function ($matches) {
return '[code]'.html_entity_decode($matches[1]).'[/code]';
},
$str
);
But note that in all three cases, your pattern is not really optimal. Firstly, you don't need to escape those < and > (but that is just for readability). Secondly, your first * allows infinite repetition (or omission) of the letter e. I suppose you wanted to allow attributes. Thirdly, you cannot include other tags within your <code> (because [^<] will not match them). In this case maybe you should go with ungreedy repetition instead (I also changed the delimiter for convenience):
~(?:<code[^>]*>(.*?)</code>)~
As you can already see, this is still far from perfect (in terms of correctly matching the HTML in the first place). Hence, the obligatory reminder: don't use regex to parse HTML. You will be much better off, using a DOM parser. PHP brings a built-in one, and there is also this very convenient-to-use 3rd-party one.

Check out this:
http://www.php.net/manual/en/function.preg-replace-callback.php
You can create a callback function that applies the html_entity_decode functionality on your match.

how to make a string lowercase without changing url

I'm using mb_strtolower to make a string lowercase, but sometimes text contains urls with upper case. And when I use mb_strtolower, of course the urls changing and not working.
How can I convert string to lower without changin urls?

Since you have not posted your string, this can be only generally answered.
Whenever you use a function on a string to make it lower-case, the whole string will be made lower-case. String functions are aware of strings only, they are not aware of the contents written within these strings specifically.
In your scenario you do not want to lowercase the whole string I assume. You want to lowercase only parts of that string, other parts, the URLs, should not be changed in their case.
To do so, you must first parse your string into these two different parts, let's call them text and URLs. Then you need to apply the lowercase function only on the parts of type text. After that you need to combine all parts together again in their original order.
If the content of the string is semantically simple, you can split the string at spaces. Then you can check each part, if it begins with http:// or https:// (is_url()?) and if not, perform the lowercase operation:
$text = 'your content http://link.me/now! might differ';
$fragments = explode(' ', $text);
foreach($fragments as &$fragment) {
if (is_not_url($fragment))
$fragment = strtolower($fragment) // or mb_strtolower
;
}
unset($fragment); // remove reference
$lowercase = implode(' ', $fragments);
To have this code to work, you need to define the is_not_url() function. Additionally, the original text must contain contents that allows to work on rudimentary parsing it based on the space separator.
Hopefully this example help you getting along with coding and understanding your problem.

Here you go, iterative, but as fine as possible.
function strtolower_sensitive ( $input ) {
$regexp = "#((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|\"|'|:|\<|$|\.\s)#ie";
if(preg_match_all($regexp, $input, $matches, PREG_SET_ORDER)) {
for( $i=0, $hist=array(); $i<=count($matches); ++$i ) {
str_replace( $u=$matches[$i][0], $n="sxxx".$i+1, $input ); $hist[]=array($u,$n);
}
$input = strtolower($input);
foreach ( $hist as $h ) {
str_replace ( $h[1], $h[0], $input );
}
}
return $input;
}
$input is your string, $output will be your answer. =)

PHP preg_match_all question

I have a question about a regular function that is giving me grief. I have a list of items that is separated in tags. I am trying to extract everything between two particular tags (which occur multiple times). Here is a sample of the list I am parsing:
<ResumeResultItem_V3>
<ResumeTitle>Johnson</ResumeTitle>
<RecentEmployer>University of Phoenix</RecentEmployer>
<RecentJobTitle>Advisor</RecentJobTitle>
<RecentPay>40000</RecentPay>
</ResumeResultItem_V3>
<ResumeResultItem_V3>
<ResumeTitle>ResumeforJake</ResumeTitle>
<RecentEmployer>APEX</RecentEmployer>
<RecentJobTitle>Consultant</RecentJobTitle>
<RecentPay>66000</RecentPay>
</ResumeResultItem_V3>
I'm trying to get everything in between "ResumeResultItem_V3" as a blob of text, but I can't seem to get the expression right.
Here is the code I have so far:
$test = "(<ResumeResultItem_V3>)";
$test2 = "(<\/ResumeResultItem_V3>)";
preg_match_all("/" . $test . "(\w+)" . $test2 . "/", $xml, $matches);
foreach ($matches[0] as $match) {
echo $match;
echo "<br /><br />";
}
How can I fix this?

I'm making assuptions about your XML structure, but I really think you need an example using an XML parser, like SimpleXML.
$xml = new SimpleXMLElement( $file );
foreach( $xml->ResumeResultItem_V3 as $ResumeResultItem_V3 )
echo (string)$ResumeResultItem_V3;

You are probably better off with simplexml for extracting the data here.
But to also answer the regex question. \w+ only matches word-characters. But in this case you want it to match pretty much everything in between the delimeters, which .*? can be used for.
preg_match_all("/$test(.*?)$test2/s", $xml, $matches);
Only works with the /s modifier though.

Ignoring that you probably ought to use an XML parser, and that PHP has one you can use...
The issue is that \w+ matches word characters, not any character. A space and most punctuation aren't word characters, so your match fails. You need instead to match "any" character . for as many as there are +, but because you might be able to group excessively, you need a modifier to make it non-greedy, ?. Your expression should work if you change \w+ to .+? -- the any character match also requires an s modifier, so:
preg_match_all('/' . $test . '(.+?)' . $test2 . '/s', $xml, $matches);

If you can use the output as an array with 1 item for each of the "text blob" matches, try this:
<?php
$text =
"<ResumeResultItem_V3>
<ResumeTitle>Johnson</ResumeTitle>
<RecentEmployer>University of Phoenix</RecentEmployer>
<RecentJobTitle>Advisor</RecentJobTitle>
<RecentPay>40000</RecentPay>
</ResumeResultItem_V3>
<ResumeResultItem_V3>
<ResumeTitle>ResumeforJake</ResumeTitle>
<RecentEmployer>APEX</RecentEmployer>
<RecentJobTitle>Consultant</RecentJobTitle>
<RecentPay>66000</RecentPay>
</ResumeResultItem_V3>";
$matches = preg_split("/<\/ResumeResultItem_V3>/",preg_replace("/<ResumeResultItem_V3>/","",$text));
print_r($matches);
?>
Results in:
Array
(
[0] =>
<ResumeTitle>Johnson</ResumeTitle>
<RecentEmployer>University of Phoenix</RecentEmployer>
<RecentJobTitle>Advisor</RecentJobTitle>
<RecentPay>40000</RecentPay>
[1] =>
<ResumeTitle>ResumeforJake</ResumeTitle>
<RecentEmployer>APEX</RecentEmployer>
<RecentJobTitle>Consultant</RecentJobTitle>
<RecentPay>66000</RecentPay>
[2] =>
)

PHP Formatting Regex - BBCode

To be honest, I suck at regex so much, I would use RegexBuddy, but I'm working on my Mac and sometimes it doesn't help much (for me).
Well, for what I need to do is a function in php
function replaceTags($n)
{
$n = str_replace("[[", "<b>", $n);
$n = str_replace("]]", "</b>", $n);
}
Although this is a bad example in case someone didn't close the tag by using ]] or [[, anyway, could you help with regex of:
[[ ]] = Bold format
** ** = Italic format
(( )) = h2 heading
Those are all I need, thanks :)
P.S - Is there any software like RegexBuddy available for Mac (Snow Leopard)?

function replaceTags($n)
{
$n = preg_replace("/\[\[(.*?)\]\]/", "<strong>$1</strong>", $n);
$n = preg_replace("/\*\*(.*?)\*\*/", "<em>$1</em>", $n);
$n = preg_replace("/\(\((.*?)\)\)/", "<h2>$1</h2>", $n);
return $n;
}
I should probably provide a little explanation: Each special character is preceded by a backslash so it's not treated as regex instructions ("[", "(", etc.). The "(.*?)" captures all characters between your delimiters ("[[" and "]]", etc.). What's captured is then output in the replacements string in place of "$1".

The same reason you can't do this with str_replace() applies to preg_replace() as well. Tag-pair style parsing requires a lexer/parser if you want to yield 100% accuracy and cover for input errors.
Regular expressions can't handle unclosed tags, nested tags, that sort of thing.
That all being said, you can get 50% of the way there with very little effort.
$test = "this is [[some]] test [[content for **you** to try, ((does [[it]])) **work?";
echo convertTags( $test );
// only handles validly formatted, non-nested input
function convertTags( $content )
{
return preg_replace(
array(
"/\[\[(.*?)\]\]/"
, "/\*\*(.*?)\*\*/"
, "/\(\((.*?)\)\)/"
)
, array(
"<strong>$1</strong>"
, "<em>$1</em>"
, "<h2>$1</h2>"
)
, $content
);
}

Modifiers could help too :)
http://lv.php.net/manual/en/reference.pcre.pattern.modifiers.php
U (PCRE_UNGREEDY) This modifier
inverts the "greediness" of the
quantifiers so that they are not
greedy by default, but become greedy
if followed by ?. It is not compatible
with Perl. It can also be set by a
(?U) modifier setting within the
pattern or by a question mark behind a
quantifier (e.g. .*?).

Writing a simple preg_replace in PHP

I'm not much of a coder, but I need to write a simple preg_replace statement in PHP that will help me with a WordPress plugin. Basically, I need some code that will search for a string, pull out the video ID, and return the embed code with the video id inserted into it.
In other words, I'm searching for this:
[youtube=http://www.youtube.com/watch?v=VIDEO_ID_HERE&hl=en&fs=1]
And want to replace it with this (keeping the video id the same):
param name="movie" value="http://www.youtube.com/v/VIDEO_ID_HERE&hl=en&fs=1&rel=0
If possible, I'd be forever grateful if you could explain how you've used the various slashes, carets, and Kleene stars in the search pattern, i.e. translate it from grep to English so I can learn. :-)
Thanks!
Mike

BE CAREFUL! If this is a BBCode-style system with user input, these other two solutions would leave you vulnerable to XSS attacks.
You have several ways to protect yourself against this. Have the regex explicitly disallow the characters that could get you in trouble (or, allow only those valid for a youtube video id), or actually sanitize the input and use preg_match instead, which I will illustrate below going off of RoBorg's regex.
<?php
$input = "[youtube=http://www.youtube.com/watch?v=VIDEO_ID_HERE&hl=en&fs=1]";
if ( preg_match('/\[youtube=.*?v=(.*?)&.*?\]/i', $input, $matches ) )
{
$sanitizedVideoId = urlencode( strip_tags( $matches[1] ) );
echo 'param name="movie" value="http://www.youtube.com/v/' . $sanitizedVideoId . '&hl=en&fs=1&rel=0';
} else {
// Not valid input
}
Here's an example of this type of attack in action
<?php
$input = "[youtube=http://www.youtube.com/watch?v=\"><script src=\"http://example.com/xss.js\"></script>&hl=en&fs=1]";
// Is vulnerable to XSS
echo preg_replace('/\[youtube=.*?v=(.*?)&.*?\]/i', 'param name="movie" value="http://www.youtube.com/v/$1&hl=en&fs=1&rel=0', $input );
echo "\n";
// Prevents XSS
if ( preg_match('/\[youtube=.*?v=(.*?)&.*?\]/i', $input, $matches ) )
{
$sanitizedVideoId = urlencode( strip_tags( $matches[1] ) );
echo 'param name="movie" value="http://www.youtube.com/v/' . $sanitizedVideoId . '&hl=en&fs=1&rel=0';
} else {
// Not valid input
}

$str = preg_replace('/\[youtube=.*?v=([a-z0-9_-]+?)&.*?\]/i', 'param name="movie" value="http://www.youtube.com/v/$1&hl=en&fs=1&rel=0', $str);
/ - Start of RE
\[ - A literal [ ([ is a special character so it needs escaping)
youtube= - Make sure we've got the right tag
.*? - Any old rubbish, but don't be greedy; stop when we reach...
v= - ...this text
([a-z0-9_-]+?) - Take some more text (just z-a 0-9 _ and -), and don't be greedy. Capture it using (). This will get put in $1
&.*?\] - the junk up to the ending ]
/i - end the RE and make it case-insensitive for the hell of it

I would avoind regular expressions in this case if at all possible, because: who guarantees that the querystring in the first url will always be in that format?
i'd use parse_url($originalURL, PHP-URL-QUERY); and then loop through the returned array finding the correct 'name=value' pair for the v part of the query string:
something like:
$originalURL = 'http://www.youtube.com/watch?v=VIDEO_ID_HERE&hl=en&fs=1';
foreach( parse_url( $originalURL, PHP_URL_QUERY) as $keyvalue )
{
if ( strlen( $keyvalue ) > 2 && substr( $keyvalue, 0, 2 ) == 'v=' )
{
$videoId = substr( $keyvalue, 2 );
break;
}
}
$newURL = sprintf( 'http://www.youtube.com/v/%s/whatever/else', url_encode( $videoId ) );
p.s. written in the SO textbox, untested.

$embedString = 'youtube=http://www.youtube.com/watch?v=VIDEO_ID_HERE&hl=en&fs=1';
preg_match('/v=([^&]*)/',$embedstring,$matches);
echo 'param name="movie" value="http://www.youtube.com/v/'.$matches[1].'&hl=en&fs=1&rel=0';
Try that.
The regex /v=([^&]*)/ works this way:
it searches for v=
it then saves the match to the pattern inside the parentheses to $matches
[^&] tells it to match any character except the ampersand ('&')
* tells it we want anywhere from 0 to any number of those characters in the match

A warning. If the text after .*? isn't found immediately, the regex engine will continue to search over the whole line, possibly jumping to the next [youtube...] tag. It is often better to use [^\]]*? to limit the search inside the brackets.
Based on RoBorgs answer:
$str = preg_replace('/\[youtube=[^\]]*?v=([^\]]*?)&[^\]]*?\]/i', ...)
[^\]] will match any character except ']'.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

preg_replace everything but content within bbcode - php

Related

html_entity_decode in specific regular expression for a preg_replace

how to make a string lowercase without changing url

PHP preg_match_all question

PHP Formatting Regex - BBCode

Writing a simple preg_replace in PHP

Categories

Resources