PHP preg_replace Expertise Sought - php

I'm creating some custom BBcode for a forum. I'm trying to get the regular expression right, but it has been eluding me for two days. Any expert advice is welcome.
The input (e.g. sample forum post):
[quote=Bob]I like Candace. She is nice.[/quote]
I agree, she is very nice. I like Ashley, too, and especially [Ryan] when he's drinking.
Essentially, I want to encase any names (from a specified list) in [user][/user] BBcode... except, of course, those being quoted, because doing that causes some terrible parsing errors. Below is an example of how I want the output to be.
The desired output:
[quote=Bob]I like [user]Candace[/user]. She is nice.[/quote]
I agree, she is very nice. I like [user]Ashley[/user], too, and especially [[user]Ryan[/user]] when he's drinking.
My current code:
$searchArray = array(
'/(?i)(Ashley|Bob|Candace|Ryan|Tim)/'
);
$replaceArray = array(
"[user]\\0[/user]"
);
$text = preg_replace($searchArray, $replaceArray, $input);
$input is of course set to the post contents (i.e. the first example listed above). How can I achieve the results I want? I don't want the regex to match when a name is preceded by an equals sign (=), but putting a [^=] in front of the names in the regex will make it match any non-equals sign character (i.e. spaces), which then messes up the formatting.
Update
The problem is that by using \1 instead of \0 it is omitting the first character before the names (because anything but = is matched). The output results in this:
[quote=Bob]I like[user]Candace[/user]. She is nice.[/quote]
I agree, she is very nice. I like[user]Ashley[/user], too, and especially [user]Ryan[/user]] when he's drinking.

You were on the right track with the [^=] idea. You can put it outside the capture group, and instead of \\0 which is the full match, use \\1 and \\2 i.e. the first & second capture groups
$searchArray = array(
'/(?i)([^=])(Ashley|Bob|Candace|Ryan|Tim)/'
);
$replaceArray = array(
"\\1[user]\\2[/user]"
);
$text = preg_replace($searchArray, $replaceArray, $input);

Related

preg_replace putting the final result in the wrong place

so I'm having issues getting preg_replace to work right. I'm trying to create my own custom markdown. I get the result I want since it seems to be coughing up what I wanted. However, the problem is that it spits the user input outside of the blockquote. Here is an example of what I am talking about.
Here's my code.
<?php
$user_input = '> My quote';
$syntax = array(
'/>\s+(.*?)/is'
);
$replace_with_html = array(
'<blockquote><h3>Quote</h3><p>$1</p></blockquote>'
);
$replaced = preg_replace($syntax, $replace_with_html, $user_input);
print($replaced);
Here's the user input.
> My quote
And here is the result.
<blockquote><h3>Quote</h3><p></p></blockquote>My quote
What I want is
<blockquote><h3>Quote</h3><p>My quote</p></blockquote>
As you can see, the user input is in the wrong placement (at the end of the final HTML code). Is there a way to possibliy fix this and place it within the paragraph tags?
You don't need to make arrays, use this:
$user_input = '> My quote';
$syntax = '/>\s+(.*)/s';
$replace_with_html = '<blockquote><h3>Quote</h3><p>$1</p></blockquote>';
$replaced = preg_replace($syntax, $replace_with_html, $user_input);
print($replaced);
This works the same way: (Demo)
$user_input = '> My quote';
$syntax = ['/>\s+(.*)/s'];
$replace_with_html = ['<blockquote><h3>Quote</h3><p>$1</p></blockquote>'];
$replaced = preg_replace($syntax, $replace_with_html, $user_input);
print($replaced);
Either way, you WANT the dot in the pattern to be greedy, remove the ?.
Without this adjustment, you're only replacing the >\s+ part of the pattern.
That said, let me solve some problems that you haven't encountered yet...
How do you know where to stop quoting?
What if someone wants to use > to mean "greater than"?
Consider this new pattern and how it may help you tackle some future challenges:
/^>\s+(\S+(?:\s\S+)*)/m Replacement Demo
In the demo link you will see that the pattern will match (after > and 1 or more spaces) one or more non-whitespace characters optionally followed by: a single whitespace character (this can be a space/tab/return/newline) then one or more non-whitespace characters.
Effectively this says, you want to continue matching "quote" text until there are 2 or more consecutive whitespace characters (or else to the end of the string).
This adjustment should give your users the ability to accurately/conveniently quote-format their text while appropriately leaving innocent > character alone.

Regex for PHP seems simple but is killing me

I'm trying to make a replace in a string with a regex, and I really hope the community can help me.
I have this string :
031,02a,009,a,aaa,AZ,AZE,02B,975,135
And my goal is to remove the opposite of this regex
[09][0-9]{2}|[09][0-9][A-Za-z]
i.e.
a,aaa,AZ,AZE,135
(to see it in action : http://regexr.com?3795f )
My final goal is to preg_replace the first string to only get
031,02a,009,02B,975
(to see it in action : http://regexr.com?3795f )
I'm open to all solution, but I admit that I really like to make this work with a preg_replace if it's possible (It became something like a personnal challenge)
Thanks for all help !
As #Taemyr pointed out in comments, my previous solution (using a lookbehind assertion) was incorrect, as it would consume 3 characters at a time even while substrings weren't always 3 characters.
Let's use a lookahead assertion instead to get around this:
'/(^|,)(?![09][0-9]{2}|[09][0-9][A-Za-z])[^,]*/'
The above matches the beginning of the string or a comma, then checks that what follows does not match one of the two forms you've specified to keep, and given that this condition passes, matches as many non-comma characters as possible.
However, this is identical to #anubhava's solution, meaning it has the same weakness, in that it can leave a leading comma in some cases. See this Ideone demo.
ltriming the comma is the clean way to go there, but then again, if you were looking for the "clean way to go," you wouldn't be trying to use a single preg_replace to begin with, right? Your question is whether it's possible to do this without using any other PHP functions.
The anwer is yes. We can take
'/(^|,)foo/'
and distribute the alternation,
'/^foo|,foo/'
so that we can tack on the extra comma we wish to capture only in the first case, i.e.
'/^foo,|,foo/'
That's going to be one hairy expression when we substitute foo with our actual regex, isn't it. Thankfully, PHP supports recursive patterns, so that we can rewrite the above as
'/^(foo),|,(?1)/'
And there you have it. Substituting foo for what it is, we get
'/^((?![09][0-9]{2}|[09][0-9][A-Za-z])[^,]*),|,(?1)/'
which indeed works, as shown in this second Ideone demo.
Let's take some time here to simplify your expression, though. [0-9] is equivalent to \d, and you can use case-insensitive matching by adding /i, like so:
'/^((?![09]\d{2}|[09]\d[a-z])[^,]*),|,(?1)/i'
You might even compact the inner alternation:
'/^((?![09]\d(\d|[a-z]))[^,]*),|,(?1)/i'
Try it in more steps:
$newList = array();
foreach (explode(',', $list) as $element) {
if (!preg_match('/[09][0-9]{2}|[09][0-9][A-Za-z]/', $element) {
$newList[] = $element;
}
}
$list = implode(',', $newList);
You still have your regex, see! Personnal challenge completed.
Try matching what you want to keep and then joining it with commas:
preg_match_all('/[09][0-9]{2}|[09][0-9][A-Za-z]/', $input, $matches);
$result = implode(',', $matches);
The problem you'll be facing with preg_replace is the extra-commas you'll have to strip, cause you don't just want to remove aaa, you actually want to remove aaa, or ,aaa. Now what when you have things to remove both at the beginning and at the end of the string? You can't just say "I'll just strip the comma before", because that might lead to an extra comma at the beginning of the string, and vice-versa. So basically, unless you want to mess with lookaheads and/or lookbehinds, you'd better do this in two steps.
This should work for you:
$s = '031,02a,009,a,aaa,AZ,AZE,02B,975,135';
echo ltrim(preg_replace('/(^|,)(?![09][0-9]{2}|[09][0-9][A-Za-z])[^,]+/', '', $s), ',');
OUTPUT:
031,02a,009,02B,975
Try this:
preg_replace('/(^|,)[1-8a-z][^,]*/i', '', $string);
this will remove all substrings starting with the start of the string or a comma, followed by a non allowed first character, up to but excluding the following comma.
As per #GeoffreyBachelet suggestion, to remove residual commas, you should do:
trim(preg_replace('/(^|,)[1-8a-z][^,]*/i', '', $string), ',');

PHP to capitalize all letters (including after a slash) except for certain words

I want to use PHP to clean up some titles by capitalizing each word, including those following a slash. However, I do not want to capitalize the words 'and', 'of', and 'the'.
Here are two example strings:
accounting technology/technician and bookkeeping
orthopedic surgery of the spine
Should correct to:
Accounting Technology/Technician and Bookkeeping
Orthopedic Surgery of the Spine
Here's what I currently have. I'm not sure how to combine the implosion with the preg_replace_callback.
// Will capitalize all words, including those following a slash
$major = implode('/', array_map('ucwords',explode('/',$major)));
// Is supposed to selectively capitalize words in the string
$major = preg_replace_callback("/[a-zA-Z]+/",'ucfirst_some',$major);
function ucfirst_some($match) {
$exclude = array('and','of','the');
if ( in_array(strtolower($match[0]),$exclude) ) return $match[0];
return ucfirst($match[0]);
}
Right now it capitalizes all words in the string, including the ones I don't want it to.
Well, I was going to try a recursive call to ucfirst_some(), but your code appears to work just fine without the first line. ie:
<?php
$major = 'accounting technology/technician and bookkeeping';
$major = preg_replace_callback("/[a-zA-Z]+/",'ucfirst_some',$major);
echo ucfirst($major);
function ucfirst_some($match) {
$exclude = array('and','of','the');
if ( in_array(strtolower($match[0]),$exclude) ) return $match[0];
return ucfirst($match[0]);
}
Prints the desired Accounting Technology/Technician and Bookkeeping.
Your regular expression matches strings of letters already, you don't seem to need to worry about the slashes at all. Just be aware that a number or symbol [like a hyphen] in the middle of a word will cause the capitalization as well.
Also, disregard the people harping on you about your $exclude array not being complete enough, you can always add in more words as you come across them. Or just Google for a list.
It should be noted that there is no single, agreed-upon "correct" way to determing what should/should not be capitalized in this way.
You also want to make sure if words like an and the are used at the start of a sentence that they are all caps.
Note: I can not think of any terms like this that start with of or and at the start but it is easier to fix things like that before odd data creeps into your program.
There is a code snipplet out there that I have used before at
http://codesnippets.joyent.com/posts/show/716
It is referred on the php.net function page for ucwords in the comments section
http://php.net/manual/en/function.ucwords.php#84920

Automatically convert keywords to links in php

I am trying to convert specific keywords in text, which are stored in array, to the links.
Example text:
$text='This text contains many keywords, but also formated keywords.'
So now I want to convert the word keywords to the #keywords.
I used the very simple preg_replace function
preg_replace('/keywords/i',' keywords ',$text);
but obviously it converts to link also the string already formatted as a link, so I get a messy html like:
$text='This text contains many keywords, but also formated keywords" title="keywords">keywords</a>.'
Expected result:
$text='This text contains many keywords, but also formated keywords.'
Any suggestions?
THX
EDIT
We are one step from the perfect function, but still not working well in this case:
$text='This text contains many keywords, but also formated
keywords.'
In this case it replaces also the word keywords in the href, so we again get the messy code like
keywords.com/keywords" title="keywords">keywords</a>
I'm not great with regular expressions, but maybe this one will work:
/[^#>"]keywords/i
What I think it will do is ignore any instances of #keywords, >keywords, and "keywords and find the rest.
EDIT:
After testing it out, it looks like that replaces the space before the word as well, and doesn't work if keywords is the beginning of the string. It also didn't preserve original capitalization. I have tested this one, and it works perfectly for me:
$string = "Keywords and keywords, plus some more keywords with the original keywords.";
$string = preg_replace("/(?<![#>\"])keywords/i", "$0", $string);
echo $string;
The first three are replaced, preserving the original capitalization, and the last one is left untouched. This one uses a negative lookbehind and backreferences.
EDIT 2:
OP edited question. With the new example provided, the following regex will work:
$string = 'This text contains many keywords, but also formated keywords.';
$string = preg_replace("/(?<![#>\".\/])keywords/i", "$0", $string);
echo $string;
// outputs: This text contains many keywords, but also formated keywords.
This will replace all instances of keywords that are not preceded by #, >, ", ., or /.
Here is the problem:
The keyword could be inside the href, the title, or the text of the link, and anywhere in there (like if the keyword was sanity and you already had href="insanity". Or even worse, you could have a non-keyword link that happens to contain a keyword, something like:
Click here to find more keywords and such!
In the above example, even though it fits every other possible criteria (it's got spaces before and after being the easiest one to test for), it still would result in a link within a link, which I think breaks the internet.
Because of this, you need to use lookaheads and lookbehinds to check if the keyword is wrapped in a link. But there is one catch: lookbehinds have to have a defined pattern (meaning no wild cards).
I thought I'd be the hero and show you the easy fix for your issue, which would be something to the effect of:
'/(?<!\<a.?>)[list|of|keywords](?!\<\/a>)/'
Except you can't do that because the lookbehind in this case has that wildcard. Without it, you end up with a super greedy expression.
So my proposed alternative is to use regex to find all link elements, then str_replace to swap them out with a placeholder, and then replacing them with the placeholder at the end.
Here's how I did it:
$text='This text contains many keywords, but also formated keywords.';
$keywords = array('text', 'formatted', 'keywords');
//This is just to make the regex easier
$keyword_list_pattern = '['. implode($keywords,"|") .']';
// First, get all matching keywords that are inside link elements
preg_match_all('/<a.*' . $keyword_list_pattern . '.*<\/a>/', $text, $links);
$links = array_unique($links[0]); // Cleaning up array for next step.
// Second, swap out all matches with a placeholder, and build restore array:
foreach($links as $count => $link) {
$link_key = "xxx_{$count}_xxx";
$restore_links[$link_key] = $link;
$text = str_replace($link, $link_key, $text);
}
// Third, we build a nice replacement array for the keywords:
foreach($keywords as $keyword) {
$keyword_links[$keyword] = "<a href='#$keyword'>$keyword</a>";
}
// Merge the restore links to the bottom of the keyword links for one mass replacement:
$keyword_links = array_merge($keyword_links, $restore_links);
$text = str_replace(array_keys($keyword_links), $keyword_links, $text);
echo $text;
You can change your RegEx so that it only targets keywords with a space in front. Since the formatted keywords do no contain a space. Here is an example.
$text = preg_replace('/ keywords/i',' keywords',$text);

Complex(?) Name Matching Regex for vBulletin

I'm creating some custom BBcode for a forum. I'm trying to get the regular expression right, but it has been eluding me for weeks. Any expert advice is welcome.
Sample input (a very basic example):
[quote=Bob]I like Candace. She is nice.[/quote]
Ashley Ryan Thomas
Essentially, I want to encase any names (from a specified list) in [user][/user] BBcode... except, of course, those being quoted, because doing that causes some terrible parsing errors.
The desired output:
[quote=Bob]I like [user]Candace[/user]. She is nice.[/quote]
[user]Ashley[/user] [user]Ryan[/user] [user]Thomas[/user]
My current code:
$searchArray = array(
'/(?i)([^=]|\b|\s|\/|\r|\n|\t|^)(Ashley|Bob|Candace|Ryan|Thomas)(\s|\r|\n|\t|,|\.(\b|\s|\.|$)|;|:|\'|"|-|!|\?|\)|\/|\[|$)/'
);
$replaceArray = array(
"\\1[user]\\2[/user]\\3"
);
$text = preg_replace($searchArray, $replaceArray, $input);
What it currently produces:
[quote=Bob]I like [user]Candace[/user]. She is nice.[/quote]
[user]Ashley[/user] Ryan [user]Thomas[/user]
Notice that Ryan isn't encapsulated by [user] tags. Also note that much of the additional regex matching characters were added on an as-needed basis as they cropped up on the forums, so removing them will simply make it fail to match in other situations (i.e. a no-no). Unless, of course, you spot a glaring error in the regex itself, in which case please do point it out.
Really, though, any assistance would be greatly appreciated! Thank you.
It's quite simply that you are matching delimiters (\s|\r|...) at both ends of the searched names. The poor Ashley and Ryan share a single space character in your test string. But the regex can only match it once - as left or right border.
The solution here is to use assertions. Enclose the left list in (?<= ) and the right in (?= ) so they become:
(?<=[^=]|\b|\s|\/|^)
(?=\s|,|\.(\b|\s|\.|$)|;|:|\'|"|-|!|\?|\)|\/|\[|$)
Btw, \s already contains \r|\n|\t so you can probably remove that.
Since you don't really need to match the spaces on either side (just make sure they're there, right?) try replacing your search expression with this:
$searchArray = array(
'/\b(Ashley|Bob|Candace|Ryan|Thomas)\b/i'
);
$replaceArray = array(
'[user]$1[/user]'
);
$text = preg_replace($searchArray, $replaceArray, $input);

Categories