Use regex to replace text while stripping newlines and quotes

Use regex to replace text while stripping newlines and quotes - php

This is now mostly academic as I can achieve the same result other ways, but… it’s been bugging me, and I’m sure is possible somehow with regex.
I want to use PHP’s preg_replace to replace content thus:
Content: “String <tag>This is some content, which contains newlines and quotation marks.</tag> and other unrelated content”.
Regex: /<tag>(.*)<\/tag>/sU
Replace: “String of other content, including matched pattern $1”
However the problem is, I want to strip out any newlines and/or quotation marks found between the elements. What regex would allow me to do this?

PHPs preg_replace() does a one-pass processing of the subject. You can actually specify an array of patterns and replacements, however only one will match on each part of the subject string. There certainly is no solution using a singel regex, since this problem is not amongst the regular languages. Theoretical computer science teaches that you need a stateful automat for such task. A regex is to primitive.

Not easy, but possible.
Try this PHP code:
function myFn($a, $b, $c) {
$b = preg_replace("!(?:\\\'|[\"\n\r])!", '', $b);
return "BEGIN " . $b . " END";
}
$s = "abc <tag>def \n ghi 'jkl' mno \"pqr\" stu</tag> vwx";
$s = preg_replace('!(<tag>)(.*?)(</tag>)!ise', 'myFn("$1", "$2", "$3")', $s);
print $s;
Output:
abc BEGIN def ghi jkl mno pqr stu END vwx
Test this code here.

As arkascha pointed out, this is not really a problem that can easily be done in one pass.
It could be done in one step in Perl:
use strict;
use warnings;
my $string = "blah <tag> foo \"bar \n </tag> baz";
$string =~ s/(?<=\<tag\>)([^<]+)(?=\<\/tag\>)/$_=$1;s|[\n\"]||gs;$_/ges;
print $string;
This takes advantage of the fact that Perl lets you use code to generate the replacement string.
I don't know whether something similar could be done in PHP. This is not a good real-world code design anyway. But it is fun.

Related

Re-ordering Strings in PHP

I have a document full of hex colours, as shown below.
#123 is a nice colour but #321 is also fine. However, #fe4918 isn't a bad either.
I'd like to rotate them round, so that #123 would become #231, effectively changing the colour scheme. #fe4918 would become #4918fe.
I know that with regular expressions, one can select the the hash tags but not much else.

You could use a regex to do it...
preg_replace('/#([\da-f])([\da-f])([\da-f])(?:([\da-f])([\da-f])([\da-f]))?/i', '#$2$5$3$6$1$4', $str)
CodePad
It works by matching case insensitive hexadecimal numbers 3 or 6 times, and then reverses them using the matched groups.
Alternatively you could match it with a simple regex and callback with preg_replace_callback() and use strrev(), but I think the above example is clear enough.

You can use the following to match:
#([\da-f]{2})([\da-f]{2})([\da-f]{2})|#(\d)(\d)(\d)
And replace with:
#\2\5\3\4\1\6
See RegEX DEMO

You can use a branch reset group to handle the two cases with the same capture group numbers:
$str = preg_replace('~#(?|([a-f\d]{2})([a-f\d]{4})|([a-f\d])([a-f\d]{2}))~i',
'#$2$1', $str);

You can use a combination of a regex and strrev():
#([a-f0-9]+)
In PHP this would be:
<?php
$string = "#123 is a nice colour but #321 is also fine. However, #fe4918 isn't a bad either.";
$regex = '~#([a-f0-9]+)~';
$string = preg_replace_callback(
$regex,
function($match) {
return '#'.strrev($match[1]);
},
$string
);
echo $string;
// #321 is a nice colour but #123 is also fine. However, #8194ef isn't a bad either.
?>
You can do this in regex alone, but the above logic seems very clear (and maintainable in a few months as well).
See a demo on ideone.com.

Make me understand preg_replace

I've been looking all over the internet for some useful information and I think I found too much. I'm trying to understand regular expressions but don't get it.
Lets for instance say $data="A bunch of text [link=123] another bunch of text.", and it should get replaced with "< a href=\"123.html\">123< /a>".
I've been trying around a lot with code similar to this:
$find = "/[link=[([0-9])]/";
$replace = "< a href=\"$1\">$1< /a>";
echo preg_replace ($find, $replace, $data);
but the output is always the same as the original $data.
I think I have to see something relevent to my problem understand the basics.

Remove the extra [] around the (), and add + after the [0-9] to quantify it. Also, escape the [] that make up the tag itself.
$find = "/\[link=(\d+)\]/"; // "\d" is equivalent to "[0-9]"
$replace = "$1";
echo preg_replace($find,$replace,$data);

The regex would be \[link=([\d]+)\]
A good source for an quick overview of regular expression can you find here http://www.regular-expressions.info/
When you really interested in the power of regular expression, you should buy this book: Mastering Regular Expressions
A good Programm to test your RexEx on a Windows Client is: RegEx-Trainer

You are missing the + quantifier and as a result of this your pattern matches if there is a single digit following link=.
And there is an extra pair of [..] as a result of this the outer [...] will be treated as the character class.
You also forgot the escape the closing ].
Solution:
$find = "/[link=([0-9]+)\]/";

<?php
$data= "A bunch of text [link=123] another bunch of text.";
$find = '/\[link=([0-9]+?)\]/';
echo preg_replace($find, "$1", $data);

Regex pattern matching literal repeated \n

Given a literal string such as:
Hello\n\n\n\n\n\n\n\n\n\n\n\nWorld
I would like to reduce the repeated \n's to a single \n.
I'm using PHP, and been playing around with a bunch of different regex patterns. So here's a simple example of the code:
$testRegex = '/(\\n){2,}/';
$test = 'Hello\n\n\n\n\n\n\n\n\nWorld';
$test2 = preg_replace($testRegex ,'\n',$test);
echo "<hr/>test regex<hr/>".$test2;
I'm new to PHP, not that new to regex, but it seems '\n' conforms to special rules. I'm still trying to nail those down.
Edit: I've placed the literal code I have in my php file here, if I do str_replace() I can get good things to happen, but that's not a complete solution obviously.

To match a literal \n with regex, your string literal needs four backslashes to produce a string with two backlashes that’s interpreted by the regex engine as an escape for one backslash.
$testRegex = '/(\\\\n){2,}/';
$test = 'Hello\n\n\n\n\n\n\n\n\n\n\n\nWorld';
$test2 = preg_replace($testRegex, '\n', $test);

Perhaps you need to double up the escape in the regular expression?
$pattern = "/\\n+/"
$awesome_string = preg_replace($pattern, "\n", $string);

Edit: Just read your comment on the accepted answer. Doesn't apply, but is still useful.
If you're intending on expanding this logic to include other forms of white-space too:
$output = echo preg_replace('%(\s)*%', '$1', $input);
Reduces all repeated white-space characters to single instances of the matched white-space character.

it indeed conforms to special rules, and you need to add the "multiline"-modifier, m. So your pattern would look like
$pattern = '/(\n)+/m'
which should provide you with the matches. See the doc for all modifiers and their detailed meaning.
Since you're trying to reduce all newlines to one, the pattern above should work with the rest of your code. Good luck!

Try this regular expression:
/[\n]*/

PHP preg_replace to turn xyz to <b>xyz</b>

I decided to, for fun, make something similar to markdown. With my small experiences with Regular Expressions in the past, I know how extremely powerful they are, so they will be what I need.
So, if I have this string:
Hello **bold** world
How can I use preg_replace to convert that to:
Hello <b>bold</b> world
I assume something like this?
$input = "Hello **bold** world";
$output = preg_replace("/(\*\*).*?(\*\*/)", "<b></b>", $input);

Close:
$input = "Hello **bold** world";
$output = preg_replace("/\*\*(.*?)\*\*/", "<b>$1</b>", $input);

I believe there is a PHP package for rendering Markdown. Rather than rolling your own, try using an existing set of code that's been written and tested.

Mmm I guess this could work
$output = preg_replace('/\*\*(.*?)\*\*/', '<b>$1</b>', $input);
You find all sequences **something** and then you substitute the entire sequence found with the bold tag and inside it (the $1) the first captured group (the brackets in the expression).

$output = preg_replace("/\*\*(.*?)\*\*/", "<b>$1</b>", $input);

Regular expression anchor text for a link

I am trying to pull the anchor text from a link that is formatted this way:
<h3><b>File</b> : i_want_this</h3>
I want only the anchor text for the link : "i_want_this"
"variable_text" varies according to the filename so I need to ignore that.
I am using this regex:
<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>
This is matching of course the complete link.

PHP uses a pretty close version to PCRE (PERL Regex). If you want to know a lot about regex, visit perlretut.org. Also, look into Regex generators like exspresso.
For your use, know that regex is greedy. That means that when you specify that you want something, follwed by anything (any repetitions) followed by something, it will keep on going until that second something is reached.
to be more clear, what you want is this:
<a href="
any character, any number of times (regex = .* )
">
any character, any number of times (regex = .* )
</a>
beyond that, you want to capture the second group of "any character, any number of times". You can do that using what are called capture groups (capture anything inside of parenthesis as a group for reference later, also called back references).
I would also look into named subpatterns, too - with those, you can reference your choice with a human readable string rather than an array index. Syntax for those in PHP are (?P<name>pattern) where name is the name you want and pattern is the actual regex. I'll use that below.
So all that being said, here's the "lazy web" for your regex:
<?php
$str = '<h3><b>File</b> : i_want_this</h3>';
$regex = '/(<a href\=".*">)(?P<target>.*)(<\/a>)/';
preg_match($regex, $str, $matches);
print $matches['target'];
?>
//This should output "i_want_this"
Oh, and one final thought. Depending on what you are doing exactly, you may want to look into SimpleXML instead of using regex for this. This would probably require that the tags that we see are just snippits of a larger whole as SimpleXML requires well-formed XML (or XHTML).

I'm sure someone will probably have a more elegant solution, but I think this will do what you want to done.
Where:
$subject = "<h3><b>File</b> : i_want_this</h3>";
Option 1:
$pattern1 = '/(<a href=")(.*)(">)(.*)(<\/a>)/i';
preg_match($pattern1, $subject, $matches1);
print($matches1[4]);
Option 2:
$pattern2 = '()(.*)()';
ereg($pattern2, $subject, $matches2);
print($matches2[4]);

Do not use regex to parse HTML. Use a DOM parser. Specify the language you're using, too.
Since it's in a captured group and since you claim it's matching, you should be able to reference it through $1 or \1 depending on the language.
$blah = preg_match( $pattern, $subject, $matches );
print_r($matches);

The thing to remember is that regex's return everything you searched for if it matches. You need to specify that only care about the part you've surrounded in parenthesis (the anchor text). I'm not sure what language you're using the regex in, but here's an example in Ruby:
string = 'i_want_this'
data = string.match(/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/)
puts data # => outputs 'i_want_this'
If you specify what you want in parenthesis, you can reference it:
string = 'i_want_this'
data = string.match(/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/)[1]
puts data # => outputs 'i_want_this'
Perl will have you use $1 instead of [1] like this:
$string = 'i_want_this';
$string =~ m/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/;
$data = $1;
print $data . "\n";
Hope that helps.

I'm not 100% sure if I understand what you want. This will match the content between the anchor tags. The URL must start with /en/browse/file/, but may end with anything.
#(.*?)#
I used # as a delimiter as it made it clearer. It'll also help if you put them in single quotes instead of double quotes so you don't have to escape anything at all.
If you want to limit to numbers instead, you can use:
#(.*?)#
If it should have just 5 numbers:
#(.*?)#
If it should have between 3 and 6 numbers:
#(.*?)#
If it should have more than 2 numbers:
#(.*?)#

This should work:
<a href="[^"]*">([^<]*)
this says that take EVERYTHING you find until you meet "
[^"]*
same! take everything with you till you meet <
[^<]*
The paratese around [^<]*
([^<]*)
group it! so you can collect that data in PHP! If you look in the PHP manual om preg_match you will se many fine examples there!
Good luck!
And for your concrete example:
<a href="/en/browse/file/variable_text">([^<]*)
I use
[^<]*
because in some examples...
.*?
can be extremely slow! Shoudln't use that if you can use
[^<]*

You should use the tool Expresso for creating regular expression... Pretty handy..
http://www.ultrapico.com/Expresso.htm

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Use regex to replace text while stripping newlines and quotes - php

Related

Re-ordering Strings in PHP

Make me understand preg_replace

Regex pattern matching literal repeated \n

PHP preg_replace to turn xyz to <b>xyz</b>

Regular expression anchor text for a link

Categories

Resources

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Use regex to replace text while stripping newlines and quotes - php

Related

Re-ordering Strings in PHP

Make me understand preg_replace

Regex pattern matching literal repeated \n

PHP preg_replace to turn **xyz** to <b>xyz</b>

Regular expression anchor text for a link

Categories

Resources

PHP preg_replace to turn xyz to <b>xyz</b>