I always encounter regular expressions but I don't really try to understand and use them. But my current project is forcing me to use a regular expression so I need someone who can give me the correct regex to replace a simple string. Basically I'm replacing a small subset of longtext retrieved from a database. The longtext is just a paragraph(s) with text anchors in a form of:
Example
So the question is how do I replace the value of the title attribute? Please note that the text may contain two more anchor tags so I'd like to able to specifically target each of them.
EDIT:
I'd like to use pure PHP on this. I think I know how to do this using js/jquery.
$doc = new DOMDocument();
$doc->loadHTML('Example');
$anchors = $doc->getElementsByTagName('a');
foreach ($anchors as $anchor)
{
$anchor->setAttribute('target', '__blank');
}
$html = $doc->saveHTML();
echo $html;
See it in action
Description
You could do this with the following regex
(<a\b[^>]*?\btitle=(['"]))(.*?)\2
Summary
( start capture group 1
<a\b consume open angle bracket and an a followed by a word break
[^>]*? consume all non close angle bracket characters up to... this forces the regex to stay inside the anchor tag
\btitle= consume a word break and title=, the break helps do some additional checking
(['"]) capture group 2, ensure the an open single or double quote is being used
) close capture group 1
(.*?) start capture group 3, and non greedy consume to collect all text inside the quotes
\2 reference back to the string from capture group 2, if you used a single quote to open the value, then a single quote will be required to close the value. Same if you had use a double quote.
In the replace command I'm simply replacing the entire found string from <a to the close quote with: group capture 1, followed by the desired text NewValue followed by the close quote from group capture 2.
PHP example
<?php
$sourcestring="Example";
echo preg_replace('/(<a\b[^>]*?\btitle=([\'"]))(.*?)\2/im','\1NewValue\2',$sourcestring);
?>
$sourcestring after replacement:
Example
Disclaimer
Since parsing text via a html parser is not the desired solution, I'll skip the usual soap box disclaimer about parsing html with Regex.
$string=preg_replace(
'#<a (.*)title="(.*)"([^>]*)>(.*)</a>#iU',
'<a $1title="'.$replacement.'"$3>$4</a>',
$string);
Note that the i at the end of the expression makes it case insensitive, and the U makes it ungreedy.
Related
I am using the following Regular Expresion to remove html tags from a string. It works except I leave the closing tag. If I attempt to remove: blah it leaves the <a/>.
I do not know Regular Expression syntax at all and fumbled through this. Can someone with RegEx knowledge please provide me with a pattern that will work.
Here is my code:
string sPattern = #"<\/?!?(img|a)[^>]*>";
Regex rgx = new Regex(sPattern);
Match m = rgx.Match(sSummary);
string sResult = "";
if (m.Success)
sResult = rgx.Replace(sSummary, "", 1);
I am looking to remove the first occurence of the <a> and <img> tags.
Using a regular expression to parse HTML is fraught with pitfalls. HTML is not a regular language and hence can't be 100% correctly parsed with a regex. This is just one of many problems you will run into. The best approach is to use an HTML / XML parser to do this for you.
Here is a link to a blog post I wrote awhile back which goes into more details about this problem.
http://blogs.msdn.com/b/jaredpar/archive/2008/10/15/regular-expression-limitations.aspx
That being said, here's a solution that should fix this particular problem. It in no way is a perfect solution though.
var pattern = #"<(img|a)[^>]*>(?<content>[^<]*)<";
var regex = new Regex(pattern);
var m = regex.Match(sSummary);
if ( m.Success ) {
sResult = m.Groups["content"].Value;
To turn this:
'<td>mamma</td><td><strong>papa</strong></td>'
into this:
'mamma papa'
You need to replace the tags with spaces:
.replace(/<[^>]*>/g, ' ')
and reduce any duplicate spaces into single spaces:
.replace(/\s{2,}/g, ' ')
then trim away leading and trailing spaces with:
.trim();
Meaning that your remove tag function look like this:
function removeTags(string){
return string.replace(/<[^>]*>/g, ' ')
.replace(/\s{2,}/g, ' ')
.trim();
}
In order to remove also spaces between tags, you can use the following method a combination between regex and a trim for spaces at start and end of the input html:
public static string StripHtml(string inputHTML)
{
const string HTML_MARKUP_REGEX_PATTERN = #"<[^>]+>\s+(?=<)|<[^>]+>";
inputHTML = WebUtility.HtmlDecode(inputHTML).Trim();
string noHTML = Regex.Replace(inputHTML, HTML_MARKUP_REGEX_PATTERN, string.Empty);
return noHTML;
}
So for the following input:
<p> <strong> <em><span style="text-decoration:underline;background-color:#cc6600;"></span><span style="text-decoration:underline;background-color:#cc6600;color:#663333;"><del> test text </del></span></em></strong></p><p><strong><span style="background-color:#999900;"> test 1 </span></strong></p><p><strong><em><span style="background-color:#333366;"> test 2 </span></em></strong></p><p><strong><em><span style="text-decoration:underline;background-color:#006600;"> test 3 </span></em></strong></p>
The output will be only the text without spaces between html tags or space before or after html:
" test text test 1 test 2 test 3 ".
Please notice that the spaces before test text are from the <del> test text </del> html and the space after test 3 is from the <em><span style="text-decoration:underline;background-color:#006600;"> test 3 </span></em></strong></p> html.
Strip off HTML Elements (with/without attributes)
/<\/?[\w\s]*>|<.+[\W]>/g
This will strip off all HTML elements and leave behind the text. This works well even for malformed HTML elements (i.e. elements that are missing closing tags)
Reference and example (Ex.10)
So the HTML parser everyone's talking about is Html Agility Pack.
If it is clean XHTML, you can also use System.Xml.Linq.XDocument or System.Xml.XmlDocument.
can use:
Regex.Replace(source, "<[^>]*>", string.Empty);
If you need to find only the opening tags you can use the following regex, which will capture the tag type as $1 (a or img) and the content (including closing tag if there is one) as $2:
(?:<(a|img)(?:\s[^>]*)?>)((?:(?!<\1)[\s\S])*)
In case you have also closing tag you should use the following regex, which will capture the tag type as $1 (a or img) and the content as $2:
(?:<(a|img)(?:\s[^>]*)?>)\s*((?:(?!<\1)[\s\S])*)\s*(?:<\/\1>)
Basically you just need to use replace function on one of above regex, and return $2 in order to get what you wanted.
Short explanation about the query:
( ) - is used for capturing whatever matches the regex inside the brackets. The order of the capturing is the order of: $1, $2 etc.
?: - is used after an opening bracket "(" for not capturing the content inside the brackets.
\1 - is copying capture number 1, which is the tag type. I had to capture the tag type so closing tag will be consistent to the opening one and not something like: <img src=""> </a>.
\s - is white space, so after opening tag <img there will be at least 1 white space in case there are attributes (so it won't match <imgs> for example).
[^>]* - is looking for anything but the chars inside, which in this case is >, and * means for unlimited times.
?! - is looking for anything but the string inside, kinda similar to [^>] just for string instead of single chars.
[\s\S] - is used almost like . but allow any whitespaces (which will also match in case there are new lines between the tags). If you are using regex "s" flag, then you can use . instead.
Example of using with closing tag:
https://regex101.com/r/MGmzrh/1
Example of using without closing tag:
https://regex101.com/r/MGmzrh/2
Regex101 also has some explanation for what i did :)
You can use already existing libraries to strip off the html tags. One good one being Chilkat C# Library.
If all you're trying to do is remove the tags (and not figure out where the closing tag is), I'm really not sure why people are so fraught about it.
This Regex seems to handle anything I can throw at it:
<([\w\-/]+)( +[\w\-]+(=(('[^']*')|("[^"]*")))?)* *>
To break it down:
<([\w\-/]+) - match the beginning of the opening or closing tag. if you want to handle invalid stuff, you can add more here
( +[\w\-]+(=(('[^']*')|("[^"]*")))?)* - this bit matches attributes [0, N] times (* at then end)
+[\w\-]+ - is space(s) followed by an attribute name
(=(('[^']*')|("[^"]*")))? - not all attributes have assignment (?)
('[^']*')|("[^"]*") - of the attributes that do have assignment, the value is a string with either single or double quotes. It's not allowed to skip over a closing quote to make things work
*> - the whole thing ends with any number of spaces, then the closing bracket
Obviously this will mess up if someone throws super invalid html at it, but it works for anything valid I've come up with yet. Test it out here:
const regex = /<([\w\-/]+)( +[\w\-]+(=(('[^']*')|("[^"]*")))?)* *>/g;
const byId = (id) => document.getElementById(id);
function replace() {
console.log(byId("In").value)
byId("Out").innerText = byId("In").value.replace(regex, "CUT");
}
Write your html here: <br>
<textarea id="In" rows="8" cols="50"></textarea><br>
<button onclick="replace()">Replace all tags with "CUT"</button><br>
<br>
Output:
<div id="Out"></div>
Remove image from the string, using a regular expression in c# (image search performed by image id)
string PRQ=<td valign=\"top\" style=\"width: 400px;\" align=\"left\"><img id=\"llgo\" src=\"http://test.Logo.png\" alt=\"logo\"></td>
var regex = new Regex("(<img(.+?)id=\"llgo\"(.+?))src=\"([^\"]+)\"");
PRQ = regex.Replace(PRQ, match => match.Groups[1].Value + "");
Why not trying reluctant quantifier?
htmlString.replaceAll("<\\S*?>", "")
(It's Java but the main thing is to show the idea)
Simple way,
String html = "<a>Rakes</a> <p>paroladasdsadsa</p> My Name Rakes";
html = html.replaceAll("(<[\\w]+>)(.+?)(</[\\w]+>)", "$2");
System.out.println(html);
Here is the extension method I've been using for quite some time.
public static class StringExtensions
{
public static string StripHTML(this string htmlString, string htmlPlaceHolder) {
const string pattern = #"<.*?>";
string sOut = Regex.Replace(htmlString, pattern, htmlPlaceHolder, RegexOptions.Singleline);
sOut = sOut.Replace(" ", String.Empty);
sOut = sOut.Replace("&", "&");
sOut = sOut.Replace(">", ">");
sOut = sOut.Replace("<", "<");
return sOut;
}
}
This piece of code could help you out easily removing any html tags:
import re
string = str(blah)
replaced_string = re.sub('<a.*href="blah">.*<\/a>','',string) // remember, sub takes 3 arguments.
Output is an empty string.
Here's an extension method I created using a simple regular expression to remove HTML tags from a string:
/// <summary>
/// Converts an Html string to plain text, and replaces all br tags with line breaks.
/// </summary>
/// <returns></returns>
/// <remarks></remarks>
[Extension()]
public string ToPlainText(string s)
{
s = s.Replace("<br>", Constants.vbCrLf);
s = s.Replace("<br />", Constants.vbCrLf);
s = s.Replace("<br/>", Constants.vbCrLf);
s = Regex.Replace(s, "<[^>]*>", string.Empty);
return s;
}
Hope that helps.
Select everything except from whats in there:
(?:<span.*?>|<\/span>|<p.*?>|<\/p>)
<a href="/search?hl=en&pwst=1&sa=X&ei=RCPqTqkHycryA_bK_f0J&ved=0CCUQvwUoAQ&q=psychology&spell=1" class=spell><b><i>psychology</i></b></a>
Hi, I'm looking to create a regex which matches this anchor and returns the inner text of it.
This is what I've been trying as a regex but without success.
'/<a[^>]+class=\"spell\"[^>]*>(.*?)<\/a>/isU'
It's probably something really silly. Thanks.
Problem was missing quotes surrounding the class. Not proper html markup but I neglected to notice so I just changed my regex to have quotes as optional.
Final regex:
'/<a[^>]+class=\"?spell\"?[^>]*>(.*?)<\/a>/is'
The regex looks OK, although you don't need to escape the quotes. Perhaps PHP doesn't like it if you use unnecessary escapes, although I doubt it. The problem is more likely the way you're using the regex. Did you access group number 1?
if (preg_match('%<a[^>]+class="spell"[^>]*>(.*?)</a>%', $subject, $regs)) {
$result = $regs[1];
}
Your problem might be the combination of (.*?) and /isU modifier. That U alters the meaning of ? making your match group (.*) greedy actually. Then you will match parts beyond the <\/a> end marker, until it encounters another.
If you remove the /U it works as expected. With your given input text, at least.
Here are two options to fix your expression:
For starters, you can simplify your expression to:
class=\"spell\"[^>]*>(.*?)<\/a>
This captures
<b><i>psychology</i></b>
in Group 1. I assume this is what you want to achieve.
Then, if you want to capture "psychology" without the bold and italic tags, you can use:
class=\"spell\"[^>]*>\s*<(\w+)>?\s*<(\w+)>?\s*(.*?)<\/\2>\s*<\/\1>\s*<\/a>
This captures "psychology" in group 3.
In group 1, you will find the first optional tag, whether it be "b", "strong" or nothing.
In group 2, you will find the second optional tag, which was "i" in your example.
The multiple instances of \s* allow for optional space between the tags.
Is this what you were looking for?
I have a string:
$string = "This is my big <span class="big-string">string</span>";
I cannot figure out how to write a regular expression that will replace the 'b' in 'big' without replacing the 'b' in 'big-string'. I need to replace all occurances of a substring except when that substring appears in an html tag.
Any help is appreciated!
Edit
Maybe some more info will help. I'm working on an autocomplete feature that highlights whatever you're searching for in the current result set. Currently if you have typed 'aut' in the search dialog, then the results look like this: automotive
The problem appears when I search for 'auto b'. First I replace all occurrences of 'auto' with '<b>auto</b>' then I replace all occurrences of 'b' with '<b>b</b>'. Unfortunately this second sweep changes '<b>auto</b>' to '<<b>b</b>>auto</<b>b</b>>'
Please do not try to parse HTML using regular expressions. Just load up the HTML in a DOM, walk over the text nodes and do a simple str_replace. You'll thank me around debugging time.
Is there a guarantee that 'big' won't be immediately preceded by "? If so, then s/([^"])b/$1foo/ should replace the b in question with foo.
If you insist upon using a regex, this one will do a pretty decent job:
$re = '/# (Crudely) match a sub-string NOT in an HTML tag.
big # The sub-string to be matched.
(?= # Assert we are not inside an HTML tag.
[^<>]* # Consume all non-<> up to...
(?:<\w+ # either an HTML start tag,
| $ # or the end of string.
) # End group of valid alternatives.
) # End "not-in-html-tag" lookahead assertion.
/ix';
Caveats: This regex has very real limitations. The HTML must not have any angle brackets in the tag attributes. This regex also finds the target substring inside other parts of the HTML file such as comments, scripts and stylesheets, and this may not be desirable.
I've always had a difficult time with regular expressions. I've searched for help with this, but I can't quite find what I'm looking for.
I have blocks of text that follow this pattern:
[php]
... any type of code sample here
[/php]
I need to:
check for the square brackets, which can contain any number of 20-30 programming language names (php, ruby, etc.).
need to grab all code in between the opening and closing bracket.
I have worked out the following regular expression:
#\[([a-z]+)\]([^\[/]*)\[/([a-z]+)\]#i
Which matches everything pretty well. However, it breaks when the code sample contains square brackets. How do I modify it so that any character between those opening/closing braces will be matched for later use?
This is the regex you want. It matches where the tags are even too, so a php tag will only end a php tag.
/\[(\w+)\](.*?)\[\/\1\]/s
Or if you wanted to explicitly match the tags you could use...
$langs = array('php', 'python', ...);
$langs = implode('|', array_map('preg_quote', $langs));
preg_match_all('/\[(' . $langs . ')\](.*?)\[\/\1\]/s', $str, $matches);
The following will work:
\[([a-z]+)\].*\[/\1\]
If you don't want to remove the greediness, you can do:
\[([a-z]+)\].*?\[/\1\]
All you have to do is to check that both the closing and opening tags have the same text (in this case, that both are the same programming language), and you do that with \1, telling it to match the previously matched Group number 1: ([a-z]+)
Why don't you use something like below:
\[php\].*?\[/php\]
I don't understand why you want to use [a-z]+ for the tags, there should be php or a limited amount of other tags. Just keep it simple.
Actually you can use:
\[(php)\].*?\[/(\1)\]
so that you can match the opening and closing tags. Otherwise you will be matching random opening and closing. Add others like, I don't know, js etc as php|js etc.
Use a backreference to refer to a match already made in the regular expression:
\[(\w+)\].*?\[/\1\]
I am using preg_replace() for some string replacement.
$str = "<aa>Let's find the stuff qwe in between <id>12345</id> these two previous brackets</h>";
$do = preg_match("/qwe(.*)12345/", $str, $matches);
which is working just fine and gives the following result
$match[0]=qwe in between 12345
$match[1]=in between
but I am using same logic to extract from the following string.
<text>
<src><![CDATA[<TEXTFORMAT LEADING="2"><P ALIGN="LEFT"><FONT FACE="Arial" SIZE="36" COLOR="#999999" LETTERSPACING="0" KERNING="0">r1 text 1 </FONT></P></TEXTFORMAT>]]></src>
<width>45%</width>
<height>12%</height>
<left>30.416666666666668%</left>
<top>3.0416666666666665%</top>
<begin>2s</begin>
<dur>10s</dur>
<transIn>fadeIn</transIn>
<transOut>fadeOut</transOut>
<id>E2159292994B083ACA7ABC7799BBEF3F7198FFA2</id>
</text>
I want to extract the string from
r1text1
to
</id>
The Regular expression I currently Have is:
preg_match('/r1text1(.*)</id\>/', $metadata], $matches);
where $metadata is the above string..
$matches does not return anything....
For some reason...how do i do it?
Thanks in advance
If you want to extract the text, you will probably want to use preg_match. The following might work:
preg_match('#\<P[^\>]*\>\<FONT[^\>]*\>(.*\</id\>)#', $string, $matches)
Whatever gets matched in the parantheses can be found later in the $matches array. In this case everything between a <P> tag followed by a <FONT> tag and </id>, including the latter.
Above regex is untested but might give you a general idea of how to do it. Adapt if your needs are a bit different :)
Even if don't know why you would match the regex on a incomplete XML fragment (starting within a <![CDATA[ and ending right before the closing XML tag </id>, you do have three obvious problems with your regex:
As Amri said: you have to escape the / character in the closing XML tag because you use / as the pattern delimiter. By the way, you don't have to escape the > character. That gives you: '/r1text1(.*)<\/id>/' Alternatively you can change the pattern delimiter to # for example: '#r1text1(.*)</id>#' (I will use the first pattern to further develop the expression).
As Rich Adams already said: the text in your example data is "r1_text_1" (_ is a space character) but you match against '/r1text1(.*)<\/id>/'. You have to include the spaces in your regex or allow for a uncertain number of spaces, such as '/r1(?:\s*)text(?:\s*)1(.*)<\/id>/' (the ?: is the syntax for non-capturing subpatterns)
The . (dot) in your regex does not match newlines by default. You have to add the s (PCRE_DOTALL) pattern modifier to let the . (dot) match against newlines as well: '/r1(?:\s*)text(?:\s*)1(.*)<\/id>/s'
you probably need to parse your string/file and extract the value between the FONT tag. Then insert the value into the id tag
Try googling for php parsing.
try this
preg_match('/r1text1(.*)<\/id\>/', $metadata], $matches);
You are using / as the pattern delimiter but your content has / in . You can use \ as the escape character.
In the sample you have "r1 text 1 ", yet your regular expression has "r1text1". The regular expression doesn't match because there are spaces in the string you are trying to match it against. You should include the spaces in the regular expression.