I'm using the following code to filter out urls from a block of HTML text in PHP.
preg_replace('#<a(?![^>]+?href="?http://keepthisdomain.com/foo/bar"?).*?>(.*?)</a>#i', '\1', $text);
It's intended to replace all url's that do not match the specified url pattern. However I do want to include all tags that have the attribute rel="shadowbox[a]" set.
How can I modify this preg_replace to do that?
You are better off not using regex at all and using a parser instead, for the reasons set forth in this answer.
That said, you can do it with regex, but it's tricky:
preg_replace('#<a(?![^>]+?\bhref="?http://keepthisdomain\.com/foo/bar"?|[^>]+\brel="shadowbox\[a\]").*?>(.*?)</a>#i', '\1', $text);
Details on the regex:
<a(?![^>]+?\bhref="?http://keepthisdomain\.com/foo/bar"?|[^>]+\brel="shadowbox\[a\]").*?>(.*?)</a>
Out of the following four tags, only the third would be replaced:
foo // left alone
foo // left alone
foo // REPLACED
foo // left alone
Edited with a minor tweak to make it match a literal . in .com, using \.
Related
I'm trying to get information in DB from html files, and suddenly found that link can be like:
channel crosstalk: <60dB
there for my regular expression doesn't find that link:
preg_match_all('|<a href="/blabla/([0-9]+)"[^>]*>([^<]*)</a>|Uis',$html,$matches);
This is a part of big regular expression, I just simplified it for example.
It's hard to tell what you are trying to pull. Are you looking for the entire link? Or are you looking to grab parts from the link (hence the parenthesis)? Here is a solution for getting the individual contents in the link:
preg_match_all( '#(.*?)#i', $html, $matches);
The first element of matches will be the entire link, while the other elements will be the sub parts.
Or here is one for just the entire link:
preg_match_all( "#(<a.*>.*</a>)#i", $html, $matches );
Or here is a slightly modified version of yours which currently isn't matching because it's saying to match anything that is not an angle bracket inside the opening and closing A tags as its contents has an angle bracket:
preg_match_all( '|<a href="/blabla/([0-9]+)"[^>]*>(.*?)</a>|Uis', $html, $matches );
Again, not 100% sure the exact results you are looking for, but maybe this will get your going and you can make modifications as needed.
You can use this regex to extract href and link text.
<a[^>]+?href="(.*?)"[^>]+?>(.*?)</a>
Group 1: href
Group 2: link text
This is the fundamental issue with trying to regex HTML. This is not really good HTML - because contents that are not meant to be interpreted as HTML should be html entities (aka <e; instead of <). You won't always be able to handle that though.
In your case, something like this works for regex:
|.*?|Uis
The matching group gets shifted. This also allows nested tags (like <a><b><i></i></b></a>).
Keep in mind that the Ungreedy tag you used means that you can be a little more lax in your regex matching. If you wanted to do this without the U modifier you'd maybe need to do some negative lookaheads.
|(?:(?!).)*</a>|is
I am making forum from phpbb to php native and I need to parse some bbcode tags with uid inside. This is the code to parse it into regular bbcode without the uid:
$regex = "#\[quote:(.*)=(.*)\](.+)\[/quote:(.+)\]#isU";
$text = "outside sample
[quote:c1891a7ad3]
text with link https://www.facebook.com/groups/35688476100/?fref=ts [/quote:c1891a7ad3]
outside text
[quote:c1891a7ad3="Budi"]
written by me , - budi
[/quote:c1891a7ad3]"
preg_replace($regex,"[quote=$2]$3[\quote]",$text);
but the result is not
"outside sample
[quote:c1891a7ad3]
text with link https://www.facebook.com/groups/35688476100/?fref=ts [/quote:c1891a7ad3]
outside text
[quote="Budi"]
written by me , - budi
[\quote]"
How should the regex be modified to yield expected result?
You have a mismatch between the pattern and the actual string you test against. In the pattern, you have / in [/quote] and in the string, you have \ ([\quote:c1891a7ad3]).
So, if your actual string in fact has /, all you need to fix is the (.*) part as the dot matches any character (including ]) and thus can overmatch even with lazy matching.
So, use
$regex = "#\[quote:([^]]*)=([^]]*)\](.+)\[/quote:([^]]+)\]#isU";
See IDEONE demo
In this regex, I am using a negated character class [^]]* that matches 0 or more characters other than ]. It makes sure we only match text inside [...]. (.*) matches c1891a7ad3]
text with link https://www.facebook.com/groups/35688476100/?fref, so we need to restrict this somehow.
<a href="/search?hl=en&pwst=1&sa=X&ei=RCPqTqkHycryA_bK_f0J&ved=0CCUQvwUoAQ&q=psychology&spell=1" class=spell><b><i>psychology</i></b></a>
Hi, I'm looking to create a regex which matches this anchor and returns the inner text of it.
This is what I've been trying as a regex but without success.
'/<a[^>]+class=\"spell\"[^>]*>(.*?)<\/a>/isU'
It's probably something really silly. Thanks.
Problem was missing quotes surrounding the class. Not proper html markup but I neglected to notice so I just changed my regex to have quotes as optional.
Final regex:
'/<a[^>]+class=\"?spell\"?[^>]*>(.*?)<\/a>/is'
The regex looks OK, although you don't need to escape the quotes. Perhaps PHP doesn't like it if you use unnecessary escapes, although I doubt it. The problem is more likely the way you're using the regex. Did you access group number 1?
if (preg_match('%<a[^>]+class="spell"[^>]*>(.*?)</a>%', $subject, $regs)) {
$result = $regs[1];
}
Your problem might be the combination of (.*?) and /isU modifier. That U alters the meaning of ? making your match group (.*) greedy actually. Then you will match parts beyond the <\/a> end marker, until it encounters another.
If you remove the /U it works as expected. With your given input text, at least.
Here are two options to fix your expression:
For starters, you can simplify your expression to:
class=\"spell\"[^>]*>(.*?)<\/a>
This captures
<b><i>psychology</i></b>
in Group 1. I assume this is what you want to achieve.
Then, if you want to capture "psychology" without the bold and italic tags, you can use:
class=\"spell\"[^>]*>\s*<(\w+)>?\s*<(\w+)>?\s*(.*?)<\/\2>\s*<\/\1>\s*<\/a>
This captures "psychology" in group 3.
In group 1, you will find the first optional tag, whether it be "b", "strong" or nothing.
In group 2, you will find the second optional tag, which was "i" in your example.
The multiple instances of \s* allow for optional space between the tags.
Is this what you were looking for?
I have a string:
$string = "This is my big <span class="big-string">string</span>";
I cannot figure out how to write a regular expression that will replace the 'b' in 'big' without replacing the 'b' in 'big-string'. I need to replace all occurances of a substring except when that substring appears in an html tag.
Any help is appreciated!
Edit
Maybe some more info will help. I'm working on an autocomplete feature that highlights whatever you're searching for in the current result set. Currently if you have typed 'aut' in the search dialog, then the results look like this: automotive
The problem appears when I search for 'auto b'. First I replace all occurrences of 'auto' with '<b>auto</b>' then I replace all occurrences of 'b' with '<b>b</b>'. Unfortunately this second sweep changes '<b>auto</b>' to '<<b>b</b>>auto</<b>b</b>>'
Please do not try to parse HTML using regular expressions. Just load up the HTML in a DOM, walk over the text nodes and do a simple str_replace. You'll thank me around debugging time.
Is there a guarantee that 'big' won't be immediately preceded by "? If so, then s/([^"])b/$1foo/ should replace the b in question with foo.
If you insist upon using a regex, this one will do a pretty decent job:
$re = '/# (Crudely) match a sub-string NOT in an HTML tag.
big # The sub-string to be matched.
(?= # Assert we are not inside an HTML tag.
[^<>]* # Consume all non-<> up to...
(?:<\w+ # either an HTML start tag,
| $ # or the end of string.
) # End group of valid alternatives.
) # End "not-in-html-tag" lookahead assertion.
/ix';
Caveats: This regex has very real limitations. The HTML must not have any angle brackets in the tag attributes. This regex also finds the target substring inside other parts of the HTML file such as comments, scripts and stylesheets, and this may not be desirable.
I am trying to pull the anchor text from a link that is formatted this way:
<h3><b>File</b> : i_want_this</h3>
I want only the anchor text for the link : "i_want_this"
"variable_text" varies according to the filename so I need to ignore that.
I am using this regex:
<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>
This is matching of course the complete link.
PHP uses a pretty close version to PCRE (PERL Regex). If you want to know a lot about regex, visit perlretut.org. Also, look into Regex generators like exspresso.
For your use, know that regex is greedy. That means that when you specify that you want something, follwed by anything (any repetitions) followed by something, it will keep on going until that second something is reached.
to be more clear, what you want is this:
<a href="
any character, any number of times (regex = .* )
">
any character, any number of times (regex = .* )
</a>
beyond that, you want to capture the second group of "any character, any number of times". You can do that using what are called capture groups (capture anything inside of parenthesis as a group for reference later, also called back references).
I would also look into named subpatterns, too - with those, you can reference your choice with a human readable string rather than an array index. Syntax for those in PHP are (?P<name>pattern) where name is the name you want and pattern is the actual regex. I'll use that below.
So all that being said, here's the "lazy web" for your regex:
<?php
$str = '<h3><b>File</b> : i_want_this</h3>';
$regex = '/(<a href\=".*">)(?P<target>.*)(<\/a>)/';
preg_match($regex, $str, $matches);
print $matches['target'];
?>
//This should output "i_want_this"
Oh, and one final thought. Depending on what you are doing exactly, you may want to look into SimpleXML instead of using regex for this. This would probably require that the tags that we see are just snippits of a larger whole as SimpleXML requires well-formed XML (or XHTML).
I'm sure someone will probably have a more elegant solution, but I think this will do what you want to done.
Where:
$subject = "<h3><b>File</b> : i_want_this</h3>";
Option 1:
$pattern1 = '/(<a href=")(.*)(">)(.*)(<\/a>)/i';
preg_match($pattern1, $subject, $matches1);
print($matches1[4]);
Option 2:
$pattern2 = '()(.*)()';
ereg($pattern2, $subject, $matches2);
print($matches2[4]);
Do not use regex to parse HTML. Use a DOM parser. Specify the language you're using, too.
Since it's in a captured group and since you claim it's matching, you should be able to reference it through $1 or \1 depending on the language.
$blah = preg_match( $pattern, $subject, $matches );
print_r($matches);
The thing to remember is that regex's return everything you searched for if it matches. You need to specify that only care about the part you've surrounded in parenthesis (the anchor text). I'm not sure what language you're using the regex in, but here's an example in Ruby:
string = 'i_want_this'
data = string.match(/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/)
puts data # => outputs 'i_want_this'
If you specify what you want in parenthesis, you can reference it:
string = 'i_want_this'
data = string.match(/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/)[1]
puts data # => outputs 'i_want_this'
Perl will have you use $1 instead of [1] like this:
$string = 'i_want_this';
$string =~ m/<a href=\"\/en\/browse\/file\/variable_text\">(.*?)<\/a>/;
$data = $1;
print $data . "\n";
Hope that helps.
I'm not 100% sure if I understand what you want. This will match the content between the anchor tags. The URL must start with /en/browse/file/, but may end with anything.
#(.*?)#
I used # as a delimiter as it made it clearer. It'll also help if you put them in single quotes instead of double quotes so you don't have to escape anything at all.
If you want to limit to numbers instead, you can use:
#(.*?)#
If it should have just 5 numbers:
#(.*?)#
If it should have between 3 and 6 numbers:
#(.*?)#
If it should have more than 2 numbers:
#(.*?)#
This should work:
<a href="[^"]*">([^<]*)
this says that take EVERYTHING you find until you meet "
[^"]*
same! take everything with you till you meet <
[^<]*
The paratese around [^<]*
([^<]*)
group it! so you can collect that data in PHP! If you look in the PHP manual om preg_match you will se many fine examples there!
Good luck!
And for your concrete example:
<a href="/en/browse/file/variable_text">([^<]*)
I use
[^<]*
because in some examples...
.*?
can be extremely slow! Shoudln't use that if you can use
[^<]*
You should use the tool Expresso for creating regular expression... Pretty handy..
http://www.ultrapico.com/Expresso.htm