<a href="/search?hl=en&pwst=1&sa=X&ei=RCPqTqkHycryA_bK_f0J&ved=0CCUQvwUoAQ&q=psychology&spell=1" class=spell><b><i>psychology</i></b></a>
Hi, I'm looking to create a regex which matches this anchor and returns the inner text of it.
This is what I've been trying as a regex but without success.
'/<a[^>]+class=\"spell\"[^>]*>(.*?)<\/a>/isU'
It's probably something really silly. Thanks.
Problem was missing quotes surrounding the class. Not proper html markup but I neglected to notice so I just changed my regex to have quotes as optional.
Final regex:
'/<a[^>]+class=\"?spell\"?[^>]*>(.*?)<\/a>/is'
The regex looks OK, although you don't need to escape the quotes. Perhaps PHP doesn't like it if you use unnecessary escapes, although I doubt it. The problem is more likely the way you're using the regex. Did you access group number 1?
if (preg_match('%<a[^>]+class="spell"[^>]*>(.*?)</a>%', $subject, $regs)) {
$result = $regs[1];
}
Your problem might be the combination of (.*?) and /isU modifier. That U alters the meaning of ? making your match group (.*) greedy actually. Then you will match parts beyond the <\/a> end marker, until it encounters another.
If you remove the /U it works as expected. With your given input text, at least.
Here are two options to fix your expression:
For starters, you can simplify your expression to:
class=\"spell\"[^>]*>(.*?)<\/a>
This captures
<b><i>psychology</i></b>
in Group 1. I assume this is what you want to achieve.
Then, if you want to capture "psychology" without the bold and italic tags, you can use:
class=\"spell\"[^>]*>\s*<(\w+)>?\s*<(\w+)>?\s*(.*?)<\/\2>\s*<\/\1>\s*<\/a>
This captures "psychology" in group 3.
In group 1, you will find the first optional tag, whether it be "b", "strong" or nothing.
In group 2, you will find the second optional tag, which was "i" in your example.
The multiple instances of \s* allow for optional space between the tags.
Is this what you were looking for?
Related
What I'm trying to do is find all the matches within a content block, but ignore anything that is inside tags, for use inside preg_replace_callback().
For example:
test
test title
test
In this case, I want the first line to match, and the third line to match, but NOT the url match, nor the title match in between the a tags.
I've got a regex that I feel like is close:
#(?!<.*?)(\btest\b)(?![^<>]*?>)#si
(and this will not match the url part)
But how do I modify the regex to also exclude the "test" between a and /a?
If it's always the same pattern you can use [A-Z] or a combination like [A-Za-z]
I ended up solving it myself. This regex pattern will do what I wanted:
#(?!<a[^>]*?>)(\btest\b)(?![^<]*?<\/a>)#si
The regex works perfectly but the problem is it also include the next occurrence instead of ending with the first occurrence then start again from the
Regex : (?=<appView)\s{0,1}(.*)(?<=<\/appView>)
String: <appView></appView> <appView></appView>
But my problem is it eat matches the whole word like
(Match 1)<appView></appView> <appView></appView>
I want it to search the group differently but i cant make it work.
Desired output : (Match 1) <appView></appView> (Match 2)<appView></appView>
\s{0,1} equals \s? You need to use (.*?) to be lazy instead of (.*)
Use this pattern: ~(?=<appView)\s?(.*?)(?<=</appView>)~
Demo Link
*note, you don't have to escape / in the closing tag if you use something other than a slash as your pattern delimiter. I am using ~ at the beginning and end of my pattern to avoid escaping.
I fully recommend to switch from regex to an actual sequential xml parser. Regex is aweful for parsing xml based files, for example because of the problems below.
That said, you can "fix" your regex by using ([^<>]*). This will match all characters without < or >, which will make sure that no other tags are nested inside. If done with all tags, you cannot match something like <appview><unclosedTag></appView>, because it is invalid. If you can be certain that the structure is correct, this is slightly less of an issue.
Another problem your approach has is that if you have nested tags like so: <appView> something <appView> something else </appView> else </appView>, your approach will make you end up with [replaced] else </appView>.
I am currently working on protecting my AJAX Chat against exploits by checking all text in PHP before it is passed to the client. So far I have been successful with my mission except for one part where I require to match sets of image tags.
Overall I wish to have it pick up any instance of there being a newline character between a set tags which I have sort of managed, but the solution I have is greedy and matches newline characters outside of tags as well if there are multiple sets of tags.
At the moment I have the following which works if I wanted to match just [img]{newline}[/img]
if(preg_match('/\[\bimg\].*\x0A.*\[\/\bimg\]/', $text)){ //code }
But if I wanted to do [img]image.jpg[/img]{newline}[img]image.jpg[/img], it only sees the very first and end tags which I do not want.
So now I ask, how do you make it match each set of tags properly?
Edit: For clarification. Any newline characters inside tags are bad, so I want to detect them. Any newline characters outside tags are good and I want to ignore them. The reason being, if the client processes a newline character inside of a tag, it crashes.
Just make it ungreedy by putting ? after the two .*
But note that your current solution will not match this:
[img]
look, two newlines!
[/img]
I'm not sure why you want to do this, but you can make . match newlines by adding the s modifier to your regex. Then it's just "(\[img\](.*?)\[/img\])is" to match it, and you can even capture that group and individually check it for newlines if you want.
Try setting the s modifier, like this:
if (preg_match('/\[\bimg\].*\x0A.*\[\/\bimg\]/s', $text)) { code }
See also the PHP Documentation for Regex modifiers
I need a regular expression for php that outputs everything between <!--:en--> and <!--:-->.
So for <!--:en-->STRING<!--:--> it would output just STRING.
EDIT: oh and the following <!--:--> nedds to be the first one after <!--:en--> becouse there are more in the text..
The one you want is actually not too complicated:
/<!--:en-->(.*?)<!--:-->/gi
Your matches will be in capture group 1.
Explanation:
The .*? is a lazy quantifier. Basically, it means "keep matching until you find the shortest string that will still fit this pattern." This is what will cause the matching to stop at the first instance of <!--:-->, rather than sucking up everything until the last <!--:--> in the document.
Usage is something like preg_match("/<!--:en-->(.*?)<!--:-->/gi", $input) if I recall my PHP correctly.
If you have just that input
$input = '<!--:en-->STRING<!--:-->';
You can try with
$output = strip_tags($input);
Try:
^< !--:en-- >(.*)< !--:-- >$
I don't think any of the other characters need to be escaped.
<!--:en--\b[^>]*>(.*?)<!--:-->
This will match the things between your tags. This will break if you nest your tags, but you didnt say you were doing that :)
Hi Guys I'm very new to regex, can you help me with this.
I have a string like this "<input attribute='value' >" where attribute='value' could be anything and I want to get do a preg_replace to get just <input />
How do I specify a wildcard to replace any number of any characters in a srting?
like this? preg_replace("/<input.*>/",$replacement,$string);
Many thanks
What you have:
.*
will match "any character, and as many as possible.
what you mean is
[^>]+
which translates to "any character, thats not a ">", and there must be at least one
or altertaively,
.*?
which means
"any character, but only enough to make this rule work"
BUT DONT
Parsing HTML with regexps is Bad
use any of the existing html parsers, DOM librarys, anything, Just NOT NAïVE REGEX
For example:
<foo attr=">">
Will get grabbed wrongly by regex as
'<foo attr=" ' with following text of '">'
Which will lead you to this regex:
`<[a-zA-Z]+( [a-zA-Z]+=['"][^"']['"])*)> etc etc
at which point you'll discover this lovely gem:
<foo attr="'>\'\"">
and your head will explode.
( the syntax highlighter verifies my point, and incorrectly matches thinking i've ended the tag. )
Some people were close... but not 100%:
This:
preg_replace("<input[^>]*>", $replacement, $string);
should be this:
preg_replace("<input[^>]*?>", $replacement, $string);
You don't want that to be a greedy match.
preg_replace("<input[^>]*>", $replacement, $string);
// [^>] means "any character except the greater than symbol / right tag bracket"
This is really basic stuff, you should catch up with some reading. :-)
If I understand the question correctly, you have the code:
preg_replace("/<input.*>/",$replacement,$string);
and you want us to tell you what you should use for $replacement to delete what was matched by .*
You have to go about this the other way around. Use capturing groups to capture what you want to keep, and reinsert that into the replacement. E.g.:
preg_replace("/(<input).*(>)/","$1$2",$string);
Of course, you don't really need capturing groups here, as you're only reinserting literal text. Bet the above shows the technique, in case you want to do this in a situation where the tag can vary. This is a better solution:
preg_replace("/<input [^>]*>/","<input />",$string);
The negated character class is more specific than the dot. This regex will work if there are two HTML tags in the string. Your original regex won't.