I wrote a regex to find out href from anchor tag
My regex is
<a.*?href="(.*?)">blah<\/a> //dot is matching all
So according to me, this will start matching from <a until it finds out first href. After this it will grab the url in href until first " and then it will match for blah.
But this is matching multiple sets of anchor tags which have blah tag in end, for example:
abc
def
blah
According to me it should grab only last url as regex fits it perfectly.
To answer the question, you can swap your dot operator for a not group, to match everything but the closing tag:
<a[^>]*href="([^"]*)">def<\/a>
This (in theory) ensures that the regex pattern will only match inside a particular tag.
To not answer your question: it's often not a great idea to parse HTML with regex, unless you can be extremely sure of exactly how it's formatted. You might want to look into the PHP DOM.
Related
What I'm trying to do is find all the matches within a content block, but ignore anything that is inside tags, for use inside preg_replace_callback().
For example:
test
test title
test
In this case, I want the first line to match, and the third line to match, but NOT the url match, nor the title match in between the a tags.
I've got a regex that I feel like is close:
#(?!<.*?)(\btest\b)(?![^<>]*?>)#si
(and this will not match the url part)
But how do I modify the regex to also exclude the "test" between a and /a?
If it's always the same pattern you can use [A-Z] or a combination like [A-Za-z]
I ended up solving it myself. This regex pattern will do what I wanted:
#(?!<a[^>]*?>)(\btest\b)(?![^<]*?<\/a>)#si
I'm trying to get information in DB from html files, and suddenly found that link can be like:
channel crosstalk: <60dB
there for my regular expression doesn't find that link:
preg_match_all('|<a href="/blabla/([0-9]+)"[^>]*>([^<]*)</a>|Uis',$html,$matches);
This is a part of big regular expression, I just simplified it for example.
It's hard to tell what you are trying to pull. Are you looking for the entire link? Or are you looking to grab parts from the link (hence the parenthesis)? Here is a solution for getting the individual contents in the link:
preg_match_all( '#(.*?)#i', $html, $matches);
The first element of matches will be the entire link, while the other elements will be the sub parts.
Or here is one for just the entire link:
preg_match_all( "#(<a.*>.*</a>)#i", $html, $matches );
Or here is a slightly modified version of yours which currently isn't matching because it's saying to match anything that is not an angle bracket inside the opening and closing A tags as its contents has an angle bracket:
preg_match_all( '|<a href="/blabla/([0-9]+)"[^>]*>(.*?)</a>|Uis', $html, $matches );
Again, not 100% sure the exact results you are looking for, but maybe this will get your going and you can make modifications as needed.
You can use this regex to extract href and link text.
<a[^>]+?href="(.*?)"[^>]+?>(.*?)</a>
Group 1: href
Group 2: link text
This is the fundamental issue with trying to regex HTML. This is not really good HTML - because contents that are not meant to be interpreted as HTML should be html entities (aka <e; instead of <). You won't always be able to handle that though.
In your case, something like this works for regex:
|.*?|Uis
The matching group gets shifted. This also allows nested tags (like <a><b><i></i></b></a>).
Keep in mind that the Ungreedy tag you used means that you can be a little more lax in your regex matching. If you wanted to do this without the U modifier you'd maybe need to do some negative lookaheads.
|(?:(?!).)*</a>|is
I've searched for this but couldn't find a solution that worked for me.
I need regex pattern that will match all text except html tags, so I can make it cyrilic (which would obviously ruin the entire html =))
So, for example:
<p>text1</p>
<p>text2 <span class="theClass">text3</span></p>
I need to match text1, text2, and text3, so something like
preg_match_all("/pattern/", $text, $matches)
and then I would just iterate over the matches, or if it can be done with preg_replace, to replace text1/2/3, with textA/B/C, that would be even better.
As you probably know, regex is not a great choice for this (the general advice here will be to use a Dom parser).
However, if you needed a quick regex solution, you use this (see demo):
<[^>]*>(*SKIP)(*F)|[^<]+
How this works is that on the left the <[^>]*> matches complete <tags>, then the (*SKIP)(*F) causes the regex to fail and the engine to advance to the position in the string that follows the last character of the matched tag.
This is an application of a general technique to exclude patterns from matches (read the linked question for more details).
If you don't want to allow the matches to span several lines, add \r\n to the negated character class that does your matching, like so:
<[^>]*>(*SKIP)(*F)|[^<\r\n]+
How about this RegEx:
/(?<=>)[\w\s]+(?=<)/g
Online Demo
Maybe this one (in Ruby):
/(?<!<)(?<!<\/)(?<![<\/\w+])([[:alpha:]])+(?!>)/
Enjoy !
Please use PHP DOMDocument class to parse XML content :
PHP Doc
I'm trying to read an HTML file and capture all anchor tags that match a specific URL pattern in order to display those links on another page. The pattern looks like this:
https://docs.google.com/file/d/aBunchOfLettersAndNumbers/edit?usp=drive_web
I'm lousy with RegEx. I've tried a bunch of things and read a bunch of answers here on Stack Overflow, but I'm not hitting on the correct syntax.
Here's what I have now:
preg_match ('/<a href="https:\/\/docs.google.com\/file\/d\/(.*)<\/a>/', $file, $matches)
When I test this on an HTML page with two matching anchor tags, the first result includes the first and second match and everything in between, while the second result includes part of the first match, part of the second match, and everything in between.
While I'd be happy to capture matching anchor tags along with the inner HTML, I'd be even happier if I could generate a multidimensional array with the HREF attribute of each matching anchor tag, along with the matching inner HTML (so I can format the links myself, without having to use even more RegEx to get rid of unwanted attributes). Would I use preg_match_all for that? What would that look like?
Am I even on the right path here, or should I be using DOM and XPath queries to find this stuff?
Thanks.
Oh jeez, I can't believe every answer here uses "/" delimiters. If your pattern has slashes in it, use something else for the sake of readability.
Here's a better answer (you may need to tweak if your anchors may have additional attributes other than href):
$hrefPattern = "(?P<href>https://docs\.google\.com/file/d/[a-z0-9]+/edit\?usp=drive_web)";
$innerPattern = "(?P<inner>.*?)";
$anchorPattern = "$innerPattern";
preg_match_all("#$anchorPattern#i", $file, $matches);
This will give you something like:
[
0 => ['<span>More foo</span>'],
"href" => ["https://docs.google.com/file/d/foo/edit?usp=drive_web"],
"inner" => ["<span>More foo</span>"]
]
And absolutely, you should use the DOM for this.
Replace (.*) with (.*?) - use lazy quantification:
preg_match('/<a href="https:\/\/docs.google.com\/file\/d\/(.*?)<\/a>/', $file, $matches);
You could use the following regular expression:
/<a.*?href="(https:\/\/docs\.google\.com\/file\/d\/.*?)".*?>(.*?)<\/a>/
Which would give you the URL from the href and the innerHTML.
Break down
<a.*?href=" Matches the opening a tag and any charachters up until href="
(https:\/\/docs\.google\.com\/file\/d\/.*?)" Matches (and captures) until the end of the href (i.e. until "
.*?> Matches all characters to the end of the a tag >
(.*?)<\/a> Matches (and captures) the innerHTML until the closing a tag (i.e. </a>).
Dave,
The DOM would be better. But here is the Regex that works.
$url = 'href="https://docs.google.com/file/d/aBunchOfLettersAndNumbers/edit?usp=drive_web"';
preg_match ('/href="https:\/\/docs.google.com\/file\/d\/(.*?)"/', $url, $matches);
Results:
array (size=2)
0 => string 'href="https://docs.google.com/file/d/aBunchOfLettersAndNumbers/edit?usp=drive_web"' (length=82)
1 => string 'aBunchOfLettersAndNumbers/edit?usp=drive_web' (length=44)
You can can the html tags, but most importantly, in your question, your code in the preg_match line didn't contain the ending > of the opening tag which threw it off and it needed to have (.?) instead of (.). The added ? tells it to looking for any characters, of an unknown quantity. (.*) means any one character I believe.
I have a string:
$string = "This is my big <span class="big-string">string</span>";
I cannot figure out how to write a regular expression that will replace the 'b' in 'big' without replacing the 'b' in 'big-string'. I need to replace all occurances of a substring except when that substring appears in an html tag.
Any help is appreciated!
Edit
Maybe some more info will help. I'm working on an autocomplete feature that highlights whatever you're searching for in the current result set. Currently if you have typed 'aut' in the search dialog, then the results look like this: automotive
The problem appears when I search for 'auto b'. First I replace all occurrences of 'auto' with '<b>auto</b>' then I replace all occurrences of 'b' with '<b>b</b>'. Unfortunately this second sweep changes '<b>auto</b>' to '<<b>b</b>>auto</<b>b</b>>'
Please do not try to parse HTML using regular expressions. Just load up the HTML in a DOM, walk over the text nodes and do a simple str_replace. You'll thank me around debugging time.
Is there a guarantee that 'big' won't be immediately preceded by "? If so, then s/([^"])b/$1foo/ should replace the b in question with foo.
If you insist upon using a regex, this one will do a pretty decent job:
$re = '/# (Crudely) match a sub-string NOT in an HTML tag.
big # The sub-string to be matched.
(?= # Assert we are not inside an HTML tag.
[^<>]* # Consume all non-<> up to...
(?:<\w+ # either an HTML start tag,
| $ # or the end of string.
) # End group of valid alternatives.
) # End "not-in-html-tag" lookahead assertion.
/ix';
Caveats: This regex has very real limitations. The HTML must not have any angle brackets in the tag attributes. This regex also finds the target substring inside other parts of the HTML file such as comments, scripts and stylesheets, and this may not be desirable.