Matching a Specific URL Pattern with PHP - php

I'm trying to read an HTML file and capture all anchor tags that match a specific URL pattern in order to display those links on another page. The pattern looks like this:
https://docs.google.com/file/d/aBunchOfLettersAndNumbers/edit?usp=drive_web
I'm lousy with RegEx. I've tried a bunch of things and read a bunch of answers here on Stack Overflow, but I'm not hitting on the correct syntax.
Here's what I have now:
preg_match ('/<a href="https:\/\/docs.google.com\/file\/d\/(.*)<\/a>/', $file, $matches)
When I test this on an HTML page with two matching anchor tags, the first result includes the first and second match and everything in between, while the second result includes part of the first match, part of the second match, and everything in between.
While I'd be happy to capture matching anchor tags along with the inner HTML, I'd be even happier if I could generate a multidimensional array with the HREF attribute of each matching anchor tag, along with the matching inner HTML (so I can format the links myself, without having to use even more RegEx to get rid of unwanted attributes). Would I use preg_match_all for that? What would that look like?
Am I even on the right path here, or should I be using DOM and XPath queries to find this stuff?
Thanks.

Oh jeez, I can't believe every answer here uses "/" delimiters. If your pattern has slashes in it, use something else for the sake of readability.
Here's a better answer (you may need to tweak if your anchors may have additional attributes other than href):
$hrefPattern = "(?P<href>https://docs\.google\.com/file/d/[a-z0-9]+/edit\?usp=drive_web)";
$innerPattern = "(?P<inner>.*?)";
$anchorPattern = "$innerPattern";
preg_match_all("#$anchorPattern#i", $file, $matches);
This will give you something like:
[
0 => ['<span>More foo</span>'],
"href" => ["https://docs.google.com/file/d/foo/edit?usp=drive_web"],
"inner" => ["<span>More foo</span>"]
]
And absolutely, you should use the DOM for this.

Replace (.*) with (.*?) - use lazy quantification:
preg_match('/<a href="https:\/\/docs.google.com\/file\/d\/(.*?)<\/a>/', $file, $matches);

You could use the following regular expression:
/<a.*?href="(https:\/\/docs\.google\.com\/file\/d\/.*?)".*?>(.*?)<\/a>/
Which would give you the URL from the href and the innerHTML.
Break down
<a.*?href=" Matches the opening a tag and any charachters up until href="
(https:\/\/docs\.google\.com\/file\/d\/.*?)" Matches (and captures) until the end of the href (i.e. until "
.*?> Matches all characters to the end of the a tag >
(.*?)<\/a> Matches (and captures) the innerHTML until the closing a tag (i.e. </a>).

Dave,
The DOM would be better. But here is the Regex that works.
$url = 'href="https://docs.google.com/file/d/aBunchOfLettersAndNumbers/edit?usp=drive_web"';
preg_match ('/href="https:\/\/docs.google.com\/file\/d\/(.*?)"/', $url, $matches);
Results:
array (size=2)
0 => string 'href="https://docs.google.com/file/d/aBunchOfLettersAndNumbers/edit?usp=drive_web"' (length=82)
1 => string 'aBunchOfLettersAndNumbers/edit?usp=drive_web' (length=44)
You can can the html tags, but most importantly, in your question, your code in the preg_match line didn't contain the ending > of the opening tag which threw it off and it needed to have (.?) instead of (.). The added ? tells it to looking for any characters, of an unknown quantity. (.*) means any one character I believe.

Related

PHP preg_replace_callback match string but exclude urls

What I'm trying to do is find all the matches within a content block, but ignore anything that is inside tags, for use inside preg_replace_callback().
For example:
test
test title
test
In this case, I want the first line to match, and the third line to match, but NOT the url match, nor the title match in between the a tags.
I've got a regex that I feel like is close:
#(?!<.*?)(\btest\b)(?![^<>]*?>)#si
(and this will not match the url part)
But how do I modify the regex to also exclude the "test" between a and /a?
If it's always the same pattern you can use [A-Z] or a combination like [A-Za-z]
I ended up solving it myself. This regex pattern will do what I wanted:
#(?!<a[^>]*?>)(\btest\b)(?![^<]*?<\/a>)#si

preg_match link text with less-than sign in it

I'm trying to get information in DB from html files, and suddenly found that link can be like:
channel crosstalk: <60dB
there for my regular expression doesn't find that link:
preg_match_all('|<a href="/blabla/([0-9]+)"[^>]*>([^<]*)</a>|Uis',$html,$matches);
This is a part of big regular expression, I just simplified it for example.
It's hard to tell what you are trying to pull. Are you looking for the entire link? Or are you looking to grab parts from the link (hence the parenthesis)? Here is a solution for getting the individual contents in the link:
preg_match_all( '#(.*?)#i', $html, $matches);
The first element of matches will be the entire link, while the other elements will be the sub parts.
Or here is one for just the entire link:
preg_match_all( "#(<a.*>.*</a>)#i", $html, $matches );
Or here is a slightly modified version of yours which currently isn't matching because it's saying to match anything that is not an angle bracket inside the opening and closing A tags as its contents has an angle bracket:
preg_match_all( '|<a href="/blabla/([0-9]+)"[^>]*>(.*?)</a>|Uis', $html, $matches );
Again, not 100% sure the exact results you are looking for, but maybe this will get your going and you can make modifications as needed.
You can use this regex to extract href and link text.
<a[^>]+?href="(.*?)"[^>]+?>(.*?)</a>
Group 1: href
Group 2: link text
This is the fundamental issue with trying to regex HTML. This is not really good HTML - because contents that are not meant to be interpreted as HTML should be html entities (aka &lte; instead of <). You won't always be able to handle that though.
In your case, something like this works for regex:
|.*?|Uis
The matching group gets shifted. This also allows nested tags (like <a><b><i></i></b></a>).
Keep in mind that the Ungreedy tag you used means that you can be a little more lax in your regex matching. If you wanted to do this without the U modifier you'd maybe need to do some negative lookaheads.
|(?:(?!).)*</a>|is

What's a smart approach to parsing "forum-style" tags within a string of random HTML?

So I'm working with some pretty awesome HTML strings stored in our DB and I need to be able to parse out the string between the "forum-style" youtube tags as in the example below. I have a solution, but it feels a bit hackish. I'm thinking there's probably a more elegant way to handle this problem.
<?php
$video_string = '<p><span style="font-size: 12px;"><span style="font-family: verdana,geneva,sans-serif;">[youtube]KbI_7IHAsyw[/youtube]<br /></span></span></p>';
$matches = array();
preg_match('/\][_A-Za-z0-9]+\[/', $video_string, $matches);
$yt_vid_key = substr($matches[0], 1, strlen($matches[0]) - 2 );
I'd change the regex a bit:
'/\[youtube\](.*?)\[\/youtube\]/is'
Adding the 'youtube' part to not replace ALL bb-codes - only the right ones.
I've also added the '?' to make the regex less greedy (incase there are multiple YT videos in one post.
I added the pattern modifiers i and s, to be able to match case-insensitive and multiline strings.
Edit:
You may also rather want to use preg_replace, it'll be a bit less code that way.
Try this:
preg_match('!\[youtube\]([_A-Za-z0-9]+?)\[/youtube\]!',$subject, $matches);
$yt_vid_key = $matches[1];
if you expect multiple occurances, use preg_match_all instead.
All of the answers provided here are correct if you don't expect nested tags if so then you have to come up with a way to match the tags properly, which can't really be done in regex and you will have to create some sort of way to handle it.
Here is some pseudo like code to help you out
find opening tag to tag match
openTags = 0
closeTags = 0
position = 0
do{
Move through the string: increase position
if open tag matches: openTags++
if close tag matches: closeTags++, positionOfCloseTag = position
}while(openTags > closeTags);
first occurence of close tag after the last close tag you found in do-while loop is the correct matching of the tag.

PHP Regex of Anchor with Class to get Inner Text

<a href="/search?hl=en&pwst=1&sa=X&ei=RCPqTqkHycryA_bK_f0J&ved=0CCUQvwUoAQ&q=psychology&spell=1" class=spell><b><i>psychology</i></b></a>
Hi, I'm looking to create a regex which matches this anchor and returns the inner text of it.
This is what I've been trying as a regex but without success.
'/<a[^>]+class=\"spell\"[^>]*>(.*?)<\/a>/isU'
It's probably something really silly. Thanks.
Problem was missing quotes surrounding the class. Not proper html markup but I neglected to notice so I just changed my regex to have quotes as optional.
Final regex:
'/<a[^>]+class=\"?spell\"?[^>]*>(.*?)<\/a>/is'
The regex looks OK, although you don't need to escape the quotes. Perhaps PHP doesn't like it if you use unnecessary escapes, although I doubt it. The problem is more likely the way you're using the regex. Did you access group number 1?
if (preg_match('%<a[^>]+class="spell"[^>]*>(.*?)</a>%', $subject, $regs)) {
$result = $regs[1];
}
Your problem might be the combination of (.*?) and /isU modifier. That U alters the meaning of ? making your match group (.*) greedy actually. Then you will match parts beyond the <\/a> end marker, until it encounters another.
If you remove the /U it works as expected. With your given input text, at least.
Here are two options to fix your expression:
For starters, you can simplify your expression to:
class=\"spell\"[^>]*>(.*?)<\/a>
This captures
<b><i>psychology</i></b>
in Group 1. I assume this is what you want to achieve.
Then, if you want to capture "psychology" without the bold and italic tags, you can use:
class=\"spell\"[^>]*>\s*<(\w+)>?\s*<(\w+)>?\s*(.*?)<\/\2>\s*<\/\1>\s*<\/a>
This captures "psychology" in group 3.
In group 1, you will find the first optional tag, whether it be "b", "strong" or nothing.
In group 2, you will find the second optional tag, which was "i" in your example.
The multiple instances of \s* allow for optional space between the tags.
Is this what you were looking for?

PHP/Perl Regular expression help!

I have a string:
$string = "This is my big <span class="big-string">string</span>";
I cannot figure out how to write a regular expression that will replace the 'b' in 'big' without replacing the 'b' in 'big-string'. I need to replace all occurances of a substring except when that substring appears in an html tag.
Any help is appreciated!
Edit
Maybe some more info will help. I'm working on an autocomplete feature that highlights whatever you're searching for in the current result set. Currently if you have typed 'aut' in the search dialog, then the results look like this: automotive
The problem appears when I search for 'auto b'. First I replace all occurrences of 'auto' with '<b>auto</b>' then I replace all occurrences of 'b' with '<b>b</b>'. Unfortunately this second sweep changes '<b>auto</b>' to '<<b>b</b>>auto</<b>b</b>>'
Please do not try to parse HTML using regular expressions. Just load up the HTML in a DOM, walk over the text nodes and do a simple str_replace. You'll thank me around debugging time.
Is there a guarantee that 'big' won't be immediately preceded by "? If so, then s/([^"])b/$1foo/ should replace the b in question with foo.
If you insist upon using a regex, this one will do a pretty decent job:
$re = '/# (Crudely) match a sub-string NOT in an HTML tag.
big # The sub-string to be matched.
(?= # Assert we are not inside an HTML tag.
[^<>]* # Consume all non-<> up to...
(?:<\w+ # either an HTML start tag,
| $ # or the end of string.
) # End group of valid alternatives.
) # End "not-in-html-tag" lookahead assertion.
/ix';
Caveats: This regex has very real limitations. The HTML must not have any angle brackets in the tag attributes. This regex also finds the target substring inside other parts of the HTML file such as comments, scripts and stylesheets, and this may not be desirable.

Categories