negative lookbehind and greedy quantifiers in php - php

I'm using a regex to find any URLs and link them accordingly. However, I do not want to linkify any URLs that are already linked so I'm using lookbehind to see if the URL has an href before it.
This fails though because variable length quantifiers aren't allowed in lookahead and lookbehind for PHP.
Here's the regex for the match:
/\b(?<!href\s*=\s*[\'\"])((?:http:\/\/|www\.)\S*?)(?=\s|$)/i
What's the best way around this problem?
EDIT:
I have yet to test it, but I think the trick to doing it in a single regex is using conditional expressions within the regex, which is supported by PCRE. It would look something like this:
/(href\s*=\s*[\'\"])?(?(1)^|)((?:http:\/\/|www\.)\w[\w\d\.\/]*)(?=\s|$)/i
The key point is that if the href is captured, the match is immediately thrown out due to the conditional (?(1)^|), which is guaranteed to not match.
There's probably something wrong with it. I'll test it out tomorrow.

I tried doing the same thing the other way round: ensure that the URL doesn't end in ">:
/((?:http:\/\/|www\.)(?:[^"\s]|"[^>]|(*FAIL))*?)(?=\s|$)/i
But for me that looks pretty hacky, I'm sure you can do better.
My second approach is more similar to yours (and thus is more precise):
/href\s*=\s*"[^"]*"(*SKIP)(*FAIL)|((?:http:\/\/|www\.)\S*?)(?=\s|$)/i
If I find an href= I (*SKIP)(*FAIL). This means that I jump to the position the regex engine is at, when it encounters the (*SKIP).
But that's no less hacky and I'm sure there is a better alternative.

Finding "every URL that isn't part of a link" is quite difficult negative logic. It may be easier to find every URL, then every URL that's a link, and remove every of the latter from the former list.
As far as finding which URLs are a part of a link, try:
/<a([\s]+[\w="]+)*[\s]+href[\s]*=[\s]*"([\w\s:/.?+&=]+)"([\s]+[\w="]+)*>/i
I tested it with http://regexpal.com/ to be sure. It looks for the <a first, then it allows for any number of parameters, followed by href, followed by any other number of parameters. If it doesn't have the href, it's not a link. If it isn't an <a> tag, it's not a link. Since this is just the list of what we want to remove from the other list (of URLs), I simplified the definition of a URL to [\w\s:/.?+&=]+. As far as generating a list of URLs, you'll want something smarter.

I don't have a better regex. but if you do not find better regex then I would suggest using two queries for the task. First, find and remove all links and then search for urls. This would be easier and faster possibly.
(For, find and replace in one go, you can use something like - http://www.satya-weblog.com/2010/08/php-regex-find-and-replace-any-word-string-or-text-at-one-go.html).

Related

RegEx for URL pattern(s), redirect after re-launch

I have been struggling and testing for the last two hours and simply cannot wrap my head around the whole RegEx-stuff enough in order to find a proper solution to this...
I am trying to redirect a couple of URLs from our old site to the new one due to a recent re-launch.
This is the current state of things / a demo of my RegEx
Essentially it looks like this:
.+(\/es|\/de|\/en)?\/(legal)(.+)?
My problem is that a URL like https://example.com/es/projects/legal-yeah is also being matched, which does make sense looking at the rule but is not what I want to achieve...
How can I perform a test which only matches URLs where there is nothing in between the first part for the language string (de/en/es/empty) and the second part (/legal)?
Thanks so much for sharing your thoughts on this, appreciate it!
By using an end-of-line anchor $ and explicitly adding (\/.*) after legal you can achieve what you need:
.+(\/es|\/de|\/en)?\/(legal)(\/.+)?$
https://regex101.com/r/HsIDkQ/8
This final RegEx-rule matches the URLs like I intended – ignoring any other occurences of the "legal"-string (in this case) which might appear in another URL on some other level and 'fuzzy' enough to include all the language-cases, even without a language-string appearing at all.
Solution
The trick in the end was to force the rule to look for a TLD in front of the other stuff so it would only allow for first-level URLs to be included.
UPDATE: My first solution didn't turn out to work since the redirection engine / plugin only makes use of the URL path, not including the domain (see GitHub issue) and as such I can't match the DOT as needed precessor.
Now the rule is paying attention to the start of the string and not accepting anything other the language string in front of the targeted URL-slug which in turn removes false positives.
Thanks to #Xatenev who pointed me in the right direction!

how to extract web address from a string in php

I know how to extract web address from url like this:
https://www.youtube.com/watch?v=j__wz7NtNgM
I can extract "youtube.com" from it.
I have no idea how to extract web address from a string like this
my fav website is youtube.com snel ip
How to extract "youtube.com" from it?
you can use different functions.
There are:
preg_match
trim
parse_url
I suggest you using the parse_url function, you can work with it more easy.
To check the url with the preg_match function you have to declare the whole
regex of a url. With the trim function you can split the parts off,
e.g. the http://xxx part or the part behind this section
Even though you’re implementing it in PHP, it’s rather a question of regular expression, right?
Just like a complete URL, you can match the fragment of hostname and TLD from any text. The former’s got the advantage that, to put it a bit bluntly, it starts with https? which can’t be mistaken so easily. On the other hand, it’s hard to tell apart if it’s a web address or if someone just missed hitting space:
my fav website is youtube.com snel ip and it blows.museum is closed, right?
One possible trade-off is by detecting addresses that start with either a protocol or a dubdubdub:
(https?:\/\/([a-z]+\.)*|www\.)([a-z0-9]+\.[a-z]{2,})(\/)?
That’s a bit safer, but it won’t match your example. So, another imperfect way is to detect links if they have some sort of boundary around them:
(^|\b|\s)([a-z0-9]+\.[a-z]{2,})(\b|\W|$)
You could narrow mismatches by making a whitelist of TLDs like (com|net), but I’d not do that; remember there’s IDNs. If you wish to support something like http://موقع.وزارة-الاتصالات.مصر/, it gets a bit more sophisticated.
The regular expressions above do work, though their intent is to be a mere lead to further adjust to your needs and to propose another solution respectively.

Regular expressions - Reference the first match in a search

I don't quite know how to describe my problem in a short title so I am sorry if the title for this question is a bit mis-leading.
But I really don't know what the thing I am looking for is called or if it is even possible.
I am trying to use a regular expression to find everything between a set of matching tags in HTML.
This was easy for me when I was testing with static tags because I could just search for everything in between two pieces of text such as \{myTag\}(someExpression)\{\/myTag\}
My problem comes with the fact that 'myTag' could be anything.
I just don't know how (or if it is even possible) to match the starting tag with the ending tag when that text is variable.
I thought I had seen some kind of referencing system in regular expressions before where you can use the dollar sign and a number, but I don't know if you can use this within the search itself.
I originally thought that perhaps I could write something like: \{(.*?)\}(someExpression)\{\/${1}\} but I have no idea if this would actually work or if it is possible (let alone if it is correct).
I hope this question makes sense as I'm not really sure how to ask it.
Mainly because like I said I don't know if this has a name, if it is possible and I am also a total beginner at regular experessions.
And if it makes any difference the language I am doing this is in PHP with the preg_replace_callback function.
Any help would be greatly appreciated.
Try this:
\{([^}]*)\}(someExpression)\{\/\1\}
but be aware that you need to make sure someExpression doesn't match ending tags as well (like for example .* would). And of course, if tags are nested, then all bets are off, and you'll need a different regex (or a parser).
It kind of depends on your case. If you know it's just an HTML snippet and there is a specific pattern you can search the HTML for then you can use a regex to find and replace the pattern but it seems to me you are trying to parse the HTML. So the issue would be if you had a nested tag. You should check out http://php.net/manual/en/function.preg-replace.php because that seems like a much easier function to use than the one with the callback.
As a note about regular expression look backs you can use $i or \i depending on the language you are using. I don't know if php regex supports capturing group look backs.

Regular Expressions - Where Angels Fear to Tread

I've just started studying regular expressions in PHP, but I'm having a terrible time following some of the tutorials on the WWW and cannot seem to find anything addressing my current needs. Perhaps I'm trying to learn too much too fast. This aspect of PHP is entirely new to me.
What I'm trying to create is a regular expression to replace all HTML code in between the nth occurrence of <TAG> and </TAG> with any code I choose.
My ultimate goal is to make an Internet filter in PHP through which I can view a web page stripped of certain content (or replaced with sanitized content) between any specified set of tags <TAG>...</TAG> within the page, where <TAG>...</TAG> represents any valid paired HTML tags, such as <B>...</B> or <SPAN>...</SPAN> or <DIV>...</DIV>, etc, etc.
For example, if the page has a porn ad contained in the 5th <DIV>...</DIV> block within the page, what regular expression could be invoked to target and replace that code with something else, like xxxxxxx, but only the 5th <DIV> block within the page and nothing else?
The entire web page is contained within a single text string and the filtered result should also be a single string of text.
I'm not sure, but I think the code to do this could have a format similar to:
$FilteredPage = preg_replace("REG EXPRESSION", "xxxxxxxx", $OriginalPage);
The "REG EXPRESSION" to invoke is what I need to know and the "xxxxxxxx" represents the text to replace the code between the tags targeted by "REG EXPRESSION".
Regular expressions are obviously the work of Satan!
Any general suggestions or perhaps a couple of working examples which I could study and experiment with would be greatly appreciated.
Thanks, Jay
Firstly, are you using the right tool for the job? Regex is a text matching engine, not a fully blown parser - perhaps a dedicated HTML parser will give better results.
Secondly, when approaching any programming problem, try to simplify your problem and build it brick by brick rather than just jumping straight to a final solution. For example, you could:
Start with a simple block of normal english text, and try to match and replace (for example) every occurrence of the word "and".
When that works, wrap it in a loop of PHP that can count up to 5 and only replace the 5th occurrence. Why use regex to count when PHP is so much better at that task?
Then modify your regex to match your 5th HTML tag (which is a bit harder because <> are special characters and need escaping)
By approaching the problem in steps, you will be able to get each part working in turn and build a solid solution that you understand.
This has been done to death, but please, don't use a regex to parse HTML. Just stop, give up... It is not worth the kittens god will kill for you doing it. use a real HTML or XML parser
On a more constructive note, look at xpath as a technology better suited to describing html nodes you might want to replace... or phpQuery and QueryPath
The reason god kills kittens when you parse HTML with a regex:
Html is not a regular language, thus a regex can only ever parse very limited html. HTML is a context free language, and as such can only be properly parsed with a context free parser.
Edit: thank you #Andrew Grimm, this is said much better than i could, as evidenced by the first answer with well over four thousand upvotes!
RegEx match open tags except XHTML self-contained tags
ok, few ground rules.
Dont post a question like that, pre-ing all the question, will only keep people away
Regular expressions are awsome!
If you want to consider options, look on how to read html as an xml document and parse it using xpath
#tobyodavies is pretty much correct, I'll include the answer in case you want to do it anyways
Now, to your problem. With this one:
$regex = "#<div>(.+?)</div>#si";
You should be ok using that expression and counting the occurences, much like this:
preg_match_all($regex, $htmlcontent, $matches, PREG_SET_ORDER );
Suppose you only need the 5th one. Matches[$i][0] is the whole string of the $i-eth match
if (count($matches) > 5 )
{
$myMatch = $matches[5][0];
$matchedText = $matches[5][1];
}
Good luck in your efforts...

Preg_replace regex, newlines, connection resets

I have mixed html, custom code, and regular text I need to examine and change frequently on several, long wiki pages. I'm working with a proprietary wiki-like application and have no control over how the application functions or validates user input. The layout of pages that users add must follow a very specific standard layout and always include very specific text in only certain places - a standard which frequently changes. If users add pages that are so far out of the standard, they will be deleted.
I do not have the resources to manually proof-read and correct all these pages, so automation is the only solution. The fact that all this is obviously a complete waste of time when alternative platforms to do exactly what's needed here exist is already understood.
I've built a PHP based API to automate this post-validation and frequent restandardization process for me. I've been able set up regex patterns to handle all this mixed text, and they all work fine for handling single lines. The problem I have is this: Poorly formed regex against long text with line breaks can lead to unexpected results, such as connection resets. I have no access to server-side logs to troubleshoot. How do I overcome this?
This is just one example of what I currently have: {column} and {section} tags I'm searching for below can have any number of attributes, and wrap any text. {section} may or may not exist and may or may not be one or more lines under {column}, but it has to be wrapped inside {column}. {column} itself may or may not exist, and if it doesn't, I don't care as I then have some default text inserted later on down the script. I want to grab the inner section contents and wrap it in an html div tag instead. I can't recall the exact pattern I'm using offhand at the moment, but it's close enough...
$pattern = "/\{column:id=summary([|]?([a-zA-Z0-9-_ ]+[:][a-zA-Z0-9-_ ]+[ ]?))\}(.*)({section([|]([a-zA-Z0-9-_ ]+[:][a-zA-Z0-9-_ ]+[ ]?))\}(.*)\{section\}(.*))?{column\}/s";
$replacement = "{html}<div id='summary'>$7</div>{html}";
$text = preg_replace($pattern, $replacement, $subject);
Handling the {column} and {section} attributes and passing only valid HTML parameters to the new html div or a subtext of it is itself a challenge, but my main focus above right now is getting that (.*) value within {section} above without causing a connection reset. Any pointers?
This probably isn't what you're looking for, but: don't use a regex! You're trying to parse some very structured, very complex text, and to do so, you should really use a parser. I don't know what's available for PHP (you can Google just as well as I can, and I'm in no position to make any particular recommendation) but I'm sure something exists.
As for what's causing a connection reset, my only guess is that, since you mention problems with "long text", you're having a memory allocation issue. I don't think your regex will have unexpectedly huge performance, though it might in the non-matching case. But your best option, if you can, is probably to scrap the regex technique and switch to a real parser.
I found the likely source of the crashing issue: catastrophic backtracking (http://www.regular-expressions.info/catastrophic.html). So if refining patterns to handle that doesn't work (and if anyone has any patterns to suggest, please do share), switching to some other text parser solution would be best.
The only real problem I can see is all those (.*)s. In /s mode, each (.*) initially slurps up the whole page, only to have to backtrack most of the way. Change them all to (.*?) (i.e., switch to reluctant quantifiers) and it should work much faster.

Categories