Detecting dofollow backlinks using regular expression

Detecting dofollow backlinks using regular expression - php

The objective of this regular expression is to find whether a web page contains backlink(s) to a given domain and that all of must have a rel="nofollow" attribute on a tag. True if it meets this otherwise False if any does not contain rel="nofollow".
From any web page I want to check whether anything like this is present:
<a ... href="http://www.mysite.com/xyz...." ... >
Addtionally there must not be "rel=nofollow" attribute in all such links found.
Given that domain www.mysite.com is known and I want to check it even within comments or wherever present in the page.
I could do above myself but I'm not able to think of optimized way to it using single pattern.
One unoptimized way I can do it to find all occurances of a tags with href="mysite.com" and see if even single match does not contain a rel=nofollow.
Is there any smart & single line way of making a regular expression pattern?
PS: Don't want to parse DOM since it's risky to miss a backlink due to parsing error and Google's DOM parser could be different. I want human attention to only those pages links from whom can cause backlink penalty from search engines. If a link within comment is flagged as backlink and takes away some human attention, no problem. But at any cost links from say a porn site must be caught. Finally I want to prepare list of spam links which I can submit in Google Webmaster's Disavow tool. This exercise is must for every webmaster once or so in a month for every site. And I can't afford this kind of paid service: www.linkdetox.com

Usually, parsing HTML with regex is a bad idea (here's the famous reason why). You risk weird bugs as regex aren't able to fully parse HTML.
However, if your input is "safe" (i.e. not changing a lot, or you're prepared for weird errors) and to answer your question, when you're on the a tag you can use something like this to catch link with the href you want and without rel="nofollow":
#<a\s+(?![^>]*rel\s*=\s*(['"])\s*nofollow\s*\1)[^>]*href\s*=\s*(["'])http://www.mysite.com[][\w-.~:/%?##!$&'()*+,;=]*\2[^>]*>
<a\s+ # start of the a tag followed by at least a space
(?! # negative look-ahead: if there isn't...
[^>]* # anything except tag closing bracket
rel\s*=\s* # 'rel=', with spaces allowed
(['"]) # capture the opening quote
\s*nofollow\s* # nofollow
\1 # closing quote is the same as captured opening one
) # end of negative look ahead
[^>]* # anything but a closing tag
href\s*=\s* #
(["']) # capture opening quote
http://www.mysite.com # the fixed part of your url
[][\w-.~:%/?##!$&'()*+,;=]* # url-allowed characters
\2 # closing quote
[^>]*> # "checks" that the tag is ending
Demo: http://regex101.com/r/hC8lV9
Disclaimer
This isn't meant to check whether your input is well-formed or not, this assumes it is well formed. This won't account for stuff like escaped > or escaped quotes, and you very probably will need to adapt it to your needs. Basically, no regex will give a complete answer.
If you need to deal with various input or with potentially malformed HTML, a parser will will do a much safer and better job than regex.
However I'm putting this one here to give you an idea of what can be done on this subject, since in very strict and narrowly defined context regex can actually be a relevant solution.

First of all do not use regular expression for parsing the dom of a web page. PHP got it 's own Document Object Model, which does the whole job. Just have a look at http://de1.php.net/manual/en/class.domdocument.php and http://de1.php.net/manual/en/class.domxpath.php.

Regular Expression
<a(?=[^>]*?rel=nofollow)(?=[^>]*?href="http:\/\/www\.mysite\.com\/.*?")[^>]*?>
How it works
It uses positive lookaheads to validate the string for the rel=nofollow and href="mysite tags.
Online demo: http://regex101.com/r/pX0yF5

If you’ve been doing any kind of reading about link building, then you’ve probably seen people mentioning nofollow and dofollow links. These are very important terms to understand when you are trying to build great links back to your site in order to increase your search engine rankings. But, to the person who is new to all of this, it may be kind of confusing. I am going to help break it down for you.
To tell the spiders to crawl a link, you don’t have to do anything. Simply using the format shown above, the search engine spiders will crawl the link provided.

Related

Regex PHP. Reduce steps: limited by fixed width Lookbehind

I have a regex that will be used to match #users tags.
I use lokarround assertions, letting punctuation and white space characters surround the tags.
There is an added complication, there are a type of bbcodes that represent html.
I have two types of bbcodes, inline (^B bold ^b) and blocks (^C center ^c).
The inline ones have to be passed thru to reach for the previous or next character.
And the blocks are allowed to surround a tag, just like punctuation.
I made a regex that does work. What I want to do now is to lower the number of steps that it does in every character that’s not going to be a match.
At first I thought I could do a regex that would just look for #, and when found, it would start looking at the lookarrounds, that worked without the inline bbcodes, but since lookbehind cannot be quantifiable, it’s more difficult since I cannot add ((\^[BIUbiu])++)* inside, producing much more steps.
How could I do my regex more efficient with fewer steps?
Here is a simplified version of it, in the Regex101 link there is the full regex.
(?<=[,\.:=\^ ]|\^[CJLcjl])((\^[BIUbiu])++)*#([A-Za-z0-9\-_]{2,25})((\^[BIUbiu])++)*(?=[,\.:=\^ ]|\^[CJLcjl])
https://regex101.com/r/lTPUOf/4/

A rule of thumb:
Do not let engine make an attempt on matching each single one character if
there are some boundaries.
The quote originally comes from this answer. Following regular expression reduces steps in a significant manner because of the left side of the outermost alternation, from ~20000 to ~900:
(?:[^#^]++|[#^]{2,}+)(*SKIP)(*F)
|
(?<=([HUGE-CHARACTER-CLASS])|\^[cjleqrd])
(\^[34biu78])*+#([a-z\d][\w-.]{0,25}[a-z\d])(\^[34biu78])*+(?=(?1))
Actually I don't care much about the number of steps being reported by regex101 because that wouldn't be true within your own environment and it is not obvious if some steps are real or not or what steps are missed. But in this case since the logic of regex is clear and the difference is a lot it makes sense.
What is the logic?
We first try to match what probably is not desired at all, throw it away and look for parts that may match our pattern. [^#^]++ matches up to a # or ^ symbols (desired characters) and [#^]{2,}+ prevents engine to take extra steps before finding out it's going nowhere. So we make it to fail as soon as possible.
You can use i flag instead of defining uppercase forms of letters (this may have a little impact however).
See live demo here

PHP Recognize paragraphs in Rich Text

I've a rich text editor for news messages.
The frontend shows one paragraph and the user can read the full message once the user clicks "read more".
However this recognition is now done by <div></div> tags, while the editor works with tags (two for a paragraph).
My current regex is:
"/<div>([^`]*?)<\/div>/is"
How can i extend this to also recognize two tags right after each other. (Notice, those br tags might contain attributes).

As discussed above, beware that using regex to parse HTML, especially for "complex" problems, is generally a bad idea. The following is not a perfect solution, but may be good enough to the simple requirements you've given above:
/(?<=<div>).*?(?=<\/div>)|(?<=<br>\s*<br>).*?(?=<div>|<br>\s*<br>)/is
The (?<=...) and (?=...) are look behinds/aheads, i.e. they assert that those sections of the pattern are present, but are not included in the match result.
I have also used \s* to help catch scenarios where the user types something like:
<br> <br>
Or:
<br>
<br>
...But as I say, this is still not a perfect solution. If you find the pattern gets too complex, then seriously consider using an XML parser instead. (Or, how about just letting the user enter new lines, and converting these into paragraphs for them? ... Or even, just use an existing WYSIHTML5 library, or a markdown library?)

changing www*.com to a clickable URL with REGEX

I'm working on a web page and regex keeps coming up as the best way to handle string manipulation for an issue I'm trying to resolve. Unfortunately, regex is not exactly trivial and I've been having trouble. Any help is appreciated;
I would like to make strings entered from a php form into clickable links. I've received help with my first challenge; how to make strings starting with http, https or ftp into clickable links;
function make_links_clickable($message){
return preg_replace('!(((f|ht)tp(s)?://)[-a-zA-Zа-яА-Я()0-9#:%_+.~#?&;//=]+)!i', '$1', $message);
}
$message = make_links_clickable($message);
And this works well. When I look at it (and do some research), the best that I can glean from the syntax is that the first piece is matching ftp, http, and https, :, and // along with a wide range of combined patterns. I would like to know how I can;
1) Make links starting with www, or ending with .com/.net/.org/etc clickable (like google.com, or www.google.com - leaving out the http://)
2) Change youtube links like
"https://www.youtube.com/watch?v=examplevideo"
into
"<iframe width="560" height="315" src="//www.youtube.com/embed/examplevideo" frameborder="0" allowfullscreen></iframe>"
I think these two cases are basically doing the same kind of thing, but figuring out is not intuitive. Any help would be deeply appreciated.

The first regular expression there is made to match almost everything that follows ftp://, http://, https:// that occurs, so it might be best to implement the others as separate expressions since they'll only be matching hostnames.
For number 1, you'll need to decide how strictly you wish to match different TLDs (.com/.net/etc). For example, you can explicitly match them like this:
(www\.)?[a-z0-9\-]+\.(com|net|org)
However, that will only match URLs that end in .com, .net, or .org. If you want all top-level domains and only the valid ones, you'll need to manually write them all in to the end of that. Alternatively, you can do something like this,
(www\.)?[a-z0-9\-]+\.[a-z]{2,6}
which will accept anything that looks like a url and ends with "dot", and any combination of 2 to 6 letters (.museum and .travel). However, this will match strings like "fgs.fds". Depending on your application, you may need to add more characters to [a-z], to add support for extended character alphabets.
Edit (2 Aug 14): As pointed out in the comments below, this won't match TLDs like .co.uk. Here's one that will:
(www\.)?[a-z0-9\-]+\.([a-z]{2,3}(\.?[a-z]{2,3})?)
Instead of any string between two and six characters (following a period), this will match any two to three, then another one to three (if present), with or without a dividing period.
It'd be redundant, but you could instead remove the question mark after www on the second option, then do both tests; that way, you can match any string ending in a common TLD, or a string that begins with "www." and is followed by any characters with one period separating them, "gpspps.cobg". It would still match sites that might not actually exist, but at least it looks like a url, at it would look like one.
For the YouTube one, I went a little question mark crazy.
(?i:(?:(?:http(?:s)?://)?(?:www\.)?)?youtu(?:\.be/|be\.com/watch\?(?:[a-z0-9_\-\%\&\=]){0,}?v\=))([a-zA-Z0-9_\-]{11}){0,}?v\=))(?i)([a-zA-Z0-9_\-]{11})
EDIT: I just tried to use the above regex in one of my own projects, but I encountered some errors with it. I changed it a little and I think this version may be better:
(?i:(?:(?:http(?:s)?://)?(?:www\.)?)?youtu(?:\.be/|be\.com/watch\?(?:[a-z0-9_\-\%\&\=]){0,})?)(?:v=)?([a-zA-Z0-9_\-]{11})
For those not familiar with regular expressions, parentheses , ( ...regex... ), are stored as groups, which can be selectively picked out of matched strings. Parenthesis groups that begin with ?: as in most of the ones up there, (?:www\.) are however not captured within the groups. Because the end of that regex was left as a normal—"captured"—group, ([a-zA-Z0-9_\-]{11}), you use the $matches argument of functions like preg_match, then you can use $matches[1] to get the YouTube ID of the video, 'examplevide', then work with it however you'd like. Also note, the regex is only matching 11 characters for the ID.
This regex will match pretty much any of the current youtube url formats including incorrect cases, and out of (normal) order parameters:
http://youtu.be/dQw4w9WgXcQ
https://www.youtube.com/watch?v=dQw4w9WgXcQ
http://www.youtube.com/watch?v=dQw4w9WgXcQ&feature=featured
http://www.youtube.com/watch?feature=featured&v=dQw4w9WgXcQ
http://WWW.YouTube.Com/watch?v=dQw4w9WgXcQ
http://YouTube.Com/watch?v=dQw4w9WgXcQ
www.youtube.com/watch?v=dQw4w9WgXcQ

Regex to parse links containing specific words

Taking this thread a step further, can someone tell me what the difference is between these two regular expressions? They both seem to accomplish the same thing: pulling a link out of html.
Expression 1:
'/(https?://)?(www.)?([a-zA-Z0-9_%]*)\b.[a-z]{2,4}(.[a-z]{2})?((/[a-zA-Z0-9_%])+)?(.[a-z])?/'
Expression 2:
'/<a.*?href\s*=\s*["\']([^"\']+)[^>]*>.*?<\/a>/si'
Which one would be better to use? And how could I modify one of those expressions to match only links that contain certain words, and to ignore any matches that do not contain those words?
Thanks.

The difference is that expression 1 looks for valid and full URIs, following the specification. So you get all full urls that are somewhere inside of the code. This is not really related to getting all links, because it doesn't match relative urls that are very often used, and it gets every url, not only the ones that are link targets.
The second looks for a tags and gets the content of the href attribute. So this one will get you every link. Except for one error* in that expression, it is quite safe to use it and it will work good enough to get you every link – it checks for enough differences that can appear, such as whitespace or other attributes.
*However there is one error in that expression, as it does not look for the closing quote of the href attribute, you should add that or you might match weird things:
/<a.*?href\s*=\s*["\']([^"\'>]+)["\'][^>]*>.*?<\/a>/si
edit in response to the comment:
To look for word inside of the link url, use:
/<a.*?href\s*=\s*["\']([^"\'>]*word[^"\'>]*)["\'][^>]*>.*?<\/a>/si
To look for word inside of the link text, use:
/<a.*?href\s*=\s*["\']([^"\'>]+)["\'][^>]*>.*?word.*?<\/a>/si

In the majority of cases I'd strongly recommend using an HTML parser (such as this one) to get these links. Using regular expressions to parse HTML is going to be problematic since HTML isn't regular and you'll have no end of edge cases to consider.
See here for more info.

/<a.*?href\s*=\s*["']([^"']+)[^>]*>.*?<\/a>/si
You have to be very careful with .*, even in the non-greedy form. . easily matches more than you bargained for, especially in dotall mode. For example:
<a name="foo">anchor</a>
...
Matches from the start of the first <a to the end of the second.
Not to mention cases like:
<a href="a"></a >
or:
<a href="a'b>c">
or:
<a data-href="a" title="b>c" href="realhref">
or:
<!-- <a href="notreallyalink"> -->
and many many more fun edge cases. You can try to refine your regex to catch more possibilities, but you'll never get them all, because HTML cannot be parsed with regex (tell your friends)!
HTML+regex is a fool's game. Do yourself a favour. Use an HTML parser.

At a brief glance the first one is rubbish but seems to be trying to match a link as text, the second one is matching a html element.

How can I convert URLs to Markdown syntax, but NOT interfere with URLs already in Markdown syntax?

A system I am writing uses Markdown to modify links, but I also want to make plain links active, so that typing http://www.google.com would become an active link. To do this, I am using a regex replacement to find urls, and rewrite them in Markdown syntax. The problem is that I can not get the regex to not also parse links already in Markdown syntax.
I'm using the following code:
$value = preg_replace('#((?!\()https?://([-\w\.]+)+(:\d+)?(/([\w/_\.]*(\?\S+)?)?)?)#', '[$1]($1)', $value);
This works well for plain links, such as http://www.google.com, but I need it to ignore links already in the Markdown format. I thought the section (?!() would prevent it from matching urls which followed a parenthesis, but it would seem that I am in error.
I realize that even this is not an ideal solution (if it worked), but this is pushing beyond my regex abilities.

I think (?<!\() is what you meant. If the match position is at the beginning of http://www.google.com, it's not the next character you need to check, but the previous one. In other words you need a negative lookbehind, not a negative lookahead.

regexes are notoriously bad at stuff like this, you might end up with all sorts of clever html exploits you never could have thought of. IMO you should mod the markdown script to flag markdown URLs as it sees them, so you can ignore flagged URLs when you find them all with a very very simple search that doesn't leave complexity to hack.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.