I have planned to write a great SEO tool and I want to know how can I find pages from a static/dynamic website link.
I will just have domain like www.yahoo.com and my system should find all pages that exists in that host.
Are there any techniques to do that? I can use any language but I think .NET will really boost things up.
I think you would almost certainly have to parse the page code for references to HREF=
You could request the URL using System.WebRequest.Create(uri) and then Regex over the response stream.
I would certainly be interested if there was an easier way in .Net.
You cannot just "magically" find all pages that exist on the domain, unless there is a sitemap (which won't exist most of the time).
Here is what you can do
1. Brute force - This is a bad idea as it will just take a very very long time.
2. Regex over source code - Look for regular expressions within tags
2 is your best bet, as it will provide all links on that page. I would consider adding a recursive functionality so that you "spider" out and perform the same regex operation on all pages found in the seed.
Here is the algorithm
Start with a seed (ie: www.yahoo.com)
Perform regex on the source code of this page, and store all links in a
data structure
Recursively call #1 on each link found in #2. You might want to
restrict this to only links that live
on the seed domain (ie: start with or
contain www.yahoo.com), as well as excluding links to pages that you've already visited
A tree datastructure with a visitor design pattern would be ideal for this type of implementation.
Related
I'm building a simple web crawler and I'm trying to filter links based on whether or not they've been seen before. The issue is that a link might be the same, but with a forward slash, arguments. I also would like to filter out mail to's. Is there any known, straightforward to do this? I'm currently working in pHp.
Edit:
I used Net_URL2.php to normalize the URL's after viewing this:
How do I apply URL normalization rules in PHP?
Short answer is no, there's no straight way to do that. Have a read at this article about URL normalization to find out some reasons why that is hard to accomplish.
I am writing a php code that uses regex to get all the links from a page and I need to transform it to get the links from entire website.
I guess the extracted urls should be checked again and so on, so that the script will access all the urls of it, not only the one given page.
I know that anything is possible, but how about this? Thank you for your guidance.
Hmm, to ensure that you get all the pages that google have found, what about crawling google instead? Just search for "site:domain.com", and then retrieve anything that follows this pattern:
<h3 class="r"><a href="http://domain.com/.*?" class=l
(you'll have to escape the right characters as well, and the '.*?' is the RegEx that gives you all the urls that google finds.
Anyways, that's just a suggestion for an alternative approach.
So, your regex grabs all the links. You cycle through a loop of those links, grab each with cURL, run that through your regex, wash, rinse, repeat.
Might want to make sure to put some sort of URL depth counter in there, lest you end up parsing The Internet.
Might also want to make sure you don't re-check links you've already followed, lest you end up at the end of Infinite Recursion Street.
Might also want to look at threading, lest it take 100,000 years.
This will get urls from url() (css), href and src attributes (links, imgs, scripts):
#(?:href|src)="([^"]+)|url\(["']?(.*?)["']?\)#i
They will be captured in group 1 and 2.
Be aware that some urls can be relative, so you have to make them absolute before calling them.
Normally, you do not have access to the underlying server that allows you to retrieve all pages on the site.
So you just need to do what Google does: Get all links from the page and then scan those links for additional links.
I'm developing a PHP-based web-application in which you have a form with textarea inputs that can accept links via anchor tags. But when I tested it after adding a hyperlink as follows, it pointed to a non-existent local subdirectory:
link
I realized that this was because I had not appended http:// before the link.
there might be cases where a user might input the link just as I did above. In such cases I don't want the link to be pointing as it did above. is there any possible solution, such as automatically appending http:// before the link in case that it doesn't exist? How do I do that?
----------------------------------------Edit---------------------------------------------
Please consider that the anchor tags are amidst other plaintext and this is making things harder to work with.
I'd go for something like this:
if (!parse_url($url, PHP_URL_SCHEME)) {
$url = 'http://' . $url;
}
This is an easy and stable way to check for the presence of a protocol in a URL, and allows others (e.g. ftp, https) that may be entered.
What you're talking about involves two steps, URL detection and URL normalization. First you'll have to detect all the URLs in the string being parsed and store them in a data structure for further processing, such as an array. Then you need to iterate over the array and normalize each URL in turn, before attempting to store them.
Unfortunately, both detection and normalization can be problematic, as a URL has a quite complicated structure. http://www.regexguru.com/2008/11/detecting-urls-in-a-block-of-text/ makes some suggestions, but as the page itself says, no regex URL detection is ever perfect.
There are examples of regular expressions that can detect URLs available from various sites, but in my experience none of them are completely reliable.
As for normalization, Wikipedia has an article on the subject which may be a good starting point. http://en.wikipedia.org/wiki/URL_normalization
I need some library which would be able to keep my urls Indexed and described. So I want to say to it something like
Index this new url "www.bla-bla.com/new_url" with some key words
or something like that. And I want to be soure that If I told my lib about my new URL Google and others will 100% find it As soon as possible and people will be able to find this URL on the web.
Do you know any such libs?
I do not know of any librarys that will achieve this but I think you need to do some reading on Search Engine Optimisation. From my understanding (and please correct me if I am wrong) when a Google Bot comes to your website to index it, it will check for a file called sitemap.xml. In this file you define properties as follows;
<url>
<loc>http://www.myhost.com/mypage.html</loc>
<lastmod>YYYY-DD-MM</lastmod>
<changefreq>monthly</changefreq>
<priority>1.00</priority>
</url>
As far as I know, you can not specifiy particular keywords for a particular page. The use of META tags can to "some" (arguably) extent influence this. The main influence will be the actual content of the page.
I would recommend the use of Google's "Webmaster Tools" which will give you feedback/errors about the indexing of your site. You can Add your site to google and join a queue for indexing.
There are several Automated Sitemap Generators, which I have had no experience with so can not comment on these.
There is no way to (immediately and on-demand) manipulate the search results in any search engine. It will always take at least a week for your site to be indexed (maybe even longer).
I'm making an online dictionary. Now I have two options:
1) Use AJAX for retrieving results
2) Just use some regular PHP scripts
If I choose the first option more likely the online dictionary will have one page and it's fast. If I choose the second option I'll have more pages (separate page for each entry with it's own URL), but it's slower. Personally I like the second option, don't really like to much AJAX on pages.
What is your opinion? Cons and pros (for this certain case)?
Thank you.
If you use the second solution (ie, several URLs, one per page/definition), your users will be able to bookmark URLs for each specific page, use those to come back to youor site or link to it, or send them to other people, or whatever they want -- which is good.
If you only have one and only one page for your whole website, people cannot link to specific pages/definitions ; they have no way to come back to one specific word ; and that's bad :-(
Same for search engines, btw (Even if not that important for search results) : if they only see one page, they will not index your content well... and you probably want your site indexed.
So, in your case, I would probably go with several distinct URLs (even the corresponding pages are all generated by the same PHP script, of course).
And same thing for search results : you probably want to give people the ability to link to search results, don't you ?
About the "speed" thing : well, with Ajax, you'll send one request to the server. Without Ajax, you'll still send one request (for a bigger page, I admit), plus the ones for the images, CSS, JS, and all.
You should read a bit about Frontend optimization (see Yahoo's Exceptional Performance pages, for instance) ; it'll help quite a lot, about that ;-)
You could potentially use htaccess (assuming you're on a linux server) to give unique urls to each page which keeping one page for displaying content.
For instance:
http://yourdomain.com/s/sausage would point to http://yourdomain.com/page.php?id=426
Which then would pull from a db and show you the 426th result, which would be for sausage, hurray. You'd probably want to build a htaccess generator for this, or use a system of slugs and use regular expressions
htaccess would look like this:
$http://yourdomain.com/s/([^/]+)^ $http://yourdomain.com/page.php?slug=$1
this will send anything after http://yourdomain.com/s/ to page.php as a slug variable, then reference your database for that slug.
Not sure if that's in any way helpful.