Crawling unreferenced URLS

Crawling unreferenced URLS - php

I have been building a tool from scratch to generate a visual graph of webpages in a particular domain name. If a page links to another page it is denoted by an edge in the graph. My project is to investigate how web developers link their pages inside a particular website. My aim is to run this tool on around 100 non profit websites and analyse the results.
There's a catch :
Some pages are not linked by any other page on the internet(They are standalone pages). Is there any way I can get a list of such webpages in a particular domain name or a particular directory in a domain name.
Example : Say we have www.example.com/abc/xyz.asp
xyz.asp is not linked by any other page on internet and also directory listing at the parent directory (www.example.com/abc/ ) is disabled. How do I get to know that a webpage exists in that particular location.
I m particularly interested in asp and php domains. My assumption is that linked pages will form a cluster and standalone pages will be left alone like stars in the sky. After generating the graph I need to calculate some co-efficients.

Related

how can i get list of webpages(sitemap) with php?

I need to get limited list of all pages, that belongs to some website Php. How would code look like? Limited means function(some url, limit of page).

There is no standard way to do this. Some web sites publish an XML sitemap and link to it from robots.txt, but most do not.
You may be able to assemble a partial list of pages on a site by crawling the site, e.g. requesting one page on the site, searching for links to other pages, and requesting those pages as well. However, this is not guaranteed to find all pages on a site -- some may not be reachable from the home page! -- and is a complex process.

How do search engines index MVC pages?

Manually you can make PHP pages in your directory.
e.g.
Index.php
About.php
Contact.php
But with PHP frameworks like Laravel the pages do not exist in a file, they are in the database and are called when the user visits the page.
e.g.
If a person visits http://mywebsite.com/contact , the framework will look in the database for a page named 'contact' then output it to a user.
But how does Google (or other search engines) find those pages if they only exist in the database?

Google can index these fine as they are "server-side" generated. Files do not need to exist for Google to be able to index them, just exist at the server-side level.
Where Google has issues indexing is if your site is "client-side" based and uses something like AJAX to pull the content into the browser. A search engine spider can't execute JavaScript so they never find the content. However, Google has defined some guidelines for people to get this content indexed in their Web Masters Guide.

You have a static website address www.domain.com and that is real so once google come to know that there is a website named www.domain.com it will visit the site, now that google crawler is on your website it will look out for the links available on the home page of www.domain.com and hence they will be crawled. Thats simple

In Laravel, pages DO NOT exist in database, although they might be dynamically generated.
As pointed #expodax ,
Google will index LINKS for your web app, and links (URIs) are geneated in accordance with your routes.php file (found in app/Http/routes.php)
In essence, Google will index links / URIs available for the end user, it DOES NOT depend upon how you've organized files in your web app.
For detailed documentation about Routes in Laravel (how they can be generated or used) please check this.
http://laravel.com/docs/5.0/routing

A sitemap is a file where you can list the web pages of your site to tell Google and other search engines about the organization of your site content. Search engine web crawlers like Googlebot read this file to more intelligently crawl your site. more info
If you want to generate a sitemap for your laravel application you can do it manually or you can use a package like this: https://github.com/RoumenDamianoff/laravel-sitemap

SEO for dynamic webpages

I have created an advertising website in php mysql. I have nearly 200 files for each location. This 200 files will be for example : for selling cars, bikes etc. In all the title, head, keywords I used a variable x which is the location. Then I used a scripting language to open each of the 200 files, replace x with location name, save it in different names. For ex: location1_websitename_cars.php. There are more than million locations. I created 200× million files like these. But I cannot host my website economically due to file number limitation on shared hosting servers.
My intention for replicating 200 files for each location was that google search engine can find my pages when user searches the location name as keyword. As per my understanding google crawls through the existing pages in server and find the location name as keyword and this results in inclusion of webpage in search results. Since this approach wont work with shared hosting, I changed strategy.
I am able to generate files required for a location dynamically according to the user selected location from the home page of my website. In this case I just need to store 200 files in my server. All pages would be accessible from home page of website. But I don’t know whether that pages would accessible from google search. For ex: if user types : "location1 www.mywebsite.com cars ", that php page wont be displayed as this page don’t exist in server. It is to be dynamically created.
To simply put: " Is there a way of including my website pages in google search results if that page don’t exist in server. It would be dynamically created once user selects some input and submit it from the home page.

Search engines won't have any problem following your dynamically created pages, but you will need to first create links to those pages from another page that the search engines do know about (eg: your home page). Once you link to your dynamic pages, they can be indexed.
The more pages and sites (especially high values sites) that link to your pages, the higher up in the search results your pages will be (of course there are other factors that affect this as well). Also if you want to test any of this without wading through pages upon pages of search results, google: "site:www.yourwebsite.com yoursearchterms"

Google use URL as identifier for pages not the files on the server.
To detect URLs, Google use robots following links on the web (<a>, <link>, etc.).
If you want your page to get found and indexed by Google, do not worry about your files on server but on your URLs and internal linking. You need to create a navigation to all the possible pages to let robots access it.
NB: URLs with parameters work but it is preferable to rewrite them.

Should the Sitemap tree in Site map Page and Sitemap.xml file must be Cross Checked for seo

Should the Site map page in site and the sitemap.xml which we add along with web site file should be same?
Should the site map page depict every thing in sitemap.xml?
I want to create site map page for all the pages in my website.The website contains neary 500 pages and a PHP Smarty template system.Where can I find SEO friendly Site map Generation script for PHP

Should the site map page depict every thing in sitemap.xml?
No, the xml map is for the search engines and not for you website users.
In some cases, it also contains a partial pages, and links to pages to help search engine to index the site
The html map is for the users and should contain relevant links to most major categories and important pages in your site.
Site map Generation script for PHP can be found on google wiki page about sitemaps at http://code.google.com/p/sitemap-generators/wiki/SitemapGenerators

how can google find me if I am inside a mysql table?

I am creating a classifieds website.
Im storing all ads in mysql database, in different tables.
Is it possible to find these ads somehow, from googles search engine?
Is it possible to create meta information about each ad so that google finds them?
How does major companies do this?
I have thought about auto-generating a html-page for each ad inserted, but 500thousand auto-generated html pages doesn't really sound that good of a solution!
Any thoughts and idéas?
UPDATE:
Here is my basic website so far:
(ALL PHP BASED)
I have a search engine which searches database for records.
After finding and displaying search results, you can click on a result ('ad') and then PHP fetches info from the database and displays it, simple!
In the 'put ad' section of my site, you can put your own ad into a mysql database.
I need to know how I should make google find ads in my website also, as I dont think google-crawler can search my database just because users can.
Please explain your answers more thoroughly so that I understand fully how this works!
Thank you

Google doesn't find database records. Google finds web pages. If you want your classifieds to be found then they'll need to be on a Web page of some kind. You can help this process by giving Google a site map/index of all your classifieds.
I suggest you take a look at Google Basics and Creating and submitting SitemapsPrint
. Basically the idea is to spoon feed Google every URL you want Google to find. So if your reference your classifieds this way:
http://www.mysite.com/classified?id=1234
then you create a list of every URL required to find every classified and yes this might be hundreds of thousands or even millions.
The above assumes a single classified per page. You can of course put 5, 10, 50 or 100 on a single page and then create a smaller set of URLs for Google to crawl.
Whatever you do however remember this: your sitemap should reflect how your site is used. Every URL Google finds (or you give it) will appear in the index. So don't give Google a URL that a user couldn't reach by using the site normally or that you don't want a user to use.
So while 50 classifieds per page might mean less requests from Google, if that's not how you want users to use your site (or a view you want to provide) then you'll have to do it some other way.
Just remember: Google indexes Web pages not data.

How would you normally access these classifieds? You're not just keeping them locked up in the database, are you?
Google sees your website like any other visitor would see your website. If you have a normal database-driven site, there's some unique URL for each classified where it it displayed. If there's a link to it somewhere, Google will find it.

If you want Google to index your site, you need to put all your pages on the web and link between them.
You do not have to auto-generate a static HTML page for everything, all pages can be dynamically created (JSP, ASP, PHP, what have you), but they need to be accessible for a web crawler.

Google can find you no matter where you try to hide. Even if you can somehow fit yourself into a mysql table. Because they're Google. :-D
Seriously, though, they use a bot to periodically spider your site so you mostly just need to make the data in your database available as web pages on your site, and make your site bot-friendly (use an appropriate robots.txt file, provide a search engine-friendly site map, etc.) You need to make sure they can find your site, so make sure it's linked to by other sites -- preferably sites with lots of traffic.
If your site only displays specific results in response to search terms you'll have a harder time. You may want to make full lists of the records available for people without search terms (paged appropriately if you have lots of data).

First Create a PHP file that pulls the index plus human readable reference for all records.
That is your main page broken out into categories (like in the case of Craigslist.com - by Country and State).
Then each category link feeds back to the php script the selected value regardless of level(s) finally reaching the ad itself.
So, If a category is selected which contains more categories (like states contain cities) Then display the next list of categories. Else display the list of ads for that city.
This will give Google.com a way to index a site (aka mysql db) dynamically with out creating static content for the millions (billions or trillions) of records involved.
This is Just an idea of how to get Google.com to index a database.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.