how can i get list of webpages(sitemap) with php? - php

I need to get limited list of all pages, that belongs to some website Php. How would code look like? Limited means function(some url, limit of page).

There is no standard way to do this. Some web sites publish an XML sitemap and link to it from robots.txt, but most do not.
You may be able to assemble a partial list of pages on a site by crawling the site, e.g. requesting one page on the site, searching for links to other pages, and requesting those pages as well. However, this is not guaranteed to find all pages on a site -- some may not be reachable from the home page! -- and is a complex process.

Related

Crawling websites and dynamic urls

Do search engine robots crawl my dynamically generated URLs? With this I mean html pages generated by php based upon GET variables in the url. The links would look like this:
http://www.mywebsite.com/view.php?name=something
http://www.mywebsite.com/view.php?name=somethingelse
http://www.mywebsite.com/view.php?name=something
I have tried crawling my website with a test crawler found here: http://robhammond.co/tools/seo-crawler but it only visits my view page once, with just one variable in the header.
Most of the content on my website is generated by these GET variables from the database so I would really like the search engines to crawl those pages.
Some search engines do, and some don't. Google for one does include dynamically generated pages: https://support.google.com/webmasters/answer/35769?hl=en
Be sure to check your robots.txt file to ensure files you do not want the crawlers to see are blocked, and that files you do want indexed are not blocked.
Also, ensure that all pages you want indexed are linked via other pages, that you have a sitemap, or submit individual URLs to the search engine(s) you want to index your site.
Yes, search engines will crawl those pages, assuming they can find them. Best thing to do is to simply create links to those pages on your website, particularly accessible, or at least traversable from the home page.

SEO for dynamic webpages

I have created an advertising website in php mysql. I have nearly 200 files for each location. This 200 files will be for example : for selling cars, bikes etc. In all the title, head, keywords I used a variable x which is the location. Then I used a scripting language to open each of the 200 files, replace x with location name, save it in different names. For ex: location1_websitename_cars.php. There are more than million locations. I created 200× million files like these. But I cannot host my website economically due to file number limitation on shared hosting servers.
My intention for replicating 200 files for each location was that google search engine can find my pages when user searches the location name as keyword. As per my understanding google crawls through the existing pages in server and find the location name as keyword and this results in inclusion of webpage in search results. Since this approach wont work with shared hosting, I changed strategy.
I am able to generate files required for a location dynamically according to the user selected location from the home page of my website. In this case I just need to store 200 files in my server. All pages would be accessible from home page of website. But I don’t know whether that pages would accessible from google search. For ex: if user types : "location1 www.mywebsite.com cars ", that php page wont be displayed as this page don’t exist in server. It is to be dynamically created.
To simply put: " Is there a way of including my website pages in google search results if that page don’t exist in server. It would be dynamically created once user selects some input and submit it from the home page.
Search engines won't have any problem following your dynamically created pages, but you will need to first create links to those pages from another page that the search engines do know about (eg: your home page). Once you link to your dynamic pages, they can be indexed.
The more pages and sites (especially high values sites) that link to your pages, the higher up in the search results your pages will be (of course there are other factors that affect this as well). Also if you want to test any of this without wading through pages upon pages of search results, google: "site:www.yourwebsite.com yoursearchterms"
Google use URL as identifier for pages not the files on the server.
To detect URLs, Google use robots following links on the web (<a>, <link>, etc.).
If you want your page to get found and indexed by Google, do not worry about your files on server but on your URLs and internal linking. You need to create a navigation to all the possible pages to let robots access it.
NB: URLs with parameters work but it is preferable to rewrite them.

Crawling unreferenced URLS

I have been building a tool from scratch to generate a visual graph of webpages in a particular domain name. If a page links to another page it is denoted by an edge in the graph. My project is to investigate how web developers link their pages inside a particular website. My aim is to run this tool on around 100 non profit websites and analyse the results.
There's a catch :
Some pages are not linked by any other page on the internet(They are standalone pages). Is there any way I can get a list of such webpages in a particular domain name or a particular directory in a domain name.
Example : Say we have www.example.com/abc/xyz.asp
xyz.asp is not linked by any other page on internet and also directory listing at the parent directory (www.example.com/abc/ ) is disabled. How do I get to know that a webpage exists in that particular location.
I m particularly interested in asp and php domains. My assumption is that linked pages will form a cluster and standalone pages will be left alone like stars in the sky. After generating the graph I need to calculate some co-efficients.

How would a completely dynamic (every page dynamically generated) web site remain search engine-friendly?

Most completely dynamic web sites allow nearly every page to be found, crawled and indexed by search engines. How would this be properly implemented to allow a completely dynamic web site to be search engine-friendly? Note that there is no directory structure, users can type in complex URLs (www.example.com/news/recent) but the folder structure doesn't actually exist, it is all handled by htaccess, which submits the url entered to the main web application for page generation.
Search engines access websites nearly the same way as a visitor. If the search engine web crawler gets to www.example.com/news/recent, it will index the results which will then be search-able.
Most websites have static links to point to content, so the top news article might be on www.example.com/news/recent, but it could also be on www.example.com/news/9234. That gives search engines somewhere permanent to link to. The search engine doesn't care if www.example.com/news/9234 really loads www.example.com/pages/newsitems.php?item=9234, that's all hidden.
Another handy way is through site maps, which provide the search engine a direct list/map of pages on the website that can be more complicated/less pretty.

Generating Sitemap.xml file for Dynamic sites in PHP

How can one crawl a site for all Unique links and Make/Write an XML file to the root of that Respective domain. I want something like when i call mydomain.com/generatesitemap.php And this file crawls all the links in the domain and writes them to file sitemap.xml. Is this possible in PHP with cURL?
It depends on your site. If it is simple site -- then the task is simple. Grab your site root page via curl or file_get_contents, preg_match all the links (see here, for the reference http://www.mkyong.com/regular-expressions/how-to-extract-html-links-with-regular-expression/), then recursively grab all the links, which are inside your site, do not process links, which are allready processed.
The task become more complicated when JavaScript comes to play. If navigation uses JavaScript data, it will be difficult to obtain the links. There could be other navigation tricks, like select-combobox as dropdown menu.
The task could be even more complicated if you have pages with query strings. Say you have the catalogue section. And url are like this:
/catalogue
/catalogue?section=books
/catalogue?section=papers
/catalogue?section=magazines
Is it one page or not?
And what about this one?
/feedback
/feedback?mode=sent
So you should take care of this cases.
There are many examples of such a crawlers in google search. Look at this for instance:
http://phpcrawl.cuab.de/

Categories