How can one crawl a site for all Unique links and Make/Write an XML file to the root of that Respective domain. I want something like when i call mydomain.com/generatesitemap.php And this file crawls all the links in the domain and writes them to file sitemap.xml. Is this possible in PHP with cURL?
It depends on your site. If it is simple site -- then the task is simple. Grab your site root page via curl or file_get_contents, preg_match all the links (see here, for the reference http://www.mkyong.com/regular-expressions/how-to-extract-html-links-with-regular-expression/), then recursively grab all the links, which are inside your site, do not process links, which are allready processed.
The task become more complicated when JavaScript comes to play. If navigation uses JavaScript data, it will be difficult to obtain the links. There could be other navigation tricks, like select-combobox as dropdown menu.
The task could be even more complicated if you have pages with query strings. Say you have the catalogue section. And url are like this:
/catalogue
/catalogue?section=books
/catalogue?section=papers
/catalogue?section=magazines
Is it one page or not?
And what about this one?
/feedback
/feedback?mode=sent
So you should take care of this cases.
There are many examples of such a crawlers in google search. Look at this for instance:
http://phpcrawl.cuab.de/
Related
Do search engine robots crawl my dynamically generated URLs? With this I mean html pages generated by php based upon GET variables in the url. The links would look like this:
http://www.mywebsite.com/view.php?name=something
http://www.mywebsite.com/view.php?name=somethingelse
http://www.mywebsite.com/view.php?name=something
I have tried crawling my website with a test crawler found here: http://robhammond.co/tools/seo-crawler but it only visits my view page once, with just one variable in the header.
Most of the content on my website is generated by these GET variables from the database so I would really like the search engines to crawl those pages.
Some search engines do, and some don't. Google for one does include dynamically generated pages: https://support.google.com/webmasters/answer/35769?hl=en
Be sure to check your robots.txt file to ensure files you do not want the crawlers to see are blocked, and that files you do want indexed are not blocked.
Also, ensure that all pages you want indexed are linked via other pages, that you have a sitemap, or submit individual URLs to the search engine(s) you want to index your site.
Yes, search engines will crawl those pages, assuming they can find them. Best thing to do is to simply create links to those pages on your website, particularly accessible, or at least traversable from the home page.
Let me know if this question needs more clarification.
I am a front-end developer, and I usually use Wordpress with lots of custom fields to put together a CMS for clients.
A current client wants a design portfolio site that initially presents a grid of images that link to projects, but instead of loading a new page on click, the new content loads and fades in smoothly.
I figure the simplest way to do this kind of thing is to load everything up front on one page with ajax (a loading screen is OK), and then just show/hide/move content with jQuery.
The request I am having trouble with is being able to have specific URL's for different projects and images. The client wants a URL scheme like here:
http://collins1.com/work/bp-helios-house/3
Where the number at the end causes a specific image to load in the given project. It seems like this would be simple enough using php variables where like:
http://www.whatever.com?project=3&image=2
And using those to manipulate the initial AJAX load.
But how is this accomplished using a more traditional (pretty) URL structure like the example? If I am building the site as one page loading content, won't the browser attempt to load that as a page and just come up with a 404?
Bonus: How do you change the URL in the address bar to create these links as the user navigates the site without reloading the page?
Thanks!
what you see there, is called url routing. Basically, some server rule that rewrites the url in a proper way, depending on server and scripting language used.
for example, the url
http://server.com/foo/bar
MAY be redirected to
http://server.com/index.php?foo=bar
If you need a lightweight framework to handle this, take a look at www.slimframework.com
If you want real pretty urls you are going to need a server-side framework for url routing and will require you to get into php or ruby on rails. If you want a pure front end solution you can fake it in javascript using hash fragments. For how to do this see
http://backbonejs.org/#Router
http://www.asual.com/jquery/address/
http://benalman.com/projects/jquery-hashchange-plugin/
I would like to know from others' experience the best way to create sitemaps with Codeigniter. I have looked at some plugins/libraries, but all check the database for the pages. What happens if some pages on the site are static and not dynamic?
Is there any way to crawl the site using PHP and creating an XML file with the results?
A tool I have used previously for my projects is http://enarion.net/tools/phpsitemapng/download/
Which is a free tool for creating sitemap and allows functionality such as cron jobs.
What is my next step? How can I achieve this?
Well, you're problem lies in the fact that you have both dynamic and static pages. So, a crawler would work, but you'd have to generate a list of links to all dynamic pages. Then, you're crawler could hit that list and have access to all dynamic pages, and then hit directories where you have your static pages.
However, the docs on the phpsitemapng that you mention state that they will crawl a live website. So, if you have links to all of your pages accessible from those pages, then that will do what you need.
Scans files on website (slower, but will also find dynamic generated files and links)
I have a classifieds website.
The index.html has a form:
<form action="php_page" target="iframe" etc...>
The iframe displays the results, and the php_page builds the results for the iframe. Basically the php_page builds a table containing the results from a mysql db, and outputs it.
My problem is that this doesn't get indexed by google.
How can I solve this?
The reason I used an Iframe in the first place was to avoid page-reloading when hitting submit.
Ajax couldn't be used due to various reasons I wont go into here.
Any ideas what to do?
Thanks
UPDATE:
I have a sitemap with URLS to all the classifieds also, but I don't think this guarantees google to spider those URLS.
Trying to make the google spider crawl the results of a search form is not really the right approach.
Assuming you want google.com users to find your classifieds ads by searching google, the best approach is to create a set of static html pages from the ads, and link them (not invisibly) from elsewhere on your site (probably best from the home page - but such a link can be in a footer or something else unobtrusive)
They can also be linked to from your sitemap XML (you do have a sitemap XML file don't you?)
Note: the <iframe> doesn't really come into this. Or Ajax.
There is no way to make any webspider fill out and submit forms.
Workaround: Every night, create a dump of the database and save the HTML to a file. Create a link from index.html to that file. Use CSS classes to make the link invisible. This way, Google will pick it up but users won't see it.
How is it possibe to generate a list of all the pages of a given website programmatically using PHP?
What I'm basically trying to achieve is to generate something like an sitemap, in nested unordered list with links for all the pages contained in a website.
If all pages are linked to one another, then you can use a crawler or spider to do this.
If there are pages that are not all linked you will need to come up with another method.
You can try this:
Add an "image bug/web beacon/web
bug" to each page you tracked as
follows:
OR
alternatively add a javascript function to each page that makes a call to /scripts/logger.php You can use any of the javascript libraries that make this super simple like Jquery, Mootools, or YUI.
Create the logger.php script, have it save the request's originating URL somewhere like a file or a database.
Pros:
- Fairly simple
Cons:
Requires edits to each page
Pages that aren't visited don't get
logged
Some other techniques that don't really fit your need to do it programatically but may be worth considering include:
Create a spider or crawler
Use a ripper such as CURL, or
Teleport Plus.
Using Google Analytics (similar to
the image bug technique)
Use a log analyzer like Webstats or a
freeware UNIX webstats analyzer
You can easly list the files with the glob function... But if the pages uses includes/requires and other stuff to mix multiple files into "one page" you'll need to import the Google "site:mysite.com" search results.. Or just create a table with the URL of every page :P
Maybe this can help:
http://www.xml-sitemaps.com/ (SiteMap Generator)