1 Code Base, multiple domains, How to create sitemap files

1 Code Base, multiple domains, How to create sitemap files - php

I have created a site, that uses 1 code base, but multiple domains access that codebase.
Now the content served-up shows different CSS and imagery.
My question i'm running into, is, how do I generate a Sitemap file for each domain.
I have looked at using, http://www.xml-sitemaps.com/ and using their Script, but that will only work for 1 domain.
Other than creating my own code to do the site scraping, I don't see any other route. Do you know of another solution instead of having to start from scratch? Ideally would love to hit the ground running.
Note: The script needs to crawl the site. Thoughts?

Creating Mutliple sitemaps for single codebase something challanging job but not the impossible one. I am supposing that you are using some kind of framework for working of website.
Theer are many problem which comes during creating such thing are:
How to identify which request is coming from which website. So, the problem is to create the sitemap for specific site for the request is recieved.
Somehow, if you identify the which request is coming from which website then your website is dynamic. How to record these paramenters.
Where to store the such a huge database. Somehow if you resolve these things the mutlitple sites requests/parameters which database would be large enough to store such large requests.
If you somehow manage the huge database then next problem about submitting such a huge xml to search engine.
Sitemap would start growing daily the time for creation would certainly go up and so the request of crawling from website would also grow daily.
If your sitemap grows huge and same pages are submitted for different website then content would be marked as spam along with website.
There are some problems which can be unseen or predicted and thus it will be risky thing. Now to do it.
solutions
For problem 1st and 2nd we have to use the PHP $_SERVER - which provides the information about Server and execution environment information, such as parameters, hostname, requested host and many other things.
Now for problem 3 to 6, we have use the text files to store the requests one file for one domain and provide request details. The file must be flushed after particular time i.e. daily, weekly etc.
While creating sitemap we have to read the file and get the unqiue parameters, so that the sitemap doesn't include multiple same urls.
warning: It is highly recommended that don't do such thing as it will trigger the spamming and would be identified soon and marked as spammers websites.

Assumptions built into this answer:
The URI elements trailing the domain name are the same on each domain, for all pages.
i.e. http://site-one.com/page/1 is the same as http://site-two.com/page/1
You are able to manipulate the file provided by xml-sitemaps. This is a concern if you need to generate this on a continuous basis, meaning you would need to create a script to do the following per href.
If you don't mind using the service you mentioned at http://www.xml-sitemaps.com and , then by far the easiest way to do this would be to use that service and then change all absolute URLs to relative URLs. You can write any link that looks like
http://www.example.com/category/page
as a relative link at
/category/page
In short, that starting slash is the key, indicating to the browser to 'use the current domain'. You can do a find and replace on all all instances of http://www.example.com/, converting to / + the remaining string of URI elements.

Related

AJAX: dynamically loading data from the same php in separate calls ok?

I have an HTML/PHP/JQUERY/MYSQL web application.
It's an HTML Bootstrap base, and jquery and other libraries plus my custom scripts, in front.
backend, i have several php files to serve the data.
For this example, say I have a CONTACTS php page where I need to display several data sets:
1) List of contacts
2) List of groups
3) list of tags associated with contacts
I have a backend php file in: engine/contacts.php
this is the php script that serves the contacts data as requested based on GET flags, eg:
engine/contacts.php?list=contacts
engine/contacts.php?list=groups
engine/contacts.php?list=tags
Sure i could serve them up in one call but , by design, each part of the web page (contacts, or groups, or tags) are separate datasets, and this way, if one data set is updated, i can refresh that part only.. eg:
user adds a group , then JS will ajax load:
engine/contacts.php?list=groups
to update the groups area (div)
So, basically, ON PAGE LOAD 3 separate JS calls are fired at the same time load data from the same contacts.php file
IS THIS AN OK Practice? I mean it should be because I see lots of sites doing the same .
And how does this impact the server side? Will the server execute this php file one at a time? will it be better if i separate the files? like:
contacts.php
contacts_groups.php
contacts_tags.php
and simultaneously call them?
The reason I ask is because I'm currently debugging some performance issues. Simply put, i have very light weight PHP/MYSQL web application with HTML5/Jquery front end. The datasets being handled is very minimal and the database tables having less than 50rows
But somehow my application is hitting resource limits on the shared host server, particularly on the 1GB RAM limit side. And i have reproduced this situation on a stand alone domain w/ no other users and it's still hitting the limits.
I have gone through the php scripts and can't find anything. I do have loops, yes, but they are thoughtfully done and terminates after a few iterations.
I'm out of ideas so I'm just trying to explore what else i can poke at.
Would appreciate some guidance, thanks

I think, if you use an OOP structure you can consider a method for handling each request in backend. Although the best way is to use a MVC framework to dispatch the requests with URL routing to special methods :
engine/contacts/contacts
engine/contacts/groups
engine/contacts/tags

Many HTTP requests (API) Vs Everything in single php

I need some advice on website design.
Lets take example of twitter for my question. Lets say I am making twitter. Now on the home_page.php ,I need both, Data about tweets (Tweet id , who tweeted , tweet time etc. etc) and Data about the user( userId , username , user profile pic).
Now to display all this, I have two option in mind..
1) Making separate php files like tweets.php and userDetails.php. By using AJAX queries, I can get the data on the home_page.php.
2) Adding all the php code (connecting to db, fetching data ) in the home_page.php itself.
In option one, I need to make many HTTP requests, which (i think) will be load to the network. So it might slow down the website.
But option two, I will have a defined REST API. Which will be good of adding more features in the future.
Please give me some advice on picking the best. Also I am still a learner, so if there are more options of implementing this, please share.

In number 1 you're reliant on java-script which doesn't follow progressive enhancement or graceful degradation; if a user doesn't have JS they will see zero content which is obviously bad.
Split your code into manageable php files to make it easier to read and require them all in one main php file; this wont take any extra http requests because all the includes are done server side and 1 page is sent back.
You can add additional javascript to grab more "tweets" like twitter does, but dont make the main functionality rely on javascript.

Don't think of PHP applications as a collection of PHP files that map to different URLs. A single PHP file should handle all your requests and include functionality as needed.
In network programming, it's usually good to minimize the number of network requests, because each request introduces an overhead beyond the time it takes for the raw data to be transmitted (due to protocol-specific information being transmitted and the time it takes to establish a connection for example).
Don't rely on JavaScript. JavaScript can be used for usability enhancements, but must not be used to provide essential functionality of your application.

Adding to Kiee's answer:
It can also depend on the size of your content. If your tweets and user info is very large, the response the single PHP file will take considerable time to prepare and deliver. Then you should go for a "minimal viable response" (i.e. last 10 tweets + 10 most popular users, or similar).
But what you definitely will have to do: create an API to bring your page to life. No matter which approach you will use...

How are web pages scraped and how to protect againist someone doing it?

Im not talking about extracting a text, or downloading a web page.
but I see people downloading whole web sites, for example, there is a directory called "example" and it isnt even linked in web site, how do I know its there? how do I download "ALL" pages of a website? and how do I protect against?
for example, there is "directory listing" in apache, how do I get list of directories under root, if there is a index file already?
this question is not language-specific, I would be happy with just a link that explains techniques that does this, or a detailed answer.

Ok so to answer your questions one by one; how do you know that a 'hidden' (unlinked) directory is on the site? Well you don't, but you can check the most common directory names, whether they return HTTP 200 or 404... With couple of threads you will be able to check even thousands a minute. That being said, you should always consider the amount of requests you are making in regards to the specific website and the amount of traffic it handles, because for small to mid-sized websites this could cause connectivity issues or even a short DoS, which of course is undesirable. Also you can use search engines to search for unlinked content, it may have been discovered by the search engine on accident, there might have been a link to it from another site etc. (for instance google site:targetsite.com will list all the indexed pages).
How you download all pages of a website has already been answered, essentially you go to the base link, parse the html for links, images and other content which points to a onsite content and follow it. Further you deconstruct links to their directories and check for indexes. You will also bruteforce common directory and file names.
Well you really effectively can't protect against bots, unless you limit user experience. For instance you could limit the number of requests per minute; but if you have ajax site, a normal user will also be producing a large number of requests so that really isn't a way to go. You can check user agent and white list only 'regular' browsers, however most scraping scripts will identify themselves as regular browsers so that won't help you much either. Lastly you can blacklist IPs, however that is not very effective, there is plenty of proxies, onion routing and other ways to change your IP.
You will get directory list only if a) it is not forbidden in the server config and b) there isn't the default index file (default on apache index.html or index.php).
In practical terms it is good idea not to make it easier to the scraper, so make sure your website search function is properly sanitized etc. (it doesn't return all records on empty query, it filters % sign if you are using LIKE mysql syntax...). And of course use CAPTCHA if appropriate, however it must be properly implemented, not a simple "what is 2 + 2" or couple of letters in common font with plain background.
Another protection from scraping might be using referer checks to allow access to certain parts of the website; however it is better to just forbid access to any parts of the website you don't want public on server side (using .htaccess for example).
Lastly from my experience scrapers will only have basic js parsing capabilities, so implementing some kind of check in javascript could work, however here again you'd also be excluding all web visitors with js switched off (and with noscript or similar browser plugin) or with outdated browser.

To fully "download" a site you need a web crawler, that in addition to follow the urls also saves their content. The application should be able to :
Parse the "root" url
Identify all the links to other pages in the same domain
Access and download those and all the ones contained in these child pages
Remember which links have already been parsed, in order to avoid loops
A search for "web crawler" should provide you with plenty of examples.
I don't know counter measures you could adopt to avoid this: in most cases you WANT bots to crawl your websites, since it's the way search engines will know about your site.
I suppose you could look at traffic logs and if you identify (by ip address) some repeating offenders you could blacklist them preventing access to the server.

How to get all webpages on a domain

I am making a simple web spider and I was wondering if there is a way that can be triggered in my PHP code that I can get all the webpages on a domain...
e.g Lets say I wanted to get all the webpages on Stackoverflow.com . That means that it would get:
https://stackoverflow.com/questions/ask
pulling webpages from an adult site -- how to get past the site agreement?
https://stackoverflow.com/questions/1234214/
Best Rails HTML Parser
And all the links. How can I get that. Or is there an API or DIRECTORY that can enable me to get that?
Also is there a way I can get all the subdomains?
Btw how do crawlers crawl websites that don't have SiteMaps or Syndication feeds?
Cheers.

If a site wants you to be able to do this, they will probably provide a Sitemap. Using a combination of a sitemap and following the links on pages, you should be able to traverse all the pages on a site - but this is really up to the owner of the site, and how accessible they make it.
If the site does not want you to do this, there is nothing you can do to work around it. HTTP does not provide any standard mechanism for listing the contents of a directory.

You would need to hack the server sorry.
What you can do is that, if you own the domain www.my-domain.com, you can put a PHP file there, that you use as a request on demand file. That php file you will need to code some sort of code in that can look at the Folders FTP Wise. PHP can connect to a FTP server, so thats a way to go :)
http://dk1.php.net/manual/en/book.ftp.php
You can with PHP read the dirs folders and return that as an array. Best i can do.

As you have said, you must follow all the links.
To do this, you must start by retrieving stackoverflow.com, easy: file_get_contents ("http:\\stackoverflow.com").
Then parse its contents, looking for links: <a href="question/ask">, not so easy.
You store those new URL's in a database and then parse that those after, which will give you a whole new set of URL's, parse those. Soon enough you'll have the vast majority of the site's content, including stuff like sub1.stackoverflow.com. This is called crawling, and it is quite simple to implement, although not so simple to retrieve useful information once you have all that data.
If you are only interested in one particular domain, be sure to dismiss links to external sites.

No, not the way you are asking.
However, provided you have a clear goal in mind, you may be able to:
use a "primary" request to get the objects of interest. Some sites provide JSON, XML, ... apis to list such objects (e.g SO can list questions this way). Then use "per-object" requests to fetch information specific to one object
fetch information from other open (or paid) sources, e.g. search engines, directories, "forensic" tools such as SpyOnWeb
reverse engineer the structure of the site, e.g. you know that /item/<id> gets you to the page of item whose ID is <id>
ask the webmaster
Please note that some of these solutions may be in violation of the site's termes of use. Anyway these are just pointers, on top of my head.

You can use WinHTTPTack/. But it is a polite not to hammer other peoples web sites.
I just use it to find broken links and make a snap shot.
If you do start hammering other peoples sites they will take measures. Some of them will not be nice (i.e. hammer yours).
Just be polite.

Google Sitemap - Should I provision for load control / caching?

I have a community site which has around 10,000 listings at the moment. I am adopting a new url strategy something like
example.com/products/category/some-product-name
As part of strategy, I am implementing a site map. Google already has a good index of my site, but the URLs will change. I use a php framework which accesses the DB for each product listing.
I am concerned about the perfomance effects of supplying 10,000 new URLs to google, should I be?
A possible solution I'm looking at is rendering my php-outputted pages to static HTML pages. I already have this functionality elsewhere on the site. That way, google would index 10,000 html pages. The beauty of this system is that if a user arrives via google to that HTML page, as soon as they start navigating around the site, they jump straight back into the PHP version.
My problem with this method is that I would have to append .html onto my nice clean URLs...
example.com/products/category/some-product-name.html
Am I going about this the wrong way?
Edit 1:
I want to cut down on PHP and MySQL overhead. Creating the HTML pages is just a method of caching in preparation of a load spike as the search engines crawl those pages. Are there better ways?

Unless I'm missing something, I think you don't need to worry about it. I'm assuming that your list of product names doesn't change all that often -- on a scale of a day or so, not every second. The Google site-map should be read in a second or less, and the crawler isn't going to crawl you instantly after you update. I'd try it without any complications and measure the effect before you break your neck optimizing.

You shouldnt be worried about 10000 new links, but you might want to analyze your current google traffic, to see how fast google would crawl them. Caching is always a good idea (See: Memcache, or even generate static files?).
For example, i have currently about 5 requests / second from googlebot, which would mean google would crawl those 10,000 pages in a good half hour, but, consider this:
Redirect all existing links to new locations
By doing this, you assure that links already indexed by google and other search engines are almost immediatelly rewritten. Current google rank is migrated to the new link (additional links start with score 0).
Google Analytics
We have noticed that google uses Analytics data to crawl pages, that it usually wouldn't find with normal crawling (javascript redirects, logged in user content links). Chances are, google would pick up on your url change very quickly, but see 1).
Sitemap
The rule of thumb for the sitemap files in our case is only to keep them updated with the latest content. Keeping 10,000 links, or even all of your links in there is pretty pointless. How will you update this file?
It's a love & hate relationship with me and Google crawler theese days, since most used links by users are pretty well cached, but the thing google crawler crawls usually are not. This is the reason google causes 6x the load in 1/6th the requests.

Not an answer to your main question.
You dont have to append .html. You can leave the URLs as they are. If you cant find a better way to redirect to the html file (which does not have ot have an .html suffix), you can output it via PHP with readfile.

I am concerned about the perfomance effects of supplying 10,000 new URLs to google, should I be?
Performance effects on Google's servers? I wouldn't worry about it.
Performance effects on your own servers? I also wouldn't worry about it. I doubt you'll get much more traffic than you used to, you'll just get it sent to different URLs.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.