Curl Check if a domain is root

Curl Check if a domain is root - php

Hello I am trying to make a little spider.
While I was building it I came across a problem where I need to check if a link is a root domain link or a subdomain link.
For example:
http://www.domain.com or
http://domain.com
http://domain.com/index.php
http://domain.com/default.php
http://domain.com/index.html
http://domain.com/default.html
.
.
etc
are all the same.
So I need a function actually that takes the string url as an input and checks if it's the root or homepage whatever you like to call it of a site.

As noted in comments, this is really a basic aspect of coding the spider. If you intend to code a general purpose spider, you'll need to add means to resolve URLs and detect if they point to the same content and in what way (through a redirect or simply through duplicate content), as well as what kind of content they point to.
You need at least to handle:
relative paths
GET-variables that are in one way or another significant to the web page, but does not render differences in the content.
Malformed URLs.
JavaScript related information in the href attribute.
Links to non-HTML material -- direct download links to PDFs, images etc. (detect it on extension isn't always enough, what with PHP scripts delivering images).
These are just some of the aspects but it all comes down to the point that the kind of detection your after have to be a fundamental part of the spider if you intend to use it in any kind of generic manner.

Related

Convert URL into one standard format

Here are a few URLs:
http://sub.example.com/?feed=atom&hello=world
http://www.sub.example.com/?feed=atom&hello=world
http://sub.example.com/?hello=world&feed=atom
http://www.sub.example.com/?hello=world&feed=atom
http://www.sub.example.com/?hello=world&feed=atom
http://www.sub.example.com/?hello=world&feed=atom#123
As you can see, they all lead to the exact same page but the URL format is different. Here is two other basic examples:
http://example.com/hello/
http://example.com/hello
Both are the same.
I want to convert the URL into one standard format so that when I store the URL in the database, I can easily check whether if the URL string already exists in the database.
Because of the various ways of how the URL can be formatted, this can be puzzling.
What's the definitive approach to converting URL into one standard format? Maybe parse_url() route...?
Edit
As outlined in the comments, there is no definitive solution to this, but the aim is to get as close as possible with what we have without "retrieving" the page. Please read comments before posting an answer to this bounty.

After you parse_url:
Remove the www prefix from the domain name
If the path is not empty - remove the trailing slash from it
Sort query parameters alphabetically by their name - if there are any
Combine these parts in order to get a canonical URL.

I had the same issue for a reports-configuration-save functionality. In our system, users can design his own reports of sales (like JQL of Jira); for that, we use get params as conditions, and fragment identifier (after #) as layout setup, like this:
http://example.com/report.php?since=20180101&until=20180806#sort=amount&color=blue
For our system, order of GET or after # params are irrelevant as well you reach the same report configuration if set param "until" first than "since", so for us are the same request.
Considering this, subdomains are out of discussion, cause you must solve this using rewrite techniques (like mod_rewrite with 301 in Apache) or create a pool of domain exceptions to do this at software level. Also, different domains can point into different websites, so you must decide if is a good idea; in subdos "www" is very easy to figured it out, but it will toke you time in another cases.
Server side can help to get vars in query section. For example, in PHP you can use function parse_str and $_SERVER['QUERY_STRING'] to get array, and then, you will need use asort() to order it to finnaly compare if are the same request (array_diff function).
Unfortunately, server side is not an option since have no capability to get after hash (#) content, and we still without consider another problems, like scriptname included, protocols or ports:
http://www.sub.example.com/index.php?hello=world&feed=atom
https://www.sub.example.com/?hello=world&feed=atom
http://www.sub.example.com:8081/?hello=world&feed=atom
In my personal experience, the most close solution is JavaScript, for handling url, parsing query section as array, compare them and do the same with fragment identifier. If you need to use it in server side, every load page will must be followed with an ajax request sending this data to the server.
Apologies in advance for length of my answer, but it is what I had to go through in order to solve the same problems you have. Greetings!
Get protocol, domain, and port from URL
Get protocol, domain, and port from URL
How can I get query string values in JavaScript?
How can I get query string values in JavaScript?
How do I get the fragment identifier (value after hash #) from a URL?
How do I get the fragment identifier (value after hash #) from a URL?

adding the preferred <link rel="canonical" ... > tag into the HTML headers is the only reliable solution, in order to reference unique content to a single SEF URL. see Google's documentation, concerning Consolidate duplicate URLs, which possibly answers the whole question more autoritative and reliable, than I ever could.
the idea of being able to know of the canonical URL or to resolve a bunch externals URLs, without parsing those server's .htaccess rewrite-rules or the HTML headers, does not appear to be applicable (simply because one can maintain a table with URL aliases, which subsequently do not permit guessing how a HTTP request might have been re-written).
this question might belong to https://webmasters.stackexchange.com/search?q=cannonical.

Since the question is marked „PHP“ I assume you are in the backend.
There are enough answers how you can compare URLs (protocol, host, port, path, list of request params) where path is case sensitive, protocol and host are not. Changing the order of request parameters is strictly speaking also changing the URL.
My impression is that you want to differentiate by the RESOURCE which the server is serving (http://www.sub.example.com/ serves the same resource as http://sub.example.com/ or .../hello serves the same resource as .../hello/)
Which resource is served, you should perfectly know on the backend level, since you (the backend) know what you are serving. Find the perfect ID for the resource and use it.
PS: the URL is not a good identifier for that. But if you must use it, just use a sanitized version (sanitization for your purpose => sanitize to your preferred host, strip or add slashes at end of paths, drop things like /../ from path (security issue anyway), bring the request params in a certain order, whatever is right for your purpose.
Best regards, iPirat

It's the case with duplicate URLs and you can avoid these kind of duplicate URLs using a URL factory redirecting all URLs which are not proper to the proper URL.
And the same thing is explained in this article:
https://www.tinywebhut.com/remove-duplicate-urls-from-your-website-38
Any other URLs leading to the same page are 301 redirected to the proper version of the URLs.
This is the best practice of Search Engine Optimization(SEO). Here I'm going to give you a couple of examples.
You can consider the URLs of this website, for example the wrong links of this page are
https://stackoverflow.com/questions/51685850
https://stackoverflow.com/questions/51685850/convert-url-into-one-s
https://stackoverflow.com/questions/51685850/
If you go to the above wrong URLs of this page, you'll be redirected to the proper URL which is
https://stackoverflow.com/questions/51685850/convert-url-into-one-standard-format
And if you change the title of this question, all other URLs are 301 redirected to the proper URL. The idea here is the 301 redirection which tells the search engines to replace the old URL with the new one otherwise the search engines find different URLs providing the same content.
The real deal here is the id of the question, 51685850. This id is used to create the proper URL with the information from the database. With the URL factory that is created in the article in the link provided, you do not even need to store URLs in the database.
You can read more on duplicate content here:
https://moz.com/learn/seo/duplicate-content
The same rules are applied to tinywebhut.com as well, the wrong URLs are
https://www.tinywebhut.com/remove-duplicate-38
https://www.tinywebhut.com/some-text-38
https://www.tinywebhut.com/remove-duplicate-urls-from-your-website-38/
In the above URLs the ID is appended to the end of the URL which is 38 and if you go to any of these URLs, you'll be 301 redirected to the proper version of the URLs which is
https://www.tinywebhut.com/remove-duplicate-urls-from-your-website-38
I didn't make any functions to explain this here because it is already done in this article:
https://www.tinywebhut.com/remove-duplicate-urls-from-your-website-38
You can achieve the goal with a couple of really simple functions and you can apply the same idea to remove other duplicate URLs such as /about.php, /about, /about.php/, /about/ and so on. And to achieve this you just need a little more code to your existing functions.
One alternative is adding canonical tag, for example, even if you have more than one URL to go the same page, you just need to apply canonical tag and add the link to the proper URL.
<link rel="canonical" href="https://stackoverflow.com/questions/51685850/convert-url-into-one-standard-format" />
This way you are telling the search engines that the multiple URLs should be considered as one and the search engines add the link used in the canonical tag in their search results. You can read more on canonicalization here:
https://moz.com/learn/seo/canonicalization
But still the best way to get rid of duplicate content is the 301 redirect. If you have a 301 redirect like I talked at the beginning, all problems are solved without surprises.

My original answer assumes that the pages are all owned by the OP, as per the line "As you can see, they all lead to the exact same page but the URL format is different...". I am adapting the answer to handle multiple options and adding a list of assumptions you can and cannot make about URLs.
As others have pointed out there is no definitive easy answer to this if you do not know that the page(s) are the same. However, if you follow these assumptions, you should be safe standardizing some things:
CAN ASSUME
Query strings with the same values point to the same location regardless of order. Example: https://example.com/?fruit=apple&color=red is the same as https://example.com/?color=red&fruit=apple
301 redirects to a specific source can be followed. If you receive a 301 redirect response, follow the redirect and use that URL. You can safely assume that if a URL actually does point to the same page, and page rank is optimized, then you can follow it.
If there is a single <link rel="canonical"> tag in the HTML, that too can be used to cover the canonical link (see below for why).
CANNOT ASSUME
Any URL is guaranteed to be the same as any other URL, if they are different (by URL in this case I am talking about anything before the query string).
http://example.com can be different from https://example.com can be different from http://www.example.com or https://www.example.com. There is no restriction against showing a different website when putting "www" or leaving it out. That's why page rank on search engines is really damaged here.
Any two URLs, even if they currently have exactly the same content, will keep exactly the same content. An example would be https://example.com/test and https://sub.example.com/test. Both may feasibly be set to the same generic test page content. In the future, https://sub.example.com/test may be changed. You can't assume it won't be.
If you own the site
Redirect all traffic in the first part of the URL format you want: Do you want www.example.com or example.com or sub.example.com? Do you want a trailing slash or not? Redirect this first, either using server rules or PHP. This is also highly beneficial for search page rank (if that matters to you).
An example of this would be something like this:
if (!$_SERVER['HTTPS'] || 'example.com' !== $_SERVER['HTTP_HOST'] || rtrim($_SERVER['PHP_SELF'], '/') !== $_SERVER['PHP_SELF']) {
header('HTTP/1.1 301 Moved Permanently');
header('Location: '. 'https://example.com/'.rtrim($_SERVER['PHP_SELF']), '/'));
exit;
}
Finally, to manage any remaining SEO concerns, you can add this HTML tag:
`<link rel="canonical" href="<?php echo $url; ?>">`
Whether you own the site or not, you can standardize query order
Even if you don't control the site, you can assume that query order does not matter. To standardize this, take your query and rebuild the parameters, appending it to your normalized URL.
function getSortedQuery()
{
$url = [];
parse_str($_SERVER['QUERY_STRING'], $url);
ksort($url);
return http_build_query($url);
}
$url = $_SERVER['HTTP_HOST'].$_SERVER['PHP_SELF'].'?'.getSortedQuery();
Another option is to grab the contents of the page and see if there is a <link rel="canonical"> string, and use that string to log your data. This is a bit more costly as it requires a full page load.
To repeat, do make sure you grab 301 redirects as they are not suggestions, but directives, as to the end result URL.
One final suggestion
I might recommend using two columns, one being "canonical_url" and another being "effective_url". Sometimes a URL works and then later becomes a 301 redirect. This is just my take but I would like to know these things.

All of the answers have great information. Assuming you are using an Apache-like server, for the URL bit, I would use .htaccess (or, preferably, if you can change it - the equivalent server Apache config file) to do the rewrites. For a simple example:
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_HOST} ^www\.example\.com$
RewriteRule (.*) http://example.com/$1 [R=Permanent]
In this example, the "R=Permanent" DOES do a redirect. This is usually not a big issue as, a) it tells the browser to remember the redirect, and b) your internal links are presumably relative, so protocol (http or https) and the server (example.com or whatever) are preserved. So generally the redirect will be once per session or less - time well spent, IMO, to avoid doing all this in PHP.
I guess you could use it to rewrite the order of the query bits as well, though when the query bits are significant, I tend to (not recommending you do, just sayin') add them to my path (eg rewrite ".../blah/atom" to ".../blah.php?feed=atom"). At any rate, there are loads of rewrite tricks available, and I recommend you read about them in
Apache mod_rewrite.
If you do go this route, be sure to carefully think through what you want to happen - once you start mucking with the URL's, you are usually stuck with your decisions for a long while.

As several have pointed out, while the URLs you show may currently point to the same content, there is no way to tell if they will in the future. A change in either protocol or hostname can get you different sets of content, even example.com vs. www.example.com, even if served up by the same machine at the same IP. Not common, but it can happen...
So if I were wanting to maintain a list of URLs, I would store protocol, hostname, directory path, filename if present (aka "whatever came after the last slash before a questionmark"), and a sorted on key set of key/value pairs for the GET arguments
And then don't forget that you can go to https://www.google.com and not have anything BUT the protocol and hostname...

Avoid passing the parameters in the url. Pass your parameters to the web page using JSON.

How to mask url having many sub directories and files?

Am having an website with many directories and files. I just want to hide all the sub directories name and file names like https://example.com/folder_01/file.php to https://example.com. I could able to hide a single folder name using rewrite rule in htaccess apache server. Also I tried frame set concept but it shows unsafe script when tried to run the website in browser. Can anyone help me?
Thanks in advance.

This isn't possible.
A URL is how the browser asks the server for something.
If you want different things, then they need different URLs.
If what you desired was possible, then somehow the server would have to know that if my browser asked for / then it meant "The picture of the cat" while also knowing that if my browser asks for / then it means "The picture of the dog".
It would be like stopping at a fast food drivethru where you had never been before, didn't know anyone who worked there, and asking for "My usual" and expecting them to be able to know what that was.
You mentioned using frames, which is an old and very horrible hack that will keep a constant URL displayed in the browser's address bar but has no real effect beyond making life more difficult for the user.
They can still look at the Network tab in their browser's developer tools to see the real URL.
They can still right click a link and "Open in new tab" to escape the frames.
Links from search engines will skip right past the frames and index the URLs of the underlying pages which have actual content.
URLs are a fundamental part of the WWW. Don't try to break them. You'll only hurt your site.

Using /index.php as a simple template and document router for a website?

I need to put together a simple/small website fairly quickly for internal use (no external access), which will contain a few forms and simple interactive pages.
I don't really want the maintenance overhead of a CMS, but so that each page has a consistent look and feel, etc, it would be useful to be able to have each page be based on a common template to wrap around the unique page content, to include the HTML head, title, site navigation, footer, etc.
One approach would be to include various snippets via PHP within each individual page, but that involves repetition in each page, and doesn't scale well if I decide I might need to substantially change things later.
The alternative approach would be to use the main DocumentRoot index.php file as the template, and instead have it include the requested page 'within' itself (so that each of the other pages is actually really only a partial file defining variables for the 'title' and 'body' (for the main page body content)).
I see that I can use $_SERVER['PATH_INFO'] to extract the desired file path (relative to the DocumentRoot) and $_SERVER['REQUEST_URI'] to get the whole request string (in case there might be any GET parameters); for the actual content of the index page itself I could have it include an alternatively-named file instead; and there must be some way in Apache rewrite rules that it would be possible to elide out the index.php from the eventual URIs, ..but I haven't yet thought through very much further than this.
I am sure that this must be a scenario encountered many many times before. I could well spend a couple of days trying to think this through and re-invent my own wheel, but I don't really have the time to do this, and it probably wouldn't be a very good use of it in any case.
Does anybody have some existing "quickie" code that would be "good enough" for this, or know of something "formally" published (at which point does a working quick hack become an actual software package?!)?
Thanks for any advice.

You could send your requested page as url parameter to your index.php, like 'index.php?page=requested_page' and include the following rule in your index.php at the position where you like to get your content
<?php
include("./pages/".$_GET['page'].".php");
?>

I'm not sure if any effort you will put into this will be less than the effort of implementing a simple CMS. Most systems have simple setups and do most of the work for you. Any time you save on not implementing a CMS will cost you later on maintenance. Don't use 'internal use' as an excuse to make bad software. Unless it's a one-time solution that will be disregarded in a few weeks, you (or another developer) will have to extend and maintain the software.

Redirect all requests to index:
RewriteEngine on
RewriteBase /
RewriteRule ^(?!index\.php$|(?:css|js|media)(?:/|$)). index.php [L]
Something like that will stop redirect loops on the index and send everything except requests to static content folders (change to suit) to your handler in the index. Access the URL via $_SERVER['REQUEST_URI'].

Form submission URL - with no "slash" in the `action` tag?

I've created a form with this action tag inside:
action="/folder/"
My common sense says that now I should create a directory named "folder",
and place an index.php file inside that would process the form.
But I'm not sure if that is the common way that this is done.
For example, google.com have this action tag for their search button:
action="/search"
No slash at the end.
But in order for my form to work, I must have the slash or else the browser would not get to to my index.php file.
So how is it usually done?
Is it ok to make the form my way? any drawbacks?
In what way can I cut the slash from my form and still make it work? (maybe something with .htaccess?)
Thanks

I doubt very much they are using folders with php files instead them, they will be using a routing standard to make clean URL's
"Clean URLs, RESTful URLs, user-friendly URLs or SEO-friendly URLs are purely structural URLs that do not contain a query string [e.g., action=delete&id=91] and instead contain only the path of the resource (after the scheme [e.g., http] and the authority [e.g., example.org]). This is often done for aesthetic, usability, or search engine optimization (SEO) purposes.1 Other reasons for designing a clean URL structure for a website or web service include ensuring that individual web resources remain under the same URL for years, which makes the World Wide Web a more stable and useful system,2 and to make them memorable, logical, easy to type, human-centric, and long-lived.[3]"
http://en.wikipedia.org/wiki/Clean_URL
I noticed you used the PHP tag in this question, I have added a link to Symfony routing which might give you an idea how you could implement such a feature in your project.
http://symfony.com/doc/current/book/routing.html

how to fake url detection by php

im working on a script for indexing and downloading whole website by user sent url
for example when a user submit a domain like http://example.com then i will copy all links in index page and go for download the its inside links and start from first.....
i do this part with curl and regular expression to download and extract the links
however
some yellow websites are making fake urls for example if you go to http://example.com?page=12 it have some links to http://example.com?page=12&id=10 or http://example.com?page=13 and etc..
this will make a loop and the script cant complete the site downloading
is there any way to detect these kind of pages!?
p.s.: i think google and yahoo and some other search engines face this kind of problem too but their database are clear and on searches thay dont show these kind of data....

Some pages may use GET variables and be perfectly valid (like as you've mentioned here, ?page=12 and ?page=13 may be acceptable). So what I believe you're actually looking for here is a unique page.
It's not possible however to detect these straight from their URL. ?page=12 may point to exactly the same thing as ?page=12&id=1 does; they may not. The only way to detect one of these is to download it, compare the download to pages you've already got, and as a result find out if it really is one you haven't seen yet. If you have seen it before, don't crawl its links.
Minor side note here: Make sure you block websites from a different domain, otherwise you may accidentally start crawling the whole web :)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Curl Check if a domain is root - php

Related

Convert URL into one standard format

How to mask url having many sub directories and files?

Using /index.php as a simple template and document router for a website?

Form submission URL - with no "slash" in the `action` tag?

how to fake url detection by php

Categories

Resources