Here are a few URLs:
http://sub.example.com/?feed=atom&hello=world
http://www.sub.example.com/?feed=atom&hello=world
http://sub.example.com/?hello=world&feed=atom
http://www.sub.example.com/?hello=world&feed=atom
http://www.sub.example.com/?hello=world&feed=atom
http://www.sub.example.com/?hello=world&feed=atom#123
As you can see, they all lead to the exact same page but the URL format is different. Here is two other basic examples:
http://example.com/hello/
http://example.com/hello
Both are the same.
I want to convert the URL into one standard format so that when I store the URL in the database, I can easily check whether if the URL string already exists in the database.
Because of the various ways of how the URL can be formatted, this can be puzzling.
What's the definitive approach to converting URL into one standard format? Maybe parse_url() route...?
Edit
As outlined in the comments, there is no definitive solution to this, but the aim is to get as close as possible with what we have without "retrieving" the page. Please read comments before posting an answer to this bounty.
After you parse_url:
Remove the www prefix from the domain name
If the path is not empty - remove the trailing slash from it
Sort query parameters alphabetically by their name - if there are any
Combine these parts in order to get a canonical URL.
I had the same issue for a reports-configuration-save functionality. In our system, users can design his own reports of sales (like JQL of Jira); for that, we use get params as conditions, and fragment identifier (after #) as layout setup, like this:
http://example.com/report.php?since=20180101&until=20180806#sort=amount&color=blue
For our system, order of GET or after # params are irrelevant as well you reach the same report configuration if set param "until" first than "since", so for us are the same request.
Considering this, subdomains are out of discussion, cause you must solve this using rewrite techniques (like mod_rewrite with 301 in Apache) or create a pool of domain exceptions to do this at software level. Also, different domains can point into different websites, so you must decide if is a good idea; in subdos "www" is very easy to figured it out, but it will toke you time in another cases.
Server side can help to get vars in query section. For example, in PHP you can use function parse_str and $_SERVER['QUERY_STRING'] to get array, and then, you will need use asort() to order it to finnaly compare if are the same request (array_diff function).
Unfortunately, server side is not an option since have no capability to get after hash (#) content, and we still without consider another problems, like scriptname included, protocols or ports:
http://www.sub.example.com/index.php?hello=world&feed=atom
https://www.sub.example.com/?hello=world&feed=atom
http://www.sub.example.com:8081/?hello=world&feed=atom
In my personal experience, the most close solution is JavaScript, for handling url, parsing query section as array, compare them and do the same with fragment identifier. If you need to use it in server side, every load page will must be followed with an ajax request sending this data to the server.
Apologies in advance for length of my answer, but it is what I had to go through in order to solve the same problems you have. Greetings!
Get protocol, domain, and port from URL
Get protocol, domain, and port from URL
How can I get query string values in JavaScript?
How can I get query string values in JavaScript?
How do I get the fragment identifier (value after hash #) from a URL?
How do I get the fragment identifier (value after hash #) from a URL?
adding the preferred <link rel="canonical" ... > tag into the HTML headers is the only reliable solution, in order to reference unique content to a single SEF URL. see Google's documentation, concerning Consolidate duplicate URLs, which possibly answers the whole question more autoritative and reliable, than I ever could.
the idea of being able to know of the canonical URL or to resolve a bunch externals URLs, without parsing those server's .htaccess rewrite-rules or the HTML headers, does not appear to be applicable (simply because one can maintain a table with URL aliases, which subsequently do not permit guessing how a HTTP request might have been re-written).
this question might belong to https://webmasters.stackexchange.com/search?q=cannonical.
Since the question is marked „PHP“ I assume you are in the backend.
There are enough answers how you can compare URLs (protocol, host, port, path, list of request params) where path is case sensitive, protocol and host are not. Changing the order of request parameters is strictly speaking also changing the URL.
My impression is that you want to differentiate by the RESOURCE which the server is serving (http://www.sub.example.com/ serves the same resource as http://sub.example.com/ or .../hello serves the same resource as .../hello/)
Which resource is served, you should perfectly know on the backend level, since you (the backend) know what you are serving. Find the perfect ID for the resource and use it.
PS: the URL is not a good identifier for that. But if you must use it, just use a sanitized version (sanitization for your purpose => sanitize to your preferred host, strip or add slashes at end of paths, drop things like /../ from path (security issue anyway), bring the request params in a certain order, whatever is right for your purpose.
Best regards, iPirat
It's the case with duplicate URLs and you can avoid these kind of duplicate URLs using a URL factory redirecting all URLs which are not proper to the proper URL.
And the same thing is explained in this article:
https://www.tinywebhut.com/remove-duplicate-urls-from-your-website-38
Any other URLs leading to the same page are 301 redirected to the proper version of the URLs.
This is the best practice of Search Engine Optimization(SEO). Here I'm going to give you a couple of examples.
You can consider the URLs of this website, for example the wrong links of this page are
https://stackoverflow.com/questions/51685850
https://stackoverflow.com/questions/51685850/convert-url-into-one-s
https://stackoverflow.com/questions/51685850/
If you go to the above wrong URLs of this page, you'll be redirected to the proper URL which is
https://stackoverflow.com/questions/51685850/convert-url-into-one-standard-format
And if you change the title of this question, all other URLs are 301 redirected to the proper URL. The idea here is the 301 redirection which tells the search engines to replace the old URL with the new one otherwise the search engines find different URLs providing the same content.
The real deal here is the id of the question, 51685850. This id is used to create the proper URL with the information from the database. With the URL factory that is created in the article in the link provided, you do not even need to store URLs in the database.
You can read more on duplicate content here:
https://moz.com/learn/seo/duplicate-content
The same rules are applied to tinywebhut.com as well, the wrong URLs are
https://www.tinywebhut.com/remove-duplicate-38
https://www.tinywebhut.com/some-text-38
https://www.tinywebhut.com/remove-duplicate-urls-from-your-website-38/
In the above URLs the ID is appended to the end of the URL which is 38 and if you go to any of these URLs, you'll be 301 redirected to the proper version of the URLs which is
https://www.tinywebhut.com/remove-duplicate-urls-from-your-website-38
I didn't make any functions to explain this here because it is already done in this article:
https://www.tinywebhut.com/remove-duplicate-urls-from-your-website-38
You can achieve the goal with a couple of really simple functions and you can apply the same idea to remove other duplicate URLs such as /about.php, /about, /about.php/, /about/ and so on. And to achieve this you just need a little more code to your existing functions.
One alternative is adding canonical tag, for example, even if you have more than one URL to go the same page, you just need to apply canonical tag and add the link to the proper URL.
<link rel="canonical" href="https://stackoverflow.com/questions/51685850/convert-url-into-one-standard-format" />
This way you are telling the search engines that the multiple URLs should be considered as one and the search engines add the link used in the canonical tag in their search results. You can read more on canonicalization here:
https://moz.com/learn/seo/canonicalization
But still the best way to get rid of duplicate content is the 301 redirect. If you have a 301 redirect like I talked at the beginning, all problems are solved without surprises.
My original answer assumes that the pages are all owned by the OP, as per the line "As you can see, they all lead to the exact same page but the URL format is different...". I am adapting the answer to handle multiple options and adding a list of assumptions you can and cannot make about URLs.
As others have pointed out there is no definitive easy answer to this if you do not know that the page(s) are the same. However, if you follow these assumptions, you should be safe standardizing some things:
CAN ASSUME
Query strings with the same values point to the same location regardless of order. Example: https://example.com/?fruit=apple&color=red is the same as https://example.com/?color=red&fruit=apple
301 redirects to a specific source can be followed. If you receive a 301 redirect response, follow the redirect and use that URL. You can safely assume that if a URL actually does point to the same page, and page rank is optimized, then you can follow it.
If there is a single <link rel="canonical"> tag in the HTML, that too can be used to cover the canonical link (see below for why).
CANNOT ASSUME
Any URL is guaranteed to be the same as any other URL, if they are different (by URL in this case I am talking about anything before the query string).
http://example.com can be different from https://example.com can be different from http://www.example.com or https://www.example.com. There is no restriction against showing a different website when putting "www" or leaving it out. That's why page rank on search engines is really damaged here.
Any two URLs, even if they currently have exactly the same content, will keep exactly the same content. An example would be https://example.com/test and https://sub.example.com/test. Both may feasibly be set to the same generic test page content. In the future, https://sub.example.com/test may be changed. You can't assume it won't be.
If you own the site
Redirect all traffic in the first part of the URL format you want: Do you want www.example.com or example.com or sub.example.com? Do you want a trailing slash or not? Redirect this first, either using server rules or PHP. This is also highly beneficial for search page rank (if that matters to you).
An example of this would be something like this:
if (!$_SERVER['HTTPS'] || 'example.com' !== $_SERVER['HTTP_HOST'] || rtrim($_SERVER['PHP_SELF'], '/') !== $_SERVER['PHP_SELF']) {
header('HTTP/1.1 301 Moved Permanently');
header('Location: '. 'https://example.com/'.rtrim($_SERVER['PHP_SELF']), '/'));
exit;
}
Finally, to manage any remaining SEO concerns, you can add this HTML tag:
`<link rel="canonical" href="<?php echo $url; ?>">`
Whether you own the site or not, you can standardize query order
Even if you don't control the site, you can assume that query order does not matter. To standardize this, take your query and rebuild the parameters, appending it to your normalized URL.
function getSortedQuery()
{
$url = [];
parse_str($_SERVER['QUERY_STRING'], $url);
ksort($url);
return http_build_query($url);
}
$url = $_SERVER['HTTP_HOST'].$_SERVER['PHP_SELF'].'?'.getSortedQuery();
Another option is to grab the contents of the page and see if there is a <link rel="canonical"> string, and use that string to log your data. This is a bit more costly as it requires a full page load.
To repeat, do make sure you grab 301 redirects as they are not suggestions, but directives, as to the end result URL.
One final suggestion
I might recommend using two columns, one being "canonical_url" and another being "effective_url". Sometimes a URL works and then later becomes a 301 redirect. This is just my take but I would like to know these things.
All of the answers have great information. Assuming you are using an Apache-like server, for the URL bit, I would use .htaccess (or, preferably, if you can change it - the equivalent server Apache config file) to do the rewrites. For a simple example:
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_HOST} ^www\.example\.com$
RewriteRule (.*) http://example.com/$1 [R=Permanent]
In this example, the "R=Permanent" DOES do a redirect. This is usually not a big issue as, a) it tells the browser to remember the redirect, and b) your internal links are presumably relative, so protocol (http or https) and the server (example.com or whatever) are preserved. So generally the redirect will be once per session or less - time well spent, IMO, to avoid doing all this in PHP.
I guess you could use it to rewrite the order of the query bits as well, though when the query bits are significant, I tend to (not recommending you do, just sayin') add them to my path (eg rewrite ".../blah/atom" to ".../blah.php?feed=atom"). At any rate, there are loads of rewrite tricks available, and I recommend you read about them in
Apache mod_rewrite.
If you do go this route, be sure to carefully think through what you want to happen - once you start mucking with the URL's, you are usually stuck with your decisions for a long while.
As several have pointed out, while the URLs you show may currently point to the same content, there is no way to tell if they will in the future. A change in either protocol or hostname can get you different sets of content, even example.com vs. www.example.com, even if served up by the same machine at the same IP. Not common, but it can happen...
So if I were wanting to maintain a list of URLs, I would store protocol, hostname, directory path, filename if present (aka "whatever came after the last slash before a questionmark"), and a sorted on key set of key/value pairs for the GET arguments
And then don't forget that you can go to https://www.google.com and not have anything BUT the protocol and hostname...
Avoid passing the parameters in the url. Pass your parameters to the web page using JSON.
I've searched a lot and found even more topics on CI's multilanguage URLs. But most of the answers have no longer working links to CI wiki or doesn't explain a bit of what I want.
My default is a non English language, but I want controllers and methods named in English. I also don't want language segment appear in my URL (at least not for default language).
Eg.:
Default uri - example.com/blogas/irasas/123
English uri - example.com/blog/item/123 or example.com/en/blog/item/123
Links in both languages should call Blog::item()
Generating uris should be similar - site_url('blog/item/123'); should result a link in user language:
Default - example.com/blogas/irasas/123
English - example.com/blog/item/123 or example.com/en/blog/item/123
User language should be set in a cookie or session preferably.
Is there any way to achieve this?
Using latest CI 2.2.0
I believe this solution for 2.1 should work, but I can't figure out how URI's should be translated (where to put translations) and how to remove language part from URI.
May be I am wrong but what I understood is that you are trying to generate different urls based on active language using same words in site_url. If this is correct you can simply translate your words using language helper.
site_url(lang('blog_item_url') . '123);
And in your english langugae file.
$lang['blog_item_url'] = "blog/item/";
And in your other langugae
$lang['blog_item_url'] = "blogas/irasas/";
And define routes in route.php in configurations.
$route['blog/item/(:num)'] = "blog/item/$1";
$route['blogas/irasas/(:num)'] = "blog/item/$1";
I Just started exploring CakePHP. From what I have observed, In CakePHP, All of the following are the same.
http://example.com/page
http://example.com/page/
http://example.com/page/index
Is this not considered duplicate content by the search engines?
If yes, how to fix this?
They are only duplicate content if you link to them differently, thus making the different urls visible.
Usually, using cake's internal routing you can only get one of those three versions. Always.
But if someone would get hold of the wrong url and does link it from somewhere it might actually be followed by google and indexed wrongly. So yes, there is a possibility.
So
use htaccess to prevent the / (or vica versa) and 301 redirect to the other one
use canonical tag in your layout to always route to the correctly routed url - not matter what the full url including /index etc might currently look like).
Details and code examples:
http://www.dereuromark.de/2012/12/29/cakephp-and-seo/
Wordpress uses friendly seo url without using htaccess.
Can any explain this to me please how they do it.
The only way i can think of is to do something like this.
domain.com/index.php/nnn/mmmm/
But wordpress does not use index.php
I know they are not using htaccess.
Please let me know.
Thank you.
WordPress actually has a single .htaccess file that they don't need to change which redirects all requests to index.php
index.php then looks at the permalinks rules and runs a few database queries to determine which page to send you.
So for instance if the permalink rule is %postdate%/%postname% (may not be the actual WordPress permalink variables. I haven't been using WordPress for too long) then it would just use regular expressions (or combinations of substr() and strpos()) to put %postdate% and %postname% into variables. Next it runs a simple database query for any item matching that date and that name. If nothing is found, redirect to search. If you find more than one, list them all (like a category page). If you find one and only one, send that page.
As far as actually "sending" the page, that's just a matter of settings certain variables (such as $the_post['content']) and then include()'ing the proper theme file.
include()'ing the theme file is a simple if() statement.
if(file_exists("wp-content/themes/<your_theme>/$the_post['type'].php")){
include("wp-content/themes/<your_theme>/$the_post['type'].php");
} else {
include("wp-content/themes/<your_theme>/index.php");
}
Mind you these aren't the exact variable names or the exact functions as they occur. This is just a very simplified version to give the general idea of how these systems work.
Wordpress does use .htaccess files. You can modify the permalinks in the permalinks section your wordpress site. You can see information on it here.
If you are using IIS you can use this method
Wordpress does actually use .htaccess. Make sure you have hidden files displayed! If you're FTP-ing, there'll be a setting in your client.
To answer the main question:
No.
As for the other points of yours:
Wordpress uses friendly seo url without using htaccess.
USUALLY it does not.
USUALLY it requires the .htaccess to make nice urls work.
However, as Wordpress' Codex says:
If you are using IIS 7 and have admin rights on your server, you can
use Microsoft's URL Rewrite Module instead. Though not
completely compatible with mod_rewrite, it does support WordPress's
pretty permalinks.
So they are always using some kind of rewrite module anyway.
Can any explain this to me please how they do it.
There is one interesting thing to explain:
There is only one .htaccess file.
Which would not be so interesting if it would contain some kind of general rule rewriting urls to some.php's GET parameters (domain.com/verynice => domain.com/index.php?page=verynice).
But that is not happening.
What is happening is that
It rewrites ANY url that is NOT a specific FILE or DIRECTORY to index.php EXCEPT the index.php (to prevent infinite loop).
So urls like domain.com/blah/blah and domain.com/lol/trololo are both actually requesting the index.php - the only difference is in the requested URI.
Using superglobal variables like $_SERVER['REQUEST_URI'] they get the requested URI and after that they parse the requested URI and store the results in varibles accordingly.
The only way i can think of is to do something like this.
domain.com/index.php/nnn/mmmm/
In common explanation of the URL this is NOT request to the index.php file itself.
It is a request for the "mmmm" folder in folder "nnn" which is in folder "index.php".
However apache usually understand this as a request for the index.php file .
For other files e.g. my.file domain.com/my.file is a file request, domain.com/my.file/ is a folder request (notice the slash / at the end)
But wordpress does not use index.php
That is NOT TRUE.
Especially if you are using nice-urls the index.php is practicaly fundamental for Wordpress.
I know they are not using htaccess.
This point of yours could meant 3 things:
They are using the htaccess but not for processing the URL (which I've explained above)
They are not using it at all - not true, unless they would be using the MS' URL Rewrite Module
You don't know. You did not provide any information about how did you find out they are not using it, so we can't know if you really know or you just think that :)
Please let me know.
I let you know. And I hope I've answered all your questions :), if not - use comments :)
Thank you.
You are welcome.
I was happy to help :) ☺ ☺ ☺ ☺ ☺
I am trying to use SEO-friendly URLs for a website. The website displays information in hierarchical manner. For instance, if my website was about cars I would want the URL 'http://example.com/ford' to show all Ford models stored in my 'cars' table. Then the URL 'http://example.com/ford/explorer' would show all the Ford Explorer Models (V-6, V-8, Premium, etc).
Now for the question:
Is mod_rewrite used to rewrite a query string style URL into a semantic URL, or the other way around? In other words, when I use JavaScript to set window.location=$URL, should the URL be the query string version 'http://example.com/page.php?type=ford&model=explorer' OR do I internally use 'http://example.com/ford/explorer' which then gives me access to the query string variables?
Hopefully, someone can see my confusion with this issue. For what it's worth, I do have unique 'slugs' made for all my top level categories (obviously, the site isn't about cars).
Thanks for all the help so far. I got the rewrite working but it is affecting other paths on the site (CSS, JavaScript, images). I using the correct path structure for all these includes (ie '/images/car.png' and '/css/main.css'). Do I have to use an absolute path ('http://example.com/css/main.css') for all files? Thanks!
Generally, people who use mod_rewrite use the terminology like this:
I want mod_rewrite to rewrite A to be B.
What this means is that any request from the outside world for page A gets rewritten to file B on the server.
You want the outside world to see URLs that look like
A) http://example.com/ford/explorer
but your web server wants them to look like
B) http://example.com/page.php?type=ford&model=explorer
I would say you want to rewrite (A) to look like (B), or you want to rewrite the semantic URL into a query string URL.
Since all the links on your page are clicked on by the user and/or requested by the browser, you want them to look like (A). This includes links that javascript uses in window.location. They can and should look like (A).
once you have set up mod_rewrite then your links should point to the mod_rewritten version of the URL (in your example: http://mysite.com/ford/explorer). Internally in your system you will still reference the variables as if they are traditional query string name value pairs though.
Its also worth pointing out that Google is now starting to advocate more logical URLs from a search engine point of view, i.e. a query string over mod rewrite
Does that mean I should avoid
rewriting dynamic URLs at all? That's
our recommendation, unless your
rewrites are limited to removing
unnecessary parameters, or you are
very diligent in removing all
parameters that could cause problems.
If you transform your dynamic URL to
make it look static you should be
aware that we might not be able to
interpret the information correctly in
all cases
http://googlewebmastercentral.blogspot.com/2008/09/dynamic-urls-vs-static-urls.html
also worth looking at:
http://googlewebmastercentral.blogspot.com/2009/08/optimize-your-crawling-indexing.html