Here are a few URLs:
http://sub.example.com/?feed=atom&hello=world
http://www.sub.example.com/?feed=atom&hello=world
http://sub.example.com/?hello=world&feed=atom
http://www.sub.example.com/?hello=world&feed=atom
http://www.sub.example.com/?hello=world&feed=atom
http://www.sub.example.com/?hello=world&feed=atom#123
As you can see, they all lead to the exact same page but the URL format is different. Here is two other basic examples:
http://example.com/hello/
http://example.com/hello
Both are the same.
I want to convert the URL into one standard format so that when I store the URL in the database, I can easily check whether if the URL string already exists in the database.
Because of the various ways of how the URL can be formatted, this can be puzzling.
What's the definitive approach to converting URL into one standard format? Maybe parse_url() route...?
Edit
As outlined in the comments, there is no definitive solution to this, but the aim is to get as close as possible with what we have without "retrieving" the page. Please read comments before posting an answer to this bounty.
After you parse_url:
Remove the www prefix from the domain name
If the path is not empty - remove the trailing slash from it
Sort query parameters alphabetically by their name - if there are any
Combine these parts in order to get a canonical URL.
I had the same issue for a reports-configuration-save functionality. In our system, users can design his own reports of sales (like JQL of Jira); for that, we use get params as conditions, and fragment identifier (after #) as layout setup, like this:
http://example.com/report.php?since=20180101&until=20180806#sort=amount&color=blue
For our system, order of GET or after # params are irrelevant as well you reach the same report configuration if set param "until" first than "since", so for us are the same request.
Considering this, subdomains are out of discussion, cause you must solve this using rewrite techniques (like mod_rewrite with 301 in Apache) or create a pool of domain exceptions to do this at software level. Also, different domains can point into different websites, so you must decide if is a good idea; in subdos "www" is very easy to figured it out, but it will toke you time in another cases.
Server side can help to get vars in query section. For example, in PHP you can use function parse_str and $_SERVER['QUERY_STRING'] to get array, and then, you will need use asort() to order it to finnaly compare if are the same request (array_diff function).
Unfortunately, server side is not an option since have no capability to get after hash (#) content, and we still without consider another problems, like scriptname included, protocols or ports:
http://www.sub.example.com/index.php?hello=world&feed=atom
https://www.sub.example.com/?hello=world&feed=atom
http://www.sub.example.com:8081/?hello=world&feed=atom
In my personal experience, the most close solution is JavaScript, for handling url, parsing query section as array, compare them and do the same with fragment identifier. If you need to use it in server side, every load page will must be followed with an ajax request sending this data to the server.
Apologies in advance for length of my answer, but it is what I had to go through in order to solve the same problems you have. Greetings!
Get protocol, domain, and port from URL
Get protocol, domain, and port from URL
How can I get query string values in JavaScript?
How can I get query string values in JavaScript?
How do I get the fragment identifier (value after hash #) from a URL?
How do I get the fragment identifier (value after hash #) from a URL?
adding the preferred <link rel="canonical" ... > tag into the HTML headers is the only reliable solution, in order to reference unique content to a single SEF URL. see Google's documentation, concerning Consolidate duplicate URLs, which possibly answers the whole question more autoritative and reliable, than I ever could.
the idea of being able to know of the canonical URL or to resolve a bunch externals URLs, without parsing those server's .htaccess rewrite-rules or the HTML headers, does not appear to be applicable (simply because one can maintain a table with URL aliases, which subsequently do not permit guessing how a HTTP request might have been re-written).
this question might belong to https://webmasters.stackexchange.com/search?q=cannonical.
Since the question is marked „PHP“ I assume you are in the backend.
There are enough answers how you can compare URLs (protocol, host, port, path, list of request params) where path is case sensitive, protocol and host are not. Changing the order of request parameters is strictly speaking also changing the URL.
My impression is that you want to differentiate by the RESOURCE which the server is serving (http://www.sub.example.com/ serves the same resource as http://sub.example.com/ or .../hello serves the same resource as .../hello/)
Which resource is served, you should perfectly know on the backend level, since you (the backend) know what you are serving. Find the perfect ID for the resource and use it.
PS: the URL is not a good identifier for that. But if you must use it, just use a sanitized version (sanitization for your purpose => sanitize to your preferred host, strip or add slashes at end of paths, drop things like /../ from path (security issue anyway), bring the request params in a certain order, whatever is right for your purpose.
Best regards, iPirat
It's the case with duplicate URLs and you can avoid these kind of duplicate URLs using a URL factory redirecting all URLs which are not proper to the proper URL.
And the same thing is explained in this article:
https://www.tinywebhut.com/remove-duplicate-urls-from-your-website-38
Any other URLs leading to the same page are 301 redirected to the proper version of the URLs.
This is the best practice of Search Engine Optimization(SEO). Here I'm going to give you a couple of examples.
You can consider the URLs of this website, for example the wrong links of this page are
https://stackoverflow.com/questions/51685850
https://stackoverflow.com/questions/51685850/convert-url-into-one-s
https://stackoverflow.com/questions/51685850/
If you go to the above wrong URLs of this page, you'll be redirected to the proper URL which is
https://stackoverflow.com/questions/51685850/convert-url-into-one-standard-format
And if you change the title of this question, all other URLs are 301 redirected to the proper URL. The idea here is the 301 redirection which tells the search engines to replace the old URL with the new one otherwise the search engines find different URLs providing the same content.
The real deal here is the id of the question, 51685850. This id is used to create the proper URL with the information from the database. With the URL factory that is created in the article in the link provided, you do not even need to store URLs in the database.
You can read more on duplicate content here:
https://moz.com/learn/seo/duplicate-content
The same rules are applied to tinywebhut.com as well, the wrong URLs are
https://www.tinywebhut.com/remove-duplicate-38
https://www.tinywebhut.com/some-text-38
https://www.tinywebhut.com/remove-duplicate-urls-from-your-website-38/
In the above URLs the ID is appended to the end of the URL which is 38 and if you go to any of these URLs, you'll be 301 redirected to the proper version of the URLs which is
https://www.tinywebhut.com/remove-duplicate-urls-from-your-website-38
I didn't make any functions to explain this here because it is already done in this article:
https://www.tinywebhut.com/remove-duplicate-urls-from-your-website-38
You can achieve the goal with a couple of really simple functions and you can apply the same idea to remove other duplicate URLs such as /about.php, /about, /about.php/, /about/ and so on. And to achieve this you just need a little more code to your existing functions.
One alternative is adding canonical tag, for example, even if you have more than one URL to go the same page, you just need to apply canonical tag and add the link to the proper URL.
<link rel="canonical" href="https://stackoverflow.com/questions/51685850/convert-url-into-one-standard-format" />
This way you are telling the search engines that the multiple URLs should be considered as one and the search engines add the link used in the canonical tag in their search results. You can read more on canonicalization here:
https://moz.com/learn/seo/canonicalization
But still the best way to get rid of duplicate content is the 301 redirect. If you have a 301 redirect like I talked at the beginning, all problems are solved without surprises.
My original answer assumes that the pages are all owned by the OP, as per the line "As you can see, they all lead to the exact same page but the URL format is different...". I am adapting the answer to handle multiple options and adding a list of assumptions you can and cannot make about URLs.
As others have pointed out there is no definitive easy answer to this if you do not know that the page(s) are the same. However, if you follow these assumptions, you should be safe standardizing some things:
CAN ASSUME
Query strings with the same values point to the same location regardless of order. Example: https://example.com/?fruit=apple&color=red is the same as https://example.com/?color=red&fruit=apple
301 redirects to a specific source can be followed. If you receive a 301 redirect response, follow the redirect and use that URL. You can safely assume that if a URL actually does point to the same page, and page rank is optimized, then you can follow it.
If there is a single <link rel="canonical"> tag in the HTML, that too can be used to cover the canonical link (see below for why).
CANNOT ASSUME
Any URL is guaranteed to be the same as any other URL, if they are different (by URL in this case I am talking about anything before the query string).
http://example.com can be different from https://example.com can be different from http://www.example.com or https://www.example.com. There is no restriction against showing a different website when putting "www" or leaving it out. That's why page rank on search engines is really damaged here.
Any two URLs, even if they currently have exactly the same content, will keep exactly the same content. An example would be https://example.com/test and https://sub.example.com/test. Both may feasibly be set to the same generic test page content. In the future, https://sub.example.com/test may be changed. You can't assume it won't be.
If you own the site
Redirect all traffic in the first part of the URL format you want: Do you want www.example.com or example.com or sub.example.com? Do you want a trailing slash or not? Redirect this first, either using server rules or PHP. This is also highly beneficial for search page rank (if that matters to you).
An example of this would be something like this:
if (!$_SERVER['HTTPS'] || 'example.com' !== $_SERVER['HTTP_HOST'] || rtrim($_SERVER['PHP_SELF'], '/') !== $_SERVER['PHP_SELF']) {
header('HTTP/1.1 301 Moved Permanently');
header('Location: '. 'https://example.com/'.rtrim($_SERVER['PHP_SELF']), '/'));
exit;
}
Finally, to manage any remaining SEO concerns, you can add this HTML tag:
`<link rel="canonical" href="<?php echo $url; ?>">`
Whether you own the site or not, you can standardize query order
Even if you don't control the site, you can assume that query order does not matter. To standardize this, take your query and rebuild the parameters, appending it to your normalized URL.
function getSortedQuery()
{
$url = [];
parse_str($_SERVER['QUERY_STRING'], $url);
ksort($url);
return http_build_query($url);
}
$url = $_SERVER['HTTP_HOST'].$_SERVER['PHP_SELF'].'?'.getSortedQuery();
Another option is to grab the contents of the page and see if there is a <link rel="canonical"> string, and use that string to log your data. This is a bit more costly as it requires a full page load.
To repeat, do make sure you grab 301 redirects as they are not suggestions, but directives, as to the end result URL.
One final suggestion
I might recommend using two columns, one being "canonical_url" and another being "effective_url". Sometimes a URL works and then later becomes a 301 redirect. This is just my take but I would like to know these things.
All of the answers have great information. Assuming you are using an Apache-like server, for the URL bit, I would use .htaccess (or, preferably, if you can change it - the equivalent server Apache config file) to do the rewrites. For a simple example:
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_HOST} ^www\.example\.com$
RewriteRule (.*) http://example.com/$1 [R=Permanent]
In this example, the "R=Permanent" DOES do a redirect. This is usually not a big issue as, a) it tells the browser to remember the redirect, and b) your internal links are presumably relative, so protocol (http or https) and the server (example.com or whatever) are preserved. So generally the redirect will be once per session or less - time well spent, IMO, to avoid doing all this in PHP.
I guess you could use it to rewrite the order of the query bits as well, though when the query bits are significant, I tend to (not recommending you do, just sayin') add them to my path (eg rewrite ".../blah/atom" to ".../blah.php?feed=atom"). At any rate, there are loads of rewrite tricks available, and I recommend you read about them in
Apache mod_rewrite.
If you do go this route, be sure to carefully think through what you want to happen - once you start mucking with the URL's, you are usually stuck with your decisions for a long while.
As several have pointed out, while the URLs you show may currently point to the same content, there is no way to tell if they will in the future. A change in either protocol or hostname can get you different sets of content, even example.com vs. www.example.com, even if served up by the same machine at the same IP. Not common, but it can happen...
So if I were wanting to maintain a list of URLs, I would store protocol, hostname, directory path, filename if present (aka "whatever came after the last slash before a questionmark"), and a sorted on key set of key/value pairs for the GET arguments
And then don't forget that you can go to https://www.google.com and not have anything BUT the protocol and hostname...
Avoid passing the parameters in the url. Pass your parameters to the web page using JSON.
What method can you recommended for creating search engine-friendly URLs? When coding in PHP that is. Ideally I would like something like:
http://www.example.com/article/523544
So it doesn't display the file it's opening (eg article.php)
It is quite necessary to generate a SEO friendly URL's so that most of the Search engines can easily index it.And the most interesting part with it that URL can easily correlate to the Page Content and the User can generate a Pretty URL as per the keywords he want to rank the page on different Search Engines(e.g. google.com,google.co.in,bing.com)
The best example to have Pretty Links is on Wordpress.It actually stores the Dynamic Page URL's in the Database itself.And when the Pretty Ur is being called,internally the htaccess is called and it redirects to the original dynamic page in the system.
Some basic tips from
Google
SEOmoz
may help you.
Some topics in SO:
mod_rewrite
nice url
Edit:
You need to place a .htaccess file in your document root that includes the following rules:
RewriteEngine on
RewriteRule ^article/([0-9]+)?$ article.php?id=$1 [L]
Make sure mod_rewrite enabled in Apache and you are allowed to use it.
If you read some questions in SO in this topic it will help you understand how mod_rewrite works.
To make your urls more search engine friendly you may want to use 'slugs' so you need to sanitize your article titles like in this url.
Ideally your URL needs to contain something about the topic of the URL. You gave the example of http://www.example.com/article/523544, where this is better than using standard query strings, it's still not ideal, as all that any search engine can see from it is that it's an article.
It's important to remember that the segment (a segment is the string between each slash) closest to the domain is the most important:
http://www.example.com/most-important/next-important/less-important/
I personally always try to use the following URL structure, and keep my page/article titles unique:
http://www.example.com/this-wonderful-article
Notice the use of dashes and not underscores, this is generally known as the preferred method. Using this method I usually generate and save the article's slug ('this-wonderful-article') in the database, and then search for that instead of an ID.
Appreciated that sometimes it's very difficult to just use slug, especially with a larger website. You may have multiple articles with the same title, or the website may have user-submitted content over which you have no control. If this is the case, you can use the ID without any worries, but just be sure to include the title of the article in the URL. Eg: http://www.example.com/this-wonderful-article/29587
If you're looking for a method of using these URLs then I'd suggest looking at some mod_rewrite tutorials. Personally I use a framework that does the majority of the legwork for me such as CodeIgniter (http://www.codeigniter.com), or you could use something like the Zend Framework or CakePHP. If you're only doing articles then it might be worth looking into a sturdy CMS like WordPress, although this depends largely on your requirements.
I am working on a PHP Application, Every thing works perfectly, The only problem is.
I have enabled SEO Friendly URL's, Which re-writes the actual URL's to virtual URL's( i know you guys know it )
Ex : hxxp://www.website.com/index.php?page=about-us
To
hxxp://www.website.com/page/about-us/
What i want to achieve is If the SEO URL's / Mod Rewrite is disabled, the user should be able to access the direct/actual URL's.
In brief, If Mod-Rewrite is enabled, the web application should automatically use the SEO Friendly URL's otherwise go with the default URL's.
You would have to replace all occurrences of links with a function that checks if mod_rewrite is available, or more likely, a config value. It would then return the appropriate link.
getLink("?page=about-us")
Use an <IfModule> to avoid breaking other .htaccess directives and or 500 internal server errors if Apache doesn't understand your rules. Also add a single non-rewriting rewriterule (before all others);
<IfModule mod_rewrite.c>
RewriteEngine On
#The next rule does no rewriting, but sets en environmental variable.
RewriteRule .* - [E=RewriteCapable:On]
</IfModule>
In your file (store as setting or check on places generating/outputting urls):
if(isset($_SERVER['RewriteCapable'])){
//make fancy urls
} else {
//cludgy old-style urls
}
Hmm.. this may need thought about a bit to get the correct solution.. follow me here if you will :)
SEO URLs were primarily introduced to (1) include human readable text in the URLs and (2) to get rid of the GET parameters.
To look at point (2) for a moment, this was the primary driver initially, because people used about.php?id=1, id=2 ... id=3457348 to get the same page listed in the search engines multiple times, which of course got detected and stopped, then sometimes people would pass a session id=24234234 which would also get stopped as being a duplicate page (rightfully as it uses HTTP as a stateful protocol when it's not).
With an URL, everything from the first char up to a the # of a #fragment defines a resource (from an HTTP perspective), so rightly so when several different URLs all resolve to the same 'page' they are indeed duplicates.
So, by negating the GET parameters you solve this problem, which now isn't a problem by the way and hasn't been for a long time, there's no reason not to use GET params properly other than vanity.
So, really you solve no problem but have instead introduced a new problem, in that you want '/page/about-us' and '?page=about-us' to both go to the same 'page' which means you've got duplicate resources again and this could be detected and you could get penalised.
Thus, by introducing 'SEO URLs' you've actually created the problem SEO URLs were 'invented' to counteract.
This only leaves the point about human readable words in the URL. URLs are supposed to be transparent so they don't count for anything in reality, but some still like - so I'd have to ask what's wrong with using '/?/page/about-us'... and if you don't like that then whats wrong with creating a fixed file with the filesystem path '/page/about-us' which simply includes your index.php with the right variables set?
Of course you can create duplicate pages and have both SEO friendly urls and GET param URLs but as you can see that won't be SEO friendly now will it?
Something to chew on :)
I'm trying to make a clean url for a blog on a dynamic website, but I think that the problem is that I don't know how to plan the website schema.
I read about how to use mod_rewrite and all I found is how to make "http://www.website.com/?category&date&post-title" to "http://www.website.com/category/date/post-title". that's works o.k for me.
The problem is that If my url looks like "http://www.website.com/blog/?id=34" this method won't work as far as I got it.
So, I have two questions:
1. Is there a way to use mod_rewrite (maybe read from a txt file) to read the post title of my blog and rewrite my url by date and post-title?
2. Should I rewrite my website to query the data from one index file in the homepage and use mod_rewrite to write the nice url? should I query also the date and the title of the post instead just the post ID?
mod_rewrite used to rewrite requests and it has nothing to do with urls. You have to change urls by hands.
yes, it's most common practice, to query the data from one index file
no, you can't use mod_rewrite to write the nice url
yes, an id must be present in the url along with title. your engine will just throw title away and use only id to retrieve an article.
Take a look at SO urls for an example
What you're talking about is commonly referred to as routing and lots of examples exist of different ways to do it with PHP. The most common approach uses the frontcontroller pattern, which means in the simple case rewriting all URLs to a single php file and then having that file determine what content to show dynamically based on the URL.
The most popular PHP frameworks (CakePHP, Symphony, Codeigniter, etc.) all have routing code in them which you might be able to use or might serve as inspiration. Alternatively this article covers lots of the basics if you want to do it yourself: http://www.phpaddiction.com/tags/axial/url-routing-with-php-part-one/
RewriteMap allows you to do all sorts of dynamic rewriting (text file, script, etc).
I am trying to use SEO-friendly URLs for a website. The website displays information in hierarchical manner. For instance, if my website was about cars I would want the URL 'http://example.com/ford' to show all Ford models stored in my 'cars' table. Then the URL 'http://example.com/ford/explorer' would show all the Ford Explorer Models (V-6, V-8, Premium, etc).
Now for the question:
Is mod_rewrite used to rewrite a query string style URL into a semantic URL, or the other way around? In other words, when I use JavaScript to set window.location=$URL, should the URL be the query string version 'http://example.com/page.php?type=ford&model=explorer' OR do I internally use 'http://example.com/ford/explorer' which then gives me access to the query string variables?
Hopefully, someone can see my confusion with this issue. For what it's worth, I do have unique 'slugs' made for all my top level categories (obviously, the site isn't about cars).
Thanks for all the help so far. I got the rewrite working but it is affecting other paths on the site (CSS, JavaScript, images). I using the correct path structure for all these includes (ie '/images/car.png' and '/css/main.css'). Do I have to use an absolute path ('http://example.com/css/main.css') for all files? Thanks!
Generally, people who use mod_rewrite use the terminology like this:
I want mod_rewrite to rewrite A to be B.
What this means is that any request from the outside world for page A gets rewritten to file B on the server.
You want the outside world to see URLs that look like
A) http://example.com/ford/explorer
but your web server wants them to look like
B) http://example.com/page.php?type=ford&model=explorer
I would say you want to rewrite (A) to look like (B), or you want to rewrite the semantic URL into a query string URL.
Since all the links on your page are clicked on by the user and/or requested by the browser, you want them to look like (A). This includes links that javascript uses in window.location. They can and should look like (A).
once you have set up mod_rewrite then your links should point to the mod_rewritten version of the URL (in your example: http://mysite.com/ford/explorer). Internally in your system you will still reference the variables as if they are traditional query string name value pairs though.
Its also worth pointing out that Google is now starting to advocate more logical URLs from a search engine point of view, i.e. a query string over mod rewrite
Does that mean I should avoid
rewriting dynamic URLs at all? That's
our recommendation, unless your
rewrites are limited to removing
unnecessary parameters, or you are
very diligent in removing all
parameters that could cause problems.
If you transform your dynamic URL to
make it look static you should be
aware that we might not be able to
interpret the information correctly in
all cases
http://googlewebmastercentral.blogspot.com/2008/09/dynamic-urls-vs-static-urls.html
also worth looking at:
http://googlewebmastercentral.blogspot.com/2009/08/optimize-your-crawling-indexing.html