Convert URL into one standard format

Convert URL into one standard format - php

Here are a few URLs:
http://sub.example.com/?feed=atom&hello=world
http://www.sub.example.com/?feed=atom&hello=world
http://sub.example.com/?hello=world&feed=atom
http://www.sub.example.com/?hello=world&feed=atom
http://www.sub.example.com/?hello=world&feed=atom
http://www.sub.example.com/?hello=world&feed=atom#123
As you can see, they all lead to the exact same page but the URL format is different. Here is two other basic examples:
http://example.com/hello/
http://example.com/hello
Both are the same.
I want to convert the URL into one standard format so that when I store the URL in the database, I can easily check whether if the URL string already exists in the database.
Because of the various ways of how the URL can be formatted, this can be puzzling.
What's the definitive approach to converting URL into one standard format? Maybe parse_url() route...?
Edit
As outlined in the comments, there is no definitive solution to this, but the aim is to get as close as possible with what we have without "retrieving" the page. Please read comments before posting an answer to this bounty.

After you parse_url:
Remove the www prefix from the domain name
If the path is not empty - remove the trailing slash from it
Sort query parameters alphabetically by their name - if there are any
Combine these parts in order to get a canonical URL.

I had the same issue for a reports-configuration-save functionality. In our system, users can design his own reports of sales (like JQL of Jira); for that, we use get params as conditions, and fragment identifier (after #) as layout setup, like this:
http://example.com/report.php?since=20180101&until=20180806#sort=amount&color=blue
For our system, order of GET or after # params are irrelevant as well you reach the same report configuration if set param "until" first than "since", so for us are the same request.
Considering this, subdomains are out of discussion, cause you must solve this using rewrite techniques (like mod_rewrite with 301 in Apache) or create a pool of domain exceptions to do this at software level. Also, different domains can point into different websites, so you must decide if is a good idea; in subdos "www" is very easy to figured it out, but it will toke you time in another cases.
Server side can help to get vars in query section. For example, in PHP you can use function parse_str and $_SERVER['QUERY_STRING'] to get array, and then, you will need use asort() to order it to finnaly compare if are the same request (array_diff function).
Unfortunately, server side is not an option since have no capability to get after hash (#) content, and we still without consider another problems, like scriptname included, protocols or ports:
http://www.sub.example.com/index.php?hello=world&feed=atom
https://www.sub.example.com/?hello=world&feed=atom
http://www.sub.example.com:8081/?hello=world&feed=atom
In my personal experience, the most close solution is JavaScript, for handling url, parsing query section as array, compare them and do the same with fragment identifier. If you need to use it in server side, every load page will must be followed with an ajax request sending this data to the server.
Apologies in advance for length of my answer, but it is what I had to go through in order to solve the same problems you have. Greetings!
Get protocol, domain, and port from URL
Get protocol, domain, and port from URL
How can I get query string values in JavaScript?
How can I get query string values in JavaScript?
How do I get the fragment identifier (value after hash #) from a URL?
How do I get the fragment identifier (value after hash #) from a URL?

adding the preferred <link rel="canonical" ... > tag into the HTML headers is the only reliable solution, in order to reference unique content to a single SEF URL. see Google's documentation, concerning Consolidate duplicate URLs, which possibly answers the whole question more autoritative and reliable, than I ever could.
the idea of being able to know of the canonical URL or to resolve a bunch externals URLs, without parsing those server's .htaccess rewrite-rules or the HTML headers, does not appear to be applicable (simply because one can maintain a table with URL aliases, which subsequently do not permit guessing how a HTTP request might have been re-written).
this question might belong to https://webmasters.stackexchange.com/search?q=cannonical.

Since the question is marked „PHP“ I assume you are in the backend.
There are enough answers how you can compare URLs (protocol, host, port, path, list of request params) where path is case sensitive, protocol and host are not. Changing the order of request parameters is strictly speaking also changing the URL.
My impression is that you want to differentiate by the RESOURCE which the server is serving (http://www.sub.example.com/ serves the same resource as http://sub.example.com/ or .../hello serves the same resource as .../hello/)
Which resource is served, you should perfectly know on the backend level, since you (the backend) know what you are serving. Find the perfect ID for the resource and use it.
PS: the URL is not a good identifier for that. But if you must use it, just use a sanitized version (sanitization for your purpose => sanitize to your preferred host, strip or add slashes at end of paths, drop things like /../ from path (security issue anyway), bring the request params in a certain order, whatever is right for your purpose.
Best regards, iPirat

It's the case with duplicate URLs and you can avoid these kind of duplicate URLs using a URL factory redirecting all URLs which are not proper to the proper URL.
And the same thing is explained in this article:
https://www.tinywebhut.com/remove-duplicate-urls-from-your-website-38
Any other URLs leading to the same page are 301 redirected to the proper version of the URLs.
This is the best practice of Search Engine Optimization(SEO). Here I'm going to give you a couple of examples.
You can consider the URLs of this website, for example the wrong links of this page are
https://stackoverflow.com/questions/51685850
https://stackoverflow.com/questions/51685850/convert-url-into-one-s
https://stackoverflow.com/questions/51685850/
If you go to the above wrong URLs of this page, you'll be redirected to the proper URL which is
https://stackoverflow.com/questions/51685850/convert-url-into-one-standard-format
And if you change the title of this question, all other URLs are 301 redirected to the proper URL. The idea here is the 301 redirection which tells the search engines to replace the old URL with the new one otherwise the search engines find different URLs providing the same content.
The real deal here is the id of the question, 51685850. This id is used to create the proper URL with the information from the database. With the URL factory that is created in the article in the link provided, you do not even need to store URLs in the database.
You can read more on duplicate content here:
https://moz.com/learn/seo/duplicate-content
The same rules are applied to tinywebhut.com as well, the wrong URLs are
https://www.tinywebhut.com/remove-duplicate-38
https://www.tinywebhut.com/some-text-38
https://www.tinywebhut.com/remove-duplicate-urls-from-your-website-38/
In the above URLs the ID is appended to the end of the URL which is 38 and if you go to any of these URLs, you'll be 301 redirected to the proper version of the URLs which is
https://www.tinywebhut.com/remove-duplicate-urls-from-your-website-38
I didn't make any functions to explain this here because it is already done in this article:
https://www.tinywebhut.com/remove-duplicate-urls-from-your-website-38
You can achieve the goal with a couple of really simple functions and you can apply the same idea to remove other duplicate URLs such as /about.php, /about, /about.php/, /about/ and so on. And to achieve this you just need a little more code to your existing functions.
One alternative is adding canonical tag, for example, even if you have more than one URL to go the same page, you just need to apply canonical tag and add the link to the proper URL.
<link rel="canonical" href="https://stackoverflow.com/questions/51685850/convert-url-into-one-standard-format" />
This way you are telling the search engines that the multiple URLs should be considered as one and the search engines add the link used in the canonical tag in their search results. You can read more on canonicalization here:
https://moz.com/learn/seo/canonicalization
But still the best way to get rid of duplicate content is the 301 redirect. If you have a 301 redirect like I talked at the beginning, all problems are solved without surprises.

My original answer assumes that the pages are all owned by the OP, as per the line "As you can see, they all lead to the exact same page but the URL format is different...". I am adapting the answer to handle multiple options and adding a list of assumptions you can and cannot make about URLs.
As others have pointed out there is no definitive easy answer to this if you do not know that the page(s) are the same. However, if you follow these assumptions, you should be safe standardizing some things:
CAN ASSUME
Query strings with the same values point to the same location regardless of order. Example: https://example.com/?fruit=apple&color=red is the same as https://example.com/?color=red&fruit=apple
301 redirects to a specific source can be followed. If you receive a 301 redirect response, follow the redirect and use that URL. You can safely assume that if a URL actually does point to the same page, and page rank is optimized, then you can follow it.
If there is a single <link rel="canonical"> tag in the HTML, that too can be used to cover the canonical link (see below for why).
CANNOT ASSUME
Any URL is guaranteed to be the same as any other URL, if they are different (by URL in this case I am talking about anything before the query string).
http://example.com can be different from https://example.com can be different from http://www.example.com or https://www.example.com. There is no restriction against showing a different website when putting "www" or leaving it out. That's why page rank on search engines is really damaged here.
Any two URLs, even if they currently have exactly the same content, will keep exactly the same content. An example would be https://example.com/test and https://sub.example.com/test. Both may feasibly be set to the same generic test page content. In the future, https://sub.example.com/test may be changed. You can't assume it won't be.
If you own the site
Redirect all traffic in the first part of the URL format you want: Do you want www.example.com or example.com or sub.example.com? Do you want a trailing slash or not? Redirect this first, either using server rules or PHP. This is also highly beneficial for search page rank (if that matters to you).
An example of this would be something like this:
if (!$_SERVER['HTTPS'] || 'example.com' !== $_SERVER['HTTP_HOST'] || rtrim($_SERVER['PHP_SELF'], '/') !== $_SERVER['PHP_SELF']) {
header('HTTP/1.1 301 Moved Permanently');
header('Location: '. 'https://example.com/'.rtrim($_SERVER['PHP_SELF']), '/'));
exit;
}
Finally, to manage any remaining SEO concerns, you can add this HTML tag:
`<link rel="canonical" href="<?php echo $url; ?>">`
Whether you own the site or not, you can standardize query order
Even if you don't control the site, you can assume that query order does not matter. To standardize this, take your query and rebuild the parameters, appending it to your normalized URL.
function getSortedQuery()
{
$url = [];
parse_str($_SERVER['QUERY_STRING'], $url);
ksort($url);
return http_build_query($url);
}
$url = $_SERVER['HTTP_HOST'].$_SERVER['PHP_SELF'].'?'.getSortedQuery();
Another option is to grab the contents of the page and see if there is a <link rel="canonical"> string, and use that string to log your data. This is a bit more costly as it requires a full page load.
To repeat, do make sure you grab 301 redirects as they are not suggestions, but directives, as to the end result URL.
One final suggestion
I might recommend using two columns, one being "canonical_url" and another being "effective_url". Sometimes a URL works and then later becomes a 301 redirect. This is just my take but I would like to know these things.

All of the answers have great information. Assuming you are using an Apache-like server, for the URL bit, I would use .htaccess (or, preferably, if you can change it - the equivalent server Apache config file) to do the rewrites. For a simple example:
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_HOST} ^www\.example\.com$
RewriteRule (.*) http://example.com/$1 [R=Permanent]
In this example, the "R=Permanent" DOES do a redirect. This is usually not a big issue as, a) it tells the browser to remember the redirect, and b) your internal links are presumably relative, so protocol (http or https) and the server (example.com or whatever) are preserved. So generally the redirect will be once per session or less - time well spent, IMO, to avoid doing all this in PHP.
I guess you could use it to rewrite the order of the query bits as well, though when the query bits are significant, I tend to (not recommending you do, just sayin') add them to my path (eg rewrite ".../blah/atom" to ".../blah.php?feed=atom"). At any rate, there are loads of rewrite tricks available, and I recommend you read about them in
Apache mod_rewrite.
If you do go this route, be sure to carefully think through what you want to happen - once you start mucking with the URL's, you are usually stuck with your decisions for a long while.

As several have pointed out, while the URLs you show may currently point to the same content, there is no way to tell if they will in the future. A change in either protocol or hostname can get you different sets of content, even example.com vs. www.example.com, even if served up by the same machine at the same IP. Not common, but it can happen...
So if I were wanting to maintain a list of URLs, I would store protocol, hostname, directory path, filename if present (aka "whatever came after the last slash before a questionmark"), and a sorted on key set of key/value pairs for the GET arguments
And then don't forget that you can go to https://www.google.com and not have anything BUT the protocol and hostname...

Avoid passing the parameters in the url. Pass your parameters to the web page using JSON.

Related

Multiple URLs for the same content in CakePHP

I Just started exploring CakePHP. From what I have observed, In CakePHP, All of the following are the same.
http://example.com/page
http://example.com/page/
http://example.com/page/index
Is this not considered duplicate content by the search engines?
If yes, how to fix this?

They are only duplicate content if you link to them differently, thus making the different urls visible.
Usually, using cake's internal routing you can only get one of those three versions. Always.
But if someone would get hold of the wrong url and does link it from somewhere it might actually be followed by google and indexed wrongly. So yes, there is a possibility.
So
use htaccess to prevent the / (or vica versa) and 301 redirect to the other one
use canonical tag in your layout to always route to the correctly routed url - not matter what the full url including /index etc might currently look like).
Details and code examples:
http://www.dereuromark.de/2012/12/29/cakephp-and-seo/

How to automatically append "http://" before links before saving them to a database?

I'm developing a PHP-based web-application in which you have a form with textarea inputs that can accept links via anchor tags. But when I tested it after adding a hyperlink as follows, it pointed to a non-existent local subdirectory:
link
I realized that this was because I had not appended http:// before the link.
there might be cases where a user might input the link just as I did above. In such cases I don't want the link to be pointing as it did above. is there any possible solution, such as automatically appending http:// before the link in case that it doesn't exist? How do I do that?
----------------------------------------Edit---------------------------------------------
Please consider that the anchor tags are amidst other plaintext and this is making things harder to work with.

I'd go for something like this:
if (!parse_url($url, PHP_URL_SCHEME)) {
$url = 'http://' . $url;
}
This is an easy and stable way to check for the presence of a protocol in a URL, and allows others (e.g. ftp, https) that may be entered.

What you're talking about involves two steps, URL detection and URL normalization. First you'll have to detect all the URLs in the string being parsed and store them in a data structure for further processing, such as an array. Then you need to iterate over the array and normalize each URL in turn, before attempting to store them.
Unfortunately, both detection and normalization can be problematic, as a URL has a quite complicated structure. http://www.regexguru.com/2008/11/detecting-urls-in-a-block-of-text/ makes some suggestions, but as the page itself says, no regex URL detection is ever perfect.
There are examples of regular expressions that can detect URLs available from various sites, but in my experience none of them are completely reliable.
As for normalization, Wikipedia has an article on the subject which may be a good starting point. http://en.wikipedia.org/wiki/URL_normalization

How To Make PHP Application To Use Actual URL's If Mod-Rewrite Is Disabled?

I am working on a PHP Application, Every thing works perfectly, The only problem is.
I have enabled SEO Friendly URL's, Which re-writes the actual URL's to virtual URL's( i know you guys know it )
Ex : hxxp://www.website.com/index.php?page=about-us
To
hxxp://www.website.com/page/about-us/
What i want to achieve is If the SEO URL's / Mod Rewrite is disabled, the user should be able to access the direct/actual URL's.
In brief, If Mod-Rewrite is enabled, the web application should automatically use the SEO Friendly URL's otherwise go with the default URL's.

You would have to replace all occurrences of links with a function that checks if mod_rewrite is available, or more likely, a config value. It would then return the appropriate link.
getLink("?page=about-us")

Use an <IfModule> to avoid breaking other .htaccess directives and or 500 internal server errors if Apache doesn't understand your rules. Also add a single non-rewriting rewriterule (before all others);
<IfModule mod_rewrite.c>
RewriteEngine On
#The next rule does no rewriting, but sets en environmental variable.
RewriteRule .* - [E=RewriteCapable:On]
</IfModule>
In your file (store as setting or check on places generating/outputting urls):
if(isset($_SERVER['RewriteCapable'])){
//make fancy urls
} else {
//cludgy old-style urls
}

Hmm.. this may need thought about a bit to get the correct solution.. follow me here if you will :)
SEO URLs were primarily introduced to (1) include human readable text in the URLs and (2) to get rid of the GET parameters.
To look at point (2) for a moment, this was the primary driver initially, because people used about.php?id=1, id=2 ... id=3457348 to get the same page listed in the search engines multiple times, which of course got detected and stopped, then sometimes people would pass a session id=24234234 which would also get stopped as being a duplicate page (rightfully as it uses HTTP as a stateful protocol when it's not).
With an URL, everything from the first char up to a the # of a #fragment defines a resource (from an HTTP perspective), so rightly so when several different URLs all resolve to the same 'page' they are indeed duplicates.
So, by negating the GET parameters you solve this problem, which now isn't a problem by the way and hasn't been for a long time, there's no reason not to use GET params properly other than vanity.
So, really you solve no problem but have instead introduced a new problem, in that you want '/page/about-us' and '?page=about-us' to both go to the same 'page' which means you've got duplicate resources again and this could be detected and you could get penalised.
Thus, by introducing 'SEO URLs' you've actually created the problem SEO URLs were 'invented' to counteract.
This only leaves the point about human readable words in the URL. URLs are supposed to be transparent so they don't count for anything in reality, but some still like - so I'd have to ask what's wrong with using '/?/page/about-us'... and if you don't like that then whats wrong with creating a fixed file with the filesystem path '/page/about-us' which simply includes your index.php with the right variables set?
Of course you can create duplicate pages and have both SEO friendly urls and GET param URLs but as you can see that won't be SEO friendly now will it?
Something to chew on :)

PHP: Parse different styles of friendly urls

I am writting my own small framework and I am going to implement friendly URLs there. mod_rewrite is great. But I want to be able to handle several types of friendly URLS at once:
/index.php?ac=user&an=showprofile (fallback variant, the worst)
/index.php/user/showprofile (supposedly, can be disabled by security settings)
index.php?user/showprofile (optional, not needed)
/user/showprofile (ideal, but requires mod_rewrite or dirty ErrorDocument tricks)
I would like all the variants to be supported at once so that old links generated with whatever scheme would be forever valid. Should I write my own parse functions for this or, may be, I missed some library/script, that can do that? Extracting algos from big frameworks like Symfony or Zend is quite difficult. There are also many different unobvious cases like correctl handling URLs UTF-8 encoded or with magic_quotes_runtime etc...

If you can both programmatically distinguish between all different types of URLs and normalize them to one base form you can just write a simple tokenizer function that normalizes the different types and you can use the normalized type to get the actual destination. I've done this, but not without mod_rewrite. Pretty sure it can be done without it though.
I usually have one index file that parses whatever url and then does a bunch of request handling and routing to get the output, without having any url map directly to any file. Just mod_rewrite everything to that index file and parse $_SERVER['REQUEST_URI'].

The first through third should be passed through to PHP for processing ($_GET, $_SERVER["REQUEST_URI"] and $_SERVER["QUERY_STRING"]).
The fourth can be done with mod_rewrite using RewriteCond !-f and an appropriate RewriteRule.

You should avoid allowing all different types of URLs. If all those URLs you posted show the same content then SEs will see it as duplicate content and you will end up "splitting" your ranking. In other words, each of the four pages will rank a quarter as well as if everything was focused on one URL...if that makes sense.
Just choose one URL format that's simple for users, and stick with it. Personally I would prefer the last one, with mod_rewrite. Your question implies there is something "wrong" with using mod_rewrite, which there isn't.
You can also use things like $_SERVER["REQUEST_URI"] to detect what URL is being requested and do a 301 redirect to the official URL, if that's even necessary.

REALLY basic mod_rewrite question

I am trying to use SEO-friendly URLs for a website. The website displays information in hierarchical manner. For instance, if my website was about cars I would want the URL 'http://example.com/ford' to show all Ford models stored in my 'cars' table. Then the URL 'http://example.com/ford/explorer' would show all the Ford Explorer Models (V-6, V-8, Premium, etc).
Now for the question:
Is mod_rewrite used to rewrite a query string style URL into a semantic URL, or the other way around? In other words, when I use JavaScript to set window.location=$URL, should the URL be the query string version 'http://example.com/page.php?type=ford&model=explorer' OR do I internally use 'http://example.com/ford/explorer' which then gives me access to the query string variables?
Hopefully, someone can see my confusion with this issue. For what it's worth, I do have unique 'slugs' made for all my top level categories (obviously, the site isn't about cars).
Thanks for all the help so far. I got the rewrite working but it is affecting other paths on the site (CSS, JavaScript, images). I using the correct path structure for all these includes (ie '/images/car.png' and '/css/main.css'). Do I have to use an absolute path ('http://example.com/css/main.css') for all files? Thanks!

Generally, people who use mod_rewrite use the terminology like this:
I want mod_rewrite to rewrite A to be B.
What this means is that any request from the outside world for page A gets rewritten to file B on the server.
You want the outside world to see URLs that look like
A) http://example.com/ford/explorer
but your web server wants them to look like
B) http://example.com/page.php?type=ford&model=explorer
I would say you want to rewrite (A) to look like (B), or you want to rewrite the semantic URL into a query string URL.
Since all the links on your page are clicked on by the user and/or requested by the browser, you want them to look like (A). This includes links that javascript uses in window.location. They can and should look like (A).

once you have set up mod_rewrite then your links should point to the mod_rewritten version of the URL (in your example: http://mysite.com/ford/explorer). Internally in your system you will still reference the variables as if they are traditional query string name value pairs though.
Its also worth pointing out that Google is now starting to advocate more logical URLs from a search engine point of view, i.e. a query string over mod rewrite
Does that mean I should avoid
rewriting dynamic URLs at all? That's
our recommendation, unless your
rewrites are limited to removing
unnecessary parameters, or you are
very diligent in removing all
parameters that could cause problems.
If you transform your dynamic URL to
make it look static you should be
aware that we might not be able to
interpret the information correctly in
all cases
http://googlewebmastercentral.blogspot.com/2008/09/dynamic-urls-vs-static-urls.html
also worth looking at:
http://googlewebmastercentral.blogspot.com/2009/08/optimize-your-crawling-indexing.html

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.