PHP + HTACCESS + mod_rewrite + different length url segments

PHP + HTACCESS + mod_rewrite + different length url segments - php

Right, Good afternoon all (well, it is afternoon here in the UK!)
I am in the process of writing a (PHP/MySQL) site that uses friendly URLs.
I have set up my htaccess (mod_rewrite enabled) and have a basic script that can handle "/" and then everything else is handled after "?" in the same script. I.e. I am able to work out whether a user has tried to put example.com/about, example.com/about/the-team or example.com/join/?hash=abc123etc etc.
My question is how do I handle variable length URLs such as (examples):
example.com/about (node only)
example.com/about/the-team (node + seo-page-title)
example.com/projects (node only)
example.com/projects/project-x (node + sub-node)
example.com/projects/project-x/specification (node + sub-node + seo-friendly-title)
example.com/news/article/new-article (node + sub-node + seo-friendly-title)
example.com/join/?hash=abc123etc&this=that (node + query pair)
BUT, the "nodes" (first argument), "sub-nodes" (second argument) or "seo-friendly page titles" may be missing or unknown (database controlled) so I cannot put the processing in .htaccess specifically. Remember: I have already (I think!) got a working htaccess to forwards everything correctly to my PHP processing script. Everything not found will be forwarded to a CMS "404".
I think my client will have a maximum of THREE arguments (and then everything else will be after "?").
Has anyone tried this or have a place to start with a database structure or how to handle whether I have put any of the above possibilities?
I have tried in a previous project but have always had to resort to writing the CMS to force the user to have (whilst adding pages) at least a node OR a node + subnode + seo-friendly-title which I would like to get away from...
I don't want a script that will put too much strain on database searches by trying to find every single possibility of the arguments until a match is found... or is this the only way if I want to implement what I'm asking?
Many Thanks!

You can cater for different numbers of matches like this:
RewriteRule ^/([^/])* /content.php?part1=$1 [L,QSA,NC]
RewriteRule ^/([^/])*/([^/])* /content.php?part1=$1&part2=$2 [L,QSA,NC]
RewriteRule ^/([^/])*/([^/])/([^/])* /content.php?part1=$1&part2=$2&part3=$3 [L,QSA,NC]
Where [ ^ / ] to matches any character other than '/' - and then because that term was enclosed in () brackets, it can be used in the re-written URL.
QSA would handle all the parameters and correctly attach them to the re-written URL.
How you match up the parts with things that you know about is up to you but I imagine that something like this would be sensible:
$knownKnodes = array(
'about',
'projects',
'news',
'join',
);
$knownSubNodes = array(
'the-team',
'project-x',
'the-team'
);
$node = FALSE;
$subNode = FALSE;
$seoLinks = array();
if(isset($part1) == TRUE){
if(in_array($part1, $knownNodes) == TRUE){
$node = $part1;
}
else{
$seoLinks[] = $part1;
}
}
if(isset($part2) == TRUE){
if(in_array($part2, $knownSubNodes) == TRUE){
$subNode = $part2;
}
else{
$seoLinks[] = $part2;
}
}
if(isset($part3) == TRUE){
$seoLinks[] = $part3;
}
if(isset($part4) == TRUE){
$seoLinks[] = $part4;
}
Obviously the list of nodes and subNodes could be pulled from a DB rather than being hard-coded. The exact details of how you match up the known things with the free text is really up to you.

in wich structure does the php script get the information?
if the structure for 'example.com/news/article/new-article' is
$_GET[a]=news
$_GET[b]=article
$_GET[c]=new-article
you could check if $_GET[c] is empty; if not the real site is $_GET[b], and so one...
an other way is that $_GET[a] will return someting like 'news_article_new-article'
in this case you have an unique name for DB-search
I hope I understood you right

Related

WordPress: append query string to all URL's

A user will be directed from a website to a landing page that will have a query string in the URL i.e. www.sitename.com?foo=bar&bar=foo. What I want to do, is then append that query string to all links on the page, preferably whether they were generated by WordPress or not (i.e. hard coded or not) and done server-side.
The reason for this is because their goal destination has to have the query string in the URL. I could use cookies, but i'd rather not since it has many other problems that it will bring with it for my specific use case.
I have explored the possibility of using .htaccess in conjunction with $_SERVER['QUERY_STRING'] to no avail. My understanding of .htaccess isn't great, but in my mind I assumed it would be possible to rewrite the current URL to be current URL + the variable that stores $_SERVER['QUERY_STRING'].
I've also explored add_rewrite_rule but couldn't find a logical way to achieve what I want.
Here's the Javascript solution I have, but as I said, I'd like a server-side solution:
const links = document.querySelectorAll('a');
links.forEach(link => {
if (!link.host.includes(location.host)) {
return;
}
const url = new URL(link.href);
const combined = Array.from(url.searchParams.entries()).reduce((agg, [key, val]) => {
agg.set(key, val);
return agg;
}, (new URL(location.href)).searchParams);
const nextUrl = [link.protocol, '//', link.host, link.pathname].join('');
link.href = (new URL(`${nextUrl}?${combined.toString()}`)).toString();
});

How to handle ?_escaped_fragment_= for AJAX crawlers?

I'm struggling to make AJAX-based website SEO-friendly. As recommended in tutorials on the web, I've added "pretty" href attributes to links: контакт and, in a div where content is loaded with AJAX by default, a PHP script for crawlers:
$files = glob('./pages/*.php');
foreach ($files as &$file) {
$file = substr($file, 8, -4);
}
if (isset($_GET['site'])) {
if (in_array($_GET['site'], $files)) {
include ("./pages/".$_GET['site'].".php");
}
}
I have a feeling that at the beginning I need to additionaly cut the _escaped_fragment_= part from (...)/index.php?_escaped_fragment_=site=about because otherwise the script won't be able to GET the site value from URL , am I right?
but, anyway, how do I know that the crawler transforms pretty links (those with #!) to ugly links (containing ?_escaped_fragment_=)? I've been told that it happens automatically and I don't need to provide this mapping, but Fetch as Googlebot doesn't provide me with any information about what happens to URL.

Google bot will automatically query for ?_escaped_fragment_= urls.
So from www.example.com/index.php#!site=about
Google bot will query: www.example.com/index.php?_escaped_fragment_=site=about
On PHP site you will get it as $_GET['_escaped_fragment_'] = "site=about"
If you want to get the value of the "site" you need to do something like this:
if(isset($_GET['_escaped_fragment_'])){
$escaped = explode("=", $_GET['_escaped_fragment_']);
if(isset($escaped[1]) && in_array($escaped[1], $files)){
include ("./pages/".$escaped[1].".php");
}
}
Take a look at the documentation:
https://developers.google.com/webmasters/ajax-crawling/docs/specification

Reduce link (URL) size

Is it possible to reduce the size of a link (in text form) by PHP or JS?
E.g. I might have links like these:
http://www.example.com/index.html <- Redirects to the root
http://www.example.com/folder1/page.html?start=true <- Redirects to page.html
http://www.example.com/folder1/page.html?start=false <- Redirects to page.html?start=false
The purpose is to find out, if the link can be shortened and still point to the same location. In these examples the first two links can be reduces, because the first points to the root, and the second has parameters that can be omitted.
The third link is then the case, where the parameters can't be omitted, meaning that it can't be reduced further than to remove the http://.
So the above links would be reduced like this:
Before: http://www.example.com/index.html
After: www.example.com
Before: http://www.example.com/folder1/page.html?start=true
After: www.example.com/folder1/page.html
Before: http://www.example.com/folder1/page.html?start=false
After: www.example.com/folder1/page.html?start=false
Is this possible by PHP or JS?
Note:
www.example.com is not a domain I own or have access to besides through the URL. The links are potentially unknown, and I'm looking for something like an automatic link shortener that can work by getting the URL and nothing else.
Actually I was thinking of something like a linkchecker that could check if the link works before and after the automatic trim, and if it doesn't then the check will be done again at a less trimmed version of the link. But that seemed like overkill...

Since you want to do this automatically, and you don't know how the parameters change the behaviour, you will have to do this by trial and error: Try to remove parts from an URL, and see if the server responds with a different page.
In the simplest case this could work somehow like this:
<?php
$originalUrl = "http://stackoverflow.com/questions/14135342/reduce-link-url-size";
$originalContent = file_get_contents($originalUrl);
$trimmedUrl = $originalUrl;
while($trimmedUrl) {
$trialUrl = dirname($trimmedUrl);
$trialContent = file_get_contents($trialUrl);
if ($trialContent == $originalContent) {
$trimmedUrl = $trialUrl;
} else {
break;
}
}
echo "Shortest equivalent URL: " . $trimmedUrl;
// output: Shortest equivalent URL: http://stackoverflow.com/questions/14135342
?>
For your usage scenario, your code would be a bit more complicated, as you would have to test for each parameter in turn to see if it is necessary. For a starting point, see the parse_url() and parse_str() functions.
A word of caution: this code is very slow, as it will perform lots of queries to every URL you want to shorten. Also, it will likely fail to shorten many URLs because the server might include stuff like timestamps in the response. This makes the problem very hard, and that's the reason why companies like google have many engineers that think about stuff like this :).

Yea, that's possible:
JS:
var url = 'http://www.example.com/folder1/page.html?start=true';
url = url.replace('http://','').replace('?start=true','').replace('/index.html','');
php:
$url = 'http://www.example.com/folder1/page.html?start=true';
$url = str_replace(array('http://', '?start=true', '/index.html'), "", $url);
(Each item in the array() will be replaced with "")

Here is a JS for you.
function trimURL(url, trimToRoot, trimParam){
var myRegexp = /(http:\/\/|https:\/\/)(.*)/g;
var match = myRegexp.exec(url);
url = match[2];
//alert(url); // www.google.com
if(trimParam===true){
url = url.split('?')[0];
}
if(trimToRoot === true){
url = url.split('/')[0];
}
return url
}
alert(trimURL('https://www.google.com/one/two.php?f=1'));
alert(trimURL('https://www.google.com/one/two.php?f=1', true));
alert(trimURL('https://www.google.com/one/two.php?f=1', false, true));
Fiddle: http://jsfiddle.net/5aRpQ/

Extract common path from URI and Web Root

I have a self-built MVC fw with a router routing URLs such that the common example.com/controller/action is used. I'm running into issues when my application is deployed within a sub-directory such as
example.com/my_app/controller/action/?var=value
My router thinks my_app is the name of the controller now and controller is the method.
My current solution is to manually ask for any sub directory name in a config at install. I'd like to do this manually. See my question below and let me know if I'm going about solving this the wrong way and asking the wrong question.
My question:
if I have two paths, how do I truncate the common pieces from the end of one and remove it from the end of the other.
A = /var/www/my_app/pub
B = /my_app/pub/cntrl/actn
What's your quickest one liner to remove /my_app/pub from B and remain with /cntrl/actn?
Basically looking for a perl-esque way of getting the common denominator like string.
Thanks for any input

my #physical_parts = split qr{/}, $physical_path;
my #logical_parts = split qr{/}, $logical_path;
my #physical_suffix;
my #logical_prefix;
my $found = 0;
while (#physical_parts && #logical_parts) {
unshift #physical_suffix, pop(#physical_parts);
push #logical_prefix, shift(#logical_parts);
if (#physical_suffix ~~ #logical_prefix) {
$found = 1;
last;
}
}

The way I would solve this is adding this logic to the front controller (the file to which your server sends all nonexistant file requests, usually index.php).
$fontControllerPath = $_SERVER['SCRIPT_NAME'];
$frontControllerPathLength = strlen($fontControllerPath);
$frontControllerFileName = basename($fontControllerPath);
$frontControllerFileNameLength = strlen($frontControllerFileName);
$subdirectoryLength = $frontControllerPathLength - $frontControllerFileNameLength;
$url = substr($_SERVER['REQUEST_URI'], $subdirectoryLength - 1);
Here's a codepad demo.
What does this do? If the front controller is located (relative to the www root) in: /subdir/myapp/, then it's $_SERVER['SCRIPT_NAME'] would be /subdir/myapp/index.php. The actual request URI is contained in $_SERVER['REQUEST_URI']. Let's say, for example, that it is /subdir/myapp/controller/action?extras=stuff. To remove the subdirectory prefix we need to find the length of it. That is found by subtracting the length of the script name (retrieved from basename()) from the length of the script's name relative to the www root.
File that receives request: /subdir/myapp/index.php (length = 23)
Filename: index.php (length = 9)
-
-------------------------------------------------------------------
14 chars to remove
/subdir/mpapp/controller/action?extras=stuff
^
Cut off everything before here

Including pages based on URL in PHP

Is this a terrible way to include pages based on the URL? (using mod_rewrite through index.php)
if($url === '/index.php/'.$user['username']) {
include('app/user/page.inc.php');
}
// Upload *
else if($url === '/index.php/'.$user['username'].'/Upload') {
include('app/user/upload.inc.php');
}
// Another page *
else if($url === '/index.php/AnotherPage') {
include('page/another_page.inc.php');
}
I'm using $_GET['variables'] through mod_rewrite for
^(.+)$ index.php?user=$1 [NC]
and a couple other base pages. But, those are just for the first argument on base files. The above if / else examples are also case sensitive which is really not good.
What are your thoughts on this?
How would I mod_rewrite these 2nd / 3rd etc. argument off of the index.php?
Would that be totally SEO incompatible with the aforementioned example?

I don't fully understand your question, per se.
What do you mean by "these 2nd / 3rd etc. argument"?
You can do the same steps in a more readable/maintainable manner as follows:
$urls = array(
'/index.php/'.$user['username'] => 'app/user/page.inc.php',
'/index.php/'.$user['username'].'/Upload' => 'app/user/upload.inc.php',
'/index.php/AnotherPage' => 'page/another_page.inc.php'
);
$url = $urls[$url];
If the '.inc.php' is consistant, you can remove that from each item of the array and add it at the end:
$url = $urls[$url].'inc.php'
Along the same lines, you can write the array in reverse (switch the keys and values in above array) and use preg_grep to search it. This will allow you to search the url without being case sensitive, as well as allowing wildcards.
$url = key( preg_grep("/$url/i", $urls));
See Here for a live interactive example.
Note that this is far less efficient, though for wildcard matches it is the best way.
(And for most pages, the inefficiency is livable.)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.