Dynamic hierarchical pages and SEO

Dynamic hierarchical pages and SEO - php

Cheers everyone!
Please bear with me, I really did do some research on this, but I couldn't come to a final solution, hence I'm here to hear your opinions.
What I want to build is a small i18n-CMS with dynamic hierarchical pages such as:
domain.tld/en/I/am/a/path
I want to find the least performance intense way that allows me to have beautiful, SEO and human-friendly URLs.
I use a Closure-Table, so two tables in the database, one for the pagenodes and one for the pathtree plus another table for the localised page, that references a certain pagenode (three in total).
My different solutions so far:
Sure I could make an algorithm, that goes through all the different request segments and checks if there is an English "path" under an "a" under an "am" under an "I", but this seems very unwise considering a multitude of page-hits.
Or is it?
Positive: I wouldn't need to save the path anywhere, because it would be calculated. So moving pages around wouldn't need to recalculate the path and save it again.
I could simply save the whole path to the database, as VARCHAR(2000) or something and then just check if there is a page with path "I/am/a/path" in English language and get that one.
This seems to be rather messy.
As I do it now. Currently I add an "ID" at the end of my path. Such as:
domain.tld/en/I/am/a/path.1
So if you enter "domain.tld/en.1" you get forwarded to the one with the right slug. But here again I need to save the slug to the database, for each single page.
Also I would love to get rid of the id (could I do this with mod-rewrite and .htaccess?)
Any more insights on this one? As I'm not a webdeveloper, so I'm not really sure regarding performance.
Kindest regards,
Meren

It seems to me that page request will happen a million times more often than an editor changing a page address. So I would definitely go with the save-to-db option. What you can do is create an extra field in which you save the 'slug' for that page, in combination with .htaccess you can redirect pages from the 'slug' addresses. For example in http://www.fuuu.com/futest-fu , 'futest-fu' is a slug which could be rewritten to an ID number (or anything you would want it to be). Amongst others, Wordpress works this way. Check out this discussion for some insights: http://wordpress.org/support/topic/where-are-the-permalinks-slug-stored-in-the-database

Related

How to compare strings based on caracters similarity in SQL?

I'm working on redirecting people if they type a "not really wrong url".
For example I have a good URL http://www.website.com/category/foo-bar-if-bar-foo/.
This one works so if a user enter to my website with it, I can retrieve the article corresponding.
But if someone enter to my website with a not really wrong url like http://www.website.com/category/foo-bar-foo/ because an another website has referenced a wrong url, I should redirect him to the right one instead of having a 404 status code...
So how should I do this? and Most important, should I do this ?
I actually use Eloquent with Laravel 4.2.
Thank you in advance.
EDIT
I was wrong about stackoverflow, thanks for your comment. It uses the unique ID of a post.
EDIT 2
I Looked at SOUNDEX function in SQL, it's really good if there is a small difference like a character or two missing. But if my url is as broken as my example, it's not working anymore obviously. But thanks it's gonna be usefull.

Just thinking off the top of my head, you could create a SQL table (with Full-Text indexing enabled) containing all your paths (it might already exist).
In the event that a 404 is triggered, hijack that and do a MATCH (Full Text Search) and return the path with the highest scoring MATCH (you can also consider using a score threshold to prevent nonsensical matches).

How to properly store images (with numeric names) in server's filesystem

I have a lot of records in the database, and each record will have an image, I'm pretty confused about how to store the images.
I want the access route to be something like /img/record-id.jpg (i.e. /img/15178.jpg).
Alright, storing all images inside /img/ isn't a good thing, because there will be many.
In this question it is suggested to reverse the name of the image, so the example above would be stored under /img/78/51/15178.jpg. The suggestion won't give further info (and for me it's not obvious) about other scenarios. What will happen (this is asked in the last comment for the answer) if the id is a low number like 5, 15, 128, 1517?
Leaving that aside, let's remember I want the path to be /img/15178.jpg. I'd redirect the request using Apache, but for that I'd have to type at least 3 or more rules for different id numbers:
^/img/(\d)(\.jpg)$ /img/$1$2
^/img/(\d\d)(\.jpg)$ /img/$1/$1$2
^/img/(\d\d)(\d\d)(\.jpg)$ /img/$1/$2/$3
And so on?
This doesn't seem to be a nice solution, although it would work just fine.
I could think of other option which is: take the MD5 of the image, store it in its respective record, redirect the request to a PHP script and let it take care of the output.
The script will look the MD5 for the id in the database, build the actual route out of the hash and output the image. This solution is neater, but it involves database and PHP to output an image, sounds like a little too much.
I really don't know what to do here. Mind giving me some advice?

You already have written the perfect answer ! Professionals use it exactly like you (or the guy in the linked question) says: Building a deep directory structure that fits your needs. I have done this with 16 million pictures, and it worked perfectly.
I did it like this:
/firstCharacter/secondCharacter/...
Files with short names, like 5.jpg, will be in /5/5.jpg
EDIT: to keep the performance on top, i'm totally against any further php actions, like salt, md5, etc. Keep it straight and simple.

PHP Detect Pages Genre/Category

I was wondering if their was any sort of way to detect a pages genre/category.
Possibly their is a way to find keywords or something?
Unfortunately I don't have any idea so far, so I don't have any code to show you.
But if anybody has any ideas at all, let me know.
Thanks!
EDIT #Nican
Perhaps their is a way to set, let's say 10 category's (Entertainment, Funny, Tech).
Then creating keywords for these category's (Funny = Laughter, Funny, Joke etc).
Then searching through a webpage (maybe using a cUrl) for these keywords and assigning it to the right category.
Hope that makes sense.

What you are talking about is basically what Google Adsense and similar services do, and it's based on analyzing the content of a page and matching it to topics. Generally, this kind of stuff is beyond what you would call simple programming / development and would require significant resources to be invested to get it to work "right".
A basic system might work along the following lines:
Get page content
Get X most commonly used words (omitting stuff like "and" "or" etc.)
Get words used in headings
Assign weights to different words according to a set of factors (is used in heading, is used in more than one paragraph, is used in link anchors)
Match the filtered words against a database of words related to a specific "category"
If cumulative score > treshold, classify site as belonging to category
Rinse and repeat

Folksonomy may be a way of accomplishing what you're looking for:
http://en.wikipedia.org/wiki/Folksonomy
For instance, in Drupal they have a Folksonomy module:
http://drupal.org/node/19697 (Note this module appears to be dead, see http://drupal.org/taxonomy/term/71)
Couple that with a tag cloud generator, and you may get somewhere:
http://drupal.org/project/searchcloud
Plus, a little more complexity may be able to derive mapped relationships to other terms, especially if you control the structure of the tagging options.
http://intranetblog.blogware.com/blog/_archives/2008/5/22/3707044.html
EDIT
In general, the type of system you're trying to build relies on unique word values on a page. So you would need to...
Get unique word values from your content (index values or create a bot to crawl your site)
Remove all words and symbols you can't use (at, the, or, and, etc...)
Count the number of times the unique words appear on the page
Add them to some type of datastore so you can call them based on the relationships you're mapping
If you have a root label system in place, associate those values with the word counts on the page (such as a query or derived table)
This is very general, and there are a number of ways this can be implemented/interpreted. Folksonomies are meant to "crowdsource" much of the effort for you, in a "natural way", as long as you have a user base that will contribute.

How can I have Code Igniter URI segments for multiple variables?

I'm writing an app that allows you to filter database results based on Location and Category.
If someone was to search for Liverpool under the Golf category the URI would be /index.php/search/Liverpool/Golf.
Should someone want to search by Location but not category, they would be sent to /index.php/search/Liverpool
However, should someone want to filter only by category they would be unable to use /index.php/search/Golf because that would be caught by the location search.
Is there a best practice way to have /index.php/search/Golf be recognised? Some best practice as to what else to add to the URI to make these two queries distinct? /index.php/search/category/Golf perhaps?
Though that is beginning to show characteristics of /index.php?search&category=Golf which is exactly what I'm trying to avoid.

Try using $this->uri->uri_to_assoc(n)
described here http://codeigniter.com/user_guide/libraries/uri.html (half way down on page)
basically you will structure your url like this:
mysite.com/index.php/search/location/liverpool/category/golf
NOTE: the parameters are optional so you dont have to have both in there all the time. you can just as well do
mysite.com/index.php/search/location/liverpool/
and
mysite.com/index.php/search/category/golf
this way it will return FALSE if the element you are looking for does not exist

It would probably be best to keep your URI segments relavent no matter what they are searching for.
index.php/LOCATION/CATEGORY
If they are not interested in a location then pass a filler to the system:
index.php/anywhere/golf
Then in your code you just check for that specific string of ANYWHERE to determine if they only want to see the activity. I assume that you are going to be redirecting them with either links or forums (and that they aren't typing the URI string themselves) so you should be safe in just passing information that you expect and testing against that.

I use the format suggested by Tom above and then do something along the lines of below to determine the value of the parameters.
$segment_array = $this->uri->segment_array();
$is_location_searched = array_search('location', $segment_array);
if($is_location_searched && $this->uri->segment($is_location_searched +1))
{
$location = $this->uri->segment($is_sorted+1);
}

Have a look at http://lucenebook.com/#/p:solr/s:wiki and click around a bit on the left-hand navigation. Pay close attention to what happens in the url when you do. I really like this scheme for many reasons.
It's SEO-friendly.
"Curious" people can mix/match the urls and it still resolves to a proper search.
It just looks good!
Of course, the trick is really in the code, in how you build the thing. It took me a few weeks to sort it out, but I finally have my own version of that site. Just not ajax based, because I like search engines better than ajax. Ajax don't pay the bills.

Using SEO-friendly links

I'm developing a PHP website, and currently my links are in a facebook-ish style, like so
me.com/profile.php?id=123
I'm thinking of moving to something more friendly to crawling search engines
(like here at stackoverflow), something like:
me.com/john-adams
But how can I differentiate from two users with the same name - or more correctly, how does stackoverflow tell the difference from two questions with the same title?
I was thinking of doing something like
me.com/john-adams-123
and parsing the url.
Any other recommendations?

Stackoverflow does something similar to your me.com/john-adams-123 option, except more like me.com/123/john-adams where the john-adams part actually has no programmatic meaning. The way you're proposing is slightly better because the semantic-content-free numeric ID is farther to the right in the URL.
What I would do is store a unique slug (these SEO-friendly URL components are generally called slugs) in the user table and do the number append thing when necessary to get a unique one.

In stack overflow's case, it's
http://stackoverflow.com/questions/975240/using-seo-friendly-links
http://stackoverflow.com/questions <- Constant prefix
/975240 <- Unique question id
using-seo-friendly-links <- Any text at all, defaults to title of question.
Facebook, on the other hand, has decided to just make everyone pick a unique ID. Then they are going to use that as a profile page. Something like http://facebook.com/p/username/. They are solving the problem of uniqueness between users, by just requiring it to be some string that the user picks that is unique among all existing users.

SO 'cheats' :-).
The link for your question is "Using SEO-friendly links" but "Using SEO-friendly links" also works.
The part after the number is the SEO friendly bit, but SO doesn't really care what's there. I think it defaults to the question title.
So in your case you could construct a link like:
me.com/123/john-adams
a second john adams would have a different Id and a unique URL like :
me.com/111/john-adams

I would say that your proposed solution is a better solution to that of stackoverflows as it preserves content hierarchy:
me.com/john-adams-123
Usage of the unique ID before the username is simply nonsensical.
I would, however, recommend enforcement of content type:
me.com/john-adams-123.html
This will allow for consistent urls while serving a variety of content types.
Additionally, you could make use of sexatrigesimal for the unique id, to further reduce the amount of unnecessary cruft in your URL, especially for high end numbers, but this is often overkill :D
me.com/john-adams-123.html -> me.com/john-adams-3F.html
me.com/john-adams-1234567890.html -> me.com/john-adams-KF12OI.html
Finally, be sure to utilize 301 redirects on non-conforming accessible URIs to redirect to the "correct" seo-friendly schema to prevent duplicate content penalties.

I'd go with your style of me.com/john-adams-123, because I think the leftmost part of the URI has more importance in SEO ranking.
Actually, if you are willing to use this on several controllers (not just user profile), you may want to do it more like me.com/john-adams-profile-123 with a rewriting rule redirecting /.+-profile-(\d+) to profile.php?uid=$1 and still be able to use, say, me.com/john-adams-articles-123 for this user's articles...

To avoid dealing with the links contain special characters, you can use this plugin for Zend Framework.
https://github.com/btlagutoli/CharConvert
$filter2 = new Zag_Filter_CharConvert(array(
'onlyAlnum' => true,
'replaceWhiteSpace' => '-'
));
echo $filter2->filter('éééé ááááá ? 90 :');//eeee-aaaaa-90
this can help you deal with strings in other languages

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.