Within my application UI want to avoid id numbers within the urls if possible so the best way to do this would be to create a a unique version of the title that's valid for url schemas.
SO do a something the same but as the you allow duplicate questions they have the id within the URI!
http://stackoverflow.com/questions/3637971/how-to-edit-onchange-attribute-in-a-select-tag-using-jquery
Wordpress have implemented such features as well
my question is:
What's the best way to accomplish this, sticking to the URI RFC as well as keeping search engines happy.
The Drupal Path/Pathauto modules do this, so I'd check that implementation. For a quick hit, if there are titles that reduce to duplicates:
CaseySoftware is awesome
CaseySoftware is awesome!
They would become:
caseysoftware-is-awesome
caseysoftware-is-awesome-0
You will definitely need to scrub out punctuation, but you may want to do the same to common articles like "a, the, is".
To keep search engine happy
You should use this in your head :
<link rel="canonical" href="http://yoursite.com/page/uniqueTitle/"/>
This will tell search engine that all page that have that specific canonical name are the same.
For example, this page has the following line :
<link rel="canonical" href="http://stackoverflow.com/questions/3637990/foolproof-unique-title-for-urls">
If you change the title, that value will stay the same. This is how search engine really know it's all the same page.
How to generate
As for how those URL are generated, you should stick to the lower case alphanumeric characters ([a-z0-9]) and replace space with "-".
"Friendly URLs — Possibly all of what makes a good URL structure" is a nice article about that topic, and it includes a short example implementation in Python.
To make the URLs really unique without having to use a numeric ID everywhere, I'd try to generate my new URL, see if it already exists (shouldn't occur very often), and only if it does, append a short sequence number at the end.
Related
Concerning search engine optimization I wonder what the best practice is to write parameters in a url. Should I place parameter names? Does an id value has a negative effect on search engines?
Here are some options that come to mind:
/project/pid/171/name/my_funny_name
/project/171/my_funny_name
/project/my_funny_name
Good rule is less params is better. If you need numerical id, 2nd option is quite good, if you force my_funny_name part to be unique, you may rely only on this as id. However keep in mind, that if you change name, url will be broken.
Also remember to avoid double names for same content, like /project/171/my_funny_name and /project/171/my_old_name. Try to use <link rel="canonical" href="http://example.com/project/171/my_funny_name">
In the url, you want AT MINIMUM, the keywords that people will be searching for to find the page in question. The id's, in your case, should not have a negative affect.
Google SEO
Having an id value in your url won't have a negative effect, but you should definitely add words before or after your id for SEO.
What I personnaly like is trying to have a "logical" structure in the url, with id at the beginning, like this :
/123_my-category/456_my-great-stuff
Concerning the underscore, you should rather use "-", as it is considered as the word separator while _ is more for lisibility, but it's like having the words tied to each other.
I'm really a major novice at RegEx and could do with some help.
I have a long string containing lots of URL's and other text, and one of the URL's contains has /find/ in it. ie:
1. http://www.example.com/not/index.html
2. http://www.example.com/sat/index.html
3. http://www.example.com/find/index.html
4. http://www.example.com/rat/mine.html
5. http://www.example.com/mat/find.html
What sort of RegEx would I use to return the URL that is number 3 in that list but not return me number 5 as well? I suppose basically what I'm looking for is a way of returning a whole word that contains a specific set of letters and / in order.
TIA
I would assume you want preg_match("%/find/%",$input); or similar.
EDIT: To get the full line, use:
preg_match("%^.*?/find/.*$%m",$input);
I can suggest you to use RegExr to generate regular expressions.
You can type in a sample list (like the one above) and use a palette to create a RegExp and test it in realtime. The program is available both online and as downloadable Adobe AIR package.
Unfortunately I cannot access their site now, so I'm attaching the AIR package of the downloadable version.
I really recommend you this, since it helped a RegExp newbie like me to design even the most complex patterns.
However, for your question, I think that just
\/find\/
goes well if you want to obtain a yes/no result (i.e. if it contains or not /find/), otherwise to obtain the full line use
.*\/find\/.*
In addition to Kolink's answer, in case you wanted to regex match the whole URI:
This is by no means an exhaustive regex for URIs, but this is a good starting point. I threw in a few options at key points, like .com, .net, and .org. In reality you'll have a fairly hard time matching URIs with regular expressions due to the lack of conformity, but you can come very close
The regex from the above link:
/(https?:\/\/)?(www\.)?([a-zA-Z0-9-_]+)\.(com|org|net)\/(find)\/([a-zA-Z0-9-_]+)\.(html|php|aspx)?/is
I was wondering if their was any sort of way to detect a pages genre/category.
Possibly their is a way to find keywords or something?
Unfortunately I don't have any idea so far, so I don't have any code to show you.
But if anybody has any ideas at all, let me know.
Thanks!
EDIT #Nican
Perhaps their is a way to set, let's say 10 category's (Entertainment, Funny, Tech).
Then creating keywords for these category's (Funny = Laughter, Funny, Joke etc).
Then searching through a webpage (maybe using a cUrl) for these keywords and assigning it to the right category.
Hope that makes sense.
What you are talking about is basically what Google Adsense and similar services do, and it's based on analyzing the content of a page and matching it to topics. Generally, this kind of stuff is beyond what you would call simple programming / development and would require significant resources to be invested to get it to work "right".
A basic system might work along the following lines:
Get page content
Get X most commonly used words (omitting stuff like "and" "or" etc.)
Get words used in headings
Assign weights to different words according to a set of factors (is used in heading, is used in more than one paragraph, is used in link anchors)
Match the filtered words against a database of words related to a specific "category"
If cumulative score > treshold, classify site as belonging to category
Rinse and repeat
Folksonomy may be a way of accomplishing what you're looking for:
http://en.wikipedia.org/wiki/Folksonomy
For instance, in Drupal they have a Folksonomy module:
http://drupal.org/node/19697 (Note this module appears to be dead, see http://drupal.org/taxonomy/term/71)
Couple that with a tag cloud generator, and you may get somewhere:
http://drupal.org/project/searchcloud
Plus, a little more complexity may be able to derive mapped relationships to other terms, especially if you control the structure of the tagging options.
http://intranetblog.blogware.com/blog/_archives/2008/5/22/3707044.html
EDIT
In general, the type of system you're trying to build relies on unique word values on a page. So you would need to...
Get unique word values from your content (index values or create a bot to crawl your site)
Remove all words and symbols you can't use (at, the, or, and, etc...)
Count the number of times the unique words appear on the page
Add them to some type of datastore so you can call them based on the relationships you're mapping
If you have a root label system in place, associate those values with the word counts on the page (such as a query or derived table)
This is very general, and there are a number of ways this can be implemented/interpreted. Folksonomies are meant to "crowdsource" much of the effort for you, in a "natural way", as long as you have a user base that will contribute.
If I have a forum using tags to categorize posts, is it possible to automatically add tags according to contents and titles after posts are created ?
Thank you very much
The simplest way to do this would be to have a table of known tags. Iterate over each word in the post, and if the word is in the tag table add it to the list. To make this slightly more effective, you could store the tag in both its display and stemmed version (e.g., algorithms and algorithm). Then compared the stemmed words in the post with the stemmed tag name. See Porter's stemming algorithm for a simple way to do that (for English words).
A more effective solution would be using something like TF-IDF and associate vectors with each tag. Create a vector for the new post and compare it to each tag vector using cosine similarity. Whichever tags are above a certain threshold would be added to the post. I've never used it for auto-tagging, but in my experience it is a very effective matching tool when dealing with non-spammy data. (i.e., People aren't trying to cheat or fool the system.)
Both of these methods assume that you already have some sort of tag dictionary built to start things off. You could guess at tag names by looking at which uncommon words (need a frequency table for that) are used frequently in the post.
Try this auto-tagging PHP code:
http://www.dangrossman.info/2008/04/07/auto-tagging-content-with-open-calais/
There's a number of ways to go about this. Simple keyword matching or TF-IDF, as konforce suggest, are viable options. Others include:
Use Yahoo's term extraction webservice to extract significant terms from the text.
Use the Google Prediction API. Train it on a corpus of already tagged posts, then ask it to predict the tags of new posts.
I'm developing a PHP website, and currently my links are in a facebook-ish style, like so
me.com/profile.php?id=123
I'm thinking of moving to something more friendly to crawling search engines
(like here at stackoverflow), something like:
me.com/john-adams
But how can I differentiate from two users with the same name - or more correctly, how does stackoverflow tell the difference from two questions with the same title?
I was thinking of doing something like
me.com/john-adams-123
and parsing the url.
Any other recommendations?
Stackoverflow does something similar to your me.com/john-adams-123 option, except more like me.com/123/john-adams where the john-adams part actually has no programmatic meaning. The way you're proposing is slightly better because the semantic-content-free numeric ID is farther to the right in the URL.
What I would do is store a unique slug (these SEO-friendly URL components are generally called slugs) in the user table and do the number append thing when necessary to get a unique one.
In stack overflow's case, it's
http://stackoverflow.com/questions/975240/using-seo-friendly-links
http://stackoverflow.com/questions <- Constant prefix
/975240 <- Unique question id
using-seo-friendly-links <- Any text at all, defaults to title of question.
Facebook, on the other hand, has decided to just make everyone pick a unique ID. Then they are going to use that as a profile page. Something like http://facebook.com/p/username/. They are solving the problem of uniqueness between users, by just requiring it to be some string that the user picks that is unique among all existing users.
SO 'cheats' :-).
The link for your question is "Using SEO-friendly links" but "Using SEO-friendly links" also works.
The part after the number is the SEO friendly bit, but SO doesn't really care what's there. I think it defaults to the question title.
So in your case you could construct a link like:
me.com/123/john-adams
a second john adams would have a different Id and a unique URL like :
me.com/111/john-adams
I would say that your proposed solution is a better solution to that of stackoverflows as it preserves content hierarchy:
me.com/john-adams-123
Usage of the unique ID before the username is simply nonsensical.
I would, however, recommend enforcement of content type:
me.com/john-adams-123.html
This will allow for consistent urls while serving a variety of content types.
Additionally, you could make use of sexatrigesimal for the unique id, to further reduce the amount of unnecessary cruft in your URL, especially for high end numbers, but this is often overkill :D
me.com/john-adams-123.html -> me.com/john-adams-3F.html
me.com/john-adams-1234567890.html -> me.com/john-adams-KF12OI.html
Finally, be sure to utilize 301 redirects on non-conforming accessible URIs to redirect to the "correct" seo-friendly schema to prevent duplicate content penalties.
I'd go with your style of me.com/john-adams-123, because I think the leftmost part of the URI has more importance in SEO ranking.
Actually, if you are willing to use this on several controllers (not just user profile), you may want to do it more like me.com/john-adams-profile-123 with a rewriting rule redirecting /.+-profile-(\d+) to profile.php?uid=$1 and still be able to use, say, me.com/john-adams-articles-123 for this user's articles...
To avoid dealing with the links contain special characters, you can use this plugin for Zend Framework.
https://github.com/btlagutoli/CharConvert
$filter2 = new Zag_Filter_CharConvert(array(
'onlyAlnum' => true,
'replaceWhiteSpace' => '-'
));
echo $filter2->filter('éééé ááááá ? 90 :');//eeee-aaaaa-90
this can help you deal with strings in other languages