CakePHP urls with unique ids - php

I have seen urls such as this on some CakePHP websites: http://sample.com/posts/WordPress_get_URL_based_on_page_post_name-O8C
What would the 08C part be? As on my current setup I pass the title and id to the url to give each item a nice url e.g. http://driz.co.uk/cake/portfolio/view/NA_Software-4 but my id is just a number. How would I change it to get a 3 character id that mixed numbers and letters?
Thanks

I guess the tiny number is just a short slug.
If you already use integers for your records I don't see a point of adding an additional overhead to create tiny slugs. Also the tiny slug won't have always 3 characters when you get a decent amount of records. Tiny slugs make the most sense if you need a short URL like in emails or for twitter and other similar usecases.
However if you want to use them the CakeDC Utils plugin comes with a TinySluggable behavior.
https://github.com/CakeDC/utils

Related

Filtering within REST api

We're currently in the process of building a RESTful API. Now, it's a matter of what the best way of tackling filtering is.
We have /products. /products returns all given products you have access to. Now, let's say you want the products where the description matches exactly 'No description'. You'd get /products?description=No+description.
Now, ideally we would have more filter options. Show only products where the stock is more than or equal to 1, but less than 10. Show only products where the name ends in black, or starts with white. What is the best practice of doing this? Would we use logical operators in the URL, how would we escape wild cards?
Current state of affairs is:
/products?product_name=%25black will find all products with names ending in black.
or
/products?product_name=white%25 will find all products with names starting with white.
%25 is the encoded form of %. So far so good.
But what if someone wants to find a product where the name matches the literal % character? Or wants to find products with stock? Would it be best to introduce
min_stock and max_stock, or is it possible (or do we even want to?) to use logical operators (?stock=>=1&stock=<=5). Is there a standard for handling URLs or situations like this?
Are we overthinking? Is it possible? Should we not do filtering our end, but let users figure it out themselves?
REST paradigm is about ressources (all you access is ressource) and human understandability. That's why you make your listing url plural.
With that said, I do think, if you want to filter in two different ways (with =, like, regex...) you have two possibilitiees :
first create three different filters product_name_exact, product_name_like, product_name_regex. It looks like python.django way of filtering and it's quite elegant;
second way : create one query field, and then a query_mode it is quite the way bing api works.

Using MySQL LIKE to match a whole string

I have been doing a bit of searching round StackOverflow and the Interweb and I have not had much luck.
I have a URL which looks like this...
nr/online-marketing/week-in-review-mobile-google-and-facebook-grab-headlines
I am getting the article name from the URL and replacing the '-' with ' ' to give me:
week in review mobile google and facebook grab headlines
At this point this is all the information that I have on the article so I need to use this to query the database to get the rest of the article information, the problem comes around but this string does not match the actual headline of the article, this this instance the actual headline is:
Week in review: Mobile, Google+ and Facebook grab headlines
As you can see it include extra punctuation, so I need to find a way of using MYSQL LIKE to match the article.
Hope someone can help, a standard SELECT * FROM table WHERE field LIKE $name does not work , im hoping of finding a way of doing it without splitting up each individual word but if that what it comes down to then so be it!
Thanks.
Try MySQL MyISAM engine's full-text search. In your case the query will be:
SELECT * FROM table
WHERE MATCH (title) AGAINST ('week in review mobile google and facebook grab headlines');
That requires you to convert the table to MyISAM. Also depending on the size of the table, test the performance of the query.
See more info under:
http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
This really seems more like a database design issue... If you're using large texts with different fields as forms of primary keys it could lead to duplicates or synchronization problems.
One potential solution is to give each entry a unique identifier (perhaps an int or uniqueidentifier field if MSQL supports that), and use that field to map the actual healdine to the URL.
another potential solution is to create a table that will associate each headline with its URL and use that table for lookups. This will incur a little extra overhead, but will ensure that special characters in the title will never effect the lookup process.
As for a way to do this with your current design, you may be able to do some kind of regular expression search by tokenizing each word individually and then searching for an entry that includes all tokens, but I'm fairly certain that MSQL doesn't provide this functionality in a basic command.

PHP Detect Pages Genre/Category

I was wondering if their was any sort of way to detect a pages genre/category.
Possibly their is a way to find keywords or something?
Unfortunately I don't have any idea so far, so I don't have any code to show you.
But if anybody has any ideas at all, let me know.
Thanks!
EDIT #Nican
Perhaps their is a way to set, let's say 10 category's (Entertainment, Funny, Tech).
Then creating keywords for these category's (Funny = Laughter, Funny, Joke etc).
Then searching through a webpage (maybe using a cUrl) for these keywords and assigning it to the right category.
Hope that makes sense.
What you are talking about is basically what Google Adsense and similar services do, and it's based on analyzing the content of a page and matching it to topics. Generally, this kind of stuff is beyond what you would call simple programming / development and would require significant resources to be invested to get it to work "right".
A basic system might work along the following lines:
Get page content
Get X most commonly used words (omitting stuff like "and" "or" etc.)
Get words used in headings
Assign weights to different words according to a set of factors (is used in heading, is used in more than one paragraph, is used in link anchors)
Match the filtered words against a database of words related to a specific "category"
If cumulative score > treshold, classify site as belonging to category
Rinse and repeat
Folksonomy may be a way of accomplishing what you're looking for:
http://en.wikipedia.org/wiki/Folksonomy
For instance, in Drupal they have a Folksonomy module:
http://drupal.org/node/19697 (Note this module appears to be dead, see http://drupal.org/taxonomy/term/71)
Couple that with a tag cloud generator, and you may get somewhere:
http://drupal.org/project/searchcloud
Plus, a little more complexity may be able to derive mapped relationships to other terms, especially if you control the structure of the tagging options.
http://intranetblog.blogware.com/blog/_archives/2008/5/22/3707044.html
EDIT
In general, the type of system you're trying to build relies on unique word values on a page. So you would need to...
Get unique word values from your content (index values or create a bot to crawl your site)
Remove all words and symbols you can't use (at, the, or, and, etc...)
Count the number of times the unique words appear on the page
Add them to some type of datastore so you can call them based on the relationships you're mapping
If you have a root label system in place, associate those values with the word counts on the page (such as a query or derived table)
This is very general, and there are a number of ways this can be implemented/interpreted. Folksonomies are meant to "crowdsource" much of the effort for you, in a "natural way", as long as you have a user base that will contribute.

Find duplicate content using MySQL and PHP

I am facing a problem on developing my web app, here is the description:
This webapp (still in alpha) is based on user generated content (usually short articles although their length can become quite large, about one quarter of screen), every user submits at least 10 of these articles, so the number should grow pretty fast. By nature, about 10% of the articles will be duplicated, so I need an algorithm to fetch them.
I have come up with the following steps:
On submission fetch a length of text and store it in a separated table (article_id,length), the problem is the articles are encoded using PHP special_entities() function, and users post content with slight modifications (some one will miss the comma, accent or even skip some words)
Then retrieve all the entries from database with length range = new_post_length +/- 5% (should I use another threshold, keeping in mind that human factor on articles submission?)
Fetch the first 3 keywords and compare them against the articles fetched in the step 2
Having a final array with the most probable matches compare the new entry using PHP's levenstein() function
This process must be executed on article submission, not using cron. However I suspect it will create heavy loads on the server.
Could you provide any idea please?
Thank you!
Mike
Text similarity/plagiat/duplicate is a big topic. There are so many algos and solutions.
Lenvenstein will not work in your case. You can only use it on small texts (due to its "complexity" it would kill your CPU).
Some projects use the "adaptive local alignment of keywords" (you will find info on that on google.)
Also, you can check this (Check the 3 links in the answer, very instructive):
Cosine similarity vs Hamming distance
Hope this will help.
I'd like to point out that git, the version control system, has excellent algorithms for detecting duplicate or near-duplicate content. When you make a commit, it will show you the files modified (regardless of rename), and what percentage changed.
It's open source, and largely written in small, focused C programs. Perhaps there is something you could use.
You could design your app to reduce the load by not having to check text strings and keywords against all other posts in the same category. What if you had the users submit the third party content they are referencing as urls? See Tumblr implementation-- basically there is a free-form text field so each user can comment and create their own narrative portion of the post content, but then there are formatted fields also depending on the type of reference the user is adding (video, image, link, quote, etc.) An improvement on Tumblr would be letting the user add as many/few types of formatted content as they want in any given post.
Then you are only checking against known types like a url or embed video code. Combine that with rexem's suggestion to force users to classify by category or genre of some kind, and you'll have a much smaller scope to search for duplicates.
Also if you can give each user some way of posting to their own "stream" then it doesn't matter if many people duplicate the same content. Give people some way to vote up from the individual streams to a main "front page" level stream so the community can regulate when they see duplicate items. Instead of a vote up/down like Digg or Reddit, you could add a way for people to merge/append posts to related posts (letting them sort and manage the content as an activity on your app rather than making it an issue of behind the scenes processing).

Using SEO-friendly links

I'm developing a PHP website, and currently my links are in a facebook-ish style, like so
me.com/profile.php?id=123
I'm thinking of moving to something more friendly to crawling search engines
(like here at stackoverflow), something like:
me.com/john-adams
But how can I differentiate from two users with the same name - or more correctly, how does stackoverflow tell the difference from two questions with the same title?
I was thinking of doing something like
me.com/john-adams-123
and parsing the url.
Any other recommendations?
Stackoverflow does something similar to your me.com/john-adams-123 option, except more like me.com/123/john-adams where the john-adams part actually has no programmatic meaning. The way you're proposing is slightly better because the semantic-content-free numeric ID is farther to the right in the URL.
What I would do is store a unique slug (these SEO-friendly URL components are generally called slugs) in the user table and do the number append thing when necessary to get a unique one.
In stack overflow's case, it's
http://stackoverflow.com/questions/975240/using-seo-friendly-links
http://stackoverflow.com/questions <- Constant prefix
/975240 <- Unique question id
using-seo-friendly-links <- Any text at all, defaults to title of question.
Facebook, on the other hand, has decided to just make everyone pick a unique ID. Then they are going to use that as a profile page. Something like http://facebook.com/p/username/. They are solving the problem of uniqueness between users, by just requiring it to be some string that the user picks that is unique among all existing users.
SO 'cheats' :-).
The link for your question is "Using SEO-friendly links" but "Using SEO-friendly links" also works.
The part after the number is the SEO friendly bit, but SO doesn't really care what's there. I think it defaults to the question title.
So in your case you could construct a link like:
me.com/123/john-adams
a second john adams would have a different Id and a unique URL like :
me.com/111/john-adams
I would say that your proposed solution is a better solution to that of stackoverflows as it preserves content hierarchy:
me.com/john-adams-123
Usage of the unique ID before the username is simply nonsensical.
I would, however, recommend enforcement of content type:
me.com/john-adams-123.html
This will allow for consistent urls while serving a variety of content types.
Additionally, you could make use of sexatrigesimal for the unique id, to further reduce the amount of unnecessary cruft in your URL, especially for high end numbers, but this is often overkill :D
me.com/john-adams-123.html -> me.com/john-adams-3F.html
me.com/john-adams-1234567890.html -> me.com/john-adams-KF12OI.html
Finally, be sure to utilize 301 redirects on non-conforming accessible URIs to redirect to the "correct" seo-friendly schema to prevent duplicate content penalties.
I'd go with your style of me.com/john-adams-123, because I think the leftmost part of the URI has more importance in SEO ranking.
Actually, if you are willing to use this on several controllers (not just user profile), you may want to do it more like me.com/john-adams-profile-123 with a rewriting rule redirecting /.+-profile-(\d+) to profile.php?uid=$1 and still be able to use, say, me.com/john-adams-articles-123 for this user's articles...
To avoid dealing with the links contain special characters, you can use this plugin for Zend Framework.
https://github.com/btlagutoli/CharConvert
$filter2 = new Zag_Filter_CharConvert(array(
'onlyAlnum' => true,
'replaceWhiteSpace' => '-'
));
echo $filter2->filter('éééé ááááá ? 90 :');//eeee-aaaaa-90
this can help you deal with strings in other languages

Categories