How does similar_text work? - php

I just found the similar_text function and was playing around with it, but the percentage output always suprises me. See the examples below.
I tried to find information on the algorithm used as mentioned on php: similar_text()Docs:
<?php
$p = 0;
similar_text('aaaaaaaaaa', 'aaaaa', $p);
echo $p . "<hr>";
//66.666666666667
//Since 5 out of 10 chars match, I would expect a 50% match
similar_text('aaaaaaaaaaaaaaaaaaaa', 'aaaaa', $p);
echo $p . "<hr>";
//40
//5 out of 20 > not 25% ?
similar_text('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', 'aaaaa', $p);
echo $p . "<hr>";
//9.5238095238095
//5 out of 100 > not 5% ?
//Example from PHP.net
//Why is turning the strings around changing the result?
similar_text('PHP IS GREAT', 'WITH MYSQL', $p);
echo $p . "<hr>"; //27.272727272727
similar_text('WITH MYSQL', 'PHP IS GREAT', $p);
echo $p . "<hr>"; //18.181818181818
?>
Can anybody explain how this actually works?
Update:
Thanks to the comments I found that the percentage is actually calculated using the number of similar charactors * 200 / length1 + lenght 2
Z_DVAL_PP(percent) = sim * 200.0 / (t1_len + t2_len);
So that explains why the percenatges are higher then expected. With a string with 5 out of 95 it turns out 10, so that I can use.
similar_text('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', 'aaaaa', $p);
echo $p . "<hr>";
//10
//5 out of 95 = 5 * 200 / (5 + 95) = 10
But I still cant figure out why PHP returns a different result on turning the strings around. The JS code provided by dfsq doesn't do this. Looking at the source code in PHP I can only find a difference in the following line, but i'm not a c programmer. Some insight in what the difference is, would be appreciated.
In JS:
for (l = 0;(p + l < firstLength) && (q + l < secondLength) && (first.charAt(p + l) === second.charAt(q + l)); l++);
In PHP: (php_similar_str function)
for (l = 0; (p + l < end1) && (q + l < end2) && (p[l] == q[l]); l++);
Source:
/* {{{ proto int similar_text(string str1, string str2 [, float percent])
Calculates the similarity between two strings */
PHP_FUNCTION(similar_text)
{
char *t1, *t2;
zval **percent = NULL;
int ac = ZEND_NUM_ARGS();
int sim;
int t1_len, t2_len;
if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "ss|Z", &t1, &t1_len, &t2, &t2_len, &percent) == FAILURE) {
return;
}
if (ac > 2) {
convert_to_double_ex(percent);
}
if (t1_len + t2_len == 0) {
if (ac > 2) {
Z_DVAL_PP(percent) = 0;
}
RETURN_LONG(0);
}
sim = php_similar_char(t1, t1_len, t2, t2_len);
if (ac > 2) {
Z_DVAL_PP(percent) = sim * 200.0 / (t1_len + t2_len);
}
RETURN_LONG(sim);
}
/* }}} */
/* {{{ php_similar_str
*/
static void php_similar_str(const char *txt1, int len1, const char *txt2, int len2, int *pos1, int *pos2, int *max)
{
char *p, *q;
char *end1 = (char *) txt1 + len1;
char *end2 = (char *) txt2 + len2;
int l;
*max = 0;
for (p = (char *) txt1; p < end1; p++) {
for (q = (char *) txt2; q < end2; q++) {
for (l = 0; (p + l < end1) && (q + l < end2) && (p[l] == q[l]); l++);
if (l > *max) {
*max = l;
*pos1 = p - txt1;
*pos2 = q - txt2;
}
}
}
}
/* }}} */
/* {{{ php_similar_char
*/
static int php_similar_char(const char *txt1, int len1, const char *txt2, int len2)
{
int sum;
int pos1, pos2, max;
php_similar_str(txt1, len1, txt2, len2, &pos1, &pos2, &max);
if ((sum = max)) {
if (pos1 && pos2) {
sum += php_similar_char(txt1, pos1,
txt2, pos2);
}
if ((pos1 + max < len1) && (pos2 + max < len2)) {
sum += php_similar_char(txt1 + pos1 + max, len1 - pos1 - max,
txt2 + pos2 + max, len2 - pos2 - max);
}
}
return sum;
}
/* }}} */
Source in Javascript: similar text port to javascript

This was actually a very interesting question, thank you for giving me a puzzle that turned out to be very rewarding.
Let me start out by explaining how similar_text actually works.
Similar Text: The Algorithm
It's a recursion based divide and conquer algorithm. It works by first finding the longest common string between the two inputs and breaking the problem into subsets around that string.
The examples you have used in your question, actually all perform only one iteration of the algorithm. The only ones not using one iteration and the ones giving different results are from the php.net comments.
Here is a simple example to understand the main issue behind simple_text and hopefully give some insight into how it works.
Similar Text: The Flaw
eeeefaaaaafddddd
ddddgaaaaagbeeee
Iteration 1:
Max = 5
String = aaaaa
Left : eeeef and ddddg
Right: fddddd and geeeee
I hope the flaw is already apparent. It will only check directly to the left and to the right of the longest matched string in both input strings. This example
$s1='eeeefaaaaafddddd';
$s2='ddddgaaaaagbeeee';
echo similar_text($s1, $s2).'|'.similar_text($s2, $s1);
// outputs 5|5, this is due to Iteration 2 of the algorithm
// it will fail to find a matching string in both left and right subsets
To be honest, I'm uncertain how this case should be treated. It can be seen that only 2 characters are different in the string.
But both eeee and dddd are on opposite ends of the two strings, uncertain what NLP enthusiasts or other literary experts have to say about this specific situation.
Similar Text: Inconsistent results on argument swapping
The different results you were experiencing based on input order was due to the way the alogirthm actually behaves (as mentioned above).
I'll give a final explination on what's going on.
echo similar_text('test','wert'); // 1
echo similar_text('wert','test'); // 2
On the first case, there's only one Iteration:
test
wert
Iteration 1:
Max = 1
String = t
Left : and wer
Right: est and
We only have one iteration because empty/null strings return 0 on recursion. So this ends the algorithm and we have our result: 1
On the second case, however, we are faced with multiple Iterations:
wert
test
Iteration 1:
Max = 1
String = e
Left : w and t
Right: rt and st
We already have a common string of length 1. The algorithm on the left subset will end in 0 matches, but on the right:
rt
st
Iteration 1:
Max = 1
String = t
Left : r and s
Right: and
This will lead to our new and final result: 2
I thank you for this very informative question and the opportunity to dabble in C++ again.
Similar Text: JavaScript Edition
The short answer is: The javascript code is not implementing the correct algorithm
sum += this.similar_text(first.substr(0, pos2), second.substr(0, pos2));
Obviously it should be first.substr(0,pos1)
Note: The JavaScript code has been fixed by eis in a previous commit. Thanks #eis
Demystified!

It would indeed seem the function uses different logic depending of the parameter order. I think there are two things at play.
First, see this example:
echo similar_text('test','wert'); // 1
echo similar_text('wert','test'); // 2
It seems to be that it is testing "how many times any distinct char on param1 is found in param2", and thus result would be different if you swap the params around. It has been reported as a bug, which has been closed as "working as expected".
Now, the above is the same for both PHP and javascript implementations - paremeter order has an impact, so saying that JS code wouldn't do this is wrong. This is argued in the bug entry as intended behaviour.
Second - what doesn't seem correct is the MYSQL/PHP word example. With that, javascript version gives 3 irrelevant of the order of params, whereas PHP gives 2 and 3 (and due to that, percentage is equally different). Now, the phrases "PHP IS GREAT" and "WITH MYSQL" should have 5 characters in common, irrelevant of which way you compare: H, I, S and T, one each, plus one for empty space. In order they have 3 characters, 'H', ' ' and 'S', so if you look at the ordering, correct answer should be 3 both ways. I modified the C code to a runnable version, and added some output, so one can see what is happening there (codepad link):
#include<stdio.h>
/* {{{ php_similar_str
*/
static void php_similar_str(const char *txt1, int len1, const char *txt2, int len2, int *pos1, int *pos2, int *max)
{
char *p, *q;
char *end1 = (char *) txt1 + len1;
char *end2 = (char *) txt2 + len2;
int l;
*max = 0;
for (p = (char *) txt1; p < end1; p++) {
for (q = (char *) txt2; q < end2; q++) {
for (l = 0; (p + l < end1) && (q + l < end2) && (p[l] == q[l]); l++);
if (l > *max) {
*max = l;
*pos1 = p - txt1;
*pos2 = q - txt2;
}
}
}
}
/* }}} */
/* {{{ php_similar_char
*/
static int php_similar_char(const char *txt1, int len1, const char *txt2, int len2)
{
int sum;
int pos1, pos2, max;
php_similar_str(txt1, len1, txt2, len2, &pos1, &pos2, &max);
if ((sum = max)) {
if (pos1 && pos2) {
printf("txt here %s,%s\n", txt1, txt2);
sum += php_similar_char(txt1, pos1,
txt2, pos2);
}
if ((pos1 + max < len1) && (pos2 + max < len2)) {
printf("txt here %s,%s\n", txt1+ pos1 + max, txt2+ pos2 + max);
sum += php_similar_char(txt1 + pos1 + max, len1 - pos1 - max,
txt2 + pos2 + max, len2 - pos2 - max);
}
}
return sum;
}
/* }}} */
int main(void)
{
printf("Found %d similar chars\n",
php_similar_char("PHP IS GREAT", 12, "WITH MYSQL", 10));
printf("Found %d similar chars\n",
php_similar_char("WITH MYSQL", 10,"PHP IS GREAT", 12));
return 0;
}
the result is output:
txt here PHP IS GREAT,WITH MYSQL
txt here P IS GREAT, MYSQL
txt here IS GREAT,MYSQL
txt here IS GREAT,MYSQL
txt here GREAT,QL
Found 3 similar chars
txt here WITH MYSQL,PHP IS GREAT
txt here TH MYSQL,S GREAT
Found 2 similar chars
So one can see that on the first comparison, the function found 'H', ' ' and 'S', but not 'T', and got the result of 3. The second comparison found 'I' and 'T' but not 'H', ' ' or 'S', and thus got the result of 2.
The reason for these results can be seen from the output: algorithm takes the first letter in the first string that second string contains, counts that, and throws away the chars before that from the second string. That is why it misses the characters in-between, and that's the thing causing the difference when you change the character order.
What happens there might be intentional or it might not. However, that's not how javascript version works. If you print out the same things in the javascript version, you get this:
txt here: PHP, WIT
txt here: P IS GREAT, MYSQL
txt here: IS GREAT, MYSQL
txt here: IS, MY
txt here: GREAT, QL
Found 3 similar chars
txt here: WITH, PHP
txt here: W, P
txt here: TH MYSQL, S GREAT
Found 3 similar chars
showing that javascript version does it in a different way. What the javascript version does is that it finds 'H', ' ' and 'S' being in the same order in the first comparison, and the same 'H', ' ' and 'S' also on the second one - so in this case the order of params doesn't matter.
As the javascript is meant to duplicate the code of PHP function, it needs to behave identically, so I submitted bug report based on analysis of #Khez and the fix, which has been merged now.

first String = aaaaaaaaaa = 10 letters
second String = aaaaa = 5 letters
first five letters are similar
a+a
a+a
a+a
a+a
a+a
a
a
a
a
a
( <similar_letters> * 200 ) / (<letter_count_first_string> + <letter_count_second_string>)
( 5 * 200 ) / (10 + 5);
= 66.6666666667

Description
int similar_text ( string $first , string $second [, float &$percent ] )
This calculates the similarity between two strings as described in Oliver [1993]. Note that this implementation does not use a stack as in Oliver's pseudo code, but recursive calls which may or may not speed up the whole process. Note also that the complexity of this algorithm is O(N**3) where N is the length of the longest string.
Parameters
first
The first string.
second
The second string.
percent
By passing a reference as third argument, similar_text() will calculate the similarity in percent for you.

Related

How to make shorten URL like bit.ly [duplicate]

Closed. This question needs to be more focused. It is not currently accepting answers.
Closed 1 year ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I want to create a URL shortener service where you can write a long URL into an input field and the service shortens the URL to "http://www.example.org/abcdef".
Instead of "abcdef" there can be any other string with six characters containing a-z, A-Z and 0-9. That makes 56~57 billion possible strings.
My approach:
I have a database table with three columns:
id, integer, auto-increment
long, string, the long URL the user entered
short, string, the shortened URL (or just the six characters)
I would then insert the long URL into the table. Then I would select the auto-increment value for "id" and build a hash of it. This hash should then be inserted as "short". But what sort of hash should I build? Hash algorithms like MD5 create too long strings. I don't use these algorithms, I think. A self-built algorithm will work, too.
My idea:
For "http://www.google.de/" I get the auto-increment id 239472. Then I do the following steps:
short = '';
if divisible by 2, add "a"+the result to short
if divisible by 3, add "b"+the result to short
... until I have divisors for a-z and A-Z.
That could be repeated until the number isn't divisible any more. Do you think this is a good approach? Do you have a better idea?
Due to the ongoing interest in this topic, I've published an efficient solution to GitHub, with implementations for JavaScript, PHP, Python and Java. Add your solutions if you like :)
I would continue your "convert number to string" approach. However, you will realize that your proposed algorithm fails if your ID is a prime and greater than 52.
Theoretical background
You need a Bijective Function f. This is necessary so that you can find a inverse function g('abc') = 123 for your f(123) = 'abc' function. This means:
There must be no x1, x2 (with x1 ≠ x2) that will make f(x1) = f(x2),
and for every y you must be able to find an x so that f(x) = y.
How to convert the ID to a shortened URL
Think of an alphabet we want to use. In your case, that's [a-zA-Z0-9]. It contains 62 letters.
Take an auto-generated, unique numerical key (the auto-incremented id of a MySQL table for example).
For this example, I will use 12510 (125 with a base of 10).
Now you have to convert 12510 to X62 (base 62).
12510 = 2×621 + 1×620 = [2,1]
This requires the use of integer division and modulo. A pseudo-code example:
digits = []
while num > 0
remainder = modulo(num, 62)
digits.push(remainder)
num = divide(num, 62)
digits = digits.reverse
Now map the indices 2 and 1 to your alphabet. This is how your mapping (with an array for example) could look like:
0 → a
1 → b
...
25 → z
...
52 → 0
61 → 9
With 2 → c and 1 → b, you will receive cb62 as the shortened URL.
http://shor.ty/cb
How to resolve a shortened URL to the initial ID
The reverse is even easier. You just do a reverse lookup in your alphabet.
e9a62 will be resolved to "4th, 61st, and 0th letter in the alphabet".
e9a62 = [4,61,0] = 4×622 + 61×621 + 0×620 = 1915810
Now find your database-record with WHERE id = 19158 and do the redirect.
Example implementations (provided by commenters)
C++
Python
Ruby
Haskell
C#
CoffeeScript
Perl
Why would you want to use a hash?
You can just use a simple translation of your auto-increment value to an alphanumeric value. You can do that easily by using some base conversion. Say you character space (A-Z, a-z, 0-9, etc.) has 62 characters, convert the id to a base-40 number and use the characters as the digits.
public class UrlShortener {
private static final String ALPHABET = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
private static final int BASE = ALPHABET.length();
public static String encode(int num) {
StringBuilder sb = new StringBuilder();
while ( num > 0 ) {
sb.append( ALPHABET.charAt( num % BASE ) );
num /= BASE;
}
return sb.reverse().toString();
}
public static int decode(String str) {
int num = 0;
for ( int i = 0; i < str.length(); i++ )
num = num * BASE + ALPHABET.indexOf(str.charAt(i));
return num;
}
}
Not an answer to your question, but I wouldn't use case-sensitive shortened URLs. They are hard to remember, usually unreadable (many fonts render 1 and l, 0 and O and other characters very very similar that they are near impossible to tell the difference) and downright error prone. Try to use lower or upper case only.
Also, try to have a format where you mix the numbers and characters in a predefined form. There are studies that show that people tend to remember one form better than others (think phone numbers, where the numbers are grouped in a specific form). Try something like num-char-char-num-char-char. I know this will lower the combinations, especially if you don't have upper and lower case, but it would be more usable and therefore useful.
My approach: Take the Database ID, then Base36 Encode it. I would NOT use both Upper AND Lowercase letters, because that makes transmitting those URLs over the telephone a nightmare, but you could of course easily extend the function to be a base 62 en/decoder.
Here is my PHP 5 class.
<?php
class Bijective
{
public $dictionary = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
public function __construct()
{
$this->dictionary = str_split($this->dictionary);
}
public function encode($i)
{
if ($i == 0)
return $this->dictionary[0];
$result = '';
$base = count($this->dictionary);
while ($i > 0)
{
$result[] = $this->dictionary[($i % $base)];
$i = floor($i / $base);
}
$result = array_reverse($result);
return join("", $result);
}
public function decode($input)
{
$i = 0;
$base = count($this->dictionary);
$input = str_split($input);
foreach($input as $char)
{
$pos = array_search($char, $this->dictionary);
$i = $i * $base + $pos;
}
return $i;
}
}
A Node.js and MongoDB solution
Since we know the format that MongoDB uses to create a new ObjectId with 12 bytes.
a 4-byte value representing the seconds since the Unix epoch,
a 3-byte machine identifier,
a 2-byte process id
a 3-byte counter (in your machine), starting with a random value.
Example (I choose a random sequence)
a1b2c3d4e5f6g7h8i9j1k2l3
a1b2c3d4 represents the seconds since the Unix epoch,
4e5f6g7 represents machine identifier,
h8i9 represents process id
j1k2l3 represents the counter, starting with a random value.
Since the counter will be unique if we are storing the data in the same machine we can get it with no doubts that it will be duplicate.
So the short URL will be the counter and here is a code snippet assuming that your server is running properly.
const mongoose = require('mongoose');
const Schema = mongoose.Schema;
// Create a schema
const shortUrl = new Schema({
long_url: { type: String, required: true },
short_url: { type: String, required: true, unique: true },
});
const ShortUrl = mongoose.model('ShortUrl', shortUrl);
// The user can request to get a short URL by providing a long URL using a form
app.post('/shorten', function(req ,res){
// Create a new shortUrl */
// The submit form has an input with longURL as its name attribute.
const longUrl = req.body["longURL"];
const newUrl = ShortUrl({
long_url : longUrl,
short_url : "",
});
const shortUrl = newUrl._id.toString().slice(-6);
newUrl.short_url = shortUrl;
console.log(newUrl);
newUrl.save(function(err){
console.log("the new URL is added");
})
});
I keep incrementing an integer sequence per domain in the database and use Hashids to encode the integer into a URL path.
static hashids = Hashids(salt = "my app rocks", minSize = 6)
I ran a script to see how long it takes until it exhausts the character length. For six characters it can do 164,916,224 links and then goes up to seven characters. Bitly uses seven characters. Under five characters looks weird to me.
Hashids can decode the URL path back to a integer but a simpler solution is to use the entire short link sho.rt/ka8ds3 as a primary key.
Here is the full concept:
function addDomain(domain) {
table("domains").insert("domain", domain, "seq", 0)
}
function addURL(domain, longURL) {
seq = table("domains").where("domain = ?", domain).increment("seq")
shortURL = domain + "/" + hashids.encode(seq)
table("links").insert("short", shortURL, "long", longURL)
return shortURL
}
// GET /:hashcode
function handleRequest(req, res) {
shortURL = req.host + "/" + req.param("hashcode")
longURL = table("links").where("short = ?", shortURL).get("long")
res.redirect(301, longURL)
}
C# version:
public class UrlShortener
{
private static String ALPHABET = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
private static int BASE = 62;
public static String encode(int num)
{
StringBuilder sb = new StringBuilder();
while ( num > 0 )
{
sb.Append( ALPHABET[( num % BASE )] );
num /= BASE;
}
StringBuilder builder = new StringBuilder();
for (int i = sb.Length - 1; i >= 0; i--)
{
builder.Append(sb[i]);
}
return builder.ToString();
}
public static int decode(String str)
{
int num = 0;
for ( int i = 0, len = str.Length; i < len; i++ )
{
num = num * BASE + ALPHABET.IndexOf( str[(i)] );
}
return num;
}
}
You could hash the entire URL, but if you just want to shorten the id, do as marcel suggested. I wrote this Python implementation:
https://gist.github.com/778542
Take a look at https://hashids.org/ it is open source and in many languages.
Their page outlines some of the pitfalls of other approaches.
If you don't want re-invent the wheel ... http://lilurl.sourceforge.net/
// simple approach
$original_id = 56789;
$shortened_id = base_convert($original_id, 10, 36);
$un_shortened_id = base_convert($shortened_id, 36, 10);
alphabet = map(chr, range(97,123)+range(65,91)) + map(str,range(0,10))
def lookup(k, a=alphabet):
if type(k) == int:
return a[k]
elif type(k) == str:
return a.index(k)
def encode(i, a=alphabet):
'''Takes an integer and returns it in the given base with mappings for upper/lower case letters and numbers 0-9.'''
try:
i = int(i)
except Exception:
raise TypeError("Input must be an integer.")
def incode(i=i, p=1, a=a):
# Here to protect p.
if i <= 61:
return lookup(i)
else:
pval = pow(62,p)
nval = i/pval
remainder = i % pval
if nval <= 61:
return lookup(nval) + incode(i % pval)
else:
return incode(i, p+1)
return incode()
def decode(s, a=alphabet):
'''Takes a base 62 string in our alphabet and returns it in base10.'''
try:
s = str(s)
except Exception:
raise TypeError("Input must be a string.")
return sum([lookup(i) * pow(62,p) for p,i in enumerate(list(reversed(s)))])a
Here's my version for whomever needs it.
Why not just translate your id to a string? You just need a function that maps a digit between, say, 0 and 61 to a single letter (upper/lower case) or digit. Then apply this to create, say, 4-letter codes, and you've got 14.7 million URLs covered.
Here is a decent URL encoding function for PHP...
// From http://snipplr.com/view/22246/base62-encode--decode/
private function base_encode($val, $base=62, $chars='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ') {
$str = '';
do {
$i = fmod($val, $base);
$str = $chars[$i] . $str;
$val = ($val - $i) / $base;
} while($val > 0);
return $str;
}
Don't know if anyone will find this useful - it is more of a 'hack n slash' method, yet is simple and works nicely if you want only specific chars.
$dictionary = "abcdfghjklmnpqrstvwxyz23456789";
$dictionary = str_split($dictionary);
// Encode
$str_id = '';
$base = count($dictionary);
while($id > 0) {
$rem = $id % $base;
$id = ($id - $rem) / $base;
$str_id .= $dictionary[$rem];
}
// Decode
$id_ar = str_split($str_id);
$id = 0;
for($i = count($id_ar); $i > 0; $i--) {
$id += array_search($id_ar[$i-1], $dictionary) * pow($base, $i - 1);
}
Did you omit O, 0, and i on purpose?
I just created a PHP class based on Ryan's solution.
<?php
$shorty = new App_Shorty();
echo 'ID: ' . 1000;
echo '<br/> Short link: ' . $shorty->encode(1000);
echo '<br/> Decoded Short Link: ' . $shorty->decode($shorty->encode(1000));
/**
* A nice shorting class based on Ryan Charmley's suggestion see the link on Stack Overflow below.
* #author Svetoslav Marinov (Slavi) | http://WebWeb.ca
* #see http://stackoverflow.com/questions/742013/how-to-code-a-url-shortener/10386945#10386945
*/
class App_Shorty {
/**
* Explicitly omitted: i, o, 1, 0 because they are confusing. Also use only lowercase ... as
* dictating this over the phone might be tough.
* #var string
*/
private $dictionary = "abcdfghjklmnpqrstvwxyz23456789";
private $dictionary_array = array();
public function __construct() {
$this->dictionary_array = str_split($this->dictionary);
}
/**
* Gets ID and converts it into a string.
* #param int $id
*/
public function encode($id) {
$str_id = '';
$base = count($this->dictionary_array);
while ($id > 0) {
$rem = $id % $base;
$id = ($id - $rem) / $base;
$str_id .= $this->dictionary_array[$rem];
}
return $str_id;
}
/**
* Converts /abc into an integer ID
* #param string
* #return int $id
*/
public function decode($str_id) {
$id = 0;
$id_ar = str_split($str_id);
$base = count($this->dictionary_array);
for ($i = count($id_ar); $i > 0; $i--) {
$id += array_search($id_ar[$i - 1], $this->dictionary_array) * pow($base, $i - 1);
}
return $id;
}
}
?>
public class TinyUrl {
private final String characterMap = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
private final int charBase = characterMap.length();
public String covertToCharacter(int num){
StringBuilder sb = new StringBuilder();
while (num > 0){
sb.append(characterMap.charAt(num % charBase));
num /= charBase;
}
return sb.reverse().toString();
}
public int covertToInteger(String str){
int num = 0;
for(int i = 0 ; i< str.length(); i++)
num += characterMap.indexOf(str.charAt(i)) * Math.pow(charBase , (str.length() - (i + 1)));
return num;
}
}
class TinyUrlTest{
public static void main(String[] args) {
TinyUrl tinyUrl = new TinyUrl();
int num = 122312215;
String url = tinyUrl.covertToCharacter(num);
System.out.println("Tiny url: " + url);
System.out.println("Id: " + tinyUrl.covertToInteger(url));
}
}
This is what I use:
# Generate a [0-9a-zA-Z] string
ALPHABET = map(str,range(0, 10)) + map(chr, range(97, 123) + range(65, 91))
def encode_id(id_number, alphabet=ALPHABET):
"""Convert an integer to a string."""
if id_number == 0:
return alphabet[0]
alphabet_len = len(alphabet) # Cache
result = ''
while id_number > 0:
id_number, mod = divmod(id_number, alphabet_len)
result = alphabet[mod] + result
return result
def decode_id(id_string, alphabet=ALPHABET):
"""Convert a string to an integer."""
alphabet_len = len(alphabet) # Cache
return sum([alphabet.index(char) * pow(alphabet_len, power) for power, char in enumerate(reversed(id_string))])
It's very fast and can take long integers.
For a similar project, to get a new key, I make a wrapper function around a random string generator that calls the generator until I get a string that hasn't already been used in my hashtable. This method will slow down once your name space starts to get full, but as you have said, even with only 6 characters, you have plenty of namespace to work with.
I have a variant of the problem, in that I store web pages from many different authors and need to prevent discovery of pages by guesswork. So my short URLs add a couple of extra digits to the Base-62 string for the page number. These extra digits are generated from information in the page record itself and they ensure that only 1 in 3844 URLs are valid (assuming 2-digit Base-62). You can see an outline description at http://mgscan.com/MBWL.
Very good answer, I have created a Golang implementation of the bjf:
package bjf
import (
"math"
"strings"
"strconv"
)
const alphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
func Encode(num string) string {
n, _ := strconv.ParseUint(num, 10, 64)
t := make([]byte, 0)
/* Special case */
if n == 0 {
return string(alphabet[0])
}
/* Map */
for n > 0 {
r := n % uint64(len(alphabet))
t = append(t, alphabet[r])
n = n / uint64(len(alphabet))
}
/* Reverse */
for i, j := 0, len(t) - 1; i < j; i, j = i + 1, j - 1 {
t[i], t[j] = t[j], t[i]
}
return string(t)
}
func Decode(token string) int {
r := int(0)
p := float64(len(token)) - 1
for i := 0; i < len(token); i++ {
r += strings.Index(alphabet, string(token[i])) * int(math.Pow(float64(len(alphabet)), p))
p--
}
return r
}
Hosted at github: https://github.com/xor-gate/go-bjf
Implementation in Scala:
class Encoder(alphabet: String) extends (Long => String) {
val Base = alphabet.size
override def apply(number: Long) = {
def encode(current: Long): List[Int] = {
if (current == 0) Nil
else (current % Base).toInt :: encode(current / Base)
}
encode(number).reverse
.map(current => alphabet.charAt(current)).mkString
}
}
class Decoder(alphabet: String) extends (String => Long) {
val Base = alphabet.size
override def apply(string: String) = {
def decode(current: Long, encodedPart: String): Long = {
if (encodedPart.size == 0) current
else decode(current * Base + alphabet.indexOf(encodedPart.head),encodedPart.tail)
}
decode(0,string)
}
}
Test example with Scala test:
import org.scalatest.{FlatSpec, Matchers}
class DecoderAndEncoderTest extends FlatSpec with Matchers {
val Alphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
"A number with base 10" should "be correctly encoded into base 62 string" in {
val encoder = new Encoder(Alphabet)
encoder(127) should be ("cd")
encoder(543513414) should be ("KWGPy")
}
"A base 62 string" should "be correctly decoded into a number with base 10" in {
val decoder = new Decoder(Alphabet)
decoder("cd") should be (127)
decoder("KWGPy") should be (543513414)
}
}
Function based in Xeoncross Class
function shortly($input){
$dictionary = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z','A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z','0','1','2','3','4','5','6','7','8','9'];
if($input===0)
return $dictionary[0];
$base = count($dictionary);
if(is_numeric($input)){
$result = [];
while($input > 0){
$result[] = $dictionary[($input % $base)];
$input = floor($input / $base);
}
return join("", array_reverse($result));
}
$i = 0;
$input = str_split($input);
foreach($input as $char){
$pos = array_search($char, $dictionary);
$i = $i * $base + $pos;
}
return $i;
}
Here is a Node.js implementation that is likely to bit.ly. generate a highly random seven-character string.
It uses Node.js crypto to generate a highly random 25 charset rather than randomly selecting seven characters.
var crypto = require("crypto");
exports.shortURL = new function () {
this.getShortURL = function () {
var sURL = '',
_rand = crypto.randomBytes(25).toString('hex'),
_base = _rand.length;
for (var i = 0; i < 7; i++)
sURL += _rand.charAt(Math.floor(Math.random() * _rand.length));
return sURL;
};
}
My Python 3 version
base_list = list("0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
base = len(base_list)
def encode(num: int):
result = []
if num == 0:
result.append(base_list[0])
while num > 0:
result.append(base_list[num % base])
num //= base
print("".join(reversed(result)))
def decode(code: str):
num = 0
code_list = list(code)
for index, code in enumerate(reversed(code_list)):
num += base_list.index(code) * base ** index
print(num)
if __name__ == '__main__':
encode(341413134141)
decode("60FoItT")
For a quality Node.js / JavaScript solution, see the id-shortener module, which is thoroughly tested and has been used in production for months.
It provides an efficient id / URL shortener backed by pluggable storage defaulting to Redis, and you can even customize your short id character set and whether or not shortening is idempotent. This is an important distinction that not all URL shorteners take into account.
In relation to other answers here, this module implements the Marcel Jackwerth's excellent accepted answer above.
The core of the solution is provided by the following Redis Lua snippet:
local sequence = redis.call('incr', KEYS[1])
local chars = '0123456789ABCDEFGHJKLMNPQRSTUVWXYZ_abcdefghijkmnopqrstuvwxyz'
local remaining = sequence
local slug = ''
while (remaining > 0) do
local d = (remaining % 60)
local character = string.sub(chars, d + 1, d + 1)
slug = character .. slug
remaining = (remaining - d) / 60
end
redis.call('hset', KEYS[2], slug, ARGV[1])
return slug
Why not just generate a random string and append it to the base URL? This is a very simplified version of doing this in C#.
static string chars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890";
static string baseUrl = "https://google.com/";
private static string RandomString(int length)
{
char[] s = new char[length];
Random rnd = new Random();
for (int x = 0; x < length; x++)
{
s[x] = chars[rnd.Next(chars.Length)];
}
Thread.Sleep(10);
return new String(s);
}
Then just add the append the random string to the baseURL:
string tinyURL = baseUrl + RandomString(5);
Remember this is a very simplified version of doing this and it's possible the RandomString method could create duplicate strings. In production you would want to take in account for duplicate strings to ensure you will always have a unique URL. I have some code that takes account for duplicate strings by querying a database table I could share if anyone is interested.
This is my initial thoughts, and more thinking can be done, or some simulation can be made to see if it works well or any improvement is needed:
My answer is to remember the long URL in the database, and use the ID 0 to 9999999999999999 (or however large the number is needed).
But the ID 0 to 9999999999999999 can be an issue, because
it can be shorter if we use hexadecimal, or even base62 or base64. (base64 just like YouTube using A-Z a-z 0-9 _ and -)
if it increases from 0 to 9999999999999999 uniformly, then hackers can visit them in that order and know what URLs people are sending each other, so it can be a privacy issue
We can do this:
have one server allocate 0 to 999 to one server, Server A, so now Server A has 1000 of such IDs. So if there are 20 or 200 servers constantly wanting new IDs, it doesn't have to keep asking for each new ID, but rather asking once for 1000 IDs
for the ID 1, for example, reverse the bits. So 000...00000001 becomes 10000...000, so that when converted to base64, it will be non-uniformly increasing IDs each time.
use XOR to flip the bits for the final IDs. For example, XOR with 0xD5AA96...2373 (like a secret key), and the some bits will be flipped. (whenever the secret key has the 1 bit on, it will flip the bit of the ID). This will make the IDs even harder to guess and appear more random
Following this scheme, the single server that allocates the IDs can form the IDs, and so can the 20 or 200 servers requesting the allocation of IDs. The allocating server has to use a lock / semaphore to prevent two requesting servers from getting the same batch (or if it is accepting one connection at a time, this already solves the problem). So we don't want the line (queue) to be too long for waiting to get an allocation. So that's why allocating 1000 or 10000 at a time can solve the issue.

Algorithm for difference of products of large integers

I'm searching for an algorithm to solve differences of the type ab-cd, where a, b, c, and d are integers at the edge of the type capacity, i.e. ab overflows or loses digits depending on the actual representation on the machine. I cannot use arbitrary precision math; one of the platforms will be a SQL database.
I consider something like decomposing the product into (a'+a'')b-(c'+c'')d and then somehow iterate the way down. But probably there is a much more efficient method or at least a clever idea how to do the decomposition. Unfortunately in most cases a,b; c,d; a,c; b,d are coprime, so reduction at least is not simple.
Any ideas?
WARNING
This method is only partially functional. There are cases that it can't solve.
Taken from your text:
I'm searching for an algorithm to solve differences of the type ab-cd,
where a, b, c, and d are integers at the edge of the type capacity,
As I understand you want to calculate (a * b) - (c * d) avoiding a numeric overflow. And you want to solve this with an algorithm.
The first thing we need to recognize is that the result of (a * b) - (c * d) may not fit in the data type. I'll not try to solve those cases.
So, I'll search for different ways to calculate "ab-cd". What I've found is this:
(a * b) - (c * d) = ((a - c) * b) - (c * (d - b))
You can re-order the variables to get different products and therfore increasing the chance of finding a case that will allow you to calculate the operation without the dreaded numeric overflow:
((a - d) * b) - (d * (c - b))
((b - c) * a) - (c * (d - a))
((a - c) * b) - (c * (d - b))
((b - d) * c) - (b * (c - a))
((a - d) * c) - (a * (c - b))
((b - c) * d) - (b * (d - a))
((a - c) * d) - (a * (d - b))
Also notice that this are still differences of products, meaning that you can apply them recursively until you find one that works. For example:
Starting with:
(a * b) - (c * d)
=>
Using the transformation:
((a - d) * b) - (d * (c - b))
=>
By substitution:
(e * b) - (d * f)
=>
Rinse an repeat:
((e - f) * b) - (f * (d - b))
Of course we need to make sure we aren't going to run into a numeric overflow by doing this. Thankfully it is also possible to test if a particular product will cause a numeric overflow (without actually doing the product) with the following approach:
var max = MaxValue;
var min = MinValue;
if (a == 0 || b == 0)
{
return false;
}
else
{
var lim = a < 0 != b < 0 ? min : max;
if ((a < 0 == b < 0) == a < 0)
{
return lim / a > b;
}
else
{
return lim / a < b;
}
}
Also, it is also possible to test if a particular difference will cause a numeric overflow (without actually doing the difference) with the following approach:
var max = MaxValue;
var min = MinValue;
if (a < 0 == b < 0)
{
return true;
}
else
{
if (a < 0)
{
if (b > 0)
{
return min + b < a;
}
else
{
return min - b < a;
}
}
else
{
if (b > 0)
{
return max - b > a;
}
else
{
return max + b > a;
}
}
}
With that it is possible to pick an expression from the eight above that will allow you to calculate without the numeric overflow.
But... Sometimes none of those works. And it seems to be that there are cases where not even their combinations works (ie. rinse and repeat dosn't work)*. Maybe there are other identities that can complete the picture.
*: I did try using some heuristic to explore the combinations and also did try random exploration, there is the risk that I didn't pick good heuristics and I didn't have "luck" with the random. That's why I can't tell for sure.
I want to think that I've done some progress... But with respect to the original problem I've ultimately failed. May be I'll get back to this problem when I have more time... or may be I'll just play video games.
The standard way I know of to address this type of issues is to do what humans do with numbers beyond one digit, which is the limit of our natural counting with fingers. We carry numbers forward.
For example, let's say the limit of numbers in your numeric calculator is 256 (2^8). To get the difference of (243*244)-(242*245), we would need to decompose the numbers into
Label | Part 1 (shifted 2 right) | Part 2 (remainder)
a 2 43
b 2 44
c 2 42
d 2 45
You'd need an array to store the individual digits of the result, or a string. I think an array is faster, but a string more convenient and visible (for debugging).
(a*b)-(c*d)
=> a1*b1 shift4 + a1*b2 shift2 + a2*b1 shift2 + a2*b2
- c1*d1 shift4 + c1*d2 shift2 + c2*d1 shift2 + c2*d2
=> 987654321 (right-aligned string positioning)
+ 4xxxx
+ 88xx
+ 86xx
+ 1892
- 4xxxx
- 90xx
- 84xx
- 1890
==========
2
A naive implementation would work through each step independently, pushing each digit into place and carrying it forward where necessary. There are probably tomes of literature about optimizing these algorithms, such as breaking this into array slots of 2 digits each (since your register of number-limit 256 can handle the addition of 2 2-digit numbers easily).
If your products are near the limits of Int32 you can use Int64.
You can use BC Math Functions to work with large number which on both 32 bit & 64 bit systems
Example Of Large Numbers
$a = "4543534543543534543543543543545";
$b = "9354354546546756765756765767676";
$c = "5654656565656556565654656565656";
$d = "4556565656546546546546546356435" ;
var_dump(calculate($a, $b, $c, $d));
Output
string '257010385579862137851193415136408786476450997824338960635377204776397393100227657735978132009487561885957134796870587800' (length=120)
Function Used
function calculate($a, $b, $c, $d)
{
return bcmul(bcmul(bcmul(bcsub($a, $c),bcsub($a, $d)),bcsub($b, $c)),bcsub($b, $d));
}
After playing a little bit more I found a simpler algorithm following my original idea. It may be somewhat slower than the combined multiplication because it requires real multiplication and division instead of only shifts and addition, but I didn't benchmark it so far concerning the performance in an abstract language.
The idea is the following rewrite ab-cd = (a'+q*d)b-cd = a'b-(c-qb)d = a'b-c'd
The algorithm seems to convert the fastest if you order ab-cd as a>b and c>d, i.e. reduce the biggest numbers and maximize q.
q=(int)floor((a>c)? a/d : c/b);
a -= q*d;
c -= q*b;
Now reorder and start again. You can finish as soon as all numbers are small enough for safe multiplication, any number becomes smaller than 2 or even negative, or you find the same value for any of the numbers on both sides.

Implement a function calculating the number of positive integers up to and including n divisible by at least one of the primes in a given array

I do not really know c + +, but I need to translate the algorithm in php. Could you help me, especially not clear line std:: transform (...
task is:
Implement a function calculating the number of positive integers up to and including n divisible by at least one of the primes in a given array. The caller will ensure that this array is sorted and only contains unique primes, so your implementation may take advantage of these assumptions and doesn't need to
check whether they actually hold true.
There is a very efficient algorithm for counting these numbers for any values of n, as long as the list of divisors remains relatively short.
#include <algorithm>
#include <functional>
#include <iostream>
#include <ostream>
#include <vector>
std::vector<signed int> gen_products_of_n_divisors(
const std::vector<signed int>::const_iterator &start,
const std::vector<signed int>::const_iterator &end,
signed int n)
{
if (n == 1)
{
return std::vector<signed int>(start, end);
}
std::vector<signed int> products;
for (std::vector<signed int>::const_iterator i = start;
i != end; ++i)
{
std::vector<signed int> sub_products =
gen_products_of_n_divisors(i + 1, end, n - 1);
products.resize(products.size() + sub_products.size());
std::transform(sub_products.begin(), sub_products.end(),
products.end() - sub_products.size(),
std::bind1st(std::multiplies<signed int>(), *i));
}
return std::vector<signed int>(products);
}
signed int count_divisibles(signed int n,
const std::vector<signed int> &divisors)
{
signed int total_count = 0;
for (signed int i = 1;
i <= static_cast<signed int>(divisors.size()); ++i)
{
std::vector<signed int> products =
gen_products_of_n_divisors(divisors.begin(),
divisors.end(), i);
signed int sign = 2 * (i % 2) - 1;
for (
std::vector<signed int>::iterator j =
products.begin();
j != products.end(); ++j)
{
total_count += sign * n / (*j);
}
}
return total_count;
}
int main()
{
std::vector<signed int> a;
a.push_back(3);
a.push_back(5);
a.push_back(7);
a.push_back(11);
a.push_back(13);
a.push_back(17);
a.push_back(19);
std::cout << count_divisibles(1000000, a) << std::endl;
}
It will be easier to understand Toolbox's std::transform reference and his or her explanation of how sub-products (products of members of subsets of the set of divisors) are formed, if you are familiar with the Inclusion–exclusion principle. In effect, sub-products that are products of an odd number of numbers add to the total number of divisors, while those that are products of an even number of numbers subtract from it. This may be more obvious in the following translation to C of the C++ program in question.
In the program, note that 1<<nDiv is 2^nDiv (with ^ denoting exponentiation here). There are 2^k subsets in the power set of a set of k elements. Each distinct subset corresponds to a distinct binary ID#. (ID#="identity number"). A set element is a member of a subset if the bit for that element is set in the ID# of the subset. The program toggles sign from -1 to 1 or from 1 to -1 to keep track of even or odd number of bits.
A real program (vs a toy demo like this) should check for overflow when it computes product in the innermost loop of count_divisibles().
// translation to C of C++ program in question
#include <stdlib.h>
#include <stdio.h>
int count_divisibles(int n, int *divisors, int nDiv) {
int total_count = 0;
int i, it, j, sign, product;
for (i=1; i < 1<<nDiv; ++i) {
product = 1;
sign = -1;
for (j=0, it=i; j<nDiv; ++j, it=it/2) {
if (it & 1) {
product *= divisors[j];
sign = -sign;
}
}
total_count += sign * n/product;
}
return total_count;
}
int main(void) {
int a[] = {3,5,7,11,13,17};
int nDiv = sizeof a / sizeof a[0];
int hi, c, k;
for (hi=1000000; hi; hi/=200) {
for (k=0; k<nDiv; ++k) {
c = count_divisibles(hi, a, k);
printf ("count_divisibles(%d, a, %d) = %6d a[%d]=%d\n",
hi, k, c, k, a[k]);
}
c = count_divisibles(hi, a, nDiv);
printf ("count_divisibles(%d, a, %d) = %6d\n", hi, nDiv, c);
}
return 0;
}

First position of true value(1) from a bit pattern

For example, if the pattern is as follows:
bit [10010][1011][1000]
position 54321 4321 4321
result 2 1 4
I want to get the result from right to left position as [2] [1] [4]
If I understand your question correctly, you are looking for a function that returns the index of the least significant 1-bit in an integer. If so, check whether your platform implements the function ffs() ("find first set"). On Linux, you can do man ffs to get the full documentation. On other programming platforms the function may be named differently, e.g. in NVIDIA's CUDA, it exists as a device function __ffs().
Assuming the bit pattern is represented by an int you could do something like
if(bitPattern == 0) {
return 0;
}
int count = 1;
while(bitPattern % 2 == 0) {
bitPattern >>= 1;
count++;
}
return count;
$n = log($x & (~$x+1))/log(2)
~x + 1 is exact the same as -x, as the result of the 2's complement. So why would you use the more complex and slower?
And there are many bithacks to quickly find integer log2x instead of using the much more slower floating point log as above. No slowly divide is needed too. Since x & -x yields only the last bit which is a power of 2, you can use the following function to get log2
unsigned int log2p2(unsigned int v) // 32-bit value to find the log2 of v
{
static const unsigned int b[] = {0xAAAAAAAA, 0xCCCCCCCC, 0xF0F0F0F0,
0xFF00FF00, 0xFFFF0000};
register unsigned int r = (v & b[0]) != 0;
for (i = 4; i > 0; i--) // unroll for speed...
{
r |= ((v & b[i]) != 0) << i;
}
}
There are many other ways to calculate log2x which you can find here
So your code is now simply
log2p2(x & -x);

PHP: How to output list like this: AA, AB, AC, all the way to ZZZY, ZZZZ, ZZZZA etc

I'm trying to write a function that'll convert an integer to a string like this, but I can't figure out the logic... :(
1 = a
5 = e
27 = aa
28 = ab
etc...
Can anyone help? I'm really niffed that I can't wrap my head around how to write this... :(
Long list of them here:
/*
* Convert an integer to a string of uppercase letters (A-Z, AA-ZZ, AAA-ZZZ, etc.)
*/
function num2alpha($n)
{
for($r = ""; $n >= 0; $n = intval($n / 26) - 1)
$r = chr($n%26 + 0x41) . $r;
return $r;
}
/*
* Convert a string of uppercase letters to an integer.
*/
function alpha2num($a)
{
$l = strlen($a);
$n = 0;
for($i = 0; $i < $l; $i++)
$n = $n*26 + ord($a[$i]) - 0x40;
return $n-1;
}
I'll add this answer to sum up the comments regarding the misuse of base-26.
A common first reaction when confronted with this problem is to think "There are 26 letters, so this must be base-26! All I need to do is map each letter to its corresponding number".
But this is not base-26. It's easy to see why: there is no zero!
In base-26, the number twenty-six is the first number with two digits, and is written "10". In this counting system, twenty-six has a single digit, "Z", and the first two-digit number is twenty-seven.
But what if we make A=0, ..., Z=25? This way we have a zero and the first two-digit number becomes twenty-six. So far so good. How do we write twenty-six now? That's "AA". But... isn't A=0? Ooops! A = AA = AAA = "0" = "00" = "000".
You will have to use base_convert to convert your numbers to a 26 base:
base_convert(35, 10, 26);
That gives you the individual components in numbers from 1 - p, so 35 becomes 19 (1 * 26 + 9). Then you have to map the individual components to your desired set, so 1 => a, 9 => i, a => j, etc. and 19 becomes ai.
Well, you're pretty much converting from base 10 to base 26. Base 10 has digits 0-9, whereas base 26 can be expressed with "digits" A-Z. Conversion from base-10 is easy - see e.g. this: http://www.mathsisfun.com/base-conversion-method.html
Edit: actually, base-26 fails to account for multiple equivalent ways to write 0 ( 0 = 00 = 000).
void convert(int number)
{
string str = "";
while(number)
{
char ch;
ch = (number - 1) % 26 + 65;
str = ch + str;
number = (number-1) / 26;
}
cout << str << endl;
}

Categories