Encoding&compression of URL in PHP

Encoding&compression of URL in PHP - php

How to easy encode and "compress" URL/e-mail adress to string in PHP?
String should be:
difficult to decode by user
as short as possible (compressed)
similar URLs should be different
after encoding
not in database
easy to decode/uncompress by PHP script
ex. input -> output,
stackoverflow.com/1/ -> "n3uu399",
stackoverflow.com/2/ -> "ojfiejfe8"

Not very short but you could zip it with a password and encode it using base64. Note that zip is not too safe when it comes to passwords, but should be ok if your encrypted value is intended to have a short lifetime.
Note that whatever you do, you won't be able to generate a somewhat safe encoding unless you agree to store some unaccessible information locally. This means, whatever you do, take it as given that anyone can access the pseudo-encrypted data with enough time, be it by reverse engineering your algorithm, brute-forcing your passwords or whatever else is necessary.

You could make your own text compression system based on common strings: if URL starts 'http://www.', then the first character of the shortened URL is 'a', if it starts 'https://www.', then the first character is 'b'...(repeat for popular variants), if not then first letter is 'z' and the url follows in a coded pattern.
The if next three letters are 'abc', the second letter is 'a' etc. You'll need a list of which letter pairs/triplets are most common in URLs and work out the most popular 26/50 etc (depending on which characters you want to use) and you should be able to conduct some compression on the URL entirely in PHP (without using a database). People will only be able to reverse it by either knowing your letter pair/triplet list (your mapping list) or by manually reverse-engineering it.

Here is a simple implementation that may or may not fulfil your needs:
Input/Output:
test#test.com cyUzQTEzJTNBJTIydGVzdCU0MHRlc3QuY29tJTIyJTNC test#test.com
http://test.com/ cyUzQTE2JTNBJTIyaHR0cCUzQSUyRiUyRnRlc3QuY29tJTJGJTIyJTNC http://test.com/
Code:
function encode ($in) {
return base64_encode(rawurlencode(serialize($in)));
}
function decode ($in) {
return unserialize(rawurldecode(base64_decode($in)));
}
shrug
You need to be more specific about your inputs and outputs and what you expect from each.
You could also use gzcompress/gzuncompress instead of serialize/unserialize, etc.

if you have access to a database then you could do a relational lookup i.e. there will be 2 fields, field one holding the original URL and the second holding the compressed URL.
To make the second URL you could do something like the following
$str = "a b c d e f g h i j k l m n o p q r s t u v w x y z";
$str = explode(" ", $str);
$len = 5;
for($i = 0; $i < $len; $i++)
{
$pos = rand(0, (count($str) - 1));
$url .= $str[$pos];
}
This is just an idea that i have thought up, code isn't tested

Related

Getting different output for same PHP code

(Can't paste the exact question as the contest is over and I am unable to access the question. Sorry.)
Hello, recently I took part in a programming contest (PHP). I tested the code on my PC and got the desired output but when I checked my code on the contest website and ideone, I got wrong output. This is the 2nd time the same thing has happened. Same PHP code but different output.
It is taking input from command line. The purpose is to bring substrings that contact the characters 'A','B','C','a','b','c'.
For example: Consider the string 'AaBbCc' as CLI input.
Substrings: A,a,B,b,C,c,Aa,AaB,AaBb,AaBbC,AaBbCc,aB,aBb,aBbC,aBbCc,Bb,BbC,BbCc,bC,bCc,Cc.
Total substrings: 21 which is the correct output.
My machine:
Windows 7 64 Bit
PHP 5.3.13 (Wamp Server)
Following is the code:
<?php
$stdin = fopen('php://stdin', 'r');
while(true) {
$t = fread($stdin,3);
$t = trim($t);
$t = (int)$t;
while($t--) {
$sLen=0;
$subStringsNum=0;
$searchString="";
$searchString = fread($stdin,20);
$sLen=strlen($searchString);
$sLen=strlen(trim($searchString));
for($i=0;$i<$sLen;$i++) {
for($j=$i;$j<$sLen;$j++) {
if(preg_match("/^[A-C]+$/i",substr($searchString,$i,$sLen-$j))) {$subStringsNum++;}
}
}
echo $subStringsNum."\n";
}
die;
}
?>
Input:
2
AaBbCc
XxYyZz
Correct Output (My PC):
21
0
Ideone/Contest Website Output:
20
0

You have to keep in mind that your code is also processing the newline symbols.
On Windows systems, newline is composed by two characters, which escaped representation is \r\n.
On UNIX systems including Linux, only \n is used, and on MAC they use \r instead.
Since you are relying on the standard output, it will be susceptible to those architecture differences, and even if it was a file you are enforcing the architecture standard by using the flag "r" when creating the file handle instead of "rb", explicitly declaring you don't want to read the file in binary safe mode.
You can see in in this Ideone.com version of your code how the PHP script there will give the expected output when you enforce the newline symbols used by your home system, while in this other version using UNIX newlines it gives the "wrong" output.
I suppose you should be using fgets() to read each string separetely instead of fread() and then trim() them to remove those characters before processing.

I tried to analyse this code and that's what I know:
It seems there are no problems with input strings. If there were any it would be impossible to return result 20
I don't see any problem with loops, I usually use pre-incrementation but it shouldn't affect result at all
There are only 2 possibilities for me that cause unexpected result:
One of the loops iteration isn't executed - it could be only the last one inner loop (when $i == 5 and then $j == 5 because this loop is run just once) so it will match difference between 21 and 20.
preg_match won't match this string in one of occurrences (there are 21 checks of preg_match and one of them - possible the last one doesn't match).
If I had to choose I would go for the 1st possible cause. If I were you I would contact concepts author and ask them about version and possibility to test other codes. In this case the most important is how many times preg_match() is launched at all - 20 or 21 (using simple echo or extra counter would tell us that) and what are the strings that preg_match() checks. Only this way you can find out why this code doesn't work in my opinion.
It would be nice if you could put here any info when you find out something more.
PS. Of course I also get result 21 so it's hard to say what could be wrong

PHP gettext reverse translate

My question is quite simple, I use gettext to translate URLs, therefore I only have the translated version of the url string.
I would like to know if there was an easy way to get the base string from the translated string?
What I had in head was to automatically add the translated name in a database and aliases it with the base string each times I use my _u($string) function.
What I have currently:
function _u($string)
{
if (empty($string))
return '';
else
return dgettext('Urls', $string);
}
What I was thinking about (pseudo-code):
function _u($string)
{
if (empty($string))
return '';
$translation = dgettext('Urls', $string);
MySQL REPLACE INTO ... base = $string, translation = $translation; (translation = primary key)
return $translation;
}
function url_base($translation)
{
$row = SELECT ... FROM ... translation = $translation;
return $base;
}
Although it doesn't seem to be the best way possible to do this and, if on production I remove the REPLACE part, then I might forget a link or two in production that I might haven't went to.
EDIT: What I am mostly looking for is the parsing part of gettext. I need not to miss any of the possible URLs, so if you have another solution it would be required to have a parser (based on what I'm looking for).
EDIT2: Another difficulty have just been added. We must find the URL in any translations and put it back into the "base" translation for the system to parse the URL in the base language.

Actually, the most straightforward way I can think of would be to decode the .mo files used for the translation, through a call to the msgunfmt utility.
Once you have the plaintext database, you save it in any other kind of database, and will then be able to do reverse searches.
But perhaps better, you could create additional domain(s) ("ReverseUrlsIT") in which to store the translated URL as key, and the base as value (provided the mapping is fully two-way, that is!).
At that point you can use dgettext to recover the base string from the translated string, provided that you know the language of the translated string.
Update
This is the main point of using gettext and I would drop it anytime if
I could find another parser/library/tool that could help with that
The gettext family of functions, after all is said and done, are little more than a keystore database system with (maybe) a parser which is a little more powerful than printf, to handle plurals and adjective/noun inversions (violin virtuoso in English becomes virtuoso di violino in Italian).
At the cost of adding to the database complexity (and load), you can build a keystore leveraging whatever persistency layer you've got handy (gettext is file based, after all):
TABLE LanguageDomain
{
PRIMARY KEY ldId;
varchar(?) ldValue;
}
# e.g.
# 39 it_IT
# 44 en_US
# 01 us_US
TABLE Shorthand
{
PRIMARY KEY shId;
varchar(?) shValue;
}
# e.g.
# 1 CAMERA
# 2 BED
TABLE Translation
{
KEY t_ldId,
t_shId;
varchar(?) t_Value; // Or one value for singular form, one for plural...
}
# e.g.
# 44 1 Camera
# 39 1 Macchina fotografica
# 01 1 Camera
# 44 1 Bed
# 39 1 Letto
# 01 1 Bed
# 01 137 Behavior
# 44 137 Behaviour # "American and English have many things in common..."
# 01 979 Cookie
# 44 979 Biscuit " "...except of course the language" (O. Wilde)
function translate($string, $arguments = array())
{
GLOBAL $languageDomain;
// First recover main string
SELECT t_Value FROM Translation AS t
LEFT JOIN LanguageDomain AS l ON (t.ldId = l.ldId AND l.ldValue = :LangDom)
LEFT JOIN Shorthand AS s ON (t.t_shId = s.shId AND s.shValue=:String);
//
if (empty($arguments))
return $Result;
// Now run replacement of arguments - if any
$replacements = array();
foreach($arguments as $n => $argument)
$replacements["\${$n}"] = translate($argument);
// Now replace '$1' with translation of first argument, etc.
return str_replace(array_keys($replacements), array_values($replacements), $Result);
}
This would allow you to easily add one more languageDomain, and even to run queries such as e.g. "What terms in English have not yet been translated into German?" (i.e., have a NULL value when LEFT JOINing the subset of Translation table with English domain Id with the subset with German domain Id).
This system is inter-operable with POfiles, which is important if you need to outsource the translation to someone using the standard tools of the trade. But you can as easily output a query directly to TMX format, eliminating duplicates (in some cases this might really cut down your translation costs - several services overcharge for input in "strange" formats such as Excel, and will either overcharge for "deduping" or will charge for each duplicate as if it was an original).
<?xml version="1.0" ?>
<tmx version="1.4">
<header
creationtool="MySQLgetText"
creationtoolversion="0.1-20120827"
datatype="PlainText"
segtype="sentence"
adminlang="en-us"
srclang="EN"
o-tmf="ABCTransMem">
</header>
<body>
<tu tuid="BED" datatype="plaintext">
<tuv xml:lang="en">
<seg>bed</seg>
</tuv>
<tuv xml:lang="it">
<seg>letto</seg>
</tuv>
</tu>
<tu tuid="CAMERA" datatype="plaintext">
<tuv xml:lang="en">
<seg>camera</seg>
</tuv>
<tuv xml:lang="it">
<seg>macchina fotografica</seg>
</tuv>
</tu>
</body>
</tmx>

Encoding unique IDs to a maximum document size of 100kb

This is going to be a nice little brainbender I think. It is a real life problem, and I am stuck trying to figure out how to implement it. I don't expect it to be a problem for years, and at that point it will be one of those "nice problems to have".
So, I have documents in my search engine index. The documents can have a number of fields, however, each field size must be limited to only 100kb.
I would like to store the IDs, of particular sites which have access to this document. The site id count is low, so it is never going to get up into the extremely high numbers.
So example, this document here can be accessed by sites which have an ID of 7 and 10.
Document: {
docId: "1239"
text: "Some Cool Document",
access: "7 10"
}
Now, because the "access" field is limited to 100kb, that means that if you were to take consecutive IDs, only 18917 unique IDs could be stored.
Reference:
http://codepad.viper-7.com/Qn4N0K
<?php
$ids = range(1,18917);
$ids = implode(" ", $ids);
echo mb_strlen($ids, '8bit') / 1024 . "kb";
?>
// Output
99.9951171875kb
In my application, a particular site, of site ID 7, tries to search, and he will have access to that "Some Cool Document"
So now, my question would be, is there any way, that I could some how fit more IDs into that field?
I've thought about proper encoding, and applying something like a Huffman Tree, but seeing as each document has different IDs, it would be impossible to apply a single encoding set to every document.
Prehaps, I could use something like tokenized roman numerals?
Anyway, I'm open to ideas.
I should add, that I want to keep all IDs in the same field, for as long as possible. Searching over a second field, will have a considerable performance hit. So I will only switch to using a second access2 field, when I have milked the access field for as long as possible.
Edit:
Convert to Hex
<?php
function hexify(&$item){
$item = dechex($item);
}
$ids = range(1,21353);
array_walk( $ids, "hexify");
$ids = implode(" ", $ids);
echo mb_strlen($ids, '8bit') / 1024 . "kb";
?>
This yields a performance boost of 21353 consecutive IDs.
So that is up like 12.8%
Important Caveat
I think the fact that my fields can only store UTF encoded characters makes it next to impossible to get anything more out of it.

Where did 18917 come from? 100kb is a big number.
You have 100,000 or so bytes. Each byte can be 255 long, if you store it as a number.
If you encode as hex, you'll get 100,000 ^ 16, which is a very large number, and that just hex encoding.
What about base64? You stuff 3 bytes into a 4 byte space (a little loss), but you get 64 characters per character. So 100,000 ^ 64. That's a big number.
You won't have any problems. Just do a simple hex encoding.
EDIT:
TL;DR
Let's say you use base64. You could fit 6.4 times more data in the same spot. No compression needed.

How about using data compression?
$ids = range(1,18917);
$ids = implode(" ", $ids);
$ids = gzencode($ids);
echo mb_strlen($ids, '8bit') / 1024 . "kb"; // 41.435546875kb

Sorting Katakana names

If I have a list of Katakana names what is the best way to sort them?
Also is it more common to sort names based on their {first name}{last name} or {last name}{first name}.
Another question is how do we get the first character Hiragana representation of a Katakana name like how it is done for the iPhone's contact list is sorted.?
Thanks.

In Japan it is common (if not expected) that a person's first name appear after their surname when written: {last} {first}. But this would also depend on the context. In a less formal context it would be acceptable for a name to appear {first} {last}.
http://en.wikipedia.org/wiki/Japanese_name
Not that it matters, but why would the names of individuals be written in Katakana and not in the traditional Kanji?

I think it's
sort($array,SORT_LOCALE_STRING);
Provide more information if it's not your case

This answer talks about using the system locale to sort Unicode strings in PHP. Besides threading issues, it is also dependent on your vendor having supplied you with a correct locale for what you want to use. I’ve had so much trouble with that particular issue that I’ve given up using vendor locales altogether.
If you’re worried about different pronunciations of Unihan ideographs, then you probably need access to the Unihan database — or its moral equivalent. A smaller subset may suffice.
For example, I know that in Perl, the JIS X 0208 standard is used when the Japanese "ja" locale for is selected in the constructor for Unicode::Collate::Locale. This doesn’t depend on the system locale, so you can rely on it.
I’ve also had good luck in Perl with Lingua::JA::Romanize::Japanese, as that’s somewhat friendlier to use than accessing Unicode::Unihan directly.
Back to PHP. This article observes that you can’t get PHP to sort Japanese correctly.
I’ve taken his set of strings and run it through Perl’s sort, and I indeed get a different answer than he gets. If I use the default or English locale, I get in Perl what he gets in PHP. But if I use the Japanese locale for the collation module — which has nothing to do with the system locale and is completely thread-safe — then I get a rather different result. Watch:
JA Sort                          = EN Sort
------------------------------------------------------------
Java                               Java
NVIDIA                             NVIDIA
Windows ファイウォール             Windows ファイウォール
インターネット オプション          インターネット オプション
キーボード                         キーボード
システム                           システム
タスク                             タスク
フォント                           フォント
プログラムの追加と削除             プログラムの追加と削除
マウス                             マウス
メール                             メール
音声認識                         ! 地域と言語オプション
画面                             ! 日付と時刻
管理ツール                       ! 画面
自動更新                         ! 管理ツール
地域と言語オプション             ! 自動更新
電源オプション                     電源オプション
電話とモデムのオプション           電話とモデムのオプション
日付と時刻                       ! 音声認識
I don’t know whether this will help you at all, because I don’t know how to get at the Perl bits from PHP (can you?), but here is the program that generates that. It uses a couple of non-standard modules installed from CPAN to do its business.
#!/usr/bin/env perl
#
# jsort - demo showing how Perl sorts Japanese in a
# different way than PHP does.
#
# Data taken from http://www.localizingjapan.com/blog/2011/02/13/sorting-in-japanese-—-an-unsolved-problem/
#
# Program by Tom Christiansen <tchrist#perl.com>
# Saturday, April 9th, 2011
use utf8;
use 5.10.1;
use strict;
use autodie;
use warnings;
use open qw[ :std :utf8 ];
use Unicode::Collate::Locale;
use Unicode::GCString;
binmode(DATA, ":utf8");
my #data = <DATA>;
chomp #data;
my $ja_sorter = new Unicode::Collate::Locale locale => "ja";
my $en_sorter = new Unicode::Collate::Locale locale => "en";
my #en_data = $en_sorter->sort(#data);
my #ja_data = $ja_sorter->sort(#data);
my $gap = 8;
my $width = 0;
for my $datum (#data) {
my $columns = width($datum);
$width = $columns if $columns > $width;
}
my $bar = "-" x ( 2 + 2 * $width + $gap );
$width = -($width + $gap);
say justify($width => "JA Sort"), "= ", "EN Sort";
say $bar;
for my $i ( 0 .. $#data ) {
my $same = $ja_data[$i] eq $en_data[$i] ? " " : "!";
say justify($width => $ja_data[$i]), $same, " ", $en_data[$i];
}
sub justify {
my($len, $str) = #_;
my $alen = abs($len);
my $cols = width($str);
my $spacing = ($alen > $cols) && " " x ($alen - $cols);
return ($len < 0)
? $str . $spacing
: $spacing . $str
}
sub width {
return 0 unless #_;
my $str = shift();
return 0 unless length $str;
return Unicode::GCString->new($str)->columns;
}
__END__
システム
画面
Windows ファイウォール
インターネット オプション
キーボード
メール
音声認識
管理ツール
自動更新
日付と時刻
タスク
プログラムの追加と削除
フォント
電源オプション
マウス
地域と言語オプション
電話とモデムのオプション
Java
NVIDIA
Hope this helps. It shows that it is, at least theoretically, possible.
EDIT
This answer from How can I use Perl libraries from PHP? references this PHP package to do that for you. So if you don’t find a PHP library with the needed Japanese sorting stuff, you should be able to use the Perl module. The only one you need is Unicode::Collate::Locale. It comes standard as of release 5.14 (really 5.13.4, but that’s a devel version), but you can always install it from CPAN if you have an earlier version of Perl.

With PHP filter a textfile into an A-Z listing

I have a text file that reads:
9123 Bellvue Court
5931 Walnut Creek rd.
Andrew
Bailey
Chris
Drew
Earl
Fred
Gerald
Henry
Ida
Jake
Koman
Larry
Manny
Nomar
Omar
Perry
Quest
Raphael
State
Telleman
Uruvian
Vixan
Whales
Xavier
Yellow
Zebra
What I need to do is I need to create a A-Z listing... so:
# A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
and when you click on the letter it will bring up a table with only the words beginning with A's if I clicked A and only the words beginning with numbers if I clicked the # sign.
I was thinking of using a regular expression to accomplish this but I don't want to create 27 different pages. So is there a way to call the letter at the end of the url? like creating something that will do this
http://mywebsite/directory.php?letter=A

A very simple approach:
Read in the text file:
$inputfile = file('words.txt');
Then, AFTER sanitizing the input ($letter = $_GET['letter']), you can build a regex:
$regex = '/^'.$letter.'/i';
and filter the rows you want to show:
$result = preg_grep($regex, $inputfile);
the rest is then simply a matter of outputting nice HTML (or whatever the output shall be)
Keep in mind: When the pages are frequently read, it is a lot faster to have the file stored in a database. You should also take a look into caching mechanisms if load should be a problem at some time in the future
Edit: forgot to mention: To get the # working, you need to add a line along the following:
if ($letter == '#') $letter = '[0-9]';
to get the regex working again.

Yes.
You can access that variable to determine what to sort on by using
$letter = $_GET["letter"]
$arrayCount = preg_match('/^'.$letter."./", $textFileContents, $matches);
Something like that should work

That'd be mad unless you only have a few names in the file.
Unless you have to be terribly dynamic tell Cron to cache 26 text files from your central file each hour/day etc
a.htm etc
Once a day does me, I educated my users to understand that this is how their site would behave.
(A-Z is created from about 10 different applications' content)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Encoding&compression of URL in PHP - php

Related

Getting different output for same PHP code

PHP gettext reverse translate

Encoding unique IDs to a maximum document size of 100kb

Sorting Katakana names

With PHP filter a textfile into an A-Z listing

Categories

Resources