Encoding unique IDs to a maximum document size of 100kb - php

This is going to be a nice little brainbender I think. It is a real life problem, and I am stuck trying to figure out how to implement it. I don't expect it to be a problem for years, and at that point it will be one of those "nice problems to have".
So, I have documents in my search engine index. The documents can have a number of fields, however, each field size must be limited to only 100kb.
I would like to store the IDs, of particular sites which have access to this document. The site id count is low, so it is never going to get up into the extremely high numbers.
So example, this document here can be accessed by sites which have an ID of 7 and 10.
Document: {
docId: "1239"
text: "Some Cool Document",
access: "7 10"
}
Now, because the "access" field is limited to 100kb, that means that if you were to take consecutive IDs, only 18917 unique IDs could be stored.
Reference:
http://codepad.viper-7.com/Qn4N0K
<?php
$ids = range(1,18917);
$ids = implode(" ", $ids);
echo mb_strlen($ids, '8bit') / 1024 . "kb";
?>
// Output
99.9951171875kb
In my application, a particular site, of site ID 7, tries to search, and he will have access to that "Some Cool Document"
So now, my question would be, is there any way, that I could some how fit more IDs into that field?
I've thought about proper encoding, and applying something like a Huffman Tree, but seeing as each document has different IDs, it would be impossible to apply a single encoding set to every document.
Prehaps, I could use something like tokenized roman numerals?
Anyway, I'm open to ideas.
I should add, that I want to keep all IDs in the same field, for as long as possible. Searching over a second field, will have a considerable performance hit. So I will only switch to using a second access2 field, when I have milked the access field for as long as possible.
Edit:
Convert to Hex
<?php
function hexify(&$item){
$item = dechex($item);
}
$ids = range(1,21353);
array_walk( $ids, "hexify");
$ids = implode(" ", $ids);
echo mb_strlen($ids, '8bit') / 1024 . "kb";
?>
This yields a performance boost of 21353 consecutive IDs.
So that is up like 12.8%
Important Caveat
I think the fact that my fields can only store UTF encoded characters makes it next to impossible to get anything more out of it.

Where did 18917 come from? 100kb is a big number.
You have 100,000 or so bytes. Each byte can be 255 long, if you store it as a number.
If you encode as hex, you'll get 100,000 ^ 16, which is a very large number, and that just hex encoding.
What about base64? You stuff 3 bytes into a 4 byte space (a little loss), but you get 64 characters per character. So 100,000 ^ 64. That's a big number.
You won't have any problems. Just do a simple hex encoding.
EDIT:
TL;DR
Let's say you use base64. You could fit 6.4 times more data in the same spot. No compression needed.

How about using data compression?
$ids = range(1,18917);
$ids = implode(" ", $ids);
$ids = gzencode($ids);
echo mb_strlen($ids, '8bit') / 1024 . "kb"; // 41.435546875kb

Related

Is it possible to read a certain number of characters after a certain number of digits from the start of the file

I'm setting up a web app, where users can choose the starting point and the number of characters to read from a text file containing 1 billion digits of pi.
I have looked, but I can't find any similar problems. Because I don't know what the starting digit is, I can't use other solutions.
Here is the function written in Python:
def pi(left : int, right : int):
f.seek(left+1)
return f.read(right)
For example, entering 700 as the starting point and 9 as the number of characters should return "Pi(700,9): 542019956".
Use fseek to move the file pointer to the position you need, and fread to read the amount of characters you need - just like your Python sample code.
Actually, this capability is built in to file_get_contents.
$substr = file_get_contents('pi_file.txt', false, null, 700, 9);
A handy feature of that function that I learned about just now after using it for the past 7 years.

How can I write data into text file at fastest speed? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
It takes quite long if I random generate 4 billion IP address, and write them into a text file. Do you guys have better idea to finish this action faster?
this is my code
$ip_long = array(
array('607649792', '608174079'), //36.56.0.0-36.63.255.255
array('1038614528', '1039007743'), //61.232.0.0-61.237.255.255
array('1783627776', '1784676351'), //106.80.0.0-106.95.255.255
array('2035023872', '2035154943'), //121.76.0.0-121.77.255.255
array('2078801920', '2079064063'), //123.232.0.0-123.235.255.255
array('-1950089216', '-1948778497'), //139.196.0.0-139.215.255.255
array('-1425539072', '-1425014785'), //171.8.0.0-171.15.255.255
array('-1236271104', '-1235419137'), //182.80.0.0-182.92.255.255
array('-770113536', '-768606209'), //210.25.0.0-210.47.255.255
array('-569376768', '-564133889'), //222.16.0.0-222.95.255.255
);
$rand_key = mt_rand(0, 9);
$handle = fopen('ip_data.dat', 'a+');
for ($i=0; $i<4000000000; $i++) {
$ip= long2ip(mt_rand($ip_long[$rand_key][0], $ip_long[$rand_key][1]));
fwrite($handle, decbin( ip2long( $ip )) . "\r\n");
}
Because of the restricted ranges that you identified, the total number of distinct values that you can generate is considerably smaller than 4 billion (depending on the value of $rand_key, which is only evaluated once, it's never more than about 79*256*256 ~ 5M ) - so you are going to gets lots of duplicates. That being the case, you will be much faster if you generate an array of strings - one for each valid IP address in the range. Then pick a random string from that list and append it to a string. Write the string when it gets to a typical block size, set it back to "" and repeat.
More importantly, I question how sensible it is to use decbin - it turns your IP string into lots of ones and zeros, so a simple IP address will take 32 bytes (plus \r\n, that's 34). Multiply by 4G, and you have 120G+. That's actually a lot more data than the 50 GB that #Jon was computing above...
If you store the IP address as a binary number instead there will be just four bytes per number - and you might as well leave the CRLF off at that point. It will be faster to write, faster to read. So the suggestion becomes:
Create an array with the range of valid values (this is a range of integers; would like to think of them as unsigned but that is not a type that php knows about)
Pick a random value from the array (random index)
Put the value pointed to by the random index into another array (of predefined size - 2048 elements is good)
Do a binary write of this array when you filled it
Repeat
What you end up with is a file full of random binary numbers each of which represents an ip address. It's very fast (although it still involves writing a 16G file - but that's as small as it can get).
Since you're writing 50GB of data, your disk is most likely your bottleneck. A few suggestions
Stop calling fwrite so often - build up roughly 1000 values, then fwrite them all at once, try the same with 10000 values, measure the performance
Use C... or better yet assembly
Buy a hard drive with higher RPM or solid state memory
Don't run heavy functions in loops. Incrementing a variable and then writing once is much better. Note that the amount of data you write in the example still is extremely heavy, so I chose to guess that it was just an "immense" number in order to really get time differences to show while trying to tweak you functions.
$ip_long = array(
array('607649792', '608174079'), //36.56.0.0-36.63.255.255
array('1038614528', '1039007743'), //61.232.0.0-61.237.255.255
array('1783627776', '1784676351'), //106.80.0.0-106.95.255.255
array('2035023872', '2035154943'), //121.76.0.0-121.77.255.255
array('2078801920', '2079064063'), //123.232.0.0-123.235.255.255
array('-1950089216', '-1948778497'), //139.196.0.0-139.215.255.255
array('-1425539072', '-1425014785'), //171.8.0.0-171.15.255.255
array('-1236271104', '-1235419137'), //182.80.0.0-182.92.255.255
array('-770113536', '-768606209'), //210.25.0.0-210.47.255.255
array('-569376768', '-564133889'), //222.16.0.0-222.95.255.255
);
$rand_key = mt_rand(0, 9);
$ip = '';
for ($i=0; $i<40000; $i++) {
$ip .= decbin( ip2long( long2ip(mt_rand($ip_long[$rand_key][0], $ip_long[$rand_key][1])))) . "\r\n";
}
$handle = fopen('ip_data.dat', 'a+');
fwrite($handle, $ip);

PHP convert 12digit hex to 6

I am parsing a XML file supplied by some software. Part of the parsing is extracting colors from some attributes. The problem I have is the color is a 12digit hex value. ie,
<Text AdornmentStyle="0" Background="#FFFFFFFFFFFF" Color="#DD6B08C206A2" Font="Courier Final Draft" RevisionID="0" Size="12" Style="">Test</Text>
As you can see the colors are 12digits long. I need to get the 6 digit color so I can display it correctly on html.
Has anyone come across this before?
Hope you can advise.
Never seen a 12-digit hex color string before. Must be using 2-bytes per channel, which means if you convert it, you're going to lose a bit of information.
I believe the color is in the format #RRRRGGGGBBBB, so take each 4 hexgits and divide by (16^4/16^2)=256, and round if necessary. That should do it.
...and if that doesn't give you the right color, try CMYK like cypher suggests: #CCCMMMYYYKKK (12-bits per channel).
e.g., to convert DD6B08C206A2 do:
0xDD6B / 0x100 = 0xDD
0x08C2 / 0x100 = 0x08
0x06A2 / 0x100 = 0x06
Put those back together and you get #DD0806.

(PHP) randomly insert a 10 word sentence into a large text document

I have large text files 140k or larger full of paragraphs of text and need to insert a sentence in to this file at random intervals only if the file contains more then 200 words.
The sentence I need to insert randomly throughout the larger document is 10 words long.
I have full control over the server running my LAMP site so I can use PHP or a linux command line application if one exists which would do this for me.
Any ideas of how best to tackle this would be greatly appreciated.
Thanks
Mark
You could use str_word_count() to get the number of words in the string. From there, determine if you want to insert the string or not. As for inserting it "at random," that could be dangerous. Do you mean to suggest you want to insert it in a couple random areas? If so, load the contents of the file in as an array with file() and insert your sentence anywhere between $file[0] and count($file);
The following code should do the trick to locate and insert strings into random locations. From there you would just need to re-write the file. This is a very crude way and does not take into account punctuation or anything like that, so some fine-tuning will most likely be necessary.
$save = array();
$words = str_word_count(file_get_contents('somefile.txt'), 1);
if (count($words) <= 200)
$save = $words;
else {
foreach ($words as $word) {
$save[] = $word;
$rand = rand(0, 1000);
if ($rand >= 100 && $rand <= 200)
$save[] = 'some string';
}
}
$save = implode(' ', $save);
This generates a random number and checks if it's between 100 and 200 inclusive and, if so, puts in the random string. You can change the range of the random number and that of the check to increase or decrease how many are added. You could also implement a counter to do something like make sure there are at least x words between each string.
Again, this doesn't take into account punctuation or anything and just assumes all words are separated by spaces. So some fine tuning may be necessary to perfect it, but this should be a good starting point.

Encoding&compression of URL in PHP

How to easy encode and "compress" URL/e-mail adress to string in PHP?
String should be:
difficult to decode by user
as short as possible (compressed)
similar URLs should be different
after encoding
not in database
easy to decode/uncompress by PHP script
ex. input -> output,
stackoverflow.com/1/ -> "n3uu399",
stackoverflow.com/2/ -> "ojfiejfe8"
Not very short but you could zip it with a password and encode it using base64. Note that zip is not too safe when it comes to passwords, but should be ok if your encrypted value is intended to have a short lifetime.
Note that whatever you do, you won't be able to generate a somewhat safe encoding unless you agree to store some unaccessible information locally. This means, whatever you do, take it as given that anyone can access the pseudo-encrypted data with enough time, be it by reverse engineering your algorithm, brute-forcing your passwords or whatever else is necessary.
You could make your own text compression system based on common strings: if URL starts 'http://www.', then the first character of the shortened URL is 'a', if it starts 'https://www.', then the first character is 'b'...(repeat for popular variants), if not then first letter is 'z' and the url follows in a coded pattern.
The if next three letters are 'abc', the second letter is 'a' etc. You'll need a list of which letter pairs/triplets are most common in URLs and work out the most popular 26/50 etc (depending on which characters you want to use) and you should be able to conduct some compression on the URL entirely in PHP (without using a database). People will only be able to reverse it by either knowing your letter pair/triplet list (your mapping list) or by manually reverse-engineering it.
Here is a simple implementation that may or may not fulfil your needs:
Input/Output:
test#test.com cyUzQTEzJTNBJTIydGVzdCU0MHRlc3QuY29tJTIyJTNC test#test.com
http://test.com/ cyUzQTE2JTNBJTIyaHR0cCUzQSUyRiUyRnRlc3QuY29tJTJGJTIyJTNC http://test.com/
Code:
function encode ($in) {
return base64_encode(rawurlencode(serialize($in)));
}
function decode ($in) {
return unserialize(rawurldecode(base64_decode($in)));
}
shrug
You need to be more specific about your inputs and outputs and what you expect from each.
You could also use gzcompress/gzuncompress instead of serialize/unserialize, etc.
if you have access to a database then you could do a relational lookup i.e. there will be 2 fields, field one holding the original URL and the second holding the compressed URL.
To make the second URL you could do something like the following
$str = "a b c d e f g h i j k l m n o p q r s t u v w x y z";
$str = explode(" ", $str);
$len = 5;
for($i = 0; $i < $len; $i++)
{
$pos = rand(0, (count($str) - 1));
$url .= $str[$pos];
}
This is just an idea that i have thought up, code isn't tested

Categories