Mirc control codes to html, through php - php

I realize this has been asked before, on this very forum no less, but the proposed solution was not reliable for me.
I have been working on this for a week or more by now, and I stayed up 'till 3am yesterday working on it... But I digress, let me get to the issue at hand:
For those unaware, mirc uses ascii control codes to control character color, underline, weight, and italics. The ascii code for the color is 3, bold 2, underline 1F, italic 1D, and reverse(white text on black background), 16.
As an example of the form this data is going to come in, we have(in regex because those characters will not print):
\x034this text is red\x033this text is green\x03 \x02bold text\x02
\x034,3this text is red with a green background\x03
Et-cetera.
Below are the two functions I have attempted to modify for my own use, but have returned unreliable results. Before I get into that code, to be specific on 'unreliable', sometimes the code would parse, other times there would still be control codes left in the text, and I can't figure out why. Anyway;
function mirc2html($x) {
$c = array("FFF","000","00007F","009000","FF0000","7F0000","9F009F","FF7F00","FFFF00","00F800","00908F","00FFFF","0000FF","FF00FF","7F7F7F","CFD0CF");
$x = preg_replace("/\x02(.*?)((?=\x02)\x02|$)/", "<b>$1</b>", $x);
$x = preg_replace("/\x1F(.*?)((?=\x1F)\x1F|$)/", "<u>$1</u>", $x);
$x = preg_replace("/\x1D(.*?)((?=\x1D)\x1D|$)/", "<i>$1</i>", $x);
$x = preg_replace("/\x03(\d\d?),(\d\d?)(.*?)(?(?=\x03)|$)/e", "'</span><span style=\"color: #'.\$c[$1].'; background-color: #'.\$c[$2].';\">$3</span>'", $x);
$x = preg_replace("/\x03(\d\d?)(.*?)(?(?=\x03)|$)/e", "'</span><span style=\"color: #'.\$c[$1].';\">$2</span>'", $x);
//$x = preg_replace("/(\x0F|\x03)(.*?)/", "<span style=\"color: #000; background-color: #FFF;\">$2</span>", $x);
//$x = preg_replace("/\x16(.*?)/", "<span style=\"color: #FFF; background-color: #000;\">$1</span>", $x);
//$x = preg_replace("/\<\/span\>/","",$x,1);
//$x = preg_replace("/(\<\/span\>){2}/","</span>",$x);
return $x;
}
function color_rep($matches) {
$matches[2] = ltrim($matches[2], "0");
$bindings = array(0=>'white',1=>'black',2=>'blue',3=>'green',4=>'red',5=>'brown',6=>'purple',7=>'orange',8=>'yellow',9=>'lightgreen',10=>'#00908F',
11=>'lightblue',12=>'blue',13=>'pink',14=>'grey',15=>'lightgrey');
$preg = preg_match_all('/(\d\d?),(\d\d?)/',$matches[2], $col_arr);
//print_r($col_arr);
$fg = isset($bindings[$matches[2]]) ? $bindings[$matches[2]] : 'transparent';
if ($preg == 1) {
$fg = $bindings[$col_arr[1][0]];
$bg = $bindings[$col_arr[2][0]];
}
else {
$bg = 'transparent';
}
return '<span style="color: '.$fg.'; background: '.$bg.';">'.$matches[3].'</span>';
}
And, in case it is relevant, where the code is called:
$logln = preg_replace_callback("/(\x03)(\d\d?,\d\d?|\d\d?)(\s?.*?)(?(?=\x03)|$)/","color_rep",$logln);
Sources: First, Second
I've of course also attempted to look at the methods done by various php/ajax based irc clients, and there hasn't been any success there. As to doing this mirc-side, I've looked there as well, and although the results have been more reliable than php, the data sent to the server increases exponentially to the point that the socket times out on upload, so it isn't a viable option.
As always, any help in this matter would be appreciated.

You should divide the problem, for example with a tokenizer. A tokenizer will scan the input string and turn the special parts into named tokens, so the rest of your script can identify them. Usage example:
$mirc = "\x034this text is red\x033this text is green\x03 \x02bold text\x02
\x034,3this text is red with a green background\x03";
$tokenizer = new Tokenizer($mirc);
while(list($token, $data) = $tokenizer->getNext())
{
switch($token)
{
case 'color-fgbg':
printf('<%s:%d,%d>', $token, $data[1], $data[2]);
break;
case 'color-fg':
printf('<%s:%d>', $token, $data[1]);
break;
case 'color-reset':
case 'style-bold';
printf('<%s>', $token);
break;
case 'catch-all':
echo $data[0];
break;
default:
throw new Exception(sprintf('Unknown token <%s>.', $token));
}
}
This does not much yet, but identify the interesting parts and their (sub-) values as the output demonstrates:
<color-fg:4>this text is red<color-fg:3>this text is green<color-reset> <style-bold>bold text<style-bold>
<color-fgbg:4,3>this text is red with a green background<color-reset>
It should be relatively easy for you to modify the loop above and handle the states like opening/closing color and font-variant tags like bold.
The tokenizer itself defines a set of tokens of which is tries to find them one after the other at a certain offset (starting at the beginning of the string). The tokens are defined by regular expressions:
/**
* regular expression based tokenizer,
* first token wins.
*/
class Tokenizer
{
private $subject;
private $offset = 0;
private $tokens = array(
'color-fgbg' => '\x03(\d{1,2}),(\d{1,2})',
'color-fg' => '\x03(\d{1,2})',
'color-reset' => '\x03',
'style-bold' => '\x02',
'catch-all' => '.|\n',
);
public function __construct($subject)
{
$this->subject = (string) $subject;
}
...
As this private array shows, simple regular expressions and they get a name with their key. That's the name used in the switch statement above.
The next() function will look for a token at the current offset, and if found, will advance the offset and return the token incl. all subgroup matches. As offsets are involved, the more detailed $matches array is simplified (offsets removed) as the main routine normally does not need to know about offsets.
The principle is easy here: The first pattern wins. So you need to place the pattern that matches most (in sense of string length) on top to have this working. In your case, the largest one is the token for the foreground and background color, <color-fgbg>.
In case not token can be found, NULL is returned, so here the next() function:
...
/**
* #return array|null
*/
public function getNext()
{
if ($this->offset >= strlen($this->subject))
return NULL;
foreach($this->tokens as $name => $token)
{
if (FALSE === $r = preg_match("~$token~", $this->subject, $matches, PREG_OFFSET_CAPTURE, $this->offset))
throw new RuntimeException('Pattern for token %s failed (regex error).', $name);
if ($r === 0)
continue;
if (!isset($matches[0])) {
var_dump(substr($this->subject, $this->offset));
$c = 1;
}
if ($matches[0][1] !== $this->offset)
continue;
$data = array();
foreach($matches as $match)
{
list($data[]) = $match;
}
$this->offset += strlen($data[0]);
return array($name, $data);
}
return NULL;
}
...
So the tokenization of the string is now encapsulated into the Tokenizer class and the parsing of the token is something you can do your own inside some other part of your application. That should make it more easy for you to change the way of styling (HTML output, CSS based HTML output or something differnt like bbcode or markdown) but also the support of new codes in the future. Also in case something is missing you can more easily fix things because it's either a non-recognized code or something missing with the transformation.
The full example as gist: Tokenizer Example of Mirc Color and Style (bold) Codes.
Related resources:
Very rudimentary, regex based tokenizer routine example
http://www.mirc.com/colors.html
http://en.wikipedia.org/wiki/Control_key

Related

Regular expression for determining specific characteristics of a string (that is a poker hand)

I have a string in the form of "AsKcQsJd" that represents 4 cards from a deck of playing cards. The uppercase value represnts the card value (in this case, Ace, King, Queen, and Jack) and the lowercase value represents the suit (in this case, spade, club, spade, diamond).
Say I have another value that tells me what suit I'm looking for. So in this case, I have:
$hand = 'AsKcQsJd';
$suit = 's';
How can I write a regular expression that checks if the hand has an Ace in it, followed by the suit, so in this case 'As' and also any other card that has the suit? Or in 'poker terms', I'm trying to determine if the hand has the 'ace high flush draw' for the suit defined as $suit.
To further explain, I need to check if any combination of the following two cards exist:
AsKs, AsQs, AsJs, AsTs,As9s,As8s,As7s,As6s,As5s,As4s,As3s,As2s
With the added complexity that these cards could occur anywhere in the hand. For example, the string could have As at the front and Ks at the end. That's why I think a regular expression is the best method for determining if the two coexist in the string.
You might use two lookaheads, one for As, and one for [^A]s, like this:
(?=.*As)(?=.*[^A]s)
https://regex101.com/r/8hkWTv/1
$suit = 's';
$re = '/(?=.*A' . $suit . ')(?=.*[^A]' . $suit . ')/';
print($re); // /(?=.*As)(?=.*[^A]s)/
print(preg_match($re, 'AsKcQsJd')); // 1
print(preg_match($re, 'AdKcQsJd')); // 0
print(preg_match($re, 'KsKcQsJd')); // 0
I'm not sure regex is the best solution but if that's your cup of tea you can do it pretty easily with alternation like this:
As.*s|s.*As
Or better yet - to capture the actual cards giving you a match:
(As).*(.s)|(.s).*(As)
These basically say - the hand has a spade followed by an ace of spades OR has ace of spades followed by any other spade. https://regex101.com/r/pdwHPQ/1
That said, I'd probably consider building a simple class to parse the hand and give you more flexibility when it comes to answering questions about what cards are present. Whether or not this is worth it really depends a lot on your app. Here's an idea:
$hand = 'AsKh4c5c9h2s';
$cards = new Cards($hand);
$spades = $cards->getCardsBySuit('s');
if (in_array('As',array_keys($spades)) && count($spades) > 1) {
// hand has ace high flush draw
echo 'yep';
}
class Cards {
private $cards = '';
public function __construct($hand) {
foreach (str_split($hand,2) as $card) {
$this->cards[$card] = [
'rank' => substr($card,0,1),
'suit' => substr($card,1,1)
];
}
}
public function getCardsBySuit($suit) {
$response = [];
foreach ($this->cards as $k => $card) {
if ($card['suit'] == $suit) {
$response[$k] = $card;
}
}
return $response;
}
}

Extract code from nested curly braces, including multiple internal curly braces in PHP

So, I got some ideas off here about how to do this and took on board some of the code suggestions; I have LaTeX files with components in the form
{upper}{lower} where upper could be anything from plain text to LaTeX including its own nested {} and lower could be blank or substantial latex. Desired output is a pair of PHP strings $upper and $lower that contain only the content of the two parent braces.
$upperlowerQ='some string'; // in format {upper}{lower}
$qparts=nestor($upperlowerQ);
$upper=$qparts[0];
$lower=$qparts[1];
function nestor($subject) {
$result = false;
preg_match_all('~[^{}]+|\{(?<nested>(?R)*)\}~', $subject, $matches);
foreach($matches['nested'] as $match) {
if ($match != "") {
$result[] = $match;
$nesty = nestor($match);
if ($nesty)
$result = array_merge($result,$nesty);
}
}
return $result;
}
This function works for about 95% of my data (this upper/lower splitting is called in a loop for about 1,000 times) but it is failing on a few. An example of something it fails on looks like this:
{Draw an example of a reciprocal graph in the form $y=\frac{a}{x}$}{
\begin{tikzpicture}
\begin{axis}[xmin=-8,xmax=8,ymin=-5,ymax=12,samples=50,grid=both,grid style={gray!30},xtick={-8,...,8},ytick={-5,...,12},axis x line = bottom,
axis y line = left, axis lines=middle]
\end{axis}
\end{tikzpicture}\par
%ans: smooth reciprocal function plotted.
}
which gives:
$upper as Draw an example of a reciprocal graph in the form $y=\frac{a}{x}$ (which is correct) but $lower as a, which is the numerator of the fraction in the upper part... any ideas appreciated. It is always $lower that is wrong... $upper seems correct.
For any future readers, #Jonny5's response above worked perfectly. eval.in
Added from comments
Try using regex like this: {((?:[^}{]+|(?R))*)} for only extracting what's inside the outer { } and to check if exactly 2 items are matched by returned matchcount of preg_match_all.
$upper = ""; $lower = "";
if(preg_match_all('/{((?:[^}{]+|(?R))*)}/', $str, $out) == 2) {
$upper=$out[1][0]; $lower=$out[1][1];
}
See test at eval.in

Shortest possible encoded string with a decode possibility (shorten URL) using only PHP

I'm looking for a method that encodes a string to the shortest possible length and lets it be decodable (pure PHP, no SQL). I have working script, but I'm unsatisfied with the length of the encoded string.
Scenario
Link to an image (it depends on the file resolution I want to show to the user):
www.mysite.com/share/index.php?img=/dir/dir/hi-res-img.jpg&w=700&h=500
Encoded link (so the user can't guess how to get the larger image):
www.mysite.com/share/encodedQUERYstring
So, basically I'd like to encode only the search query part of the URL:
img=/dir/dir/hi-res-img.jpg&w=700&h=500
The method I use right now will encode the above query string to:
y8xNt9VPySwC44xM3aLUYt3M3HS9rIJ0tXJbcwMDtQxbUwMDAA
The method I use is:
$raw_query_string = 'img=/dir/dir/hi-res-img.jpg&w=700&h=500';
$encoded_query_string = base64_encode(gzdeflate($raw_query_string));
$decoded_query_string = gzinflate(base64_decode($encoded_query_string));
How do I shorten the encoded result and still have the possibility to decode it using only PHP?
I suspect that you will need to think more about your method of hashing if you don't want it to be decodable by the user. The issue with Base64 is that a Base64 string looks like a base64 string. There's a good chance that someone that's savvy enough to be looking at your page source will probably recognise it too.
Part one:
a method that encodes an string to shortest possible length
If you're flexible on your URL vocabulary/characters, this will be a good starting place. Since gzip makes a lot of its gains using back references, there is little point as the string is so short.
Consider your example - you've only saved 2 bytes in the compression, which are lost again in Base64 padding:
Non-gzipped: string(52) "aW1nPS9kaXIvZGlyL2hpLXJlcy1pbWcuanBnJnc9NzAwJmg9NTAw"
Gzipped: string(52) "y8xNt9VPySwC44xM3aLUYt3M3HS9rIJ0tXJbcwMDtQxbUwMDAA=="
If you reduce your vocabulary size, this will naturally allow you better compression. Let's say we remove some redundant information.
Take a look at the functions:
function compress($input, $ascii_offset = 38){
$input = strtoupper($input);
$output = '';
//We can try for a 4:3 (8:6) compression (roughly), 24 bits for 4 characters
foreach(str_split($input, 4) as $chunk) {
$chunk = str_pad($chunk, 4, '=');
$int_24 = 0;
for($i=0; $i<4; $i++){
//Shift the output to the left 6 bits
$int_24 <<= 6;
//Add the next 6 bits
//Discard the leading ASCII chars, i.e make
$int_24 |= (ord($chunk[$i]) - $ascii_offset) & 0b111111;
}
//Here we take the 4 sets of 6 apart in 3 sets of 8
for($i=0; $i<3; $i++) {
$output = pack('C', $int_24) . $output;
$int_24 >>= 8;
}
}
return $output;
}
And
function decompress($input, $ascii_offset = 38) {
$output = '';
foreach(str_split($input, 3) as $chunk) {
//Reassemble the 24 bit ints from 3 bytes
$int_24 = 0;
foreach(unpack('C*', $chunk) as $char) {
$int_24 <<= 8;
$int_24 |= $char & 0b11111111;
}
//Expand the 24 bits to 4 sets of 6, and take their character values
for($i = 0; $i < 4; $i++) {
$output = chr($ascii_offset + ($int_24 & 0b111111)) . $output;
$int_24 >>= 6;
}
}
//Make lowercase again and trim off the padding.
return strtolower(rtrim($output, '='));
}
It is basically a removal of redundant information, followed by the compression of 4 bytes into 3. This is achieved by effectively having a 6-bit subset of the ASCII table. This window is moved so that the offset starts at useful characters and includes all the characters you're currently using.
With the offset I've used, you can use anything from ASCII 38 to 102. This gives you a resulting string of 30 bytes, that's a 9-byte (24%) compression! Unfortunately, you'll need to make it URL-safe (probably with base64), which brings it back up to 40 bytes.
I think at this point, you're pretty safe to assume that you've reached the "security through obscurity" level required to stop 99.9% of people. Let's continue though, to the second part of your question
so the user can't guess how to get the larger image
It's arguable that this is already solved with the above, but you need to pass this through a secret on the server, preferably with PHP's OpenSSL interface. The following code shows the complete usage flow of functions above and the encryption:
$method = 'AES-256-CBC';
$secret = base64_decode('tvFD4Vl6Pu2CmqdKYOhIkEQ8ZO4XA4D8CLowBpLSCvA=');
$iv = base64_decode('AVoIW0Zs2YY2zFm5fazLfg==');
$input = 'img=/dir/dir/hi-res-img.jpg&w=700&h=500';
var_dump($input);
$compressed = compress($input);
var_dump($compressed);
$encrypted = openssl_encrypt($compressed, $method, $secret, false, $iv);
var_dump($encrypted);
$decrypted = openssl_decrypt($encrypted, $method, $secret, false, $iv);
var_dump($decrypted);
$decompressed = decompress($compressed);
var_dump($decompressed);
The output of this script is the following:
string(39) "img=/dir/dir/hi-res-img.jpg&w=700&h=500"
string(30) "<��(��tJ��#�xH��G&(�%��%��xW"
string(44) "xozYGselci9i70cTdmpvWkrYvGN9AmA7djc5eOcFoAM="
string(30) "<��(��tJ��#�xH��G&(�%��%��xW"
string(39) "img=/dir/dir/hi-res-img.jpg&w=700&h=500"
You'll see the whole cycle: compression → encryption → Base64 encode/decode → decryption → decompression. The output of this would be as close as possible as you could really get, at near the shortest length you could get.
Everything aside, I feel obliged to conclude this with the fact that it is theoretical only, and this was a nice challenge to think about. There are definitely better ways to achieve your desired result - I'll be the first to admit that my solution is a little bit absurd!
Instead of encoding the URL, output a thumbnail copy of the original image. Here's what I'm thinking:
Create a "map" for PHP by naming your pictures (the actual file names) using random characters. Random_bytes is a great place to start.
Embed the desired resolution within the randomized URL string from #1.
Use the imagecopyresampled function to copy the original image into the resolution you would like to output before outputting it out to the client's device.
So for example:
Filename example (from bin2hex(random_bytes(6))): a1492fdbdcf2.jpg
Resolution desired: 800x600. My new link could look like:
http://myserver.com/?800a1492fdbdcf2600 or maybe http://myserfer.com/?a1492800fdbdc600f2 or maybe even http://myserver.com/?800a1492fdbdcf2=600 depending on where I choose to embed the resolution within the link
PHP would know that the file name is a1492fdbdcf2.jpg, grab it, use the imagecopyresampled to copy to the resolution you want, and output it.
Theory
In theory we need a short input character set and a large output character set.
I will demonstrate it by the following example. We have the number 2468 as integer with 10 characters (0-9) as character set. We can convert it to the same number with base 2 (binary number system). Then we have a shorter character set (0 and 1) and the result is longer:
100110100100
But if we convert to hexadecimal number (base 16) with a character set of 16 (0-9 and A-F). Then we get a shorter result:
9A4
Practice
So in your case we have the following character set for the input:
$inputCharacterSet = "0123456789abcdefghijklmnopqrstuvwxyz=/-.&";
In total 41 characters: Numbers, lower cases and the special chars = / - . &
The character set for output is a bit tricky. We want use URL save characters only. I've grabbed them from here: Characters allowed in GET parameter
So our output character set is (73 characters):
$outputCharacterSet = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz~-_.!*'(),$";
Numbers, lower and upper cases and some special characters.
We have more characters in our set for the output than for the input. Theory says we can short our input string. Check!
Coding
Now we need an encode function from base 41 to base 73. For that case I don't know a PHP function. Luckily we can grab the function 'convBase' from here: Convert an arbitrarily large number from any base to any base
<?php
function convBase($numberInput, $fromBaseInput, $toBaseInput)
{
if ($fromBaseInput == $toBaseInput) return $numberInput;
$fromBase = str_split($fromBaseInput, 1);
$toBase = str_split($toBaseInput, 1);
$number = str_split($numberInput, 1);
$fromLen = strlen($fromBaseInput);
$toLen = strlen($toBaseInput);
$numberLen = strlen($numberInput);
$retval = '';
if ($toBaseInput == '0123456789')
{
$retval = 0;
for ($i = 1;$i <= $numberLen; $i++)
$retval = bcadd($retval, bcmul(array_search($number[$i-1], $fromBase), bcpow($fromLen, $numberLen-$i)));
return $retval;
}
if ($fromBaseInput != '0123456789')
$base10 = convBase($numberInput, $fromBaseInput, '0123456789');
else
$base10 = $numberInput;
if ($base10<strlen($toBaseInput))
return $toBase[$base10];
while($base10 != '0')
{
$retval = $toBase[bcmod($base10,$toLen)] . $retval;
$base10 = bcdiv($base10, $toLen, 0);
}
return $retval;
}
Now we can shorten the URL. The final code is:
$input = 'img=/dir/dir/hi-res-img.jpg&w=700&h=500';
$inputCharacterSet = "0123456789abcdefghijklmnopqrstuvwxyz=/-.&";
$outputCharacterSet = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz~-_.!*'(),$";
$encoded = convBase($input, $inputCharacterSet, $outputCharacterSet);
var_dump($encoded); // string(34) "BhnuhSTc7LGZv.h((Y.tG_IXIh8AR.$!t*"
$decoded = convBase($encoded, $outputCharacterSet, $inputCharacterSet);
var_dump($decoded); // string(39) "img=/dir/dir/hi-res-img.jpg&w=700&h=500"
The encoded string has only 34 characters.
Optimizations
You can optimize the count of characters by
reduce the length of input string. Do you really need the overhead of URL parameter syntax? Maybe you can format your string as follows:
$input = '/dir/dir/hi-res-img.jpg,700,500';
This reduces the input itself and the input character set. Your reduced input character set is then:
$inputCharacterSet = "0123456789abcdefghijklmnopqrstuvwxyz/-.,";
Final output:
string(27) "E$AO.Y_JVIWMQ9BB_Xb3!Th*-Ut"
string(31) "/dir/dir/hi-res-img.jpg,700,500"
reducing the input character set ;-). Maybe you can exclude some more characters?
You can encode the numbers to characters first. Then your input character set can be reduced by 10!
increase your output character set. So the given set by me is googled within two minutes. Maybe you can use more URL save characters.
Security
Heads up: There is no cryptographically logic in the code. So if somebody guesses the character sets, he/she can decode the string easily. But you can shuffle the character sets (once). Then it is a bit harder for the attacker, but not really safe. Maybe it’s enough for your use case anyway.
Reading from the previous answers and below comments, you need a solution to hide the real path of your image parser, giving it a fixed image width.
Step 1: http://www.example.com/tn/full/animals/images/lion.jpg
You can achieve a basic "thumbnailer" by taking profit of .htaccess
RewriteEngine on
RewriteBase /
RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule tn/(full|small)/(.*) index.php?size=$1&img=$2 [QSA,L]
Your PHP file:
$basedir = "/public/content/";
$filename = realpath($basedir.$_GET["img"]);
## Check that file is in $basedir
if ((!strncmp($filename, $basedir, strlen($basedir))
||(!file_exists($filename)) die("Bad file path");
switch ($_GET["size"]) {
case "full":
$width = 700;
$height = 500;
## You can also use getimagesize() to test if the image is landscape or portrait
break;
default:
$width = 350;
$height = 250;
break;
}
## Here is your old code for resizing images.
## Note that the "tn" directory can exist and store the actual reduced images
This lets you using the URL www.example.com/tn/full/animals/images/lion.jpg to view your reduced in size image.
This has the advantage for SEO to preserve the original file name.
Step 2: http://www.example.com/tn/full/lion.jpg
If you want a shorter URL, if the number of images you have is not too much, you can use the basename of the file (e.g., "lion.jpg") and recursively search. When there is a collision, use an index to identify which one you want (e.g., "1--lion.jpg")
function matching_files($filename, $base) {
$directory_iterator = new RecursiveDirectoryIterator($base);
$iterator = new RecursiveIteratorIterator($directory_iterator);
$regex_iterator = new RegexIterator($iterator, "#$filename\$#");
$regex_iterator->setFlags(RegexIterator::USE_KEY);
return array_map(create_function('$a', 'return $a->getpathName();'), iterator_to_array($regex_iterator, false));
}
function encode_name($filename) {
$files = matching_files(basename($filename), realpath('public/content'));
$tot = count($files);
if (!$tot)
return NULL;
if ($tot == 1)
return $filename;
return "/tn/full/" . array_search(realpath($filename), $files) . "--" . basename($filename);
}
function decode_name($filename) {
$i = 0;
if (preg_match("#^([0-9]+)--(.*)#", $filename, $out)) {
$i = $out[1];
$filename = $out[2];
}
$files = matching_files($filename, realpath('public/content'));
return $files ? $files[$i] : NULL;
}
echo $name = encode_name("gallery/animals/images/lion.jp‌​g").PHP_EOL;
## --> returns lion.jpg
## You can use with the above solution the URL http://www.example.com/tn/lion.jpg
echo decode_name(basename($name)).PHP_EOL;
## -> returns the full path on disk to the image "lion.jpg"
Original post:
Basically, if you add some formatting in your example, your shortened URL is in fact longer:
img=/dir/dir/hi-res-img.jpg&w=700&h=500 // 39 characters
y8xNt9VPySwC44xM3aLUYt3M3HS9rIJ0tXJbcwMDtQxbUwMDAA // 50 characters
Using base64_encode will always result in longer strings. And gzcompress will require at less to store one occurrence of the different chars; this is not a good solution for small strings.
So doing nothing (or a simple str_rot13) is clearly the first option to consider if you want to shorten the result you had previously.
You can also use a simple character replacement method of your choice:
$raw_query_string = 'img=/dir/dir/hi-res-img.jpg&w=700&h=500';
$from = "0123456789abcdefghijklmnopqrstuvwxyz&=/ABCDEFGHIJKLMNOPQRSTUVWXYZ";
// The following line if the result of str_shuffle($from)
$to = "0IQFwAKU1JT8BM5npNEdi/DvZmXuflPVYChyrL4R7xc&SoG3Hq6ks=e9jW2abtOzg";
echo strtr($raw_query_string, $from, $to) . "\n";
// Result: EDpL4MEu4MEu4NE-u5f-EDp.dmprYLU00rNLA00 // 39 characters
Reading from your comment, you really want "to prevent anyone to gets a high-resolution image".
The best way to achieve that is to generate a checksum with a private key.
Encode:
$secret = "ujoo4Dae";
$raw_query_string = 'img=/dir/dir/hi-res-img.jpg&w=700&h=500';
$encoded_query_string = $raw_query_string . "&k=" . hash("crc32", $raw_query_string . $secret);
Result: img=/dir/dir/hi-res-img.jpg&w=700&h=500&k=2ae31804
Decode:
if (preg_match("#(.*)&k=([^=]*)$#", $encoded_query_string, $out)
&& (hash("crc32", $out[1].$secret) == $out[2])) {
$decoded_query_string = $out[1];
}
This does not hide the original path, but this path has no reason to be public. Your "index.php" can output your image from the local directory once the key has been checked.
If you really want to shorten your original URL, you have to consider the acceptable characters in the original URL to be restricted. Many compression methods are based on the fact that you can use a full byte to store more than a character.
There are many ways to shorten URLs. You can look up how other services, like TinyURL, shorten their URLs. Here is a good article on hashes and shortening URLs: URL Shortening: Hashes In Practice
You can use the PHP function mhash() to apply hashes to strings.
And if you scroll down to "Available Hashes" on the mhash website, you can see what hashes you can use in the function (although I would check what PHP versions have which functions): mhash - Hash Library
I think this would be better done by not obscuring at all. You could quite simply cache returned images and use a handler to provide them. This requires the image sizes to be hard coded into the PHP script. When you get new sizes, you can just delete everything in the cache as it is 'lazy loaded'.
1. Get the image from the request
This could be this: /thumbnail.php?image=img.jpg&album=myalbum. It could even be made to be anything using rewrite and have a URL like: /gallery/images/myalbum/img.jpg.
2. Check to see if a temporary version does not exist
You can do this using is_file().
3. Create it if it does not exist
Use your current resizing logic to do it, but don't output the image. Save it to the temporary location.
4. Read the temporary file contents to the stream
It pretty much just outputs it.
Here is an untested code example...
<?php
// Assuming we have a request /thumbnail.php?image=img.jpg&album=myalbum
// These are temporary filenames places. You need to do this yourself on your system.
$image = $_GET['image']; // The file name
$album = $_GET['album']; // The album
$temp_folder = sys_get_temp_dir(); // Temporary directory to store images
// (this should really be a specific cache path)
$image_gallery = "images"; // Root path to the image gallery
$width = 700;
$height = 500;
$real_path = "$image_gallery/$album/$image";
$temp_path = "$temp_folder/$album/$image";
if(!is_file($temp_path))
{
// Read in the image
$contents = file_get_contents($real_path);
// Resize however you are doing it now.
$thumb_contents = resizeImage($contents, $width, $height);
// Write to the temporary file
file_put_contents($temp_path, $thumb_contents);
}
$type = 'image/jpeg';
header('Content-Type:' . $type);
header('Content-Length: ' . filesize($temp_path));
readfile($temp_path);
?>
Short words about "security"
You simply won't be able to secure your link if there is no "secret password" stored somewhere: as long as the URI carries all information to access your resource, then it will be decodable and your "custom security" (they are opposite words btw) will be broken easily.
You can still put a salt in your PHP code (like $mysalt="....long random string...") since I doubt you want an eternal security (such approach is weak because you cannot renew the $mysalt value, but in your case, a few years security sounds sufficient, since anyway, a user can buy one picture and share it elsewhere, breaking any of your security mechanism).
If you want to have a safe mechanism, use a well-known one (as a framework would carry), along with authentication and user rights management mechanism (so you can know who's looking for your image, and whether they are allowed to).
Security has a cost. If you don't want to afford its computing and storing requirements, then forget about it.
Secure by signing the URL
If you want to avoid users easy by-passing and get full resolution picture, then you may just sign the URI (but really, for safety, use something that already exist instead of that quick draft example below):
$salt = '....long random stirng...';
$params = array('img' => '...', 'h' => '...', 'w' => '...');
$p = http_build_query($params);
$check = password_hash($p, PASSWORD_BCRYPT, array('salt' => $salt, 'cost' => 1000);
$uri = http_build_query(array_merge($params, 'sig' => $check));
Decoding:
$sig = $_GET['sig'];
$params = $_GET;
unset($params['sig']);
// Same as previous
$salt = '....long random stirng...';
$p = http_build_query($params);
$check = password_hash($p, PASSWORD_BCRYPT, array('salt' => $salt, 'cost' => 1000);
if ($sig !== $check) throw new DomainException('Invalid signature');
See password_hash
Shorten smartly
"Shortening" with a generic compression algorithm is useless here because the headers will be longer than the URI, so it will almost never shorten it.
If you want to shorten it, be smart: don't give the relative path (/dir/dir) if it's always the same (or give it only if it's not the main one). Don't give the extension if it's always the same (or give it when it's not png if almost everything is in png). Don't give the height because the image carries the aspect ratio: you only need the width. Give it in x100px if you do not need a pixel-accurate width.
A lot has been said about how encoding doesn't help security, so I am just concentrating on the shortening and aesthetics.
Rather than thinking of it as a string, you could consider it as three individual components. Then if you limit your code space for each component, you can pack things together a lot smaller.
E.g.,
path - Only consisting of the 26 characters (a-z) and / - . (Variable length)
width - Integer (0 - 65k) (Fixed length, 16 bits)
height - Integer (0 - 65k) (Fixed length, 16 bits)
I'm limiting the path to only consist of a maximum 31 characters, so we can use five bit groupings.
Pack your fixed length dimensions first, and append each path character as five bits. It might also be necessary to add a special null character to fill up the end byte. Obviously you need to use the same dictionary string for encoding and decoding.
See the code below.
This shows that by limiting what you encode and how much you can encode, you can get a shorter string. You could make it even shorter by using only 12 bit dimension integers (max 2048), or even removing parts of the path if they are known such as base path or file extension (see last example).
<?php
function encodeImageAndDimensions($path, $width, $height) {
$dictionary = str_split("abcdefghijklmnopqrstuvwxyz/-."); // Maximum 31 characters, please
if ($width >= pow(2, 16)) {
throw new Exception("Width value is too high to encode with 16 bits");
}
if ($height >= pow(2, 16)) {
throw new Exception("Height value is too high to encode with 16 bits");
}
// Pack width, then height first
$packed = pack("nn", $width, $height);
$path_bits = "";
foreach (str_split($path) as $ch) {
$index = array_search($ch, $dictionary, true);
if ($index === false) {
throw new Exception("Cannot encode character outside of the allowed dictionary");
}
$index++; // Add 1 due to index 0 meaning NULL rather than a.
// Work with a bit string here rather than using complicated binary bit shift operators.
$path_bits .= str_pad(base_convert($index, 10, 2), 5, "0", STR_PAD_LEFT);
}
// Remaining space left?
$modulo = (8 - (strlen($path_bits) % 8)) %8;
if ($modulo >= 5) {
// There is space for a null character to fill up to the next byte
$path_bits .= "00000";
$modulo -= 5;
}
// Pad with zeros
$path_bits .= str_repeat("0", $modulo);
// Split in to nibbles and pack as a hex string
$path_bits = str_split($path_bits, 4);
$hex_string = implode("", array_map(function($bit_string) {
return base_convert($bit_string, 2, 16);
}, $path_bits));
$packed .= pack('H*', $hex_string);
return base64_url_encode($packed);
}
function decodeImageAndDimensions($str) {
$dictionary = str_split("abcdefghijklmnopqrstuvwxyz/-.");
$data = base64_url_decode($str);
$decoded = unpack("nwidth/nheight/H*path", $data);
$path_bit_stream = implode("", array_map(function($nibble) {
return str_pad(base_convert($nibble, 16, 2), 4, "0", STR_PAD_LEFT);
}, str_split($decoded['path'])));
$five_pieces = str_split($path_bit_stream, 5);
$real_path_indexes = array_map(function($code) {
return base_convert($code, 2, 10) - 1;
}, $five_pieces);
$real_path = "";
foreach ($real_path_indexes as $index) {
if ($index == -1) {
break;
}
$real_path .= $dictionary[$index];
}
$decoded['path'] = $real_path;
return $decoded;
}
// These do a bit of magic to get rid of the double equals sign and obfuscate a bit. It could save an extra byte.
function base64_url_encode($input) {
$trans = array('+' => '-', '/' => ':', '*' => '$', '=' => 'B', 'B' => '!');
return strtr(str_replace('==', '*', base64_encode($input)), $trans);
}
function base64_url_decode($input) {
$trans = array('-' => '+', ':' => '/', '$' => '*', 'B' => '=', '!' => 'B');
return base64_decode(str_replace('*', '==', strtr($input, $trans)));
}
// Example usage
$encoded = encodeImageAndDimensions("/dir/dir/hi-res-img.jpg", 700, 500);
var_dump($encoded); // string(27) "Arw!9NkTLZEy2hPJFnxLT9VA4A$"
$decoded = decodeImageAndDimensions($encoded);
var_dump($decoded); // array(3) { ["width"] => int(700) ["height"] => int(500) ["path"] => string(23) "/dir/dir/hi-res-img.jpg" }
$encoded = encodeImageAndDimensions("/another/example/image.png", 4500, 2500);
var_dump($encoded); // string(28) "EZQJxNhc-iCy2XAWwYXaWhOXsHHA"
$decoded = decodeImageAndDimensions($encoded);
var_dump($decoded); // array(3) { ["width"] => int(4500) ["height"] => int(2500) ["path"] => string(26) "/another/example/image.png" }
$encoded = encodeImageAndDimensions("/short/eg.png", 300, 200);
var_dump($encoded); // string(19) "ASwAyNzQ-VNlP2DjgA$"
$decoded = decodeImageAndDimensions($encoded);
var_dump($decoded); // array(3) { ["width"] => int(300) ["height"] => int(200) ["path"] => string(13) "/short/eg.png" }
$encoded = encodeImageAndDimensions("/very/very/very/very/very-hyper/long/example.png", 300, 200);
var_dump($encoded); // string(47) "ASwAyN2LLO7FlndiyzuxZZ3Yss8Rm!ZbY9x9lwFsGF7!xw$"
$decoded = decodeImageAndDimensions($encoded);
var_dump($decoded); // array(3) { ["width"] => int(300) ["height"] => int(200) ["path"] => string(48) "/very/very/very/very/very-hyper/long/example.png" }
$encoded = encodeImageAndDimensions("only-file-name", 300, 200);
var_dump($encoded); //string(19) "ASwAyHuZnhksLxwWlA$"
$decoded = decodeImageAndDimensions($encoded);
var_dump($decoded); // array(3) { ["width"] => int(300) ["height"] => int(200) ["path"] => string(14) "only-file-name" }
In your question you state that it should be pure PHP and not use a database, and there should be a possibility to decode the strings. So bending the rules a bit:
The way I am interpreting this question is that we don't care about security that much but, we do want the shortest hashes that lead back to images.
We can also take "decode possibility" with a pinch of salt by using a one way hashing algorithm.
We can store the hashes inside a JSON object, then store the data in a file, so all we have to do at the end of the day is string matching
```
class FooBarHashing {
private $hashes;
private $handle;
/**
* In producton this should be outside the web root
* to stop pesky users downloading it and geting hold of all the keys.
*/
private $file_name = './my-image-hashes.json';
public function __construct() {
$this->hashes = $this->get_hashes();
}
public function get_hashes() {
// Open or create a file.
if (! file_exists($this->file_name)) {
fopen($this->file_name, "w");
}
$this->handle = fopen($this->file_name, "r");
$hashes = [];
if (filesize($this->file_name) > 0) {
$contents = fread($this->handle, filesize($this->file_name));
$hashes = get_object_vars(json_decode($contents));
}
return $hashes;
}
public function __destroy() {
// Close the file handle
fclose($this->handle);
}
private function update() {
$handle = fopen($this->file_name, 'w');
$res = fwrite($handle, json_encode($this->hashes));
if (false === $res) {
//throw new Exception('Could not write to file');
}
return true;
}
public function add_hash($image_file_name) {
$new_hash = md5($image_file_name, false);
if (! in_array($new_hash, array_keys($this->hashes) ) ) {
$this->hashes[$new_hash] = $image_file_name;
return $this->update();
}
//throw new Exception('File already exists');
}
public function resolve_hash($hash_string='') {
if (in_array($hash_string, array_keys($this->hashes))) {
return $this->hashes[$hash_string];
}
//throw new Exception('File not found');
}
}
```
Usage example:
<?php
// Include our class
require_once('FooBarHashing.php');
$hashing = new FooBarHashing;
// You will need to add the query string you want to resolve first.
$hashing->add_hash('img=/dir/dir/hi-res-img.jpg&w=700&h=500');
// Then when the user requests the hash the query string is returned.
echo $hashing->resolve_hash('65992be720ea3b4d93cf998460737ac6');
So the end result is a string that is only 32 chars long, which is way shorter than the 52 we had before.
From the discussion in the comments section it looks like what you really want is to protect your original high-resolution images.
Having that in mind, I'd suggest to actually do that first using your web server configuration (e.g., Apache mod_authz_core or Nginx ngx_http_access_module) to deny access from the web to the directory where your original images are stored.
Note that the server will only deny access to your images from the web, but you will still be able to access them directly from your PHP scripts. Since you already are displaying images using some "resizer" script I'd suggest putting some hard limit there and refuse to resize images to anything bigger then that (e.g., something like $width = min(1000, $_GET['w'])).
I know this does not answer your original question, but I think this would the right solution to protect your images. And if you still want to obfuscate the original name and resizing parameters you can do that however you see fit without worrying that someone might figure out what’s behind it.
I'm afraid, you won't be able to shorten the query string better than any known
compression algorithm. As mentioned in other answers, a compressed
version will be shorter by a few (around 4-6) characters than the original.
Moreover, the original string can be decoded relatively easy (opposed to decoding SHA-1 or MD5, for instance).
I suggest shortening URLs by means of Web server configuration. You might
shorten it further by replacing image path with an ID (store ID-filename
pairs in a database).
For example, the following Nginx configuration accepts
URLs like /t/123456/700/500/4fc286f1a6a9ac4862bdd39a94a80858, where
the first number (123456) is supposed to be an image ID from database;
700 and 500 are image dimensions;
the last part is an MD5 hash protecting from requests with different dimensions.
# Adjust maximum image size
# image_filter_buffer 5M;
server {
listen 127.0.0.13:80;
server_name img-thumb.local;
access_log /var/www/img-thumb/logs/access.log;
error_log /var/www/img-thumb/logs/error.log info;
set $root "/var/www/img-thumb/public";
# /t/image_id/width/height/md5
location ~* "(*UTF8)^/t/(\d+)/(\d+)/(\d+)/([a-zA-Z0-9]{32})$" {
include fastcgi_params;
fastcgi_pass unix:/tmp/php-fpm-img-thumb.sock;
fastcgi_param QUERY_STRING image_id=$1&w=$2&h=$3&hash=$4;
fastcgi_param SCRIPT_FILENAME /var/www/img-thumb/public/t/resize.php;
image_filter resize $2 $3;
error_page 415 = /empty;
break;
}
location = /empty {
empty_gif;
}
location / { return 404; }
}
The server accepts only URLs of specified pattern, forwards request to /public/t/resize.php script with modified query string, then resizes the image generated by PHP with the image_filter module. In case of error, returns an empty GIF image.
The image_filter is optional, and it is included only as an example. Resizing can be performed fully on PHP side. With Nginx, it is possible to get rid of PHP part, by the way.
The PHP script is supposed to validate the hash as follows:
// Store this in some configuration file.
$salt = '^sYsdfc_sd&9wa.';
$w = $_GET['w'];
$h = $_GET['h'];
$true_hash = md5($w . $h . $salt . $image_id);
if ($true_hash != $_GET['hash']) {
die('invalid hash');
}
$filename = fetch_image_from_database((int)$_GET['image_id']);
$img = imagecreatefrompng($filename);
header('Content-Type: image/png');
imagepng($img);
imagedestroy($img);
I don't think the resulting URL can be shortened much more than in your own example. But I suggest a few steps to obfuscate your images better.
First I would remove everything you can from the base URL you are zipping and Base64 encoding, so instead of
img=/dir/dir/hi-res-img.jpg&w=700&h=500
I would use
s=hi-res-img.jpg,700,500,062c02153d653119
Were those last 16 chars are a hash to validate the URL being opened is the same you offered in your code - and the user is not trying to trick the high-resolution image out of the system.
Your index.php that serves the images would start like this:
function myHash($sRaw) { // returns a 16-characters dual hash
return hash('adler32', $sRaw) . strrev(hash('crc32', $sRaw));
} // These two hash algorithms are suggestions, there are more for you to chose.
// s=hi-res-img.jpg,700,500,062c02153d653119
$aParams = explode(',', $_GET['s']);
if (count($aParams) != 4) {
die('Invalid call.');
}
list($sFileName, $iWidth, $iHeight, $sHash) = $aParams;
$sRaw = session_id() . $sFileName . $iWidth . $iHeight;
if ($sHash != myHash($sRaw)) {
die('Invalid hash.');
}
After this point you can send the image as the user opening it had access to a valid link.
Note the use of session_id as part of the raw string that makes the hash is optional, but would make it impossible for users to share a valid URL - as it would be session bind. If you want the URLs to be shareable, then just remove session_id from that call.
I would wrap the resulting URL the same way you already do, zip + Base64. The result would be even bigger than your version, but more difficult to see through the obfuscation, and therefore protecting your images from unauthorised downloads.
If you want only to make it shorter, I do not see a way of doing it without renaming the files (or their folders), or without the use of a database.
The file database solution proposed will surely create problems of concurrency - unless you always have no or very few people using the system simultaneously.
You say that you want the size there, so that if you decide some day that the preview images are too small, you want to increase the size - the solution here is to hard code the image size into the PHP script and eliminate it from the URL.
If you want to change the size in the future, change the hardcoded values in the PHP script (or in a config.php file that you include into the script).
You've also said that you are already using files to store image data as a JSON object, like: name, title, description. Exploiting this, you don't need a database and can use the JSON file name as the key for looking up the image data.
When the user visits a URL like this:
www.mysite.com/share/index.php?ax9v
You load ax9v.json from the location you are already storing the JSON files, and within that JSON file the image's real path is stored. Then load the image, resize it according to the hardcoded size in your script and send it to the user.
Drawing from the conclusions in
URL Shortening: Hashes In Practice, to get the smallest search string part of the URL you would need to iterate valid character combinations as new files are uploaded (e.g., the first one is "AAA" then "AAB", "AAC", etc.) instead of using a hashing algorithm.
Your solution would then have only three characters in the string for the first 238,328 photos you upload.
I had started to prototype a PHP solution on PhpFiddle, but the code disappeared (don't use PhpFiddle).

PHP: How to break a string by words within a character limit and near line breaks

I am using a terrible wrapper of PDFLib that doesn't handle the problem PDFLib has with cells that are more than the character limit (Which is around 1600 characters per cell).
So I need to break a large paragraph into smaller strings that fit neatly into the cells, without breaking up words, and as close to the end of the line as possible.
I am completely stumped about how to do this efficiently (I need it to run in a reasonable amount of time)
Here is my code, which cuts the block up into substrings based on character length alone, ignoring the word and line requirements I stated above:
SPE_* functions are static functions from the wrapper class,
SetNextCellStyle calls are used to draw a box around the outline of the cells
BeginRow is required to start a row of text.
EndRow is required to end a row of text, it must be called after BeginRow, and if the preset number of columns is not completely filled, an error is generated.
AddCell adds the string to the second parameter number of columns.
function SPE_divideText($string,$cols,$indent,$showBorders=false)
{
$strLim = 1500;
$index = 0;
$maxIndex = round((strlen($string) / 1500-.5));
$retArr= array();
while(substr($string, $strLim -1500,$strLim)!=FALSE)
{
$retArr[$index] = substr($string, $strLim -1500,$strLim);
$strLim+=1500;
SPE_BeginRow();
SPE_SetNextCellStyle('cell-padding', '0');
if($indent>0)
{
SPE_Empty($indent);
}
if($showBorders)
{
SPE_SetNextCellStyle('border-left','1.5');
SPE_SetNextCellStyle('border-right','1.5');
if($index == 0)
{
SPE_SetNextCellStyle('border-top','1.5');
}
if($index== $maxIndex)
{
SPE_SetNextCellStyle('border-bottom','1.5');
}
}
SPE_AddCell($retArr[$index],$cols-$indent);
SPE_EndRow();
$index++;
}
}
Thanks in advance for any help!
Something like this should work.
function substr_at_word_boundary($string, $chars = 100)
{
preg_match('/^.{0,' . $chars. '}(?:.*?)\b/iu', $string, $matches);
$new_string = $matches[0];
return ($new_string === $string) ? $string : $new_string;
}
$string = substr_at_word_boundary($string, 1600)

Split a large string into an array, but the split point cannot break a tag

I wrote a script that sends chunks of text of to Google to translate, but sometimes the text, which is html source code) will end up splitting in the middle of an html tag and Google will return the code incorrectly.
I already know how to split the string into an array, but is there a better way to do this while ensuring the output string does not exceed 5000 characters and does not split on a tag?
UPDATE: Thanks to answer, this is the code I ended up using in my project and it works great
function handleTextHtmlSplit($text, $maxSize) {
//our collection array
$niceHtml[] = '';
// Splits on tags, but also includes each tag as an item in the result
$pieces = preg_split('/(<[^>]*>)/', $text, -1, PREG_SPLIT_DELIM_CAPTURE);
//the current position of the index
$currentPiece = 0;
//start assembling a group until it gets to max size
foreach ($pieces as $piece) {
//make sure string length of this piece will not exceed max size when inserted
if (strlen($niceHtml[$currentPiece] . $piece) > $maxSize) {
//advance current piece
//will put overflow into next group
$currentPiece += 1;
//create empty string as value for next piece in the index
$niceHtml[$currentPiece] = '';
}
//insert piece into our master array
$niceHtml[$currentPiece] .= $piece;
}
//return array of nicely handled html
return $niceHtml;
}
Note: haven't had a chance to test this (so there may be a minor bug or two), but it should give you an idea:
function get_groups_of_5000_or_less($input_string) {
// Splits on tags, but also includes each tag as an item in the result
$pieces = preg_split('/(<[^>]*>)/', $input_string,
-1, PREG_SPLIT_DELIM_CAPTURE);
$groups[] = '';
$current_group = 0;
while ($cur_piece = array_shift($pieces)) {
$piecelen = strlen($cur_piece);
if(strlen($groups[$current_group]) + $piecelen > 5000) {
// Adding the next piece whole would go over the limit,
// figure out what to do.
if($cur_piece[0] == '<') {
// Tag goes over the limit, just put it into a new group
$groups[++$current_group] = $cur_piece;
} else {
// Non-tag goes over the limit, split it and put the
// remainder back on the list of un-grabbed pieces
$grab_amount = 5000 - $strlen($groups[$current_group];
$groups[$current_group] .= substr($cur_piece, 0, $grab_amount);
$groups[++$current_group] = '';
array_unshift($pieces, substr($cur_piece, $grab_amount));
}
} else {
// Adding this piece doesn't go over the limit, so just add it
$groups[$current_group] .= $cur_piece;
}
}
return $groups;
}
Also note that this can split in the middle of regular words - if you don't want that, then modify the part that begins with // Non-tag goes over the limit to choose a better value for $grab_amount. I didn't bother coding that in since this is just supposed to be an example of how to get around splitting tags, not a drop-in solution.
Why not strip the html tags from the string before sending it to google. PHP has a strip_tags() function that can do this for you.
preg_split with a good regex would do it for you.

Categories