I am doing a real estate feed for a portal and it is telling me the max length of a string should be 20,000 bytes (20kb), but I have never run across this before.
How can I measure byte size of a varchar string. So I can then do a while loop to trim it down.
You can use mb_strlen() to get the byte length using a encoding that only have byte-characters, without worring about multibyte or singlebyte strings.
For example, as drake127 saids in a comment of mb_strlen, you can use '8bit' encoding:
<?php
$string = 'Cién cañones por banda';
echo mb_strlen($string, '8bit');
?>
You can have problems using strlen function since php have an option to overload strlen to actually call mb_strlen. See more info about it in http://php.net/manual/en/mbstring.overload.php
For trim the string by byte length without split in middle of a multibyte character you can use:
mb_strcut(string $str, int $start [, int $length [, string $encoding ]] )
You have to figure out if the string is ascii encoded or encoded with a multi-byte format.
In the former case, you can just use strlen.
In the latter case you need to find the number of bytes per character.
the strlen documentation gives an example of how to do it : http://www.php.net/manual/en/function.strlen.php#72274
Do you mean byte size or string length?
Byte size is measured with strlen(), whereas string length is queried using mb_strlen(). You can use substr() to trim a string to X bytes (note that this will break the string if it has a multi-byte encoding - as pointed out by Darhazer in the comments) and mb_substr() to trim it to X characters in the encoding of the string.
PHP's strlen() function returns the number of ASCII characters.
strlen('borsc') -> 5 (bytes)
strlen('boršč') -> 7 (bytes)
$limit_in_kBytes = 20000;
$pointer = 0;
while(strlen($your_string) > (($pointer + 1) * $limit_in_kBytes)){
$str_to_handle = substr($your_string, ($pointer * $limit_in_kBytes ), $limit_in_kBytes);
// here you can handle (0 - n) parts of string
$pointer++;
}
$str_to_handle = substr($your_string, ($pointer * $limit_in_kBytes), $limit_in_kBytes);
// here you can handle last part of string
.. or you can use a function like this:
function parseStrToArr($string, $limit_in_kBytes){
$ret = array();
$pointer = 0;
while(strlen($string) > (($pointer + 1) * $limit_in_kBytes)){
$ret[] = substr($string, ($pointer * $limit_in_kBytes ), $limit_in_kBytes);
$pointer++;
}
$ret[] = substr($string, ($pointer * $limit_in_kBytes), $limit_in_kBytes);
return $ret;
}
$arr = parseStrToArr($your_string, $limit_in_kBytes = 20000);
Further to PhoneixS answer to get the correct length of string in bytes - Since mb_strlen() is slower than strlen(), for the best performance one can check "mbstring.func_overload" ini setting so that mb_strlen() is used only when it is really required:
$content_length = ini_get('mbstring.func_overload') ? mb_strlen($content , '8bit') : strlen($content);
Related
I have a string in the arabic language like:
على احمد يوسف
Now I need to cut this string and output it like:
...على احمد يو
I tried this function:
function short_name($str, $limit) {
if ($limit < 3) {
$limit = 3;
}
if (strlen($str) > $limit) {
if (preg_match('/\p{Arabic}/u', $str)) {
return substr($str, 0, $limit - 3) . '...';
}
else {
return '...'.substr($str, 0, $limit - 3);
}
}
else {
return $str;
}
}
The problem is that sometimes it displays a symbol like this at the end of the string:
...�على احمد يو
Why does this happen?
The symbol displayed after the cut is the result of substr() cutting in the middle of a character, resulting in an invalid character.
You need to use Multibyte String Functions to handle arabic strings, such as mb_strlen() and mb_substr().
You also need to make sure the internal encoding for those functions is set to UTF-8. You can set this globally at the top of your script:
mb_internal_encoding('UTF-8');
Which leads to this:
strlen('على احمد يوسف') returns 24, the size in octets
mb_strlen('على احمد يوسف') returns 13, the size in characters
Note that mb_strlen('على احمد يوسف') would also return 24 if the internal encoding was still set to the default ISO-8859-1.
Answer:
return '...'.mb_substr($str, 0, $limit - 3, "UTF-8"); // UTF-8 is optional
Background:
In ISO 8859-1 Arabic is not a 8-bit character set. The substr() calls the internal libc functions which work on sets of 8-bit chars. To display characters higher then 255 (Arabic, Cyclic, Korean, etc..) there are more bits needed to display that character, for example 16 or sometimes even 32-bits. You subtract 3*8-bits which will result in some undisplayable character in UTF-8. Especially if you're going to use a lot of multibyte strings, make sure you use the correct string functions such as mb_strlen()
Try this function;
public static function shorten_arabic_text($text, $lenght)
{
mb_internal_encoding('UTF-8');
$out = mb_strlen($text) > $lenght ? mb_substr($text, 0, $lenght) . " ..." : $text;
return $out;
}
I have hardware unit, that when requested some data, returns a string, that when exploded on space, returns array of values:
$bytes = array(
'03',
'80',
'A0',
'01' // and others, total of 240 entries
);
These actually, depict bytes: 0x03, 0x80, 0xA0, 0x01. I need to transform them into their actual values.
I have tried in a loop, to: $value = 0x{$byte}, $value = {'0x' . $byte} and others, to no avail.
Also tried unpack, but don't know what format to apply, am kind of clueless about bytes.
Seems like a basic issue, yet cannot wrap my head around it.
How can I dynamically, transform them into their actual integer values?
use chr if you want a string
$value = chr($byte);
use hexdec if you want an integer
$value = hexdec($byte);
In PHP, bytes are the same as one-character long strings, with the following escaping:
$byte = "\x03";
There is a function that can help you, which is chr().
This function take as parameter the ASCII code of the byte you want to obtain. As it can be either a numeric string or an integer, you can use
$code = "03";
$byte = chr("0x" . $code);
to obtain the '\x03' byte, with the parameter to chr being interpreted as an hexadecimal integer.
On the other hand, as mentionned by #chumkiu, if you are trying to obtain integer values, the following code will work:
$code = "03";
$int = hexdec($code);
I think something like this will be sufficient:
foreach($bytes as byte)
{
echo hexdec($byte);
}
See also the hexdec manual.
If $string is the raw data (hex digits separated by spaces), then you can extract the binary data like this:
$binary = pack('H*',str_replace(' ','',$string));
I receive data from a PUSH service. This data is compressed with gzcompress(). At the very Beginning of the data, it contains an int which is the length of the data contained. This is done after the gzcompress(); So a sample data would be:
187xœËHÍÉÉ,
Which is produced by
echo '187'.gzcompress('Hello');
Now, I don't know the length of the int, it could be 1 digit it could be 10 digits. I also don't know the first character to find the position of the beginning of a string.
Any ideas on how to retrieve/subtract the int?
$length_value=???
$string_value=???
Assuming that the compressed data would NEVER start with a digit, then a regex would be easiest:
$string = '187xœËHÍÉÉ,';
preg_match('/^(\d+)/', $string, $matches);
$number = $matches[0];
$compressed_data = substr($string, 0, strlen($number));
If the compressed data DOES start with a digit, then you're going to end up with corrupt data - you'll have absolutely no way of differentiating where the 'length' value stops and the compressed data starts, e.g.
$compressed = '123foo';
$length = '6';
$your_string = '6123foo';
Ok - is that a string of length 61, with compressed data 23foo? or 612 + 3foo?
You could use preg_match() to catch the integer at the start of the string.
http://php.net/manual/en/function.preg-match.php
You could do:
$contents = "187xœËHÍÉÉ,";
$length = (int)$contents;
$startingPosition = strlen((string)$length);
$original = gzuncompress(substr($contents, $startingPosition), $length);
But I feel this may fail if the first compressed byte is a number.
I am trying to limit the characters of a string. Additionally, if the string is less than the required characters, I want to add padding to it.
function create_string($string, $length) {
$str_len = strlen($string);
if($str_len > $length) {
//if string is greater than max length, then strip it
$str = substr($string, 0, $length);
} else {
//if string is less than the required length, pad it with what it needs to be the length
$remaining = $length-$str_len;
$str = str_pad($string, $remaining);
}
return $str;
}
My input is
"Nik's Auto Salon"
which is 16 characters. The second parameter is 40.
However, This string is returned
"Nik's Auto Salon "
which has only eight characters of padding added onto it. That doesn't seem right.
I also tried this string:
Gold Package Mobile Car Detail
With this input, it returns a string with NO padding added onto it. When that phrase is shorter than the required 45 length I put in the second parameter place.
How can I make this function work according to my specifications?
str_pad doesn't add spaces equal to its second parameter, it pads the string TO the length given in the second parameter. This isn't very clear even in the documentation.
Try this instead (and take out the line where you calculate $remaining):
$str = str_pad($string, $length);
I have a unique problem with multibyte character strings and need to be able to shuffle, with some fair degree of randomness, a long UTF-8 encoded multibyte string in PHP without dropping or losing or repeating any of the characters.
In the PHP manual under str_shuffle there is a multi-byte function (the first user submitted one) that doesn't work: If I use a string with for example all the Japanese hiragana and katakana of string length (ex) 120 chars, I am returned a string that's 119 chars or 118 chars. Sometimes I've seen duplicate chars even though the original string doesn't have them. So that's not functional.
To make this more complex, I also need to include if possible Japanese UTF-8 newlines and line feeds and punctuation.
Can anyone with experience dealing in multiple languages with UTF-8 mb strings help? Does PHP have any built in functions to do this? str_shuffle is EXACTLY what I want. I just need it to also work on multibyte chars.
Thanks very much!
Try splitting the string using mb_strlen and mb_substr to create an array, then using shuffle before joining it back together again. (Edit: As also demonstrated in #Frosty Z's answer.)
An example from the PHP interactive prompt:
php > $string = "Pretend I'm multibyte!";
php > $len = mb_strlen($string);
php > $sploded = array();
php > while($len-- > 0) { $sploded[] = mb_substr($string, $len, 1); }
php > shuffle($sploded);
php > echo join('', $sploded);
rmedt tmu nIb'lyi!eteP
You'll want to be sure to specify the encoding, where appropriate.
This should do the trick, too. I hope.
class String
{
public function mbStrShuffle($string)
{
$chars = $this->mbGetChars($string);
shuffle($chars);
return implode('', $chars);
}
public function mbGetChars($string)
{
$chars = [];
for($i = 0, $length = mb_strlen($string); $i < $length; ++$i)
{
$chars[] = mb_substr($string, $i, 1, 'UTF-8');
}
return $chars;
}
}
I like to use this function:
function mb_str_shuffle($multibyte_string = "abcčćdđefghijklmnopqrsštuvwxyzžß,.-+'*?=)(/&%$#!~ˇ^˘°˛`˙´˝") {
$characters_array = mb_str_split($multibyte_string);
shuffle($characters_array);
return implode('', $characters_array); // or join('', $characters_array); if you have a death wish (JK)
}
Split string into an array of multibyte characters
Shuffle the good guy array who doesn't care about his residents being multibyte
Join the shuffled array together into a string
Of course I normally wouldn't have a default value for function's parameter.