I'm looking for a code/lib in php that I will call it and pass a text to it and it will tell me:
What is the encode I need to use in order to send this text as SMS (7,8,16 bit)
How many SMS message I will use to send this text (it must be smart to count "segmenation information" like in http://ozekisms.com/index.php?owpn=612)
do you have any idea of any code/lib exists that will do this for me?
Again I'm not looking for sending SMS or converting SMS, just to give me information about the text
Update:
Ok I did the below code and it seems to be working fine, let me know if you have better/optimized code/solution/lib
$text = '\#£$¥èéùìòÇØøÅåΔ_ΦΓΛΩΠΨΣΘΞÆæßÉ -./0123456789:;<=>?¡ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÑܧ¿abcdefghijklmnopqrstuvwxyzäöñüà^{}[~]|€' ; //"\\". //'"';//' ';
print $text . "\n";
print isGsm7bit($text). "\n";
print getNumberOfSMSsegments($text). "\n";
function getNumberOfSMSsegments($text,$MaxSegments=6){
/*
http://en.wikipedia.org/wiki/SMS
Larger content (concatenated SMS, multipart or segmented SMS, or "long SMS") can be sent using multiple messages,
in which case each message will start with a user data header (UDH) containing segmentation information.
Since UDH is part of the payload, the number of available characters per segment is lower:
153 for 7-bit encoding,
134 for 8-bit encoding and
67 for 16-bit encoding.
The receiving handset is then responsible for reassembling the message and presenting it to the user as one long message.
While the standard theoretically permits up to 255 segments,[35] 6 to 8 segment messages are the practical maximum,
and long messages are often billed as equivalent to multiple SMS messages. See concatenated SMS for more information.
Some providers have offered length-oriented pricing schemes for messages, however, the phenomenon is disappearing.
*/
$TotalSegment=0;
$textlen = mb_strlen($text);
if($textlen==0) return false; //I can see most mobile devices will not allow you to send empty sms, with this check we make sure we don't allow empty SMS
if(isGsm7bit($text)){ //7-bit
$SingleMax=160;
$ConcatMax=153;
}else{ //UCS-2 Encoding (16-bit)
$SingleMax=70;
$ConcatMax=67;
}
if($textlen<=$SingleMax){
$TotalSegment = 1;
}else{
$TotalSegment = ceil($textlen/$ConcatMax);
}
if($TotalSegment>$MaxSegments) return false; //SMS is very big.
return $TotalSegment;
}
function isGsm7bit($text){
$gsm7bitChars = "\\\#£\$¥èéùìòÇ\nØø\rÅåΔ_ΦΓΛΩΠΨΣΘΞÆæßÉ !\"#¤%&'()*+,-./0123456789:;<=>?¡ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÑܧ¿abcdefghijklmnopqrstuvwxyzäöñüà^{}[~]|€";
$textlen = mb_strlen($text);
for ($i = 0;$i < $textlen; $i++){
if ((strpos($gsm7bitChars, $text[$i])==false) && ($text[$i]!="\\")){return false;} //strpos not able to detect \ in string
}
return true;
}
I'm adding some extra information here because the previous answer isn't quite correct.
These are the issues:
You need to be specifying the current string encoding to mb_string, otherwise this may be incorrectly gathered
In 7-bit GSM encoding, the Basic Charset Extended characters (^{}\[~]|€) require 14-bits each to encode, so they count as two characters each.
In UCS-2 encoding, you have to be wary of emoji and other characters outside the 16-bit BMP, because...
GSM with UCS-2 counts 16-bit characters, so if you have a 💩 character (U+1F4A9), and your carrier and phone sneakily support UTF-16 and not just UCS-2, it will be encoded as a surrogate pair of 16-bit characters in UTF-16, and thus be counted as TWO 16-bit characters toward your string length. mb_strlen will count this as a single character only.
How to count 7-bit characters:
What I've come up with so far is the following to count 7-bit characters:
// Internal encoding must be set to UTF-8,
// and the input string must be UTF-8 encoded for this to work correctly
protected function count_gsm_string($str)
{
// Basic GSM character set (one 7-bit encoded char each)
$gsm_7bit_basic = "#£$¥èéùìòÇ\nØø\rÅåΔ_ΦΓΛΩΠΨΣΘΞÆæßÉ !\"#¤%&'()*+,-./0123456789:;<=>?¡ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÑܧ¿abcdefghijklmnopqrstuvwxyzäöñüà";
// Extended set (requires escape code before character thus 2x7-bit encodings per)
$gsm_7bit_extended = "^{}\\[~]|€";
$len = 0;
for($i = 0; $i < mb_strlen($str); $i++) {
$c = mb_substr($str, i, 1);
if(mb_strpos($gsm_7bit_basic, $c) !== FALSE) {
$len++;
} else if(mb_strpos($gsm_7bit_extended, $c) !== FALSE) {
$len += 2;
} else {
return -1; // cannot be encoded as GSM, immediately return -1
}
}
return $len;
}
How to count 16-bit characters:
Convert the string into UTF-16 representation (to preserve the emoji characters with mb_convert_encoding($str, 'UTF-16', 'UTF-8').
do not convert into UCS-2 as this is lossy with mb_convert_encoding)
Count bytes with count(unpack('C*', $utf16str)) and divide by two to get the number of UCS-2 16-bit characters that count toward the GSM multipart length
*caveat emptor, a word on counting bytes:
Do not use strlen to count the number of bytes. While it may work, strlen is often overloaded in PHP installations with a multibyte-capable version, and is also a candidate for API change in the future
Avoid mb_strlen($str, 'UCS-2'). While it does currently work, and will return, correctly, 2 for a pile of poo character (as it looks like two 16-bit UCS-2 characters), its stablemate mb_convert_encoding is lossy when converting from >16-bit to UCS-2. Who's to say that mb_strlen won't be lossy in the future?
Avoid mb_strlen($str, '8bit') / 2. It also currently works, and is recommended in a PHP docs comment as a way to count bytes. But IMO it suffers from the same issue as the above UCS-2 technique.
That leaves the safest current way (IMO) as unpacking into a byte array, and counting that.
So, what does this look like?
// Internal encoding must be set to UTF-8,
// and the input string must be UTF-8 encoded for this to work correctly
protected function count_ucs2_string($str)
{
$utf16str = mb_convert_encoding($str, 'UTF-16', 'UTF-8');
// C* option gives an unsigned 16-bit integer representation of each byte
// which option you choose doesn't actually matter as long as you get one value per byte
$byteArray = unpack('C*', $utf16str);
return count($byteArray) / 2;
}
Putting it all together:
function multipart_count($str)
{
$one_part_limit = 160; // use a constant i.e. GSM::SMS_SINGLE_7BIT
$multi_limit = 153; // again, use a constant
$max_parts = 3; // ... constant
$str_length = count_gsm_string($str);
if($str_length === -1) {
$one_part_limit = 70; // ... constant
$multi_limit = 67; // ... constant
$str_length = count_ucs2_string($str);
}
if($str_length <= $one_part_limit) {
// fits in one part
return 1;
} else if($str_length > ($max_parts * $multi_limit) {
// too long
return -1; // or throw exception, or false, etc.
} else {
// divide the string length by multi_limit and round up to get number of parts
return ceil($str_length / $multi_limit);
}
}
Turned this into a library...
https://bitbucket.org/solvam/smstools
The best solution I have so far:
$text = '\#£$¥èéùìòÇØøÅåΔ_ΦΓΛΩΠΨΣΘΞÆæßÉ -./0123456789:;<=>?¡ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÑܧ¿abcdefghijklmnopqrstuvwxyzäöñüà^{}[~]|€' ; //"\\". //'"';//' ';
print $text . "\n";
print isGsm7bit($text). "\n";
print getNumberOfSMSsegments($text). "\n";
function getNumberOfSMSsegments($text,$MaxSegments=6){
/*
http://en.wikipedia.org/wiki/SMS
Larger content (concatenated SMS, multipart or segmented SMS, or "long SMS") can be sent using multiple messages,
in which case each message will start with a user data header (UDH) containing segmentation information.
Since UDH is part of the payload, the number of available characters per segment is lower:
153 for 7-bit encoding,
134 for 8-bit encoding and
67 for 16-bit encoding.
The receiving handset is then responsible for reassembling the message and presenting it to the user as one long message.
While the standard theoretically permits up to 255 segments,[35] 6 to 8 segment messages are the practical maximum,
and long messages are often billed as equivalent to multiple SMS messages. See concatenated SMS for more information.
Some providers have offered length-oriented pricing schemes for messages, however, the phenomenon is disappearing.
*/
$TotalSegment=0;
$textlen = mb_strlen($text);
if($textlen==0) return false; //I can see most mobile devices will not allow you to send empty sms, with this check we make sure we don't allow empty SMS
if(isGsm7bit($text)){ //7-bit
$SingleMax=160;
$ConcatMax=153;
}else{ //UCS-2 Encoding (16-bit)
$SingleMax=70;
$ConcatMax=67;
}
if($textlen<=$SingleMax){
$TotalSegment = 1;
}else{
$TotalSegment = ceil($textlen/$ConcatMax);
}
if($TotalSegment>$MaxSegments) return false; //SMS is very big.
return $TotalSegment;
}
function isGsm7bit($text){
$gsm7bitChars = "\\\#£\$¥èéùìòÇ\nØø\rÅåΔ_ΦΓΛΩΠΨΣΘΞÆæßÉ !\"#¤%&'()*+,-./0123456789:;<=>?¡ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÑܧ¿abcdefghijklmnopqrstuvwxyzäöñüà^{}[~]|€";
$textlen = mb_strlen($text);
for ($i = 0;$i < $textlen; $i++){
if ((strpos($gsm7bitChars, $text[$i])==false) && ($text[$i]!="\\")){return false;} //strpos not able to detect \ in string
}
return true;
}
page 1 : 160 byte
page 2 : 146 byte
page 3 : 153 byte
page 4 : 153 byte
page 5 : 153 byte, ....
So regardless of language :
// strlen($text) show bytes
$count = 0;
$len = strlen($text);
if ($len > 306) {
$len = $len - 306;
$count = floor($len / 153) + 3;
} else if($len>160){
$count = 2;
}else{
$count = 1;
}
Related
In PHP I am calculating Levenshtein distance using function levenshtein(). For simple characters it works as expected, but for diacritic characters like in example
echo levenshtein('à', 'a');
it returns "2". In this case only one replacement has to be done, so I expect it to return "1".
Am I missing something?
I thought it may be useful to have this comment from the PHP manual posted as an answer to this question, so here it is:-
The levenshtein function processes each byte of the input string individually. Then for multibyte encodings, such as UTF-8, it may give misleading results.
Example with a french accented word :
- levenshtein('notre', 'votre') = 1
- levenshtein('notre', 'nôtre') = 2 (huh ?!)
You can easily find a multibyte compliant PHP implementation of the levenshtein function but it will be of course much slower than the C implementation.
Another option is to convert the strings to a single-byte (lossless) encoding so that they can feed the fast core levenshtein function.
Here is the conversion function I used with a search engine storing UTF-8 strings, and a quick benchmark. I hope it will help.
<?php
// Convert an UTF-8 encoded string to a single-byte string suitable for
// functions such as levenshtein.
//
// The function simply uses (and updates) a tailored dynamic encoding
// (in/out map parameter) where non-ascii characters are remapped to
// the range [128-255] in order of appearance.
//
// Thus it supports up to 128 different multibyte code points max over
// the whole set of strings sharing this encoding.
//
function utf8_to_extended_ascii($str, &$map)
{
// find all multibyte characters (cf. utf-8 encoding specs)
$matches = array();
if (!preg_match_all('/[\xC0-\xF7][\x80-\xBF]+/', $str, $matches))
return $str; // plain ascii string
// update the encoding map with the characters not already met
foreach ($matches[0] as $mbc)
if (!isset($map[$mbc]))
$map[$mbc] = chr(128 + count($map));
// finally remap non-ascii characters
return strtr($str, $map);
}
// Didactic example showing the usage of the previous conversion function but,
// for better performance, in a real application with a single input string
// matched against many strings from a database, you will probably want to
// pre-encode the input only once.
//
function levenshtein_utf8($s1, $s2)
{
$charMap = array();
$s1 = utf8_to_extended_ascii($s1, $charMap);
$s2 = utf8_to_extended_ascii($s2, $charMap);
return levenshtein($s1, $s2);
}
?>
Results (for about 6000 calls)
- reference time core C function (single-byte) : 30 ms
- utf8 to ext-ascii conversion + core function : 90 ms
- full php implementation : 3000 ms
The default PHP levenshtein(), like many PHP functions, is not multibyte aware. So, when processing strings with Unicode characters, it handles each byte separately and changes two bytes.
There is no multibyte version (i.e. mb_levenshtein()) so you have two options:
1) Re-implement the function yourself, using mb_ functions. Possible example code from a Gist:
<?php
function levenshtein_php($str1, $str2){
$length1 = mb_strlen( $str1, 'UTF-8');
$length2 = mb_strlen( $str2, 'UTF-8');
if( $length1 < $length2) return levenshtein_php($str2, $str1);
if( $length1 == 0 ) return $length2;
if( $str1 === $str2) return 0;
$prevRow = range( 0, $length2);
$currentRow = array();
for ( $i = 0; $i < $length1; $i++ ) {
$currentRow=array();
$currentRow[0] = $i + 1;
$c1 = mb_substr( $str1, $i, 1, 'UTF-8') ;
for ( $j = 0; $j < $length2; $j++ ) {
$c2 = mb_substr( $str2, $j, 1, 'UTF-8' );
$insertions = $prevRow[$j+1] + 1;
$deletions = $currentRow[$j] + 1;
$substitutions = $prevRow[$j] + (($c1 != $c2)?1:0);
$currentRow[] = min($insertions, $deletions, $substitutions);
}
$prevRow = $currentRow;
}
return $prevRow[$length2];
}
2) Convert your string's Unicode characters to ASCII. If you are specifically wanting to calculate Levenshtein differences from diacritic characters to non-diacritics, though, this is probably not what you want.
I have to debug an old PHP script from a developer who has left the company. I understand the most part of the code, except the following function. My question: What does...
if($seq == 0x03 || $seq == 0x30)
...mean in context of extracting the signature out of an X.509 certificate?
public function extractSignature($certPemString) {
$bin = $this->ConvertPemToBinary($certPemString);
if(empty($certPemString) || empty($bin))
{
return false;
}
$bin = substr($bin,4);
while(strlen($bin) > 1)
{
$seq = ord($bin[0]);
if($seq == 0x03 || $seq == 0x30)
{
$len = ord($bin[1]);
$bytes = 0;
if ($len & 0x80)
{
$bytes = ($len & 0x0f);
$len = 0;
for ($i = 0; $i < $bytes; $i++)
{
$len = ($len << 8) | ord($bin[$i + 2]);
}
}
if($seq == 0x03)
{
return substr($bin,3 + $bytes, $len);
}
else
{
$bin = substr($bin,2 + $bytes + $len);
}
}
else
{
return false;
}
}
return false;
}
An X.509 certificate contains data in multiple sections (called Tag-Length-Value triplets). Each section starts with a Tag byte, which indicates the data format of the section. You can see a list of these data types here.
0x03 is the Tag byte for the BIT STRING data type, and 0x30 is the Tag byte for the SEQUENCE data type.
So this code is designed to handle the BIT STRING and SEQUENCE data types. If you look at this part:
if($seq == 0x03)
{
return substr($bin,3 + $bytes, $len);
}
else // $seq == 0x30
{
$bin = substr($bin,2 + $bytes + $len);
}
you can see that the function is designed to skip over Sequences (0x30), until it finds a Bit String (0x03), at which point it returns the value of the Bit String.
You might be wondering why the magic number is 3 for Bit String and 2 for Sequence. That is because in a Bit String, the first value byte is a special extra field which indicates how many bits are unused in the last byte of the data. (For example, if you're sending 13 bits of data, it will take up 2 bytes = 16 bits, and the "unused bits" field will be 3.)
Next issue: the Length field. When the length of the Value is less than 128 bytes, the length is simply specified using a single byte (the most significant bit will be 0). If the length is 128 or greater, then the first length byte has bit 7 set, and the remaining 7 bits indicates how many following bytes contain the length (in big-endian order). More description here. The parsing of the length field happens in this section of the code:
$len = ord($bin[1]);
$bytes = 0;
if ($len & 0x80)
{
// length is greater than 127!
$bytes = ($len & 0x0f);
$len = 0;
for ($i = 0; $i < $bytes; $i++)
{
$len = ($len << 8) | ord($bin[$i + 2]);
}
}
After that, $bytes contains the number of extra bytes used by the length field, and $len contains the length of the Value field (in bytes).
Did you spot the error in the code? Remember,
If the length is 128 or greater, then the first length byte has bit 7
set, and the remaining 7 bits indicates how many following bytes
contain the length.
but the code says $bytes = ($len & 0x0f), which only takes the lower 4 bits of the byte! It should be:
$bytes = ($len & 0x7f);
Of course, this error is only a problem for extremely long messages: it will work fine as long as the length value will fit within 0x0f = 15 bytes, meaning the data has to be less than 256^15 bytes. That's about a trillion yottabytes, which ought to be enough for anybody.
As Pateman says above, you just have a logical if, we're just checking if $seq is either 0x30 or 0x03.
I have a feeling you already know that though, so here goes. $seq is the first byte of the certificate, which is probably either the version of the certificate or the magic number to denote that the file is a certificate (also known as "I'm guessing this because 10:45 is no time to start reading RFCs").
In this case, we're comparing against 0x30 and 0x03. These numbers are expressed in hexadecimal (as is every number starting with 0x), which is base-16. This is just really a very convenient shorthand for binary, as each hex digit corresponds to exactly four binary bits. A quick table is this:
0 = 0000
1 = 0001
2 = 0010
3 = 0011
...
...
E = 1110
F = 1111
Equally well, we could have said if($seq == 3 || $seq == 48), but hex is just much easier to read and understand in this case.
I'd hazard a guess that it's a byte-order-independent check for version identifier '3' in an x.509 certificate. See RFC 1422, p7. The rest is pulling the signature byte-by-byte.
ord() gets the value of the ASCII character you pass it. In this case it's checking to see if the ASCII character is either a 0 or end of text (according to this ASCII table).
0x03 and 0x30 are hex values. Look that up and you'll have what $seq is matching to
want to convert the following raw mail subject to normal UTF-8 text:
=?utf-8?Q?Schuker_hat_sich_vom_=C3=9Cbungsabend_(01.01.2012)_abgem?= =?utf-8?Q?eldet?=
The real text for that is:
Schuker hat sich vom Übungsabend (01.01.2012) abgemeldet
My first approach to convert this:
$mime = '=?utf-8?Q?Schuker_hat_sich_vom_=C3=9Cbungsabend_(01.01.2012)_abgem?= =?utf-8?Q?eldet?=';
mb_internal_encoding("UTF-8");
echo mb_decode_mimeheader($mime);
This gives me the following result:
Schuker_hat_sich_vom_Übungsabend_(01.01.2012)_abgemeldet
(Questions here: What am I doing wrong? Why do those underscores occur?)
My second approach to convert this:
$mime = '=?utf-8?Q?Schuker_hat_sich_vom_=C3=9Cbungsabend_(01.01.2012)_abgem?= =?utf-8?Q?eldet?=';
echo imap_utf8($mime);
This gives me the following (correct) result:
Schuker hat sich vom Übungsabend (01.01.2012) abgemeldet
Why does this work? On which method should I rely on?
The reason I ask is that I previously asked another mail subject decoding related question where mb_decode_mimeheader was the solution whereas here imap_utf8 would be the way to go. How can I ensure to decode everything correct for those both examples:
=?utf-8?Q?Schuker_hat_sich_vom_=C3=9Cbungsabend_(01.01.2012)_abgem?= =?utf-8?Q?eldet?
and
=?UTF-8?B?UmU6ICMyLUZpbmFsIEFjY2VwdGFuY2UgdGVzdCB3aXRoIG5ldyB0ZXh0IHdpdGggU2xvdg==?=
=?UTF-8?B?YWsgaW50ZXJwdW5jdGlvbnMgIivEvsWhxI3FpcW+w73DocOtw6khxYgi?=
Should give me the expected results:
Schuker hat sich vom Übungsabend (01.01.2012) abgemeldet
and
Re: #2-Final Acceptance test with new text with Slovak interpunctions "+ľščťžýáíé!ň"
Based on the hbit response, I've improved the imapUtf8() function to convert the subject text to UTF-8 using the charset information. The result is something like:
function imapUtf8($str){
$convStr = '';
$subLines = preg_split('/[\r\n]+/', $str);
for ($i=0; $i < count($subLines); $i++) {
$convLine = '';
$linePartArr = imap_mime_header_decode($subLines[$i]);
for ($j=0; $j < count($linePartArr); $j++) {
if ($linePartArr[$j]->charset === 'default') {
if ($linePartArr[$j]->text != " ") {
$convLine .= ($linePartArr[$j]->text);
}
} else {
$convLine .= iconv($linePartArr[$j]->charset, 'UTF-8', $linePartArr[$j]->text);
}
}
$convStr .= $convLine;
}
return $convStr;
}
This function works for both examples:
function imapUtf8($str){
$convStr = '';
$subLines = preg_split('/[\r\n]+/',$str); // split multi-line subjects
for($i=0; $i < count($subLines); $i++){ // go through lines
$convLine = '';
$linePartArr = imap_mime_header_decode(trim($subLines[$i])); // split and decode by charset
for($j=0; $j < count($linePartArr); $j++){
$convLine .= ($linePartArr[$j]->text); // append sub-parts of line together
}
$convStr .= $convLine; // append to whole subject
}
return $convStr; // return converted subject
}
Tests:
$sub1 = '=?utf-8?Q?Schuker_hat_sich_vom_=C3=9Cbungsabend_(01.01.2012)_abgem?= =?utf-8?Q?eldet?=';
$sub2 = '=?UTF-8?B?UmU6ICMyLUZpbmFsIEFjY2VwdGFuY2UgdGVzdCB3aXRoIG5ldyB0ZXh0IHdpdGggU2xvdg==?= =?UTF-8?B?YWsgaW50ZXJwdW5jdGlvbnMgIivEvsWhxI3FpcW+w73DocOtw6khxYgi?=';
echo imapUtf8($sub1);
echo imapUtf8($sub2);
Result:
Schuker hat sich vom Übungsabend (01.01.2012) abgemeldet
Re: #2-Final Acceptance test with new text with Slovak interpunctions "+ľščťžýáíé!ň"
It's also in the comments in the manual for mb_decode_mimeheader, and I actually assume it is a bug. None in the database, so I'd file it as a new one.
However, AFAIK imap_mime_header_decode will cope with both your encodings without a problem, so that will keep your code going.
About the mysterious underscore in the Subject header field:
RFC2047 4.2(2) states explicitly:
The 8-bit hexadecimal value 20 (e.g., ISO-8859-1 SPACE) may be
represented as "_" (underscore, ASCII 95.). (This character may
not pass through some internetwork mail gateways, but its use
will greatly enhance readability of "Q" encoded data with mail
readers that do not support this encoding.) Note that the "_"
always represents hexadecimal 20, even if the SPACE character
occupies a different code position in the character set in use.
The encoding rule for Subject line is documented in the very RFC2047 .
I would like to write a function that takes in 3 characters and increments it and returns the newly incremented characters as a string.
I know how to increase a single letter to the next one but how would I know when to increase the second letters and then stop and then increase the first letter again to have a sequential increase?
So if AAA is passed, return AAB. If
AAZ is passed return ABA (hard part).
I would appreciate help with the logic and what php functions will be useful to use.
Even better, has some done this already or there is a class available to do this??
Thanks all for any help
Character/string increment works in PHP (though decrement doesn't)
$x = 'AAZ';
$x++;
echo $x; // 'ABA'
You can do it with the ++ operator.
$i = 'aaz';
$i++;
print $i;
aba
However this implementation has some strange things:
for($i = 'a'; $i < 'z'; $i++) print "$i ";
This will print out letters from a to y.
for($i = 'a'; $i <= 'z'; $i++) print "$i ";
This will print out lettes from a to z and it continues with aa and ends with yz.
As proposed in PHP RFC: Strict operators directive
(currently Under Discussion):
Using the increment function on a string will throw a TypeError when strict_operators is enabled.
Whether or not the RFC gets merged, PHP will sooner or later go that direction of adding operator strictness. Therefore, you should not be incrementing strings.
a-z/A-Z ranges
If you know your letters will stay in range a-z/A-Z (not surpass z/Z), you can use the solution that converts letter to ASCII code, increments it, and converts back to letter.
Use ord() a chr():
$letter = 'A';
$letterAscii = ord($letter);
$letterAscii++;
$letter = chr($letterAscii); // 'B'
ord() converts the letter into ASCII num representation
that num representation is incremented
using chr() the number gets converted back to the letter
As discovered in comments, be careful. This iterates ASCII table so from Z (ASCII 90), it does not go to AA, but to [ (ASCII 91).
Going beyond z/Z
If you dare to go further and want z became aa, this is what I came up with:
final class NextLetter
{
private const ASCII_UPPER_CASE_BOUNDARIES = [65, 91];
private const ASCII_LOWER_CASE_BOUNDARIES = [97, 123];
public static function get(string $previous) : string
{
$letters = str_split($previous);
$output = '';
$increase = true;
while (! empty($letters)) {
$letter = array_pop($letters);
if ($increase) {
$letterAscii = ord($letter);
$letterAscii++;
if ($letterAscii === self::ASCII_UPPER_CASE_BOUNDARIES[1]) {
$letterAscii = self::ASCII_UPPER_CASE_BOUNDARIES[0];
$increase = true;
} elseif ($letterAscii === self::ASCII_LOWER_CASE_BOUNDARIES[1]) {
$letterAscii = self::ASCII_LOWER_CASE_BOUNDARIES[0];
$increase = true;
} else {
$increase = false;
}
$letter = chr($letterAscii);
if ($increase && empty($letters)) {
$letter .= $letter;
}
}
$output = $letter . $output;
}
return $output;
}
}
I'm giving you also 100% coverage if you intend to work with it further. It tests against original string incrementation ++:
/**
* #dataProvider letterProvider
*/
public function testIncrementLetter(string $givenLetter) : void
{
$expectedValue = $givenLetter;
self::assertSame(++$expectedValue, NextLetter::get($givenLetter));
}
/**
* #return iterable<array<string>>
*/
public function letterProvider() : iterable
{
yield ['A'];
yield ['a'];
yield ['z'];
yield ['Z'];
yield ['aaz'];
yield ['aaZ'];
yield ['abz'];
yield ['abZ'];
}
To increment or decrement in the 7bits 128 chars ASCII range, the safest:
$CHAR = "l";
echo chr(ord($CHAR)+1)." ".chr(ord($CHAR)-1);
/* m k */
So, it is normal to get a backtick by decrementing a, as the ascii spec list
Print the whole ascii range:
for ($i = 0;$i < 127;$i++){
echo chr($i);
}
/* !"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ */
More infos about ANSI 7 bits ASCII: man ascii
To increment or decrement in the 8-bits extended 256 chars UTF-8 range.
This is where it starts to differ regarding the host machine charset. but those charsets are all available on modern machines. From php, the safest is to use the php-mbstring extension: https://www.php.net/manual/en/function.mb-chr.php
Extended ASCII (EASCII or high ASCII) character encodings are
eight-bit or larger encodings that include the standard seven-bit
ASCII characters, plus additional characters. https://en.wikipedia.org/wiki/Extended_ASCII
More info, as example: man iso_8859-9
ISO 8859-1 West European languages (Latin-1)
ISO 8859-2 Central and East European languages (Latin-2)
ISO 8859-3 Southeast European and miscellaneous languages (Latin-3)
ISO 8859-4 Scandinavian/Baltic languages (Latin-4)
ISO 8859-5 Latin/Cyrillic
ISO 8859-6 Latin/Arabic
ISO 8859-7 Latin/Greek
ISO 8859-8 Latin/Hebrew
ISO 8859-9 Latin-1 modification for Turkish (Latin-5)
ISO 8859-10 Lappish/Nordic/Eskimo languages (Latin-6)
ISO 8859-11 Latin/Thai
ISO 8859-13 Baltic Rim languages (Latin-7)
ISO 8859-14 Celtic (Latin-8)
ISO 8859-15 West European languages (Latin-9)
ISO 8859-16 Romanian (Latin-10)
Example, we can find the € symbol in ISO 8859-7:
244 164 A4 € EURO SIGN
To increment or decrement in the 16 bits UTF-16 Unicode range:
Here is a way to generate the whole unicode charset, by generating html entities and converting to utf8. Run it online
for ($x = 0; $x < 262144; $x++){
echo html_entity_decode("&#".$x.";",ENT_NOQUOTES,"UTF-8");
}
Same stuff, but the range goes up to (16^4 * 4)!
echo html_entity_decode('!',ENT_NOQUOTES,'UTF-8');
/* ! */
echo html_entity_decode('"',ENT_NOQUOTES,'UTF-8');
/* " */
To retrieve the unicode € symbol,using the base10 decimal representation of the character.
echo html_entity_decode('€',ENT_NOQUOTES,'UTF-8');
/* € */
The same symbol, using the base16 hexadecimal representation:
echo html_entity_decode('&#'.hexdec("20AC").';',ENT_NOQUOTES,'UTF-8');
/* € */
First 32 bits are reserved for special control characters, output garbage �����, but have a meaning.
You are looking at a number representation problem. This is base24 (or however many numbers your alphabet has). Lets call the base b.
Assign a number to each letter in alphabet (A=1, B=2, C=3).
Next, figure out your input "number": The representation "ABC" means A*b^2 + B*b^1 + C*b^0
Use this formula to find the number (int). Increment it.
Next, convert it back to your number system: Divide by b^2 to get third digit, the remainder (modulo) by b^1 for second digit, the remainder (modulo) by `b^0^ for last digit.
This might help: How to convert from base10 to any other base.
You could use the ASCII codes for alpha numerics. From there you increment and decrement to get the previous/next character.
You could split your string in single characters and then apply the transformations on these characters.
Just some thoughts to get you started.
<?php
$values[] = 'B';
$values[] = 'A';
$values[] = 'Z';
foreach($values as $value ){
if($value == 'Z'){
$value = '-1';
}
$op = ++$value;
echo $op;
}
?>
I have these methods in c# that you could probably convert to php and modify to suit your needs, I'm not sure Hexavigesimal is the exact name for these though...
#region Hexavigesimal (Excel Column Name to Number)
public static int FromHexavigesimal(this string s)
{
int i = 0;
s = s.Reverse();
for (int p = s.Length - 1; p >= 0; p--)
{
char c = s[p];
i += c.toInt() * (int)Math.Pow(26, p);
}
return i;
}
public static string ToHexavigesimal(this int i)
{
StringBuilder s = new StringBuilder();
while (i > 26)
{
int r = i % 26;
if (r == 0)
{
i -= 26;
s.Insert(0, 'Z');
}
else
{
s.Insert(0, r.toChar());
}
i = i / 26;
}
return s.Insert(0, i.toChar()).ToString();
}
public static string Increment(this string s, int offset)
{
return (s.FromHexavigesimal() + offset).ToHexavigesimal();
}
private static char toChar(this int i)
{
return (char)(i + 64);
}
private static int toInt(this char c)
{
return (int)c - 64;
}
#endregion
EDIT
I see by the other answers that in PHP you can use ++ instead, nice!
I have a long "binary string" like the output of PHPs pack function.
How can I convert this value to base62 (0-9a-zA-Z)?
The built in maths functions overflow with such long inputs, and BCmath doesn't have a base_convert function, or anything that specific. I would also need a matching "pack base62" function.
I think there is a misunderstanding behind this question. Base conversion and encoding/decoding are different. The output of base64_encode(...) is not a large base64-number. It's a series of discrete base64 values, corresponding to the compression function. That is why BC Math does not work, because BC Math is concerned with single large numbers, not strings that are in reality groups of small numbers that represent binary data.
Here's an example to illustrate the difference:
base64_encode(1234) = "MTIzNA=="
base64_convert(1234) = "TS" //if the base64_convert function existed
base64 encoding breaks the input up into groups of 3 bytes (3*8 = 24 bits), then converts each sub-segment of 6 bits (2^6 = 64, hence "base64") to the corresponding base64 character (values are "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/", where A = 0, / = 63).
In our example, base64_encode() treats "1234" as a string of 4 characters, not an integer (because base64_encode() does not operate on integers). Therefore it outputs "MTIzNA==", because (in US-ASCII/UTF-8/ISO-8859-1) "1234" is 00110001 00110010 00110011 00110100 in binary. This gets broken into 001100 (12 in decimal, character "M") 010011 (19 in decimal, character "T") 001000 ("I") 110011 ("z") 001101 ("N") 00. Since the last group isn't complete, it gets padded with 0's and the value is 000000 ("A"). Because everything is done by groups of 3 input characters, there are 2 groups: "123" and "4". The last group is padded with ='s to make it 3 chars long, so the whole output becomes "MTIzNA==".
converting to base64, on the other hand, takes a single integer value and converts it into a single base64 value. For our example, 1234 (decimal) is "TS" (base64), if we use the same string of base64 values as above. Working backward, and left-to-right: T = 19 (column 1), S = 18 (column 0), so (19 * 64^1) + (18 * 64^0) = 19 * 64 + 18 = 1234 (decimal). The same number can be represented as "4D2" in hexadecimal (base16): (4 * 16^2) + (D * 16^1) + (2 * 16^0) = (4 * 256) + (13 * 16) + (2 * 1) = 1234 (decimal).
Unlike encoding, which takes a string of characters and changes it, base conversion does not alter the actual number, just changes its presentation. The hexadecimal (base16) "FF" is the same number as decimal (base10) "255", which is the same number as "11111111" in binary (base2). Think of it like currency exchange, if the exchange rate never changed: $1 USD has the same value as £0.79 GBP (exchange rate as of today, but pretend it never changes).
In computing, integers are typically operated on as binary values (because it's easy to build 1-bit arithmetic units and then stack them together to make 32-bit/etc. arithmetic units). To do something as simple as "255 + 255" (decimal), the computer needs to first convert the numbers to binary ("11111111" + "11111111") and then perform the operation in the Arithmetic Logic Unit (ALU).
Almost all other uses of bases are purely for the convenience of humans (presentational) - computers display their internal value 11111111 (binary) as 255 (decimal) because humans are trained to operate on decimal numbers. The function base64_convert() doesn't exist as part of the standard PHP repertoire because it's not often useful to anyone: not many humans read base64 numbers natively. By contrast, binary 1's and 0's are sometimes useful for programmers (we can use them like on/off switches!), and hexadecimal is convenient for humans editing binary data because an entire 8-bit byte can be represented unambiguously as 00 through FF, without wasting too much space.
You may ask, "if base conversion is just for presentation, why does BC Math exist?" That's a fair question, and also exactly why I said "almost" purely for presentation: typical computers are limited to 32-bit or 64-bit wide numbers, which are usually plenty big enough. Sometimes you need to operate on really, really big numbers (RSA moduli for example), which don't fit in those registers. BC Math solves this problem by acting as an abstraction layer: it converts huge numbers into long strings of text. When it's time to do some operation, BC Math painstakingly breaks the long strings of text up into small chunks which the computer can handle. It's much, much slower than native operations, but it can handle arbitrary-sized numbers.
Here is a function base_conv() that can convert between completely arbitrary bases, expressed as arrays of strings; Each array element represents a single "digit" in that base, thus also allowing multi-character values (it is your responsibility to avoid ambiguity).
function base_conv($val, &$baseTo, &$baseFrom)
{
return base_arr_to_str(base_conv_arr(base_str_to_arr((string) $val, $baseFrom), count($baseTo), count($baseFrom)), $baseTo);
}
function base_conv_arr($val, $baseToDigits, $baseFromDigits)
{
$valCount = count($val);
$result = array();
do
{
$divide = 0;
$newlen = 0;
for ($i = 0; $i < $valCount; ++$i)
{
$divide = $divide * $baseFromDigits + $val[$i];
if ($divide >= $baseToDigits)
{
$val[$newlen ++] = (int) ($divide / $baseToDigits);
$divide = $divide % $baseToDigits;
}
else if ($newlen > 0)
{
$val[$newlen ++] = 0;
}
}
$valCount = $newlen;
array_unshift($result, $divide);
}
while ($newlen != 0);
return $result;
}
function base_arr_to_str($arr, &$base)
{
$str = '';
foreach ($arr as $digit)
{
$str .= $base[$digit];
}
return $str;
}
function base_str_to_arr($str, &$base)
{
$arr = array();
while ($str === '0' || !empty($str))
{
foreach ($base as $index => $digit)
{
if (mb_substr($str, 0, $digitLen = mb_strlen($digit)) === $digit)
{
$arr[] = $index;
$str = mb_substr($str, $digitLen);
continue 2;
}
}
throw new Exception();
}
return $arr;
}
Examples:
$baseDec = str_split('0123456789');
$baseHex = str_split('0123456789abcdef');
echo base_conv(255, $baseHex, $baseDec); // ff
echo base_conv('ff', $baseDec, $baseHex); // 255
// multi-character base:
$baseHelloworld = array('hello ', 'world ');
echo base_conv(37, $baseHelloworld, $baseDec); // world hello hello world hello world
echo base_conv('world hello hello world hello world ', $baseDec, $baseHelloworld); // 37
// ambiguous base:
// don't do this! base_str_to_arr() won't know how to decode e.g. '11111'
// (well it does, but the result might not be what you'd expect;
// It matches digits sequentially so '11111' would be array(0, 0, 1)
// here (matched as '11', '11', '1' since they come first in the array))
$baseAmbiguous = array('11', '1', '111');