Parse PHP String Based On Number of Characters

Parse PHP String Based On Number of Characters - php

I'm starting to work on a small script that takes a string, counts the number of characters, then, based on the number of characters, splits/breaks the string apart and sends/emails 110 characters at a time.
What would be the proper logic/PHP to use to:
1) Count the number of characters in the string
2) Preface each message with (1/3) (2/3) (3/3), etc...
3) And only send 110 characters at a time.
I know I'd probably have to use strlen to count the characters, and some type of loop to loop through, but I'm not quite sure how to go about it.
Thanks!

You could use str_split, if you're not concerned with where you break the strings.
Else, if you are concerned with this (and want to, say, split only on a whitespace), you could do something like:
// $str is the string you want to chop up.
$split = preg_split('/(.{0,110})\s/',
$str,
0,
PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
With this array you could then simply do:
$count = count($split);
foreach ($split as $key => $message) {
$part = sprintf("(%d/%d) %s", $key+1, $count, $message);
// $part is now one of your messages;
// do what you wish with it here.
}

use str_split() and iterate over the resulting array.

From the top of my head, should work as is, but doesn't have to. Logic is ok though.
foreach ($messages as $msg) {
$len = strlen($msg);
if ($len > 110) {
$parts = ceil($len / 100);
for ($i = 1; $i <= $parts; $i++) {
$part = $i . '/' . $parts . ' ' . substr($msg, 0, 110);
$msg = substr($msg, 109);
your_sending_func($part);
}
} else {
your_sending_func($msg);
}
}

Related

PHP: Shift characters of string by 5 spaces? So that A becomes F, B becomes G etc

How can I shift characters of string in PHP by 5 spaces?
So say:
A becomes F
B becomes G
Z becomes E
same with symbols:
!##$%^&*()_+
so ! becomes ^
% becomes )
and so on.
Anyway to do this?

The other answers use the ASCII table (which is good), but I've got the impression that's not what you're looking for. This one takes advantage of PHP's ability to access string characters as if the string itself is an array, allowing you to have your own order of characters.
First, you define your dictionary:
// for simplicity, we'll only use upper-case letters in the example
$dictionary = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ';
Then you go through your input string's characters and replace each of them with it's $position + 5 in the dictionary:
$input_string = 'STRING';
$output_string = '';
$dictionary_length = strlen($dictionary);
for ($i = 0, $length = strlen($input_string); $i < $length; $i++)
{
$position = strpos($dictionary, $input_string[$i]) + 5;
// if the searched character is at the end of $dictionary,
// re-start counting positions from 0
if ($position > $dictionary_length)
{
$position = $position - $dictionary_length;
}
$output_string .= $dictionary[$position];
}
$output_string will now contain your desired result.
Of course, if a character from $input_string does not exist in $dictionary, it will always end up as the 5th dictionary character, but it's up to you to define a proper dictionary and work around edge cases.

Iterate over characters and, get ascii value of each character and get char value of the ascii code shifted by 5:
function str_shift_chars_by_5_spaces($a) {
for( $i = 0; $i < strlen($a); $i++ ) {
$b .= chr(ord($a[$i])+5);};
}
return $b;
}
echo str_shift_chars_by_5_spaces("abc");
Prints "fgh"

Iterate over string, character at a time
Get character its ASCII value
Increase by 5
Add to new string
Something like this should work:
<?php
$newString = '';
foreach (str_split('test') as $character) {
$newString .= chr(ord($character) + 5);
}
echo $newString;
Note that there is more than one way to iterate over a string.

PHP has a function for this; it's called strtr():
$shifted = strtr( $string,
"ABCDEFGHIJKLMNOPQRSTUVWXYZ",
"FGHIJKLMNOPQRSTUVWXYZABCDE" );
Of course, you can do lowercase letters and numbers and even symbols at the same time:
$shifted = strtr( $string,
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789!##$%^&*()_+",
"FGHIJKLMNOPQRSTUVWXYZABCDEfghijklmnopqrstuvwxyzabcde5678901234^&*()_+!##$%" );
To reverse the transformation, just swap the last two arguments to strtr().
If you need to change the shift distance dynamically, you can build the translation strings at runtime:
$shift = 5;
$from = $to = "";
$sequences = array( "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz",
"0123456789", "!##$%^&*()_+" );
foreach ( $sequences as $seq ) {
$d = $shift % strlen( $seq ); // wrap around if $shift > length of $seq
$from .= $seq;
$to .= substr($seq, $d) . substr($seq, 0, $d);
}
$shifted = strtr( $string, $from, $to );

PHP: trim word OR part of it from begining/end of string

I need to trim words from begining and end of string. Problem is, sometimes the words can be abbreviated ie. only first three letters (followed by dot).
I tried hard to find suitable regular expression. Basicaly I need to chatch three or more initial characters up to length of replacement, but I cannot find regular expression, that will match variable length and will keep order of characters.
For example, if I need to trim 'insurance' from sentence 'insur. companies are rich', then pattern \^[insurance]{3,9}\ comes to my mind, but this pattern will also catch words like 'sensace', because order of characters (and their occurance) inside [] is not important for regexp.
Also, at end of string, I need remove serial-numbers, that are abbreviated from beginig - say 'XK-25F14' is sometimes presented as '25F14'. So I decided to go purely with character by character comparison.
Therefore I end with following php function
function trimWords($s, $dirt, $case_insensitive = false, $reverse = true)
{
$pos = 0;
$func = $case_insensitive ? 'strncasecmp' : 'strncmp';
// Get number of initial characters, that match in both strings
while ($func($s, $dirt, $pos + 1) === 0)
$pos++;
// If more than 2 initial characters match, then remove the match
if ($pos > 2)
$s = substr($s, $pos);
// Reverse $s and $dirt so it will trim from the end of string
$s = strrev($s);
if ($reverse)
return trimWords($s, strrev($dirt), $case_insensitive, false);
// After second run return back-reversed string
return trim($s, ' .-');
}
I'm happy with this function, but it has one drawback. It trims only one occurence of word. How to make it trim more occurances, i.e. remove both 'insurance ' from 'Insurance insur. companies'.
And I'm also curious, it realy does not exists such regular expression, that will match variable length and will respect order of characters in pattern?
Final solution
Thanks to mrhobo I have ended with function based on regular expression. This function can be easily improved and shall also be the most efficient for this task.
I have modified my previous function and it is two times quicker than regexp, but it can remove only one word per single run, so to be able to remove word from begin and end, it has to runs itself twice and performance is same as regexp and to remove more than one occurance of word, it has to runs itself multiple times, which will then be more and more slower.
The final function goes like this.
function trimWords($string, $word, $case_insensitive = false, $min_abbrv = 3)
{
$exc = substr($word, $min_abbrv);
$pat = null;
$i = strlen($exc);
while ($i--)
$pat = '(?>'.preg_quote($exc[$i], '#').$pat.')?';
$pat = substr($word, 0, $min_abbrv).$pat;
$pat = '#(?<begin>^)?(?:\W*\b'.$pat.'\b\W*)+(?(begin)|$)#';
if ($case_insensitive)
$pat .= 'i';
return preg_replace($pat, '', $string);
}
NOTE: with this function, it does not matter, if abbreviation ends with dot or not, it wipes out any shorter form of word and also removes all nonword characters around the word.
EDIT: I just tried create replace pattern like insu(r|ra|ran|ranc|rance) and function with atomic groups is faster by ~30% and with longer words it could be possibly even more efficient.

Matching a word and all possible abbreviations from the nth letter isn't quite an easy task in regex.
Here is how I would do it for the word insurance from the 4th letter:
insu(?>r(?>a(?>n(?>c(?>(?<last>e))?)?)?)?)?(?(last)|\.)
http://regex101.com/r/aL2gV4
It works by using atomic groups to force the regex engine as far as possible forward past the last 'rance' letters using the nested pattern (?>a(?>b)?)?. If the last letter letter is matched we're not dealing with an abbreviation thus no dot is required, otherwise the dot is required. This is coded by (?(last)|\.).
To trim, I would create a function to build the above regex for an abbreviation. Then you can write a while loop that replaces each of the abbreviation regexes with empty space until there are no more matches.
Non regex version
Here is my non regex version that removes multiple words and abbreviated words from a string:
function trimWords($str, $word, $min_abbrv, $case_insensitive = false) {
$len = 0;
$word_len = strlen($word);
$strlen = strlen($str);
$cmp = $case_insensitive ? strncasecmp : strncmp;
for ($i = 0; $i < $strlen; $i++) {
if ($cmp($str[$i], $word[$len], $i) == 0) {
$len++;
} else if ($len > 0) {
if ($len == $word_len || ($len >= $min_abbrv && ($dot = $str[$i] == '.'))) {
$i -= $len;
$len += $dot;
$str = substr($str, 0, $i) . substr($str, $i+$len);
$strlen = strlen($str);
$dot = 0;
}
$len = 0;
}
}
return $str;
}
Example:
$string = 'ins. <- "ins." / insu. insuranc. insurance / insurance. <- "."';
echo trimWords($string, 'insurance', 4);
Output is:
ins. <- "ins." / / . <- "."

I wrote function that constructs regular expression pattern according to mrhobo and also simple test and benchmarked it against my function with pure PHP string comparison.
Here is the code:
$string = 'Insur. companies are nasty rich';
$dirt = 'insurance';
$cycles = 500000;
$start = microtime(true);
$i = $cycles;
while ($i) {
$i--;
regexpStyle($string, $dirt, true);
}
$stop = microtime(true);
$i = $cycles;
while ($i) {
$i--;
trimWords($string, $dirt, true);
}
$end = microtime(true);
$res1 = $stop - $start;
$res2 = $end - $stop;
$winner = $res1 < $res2 ? '<<<' : '>>>';
echo 'regexp: '.$res1.' '.$winner.' string operations: '.$res2;
function trimWords($s, $dirt, $case_insensitive = false, $reverse = true)
{
$pos = 0;
$func = $case_insensitive ? 'strncasecmp' : 'strncmp';
// Get number of initial characters, that match in both strings
while ($func($s, $dirt, $pos + 1) === 0)
$pos++;
// If more than 2 initial characters match, then remove the match
if ($pos > 2)
$s = substr($s, $pos);
// After second run return back-reversed string
return trim($s, ' .-');
}
function regexpStyle($s, $dirt, $case_insensitive, $min_abbrev = 3)
{
$ss = substr($dirt, $min_abbrev);
$arr = str_split($ss);
$patt = '(?>(?<last>'.array_pop($arr).'))?';
$i = count($arr);
while ($i)
$patt = '(?>'.$arr[--$i].$patt.')?';
$patt = '#^'.substr($dirt, 0, $min_abbrev).$patt.'(?(last)|\.)#';
$patt .= $case_insensitive ? 'i' : null;
return trim(preg_replace($patt, '', $s));
}
and the winner is... moment of silence... it is...
a draw
regexp: 8.5169589519501 >>> string operations: 8.0951890945435
but I have strong feeling that regexp approach could be better utilized.

reorder / rewrap bbcodes

I'm trying to reorder the BBCodes but I failed
so
[̶b̶]̶[̶i̶]̶[̶u̶]̶f̶o̶o̶[̶/̶b̶]̶[̶/̶u̶]̶[̶/̶i̶]̶ ̶-̶ ̶w̶r̶o̶n̶g̶ ̶o̶r̶d̶e̶r̶ ̶ ̶
I̶ ̶w̶a̶n̶t̶ ̶i̶t̶ ̶t̶o̶ ̶b̶e̶:̶ ̶
̶[̶b̶]̶[̶i̶]̶[̶u̶]̶f̶o̶o̶[̶/̶u̶]̶[̶/̶i̶]̶[̶/̶b̶]̶ ̶-̶ ̶r̶i̶g̶h̶t̶ ̶o̶r̶d̶e̶r̶
PIC:
I tried with
<?php
$string = '[b][i][u]foo[/b][/u][/i]';
$search = array('/\[b](.+?)\[\/b]/is', '/\[i](.+?)\[\/i]/is', '/\[u](.+?)\[\/u]/is');
$replace = array('[b]$1[/b]', '[i]$1[/i]', '[u]$1[/u]');
echo preg_replace($search, $replace, $string);
?>
OUTPUT: [b][i][u]foo[/b][/u][/i]
any suggestions ? thanks!

phew, spent awhile thinking of the logic to do this. (feel free to put it in a function)
this only works for the scenario given. Like other users have commented it's impossible. You shouldn't be doing this. Or even on server side. I'd use a client side parser just to throw a syntax error.
supports [b]a[i]b[u]foo[/b]baa[/u]too[/i]
and bbcode with custom values [url=test][i][u]foo[/url][/u][/i]
Will break with
[b] bold [/b][u] underline[/u]
And [b] bold [u][/b] underline[/u]
//input string to be reorganized
$string = '[url=test][i][u]foo[/url][/u][/i]';
echo $string . "<br />";
//search for all opentags (including ones with values
$tagsearch = "/\[([A-Za-z]+)[A-Za-z=._%?&:\/-]*\]/";
preg_match_all($tagsearch, $string, $tags);
//search for all close tags to store them for later
$closetagsearch = "/(\[\/([A-Za-z]+)\])/is";
preg_match_all($closetagsearch, $string, $closetags);
//flip the open tags for reverse parsing (index one is just letters)
$tags[1] = array_reverse($tags[1]);
//create temp var to store new ordered string
$temp = "";
//this is the last known position in the original string after a match
$last = 0;
//iterate through each char of the input string
for ($i = 0, $len = strlen($string); $i < $len; $i++) {
//if we run out of tags to replace/find stop looping
if (empty($tags[1]) || empty($closetags[1]))
continue;
//this is the part of the string that has no matches
$good = substr($string, $last, $i - $last);
//next closing tag to search for
$next = $closetags[1][0];
//how many chars ahead to compare against
$scope = substr($string, $i, strlen($next));
//if we have a match
if ($scope === "$next") {
//add to the temp variable with a modified
//version of an open tag letter to become a close tag
$temp .= $good . substr_replace("[" . $tags[1][0] . "]", "/", 1, 0);
//remove the first key/value in both arrays
array_shift($tags[1]);
array_shift($closetags[1]);
//update the last known unmatched char
$last += strlen($good . $scope);
}
}
echo $temp;
Please also note: it might be the users intention to nest the tags out of order :X

Increment through A-Za-z0-9 in expanding string with PHP

I'm trying to generate strings in PHP with a group of valid characters, cycling through them and appending an extra character on the end of the string, until maximum length is reached. For example, desired output:
a,b,c,d,e,f,aa,ab,ac,ad,ae,af,ba,bb,bc,bd,be,bf,ca,cb..etc
This is my PHP function so far:
<?php
$chars = Array('a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z',
'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','X','Y','Z',
'1','2','3','4','5','6','7','8','9','0');
$maxlen = 10;
$input = $chars[0];
while (1):
echo buildInput($maxlen, $chars, $input) . "\n";
endwhile;
function buildInput($maxlen, $chars, $previous)
{
if (array_search(substr($previous, -1), $chars) == sizeof($chars) - 1):
// end of input cycle reached, add another character
$previous = $previous . $chars[0];
endif;
if (strlen($previous) > $maxlen):
die('Max length reached');
endif;
// Remove last character, and append incremented char
$input = substr($previous, 0, -1);
$input = $input . $chars[array_search(substr($previous, -1), $chars)+1];
return $input;
}
?>
It only increments the last character of the string which gets to 0, then appends 'a' and starts over but without trying all the other possible permutations.
Could someone help me with a better method?

Is this the kind of thing you want?
<?php
$chars = array('a','b','c');
$max_length = 3;
function build($base_arr, $ctr) {
global $chars;
global $max_length;
$combos = array();
foreach ($base_arr as $base) {
foreach ($chars as $char) {
echo $base, $char, '<br />';
$combos[] = $base.$char;
}
}
if ($ctr < $max_length) {
build($combos, $ctr + 1);
}
}
foreach ($chars as $char) {
echo $char, '<br />';
}
build($chars, 2);
?>
It'll give you: a, b, c, aa, ab, ac, ba, bb, bc, ca, cb, cc, ..., bcc, caa, cab, cac, cba, cbb, cbc, cca, ccb, ccc.
Your array is so large, though, that using this method on it would take up way too much memory to work. Out of 62 characters (A-Z, a-z, 0-9), the number of possible 10-character permutations is 8.4 x 10^17; so hopefully, you'll be able to find a more efficient method or figure out a way to get the result you want without having to cycle through such a large array. I hope you find what you're looking for!

If you limit yourself to 0-9,a-z (only lower case), then you could use base_convert for this and do it in one line:
for($i = 0; $i < 1000; $i++) echo base_convert($i, 10, 36) . '<br/>';
Here's a demo.

This will print 200 letters: a,b,c,d,...,aa,...,cq
The buildString function will build our string from the least significant number (right) to the most significant (left). By performing a modulus division, you will find the array position of the next character. Add this character to the front of your string, and divide
your number by the size of your array (which is the base number in your character based number system), ignoring the rest.
To explain the method using our normal 10-based number system and the input of 123, you would simply pick the last digit, 3, divide the input by 10, pick the last digit 2, divide the input by 10, pick the last digit 1, divide the input by 10. The input is now 0 and your output is ready...
$chars = array('a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q'
,'r','s','t','u','v','w','x','y','z','A','B','C','D','E','F','G','H','I','J'
,'K','L','M','N','O','P','Q','R','S','T','U','V','X','Y','Z','1','2','3','4'
,'5','6','7','8','9','0');
$numChars = count($chars);
// Output numbers from 1 to 200 (a to cq)
for($i = 1; $i <= 200; $i++) {
echo buildString($i).'<br>';
}
// Will also work fine for large numbers - output "dxSA"
echo buildString(1000000).'<br>';
function buildString($int) {
global $chars;
global $numChars;
$output = '';
while($int) {
$output = $chars[($int-1) % $numChars] . $output;
$int = floor(($int-1) / $numChars);
}
return $output;
}

If you have access to gmp extension and PHP 5.3.2+ this will work for the charset you specified:
$result = strtr(
gmp_strval(gmp_init($i, 10), 62),
'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz',
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'
);

Performance of tokenizing CSS in PHP

This is a noob question from someone who hasn't written a parser/lexer ever before.
I'm writing a tokenizer/parser for CSS in PHP (please don't repeat with 'OMG, why in PHP?'). The syntax is written down by the W3C neatly here (CSS2.1) and here (CSS3, draft).
It's a list of 21 possible tokens, that all (but two) cannot be represented as static strings.
My current approach is to loop through an array containing the 21 patterns over and over again, do an if (preg_match()) and reduce the source string match by match. In principle this works really good. However, for a 1000 lines CSS string this takes something between 2 and 8 seconds, which is too much for my project.
Now I'm banging my head how other parsers tokenize and parse CSS in fractions of seconds. OK, C is always faster than PHP, but nonetheless, are there any obvious D'Oh! s that I fell into?
I made some optimizations, like checking for '#', '#' or '"' as the first char of the remaining string and applying only the relevant regexp then, but this hadn't brought any great performance boosts.
My code (snippet) so far:
$TOKENS = array(
'IDENT' => '...regexp...',
'ATKEYWORD' => '#...regexp...',
'String' => '"...regexp..."|\'...regexp...\'',
//...
);
$string = '...CSS source string...';
$stream = array();
// we reduce $string token by token
while ($string != '') {
$string = ltrim($string, " \t\r\n\f"); // unconsumed whitespace at the
// start is insignificant but doing a trim reduces exec time by 25%
$matches = array();
// loop through all possible tokens
foreach ($TOKENS as $t => $p) {
// The '&' is used as delimiter, because it isn't used anywhere in
// the token regexps
if (preg_match('&^'.$p.'&Su', $string, $matches)) {
$stream[] = array($t, $matches[0]);
$string = substr($string, strlen($matches[0]));
// Yay! We found one that matches!
continue 2;
}
}
// if we come here, we have a syntax error and handle it somehow
}
// result: an array $stream consisting of arrays with
// 0 => type of token
// 1 => token content

Use a lexer generator.

The first thing I would do would be to get rid of the preg_match(). Basic string functions such as strpos() are much faster, but I don't think you even need that. It looks like you are looking for a specific token at the front of a string with preg_match(), and then simply taking the front length of that string as a substring. You could easily accomplish this with a simple substr() instead, like this:
foreach ($TOKENS as $t => $p)
{
$front = substr($string,0,strlen($p));
$len = strlen($p); //this could be pre-stored in $TOKENS
if ($front == $p) {
$stream[] = array($t, $string);
$string = substr($string, $len);
// Yay! We found one that matches!
continue 2;
}
}
You could further optimize that by pre-calculating the length of all your tokens and storing them in the $TOKENS array, so that you don't have to call strlen() all the time. If you sorted $TOKENS into groups by length, you could reduce the number of substr() calls further as well, as you could take a substr($string) of the current string being analyzed just once for each token length, and run through all the tokens of that length before moving on to the next group of tokens.

the (probably) faster (but less memory friendly) approach would be to tokenize the whole stream at once, using one big regexp with alternatives for each token, like
preg_match_all('/
(...string...)
|
(#ident)
|
(#ident)
...etc
/x', $stream, $tokens);
foreach($tokens as $token)...parse

Don't use regexp, scan character by character.
$tokens = array();
$string = "...code...";
$length = strlen($string);
$i = 0;
while ($i < $length) {
$buf = '';
$char = $string[$i];
if ($char <= ord('Z') && $char >= ord('A') || $char >= ord('a') && $char <= ord('z') || $char == ord('_') || $char == ord('-')) {
while ($char <= ord('Z') && $char >= ord('A') || $char >= ord('a') && $char <= ord('z') || $char == ord('_') || $char == ord('-')) {
// identifier
$buf .= $char;
$char = $string[$i]; $i ++;
}
$tokens[] = array('IDENT', $buf);
} else if (......) {
// ......
}
}
However, that makes the code unmaintainable, therefore, a parser generator is better.

It's an old post but still contributing my 2 cents on this.
one thing that seriously slows down the original code in the question is the following line :
$string = substr($string, strlen($matches[0]));
instead of working on the entire string, take just a part of it (say 50 chars) which are enough for all the possible regexes. then, apply the same line of code on it. when this string shrinks below a preset length, load some more data to it.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Parse PHP String Based On Number of Characters - php

use str_split() and iterate over the resulting array.

Related

PHP: Shift characters of string by 5 spaces? So that A becomes F, B becomes G etc

PHP: trim word OR part of it from begining/end of string

reorder / rewrap bbcodes

Increment through A-Za-z0-9 in expanding string with PHP

Performance of tokenizing CSS in PHP

Categories

Resources