Without Regex: String Between Quotes? - php

I'm creating a word-replacement script. I've run into a roadblock with ignoring strings between quotes and haven't been able to find a decent solution here that didn't involve Regex.
I have a working snippet that cycles through every character in the string and figures out whether the most recent quotation was an opening or closing quote (Whether single or double) and ignores escaped quotes. The problem is that in order for it to provide a 100% accurate experience, it has to run every time the string changes (Because of how it works, it could change well over 60K times across a single function), and due to string length potential, the code takes too long even on a fairly short script.
Is there a fast way to figure out whether a string is between open and close quotes (Single and double)? Ignoring escaped " and '. Or, do you have suggestions on how to optimize the snippet to make it run significantly faster? Removing this function, the process runs at almost the preferred speed (Instant).
As an exercise, consider copying and pasting the snippet into the script with a variable containing text. For example $thisIsAQuote = "This is a quote."; And, from that point, everything should replace correctly, except $thisIsAQuote should retain its exact text.
But here's the issue: Other solutions I've found will treat everything between "This is a quote." and ... $this->formatted[$i - 1] != " ... as if it's still between quotes. Because as far as those solutions are concerned, the last quote in "This is a quote." and the first quote in the if-check are open and close quotes. Another obvious issue is that some strings contain words with apostrophes. Apostrophes shouldn't be treated as single-quotes, but in all solutions I've found, they are.
In other words, they're "unaware" solutions.
$quoteClosed = true;
$singleQuoteClosed = true;
$codeLength = mb_strlen($this->formatted);
if ($codeLength == false)
return;
for ($i = 0; $i < $codeLength; $i++)
{
if ((!$quoteClosed || !$singleQuoteClosed) && ($this->formatted[$i] == '"' || $this->formatted[$i] == "'"))
{
if (!$quoteClosed && $this->formatted[$i - 1] != "\\")
$quoteClosed = true;
else if (!$singleQuoteClosed && $this->formatted[$i - 1] != "\\")
$singleQuoteClosed = true;
}
else if ($this->formatted[$i] == '"' && ($i <= 0 || $this->formatted[$i - 1] != "\\"))
{
if ($quoteClosed && $singleQuoteClosed)
$quoteClosed = false;
}
else if ($this->formatted[$i] == "'" && ($i <= 0 || $this->formatted[$i - 1] != "\\"))
{
if ($singleQuoteClosed && $quoteClosed)
$singleQuoteClosed = false;
}
if ($quoteClosed && $singleQuoteClosed)
$this->quoted[$i] = 0;
else
$this->quoted[$i] = 1;
}
If there isn't a way to make the above more efficient, is there a non-Regex way to quickly replace all substrings in an array with substrings in a second array without missing any across an entire string?
substr_replace and str_replace only seem to replace "some" pieces of the overall string, which is why the number of iterations are in place. It cycles through a while loop until either strpos deems a string nonexistent (Which it never seems to do ... I may be using it wrong), or it cycles through 10K times, whichever occurs first.
Running the above snippet -once- per round would solve the speed issue, but that leaves the "full-replacement" issue and, of course, staying aware that it should avoid replacing anything within quotes.
for ($a = 0; $a < count($this->keys); $a++)
{
$escape = 0;
if ($a > count($this->keys) - 5)
$this->formatted = $this->decodeHTML($this->formatted);
while (strpos($this->formatted, $this->keys[$a]) !== false)
{
$valid = strpos($this->formatted, $this->keys[$a]);
if ($valid === false || $this->quoted[$valid] === 1)
break;
$this->formatted = substr_replace($this->formatted, $this->answers[$a], $valid, mb_strlen($this->keys[$a]));
$this->initializeQuoted();
$escape++;
if ($escape >= 10000)
break;
}
if ($a > count($this->keys) - 5)
$this->formatted = html_entity_decode($this->formatted);
}
$this->quoted = array();
$this->initializeQuoted();
return $this->formatted;
'keys' and 'answers' are arrays containing words of various lengths. 'formatted' is the new string with the changed information. 'initializeQuoted' is the above snippet. I use htmlentities and html_entity_decode to help get rid of whitespaces with key/answer replacements.
Ignore the magic numbers (5s and 10K).

If I understand you correctly then you can do this:
$replacements = [
"test" => "banana",
"Test" => "Banana"
];
$brackets = [[0]];
$lastOpenedQuote = null;
for ($i = 0;$i < strlen($string);$i++) {
if ($string[$i] == "\\") { $i++; continue; } //Skip escaped chars
if ($string[$i] == $lastOpenedQuote) {
$lastOpenedQuote = null;
$brackets[count($brackets)-1][] = $i;
$brackets[] = [ $i+1 ];
} elseif ($lastOpenedQuote == null && ($string[$i] == "\"" || $string[$i] == "'")) {
$lastOpenedQuote = $string[$i];
$brackets[count($brackets)-1][] = $i-1;
$brackets[] = [ $i ];
}
}
$brackets[count($brackets)-1][] = strlen($string)-1;
$prev = 0;
$bits = [];
foreach ($brackets as $index => $pair) {
$bits[$index] = substr($string,$pair[0],$pair[1]-$pair[0]+1);
if ($bits[$index][0] != "\"" && $bits[$index][0] != "'") {
$bits[$index] = str_replace(array_keys($replacements),array_values($replacements), $bits[$index]);
}
}
Check it out at: http://sandbox.onlinephpfunctions.com/code/0453cb7941f1dcad636043fceff30dc0965541ee
Now if performance is still an issue keep in mind this goes through each string character 1 time and does the minimum number of checks it needs each time so it will be really hard to reduce it more. Perhaps you should revise your approach from the bottom up if you need something faster like e.g. doing some of the splitting on the client-side progressively instead of on the whole string on the serverside.

I was just working on this. Hope this gives you some additional ideas.
MATCH: ["]([\w\s\(\)\.\d\_\-\[\]\{\}]+|\s*)["]
REPLACE: ""
<?xml version="1.0" encoding="UTF-8"?>
<NotepadPlus>
<ScintillaContextMenu>
<!--
NOTES: BLAH
-->
[WEBSITE]
https://github.com/notepad-plus-plus/notepad-plus-plus/blob/master/PowerEditor/installer/nativeLang/english.xml
-->
<Item MenuId="Tools" MenuItemName="Generate..."/>
<Item MenuEntryName="Edit" FolderName="Remove Lines" MenuItemName="Remove Empty Lines" ItemNameAs="Empty Lines"/>
<Item MenuEntryName="Plugins" FolderName="Remove Lines" MenuItemName="Remove duplicate lines" ItemNameAs="Duplicate Lines (Plugin)"/>
<Item MenuEntryName="Edit" FolderName="Remove Lines" MenuItemName="Remove Consecutive Duplicate Lines" ItemNameAs="Duplicate Lines"/>
<Item MenuEntryName="Search" FolderName="Add Style Tokens" MenuItemName="Using 1st Style" ItemNameAs="1"/>
<Item id="45003" Foldername="Convert" ItemNameAs="Macintosh (CR)"/>
<Item id="0" FolderName="XML Tools"/>
<Item MenuEntryName="Plugins" FolderName="XML Tools" MenuItemName="Options..." ItemNameAs="Options"/>
</ScintillaContextMenu>
</NotepadPlus>
Let me know if you come up with anything else.

Related

What is a workaround for colour coding in wordwrap()

So Minecraft uses section signs (§) for colour coding so for example, light green is §a (a is the color code id for green). An important note to remember is that these are VISUALLY ignored in-game. I'm using wordwrap() to make text look centred however these section signs get in the way of that because they're visually not there yet still considered as characters by the function itself.
Here's my attempt: if you take a look, I tried to count the number of occurrences the section sign was found and multiplied it by two for the colour code character. I later then realized that this is inefficient because this affects the entire line of code and not just a specific bit. This basically means that this would make the length of other colour coded lines look odd since they have more or less colour coding in them. I also tried a rather dumb alternative where I'd use constants but I quickly realized that wasn't going to work. Let me know if anything is unclear. Thanks in advance.
$line = "§r§7This is the §eAuction House§7! In the §eAuction House§7, you can sell and purchase items from other Luriders who have auctioned their items. The §eAuction House §7is a great way to make some cash by simply selling items that other players might be interested in buying."
public static function itemLineOptimizer(string $line, int $width = 40)
{
$width += substr_count($line, '§') * 2;
return wordwrap($line, $width, "\n");
}
Console Output:
string(281) "§r§7This is the §eAuction House§7! In the §eAuction
House§7, you can sell and purchase items from other
Luriders who have auctioned their items. The §eAuction
House §7is a great way to make some cash by simply
selling items that other players might be interested in
buying."
In-Game Output:
In-Game Output
No where near as efficient as IMSoP's approach, but it is an alternative method I wanted to share. So what I did was I replaced section signs, removed them, wordwrapped, then added them back to their correct places. A bit complicated at first look but it's quite simple. Every line has its details commented.
function itemLineOptimizer(string $line, int $width = 40)
{
$line = str_replace("§", "&", $line); // Since section signs aren't just one-byte, we're going to make our lives easier and replace them with another one-byte symbol, I went with "&"
$colourCoding = array(); // Straightforward
$split = str_split($line); // Splitting the line into an array per character
foreach ($split as $key => $char){ // for every character has a $key (position) and the character itself: $char
if($char === "&") { // Check if it's a section sign / symbol chosen
array_push($colourCoding, [$key, $split[$key + 1]]); // add to $colourCoding an element which includes an array consisting of the position of the sign and the colour which the character at the position after
unset($split[$key]); // remove sign
unset($split[$key + 1]); // remove colour
}
}
// Now we've removed all colour coding from the line and saved it in $colourCoding
$bland = wordwrap(implode("", $split), $width, "\n"); // $bland is the now colourless wordwrapped line
foreach ($colourCoding as $array){ // Lastly we add the section signs back in their positions
$key = $array[0]; // position
$colour = $array[1]; // colour
$lineBreak = substr_count($bland, "§"); // Check for section signs already inside this line: they interfere with future loops since the correct position is different
$bland = substr_replace($bland, "§".$colour, $key + $lineBreak, 0); // Adding the colour coding back back to its correct position
}
return $bland; // Straightforward
}
$line = "§r§7This is the §eAuction House§7! In the §eAuction House§7, you can sell and purchase items from other Luriders who have auctioned their items. The §eAuction House §7is a great way to make some cash by simply selling items that other players might be interested in buying.";
var_dump(wordwrap($line, 40), itemLineOptimizer($line, 40));
One way to approach this which I though might be interesting is to take the internal implementation of wordwrap, and adapt it to our needs.
So I found the definition in the source, and in particular the special-case algorithm for handling a single-character line-break character which is all we need here, and saves us understanding all the other modes.
It works by copying the string, and then walking through it character by character, tracking when it last saw a space, and when it last saw or inserted a newline character. It then over-writes spaces with newline characters in place, without having to touch the rest of the string.
I first translated that literally into PHP (mostly a case of adding $ in front of each variable, and removing some special type handling macros), giving this:
function my_word_wrap($text, $linelength)
{
$newtext = $text;
$breakchar = "\n";
$laststart = $lastspace = 0;
$string_length = strlen($text);
for ($current = 0; $current < $string_length; $current++) {
if ( $text[$current] == $breakchar ) {
$laststart = $lastspace = $current + 1;
}
elseif ( $text[$current] == ' ' ) {
if ($current - $laststart >= $linelength) {
$newtext[$current] = $breakchar;
$laststart = $current + 1;
}
$lastspace = $current;
}
elseif ($current - $laststart >= $linelength && $laststart != $lastspace) {
$newtext[$lastspace] = $breakchar;
$laststart = $lastspace + 1;
}
}
return $newtext;
}
Two of those if statements include this condition which tracks how many characters we've seen since the last line break: $current - $laststart >= $linelength. What we could do is subtract from that the number of invisible characters we've seen, so they don't contribute to the "width" of lines: $current - $laststart - $invisibles >= $linelength.
Next, we need to detect section signs. My immediate guess was to use $text[$current] == '§', but that doesn't work because we're working in byte offsets, and § is not a single byte. Assuming UTF-8, it's specifically the pair of bytes which in hexadecimal are C2 A7, so we need to test the current and next character for that pair: $text[$current] == "\xC2" && $text[$current+1] == "\xA7".
Now we can detect the invisible characters, we can increment our $invisibles counter. Since § is two bytes, and the following character is also invisible, we want to increment the counter by three, and also move the $current pointer an extra two steps:
elseif ( $text[$current] == "\xC2" && $text[$current+1] == "\xA7" ) {
$invisibles += 3;
$current += 2;
}
Finally, we need to reset the $invisibles counter whenever we insert a newline, or see an existing one - in other words, everywhere we reset $laststart.
So, the final result looks like this:
function special_word_wrap($text, $linelength)
{
$newtext = $text;
$breakchar = "\n";
$laststart = $lastspace = $invisibles = 0;
$string_length = strlen($text);
for ($current = 0; $current < $string_length; $current++) {
if ( $text[$current] == $breakchar ) {
$laststart = $lastspace = $current + 1;
$invisibles = 0;
}
elseif ( $text[$current] == ' ' ) {
if ($current - $laststart - $invisibles >= $linelength) {
$newtext[$current] = $breakchar;
$laststart = $current + 1;
$invisibles = 0;
}
$lastspace = $current;
}
elseif ( $text[$current] == "\xC2" && $text[$current+1] == "\xA7" ) {
$invisibles += 3;
$current += 2;
}
elseif ($current - $laststart - $invisibles >= $linelength && $laststart != $lastspace) {
$newtext[$lastspace] = $breakchar;
$laststart = $lastspace + 1;
$invisibles = 0;
}
}
return $newtext;
}
Here's a live demo of it in action with your sample input.
Not the most elegant, and probably not the most efficient way to do it, but I enjoyed the exercise, even if it's not what you were hoping for. :)

Regex replace characters in data

I am trying to clean some junked up data of special characters (allowing a few) but some still get through. I found a regex snippet earlier but does not remove some characters, like asterisks.
$clean_body = $raw_text;
$clean_title = preg_replace("/[^!&\/A-Za-z0-9_ ]/","", $clean_body);
$clean_title = substr($clean_title, 0, 64);
$clean_body = nl2br($clean_body);
if ($nid) {
$node = node_load($nid);
unset($node->field_category);
} else {
$node = new stdClass();
$node->type = 'article';
node_object_prepare($node);
}
$split_title = str_split($clean_title);
foreach ($split_title as $key => $character) {
if ($key > 15) {
if ($character == ' ' && !preg_match("/[^!&\/,.-]/", $split_title[$key - 1])) {
$node->title = html_entity_decode(substr(strip_tags($clean_title), 0, $key - 1)) . '...';
}
}
}
The first part attempts to clean out anything in the raw text that isn't normal punctuation or alpha numeric. Then, I split the title into an array and look for a space. What I want to do is create a title that is at least 15 characters long, and truncates on a space (leaving whole words intact) without stopping on a punctuation character. This is the part I am having trouble with.
Some titles still come out as ***************** or ** HOW TO MAKE $$$$$$ BLOGGING **, when the first title should not even have *'s, and the section should be HOW TO MAKE..., for example.
What about "/[^!&\/\w\s]/ui" ?
Works fine on my machine
Your problem (or, one of them anyhow) is this logic:
if ($key > 15) {
if ($character == ' ' && !preg_match("/[^!&\/,.-]/", $split_title[$key - 1])) {
$node->title = html_entity_decode(substr(strip_tags($clean_title), 0, $key - 1)) . '...';
}
}
You're only setting $node->title if these conditions match when iterating the characters in the $split_title array.
What happens when they don't match? $node->title doesn't get set (or overwritten? You didn't give much context, so I can't tell).
Using this as a test:
$clean_body = '** HOW TO MAKE $$$$$$ BLOGGING **';
You can see that these conditions do not match, so $node->title does not get set (or overwritten).

How to remove text between "matching" parentheses?

When I read the alt(technically title)-text of this XKCD comic, I became curious whether every articles in Wikipedia eventually points to Philosophy article. So I began to make a web application that displays what articles it's "pointing" using PHP.
(PS: don't worry about traffic - because I'll use it privately and will not send too much requests to Wikipedia server)
To do this, I have to remove texts between parentheses and italics, and get the first link. Other things can be achieved using PHP Simple HTML DOM Parser, but remove texts between parentheses is the problem..
If there's no parentheses in parentheses, then I could use this RegEx:\([^\)]+\), however, like the article about German language, there's some articles have overlapped parentheses(for example: German (Deutsch [ˈdɔʏtʃ] ( listen)) is..), and above RegEx can't handle these cases, since [^\)]*\) finds first closing parentheses, not matching closing parentheses. (Actually above case doesn't become a problem since there's no text between two closing parentheses, but it becomes a big problem when there's a link between two closing parentheses.)
One dirty solution I can think is this:
$s="content of a wikipedia article";$depth=0;$s2="";
for($i=0;$i<strlen($s);$i++){
$c=substr($s,$i,1);
if($c=='(')$depth++;
if($c==')'){if($depth>0)$depth--;continue;}
if($depth==0) $s2.=$c;
}
$s=$s2;
However, I don't like this solution since it cuts down a string into single characters and that looks like unnecessary...
Is there other ways to remove text in a pair of(matching) parentheses?
For example, I want to make this text:
blah(asdf(foo)bar(lol)asdf)blah
into this:
blahblah
but not like this:
blahbarasdf)blah
Edit : from a comment of Emil Vikström's answer, I realized that above approach(remove texts between parentheses) may remove a link containing parentheses. However, I still want the answer of above problem, since I met similar problem before and I want to know the answer...
So my question is still: how to remove texts between matching parentheses?
You can check out recursive patterns, which should be able to solve the problem.
When I read the comic I didn't have the willpower to get my head around recursive patterns, so I simplified it to find a link and only then check if it's in parenthesis. Here's my solution:
//Fetch links
$matches = array();
preg_match_all('!<a [^>]*href="/wiki/([^:"#]+)["#].*>!Umsi', $text, $matches);
$links = $matches[1];
//Find first link not within parenthesis
$found = false;
foreach($links as $l) {
if(preg_match('!\([^)]+/wiki/'.preg_quote($l).'.+\)!Umsi', $text)) {
continue;
}else{
$found = true;
break;
}
}
Here's my entire script: http://lajm.eu/emil/dump/filosofi.phps
Great! I am seeing someone with a problem which I experienced while cleaning up Wikipedia plain text content. Here is how you use it.
cleanBraces("blah(asdf(foo)bar(lol)asdf)blah", "(", ")")
will return
blahblah
You can pass any type of braces. Like [ and ] or { and }
Here goes my source code.
function cleanBraces($source, $oB, $eB) {
$finalText = "";
if (preg_match("/\\$oB.*\\$eB/", $source) > 0) {
while (preg_match("/\\$oB.*\\$eB/", $source) > 0) {
$brace = getBracesPos($source, $oB, $eB);
$finalText .= substr($source, 0, $brace[0]);
$source = substr($source, $brace[1] + 1, strlen($source) - $brace[1]);
}
$finalText .= $source;
} else {
$finalText = $source;
}
return $finalText;
}
function getBracesPos($source, $oB, $eB) {
if (preg_match("/\\$oB.*\\$eB/", $source) > 0) {
$open = 0;
$length = strlen($source);
for ($i = 0; $i < $length; $i++) {
$currentChar = substr($source, $i, 1);
if ($currentChar == $oB) {
$open++;
if ($open == 1) { // First open brace
$firstOpenBrace = $i;
}
} else if ($currentChar == $eB) {
$open--;
if ($open == 0) { //time to wrap the roots
$lastCloseBrace = $i;
return array($firstOpenBrace, $lastCloseBrace);
}
}
} //for
} //if
}

Wrongly asked or am I stupid?

There's a blog post comment on codinghorror.com by Paul Jungwirth which includes a little programming task:
You have the numbers 123456789, in that order. Between each number, you must insert either nothing, a plus sign, or a multiplication sign, so that the resulting expression equals 2001. Write a program that prints all solutions. (There are two.)
Bored, I thought, I'd have a go, but I'll be damned if I can get a result for 2001. I think the code below is sound and I reckon that there are zero solutions that result in 2001. According to my code, there are two solutions for 2002. Am I right or am I wrong?
/**
* Take the numbers 123456789 and form expressions by inserting one of ''
* (empty string), '+' or '*' between each number.
* Find (2) solutions such that the expression evaluates to the number 2001
*/
$input = array(1,2,3,4,5,6,7,8,9);
// an array of strings representing 8 digit, base 3 numbers
$ops = array();
$numOps = sizeof($input)-1; // always 8
$mask = str_repeat('0', $numOps); // mask of 8 zeros for padding
// generate the ops array
$limit = pow(3, $numOps) -1;
for ($i = 0; $i <= $limit; $i++) {
$s = (string) $i;
$s = base_convert($s, 10, 3);
$ops[] = substr($mask, 0, $numOps - strlen($s)) . $s;
}
// for each element in the ops array, generate an expression by inserting
// '', '*' or '+' between the numbers in $input. e.g. element 11111111 will
// result in 1+2+3+4+5+6+7+8+9
$limit = sizeof($ops);
$stringResult = null;
$numericResult = null;
for ($i = 0; $i < $limit; $i++) {
$l = $numOps;
$stringResult = '';
$numericResult = 0;
for ($j = 0; $j <= $l; $j++) {
$stringResult .= (string) $input[$j];
switch (substr($ops[$i], $j, 1)) {
case '0':
break;
case '1':
$stringResult .= '+';
break;
case '2':
$stringResult .= '*';
break;
default :
}
}
// evaluate the expression
// split the expression into smaller ones to be added together
$temp = explode('+', $stringResult);
$additionElems = array();
foreach ($temp as $subExpressions)
{
// split each of those into ones to be multiplied together
$multplicationElems = explode('*', $subExpressions);
$working = 1;
foreach ($multplicationElems as $operand) {
$working *= $operand;
}
$additionElems[] = $working;
}
$numericResult = 0;
foreach($additionElems as $operand)
{
$numericResult += $operand;
}
if ($numericResult == 2001) {
echo "{$stringResult}\n";
}
}
Further down the same page you linked to.... =)
"Paul Jungwirth wrote:
You have the numbers 123456789, in
that order. Between each number, you
must insert either nothing, a plus
sign, or a multiplication sign, so
that the resulting expression equals
2001. Write a program that prints all solutions. (There are two.)
I think you meant 2002, not 2001. :)
(Just correcting for anyone else like
me who obsessively tries to solve
little "practice" problems like this
one, and then hit Google when their
result doesn't match the stated
answer. ;) Damn, some of those Perl
examples are ugly.)"
The number is 2002.
Recursive solution takes eleven lines of JavaScript (excluding string expression evaluation, which is a standard JavaScript function, however it would probably take another ten or so lines of code to roll your own for this specific scenario):
function combine (digit,exp) {
if (digit > 9) {
if (eval(exp) == 2002) alert(exp+'=2002');
return;
}
combine(digit+1,exp+'+'+digit);
combine(digit+1,exp+'*'+digit);
combine(digit+1,exp+digit);
return;
}
combine(2,'1');

Performance of tokenizing CSS in PHP

This is a noob question from someone who hasn't written a parser/lexer ever before.
I'm writing a tokenizer/parser for CSS in PHP (please don't repeat with 'OMG, why in PHP?'). The syntax is written down by the W3C neatly here (CSS2.1) and here (CSS3, draft).
It's a list of 21 possible tokens, that all (but two) cannot be represented as static strings.
My current approach is to loop through an array containing the 21 patterns over and over again, do an if (preg_match()) and reduce the source string match by match. In principle this works really good. However, for a 1000 lines CSS string this takes something between 2 and 8 seconds, which is too much for my project.
Now I'm banging my head how other parsers tokenize and parse CSS in fractions of seconds. OK, C is always faster than PHP, but nonetheless, are there any obvious D'Oh! s that I fell into?
I made some optimizations, like checking for '#', '#' or '"' as the first char of the remaining string and applying only the relevant regexp then, but this hadn't brought any great performance boosts.
My code (snippet) so far:
$TOKENS = array(
'IDENT' => '...regexp...',
'ATKEYWORD' => '#...regexp...',
'String' => '"...regexp..."|\'...regexp...\'',
//...
);
$string = '...CSS source string...';
$stream = array();
// we reduce $string token by token
while ($string != '') {
$string = ltrim($string, " \t\r\n\f"); // unconsumed whitespace at the
// start is insignificant but doing a trim reduces exec time by 25%
$matches = array();
// loop through all possible tokens
foreach ($TOKENS as $t => $p) {
// The '&' is used as delimiter, because it isn't used anywhere in
// the token regexps
if (preg_match('&^'.$p.'&Su', $string, $matches)) {
$stream[] = array($t, $matches[0]);
$string = substr($string, strlen($matches[0]));
// Yay! We found one that matches!
continue 2;
}
}
// if we come here, we have a syntax error and handle it somehow
}
// result: an array $stream consisting of arrays with
// 0 => type of token
// 1 => token content
Use a lexer generator.
The first thing I would do would be to get rid of the preg_match(). Basic string functions such as strpos() are much faster, but I don't think you even need that. It looks like you are looking for a specific token at the front of a string with preg_match(), and then simply taking the front length of that string as a substring. You could easily accomplish this with a simple substr() instead, like this:
foreach ($TOKENS as $t => $p)
{
$front = substr($string,0,strlen($p));
$len = strlen($p); //this could be pre-stored in $TOKENS
if ($front == $p) {
$stream[] = array($t, $string);
$string = substr($string, $len);
// Yay! We found one that matches!
continue 2;
}
}
You could further optimize that by pre-calculating the length of all your tokens and storing them in the $TOKENS array, so that you don't have to call strlen() all the time. If you sorted $TOKENS into groups by length, you could reduce the number of substr() calls further as well, as you could take a substr($string) of the current string being analyzed just once for each token length, and run through all the tokens of that length before moving on to the next group of tokens.
the (probably) faster (but less memory friendly) approach would be to tokenize the whole stream at once, using one big regexp with alternatives for each token, like
preg_match_all('/
(...string...)
|
(#ident)
|
(#ident)
...etc
/x', $stream, $tokens);
foreach($tokens as $token)...parse
Don't use regexp, scan character by character.
$tokens = array();
$string = "...code...";
$length = strlen($string);
$i = 0;
while ($i < $length) {
$buf = '';
$char = $string[$i];
if ($char <= ord('Z') && $char >= ord('A') || $char >= ord('a') && $char <= ord('z') || $char == ord('_') || $char == ord('-')) {
while ($char <= ord('Z') && $char >= ord('A') || $char >= ord('a') && $char <= ord('z') || $char == ord('_') || $char == ord('-')) {
// identifier
$buf .= $char;
$char = $string[$i]; $i ++;
}
$tokens[] = array('IDENT', $buf);
} else if (......) {
// ......
}
}
However, that makes the code unmaintainable, therefore, a parser generator is better.
It's an old post but still contributing my 2 cents on this.
one thing that seriously slows down the original code in the question is the following line :
$string = substr($string, strlen($matches[0]));
instead of working on the entire string, take just a part of it (say 50 chars) which are enough for all the possible regexes. then, apply the same line of code on it. when this string shrinks below a preset length, load some more data to it.

Categories