Ok basically what I am trying to do is create a kind of BB Code system without using regex. The code that Im using below seems like it would work perfectly although it's not. Basically the code is supposed to take a string and remove all the break tags from inside all of the [code][/code] blocks and replace that back into the entire string. Then the code is supposed to turn the [code][/code] tags into "pre" tags for the SyntaxHighlighter script I'm using.
Unfortunately the code doesn't completely work 100%. In some cases it will still leave the break tags inside the [code][/code] blocks. My code is:
<?php
$string = "Hello\n[code]\nCode One\n[/code]\n[code]\nCode Two\n[/code]\n[code]\nCode Three\n[/code]";
$string = nl2br($string);
$openArray = array();
$closeArray = array();
$original = "";
$newString = "";
$i = 0;
if(strpos($string, "[code]") === 0) {
array_push($openArray, 0);
}
while($i = strpos($string, "[code]", $i + 1)) {
array_push($openArray, $i);
}
while($i = strpos($string, "[/code]", $i + 1)) {
array_push($closeArray, $i + 7);
}
for($j = 0; $j < count($openArray); $j++) {
$length = $closeArray[$j] - $openArray[$j];
$original = substr($string, $openArray[$j], $length);
$newString = strip_tags($original);
$string = str_replace($original, $newString, $string);
}
$string = str_replace("[code]", '<pre class="brush: plain">', $string);
$string = str_replace("[/code]", '</pre>', $string);
echo $string;
?>
All answers are greatly appreciated as I have been wondering what is wrong with this for quite some time now and Ive tried many different ways!
The major problem I see with your processing is that you store the open and the close tag pretty independent to each other. You then later on process them as if each one would belong to each other, but that's just not guaranteed because you do not validate if a closing code follows an opening code and if not two opening or closing codes after each other which should give a parse error.
You could write yourself a little helper function that, like strpos, returns you the next position of a open and closing code pair:
function codepos($string, $code, $offset) {
$offset = 0;
if (FALSE === $start = strpos($string, "[$code]", $offset)) {
return FALSE;
}
if (FALSE === $stop = strpos($string, "[/$code]", $start) {
throw new Exception('Close code not found.');
}
if ($next = strpos($string, "[$code]", $start + 1) && $next < $stop) {
throw new Exception('Double opening detected.');
}
$pos = new stdClass;
$pos->start = $start;
$pos->stop = $stop;
$pos->code = $code;
return $pos;
}
It's then easier to process this alter on, as you already know that things are in order. Instead of throwing exceptions you can just run FALSE and give notice somehow differently. And this routine does not yet check for a closing code before the first starting code.
$offset = 0;
while($pos = codepos($string, 'code', $offset))
{
... process each code-pair.
}
For learning or for an intranet tool only, not to be even considered on the www:
You need to take into consideration:
Lines may be longer than the string buffer. Know you will have a max line size unless you code around it.
Code for possible close tags before open tags and possible missing close/open tags unless you assume the input will always be correct.
Be able to handle the following cases:
State1 Looking for one or more open tags:
No open/close tags
Open tag only
Close tag first - parse fails
one or more matching open/close tags (in proper order)
one or more matching open/close tags (in proper order) ending with open tag
End of document - OK
State2 Looking for close tag:
close tag followed by one or more matching open/close tags (in proper order)
close tag followed by one or more matching open/close tags (in proper order) ending with open tag
no close tag
End of document - Parse fails
Related
I'm trying to reorder the BBCodes but I failed
so
[̶b̶]̶[̶i̶]̶[̶u̶]̶f̶o̶o̶[̶/̶b̶]̶[̶/̶u̶]̶[̶/̶i̶]̶ ̶-̶ ̶w̶r̶o̶n̶g̶ ̶o̶r̶d̶e̶r̶ ̶ ̶
I̶ ̶w̶a̶n̶t̶ ̶i̶t̶ ̶t̶o̶ ̶b̶e̶:̶ ̶
̶[̶b̶]̶[̶i̶]̶[̶u̶]̶f̶o̶o̶[̶/̶u̶]̶[̶/̶i̶]̶[̶/̶b̶]̶ ̶-̶ ̶r̶i̶g̶h̶t̶ ̶o̶r̶d̶e̶r̶
PIC:
I tried with
<?php
$string = '[b][i][u]foo[/b][/u][/i]';
$search = array('/\[b](.+?)\[\/b]/is', '/\[i](.+?)\[\/i]/is', '/\[u](.+?)\[\/u]/is');
$replace = array('[b]$1[/b]', '[i]$1[/i]', '[u]$1[/u]');
echo preg_replace($search, $replace, $string);
?>
OUTPUT: [b][i][u]foo[/b][/u][/i]
any suggestions ? thanks!
phew, spent awhile thinking of the logic to do this. (feel free to put it in a function)
this only works for the scenario given. Like other users have commented it's impossible. You shouldn't be doing this. Or even on server side. I'd use a client side parser just to throw a syntax error.
supports [b]a[i]b[u]foo[/b]baa[/u]too[/i]
and bbcode with custom values [url=test][i][u]foo[/url][/u][/i]
Will break with
[b] bold [/b][u] underline[/u]
And [b] bold [u][/b] underline[/u]
//input string to be reorganized
$string = '[url=test][i][u]foo[/url][/u][/i]';
echo $string . "<br />";
//search for all opentags (including ones with values
$tagsearch = "/\[([A-Za-z]+)[A-Za-z=._%?&:\/-]*\]/";
preg_match_all($tagsearch, $string, $tags);
//search for all close tags to store them for later
$closetagsearch = "/(\[\/([A-Za-z]+)\])/is";
preg_match_all($closetagsearch, $string, $closetags);
//flip the open tags for reverse parsing (index one is just letters)
$tags[1] = array_reverse($tags[1]);
//create temp var to store new ordered string
$temp = "";
//this is the last known position in the original string after a match
$last = 0;
//iterate through each char of the input string
for ($i = 0, $len = strlen($string); $i < $len; $i++) {
//if we run out of tags to replace/find stop looping
if (empty($tags[1]) || empty($closetags[1]))
continue;
//this is the part of the string that has no matches
$good = substr($string, $last, $i - $last);
//next closing tag to search for
$next = $closetags[1][0];
//how many chars ahead to compare against
$scope = substr($string, $i, strlen($next));
//if we have a match
if ($scope === "$next") {
//add to the temp variable with a modified
//version of an open tag letter to become a close tag
$temp .= $good . substr_replace("[" . $tags[1][0] . "]", "/", 1, 0);
//remove the first key/value in both arrays
array_shift($tags[1]);
array_shift($closetags[1]);
//update the last known unmatched char
$last += strlen($good . $scope);
}
}
echo $temp;
Please also note: it might be the users intention to nest the tags out of order :X
The following is a code block that's working fine as regards getting the data I want.
Don't laugh, it's probably inefficient, but I'm learning :)
What I want, is to use the $totalLength variable, to stop gathering data when the $totalLength is, say 1500 bytes/characters (ideally, ending on a full word, but I'm not looking for miracles!). Anyway, the code:
$paraLength = 0;
$totalLength = 0;
for ($k = 0; $k < $descriptionValue->length; $k++) { //define integer k as 0, get every description using ($k = 0; $k < $descriptionValue->length; $k++), increment the k loop (to get only 14 elements, use ($k <= 13))
$totalLength = $totalLength + $paraLength;
echo $totalLength." Total<br />";
$descNode = $descriptionValue->item($k)->nodeValue; //find each description element
$descNode = trim($descNode); //trim any whitespace around the element
$descPara = strip_tags($descNode); //remove any HTML tags from the elements
$paraLength = (strlen($descPara)); //find the length of each element
//if (preg_match('/^([0-9 ]+)$/', $descPara)) { //if element starts with numbers followed by a space, define it as a telephone number
// $number = $descPara;
// fwrite ($fh, "\t\t".'<div id="tel">'.$number."</div>\n"); //write a div with id tel, containing the number
//}
//else
if (preg_match('/[A-Z]{4,}/', $descPara)) { //if element starts with at least 4 uppercase characters, define it as a heading
$heading = $descPara;
$heading=ucfirst(strtolower($heading)); //convert the uppercase string to proper
fwrite ($fh, "\t\t".'<div id="heading"><h4>'.$heading."</h4></div>\n"); //write a div with id heading, containing the heading in h4 tags
}
else if (preg_match('/\d*\.\d{1,}[m x]/', $descPara)) { //if the element contains any number of digits followed by a dot, at least one further digit and the letters m x, define it as a heading based on it containing room measurements (this pattern matches at least two number after the dot \d*{2,}}
$room = $descPara;
fwrite ($fh, "\t\t".'<div id="roomheading"><h4>'.$room."</h4></div>\n"); //write a div with id roomheading, containing the heading in h4 tags
}
else if (preg_match('/^Disclaimer/i', $descPara)) { //if the element contains the word Disclaimer, define it as such
$disclaimer = $descPara;
fwrite ($fh, "\t\t".'<div id="disclaimer"><h4>'.$disclaimer."</h4></div>\n"); //write a div with id disclaimer, containing the heading in h4 tags
}
else if (strlen($paraLength<14 && $paraLength>3)) { //when all else fails, if the element is less than 14 but more than 3 characters, also define it as a heading
$other = $descPara;
fwrite ($fh, "\t\t".'<div id="other"><h4>'.$other."</h4></div>\n"); //write a div with id other and the heading in h4 tags
}
else {
fwrite ($fh, "\t\t\t<p>".$descPara."</p>\n"); //anything else is considered content, so write it out inside p tags
}
}
$totalLength counts nicely, but when I tried to put a while statement in there, it just hung. I tried putting the while statement before and after the for, but no joy. What am I doing wrong and how best to solve this one?
FYI $descriptionValue, is data parsed from HTML using DOM & xpath, the while I tried was while($totalLength <= 1500)
Maybe this is what You want:
if ($totalLength > 1500) {
break;
}
Just put a condition inside your for loop. It will jump outside the loop as soon as the condition evaluates to true.
// for () { ...
if ($totalLength > 1500) {
break;
}
// }
Basically, break ends execution of the current for, foreach, while, do-while or switch structure. You can find more about PHP's control structures in the manual.
You can also delete the for and add
$k = 0;
while ($totalLength <= 1500 || $k < $descriptionValue->length)
and inside the loop you increment the value ok $k
When I read the alt(technically title)-text of this XKCD comic, I became curious whether every articles in Wikipedia eventually points to Philosophy article. So I began to make a web application that displays what articles it's "pointing" using PHP.
(PS: don't worry about traffic - because I'll use it privately and will not send too much requests to Wikipedia server)
To do this, I have to remove texts between parentheses and italics, and get the first link. Other things can be achieved using PHP Simple HTML DOM Parser, but remove texts between parentheses is the problem..
If there's no parentheses in parentheses, then I could use this RegEx:\([^\)]+\), however, like the article about German language, there's some articles have overlapped parentheses(for example: German (Deutsch [ˈdɔʏtʃ] ( listen)) is..), and above RegEx can't handle these cases, since [^\)]*\) finds first closing parentheses, not matching closing parentheses. (Actually above case doesn't become a problem since there's no text between two closing parentheses, but it becomes a big problem when there's a link between two closing parentheses.)
One dirty solution I can think is this:
$s="content of a wikipedia article";$depth=0;$s2="";
for($i=0;$i<strlen($s);$i++){
$c=substr($s,$i,1);
if($c=='(')$depth++;
if($c==')'){if($depth>0)$depth--;continue;}
if($depth==0) $s2.=$c;
}
$s=$s2;
However, I don't like this solution since it cuts down a string into single characters and that looks like unnecessary...
Is there other ways to remove text in a pair of(matching) parentheses?
For example, I want to make this text:
blah(asdf(foo)bar(lol)asdf)blah
into this:
blahblah
but not like this:
blahbarasdf)blah
Edit : from a comment of Emil Vikström's answer, I realized that above approach(remove texts between parentheses) may remove a link containing parentheses. However, I still want the answer of above problem, since I met similar problem before and I want to know the answer...
So my question is still: how to remove texts between matching parentheses?
You can check out recursive patterns, which should be able to solve the problem.
When I read the comic I didn't have the willpower to get my head around recursive patterns, so I simplified it to find a link and only then check if it's in parenthesis. Here's my solution:
//Fetch links
$matches = array();
preg_match_all('!<a [^>]*href="/wiki/([^:"#]+)["#].*>!Umsi', $text, $matches);
$links = $matches[1];
//Find first link not within parenthesis
$found = false;
foreach($links as $l) {
if(preg_match('!\([^)]+/wiki/'.preg_quote($l).'.+\)!Umsi', $text)) {
continue;
}else{
$found = true;
break;
}
}
Here's my entire script: http://lajm.eu/emil/dump/filosofi.phps
Great! I am seeing someone with a problem which I experienced while cleaning up Wikipedia plain text content. Here is how you use it.
cleanBraces("blah(asdf(foo)bar(lol)asdf)blah", "(", ")")
will return
blahblah
You can pass any type of braces. Like [ and ] or { and }
Here goes my source code.
function cleanBraces($source, $oB, $eB) {
$finalText = "";
if (preg_match("/\\$oB.*\\$eB/", $source) > 0) {
while (preg_match("/\\$oB.*\\$eB/", $source) > 0) {
$brace = getBracesPos($source, $oB, $eB);
$finalText .= substr($source, 0, $brace[0]);
$source = substr($source, $brace[1] + 1, strlen($source) - $brace[1]);
}
$finalText .= $source;
} else {
$finalText = $source;
}
return $finalText;
}
function getBracesPos($source, $oB, $eB) {
if (preg_match("/\\$oB.*\\$eB/", $source) > 0) {
$open = 0;
$length = strlen($source);
for ($i = 0; $i < $length; $i++) {
$currentChar = substr($source, $i, 1);
if ($currentChar == $oB) {
$open++;
if ($open == 1) { // First open brace
$firstOpenBrace = $i;
}
} else if ($currentChar == $eB) {
$open--;
if ($open == 0) { //time to wrap the roots
$lastCloseBrace = $i;
return array($firstOpenBrace, $lastCloseBrace);
}
}
} //for
} //if
}
I have my site search engine functioning exactly the way I want, save for one nagging issue: I want to be able to show the kind of search body that all search engines show where they highlight 1 to 3 sentences that contain your search term(s) in my results.
My Googlefoo is not strong on this one, so I'm hoping someone can turn me on to an existing solution.
In case you're not wanting keyword highlighting as battal suggested and are wanting to snip the relevant paragraph/content this is what I'd do:
$snippets = array();
foreach ($matches as $i => $match) {
$pos = strpos($match, $searchTerm);
$buffer = 30; // characters before and after the search term is found
// start index - 0 or 30 characters before instance of search term
$start = ($pos - $buffer >= 0) ? $pos - $buffer : 0;
// end index - 30 characters after instance of search term or the length of the match
$end = $start + strlen($searchTerm) + $buffer;
$end = ($end >= strlen($match)) ? strlen($match) : $end;
$snippets[$i] = substr($match, $start, $end);
}
You mean search highlighting:
str_replace(
$searchTerm,
'<span class="searchHighlight">'.$searchTerm.'</span>',
$searchString);
You need to do this on plain text, otherwise you may come accross some complications as mentioned in the A List Apart article Enhance Usability by Highlighting Search Terms;
You may try a Javascript-based approach, but a PHP/HTML-based one would be more acessible.
This is a noob question from someone who hasn't written a parser/lexer ever before.
I'm writing a tokenizer/parser for CSS in PHP (please don't repeat with 'OMG, why in PHP?'). The syntax is written down by the W3C neatly here (CSS2.1) and here (CSS3, draft).
It's a list of 21 possible tokens, that all (but two) cannot be represented as static strings.
My current approach is to loop through an array containing the 21 patterns over and over again, do an if (preg_match()) and reduce the source string match by match. In principle this works really good. However, for a 1000 lines CSS string this takes something between 2 and 8 seconds, which is too much for my project.
Now I'm banging my head how other parsers tokenize and parse CSS in fractions of seconds. OK, C is always faster than PHP, but nonetheless, are there any obvious D'Oh! s that I fell into?
I made some optimizations, like checking for '#', '#' or '"' as the first char of the remaining string and applying only the relevant regexp then, but this hadn't brought any great performance boosts.
My code (snippet) so far:
$TOKENS = array(
'IDENT' => '...regexp...',
'ATKEYWORD' => '#...regexp...',
'String' => '"...regexp..."|\'...regexp...\'',
//...
);
$string = '...CSS source string...';
$stream = array();
// we reduce $string token by token
while ($string != '') {
$string = ltrim($string, " \t\r\n\f"); // unconsumed whitespace at the
// start is insignificant but doing a trim reduces exec time by 25%
$matches = array();
// loop through all possible tokens
foreach ($TOKENS as $t => $p) {
// The '&' is used as delimiter, because it isn't used anywhere in
// the token regexps
if (preg_match('&^'.$p.'&Su', $string, $matches)) {
$stream[] = array($t, $matches[0]);
$string = substr($string, strlen($matches[0]));
// Yay! We found one that matches!
continue 2;
}
}
// if we come here, we have a syntax error and handle it somehow
}
// result: an array $stream consisting of arrays with
// 0 => type of token
// 1 => token content
Use a lexer generator.
The first thing I would do would be to get rid of the preg_match(). Basic string functions such as strpos() are much faster, but I don't think you even need that. It looks like you are looking for a specific token at the front of a string with preg_match(), and then simply taking the front length of that string as a substring. You could easily accomplish this with a simple substr() instead, like this:
foreach ($TOKENS as $t => $p)
{
$front = substr($string,0,strlen($p));
$len = strlen($p); //this could be pre-stored in $TOKENS
if ($front == $p) {
$stream[] = array($t, $string);
$string = substr($string, $len);
// Yay! We found one that matches!
continue 2;
}
}
You could further optimize that by pre-calculating the length of all your tokens and storing them in the $TOKENS array, so that you don't have to call strlen() all the time. If you sorted $TOKENS into groups by length, you could reduce the number of substr() calls further as well, as you could take a substr($string) of the current string being analyzed just once for each token length, and run through all the tokens of that length before moving on to the next group of tokens.
the (probably) faster (but less memory friendly) approach would be to tokenize the whole stream at once, using one big regexp with alternatives for each token, like
preg_match_all('/
(...string...)
|
(#ident)
|
(#ident)
...etc
/x', $stream, $tokens);
foreach($tokens as $token)...parse
Don't use regexp, scan character by character.
$tokens = array();
$string = "...code...";
$length = strlen($string);
$i = 0;
while ($i < $length) {
$buf = '';
$char = $string[$i];
if ($char <= ord('Z') && $char >= ord('A') || $char >= ord('a') && $char <= ord('z') || $char == ord('_') || $char == ord('-')) {
while ($char <= ord('Z') && $char >= ord('A') || $char >= ord('a') && $char <= ord('z') || $char == ord('_') || $char == ord('-')) {
// identifier
$buf .= $char;
$char = $string[$i]; $i ++;
}
$tokens[] = array('IDENT', $buf);
} else if (......) {
// ......
}
}
However, that makes the code unmaintainable, therefore, a parser generator is better.
It's an old post but still contributing my 2 cents on this.
one thing that seriously slows down the original code in the question is the following line :
$string = substr($string, strlen($matches[0]));
instead of working on the entire string, take just a part of it (say 50 chars) which are enough for all the possible regexes. then, apply the same line of code on it. when this string shrinks below a preset length, load some more data to it.