How to remove text between "matching" parentheses? - php

When I read the alt(technically title)-text of this XKCD comic, I became curious whether every articles in Wikipedia eventually points to Philosophy article. So I began to make a web application that displays what articles it's "pointing" using PHP.
(PS: don't worry about traffic - because I'll use it privately and will not send too much requests to Wikipedia server)
To do this, I have to remove texts between parentheses and italics, and get the first link. Other things can be achieved using PHP Simple HTML DOM Parser, but remove texts between parentheses is the problem..
If there's no parentheses in parentheses, then I could use this RegEx:\([^\)]+\), however, like the article about German language, there's some articles have overlapped parentheses(for example: German (Deutsch [ˈdɔʏtʃ] ( listen)) is..), and above RegEx can't handle these cases, since [^\)]*\) finds first closing parentheses, not matching closing parentheses. (Actually above case doesn't become a problem since there's no text between two closing parentheses, but it becomes a big problem when there's a link between two closing parentheses.)
One dirty solution I can think is this:
$s="content of a wikipedia article";$depth=0;$s2="";
for($i=0;$i<strlen($s);$i++){
$c=substr($s,$i,1);
if($c=='(')$depth++;
if($c==')'){if($depth>0)$depth--;continue;}
if($depth==0) $s2.=$c;
}
$s=$s2;
However, I don't like this solution since it cuts down a string into single characters and that looks like unnecessary...
Is there other ways to remove text in a pair of(matching) parentheses?
For example, I want to make this text:
blah(asdf(foo)bar(lol)asdf)blah
into this:
blahblah
but not like this:
blahbarasdf)blah
Edit : from a comment of Emil Vikström's answer, I realized that above approach(remove texts between parentheses) may remove a link containing parentheses. However, I still want the answer of above problem, since I met similar problem before and I want to know the answer...
So my question is still: how to remove texts between matching parentheses?

You can check out recursive patterns, which should be able to solve the problem.
When I read the comic I didn't have the willpower to get my head around recursive patterns, so I simplified it to find a link and only then check if it's in parenthesis. Here's my solution:
//Fetch links
$matches = array();
preg_match_all('!<a [^>]*href="/wiki/([^:"#]+)["#].*>!Umsi', $text, $matches);
$links = $matches[1];
//Find first link not within parenthesis
$found = false;
foreach($links as $l) {
if(preg_match('!\([^)]+/wiki/'.preg_quote($l).'.+\)!Umsi', $text)) {
continue;
}else{
$found = true;
break;
}
}
Here's my entire script: http://lajm.eu/emil/dump/filosofi.phps

Great! I am seeing someone with a problem which I experienced while cleaning up Wikipedia plain text content. Here is how you use it.
cleanBraces("blah(asdf(foo)bar(lol)asdf)blah", "(", ")")
will return
blahblah
You can pass any type of braces. Like [ and ] or { and }
Here goes my source code.
function cleanBraces($source, $oB, $eB) {
$finalText = "";
if (preg_match("/\\$oB.*\\$eB/", $source) > 0) {
while (preg_match("/\\$oB.*\\$eB/", $source) > 0) {
$brace = getBracesPos($source, $oB, $eB);
$finalText .= substr($source, 0, $brace[0]);
$source = substr($source, $brace[1] + 1, strlen($source) - $brace[1]);
}
$finalText .= $source;
} else {
$finalText = $source;
}
return $finalText;
}
function getBracesPos($source, $oB, $eB) {
if (preg_match("/\\$oB.*\\$eB/", $source) > 0) {
$open = 0;
$length = strlen($source);
for ($i = 0; $i < $length; $i++) {
$currentChar = substr($source, $i, 1);
if ($currentChar == $oB) {
$open++;
if ($open == 1) { // First open brace
$firstOpenBrace = $i;
}
} else if ($currentChar == $eB) {
$open--;
if ($open == 0) { //time to wrap the roots
$lastCloseBrace = $i;
return array($firstOpenBrace, $lastCloseBrace);
}
}
} //for
} //if
}

Related

Stucked with array and explode - spaces

I have an application called mystique item where users has to guess the item that's behind the watermark. Water mark is revealing on time to time and everything is working perfect, but i have a "small" problem with guessing words. I have added words in arrays, separated with commas, and i'm exploding that array in my php, but for some reason, it only catches the first word as correct, everything else is being incorrect. Here's what i have done.
$gt = getVal('pics','gtext','online',1);
$won = getVal('pics','winner','online',1);
if($won=='no')
{
$counts = getGen(3);
$counts2 = getGen(4);
if($counts2==0)
{
$counts2 = 9999999999999;
}
$ccount = getCount2("$uid","$pid",date('Y-m-d H:i:s',$t1),date('Y-m-d H:i:s',$t2));
$ccount3 = getCount3("$uid","$pid");
if( $ccount>=$counts || $ccount3>=$counts2)
{
echo '4';
}
else
{
$sp = explode(",",$gt);
if(in_array($val, $sp)) // guess correct
{
echo '1';
}
else// guess wrong
{
echo '2';
}
}
}
gtext is the row where I store the words, my words has spaces in them, for example: new iphone,iphone 5s,apple ipad,etc etc).
And here's the code that checks the words:
$.post('guessit.php',{from:1,val:$('#ug').val(),uid:$('#uid').val(),pid:$('#pid').val(),t1:<?php echo $time1; ?>,t2:<?php echo $time3; ?>},function(d){
if(parseInt(d)==1){
$.post('guessit.php',{from:2,val:$('#ug').val(),uid:$('#uid').val(),pid:$('#pid').val(),t1:<?php echo $time1; ?>,t2:<?php echo $time3; ?>},function(d1){
advanced_example($('#uid').val(),'Congratulations!','You are the winner!!',1);
//setInterval($(location).attr('href','redirecttohome.php'),10000);
});
}else if(parseInt(d)==2){
$.post('guessit.php',{from:3,val:$('#ug').val(),uid:$('#uid').val(),pid:$('#pid').val(),t1:<?php echo $time1; ?>,t2:<?php echo $time3; ?>},function(d1){
advanced_example($('#uid').val(),'Wrong!','Please try again!',1);
//setInterval($(location).attr('href','redirecttohome.php'),10000);
});
}else if(parseInt(d)==3){
advanced_example($('#uid').val(),'Sorry!','Someone else was faster!',1);
//setInterval($(location).attr('href','redirecttohome.php'),8000);
}else if(parseInt(d)==4){
advanced_example($('#uid').val(),'Error!','You already attempted maximum times',1);
//setInterval($(location).attr('href','redirecttohome.php'),8000);
}
guessit.php is containing the first code I've showed you.
If you need anything else in order to help me, please let me know.
#AmalMurali What I need is next: I have in MySQL:
apple ipad,apple iphone4,apple ipod,iphone4,apple
I need them as strings as:
apple ipad
apple iphone4
apple ipod
iphone4
apple
You need to trim the whitespace for the if conditions to work as you want it:
$sp = explode(",",$gt);
$sp = array_map('trim', $sp); //trim all the elements in $sp
If the elements contain a whitespace in the beginning or end, the following condition will evaluate to FALSE, thus triggering the statements in else block:
if(in_array($val, $sp)) {
If the whitespace is removed, in_array should work and the code should work as expected.

PHP Word guessing game (highlight letters in right and wrong position - like mastermind)

Sorry for the long title.
Wanted it to be as descriptive as possible.
Disclaimer : Could find some "find the differences" code here and elsewhere on Stackoverflow, but not quite the functionality I was looking for.
I'll be using these terminoligy later on:
'userguess' : a word that will be entered by the user
'solution' : the secret word that needs to be guessed.
What I need to create
A word guessing game where:
The user enters a word (I'll make sure through Javascript/jQuery that
the entered word contains the same number of letters as the word to
be guessed).
A PHP function then checks the 'userguess' and highlights (in green)
the letters in that word which are in the correct place, and
highlights (in red) the letters that are not yet in the right place,
but do show up somewhere else in the word. The letters that don't
show up in the 'solution' are left black.
Pitfall Scenario : - Let's say the 'solution' is 'aabbc' and the user guesses 'abaac'
In the above scenario this would result in : (green)a(/green)(red)b(/red)(red)a(/red)(black)a(/black)(green)c(/green)
Notice how the last "a" is black cause 'userguess' has 3 a's but 'solution' only has 2
What I have so far
Code is working more or less, but I've got a feeling it can be 10 times more lean and mean.
I'm filling up 2 new Arrays (one for solution and one for userguess) as I go along to prevent the pitfall (see above) from messing things up.
function checkWord($toCheck) {
global $solution; // $solution word is defined outside the scope of this function
$goodpos = array(); // array that saves the indexes of all the right letters in the RIGHT position
$badpos = array(); // array that saves the indexes of all the right letters in the WRONG position
$newToCheck = array(); // array that changes overtime to help us with the Pitfall (see above)
$newSolution = array();// array that changes overtime to help us with the Pitfall (see above)
// check for all the right letters in the RIGHT position in entire string first
for ($i = 0, $j = strlen($toCheck); $i < $j; $i++) {
if ($toCheck[$i] == $solution[$i]) {
$goodpos[] = $i;
$newSolution[$i] = "*"; // RIGHT letters in RIGHT position are 'deleted' from solution
} else {
$newToCheck[] = $toCheck[$i];
$newSolution[$i] = $solution[$i];
}
}
// go over the NEW word to check for letters that are not in the right position but show up elsewhere in the word
for ($i = 0, $j = count($newSolution); $i <= $j; $i++) {
if (!(in_array($newToCheck[$i], $newSolution))) {
$badpos[] = $i;
$newSolution[$i] = "*";
}
}
// use the two helper arrays above 'goodpos' and 'badpos' to color the characters
for ($i = 0, $j = strlen($toCheck), $k = 0; $i < $j; $i++) {
if (in_array($i,$goodpos)) {
$colored .= "<span class='green'>";
$colored .= $toCheck[$i];
$colored .= "</span>";
} else if (in_array($i,$badpos)) {
$colored .= "<span class='red'>";
$colored .= $toCheck[$i];
$colored .= "</span>";
} else {
$colored .= $toCheck[$i];
}
}
// output to user
$output = '<div id="feedbackHash">';
$output .= '<h2>Solution was : ' . $solution . '</h2>';
$output .= '<h2>Color corrected: ' . $colored . '</h2>';
$output .= 'Correct letters in the right position : ' . count($goodpos) . '<br>';
$output .= 'Correct letters in the wrong position : ' . count($badpos) . '<br>';
$output .= '</div>';
return $output;
} // checkWord
Nice question. I'd probably do it slightly differently to you :) (I guess that's what you were hoping for!)
You can find my complete solution function here http://ideone.com/8ojAG - but I'm going to break it down step by step too.
Firstly, please try and avoid using global. There's no reason why you can't define your function as:
function checkWord($toCheck, $solution) {
You can pass the solution in and avoid potential nasties later on.
I'd start by splitting both the user guess, and the solution into arrays, and have another array to store my output in.
$toCheck = str_split($toCheck, 1);
$solution = str_split($solution, 1);
$out = array();
At each stage of the process, I'd remove the characters that have been identified as correct or incorrect from the users guess or the solution, so I don't need to flag them in any way, and the remaining stages of the function run more efficiently.
So to check for matches.
foreach ($toCheck as $pos => $char) {
if ($char == $solution[$pos]) {
$out[$pos] = "<span class=\"green\">$char</span>";
unset($toCheck[$pos], $solution[$pos]);
}
}
So for your example guess/solution, $out now contains a green 'a' at position 0, and a green c at position 4. Both the guess and the solution no longer have these indices, and will not be checked again.
A similar process for checking letters that are present, but in the wrong place.
foreach ($toCheck as $pos => $char) {
if (false !== $solPos = array_search($char, $solution)) {
$out[$pos] = "<span class=\"red\">$char</span>";
unset($toCheck[$pos], $solution[$solPos]);
}
}
In this case we are searching for the guessed letter in the solution, and removing it if it is found. We don't need to count the number of occurrences because the letters are removed as we go.
Finally the only letters remaining in the users guess, are ones that are not present at all in the solution, and since we maintained the numbered indices throughout, we can simply merge the leftover letters back in.
$out += $toCheck;
Almost there. $out has everything we need, but it's not in the correct order. Even though the indices are numeric, they are not ordered. We finish up with:
ksort($out);
return implode($out);
The result from this is:
"<span class="green">a</span><span class="red">b</span><span class="red">a</span>a<span class="green">c</span>"
Here try this, See In Action
Example output:
<?php
echo checkWord('aabbc','abaac').PHP_EOL;
echo checkWord('funday','sunday').PHP_EOL;
echo checkWord('flipper','ripple').PHP_EOL;
echo checkWord('monkey','kenney').PHP_EOL;
function checkWord($guess, $solution){
$arr1 = str_split($solution);
$arr2 = str_split($guess);
$arr1_c = array_count_values($arr1);
$arr2_c = array_count_values($arr2);
$out = '';
foreach($arr2 as $key=>$value){
$arr1_c[$value]=(isset($arr1_c[$value])?$arr1_c[$value]-1:0);
$arr2_c[$value]=(isset($arr2_c[$value])?$arr2_c[$value]-1:0);
if(isset($arr2[$key]) && isset($arr1[$key]) && $arr1[$key] == $arr2[$key]){
$out .='<span style="color:green;">'.$arr2[$key].'</span>';
}elseif(in_array($value,$arr1) && $arr2_c[$value] >= 0 && $arr1_c[$value] >= 0){
$out .='<span style="color:red;">'.$arr2[$key].'</span>';
}else{
$out .='<span style="color:black;">'.$arr2[$key].'</span>';
}
}
return $out;
}
?>

PHP editing the middle of a string not working?

Ok basically what I am trying to do is create a kind of BB Code system without using regex. The code that Im using below seems like it would work perfectly although it's not. Basically the code is supposed to take a string and remove all the break tags from inside all of the [code][/code] blocks and replace that back into the entire string. Then the code is supposed to turn the [code][/code] tags into "pre" tags for the SyntaxHighlighter script I'm using.
Unfortunately the code doesn't completely work 100%. In some cases it will still leave the break tags inside the [code][/code] blocks. My code is:
<?php
$string = "Hello\n[code]\nCode One\n[/code]\n[code]\nCode Two\n[/code]\n[code]\nCode Three\n[/code]";
$string = nl2br($string);
$openArray = array();
$closeArray = array();
$original = "";
$newString = "";
$i = 0;
if(strpos($string, "[code]") === 0) {
array_push($openArray, 0);
}
while($i = strpos($string, "[code]", $i + 1)) {
array_push($openArray, $i);
}
while($i = strpos($string, "[/code]", $i + 1)) {
array_push($closeArray, $i + 7);
}
for($j = 0; $j < count($openArray); $j++) {
$length = $closeArray[$j] - $openArray[$j];
$original = substr($string, $openArray[$j], $length);
$newString = strip_tags($original);
$string = str_replace($original, $newString, $string);
}
$string = str_replace("[code]", '<pre class="brush: plain">', $string);
$string = str_replace("[/code]", '</pre>', $string);
echo $string;
?>
All answers are greatly appreciated as I have been wondering what is wrong with this for quite some time now and Ive tried many different ways!
The major problem I see with your processing is that you store the open and the close tag pretty independent to each other. You then later on process them as if each one would belong to each other, but that's just not guaranteed because you do not validate if a closing code follows an opening code and if not two opening or closing codes after each other which should give a parse error.
You could write yourself a little helper function that, like strpos, returns you the next position of a open and closing code pair:
function codepos($string, $code, $offset) {
$offset = 0;
if (FALSE === $start = strpos($string, "[$code]", $offset)) {
return FALSE;
}
if (FALSE === $stop = strpos($string, "[/$code]", $start) {
throw new Exception('Close code not found.');
}
if ($next = strpos($string, "[$code]", $start + 1) && $next < $stop) {
throw new Exception('Double opening detected.');
}
$pos = new stdClass;
$pos->start = $start;
$pos->stop = $stop;
$pos->code = $code;
return $pos;
}
It's then easier to process this alter on, as you already know that things are in order. Instead of throwing exceptions you can just run FALSE and give notice somehow differently. And this routine does not yet check for a closing code before the first starting code.
$offset = 0;
while($pos = codepos($string, 'code', $offset))
{
... process each code-pair.
}
For learning or for an intranet tool only, not to be even considered on the www:
You need to take into consideration:
Lines may be longer than the string buffer. Know you will have a max line size unless you code around it.
Code for possible close tags before open tags and possible missing close/open tags unless you assume the input will always be correct.
Be able to handle the following cases:
State1 Looking for one or more open tags:
No open/close tags
Open tag only
Close tag first - parse fails
one or more matching open/close tags (in proper order)
one or more matching open/close tags (in proper order) ending with open tag
End of document - OK
State2 Looking for close tag:
close tag followed by one or more matching open/close tags (in proper order)
close tag followed by one or more matching open/close tags (in proper order) ending with open tag
no close tag
End of document - Parse fails

Regex replace characters in data

I am trying to clean some junked up data of special characters (allowing a few) but some still get through. I found a regex snippet earlier but does not remove some characters, like asterisks.
$clean_body = $raw_text;
$clean_title = preg_replace("/[^!&\/A-Za-z0-9_ ]/","", $clean_body);
$clean_title = substr($clean_title, 0, 64);
$clean_body = nl2br($clean_body);
if ($nid) {
$node = node_load($nid);
unset($node->field_category);
} else {
$node = new stdClass();
$node->type = 'article';
node_object_prepare($node);
}
$split_title = str_split($clean_title);
foreach ($split_title as $key => $character) {
if ($key > 15) {
if ($character == ' ' && !preg_match("/[^!&\/,.-]/", $split_title[$key - 1])) {
$node->title = html_entity_decode(substr(strip_tags($clean_title), 0, $key - 1)) . '...';
}
}
}
The first part attempts to clean out anything in the raw text that isn't normal punctuation or alpha numeric. Then, I split the title into an array and look for a space. What I want to do is create a title that is at least 15 characters long, and truncates on a space (leaving whole words intact) without stopping on a punctuation character. This is the part I am having trouble with.
Some titles still come out as ***************** or ** HOW TO MAKE $$$$$$ BLOGGING **, when the first title should not even have *'s, and the section should be HOW TO MAKE..., for example.
What about "/[^!&\/\w\s]/ui" ?
Works fine on my machine
Your problem (or, one of them anyhow) is this logic:
if ($key > 15) {
if ($character == ' ' && !preg_match("/[^!&\/,.-]/", $split_title[$key - 1])) {
$node->title = html_entity_decode(substr(strip_tags($clean_title), 0, $key - 1)) . '...';
}
}
You're only setting $node->title if these conditions match when iterating the characters in the $split_title array.
What happens when they don't match? $node->title doesn't get set (or overwritten? You didn't give much context, so I can't tell).
Using this as a test:
$clean_body = '** HOW TO MAKE $$$$$$ BLOGGING **';
You can see that these conditions do not match, so $node->title does not get set (or overwritten).

Performance of tokenizing CSS in PHP

This is a noob question from someone who hasn't written a parser/lexer ever before.
I'm writing a tokenizer/parser for CSS in PHP (please don't repeat with 'OMG, why in PHP?'). The syntax is written down by the W3C neatly here (CSS2.1) and here (CSS3, draft).
It's a list of 21 possible tokens, that all (but two) cannot be represented as static strings.
My current approach is to loop through an array containing the 21 patterns over and over again, do an if (preg_match()) and reduce the source string match by match. In principle this works really good. However, for a 1000 lines CSS string this takes something between 2 and 8 seconds, which is too much for my project.
Now I'm banging my head how other parsers tokenize and parse CSS in fractions of seconds. OK, C is always faster than PHP, but nonetheless, are there any obvious D'Oh! s that I fell into?
I made some optimizations, like checking for '#', '#' or '"' as the first char of the remaining string and applying only the relevant regexp then, but this hadn't brought any great performance boosts.
My code (snippet) so far:
$TOKENS = array(
'IDENT' => '...regexp...',
'ATKEYWORD' => '#...regexp...',
'String' => '"...regexp..."|\'...regexp...\'',
//...
);
$string = '...CSS source string...';
$stream = array();
// we reduce $string token by token
while ($string != '') {
$string = ltrim($string, " \t\r\n\f"); // unconsumed whitespace at the
// start is insignificant but doing a trim reduces exec time by 25%
$matches = array();
// loop through all possible tokens
foreach ($TOKENS as $t => $p) {
// The '&' is used as delimiter, because it isn't used anywhere in
// the token regexps
if (preg_match('&^'.$p.'&Su', $string, $matches)) {
$stream[] = array($t, $matches[0]);
$string = substr($string, strlen($matches[0]));
// Yay! We found one that matches!
continue 2;
}
}
// if we come here, we have a syntax error and handle it somehow
}
// result: an array $stream consisting of arrays with
// 0 => type of token
// 1 => token content
Use a lexer generator.
The first thing I would do would be to get rid of the preg_match(). Basic string functions such as strpos() are much faster, but I don't think you even need that. It looks like you are looking for a specific token at the front of a string with preg_match(), and then simply taking the front length of that string as a substring. You could easily accomplish this with a simple substr() instead, like this:
foreach ($TOKENS as $t => $p)
{
$front = substr($string,0,strlen($p));
$len = strlen($p); //this could be pre-stored in $TOKENS
if ($front == $p) {
$stream[] = array($t, $string);
$string = substr($string, $len);
// Yay! We found one that matches!
continue 2;
}
}
You could further optimize that by pre-calculating the length of all your tokens and storing them in the $TOKENS array, so that you don't have to call strlen() all the time. If you sorted $TOKENS into groups by length, you could reduce the number of substr() calls further as well, as you could take a substr($string) of the current string being analyzed just once for each token length, and run through all the tokens of that length before moving on to the next group of tokens.
the (probably) faster (but less memory friendly) approach would be to tokenize the whole stream at once, using one big regexp with alternatives for each token, like
preg_match_all('/
(...string...)
|
(#ident)
|
(#ident)
...etc
/x', $stream, $tokens);
foreach($tokens as $token)...parse
Don't use regexp, scan character by character.
$tokens = array();
$string = "...code...";
$length = strlen($string);
$i = 0;
while ($i < $length) {
$buf = '';
$char = $string[$i];
if ($char <= ord('Z') && $char >= ord('A') || $char >= ord('a') && $char <= ord('z') || $char == ord('_') || $char == ord('-')) {
while ($char <= ord('Z') && $char >= ord('A') || $char >= ord('a') && $char <= ord('z') || $char == ord('_') || $char == ord('-')) {
// identifier
$buf .= $char;
$char = $string[$i]; $i ++;
}
$tokens[] = array('IDENT', $buf);
} else if (......) {
// ......
}
}
However, that makes the code unmaintainable, therefore, a parser generator is better.
It's an old post but still contributing my 2 cents on this.
one thing that seriously slows down the original code in the question is the following line :
$string = substr($string, strlen($matches[0]));
instead of working on the entire string, take just a part of it (say 50 chars) which are enough for all the possible regexes. then, apply the same line of code on it. when this string shrinks below a preset length, load some more data to it.

Categories