match "//" comments with regex but not inside a quote - php

I need to match and replace some comments.
for example:
$test = "the url is http://www.google.com";// comment "<-- that quote needs to be matched
I want to match the comments outside of the quotes, and replace any "'s in the comments with "'s.
I have tried a number of patterns and different ways of running them but with no luck.
The regex will be run with javascript to match php "//" comments
UPDATE:
I took the regex from borkweb below and modified it. used a function from http://ejohn.org/blog/search-and-dont-replace/ and came up with this:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<script type="text/javascript">
function t_replace(data){
var q = {}, ret = "";
data.replace(/(?:((["'\/]*(("[^"]*")|('[^']*'))?[\s]*)?[\/\/|#][^"|^']*))/g, function(value){
q[key] = value;
});
for ( var key in q ){
ret = q[key];
}
var text = data.split(ret);
var out = ret + text[1];
out = out.replace(/"/g,""");
out = out.replace(/'/g,"&apos;");
return text[0] + out;
}
</script>
</head>
<body>
<script type="text/javascript">
document.write(t_replace("$test = \"the url is http://www.google.com\";// c'o\"mment \"\"\"<-- that quote needs to be matched")+"<br>");
document.write(t_replace("$test = 'the url is http://www.google.com';# c'o\"mment \"\"\"<-- that quote needs to be matched"));
</script>
</body>
</html>
it handles all the line comments outside of single or double quotes. Is there anyway I could optimize this function?
UPDATE 2:
it does not handle this string
document.write(t_replace("$test //= \"the url is http://www.google.com\"; //c'o\"mment \"\"\"<-- that quote needs to be matched")+"<br>");

You can have a regexp to match all strings and comments at the same time. If it's a string, you can replace it with itself, unchanged, and then handle a special case for comments.
I came up with this regex:
"(\\[\s\S]|[^"])*"|'(\\[\s\S]|[^'])*'|(\/\/.*|\/\*[\s\S]*?\*\/)
There are 3 parts:
"(\\[\s\S]|[^"])*" for matching double quoted strings.
'(\\[\s\S]|[^'])*' for matching single quoted strings.
(\/\/.*|\/\*[\s\S]*?\*\/) for matching both single line and multiline comments.
The replace function check if the matched string is a comment. If it's not, don't replace. If it is, replace " and '.
function t_replace(data){
var re = /"(\\[\s\S]|[^"])*"|'(\\[\s\S]|[^'])*'|(\/\/.*|\/\*[\s\S]*?\*\/)/g;
return data.replace(re, function(all, strDouble, strSingle, comment) {
if (comment) {
return all.replace(/"/g, '"').replace(/'/g, '&apos;');
}
return all;
});
}
Test run:
Input: $test = "the url is http://www.google.com";// c'o"mment """<-- that quote needs to be matched
Output: $test = "the url is http://www.google.com";// c&apos;o"mment """<-- that quote needs to be matched
Input: $test = 'the url is http://www.google.com';# c'o"mment """<-- that quote needs to be matched
Output: $test = 'the url is http://www.google.com';# c'o"mment """<-- that quote needs to be matched
Input: $test //= "the url is http://www.google.com"; //c'o"mment """<-- that quote needs to be matched
Output: $test //= "the url is http://www.google.com"; //c&apos;o"mment """<-- that quote needs to be matched

Don't forget that PHP comments can also take the form of /* this is a comment */ which can be span across multiple lines.
This site may be of interest to you:
http://blog.stevenlevithan.com/archives/mimic-lookbehind-javascript
Javascript does not have native lookbehind support in it's regular expression engine. What you may be able to do is start at the end of a line and look backward to capture any characters that follow a semi colon + optional whitespace + // So something like:
;\w*\/\/(.+)$
This may not capture everything.
You also may want to look for a Javascript (or other languages) PHP syntax checker. I think Komodo Edit's PHP syntax checker may be written in Javascript. If so, it may give you insight on how to strip everything out but comments as the syntax checkers need to ensure the PHP code is valid, comments and all. The same can be said about syntax color changers. Here are two other links:
http://ecoder.quintalinda.com/
http://www.webdesignbooth.com/9-useful-javascript-syntax-highlighting-scripts/

I have to admit, this regex took me a while to generate...but I'm pretty sure this will do what you are looking for:
<script>
var str = "$test = \"the url is http://www.google.com\";// comment \"\"\"<-- that quote needs to be matched";
var reg = /^(?:(([^"'\/]*(("[^"]*")|('[^']*'))?[\s]*)?\/\/[^"]*))"/g;
while( str !== (str = str.replace( reg, "$1"") ) );
console.log( str );
</script>
Here's what's going on in the regex:
^ # start with the beginning of the line
(?: # don't capture the following
(
([^"'\/]* # start the line with any character as long as it isn't a string or a comment
(
("[^"]*") # grab a double quoted string
| # OR
('[^']*') # grab a single quoted string
)? # but...we don't HAVE to match a string
[\s]* # allow for any amount of whitespace
)? # but...we don't HAVE to have any characters before the comment begins
\/\/ # match the start of a comment
[^"]* # match any number of characters that isn't a double quote
) # end un-caught grouping
) # end the non-capturing declaration
" # match your commented double quote
The while loop in javascript is just find/replacing until it can't find any additional matches in a given line.

In complement of #Thai answer which I found very good, I would like to add a bit more:
In this example using original regex only the last character of quotes will be matched: https://regex101.com/r/CoxFvJ/2
So I modified a bit to allow capture of full quotes content and give a more talkative and generic example of content: https://regex101.com/r/CoxFvJ/3
So final regex:
/"((?:\\"|[^"])*)"|'((?:\\'|[^'])*)'|(\/\/.*|\/\*[\s\S]*?\*\/)/g
Big thanks to Thai for unlocking me.

Related

PHP : load another php file source and replace all instances of string "<?php ... ?>"

I have a .php file which is actually an SVG file with some inline PHP code. Lets call it inner.php :
<?php
$uuid = uniqid();
?>
<svg class="__combo" id=<?php echo $uuid ?>
...
</svg>
I am writing another php file which should emit the content of inner.php but with all instances of <?php ... ?> replaced by some string ( for example "AA").The file (lets call it outer.php) looks like this now :
<?php
$svg_body = file_get_contents("inner.php"));
$replaced = preg_replace("??","AA" , $svg_body);
echo "$replaced"
?>
I marked with "??" the part where I would like to put a regular expression to contain any string starting with "<?php" and ending with first occurrence of "?>". And the output I expect to see is
"AA"
<svg class="__combo" id="AA"
...
</svg>
Basically, I dont find a way to escape string containing the <?php and ?> in PHP
An alternative is using PHP's parser via the tokenizer Extension:
<?php
$tokens = token_get_all("<svg>\n<?php echo; ?></svg>");
$result='';
$in_php = false;
foreach ($tokens as $token) {
if ($token[0]==T_INLINE_HTML) {
$result .= $token[1];
$in_php = false;
} else if (!$in_php) {
$result .= "AAA";
$in_php=true;
}
}
echo $result;
https://3v4l.org/jEpYi
This has the benefit, that it also handles other open tags, like <?= and files without closing tag. And handles cases where the closing tag also appears in PHP code (I.e. in a comment)
This expression seems to work fine...
preg_replace('/<\?(php\s|=).*\?>/siU', '"AA"', $svg_body);
Demo ~ https://3v4l.org/0Ci9d
To break it down...
<\?(php\s|=) - a literal match for "<?php" (question-mark escaped) followed by a whitespace character (could be newline) or the short-echo <?=
.* - zero or more characters
\?> - a literal match for "?>"
s - sets * to match over newlines, required for your first <?php ... ?> block
i - case-insensitive because why not
U - ungreedy. Means * matches stops at the first following pattern, not the last. This is to prevent * from matching everything between the first "<?php" and the last "?>"
See here for more information on the modifiers ~ http://php.net/manual/reference.pcre.pattern.modifiers.php
Try doing this without RegEx first. You may try with substr_replace.
$replaced = substr_replace($svg_body, 'AA', strpos($svg_body, '<?php'), strpos($svg_body, '?>'));
if you really need RegEx then try something like this:
$replaced = preg_replace(/(<\?php[\s\S]+?\?>)/, "AA", $svg_body);

Regular Expression for finding HTML tags leaves empty tags behind in PHP

I'm trying to remove all hidden tags (and the ending tag) via regular expression and it seems to work but with one problem. It leaves behind "<>" for all the elements found.
I'm using this to replace my hidden fields with blank:
$saveContent = preg_replace('<input type="hidden" .*? />', "", $saveContent);
$saveContent = preg_replace('</form>', "", $saveContent);
It just brings back "<><><>" (2 Hidden fields and the ending form tag). I tried to string replace <> and that doesn't seem to work either
Am I missing something?
The problem (except that of trying to match HTML with regex) is that you don't properly quote the expression inside the string, which is usually done in PHP like "/regex/", but any character can be used instead of the slashes, eg "~regex~".
In your case < is the quoting character, which makes the end quote > (i.e "<regex>"), thus making it valid in preg_* and not giving you any errors.
For example:
preg_replace('</form>', "", $str)
is the same as
preg_replace('~/form~', "", $str)
and
preg_replace('/\/form/', "", $str)
All of which replace /form with an empty string.
While you wanted:
preg_replace('~</form>~', "", $str)
You need to escape slashes.. and add slashes for modifiers to work http://php.net/manual/en/reference.pcre.pattern.modifiers.php
$saveContent = preg_replace('/<input type="hidden" .*? \/>/i', "", $saveContent);
$saveContent = preg_replace('/<\/form>/i', "", $saveContent);
If I remember correctly you can put the pattern between %'s as to avoid all that escaping which makes practical unreadable, for example
if (preg_match('%</form>%', $subject)) {
# Successful match
} else {
# Match attempt failed
}
Turns out for some reason (that I wasn't aware of), the < and > symbols were being converted to entities, but only for a select few.
I just checked for those entities and string replaced them to the correct symbols and it worked.
Try this
$content = '<input type="hidden" name="abc" /> abc <input type="hidden" name="abc" />';
preg_replace('#<input type="hidden"[^>]+>#', '', $content);

Jquery Not letting Me Save My FULL Text. Only 1 Word

I have the following code that will check to see if the div text has change and if so then update the text in the mysql table. but when i try and add spaces in the text and it dont allow it to save every thing i have write. Can somebody help me out please
Thank you.
<script>function save() {
var div_sN6VmIGq = $("#64").text();
var html_sN6VmIGq = $("#64").html();
var top_sN6VmIGq = $("input#sN6VmIGq_top").val();
var left_sN6VmIGq = $("input#sN6VmIGq_left").val();
if(div_sN6VmIGq == "Text Here"){
}else{
$('#saveupdate').load('modulus/empty/actions.php?act=save_text&pid=1&div_id=64&div_txt='+html_sN6VmIGq+'&randval='+ Math.random());
}
}
</script>
You need to URLEncode your text:
JavaScript:
var html_sN6VmIGq = escape($("#64").html());
//now use your function for saving the item
PHP:
$decodedVar = urldecode($yourEncodedVarHere);
You probably need to call encodeURI on your somewhat bizarrely named variables:
$('#saveupdate').load('modulus/empty/actions.php?act=save_text&pid=1&div_id=64&div_txt='+encodeURI(html_sN6VmIGq)+'&randval='+ Math.random());
Also, note that the id 64 is invalid, per this comment from W3Schools:
[ids must] begin with a letter A-Z or a-z followed by: letters (A-Za-z),
digits (0-9), hyphens ("-"),
underscores ("_"), colons (":"), and
periods (".").

php preg_match_all html dates with slashes error

I've trying to preg_match_all a date with slashes in it sitting between 2 html tags; however its returning null.
here is the html:
> <td width='40%' align='right'class='SmallDimmedText'>Last Login: 11/14/2009</td>
Here is my preg_match_all() code
preg_match_all('/<td width=\'40%\' align=\'right\' class=\'SmallDimmedText\'>Last([a-zA-Z0-9\s\.\-\',]*)<\/td>/', $h, $table_content, PREG_PATTERN_ORDER);
where $h is the html above.
what am i doing wrong?
thanks in advance
It (from a quick glance) is because you are trying to match:
Last Login: 11/14/2009
With this regex:
Last([a-zA-Z0-9\s\.\-\',]*)
The regex doesn't contain the required characters of : and / which are included in the text string. Changing the required part of the regex to:
Last([a-zA-Z0-9\s\.\-\',:/]*)
Gives a match
Would it be better to simply use a DOM parser, and then preform the regex on the result of the DOM lookup? It makes for nicer regex...
EDIT
The other issue is that your HTML is:
...40%' align='right'class='SmallDimmedText'>...
Where there is no space between align='right' and class='SmallDimmedText'
However your regex for that section is:
...40%\' align=\'right\' class=\'SmallDimmedText\'>...
Where it is indicated there is a space.
Use a DOM Parser It will save you more headaches caused by subtle bugs than you can count.
Just to give you an idea on how simple it is to parse using Simple HTML DOM.
$html = str_get_html(...);
$elems = $html->find('.SmallDimmedText');
if ( count($elems->children()) != 1 ){
throw new Exception('Too many/few elements found');
}
$text = $elems->children(0)->plaintext;
//parsing here is only an example, but you have removed all
//the html so that any regex used is really simple.
$date = substr($text, strlen('Last Login: '));
$unixTime = strtotime($date);
I see at least two problems :
in your HTML string, there is no space between 'right' and class=, and there is one space there in your regex
you must add at least these 3 characters to the list of matched characters, between the [] :
':' (there is one between "Login" and the date),
' ' (there are spaces between "Last" and "Login", and between ":" and the date),
and '/' (between the date parts)
With this code, it seems to work better :
$h = "<td width='40%' align='right'class='SmallDimmedText'>Last Login: 11/14/2009</td>";
if (preg_match_all("#<td width='40%' align='right'class='SmallDimmedText'>Last([a-zA-Z0-9\s\.\-',: /]*)<\/td>#",
$h, $table_content, PREG_PATTERN_ORDER)) {
var_dump($table_content);
}
I get this output :
array
0 =>
array
0 => string '<td width='40%' align='right'class='SmallDimmedText'>Last Login: 11/14/2009</td>' (length=80)
1 =>
array
0 => string ' Login: 11/14/2009' (length=18)
Note I have also used :
# as a regex delimiter, to avoid having to escape slashes
" as a string delimiter, to avoid having to escape single quotes
My first suggestion would be to minimize the amount of text you have in the preg_match_all, why not just do between a ">" and a "<"? Second, I'd end up writing the regex like this, not sure if it helps:
/>.*[0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4}</
That will look for the end of one tag, then any character, then a date, then the beginning of another tag.
I agree with Yacoby.
At the very least, remove all reference to any of the HTML specific and simply make the regex
preg_match_all('#Last Login: ([\d+/?]+)#', ...

PHP's json_encode does not escape all JSON control characters

Is there any reasons why PHP's json_encode function does not escape all JSON control characters in a string?
For example let's take a string which spans two rows and has control characters (\r \n " / \) in it:
<?php
$s = <<<END
First row.
Second row w/ "double quotes" and backslash: \.
END;
$s = json_encode($s);
echo $s;
// Will output: "First row.\r\nSecond row w\/ \"double quotes\" and backslash: \\."
?>
Note that carriage return and newline chars are unescaped. Why?
I'm using jQuery as my JS library and it's $.getJSON() function will do fine when you fully, 100% trust incoming data. Otherwise I use JSON.org's library json2.js like everybody else.
But if you try to parse that encoded string it throws an error:
<script type="text/javascript">
JSON.parse(<?php echo $s ?>); // Will throw SyntaxError
</script>
And you can't get the data! If you remove or escape \r \n " and \ in that string then JSON.parse() will not throw error.
Is there any existing, good PHP function for escaping control characters. Simple str_replace with search and replace arrays will not work.
function escapeJsonString($value) {
# list from www.json.org: (\b backspace, \f formfeed)
$escapers = array("\\", "/", "\"", "\n", "\r", "\t", "\x08", "\x0c");
$replacements = array("\\\\", "\\/", "\\\"", "\\n", "\\r", "\\t", "\\f", "\\b");
$result = str_replace($escapers, $replacements, $value);
return $result;
}
I'm using the above function which escapes a backslash (must be first in the arrays) and should deal with formfeeds and backspaces (I don't think \f and \b are supported in PHP).
D'oh - you need to double-encode: JSON.parse is expecting a string of course:
<script type="text/javascript">
JSON.parse(<?php echo json_encode($s) ?>);
</script>
I still haven't figured out any solution without str_replace..
Try this code.
$json_encoded_string = json_encode(...);
$json_encoded_string = str_replace("\r", '\r', $json_encoded_string);
$json_encoded_string = str_replace("\n", '\n', $json_encoded_string);
Hope that helps...
$search = array("\n", "\r", "\u", "\t", "\f", "\b", "/", '"');
$replace = array("\\n", "\\r", "\\u", "\\t", "\\f", "\\b", "\/", "\"");
$encoded_string = str_replace($search, $replace, $json);
This is the correct way
Converting to and fro from PHP should not be an issue.
PHP's json_encode does proper encoding but reinterpreting that inside java script can cause issues. Like
1) original string - [string with nnn newline in it] (where nnn is actual newline character)
2) json_encode will convert this to
[string with "\\n" newline in it] (control character converted to "\\n" - Literal "\n"
3) However when you print this again in a literal string using php echo then "\\n" is interpreted as "\n" and that causes heartache. Because JSON.parse will understand a literal printed "\n" as newline - a control character (nnn)
so to work around this: -
A)
First encode the json object in php using json_enocde and get a string. Then run it through a filter that makes it safe to be used inside html and java script.
B)
use the JSON string coming from PHP as a "literal" and put it inside single quotes instead of double quotes.
<?php
function form_safe_json($json) {
$json = empty($json) ? '[]' : $json ;
$search = array('\\',"\n","\r","\f","\t","\b","'") ;
$replace = array('\\\\',"\\n", "\\r","\\f","\\t","\\b", "&#039");
$json = str_replace($search,$replace,$json);
return $json;
}
$title = "Tiger's /new \\found \/freedom " ;
$description = <<<END
Tiger was caged
in a Zoo
And now he is in jungle
with freedom
END;
$book = new \stdClass ;
$book->title = $title ;
$book->description = $description ;
$strBook = json_encode($book);
$strBook = form_safe_json($strBook);
?>
<!DOCTYPE html>
<html>
<head>
<title> title</title>
<meta charset="utf-8">
<script type="text/javascript" src="/3p/jquery/jquery-1.7.1.min.js"></script>
<script type="text/javascript">
$(document).ready(function(){
var strBookObj = '<?php echo $strBook; ?>' ;
try{
bookObj = JSON.parse(strBookObj) ;
console.log(bookObj.title);
console.log(bookObj.description);
$("#title").html(bookObj.title);
$("#description").html(bookObj.description);
} catch(ex) {
console.log("Error parsing book object json");
}
});
</script>
</head>
<body>
<h2> Json parsing test page </h2>
<div id="title"> </div>
<div id="description"> </div>
</body>
</html>
Put the string inside single quote in java script. Putting JSON string inside double quotes would cause the parser to fail at attribute markers (something like { "id" : "value" } ). No other escaping should be required if you put the string as "literal" and let JSON parser do the work.
I don't fully understand how var_export works, so I will update if I run into trouble, but this seems to be working for me:
<script>
window.things = JSON.parse(<?php var_export(json_encode($s)); ?>);
</script>
Maybe I'm blind, but in your example they ARE escaped. What about
<script type="text/javascript">
JSON.parse("<?php echo $s ?>"); // Will throw SyntaxError
</script>
(note different quotes)
Just an addition to Greg's response: the output of json_encode() is already contained in double-quotes ("), so there is no need to surround them with quotes again:
<script type="text/javascript">
JSON.parse(<?php echo $s ?>);
</script>
Control characters have no special meaning in HTML except for new line in textarea.value . JSON_encode on PHP > 5.2 will do it like you expected.
If you just want to show text you don't need to go after JSON. JSON is for arrays and objects in JavaScript (and indexed and associative array for PHP).
If you need a line feed for the texarea-tag:
$s=preg_replace('/\r */','',$s);
echo preg_replace('/ *\n */','
',$s);
This is what I use personally and it's never not worked. Had similar problems originally.
Source script (ajax) will take an array and json_encode it. Example:
$return['value'] = 'test';
$return['value2'] = 'derp';
echo json_encode($return);
My javascript will make an AJAX call and get the echoed "json_encode($return)" as its input, and in the script I'll use the following:
myVar = jQuery.parseJSON(msg.replace(/"/ig,'"'));
with "msg" being the returned value. So, for you, something like...
var msg = '<?php echo $s ?>';
myVar = jQuery.parseJSON(msg.replace(/"/ig,'"'));
...might work for you.
There are 2 solutions unless AJAX is used:
Write data into input like and read it in JS:
<input type="hidden" value="<?= htmlencode(json_encode($data)) ?>"/>
Use addslashes
var json = '<?= addslashes(json_encode($data)) ?>';
When using any form of Ajax, detailed documentation for the format of responses received from the CGI server seems to be lacking on the Web. Some Notes here and entries at stackoverflow.com point out that newlines in returned text or json data must be escaped to prevent infinite loops (hangs) in JSON conversion (possibly created by throwing an uncaught exception), whether done automatically by jQuery or manually using Javascript system or library JSON parsing calls.
In each case where programmers post this problem, inadequate solutions are presented (most often replacing \n by \\n on the sending side) and the matter is dropped. Their inadequacy is revealed when passing string values that accidentally embed control escape sequences, such as Windows pathnames. An example is "C:\Chris\Roberts.php", which contains the control characters ^c and ^r, which can cause JSON conversion of the string {"file":"C:\Chris\Roberts.php"} to loop forever. One way of generating such values is deliberately to attempt to pass PHP warning and error messages from server to client, a reasonable idea.
By definition, Ajax uses HTTP connections behind the scenes. Such connections pass data using GET and POST, both of which require encoding sent data to avoid incorrect syntax, including control characters.
This gives enough of a hint to construct what seems to be a solution (it needs more testing): to use rawurlencode on the PHP (sending) side to encode the data, and unescape on the Javascript (receiving) side to decode the data. In some cases, you will apply these to entire text strings, in other cases you will apply them only to values inside JSON.
If this idea turns out to be correct, simple examples can be constructed to help programmers at all levels solve this problem once and for all.

Categories