PHP Regular expression to capture code

PHP Regular expression to capture code - php

I have been trying to capture code blocks in a similar fashion to wiki tags:
{{code:
code goes here
}}
Example code is shown below,
$strings = array('AbCd1zyZ9', 'foo!#$bar');
foreach ($strings as $testcase) {
if (ctype_alnum($testcase)) {
echo "It is The string $testcase consists of all letters or digits.\n";
} else {
echo "The string $testcase does not consist of all letters or digits.\n";
}
}
Essentially I want to capture anything between the {{..}}. There are multiple blocks like this embedded in an HTML page.
I would appreciate any help.

Well to start off, regex is not a good way to solve this problem. The right approach is to write a parser that understands language semantics and can tease out the subtleties. Having said that, if you still want a quick and dirty regex based approach that will work 99.99% of the time but has a couple of acknowledged bugs (see end of answer), Here you go:
You can use preg_match_all(). Here is a proof of concept:
$input = "
<html>
<head>
<title>{{code:echo 'Hello World';}}</title>
</head>
<body>
<h1>{{code:\$strings = array('AbCd1zyZ9', 'foo!#$bar');
foreach (\$strings as \$testcase) {
if (ctype_alnum(\$testcase)) {
echo \"It is The string \$testcase consists of all letters or digits.\\n\";
} else {
echo \"The string $testcase does not consist of all letters or digits.\\n\";
}
}
}}</h1>
</body>
</html>
";
$matches = array();
preg_match_all('/{{code:([^\x00]*?)}}/', $input, $matches);
print_r($matches[1]);
Outputs the following:
Array
(
[0] => echo 'Hello World';
[1] => $strings = array('AbCd1zyZ9', 'foo!#');
foreach ($strings as $testcase) {
if (ctype_alnum($testcase)) {
echo "It is The string $testcase consists of all letters or digits.\n";
} else {
echo "The string does not consist of all letters or digits.\n";
}
}
)
Be careful. There are some edge case bugs involving early termination by encountering }} within a "code" block:
If }} appears in a quoted string, the regex matches too early
If } is the last character of your "code" block and it's immediately followed by }}, you'll lose the closing } from your code block.

As I've said in the comments, Asaph's answer is a good solid regex, but breaks down when }} is contained within the code block. Hopefully this won't be a problem, but as there is a possibility of it, it would be best make your regex a little more expansive. If we can assume that any }} appearing between two single-quotes does not signify the end of the code, as in Asaph's example of <div>{{code:$myvar = '}}';}}</div>, we can expand our regex a bit:
{{code:((?:[^']*?'[^']*?')*?[^']*?)}}
[^']*?' looks for a set of non-' characters, followed by a single quote, and [^']*?'[^']*?' looks for two of them in succession. This "swallows" strings like '}}'. We lazily look for any number of these strings, then the rest of any non-string code with [^']*?, and finally our ending }}.
This allows us to match the entire string {{code:$myvar = '}}';}} rather than just {{code:$myvar = '}}.
There are still problems with this method, however. Escaping a quote within a string, such as in {{code:$myvar = '\'}}\'';}} will not work, as we will "swallow" '\' first, and end with the }} immediately following. It may be possible to determine these escaped single-quotes as well, or to add in support for double-quoted strings, but you need to ask yourself at what point using a code-parser is a better idea.
See the entire Regex in action here. (If it doesn't match anything at first, just click the window.)
how can I use the result to say place
it in new ,<div>
Use the replace function:
preg_replace($expression, "<div>$0</div>", $input)
$0 inserts the entire match, and will place it between a new <div> block. Alternatively, if you just want the actual source code, use $1, as we captured the source code in a separate capture group.
Again, see the replacement here.
I went deeper down the rabbit hole...
{{code:((?:(?:[^']|\\')*?(?<!\\)'(?:[^']|\\')*?(?<!\\)')*?(?:[^']|\\')*?)}}
This won't break with escaped single-quotes, and correctly matches {{code:$myvar = '\'}}\'';}}.
Ta-da.

use
preg_match_all("/{{(.)*}}/", $text, $match)
where text is the text that might contain code
this captures anything between {{ }}

Related

Create a function to find a specific word in the title

I have the following title formation on my website:
It's no use going back to yesterday, because at that time I was... Lewis Carroll
Always is: The phrase… (author).
I want to delete everything after the ellipsis (…), leaving only the sentence as the title. I thought of creating a function in php that would take the parts of the titles, throw them in an array and then I would work each part, identifying the only pattern I have in the title, which is the ellipsis… and then delete everything. But when I do that, in the X space of my array, it returns the following:
was...
In position 8 of the array comes the word and the ellipsis and I don't know how to find a pattern to delete the author of the title, my pattern was the ellipsis. Any idea?
<?php
$a = get_the_title(155571);
$search = '... ';
if(preg_match("/{$search}/i", $a)) {
echo 'true';
}
?>
I tried with the code above and found the ellipsis, but I needed to bring it into an array to delete the part I need. I tried something like this:
<?php
define('WP_USE_THEMES', false);
require('./wp-blog-header.php');
global $wpdb;
$title_array = explode(' ', get_the_title(155571));
$search = '... ';
if (array_key_exists("/{$search}/i",$title_array)) {
echo "true";
}
?>
I started doing it this way, but it doesn't work, any ideas?
Thanks,

If you use regex you need to escape the string as preg_quote() would do, because a dot belongs to the pattern.
But in your simple case, I would not use a regex and just search for the three dots from the end of the string.
Note: When the elipsis come from the browser, there's no way to detect in PHP.
$title = 'The phrase... (author).';
echo getPlainTitle($title);
function getPlainTitle(string $title) {
$rpos = strrpos($title, '...');
return ($rpos === false) ? $title : substr($title, 0, $rpos);
}
will output
The phrase

First of all, since you're working with regular expressions, you need to remember that . has a special meaning there: it means "any character". So /... / just means "any three characters followed by a space", which isn't what you want. To match a literal . you need to escape it as \.
Secondly, rather than searching or splitting, you could achieve what you want by replacing part of the string. For instance, you could find everything after the ellipsis, and replace it with an empty string. To do that you want a pattern of "dot dot dot followed by anything", where "anything" is spelled .*, so \.\.\..*
$title = preg_replace('/\.\.\..*/', '', $title);

How to improve my algorithm?/seaching and replacing words in a formated text/

I have a source of html, and an array of keywords. I'm trying to find all words which begin with any keyword in the keywords array and wrap it in a link tag.
For example, the keyword array has two values: [ABC, DEF]. It should match ABCDEF, DEFAD, etc. and wrap each word with hyperlink markup.
Here is the code I've got so far:
$_keys = array('ABC', 'DEF');
$text = 'Some ABCDD <strong>HTML</strong> text. DEF';
function search_and_replace(($key,$text)
{
$words = preg_split('/\s+/', trim($text)); //to seprate words in $_text
for($words as $word)
{
if(strpos($word,$key) !== false)
{
if($word.startswith($key))
{
str_replace($word,''.$word.',$_text);
}
}
}
return text;
}
for($_keys as $_key)
{
$text = search_and_replace($key,$text);
}
My questions:
Would this algorithm work?
How would I modify this to work with UTF-8?
How can I recognize hyperlinks in the html and ignore them (don't want to put a hyperlink in a hyperlink).
Is this algorithm safe?

is the algorithm "true"? ( I'm reading "accurate")
No, it is not. Since str_replace functions as follows
a string or an array with all occurrences of search in subject
replaced with the given replace value.
The string you're matching is not the only one being replaced. Using your example, if you ran this function against your data set, you'd end up wrapping each occurrence of ABC in multiple tags ( just run your code to see it, but you'll have to fix syntax errors).
work with UTF-8 Alphabets?
Not sure, but as written, I don't think so. See Preg_Replace and UTF8. PREG functions should be multibyte safe.
I want to igonre all words in each a tag for search operetion
That's awefully hard. You'll have to avoid <a ...>word</a>, which starts to make a big mess fast. Regex matching HTML reliably is a fool's errand.
Probably the best would be to interpret the webpage as XML or HTML. Have you considered doing this in javascript? Why do it on the server side? The advantage of JS is twofold - one, it runs on the client side, so you're offloading / distributing the work, and two, since the DOM is already interpreted, you can find all text nodes and replace them fairly easily. In fact, I was helping a frend working on a chrome extension to to almost exactly what you're describing; you could modify it to do what you're looking for easily.
a better alternative method?
Definitely. What you're showing here is one of the worse methods of doing this. I'd push for you to use preg_replace ( another answer has a good start for the regex you'd want, matching word breaks tather than whitespace) but since you want to avoid changing some elements, I'm thinking now that doing this in JS client-side is far better.

In order to maximize your performance you should look into Trie (same as Retrieval Tree) data structure. (http://en.wikipedia.org/wiki/Trie) If I were you I would first build a Trie containing the words in the HTML page. At this step you could also check if the word is inside an <a> tag and if it this then do not add it to the Trie. You can easily do that with a Regex match

How about regex?
preg_match_all("/\b".$word."\B*\b/",$matches);
foreach($matches as $each) {
print($each[0]);
}
(Sorry, my PHP is a bit rusty)

For a simple task like this PHP regular expressions will serve well. The idea is to find all hyperlinks ( and optionally some other HTML elements ) and replace them with unique tokens. After that we are free to seek and replace desired keywords, and in the end we will restore the removed HTML elements back.
$_keys = array( 'ABC', 'DEF', 'ABČ' );
$text =
'Some <a href="#" >ABC</a> ABCDđD <strong>ABCDEF</strong> text. DEF
<p class="test">
PHP is <em>the</em> most ABCwidely used
langČuage ABC for ABČogr ammDEFing on the webABC DEFABC.
</p>';
// array for holding html items replaced with tokens
$tokens = array();
$id = 0;
// we will replace all links and strong elements (a|strong)
$text = preg_replace_callback( '/<(a|strong)[^>]*>.*?<\/\1\s*>/s',
function( $matches ) use ( &$tokens, &$id )
{
// store matches into the tokens array
$tokens[ '#'.++$id.'#' ] = $matches[0];
// replace matches with the unique id
return '#'.$id.'#';
},
$text
);
echo htmlentities( $text );
/* - outputs: Some #1# ABCDđD #2# text. DEF <p class="test"> #3# is <em>the</em> most ABCwidely used langČuage ABC for pćrogrABCamming on the webABC DEFABC. </p>
- note the #1# #2# #3# tokens
*/
// wrap the words that starts with items in $_keys array ( with u(PCRE_UTF8) modifier )
$text = preg_replace( '/\b('. implode( '|', $_keys ) . ')\w*\b/u', '$0', $text );
// replace the tokens with values
$text = str_replace( array_keys($tokens), array_values($tokens), $text );
echo $text;
Info about UTF-8 strings in PHP regex:

Regex to match double quoted strings without variables inside php tags

Basically I need a regex expression to match all double quoted strings inside PHP tags without a variable inside.
Here's what I have so far:
"([^\$\n\r]*?)"(?![\w ]*')
and replace with:
'$1'
However, this would match things outside PHP tags as well, e.g HTML attributes.
Example case:
Here's my "dog's website"
<?php
$somevar = "someval";
$somevar2 = "someval's got a quote inside";
?>
<?php
$somevar3 = "someval with a $var inside";
$somevar4 = "someval " . $var . 'with concatenated' . $variables . "inside";
$somevar5 = "this php tag doesn't close, as it's the end of the file...";
it should match and replace all places where the " should be replaced with a ', this means that html attributes should ideally be left alone.
Example output after replace:
Here's my "dog's website"
<?php
$somevar = 'someval';
$somevar2 = 'someval\'s got a quote inside';
?>
<?php
$somevar3 = "someval with a $var inside";
$somevar4 = 'someval ' . $var . 'with concatenated' . $variables . 'inside';
$somevar5 = 'this php tag doesn\'t close, as it\'s the end of the file...';
It would also be great to be able to match inside script tags too...but that might be pushing it for one regex replace.
I need a regex approach, not a PHP approach. Let's say I'm using regex-replace in a text editor or JavaScript to clean up the PHP source code.

tl;dr
This is really too complex complex to be done with regex. Especially not a simple regex. You might have better luck with nested regex, but you really need to lex/parse to find your strings, and then you could operate on them with a regex.
Explanation
You can probably manage to do this.
You can probably even manage to do this well, maybe even perfectly.
But it's not going to be easy.
It's going to be very very difficult.
Consider this:
Welcome to my php file. We're not "in" yet.
<?php
/* Ok. now we're "in" php. */
echo "this is \"stringa\"";
$string = 'this is \"stringb\"';
echo "$string";
echo "\$string";
echo "this is still ?> php.";
/* This is also still ?> php. */
?> We're back <?="out"?> of php. <?php
// Here we are again, "in" php.
echo <<<STRING
How do "you" want to \""deal"\" with this STRING;
STRING;
echo <<<'STRING'
Apparently this is \\"Nowdoc\\". I've never used it.
STRING;
echo "And what about \\" . "this? Was that a tricky '\"' to catch?";
// etc...
Forget matching variable names in double quoted strings.
Can you just match all of the string in this example?
It looks like a nightmare to me.
SO's syntax highlighting certainly won't know what to do with it.
Did you consider that variables may appear in heredoc strings as well?
I don't want to think about the regex to check if:
Inside <?php or <?= code
Not in a comment
Inside a quoted quote
What type of quoted quote?
Is it a quote of that type?
Is it preceded by \ (escaped)?
Is the \ escaped??
etc...
Summary
You can probably write a regex for this.
You can probably manage with some backreferences and lots of time and care.
It's going to be hard and your probably going to waste a lot of time, and if you ever need to fix it, you aren't going to understand the regex you wrote.
See also
This answer. It's worth it.

Here's a function that utilizes the tokenizer extension to apply preg_replace to PHP strings only:
function preg_replace_php_string($pattern, $replacement, $source) {
$replaced = '';
foreach (token_get_all($source) as $token) {
if (is_string($token)){
$replaced .= $token;
continue;
}
list($id, $text) = $token;
if ($id === T_CONSTANT_ENCAPSED_STRING) {
$replaced .= preg_replace($pattern, $replacement, $text);
} else {
$replaced .= $text;
}
}
return $replaced;
}
In order to achieve what you want, you can call it like this:
<?php
$filepath = "script.php";
$file = file_get_contents($filepath);
$replaced = preg_replace_php_string('/^"([^$\{\n<>\']+?)"$/', '\'$1\'', $file);
echo $replaced;
The regular expression that's passed as the first argument is the key here. It tells the function to only transform strings to their single-quoted equivalents if they do not contain $ (embedded variable "$a"), { (embedded variable type 2 "{$a[0]}"), a new line, < or > (HTML tag end/open symbols). It also checks if the string contains a single-quote, and prevents the replacement to avoid situations where it would need to be escaped.
While this is a PHP solution, it's the most accurate one. The closest you can get with any other language would require you to build your own PHP parser in that language to some degree in order for your solution to be accurate.

If a variable only contains one word

I would like to know how I could find out in PHP if a variable only contains 1 word. It should be able to recognise: "foo" "1326" ";394aa", etc.
It would be something like this:
$txt = "oneword";
if($txt == 1 word){ do.this; }else{ do.that; }
Thanks.

I'm assuming a word is defined as any string delimited by one space symbol
$txt = "multiple words";
if(strpos(trim($txt), ' ') !== false)
{
// multiple words
}
else
{
// one word
}

What defines one word? Are spaces allowed (perhaps for names)? Are hyphens allowed? Punctuation? Your question is not very clearly defined.
Going under the assumption that you just want to determine whether or not your value contains spaces, try using regular expressions:
http://php.net/manual/en/function.preg-match.php
<?php
$txt = "oneword";
if (preg_match("/ /", $txt)) {
echo "Multiple words.";
} else {
echo "One word.";
}
?>
Edit
The benefit to using regular expressions is that if you can become proficient in using them, they will solve a lot of your problems and make changing requirements in the future a lot easier. I would strongly recommend using regular expressions over a simple check for the position of a space, both for the complexity of the problem today (as again, perhaps spaces aren't the only way to delimit words in your requirements), as well as for the flexibility of changing requirements in the future.

Utilize the strpos function included within PHP.
Returns the position as an integer. If needle is not found, strpos()
will return boolean FALSE.

Besides strpos, an alternative would be explode and count:
$txt = trim("oneword secondword");
$words = explode( " ", $txt); // $words[0] = "oneword", $words[1] = "secondword"
if (count($words) == 1)
do this for one word
else
do that for more than one word assuming at least one word is inputted

How to write regex to return only certain parts of this string?

So I'm working on a project that will allow users to enter poker hand histories from sites like PokerStars and then display the hand to them.
It seems that regex would be a great tool for this, however I rank my regex knowledge at "slim to none".
So I'm using PHP and looping through this block of text line by line and on lines like this:
Seat 1: fabulous29 (835 in chips)
Seat 2: Nioreh_21 (6465 in chips)
Seat 3: Big Loads (3465 in chips)
Seat 4: Sauchie (2060 in chips)
I want to extract seat number, name, & chip count so the format is
Seat [number]: [letters&numbers&characters] ([number] in chips)
I have NO IDEA where to start or what commands I should even be using to optimize this.
Any advice is greatly appreciated - even if it is just a link to a tutorial on PHP regex or the name of the command(s) I should be using.

I'm not entirely sure what exactly to use for that without trying it, but a great tool I use all the time to validate my RegEx is RegExr which gives a great flash interface for trying out your regex, including real time matching and a library of predefined snippets to use. Definitely a great time saver :)

Something like this might do the trick:
/Seat (\d+): ([^\(]+) \((\d+)in chips\)/
And some basic explanation on how Regex works:
\d = digit.
\<character> = escapes character, if not part of any character class or subexpression. for example:
\t
would render a tab, while \\t would render "\t" (since the backslash is escaped).
+ = one or more of the preceding element.
* = zero or more of the preceding element.
[ ] = bracket expression. Matches any of the characters within the bracket. Also works with ranges (ex. A-Z).
[^ ] = Matches any character that is NOT within the bracket.
( ) = Marked subexpression. The data matched within this can be recalled later.
Anyway, I chose to use
([^\(]+)
since the example provides a name containing spaces (Seat 3 in the example). what this does is that it matches any character up to the point that it encounters an opening paranthesis.
This will leave you with a blank space at the end of the subexpression (using the data provided in the example). However, his can easily be stripped away using the trim() command in PHP.
If you do not want to match spaces, only alphanumerical characters, you could so something like this:
([A-Za-z0-9-_]+)
Which would match any letter (within A-Z, both upper- & lower-case), number as well as hyphens and underscores.
Or the same variant, with spaces:
([A-Za-z0-9-_\s]+)
Where "\s" is evaluated into a space.
Hope this helps :)

Look at the PCRE section in the PHP Manual. Also, http://www.regular-expressions.info/ is a great site for learning regex. Disclaimer: Regex is very addictive once you learn it.

I always use the preg_ set of function for REGEX in PHP because the PERL-compatible expressions have much more capability. That extra capability doesn't necessarily come into play here, but they are also supposed to be faster, so why not use them anyway, right?
For an expression, try this:
/Seat (\d+): ([^ ]+) \((\d+)/
You can use preg_match() on each line, storing the results in an array. You can then get at those results and manipulate them as you like.
EDIT:
Btw, you could also run preg_match_all on the entire block of text (instead of looping through line-by-line) and get the results that way, too.

Check out preg_match.
Probably looking for something like...
<?php
$str = 'Seat 1: fabulous29 (835 in chips)';
preg_match('/Seat (?<seatNo>\d+): (?<name>\w+) \((?<chipCnt>\d+) in chips\)/', $str, $matches);
print_r($matches);
?>
*It's been a while since I did php, so this could be a little or a lot off.*

May be it is very late answer, But I am interested in answering
Seat\s(\d):\s([\w\s]+)\s\((\d+).*\)
http://regex101.com/r/cU7yD7/1

Here's what I'm currently using:
preg_match("/(Seat \d+: [A-Za-z0-9 _-]+) \((\d+) in chips\)/",$line)

To process the whole input string at once, use preg_match_all()
preg_match_all('/Seat (\d+): \w+ \((\d+) in chips\)/', $preg_match_all, $matches);
For your input string, var_dump of $matches will look like this:
array
0 =>
array
0 => string 'Seat 1: fabulous29 (835 in chips)' (length=33)
1 => string 'Seat 2: Nioreh_21 (6465 in chips)' (length=33)
2 => string 'Seat 4: Sauchie (2060 in chips)' (length=31)
1 =>
array
0 => string '1' (length=1)
1 => string '2' (length=1)
2 => string '4' (length=1)
2 =>
array
0 => string '835' (length=3)
1 => string '6465' (length=4)
2 => string '2060' (length=4)
On learning regex: Get Mastering Regular Expressions, 3rd Edition. Nothing else comes close to the this book if you really want to learn regex. Despite being the definitive guide to regex, the book is very beginner friendly.

Try this code. It works for me
Let say that you have below lines of strings
$string1 = "Seat 1: fabulous29 (835 in chips)";
$string2 = "Seat 2: Nioreh_21 (6465 in chips)";
$string3 = "Seat 3: Big Loads (3465 in chips)";
$string4 = "Seat 4: Sauchie (2060 in chips)";
Add to array
$lines = array($string1,$string2,$string3,$string4);
foreach($lines as $line )
{
$seatArray = explode(":", $line);
$seat = explode(" ",$seatArray[0]);
$seatNumber = $seat[1];
$usernameArray = explode("(",$seatArray[1]);
$username = trim($usernameArray[0]);
$chipArray = explode(" ",$usernameArray[1]);
$chipNumber = $chipArray[0];
echo "<br>"."Seat [".$seatNumber."]: [". $username."] ([".$chipNumber."] in chips)";
}

you'll have to split the file by linebreaks,
then loop thru each line and apply the following logic
$seat = 0;
$name = 1;
$chips = 2;
foreach( $string in $file ) {
if (preg_match("Seat ([1-0]): ([A-Za-z_0-9]*) \(([1-0]*) in chips\)", $string, $matches)) {
echo "Seat: " . $matches[$seat] . "<br>";
echo "Name: " . $matches[$name] . "<br>";
echo "Chips: " . $matches[$chips] . "<br>";
}
}
I haven't ran this code, so you may have to fix some errors...

Seat [number]: [letters&numbers&characters] ([number] in chips)
Your Regex should look something like this
Seat (\d+): ([a-zA-Z0-9]+) \((\d+) in chips\)
The brackets will let you capture the seat number, name and number of chips in groups.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP Regular expression to capture code - php

use preg_match_all("/{{(.)*}}/", $text, $match) where text is the text that might contain code this captures anything between {{ }}

Related

Create a function to find a specific word in the title

How to improve my algorithm?/seaching and replacing words in a formated text/

Regex to match double quoted strings without variables inside php tags

If a variable only contains one word

How to write regex to return only certain parts of this string?

Categories

Resources