PHP regex optimize

PHP regex optimize - php

I've got a regular expression that match everything between <anything> and I'm using this:
'#<([\w]+)>#'
today but I believe that there might be a better way to do it?
/ Tobias

\w doesn't match everything like you said, by the way, just [a-zA-Z0-9_]. Assuming you were using "everything" in a loose manner and \w is what you want, you don't need square brackets around the \w. Otherwise it's fine.

If "anything" is "anything except a > char", then you can:
#<([^>]+)>#
Testing will show if this performs better or worse.
Also, are you sure that you need to optimize? Does your original regex do what it should?

You better use PHP string functions for this task. It will be a lot faster and not too complex.
For example:
$string = "abcd<xyz>ab<c>d";
$curr_offset = 0;
$matches = array();
$opening_tag_pos = strpos($string, '<', $curr_offset);
while($opening_tag_pos !== false)
{
$curr_offset = $opening_tag_pos;
$closing_tag_pos = strpos($string, '>', $curr_offset);
$matches[] = substr($string, $opening_tag_pos+1, ($closing_tag_pos-$opening_tag_pos-1));
$curr_offset = $closing_tag_pos;
$opening_tag_pos = strpos($string, '<', $curr_offset);
}
/*
$matches = Array ( [0] => xyz [1] => c )
*/
Of course, if you are trying to parse HTML or XML, use a XHTML parser instead

That looks alright. What's not optimal about it?
You may also want to consider something other regex if you're trying to parse HTML:
RegEx match open tags except XHTML self-contained tags

Related

preg_match in PHP (XML extract)

152124687951<?xml version="1.0"><culo>Amazing</culo></Document>65464614
I have to extract only the XML code inside.
I could have more of XML code and I need to extract it one by one. Its starts always with </Document>. Someone could help me? Thanks...

You can use substr and strops to get all the matches you need.
It's true that regex performs worst than other solutions. So, if performance is important to you, consider other alternatives.
In other hand, performance may not be an issue (side project, background process, etc) so regex is a clean way to do the job.
From my understading you have something like:
152124687951<?xml version="1.0"><culo>Amazing</culo></Document>65464614
abc<?xml version="1.0"><culo>Amazing</culo></Document>abc
abc<?xml version="1.0"><culo>Amazing</culo></Document>abc
abc<?xml version="1.0"><culo>Amazing</culo></Document>abc
And you want to extract all the xml inside this.
So a perfect working regex will be:
#\<\?xml.+Document\>#
You can see a live result here: http://www.regexr.com/39p9q
Or you could test it online: https://www.functions-online.com/preg_match_all.html
At the end, the $matches variable will have something like (depends on the flaw you use in preg_match_all:
array (
0 =>
array (
0 => '<?xml version="1.0"><culo>Amazing</culo></Document>',
1 => '<?xml version="1.0"><culo>Amazing</culo></Document>',
),
)
So you could just iterate over it and that's all.
About performance, here is a quick test:
http://3v4l.org/B1t7h/perf#tabs

It strikes me that preg_match may not be the best approach here given the context you have described. Perhaps the following might serve your requirement more efficiently, with the supplied XML sample is held in $sXml prior to execution:
$sXml = substr( $sXml, strpos( $sXml, '<?xml' ));
$sXml = substr( $sXml, 0,
strpos( $sXml, '</Document>' ) + strlen( '</Document>' ));

If your string is large and contains many datas after and before the "XML" part, a good way (performant) consists to find the start and end offsets with strpos and to extract the substring after, example:
$start = strpos($str, '<?xml ');
$end = strpos(strrev($str), '>tnemucoD/<');
if ($start !== false && $end !== false)
$result = substr($str, $start, - $end);
If your string is not too big you can use preg_match:
if (preg_match('~\Q<?xml \E.+?</Document>~s', $str, $m))
$result = $m[0];
\Q....\E allows to write special characters (in a regex meaning) without to have to escape them. (useful to write a literal string without asking questions.). But note that in the present example, only ? needs to be escaped.

PHP - preg_replace not matching multiple occurrences

Trying to replace a string, but it seems to only match the first occurrence, and if I have another occurrence it doesn't match anything, so I think I need to add some sort of end delimiter?
My code:
$mappings = array(
'fname' => $prospect->forename,
'lname' => $prospect->surname,
'cname' => $prospect->company,
);
foreach($mappings as $key => $mapping) if(empty($mapping)) $mappings[$key] = '$2';
$match = '~{(.*)}(.*?){/.*}$~ise';
$source = 'Hello {fname}Default{/fname} {lname}Last{/lname}';
// $source = 'Hello {fname}Default{/fname}';
$text = preg_replace($match, '$mappings["$1"]', $source);
So if I use the $source that's commented, it matches fine, but if I use the one currently in the code above where there's 2 matches, it doesn't match anything and I get an error:
Message: Undefined index: fname}Default{/fname} {lname
Filename: schedule.php(62) : regexp code
So am I right in saying I need to provide an end delimiter or something?
Thanks,
Christian

Apparently your regexp matches fname}Default{/fname} {lname instead of Default.
As I mentioned here use {(.*?)} instead of {(.*)}.
{ has special meaning in regexps so you should escape it \\{.
And I recommend using preg_replace_callback instead of e modifier (you have more flow control and syntax higlighting and it's impossible to force your program to execute malicious code).
Last mistake you're making is not checking whether the requested index exists. :)
My solution would be:
<?php
class A { // Of course with better class name :)
public $mappings = array(
'fname' => 'Tested'
);
public function callback( $match)
{
if( isset( $this->mappings[$match[1]])){
return $this->mappings[$match[1]];
}
return $match[2];
}
}
$a = new A();
$match = '~\\{([^}]+)\\}(.*?)\\{/\\1\\}~is';
$source = 'Hello {fname}Default{/fname} {lname}Last{/lname}';
echo preg_replace_callback( $match, array($a, 'callback'), $source);
This results into:
[vyktor#grepfruit tmp]$ php stack.php
Hello Tested Last

Your regular expression is anchored to the end of the string so you closing {/whatever} must be the last thing in your string. Also, since your open and closing tags are simply .*, there's nothing in there to make sure they match up. What you want is to make sure that your closing tag matches your opening one - using a backreference like {(.+)}(.*?){/\1} will make sure they're the same.
I'm sure there's other gotchas in there - if you have control over the format of strings you're working with (IE - you're rolling your own templating language), I'd seriously consider moving to a simpler, easier to match format. Since you're not 'saving' the default values, having enclosing tags provides you with no added value but makes the parsing more complicated. Just using $VARNAME would work just as well and be easier to match (\$[A-Z]+), without involving back-references or having to explicitly state you're using non-greedy matching.

PHP Formatting Regex - BBCode

To be honest, I suck at regex so much, I would use RegexBuddy, but I'm working on my Mac and sometimes it doesn't help much (for me).
Well, for what I need to do is a function in php
function replaceTags($n)
{
$n = str_replace("[[", "<b>", $n);
$n = str_replace("]]", "</b>", $n);
}
Although this is a bad example in case someone didn't close the tag by using ]] or [[, anyway, could you help with regex of:
[[ ]] = Bold format
** ** = Italic format
(( )) = h2 heading
Those are all I need, thanks :)
P.S - Is there any software like RegexBuddy available for Mac (Snow Leopard)?

function replaceTags($n)
{
$n = preg_replace("/\[\[(.*?)\]\]/", "<strong>$1</strong>", $n);
$n = preg_replace("/\*\*(.*?)\*\*/", "<em>$1</em>", $n);
$n = preg_replace("/\(\((.*?)\)\)/", "<h2>$1</h2>", $n);
return $n;
}
I should probably provide a little explanation: Each special character is preceded by a backslash so it's not treated as regex instructions ("[", "(", etc.). The "(.*?)" captures all characters between your delimiters ("[[" and "]]", etc.). What's captured is then output in the replacements string in place of "$1".

The same reason you can't do this with str_replace() applies to preg_replace() as well. Tag-pair style parsing requires a lexer/parser if you want to yield 100% accuracy and cover for input errors.
Regular expressions can't handle unclosed tags, nested tags, that sort of thing.
That all being said, you can get 50% of the way there with very little effort.
$test = "this is [[some]] test [[content for **you** to try, ((does [[it]])) **work?";
echo convertTags( $test );
// only handles validly formatted, non-nested input
function convertTags( $content )
{
return preg_replace(
array(
"/\[\[(.*?)\]\]/"
, "/\*\*(.*?)\*\*/"
, "/\(\((.*?)\)\)/"
)
, array(
"<strong>$1</strong>"
, "<em>$1</em>"
, "<h2>$1</h2>"
)
, $content
);
}

Modifiers could help too :)
http://lv.php.net/manual/en/reference.pcre.pattern.modifiers.php
U (PCRE_UNGREEDY) This modifier
inverts the "greediness" of the
quantifiers so that they are not
greedy by default, but become greedy
if followed by ?. It is not compatible
with Perl. It can also be set by a
(?U) modifier setting within the
pattern or by a question mark behind a
quantifier (e.g. .*?).

Regular Expressions: how to do "option split" replaces

those reqular expressions drive me crazy. I'm stuck with this one:
test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not
Task:
Remove all [[ and ]] and if there is an option split choose the later one so output should be:
test1:link test2:silver test3:out1insideout2 test4:this|not
I came up with (PHP)
$text = preg_replace("/\\[\\[|\\]\\]/",'',$text); // remove [[ or ]]
this works for part1 of the task. but before that I think I should do the option split, my best solution:
$text = preg_replace("/\\[\\[(.*\|)(.*?)\\]\\]/",'$2',$text);
Result:
test1:silver test3:[[out1[[inside]]out2]] this|not
I'm stuck. may someone with some free minutes help me? Thanks!

I think the easiest way to do this would be multiple passes. Use a regular expression like:
\[\[(?:[^\[\]]*\|)?([^\[\]]+)\]\]
This will replace option strings to give you the last option from the group. If you run it repeatedly until it no longer matches, you should get the right result (the first pass will replace [[out1[[inside]]out2]] with [[out1insideout2]] and the second will ditch the brackets.
Edit 1: By way of explanation,
\[\[ # Opening [[
(?: # A non-matching group (we don't want this bit)
[^\[\]] # Non-bracket characters
* # Zero or more of anything but [
\| # A literal '|' character representing the end of the discarded options
)? # This group is optional: if there is only one option, it won't be present
( # The group we're actually interested in ($1)
[^\[\]] # All the non-bracket characters
+ # Must be at least one
) # End of $1
\]\] # End of the grouping.
Edit 2: Changed expression to ignore ']' as well as '[' (it works a bit better like that).
Edit 3: There is no need to know the number of nested brackets as you can do something like:
$oldtext = "";
$newtext = $text;
while ($newtext != $oldtext)
{
$oldtext = $newtext;
$newtext = preg_replace(regexp,replace,$oldtext);
}
$text = $newtext;
Basically, this keeps running the regular expression replace until the output is the same as the input.
Note that I don't know PHP, so there are probably syntax errors in the above.

This is impossible to do in one regular expression since you want to keep content in multiple "hierarchies" of the content. It would be possible otherwise, using a recursive regular expression.
Anyways, here's the simplest, most greedy regular expression I can think of. It should only replace if the content matches your exact requirements.
You will need to escape all backslashes when putting it into a string (\ becomes \\.)
\[\[((?:[^][|]+|(?!\[\[|]])[^|])++\|?)*]]
As others have already explained, you use this with multiple passes. Keep looping while there are matches, performing replacement (only keeping match group 1.)
Difference from other regular expressions here is that it will allow you to have single brackets in the content, without breaking:
test1:[[link]] test2:[[gold|si[lv]er]]
test3:[[out1[[in[si]de]]out2]] test4:this|not
becomes
test1:[[link]] test2:si[lv]er
test3:out1in[si]deout2 test4:this|not

Why try to do it all in one go. Remove the [[]] first and then deal with options, do it in two lines of code.
When trying to get something going favour clarity and simplicity.
Seems like you have all the pieces.

Why not just simply remove any brackets that are left?
$str = 'test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not';
$str = preg_replace('/\\[\\[(?:[^|\\]]+\\|)+([^\\]]+)\\]\\]/', '$1', $str);
$str = str_replace(array('[', ']'), '', $str);

Well, I didn't stick to just regex, because I'm of a mind that trying to do stuff like this with one big regex leads you to the old joke about "Now you have two problems". However, give something like this a shot:
$str = 'test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not'; $reg = '/(.*?):(.*?)( |$)/';
preg_match_all($reg, $str, $m);
foreach($m[2] as $pos => $match) {
if (strpos($match, '|') !== FALSE && strpos($match, '[[') !== FALSE ) {
$opt = explode('|', $match); $match = $opt[count($opt)-1];
}
$m[2][$pos] = str_replace(array('[', ']'),'', $match );
}
foreach($m[1] as $k=>$v) $result[$k] = $v.':'.$m[2][$k];

This is C# using only using non-escaped strings, hence you will have to double the backslashes in other languages.
String input = "test1:[[link]] " +
"test2:[[gold|silver]] " +
"test3:[[out1[[inside]]out2]] " +
"test4:this|not";
String step1 = Regex.Replace(input, #"\[\[([^|]+)\|([^\]]+)\]\]", #"[[$2]]");
String step2 = Regex.Replace(step1, #"\[\[|\]\]", String.Empty);
// Prints "test1:silver test3:out1insideout2 test4:this|not"
Console.WriteLine(step2);

$str = 'test1:[[link]] test2:[[gold|silver]] test3:[[out1[[inside]]out2]] test4:this|not';
$s = preg_split("/\s+/",$str);
foreach ($s as $k=>$v){
$v = preg_replace("/\[\[|\]\]/","",$v);
$j = explode(":",$v);
$j[1]=preg_replace("/.*\|/","",$j[1]);
print implode(":",$j)."\n";
}

PHP Split a string with start and stop value

I have fooled around with regex but can't seem to get it to work. I have a file called includes/header.php I am converting the file into one big string so that I can pull out a certain portion of the code to paste in the html of my document.
$str = file_get_contents('includes/header.php');
From here I am trying to get return only the string that starts with <ul class="home"> and ends with </ul>
try as I may to figure out an expression I am still confused.
Once I trim down the string I can just print that on my page but I can't figure out the trimming part

If you need something really hardcore, http://www.php.net/manual/en/book.xmlreader.php.
If you just want to rip out the text that fits that pattern try something like this.
$string = "stuff<ul class=\"home\">alsdkjflaskdvlsakmdf<another></another></ul>stuff";
if( preg_match( '/<ul class="home">(.*)<\/ul>/', $string, $match ) ) {
//do stuff with $match[0]
}

I'm assuming that the difficulty you're having has to do with escaping the regex special characters in the string(s) you're using as a delimiter. If so, try using the preg_quote() function:
$start = preg_quote('<ul class="home">');
$end = preg_quote('</ul>', '/');
preg_match("/" . $start. '.*' . $end . "/", $str, $matching_html_snippets);
The html you want should be in $matching_html_snippets[0]

You probably want an XML parser such as the built in one. Here is an example you might want to take a look at.
http://www.php.net/manual/en/function.xml-parse.php#90733
If you want to use regex then something along the lines of
$str = file_get_contents('includes/header.php');
$matchedstr = preg_match("<place your pattern here>", $str, $matches);
You probably want the pattern
'/<ul class="home">.*?<\/ul>/s'
Where $matches will contain an array of the matches it found so you can grab whatever element you want from the array with
$matchedstr[0];
which will return the first element. And then output that.
But I'd be a bit wary, regular expressions do tend to match to surprising edge cases and you need to feed them actual data to get reliable results as to when they are failing. However if you are just passing templates it should be ok, just do some tests and see if it all works. If not I'd still recommend using the PHP XML Parser.
Hope that helps.

If you feel like not using regexes you could use string finding, which I think the PHP manual implies is quicker:
function substrstr($orig, $startText, $endText) {
//get first occurrence of the start string
$start = strpos($orig, $startText);
//get last occurrence of the end string
$end = strrpos($orig, $endText);
if($start === FALSE || $end === FALSE)
return $orig;
$start++;
$length = $end - $start;
return substr($orig, $start, $length);
}
$substr = substrstr($string, '<ul class="home">', '</ul>');
You'll need to make some adjustments if you want to include the terminating strings in the output, but that should get you started!

Here's a novel way to do it; I make no guarantees about this technique's robustness or performance, other than it does work for the example given:
$prefix = '<ul class="home">';
$suffix = '</ul>';
$result = $prefix . array_shift(explode($suffix, array_pop(explode($prefix, $str)))) . $suffix;

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP regex optimize - php

I've got a regular expression that match everything between <anything> and I'm using this: '#<([\w]+)>#' today but I believe that there might be a better way to do it? / Tobias

\w doesn't match everything like you said, by the way, just [a-zA-Z0-9_]. Assuming you were using "everything" in a loose manner and \w is what you want, you don't need square brackets around the \w. Otherwise it's fine.

If "anything" is "anything except a > char", then you can: #<([^>]+)># Testing will show if this performs better or worse. Also, are you sure that you need to optimize? Does your original regex do what it should?

That looks alright. What's not optimal about it? You may also want to consider something other regex if you're trying to parse HTML: RegEx match open tags except XHTML self-contained tags

Related

preg_match in PHP (XML extract)

PHP - preg_replace not matching multiple occurrences

PHP Formatting Regex - BBCode

Regular Expressions: how to do "option split" replaces

PHP Split a string with start and stop value

Categories

Resources