How can i have counter for php preg_match? - php

function getContent($xml,$tag,$id="") {
if ($id=="") {
$tag_regex = '/<'.$tag.'[^>]*>(.*?)<\/'.$tag.'>/si';
} else {
$tag_regex = '/<'.$tag.'[^>]*id=[\'"]'.$id.'[\'"]>(.*?)<\/'.$tag.'>/si';
}
preg_match($tag_regex,$xml,$matches);
return $matches[1];
}
$omg = file_get_contents("Generated/index.php");
$extract = getContent($omg,"div","lolz2");
echo $extract;
For example i have something like this. And html have something like this inside:
<div id="lolz">qwg1eqwe</div>
<div id="lolz1"><div id='lolz2'>qwdqw2cq</div>asd3qwe</div>
If we search for id lolz we get the correct answer, but if we search for lolz1 we stop at first </div> that's inner <div id="lolz2">. It's possible to keep something like counter for preg_match that's will keep how many <div>'s i pass till i find </div>?

HTML isn't a regular language, so building something like that would be overkill and is the job of an HTML parser. Please see: RegEx match open tags except XHTML self-contained tags.
The reason your code was failing however was because you were using both single and double quotes in your input but your regex didn't account for it. This works for me:
function getContent($xml,$tag,$id="") {
if ($id=="") {
$tag_regex = '/<'.$tag.'[^>]*>(.*?)<\/'.$tag.'>/si';
} else {
$tag_regex = '/<'.$tag.'[^>]*id=[\\\'"]'.$id.'[\\\'"]>(.*?)<\/'.$tag.'>/si';;
}
preg_match($tag_regex,$xml,$matches);
return $matches[1];
}
$omg = '<div id="lolz">qwg1eqwe</div>
<div id="lolz1"><div id="lolz2">qwdqw2cq</div>asd3qwe</div>';
$extract = getContent($omg,"div","lolz2");
var_dump($extract);
As long as you don't have nested elements this code will work and you won't need to use a DOM parser, though you really should for anything more complicated that might be nested (e.g. you don't have control over the input).

Related

How do I insert html code into a php regex?

So I am currently in the middle of making a forums software. Something that I wanted for that forums software was a custom template engine. For the most part I have created the template engine, but I am having a small issue with the regex that I use for my IF, ELSEIF, and FOREACH statements.
The issue that I am having is that when I put a chunk of html code in to my regex, nothing will work. Here is an example: https://regex101.com/r/jlawz3/1.
Here is the PHP code that checks for the regex.
$isMatchedAgain = preg_match_all('/{IF:(.*?)}[\s]*?(.*?)[\s]*?{ELSE}[\s]*?(.*?)[\s]*?{ENDIF}/', $this->template, $elseifmatches);
for ($i = 0; $i < count($elseifmatches[0]); $i++) {
$condition = $elseifmatches[1][$i];
$trueval = $elseifmatches[2][$i];
$falseval = (isset($elseifmatches[3][$i])) ? $elseifmatches[3][$i] : false;
$res = eval('return ('.$condition.');');
if ($res===true) {
$this->template = str_replace($elseifmatches[0][$i],$trueval,$this->template);
} else {
$this->template = str_replace($elseifmatches[0][$i],$falseval,$this->template);
}
}
You can do it like this:
function render($content) {
$match = preg_match_all('/{IF:\((.*?)\)}(.*?){ELSE}(.*?)({ENDIF})/s', $content, $matches, PREG_OFFSET_CAPTURE);
if (!$match) {
return $content;
}
$beforeIf = substr($content, 0, $matches[0][0][1]);
$afterIf = substr($content, $matches[4][0][1] + strlen('{ENDIF}'));
$evalCondition = eval('return (' . $matches[1][0][0] . ');');
if ($evalCondition) {
$ifResult = $matches[2][0][0];
} else {
$ifResult = $matches[3][0][0];
}
return
$beforeIf .
$ifResult .
render($afterIf);
}
Working example.
This is a first step. This wont work for example if you have an if within an if.
Talking about mentioned security risk. Since we are using eval (nickname EVIL - for a reason). You should never ever ever process user-input through eval - or use eval at all - there is always a better solution.
For me it looks like you want to give users the ability to write "code" in their posts. If this is the case you can have a look at something like bbcode.
Whatever you do be sure to provide the desired functionality. Taking your example:
!isset($_SESSION['loggedin'])
You could do something like this:
{IS_LOGGED_IN}
Output whatever you want :)
{/IS_LOGGED_IN}
Your renderer would look specificly for this tag and act accordingly.
So after a bit of research, I have figured out that the issue I was facing could be solved by adding /ims to the end of my regex statement. So now my regex statement looks like:
/{IF:(.*?)}[\s]*?(.*?)[\s]*?{ELSE}[\s]*?(.*?)[\s]*?{ENDIF}/ims

preg_replace with wildcards?

I have HTML markup bearing the form
<div id='abcd1234A'><p id='wxyz1234A'>Hello</p></div>
which I need to replace to bear the form
<div id='abcd1234AN'><p id='wxyz1234AN'>Hello</p></div>
where N may be 1,2.. .
The best I have been able to do is as follows
function cloneIt($a,$b)
{
return substr_replace($a,$b,-1);
}
$ndx = "1'";
$str = "<div id='abcd1234A'><p id='wxyz1234A'>Hello</p></div>";
preg_match_all("/id='[a-z]{4}[0-9]{4}A'/",$str,$matches);
$matches = $matches[0];
$reps = array_merge($matches);
$ndxs = array_fill(0,count($reps),$ndx);
$reps = array_map("cloneIt",$reps,$ndxs);
$str = str_replace($matches,$reps,$str);
echo htmlspecialchars($str);
which works just fine. However, my REGEX skills are not much to write home about so I suspect that there is probably a better way to do this. I'd be most obliged to anyone who might be able to suggest a neater/quicker way of accomplishing the same result.
You can optimize your regex like this:
/id='[a-z]{4}\d{4}A'/
Sample code
preg_match_all("/id='[a-z]{4}\\d{4}A'/",$str,$matches);
However an alternative would consist in using en HTML parser. Here I'll use simple html dom:
// Load the HTML from URL or file
$html = file_get_html('http://www.mysite.com/');
// You can also load $html from string: $html = str_get_html($my_string);
// Find div with id attribute
foreach($html->find('div[id]') as $div) {
if (preg_match("/id='([a-z]{4}\\d{4})A'/" , $div->id, $matches)) {
$div->id = $matches[1] + $ndx;
}
}
echo $html->save();
Did you notice how elegant, concise and clear the code becomes with an html parser ?
References
Simple Html Dom Documentation

get wrapping element using preg_match php

I want a preg_match code that will detect a given string and get its wrapping element.
I have a string and a html code like:
$string = "My text";
$html = "<div><p class='text'>My text</p><span>My text</span></div>";
So i need to create a function that will return the element wrapping the string like:
$element = get_wrapper($string, $html);
function get_wrapper($str, $code){
//code here that has preg_match and return the wrapper element
}
The returned value will be array since it has 2 possible returning values which are <p class='text'></p> and <span></span>
Anyone can give me a regex pattern on how to get the HTML element that wraps the given string?
Thanks! Answers are greatly appreciated.
It's bad idea use regex for this task. You can use DOMDocument
$oDom = new DOMDocument('1.0', 'UTF-8');
$oDom->loadXML("<div>" . $sHtml ."</div>");
get_wrapper($s, $oDom);
after recursively do
function get_wrapper($s, $oDom) {
foreach ($oDom->childNodes AS $oItem) {
if($oItem->nodeValue == $s) {
//needed tag - $oItem->nodeName
}
else {
get_wrapper($s, $oItem);
}
}
}
The simple pattern would be the following, but it assumes a lot of things. Regexes shouldn't be used with these. You should look at something like the Simple HTML DOM parser which is more intelligent.
Anyway, the regex that would match the wrapper tags and surrounding html elements is as follows.
/[A-Za-z'= <]*>My text<[A-Za-z\/>]*/g
Even if regex is never the correct answer in the domain of dom parsing, I came out with another (quite simple) solution
<[^>/]+?>My String</.+?>
if the html is good (ie it has closing tags, < is replaced with < & so on). This way you have in the first regex group the opening tag and in the second the closing one.

Parsing data with REGEX (PHP)

I want to parse data in between brackets. Below is the code where you can see what I'm doing. I would also like to avoid using XML.
$query = '
[page:1]
<html>
all the html
</html>
[/page:1]
[page:2]
<html>
all the html
</html>
[/page:2]
';
I want to create a loop script that will use regex to find all instances of [page:x]; which in the example above is 2. And then with a get function we can specify the page we want.
if(isset($_GET['page'])) {
$page = $_GET['page'];
$regex = '\\['page':(.*?)\\';
echo preg_match($regex, $query);
}
Any thoughts?
This should find all the matching blocks at once:
preg_match_all('/\[page:([0-9]+)\](.+?)\[\/page:$1\]/', $page, $matches)
I strongly doubt regex is the most suitable solution for what you're trying to accomplish though.

Inserting multiple links into text, ignoring matches that happen to be inserted

The site I'm working on has a database table filled with glossary terms. I am building a function that will take some HTML and replace the first instances of the glossary terms with tooltip links.
I am running into a problem though. Since it's not just one replace, the function is replacing text that has been inserted in previous iterations, so the HTML is getting mucked up.
I guess the bottom line is, I need to ignore text if it:
Appears within the < and > of any HTML tag, or
Appears within the text of an <a></a> tag.
Here's what I have so far. I was hoping someone out there would have a clever solution.
function insertGlossaryLinks($html)
{
// Get glossary terms from database, once per request
static $terms;
if (is_null($terms)) {
$query = Doctrine_Query::create()
->select('gt.title, gt.alternate_spellings, gt.description')
->from('GlossaryTerm gt');
$glossaryTerms = $query->rows();
// Create whole list in $terms, including alternate spellings
$terms = array();
foreach ($glossaryTerms as $glossaryTerm) {
// Initialize with title
$term = array(
'wordsHtml' => array(
h(trim($glossaryTerm['title']))
),
'descriptionHtml' => h($glossaryTerm['description'])
);
// Add alternate spellings
foreach (explode(',', $glossaryTerm['alternate_spellings']) as $alternateSpelling) {
$alternateSpelling = h(trim($alternateSpelling));
if (empty($alternateSpelling)) {
continue;
}
$term['wordsHtml'][] = $alternateSpelling;
}
$terms[] = $term;
}
}
// Do replacements on this HTML
$newHtml = $html;
foreach ($terms as $term) {
$callback = create_function('$m', 'return \'<span>\'.$m[0].\'</span>\';');
$term['wordsHtmlPreg'] = array_map('preg_quote', $term['wordsHtml']);
$pattern = '/\b('.implode('|', $term['wordsHtmlPreg']).')\b/i';
$newHtml = preg_replace_callback($pattern, $callback, $newHtml, 1);
}
return $newHtml;
}
Using Regexes to process HTML is always risky business. You will spend a long time fiddling with the greediness and laziness of your Regexes to only capture text that is not in a tag, and not in a tag name itself. My recommendation would be to ditch the method you are currently using and parse your HTML with an HTML parser, like this one: http://simplehtmldom.sourceforge.net/. I have used it before and have recommended it to others. It is a much simpler way of dealing with complex HTML.
I ended up using preg_replace_callback to replace all existing links with placeholders. Then I inserted the new glossary term links. Then I put back the links that I had replaced.
It's working great!

Categories