Regular expression for DIV elements - php

Say I had this piece of HTML for example:
<div id="gallery2" class="galleryElement">
<h2>My Photos</h2>
<div class = "imageElement">
<h3>#Embassy - VIP </h3>
<p><b>Image URL:</b>
http://photos-p.friendster.com/photos/78/86/77426887/1_119466535.jpg</p>
<img src = "http://photos-p.friendster.com/photos/78/86/77426887/1_119466535.jpg" class = "full"/>
<img src = "http://photos-p.friendster.com/photos/78/86/77426887/1_887303260m.jpg" class = "thumbnail"/>
</div>
<div class = "imageElement">
<h3>#Embassy - VIP </h3>
<p><b>Image URL:</b>
http://photos-p.friendster.com/photos/78/86/77426887/1_119466535.jpg</p>
<img src = "http://photos-p.friendster.com/photos/78/86/774534426887/1_119466535.jpg" class = "full"/>
<img src = "http://photos-p.friendster.com/photos/78/86/774534426887/1_887303260m.jpg" class = "thumbnail"/>
</div>
</div>
I need to build the proper regular expression to parse each div class'ed as imageElement and store the contents (as text) in an array starting from the opening <div class = "imageElement"> till its ending div pair </div>. Also, there really are spaces on class = "imageElement". So far I have the expression:
\<div class = "imageElement">[\s\S\d\D]*</div>
but it only gets the whole set of elements. Thanks in advance.

This is a pretty common question here ("How do I parse this XML/HTML with a regular expression?") and I'll give you the same answer: don't.
Regular expressions are notoriously bad at this kind of thing. HTML/XML is not "regular" in the regex sense.
PHP comes with at least 3 XML parsers (SimpleXML, DOMDocument and XMLReader spring to mind) that will do this reliably. Use one of those.
Take a look at Parse HTML With PHP And DOM as an example.

sounds like the trouble you're having is that the * is greedy, ie it matches as much as possible, where you want it to match a little as possible.
If the data inside your divs does not contain "</div>" then you can keep the parsing pretty simple. If it can contain arbitrary HTML data (specifically nested divs), you'll need to parse it more.
If it stays basic, you could do the whole thing without regex. It's a little hackish, but as long as your data says simple, and expected, it should work really fast:
$chunks = explode($body, '<div class = "imageElement">');
array_shift($chunks);
$matches = array();
foreach($chunks as $chunk) {
$pos = strpos('</div>', $chunk);
if($pos) {
$matches[] = substr($chunk, 0, $pos);
{
}
If you need something more flexible, use a real html parser.

Related

How to extract HTML element from a source file

I need to replace a HTML section identified by a tag id in a source code, which is combination of HTML and PHP using PHP. In case it's pure HTML, DOM parser could be used; in case there is no DIV in DIV, I can imagine how to use preg_match. This is what I am trying to do - I have a code (loaded into a string) like:
<div>
<img >
</div>
<? include(); ?>
<div id="mydiv">
<div>
<div>
<img >
</div>
</div>
</div>
and my task is to replace content of "mydiv" DIV with a new one e.g.
<div id="newdiv>
some text
</div>
so the string will look like this after the change:
<div>
<img >
</div>
<? include(); ?>
<div id="mydiv">
<div id="newdiv>
some text
</div>
</div>
I have already tried:
1) parsing the code using DOMdocument's loadHTML => it produces a lot of errors in case PHP code is included.
2) I played around a bit with regexes like preg_match_all('/<div id="myid"([^<]*)<\/div>/', $src, $matches), which fails in case more child divs are included.
The best approach I have found so far is:
1) find id="mydiv" string
2) search for '<' and '>' chars and count them like '<'=1 and '>'=-1 (not exactly, but it gives the idea)
3) once I get sum == 0 I should be on position of the closing tag, so I know, which portion string I should exchange
This is quite "heavy" solution, which can stop working in some cases, where the code is different (e.g. onpage PHP code contains the chars as well instead of just simple "include"). So I am looking so some better solution.
You could try something like this:
$file = 'filename.php';
$content = file_get_contents($file);
$array_one = explode( '<div id="mydiv">' , $content );
$my_div_content = explode("</div>" , $array_one[1] )[0];
Or use preg_match like you said:
preg_match('/<div id="mydiv"(.*?)<\/div>/s', $content, $matches)
Yes there is. First you need to use a function that will get the content of the file. Lets call the file homepage.php:
$homepageString = file_get_contents('homepage.php');
Now you have a string with all the content. The next thing you would do is use the preg_replace() function to take out the part of code that you want to take out:
$newHomepageString = preg_replace('/id="mydiv"/',"", $homepageString);
Now you overwrite the existing homepage.php file with the new source code:
file_put_contents("homepage.php", $newHomepageString);
Let me know if it worked for you! :)

Remove everything except image tag from string using regular expression

I have string that contains all the html elements , i have to remove everything except images .
Currently i am using this code
$e->outertext = "<p class='images'>".str_replace(' ', ' ', str_replace('Â','',preg_replace('/#.*?(<img.+?>).*?#is', '',$e)))."</p>";
Its serving my purpose but very slow in execution . Any other way to do the same would be appreciable .
The code you provided seems to not work as it should and even the regex is malformed. You should remove the initial slash / like this: #.*?(<img.+?>).*?#is.
Your mindset is to remove everything and leave just the image tags, this is not a good way to do it. A better way is to think in just capturing all image tags and then using the matches to construct the output. First let's capture the image tags. That can be done using this regex:
/<img.*>/Ug
The U flag makes the regex engine become lazy instead of eager, so it will match the encounter of the first > it finds.
DEMO1
Now in order to construct the output let's use the method preg_match_all and put the results in a string. That can be done using the following code:
<?php
// defining the input
$e =
'<div class="topbar-links"><div class="gravatar-wrapper-24">
<img src="https://www.gravatar.com/avatar" alt="" width="24" height="24" class="avatar-me js-avatar-me">
</div>
</div> <img test2> <img test3> <img test4>';
// defining the regex
$re = "/<img.*>/U";
// put all matches into $matches
preg_match_all($re, $e, $matches);
// start creating the result
$result = "<p class='images'>";
// loop to get all the images
for($i=0; $i<count($matches[0]); $i++) {
$result .= $matches[0][$i];
}
// print the final result
echo $result."</p>";
DEMO2
A further way to improve that code is to use functional programming (array_reduce for example). But I'll leave that as a homework.
Note: There is another way to accomplish this which is parsing the html document and using XPath to find the elements. Check out this answer for more information.

simple html dom - space in class name

I'm using PHP Simple HTML DOM to get element from a source code of a site (not mine) and when I find a ul class that is called "board List",this is not found.I think it might be a problem of space but I don't know how to solve it.
this is a piece of php code:
$html = str_get_html($result['content']); //get the html of the site
$board = $html->find('.board List'); // Find all element which class=board List,but in my case it doesn't work,with other class name it works
and this is a piece of html code of the site:
<!-- OTHER HTML CODE BEFORE THIS --><ul class="board List"><li id="c111131" class="skin_tbl">
<table class="mback" cellpadding="0" cellspacing="0" onclick="toggleCat('c111131')"><tr>
<td class="mback_left"><div class="plus"></div><td class="mback_center"><h2 class="mtitle">presentiamoci</h2><td class="mback_right"><span id="img_c111131"></span></table>
<div class="mainbg">
<div class="title top"><div class="aa"></div><div class="bb">Forum</div><div class="yy">Statistiche</div><div class="zz">Ultimo Messaggio</div></div>
<ul class="big_list"><!-- OTHER HTML AFTER THIS -->
I solved it by removing board from the find parameter,as this:
$board = $html->find('.List');
now the parser seems to work correctly
With simple you would probably want to use:
$html->find('*[class="board List"]', 0);
If you really want to use:
$html->find('.board.List', 0);
Then use this one.
The answer is that: You cannot use spaces in classnames. spaces are the seperaters of classes
if you have <div class="container wrapper-something anothersomething"></div> then you can use .container, .wrapper-something or .anothersomething as a selector and you allways match that div.
So in your code you have <ul class="board List">, so to get a match in a css-selector ($html->find('{here_comes_the_css_selector}');) you can use eather .board or .List as the selctor
Therefor your line $board = $html->find('.board List'); should look more like this:
$board = $html->find('.board.List');
// maches every element who has class 'board' AND 'List'
// Here it is really important that there is no spaces between those 2 selectors
// or
$board = $html->find('.List');
// maches every element who has class 'List'
// or
$board = $html->find('.board');
// maches every element who has class 'board'
$board = $html->find('[class="board List"]');
With this syntax SimpleHTMLDOM finds elements with multiple class attribute

convert DIV to SPAN using str_replace

I have some data that is provided to me as $data, an example of some of the data is...
<div class="widget_output">
<div id="test1">
Some Content
</div>
<ul>
<li>
<p>
<div>768hh</div>
<div>2308d</div>
<div>237ds</div>
<div>23ljk</div>
</p>
</li>
<div id="temp3">
Some more content
</div>
<li>
<p>
<div>lkgh322</div>
<div>32khhg</div>
<div>987dhgk</div>
<div>23lkjh</div>
</p>
</li>
</div>
I am attempting to change the non valid HTML DIVs inside the paragraphs so i end up with this instead...
<div class="widget_output">
<div id="test1">
Some Content
</div>
<ul>
<li>
<p>
<span>768hh</span>
<span>2308d</span>
<span>237ds</span>
<span>23ljk</span>
</p>
</li>
<div id="temp3">
Some more content
</div>
<li>
<p>
<span>lkgh322</span>
<span>32khhg</span>
<span>987dhgk</span>
<span>23lkjh</span>
</p>
</li>
</div>
I am trying to do this using str_replace with something like...
$data = str_replace('<div>', '<span>', $data);
$data = str_replace('</div>', '</span', $data);
Is there a way I can combine these two statements and also make it so that they only affect the 'This is a random item' and not the other occurences?
$data = str_replace(array('<div>', '</div>'), array('<span>', '</span>'), $data);
As long as you didn't give any other details and only asked:
Is there a way I can combine these two statements and also make it so that they only affect the 'This is a random item' and not the other occurences?
Here you go:
$data = str_replace('<div>This is a random item</div>', '<span>This is a random item</span>', $data);
You'll need to use a regular expression to do what you are looking to do, or to actually parse the string as XML and modify it that way. The XML parsing is almost surely the "safest," since as long as the string is valid XML, it will work in a predictable way. Regexes can at times fall prey to strings not being in exactly the expected format, but if your input is predictable enough, they can be ok. To do what you want with regular expressions, you'd so something like
$parsed_string = preg_replace("~<div>(?=This is a random item)(.*?)</div>~", "<span>$1</span>, $input_string);
What's happening here is the regex is looking for a <div> tag which is followed by (using a lookahead assertion) This is a random item. It then captures any text between that tag and the next </div> tag. Finally, it replaces the match with <span>, followed by the captured text from inside the div tags, followed by </span>. This will work fine on the example you posted, but will have problems if, for example, the <div> tag has a class attribute. If you are expecting things like that, either a more complex regular expression would be needed, or full XML parsing might be the best way to go.
I'm a little surprised by the other answers, I thought someone would post a good one, but that hasn't happened. str_replace is not powerful enough in this case, and regular expressions are hit-and-miss, you need to write a parser.
You don't have to write a full HTML-parser, you can cheat a bit.
$in = '<div class="widget_output">
(..)
</div>';
$lines = explode("\n", $in);
$in_paragraph = false;
foreach ($lines as $nr => $line) {
if (strstr($line, "<p>")) {
$in_paragraph = true;
} else if (strstr($line, "</p>")) {
$in_paragraph = false;
} else {
if ($in_paragraph) {
$lines[$nr] = str_replace(array('<div>', '</div>'), array('<span>', '</span>'), $line);
}
}
}
echo implode("\n", $lines);
The critical part here is detecting whether you're in a paragraph or not. And only when you're in a paragraph, do the string replacement.
Note: I'm splitting on newlines (\n) which is not perfect, but works in this case. You might want to improve this part.

filter php variable for specific bbcode string - wrap matches inside of divs?

hey guys,
my php variable $content holds html!
i want to filter this $content for
[q=SomeQuestoin] and [a=SomeAnswer]
and wrap each match inside of a div.question and div.answer.
So whenever this [q=Some Question][a=Some Answer] structure is found in $content i want to put out this.
<div class="qanda">
<div class="question">
Some Question
</div>
<div class="answer">
Some Answer
</div>
</div>
Is that possible? Important is that the Qustion Text or the Answer Text could hold html tags as well. like <p> or <b> etc.
update:
$q_regex = '/\[q=([^"]+?)]/is';
$q_output = '<div class="qanda"><div class="queston">$1</div>';
$content = preg_replace($q_regex, $q_output, $content);
$a_regex = '/\[a=([^"]+?)]/is';
$a_output = '<div class="answer">$1</div></div>';
$content = preg_replace($a_regex, $a_output, $content);
http://www.spotlesswebdesign.com/blog.php?id=12
tutorial on using regex to do bbcode parsing. people would recommend using a bbcode parser module however. should be safe to regex since you are not using nesting and whatnot.
EDIT
possible but tricky. could be error prone. something like this maybe:
$result = preg_replace('/\[q=(.+?)].+?\[a=(.+?)]/is', '<div class="qanda"><div class="question">$1</div><div class="answer">$2</div></div>', $subject);

Categories