I have an html file...
this file has the formula:
<body>
<p class="Title-P">Compiler</p>
<p class="Heading1-P">kdnkls:</p>
<p class="Normal-P">dsf</p>
<p class="ListParagraph-P">kjsksf</p>
<p class="ListParagraph-P">dsfsf</p>
<p class="ListParagraph-P">sfsfsf</p>
<p class="Heading2-P">fsfs:</p>
</body>
what is the suitable regex to replace the tags:
<p class="Title-P>foo</p> with <h1>foo</h1>
<p class="Heading1-P">kdnkls:</p> with <h2> kdnkls: </h2>
<p class="Normal-P>foo</p> with <p> foo </p>
etc...
I'm using preg_replace function in php which takes as arguments: pattern and replacement...
Try:
$html = preg_replace('/<p class="Title-P">(.*?)<\/p>/i', "<h1>$1</h1>", $html);
$html = preg_replace('/<p class="Normal-P">(.*?)<\/p>/i', "<p>$1</h1>", $html);
That should work, better bet is to parse the document using DOM and make your changes and then save out the document.
Related
<div style = "text-align:left;" class="ref"> Text </div>
I want to replace <div> with <p> without losing attributes.
Any help is appreciated.
Try This:
$str = '<div style = "text-align:left;" class="ref"> Text </div>';
$newstr = preg_replace('/<div [^<]*?class="([^<]*?ref.*?)">(.*?)<\/div>/','<p class="$1">$2</p>',$str);
echo $newstr;
Output : <p class="ref"> Text </p>
I whant to extract the content betwin the tags from html file with DOM.
source file:
<html>
<body>
some html code
..........
<div id="text">
<p>some title</p> <br>
<p>some text</p> <br>
<img src="../images2/somegif.gif">
<div>
..........
</body>
</html>
my code:
$file = 'somefile.html';
include('simple_html_dom.php');
$html = new simple_html_dom();
$html->load_file($file);
$text = $html->getElementById('text');
echo $text;
the result is:
<div id="text">
<p>some title</p> <br>
<p>some text</p> <br>
<img src="../images2/somegif.gif">
<div>
What I want is just the data inside the div tags but keeping all the other HTML elements like:
<p>some title</p> <br>
<p>some text</p> <br>
<img src="../images2/somegif.gif">
How can I do that? ...I need that data to send to a mysql DB later on. Thank you.
$text1= $html->getElementById('text');
$text= $text1->innertext;
echo $text;
I just change the var names ...LOL
i want to get all the text between <p> and <h3> tag for the following HTML
<div class="bodyText">
<p>
<div class="articleBox articleSmallHorizontal channel-32333770 articleBoxBordered alignRight">
<div class="one">
<img src="url" alt="bar" class="img" width="80" height="60" />
</div>
<div class="two">
<h4 class="preTitle">QIEZ-Lieblinge</h4>
<h3 class="title"><a href="url" title="ABC" onclick="cmsTracking.trackClickOut({element:this, channel : 32333770, channelname : 'top_listen', content : 14832081, callTemplate : '_htmltagging.Text', action : 'click', mouseevent : event});">
Prominente Gastronomen </a></h3>
<span class="postTitle"></span>
<span class="district">Berlin</span> </div>
<div class="clear"></div>
</div>
I want this TEXT</p>
<h3>I want this TEXT</h3>
<p>I want this TEXT</p>
<p>
<div class="inlineImage alignLeft">
<div class="medium">
<img src="http://images03.qiez.de/Restaurant+%C3%96_QIEZ.jpg/280x210/0/167.231.886/167.231.798" width="280" height="210" alt="Schöne Lage: das Restaurant Ø. (c)QIEZ"/>
<span class="caption">
Schöne Lage: das Restaurant Ø. (c)QIEZ </span>
</div>
</div>I want this TEXT</p>
<p>I want this TEXT</p>
<p>I want this TEXT<br /> </p>
<blockquote><img src="url" alt="" width="68" height="68" />
"Eigentlich nur drei Worte: Ich komme wieder."<span class="author">Tina Gerstung</span></blockquote>
<div class="clear"></div>
</div>
i want all "I want this TEXT". i used xpath query
//div[contains(#class,'bodyText')]/*[local-name()='p' or local-name()='h3']
but it does not give me the text if <p> tag is followed by any other tag
It looks like you have div elements contained within your p element which is not valid and messing up things. If you use a var_dump in the loop you can see that it does actually pick up the node but the nodeValue is empty.
A quick and dirty fix to your html would be to wrap the first div that is contained in the p element in a span.
<span><div class="articleBox articleSmallHorizontal channel-32333770 articleBoxBordered alignRight">...</div></span>
A better fix would be to put the div element outside the paragraph.
If you use the dirty workaround you will need to change your query like so:
$xpath->query("//div[contains(#class,'bodyText')]/*[local-name()='p' or local-name()='h3']/text()");
If you do not have control of the source html. You can make a copy of the html and remove the offending divs:
$nodes = $xpath->query("//div[contains(#class,'articleBox')]");
$node = $nodes->item(0);
$node->parentNode->removeChild($node);
It might be easier to work with simple_html_dom. Maybe you can try this:
include('simple_html_dom.php');
$dom = new simple_html_dom();
$dom->load($html);
foreach($dom->find("div[class=bodyText]") as $parent) {
foreach($parent->children() as $child) {
if ($child->tag == 'p' || $child->tag == 'h3') {
// remove the inner text of divs contained within a p element
foreach($dom->find('div') as $e)
$e->innertext = '';
echo $child->plaintext . '<br>';
}
}
}
This is mixed content. Depending on what defines the position of the element, you can use a number of factors. In this cse, probably simply selected all the text nodes will be sufficient:
//div[contains(#class, 'bodyText')]/(p | h3)/text()
If the union operator within a path location is not allowed in your processor, then you can use your syntax as before or a little bit simpler in my opinion:
//div[contains(#class, 'bodyText')]/*[local-name() = ('p', 'h3')]/text()
How to remove all text between tags.
Input
<div>
<p>testing</p>
<div>my world</div>
</div>
Output
<div>
<p></p>
<div></div>
</div>
You can use either DOMDocument or PHP Simple HTML DOM Parser.
The following example uses the latter, although you may want to use what suits you best.
include("simple_html_dom.php");
$str = '
<div>
<p>testing</p>
<div>my world</div>
</div>
';
$html = str_get_html($str);
foreach($html->find("text") as $ht) {
$ht->innertext = "";
}
$html->save();
echo $html;
You could use two capture groups which would eliminate characters between them while replacing:
(\<.+\>).*(\<\/.+\>)
working example: http://ideone.com/Oq14El
I need to find a way to replace all the <p> within all the <blockquote> before the <hr />.
Here's a sample html:
<p>2012/01/03</p>
<blockquote>
<h4>File name</h4>
<p>Good Game</p>
</blockquote>
<blockquote><p>Laurie Ipsumam</p></blockquote>
<h4>Some title</h4>
<hr />
<p>Lorem Ipsum</p>
<blockquote><p>Laurel Ipsucandescent</p></blockquote>
Here's what I got:
$pieces = explode("<hr", $theHTML, 2);
$blocks = preg_match_all('/<blockquote>(.*?)<\/blockquote>/s', $pieces[0], $blockmatch);
if ($blocks) {
$t1=$blockmatch[1];
for ($j=0;$j<$blocks;$j++) {
$paragraphs = preg_match_all('/<p>/', $t1[$j], $paragraphmatch);
if ($paragraphs) {
$t2=$paragraphmatch[0];
for ($k=0;$k<$paragraphs;$k++) {
$t1[$j]=str_replace($t2[$k],'<p class=\"whatever\">',$t1[$j]);
}
}
}
}
I think I'm really close, but I don't know how to put back together the html that I just pieced out and modified.
You could try using simple_xml, or better DOMDocument (http://www.php.net/manual/en/class.domdocument.php) before you make it a valid html code, and use this functionality to find the nodes you are looking for, and replace them, for this you could try XPath (http://w3schools.com/xpath/xpath_syntax.asp).
Edit 1:
Take a look at the answer of this question:
RegEx match open tags except XHTML self-contained tags
$string = explode('<hr', $string);
$string[0] = preg_replace('/<blockquote>(.*)<p>(.*)<\/p>(.*)<\/blockquote>/sU', '<blockquote>\1<p class="whatever">\2</p>\3</blockquote>', $string[0]);
$string = $string[0] . '<hr' . $string[1];
output:
<p>2012/01/03</p>
<blockquote>
<h4>File name</h4>
<p class="whatever">Good Game</p>
</blockquote>
<blockquote><p class="whatever">Laurie Ipsumam</p></blockquote>
<h4>Some title</h4>
<hr />
<p>Lorem Ipsum</p>
<blockquote><p>Laurel Ipsucandescent</p></blockquote>