PHP preg_match_all + str_replace - php

I need to find a way to replace all the <p> within all the <blockquote> before the <hr />.
Here's a sample html:
<p>2012/01/03</p>
<blockquote>
<h4>File name</h4>
<p>Good Game</p>
</blockquote>
<blockquote><p>Laurie Ipsumam</p></blockquote>
<h4>Some title</h4>
<hr />
<p>Lorem Ipsum</p>
<blockquote><p>Laurel Ipsucandescent</p></blockquote>
Here's what I got:
$pieces = explode("<hr", $theHTML, 2);
$blocks = preg_match_all('/<blockquote>(.*?)<\/blockquote>/s', $pieces[0], $blockmatch);
if ($blocks) {
$t1=$blockmatch[1];
for ($j=0;$j<$blocks;$j++) {
$paragraphs = preg_match_all('/<p>/', $t1[$j], $paragraphmatch);
if ($paragraphs) {
$t2=$paragraphmatch[0];
for ($k=0;$k<$paragraphs;$k++) {
$t1[$j]=str_replace($t2[$k],'<p class=\"whatever\">',$t1[$j]);
}
}
}
}
I think I'm really close, but I don't know how to put back together the html that I just pieced out and modified.

You could try using simple_xml, or better DOMDocument (http://www.php.net/manual/en/class.domdocument.php) before you make it a valid html code, and use this functionality to find the nodes you are looking for, and replace them, for this you could try XPath (http://w3schools.com/xpath/xpath_syntax.asp).
Edit 1:
Take a look at the answer of this question:
RegEx match open tags except XHTML self-contained tags

$string = explode('<hr', $string);
$string[0] = preg_replace('/<blockquote>(.*)<p>(.*)<\/p>(.*)<\/blockquote>/sU', '<blockquote>\1<p class="whatever">\2</p>\3</blockquote>', $string[0]);
$string = $string[0] . '<hr' . $string[1];
output:
<p>2012/01/03</p>
<blockquote>
<h4>File name</h4>
<p class="whatever">Good Game</p>
</blockquote>
<blockquote><p class="whatever">Laurie Ipsumam</p></blockquote>
<h4>Some title</h4>
<hr />
<p>Lorem Ipsum</p>
<blockquote><p>Laurel Ipsucandescent</p></blockquote>

Related

How to create id attributes for <h2> and <h3> tags based on a portion of their respective innerHTML?

I have a customer who uses TinyMCE TOC, but who does not appreciate the random IDs that the plugin adds to the heading tags (<h2> and <h3>).
I want to create a script that parses an article, targets every h2 and h3 tag, then creates id attributes from the text they contain.
I thought I could do this with preg_replace_callback(), but when I use the function, I realize that there are certain situations where it does not work.
For example, it doesn't work if the h2/h3's text starts with a space, a number, etc.
Here is an early attempt that worked in some cases:
function function_to_makeItClear($string) {
$string = strtolower($string);
$string = str_ireplace(' ', '-', $string);
return preg_replace('/[^A-Za-z0-9\-]/', '', $string);
}
function betterId($match){
$escape = str_split(strip_tags($match[2]), 20);
$id = strlen($escape[0]) >=5 ? function_to_makeItClear($escape[0]) : str_shuffle('AnyWordsHere');
return '<h'.$match[1].' id="'.$id.'">'.$match[2].'</h'.$match[1].'>';
}
return preg_replace_callback('#<h([1-6]).*?>(.*?)<\/h[1-6]>#si', 'betterId', $texte);
Here is some sample text I want to transform:
<p>Paragraph one is okay </p>
<h2>This will work without problem</h2>
<p>Paragraph two is okay </p>
<h2>This heading has anchor</h2>
<p>Paragraph one is okay </p>
<h2> This heading start with space</h2>
<p>Paragraph two is okay </p>
<h3>1. This wont work</h3>
<p>Paragraph one is okay </p>
<h3>2. Not working</h3>
<p>Paragraph two is okay </p>
<h3>3. Neither this one</h3>
<h3>But this works again</h3>
I would like to have this result:
<p>Paragraph one is okay </p>
<h2 id="this-will-work">This will work without problem</h2>
<p>Paragraph two is okay </p>
<h2 id="this-heading-has">This heading has anchor</h2>
<p>Paragraph one is okay </p>
<h2 id="this-heading-start"> This heading start with space</h2>
<p>Paragraph two is okay </p>
<h3 id="this-wont-work">1. This wont work</h3>
<p>Paragraph one is okay </p>
<h3 id="not-working">2. Not working</h3>
<p>Paragraph two is okay </p>
<h3 id="neighter-this-one">3. Neither this one</h3>
<h3 id="but-this-works">But this works again</h3>
UPDATE:
I have since implemented a different approach using a DOM parser with great results, but there are still some scenarios where it fails and I have to manually add ids myself.
Use DOMDocument and its good friend XPath to reliably extract the heading tags from your valid html.
Use nodeValue() to generate a tag-free string from the heading tag's innerHTML. (Demonstration of what nodeValue() generates)
Use preg_match() to exclude the leading spaces and numbers, then match the first one, two, or three words. (A slightly altered demonstration of the pattern)
If there is a match containing at least one word, replace spaces with hyphens and add that string as the id attribute.
Code: (Demo)
$html = <<<HTML
<div>
<p>Paragraph one is okay </p>
<h2>This will work without problem</h2>
<p>Paragraph two is okay </p>
<h2>This heading has anchor</h2>
<p>Paragraph one is okay </p>
<h2> This heading start with space</h2>
<p>Paragraph two is okay </p>
<h3>1. This wont work</h3>
<p>Paragraph one is okay </p>
<h3>2. Not working</h3>
<p>Paragraph two is okay </p>
<h3>3. Neither this one</h3>
<h3>But this works again</h3>
</div>
HTML;
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($xpath->query("//h2 | //h3") as $node) {
if (preg_match('~^\s*(?:\d+\.)?\s*\K\S+(?:\s+\S+){1,2}~', $node->nodeValue, $m)) {
$node->setAttribute('id', str_replace(' ', '-', strtolower($m[0])));
}
}
echo $dom->saveHTML();
Output:
<div>
<p>Paragraph one is okay </p>
<h2 id="this-will-work">This will work without problem</h2>
<p>Paragraph two is okay </p>
<h2 id="this-heading-has">This heading has anchor</h2>
<p>Paragraph one is okay </p>
<h2 id="this-heading-start"> This heading start with space</h2>
<p>Paragraph two is okay </p>
<h3 id="this-wont-work">1. This wont work</h3>
<p>Paragraph one is okay </p>
<h3 id="not-working">2. Not working</h3>
<p>Paragraph two is okay </p>
<h3 id="neither-this-one">3. Neither this one</h3>
<h3 id="but-this-works">But this works again</h3>
</div>

xpath not returning text if p tag is followed by any other tag

i want to get all the text between <p> and <h3> tag for the following HTML
<div class="bodyText">
<p>
<div class="articleBox articleSmallHorizontal channel-32333770 articleBoxBordered alignRight">
<div class="one">
<img src="url" alt="bar" class="img" width="80" height="60" />
</div>
<div class="two">
<h4 class="preTitle">QIEZ-Lieblinge</h4>
<h3 class="title"><a href="url" title="ABC" onclick="cmsTracking.trackClickOut({element:this, channel : 32333770, channelname : 'top_listen', content : 14832081, callTemplate : '_htmltagging.Text', action : 'click', mouseevent : event});">
Prominente Gastronomen </a></h3>
<span class="postTitle"></span>
<span class="district">Berlin</span> </div>
<div class="clear"></div>
</div>
I want this TEXT</p>
<h3>I want this TEXT</h3>
<p>I want this TEXT</p>
<p>
<div class="inlineImage alignLeft">
<div class="medium">
<img src="http://images03.qiez.de/Restaurant+%C3%96_QIEZ.jpg/280x210/0/167.231.886/167.231.798" width="280" height="210" alt="Schöne Lage: das Restaurant Ø. (c)QIEZ"/>
<span class="caption">
Schöne Lage: das Restaurant Ø. (c)QIEZ </span>
</div>
</div>I want this TEXT</p>
<p>I want this TEXT</p>
<p>I want this TEXT<br /> </p>
<blockquote><img src="url" alt="" width="68" height="68" />
"Eigentlich nur drei Worte: Ich komme wieder."<span class="author">Tina Gerstung</span></blockquote>
<div class="clear"></div>
</div>
i want all "I want this TEXT". i used xpath query
//div[contains(#class,'bodyText')]/*[local-name()='p' or local-name()='h3']
but it does not give me the text if <p> tag is followed by any other tag
It looks like you have div elements contained within your p element which is not valid and messing up things. If you use a var_dump in the loop you can see that it does actually pick up the node but the nodeValue is empty.
A quick and dirty fix to your html would be to wrap the first div that is contained in the p element in a span.
<span><div class="articleBox articleSmallHorizontal channel-32333770 articleBoxBordered alignRight">...</div></span>
A better fix would be to put the div element outside the paragraph.
If you use the dirty workaround you will need to change your query like so:
$xpath->query("//div[contains(#class,'bodyText')]/*[local-name()='p' or local-name()='h3']/text()");
If you do not have control of the source html. You can make a copy of the html and remove the offending divs:
$nodes = $xpath->query("//div[contains(#class,'articleBox')]");
$node = $nodes->item(0);
$node->parentNode->removeChild($node);
It might be easier to work with simple_html_dom. Maybe you can try this:
include('simple_html_dom.php');
$dom = new simple_html_dom();
$dom->load($html);
foreach($dom->find("div[class=bodyText]") as $parent) {
foreach($parent->children() as $child) {
if ($child->tag == 'p' || $child->tag == 'h3') {
// remove the inner text of divs contained within a p element
foreach($dom->find('div') as $e)
$e->innertext = '';
echo $child->plaintext . '<br>';
}
}
}
This is mixed content. Depending on what defines the position of the element, you can use a number of factors. In this cse, probably simply selected all the text nodes will be sufficient:
//div[contains(#class, 'bodyText')]/(p | h3)/text()
If the union operator within a path location is not allowed in your processor, then you can use your syntax as before or a little bit simpler in my opinion:
//div[contains(#class, 'bodyText')]/*[local-name() = ('p', 'h3')]/text()

PHP Regex - Remove text from HTML Tags

How to remove all text between tags.
Input
<div>
<p>testing</p>
<div>my world</div>
</div>
Output
<div>
<p></p>
<div></div>
</div>
You can use either DOMDocument or PHP Simple HTML DOM Parser.
The following example uses the latter, although you may want to use what suits you best.
include("simple_html_dom.php");
$str = '
<div>
<p>testing</p>
<div>my world</div>
</div>
';
$html = str_get_html($str);
foreach($html->find("text") as $ht) {
$ht->innertext = "";
}
$html->save();
echo $html;
You could use two capture groups which would eliminate characters between them while replacing:
(\<.+\>).*(\<\/.+\>)
working example: http://ideone.com/Oq14El

Replace HTML entities between divs and only between divs

Given the following string:
asd <div> def foo </div> ghi <div> moo </div>
I want to remove all of the 's that are within <div>s, resulting in:
asd <div> def foo </div> ghi <div> moo </div>
I can use any standard PHP stuff, but I'm not sure how to approach the problem. I couldn't figure out how to keep the contents inside the <div>s while removing the
The reason why I need this is because WordPress's content filter adds under strange situations. I can't simply remove all because they might've been specifically entered by the user, but I need to remove all of them within the element that's having display problems caused by them
The following works in your case:
$str = "asd <div> def </div> ghi <div> moo </div>";
$res = preg_replace("%<div>(.*?) (.*?)</div>%", "<div>$1$2</div>", $str);
But beware of some facts:
It won't work if the divs have attributes;
It won't work as expected if the divs are nested;
It applies the replacement of a only one time, so multiple s inside divs are untouched.
So the abovementioned replacement is not a good solution at all. It's way better to first find the div tags with a (XML) parser function and then replace all s.
$text = "asd <div> def </div> ghi <div> moo </div>";
echo preg_replace_callback(
"#<div(.*?)>(.*? .*?)</div>#i",
"filter_nbsp",
$text);
function filter_nbsp($matches)
{
return "<div".$matches[1].">" . str_replace(" ","",$matches[2]) . "</div>";
}
That should work for entities between div elements closed as </div>,
output
asd <div> def </div> ghi <div> moo </div>
simple_html_dom
$html = str_get_html('asd <div> def </div> ghi <div> moo </div>');
foreach($html->find('div') as $element) {
$a = $element->plaintext;
$element->innertext = preg_replace('{\ }','',$a);
}
echo $html;

REGEX for HTML in php

I have an html file...
this file has the formula:
<body>
<p class="Title-P">Compiler</p>
<p class="Heading1-P">kdnkls:</p>
<p class="Normal-P">dsf</p>
<p class="ListParagraph-P">kjsksf</p>
<p class="ListParagraph-P">dsfsf</p>
<p class="ListParagraph-P">sfsfsf</p>
<p class="Heading2-P">fsfs:</p>
</body>
what is the suitable regex to replace the tags:
<p class="Title-P>foo</p> with <h1>foo</h1>
<p class="Heading1-P">kdnkls:</p> with <h2> kdnkls: </h2>
<p class="Normal-P>foo</p> with <p> foo </p>
etc...
I'm using preg_replace function in php which takes as arguments: pattern and replacement...
Try:
$html = preg_replace('/<p class="Title-P">(.*?)<\/p>/i', "<h1>$1</h1>", $html);
$html = preg_replace('/<p class="Normal-P">(.*?)<\/p>/i', "<p>$1</h1>", $html);
That should work, better bet is to parse the document using DOM and make your changes and then save out the document.

Categories