Extracting parts of an html code

Extracting parts of an html code - php

Let's say I had the below HTML code:
<p>Test text</p>
<p><img src="test.jpg" /></p>
<div id="test"><p>test</p></div>
<div class="block">
<img src="test2.jpg">
</div>
<p>test</p>
Parameters:
There will exist a div block with class "block"
There can be any amount of HTML code above or below the div block with class "block"
There could even be two div blocks with class "block"
I was using PHP's XPath to look at this HTML code using DOM. I want to be able to return two things:
The div block with class "block"
All the rest of the code without the div element with class "block" in it
Something like:
Block Code:
<div class="block">
<img src="test2.jpg">
</div>
Original without block code:
<p>Test text</p>
<p><img src="test.jpg" /></p>
<div id="test"><p>test</p></div>
<p>test</p>

By using DOMDocument you can do it like this :
$content = '<p>Test text</p>'.
'<p><img src="test.jpg" /></p>'.
'<div id="test"><p>test</p></div>'.
'<div class="block">'.
'<img src="test2.jpg">'.
'</div>'.
'<p>test</p>';
$blocks = array();
$doc = new DOMDocument();
$doc->loadHTML($content);
$elements = $doc->getElementsByTagName("*");
foreach ($elements as $element) {
if($element->hasAttributes()) {
if ($element->getAttribute('class') == 'block') {
//add block HTML to block array
$blocks[]=$doc->saveHTML($element);
//remove blocck element
$element->parentNode->removeChild($element);
}
}
}
echo '<pre>';
echo $blocks[0]; //iterate or print_r if multiple blocks
echo $doc->saveHTML();
echo '</pre>';
outputs the "block code" :
<div class="block"><img src="test2.jpg"></div>
and the "original without block code" :
<p>Test text</p><p><img src="test.jpg"></p><div id="test"><p>test</p></div><p>test</p>
If you simply cant accept that DOMDocument "enriches" the HTML with doctype, html and body, which can be very annoying when you want the complete document, not just some extracts, you can use this neat function and extract the body innerHTML with :
echo DOMinnerHTML($doc->getElementsByTagName('body')->item(0));

Related

Simple way to replace elements with certain attributes in php

How is it possible to get an element with a certain attribute?
Afterwards I want to replace this element including the tags of the HTML document within PHP?
I tried it here:
$html = '<note>
<span data="getThisElement">New Text</span>
<div data="yes">More Text</div>
</note>';
echo$newTxt = str_replace("<? data="getThisElement", "<div>New Div</div>", $html);
The output should be:
<note>
<div>New Div</div>
<div data="yes">More Text</div>
</note>

You can use preg_replace with this
$html = '<note>
<span data="getThisElement">New Text</span>
<div data="yes">More Text</div>
</note>';
$replace = "<div>New Div</div>";
$output = preg_replace("/<span[^>]*>.*?<\/span>/is",$replace,$html);
echo $output;
or if you want to replace it by specific element
$element = "getThisElement";
$output = str_replace('<span data=" ' . $element . ' ">New Text</span>',$replace,$html);

I would suggest more advanced technique like Simple HTML DOM Parser. Here is your example:
<?php
include "simplehtmldom_1_9_1/simple_html_dom.php";
$html = '<note>
<span data="getThisElement">New Text</span>
<div data="yes">More Text</div>
</note>';
// Create DOM from string
$dom = str_get_html($html);
$dom->find('span[data=getThisElement]', 0)->outertext = '<div>New Div</div>';
echo $dom;
// <note>
// <div>New Div</div>
// <div data="yes">More Text</div>
// </note>

You can get PHP to echo some JS as well as the HTML. This can do the replacements for you as JS is good at parsing HTML and you don't have to add extra functionality to PHP. The result is the same. For more generality you could make the JS a function so can be used elsewhere echoed by the PHP if required.
<?php
$html = '<note>
<span data="getThisElement">New Text</span>
<div data="yes">More Text</div>
</note>';
$repAttr = 'data="getThisElement"';// The attribute of those elements we want to replace
$repWith = '<div>New Div</div>'; // what we want to replace those elements with
?>
<div id="temp">
<script>
let el = document.createElement( 'div' );
let replaceWith = "<div>New Div</div>";
el.innerHTML = `<?php echo $html; ?>`; //note the use of backticks so the string can span many lines
let elsToReplace = el.querySelectorAll( "note"[<?php echo $repAttr; ?>] ); // gets all the elements within $html that have the given attribute
elsToReplace.forEach(function (repEl) {
repEl.outerHTML = '<?php echo $repWith; ?>';
});
document.getElementById("temp").outerHTML = el.innerHTML; //this will overwrite all this setting-up JS so the DOM will have the content required and nothing else
</script>
</div>

preg_replace regex to remove stray end tag

I have a string containing different types of html tags and stuff, including some <img> elements. I am trying to wrap those <img> elements inside a <figure> tag. So far so good using a preg_replace like this:
preg_replace( '/(<img.*?>)/s','<figure>$1</figure>',$content);
However, if the <img>tag has a neighboring <figcaption> tag, the result is rather ugly, and produces a stray end tag for the figure-element:
<figure id="attachment_9615">
<img class="size-full" src="http://www.example.com/pic.png" alt="name" width="1699" height="354" />
<figcaption class="caption-text"></figure>Caption title here</figcaption>
</figure>
I've tried a whole bunch of preg_replace regex variations to wrap both the img-tag and figcaption-tag inside figure, but can't seem to make it work.
My latest try:
preg_replace( '/(<img.*?>)(<figcaption .*>*.<\/figcaption>)?/s',
'<figure">$1$2</figure>',
$content);

As others pointed out, better use a parser, i.e. DOMDocument instead. The following code wraps a <figure> tag around each img where the next sibling is a <figcaption>:
<?php
$html = <<<EOF
<html>
<img class="size-full" src="http://www.example.com/pic.png" alt="name" width="1699" height="354" />
<figcaption class="caption-text">Caption title here</figcaption>
<img class="size-full" src="http://www.example.com/pic.png" alt="name" width="1699" height="354" />
<img class="size-full" src="http://www.example.com/pic.png" alt="name" width="1699" height="354" />
<figcaption class="caption-text">Caption title here</figcaption>
</html>
EOF;
$dom = new DOMdocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
# get all images
$imgs = $xpath->query("//img");
foreach ($imgs as $img) {
if ($img->nextSibling->tagName == 'figcaption') {
# create a new figure tag and append the cloned elements
$figure = $dom->createElement('figure');
$figure->appendChild($img->cloneNode(true));
$figure->appendChild($img->nextSibling->cloneNode(true));
# insert the newly generated elements right before $img
$img->parentNode->insertBefore($figure, $img);
# and remove both the figcaption and the image from the DOM
$img->nextSibling->parentNode->removeChild($img->nextSibling);
$img->parentNode->removeChild($img);
}
}
$dom->formatOutput=true;
echo $dom->saveHTML();
See a demo on ideone.com.
To have a <figure> tag around all your images, you might want to add an else branch:
} else {
$figure = $dom->createElement('figure');
$figure->appendChild($img->cloneNode(true));
$img->parentNode->insertBefore($figure, $img);
$img->parentNode->removeChild($img);
}

xpath not returning text if p tag is followed by any other tag

i want to get all the text between <p> and <h3> tag for the following HTML
<div class="bodyText">
<p>
<div class="articleBox articleSmallHorizontal channel-32333770 articleBoxBordered alignRight">
<div class="one">
<img src="url" alt="bar" class="img" width="80" height="60" />
</div>
<div class="two">
<h4 class="preTitle">QIEZ-Lieblinge</h4>
<h3 class="title"><a href="url" title="ABC" onclick="cmsTracking.trackClickOut({element:this, channel : 32333770, channelname : 'top_listen', content : 14832081, callTemplate : '_htmltagging.Text', action : 'click', mouseevent : event});">
Prominente Gastronomen </a></h3>
<span class="postTitle"></span>
<span class="district">Berlin</span> </div>
<div class="clear"></div>
</div>
I want this TEXT</p>
<h3>I want this TEXT</h3>
<p>I want this TEXT</p>
<p>
<div class="inlineImage alignLeft">
<div class="medium">
<img src="http://images03.qiez.de/Restaurant+%C3%96_QIEZ.jpg/280x210/0/167.231.886/167.231.798" width="280" height="210" alt="Schöne Lage: das Restaurant Ø. (c)QIEZ"/>
<span class="caption">
Schöne Lage: das Restaurant Ø. (c)QIEZ </span>
</div>
</div>I want this TEXT</p>
<p>I want this TEXT</p>
<p>I want this TEXT<br /> </p>
<blockquote><img src="url" alt="" width="68" height="68" />
"Eigentlich nur drei Worte: Ich komme wieder."<span class="author">Tina Gerstung</span></blockquote>
<div class="clear"></div>
</div>
i want all "I want this TEXT". i used xpath query
//div[contains(#class,'bodyText')]/*[local-name()='p' or local-name()='h3']
but it does not give me the text if <p> tag is followed by any other tag

It looks like you have div elements contained within your p element which is not valid and messing up things. If you use a var_dump in the loop you can see that it does actually pick up the node but the nodeValue is empty.
A quick and dirty fix to your html would be to wrap the first div that is contained in the p element in a span.
<span><div class="articleBox articleSmallHorizontal channel-32333770 articleBoxBordered alignRight">...</div></span>
A better fix would be to put the div element outside the paragraph.
If you use the dirty workaround you will need to change your query like so:
$xpath->query("//div[contains(#class,'bodyText')]/*[local-name()='p' or local-name()='h3']/text()");
If you do not have control of the source html. You can make a copy of the html and remove the offending divs:
$nodes = $xpath->query("//div[contains(#class,'articleBox')]");
$node = $nodes->item(0);
$node->parentNode->removeChild($node);
It might be easier to work with simple_html_dom. Maybe you can try this:
include('simple_html_dom.php');
$dom = new simple_html_dom();
$dom->load($html);
foreach($dom->find("div[class=bodyText]") as $parent) {
foreach($parent->children() as $child) {
if ($child->tag == 'p' || $child->tag == 'h3') {
// remove the inner text of divs contained within a p element
foreach($dom->find('div') as $e)
$e->innertext = '';
echo $child->plaintext . '<br>';
}
}
}

This is mixed content. Depending on what defines the position of the element, you can use a number of factors. In this cse, probably simply selected all the text nodes will be sufficient:
//div[contains(#class, 'bodyText')]/(p | h3)/text()
If the union operator within a path location is not allowed in your processor, then you can use your syntax as before or a little bit simpler in my opinion:
//div[contains(#class, 'bodyText')]/*[local-name() = ('p', 'h3')]/text()

PHP Regex - Remove text from HTML Tags

How to remove all text between tags.
Input
<div>
<p>testing</p>
<div>my world</div>
</div>
Output
<div>
<p></p>
<div></div>
</div>

You can use either DOMDocument or PHP Simple HTML DOM Parser.
The following example uses the latter, although you may want to use what suits you best.
include("simple_html_dom.php");
$str = '
<div>
<p>testing</p>
<div>my world</div>
</div>
';
$html = str_get_html($str);
foreach($html->find("text") as $ht) {
$ht->innertext = "";
}
$html->save();
echo $html;

You could use two capture groups which would eliminate characters between them while replacing:
(\<.+\>).*(\<\/.+\>)
working example: http://ideone.com/Oq14El

Retrieve a text node with Simple HTML DOM Parser

I'm quite new to Simple HTML DOM Parser. I want to get a child element from the following HTML:
<div class="article">
<div style="text-align:justify">
<img src="image.jpg" title="image">
<br>
<br>
"Text to grab"
<div>......</div>
<br></br>
................
................
</div>
</div>
I'm trying to get the text "Text to grab"
So far I've tried the following query:
$html->find('div[class=article] div')->children(3);
But it's not working. Any idea how to solve this ?

You don't need simple_html_dom here. It can be done with DOMDocument and DOMXPath. Both are part of the PHP core.
Example:
// your sample data
$html = <<<EOF
<div class="article">
<div style="text-align:justify">
<img src="image.jpg" title="image">
<br>
<br>
"Text to grab"
<div>......</div>
<br></br>
................
................
</div>
</div>
EOF;
// create a document from the above snippet
// if you are loading from a remote url use:
// $doc->load($url);
$doc = new DOMDocument();
$doc->loadHTML($html);
// initialize a XPath selector
$selector = new DOMXPath($doc);
// get the text node (also text elements in xml/html are nodes
$query = '//div[#class="article"]/div/br[2]/following-sibling::text()[1]';
$textToGrab = $selector->query($query)->item(0);
// remove newlines on start and end using trim() and output the text
echo trim($textToGrab->nodeValue);
Output:
"Text to grab"

If it's always in the same place you can do:
$html->find('.article text', 4);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Extracting parts of an html code - php

Related

Simple way to replace elements with certain attributes in php

preg_replace regex to remove stray end tag

xpath not returning text if p tag is followed by any other tag

PHP Regex - Remove text from HTML Tags

Retrieve a text node with Simple HTML DOM Parser

Categories

Resources