Using Simple HTML DOM to get specific plain text - php

Problem:
Trying to extract specific text from HTML code that is available to me through PHP.
HTML code:
<a href="/debatt/s-vill-ha-tioarig-skolplikt-och-farre-elever-i-klassen">
<span class="number">2. </span>Skolplikt och färre elever i klassen
<br />
<span class="metadata">I går</span>
</a>
<a href="/sthlm/edholm-backar-om-skolornas-smorforbud">
<span class="number">3. </span>Edholm backar om skolornas smörförbud
<br />
<span class="metadata">16 okt</span>
</a>
Desired output:
2. Skolplikt och färre elever i klassen
3. Edholm backar om skolornas smörförbud
Both code examples have the same HTML structure. Is it possible through Simple HTML DOM to do this or should regular expressions be pursued?

Add the HTML into a DOMElement object. With it you can select children and extract their HTML/text into variables.
Docs: http://php.net/manual/en/class.domelement.php
Same answer as https://stackoverflow.com/a/12950525/711129

If you have to frequently do this, you can use a very handy and easy class for parsing html dom.
http://simplehtmldom.sourceforge.net/

Related

xpath not returning text if p tag is followed by any other tag

i want to get all the text between <p> and <h3> tag for the following HTML
<div class="bodyText">
<p>
<div class="articleBox articleSmallHorizontal channel-32333770 articleBoxBordered alignRight">
<div class="one">
<img src="url" alt="bar" class="img" width="80" height="60" />
</div>
<div class="two">
<h4 class="preTitle">QIEZ-Lieblinge</h4>
<h3 class="title"><a href="url" title="ABC" onclick="cmsTracking.trackClickOut({element:this, channel : 32333770, channelname : 'top_listen', content : 14832081, callTemplate : '_htmltagging.Text', action : 'click', mouseevent : event});">
Prominente Gastronomen </a></h3>
<span class="postTitle"></span>
<span class="district">Berlin</span> </div>
<div class="clear"></div>
</div>
I want this TEXT</p>
<h3>I want this TEXT</h3>
<p>I want this TEXT</p>
<p>
<div class="inlineImage alignLeft">
<div class="medium">
<img src="http://images03.qiez.de/Restaurant+%C3%96_QIEZ.jpg/280x210/0/167.231.886/167.231.798" width="280" height="210" alt="Schöne Lage: das Restaurant Ø. (c)QIEZ"/>
<span class="caption">
Schöne Lage: das Restaurant Ø. (c)QIEZ </span>
</div>
</div>I want this TEXT</p>
<p>I want this TEXT</p>
<p>I want this TEXT<br /> </p>
<blockquote><img src="url" alt="" width="68" height="68" />
"Eigentlich nur drei Worte: Ich komme wieder."<span class="author">Tina Gerstung</span></blockquote>
<div class="clear"></div>
</div>
i want all "I want this TEXT". i used xpath query
//div[contains(#class,'bodyText')]/*[local-name()='p' or local-name()='h3']
but it does not give me the text if <p> tag is followed by any other tag
It looks like you have div elements contained within your p element which is not valid and messing up things. If you use a var_dump in the loop you can see that it does actually pick up the node but the nodeValue is empty.
A quick and dirty fix to your html would be to wrap the first div that is contained in the p element in a span.
<span><div class="articleBox articleSmallHorizontal channel-32333770 articleBoxBordered alignRight">...</div></span>
A better fix would be to put the div element outside the paragraph.
If you use the dirty workaround you will need to change your query like so:
$xpath->query("//div[contains(#class,'bodyText')]/*[local-name()='p' or local-name()='h3']/text()");
If you do not have control of the source html. You can make a copy of the html and remove the offending divs:
$nodes = $xpath->query("//div[contains(#class,'articleBox')]");
$node = $nodes->item(0);
$node->parentNode->removeChild($node);
It might be easier to work with simple_html_dom. Maybe you can try this:
include('simple_html_dom.php');
$dom = new simple_html_dom();
$dom->load($html);
foreach($dom->find("div[class=bodyText]") as $parent) {
foreach($parent->children() as $child) {
if ($child->tag == 'p' || $child->tag == 'h3') {
// remove the inner text of divs contained within a p element
foreach($dom->find('div') as $e)
$e->innertext = '';
echo $child->plaintext . '<br>';
}
}
}
This is mixed content. Depending on what defines the position of the element, you can use a number of factors. In this cse, probably simply selected all the text nodes will be sufficient:
//div[contains(#class, 'bodyText')]/(p | h3)/text()
If the union operator within a path location is not allowed in your processor, then you can use your syntax as before or a little bit simpler in my opinion:
//div[contains(#class, 'bodyText')]/*[local-name() = ('p', 'h3')]/text()

Retrieve a text node with Simple HTML DOM Parser

I'm quite new to Simple HTML DOM Parser. I want to get a child element from the following HTML:
<div class="article">
<div style="text-align:justify">
<img src="image.jpg" title="image">
<br>
<br>
"Text to grab"
<div>......</div>
<br></br>
................
................
</div>
</div>
I'm trying to get the text "Text to grab"
So far I've tried the following query:
$html->find('div[class=article] div')->children(3);
But it's not working. Any idea how to solve this ?
You don't need simple_html_dom here. It can be done with DOMDocument and DOMXPath. Both are part of the PHP core.
Example:
// your sample data
$html = <<<EOF
<div class="article">
<div style="text-align:justify">
<img src="image.jpg" title="image">
<br>
<br>
"Text to grab"
<div>......</div>
<br></br>
................
................
</div>
</div>
EOF;
// create a document from the above snippet
// if you are loading from a remote url use:
// $doc->load($url);
$doc = new DOMDocument();
$doc->loadHTML($html);
// initialize a XPath selector
$selector = new DOMXPath($doc);
// get the text node (also text elements in xml/html are nodes
$query = '//div[#class="article"]/div/br[2]/following-sibling::text()[1]';
$textToGrab = $selector->query($query)->item(0);
// remove newlines on start and end using trim() and output the text
echo trim($textToGrab->nodeValue);
Output:
"Text to grab"
If it's always in the same place you can do:
$html->find('.article text', 4);

Transform complex and variable xml

I've a complex XML that I want to transform in HTML. Some tags need to be replaced in html tags.
The XML is this:
<root>
<div>
<p>
<em>bol text</em>, some normale text
</p>
</div>
<list>
<listitem>
normal text inside list <em>bold inside list</em>
</listitem>
<listitem>
another text in list...
</listitem>
</list>
<p>
A sample paragraph
</p>
The text inside the element is variable, which means that the other xml that I parse can completely change.
The output I want is this (for this scenario):
<root>
<div>
<p>
<strong>bol text</strong>, some normale text
</p>
</div>
<ul>
<li>
normal text inside list <strong>bold inside list</strong>
</li>
<li>
another text in list...
</li>
</ul>
<p>
A sample paragraph
</p>
</root>
I make a recursive function for parse any single node of xml and replace it in HTML tag (but doesn't work):
$doc = new DOMDocument();
$doc->preserveWhiteSpace = false;
$doc->load('section.xml');
echo $doc->saveHTML();
function printHtml(DOMNode $node)
{
if ($node->hasChildNodes())
{
foreach ($node->childNodes as $child)
{
printHtml($child);
}
}
if ($node->nodeName == 'em')
{
$newNode = $node->ownerDocument->createElement('strong', $node->nodeValue);
$node->parentNode->replaceChild($newNode, $node);
}
if ($node->nodeName == 'listitem')
{
$newNode = $node->ownerDocument->createElement('li', $node->nodeValue);
$node->parentNode->replaceChild($newNode, $node);
}
}
Can anyone help me?
This is an example of a complete xml:
<root>
<div>
<p>
<em>bol text</em>, some normale text
</p>
</div>
<list>
<listitem>
normal text inside list <em>bold inside list</em>
</listitem>
<listitem>
another text in list...
</listitem>
</list>
<media>
<info isVisible="false">
<title>
<p>Image title <em>in bold</em> not in bold</p>
</title>
</info>
<file isVisible="true">
<href>
"path/to/file.jpg"
</href>
</file>
</media>
<p>
A sample paragraph
</p>
</root>
Which has to be transformed into:
<root>
<div>
<p>
<strong>bol text</strong>, some normale text
</p>
</div>
<ul>
<li>
normal text inside list <em>bold inside list</em>
</li>
<li>
another text in list...
</li>
</ul>
<!-- the media tag can be presented in two mode: with title visible, and title hidden -->
<!-- this is the case when the title is hidden -->
<img src="path/to/file.jpg" />
<!-- this is the case when the title is visible -->
<!-- the info tag (inside media tag) has an attribute isVisible="false" which means it doesn't have to be shown. -->
<!-- if the info tag has visible=true, the media tag must be translated into
<div>
<img src="path/to/file.jpg" />
<p>Image title <strong>in bold</strong> not in bold</p>
<div>
-->
<p>
A sample paragraph
</p>
</root>
There's a language specially designed for this task: it's called XSLT, and you can easily express your desired transformation in XSLT and invoke it from your PHP program. There's a learning curve, of course, but it's a much better solution than writing low-level DOM code.
In XSLT you write a set of template rules saying how individual elements should be handled. Many elements in your example are copied through unchanged, so you can start with a default rule that does this:
<xsl:template match="*">
<xsl:copy><xsl:apply-templates/></xsl:copy>
</xsl:template>
The "match" part says what part of the input you are matching; the body of the rule says what output to produce. The xsl:apply-templates does a recursive descent to process the children of the current element.
Some of your elements are simply renamed, for example
<xsl:template match="listitem">
<li><xsl:apply-templates/></li>
</xsl:template>
Some of the rules are a little bit more complex, but still easily expressed:
<xsl:tempate match="media/file[#isVisible='true']">
<img src="{href}"/>
</xsl:template>
I hope you agree that this declarative rule-based approach is much clearer than your procedural code; it's also much easier for someone else to change the rules in six months' time.
Well, maybe, it's not the most correct idea, but why not just to use str_replace? That way You will see clearly the list of changes to apply and add / remove new ones easily.
file_get_contents $file = file_get_contents('file.xml');
str_replace $file = str_replace("<em>", "<strong>", $file);
file_put_contents file_put_contents('file.html', $file);
UPDATE (Some more ideas regarding the changes in the question)
This seems a little bit tricky (at least for me now) to use PHP + DOM here. Maybe, it would be more reasonable to use XSL / XSLT (Extensible Stylesheet Language Transformations). In that case, smth. similar can be found here: How to replace a node-name with another in Xslt?
XSLT specifically used for Language Transformations http://en.wikipedia.org/wiki/XSLT

how to scrape the external url using php simple html dom parser

<div id="lyrics">
<img />
<span id="line_31" class="line line-s">En medio de este tropico mortal</span
<br>
<span id="line_32" class="line line-s">Roots and creation, come again!</span>
<br>
<span id="line_33" class="line line-s">So mi guardian, mi guardian mi lift up di plan</span>
<span id="line_34" class="line line-s">Now everybody a go' do dis one</span>
<span id="line_35" class="line line-s">Like in down di Caribbean</span>
<span id="line_36" class="line line-s">San Andrés, Providence Island</span>
<br>
</div>
Here I have a div, inside div there is multiple span and br tag between span. I want to scrape the span text and br tag as it is. so how can i scrape with php simple dom parser.
thanks for any help.
Let's say the html file you have above is called "index.html".
$html = file_get_html("index.html");
$element = $html->find('div#lyrics');
$result = $element->innertext;
You want to consult the manual: http://simplehtmldom.sourceforge.net/

PHP allow img tags only

I need your assistence related php. In php, i want to allow html <img> tags only, i tried php's built-in function strip_tags() but it's not giving me the output i need. For instance, in the following code strip_tags() allows img tags but along with text.
$img = "<img src='/img/fawaz.jpg' alt= ''> <br /> <p> This is a detailed paragraph about Fawaz and his mates.</p>";
echo strip_tags($img , "<img>");
What would be the proper way to just allow <img> or any tag only from the function or variable.
Any help 'd be appreciated.
Thanks
This might be due to non closing img tag in your code. Try this
$img = "<img src='/img/fawaz.jpg' alt= '' /> <br /> <p> This is a detailed paragraph about Fawaz and his mates.</p>";
echo strip_tags($img , "<img>");
strip_tags() doesn't work that way you want it to behave. If supplied with a second argument, the tags listed are allowed to be part of the resulting string - except those which are not listed. And it will not filter out inner text.
If you want to extract <img/> elements only, don't even think about using a regex. Use a DOM parser for that:
libxml_use_internal_errors(true);
$doc=new DOMDocument;
$html=$doc->loadHTML('<img src="/img/fawaz.jpg" alt= ""> <br /> <p> This is a
detailed paragraph about Fawaz and his mates.</p>');
$path=new DOMXPath($doc);
foreach ($path->query('//img') as $found)
var_dump($doc->saveXML($found));
delete HTML Tags Without <img> and <a> and <br/> and <hr/> and ...
$img = "
<img src='/img/fawaz.jpg' alt= '' />
<br /><br/>
<hr/>
<p> This is a detailed paragraph about Fawaz and his mates.</p>
<a href='cft'>123</a>
";
$img = strip_tags($img , "<img>|<a>|<br>|<hr>");
echo $img;

Categories