how to scrape the external url using php simple html dom parser - php

<div id="lyrics">
<img />
<span id="line_31" class="line line-s">En medio de este tropico mortal</span
<br>
<span id="line_32" class="line line-s">Roots and creation, come again!</span>
<br>
<span id="line_33" class="line line-s">So mi guardian, mi guardian mi lift up di plan</span>
<span id="line_34" class="line line-s">Now everybody a go' do dis one</span>
<span id="line_35" class="line line-s">Like in down di Caribbean</span>
<span id="line_36" class="line line-s">San Andrés, Providence Island</span>
<br>
</div>
Here I have a div, inside div there is multiple span and br tag between span. I want to scrape the span text and br tag as it is. so how can i scrape with php simple dom parser.
thanks for any help.

Let's say the html file you have above is called "index.html".
$html = file_get_html("index.html");
$element = $html->find('div#lyrics');
$result = $element->innertext;
You want to consult the manual: http://simplehtmldom.sourceforge.net/

Related

How to select the right div by class with Xpath?

$divs = $xpathsuj->query("//div[#class='txt-msg text-enrichi-forum ']");
$div = $divs[$i];
With this XPath command I'm able to select the div with the class "txt-msg text-enrichi-forum " :
<div class="bloc-contenu">
<div class="txt-msg text-enrichi-forum ">
<p>Tu pourrais écrire en FRANCAIS si ce n'est pas trop demandé?
<img src="http://image.jeuxvideo.com/smileys_img/54.gif" alt=":coeur:" data-code=":coeur:" title=":coeur:" width="21" height="20" />
</p>
</div>
</div>
But not this one :
<div class="bloc-contenu">
<div class="txt-msg text-enrichi-forum ">
<p>
<img src="http://image.jeuxvideo.com/smileys_img/42.gif" alt=":salut:" data-code=":salut:" title=":salut:" width="46" height="41" />
</p>
</div>
<div class="signature-msg text-enrichi-forum ">
<p>break;</p>
</div>
</div>
What am I doing wrong?
I've tried it with both segments of XML and it seems to work with both, but there is a possibility that there is some issue with spacing.
In your XPath query, your looking for an exact match of 'txt-msg text-enrichi-forum ' which has two spaces after txt-msg and one after the last part. If any spaces are missing, then this will not find the element.
If you change it to...
$divs = $xpathsuj->query("//div[contains(#class,'txt-msg') and contains(#class,'text-enrichi-forum')]");
foreach ( $divs as $div ) {
echo $doc->saveXML($div).PHP_EOL;
}
It should be a bit more tolerant.

xpath not returning text if p tag is followed by any other tag

i want to get all the text between <p> and <h3> tag for the following HTML
<div class="bodyText">
<p>
<div class="articleBox articleSmallHorizontal channel-32333770 articleBoxBordered alignRight">
<div class="one">
<img src="url" alt="bar" class="img" width="80" height="60" />
</div>
<div class="two">
<h4 class="preTitle">QIEZ-Lieblinge</h4>
<h3 class="title"><a href="url" title="ABC" onclick="cmsTracking.trackClickOut({element:this, channel : 32333770, channelname : 'top_listen', content : 14832081, callTemplate : '_htmltagging.Text', action : 'click', mouseevent : event});">
Prominente Gastronomen </a></h3>
<span class="postTitle"></span>
<span class="district">Berlin</span> </div>
<div class="clear"></div>
</div>
I want this TEXT</p>
<h3>I want this TEXT</h3>
<p>I want this TEXT</p>
<p>
<div class="inlineImage alignLeft">
<div class="medium">
<img src="http://images03.qiez.de/Restaurant+%C3%96_QIEZ.jpg/280x210/0/167.231.886/167.231.798" width="280" height="210" alt="Schöne Lage: das Restaurant Ø. (c)QIEZ"/>
<span class="caption">
Schöne Lage: das Restaurant Ø. (c)QIEZ </span>
</div>
</div>I want this TEXT</p>
<p>I want this TEXT</p>
<p>I want this TEXT<br /> </p>
<blockquote><img src="url" alt="" width="68" height="68" />
"Eigentlich nur drei Worte: Ich komme wieder."<span class="author">Tina Gerstung</span></blockquote>
<div class="clear"></div>
</div>
i want all "I want this TEXT". i used xpath query
//div[contains(#class,'bodyText')]/*[local-name()='p' or local-name()='h3']
but it does not give me the text if <p> tag is followed by any other tag
It looks like you have div elements contained within your p element which is not valid and messing up things. If you use a var_dump in the loop you can see that it does actually pick up the node but the nodeValue is empty.
A quick and dirty fix to your html would be to wrap the first div that is contained in the p element in a span.
<span><div class="articleBox articleSmallHorizontal channel-32333770 articleBoxBordered alignRight">...</div></span>
A better fix would be to put the div element outside the paragraph.
If you use the dirty workaround you will need to change your query like so:
$xpath->query("//div[contains(#class,'bodyText')]/*[local-name()='p' or local-name()='h3']/text()");
If you do not have control of the source html. You can make a copy of the html and remove the offending divs:
$nodes = $xpath->query("//div[contains(#class,'articleBox')]");
$node = $nodes->item(0);
$node->parentNode->removeChild($node);
It might be easier to work with simple_html_dom. Maybe you can try this:
include('simple_html_dom.php');
$dom = new simple_html_dom();
$dom->load($html);
foreach($dom->find("div[class=bodyText]") as $parent) {
foreach($parent->children() as $child) {
if ($child->tag == 'p' || $child->tag == 'h3') {
// remove the inner text of divs contained within a p element
foreach($dom->find('div') as $e)
$e->innertext = '';
echo $child->plaintext . '<br>';
}
}
}
This is mixed content. Depending on what defines the position of the element, you can use a number of factors. In this cse, probably simply selected all the text nodes will be sufficient:
//div[contains(#class, 'bodyText')]/(p | h3)/text()
If the union operator within a path location is not allowed in your processor, then you can use your syntax as before or a little bit simpler in my opinion:
//div[contains(#class, 'bodyText')]/*[local-name() = ('p', 'h3')]/text()

PHP XPATH parse HTML how to get inner text BEFORE another nested tag

i'm ussing xpath to parse html with no problems until I found the code below.
I usually use the "textContent" property one I got this td with ax xpath query, BUT I need to get only the text BEFORE the <img tag.
<td class="rowdispari">
ZONA NON SERVITA QUOTIDIANAMENTE-PROSSIMA CONSEGNA
<img onmouseover="caricaTool()" src="template/img/infoTip.png" width="17">
<div class="bottom" id='tooool'>
<div class="contenuto">
<div class="top">
<font class="testobold"><font class='testoblubold'>ZONA NON SERVITA QUOTIDIANAMENTE - PROSSIMA CONSEGNA </font><br>La località di destinazione non è tra quelle servite quotidianamente da SDA. La consegna avverrà al più presto possibile, compatibilmente con le operazioni logistiche.</font>
<p> <br><u>Chiudi</u>
</div>
</div>
</div>
</td>
You can probably use:
//td[#class="rowdispari"][img[#src="template/img/infoTip.png"]]/text()[1]
or:
//td[#class="rowdispari"]/text()[following-sibling::img[#src="template/img/infoTip.png"]][1]
Assuming that you already have XPath to get the outer <td> element, you can simply append the XPath with /text()[1] to get the first text node that is direct child of current <td> element :
path_to_td_here/text()[1]
more concrete example :
//td[#class='rowdispari']/text()[1]

Using Simple HTML DOM to get specific plain text

Problem:
Trying to extract specific text from HTML code that is available to me through PHP.
HTML code:
<a href="/debatt/s-vill-ha-tioarig-skolplikt-och-farre-elever-i-klassen">
<span class="number">2. </span>Skolplikt och färre elever i klassen
<br />
<span class="metadata">I går</span>
</a>
<a href="/sthlm/edholm-backar-om-skolornas-smorforbud">
<span class="number">3. </span>Edholm backar om skolornas smörförbud
<br />
<span class="metadata">16 okt</span>
</a>
Desired output:
2. Skolplikt och färre elever i klassen
3. Edholm backar om skolornas smörförbud
Both code examples have the same HTML structure. Is it possible through Simple HTML DOM to do this or should regular expressions be pursued?
Add the HTML into a DOMElement object. With it you can select children and extract their HTML/text into variables.
Docs: http://php.net/manual/en/class.domelement.php
Same answer as https://stackoverflow.com/a/12950525/711129
If you have to frequently do this, you can use a very handy and easy class for parsing html dom.
http://simplehtmldom.sourceforge.net/

[php]remove html code from string

in this variable: $this->item->text I have this string:
<!-- JoomlaWorks "Disqus Comment System for Joomla!" Plugin (v2.2) starts here -->
<div class="itp-fshare-floating" id="itp-fshare" style="position:fixed; top:30px !important; left:50px !important;"></div><p>Lorem Ipsum è un testo segnaposto utilizzato nel settore della tipografia e della stampa. Lorem Ipsum è considerato il testo segnaposto standard sin dal sedicesimo secolo, quando un anonimo tipografo prese una cassetta di caratteri e li assemblòdei fogli di caratteri trasferibili “Letraset”,</p>
<p style="text-align: center;"><span class="easy_img_caption" style="display:inline-block;line-height:0.5;vertical-align:top;background-color:#F2F2F2;text-align:left;width:150px;float:left;margin:0px 10px;"><img src="/joomla/plugins/content/imagesresizecache/8428e9c26f1d8498ece730c0aa6aa023.jpeg" border="0" alt="1" title="1" style="width:150px; height:120px; ;margin:0;" /><span class="easy_img_caption_inner" style="display:inline-block;line-height:normal;color:#000000;font-size:8pt;font-weight:normal;font-style:normal;padding:4px 8px;margin:0px;">1</span></span></p>
<p>che contenevano passaggi del Lorem Ipsum, e più recentemente da software di impaginazione come Aldus PageMaker</p>
<!-- Disqus comments counter and anchor link -->
<a class="jwDisqusListingCounterLink" href="http://clouderize.it/joomla/index.php?option=com_content&view=article&id=8:recensione&catid=3:recensione-dei-servizi-di-cloud-computing&Itemid=4#disqus_thread" title="Add a comment">
Add a comment</a>
<!-- JoomlaWorks "Disqus Comment System for Joomla!" Plugin (v2.2) ends here -->
<div class="cp_tags">
<span class="cp_tag_label">Tags: </span><span class="cp_tag cp_tag_6">Recensioni
</span> </div>
So with this code I extract
<span class="easy_img_caption......></span>
Code (I am using this library called phpQuery http://goo.gl/rSu3k):
include_once('includes/phpQuery.php');
$doc = phpQuery::newDocument($this->item->text);
$extraction=pq('.easy_img_caption:eq(0)')->htmlOuter();
echo"<textarea>".$extraction."</textarea>";
So my question is:
How can I remove $extraction string from $this->item->text?
Thank you.
I'll assume phpQuery is some library aiding with dom-parsing in php?
Anyway, to accomplish this, you don't exactly need this external library. It can easily be accomplished with a regular expression replace:
$text = preg_replace('/<span.*?class="[^"]*?easy_img_caption[^"]*?".*?>.*?<\/span>/s', '', $this->item->text);
echo "<textarea>" . $text . "</textarea>";

Categories