how to remove link from simple dom html data - php

I have this code, i get the info but with this i get the data + the link for example
require_once('simple_html_dom.php');
set_time_limit (0);
$html ='www.domain.com';
$html = file_get_html($url);
// i read the first div
foreach($html->find('#content') as $element){
// i read the second
foreach ($element->find('p') as $phone){
echo $phone;
Mobile Pixel 2 -
google << there the link
But i need remove these link, the problem is the next, i scrape this:
<p>the info that i really need is here<p>
<p class="text-right"><a class="btn btn-default espbott aplus" role="button"
href="brand/google.html">Google</a></p>
I read this:
Simple HTML Dom: How to remove elements?
But i cant find the answer
update: if i use this:
foreach ($element->find('p[class="text-right"]');
It will select the links but can't remove scrapped data

You can use file_get_content with str_get_html and replace it :
include 'simple_html_dom.php';
$content=file_get_contents($url);
$html = str_get_html($content);
// i read the first div
foreach($html->find('#content') as $element){
// i read the second
foreach ($element->find('p[class="text-right"]') as $phone){
$content=str_replace($phone,'',$content);
}
}
print $content;
die;

Or here a native version:
PHP-CODE
$sHtml = '<p>the info that i really need is here<p>
<p class="text-right"><a class="btn btn-default espbott aplus" role="button"
href="brand/google.html">Google</a></p>';
$sHtml = '<div id="wrapper">' . $sHtml . '</div>';
echo "org:\n";
echo $sHtml;
echo "\n\n";
$doc = new DOMDocument();
$doc->loadHtml($sHtml);
foreach( $doc->getElementsByTagName( 'a' ) as $element ) {
$element->parentNode->removeChild( $element );
}
echo "res:\n";
echo $doc->saveHTML($doc->getElementById('wrapper'));
Output
org:
<div id="wrapper"><p>the info that i really need is here<p>
<p class="text-right"><a class="btn btn-default espbott aplus" role="button"
href="brand/google.html">Google</a></p></div>
res:
<div id="wrapper">
<p>the info that i really need is here</p>
<p>
</p>
<p class="text-right"></p>
</div>
https://3v4l.org/RhuEU

Related

simple html dom traversal confusion when looping

I'm trying to use the php script simplehtmldom to loop over divs on a web page while scraping.
Right now I have this:
$url = "https://test.com/";
$html = new simple_html_dom();
$html->load_file($url);
$item_list = $html->find('div.main div[id]');
foreach ($item_list as $item)
{
echo $item->outertext . PHP_EOL;
}
This will give me many like this (from the echo in the loop above):
<div id=1>
<div>
stuff here
</div>
<div>
<span class="title">name</span>
</div>
</div>
<div id=2>
<div>
stuff here
</div>
<div>
<span class="title">name 2</span>
</div>
</div>
What I'm trying to do is loop over the span with class=title, but no matter what I can't seem to quite get the right selector. Could someone help me out?
You can get the spans adding span[class=title] as a selector:
$item_list = $html->find('div.main div[id] span[class=title]');
foreach ($item_list as $item)
{
echo $item->outertext . PHP_EOL;
}

Extract html from url with DOM

I've already search about this but most of the topics used java language, but i need using DOM in PHP. I wanna extract this element from example.com :
<div id="download" class="large-12 medium-12 columns hide-for-small-only">
<a href="javascript:void(0)" link="https://mediamusic.com/media/mp3/mp3-256/Mas.mp3" target="_blank" class="mp3_download_link">
<i class="fa fa-cloud-download">Download Now</i>
</a>
</div>
How can i get mp3_download_link class from this code using DOM in PHP! as i said i have already search about this but really i confused...
You can use library to parsing DOM. For example: https://github.com/tburry/pquery
Usage:
$dom = pQuery::parseStr($html);
$class = $dom->query('#download a')->attr('class');
You can try file_get_html to parse html
$html=file_get_html('http://demo.com');
and use the below to get all the attributes of anchor tag.
foreach($html->find('div[id=download] a') as $a){
var_dump($a->attr);
}
Let's assume you have this DOM as a string. Then you may use built-in DOM extension to get link you need. Here is the example of a code:
$domstring = '<div id="download" class="large-12 medium-12 columns hide-for-small-only">
<a href="javascript:void(0)" link="https://mediamusic.com/media/mp3/mp3-256/Mas.mp3" target="_blank" class="mp3_download_link">
<i class="fa fa-cloud-download">Download Now</i>
</a>
</div>';
$links = array();
$dom = new DOMDocument('1.0', 'utf-8');
$dom->loadHTML($domstring);//here $domstring is a string containing html you posted in your question
$node_list = $dom->getElementsByTagName('a');
foreach ($node_list as $node) {
$links[] = $node->getAttribute('link');
}
print_r(array_shift($links));

simple HTML DOM tag selecting issue

I have this HTML:
<div class="price" itemprop="offers" itemscope itemtype="http://schema.org/Offer">
<small class="old-price">Stara cena: 1.890 RSD</small>
<span>Ušteda: <strong>1.000 RSD</strong></span>
<h5>890 <em>RSD</em>
<div class="tooltip"><p>Cene sa popustom uz gotovinsko plaćanje za online porudžbine</p></div>
</h5>
<span style="display:none" itemprop="priceCurrency" content="RSD"></span>
<span itemprop="price" content="890.00"></span>
</div>
I'm collecting prices from tag like this:
foreach($html->find('span[itemprop=price]') as $element) {
$niz['price'][] = $element->content;
}
And now i need to collect text from small tag if it exists (if it does not exist then i need empty string in an array):<small class="old-price">Stara cena: 1.890 RSD</small>
So i need something like this:
if($html->find('small[class=old-price]',0))
{
$niz['oldprice'][] = $element->innertext;
}else{
$niz['oldprice'][] = '';
}
Problem is that i get only elements from class=old-price in array and not a single empty string.
Any advice would be appreciated.
Hi can you please use code
foreach($html->find('small[class=old-price]') as $element) {
if($element->plaintext)
{
$niz['oldprice'][] = $element->plaintext;
}else{
$niz['oldprice'][] = '';
}
}

Getting the title of post

I am trying to get the title of a post using simple_html_dom the html roots can be seen below the part I am trying to get is titled This Is Our Title.
<div id="content">
<div id="section">
<div id="sectionleft">
<p>
Latest News
</p>
<ul class="cont news">
<li>
<div style="padding: 1px;">
<a href="http://www.example.com">
<img src="http://www.example.com/our-image.png" width="128" height="96" alt="">
</a>
</div>
<a href="http://www.example.com" class="name">
This is our title
</a>
<i class="info">added: Dec 16, 2015</i>
</li>
</ul>
</div>
</div>
</div>
Currently I have this
$page = (isset($_GET['p'])&&$_GET['p']!=0) ? (int) $_GET['p'] : '';
$html = file_get_html('http://www.example.com/'.$page);
foreach($html->find('div#section ul.cont li div a') as $element)
{
print '<br><br>';
echo $url = 'http://www.example.com/'.$element->href;
$html2 = file_get_html($url);
print '<br>';
$image = $html2->find('meta[property=og:image]',0);
print $image = $image->content;
print '<br>';
$title = $html2->find('#sectionleft ul.cont news li a.name',0);
print $title = $title->plaintext;
print '<br>';
}
The issue is here $title = $html2->find('#sectionleft ul.cont news li a.name',0); I assume I am using the wrong selector but I am literally not sure what I am doing wrong..
ul.cont news means "find <news> elements that are a child of ul.cont".
You actually want:
#sectionleft ul.cont.news li a.name
EDIT: For some reason, it seems simple_html_dom doesn't like ul.cont.news even though it's a valid CSS selector.
You can try
#sectionleft ul[class="cont news"] li a.name
which should work as long as the classes are in that order.
If this seems a little hacky, forgive me, but... you can always employ PHP to run a quick .js:
<?php
echo '<script>';
echo 'var postTitle = document.querySelector("ul.cont.news a.name").innerHTML;';
if (!isset($_GET['posttitle'])) {
echo 'window.location.href = window.location.href + "?posttitle=" + postTitle';}
echo '</script>';
$postTitle = $_GET['posttitle'];
?>

Use file_get_html and extract content from the first span in a div

The html content is:
<div id="sns-availability" class="a-section a-spacing-none">
<div class="a-section a-spacing-mini">
<span class="a-size-medium a-color-success">
In Stock.
</span>
<span class="a-size-base">
Ships Soon.
</span>
</div>
</div>
and from my code below, the output is :
In Stock. Ships soon.
I'm wondering how to extract only :
In Stock.
Can someone help?
include_once('simple_html_dom.php');
$url = "xxx";
$html = file_get_html($url);
$output = $html->find('div[id=sns-availability]');
$output = $output[0]->first_child();
echo $output;
That would simply be:
$html->find('#sns-availability span', 0);
You can probably add another firstchild()
$output = $output[0]->first_child()->first_child();
You only navigate to the div that groups the two sub-divs whos content is echoed. You need to get to the first one of those two children. As illustrated in my simplification here:
<div>
<div> <-- you are here
<span>In stock</span> <-- need to get here
<span>Ships soon</span>
</div>
<div>
According documentation
// Find all <span> with class=gb1
$result = $dom->find('span.gb1');
try to
$result = $dom->find('span.a-size-medium a-color-success');
echo $result->plaintext;

Categories