Simple HTML DOM - Skip certain element - php

I want to ignore the contents of the <a> which is inside <h3> element and only get the text of the <h3>.
<h3>
144.000 TL
<a class="emlak-endeksi-link trackClick trackId_emlak-endeksi-link" id="emlakEndeksiLink">
Emlak Endeksi</a>
</h3>
Example: only want to get 144.000 TL and ignore the (Emlak Endeksi)
foreach ($html1->find('div.classifiedInfo h3') as $price) {
$ilanlar['price'] = $price->plaintext;
}

not very familiar with simple html dom, but ... selecting the text node via http://simplehtmldom.sourceforge.net/manual.htm#frag_find_textcomment should help?
$ilanlar['price'] = $price->find('text', 0)->plaintext;

Maybe removing the <a> tag helps:
$str = <<<str
<h3>
144.000 TL
<a class="emlak-endeksi-link trackClick trackId_emlak-endeksi-link" id="emlakEndeksiLink">
Emlak Endeksi</a>
</h3>
str;
$html = str_get_html($str);
// Find first <h3>
$h3 = $html->find('h3', 0);
// Find first <a> inside the <h3>, or use $h3->find('a') to find all of them
$a = $h3->find('a', 0);
// Remove <a> tag
$a->outertext = '';
// Output: "144.000 TL"
print trim($h3->innertext);

You can do it via regular expression.
preg_match_all('\<h3>([^\n]*\n+)+<a([^\n]*\n+)+<\/h3>\', $content, $output);
echo $output[1];
https://regex101.com/r/qM5Nlk/1

Related

How can I strip html tags except some of them?

I need to remove all html codes from a php string except:
<p>
<em>
<small>
You know, strip_tags() function is good, but it strips all html tags, how can I tell it remove all html except those tags above?
You should check out the manual: Example #1 strip_tags() example
Syntax: strip_tags ( Your-string, Allowable-Tags )
If you pass the second parameter, these tags will not be stripped.
strip_tags($string, '<p><em><small>');
According to your comment, you want to remove HTML elements only if they have some class or attribute. You'll need to build up a DOM then:
<?php
$data = <<<DATA
<div>
<p>These line shall stay</p>
<p class="myclass">Remove this one</p>
<p>I will be deleted as well</p>
<p>But keep this</p>
</div>
DATA;
$dom = new DOMDOcument();
$dom->loadHTML($data, LIBXML_HTML_NOIMPLIED);
$xpath = new DOMXPath($dom);
$elements_to_be_removed = $xpath->query("//*[count(#*)>0]");
foreach ($elements_to_be_removed as $element) {
$element->parentNode->removeChild($element);
}
// just to check
echo $dom->saveHTML();
?>
To change which elements shall be removed, you'll need to change the query, ie to remove all elements with the class myclass, it must read "//*[class='myclass']".

remove a child element in a element content with simple_html_dom

This is a sample html code:
<div class="leadContent">
<span> sentence 1 </span>
sentence 2
</div>
I just want to get sentence 2 (Not span tag and its content)
Is there any way to do this with simple_html_dom?
$html->find('div.leadContent', 0)->innertext;
You can remove span tag using outertext:
$html->find('div.leadContent span', 0)->outertext = '';
then you get the content
$html->find('div.leadContent', 0)->innertext;
edit:
You can also try this way:
$div = $html->find('div.leadContent', 0);
$div->find('span',0)->outertext = '';
$div will have your content.
You can try this way also,
echo $html->find('text',2)->plaintext;

How to select 2nd element with same tag using dom xpath?

I have layout like this:
<div class="fly">
<img src="a.png" class="badge">
<img class="aye" data-original="b.png" width="130" height="253" />
<div class="to">
<h4>Fly To The Moon</h4>
<div class="clearfix">
<div class="the">
<h4>**Wow**</h4>
</div>
<div class="moon">
<h4>**Great**</h4>
</div>
</div>
</div>
</div>
First I get query from xpath :
$a = $xpath->query("//div[#class='fly']""); //to get all elements in class fly
foreach ($a as $p) {
$t = $p->getElementsByTagName('img');
echo ($t->item(0)->getAttributes('data-original'));
}
When I run the code, it will produced 0 result. After I trace I found that <img class="badge"> is processed first. I want to ask, how can I get data-original value from <img class="aye">and also get the value "Wow" and "Great" from <h4> tag?
Thank you,
Alernatively, you could use another xpath query on that to add on your current code.
To get the attribute, use ->getAttribute():
$dom = new DOMDocument();
$dom->loadHTML($markup);
$xpath = new DOMXpath($dom);
$parent_div = $xpath->query("//div[#class='fly']"); //to get all elements in class fly
foreach($parent_div as $div) {
$aye = $xpath->query('./img[#class="aye"]', $div)->item(0)->getAttribute('data-original');
echo $aye . '<br/>'; // get the data-original
$others = $xpath->query('./div[#class="to"]/div[#class="clearfix"]', $div)->item(0);
foreach($xpath->query('./div/h4', $others) as $node) {
echo $node->nodeValue . '<br/>'; // echo the two h4 values
}
echo '<hr/>';
}
Sample Output
Thank you for your code!
I try the code but it fails, I don't know why. So, I change a bit of your code and it works!
$dom = new DOMDocument();
$dom->loadHTML($markup);
$xpath = new DOMXpath($dom);
$parent_div = $xpath->query("//div[#class='fly']"); //to get all elements in class fly
foreach($parent_div as $div) {
$aye = $xpath->query('**descendant::**img[#class="aye"]', $div)->item(0)->getAttribute('data-original');
echo $aye . '<br/>'; // get the data-original
$others = $xpath->query('**descendant::**div[#class="to"]/div[#class="clearfix"]', $div)->item(0);
foreach($xpath->query('.//div/h4', $others) as $node) {
echo $node->nodeValue . '<br/>'; // echo the two h4 values
}
echo '<hr/>';
}
I have no idea what is the difference between ./ and descendant but my code works fine using descendant.
given the following XML:
<div class="fly">
<img src="a.png" class="badge">
<img class="aye" data-original="b.png" width="130" height="253" />
<div class="to">
<h4>Fly To The Moon</h4>
<div class="clearfix">
<div class="the">
<h4>**Wow**</h4>
</div>
<div class="moon">
<h4>**Great**</h4>
</div>
</div>
</div>
</div>
you asked:
how can I get data-original value from <img class="aye">and also get the value "Wow" and "Great" from <h4> tag?
With XPath you can obtain the values as string directly:
string(//div[#class='fly']/img/#data-original)
This is the string from the first data-original attribute of an img tag within all divs with class="fly".
string(//div[#class='fly']//h4[not(following-sibling::*//h4)][1])
string(//div[#class='fly']//h4[not(following-sibling::*//h4)][2])
These are the string values of first and second <h4> tag that is not followed on it's own level by another <h4> tag within all divs class="fly".
This looks a bit like standing in the way right now, but with iteration, those parts in front will not be needed any longer soon because the xpath then will be relative:
//div[#class='fly']
string(./img/#data-original)
string(.//h4[not(following-sibling::*//h4)][1])
string(.//h4[not(following-sibling::*//h4)][2])
To use xpath string(...) expressions in PHP you must use DOMXPath::evaluate() instead of DOMXPath::query(). This would then look like the following:
$aye = $xpath->evaluate("string(//div[#class='fly']/img/#data-original)");
$h4_1 = $xpath->evaluate("string(//div[#class='fly']//h4[not(following-sibling::*//h4)][1])");
$h4_2 = $xpath->evaluate("string(//div[#class='fly']//h4[not(following-sibling::*//h4)][2])");
A full example with iteration and output:
// all <div> tags with class="fly"
$divs = $xpath->evaluate("//div[#class='fly']");
foreach ($divs as $div) {
// the first data-original attribute of an <img> inside $div
echo $xpath->evaluate("string(./img/#data-original)", $div), "<br/>\n";
// all <h4> tags anywhere inside the $div
$h4s = $xpath->evaluate('.//h4[not(following-sibling::*//h4)]', $div);
foreach ($h4s as $h4) {
echo $h4->nodeValue, "<br/>\n";
}
}
As the example shows, you can use evaluate as well for node-lists, too. Obtaining the values from all <h4> tags it not with string() any longer as there could be more than just two I assume.
Online Demo including special string output (just exemplary):
echo <<<HTML
{$xpath->evaluate("string(//div[#class='fly']/img/#data-original)")}<br/>
{$xpath->evaluate("string(//div[#class='fly']//h4[not(following-sibling::*//h4)][1])")}<br/>
{$xpath->evaluate("string(//div[#class='fly']//h4[not(following-sibling::*//h4)][2])")}<br/>
<hr/>
HTML;

Get the inner text using curl concept in php

This is html text in the website, i want to grab
1,000 Places To See Before You Die
<ul class="listings">
<li>
<a href="http://watchseries.eu/serie/1,000_places_to_see_before_you_die" title="1,000 Places To See Before You Die">
1,000 Places To See Before You Die
<span class="epnum">2009</span>
</a>
</li>
I used the code like this
foreach($html->find('ul.listings li a') as $e)
echo $e->innertext. '<br/>';
The output i am getting is like
999: Whats Your Emergency<span class="epnum">2012</span>
including the span pls help me this
Why not DOMDocument and get title attribute?:
$string = '<ul class="listings">
<li>
<a href="http://watchseries.eu/serie/1,000_places_to_see_before_you_die" title="1,000 Places To See Before You Die">
1,000 Places To See Before You Die
<span class="epnum">2009</span>
</a>
</li>';
$dom = new DOMDocument;
$dom->loadHTML($string);
$xpath = new DOMXPath($dom);
$text = $xpath->query('//ul[#class="listings"]/li/a/#title')->item(0)->nodeValue;
echo $text;
or
$text = explode("\n", trim($xpath->query('//ul[#class="listings"]/li/a')->item(0)->nodeValue));
echo $text[0];
Codepad Example
There are 2 ways that I could think of to solve this. One, is that you grab the title attribute from the anchor tag. Of course, not everyone set up a title attribute for the anchor tag and the value of the attribute could be different if they want to fill it that way. The other solution is that, you get the innertext attribute and then replace every child of the anchor tag with an empty value.
So, either do this
$e->title;
or this
$text = $e->innertext;
foreach ($e->children() as $child)
{
$text = str_replace($child, '', $text);
}
Though, it might be a good idea to use DOMDocument instead for this.
You can use strip_tags() for that
echo trim(strip_tags($e->innertext));
Or try to use preg_replace() to remove unwanted tag and its content
echo preg_replace('/<span[^>]*>([\s\S]*?)<\/span[^>]*>/', '', $e->innertext);
Use plaintext instead.
echo $e->plaintext;
But still the year will be present which you can trim off using regexp.
Example from the documentation here:
$html = str_get_html("<div>foo <b>bar</b></div>");
$e = $html->find("div", 0);
echo $e->tag; // Returns: " div"
echo $e->outertext; // Returns: " <div>foo <b>bar</b></div>"
echo $e->innertext; // Returns: " foo <b>bar</b>"
echo $e->plaintext; // Returns: " foo bar"
First of all check your html. Now it is like
$string = '<ul class="listings">
<li>
<a href="http://watchseries.eu/serie/1,000_places_to_see_before_you_die" title="1,000 Places To See Before You Die">
1,000 Places To See Before You Die
<span class="epnum">2009</span>
</a>
</li>';
There is no close tag for ul, perhaps you missed it.
$string = '<ul class="listings">
<li>
<a href="http://watchseries.eu/serie/1,000_places_to_see_before_you_die" title="1,000 Places To See Before You Die">
1,000 Places To See Before You Die
<span class="epnum">2009</span>
</a>
</li>
</ul>';
Try like this
$xml = simplexml_load_string($string);
echo $xml->li->a['title'];

create anchors in a page with the content of <h2></h2> in PHP

Well I have a html text string in a variable:
$html = "<h1>title</h1><h2>subtitle 1</h2> <h2>subtitle 2</h2>";
so I want to create anchors in each subtitle that has with the same name and then print the html code to browser and also get the subtitles as an array.
I think is using regex.. please help.
I think this will do the trick for you:
$pattern = "|<h2>(.*)</h2>|U";
preg_match_all($pattern,$html,$matches);
foreach($matches[1] as $match)
$html = str_replace($match, "<a name='".$match."' />".$match, $html);
$array_of_elements = $matches[1];
Just make sure that $html has the existing html before this code starts. Then it will have an <a name='foo' /> added after this completes, and $array_of_elements will have the array of matching text values.

Categories