Extract only first level paragraphs from html

Extract only first level paragraphs from html - php

I have the following html:
<div id="myID">
<p>I want this</p>
<p>and I want this</p>
<div>
<p>I don't want this</p>
</div>
</div>
I want to extract only the first level <p>...</p> elements.
I've tried using the excellent simple_html_dom library e.g. $html->find('#myID p') but in the case above, this finds all three <p>...</p> elements
Is there a better way to do this?

Instead of having to use some external library why don't you just use the built in classes to handle the dom?
First create a DOMDocument instance using your HTML:
$dom = new DOMDocument();
$dom->loadHtml($yourHtml);
After that use DOMXPath to select your elements:
$xpath = new DOMXpath($dom);
$nodes = $xpath->query("//*[#id='myID']/p");
var_dump($nodes->length); // outputs 2
This selects all p elements which are direct children of the element with the id myID. Demo

Related

Target element within specific element domdocument

I want to target a tags with class genre within parent div with id test:
<div id="test">
<a class="genre">hello</a>
<a class="genre">hello2</a>
</div>
So far, I can get all the genre a tags:
$xpath = new DOMXPath($doc);
$elements = $xpath->query('//a[#class="genre"]');
... but I want to adjust //a[#class="genre"] so I only target the ones within the test div.

I don't understand why you did not write it yourself because you use all needed elements of xpath in your expression. Or, maybe, i've misunderstand you question
$elements = $xpath->query('//div[#id="test"]/a[#class="genre"]');

how to remove parent element using php?

I want to get the HTML inside the parent element using php. For example, I have this structure:
<p>
<p>this is my first xml file </p>
</p>
and I want to get below text as a result.
<p>this is my first xml file </p>

Make use of a DOM Parser
<?php
$html='<p>
<p>this is my first xml file </p>
</p>';
$dom = new DOMDocument;
#$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('p') as $tag){
if(!empty($tag->nodeValue)){ echo $tag->nodeValue;}
}

For each div tag, take its contents

I'm trying to loop through the code of a HTML page and reformat it's contents. It has a few div's within div's, which I want to extract. I've tried various forms of explode, regex and DOM, but can't find exactly how to do this.
Example:
<div class="section1">
<div class="section2">number 1</div>
</div>
<div class="section1">
<div class="section2">number 2</div>
</div>
The result I'm looking for is basically, for each section 1, get contents from section 2, so the output would be:
number 1, number 2
Does anyone know how to do something like this?

Should be pretty easy with DOMXPath:
$doc = new DOMDocument;
$doc->loadHTML(/*...*/); // load the HTML here
$xpath = new DOMXPath($doc);
$result = $xpath->query("//div[#class='section1']/div[#class='section2']/text()");
foreach ($result as $item) {
echo "$item->wholeText\n";
}
See it in action.

This is a jQuery solution, not PHP:
$('.section1).each(function() {
return $(this).html();
});

DOMXPath union extract with PHP

I'm trying to get img and the div which is coming after the div which contains that img, all in one query.
So I did this:
$nodes = $xpath->query('//div[starts-with(#id, "someid")]/img |
//div[starts-with(#id, "someid")]/following-sibling::div[#class="spec_class"][1]/text()');
Now, I'm able to get the attributes of img tag, but I can't get the text of the following sibling. If I separate the query (two queries - first for the img and second query for the sibling) it works. But how can I do this with only one query? By the way, there is no error in the syntax. But somehow the union doesn't work or maybe I'm not extracting the sibling content right.
Here's the markup (which repeats many times with another text and id="someid_%randomNumber%)
<div id="someid_1">
<img src="link_to_image.png" />
...some text...
</div>
<div>...another text...</div>
<div class="spec_class">
...Important text...
</div>
I want to get in one query both link_to_image.png and ...Important text...

Your query seems correct.
Example XML:
<div>
<div id="someid-1"><img src="foo"/></div>
<div class="spec_class">bar</div>
<div class="spec_class">baz</div>
</div>
Example PHP Code:
$dom = new DOMDocument;
$dom->loadXml($xhtml);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//div…') as $node) {
echo $dom->saveXML($node);
}
Outputs (demo):
<img src="foo"/>bar
Note that you will have to iterate the DOMNodeList returned by the XPath query.

php: how can I work with html as xml ? how do i find specific nodes and get the text inside these nodes?

Lets say i have the following web page:
<html>
<body>
<div class="transform">
<span>1</span>
</div>
<div class="transform">
<span>2</span>
</div>
<div class="transform">
<span>3</span>
</div>
</body>
</html>
I would like to find all div elements that contain the class transform and to fetch the text in each div element ?
I know I can do that easily with regular expressions, but i would like to know how can I do that without regular expressions, but parsing the xml and finding the required nodes i need.
update
i know that in this example i can just iterate through all the divs. but this is an example just to illustrate what i need.
in this example i need to query for divs that contain the attribute class=transform
thanks!

Could use SimpleXML - see the example below:
$string = "<?xml version='1.0'?>
<html>
<body>
<div class='transform'>
<span>1</span>
</div>
<div>
<span>2</span>
</div>
<div class='transform'>
<span>3</span>
</div>
</body>
</html>";
$xml = simplexml_load_string($string);
$result = $xml->xpath("//div[#class = 'transform']");
foreach($result as $node) {
echo "span " . $node->span . "<br />";
}
Updated it with xpath...

You can use xpath to address the items. For that particular query, you'd use:
div[contains(concat(" ",#class," "), concat(" ","transform"," "))]
Full PHP example:
<?php
$document = new DomDocument();
$document->loadHtml($html);
$xpath = new DomXPath($document);
foreach ($xpath->query('div[contains(concat(" ",#class," "), concat(" ","transform"," "))]') as $div) {
var_dump($div);
}
If you know CSS, here's a handy CSS-selector to XPath-expression mapping: http://plasmasturm.org/log/444/ -- You can find the above example listed there, as well as other common queries.
If you use it a lot, you might find my csslib library handy. It offers a wrapper csslib_DomCssQuery, which is similar to DomXPath, but using CSS-selectors instead.

ok what i wanted can be easily achieved using php xpath:
example:
http://ditio.net/2008/12/01/php-xpath-tutorial-advanced-xml-part-1/

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Extract only first level paragraphs from html - php

Related

Target element within specific element domdocument

how to remove parent element using php?

For each div tag, take its contents

DOMXPath union extract with PHP

php: how can I work with html as xml ? how do i find specific nodes and get the text inside these nodes?

Categories

Resources