xpath extract complete html

xpath extract complete html - php

I am trying to extract a complete table including the HTML tags, with XPath, that I can store in a variable, do a bit of string replacement on, then echo directly to the screen. I have found numerous posts on getting the text out of the table but I want to retain the HTML formatting since I am just going to display it (after minor modification).
At present I am extracting the table using string functions stristr, substr etc. but I would prefer to use XPath.
I can display the contents of the table with the following but it just displays the table TD fields with no formatting. It also does not store it in a variable that I can manipulate.
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$arr = $xpath->query('//table');
foreach($arr as $el) {
echo $el->textContent;
I tried this but got no output:
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$arr = $xpath->query('//table');
echo $arr->saveHTML();

Use DOMNode::C14N():
foreach($arr as $el) {
echo $el->C14N();

Related

php - loadHTML() - every <p> until a certain class

I'm calling some wikipedia content two different way:
$html = file_get_contents('https://en.wikipedia.org/wiki/Sans-serif');
The first one is to call the first paragraph
$dom = new DomDocument();
#$dom->loadHTML($html);
$p = $dom->getElementsByTagName('p')->item(0)->nodeValue;
echo $p;
The second one is to call the first paragraph after a specific $id
$dom = new DOMDocument();
#$dom->loadHTML($html);
$p=$dom->getElementById('$id')->getElementsByTagName('p')->item(0);
echo $p->nodeValue;
I'm looking for a third way to call all the first part.
So I was thinking about calling all the <p> before the id or class "toc" which is the id/class of the table of content.
Any idea how to do that?

If you're just looking for the intro in plain text, you can simply use Wikipedia's API:
https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=&titles=Sans-serif
If you want HTML formatting as well (excluding inner images and the likes):
https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&titles=Sans-serif

You could use DOMDocument and DOMXPath with for example an xpath expression like:
//div[#id="toc"]/preceding-sibling::p
$doc = new DOMDocument();
$doc->load("https://en.wikipedia.org/wiki/Sans-serif");
$xpath = new DOMXPath($doc);
$nodes = $xpath->query('//div[#id="toc"]/preceding-sibling::p');
foreach ($nodes as $node) {
echo $node->nodeValue;
}
That would give you the content of the paragraphs preceding the div with id = toc.

Get next 17 letters after keyword

I have this keyword: yt-lookup-title.
I want the next 17 letters after this in a variable. So I would have:
"<a href="/watch?v=HnlC81tWoY8"
How can I archive that I get it from all lines with this Keyword?
Keywords

If you want to get the href content, you can rely on domdocument.
If I'm not mistaken, all the links (<a>) have this class yt-uix-tile-link. So you can do the following:
$dom = new DOMDocument;
// $html is a string containing the html of the page you're parsing
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
links = array ();
$nodes = $xpath->query('//a[#class="yt-uix-tile-link"]/#href');
foreach ($nodes as $node) {
$links [] = $node->nodeValue;
}
var_dump ($links);
Hope that helps

PHP DOMDocument how to get that content of this tag?

I am using domDocument hoping to parse this little html code. I am looking for a specific span tag with a specific id.
<span id="CPHCenter_lblOperandName">Hello world</span>
My code:
$dom = new domDocument;
#$dom->loadHTML($html); // the # is to silence errors and misconfigures of HTML
$dom->preserveWhiteSpace = false;
$nodes = $dom->getElementsByTagName('//span[#id="CPHCenter_lblOperandName"');
foreach($nodes as $node){
echo $node->nodeValue;
}
But For some reason I think something is wrong with either the code or the html (how can I tell?):
When I count nodes with echo count($nodes); the result is always 1
I get nothing outputted in the nodes loop
How can I learn the syntax of these complex queries?
What did I do wrong?

You can use simple getElementById:
$dom->getElementById('CPHCenter_lblOperandName')->nodeValue
or in selector way:
$selector = new DOMXPath($dom);
$list = $selector->query('/html/body//span[#id="CPHCenter_lblOperandName"]');
echo($list->item(0)->nodeValue);
//or
foreach($list as $span) {
$text = $span->nodeValue;
}

Your four part question gets an answer in three parts:
getElementsByTagName does not take an XPath expression, you need to give it a tag name;
Nothing is output because no tag would ever match the tagname you provided (see #1);
It looks like what you want is XPath, which means you need to create an XPath object - see the PHP docs for more;
Also, a better method of controlling the libxml errors is to use libxml_use_internal_errors(true) (rather than the '#' operator, which will also hide other, more legitimate errors). That would leave you with code that looks something like this:
<?php
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach($xpath->query("//span[#id='CPHCenter_lblOperandName']") as $node) {
echo $node->textContent;
}

Retrieve a href titles containing a specific string in url php

i have the following code. and i want to retrieve only the a href titles , that have /movie/ within url.
function get_a_contentmovies(){
$h1count = preg_match_all("/(<a.*>)(\w.*)(<.*>)/ismU",$this->DataFromSite,$patterns);
return $patterns[2];
}

You can use DOMXpath like this:
$dom = new DomDocument();
$dom->loadHTML($string);
$xpath = new DOMXpath($dom);
$elements = $xpath->query("//a[contains(#href, '/movie/')]");
foreach($elements as $el) {
var_dump($el->getAttribute('title'));
}

Using Regex to parse (x)HTML is a bad idea. You should use a DOM parser such as DomDocument. Have a look at this topic.

convert part of dom element to string with html tags inside of them

im in need of converting part of DOM element to string with html tags inside of them.
i tried following but it prints just a text without tags in side.
$dom = new DOMDocument();
$dom->loadHTMLFile('http://www.pixmania-pro.co.uk/gb/uk/08920684/art/packard-bell/easynote-tm89-gu-015uk.html');
$xpath = new DOMXPath($dom);
$elements=xpath->query('//table');
foreach($elements as $element)
echo $element->nodeValue;
i want all the tags as it is and the content inside tables. can some one help me. it'll be a greate help.
thanks.

Current solution:
foreach($elements as $element){
echo $dom->saveHTML($element);
}
Old answer (php < 5.3.6):
Create new instance of DomDocument
Clone node (with all sub nodes) you wish to save as HTML
Import cloned node to new instance of DomDocument and append it as a child
Save new instance as html
So something like this:
foreach($elements as $element){
$newdoc = new DOMDocument();
$cloned = $element->cloneNode(TRUE);
$newdoc->appendChild($newdoc->importNode($cloned,TRUE));
echo $newdoc->saveHTML();
}

With php 5.3.6 or higher you can use a node in DOMDocument::saveHTML:
foreach($elements as $element){
echo $dom->saveHTML($element);
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

xpath extract complete html - php

Use DOMNode::C14N(): foreach($arr as $el) { echo $el->C14N();

Related

php - loadHTML() - every <p> until a certain class

Get next 17 letters after keyword

PHP DOMDocument how to get that content of this tag?

Retrieve a href titles containing a specific string in url php

convert part of dom element to string with html tags inside of them

Categories

Resources