HTML:
<div class="something" data-a="abc">ddsf</d>
PHP:
foreach ($dom->find('.something[data-rel]') as $this) {
var_dump($this->attr());
}
I tried this but error. Couldn't find any info on its documentation. I want to get the data-a's value which is abc.
Why not just use the well-documented, built in, DOM extension?
Example:
$html = '<div class="something" data-a="abc">ddsf</div>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//div[#class="something"]/#data-a');
foreach ($nodes as $node) {
var_dump($node->value);
}
Output:
string(3) "abc"
It looks like this:
$dom->find('div[data-a]',0)->{'data-a'}
use
foreach($html->find('button') as $element)
echo $element->{'data-coupon'};
Use xpath
Should be something like this:
foreach ($dom->xpath('/div[#data-a]') as $item) {
...
}
Related
I am trying to extract specific type of links from the webpage using php
links are like following..
http://www.example.com/pages/12345667/some-texts-available-here
I want to extract all links like in the above format.
maindomain.com/pages/somenumbers/sometexts
So far I can extract all the links from the webpage, but the above filter is not happening. How can i acheive this ?
Any suggestions ?
<?php
$html = file_get_contents('http://www.example.com');
//Create a new DOM document
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
//Iterate over the extracted links and display their URLs
foreach ($links as $link){
//Extract and show the "href" attribute.
echo $link->nodeValue;
echo $link->getAttribute('href'), '<br>';
}
?>
You can use DOMXPath and register a function with DOMXPath::registerPhpFunctions to use it after in an XPATH query:
function checkURL($url) {
$parts = parse_url($url);
unset($parts['scheme']);
if ( count($parts) == 2 &&
isset($parts['host']) &&
isset($parts['path']) &&
preg_match('~^/pages/[0-9]+/[^/]+$~', $parts['path']) ) {
return true;
}
return false;
}
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTMLFile($filename);
$xp = new DOMXPath($dom);
$xp->registerNamespace("php", "http://php.net/xpath");
$xp->registerPhpFunctions('checkURL');
$links = $xp->query("//a[php:functionString('checkURL', #href)]");
foreach ($links as $link) {
echo $link->getAttribute('href'), PHP_EOL;
}
In this way you extract only the links you want.
This is a slight guess, but if I got it wrong you can still see the way to do it.
foreach ($links as $link){
//Extract and show the "href" attribute.
If(preg_match("/(?:http.*)maindomain\.com\/pages\/\d+\/.*/",$link->getAttribute('href')){
echo $link->nodeValue;
echo $link->getAttribute('href'), '<br>';
}
}
You already use a parser, so you might step forward and use an xpath query on the DOM. XPath queries offer functions like starts-with() as well, so this might work:
$xpath = new DOMXpath($dom);
$links = $xpath->query("//a[starts-with(#href, 'maindomain.com')]");
Loop over them afterwards:
foreach ($links as $link) {
// do sth. with it here
// after all, it is a DOMElement
}
<li data-docid="thisisthevaluetoget" class="search-results-item">
</li>
How to get the value of "data-docid"?
You can use DOMDocument to get at the attributes:
$html = '<li data-docid="thisisthevaluetoget" class="search-results-item"></li>';
$doc = new DOMDocument;
$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('li');
foreach ($nodes as $node) {
if ($node->hasAttributes()) {
foreach ($node->attributes as $a) {
echo $a->nodeName.': '.$a->nodeValue.'<br/>';
}
}
}
You may do this using JavaScript + jQuery. You may get the value and pass it into another php file using $_GET method.
an example is here
I have a var of a HTTP (craigslist) link $link, and put the contents into $linkhtml. In this var is the HTML code for a craigslist page, $link.
I need to extract the text between <h2> and </h2>. I could use a regexp, but how do I do this with PHP DOM? I have this so far:
$linkhtml= file_get_contents($link);
$dom = new DOMDocument;
#$dom->loadHTML($linkhtml);
What do I do next to put the contents of the element <h2> into a var $title?
if DOMDocument looks complicated to understand/use to you, then you may try PHP Simple HTML DOM Parser which provides the easiest ever way to parse html.
require 'simple_html_dom.php';
$html = '<h1>Header 1</h1><h2>Header 2</h2>';
$dom = new simple_html_dom();
$dom->load( $html );
$title = $dom->find('h2',0)->plaintext;
echo $title; // outputs: Header 2
You can use this code:
$linkhtml= file_get_contents($link);
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($linkhtml); // loads your html
$xpath = new DOMXPath($doc);
$h2text = $xpath->evaluate("string(//h2/text())");
// $h2text is your text between <h2> and </h2>
You can do this with XPath: untested, may contain errors
$linkhtml= file_get_contents($link);
$dom = new DOMDocument;
#$dom->loadHTML($linkhtml);
$xpath = new DOMXpath($dom);
$elements = $xpath->query("/html/body/h2");
if (!is_null($elements)) {
foreach ($elements as $element) {
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
The div is like this
<div style="width:90%;margin:0 auto;color:#Black;" id="content">
this is text, severaltags
</div>
how should i get the div's content including the tags using dom in php?
Assuming your using PHP5 you can use DOMDocument -- take note that this doesn't provide simple means for retrieving inner html of an element. You can do something along the following:
function DOMinnerHTML($element)
{
$innerHTML = "";
$children = $element->childNodes;
foreach ($children as $child)
{
$tmp_dom = new DOMDocument();
$tmp_dom->appendChild($tmp_dom->importNode($child, true));
$innerHTML.=trim($tmp_dom->saveHTML());
}
return $innerHTML;
}
$dom = new DOMDocument();
$dom->loadHTML($html);
$items = $dom->getElementsByTagName('div');
if ($items->length)
{
$innerHTML = DOMinnerHTML($items->item(0));
}
echo $innerHTML;
For something this simple, although I don't normally recommend it, I'd use regex:
preg_match('|<div[^>]+>(.*?)</div>|is', $html, $match);
if ($match)
{
echo 'html is: ' . $match[1][0];
}
Something like this?
$document = new DOMDocument();
$document->loadHTML($html);
$element = $document->getElementById('content');
To get the values, you can try something like this
$doc = new DOMDocument();
$doc->loadHTMLFile('link-t0-html-file.php');
$xpath = new DOMXPath($doc);
$element = $xpath->query("//*[#id='content']")->item(0);
echo $element->nodeValue;
if i am not wrong you want this
echo "< div style='width:90%;margin:0 auto;color:#000000;font-size:14px;line-height:24px;'
id='content'>";
echo "this is text, several `<br/>` tags";
echo "< /div>";
just mind it never use double quote (") within double quote ("). use single quote(') within double quote.
My PHP code
$dom = new DOMDocument();
#$dom->loadHTML($file);
$xpath = new DOMXPath($dom);
$tags = $xpath->query('//div[#class="text"]');
foreach ($tags as $tag) {
echo $tag->textContent;
}
What I'm trying to do here is to get the content of the div that has class 'text' but the problem when I loop and echo the results I only get the text I can't get the HTML code with images and all the HTML tags like p, br, img... etc i tried to use $tag->nodeValue; but also nothing worked out.
Personally, I like Simple HTML Dom Parser.
include "lib.simple_html_dom.php"
$html = str_get_html($file);
foreach($html->find('div.text') as $e){
echo $e->innertext;
}
Pretty simple, huh? It accommodates selectors like jQuery :)
What you need to do is create a temporary document, add the element to that and then use saveHTML():
foreach ($tags as $tag) {
$doc = new DOMDocument;
$doc->appendChild($doc->importNode($tag, true));
$html = $doc->saveHTML();
}
I found this snippet at http://www.php.net/manual/en/class.domelement.php:
<?php
function getInnerHTML($Node)
{
$Body = $Node->ownerDocument->documentElement->firstChild->firstChild;
$Document = new DOMDocument();
$Document->appendChild($Document->importNode($Body,true));
return $Document->saveHTML();
}
?>
Not sure if it works though.