This question already has answers here:
Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing?
(2 answers)
Closed 8 years ago.
libxml_use_internal_errors(true);
$url = 'http://thepiratebay.is/browse/200/0/7';
$html = file_get_contents($url);
$dom = new \DOMDocument();
$dom->loadHTML($html);
$x = new \DOMXPath($dom);
$nodeList = $x->query('/html/body/div[2]/div[2]/table/tbody/tr');
foreach ($nodeList as $node) {
die(var_dump($node));
}
Gives me the error:
"Invalid argument supplied for foreach()"
Not sure why xpath doesn't work on that domain?
If I'm right you'd like to get all the titles in that table. I'd suggest an easier, yet more specific XPath query, i.e.
$nodeList = $x->query('//div[#class="detName"]');
See it in action
Related
This question already has answers here:
How to extract a node attribute from XML using PHP's DOM Parser
(3 answers)
Closed 3 years ago.
I have the following html
<div class="logo">***® text.<sup>TM</sup></div>
I would like to get the value of href with php dom xpath, how would I accomplish that?
This is what I have tried:
$anchors = $domXpath->query("//div[#class='logo']/a");
foreach($anchors as $a)
{
print $a->nodeValue." - ".$a->getAttribute("href")."<br/>";
}
Here is the solution.
$xpath = new DOMXpath($dom);
$link = $xpath->query('//div[#class="logo"]/a');
$link->getAttribute('href')
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 5 years ago.
I am trying get a specific div element (i.e. with attribute id="vung_doc") from a website, but I get almost every element. Do you have any idea what's wrong?
$doc = new DOMDocument;
// We don't want to bother with white spaces
$doc->preserveWhiteSpace = true;
// Most HTML Developers are chimps and produce invalid markup...
$doc->strictErrorChecking = false;
$doc->recover = true;
$doc->loadHTMLFile('http://lightnovelgate.com/chapter/epoch_of_twilight/chapter_300');
$xpath = new DOMXPath($doc);
$query = "//*[#class='vung_doc']";
$entries = $xpath->query($query);
var_dump($entries->item(0)->textContent);
Actually, it appears that that one element, which has both id and class attributes with value vung_doc, has many paragraphs inside its text content. Perhaps you are thinking each paragraph should be in its own div element.
<div id="vung_doc" class="vung_doc" style="font-size: 18px;">
<p></p>
"Mayor song..."
In the screenshot at the bottom of this post, I added an outline style to that element, to show just how many paragraphs are within that element.
If you wanted to separate the paragraphs, you could use preg_split() to split on any new line characters:
$entries = $xpath->query($query);
foreach($entries as $entry) {
$paragraphs = preg_split("/[\r\n]+/s",$entry->textContent);
foreach($paragraphs as $paragraph) {
if (trim($paragraph)) {
echo '<b>paragraph:</b> '.$paragraph;
break;
}
}
}
See a demonstration of this in this playground example. Note that before loading the HTML file, libxml_use_internal_errors() is called, to suppress the XML errors:
libxml_use_internal_errors(true);
Screenshot of the target div element with outline added:
Change
$query = "//*[#class='vung_doc']";
to
$query = "//*[#id='vung_doc']";
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 6 years ago.
I've got something like this:
$string = '<some code before><div class="abc">Something written here</div><some other code after>'
What I want is to get what is within the div and output it:
Something written here
How can I do that in php? Thanks in advance!
You would use the DOMDocument class.
// HTML document stored in a string
$html = '<strong><div class="abc">Something written here</div></strong>';
// Load the HTML document
$dom = new DOMDocument();
$dom->loadHTML($html);
// Find div with class 'abc'
$xpath = new DOMXPath($dom);
$result = $xpath->query('//div[#class="abc"]');
// Echo the results...
if($result->length > 0) {
foreach($result as $node) {
echo $node->nodeValue,"\n";
}
} else {
echo "Empty result set\n";
}
Read up on the expression syntax for XPath to customize your DOM searches.
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 7 years ago.
I need to extract all the links to news articles from the NY Times RSS feed to a MySQL database periodically. How do I go about doing this? Can I use some regular expression (in PHP) to match the links? Or is there some other alternative way? Thanks in advance.
UPDATE 2 I tested the code below and had to modify the
$links = $dom->getElementsByTagName('a');
and change it to:
$links = $dom->getElementsByTagName('link');
It successfully outputted the links. Good Luck
UPDATE Looks like there is a complete answer here: How do you parse and process HTML/XML in PHP.
I developed a solution so that I could recurse all the links in my website. I've removed the code which verified the domain was the same with each recursion (since the question didn't ask for this), but you can easily add one back in if you need it.
Using html5 DOMDocument, you can parse HTML or XML document to read links. It is better than using regex. Try something like this
<?php
//300 seconds = 5 minutes - or however long you need so php won't time out
ini_set('max_execution_time', 300);
// using a global to store the links in case there is recursion, it makes it easy.
// You could of course pass the array by reference for cleaner code.
$alinks = array();
// set the link to whatever you are reading
$link = "http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml";
// do the search
linksearch($link, $alinks);
// show results
var_dump($alinks);
function linksearch($url, & $alinks) {
// use $queue if you want this fn to be recursive
$queue = array();
echo "<br>Searching: $url";
$href = array();
//Load the HTML page
$html = file_get_contents($url);
//Create a new DOM document
$dom = new DOMDocument;
//Parse the HTML. The # is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
#$dom->loadHTML($html);
//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementsByTagName('link');
//Iterate over the extracted links and display their URLs
foreach ($links as $link){
//Extract and show the "href" attribute.
$href[] = $link->getAttribute('href');
}
foreach (array_unique($href) as $link) {
// add to list of links found
$queue[] = $link;
}
// remove duplicates
$queue = array_unique($queue);
// get links that haven't yet been processed
$queue = array_diff($queue, $alinks);
// update array passed by reference with new links found
$alinks = array_merge($alinks, $queue);
if (count($queue) > 0) {
foreach ($queue as $link) {
// recursive search - uncomment out if you use this
// remember to check that the domain is the same as the one starting from
// linksearch($link, $alinks);
}
}
}
DOM+Xpath allows you to fetch nodes using expressions.
RSS Item Links
To fetch the RSS link elements (the link for each item):
$xml = file_get_contents($url);
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXPath($document);
$expression = '//channel/item/link';
foreach ($xpath->evaluate($expression) as $link) {
var_dump($link->textContent);
}
Atom Links
The atom:link have a different semantic, they are part of the Atom namespace and used to describe relations. NYT uses the standout relation to mark featured stories. To fetch the Atom links you need to register a prefix for the namespace. Attributes are nodes, too so you can fetch them directly:
$xml = file_get_contents($url);
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXPath($document);
$xpath->registerNamespace('a', 'http://www.w3.org/2005/Atom');
$expression = '//channel/item/a:link[#rel="standout"]/#href';
foreach ($xpath->evaluate($expression) as $link) {
var_dump($link->value);
}
Here are other relations like prev and next.
HTML Links (a elements)
The description elements contain HTML fragments. To extract the links from them you have to load the HTML into a separate DOM document.
$xml = file_get_contents($url);
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXPath($document);
$xpath->registerNamespace('a', 'http://www.w3.org/2005/Atom');
$expression = '//channel/item/description';
foreach ($xpath->evaluate($expression) as $description) {
$fragment = new DOMDocument();
$fragment->loadHtml($description->textContent);
$fragmentXpath = new DOMXpath($fragment);
foreach ($fragmentXpath->evaluate('//a[#href]/#href') as $link) {
var_dump($link->value);
}
}
This question already has answers here:
How to retrieve comments from within an XML Document in PHP
(4 answers)
Closed 8 years ago.
I am trying to retrieve content from a p element in this page. As you can see, in the source code there is a paragraph with the content i want:
<p id="qb"><!--
QBlastInfoBegin
Status=READY
QBlastInfoEnd
--></p>
Actually i want to take the value of the Status.
Here is my PHP code.
#$dom->loadHTML($ncbi->ncbi_request($params));
$XPath = new DOMXpath($dom);
$nodes = $XPath->query('//p[#id="qb"]');
$node = $nodes->item(0)->nodeValue;
var_dump($node))
that returns
["nodeValue"]=> string(0) ""
Any idea ?
Thanks!
Seems that to get comment values you need to use //comment()
I'm not too familiar with XPaths so am not too sure on the exact syntax
Sources: https://stackoverflow.com/a/7548089/723139 / https://stackoverflow.com/a/1987555/723139
Update: with working code
<?php
$data = file_get_contents('http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?RID=UY5PPBRH014&CMD=Get');
$dom = new DOMDocument();
#$dom->loadHTML($data);
$XPath = new DOMXpath($dom);
$nodes = $XPath->query('//p[#id="qb"]/comment()');
foreach ($nodes as $comment)
{
var_dump($comment->textContent);
}
I checked up the site, and it seems you are after the comment inside, you need to add comment() on your xpath query. Consider this example:
$contents = file_get_contents('http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?RID=UY5PPBRH014&CMD=Get');
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($contents);
libxml_clear_errors();
$xpath = new DOMXpath($dom);
$comment = $xpath->query('//p[#id="qb"]/comment()')->item(0)->nodeValue;
echo '<pre>';
print_r($comment);
Outputs:
QBlastInfoBegin
Status=READY
QBlastInfoEnd