How to parse html and create array from dt/dd - php

I have a select element for which I want to dynamically create the option values using information from another website. The information can be seen at http://www.dandwiki.com/wiki/SRD:Classes. It is the list of 'Base Classes'. I have tried using the DOMDocument class, but I can't see any way of using a url instead of an html file. I have tried using file_get_html and a foreach loop, but can't make it work with the format of the data on the website. It is in a dt/dd element, and the elements don't have id's. What would be the best way to pull the information off the website, and create an option value for each class in my select element?

I usually use http://www.bit-101.com/xpath/ to test the query. You would end up with something like this.
<?php
$url = 'http://www.dandwiki.com/wiki/SRD:Classes';
$html = file_get_contents($url);
$dom = new DOMDocument();
//from file
#$dom->loadHTML($html);
//Creating a new DOMPath
$Xpath = new DOMXPath($dom);
// Get links
// //big Selects all <big> elements in the document.
// [text()="Base Classes"] with text "base classes"
// /.. get parent class
// /.. get parent class
// //a select all <a> elements
// /text() get text
$query = '//big[text()="Base Classes"]/../..//a/text()';
$entries = $Xpath->query($query);
foreach ($entries as $entry) {
echo $entry->nodeValue . "<br>";
}
?>

Related

How do I extract all URL links from an RSS feed? [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 7 years ago.
I need to extract all the links to news articles from the NY Times RSS feed to a MySQL database periodically. How do I go about doing this? Can I use some regular expression (in PHP) to match the links? Or is there some other alternative way? Thanks in advance.
UPDATE 2 I tested the code below and had to modify the
$links = $dom->getElementsByTagName('a');
and change it to:
$links = $dom->getElementsByTagName('link');
It successfully outputted the links. Good Luck
UPDATE Looks like there is a complete answer here: How do you parse and process HTML/XML in PHP.
I developed a solution so that I could recurse all the links in my website. I've removed the code which verified the domain was the same with each recursion (since the question didn't ask for this), but you can easily add one back in if you need it.
Using html5 DOMDocument, you can parse HTML or XML document to read links. It is better than using regex. Try something like this
<?php
//300 seconds = 5 minutes - or however long you need so php won't time out
ini_set('max_execution_time', 300);
// using a global to store the links in case there is recursion, it makes it easy.
// You could of course pass the array by reference for cleaner code.
$alinks = array();
// set the link to whatever you are reading
$link = "http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml";
// do the search
linksearch($link, $alinks);
// show results
var_dump($alinks);
function linksearch($url, & $alinks) {
// use $queue if you want this fn to be recursive
$queue = array();
echo "<br>Searching: $url";
$href = array();
//Load the HTML page
$html = file_get_contents($url);
//Create a new DOM document
$dom = new DOMDocument;
//Parse the HTML. The # is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
#$dom->loadHTML($html);
//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementsByTagName('link');
//Iterate over the extracted links and display their URLs
foreach ($links as $link){
//Extract and show the "href" attribute.
$href[] = $link->getAttribute('href');
}
foreach (array_unique($href) as $link) {
// add to list of links found
$queue[] = $link;
}
// remove duplicates
$queue = array_unique($queue);
// get links that haven't yet been processed
$queue = array_diff($queue, $alinks);
// update array passed by reference with new links found
$alinks = array_merge($alinks, $queue);
if (count($queue) > 0) {
foreach ($queue as $link) {
// recursive search - uncomment out if you use this
// remember to check that the domain is the same as the one starting from
// linksearch($link, $alinks);
}
}
}
DOM+Xpath allows you to fetch nodes using expressions.
RSS Item Links
To fetch the RSS link elements (the link for each item):
$xml = file_get_contents($url);
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXPath($document);
$expression = '//channel/item/link';
foreach ($xpath->evaluate($expression) as $link) {
var_dump($link->textContent);
}
Atom Links
The atom:link have a different semantic, they are part of the Atom namespace and used to describe relations. NYT uses the standout relation to mark featured stories. To fetch the Atom links you need to register a prefix for the namespace. Attributes are nodes, too so you can fetch them directly:
$xml = file_get_contents($url);
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXPath($document);
$xpath->registerNamespace('a', 'http://www.w3.org/2005/Atom');
$expression = '//channel/item/a:link[#rel="standout"]/#href';
foreach ($xpath->evaluate($expression) as $link) {
var_dump($link->value);
}
Here are other relations like prev and next.
HTML Links (a elements)
The description elements contain HTML fragments. To extract the links from them you have to load the HTML into a separate DOM document.
$xml = file_get_contents($url);
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXPath($document);
$xpath->registerNamespace('a', 'http://www.w3.org/2005/Atom');
$expression = '//channel/item/description';
foreach ($xpath->evaluate($expression) as $description) {
$fragment = new DOMDocument();
$fragment->loadHtml($description->textContent);
$fragmentXpath = new DOMXpath($fragment);
foreach ($fragmentXpath->evaluate('//a[#href]/#href') as $link) {
var_dump($link->value);
}
}

Search for substrings of text in a node in a XML file

I have this PHP I found in a Q&A forum that queries an XML file:
$doc = new DOMDocument; // Create a new dom document
$doc->preserveWhiteSpace = false; // Set features
$doc->formatOutput = true; // Create indents on xml
$doc->Load('i.xml'); // Load the file
$xpath = new DOMXPath($doc);
$query = '//users/user/firstname[.= "'.$_POST["search"].'"]'; // The xpath (starts from root node)
$names = $xpath->query($query); // A list of matched elements
$Output="";
foreach ($names as $node) {
$Output.=$doc->saveXML($node->parentNode)."\n"; // We get the parent of "<firstname>" element (the entire "<user>" node and its children) (maybe get the parent node directly using xpath)
// and use the saveXML() to convert it to string
}
echo $Output."<br>\n\n"; // The result
echo "<hr><br><b>Below view the results as HTML content. (See also the page's HTML code):</b>\n<pre>".htmlspecialchars($Output)."</pre>";
The script will search the values of all the firstname nodes in the XML document from the input from the POST, and will return the parent node of the nodefirstname, if the POST input value matches any of the node values.
This script works well, but it only returns queries that contain the entire value of a firstname node, and will not work if I search for a substring of the node's text (e.g a query for Potato will return Potato, but a query for Pot, will not give me results for Potato).
So how do you get a result that only contains a substring of the node's text, instead of the entire value ?

Get just the first item with DOMDocument in PHP

I am using this below code to get the elements that are in special HTML element :
$dom = new DOMDocument();
#$dom->loadHTML($google_html);
$xpath = new DOMXPath($dom);
$tags = $xpath->query('//span[#class="st"]');
foreach ($tags as $tag) {
echo $node_value;
}
Now, the problem is that, the code gives all of the elements that are in one special class, but i just need to get the First item that has that class name.
So i don't need using foreach loops.
How to use that code to get JUST the FIRST item ?
The following will make sure you get just the first one in the DOMNodeList that is returned
$xpath->query('//span[#class="st"][1]');
The following gets the only item in the DOMNodeList
$tags = $xpath->query('//span[#class="st"][1]');
$first = $tags->item(0);
$text = $first->textContent;
See XPath: Select first element with a specific attribute

Change the value of a text node using SimpleXML

I am trying to write a code where it will find a specific element in my XML file and then change the value of the text node. The XML file has different namespaces. Till now, I have managed to register the namespaces and also echo the text node of the element, which I want to change.
<?php
$xml = simplexml_load_file('getobs.xml');
$xml->registerXPathNamespace('g','http://www.opengis.net/gml');
$result = $xml->xpath('//g:beginPosition');
foreach ($result as $title) {
echo $title . "\n";
}
?>
My question is: How can I change the value of this element using SimpleXML? I tried to use the nodeValue command but I am not able to make it work.
This is a part of the XML:
<sos:GetObservation xmlns:sos="http://www.opengis.net/sos/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" service="SOS" version="1.0.0" srsName="urn:ogc:def:crs:EPSG:4326">
<sos:offering>urn:gfz:cawa:def:offering:meteorology</sos:offering>
<sos:eventTime>
<ogc:TM_During xmlns:ogc="http://www.opengis.net/ogc" xsi:type="ogc:BinaryTemporalOpType">
<ogc:PropertyName>urn:ogc:data:time:iso8601</ogc:PropertyName>
<gml:TimePeriod xmlns:gml="http://www.opengis.net/gml">
<gml:beginPosition>2011-02-10T01:10:00.000</gml:beginPosition>
Thanks
Dimitris
In the end I managed to do it by using the PHP XML DOM.
Here is the code that I used in order to change the text node of a specific element:
<?php
// create new DOM document and load the data
$dom = new DOMDocument;
$dom->load('getobs.xml');
//var_dump($dom);
// Create new xpath and register the namespace
$xpath = new DOMXPath($dom);
$xpath->registerNamespace('g','http://www.opengis.net/gml');
// query the result amd change the value to the new date
$result = $xpath->query("//g:beginPosition");
$result->item(0)->nodeValue = 'sds';
// save the values in a new xml
file_put_contents('test.xml',$dom->saveXML());
?>
Not wanting to switch from the code I've already made for SimpleXML, I found this solution:
http://www.dotdragnet.com/forum/index.php?topic=3979.0
Specificially:
$numvotes = $xml->xpath('/gallery/image[path="'.$_GET["image"].'"]/numvotes');
...
$numvotes[0][0] = $votes;
Hope this helps!

confused with xpath

I've got this PHP code loading in some html.
$dom = new DOMDocument();
$dom->loadHTML($somehtml);
$xpath = new DOMXPath($dom);
$divContent = $xpath->query('//table[class="defURLP"]');
echo $divContent;
I'm too confused to understand quite what needs to go on here, however my desire would it to be able to populate the variable $divContent to have the html contents of the table with the classname defURLP
It's currently just returning
object(DOMNodeList)#3 (0) { }
You need to retrieve the first item from the DOMNodeList returned by your xpath query, since there may be more than one in the list.
// Queries for tables having class defURLP
$tables = $xpath->query('//table[class="defURLP"]');
// Reference the first one in $divContent
$divContent = $tables->item(0);
// Output its nodeValue
echo $divContent->nodeValue;
Or iterate over the node list with a foreach:
$tables = $xpath->query('//table[class="defURLP"]');
// Iterate over the whole node list in $tables (if it is multiple nodes)
foreach ($tables as $t) {
echo $t->nodeValue;
}

Categories