PHP DOMDocument how to get element? - php

I am trying to read a website's content but i have a problem i want to get images, links these elements but i want to get elements them selves not the element content for instance i want to get that: i want to get that entire element.
How can i do this..
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.link.com");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
$dom = new DOMDocument;
#$dom->loadHTML($output);
$items = $dom->getElementsByTagName('a');
for($i = 0; $i < $items->length; $i++) {
echo $items->item($i)->nodeValue . "<br />";
}
curl_close($ch);;
?>

You appear to be asking for the serialized html of a DOMElement? E.g. you want a string containing link text? (Please make your question clearer.)
$url = 'http://example.com';
$dom = new DOMDocument();
$dom->loadHTMLFile($url);
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $a) {
// Best solution, but only works with PHP >= 5.3.6
$htmlstring = $dom->saveHTML($a);
// Otherwise you need to serialize to XML and then fix the self-closing elements
$htmlstring = saveHTMLFragment($a);
echo $htmlstring, "\n";
}
function saveHTMLFragment(DOMElement $e) {
$selfclosingelements = array('></area>', '></base>', '></basefont>',
'></br>', '></col>', '></frame>', '></hr>', '></img>', '></input>',
'></isindex>', '></link>', '></meta>', '></param>', '></source>',
);
// This is not 100% reliable because it may output namespace declarations.
// But otherwise it is extra-paranoid to work down to at least PHP 5.1
$html = $e->ownerDocument->saveXML($e, LIBXML_NOEMPTYTAG);
// in case any empty elements are expanded, collapse them again:
$html = str_ireplace($selfclosingelements, '>', $html);
return $html;
}
However, note that what you are doing is dangerous because it could potentially mix encodings. It is better to have your output as another DOMDocument and use importNode() to copy the nodes you want. Alternatively, use an XSL stylesheet.

I'm assuming you just copy-pasted some example code and didn't bother trying to learn how it actually works...
Anyway, the ->nodeValue part takes the element and returns the text content (because the element has a single text node child - if it had anything else, I don't know what nodeValue would give).
So, just remove the ->nodeValue and you have your element.

Related

PHP - collect data from paginated API results (in XML)

I'm accessing data from an API supplying XML using PHP, with a retrieval format and URL like this:
$response = $oauth->get('https://example.com/Main/1');
In this case the "1" is the page number. It will return the first 100 results. (That I have all working.)
But if there are more results, I can't access them automatically currently. (I'd have to change the url manually.)
The returned XML will list a <Links><rel>last</rel><href>https://example.com/Main/3</href></Links> of how many pages there are. (In this case, 3 pages available.)
Here's a sample of the XML returned:
<?xml version="1.0" encoding="UTF-8"?>
<Fleet xmlns="http://standards.iso.org/iso/15143/-3" version="2" snapshotTime="2020-01-13T20:12:55.224Z">
<Links>
<rel>self</rel>
<href>https://example.com/Main/1</href>
</Links>
<Links>
<rel>last</rel>
<href>https://example.com/Main/3</href>
</Links>
<Equipment>
<EquipmentHeader>
<OEMName>CAT</OEMName>
<Model>D6</Model>
<EquipmentID>1111111</EquipmentID>
<SerialNumber>1111111</SerialNumber>
<PIN>1111111</PIN>
</EquipmentHeader>
<CumulativeOperatingHours datetime="2018-07-29T18:15:30.000Z">
<Hour>1111</Hour>
</CumulativeOperatingHours>
</Equipment>
// ... and so on - 100 results...
</Fleet>
Is there a simple way to check the value given of the last page, and then loop through retrieving data from each of the pages (from the first to the last)?
(Since I won't know how many pages there are until the first request results are returned.)
UPDATE
I've come up with this for finding the number of pages:
$total_pages = NULL;
$xml = simplexml_load_string($response);
// Get used name space, and use that
$namespaces = $xml->getDocNamespaces();
if(isset($namespaces[''])) {
$defaultNamespaceUrl = $namespaces[''];
$xml->registerXPathNamespace('default', $defaultNamespaceUrl);
$nsprefix = 'default:';
} else {$nsprefix = '';}
$nodes = $xml->xpath('//'.$nsprefix.'Links');
foreach($nodes as $node) {
if($node->rel == 'last'){
$last_page_url = $node->href;
$pos = strrpos($last_page_url, '/'); // position of last slash in url
$total_pages = $pos === false ? 0 : substr($last_page_url, $pos + 1); // if slash doesn't exist, then 0, otherwise the value after the last slash
} // end if
} // end foreach
echo $total_pages;
So now I need to figure out how to loop through the requests...
First, you could simplify the last page lookup, using DOMXPath:
$domDocument = new \DOMDocument();
$domDocument->loadXML($response);
$xpath = new \DOMXPath($domDocument);
$xpath->registerNamespace('d', 'http://standards.iso.org/iso/15143/-3');
$lastPageHref = $xpath->evaluate('string(//d:Links/d:rel[text()="last"]/following-sibling::d:href)');
$lastPage = (int)basename($lastPageHref);
This gets a href element which is the direct following sibling of a rel element whose text content is "Last", which is itself a child of a Links element anywhere in the doc.
It then uses basename to get the last part of that URL, and converts it to an integer.
Demo: https://3v4l.org/urfU3
From there, you can simply do something like this (where OAuthClass is to be replaced by what class $oauth is of):
function fetchPage(YourOAuthClass $oauth, int $page): \DOMDocument
{
$xml = $oauth->get("https://example.com/Main/$page");
$domDocument = new \DOMDocument();
$domDocument->loadXML($xml);
return $domDocument;
}
$domDocument = fetchPage($oauth, 1);
// Here, do the code above to grab $lastPage
// Also do stuff with $domDocument (handle page 1)
for ($page = 2; $page <= $lastPage; $page++) {
$domDocument = fetchPage($oauth, $page);
// Do stuff with $domDocument (handle current page)
}

xml to json auto script

I have a couple of xml feeds which I need to convert to json. My service provider uploads the xml files 2-3 times a day to our server. At this point I use codebeautify.org to convert the files to json and then re-load them back to our server. Is there a way that I could have this conversion done automatically for me either by way of a php script or similar. Appreciate advice on how I should tackle it. Thanks in advance
Here you go:
function removeNamespaceFromXML( $xml )
{
// Because I know all of the the namespaces that will possibly appear in
// in the XML string I can just hard code them and check for
// them to remove them
$toRemove = ['rap', 'turss', 'crim', 'cred', 'j', 'rap-code', 'evic'];
// This is part of a regex I will use to remove the namespace declaration from string
$nameSpaceDefRegEx = '(\S+)=["\']?((?:.(?!["\']?\s+(?:\S+)=|[>"\']))+.)["\']?';
// Cycle through each namespace and remove it from the XML string
foreach( $toRemove as $remove ) {
// First remove the namespace from the opening of the tag
$xml = str_replace('<' . $remove . ':', '<', $xml);
// Now remove the namespace from the closing of the tag
$xml = str_replace('</' . $remove . ':', '</', $xml);
// This XML uses the name space with CommentText, so remove that too
$xml = str_replace($remove . ':commentText', 'commentText', $xml);
// Complete the pattern for RegEx to remove this namespace declaration
$pattern = "/xmlns:{$remove}{$nameSpaceDefRegEx}/";
// Remove the actual namespace declaration using the Pattern
$xml = preg_replace($pattern, '', $xml, 1);
}
// Return sanitized and cleaned up XML with no namespaces
return $xml;
}
function namespacedXMLToArray($xml)
{
// One function to both clean the XML string and return an array
return json_decode(json_encode(simplexml_load_string(removeNamespaceFromXML($xml))), true);
}
print_r(namespacedXMLToArray($xml));
Source: https://laracasts.com/discuss/channels/general-discussion/converting-xml-to-jsonarray

PHP creating multiple DOMDocuments in a loop issue

I have a list of items to be added to the end of a base url and am trying to retrieve the html from each of these generated url's in a loop. However, I am encountering an error and i've really been struggling to fix it!
current code:
($items is just an array of strings)
$output = "";
foreach($items as $item) {
$url = $baseUrl . $item;
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTMLFile($url);
$output = $output . json_encode($dom->saveHTML());
}
echo $output;
Can anyone tell me why I can't load multiple HTML documents like this?
Annoyingly i'm not getting any PHP error logs and the ajax xhr text is not providing any useful info, it's just returning a section of the first html page loaded as the 'error' (it seems to be able to load the first item in the array but then fails)
You were almost there. This way it should do the trick:
$output = "";
foreach($items as $item) {
$url = $baseUrl . $item;
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTMLFile($url);
$output .= json_encode($dom->saveHTML(),JSON_ERROR_UTF8);
}
echo $output;

php xml get a value from a xml over http and then store it as a string

Okay i really have some issues with this XML / php script
I have the following xml which i want to load over http
<WowzaMediaServer>
<ConnectionsCurrent>3</ConnectionsCurrent>
<ConnectionsTotal>26</ConnectionsTotal>
<ConnectionsTotalAccepted>20</ConnectionsTotalAccepted>
<ConnectionsTotalRejected>6</ConnectionsTotalRejected>
<MessagesInBytesRate>3248.0</MessagesInBytesRate>
<MessagesOutBytesRate>1054.0</MessagesOutBytesRate>
<VHost>
<Name>_defaultVHost_</Name>
<TimeRunning>28752.989</TimeRunning>
<ConnectionsLimit>0</ConnectionsLimit>
<ConnectionsCurrent>3</ConnectionsCurrent>
<ConnectionsTotal>26</ConnectionsTotal>
<ConnectionsTotalAccepted>20</ConnectionsTotalAccepted>
<ConnectionsTotalRejected>6</ConnectionsTotalRejected>
<MessagesInBytesRate>3248.0</MessagesInBytesRate>
<MessagesOutBytesRate>1054.0</MessagesOutBytesRate>
<Application>
<Name>zahlio</Name>
<Status>loaded</Status>
<TimeRunning>3339.479</TimeRunning>
<ConnectionsCurrent>3</ConnectionsCurrent>
<ConnectionsTotal>14</ConnectionsTotal>
<ConnectionsTotalAccepted>14</ConnectionsTotalAccepted>
<ConnectionsTotalRejected>0</ConnectionsTotalRejected>
<MessagesInBytesRate>31595.0</MessagesInBytesRate>
<MessagesOutBytesRate>32045.0</MessagesOutBytesRate>
<ApplicationInstance>
<Name>_definst_</Name>
<TimeRunning>3339.478</TimeRunning>
<ConnectionsCurrent>3</ConnectionsCurrent>
<ConnectionsTotal>14</ConnectionsTotal>
<ConnectionsTotalAccepted>14</ConnectionsTotalAccepted>
<ConnectionsTotalRejected>0</ConnectionsTotalRejected>
<MessagesInBytesRate>31594.0</MessagesInBytesRate>
<MessagesOutBytesRate>32045.0</MessagesOutBytesRate>
<Stream>
<Name>zahlio</Name>
<SessionsFlash>2</SessionsFlash>
<SessionsCupertino>0</SessionsCupertino>
<SessionsSanJose>0</SessionsSanJose>
<SessionsSmooth>0</SessionsSmooth>
<SessionsRTSP>0</SessionsRTSP>
<SessionsTotal>2</SessionsTotal>
</Stream>
</ApplicationInstance>
</Application>
</VHost>
</WowzaMediaServer>
The data i want to load is the <SessionsFlash> the 2 from the <stream> with x as name and x being a variable, in this case it zahlio.
i load it by using this http: http://username:pwd#mydomian.com:8086/connectioncounts
and this is my current php script:
$sxe = new SimpleXMLElement('http://username:pwd#mydomian.com:8086/connectioncounts');
$propNode = $sxe->xpath('/WowzaMediaServer/VHost/Application/ApplicationInstance/Stream');
$count = $propNode->getChildren("SessionsFlash");
it dosnt work and i dont know how to select trh data from the child with the name x
If I understood correctly, you are trying to extract the data from the XML you posted. The following code gets the Name and SessionsFlash from provided XML:
$dom = new DOMDocument();
#$dom->loadXML($xml);
$xpath = new DOMXPath($dom);
$search_name = 'zahlio';
$items = $xpath->query('/WowzaMediaServer/VHost/Application[Name="' . $search_name . '"]/ApplicationInstance/Stream[Name="' . $search_name . '"]');
for ($i = 0; $i < $items->length; $i++)
{
$temp = $xpath->query('SessionsFlash', $items->item($i));
$SessionsFlash = $temp->item(0)->nodeValue;
echo $SessionsFlash;
}
Your server was using Digest authentication, your browser handles it automatically and PHP needs some help. Here's the code that works:
$ch = curl_init('http://user:pwd#website.net:8086/connectioncounts');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_DIGEST);
$xml = curl_exec($ch);
I build script for Wowza Media Server too. This is how I read XML in php, maybe it can help
$dom=new DOMDocument;
$dom->preserveWhiteSpace = false;
$dom->load('/home/vichea/wowza_visitor/wowza_serverinfo.xml');
$date=date("d/m/Y");
$date_file=date('mY');
$dataset0=$dom->getElementsByTagName("VHost");
foreach($dataset0 as $row){
$xmlC=$row->getElementsByTagName("ConnectionsTotal");
$xmlConn=$xmlC->item(0)->nodeValue;
$st[]=$xmlConn;
}

Prevent save() from overwriting dtd in an xml file

I'm writing a script that adds nodes to an xml file. In addition to this I have an external dtd I made to handle the organization to the file. However the script I wrote keeps overwriting the dtd in the empty xml file when it's done appending nodes. How can I stop this from happening?
Code:
<?php
/*Dom vars*/
$dom = new DOMDocument("1.0", "UTF-8");
$previous_value = libxml_use_internal_errors(TRUE);
$dom->load('post.xml');
libxml_clear_errors();
libxml_use_internal_errors($previous_value);
$dom->formatOutput = true;
$entry = $dom->getElementsByTagName('entry');
$date = $dom->getElementsByTagName('date');
$para = $dom->getElementsByTagname('para');
$link = $dom->getElementsByTagName('link');
/* Dem POST vars used by dat Ajax mah ziggen, yeah boi*/
if (isset($_POST['Text'])){
$text = trim($_POST['Text']);
}
/*
function post(){
global $dom, $entry, $date, $para, $link,
$home, $about, $contact, $text;
*/
$entryC = $dom->createElement('entry');
$dateC = $dom->createElement('date', date("m d, y H:i:s")) ;
$entryC->appendChild($dateC);
$tab = "\n";
$frags = explode($tab, $text);
$i = count($frags);
$b = 0;
while($b < $i){
$paraC = $dom->createElement('para', $frags[$b]);
$entryC->appendChild($paraC);
$b++;
}
$linkC = $dom->createElement('link', rand(100000, 999999));
$entryC->appendChild($linkC);
$dom->appendChild($entryC);
$dom->save('post.xml');
/*}
post();
*/echo 1;
?>
It looks like in order to do this, you'd have to create a DOMDocumentType using
DOMImplementation::createDocumentType
then create an empty document using the DOMImplementation, and pass in the DOMDocumentType you just created, then import the document you loaded. This post: http://pointbeing.net/weblog/2009/03/adding-a-doctype-declaration-to-a-domdocument-in-php.html and the comments looked useful.
I'm guessing this is happening because after parsing/validation, the DTD isn't part of the DOM anymore, and PHP therefore isn't able to include it when the document is serialized.
Do you have to use a DTD? XML Schemas can be linked via attributes (and the link is therefore part of the DOM). Or there's RelaxNG, which can be linked via a processing instruction. DTDs have all this baggage that comes with them as a holdover from SGML. There are better alternatives.

Categories