I want to get some data from this website but as you can see in their html code there are some weird stuff going on as <TABLE BORDER=0 CELLSPACING=1 CELLPADDING=3 WIDTH=100%> without using "" and some other stuff, so I'm having errors when I try to parse the table using SimpleXmlElement which I have been using for a few time and works perfectly in some websites,
I'm doing something like:
$html = file_get_html('https://secure.tibia.com/community/?subtopic=killstatistics&world=Menera');
$table = $html->find('table', 4);
$xml = new SimpleXmlElement($table);
I get a bunch of erros and stuff, so is there a way of cleaning the code before sending to SimpleXmlElement or perhaps using another kind of DOM class?
What do you guys recommend?
The problem with your HTML code is that the tag attributes are not wrapped by quotes: unquoted attributes are allowed in HTML, but not in XML.
If you don't care about attributes, you can continue using Simple HTML Dom, otherwise you have to change HTML parser.
Cleaning attributes with Simple HTML DOM:
Start creating a function to clear all node attributes:
function clearAttributes( $node )
{
foreach( $node->getAllAttributes() as $key => $val )
{
$node->$key = Null;
}
}
Then apply the function to your <table>, <tr> and <td> nodes:
clearAttributes( $table );
foreach( $table->find('tr') as $tr )
{
clearAttributes( $tr );
foreach( $tr->find( 'td' ) as $td )
{
clearAttributes( $td );
}
}
Last but not least: site HTML contains a lot of encoded characters. If you don't want see a lot of <td>1 </td><td>0 </td> inside your XML, you have to prepend at your string a utf-8 declaration before importing it in a SimpleXml object:
$xml = '<?xml version="1.0" encoding="utf-8" ?>'.html_entity_decode( $table );
$xml = new SimpleXmlElement( $xml );
Preserving attributes with DOMDocument:
The built-in DOMDocument class is more powerful and less memory hungry than Simple HTML Dom. In this case, it will well-format original HTML for you. Despite appearances, its use is simple.
First, you have to init a DOMDocument object, setting libxml_use_internal_errors (to suppress a lot of warnings on malformed HTML) and load your url:
$dom = new DOMDocument();
libxml_use_internal_errors( 1 );
$dom->loadHTMLfile( 'https://secure.tibia.com/community/?subtopic=killstatistics&world=Menera' );
$dom->formatOutput = True;
Then, you retrieve desired <table>:
$table = $dom->getElementsByTagName( 'table' )->item(4);
And, like in Simple HTML Dom example, you have to prepend utf-8 declaration to avoid weird characters:
$xml = '<?xml version="1.0" encoding="utf-8" ?>'.$dom->saveHTML( $table );
$xml = new SimpleXmlElement( $xml );
As you can see, the DOMDocument syntax to retrieve a node as HTML is different than Simple HTML Dom: you need always to refer to main object and specify the node to print as argument:
echo $dom->saveHTML(); // print entire HTML document
echo $dom->saveHTML( $node ); // print node $node
Edit: removing with DOMDocument:
To remove unwanted from HTML, you can pre-load HTML and use str_replace.
Change this line:
$dom->loadHTMLfile( 'https://secure.tibia.com/community/?subtopic=killstatistics&world=Menera' );
with this:
$data = file_get_contents( 'https://secure.tibia.com/community/?subtopic=killstatistics&world=Menera' );
$data = str_replace( ' ', '', $data );
$dom->loadHTML( $data );
Related
I have wrote this code in PHP to compile an XML file with parameters that are in URL.
But when the XML file is already created instead of adding the new data at the bottom of file inside the root element, overwrite it and delete all old data.
Where is the problem?
I have seen some examples online but I can't figure out how fix it.
I need to verify if file already exist and then add the element?
Or I need to read it and then add again the old elements and new?
I don't know very well dom so I can't figure out
<?php
$FOL = $_GET["FOL"];
$NUM = $_GET["NUM"];
$DAT = $_GET["DAT"];
$ZON = $_GET["ZON"];
$TIP = $_GET["TIP"];
$COM = $_GET["COM"];
$dom = new DOMDocument();
$dom->encoding = 'utf-8';
$dom->xmlVersion = '1.0';
$dom->formatOutput = true;
$xml_file_name = "$NUM.xml";
$xmlString = file_get_contents($xml_file_name);
$dom->loadXML($xmlString);
$loaded_xml = $dom->getElementsByTagName('Territorio');
$territorio_node = $dom->createElement('Territorio');
$child_node_NOM = $dom->createElement('NOM', "$NOM");
$territorio_node->appendChild($child_node_NOM);
$child_node_NUM = $dom->createElement('NUM', "$NUM");
$territorio_node->appendChild($child_node_NUM);
$child_node_DAT = $dom->createElement('DAT', "$DAT");
$territorio_node->appendChild($child_node_DAT);
$child_node_ZON = $dom->createElement('ZON', "$ZON");
$territorio_node->appendChild($child_node_ZON);
$dom->appendChild($territorio_node);
$child_node_TIP = $dom->createElement('TIP', "$TIP");
$territorio_node->appendChild($child_node_TIP);
$child_node_COM = $dom->createElement('COM', "$COM");
$territorio_node->appendChild($child_node_COM);
$dom->appendChild($territorio_node);
$dom->save($FOL.'/'.$xml_file_name);
echo "$xml_file_name creato correttamente";
?>
as per the comment: You should check that the root node of the XML file exists before calling createElement to generate a new one. To do that you can call getElementsByClassName and test whether the first entry is empty
ie: empty( $dom->getElementsByTagName('Territorio')[0] ) sort of thing...
If the root node exists we use that, otherwise add a new root to the document and continue
// check that the querystring has all the required parameters
if( isset(
$_GET['FOL'],
$_GET['NUM'],
$_GET['DAT'],
$_GET['ZON'],
$_GET['TIP'],
$_GET['COM']
)){
// the filename is generated from one of the querystring parameters
// - here the directory used is the same as the script running
$file=sprintf( '%s/%s.xml', __DIR__, $_GET['NUM'] );
// create an empty file if it does not exist
if( !file_exists( $file )){
file_put_contents( $file, '' );
}
clearstatcache();
// create the DOMDocument and set various options
libxml_use_internal_errors( true );
$dom=new DOMDocument('1.0','utf-8');
$dom->strictErrorChecking=false;
$dom->preserveWhiteSpace=true;
$dom->formatOutput=true;
$dom->recover=true;
// load the XML file directly rather than loading a string read from the original file.
$dom->load( $file );
// The ROOT node of the document ... does it exist? if not, create it and add to the DOM.
$root=$dom->getElementsByTagName('Territorio')[0];
if( empty( $root ) ){
$root=$dom->createElement('Territorio');
$dom->appendChild( $root );
}
// I added this part so that you can distinguish easily new elements
$record=$dom->createElement('record');
$root->appendChild( $record );
// add all the querystring parameters within the new record.
$record->appendChild( $dom->createElement('NUM', $_GET["NUM"] ) );
$record->appendChild( $dom->createElement('DAT', $_GET["DAT"] ) );
$record->appendChild( $dom->createElement('ZON', $_GET["ZON"] ) );
$record->appendChild( $dom->createElement('TIP', $_GET["TIP"] ) );
$record->appendChild( $dom->createElement('COM', $_GET["COM"] ) );
$record->appendChild( $dom->createElement('FOL', $_GET["FOL"] ) );
$dom->save( $file );
}
An example of the XML generated:
<?xml version="1.0" encoding="utf-8"?>
<Territorio>
<record>
<NUM>wibble</NUM>
<DAT>25/10/2022</DAT>
<ZON>europe</ZON>
<TIP>total</TIP>
<COM>94</COM>
<FOL>234</FOL>
</record>
<record>
<NUM>wibble</NUM>
<DAT>26/10/2022</DAT>
<ZON>europe</ZON>
<TIP>total</TIP>
<COM>96</COM>
<FOL>238</FOL>
</record>
</Territorio>
In the original code the file is saved to a location defined by another parameter in the querystring ( only just noticed that afterwards ) so rather than
$file=sprintf( '%s/%s.xml', __DIR__, $_GET['NUM'] );
you would likely want to do:
$file=sprintf( '%s/%s.xml', $_GET['FOL'], $_GET['NUM'] );
The root node of an XML document is called document element and here is an property for it. So you can just check if it is undefined. However an document can have only a single document element, so will need to modify the structure of your XML - for example add a "Territori" document element.
Do not use the second argument of the createElement() method or the $nodeValue property. Their escaping is broken - try adding a value with an &. Use $textContent or add a text node.
In modern DOM you can even just append() a string.
$NUM = "NUM";
$DAT = "DAT";
$ZON = "ZON";
$document = new DOMDocument('1.0', 'UTF-8');
// let the parser ignore existing indents
$document->preserveWhiteSpace = false;
$document->loadXML(getXMLString());
// fetch or create document element
$territori = $document->documentElement
?? $document->appendChild($document->createElement('Territori'));
// create/append an item element
$territori
->appendChild(
$territorio = $document->createElement('Territorio')
);
// create/append an element and set its text content
$territorio
->appendChild($document->createElement('NUM'))
->textContent = $NUM;
// create/append an element with a text child node
$territorio
->appendChild($document->createElement('DAT'))
->appendChild($document->createTextNode($DAT));
// create/append an element and a string (DOM Level 3)
$territorio
->appendChild($document->createElement('ZON'))
->append((string)$ZON);
// enable output formatting
$document->formatOutput = true;
echo $document->saveXML();
function getXMLString() {
return <<<'XML'
<?xml version="1.0"?>
<Territori>
<Territorio>
<NUM>NUM</NUM>
<DAT>DAT</DAT>
<ZON>DAT</ZON>
</Territorio>
</Territori>
XML;
}
For a more flexible approach to fetch nodes use Xpath expressions. Here is an example that checks if an Territorio with a specific NUM value exists:
$document = new DOMDocument('1.0', 'UTF-8');
$document->loadXML(getXMLString());
$xpath = new DOMXpath($document);
if ($xpath->evaluate('count(//Territorio[NUM="NUM"]) > 0')) {
echo "Node exists";
}
I have the following XML code which I want to read and get the value inside "content" tag.
"<?xml version='1.0' encoding='ISO-8859-1'?>
<ad modelVersion='0.9'>
<richmediaAd>
<content>
<![CDATA[<script src=\"mraid.js\"></script>
<div class=\"celtra-ad-v3\">
<img src=\"data: image/png, celtra\" style=\"display: none\"onerror=\"(function(img){ varparams={ 'channelId': '45f3f23c','clickUrl': 'http%3a%2f%2fexamplehost.com%3a53766%2fCloudMobRTBWeb%2fClickThroughHandler.ashx%3fadid%3de6983c95-9292-4e16-967d-149e2e77dece%26cid%3d352%26crid%3d850'};varreq=document.createElement('script');req.id=params.scriptId='celtra-script-'+(window.celtraScriptIndex=(window.celtraScriptIndex||0)+1);params.clientTimestamp=newDate/1000;req.src=(window.location.protocol=='https: '?'https': 'http')+': //ads.celtra.com/e7f5ce18/mraid-ad.js?';for(varkinparams){req.src+='&'+encodeURIComponent(k)+'='+encodeURIComponent(params[ k ]); }img.parentNode.insertBefore(req, img.nextSibling);})(this);\"/>
</div>]]>
</content>
<width>320</width>
<height>50</height>
</richmediaAd>
</ad>"
I tried 2 methods (SimpleXML and DOM). I managed to get the value but found the keyword "CDATA" missing. What I got inside "content" tag was:
<script src="mraid.js"></script>
<div class="celtra-ad-v3">
<img src="data: image/png, celtra" style="display: none"onerror="(function(img){ varparams={ 'channelId': '45f3f23c','clickUrl': 'http%3a%2f%2fexamplehost.com%3a53766%2fCloudMobRTBWeb%2fClickThroughHandler.ashx%3fadid%3de6983c95-9292-4e16-967d-149e2e77dece%26cid%3d352%26crid%3d850'};varreq=document.createElement('script');req.id=params.scriptId='celtra-script-'+(window.celtraScriptIndex=(window.celtraScriptIndex||0)+1);params.clientTimestamp=newDate/1000;req.src=(window.location.protocol=='https: '?'https': 'http')+': //ads.celtra.com/e7f5ce18/mraid-ad.js?';for(varkinparams){req.src+='&'+encodeURIComponent(k)+'='+encodeURIComponent(params[ k ]); }img.parentNode.insertBefore(req, img.nextSibling);})(this);"/>
</div>
I know the parser was trying to sort of "beautify" the XML by removing CDATA. But what I want is just the raw data with "CDATA" tag in it. Is there any way to achieve this?
Appreciate your help.
And below is my 2 methods for your reference:
Method 1:
$type = simplexml_load_string($response['adm']) or die("Error: Cannot create object");
$data = $type->richmediaAd[0]->content;
Yii::warning((string) $data);
Yii::warning(strpos($data, 'CDATA'));
Method 2:
$doc = new \DOMDocument();
$doc->loadXML($response['adm']);
$richmediaAds = ($doc->getElementsByTagName("richmediaAd"));
foreach($richmediaAds as $richmediaAd){
$contents = $richmediaAd->getElementsByTagName("content");
foreach($contents as $content){
Yii::warning($content->nodeValue);
}
}
I'll improve this if I can, but you can target explicitly the "CDATA Section" node of your content element and use $doc->saveXML( $node ) with the node as the parameter to get that exact XML element structure.
$doc = new \DOMDocument();
$doc->loadXML( $xml );
$xpath = new \DOMXPath( $doc );
$nodes = $xpath->query( '/ad/richmediaAd/content');
foreach( $nodes[0]->childNodes as $node )
{
if( $node->nodeType === XML_CDATA_SECTION_NODE )
{
echo $doc->saveXML( $node ); // string content
}
}
Edit: You may wish to support some redundancy if there is no CDATA found.
Without XPATH
$doc = new \DOMDocument();
$doc->loadXML( $xml );
$doc->normalize();
foreach( $doc->getElementsByTagName('content')->item(0)->childNodes as $node )
{
if( $node->nodeType === XML_CDATA_SECTION_NODE )
{
echo $doc->saveXML( $node ); // string content
}
}
When adding a string that might contain troublesome characters (eg &, <, >), DOMDocument throws a warning, rather than sanitizing the string.
I'm looking for a succinct way to make strings xml-safe - ideally something that leverages the DOMDocument library.
I'm looking for something better than preg_replace or htmlspecialchars. I see DOMDocument::createTextNode(), but the resulting DOMText object is cumbersome and can't be handed to DOMDocument::createElement().
To illustrate the problem, this code:
<?php
$dom = new DOMDocument;
$dom->formatOutput = true;
$parent = $dom->createElement('rootNode');
$parent->appendChild( $dom->createElement('name', 'this ampersand causes pain & sorrow ') );
$dom->appendChild( $parent );
echo $dom->saveXml();
produces this result (see eval.in):
Warning: DOMDocument::createElement(): unterminated entity reference sorrow in /tmp/execpad-41ee778d3376/source-41ee778d3376 on line 6
<?xml version="1.0"?>
<rootNode>
<name>this ampersand causes pain </name>
</rootNode>
You will have to create the text node and append it. I described the problem in this answer: https://stackoverflow.com/a/22957785/2265374
However you can extend DOMDocument and overload createElement*().
class MyDOMDocument extends DOMDocument {
public function createElement($name, $content = '') {
$node = parent::createElement($name);
if ((string)$content !== '') {
$node->appendChild($this->createTextNode($content));
}
return $node;
}
public function createElementNS($namespace, $name, $content = '') {
$node = parent::createElementNS($namespace, $name);
if ((string)$content !== '') {
$node->appendChild($this->createTextNode($content));
}
return $node;
}
}
$dom = new MyDOMDocument();
$root = $dom->appendChild($dom->createElement('foo'));
$root->appendChild($dom->createElement('bar', 'Company & Son'));
$root->appendChild($dom->createElementNS('urn:bar', 'bar', 'Company & Son'));
$dom->formatOutput = TRUE;
echo $dom->saveXml();
Output:
<?xml version="1.0"?>
<foo>
<bar>Company & Son</bar>
<bar xmlns="urn:bar">Company & Son</bar>
</foo>
This is the structure I use to build XML elements, the second part is usually wrapped in a function.
$parent = $document->documentElement; // pick the node we want to append to
$name = 'foo'; // new element name
$content = 'bar < not a tag > <![CDATA[" testing cdata "]]>'; // content
$element = ($parent->ownerDocument) ? $parent->ownerDocument->createElement($name) : $parent->createElement($name);
$parent->appendchild($element);
$element->appendchild($parent->ownerDocument->createTextNode($content));
my function will then return $element
I'm using the following script for a lightweight DOM editor. However, nodeValue in my for loop is converting my html tags to plain text. What is a PHP alternative to nodeValue that would maintain my innerHTML?
$page = $_POST['page'];
$json = $_POST['json'];
$doc = new DOMDocument();
$doc = DOMDocument::loadHTMLFile($page);
$xpath = new DOMXPath($doc);
$entries = $xpath->query('//*[#class="editable"]');
$edits = json_decode($json, true);
$num_edits = count($edits);
for($i=0; $i<$num_edits; $i++)
{
$entries->item($i)->nodeValue = $edits[$i]; // nodeValue strips html tags
}
$doc->saveHTMLFile($page);
Since $edits[$i] is a string, you need to parse it into a DOM structure and replace the original content with the new structure.
Update
The code fragment below does an incredible job when using non-XML compliant HTML. (e.g. HTML 4/5)
for($i=0; $i<$num_edits; $i++)
{
$f = new DOMDocument();
$edit = mb_convert_encoding($edits[$i], 'HTML-ENTITIES', "UTF-8");
$f->loadHTML($edit);
$node = $f->documentElement->firstChild;
$entries->item($i)->nodeValue = "";
foreach($node->childNodes as $child) {
$entries->item($i)->appendChild($doc->importNode($child, true));
}
}
I haven't working with that library in PHP before, but in my other xpath experience I think that nodeValue on anything other than a text node does strip tags. If you're unsure about what's underneath that node, then I think you'll need to recursively descend $entries->item($i)->childNodes if you need to get the markup back.
Or...you may wany textContent instead of nodeValue:
http://us.php.net/manual/en/class.domnode.php#domnode.props.textcontent
I am looking for something equivalent to this:
$e= xmlwriter_open_uri("test.xml");
....
print htmlentities(xmlwriter_output_memory($e));
now this print allows to display whats in the xml list into a table.
But my with my simple xml (combined with $dom for formatting) i have no idea how to display this. Although this generates the proper output i wish into the xml how do i display the xml below? Something similar to a print or?
The purpose is to display the values of the xml into a table.
$dom = new DOMDocument('1.0');
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$xml = new SimpleXMLElement('<test></test>');
$one= $xml->addChild('enemy', 'yes');
$two= $xml->addChild('friend', 'maybe');
$dom->loadXML($xml->asXML());
$dom->save('test.xml');
Regards
You don't need to stringify (technical term!) the SimpleXMLElement to load it into a DOMDocument, in fact that's a terrible idea (though, you're forgiven).
$xml = new SimpleXMLElement('<test></test>');
$one= $xml->addChild('enemy', 'yes');
$two= $xml->addChild('friend', 'maybe');
// Get the DOMDocument associated with this XML
$dom = dom_import_simplexml($xml)->ownerDocument;
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
echo $dom->saveXML(); // or echo htmlentities($dom->saveXML()) if you really must
More info about retrieving a DOMElement (and its DOMDocument) from a SimpleXMLElement can be found in the docs for dom_import_simplexml().