I need to automatically create GEXF (http://gexf.net) XML files from an array of nodes in PHP.
I've Googled the topic but was unable to find anything useful.
How would I do this?
Here is my take on this. After a number of hours I got it all right. This example has support for the viz: namespace too, so you can use more detailed node elements and placements.
// Construct DOM elements
$xml = new DomDocument('1.0', 'UTF-8');
$xml->formatOutput = true;
$gexf = $xml->createElementNS(null, 'gexf');
$gexf = $xml->appendChild($gexf);
// Assign namespaces for GexF with VIZ :)
$gexf->setAttribute('xmlns:viz', 'http://www.gexf.net/1.1draft/viz'); // Skip if you dont need viz!
$gexf->setAttributeNS('http://www.w3.org/2000/xmlns/' ,'xmlns:xsi', 'http://www.w3.org/2001/XMLSchema-instance');
$gexf->setAttributeNS('http://www.w3.org/2001/XMLSchema-instance', 'schemaLocation', 'http://www.gexf.net/1.2draft http://www.gexf.net/1.2draft/gexf.xsd');
// Add Meta data
$meta = $gexf->appendChild($xml->createElement('meta'));
$meta->setAttribute('lastmodifieddate', date('Y-m-d'));
$meta->appendChild($xml->createElement('creator', 'PHP GEXF Generator v0.1'));
$meta->appendChild($xml->createElement('description', 'by me etc'));
// Add Graph data!
$graph = $gexf->appendChild($xml->createElement('graph'));
$nodes = $graph->appendChild($xml->createElement('nodes'));
$edges = $graph->appendChild($xml->createElement('edges'));
// Add Node!
$node = $xml->createElement('node');
$node->setAttribute('id', '1');
$node->setAttribute('label', 'Hello world!');
// Set color for node
$color = $xml->createElement('viz:color');
$color->setAttribute('r', '1');
$color->setAttribute('g', '1');
$color->setAttribute('b', '1');
$node->appendChild($color);
// Set position for node
$position = $xml->createElement('viz:position');
$position->setAttribute('x', '1');
$position->setAttribute('y', '1');
$position->setAttribute('z', '1');
$node->appendChild($position);
// Set size for node
$size = $xml->createElement('viz:size');
$size->setAttribute('value', '1');
$node->appendChild($size);
// Set shape for node
$shape = $xml->createElement('viz:shape');
$shape->setAttribute('value', 'disc');
$node->appendChild($shape);
// Add Edge (assuming there is a node with id 2 as well!)
$edge = $xml->createElement('edge');
$edge->setAttribute('source', '1');
$edge->setAttribute('target', '2');
// Commit node & edge changes to nodes!
$edges->appendChild($edge);
$nodes->appendChild($node);
// Serve file as XML (prompt for download, remove if unnecessary)
header('Content-type: "text/xml"; charset="utf8"');
header('Content-disposition: attachment; filename="internet.gexf"');
// Show results!
echo $xml->saveXML();
Eat your heart out! Feel free to email me your project results, I am curious.
Related
and thanks in advance. I try to build a webscraper with PHP and I use Visual Studio Code.
When I run the following code, the following problem shows up:
Use of unknown class: 'Goutte\Client'
Does anyone know how to solve that issue?
I have googled all over the place, looked at SO and asked the forbidden one, but still after three days I have not achieved any progress. (I am also a noob, so maybe it is not as difficult to solve as I think).
Looking forward to your feedback and tips.
<?php
require 'vendor/autoload.php';
use Goutte\Client;
// Initialize the Goutte client
$client = new Client();
// Create a new array to store the scraped data
$data = array();
// Loop through the pages
for ($i = 0; $i < 3; $i++) {
// Make a request to the website
$crawler = $client->request('GET', 'https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives_de?page=' . $i);
// Find all the initiatives on the page
$crawler->filter('.initiative')->each(function ($node) use (&$data) {
// Extract the information for each initiative
$title = $node->filter('h3')->text();
$link = $node->filter('a')->attr('href');
$description = $node->filter('p')->text();
$deadline = $node->filter('time')->attr('datetime');
// Append the data for the initiative to the data array
$data[] = array($title, $link, $description, $deadline);
});
// Sleep for a random amount of time between 5 and 10 seconds
$sleep = rand(5,10);
sleep($sleep);
}
// Open the output file
$fp = fopen('initiatives.csv', 'w');
// Write the header row
fputcsv($fp, array('Title', 'Link', 'Description', 'Deadline'));
I'm accessing data from an API supplying XML using PHP, with a retrieval format and URL like this:
$response = $oauth->get('https://example.com/Main/1');
In this case the "1" is the page number. It will return the first 100 results. (That I have all working.)
But if there are more results, I can't access them automatically currently. (I'd have to change the url manually.)
The returned XML will list a <Links><rel>last</rel><href>https://example.com/Main/3</href></Links> of how many pages there are. (In this case, 3 pages available.)
Here's a sample of the XML returned:
<?xml version="1.0" encoding="UTF-8"?>
<Fleet xmlns="http://standards.iso.org/iso/15143/-3" version="2" snapshotTime="2020-01-13T20:12:55.224Z">
<Links>
<rel>self</rel>
<href>https://example.com/Main/1</href>
</Links>
<Links>
<rel>last</rel>
<href>https://example.com/Main/3</href>
</Links>
<Equipment>
<EquipmentHeader>
<OEMName>CAT</OEMName>
<Model>D6</Model>
<EquipmentID>1111111</EquipmentID>
<SerialNumber>1111111</SerialNumber>
<PIN>1111111</PIN>
</EquipmentHeader>
<CumulativeOperatingHours datetime="2018-07-29T18:15:30.000Z">
<Hour>1111</Hour>
</CumulativeOperatingHours>
</Equipment>
// ... and so on - 100 results...
</Fleet>
Is there a simple way to check the value given of the last page, and then loop through retrieving data from each of the pages (from the first to the last)?
(Since I won't know how many pages there are until the first request results are returned.)
UPDATE
I've come up with this for finding the number of pages:
$total_pages = NULL;
$xml = simplexml_load_string($response);
// Get used name space, and use that
$namespaces = $xml->getDocNamespaces();
if(isset($namespaces[''])) {
$defaultNamespaceUrl = $namespaces[''];
$xml->registerXPathNamespace('default', $defaultNamespaceUrl);
$nsprefix = 'default:';
} else {$nsprefix = '';}
$nodes = $xml->xpath('//'.$nsprefix.'Links');
foreach($nodes as $node) {
if($node->rel == 'last'){
$last_page_url = $node->href;
$pos = strrpos($last_page_url, '/'); // position of last slash in url
$total_pages = $pos === false ? 0 : substr($last_page_url, $pos + 1); // if slash doesn't exist, then 0, otherwise the value after the last slash
} // end if
} // end foreach
echo $total_pages;
So now I need to figure out how to loop through the requests...
First, you could simplify the last page lookup, using DOMXPath:
$domDocument = new \DOMDocument();
$domDocument->loadXML($response);
$xpath = new \DOMXPath($domDocument);
$xpath->registerNamespace('d', 'http://standards.iso.org/iso/15143/-3');
$lastPageHref = $xpath->evaluate('string(//d:Links/d:rel[text()="last"]/following-sibling::d:href)');
$lastPage = (int)basename($lastPageHref);
This gets a href element which is the direct following sibling of a rel element whose text content is "Last", which is itself a child of a Links element anywhere in the doc.
It then uses basename to get the last part of that URL, and converts it to an integer.
Demo: https://3v4l.org/urfU3
From there, you can simply do something like this (where OAuthClass is to be replaced by what class $oauth is of):
function fetchPage(YourOAuthClass $oauth, int $page): \DOMDocument
{
$xml = $oauth->get("https://example.com/Main/$page");
$domDocument = new \DOMDocument();
$domDocument->loadXML($xml);
return $domDocument;
}
$domDocument = fetchPage($oauth, 1);
// Here, do the code above to grab $lastPage
// Also do stuff with $domDocument (handle page 1)
for ($page = 2; $page <= $lastPage; $page++) {
$domDocument = fetchPage($oauth, $page);
// Do stuff with $domDocument (handle current page)
}
I am trying to read a website's content but i have a problem i want to get images, links these elements but i want to get elements them selves not the element content for instance i want to get that: i want to get that entire element.
How can i do this..
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.link.com");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
$dom = new DOMDocument;
#$dom->loadHTML($output);
$items = $dom->getElementsByTagName('a');
for($i = 0; $i < $items->length; $i++) {
echo $items->item($i)->nodeValue . "<br />";
}
curl_close($ch);;
?>
You appear to be asking for the serialized html of a DOMElement? E.g. you want a string containing link text? (Please make your question clearer.)
$url = 'http://example.com';
$dom = new DOMDocument();
$dom->loadHTMLFile($url);
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $a) {
// Best solution, but only works with PHP >= 5.3.6
$htmlstring = $dom->saveHTML($a);
// Otherwise you need to serialize to XML and then fix the self-closing elements
$htmlstring = saveHTMLFragment($a);
echo $htmlstring, "\n";
}
function saveHTMLFragment(DOMElement $e) {
$selfclosingelements = array('></area>', '></base>', '></basefont>',
'></br>', '></col>', '></frame>', '></hr>', '></img>', '></input>',
'></isindex>', '></link>', '></meta>', '></param>', '></source>',
);
// This is not 100% reliable because it may output namespace declarations.
// But otherwise it is extra-paranoid to work down to at least PHP 5.1
$html = $e->ownerDocument->saveXML($e, LIBXML_NOEMPTYTAG);
// in case any empty elements are expanded, collapse them again:
$html = str_ireplace($selfclosingelements, '>', $html);
return $html;
}
However, note that what you are doing is dangerous because it could potentially mix encodings. It is better to have your output as another DOMDocument and use importNode() to copy the nodes you want. Alternatively, use an XSL stylesheet.
I'm assuming you just copy-pasted some example code and didn't bother trying to learn how it actually works...
Anyway, the ->nodeValue part takes the element and returns the text content (because the element has a single text node child - if it had anything else, I don't know what nodeValue would give).
So, just remove the ->nodeValue and you have your element.
I'm writing a script that adds nodes to an xml file. In addition to this I have an external dtd I made to handle the organization to the file. However the script I wrote keeps overwriting the dtd in the empty xml file when it's done appending nodes. How can I stop this from happening?
Code:
<?php
/*Dom vars*/
$dom = new DOMDocument("1.0", "UTF-8");
$previous_value = libxml_use_internal_errors(TRUE);
$dom->load('post.xml');
libxml_clear_errors();
libxml_use_internal_errors($previous_value);
$dom->formatOutput = true;
$entry = $dom->getElementsByTagName('entry');
$date = $dom->getElementsByTagName('date');
$para = $dom->getElementsByTagname('para');
$link = $dom->getElementsByTagName('link');
/* Dem POST vars used by dat Ajax mah ziggen, yeah boi*/
if (isset($_POST['Text'])){
$text = trim($_POST['Text']);
}
/*
function post(){
global $dom, $entry, $date, $para, $link,
$home, $about, $contact, $text;
*/
$entryC = $dom->createElement('entry');
$dateC = $dom->createElement('date', date("m d, y H:i:s")) ;
$entryC->appendChild($dateC);
$tab = "\n";
$frags = explode($tab, $text);
$i = count($frags);
$b = 0;
while($b < $i){
$paraC = $dom->createElement('para', $frags[$b]);
$entryC->appendChild($paraC);
$b++;
}
$linkC = $dom->createElement('link', rand(100000, 999999));
$entryC->appendChild($linkC);
$dom->appendChild($entryC);
$dom->save('post.xml');
/*}
post();
*/echo 1;
?>
It looks like in order to do this, you'd have to create a DOMDocumentType using
DOMImplementation::createDocumentType
then create an empty document using the DOMImplementation, and pass in the DOMDocumentType you just created, then import the document you loaded. This post: http://pointbeing.net/weblog/2009/03/adding-a-doctype-declaration-to-a-domdocument-in-php.html and the comments looked useful.
I'm guessing this is happening because after parsing/validation, the DTD isn't part of the DOM anymore, and PHP therefore isn't able to include it when the document is serialized.
Do you have to use a DTD? XML Schemas can be linked via attributes (and the link is therefore part of the DOM). Or there's RelaxNG, which can be linked via a processing instruction. DTDs have all this baggage that comes with them as a holdover from SGML. There are better alternatives.
I have write this little class below which generates a XML sitemap although when I try to add this to Google Webmaster I get error:
Sitemap URL: http://www.moto-trek.co.uk/sitemap/xml
Unsupported file format
Your Sitemap does not appear to be in a supported format. Please ensure it meets our Sitemap guidelines and resubmit.
<?php
class Frontend_Sitemap_Xml extends Cms_Controller {
/**
* Intercept special function actions and dispatch them.
*/
public function postDispatch() {
$db = Cms_Db_Connections::getInstance()->getConnection();
$oFront = $this->getFrontController();
$oUrl = Cms_Url::getInstance();
$oCore = Cms_Core::getInstance();
$absoDomPath = $oFront->getDomain() . $oFront->getHome();
$pDom = new DOMDocument();
$pXML = $pDom->createElement('xml');
$pXML->setAttribute('version', '1.0');
$pXML->setAttribute('encoding', 'UTF-8');
// Finally we append the attribute to the XML tree using appendChild
$pDom->appendChild($pXML);
$pUrlset = $pDom->createElement('urlset');
$pUrlset->setAttribute('xmlns', 'http://www.sitemaps.org/schemas/sitemap/0.9');
$pXML->appendChild($pUrlset);
// FETCH content and section items
$array = $this->getDataset("sitemap")->toArray();
foreach($array["sitemap"]['rows'] as $row) {
try {
$content_id = $row['id']['fvalue'];
$url = "http://".$absoDomPath.$oUrl->forContent($content_id);
$pUrl = $pDom->createElement('url');
$pLoc = $pDom->createElement('loc', $url);
$pLastmod = $pDom->createElement('lastmod', gmdate('Y-m-d\TH:i:s', strtotime($row['modified']['value'])));
$pChangefreq = $pDom->createElement('changefreq', ($row['changefreq']['fvalue'] != "")?$row['changefreq']['fvalue']:'monthly');
$pPriority = $pDom->createElement('priority', ($row['priority']['fvalue'])?$row['priority']['fvalue']:'0.5');
$pUrl->appendChild($pLoc);
$pUrl->appendChild($pLastmod);
$pUrl->appendChild($pChangefreq);
$pUrl->appendChild($pPriority);
$pUrlset->appendChild($pUrl);
} catch(Exception $e) {
throw($e);
}
}
// Set content type to XML, thus forcing the browser to render is as XML
header('Content-type: text/xml');
// Here we simply dump the XML tree to a string and output it to the browser
// We could use one of the other save methods to save the tree as a HTML string
// XML file or HTML file.
echo $pDom->saveXML();
}
}
?>
urlset should by root element, but in your case it is xml. So appending urlset directly into domdocument should solve your problem.
$pDom = new DOMDocument('1.0','UTF-8');
$pUrlset = $pDom->createElement('urlset');
$pUrlset->setAttribute('xmlns', 'http://www.sitemaps.org/schemas/sitemap/0.9');
foreach( ){ }
$pDom->appendChild($pUrlset);
echo $pDom->saveXML();