Parsing xml data from website - php

I want to be able to retrieve the data on a certain site with the font size="6"
I plan on doing this with the xml parser but so far have had no luck. This is my code, if anyone knows where my mistake is, it would be much appreciated.
Thanks
#$doc=new DOMDocument();
#$doc->loadHTML($html4);
$xml=simplexml_import_dom($doc); // just to make xpath more simple
$data=$xml->xpath('//font size=6');
$arr= array();
foreach ($data as $img) {
echo $img;
}

Something like:
$doc = new DOMDocument();
$doc->loadHTML($html4);
$xpath = new DOMXpath($doc);
$data = $xpath->query("*/font[#size='6']");

Related

Q: How to find a section from a web page without using XPath

I need to extract a section from a web page. I need a version with DOM API and without XPath. This is my version. Need to extract from "Latest Distributions" and display the information in browser.
<?php
$result = file_get_contents ('https://distrowatch.com/');
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($result);
$xpath = new DOMXPath($doc);
$node = $xpath->query('//table[#class="News"]')->item(0);
echo $node->textContent;
This seems pretty straightforward, but it's a waste of time to do this instead of XPath.
<?php
$result = file_get_contents ('https://distrowatch.com/');
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($result);
foreach ($doc->getElementsByTagName("table") as $table) {
if ($table->getAttribute("class") === "News") {
echo $table->textContent;
break;
}
}

How to get links with mp3 as extension

I have this code which extracts all links from a website. How do I edit it so that it only extracts links that ends on .mp3?
Here are the following code:
preg_match_all("/\<a.+?href=(\"|')(?!javascript:|#)(.+?)(\"|')/i", $html, $matches);
Update:
A nice solution would be to use DOM together with XPath, as #zerkms mentioned in the comments:
$doc = new DOMDocument();
$doc->loadHTML($yourHtml);
$xpath = new DOMXPath($doc);
// use the XPath function ends-with to select only those links which end with mp3
$links = $xpath->query('//a[ends-with(#href, ".mp3")]/#href');
Original Answer:
I would use DOM for this:
$doc = new DOMDocument();
$doc->loadHTML($yourHtml);
$links = array();
foreach($doc->getElementsByTagName('a') as $elem) {
if($elem->hasAttribute('href')
&& preg_match('/.*\.mp3$/i', $elem->getAttribute('href')) {
$links []= $elem->getAttribute('href');
}
}
var_dump($links);
I would prefer XPath, which is meant to parse XML/xHTML:
$DOM = new DOMDocument();
#$DOM->loadHTML($html); // use the # to suppress warnings from invalid HTML
$XPath = new DOMXPath($DOM);
$links = array();
$link_nodes = $XPath->query('//a[contains(#href, ".mp3")]');
foreach($link_nodes as $link_node) {
$source = $link_nodes->getAttribute('href');
// do some extra work to make sure .mp3 is at the end of the string
$links[] = $source;
}
There is an ends-with() XPath function that you can replace contains(), if you are using XPath 2.0. Otherwise, you might want to add an extra conditional to make sure the .mp3 is at the end of the string. It may not be necessary though.

Parsing XML with multiple namespaces in PHP

I have XML in the following form that I want to parse with PHP (I can't change the format of the XML). Neither SimpleXML nor DOM seem to handle the different namespaces - can anyone give me sample code? The code below gives no results.
<atom:feed>
<atom:entry>
<atom:id />
<otherns:othervalue />
</atom:entry>
<atom:entry>
<atom:id />
<otherns:othervalue />
</atom:entry>
</atom:feed>
$doc = new DOMDocument();
$doc->load($url);
$entries = $doc->getElementsByTagName("atom:entry");
foreach($entries as $entry) {
$id = $entry->getElementsByTagName("atom:id");
echo $id;
$othervalue = $entry->getElementsByTagName("otherns:othervalue");
echo $othervalue;
}
I just want to post with an answer to this awful question. Sorry.
Namespaces are irrelavent with DOM - I just wasn't getting the nodeValue from the Element.
$doc = new DOMDocument();
$doc->load($url);
$feed = $doc->getElementsByTagName("entry");
foreach($feed as $entry) {
$id = $entry->getElementsByTagName("id")->item(0)->nodeValue;
echo $id;
$id = $entry->getElementsByTagName("othervalue")->item(0)->nodeValue;
echo $othervalue;
}
You need to register your name spaces. Otherwise simplexml will ignore them.
This bit of code I got from the PHP manual and I used in my own project
$xmlsimple = simplexml_load_string('YOUR XML');
$namespaces = $xmlsimple->getNamespaces(true);
$extensions = array_keys($namespaces);
foreach ($extensions as $extension )
{
$xmlsimple->registerXPathNamespace($extension,$namespaces[$extension]);
}
After that you use xpath on $xmlsimple

How to load a xml file in php so that i can use xpath on it?

I have a problem with php,
If I implement this code below then nothing will be happen.
$filename = "/opt/olat/olatdata/bcroot/course/85235053647606/runstructure.xml";
if (file_exists($filename)) {
$xml = simplexml_load_file($filename, 'SimpleXMLElement', LIBXML_NOCDATA);
// $xpath = new DOMXPath($filename);
}
$doc = new DOMDocument();
$doc->loadXML($xml);
$xpath = new DOMXpath($doc);
$res = $xpath->query('/org.olat.course.Structure/rootNode/children/org.olat.course.nodes.STCourseNode/shortTitle');
foreach ($res as $entry) {
echo "{$entry->nodeValue}<br/>";
}
If I change the contents of $xml in the content with the content of the $filename
$xml = '<org.olat.course.Structure><rootNode class="org.olat.course.nodes.STCourseNode"> ... ';
then it works, so i think that there is something wrong with loading methode of the xml file,
I've also tried to load the xml file as a Domdocument but it won't work neither.
And in both cases, it does work if I collect xml data via xml
for example this works
echo $Course_name = $xml->rootNode->longTitle;
loadXML takes a string as input, not the return value of simplexml_load_file. Just use file_get_contents to get the (full) contents of a file as string

Getting the List of Child Nodes from within a HTML Tag using PHP

I am currently using the PHP DOM to get the BODY tag from HTML.
$doc = new DOMDocument();
$doc->loadHTML($HTML);
$body = preg_replace("/.*<body[^>]*>|<\/body>.*/si", "", $HTML);
The above code completely gives me the html from the body tag for a given HTML.
Can I get the HTML tags with $body as an array?
If possible, I would use DOM - it will make your solution a lot more reliable and cleaner to use.
This should get your headed in the right direction (I'm not writing the solution for you, sorry):
$html = file_get_contents("http://google.com");
$dom = new DOMdocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$elements = $xpath->query("//*");
foreach ($elements as $element) {
echo "<h1>". $element->nodeName. "</h1>";
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo "<h2>".$node->nodeName. "</h2>";
echo $node->nodeValue. "\n";
}
}

Categories