Extract content from MediaWiki API call (XML, cURL) - php

URL:
http://en.wikipedia.org/w/api.php?action=parse&prop=text&page=Lost_(TV_series)&format=xml
This outputs something like:
<api><parse><text xml:space="preserve">text...</text></parse></api>
How do I get just the content between <text xml:space="preserve"> and </text>?
I used curl to fetch all the content from this URL. So this gives me:
$html = curl_exec($curl_handle);
What's the next step?

Use PHP DOM to parse it. Do it like this:
//you already have input text in $html
$html = '<api><parse><text xml:space="preserve">text...</text></parse></api>';
//parsing begins here:
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('text');
//display what you need:
echo $nodes->item(0)->nodeValue;
This outputs:
text...

Related

PHP getting a certain information from a website but from all pages

I want to extract a href attribute but this attributes especially has mailto function. and i want to do this not just for one link but all links belongs to main webpage.
I tried this:
<?php
$url = "https://www.omurcanozcan.com";
$html = file_get_contents( $url);
libxml_use_internal_errors( true);
$doc = new DOMDocument;
$doc->loadHTML( $html);
$xpath = new DOMXpath( $doc);
$node = $xpath->query( "//a[#href='mailto:']")->item(0);
echo $node->textContent; // This will print **GET THIS TEXT**
?>
I expect for instance a code is
<a href='mailto:omurcan#omurcanozcan.com'>omurcan#omurcanozcan.com</a>
I want to echo
<p>omurcan#omurcanozcan.com</p>
The main problem is that in your XPath, you are checking for
//a[#href='mailto:']
This will looks for a href attribute which only contains mailto:, what you want is where the href starts with mailto:, you can do this using starts-with()...
$node = $xpath->query( "//a[starts-with(#href,'mailto:')]")->item(0);
The second thing is that I don't think your page is fully loaded when you get the content, a common test I do is to save the HTML once I've loaded it so I can check it out first...
$url = "https://www.omurcanozcan.com";
$html = file_get_contents( $url);
file_put_contents("a.html", $html);
If you then look in a.html you can see the HTML it is using, in the content I cannot see any mailto: links.

Parse a HTML document and get a specific element in PHP and save its HTML

All I want to do is save the first div with attribute role="main" as a string from an external URL using PHP.
So far I have this:
$doc = new DOMDocument();
#$doc->loadHTMLFile("http://example.com/");
$xpath = new DOMXPath($doc);
$elements = $xpath->query('//div[#role="main"]');
$str = "";
if ($elements->length > 0) {
$str = $elements->item(0)->textContent;
}
echo htmlentities($str);
But unfortunately the $str does not seem to be displaying the HTML tags. Just the text.
You can get the HTML via the saveHTML() method.
$str = $doc->saveHTML($elements->item(0));

Reading comments in XML API response

I have the following response from a XML api , I want to display the text that is placed in the comments section .
<OTA_AirDetailsRS PrimaryLangID="eng" Version="1.0" TransactionIdentifier=""><Errors><Error Type="ERR" FLSErrorCode="-10" FLSErrorName="Invalid Input"/></Errors><!-- Reason for error: The Date parameter is not valid [2014-05-16] --></OTA_AirDetailsRS>
I have used this :
...
$query = curl_exec($curl_handle);
curl_close($curl_handle);
$xml = new SimpleXmlElement($query);
if($xml->Errors){
$doc = new DOMDocument;
$doc->loadXML($xml);
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//comment()') as $comment)
{
var_dump($comment->textContent);
}
Its not displaying anything in this case , but if instead of passing xml response , I pass it a simple xml in string format , it is working . Please suggest if something is wrong.
You need to load the XML, not the object from your SimpleXmlElement call:
$doc = new DOMDocument;
$doc->loadXML($query);
Then your script outputs:
string(64) " Reason for error: The Date parameter is not valid [2014-05-16] "

How to load a xml file in php so that i can use xpath on it?

I have a problem with php,
If I implement this code below then nothing will be happen.
$filename = "/opt/olat/olatdata/bcroot/course/85235053647606/runstructure.xml";
if (file_exists($filename)) {
$xml = simplexml_load_file($filename, 'SimpleXMLElement', LIBXML_NOCDATA);
// $xpath = new DOMXPath($filename);
}
$doc = new DOMDocument();
$doc->loadXML($xml);
$xpath = new DOMXpath($doc);
$res = $xpath->query('/org.olat.course.Structure/rootNode/children/org.olat.course.nodes.STCourseNode/shortTitle');
foreach ($res as $entry) {
echo "{$entry->nodeValue}<br/>";
}
If I change the contents of $xml in the content with the content of the $filename
$xml = '<org.olat.course.Structure><rootNode class="org.olat.course.nodes.STCourseNode"> ... ';
then it works, so i think that there is something wrong with loading methode of the xml file,
I've also tried to load the xml file as a Domdocument but it won't work neither.
And in both cases, it does work if I collect xml data via xml
for example this works
echo $Course_name = $xml->rootNode->longTitle;
loadXML takes a string as input, not the return value of simplexml_load_file. Just use file_get_contents to get the (full) contents of a file as string

PHP function to grab all links inside a <DIV> on remote site using scrape method

Anyone has a PHP function that can grab all links inside a specific DIV on a remote site? So usage might be:
$links = grab_links($url,$divname);
And return an array I can use. Grabbing links I can figure out but not sure how to make it only do it within a specific div.
Thanks!
Scott
Check out PHP XPath. It will let you query a document for the contents of specific tags and so on. The example on the php site is pretty straightforward:
http://php.net/manual/en/simplexmlelement.xpath.php
This following example will actually grab all of the URLs in any DIVs in a doc:
$xml = new SimpleXMLElement($docAsString);
$result = $xml->xpath('//div//a');
You can use this on well-formed HTML files, not just XML.
Good XPath reference: http://msdn.microsoft.com/en-us/library/ms256086.aspx
In the past I have use the PHP Simple DOM library with success:
http://simplehtmldom.sourceforge.net/
Samples:
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
I found something that seems to do what I wanted.
http://www.earthinfo.org/xpaths-with-php-by-example/
<?php
$html = new DOMDocument();
#$html->loadHtmlFile('http://www.bbc.com');
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[#id='news_moreTopStories']//a/#href" );
foreach ($nodelist as $n){
echo $n->nodeValue."\n";
}
// for images
echo "<br><br>";
$html = new DOMDocument();
#$html->loadHtmlFile('http://www.bbc.com');
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query( "//div[#id='promo_area']//img/#src" );
foreach ($nodelist as $n){
echo $n->nodeValue."\n";
}
?>
I also tried PHP DOM method and it seems faster...
http://w-shadow.com/blog/2009/10/20/how-to-extract-html-tags-and-their-attributes-with-php/
$html = file_get_contents('http://www.bbc.com');
//Create a new DOM document
$dom = new DOMDocument;
//Parse the HTML. The # is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
#$dom->loadHTML($html);
//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementById('news_moreTopStories')->getElementsByTagName('a');
//Iterate over the extracted links and display their URLs
foreach ($links as $link){
//Extract and show the "href" attribute.
echo $link->getAttribute('href'), '<br>';
}

Categories