xpath won't retrieve elements - php

Here is the URL of the xml source:
I'm tryng to grab all the RichText elements using xpath relative location and then print the elementID attribute. It is outputting nothing though. Any ideas?
<?php
$url = "FXG";
$xml = simplexml_load_file($url);
//print_r($xml);
$textNode = $xml->xpath("//RichText");
$count = count($textNode);
$i = 0;
while($i < $count)
{
echo '<h1>'.$textNode[$i]['s7:elementID'].'</h1>';
$i++;
}
?>

You need to register the namespaces that are set in the xml
$url = "http://testvipd7.scene7.com/is/agm/papermusepress/HOL_12_F_green?&fmt=fxgraw";
$xml = simplexml_load_file($url);
$xml->registerXPathNamespace('default', 'http://ns.adobe.com/fxg/2008');
$xml->registerXPathNamespace('s7', 'http://ns.adobe.com/S7FXG/2008');
$textNode = $xml->xpath("//default:RichText/#s7:elementID");
foreach($textNode as $node) {
echo '<h1>'.$node[elementID].'</h1>';
}
I hope this helps.

Strange. This, however, works.
$textNode = $xml->xpath("//*[name() = 'RichText']");

Related

webscrapinhg a webite filtering for divs with a certain classname. How to do that?

currently I´m tring to webscrape a site for football matches and I need to find out how to filter for divs with a specific name. Here is the code I already have. Thanks
include('simple_html_dom.php');
$day = 1; //temporär
$html = file_get_html('https://sport.sky.de/bundesliga-spielplan-ergebnisse-'.$day);
$list = $html -> find('div[class="sdc-site-fixres__match-cell sdc-site-fixres__match-cell--score"]', 0);
$list_array = $list -> find('div');
for($i = 0; $i < sizeof($list_array); $i++){
echo $list_array[$i]->plaintext;
echo "<br>";
}
You can use xpath. Here is the full documentation.
$day = 1; //temporär
$html = file_get_contents('https://sport.sky.de/bundesliga-spielplan-ergebnisse-'.$day);
$doc = DOMDocument::loadHTML($html);
$xpath = new DOMXPath($doc);
$query = $xpath->query('//div[#class="sdc-site-fixres__match-cell sdc-site-fixres__match-cell--score"]/div/span[2]');
foreach ($query as $item) {
/** #var DOMElement $item */
echo $item->nodeValue;
echo PHP_EOL;
}
Or you can benefit from symfony components for this purpose like DOM crawler or CSS selector

Getting link tag via DOMDocument

I convert an atom feed into RSS using atom2rss.xsl. Works fine.
Then, using DOMDocument, I try to get the post title and URL:
$feed = new DOMDocument();
$feed->loadHTML('<?xml encoding="utf-8" ?>' . $html);
if (!empty($feed) && is_object($feed) ) {
foreach ($feed->getElementsByTagName("item") as $item){
echo 'url: '. $item->getElementsByTagName("link")->item(0)->nodeValue;
echo 'title'. $item->getElementsByTagName("title")->item(0)->nodeValue;
}
return;
}
But the post URL is empty.
See this eval which contains HTML. What am I doing wrong? I suspect I am not getting the link tag properly via $item->getElementsByTagName("link")->item(0)->nodeValue.
I think the problem is that there are several <link> elements in each item and the one (I think) your interested in is the one with rel="self" as an attribute. The quickest way (without messing around with XPath) is to loop over each <link> element checking for the right rel value and then take the href attribute from that...
if (!empty($feed) && is_object($feed) ) {
foreach ($feed->getElementsByTagName("item") as $item){
$url = "";
// Look for the 'right' link tag and extract URL from that
foreach ( $item->getElementsByTagName("link") as $link ) {
if ( $link->getAttribute("rel") == "self" ) {
$url = $link->getAttribute("href");
break;
}
}
echo 'url: '. $url;
echo 'title'. $item->getElementsByTagName("title")->item(0)->nodeValue;
}
return;
}
which gives...
url: https://www.blogger.com/feeds/2984353310628523257/posts/default/1947782625877709813titleExtraordinary Genius - Cp274
function get_links($link)
{
$ret = array();
$dom = new DOMDocument();
#$dom->loadHTML(file_get_contents($link));
$dom->preserveWhiteSpace = false;
$links = $dom->getElementsByTagName('a');
foreach ($links as $tag){
$ret[$tag->getAttribute('href')] = $tag->childNodes->item(0)->nodeValue;
}
return $ret;
}
print_r(get_links('http://www.google.com'));
OR u can use DOMXpath
$html = file_get_contents('http://www.google.com');
$dom = new DOMDocument();
#$dom->loadHTML($html);
// take all links
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
echo $url.'
';

PHP Simple Dom HTML - Trouble parsing list of a hrefs

I'm trying to scrape all the a hrefs with an id starting with 'system' from this webpage: http://www.myfxbook.com/systems
Here is my code which I just can't seem to get to work. I've been fiddling around for hours now, looking at countless answered questions here.
include_once( 'simple_html_dom.php' );
$url2process = 'http://www.myfxbook.com/systems';
$html = file_get_html( $url2process );
$cnt = 0;
$parent_mark = $html->find('a[id^=system]');
$cntr = 0;
foreach( $parent_mark as $element) {
if( $cntr > 3 ) continue;
$cntr++;
$single_html = file_get_html( $element->href );
UPDATE1: Ok this is kind of working now, but it only seems to be using the very last a href on the page with the correct id. I need to process ALL these a hrefs with this ID, what am I missing here?
You could do it using the domdocument like this..
$html = file_get_contents('http://www.myfxbook.com/systems');
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html);
libxml_use_internal_errors(false);
$links = $doc->getElementsByTagName('a');
$cnt = 0;
$cntr = 0;
foreach ($links as $link) {
if(preg_match('~^system~', $link->getAttribute('id'))) {
if( $cntr > 3 ) {
continue;
}
$cntr++;
$single_html = file_get_contents($link->getAttribute('href'));
if (empty($single_html)) {
echo 'EMPTY';
}
}
}

Getting XML attributes in PHP

Looked at a few other SO posts on this but no joy.
I've got this code:
$url = "http://itunes.apple.com/us/rss/toppaidapplications/limit=10/genre=6014/xml";
$string = file_get_contents($url);
$string = preg_replace("/(<\/?)(\w+):([^>]*>)/", "$1$2$3", $string);
$xml = simplexml_load_string($string);
foreach ($xml->entry as $val) {
echo "RESULTS: " . $val->attributes() . "\n";
but I can't get any results.
I'm specifically interested in getting the ID value which would be 549592189 in this fragment:
<id im:id="549592189" im:bundleId="com.activision.wipeout">http://itunes.apple.com/us/app/wipeout/id549592189?mt=8&uo=2</id>
Any suggestions?
SimpleXML gives you can easy way to drill down in the XML structure and get the element(s) you want. No need for the regex, whatever it does.
<?php
// Load XML
$url = "http://itunes.apple.com/us/rss/toppaidapplications/limit=10/genre=6014/xml";
$string = file_get_contents($url);
$xml = new SimpleXMLElement($string);
// Get the entries
$entries = $xml->entry;
foreach($entries as $e){
// Get each entriy's id
$id = $e->id;
// Get the attributes
// ID is in the "im" namespace
$attr = $id->attributes('im', TRUE);
// echo id
echo $attr['id'].'<br/>';
}
DEMO: http://codepad.viper-7.com/qNo7gs
Try with xpath:
$doc = new DOMDocument;
#$doc->loadHTML($string);
$xpath = new DOMXpath($doc);
$r = $xpath->query("//id/#im:id");
$id = $r->item(0)->value;
Try:
$sxml = new SimpleXMLElement($url);
for($i = 0;$i <=10;$i++){
$appid= $sxml->entry[$i]->id->attributes("im",TRUE);
echo $appid;
}

Get Twitter feed and display information using SimpleXML

Based on the thread With PHP preg_match_all, get value of href, I'm trying to get some information from a twitter feed.
Here is the feed url (for testing purpose): Twitter feed
Here is my code:
function parse_feed($process) {
$xml = #simplexml_load_string($process);
$findTweet = $xml['entry'];
return $findTweet;
}
$username = 'tweet';
$feed = "http://search.twitter.com/search.atom?q=from:" . $username . "&rpp=2";
$feed = file_get_contents($feed);
//echo $feed;
print_r(parse_feed($feed));
I never used SimpleXML before or worked with XML.
Can someone help me please?
this site might be a good start http://www.php.net/manual/en/simplexml.examples-basic.php
Okay Founf it!, here is the solution for who is interested...
Thanks to m1k3y02 for the docs!
function parse_feed($process) {
$xml = new SimpleXMLElement($process);
$n=0;
foreach($xml->entry as $entry) {
$tweets[$n] = array($entry->published,$entry->content);
$n++;
}
return $tweets;
}
$twitter_username = 'tweet';
$twitter_entries = 5;
$feed = "http://search.twitter.com/search.atom?q=from:" . $twitter_username . "&rpp=".$twitter_entries;
$feed = file_get_contents($feed);
$tweets = parse_feed($feed);
$n=0;
$n_t = count($tweets);
while($n < $n_t) {
echo "<div class=\"tweet\"><img src=\"img/tweet.png\" valign=\"absmiddle\" /> ";
echo $tweets[$n][1][0];
echo "</div>";
echo "<div class=\"date\">".$tweets[$n][0][0]."</div>";
$n++;
}

Categories