How to Parse XML from a URL using PHP - php

Im learning how to parse XML elements into an html document.
This takes a url with an xml, reads the elements but it ain't working...also
I want to take it a bit further but I simply haven't been able to, how can I make it so I read the xml from a url? and use an xml element as filename to create an html document using a template?
////EDIT this is what I tried! /////EDIT/////EDIT/////EDIT/////EDIT/////EDIT/////EDIT
I tried this just for the sake of me knowing what Im doing(...apparently nothing haha) so I could echo if the information was right....
<?php
$url = "http://your_blog.blogspot.com/feeds/posts/default?alt=rss";
$xml = simplexml_load_file($url);
print_r($xml);
?>
Thank you for your time!

Generally, "cross-domain" requests would be forbidden by web browsers, per the same origin security policy.
However, there is a mechanism that allows JavaScript on a web page to make XMLHttpRequests to another domain called Cross-origin resource sharing (CORS).
Read this about CORS:
http://en.wikipedia.org/wiki/Cross-origin_resource_sharing
Check this article out about RSS feeds:
http://www.w3schools.com/php/php_ajax_rss_reader.asp

Related

Scraping data from a website with Simple HTML Dom

I work to finish an API for a website (https://rushwallet.com/) for github.
I am using PHP and attempting to retrieve the wallet address from this URL: https://rushwallet.com/#n3GjsndjdCURphhsqJ4mQH7AjiXlGI.
Can anyone can help me?
My code so far:
$url = "https://rushwallet.com/#n3GjsndjdCURphhsqJ4mQH7AjiXlGI";
$open_url = str_get_html(file_get_contents($url));
$content_url = $open_url->find('span[id=btcBalance]', 0)->innertext;
die(var_dump($content_url));
You cannot read the correct content in this case. You are trying to access the non-rendered page content. Therefore, you always read the empty string. The content is loaded after the page is fully loaded. The page source is shown as:
฿<span id="btcBalance"></span>
If you want to scrape the data in this case, you need to use rendering engine which is possible to render javascript. One possible engine is phantomJS, which is a headless browser and able to scrape the data after rendering.

why this error `String could not be parsed as XML` is seen when tried to create a object of simplexmlelement?

I have a functionality like below and getting an error String could not be parsed as XML
$category_feed_url = "http://www.news4u.com/blogs/category/articles/feed/";
$file = file_get_contents($category_feed_url);
$xml = new SimpleXMLElement($file);
foreach($xml->channel->item as $feed)
{
echo $feed->link;
echo $feed->title;
...
why this error has occurred.
The URL points to an HTML document.
It is possible for a document to be both HTML and XML, but this one isn't.
It fails because you are trying to parse not-XML as if it was XML.
See How to parse and process HTML with PHP? for guidance in parsing HTML using PHP.
You seem to be expecting an RSS feed though, and that document doesn't resemble one or reference one. The site looks rather spammy, possibly that URI used to point to an RSS feed but the domain has now fallen to a link farm spammer. If so, you should find an alternative source for the information you were collecting.
"String could not be parsed as XML", your link is an html page.

How can i make flexible XML data files for other websites?

I would like to know how can i give other website parsers an xml file or response based on arguments they request?
For example i have show_data.php file that can take range of params and then apply it to mysql query and then form valid xml 1.0 string.
So by this point i have finished with data fetching + xml formating based on request params.
Now how would i share that xml with other websites for their xml parsers?
Do i simply output my xml string in php file with appropriate headers or somehow else?
Example:
1)www.example.com request www.mypage.com/show_data.php?show=10
2)www.mypage.com/show_data.php send xml data back to www.example.com
It's really hard to explain since i have not worked with xml and stuff before. Hope it makes some sense.
Thanks.
Well, when example.com does the initial request, your page will process it and return the xml as the result. There's nothing special that you'll need to do.
$xml = "";
// process the xml (build it - do what you need to do)
// returning the xml to the requester
header ("Content-Type:text/xml");
echo $xml;

How to detect if a page is an RSS or ATOM feed

I'm currently building a new online Feed Reader in PHP. One of the features I'm working on is feed auto-discovery. If a user enters a website URL, the script will detect that its not a feed and look for the real feed URL by parsing the HTML for the proper <link> tag.
The problem is, the way I'm currently detecting if the URL is a feed or a website only works part of the time, and I know it can't be the best solution. Right now I'm taking the CURL response and running it through simplexml_load_string, if it can't parse it I treat it as a website. Here is the code.
$xml = #simplexml_load_string( $site_found['content'] );
if( !$xml ) // this is a website, not a feed
{
// handle website
}
else
{
// parse feed
}
Obviously, this isn't ideal. Also, when it runs into an HTML website that it can parse, it thinks its a feed.
Any suggestions on a good way of detecting the difference between a feed or non-feed in PHP?
I would sniff for the various unique identifiers those formats have:
Atom: Source
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
RSS 0.90: Source
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns="http://my.netscape.com/rdf/simple/0.9/">
Netscape RSS 0.91
<rss version="0.91">
etc. etc. (See the 2nd source link for a full overview).
As far as I can see, separating Atom and RSS should be pretty easy by looking for <feed> and <rss> tags, respectively. Plus you won't find those in a valid HTML document.
You could make an initial check to tell HTML and feeds apart by looking for <html> and <body> elements first. To avoid problems with invalid input, this may be a case where using regular expressions (over a parser) is finally justified for once :)
If it doesn't match the HTML test, run the Atom / RSS tests on it. If it is not recognized as a feed, or the XML parser chokes on invalid input, fall back to HTML again.
what that looks like in the wild - whether feed providers always conform to those rules - is a different question, but you should already be able to recognize a lot this way.
I think your best choice is getting the Content-Type header as I assume that's the way firefox (or any other browser) does it. Besides, if you think about it, the Content-Type is indeed the way server tells user agents how to process the response content. Almost any decent HTTP server sends a correct Content-Type header.
Nevertheless you could try to identify rss/atom in the content as a second choice if the first one "fails"(this criteria is up to you).
An additional benefit is that you only need to request the header instead of the entire document, thus saving you bandwidth, time, etc. You can do this with curl like this:
<?php
$ch = curl_init("http://sample.com/feed");
curl_setopt($ch, CURLOPT_NOBODY, true); // this set the HTTP Request Method to HEAD instead GET(default) and the server only sends HTTP Header(no content).
curl_exec($ch);
$conType = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);
if (is_rss($conType)){ // You need to implement is_rss($conType) function
// TODO
}elseif(is_html($conType)) { // You need to implement is_html($conType) function
// Search a rss in html
}else{
// Error : Page has no rss/atom feed
}
?>
Why not try to parse your data with a component built specifically to parse RSS/ATOM Feed, like Zend_Feed_Reader ?
With that, if the parsing succeeds, you'll be pretty sure that the URL you used is indeed a valid RSS/ATOM feed.
And I should add that you could use such a component to parse feed in order to extract their informations, too : no need to re-invent the wheel, parsing the XML "by hand", and dealing with special cases yourself.
Use the Content-Type HTTP response header to dispatch to the right handler.

Basic web-crawling question: How to create a list of all pages on a website using php?

I would like to create a crawler using php that would give me a list of all the pages on a specific domain (starting from the homepage: www.example.com).
How can I do this in php?
I don't know how to recursively find all the pages on a website starting from a specific page and excluding external links.
For the general approach, check out the answers to these questions:
How to write a crawler?
How to best develop web crawlers
Is there a way to use PHP to crawl links?
In PHP, you should be able to simply fetch a remote URL with file_get_contents(). You could perform a naive parse of the HTML by using a regular expression with preg_match() to find <a href=""> tags and parse the URL out of them (See this question for some typical approaches).
Once you've extract the raw href attribute, you could use parse_url() to break into it components and figure out if its a URL you want to fetch - remember also the URLs may be relative to the page you've fetched.
Though fast, a regex isn't the best way of parsing HTML though - you could also try the DOM classes to parse the HTML you fetch, for example:
$dom = new DOMDocument();
$dom->loadHTML($content);
$anchors = $dom->getElementsByTagName('a');
if ( count($anchors->length) > 0 ) {
foreach ( $anchors as $anchor ) {
if ( $anchor->hasAttribute('href') ) {
$url = $anchor->getAttribute('href');
//now figure out whether to processs this
//URL and add it to a list of URLs to be fetched
}
}
}
Finally, rather than write it yourself, see also this question for other resources you could use.
is there a good web crawler library available for PHP or Ruby?
Overview
Here are some notes on the basics of the crawler.
It is a console app - It doesn't need a rich interface, so I figured a console application would do. The output is done as an html file and the input (what site to view) is done through the app.config. Making a windows app out of this seemed like overkill.
The crawler is designed to only crawl the site it originally targets. It would be easy to change that if you want to crawl more than just a single site, but that is the goal of this little application.
Originally the crawler was just written to find bad links. Just for fun I also had it collect information on page and viewstate sizes. It will also list all non-html files and external urls, just in case you care to see them.
The results are shown in a rather minimalistic html report. This report is automatically opened in Internet Explorer when the crawl is finished.
Getting the Text from an Html Page
The first crucial piece of building a crawler is the mechanism for going out and fetching the html off of the web (or your local machine, if you have the site running locally.). Like so much else, .NET has classes for doing this very thing built into the framework.
private static string GetWebText(string url)
{
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
request.UserAgent = "A .NET Web Crawler";
WebResponse response = request.GetResponse();
Stream stream = response.GetResponseStream();
StreamReader reader = new StreamReader(stream);
string htmlText = reader.ReadToEnd();
return htmlText;
}
The HttpWebRequest class can be used to request any page from the internet. The response (retrieved through a call to GetResponse()) holds the data you want. Get the response stream, throw it in a StreamReader, and read the text to get your html.
for Reference: http://www.juicer.headrun.com

Categories