Page not being converted into xml format - php

I am grabbing a page and then converting it into an xml format, the function im using is below
public function getXML($url){
$ch = curl_init();
//curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
//curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$response = curl_exec($ch);
$xml = simplexml_load_string($response);
return $xml;
}
print_r($curl->getXML("http://www.amazon.co.uk/gp/offer-listing/0292783760/ref=tmm_pap_new_olp_sr?ie=UTF8&condition=used"));
After trying different urls nothing is returned, the page loads fine so the problem is with the line $xml = simplexml_load_string($response);
What could be wrong with this code?

Not understanding exactly what you're up to, it looks like you're trying to scrape the Amazon web page? If I pull up that URL in my browser, it's not listed as XHTML in the headers or document itself--I suspect it's not. I don't think simplexml can handle that.
(Does CURL do the conversion to XML for you? I don't think so but I'm not a master of all things CURL. If so, it might be an incompatability between CURL's output and what simplxml--which is fairly limited--will take in).
You might try working with DOMDocument instead, although my PHP could be a bit out of date--there may be better utilities these days.
A quick googling brought up this tutorial
<?php
$doc = new DOMDocument();
$doc->strictErrorChecking = FALSE;
$doc->loadHTML($html);
$xml = simplexml_import_dom($doc);
?>
I don't think this is a complete answer, but it was a bit much for a comment; so take it with a grain of salt and a healthy serving of doubt. I hope it inspires some ideas.

Related

Get Content from Web Pages with PHP

I am working on a small project to get information from several webpages based on the HTML Markup of the page, and I do not know where to start at all.
The basic idea is of getting the Title from <h1></h1>s, and content from the <p></p>s tags and other important information that is required.
I would have to setup each case from each source for it to work the way it needs. I believe the right method is using $_GET method with PHP. The goal of the project is to build a database of information.
What is the best method to grab the information which I need?
First of all: PHP's $_GET is not a method. As you can see in the documentation $_GET is simply an array initialized with the GET's parameters your web server received during the current query. As such it is not what you want to use for this kind of things.
What you should look into is cURL that allows you to compose even fairly complex query, send to the destination server and retrieve the response. For example for a POST request you could do something like:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,"http://www.mysite.com/tester.phtml");
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS,
"postvar1=value1&postvar2=value2&postvar3=value3");
// in real life you should use something like:
// curl_setopt($ch, CURLOPT_POSTFIELDS,
// http_build_query(array('postvar1' => 'value1')));
// receive server response ...
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$server_output = curl_exec ($ch);
curl_close ($ch);
Source
Of course if you don't have to do any complex query but simple GET requests you can go with the PHP function file_get_contents
After you received the web page content you have to parse it. IMHO the best way to do this is by using PHP's DOM functions. How to use them should really be another question, but you can find tons of example without much effort.
<?php
$remote = file_get_contents('http://www.remote_website.html');
$doc = new DomDocument();
$file = #$doc->loadHTML($remote);
$cells = #$doc->getElementsByTagName('h1');
foreach($cells AS $cell)
{
$titles[] = $cell->nodeValue ;
}
$cells = #$doc->getElementsByTagName('p');
foreach($cells AS $cell)
{
$content[] = $cell->nodeValue ;
}
?>
You can get the HTML source of a page with:
<?php
$html= file_get_contents('http://www.example.com/');
echo $html;
?>
Then once you ahve the structure of the page you get the request tag with substr() and strpos()

PHP getElementById behaviour with elements sharing id

I'm using some simple php to scrape information from a website to allow reading it offline. The code seems to be working fine but I am worried about undefined behaviour. The site is a bit poorly coded and some of the elements I'm grabbing share the same id with another element. I'd imagine that getElementById traverses the DOM from top to bottom and the reason I'm not having an issue is because the element I need is the first instance with the id. Is there any way to ensure this behaviour? The element has no other real way of distinguishing it so selecting it by id seems to be the best option. I have included a stripped back example of the code I'm using below.
Thanks.
<?php
$curl_referer = "http://example.com/";
$curl_url = "http://example.com/content.php";
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, 'Scraper/0.9');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($ch, CURLOPT_REFERER, "$curl_referer");
curl_setopt($ch, CURLOPT_URL, "$curl_url");
$output = curl_exec($ch);
$dom = new DOMDocument();
#$dom->loadHTML($output);
$content = $dom->getElementById('content');
echo $content->nodeValue;
?>
Try using XPath expression to get the first containing id.
Like that: //*[#id="content"][1]
The PHP code will be like that:
$xpath = new DOMXPath($dom);
$xpath->query('//*[#id="content"][1]')->item(0)->nodeValue;
And an tip: use libxml_use_internal_errors(true), you can catch they latter for logging or try tidying-up the document.
Edit
Hey, in your code you're setting the UA as "Scraper/0.9", most people that write a bad website doesn't look at that and doesn't do logging incoming requests in their pages, but, i don't recommend to put UA like that, just put an browser UA, like chrome's user-agent because if they're monitoring and see requests that contains this user-agent, they will be blacklist you (future).

Php curl incorrect download

I'm attempting to use Youtube's API to pull a list of video and display them. To do this, I need to curl their api and get the xml file returned, which I will then parse.
When I run the following curl function
function get_url_contents($url){
$crl = curl_init();
$timeout = 5;
curl_setopt ($crl, CURLOPT_URL,$url);
curl_setopt ($crl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($crl, CURLOPT_CONNECTTIMEOUT, $timeout);
$ret = curl_exec($crl);
curl_close($crl);
return $ret;
}
against the url
http://gdata.youtube.com/feeds/api/videos?q=Apple&orderby=relevance
The string that is saved is horribly screwed up. There are no < > tags, or half of the characters in most of it. It looks 100% different then if I view it in a browser.
I tried print, echo, and var dump and they all show it has completely different, which makes parsing it impossible.
How do I get the file properly from the server?
It's working for me. I'm pretty sure that the file is returned without errors, but when you print it, the <> tags aren't showed. But if you look on the source code you can see them.
Try this, you can see it work:
$content = get_url_contents('http://gdata.youtube.com/feeds/api/videos?q=Apple&orderby=relevance');
$xml = simplexml_load_string($content);
print_r($xml);
Make use of the client library that Google provides, it'll make your life easier.
http://code.google.com/apis/youtube/2.0/developers_guide_php.html

Simplexml_load_file issues caused by proxy

I've built a script locally which works fine. I've now moved that script onto a server that's behind a proxy and I've come into some issues.
Here's the code:
$yahooXML = simplexml_load_file('http://query.yahooapis.com/v1/public/yql?q=select+*%20from%20yahoo.finance.xchange%20where%20pair%20in%20(%22'.$from.''.$to.'%22)&diagnostics=true&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys');
print_r($yahooXML);
die();
I'm getting a failed to open stream and I/O warning : failed to load external entity error using this.
I explored using cURL to load the data and then parse with simplexml but not sure if this is possible?
any ideas?
Edit:
I loaded the page with CURL which failed as well so I added the proxy option in and it fixed it. Now I just need to load this with XML?
function curl($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_PROXY, 'proxysg.uwe.ac.uk:8080');
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$feed = 'http://query.yahooapis.com/v1/public/yql?q=select+*%20from%20yahoo.finance.xchange%20where%20pair%20in%20(%22'.$from.''.$to.'%22)&diagnostics=true&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys';
$data = curl($feed);
echo $data;
die();
Once you have the XML file and you've verified that it's proper XML, you can load it into php via simplexml_load_string() or simplexml_load_file() depending on what you have.
If your $data var is a string w/well formed XML, then:
$xml = simplexml_load_string($data);
print_r($xml);
should work just fine. Of course, now you have a simple xml object, which will work with any of the normal simplexml functions.

Parsing XML feed in PHP

<?php
$url='http://bart.gov/dev/eta/bart_eta.xml';
$c = curl_init($url);
curl_setopt($c, CURLOPT_MUTE, 1);
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
$rawXML = curl_exec($c);
curl_close($c);
$fixedupXML = htmlspecialchars($rawXML);
foreach($fixedupXML->eta-> as $eta) {
echo $eta->destination;
}
?>
As a way to get introduced to PHP, I've decided to parse BART's XML feed and display it on my webpage. I managed (also via this site) to be able to fetch the data and preserve the XML tags. However, when I try to output the XML data, using what I found to be the simplest method, nothing happens.
foreach($fixedupXML->eta as $eta){
echo $eta->destination;
}
Am I not getting the nested elements right in the foreach loop?
Here is the BART XML feed http://www.bart.gov/dev/eta/bart_eta.xml
Thanks!
You may want to look at simplexml, which is a fantastic and really simple way to work with XML.
Here's a great example:
$xml = simplexml_load_file('http://bart.gov/dev/eta/bart_eta.xml');
Then you can run a print_r on $xml to see it's contents:
print_r($xml);
And you should be able to work with it from there :)
If you still need to use curl to get the feed data for some reason, you can feed the XML into simplexml like this:
$xml = simplexml_load_string($rawXML);

Categories