How to use foreach() to iterate YQL XML and scrape HTML?

How to use foreach() to iterate YQL XML and scrape HTML? - php

Wasn't sure what to call this, so I will quickly elaborate.
I have a screen scraper I am trying to build, using the YQL console. The query provides the user with a choice of XML or JSON. I am targeting the YQL>data>html aspect of the console, and chose XML as my output format.
My YQL Query:
SELECT * FROM html WHERE url="http://google.com"
This will provide you with a readout of the Google.com document tree in XML. Too much output to paste into this post, so just click the link.
My problem comes with traversing the XML Tree with PHP to properly display the output from this request. I dont know how to effectively create a foreach statement (or any other statement) to effectively scrape the XML output and collect the Document tree and re-display it for my own needs.
My PHP:
$searchUrl = "google.com";
if(isset($_REQUEST['searchUrl'])) {
$searchUrl = $_REQUEST['searchUrl'];
}
$query = "select * from html where url=\"http://".$searchUrl."\"";
$url = "http://query.yahooapis.com/v1/public/yql";
// Get Subcategory Article Data
$parameterData = "q=".urlencode($query);
$parameterData .= "&diagnostics=true";
// setup CURL
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $parameterData);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_TIMEOUT, 20);
// send
$response = trim(urldecode(curl_exec($ch)));
// parse response
$xmlObjects = #simplexml_load_string($response);
foreach ($xmlObjects->diagnostics as $diagnostics) {
echo "<a href=".$diagnostics->url." target='_blank'>".$diagnostics->url."</a>";
}
foreach ($xmlObjects->results as $result) {
// here is where I would go echo $result->body or something along those lines
}
I suppose I am a bit stumped at this point due to my lack of knowledge to know where to turn next to navigate an XML tree with this type of format. After query>results>body in the XML I am unsure where to turn to collect the remaining objects, and output it into my document in a pre tag or something of that nature.
I would like to provide an input field for users to enter their own domain, and my PHP will submit the query, iterate over the response, and return the Document tree to the user for HTML viewing and debugging.
I am familiar with PHP and XML in the context of iterating a large number of parent elements with the same internal structure like an RSS feed or something of that nature. In this case I am dealing with a dynamic XML tree, with one large response object, and a fluctuating internal structure.

The following code will display the result body as html page:
<?php
// ... the code you posted in the question
// !without the diagnostics output!
// read comments of the answer to know why
?>
<html>
<head>
</head>
<?php
foreach ($xmlObjects->results as $result) {
// asXml() will return the content of body as xml string
echo $result->body->asXml();
break;
}
?>
</html>
Note that as you won't get the <head> element of the page via YQL the output will in most cases look messy.

Related

Crawling google search results with PHP Curl , was working but seems to have stopped

Hi Im attempting to crawl google search results, just for my own learning, but also to see can I speed up getting access to direct URLS (Im aware of their API but I just thought Id try this for now).
It was working fine but it seems to have stopped, its simply returning nothing now, Im unsure if its something I did, but I can say that I had this in a for loop to allow the start parameter to increase and Im wondering may that have caused problems.
Is it possible Google can block an IP from crawling?
Thanks..
$url = "https://www.google.ie/search?q=adrian+de+cleir&start=1&ie=utf-8&oe=utf-8&rls=org.mozilla:en-US:official&client=firefox-a&channel=fflb&gws_rd=cr&ei=D730U7KgGfDT7AbNpoBY#channel=fflb&q=adrian+de+cleir&rls=org.mozilla:en-US:official";
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$html = curl_exec($ch);
curl_close($ch);
# Create a DOM parser object
$dom = new DOMDocument();
# Parse the HTML from Google.
# The # before the method call suppresses any warnings that
# loadHTML might throw because of invalid HTML in the page.
#$dom->loadHTML($html);
# Iterate over all the <a> tags
foreach($dom->getElementsByTagName('h3') as $link) {
$actual_link = $link->getElementsbyTagName('a');
foreach ($actual_link as $single_link) {
# Show the <a href>
echo '<pre>';
print_r($single_link->getAttribute('href'));
echo '</pre>';
}
}

Given below is the program I have written in python. But it is not completed fully. Right now it only gets the first page and prints all the href links found on the result.
We can use sets and remove the redundant links from the result set.
import requests<br>
from bs4 import BeautifulSoup
def search_spider(max_pages, search_string):
page = 0
search_string = search_string.replace(' ','+')
while page <= max_pages:
url = 'https://www.google.com/search?num=10000&q=' + search_string + '#q=' + search_string + '&start=' + str(page)
print("URL to search - " + url)
source_code = requests.get(url)
count = 1
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll("a", {"class" : ""}):
href = link.get('href')
input_string = slice_string(href)
print(input_string)
count += 1
page += 10
def slice_string(input_string):
input_string = input_string.lstrip("/url?q=")
index_c = input_string.find('&')
input_string = input_string[:index_c]
return input_string
search_spider(1,"bangalore cabs")
This program will search for bangalore cabs in google.
Thanks,
Karan

You can check if Google blocked you by the following simple curl script command:
curl -sSLA Mozilla "http://www.google.com/search?q=linux" | html2text -width 80
You may install html2text in order to convert html into plain text.
Normally you should use Custom Search API provided by Google to avoid any limitations, so you could retrieve search results in easier way by having access to different formats (such as XML or JSON).

using curl within foreach loop optimization?

Is it bad practice or will it be slower if I use curl within a foreach loop?
I'm planning on having an autocomplete input field, and the query in the input would be sent to an API call.
I'm getting an id from a certain link (ie: http://api.linke1.com/names)
foreach($json as j){
$id = $j->id; //from http://api.linke1.com/names
$url = "https://api.site/{$id}/photos";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$output = curl_exec($ch);
curl_close($ch);
$jsonDecode = json_decode($output);
$results = $jsonDecode->results;
foreach($results as $result)
{
$photoURL= $result->photo->url; //from https://api.site/{$id}/photos
}
}
So every time I type in a name, it will go into the foreach searching for an id from http://api.linke1.com/names, and then it will look for the photo url from the other link. I wanted to output a list of an array, so eventually i'll have a list of data to output showing information such as name, photo, etc...
Will this slow down dramatically because each letter typed in the input field it will run through this foreach loop. Would there be an easier way?
Thanks!

Initialize the curl and the things that doesn't change before the loop and close it afterwards.
That will speed up the thing a little bit.
and you can use curl_multi_*, which can fetch several URLs in parallel.
http://se2.php.net/manual/en/ref.curl.php

SimpleXML feed showing blank arrays - how do I get the content out?

I'm trying to get the image out of a rss feed using a simpleXML feed and parsing the data out via an array and back into the foreach loop...
in the source code the array for [description] is shown as blank though I've managed to pull it out using another loop, however, I can't for the life of me work out how to pull in the next array, and subsequently the image for each post!
help?
you can view my progress here: http://dev.thebarnagency.co.uk/tfolphp.php
here's the original feed: feed://feeds.feedburner.com/TheFutureOfLuxury?format=xml
$xml_feed_url = 'http://feeds.feedburner.com/TheFutureOfLuxury?format=xml';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $xml_feed_url);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$xml = curl_exec($ch);
curl_close($ch);
function produce_XML_object_tree($raw_XML) {
libxml_use_internal_errors(true);
try {
$xmlTree = new SimpleXMLElement($raw_XML);
} catch (Exception $e) {
// Something went wrong.
$error_message = 'SimpleXMLElement threw an exception.';
foreach(libxml_get_errors() as $error_line) {
$error_message .= "\t" . $error_line->message;
}
trigger_error($error_message);
return false;
}
return $xmlTree;
}
$feed = produce_XML_object_tree($xml);
print_r($feed);
foreach ($feed->channel->item as $item) {
// $desc = $item->description;
echo 'link<br>';
foreach ($item->description as $desc) {
echo $desc;`
}
}
thanks

Can you use
wp_remote_get( $url, $args );
Which i get from here http://dynamicweblab.com/2012/09/10-useful-wordpress-functions-to-reduce-your-development-time
Also get more details about this function http://codex.wordpress.org/Function_API/wp_remote_get
Hope this will help

I'm not entirely clear what your problem is here - the code you provided appears to work fine.
You mention "the image for each post", but I can't see any images specifically labelled in the XML. What I can see is that inside the HTML in the content node of the XML, there is often an <img> tag. As far as the XML document is concerned, this entire blob of HTML is just one string delimited with the special tokens <![CDATA[ and ]]>. If you get this string into a PHP variable (using (string)$item->content you can then find a way of extracting the <img> tag from inside it - but note that the HTML is unlikely to be valid XML.
The other thing to mention is that SimpleXML is not, as you repeatedly refer to it, an array - it is an object, and a particularly magic one at that. Everything you do to the SimpleXML object - foreach ( $nodeList as $node ), isset($node), count($nodeList), $node->childNode, $node['attribute'], etc - is actually a function call, often returning another SimpleXML object. It's designed for convenience, so in many cases writing what seems natural will be more helpful than inspecting the object.
For instance, since each item has only one description you don't need the inner foreach loop - the following will all have the same effect:
foreach ($item->description as $desc) { echo $desc; } (loop over all child elements with tag name description)
echo $item->description[0]; (access the first description child node specifically)
echo $item->description; (access the first/only description child node implicitly; this is why you can write $feed->channel->item and it would still work if there was a second channel element, it would just ignore it)

I had an issue where simplexml_load_file was returning some array sections blank as well, even though they contained data when you view the source url directly.
Turns out the data was there, but it was CDATA so it was not properly being displayed.
Is this perhaps the same issue op was having?
Anyways my solution was this:
So initially I used this:
$feed = simplexml_load_file($rss_url);
And I got empty description back like this:
[description] => SimpleXMLElement Object
(
)
But then I found this solution in comments of PHP.net site, saying I needed to use LIBXML_NOCDATA:
https://www.php.net/manual/en/function.simplexml-load-file.php
$feed = simplexml_load_file($rss_url, "SimpleXMLElement", LIBXML_NOCDATA);
After making this change, I got description like this:
[description] => My description text!

save multiple twitter profiles to single xml file

Hi I am trying to identify key Twitter influencers for a client and I have a list of 170 twitter id's that I need to learn more about.
I would like the script to loop through the list of Twitter Id's and save the output to a single XML file -
http://twitter.com/users/show/mattmuller.xml
http://twitter.com/users/show/welovecrowds.xml
http://twitter.com/users/show/jlyon.xml
etc
In essence I need to write a script that passes each url and saves the output as a single xml file on the server. Any ideas on how to do this with PHP and do I need to use Curl?
Thanks for any help.
Cheers
Jonathan

This is a simple example of how could you achieve this using cURL:
// array of twitter accounts
$ids = array('mattmuller', 'welovecrowds', 'jlyon' /* (...) */);
$ch = curl_init();
$url = 'http://twitter.com/users/show/';
$xml = '<?xml version ="1.0" encoding="utf-8"?>';
// make curl return the contents instead of outputting them
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
foreach($ids as $id) {
// set the url base on the account id
curl_setopt($ch, CURLOPT_URL, "$url$id.xml");
// fetch the url contents and remove the xml heading
$xml .= preg_replace ('/\<\?xml .*\?\>/i', '', curl_exec($ch));
}
// save the contents of $xml into a file
file_put_contents('users.xml', $xml);

website query, php

How can I query a particular website with some fields and get the results to my webpage using php?
let say website xyz.com will give you the name of the city if you give them the zipcode. How can I acehive this easliy in php? any code snap shot will be great.

If I understand what you mean (You want to submit a query to a site and get the result back for processing and such?), you can use cURL.
Here is an example:
<?php
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, "example.com");
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// $output contains the output string
$output = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);
?>
You can grab the Lat/Long from this site with some regexp like this:
if ( preg_match_all( "#<td>\s+-?(\d+\.\d+)\s+</td>#", $output, $coords ) ) {
list( $lat, $long ) = $coords[1];
echo "Latitude: $lat\nLongitude: $long\n";
}
Just put that after the curl_close() function.
That will return something like this (numbers changed):
Latitude: 53.5100
Longitude: 60.2200

You can use file_get_contents (and other similar fopen-class functions) to do this:
$result = file_get_contents("http://other-site.com/query?variable=value");

Do you mean something like:
include 'http://www.google.com?q=myquery'; ? or which fields do you want to get?
Can you be a bit more specific pls :)

If you want to import the html to your page and analyze it, you probably want to use cURL.
You have to have the extensions loaded to your page (it's usually part of PHP _ I think it has to be compiled in? The manual can answer that)
Here is a curl function. Set up your url like
$param='fribby';
$param2='snips';
$url="www.example.com?data=$param&data2=$param2";
function curl_page($url)
{
$response =false;
$ch = curl_init($url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch,CURLOPT_FAILONERROR,true);
curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch,CURLOPT_TIMEOUT,30);
$response = curl_exec($ch);
curl_close($ch);
return $response;
}
$page_data=curl_page($url);
Then, you can get data out of the page using the DOM parsing or grep/sed/awk type stuff.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to use foreach() to iterate YQL XML and scrape HTML? - php

Related

Crawling google search results with PHP Curl , was working but seems to have stopped

using curl within foreach loop optimization?

SimpleXML feed showing blank arrays - how do I get the content out?

save multiple twitter profiles to single xml file

website query, php

Categories

Resources