Screen scrape using php and fopen [duplicate] - php

This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
Screen scapingin in php using file_get_contents
Can anyone help me.. I am trying to scrape Hotel reviews from LateRooms.com dont tell me its a bad idea because I already have permission as an affiliate
My code:
<?php
header('content-type: text/plain');
$contents = file_get_contents('http://www.laterooms.com/en/hotel-reviews/238902_the-westfield-bb-sandown.aspx');
$contents = preg_replace('/\s(1,)/', ' ', $contents);
print $contents . "\n";
$records = preg_split('/<div id="review/', $contents);
for ($ix = 1; $ix < count($records); $ix++) {
$tmp = $records[$ix];
preg_match('/id="review"/', $tmp, $match_reviews);
print_r($match_reviews);
exit();
}
?>
This works really well the only problem is that It pulls in the whole page of code and doesnt match the div id 'review'
Thanks in advance

function file_get_contents_curl($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
function DOMinnerHTML($element){
$innerHTML = "";
$children = $element->childNodes;
foreach ($children as $child)
{
$tmp_dom = new DOMDocument();
$tmp_dom->appendChild($tmp_dom->importNode($child, true));
$innerHTML.=trim($tmp_dom->saveHTML());
}
return $innerHTML;
}
$url = 'http://www.laterooms.com/en/hotel-reviews/238902_the-westfield-bb-sandown.aspx';
$html = file_get_contents_curl($url);
//parsing begins here:
$doc = new DOMDocument();
#$doc->loadHTML($html);
$div_elements = $doc->getElementsByTagName('div');
if ($div_elements->length <> 0){
foreach ($div_elements as $div_element) {
if ($div_element->getAttribute('class') == 'review newReview'){
$reviews[] = DOMinnerHTML($div_element);
}
}
}
print_r($reviews);
Try this, it will return all reviews. You can refine the content as per your requirement.

Related

Instagram Curl is Working on Localhost but not on live domain

i am reading the html source code of instagram post by using the CURL. I am able to do this on localhost but when i test the code on live domain then meta tags with og property like og:type is missing, it only showing at localhost.
This is the complete code.
<?php
function get_domain($url)
{
$pieces = parse_url($url);
$domain = isset($pieces['host']) ? $pieces['host'] : $pieces['path'];
if (preg_match('/(?P<domain>[a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$/i',
$domain, $regs)) {
return $regs['domain'];
}
return false;
}
//run curl here and get html code of instagram post page
function file_get_contents_curl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER, true);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
//check instagram url
function checkinstaurl($urlhere) {
//remove white space
$urlhere = trim($urlhere);
$urlhere = htmlspecialchars($urlhere);
///remove white space
if (get_domain($urlhere) == "instagram.com") {
//getting the meta tag data
$html = file_get_contents_curl($urlhere);
//parsing begins here:
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
//get and display what you need:
$title = $nodes->item(0)->nodeValue;
$metas = $doc->getElementsByTagName('meta');
$mediatype = null;
$description = null;
for ($i = 0; $i < $metas->length; $i++)
{
$meta = $metas->item($i);
if($meta->getAttribute('property') == 'og:type')
$mediatype = $meta -> getAttribute('content');
if($mediatype == 'video') {
if($meta->getAttribute('property') == 'og:video')
$description = $meta -> getAttribute('content');
} else {
if($meta->getAttribute('property') == 'og:image')
$description = $meta -> getAttribute('content');
$mediatype = 'photo';
}
} // for loop statement
$out['mediatype'] = $mediatype;
$out['descriptionc'] = $description;
return $out;
//getting the meta tag data
}
}
/*output*/
$igurl = 'https://www.instagram.com/p/COf0dN0M8pU/';
$output = checkinstaurl($igurl);
echo "<pre>";
print_r($output);
?>
This above code, At Localhost returns the complete html with meta tags but on live domain meta tags with og property is missing.

Get text from an element of a web page with PHP

I have this error: Object of class DOMDocument could not be converted to string
I'm trying to parse web page to get text inside a div
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($ch);
$dom = new DOMDocument();
$dom->loadHTML($html);
$table = $dom->getElementById('mostra')> textContent; //DOMElement
echo $table;
This is html element:
<div id="mostra">Hello<img src="file.png"></div>
I want to print Hello
How can i solve it ?
Thanks a lot and sorry for my english
function string_between_two_string($str, $starting_word, $ending_word) {
$subtring_start = strpos($str, $starting_word);
$subtring_start += strlen($starting_word);
$size = strpos($str, $ending_word, $subtring_start) - $subtring_start;
return substr($str, $subtring_start, $size);
}
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($ch);
$table = string_between_two_string($html, '<div id="mostra">', '<img src="file.png"></div>');
echo $table;
Try to use this function to find text between two element

Why is new SimpleXMLElement causing a 500 error?

I have a simple script that until yesterday had worked fine for 2 years. Im just taking a XML feed from a WP site and formatting it to be displayed on a different website. Here is the code:
<?php
function download_page($path){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$path);
curl_setopt($ch, CURLOPT_FAILONERROR,1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_TIMEOUT, 15);
$retValue = curl_exec($ch);
curl_close($ch);
return $retValue;
}
$sXML = download_page('https://example.com/tradeblog/feed/atom/');
$oXML = new SimpleXMLElement($sXML);
$items = $oXML->entry;
$i = 0;
foreach($items as $item) {
$title = $item->title;
$link = $item->link;
echo '<li>';
foreach($link as $links) {
$loc = $links['href'];
$href = str_replace("/feed/atom/", "", $loc);
echo "<a href=\"$href\" target=\"_blank\">";
}
echo $title;
echo "</a>";;
echo "</li>";
if(++$i == 3) break;
}
?>
I can echo out $sXML and it will display the entire XML contents as expected. When I try and echo $oXML I get the 500 error. Any use of $oXML causes the 500. What changed? Is there a different / better way to do this using PHP?
It seems your xml source is not exactly a xml. I tried to validate it using w3 scholl validator and it throws an error. Tried here too, and got the same error.
Not sure why, but this worked
<?php
$rss = new DOMDocument();
$rss->load('https://example.com/tradeblog/feed/rss2/');
$feed = array();
foreach ($rss->getElementsByTagName('item') as $node) {
$item = array (
'title' => $node->getElementsByTagName('title')->item(0)->nodeValue,
'link' => $node->getElementsByTagName('link')->item(0)->nodeValue,
);
array_push($feed, $item);
}
$limit = 3;
for($x=0;$x<$limit;$x++) {
$title = str_replace(' & ', ' & ', $feed[$x]['title']);
$link = $feed[$x]['link'];
echo '<li>'.$title.'</li>';
}
?>

Call a URL with PHP and get XML response [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 7 years ago.
So I am trying to get a XML response after calling a URL with params (GET request). I found this code below, which is working.
$url = "http://...";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true );
curl_setopt($ch, CURLOPT_ENCODING, "gzip,deflate");
$response = curl_exec($ch);
curl_close($ch);
echo $response;
But as response I am getting a huge string with no commas (so I cannot explode it). And this string has only values, no keys.
Is there a way to get an associative array instead?
The XML is like:
<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
<transaction>
<date>2011-02-10T16:13:41.000-03:00</date>
<code>9E884542-81B3-4419-9A75-BCC6FB495EF1</code>
<reference>REF1234</reference>
<type>1</type>
<status>3</status>
<paymentMethod>
<type>1</type>
<code>101</code>
</paymentMethod>
<grossAmount>49900.00</grossAmount>
<discountAmount>0.00<discountAmount>
(...)
SO I would like to have an array like:
date => ...
code => ...
reference => ...
(and so on)
Is that possible? If so, how?
EDIT: I donĀ“t agree with the "this questions is already answered" tag. No code found on the indicated topic solved my issue. But, anyhow, I found a way, with the code below.
$url = http://...;
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
$transaction= curl_exec($curl);
curl_close($curl);
$transaction = simplexml_load_string($transaction);
var_dump($transaction); //retrieve a object(SimpleXMLElement)
I have had good luck using code like this:
$url = "http://feeds.bbci.co.uk/news/rss.xml";
$xml = file_get_contents($url);
if ($rss = new SimpleXmlElement($xml)) {
echo $rss->channel->title;
}
I use something like this (very universal solution):
http://www.akchauhan.com/convert-xml-to-array-using-dom-extension-in-php5/
Only thing is I exclude the attributes part as I don't need them for my cases
<?php
class xml2array {
function xml2array($xml) {
if (is_string($xml)) {
$this->dom = new DOMDocument;
$this->dom->loadXml($xml);
}
return FALSE;
}
function _process($node) {
$occurance = array();
foreach ($node->childNodes as $child) {
$occurance[$child->nodeName]++;
}
if ($node->nodeType == XML_TEXT_NODE) {
$result = html_entity_decode(htmlentities($node->nodeValue, ENT_COMPAT, 'UTF-8'),
ENT_COMPAT,'ISO-8859-15');
} else {
if($node->hasChildNodes()){
$children = $node->childNodes;
for ($i=0; $i < $children->length; $i++) {
$child = $children->item($i);
if ($child->nodeName != '#text') {
if($occurance[$child->nodeName] > 1) {
$result[$child->nodeName][] = $this->_process($child);
} else {
$result[$child->nodeName] = $this->_process($child);
}
} else if ($child->nodeName == '#text') {
$text = $this->_process($child);
if (trim($text) != '') {
$result[$child->nodeName] = $this->_process($child);
}
}
}
}
}
return $result;
}
function getResult() {
return $this->_process($this->dom);
}
}
?>
And call it from your script like this:
$obj = new xml2array($response);
$array = $obj->getResult();
The code is very self explanatory, Objective approach and it can easily be modified to exclude or include parts at desire.
simply load XML into DOM Object, then recursively check for children and fetch respective values.
Hope it helps

How to parse this XML using PHP?

I would like to do parse XML for the url:
http://www.opencellid.org/cell/get?key=myapikey&mcc=250&mnc=99&cellid=29513&lac=0
I am the beginner of PHP and tried the code as below:
$url = "http://www.opencellid.org/cell/get?key=myapikey&mcc=250&mnc=99&cellid=29513&lac=0";
$xml = simplexml_load_file($url) or die("feed not loading");
print_r($xml);
var_dump($xml);
I would like to do echo for each attribute e.g. lat, lon, range.. for this XML url.
I found many resource in stackoverflow which the XML is quite standard. I cannot find an example which is used for this format of XML:
Anyone could give me an idea? Thanks.
$url = "http://www.opencellid.org/cell/get?key=myapikey&mcc=250&mnc=99&cellid=29513&lac=0";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$data = curl_exec($ch);
$status = curl_getinfo($ch,CURLINFO_HTTP_CODE);
curl_close($ch);
if($status==200){
$x = new SimpleXMLElement($data,LIBXML_NOCDATA);
echo 'lat: '.$x->cell['lat'];
echo 'lon: '.$x->cell['lon'];
echo 'range: '.$x->cell['range'];
}
Update:
This is an alternative to curl. If you have curl installed, use curl, it is faster.
$url = "http://www.opencellid.org/cell/get?key=myapikey&mcc=250&mnc=99&cellid=29513&lac=0";
$data = file_get_contents($url);
$x = new SimpleXMLElement($data,LIBXML_NOCDATA);
$lat = $x->cell['lat'];
$lng = $x->cell['lon'];
$range = $x->cell['range'];
echo 'lat: '.$lat.'<br>';
echo 'lng: '.$lng.'<br>';
echo 'range: '.$range.'<br>';
Please try this
$content = file_get_contents('http://www.opencellid.org/cell/get?key=myapikey&mcc=250&mnc=99&cellid=29513&lac=0');
$x = new SimpleXmlElement($content);
foreach($x->cell as $entry) {
echo $entry['lat']."==".$entry['lon']; exit;
}
$url = "http://www.opencellid.org/cell/get?key=myapikey&mcc=250&mnc=99&cellid=29513&lac=0";
$xml = simplexml_load_file($url) or die("feed not loading");
$attr = array("lat", "lon"); //list attr you require
foreach($xml->cell as $cell){
$data = $cell->attributes();
foreach ($attr as $key) {
// echo "<br>key: $key, value: " . $data[$key];
//edit to insert
$columns[] = $key;
$values[] = $data[$key];
}
echo $query = "insert into table_name(".implode(',',$columns).") values (".implode(',',$values).")";
}

Categories