Using Xpath with php to parse html from a website

Using Xpath with php to parse html from a website - php

Currently I'm trying to use xpath to parse an html page from a website.
I need to get a result in the format:
Time of the program : Program name
For example:
1.00PM : Ye Hai Mohabbatein
I am using the following code (as shown here) to obtain it but it is not working.
<?php
libxml_use_internal_errors(true);
$dom = new DomDocument;
$dom->loadHTMLFile("www.starplus.in/schedule.aspx");
$xpath = new DomXPath($dom);
$nodes = $xpath->query("//table");
foreach ($nodes as $i => $node) {
echo "hy";
echo "Node($i): ", $node->nodeValue, "\n";
}
?>
I will be thankful if anybody help me out in this issue.

Basically, just target the table div/table which has that name of the show and the timeslot.
Rough example:
// it seems it doesn't work when there is no user agent
$ch = curl_init('http://www.starplus.in/schedule.aspx');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$page = curl_exec($ch);
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($page);
libxml_clear_errors();
$xpath = new DOMXPath($dom);
$shows = array();
$tables = $xpath->query("//div[#class='sech_div_bg']/table"); // target that table
foreach ($tables as $table) {
$time_slot = $xpath->query('./tr[1]/td/span', $table)->item(0)->nodeValue;
$show_name = $xpath->query('./tr[3]/td/span', $table)->item(0)->nodeValue;
$shows[] = array('time_slot' => $time_slot, 'show_name' => $show_name);
echo "$time_slot - $show_name <br/>";
}
// echo '<pre>';
// print_r($shows);

Related

Php web-scraping divs with class do not display images

I'm web-scraping a newspaper online, external url, with this plane php, a class block__item, which is compose of other classes and images
<?php
error_reporting(E_ALL);
ini_set('display_errors', 1);
$url = "https://www.repubblica.it";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$content = curl_exec($ch);
curl_close($ch);
$doc = new DOMDocument('1.0', 'utf-8');
#$doc->loadHTML($content);
$classname="block__item";
$finder = new DomXPath($doc);
$spaner = $finder->query("//*[contains(#class, '$classname')]");
foreach ($spaner as $value) {
echo $doc->saveXML($value);
}
/?>
Now, images are not displayed; I have the space in the page, the url, but I can't see the image.
I tried with this
$images = $finder->query("(//img/#src)");
foreach ( $images as $image) {
$print = $doc->saveXML($image);
echo $print;
}
and this gave me all the images at the end of the page;
I tried with combine
foreach (array_combine($spaner, $images) as $value => $image){
echo $value, $image;
}
but I receive error.
I tried also with query class and img in the same time, but nothing.
Please, anyone could help me?

Websrapping from Reuters using PHP, how do I correctly identify the elements I wish to select

With this code (below) i can return the current price of AAPL/Apple. How do i modify this to return the previous close for example.
$ticker = "aapl";
$url = "http://reuters.com/finance/stocks/overview?symbol=";
$newURL = $url.$ticker;
$result = file_get_contents($newURL);
$nyArr1 = explode('font-size: 23px;">', $result);
if ($nyArr1[1]) {
$nyArr2 = explode("</span>", $nyArr1[1]);
if ($nyArr2[1]) {
$nyPrice = $nyArr2[0];
}
}
Site link: https://www.reuters.com/finance/stocks/overview/AAPL.O

I recommend you to use DOMDocument to parse a HTML document with PHP, like this :
$ticker = "aapl";
$baseUrl = "http://reuters.com/finance/stocks/overview?symbol=";
$url = $baseUrl.$ticker;
$dom = new DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTMLFile($url);
$finder = new DomXPath($dom);
echo "First value :". $finder->query('//*[#id="headerQuoteContainer"]/div[1]/div/span[2]')->item(0)->nodeValue."<br/>";
echo "Second value :". $finder->query('//*[#id="headerQuoteContainer"]/div[3]/div[1]/span[2]')->item(0)->nodeValue;
I have used DomXPath but it is not mandatory.

Try the following to get the required content. In case you need another value, all you wanna do is modify this visible text [contains(.,'Prev Close')] to satisfy your need.
<?php
function get_content($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_exec($ch);
$htmlContent = curl_exec($ch);
curl_close($ch);
$dom = #DOMDocument::loadHTML($htmlContent);
$xp = new DOMXPath($dom);
$prevClose = $xp->query("//span[contains(.,'Prev Close')]/following-sibling::span")->item(0)->nodeValue;
$Open = $xp->query("//span[contains(.,'Open')]/following-sibling::span")->item(0)->nodeValue;
echo "PrevClose: $prevClose". '<br/>';
echo "Open: $Open";
}
$link = "https://www.reuters.com/finance/stocks/overview/AAPL.O";
get_content($link);
?>

Extract data from HTML tag

I have the following code and trying to extract the value of attribute content from an html page, But it's not giving any result that I expect, instead its give only blank page.
Any help where could be the issue ?
$url= "https://fr-ca.wordpress.org";
$html = file_get_contents($url);
# Create a DOM parser object
$dom = new DOMDocument();
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('meta') as $key ) {
echo "<pre>";
$tab[] = $key->getAttribute('content');
}
$reg= '<meta name="generator" content="(.*?)"/>';
if (preg_match_all($reg, $html, $ar)) {
print_r($ar);
}
Page source has :
<meta name="generator" content="WP 4.5"/>

try this:
$html = '<meta name="generator" content="WP 4.5"/>';
preg_match_all('/content="(.*)"/i', $html, $matches);
if (isset($matches[1])) {
print_r($matches[1]);
}

Here is a regex that will look for a meta tag and get the content attribute contents. It has some wild cards that will account for other variables such as different names, or extra spaces, etc.
$html = '<meta name="generator" content="WP 4.5"/>';
preg_match_all( '#<meta.*?content=[\'"](.*?)[\'"]\s*/>#i', $tab, $results );
print_r( $results[1] ); // contains array of captures.
if( $results[1] ) {
// code here...
}

please use like this ...
$html = file_get_contents( $url);
libxml_use_internal_errors( true);
$doc = new DOMDocument;
$doc->loadHTML( $html);
$xpath = new DOMXpath( $doc);
// A name attribute on a <div>???
$nodes = $xpath->query( '//div[#name="changeable_text"]')->item( 0);
echo $nodes->Content;
OR
// Use Curl ...
function getHTML($url,$timeout)
{
$ch = curl_init($url); // initialize curl with given url
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER["HTTP_USER_AGENT"]); // set useragent
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // write the response to a variable
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follow redirects if any
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); // max. seconds to execute
curl_setopt($ch, CURLOPT_FAILONERROR, 1); // stop when it encounters an error
return #curl_exec($ch);
}
$html=getHTML("http://www.website.com",10);
// Find all images on webpage
foreach($html->find("img") as $element)
echo $element->src . '<br>';
// Find all links on webpage
foreach($html->find("a") as $element)
echo $element->href . '<br>';

How to get a specified row using cUrl PHP

Hey guys I use curl to communicate web external server, but the type of response is html, I was able to convert it to json code (more than 4000 row) but I have no idea how to get specified row which contains my result. Any idea ?
Here is my cUrl code :
require_once('getJson.php');
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.reputationauthority.org/domain_lookup.php?ip=website.com&Submit.x=9&Submit.y=5&Submit=Search');
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
curl_setopt($ch, CURLOPT_TIMEOUT, 5);
$data = curl_exec($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
$data = '<<<EOF'.$data.'EOF';
$json = new GetJson();
header("Content-Type: text/plain");
$res = json_encode($json->html_to_obj($data), JSON_PRETTY_PRINT);
$myArray = json_decode($res,true);
For getJson.php
class GetJson{
function html_to_obj($html) {
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
return $this->element_to_obj($dom->documentElement);
}
function element_to_obj($element) {
if ($element->nodeType == XML_ELEMENT_NODE){
$obj = array( "tag" => $element->tagName );
foreach ($element->attributes as $attribute) {
$obj[$attribute->name] = $attribute->value;
}
foreach ($element->childNodes as $subElement) {
if ($subElement->nodeType == XML_TEXT_NODE) {
$obj["html"] = $subElement->wholeText;
}
else {
$obj["children"][] = $this->element_to_obj($subElement);
}
}
return $obj;
}
}
}
My idea is instead of Browsing rows to achieve lign 2175 (doing something like : $data['children'][2]['children'][7]['children'][3]['children'][1]['children'][1]['children'][0]['children'][1]['children'][0]['children'][1]['children'][2]['children'][0]['children'][0]['html'] is not a good idea to me), I want to go directly to it.

If the HTML being returned has a consistent structure every time, and you just want one particular value from one part of it, you may be able to use regular expressions to parse the HTML and find the part you need. This is an alternative you trying to put the whole thing into an array. I have used this technique before to parse a HTML document and find a specific item. Here's a simple example. You will need to adapt it to your needs, since you haven't specified the exact nature of the data you're seeking. You may need to go down several levels of parsing to find the right bit:
$data = curl_exec($ch);
//Split the output into an array that we can loop through line by line
$array = preg_split('/\n/',$data);
//For each line in the output
foreach ($array as $element)
{
//See if the line contains a hyperlink
if (preg_match("/<a href/", "$element"))
{
...[do something here, e.g. store the data retrieved, or do more matching to find something within it]...
}
}

How to pull data from HTML

I am trying to write a PHP Script to pull snow and other data from http://www.snowbird.com/mountain-report to display via an LED array. I am having troubles with getting the data I need. I can't seem to be able to find a way to make it work. I've read about PHP not being the best tool for this? Would I be able to make this work, or would I have to go about and use a different language? Here is the code I cant seem to get working.
<?php
include_once('simple_html_dom.php');
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, "http://www.snowbird.com/mountain-report/");
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// $output contains the output string
$output = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);
$output = ($output);
$html = new DOMDocument();
$html = loadhtml( $content);
$ret1 = $html->find('div[id=twelve-hour]');
print_r ($ret1);
$ret2 = $html->find('#twenty-four-hour');
print_r ($ret2);
$ret3 = $html->find('#forty-eight-hour');
print_r ($ret3);
$ret4 = $html->find('#current-depth');
print_r ($ret4);
$ret5 = $html->find('#year-to-date');
print_r ($ret5);
?>

This is an ancient question, but it's easy enough to provide an answer for it. Use an XPath query to get the correct node's text value. (This should be as easy as passing the URL directly to DOMDocument::loadHTMLFile() but the server is requests based on user agent so we have to fake it.)
<?php
$ctx = stream_context_create(["http"=>[
"user_agent"=>"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:53.0) Gecko/20100101 Firefox/53.0"
]]);
$html = file_get_contents("http://www.snowbird.com/mountain-report/", true, $ctx);
libxml_use_internal_errors(true);
$doc = new DOMDocument;
$doc->loadHTML($html, LIBXML_NOWARNING|LIBXML_NOERROR);
$xp = new DomXpath($doc);
$root = $doc->getElementById("snowfall");
$snowfall = [
"12hour" => $xp->query("div[#id='twelve-hour']/div[#class='total-inches']/text()", $root)->item(0)->textContent,
"24hour" => $xp->query("div[#id='twenty-four-hour']/div[#class='total-inches']/text()", $root)->item(0)->textContent,
"48hour" => $xp->query("div[#id='forty-eight-hour']/div[#class='total-inches']/text()", $root)->item(0)->textContent,
"current" => $xp->query("div[#id='current-depth']/div[#class='total-inches']/text()", $root)->item(0)->textContent,
"ytd" => $xp->query("div[#id='year-to-date']/div[#class='total-inches']/text()", $root)->item(0)->textContent,
];
print_r($snowfall);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Using Xpath with php to parse html from a website - php

Related

Php web-scraping divs with class do not display images

Websrapping from Reuters using PHP, how do I correctly identify the elements I wish to select

Extract data from HTML tag

How to get a specified row using cUrl PHP

How to pull data from HTML

Categories

Resources