Get data from URL based on the data inside span - php

I am trying to get data from a URL and only retrieve the data from within the span that has title=""
Each "row" of data has a span with a different incremental value of the title for example
title="1", title="2"
so the data I want to get will be inside this span
DATA HERE
x will be an incremental number
I am able to get all data from the page using this code however I am stuck on how to achieve what i need
function file_get_contents_curl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$html = file_get_contents_curl("http://www.example.com");
//parsing all content:
$doc = new DOMDocument();
#$doc->loadHTML($html);
echo "$html";
The data is formatted like :
<span id="RANDOMINFO">
+
<span title="1">DATA I WANT HERE</span>
CLICK
RANDOM DATA
</span>
<span id="RANDOMINFO">
+
<span title="2">DATA I WANT HERE</span>
CLICK
RANDOM DATA
</span>

Solution:
Explanation is available as comments in the provided code
$doc = new DOMDocument();
#$doc->loadHTML($html);
foreach($doc->getElementsByTagName('span') as $element ) { //Loops through all available span elements
if (empty($element->attributes->getNamedItem('id')->value) || $element->attributes->getNamedItem('id')->value != 'RANDOMINFO') { // Discards irrelevant span elements based on their `ID`. A similar sorting is achieved with `empty()` as the target `span` doesn't have any associated `ID`.
echo get_inner_html($element).PHP_EOL;
}
}
function get_inner_html( $node ) {
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveHTML( $child ); //fetches the text inside child elements of the targeted element
}
return $innerHTML;
}
Output:
DATA I WANT HERE
DATA I WANT HERE
References:
DOMDocument::getElementsByTagName
DOMNamedNodeMap::getNamedItem
DOMDocument::saveHTML

Related

Get text from an element of a web page with PHP

I have this error: Object of class DOMDocument could not be converted to string
I'm trying to parse web page to get text inside a div
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($ch);
$dom = new DOMDocument();
$dom->loadHTML($html);
$table = $dom->getElementById('mostra')> textContent; //DOMElement
echo $table;
This is html element:
<div id="mostra">Hello<img src="file.png"></div>
I want to print Hello
How can i solve it ?
Thanks a lot and sorry for my english
function string_between_two_string($str, $starting_word, $ending_word) {
$subtring_start = strpos($str, $starting_word);
$subtring_start += strlen($starting_word);
$size = strpos($str, $ending_word, $subtring_start) - $subtring_start;
return substr($str, $subtring_start, $size);
}
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($ch);
$table = string_between_two_string($html, '<div id="mostra">', '<img src="file.png"></div>');
echo $table;
Try to use this function to find text between two element

Parsing HTML in PHP: get table onclick attribute value

I want to parse HTML page to get data from table (basically I want to loop through all tr tags).
I have next questions:
How to skip tr in table head?
How to get onclick attribute value of td tag?
How to count td in each tr
HTML structure:
<tr>
<td onclick="window.location='home.php?navi=148';">kkkk</td>
<td>demo</td>
<td>kkkk</td>
</tr>
i want to get window.location='home.php?navi=148';
Code that I am using:
$url = $html;
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$html = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($html);
# Iterate over all the <a> tags
foreach($dom->getElementsByTagName('td') as $link) {
# Show the <a href>
print_r($link);
echo "<br />";
}
You are already using the DOM extension, but you missed DOMXPath. It allows you to use XPath expression to fetch part of the document. It can return node lists, or scalars.
Basic Syntax
$xpath = new DOMXPath($dom);
$result = $xpath->evaluate($expression, $optionalContext);
How to skip tr in table head?
This is possible but most of the time it is easier to do positive matches (all tr inside the tbody). Think about the tr inside a tfoot.
All tr inside tbody: //table/tbody/tr
All tr directly in table: //table/tr
All tr where the parent is not a thead //table//tr[name(parent::*) != 'thead']
How to get onclick attribute value of td tag?
This is a scalar value - so you need to cast it to a string:
string(//table/tbody/tr/td/#onclick)
How to count td in each tr
This will require a combination, first fetching the tr, then the count with the tr as context:
foreach ($xpath->evaluate('//table/tbody/tr') as $tr) {
var_dump($xpath->evaluate('count(td)', $tr);
}
Have you tried to get node Value?
foreach($dom->getElementsByTagName('td') as $link) {
# Show the <a href>
echo $link->nodeValue; //td value inside
echo "<br />";
}
Instead of using php why don't you use javascript to achieve what you want..
The code for doing this is as follows:
$('#tableId tr').each(function(){
defaultData[i] = new Array();
j = 0;
$(this).find('td').each(function(){
defaultData[i][j] = $(this).html();
if (defaultData[i][j].length > 150)
{
defaultData[i][j] = $(this).find('select').val();
}
j++;
});
i++;
});

XML to associative array and echo out specific values

I'm not sure I am going about this the right way but I am trying to echo out individual elements of data from an array, but not succeeding, I only need to grab around 10 variables for average fuel consumption from an XML File here: https://www.fueleconomy.gov/ws/rest/ympg/shared/vehicles?make=honda&model=civic
I only need make, model, year avgMpg which is a child of youMpgVehicle etc so I can place them within a table in the same was as you can echo out SQL data within PHP.
function download_page($path){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$path);
curl_setopt($ch, CURLOPT_FAILONERROR,1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_TIMEOUT, 15);
//curl_setopt($ch, CURLOPT_SSLVERSION,3);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
//curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
$retValue = curl_exec($ch);
curl_close($ch);
return $retValue;
}
$sXML = download_page('https://www.fueleconomy.gov/ws/rest/ympg/shared/vehicles?make=honda&model=civic');
$oXML = new SimpleXMLElement($sXML);
$dom = new DomDocument();
$dom->loadXml($sXML);
$dataElements = $dom->getElementsByTagName('vehicle');
$array = array();
foreach ($dataElements as $element) {
$subarray = array();
foreach ($element->childNodes as $node) {
if (!$node instanceof DomElement) {
continue;
}
$key = $node->tagName;
$value = $node->textContent;
$subarray[$key] = $value;
}
$array[] = $subarray;
// var_dump($array); // returns the array as expected
var_dump($array[0]["barrels08"]); //how can I get this and other variables?
}
The structure is like this: (Or you can click on the hyperlink above)
-<vehicles>
-<vehicle>
<atvType/>
<barrels08>10.283832</barrels08>
<barrelsA08>0.0</barrelsA08>
<charge120>0.0</charge120>
<charge240>0.0</charge240>
<city08>28</city08>
<city08U>28.0743</city08U>
<cityA08>0</cityA08>
<cityA08U>0.0</cityA08U>
<cityCD>0.0</cityCD>
<cityE>0.0</cityE>
<cityUF>0.0</cityUF>
<co2>279</co2>
<co2A>-1</co2A>
<co2TailpipeAGpm>0.0</co2TailpipeAGpm>
<co2TailpipeGpm>279.0</co2TailpipeGpm>
<comb08>32</comb08>
<comb08U>31.9768</comb08U>
<combA08>0</combA08>
<combA08U>0.0</combA08U>
<combE>0.0</combE>
<combinedCD>0.0</combinedCD>
<combinedUF>0.0</combinedUF>
<cylinders>4</cylinders>
<displ>1.8</displ>
<drive>Front-Wheel Drive</drive>
<engId>18</engId>
<eng_dscr/>
<evMotor/>
<feScore>8</feScore>
<fuelCost08>1550</fuelCost08>
<fuelCostA08>0</fuelCostA08>
<fuelType>Regular</fuelType>
<fuelType1/>
<fuelType2/>
<ghgScore>8</ghgScore>
<ghgScoreA>-1</ghgScoreA>
<guzzler/>
<highway08>39</highway08>
<highway08U>38.5216</highway08U>
<highwayA08>0</highwayA08>
<highwayA08U>0.0</highwayA08U>
<highwayCD>0.0</highwayCD>
<highwayE>0.0</highwayE>
<highwayUF>0.0</highwayUF>
<hlv>0</hlv>
<hpv>0</hpv>
<id>33504</id>
<lv2>12</lv2>
<lv4>12</lv4>
<make>Honda</make>
<mfrCode>HNX</mfrCode>
<model>Civic</model>
<mpgData>Y</mpgData>
<phevBlended>false</phevBlended>
<pv2>83</pv2>
<pv4>95</pv4>
<rangeA/>
<rangeCityA>0.0</rangeCityA>
<rangeHwyA>0.0</rangeHwyA>
<trans_dscr/>
<trany>Automatic 5-spd</trany>
<UCity>36.4794</UCity>
<UCityA>0.0</UCityA>
<UHighway>55.5375</UHighway>
<UHighwayA>0.0</UHighwayA>
<VClass>Compact Cars</VClass>
<year>2013</year>
<youSaveSpend>3000</youSaveSpend>
-
33.612226599
45
55
47
28
16
33504
You don't actually need to put everything into an array if you just want to display the data. SimpleXML makes it very simple to handle XML data. If I may suggest a maybe less complex solution:
<?php
function getFuelDataAsXml($make, $model)
{
// In most cases CURL is overkill, unless you need something more complex
$data = file_get_contents("https://www.fueleconomy.gov/ws/rest/ympg/shared/vehicles?make={$make}&model={$model}");
// If we got some data, return it as XML, otherwise return null
return $data ? simplexml_load_string($data) : null;
}
// get the data for a specific make and model
$data = getFuelDataAsXml('honda', 'civic');
// iterate over all vehicle-nodes
foreach($data->vehicle as $vehicleData)
{
echo $vehicleData->barrels08 . '<br />';
echo $vehicleData->yourMpgVehicle->avgMpg . '<br />';
echo '<hr />';
}
To fetch data from an DOM use Xpath:
$url = "https://www.fueleconomy.gov/ws/rest/ympg/shared/vehicles?make=honda&model=civic";
$dom = new DomDocument();
$dom->load($url);
$xpath = new DOMXpath($dom);
foreach ($$xpath->evaluate('/*/vehicle') as $vehicle) {
var_dump(
array(
$xpath->evaluate('string(fuelType)', $vehicle),
$xpath->evaluate('number(fuelCost08)', $vehicle),
$xpath->evaluate('number(barrels08)', $vehicle)
)
);
}
Most Xpath expressions return an a list of nodes that can be iterated using foreach. Using number() or string() will cast the value or content of the first node into a float or string. If the list was empty you will get an empty value.

Getting the src of an image in a curled html with dom

function getPage($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec($ch);
curl_close($ch);
return $result;
}
$page = getPage(trim('http://localhost/test/test.html'));
$dom = new DOMDocument();
$dom->loadHTML($page);
$xp = new DOMXPath($dom);
$result = $xp->query("//img[#class='wallpaper']");
I'm trying to find all images with a class wallpaper and now I'm stuck to that point. I tried to var_dump($result) but it's giving me a weird object(DOMNodeList)[3]. How do i finally get the src of the image?
$result is a DOMNodeList object.
You can find out how many items it contains with
$count = $result->length;
You access items individually using DOMNodeList::item()
if ($result->length > 0) {
$first = $result->item(0);
$src = $first->getAttribute('src');
}
You can also iterate it like an array, eg
foreach ($result as $img) {
$src = $img->getAttribute('src');
}
In addition to #Phil's answer, you can also grab the src attribute directly in your xpath query instead of grabbing the img element:
$srcs = array();
$result = $xp->query("//img[#class='wallpaper']/#src");
foreach($result as $attr) {
$srcs[] = $attr->value;
}
You can access the images in the DOMNodeList with a foreach loop.
foreach($result as $img) {
echo $img->getAttribute('src');
}
You could get the first with echo $result->item(0)->getAttribute('src'). You may want to confirm the DOMNodeList has items by checking the length property of $result.
Try
echo $result->getAttribute('src');

Applying a class to a parent element in the DOM?

<li id="weather" class="widget-container widget-dark-blue">
<h3 class="widget-title">Weather</h3>
<?php include (TEMPLATEPATH . '/widgets/weather.php'); ?>
</li>
weather.php does a curl request to a weather service and returns a table. With the DomDocument Class I read the values inside of the td's. I'm applying a classname of the current weather condition to a div.weather.
<?php
require_once('classes/SmartDOMDocument.class.php');
$url = 'some/domain...';
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_HEADER, false);
$str = curl_exec($curl);
$dom = new SmartDOMDocument();
$dom->loadHTML($str);
$xpath = new DOMXPath($dom);
$tds = $xpath->query('//div/table/tr/td');
foreach ($tds as $key => $cell) {
if ($key==1) {
$condition = $cell->textContent;
//$cell->parentNode->setAttribute('class', 'hello');
echo "<div class='weather " . strtolower($condition) ."'>";
...
?>
Everything works fine. My one and only question is, is there a PHP way of applying the classname $condition to the list-item that holds the information?
So instead of having a class with the $condtion inside of my li#weather I'd like to have the li#weather the class.
<li id="weather" class="widget-container widget-dark-blue $condition">
Is there any way I can apply the $condition class to the list that hold's everything. I could easily do it with javascript/jquery. However I wonder if there is some serverside solution.
thank you
Maybe you could try somthing like :
$parent = $cell->parentNode;
while ($parent->tagName != 'li')
{
$parent = $parent->parentNode;
}
$class = $parent->getAttribute('class');
$parent->setAttribute('class', $class . ' ' . strtolower($condition));

Categories