How can I get several values on a website using PHP (the value between div tags, value1, value2, value3 in the example below)?
I have been looking into DOMDocument, but getting confused.
Also, will it be possible to get the values without loading the website 3 times?
Example.
I need to get 3 values (or more) from a website:
<div class="SomeUniqueClassName">value1</div>
<div class="AnotherUniqueClassName">value2</div>
<div class="UniqueClassName">value3</div>
This is what I have now, but it looks stupid and i'm not 100% sure what i'm doing:
$doc = new DOMDocument;
$doc->loadHTMLFile($url);
$xpath = new DOMXPath($doc);
$query1 = "//div[#class='SomeUniqueClassName']";
$query2 = "//div[#class='AnotherUniqueClassName']";
$query3 = "//div[#class='UniqueClassName']";
$entry1 = $xpath->query($query1);
$value 1 = var_dump($entry1->item(0)->textContent);
$entry2 = $xpath->query($query2);
$value 2 = var_dump($entry2->item(0)->textContent);
$entry3 = $xpath->query($query3);
$value 3 = var_dump($entry3->item(0)->textContent);
You should use CURL for this :
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL,'http://theurlhere.com');
//Optional, if the target URL use SSL
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
$parse = curl_exec($curl);
curl_close($curl);
preg_match_all('/<div class="uniqueClassName([0-9])">(.*)<\/div>/', $parse, $value);
print_r($value);
With the XPath expression you could try using the "contains" qualifier and look for the unique class if it follows your example
$dom = new DOMDocument;
$dom->loadHTMLFile( $url );
$xp = new DOMXPath( $dom );
$query="//div[ contains( #class,'UniqueClass' ) ]";
$col=$xp->query( $query );
if( $col && $col->length > 0 ){
foreach( $col as $node ){
echo $node->item(0)->nodeValue;
}
}
Or modify the XPath expression to search for multiple conditions, like:
$query="//div[#class='UniqueClass1'] | //div[#class='UniqueClass2'] | //div[#class='UniqueClass3']";
$col=$xp->query( $query );
if( $col && $col->length > 0 ){
foreach( $col as $node ){
echo $node->item(0)->nodeValue;
}
}
Related
Im a a newbie trying to code a crawler to make some stats from a forum.
Here is my code :
<?php
$ch = curl_init();
$timeout = 0; // set to zero for no timeout
curl_setopt ($ch, CURLOPT_URL, 'http://m.jeuxvideo.com/forums/42-51-61913988-1-0-1-0-je-code-un-bot-pour-le-forom-je-vous-le-montre-en-action.htm');
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$file_contents = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($file_contents);
$xpath = new DOMXPath($dom);
$posts = $xpath->query("//div[#class='who-post']/a");//$elements = $xpath->query("/html/body/div[#id='yourTagIdHere']");
$dates = $xpath->query("//div[#class='date-post']");//$elements = $xpath->query("/html/body/div[#id='yourTagIdHere']");
$contents = $xpath->query("//div[#class='message text-enrichi-fmobile text-crop-fmobile']/p");//$elements = $xpath->query("/html/body/div[#id='yourTagIdHere']");
$i = 0;
foreach ($posts as $post) {
$nodes = $post->childNodes;
foreach ($nodes as $node) {
$value = trim($node->nodeValue);
$tab[$i]['author'] = $value;
$i++;
}
}
$i = 0;
foreach ($dates as $date) {
$nodes = $date->childNodes;
foreach ($nodes as $node) {
$value = trim($node->nodeValue);
$tab[$i]['date'] = $value;
$i++;
}
}
$i = 0;
foreach ($contents as $content) {
$nodes = $content->childNodes;
foreach ($nodes as $node) {
$value = $node->nodeValue;
echo $value;
$tab[$i]['content'] = trim($value);
$i++;
}
}
?>
<h1>Participants</h2>
<pre>
<?php
print_r($tab);
?>
</pre>
As you can see, the code do not retrieve some content. For example, Im trying to retrieve this content from : http://m.jeuxvideo.com/forums/42-51-61913988-1-0-1-0-je-code-un-bot-pour-le-forom-je-vous-le-montre-en-action.htm
The second post is a picture and my code do not work.
On the second hand, I guess i made some errors, I find my code ugly.
Can you help me please ?
You could simply select the posts first, then grab each subdata separately using:
DOMXPath::evaluate combined with normalize-space to retrieve pure text,
DOMXPath::query combined with DOMDocument::save to retrieve message paragraphs.
Code:
$xpath = new DOMXPath($dom);
$postsElements = $xpath->query('//*[#class="post"]');
$posts = [];
foreach ($postsElements as $postElement) {
$author = $xpath->evaluate('normalize-space(.//*[#class="who-post"])', $postElement);
$date = $xpath->evaluate('normalize-space(.//*[#class="date-post"])', $postElement);
$message = '';
foreach ($xpath->query('.//*[contains(#class, "message")]/p', $postElement) as $messageParagraphElement) {
$message .= $dom->saveHTML($messageParagraphElement);
}
$posts[] = (object)compact('author', 'date', 'message');
}
print_r($posts);
Unrelated note: scraping a website's HTML is not illegal in itself, but you should refrain from displaying their data on your own app/website without their consent. Also, this might break just about anytime if they decide to alter their HTML structure/CSS class names.
I am trying to scrape this url https://nrg91.gr/nrg-airplay-chart/ using simple-html-dom, but it does not seem to get the full html source code. This code:
include_once('simple_html_dom.php');
$html = file_get_html('https://nrg91.gr/nrg-airplay-chart');
echo $html->plaintext;
displays the content up to the h1, just before the content I am after. And from the simple-html-dom manual examples, this should display all links from that url:
foreach($html->find('a') as $e)
echo $e->href . '<br>';
but it only displays the links up to the main navigation menu, not from the main body or footer.
I also tried using prerender.com, to fully load url before passing it to file_get_html but the result was the same. What am I doing wrong?
That library looks like it hasn't been updated in 7 years. I'd always recommend using PHP's built-in functions:
$url = "https://nrg91.gr/nrg-airplay-chart/";
$dom = new DomDocument();
libxml_use_internal_errors(true);
$dom->load($url);
foreach($dom->getElementsByTagName("a") as $e) {
echo $e->getAttribute("href") . "\n";
}
Here's my super dirty approach to fetching the rank/artist/title/youtube data using both DOMDocument and SimpleXML.
The concept is to locate each "row" of data via the xpath //ul[#id="chart_ul"]/li, then using dom_import_simplexml( $outer )->getNodePath() to build a new xpath to select the individual elements where the desired data can be located.
$temp = sys_get_temp_dir() . DIRECTORY_SEPARATOR . 'nrg-airplay-chart.html';
if( file_exists( $temp ) === false or filemtime( $temp ) < time() - 3600 )
{
file_put_contents( $temp, $html = file_get_contents('https://nrg91.gr/nrg-airplay-chart/') );
}
else
{
$html = file_get_contents( $temp );
}
$dom = new DOMDocument();
$dom->loadHTML( $html );
$xml = simplexml_import_dom( $dom );
$array = array();
foreach( $xml->xpath('//ul[#id="chart_ul"]/li') as $index => $set )
{
$basexpath = dom_import_simplexml( $set )->getNodePath();
$array[] = array(
'ranking' => (string) $xml->xpath( $basexpath . '//span[#id="ranking"]' )[0],
'artist' => (string) $xml->xpath( $basexpath . '//p[#id="artist"]/b' )[0],
'title' => (string) $xml->xpath( $basexpath . '//p[#id="title"]' )[0],
'youtube' => (string) $xml->xpath( $basexpath . '//div[#id="media"]/a/#href' )[0],
);
}
print_r( $array );
Another approach you might wanna comply:
<?php
function get_content($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_exec($ch);
$htmlContent = curl_exec($ch);
curl_close($ch);
return $htmlContent;
}
$link = "https://nrg91.gr/nrg-airplay-chart/";
$xml = get_content($link);
$dom = #DOMDocument::loadHTML($xml);
$xpath = new DOMXPath($dom);
foreach($xpath->query('//li[contains(#id,"wprs_chart-")]') as $items){
$artist = $xpath->query('.//p[#id="artist"]/b',$items)->item(0)->nodeValue;
$title = $xpath->query('.//p[#id="title"]',$items)->item(0)->nodeValue;
echo "{$artist} -- {$title}<br>";
}
?>
Output you should get like:
PORTOGAL THE MAN -- Feel It Still
JAX JONEW Feat INA WROLDSEN -- Breathe
CAMILA CABELLO -- Havana
CARBI B, J BALVIN & BAD BUNNY -- I Like It
ZAYN Feat SIA -- Dusk Till Dawn
$newstring = substr_replace("http://ws.spotify.com/search/1/track?q=", $_COOKIE["word"], 39, 0);
/*$curl = curl_init($newstring);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($curl);*/
//echo $result;
$xml = simplexml_load_file($newstring);
//print_r($xml);
$xpath = new DOMXPath($xml);
$value = $xpath->query("//track/#href");
foreach ($value as $e) {
echo $e->nodevalue;
}
This is my code. I am using spotify to supply me with an xml document. I am then trying to get the href link from all of the track tags so I can use it. Right now the print_r($xml) I have commented out works, but if I try to query and print that out it returns nothing. The exact link I am trying to get my xml from is: http://ws.spotify.com/search/1/track?q=incredible
This maybe is not the answer you need, because I dropped the DOMXPath, I'm using getElementsByTagName() instead.
$url = "http://ws.spotify.com/search/1/track?q=incredible";
$xml = file_get_contents( $url );
$domDocument = new DOMDocument();
$domDocument->loadXML( $xml );
$value = $domDocument->getElementsByTagName( "track" );
foreach ( $value as $e ) {
echo $e->getAttribute( "href" )."<br>";
}
I'm not sure I am going about this the right way but I am trying to echo out individual elements of data from an array, but not succeeding, I only need to grab around 10 variables for average fuel consumption from an XML File here: https://www.fueleconomy.gov/ws/rest/ympg/shared/vehicles?make=honda&model=civic
I only need make, model, year avgMpg which is a child of youMpgVehicle etc so I can place them within a table in the same was as you can echo out SQL data within PHP.
function download_page($path){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$path);
curl_setopt($ch, CURLOPT_FAILONERROR,1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_TIMEOUT, 15);
//curl_setopt($ch, CURLOPT_SSLVERSION,3);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
//curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
$retValue = curl_exec($ch);
curl_close($ch);
return $retValue;
}
$sXML = download_page('https://www.fueleconomy.gov/ws/rest/ympg/shared/vehicles?make=honda&model=civic');
$oXML = new SimpleXMLElement($sXML);
$dom = new DomDocument();
$dom->loadXml($sXML);
$dataElements = $dom->getElementsByTagName('vehicle');
$array = array();
foreach ($dataElements as $element) {
$subarray = array();
foreach ($element->childNodes as $node) {
if (!$node instanceof DomElement) {
continue;
}
$key = $node->tagName;
$value = $node->textContent;
$subarray[$key] = $value;
}
$array[] = $subarray;
// var_dump($array); // returns the array as expected
var_dump($array[0]["barrels08"]); //how can I get this and other variables?
}
The structure is like this: (Or you can click on the hyperlink above)
-<vehicles>
-<vehicle>
<atvType/>
<barrels08>10.283832</barrels08>
<barrelsA08>0.0</barrelsA08>
<charge120>0.0</charge120>
<charge240>0.0</charge240>
<city08>28</city08>
<city08U>28.0743</city08U>
<cityA08>0</cityA08>
<cityA08U>0.0</cityA08U>
<cityCD>0.0</cityCD>
<cityE>0.0</cityE>
<cityUF>0.0</cityUF>
<co2>279</co2>
<co2A>-1</co2A>
<co2TailpipeAGpm>0.0</co2TailpipeAGpm>
<co2TailpipeGpm>279.0</co2TailpipeGpm>
<comb08>32</comb08>
<comb08U>31.9768</comb08U>
<combA08>0</combA08>
<combA08U>0.0</combA08U>
<combE>0.0</combE>
<combinedCD>0.0</combinedCD>
<combinedUF>0.0</combinedUF>
<cylinders>4</cylinders>
<displ>1.8</displ>
<drive>Front-Wheel Drive</drive>
<engId>18</engId>
<eng_dscr/>
<evMotor/>
<feScore>8</feScore>
<fuelCost08>1550</fuelCost08>
<fuelCostA08>0</fuelCostA08>
<fuelType>Regular</fuelType>
<fuelType1/>
<fuelType2/>
<ghgScore>8</ghgScore>
<ghgScoreA>-1</ghgScoreA>
<guzzler/>
<highway08>39</highway08>
<highway08U>38.5216</highway08U>
<highwayA08>0</highwayA08>
<highwayA08U>0.0</highwayA08U>
<highwayCD>0.0</highwayCD>
<highwayE>0.0</highwayE>
<highwayUF>0.0</highwayUF>
<hlv>0</hlv>
<hpv>0</hpv>
<id>33504</id>
<lv2>12</lv2>
<lv4>12</lv4>
<make>Honda</make>
<mfrCode>HNX</mfrCode>
<model>Civic</model>
<mpgData>Y</mpgData>
<phevBlended>false</phevBlended>
<pv2>83</pv2>
<pv4>95</pv4>
<rangeA/>
<rangeCityA>0.0</rangeCityA>
<rangeHwyA>0.0</rangeHwyA>
<trans_dscr/>
<trany>Automatic 5-spd</trany>
<UCity>36.4794</UCity>
<UCityA>0.0</UCityA>
<UHighway>55.5375</UHighway>
<UHighwayA>0.0</UHighwayA>
<VClass>Compact Cars</VClass>
<year>2013</year>
<youSaveSpend>3000</youSaveSpend>
-
33.612226599
45
55
47
28
16
33504
You don't actually need to put everything into an array if you just want to display the data. SimpleXML makes it very simple to handle XML data. If I may suggest a maybe less complex solution:
<?php
function getFuelDataAsXml($make, $model)
{
// In most cases CURL is overkill, unless you need something more complex
$data = file_get_contents("https://www.fueleconomy.gov/ws/rest/ympg/shared/vehicles?make={$make}&model={$model}");
// If we got some data, return it as XML, otherwise return null
return $data ? simplexml_load_string($data) : null;
}
// get the data for a specific make and model
$data = getFuelDataAsXml('honda', 'civic');
// iterate over all vehicle-nodes
foreach($data->vehicle as $vehicleData)
{
echo $vehicleData->barrels08 . '<br />';
echo $vehicleData->yourMpgVehicle->avgMpg . '<br />';
echo '<hr />';
}
To fetch data from an DOM use Xpath:
$url = "https://www.fueleconomy.gov/ws/rest/ympg/shared/vehicles?make=honda&model=civic";
$dom = new DomDocument();
$dom->load($url);
$xpath = new DOMXpath($dom);
foreach ($$xpath->evaluate('/*/vehicle') as $vehicle) {
var_dump(
array(
$xpath->evaluate('string(fuelType)', $vehicle),
$xpath->evaluate('number(fuelCost08)', $vehicle),
$xpath->evaluate('number(barrels08)', $vehicle)
)
);
}
Most Xpath expressions return an a list of nodes that can be iterated using foreach. Using number() or string() will cast the value or content of the first node into a float or string. If the list was empty you will get an empty value.
I'm looking for a good way to do this: my current method seems to not allow depths of searches beyond 30-40, even after editing the php.ini settings in hopes to increase default execution time as well as max memory usage. Basically, as soon as the depth of search exceeds this amount, the server crashes.
Here is my code (private function _ParseHtml($html, $depth = nDepth):
if ($depth === 0)
{
return;
}
#$this->_dom->loadHTML($html);
$this->nodes = $this->_dom->childNodes;
$html = array();
$iterCount = 0;
foreach($this->nodes as $node)
{
if($node->hasChildNodes())
{
$html[$iterCount++] = $node->C14N();
}
$this->_tagCount++;
if ( $this->_config['Debug'] ) _wrapBreak("Tag Count incremented");
}
if( count( $html ) > 0 )
{
$static_depth = $depth - 1;
foreach( $html as $parse )
{
$this->_ParseHtml( $parse, $static_depth );
if ( $this->_config['Debug'] ) _wrapBreak("ParseHtml did return");
}
}
_wrapBreak("<strong>Current Depth</strong> => <strong>{$depth}</strong>");
As well as the main code for the scrape _Invoke() function:
$handle = curl_init($this->_url);
curl_setopt($handle, CURLOPT_BUFFERSIZE, self::BUFSIZE); //BUFSIZE == 50000
curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
curl_setopt($handle, CURLOPT_FOLLOWLOCATION, true);
$this->_data['html'] = curl_exec($handle);
curl_close($handle);
$this->_ParseHtml($this->_data['html']);
The number of HTML tags should be easily obtainable though
$this->_dom->getElementsByTagName("*")->length;
As found here: Count all HTML tags in page PHP
$dom = new DOMDocument;
$dom->loadHTML($HTML);
$allElements = $dom->getElementsByTagName('*');
echo $allElements->length;
Although the example in the link does not get event close to the number of nested levels that you have...