Getting the src of an image in a curled html with dom - php

function getPage($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec($ch);
curl_close($ch);
return $result;
}
$page = getPage(trim('http://localhost/test/test.html'));
$dom = new DOMDocument();
$dom->loadHTML($page);
$xp = new DOMXPath($dom);
$result = $xp->query("//img[#class='wallpaper']");
I'm trying to find all images with a class wallpaper and now I'm stuck to that point. I tried to var_dump($result) but it's giving me a weird object(DOMNodeList)[3]. How do i finally get the src of the image?

$result is a DOMNodeList object.
You can find out how many items it contains with
$count = $result->length;
You access items individually using DOMNodeList::item()
if ($result->length > 0) {
$first = $result->item(0);
$src = $first->getAttribute('src');
}
You can also iterate it like an array, eg
foreach ($result as $img) {
$src = $img->getAttribute('src');
}

In addition to #Phil's answer, you can also grab the src attribute directly in your xpath query instead of grabbing the img element:
$srcs = array();
$result = $xp->query("//img[#class='wallpaper']/#src");
foreach($result as $attr) {
$srcs[] = $attr->value;
}

You can access the images in the DOMNodeList with a foreach loop.
foreach($result as $img) {
echo $img->getAttribute('src');
}
You could get the first with echo $result->item(0)->getAttribute('src'). You may want to confirm the DOMNodeList has items by checking the length property of $result.

Try
echo $result->getAttribute('src');

Related

Add space between textContent data scraped from website using PHP DOM

I am trying to add a comma and whitespace to some data I am scraping from a website. The data scrapes successfully, but they are muddled up together, and the space and comma are trying to add only get added to the last item. Here is the code I currently have
$html = curl_exec($ch);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$finder = new DomXPath($dom);
$class_ops = 'ipc-inline-list ';
$class_opp = 'ipc-inline ';
$node = $finder->query("//div[#class='$class_ops']//ul[#class='$class_opp']");
foreach ($node as $index => $t) {
if ($index == 3) {
$la = $t->textContent.", ";
}
}
echo $la;
Current Result
DoyleBrainDavid,
Expected Result
Doyle, Brain, David
I am using this code
$c1 = curl_init('https://stackoverflow.com/');
curl_setopt($c1, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($c1);
if (curl_error($c1))
die(curl_error($c1));
// Get the status code
$status = curl_getinfo($c1, CURLINFO_HTTP_CODE);
curl_close($c1);
preg_match_all('/<span(.*?)<\/span>/s', $html, $matches1);
foreach($matches1[0] as $k=>$v){
$enc = mb_detect_encoding($v);
$v = mb_convert_encoding($v,$enc, "UTF-8");
$match1[$k] = strip_tags ($v);
//$match1[$k] = preg_replace('/^[^A-Za-z0-9]+/', '', $match1[$k]);
}
var_dump($match1);
In your case you can replace like this
preg_match_all('/<div class="ipc-inline-list">(.*?)<\/div>/s', $html, $matches1);
This return array with matches.
I hope this can be helpful for you
You want each li, not the ul as one block. Try:
$node = $finder->query("//div[#class='$class_ops']//ul[#class='$class_opp']/li");
Demo: https://3v4l.org/Mvfud
If that doesn't work the actual HTML content should be added to the question.

XPath do not retrieve some content

Im a a newbie trying to code a crawler to make some stats from a forum.
Here is my code :
<?php
$ch = curl_init();
$timeout = 0; // set to zero for no timeout
curl_setopt ($ch, CURLOPT_URL, 'http://m.jeuxvideo.com/forums/42-51-61913988-1-0-1-0-je-code-un-bot-pour-le-forom-je-vous-le-montre-en-action.htm');
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$file_contents = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($file_contents);
$xpath = new DOMXPath($dom);
$posts = $xpath->query("//div[#class='who-post']/a");//$elements = $xpath->query("/html/body/div[#id='yourTagIdHere']");
$dates = $xpath->query("//div[#class='date-post']");//$elements = $xpath->query("/html/body/div[#id='yourTagIdHere']");
$contents = $xpath->query("//div[#class='message text-enrichi-fmobile text-crop-fmobile']/p");//$elements = $xpath->query("/html/body/div[#id='yourTagIdHere']");
$i = 0;
foreach ($posts as $post) {
$nodes = $post->childNodes;
foreach ($nodes as $node) {
$value = trim($node->nodeValue);
$tab[$i]['author'] = $value;
$i++;
}
}
$i = 0;
foreach ($dates as $date) {
$nodes = $date->childNodes;
foreach ($nodes as $node) {
$value = trim($node->nodeValue);
$tab[$i]['date'] = $value;
$i++;
}
}
$i = 0;
foreach ($contents as $content) {
$nodes = $content->childNodes;
foreach ($nodes as $node) {
$value = $node->nodeValue;
echo $value;
$tab[$i]['content'] = trim($value);
$i++;
}
}
?>
<h1>Participants</h2>
<pre>
<?php
print_r($tab);
?>
</pre>
As you can see, the code do not retrieve some content. For example, Im trying to retrieve this content from : http://m.jeuxvideo.com/forums/42-51-61913988-1-0-1-0-je-code-un-bot-pour-le-forom-je-vous-le-montre-en-action.htm
The second post is a picture and my code do not work.
On the second hand, I guess i made some errors, I find my code ugly.
Can you help me please ?
You could simply select the posts first, then grab each subdata separately using:
DOMXPath::evaluate combined with normalize-space to retrieve pure text,
DOMXPath::query combined with DOMDocument::save to retrieve message paragraphs.
Code:
$xpath = new DOMXPath($dom);
$postsElements = $xpath->query('//*[#class="post"]');
$posts = [];
foreach ($postsElements as $postElement) {
$author = $xpath->evaluate('normalize-space(.//*[#class="who-post"])', $postElement);
$date = $xpath->evaluate('normalize-space(.//*[#class="date-post"])', $postElement);
$message = '';
foreach ($xpath->query('.//*[contains(#class, "message")]/p', $postElement) as $messageParagraphElement) {
$message .= $dom->saveHTML($messageParagraphElement);
}
$posts[] = (object)compact('author', 'date', 'message');
}
print_r($posts);
Unrelated note: scraping a website's HTML is not illegal in itself, but you should refrain from displaying their data on your own app/website without their consent. Also, this might break just about anytime if they decide to alter their HTML structure/CSS class names.

PHP: XPath query returns nothing from large XML

$newstring = substr_replace("http://ws.spotify.com/search/1/track?q=", $_COOKIE["word"], 39, 0);
/*$curl = curl_init($newstring);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
$result = curl_exec($curl);*/
//echo $result;
$xml = simplexml_load_file($newstring);
//print_r($xml);
$xpath = new DOMXPath($xml);
$value = $xpath->query("//track/#href");
foreach ($value as $e) {
echo $e->nodevalue;
}
This is my code. I am using spotify to supply me with an xml document. I am then trying to get the href link from all of the track tags so I can use it. Right now the print_r($xml) I have commented out works, but if I try to query and print that out it returns nothing. The exact link I am trying to get my xml from is: http://ws.spotify.com/search/1/track?q=incredible
This maybe is not the answer you need, because I dropped the DOMXPath, I'm using getElementsByTagName() instead.
$url = "http://ws.spotify.com/search/1/track?q=incredible";
$xml = file_get_contents( $url );
$domDocument = new DOMDocument();
$domDocument->loadXML( $xml );
$value = $domDocument->getElementsByTagName( "track" );
foreach ( $value as $e ) {
echo $e->getAttribute( "href" )."<br>";
}

Extract content of specific div preserving only certain elements

I need to extract only textual part of the webpage preserving all and only the <p> <h2>, <h3>, <h4> and <blockquote>s.
Now, using DOMXPath and $div = $xpath->query('//div[#class="story-inner"]'); gives lots of unwanted page elements like pictures, ad blocks, other custom markups, etc. inside of text div.
On the other hand using the following code:
$items = $doc->getElementsByTagName('<p>');
for ($i = 0; $i < $items->length; $i++) {
echo $items->item($i)->nodeValue . "<p>";
}
gives very nice and clean result very close what I wanted, but with <h2>, <h3>, <h4> and <blockquotes> missing.
I wonder is there any DOM-way of (1) indicating only desired page elements and extracting clean result or (2) efficient way of cleaning up the output obtained by using $div = $xpath->query('//div[#class="story-inner"]');?
You could use an OR inside your xpath query in this case. Just cascade those tags with it get those only desired ones.
$url = "http://www.example.com/russian/international/2015/02/150218_ukraine_debaltseve_fighting";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$html = curl_exec($curl);
curl_close($curl);
$doc = new DOMDocument();
$html = mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8");
#$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$tags = array('p', 'h2');
$children_needed = implode(' or ', array_map(function($tag){ return sprintf('name()="%s"', $tag); }, $tags));
$query = "//div[#class='story-body__inner']//*[$children_needed]";
$div_children = $xpath->query($query);
if($div_children->length > 0) {
foreach($div_children as $child) {
echo $doc->saveHTML($child);
}
}
If i understood your question correctly.. is this what you are asking for...
$output1=preg_match('/^.*<tagName>(.*)<\/tagName>/', $value,$match1);
Match with the tagnames and get the data in between them by using preg_match...

XML to associative array and echo out specific values

I'm not sure I am going about this the right way but I am trying to echo out individual elements of data from an array, but not succeeding, I only need to grab around 10 variables for average fuel consumption from an XML File here: https://www.fueleconomy.gov/ws/rest/ympg/shared/vehicles?make=honda&model=civic
I only need make, model, year avgMpg which is a child of youMpgVehicle etc so I can place them within a table in the same was as you can echo out SQL data within PHP.
function download_page($path){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$path);
curl_setopt($ch, CURLOPT_FAILONERROR,1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_TIMEOUT, 15);
//curl_setopt($ch, CURLOPT_SSLVERSION,3);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
//curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
$retValue = curl_exec($ch);
curl_close($ch);
return $retValue;
}
$sXML = download_page('https://www.fueleconomy.gov/ws/rest/ympg/shared/vehicles?make=honda&model=civic');
$oXML = new SimpleXMLElement($sXML);
$dom = new DomDocument();
$dom->loadXml($sXML);
$dataElements = $dom->getElementsByTagName('vehicle');
$array = array();
foreach ($dataElements as $element) {
$subarray = array();
foreach ($element->childNodes as $node) {
if (!$node instanceof DomElement) {
continue;
}
$key = $node->tagName;
$value = $node->textContent;
$subarray[$key] = $value;
}
$array[] = $subarray;
// var_dump($array); // returns the array as expected
var_dump($array[0]["barrels08"]); //how can I get this and other variables?
}
The structure is like this: (Or you can click on the hyperlink above)
-<vehicles>
-<vehicle>
<atvType/>
<barrels08>10.283832</barrels08>
<barrelsA08>0.0</barrelsA08>
<charge120>0.0</charge120>
<charge240>0.0</charge240>
<city08>28</city08>
<city08U>28.0743</city08U>
<cityA08>0</cityA08>
<cityA08U>0.0</cityA08U>
<cityCD>0.0</cityCD>
<cityE>0.0</cityE>
<cityUF>0.0</cityUF>
<co2>279</co2>
<co2A>-1</co2A>
<co2TailpipeAGpm>0.0</co2TailpipeAGpm>
<co2TailpipeGpm>279.0</co2TailpipeGpm>
<comb08>32</comb08>
<comb08U>31.9768</comb08U>
<combA08>0</combA08>
<combA08U>0.0</combA08U>
<combE>0.0</combE>
<combinedCD>0.0</combinedCD>
<combinedUF>0.0</combinedUF>
<cylinders>4</cylinders>
<displ>1.8</displ>
<drive>Front-Wheel Drive</drive>
<engId>18</engId>
<eng_dscr/>
<evMotor/>
<feScore>8</feScore>
<fuelCost08>1550</fuelCost08>
<fuelCostA08>0</fuelCostA08>
<fuelType>Regular</fuelType>
<fuelType1/>
<fuelType2/>
<ghgScore>8</ghgScore>
<ghgScoreA>-1</ghgScoreA>
<guzzler/>
<highway08>39</highway08>
<highway08U>38.5216</highway08U>
<highwayA08>0</highwayA08>
<highwayA08U>0.0</highwayA08U>
<highwayCD>0.0</highwayCD>
<highwayE>0.0</highwayE>
<highwayUF>0.0</highwayUF>
<hlv>0</hlv>
<hpv>0</hpv>
<id>33504</id>
<lv2>12</lv2>
<lv4>12</lv4>
<make>Honda</make>
<mfrCode>HNX</mfrCode>
<model>Civic</model>
<mpgData>Y</mpgData>
<phevBlended>false</phevBlended>
<pv2>83</pv2>
<pv4>95</pv4>
<rangeA/>
<rangeCityA>0.0</rangeCityA>
<rangeHwyA>0.0</rangeHwyA>
<trans_dscr/>
<trany>Automatic 5-spd</trany>
<UCity>36.4794</UCity>
<UCityA>0.0</UCityA>
<UHighway>55.5375</UHighway>
<UHighwayA>0.0</UHighwayA>
<VClass>Compact Cars</VClass>
<year>2013</year>
<youSaveSpend>3000</youSaveSpend>
-
33.612226599
45
55
47
28
16
33504
You don't actually need to put everything into an array if you just want to display the data. SimpleXML makes it very simple to handle XML data. If I may suggest a maybe less complex solution:
<?php
function getFuelDataAsXml($make, $model)
{
// In most cases CURL is overkill, unless you need something more complex
$data = file_get_contents("https://www.fueleconomy.gov/ws/rest/ympg/shared/vehicles?make={$make}&model={$model}");
// If we got some data, return it as XML, otherwise return null
return $data ? simplexml_load_string($data) : null;
}
// get the data for a specific make and model
$data = getFuelDataAsXml('honda', 'civic');
// iterate over all vehicle-nodes
foreach($data->vehicle as $vehicleData)
{
echo $vehicleData->barrels08 . '<br />';
echo $vehicleData->yourMpgVehicle->avgMpg . '<br />';
echo '<hr />';
}
To fetch data from an DOM use Xpath:
$url = "https://www.fueleconomy.gov/ws/rest/ympg/shared/vehicles?make=honda&model=civic";
$dom = new DomDocument();
$dom->load($url);
$xpath = new DOMXpath($dom);
foreach ($$xpath->evaluate('/*/vehicle') as $vehicle) {
var_dump(
array(
$xpath->evaluate('string(fuelType)', $vehicle),
$xpath->evaluate('number(fuelCost08)', $vehicle),
$xpath->evaluate('number(barrels08)', $vehicle)
)
);
}
Most Xpath expressions return an a list of nodes that can be iterated using foreach. Using number() or string() will cast the value or content of the first node into a float or string. If the list was empty you will get an empty value.

Categories