Fetch data from site using php and put in an array - php

<div>A/C:front<span style="color:red;margin:8px">/
</span>Anti-Lock Brakes<span style="color:red;margin:8px">/
</span>Passenger Airbag<span style="color:red;margin:8px">/
</span>Power Mirrors<span style="color:red;margin:8px">/
</span>Power Steering<span style="color:red;margin:8px">/
</span>Power Windows<span style="color:red;margin:8px">/
</span>Driver Airbag<span style="color:red;margin:8px">/
</span>No Accidents<span style="color:red;margin:8px">/
</span>Power Door Locks<span style="color:red;margin:8px">/</span>
</div>
Appears like this on website :
A/C:front/Anti-Lock Brakes/Passenger Airbag/Power Mirrors/Power Steering/Power Windows/Driver Airbag/No Accidents/Power Door Locks/
I used $content = file_get_contents('url'); and now i need to shift through the data.
I need to fetch each one of the options above and put them in an array or something like :
$option = ("A/C:front","Anti-Lock Brakes","Passenger Airbag",....);
Any idea how to do this using php ?

With the source code everything is easier:
<?php
$dom = new DOMDocument;
#$dom->loadHTMLFile('http://www.sayuri.co.jp/used-cars/B37659-Nissan-Tiida%20Latio-japanese-used-cars');
$xpath = new DOMXPath($dom);
$nodes = iterator_to_array($xpath->query('//h4/following-sibling::div')->item(0)->childNodes);
$items = array_map(function ($node) {
return $node->nodeValue;
}, array_filter($nodes, function ($node) {
return $node->nodeValue != '/';
}));
var_dump($items);
This gave me the following:
array(9) {
[0]=>
string(9) "A/C:front"
[2]=>
string(16) "Anti-Lock Brakes"
[4]=>
string(16) "Passenger Airbag"
[6]=>
string(13) "Power Mirrors"
[8]=>
string(14) "Power Steering"
[10]=>
string(13) "Power Windows"
[12]=>
string(13) "Driver Airbag"
[14]=>
string(12) "No Accidents"
[16]=>
string(16) "Power Door Locks"
}
You might want to use array_values() on $items to reset the indexes. That's all!

Sounds like you need DOMDocument. Specifically, the getElementsByTagName function. So using your example, I suggest this. Please adjust to suit your needs:
// Get the contents of the URL.
$content = file_get_contents('url');
// Parse the HTML using `DOMDocument`
$dom = new DOMDocument();
#$dom->loadHTML($content);
// Search the parsed DOM structure for `span` elements.
$option = array();
foreach($dom->getElementsByTagName('span') as $span){
$option[] = $span->nodeValue;
}
// Dumps the values in `option` for review.
echo '<pre>';
print_r($option);
echo '</pre>';

Related

Removing link from Unordered list as string in array

Using PHP, I would like to remove all the links in an unordered list and put them in an array. So the output would be: array[0]='Benefits', array[1]='Cost Savings', etc.
<ul>
<li>Benefits</li>
<li>Cost Savings</li>
<li>Member listing</li>
</ul>
Using; preg_match_all('/<a href=\"(.*?)\"[.*]?>(.*?)<\/a>/i', $content, $matches);
I get:
array(3) { [0]=> array(3) { [0]=> string(24) "Benefits" [1]=> string(28) "Cost Savings" [2]=> string(30) "Member listing" } [1]=> array(3) { [0]=> string(1) "#" [1]=> string(1) "#" [2]=> string(1) "#" } [2]=> array(3) { [0]=> string(8) "Benefits" [1]=> string(12) "Cost Savings" [2]=> string(14) "Member listing" } }
But i need to put it into one array.
To fetch the links you can leverage domdocument and domxpath
$html = '<html><body><ul>
<li>Benefits</li>
<li>Cost Savings</li>
<li>Member listing</li>
</ul></body></html>';
$dom = new DOMDocument();
$dom->loadHTML( $html ); // loads the html into the class
$xpath = new DOMXPath( $dom );
$items = $xpath->query('*/ul/li/a'); // matches any elements in this order
$array = array();
foreach( $items as $item )
{
$array[] = $dom->saveHTML( $item ); // using the parent document, get just a single elements html
}
// Array
// (
// [0] => Benefits
// [1] => Cost Savings
// [2] => Member listing
// )

Cannot get html attribute using PHP Simple Html DOM

I am tryng to get the ,,sold" info from eBay listing- https://www.ebay.co.uk/itm/Box-With-Tail-Pipe-Rear-Back-Silencer-Fits-Citroen-C2-C3-I-C3-Pluriel-GCN499/254292997729?hash=item3b350b3661:g:clEAAOSwnhldLB4J.
Here is the screenshot:
As you can see I want to get ,1 sold" text on the upper right corner of the screen. I am using the class ,,vi-txt-underline" to get it, however it is not working. Does anyone know how this can be done, using other attribute or something different? Here is the code:
$sold = $html->find(".vi-text-underline", 0);
if($sold != null){
$item['sold'] = $sold->find("a", 0)->plaintext;
}else{
$item['sold'] = '';
["tag"]=>
string(4) "text"
["attr"]=>
array(0) {
}
["children"]=>
array(0) {
}
["nodes"]=>
array(0) {
}
["parent"]=>
*RECURSION*
["_"]=>
array(1) {
[4]=>
string(6) "1 sold"
The above is part of the debugged $sold variable.
I am using an array $item[] because I am also searching for more info before this part of the code.
get page contents
$url = "https://www.ebay.co.uk/itm/Box-With-Tail-Pipe-Rear-Back-Silencer-Fits-Citroen-C2-C3-I-C3-Pluriel-GCN499/254292997729?hash=item3b350b3661:g:clEAAOSwnhldLB4J";
$content = file_get_contents($url);
find what you want
echo strpos($content,'1 sold');

Getting data from XML

I am struggling with reading XML file using PHP.
The XML I want to use is here:
http://www.gdacs.org/xml/rss.xml
Now, the data I am interested are the "item" nodes.
I created the following function, which gets the data:
$rawData = simplexml_load_string($response_xml_data);
foreach($rawData->channel->item as $value) {
$title = $value->title;
....
this works fine.
The nodes with the "gdcs:xxxx" were slightly more problematic, but I used the following code, which also works:
$subject = $value->children('dc', true)->subject;
Now the problem I have is with the "resources" node,
Basically the stripped down version of it would look like this:
<channel>
<item>
<gdacs:resources>
<gdacs:resource id="xx" version="0" source="xx" url="xx" type="xx">
<gdacs:title>xxx</gdacs:title>
</gdacs:resource>
<gdacs:resource id="xx" version="0" source="xx" url="xx" type="xx">
<gdacs:title>xxx</gdacs:title>
</gdacs:resource>
<gdacs:resource id="xx" version="0" source="xx" url="xx" type="xx">
<gdacs:title>xxx</gdacs:title>
</gdacs:resource>
</gdacs:resources>
</item>
</channel>
How in this case would I get the resources? I was able to get always just the first resource and only the title of it. What I would like to do is get all the resources items, which have "type" of a particular value and get their URL.
Running through XML the regular path, is , from my experience, slow and excruciating.
Have a look into XPath -> it's a way to extract data from XML through selectors ( similar to CSS selectors )
http://php.net/manual/en/simplexmlelement.xpath.php
You can select elements by their attributes similar to CSS
<?php
$xmlStr = file_get_contents('some_xml.xml');
$xml = new SimpleXMLElement($xmlStr);
$items = $xml->xpath("//channel/item");
$urls_by_item = array();
foreach($items as $x) {
$urls_by_item [] = $x->xpath("//gdacs:resources/gdacs:resource[#type='image']/#url");
}
Consider using the node occurrence of xpath with square brackets [] to align urls with corresponding titles. A more involved modification of #Daniel Batkilin's answer, you can incorporate both data pieces in an associative multidimensional array, requiring nested for loops.
$xml = simplexml_load_file('http://www.gdacs.org/xml/rss.xml');
$xml->registerXPathNamespace('gdacs', 'http://www.gdacs.org');
$items = $xml->xpath("//channel/item");
$i = 1;
$out = array();
foreach($items as $x) {
$titles = $xml->xpath("//channel/item[".$i."]/gdacs:resources/gdacs:resource[#type='image']/gdacs:title");
$urls = $xml->xpath("//channel/item[".$i."]/gdacs:resources/gdacs:resource[#type='image']/#url");
for($j=0; $j<count($urls); $j++) {
$out[$j.$i]['title'] = (string)$titles[$j];
$out[$j.$i]['url'] = (string)$urls[$j];
}
$i++;
}
$out = array_values($out);
var_dump($out);
ARRAY DUMP
array(40) {
[0]=>
array(2) {
["title"]=>
string(21) "Storm surge animation"
["url"]=>
string(92) "http://webcritech.jrc.ec.europa.eu/ModellingCyclone/cyclonesurgeVM/1000226/final/outres1.gif"
}
[1]=>
array(2) {
["title"]=>
string(26) "Storm surge maximum height"
["url"]=>
string(101) "http://webcritech.jrc.ec.europa.eu/ModellingCyclone/cyclonesurgeVM/1000226/final/P1_MAXHEIGHT_END.jpg"
}
[2]=>
array(2) {
["title"]=>
string(12) "Overview map"
["url"]=>
string(64) "http://dma.gdacs.org/saved/gdacs/tc/1000226/clouds_1000226_2.png"
}
[3]=>
array(2) {
["title"]=>
string(41) "Map of rainfall accummulation in past 24h"
["url"]=>
string(70) "http://dma.gdacs.org/saved/gdacs/tc/1000226/current_rain_1000226_2.png"
}
[4]=>
array(2) {
["title"]=>
string(23) "Map of extreme rainfall"
["url"]=>
string(62) "http://dma.gdacs.org/saved/gdacs/tc/1000226/rain_1000226_2.png"
}
[5]=>
array(2) {
["title"]=>
string(34) "Map of extreme rainfall (original)"
["url"]=>
string(97) "http://www.ssd.noaa.gov/PS/TROP/DATA/ETRAP/2015/NorthIndian/THREE/2015THREE.pmqpf.10100000.00.GIF"
}
...

How to fetch html string of XPath results?

Considering this code:
<div class="a">foo</div>
<div class="a"><div id="1">bar</div></div>
If I want to fetch all the values of divs with class a, I'll do the following query:
$q = $xpath->query('//div[#class="a"]');
However, I'll get this result:
foo
bar
But I want to get the actual value including the children tags. So it'll look like that:
foo
<div id="1">bar</div>
How can I accomplish that with XPath and DOMDocument only?
Solved by the function provided here.
PHP DOM has an undocumented '.nodeValue' attribute which acts exactly like .innerHTML in a browser. Once you've used XPath to get the node you want, just do $node->nodeValue to get the innerhtml.
You can try to use
$xml = '<?xml version=\'1.0\' encoding=\'UTF-8\' ?>
<root>
<div class="a">foo</div>
<div class="a"><div id="1">bar</div></div>
</root>';
$xml = simplexml_load_string($xml);
var_dump($xml->xpath('//div[#class="a"]'));
But in this case you will have to iterate objects.
Output:
array(2) {
[0]=>
object(SimpleXMLElement)#2 (2) {
["#attributes"]=>
array(1) {
["class"]=>
string(1) "a"
}
[0]=>
string(3) "foo"
}
[1]=>
object(SimpleXMLElement)#3 (2) {
["#attributes"]=>
array(1) {
["class"]=>
string(1) "a"
}
["div"]=>
string(3) "bar"
}
}
Try something like:
$doc = new DOMDocument;
$doc->loadHTML('<div>Your HTML here.</div>');
$xpath = new DOMXpath($doc);
$node = $xpath->query('//div[#class="a"]')->item(0);
$html = $node->ownerDocument->saveHTML($node); // Get HTML of DOMElement.

xpath not return values

I am able to pull the necessary information using xpath, when I use var_dump using the following code. When I try to add a foreach loop to return all ["href"] values i get a blank page any ideas where I am messing up?
$dom = new DOMDocument();
#$dom->loadHTML($source);
$xml = simplexml_import_dom($dom);
$rss = $xml->xpath("/html/body//a[#class='highzoom1']");
$links = $rss->href;
foreach ($links as $link){
echo $link;
}
Here is the array of information.
array(96) {
[0]=>
object(SimpleXMLElement)#3 (2) {
["#attributes"]=>
array(2) {
["href"]=>
string(49) "/p/18351/test1.html"
["class"]=>
string(10) "highzoom1"
}
[0]=>
string(36) ""test1"
}
[1]=>
object(SimpleXMLElement)#4 (2) {
["#attributes"]=>
array(2) {
["href"]=>
string(43) "/p/18351/test2.html"
["class"]=>
string(10) "highzoom1"
}
[0]=>
string(30) ""test2"
}
[2]=>
object(SimpleXMLElement)#5 (2) {
["#attributes"]=>
array(2) {
["href"]=>
string(48) "/p/18351/test3.html"
["class"]=>
string(10) "highzoom1"
}
[0]=>
string(35) ""test3"
}
Instead of:
$rss = $xml->xpath("/html/body//a[#class='highzoom1']");
use:
$hrefs = $xml->xpath("/html/body//a[#class='highzoom1']/#href");
The original XPath expression (the first above) you are using selects any a element in the XML document the value of whose class atribute is 'highzoom1' and that (the a element) is a descendent of a body that is a child of the top element (named html) in the XML document.
However, you want to select the href attributes of these a elements -- not the a elements themselves.
The second XPath expression above select exactly the href attributes of these a elements.
$links = $rss->href;
will never work, as $rss is a DOMNodeList object, and won't have an href attribute. Instead, you'd want to do this:
$rss = $xml->xpath("/html/body//a[#class='highzoom1']");
foreach($rss as $link) {
echo $link->href;
}
Or you can address $rss as an array directly:
echo $rss[5]->href; // echo out the href of the 6th link found.

Categories