PHP Web Scraping And JSON or Array Output - php

I'm experimenting scraping Amazon with PHP but I don't know what I am doing wrong. The problem is that I can't access all the data I scraped. Here is my code:
<?php
$url = 'https://www.amazon.com/s/ref=nb_sb_ss_c_1_9?url=search-alias%3Daps&field-keywords=most+sold+items+on+amazon&sprefix=most+sold%2Caps%2C435&crid=348CE8G406XVG&rh=i%3Aaps%2Ck%3Amost+sold+items+on+amazon';
$html = file_get_html($url);
foreach ($html->find('h2[class=a-size-medium]') as $element) {
echo "<li>" .$element->plaintext."</li><br>";
}
?>
The foreach statement loops through and output the plain text but I want to be able to pass the plain text to a variable or array. The problem is that if I do that and output the result, I only get the last string of the plain text array. I have done lots of research to find what I'm doing wrong but I can't find it. Please any help will be appreciated. Here is what I'm trying to achieve:
<?php
$url = 'https://www.amazon.com/s/ref=nb_sb_ss_c_1_9?url=search-alias%3Daps&field-keywords=most+sold+items+on+amazon&sprefix=most+sold%2Caps%2C435&crid=348CE8G406XVG&rh=i%3Aaps%2Ck%3Amost+sold+items+on+amazon';
$hold = array();
$html = file_get_html($url);
foreach ($html->find('h2[class=a-size-medium]') as $element) {
$hold = $element->plaintext;
}
print_r($hold);
?>
The second code will output the last string of the plain text which is: "Rubbermaid LunchBlox Side Container Kit, 2-Pack, 1806176". I also tried achieving this by encoding and decoding the plain text but nothing changed. What am I doing wrong?

Instead of setting the array hold to a string...add new elements to the array:
$hold[] = $element->plaintext;

Related

PHP foreach statement issue with accessing XML data

First, I am pretty clueless with PHP, so be kind! I am working on a site for an SPCA (I'm a Vet and a part time geek). The PHP accesses an xml file from a portal used to administer the shelter and store images, info. The file writes that xml data to JSON and then I use the JSON data in a handlebars template, etc. I am having a problem getting some data from the xml file to outprint to JSON.
The xml file is like this:
</DataFeedAnimal>
<AdditionalPhotoUrls>
<string>doc_73737.jpg</string>
<string>doc_74483.jpg</string>
<string>doc_74484.jpg</string>
</AdditionalPhotoUrls>
<PrimaryPhotoUrl>19427.jpg</PrimaryPhotoUrl>
<Sex>Male</Sex>
<Type>Cat</Type>
<YouTubeVideoUrls>
<string>http://www.youtube.com/watch?v=6EMT2s4n6Xc</string>
</YouTubeVideoUrls>
</DataFeedAnimal>
In the PHP file, written by a friend, the code is below, (just part of it), to access that XML data and write it to JSON:
<?php
$url = "http://eastbayspcapets.shelterbuddy.com/DataFeeds/AnimalsForAdoption.aspx";
if ($_GET["type"] == "found") {
$url = "http://eastbayspcapets.shelterbuddy.com/DataFeeds/foundanimals.aspx";
} else if ($_GET["type"] == "lost") {
$url = "http://eastbayspcapets.shelterbuddy.com/DataFeeds/lostanimals.aspx";
}
$response_xml_data = file_get_contents($url);
$xml = simplexml_load_string($response_xml_data);
$data = array();
foreach($xml->DataFeedAnimal as $animal)
{
$item = array();
$item['sex'] = (string)$animal->Sex;
$item['photo'] = (string)$animal->PrimaryPhotoUrl;
$item['videos'][] = (string)$animal->YouTubeVideoUrls;
$item['photos'][] = (string)$animal->PrimaryPhotoUrl;
foreach($animal->AdditionalPhotoUrls->string as $photo) {
$item['photos'][] = (string)$photo;
}
$item['videos'] = array();
$data[] = $item;
}
echo file_put_contents('../adopt.json', json_encode($data));
echo json_encode($data);
?>
The JSON output works well but I am unable to get 'videos' to write out to the JSON file as the 'photos' do. I just get '/n'!
Since the friend who helped with this is no longer around, I am stuck. I have tried similar code to the foreach statement for photos but am getting nowhere. Any help would be appreciated and the pets would appreciate it as well!
The trick with such implementations is to always look what you have got by dumping data structures to a log file or command line. Then to take a look at the documentation of the data you see. That way you know exactly what data you are working with and how to work with it ;-)
Here it turns out that the video URLs you are interested in are placed inside an object of type SimpleXMLElement with public properties, which is not really surprising if you look at the xml structure. The documentation of class SimpleXMLElement shows the method children() which iterates through all children. Just what we are looking for...
That means a clean implementation to access those sets should go along these lines:
foreach($animal->AdditionalPhotoUrls->children() as $photo) {
$item['photos'][] = (string)$photo;
}
foreach($animal->YouTubeVideoUrls->children() as $video) {
$item['videos'][] = (string)$video;
}
Take a look at this full and working example:
<?php
$response_xml_data = <<< EOT
<DataFeedAnimal>
<AdditionalPhotoUrls>
<string>doc_73737.jpg</string>
<string>doc_74483.jpg</string>
<string>doc_74484.jpg</string>
</AdditionalPhotoUrls>
<PrimaryPhotoUrl>19427.jpg</PrimaryPhotoUrl>
<Sex>Male</Sex>
<Type>Cat</Type>
<YouTubeVideoUrls>
<string>http://www.youtube.com/watch?v=6EMT2s4n6Xc</string>
<string>http://www.youtube.com/watch?v=hgfg83mKFnd</string>
</YouTubeVideoUrls>
</DataFeedAnimal>
EOT;
$animal = simplexml_load_string($response_xml_data);
$item = [];
$item['sex'] = (string)$animal->Sex;
$item['photo'] = (string)$animal->PrimaryPhotoUrl;
$item['photos'][] = (string)$animal->PrimaryPhotoUrl;
foreach($animal->AdditionalPhotoUrls->children() as $photo) {
$item['photos'][] = (string)$photo;
}
$item['videos'] = [];
foreach($animal->YouTubeVideoUrls->children() as $video) {
$item['videos'][] = (string)$video;
}
echo json_encode($item);
The obvious output of this is:
{
"sex":"Male",
"photo":"19427.jpg",
"photos" ["19427.jpg","doc_73737.jpg","doc_74483.jpg","doc_74484.jpg"],
"videos":["http:\/\/www.youtube.com\/watch?v=6EMT2s4n6Xc","http:\/\/www.youtube.com\/watch?v=hgfg83mKFnd"]
}
I would however like to add a short hint:
In m eyes it is questionable to convert such structured information into an associative array. Why? Why not a simple json_encode($animal)? The structure is perfectly fine and should be easy to work with! The output of that would be:
{
"AdditionalPhotoUrls":{
"string":[
"doc_73737.jpg",
"doc_74483.jpg",
"doc_74484.jpg"
]
},
"PrimaryPhotoUrl":"19427.jpg",
"Sex":"Male",
"Type":"Cat",
"YouTubeVideoUrls":{
"string":[
"http:\/\/www.youtube.com\/watch?v=6EMT2s4n6Xc",
"http:\/\/www.youtube.com\/watch?v=hgfg83mKFnd"
]
}
}
That structure describes objects (items with an inner structure, enclosed in json by {...}), not just arbitrary arrays (sets without a structure, enclosed in json by a [...]). Arrays are only used for the two unstructured sets of strings in there: photos and videos. This is much more logical, once you think about it...
Assuming the XML Data and the JSON data are intended to have the same structure. I would take a look at this: PHP convert XML to JSON
You may not need for loops at all.

PHP Simplexml traversal

I am trying to traverse this xml feed:
http://www.goonersworld.co.uk/forum/feed.php
using this code:
$data = file_get_contents('http://www.goonersworld.co.uk/forum/feed.php');
//$data = str_replace("content:encoded>","content>",$data);
$xml = simplexml_load_string($data,'SimpleXMLElement', LIBXML_NOCDATA);
foreach ($xml->feed->entry as $stories) {
echo $stories->published."<br>";
}
But I'm getting nothing back. To me it looks like $xml->feed->entry->published but am I missing something?
Ok, so I've figured this out. In case it helps anyone else it was a feed generated by PHPBB (Bulletin Board).
The actual structure I was looking for was simply:
foreach ($xml->entry as $stories) {
echo $stories->published."<br>";
}
feed and entry didn't look to be at the same level in the XML formatting but they were :-(

Json response with php

I am trying to get JSON response using PHP. I want to have Json array not the HTML tags. But the output shows HTML tags as well.I want to remove this HTML output! PHP code is as follows: I don't know how to do this ? Please help.
Thanks in advance :)
<?php
function getFixture(){
$db = new DbConnect();
// array for json response of full fixture
$response = array();
$response["fixture"] = array();
$result = mysql_query("SELECT * FROM fixture"); // Select all rows from fixture table
while($row = mysql_fetch_array($result)){
$tmp = array(); // temporary array to create single match information
$tmp["matchId"] = $row["matchId"];
$tmp["teamA"] = $row["teamA"];
$tmp["teamB"] = $row["teamB"];
array_push($response["fixture"], $tmp);
}
header('Content-Type: application/json');
echo json_encode($response);
}
getFixture();
?>
It's difficult to tell without seeing what your output is, but there is nothing in your code which would add HTML to your response.
It sounds like the HTML is in the database, so you're getting the data as expected, and your browser is the displaying whatever html elements might be there.
You could ensure none of the rows from the database have HTML in them by using strip_tags as follows:
$tmp["teamA"] = strip_tags($row["teamA"]);
Do this for all rows which may contain html.
Sorry if this is not formatted right, I'm new to StackOverflow!
http://php.net/strip-tags

PHP XPath query returns nothing

I've been recently playing with DOMXpath in PHP and had success with it, trying to get more experience with it I've been playing grabbing certain elements of different sites. I am having trouble getting the weather marker off of http://www.theweathernetwork.com/weather/cape0005 this website.
Specifically I want
//*[#id='theTemperature']
Here is what I have
$url = file_get_contents('http://www.theweathernetwork.com/weather/cape0005');
$dom = new DOMDocument();
#$dom->loadHTML($url);
$xpath = new DOMXPath($dom);
$tags = $xpath->query("//*[#id='theTemperature']");
foreach ($tags as $tag){
echo $tag->nodeValue;
}
Is there something I am doing wrong here? I am able to produce actual results on other tags on the page but specifically not this one.
Thanks in advance.
You might want to improve your DOMDocument debugging skills, here some hints (Demo):
<?php
header('Content-Type: text/plain;');
$url = file_get_contents('http://www.theweathernetwork.com/weather/cape0005');
$dom = new DOMDocument();
#$dom->loadHTML($url);
$xpath = new DOMXPath($dom);
$tags = $xpath->query("//*[#id='theTemperature']");
foreach ($tags as $i => $tag){
echo $i, ': ', var_dump($tag->nodeValue), ' HTML: ', $dom->saveHTML($tag), "\n";
}
Output the number of the found node, I do it here with $i in the foreach.
var_dump the ->nodeValue, it helps to show what exactly it is.
Output the HTML by making use of the saveHTML function which shows a better picture.
The actual output:
0: string(0) ""
HTML: <p id="theTemperature"></p>
You can easily spot that the element is empty, so the temperature must go in from somewhere else, e.g. via javascript. Check the Network tools of your browser.
what happens is straightforward, the page contains an empty id="theTemperature" element which is a placeholder to be populated with javascript. file_get_contents() will just download the page, not executing javascript, so the element remains empty. Try to load the page in the browser with javascript disabled to see it yourself
The element you're trying to select is indeed empty. The page loads the temperature into that id through ajax. Specifically this script:
http://www.theweathernetwork.com/common/js/master/citypage_ajax.js?cb=201301231338
but when you do a file_get_contents those scripts obviously don't get resolved. I'd go with guido's solution of using the RSS

Parsing XML with PHP (simplexml)

Firstly, may I point out that I am a newcomer to all things PHP so apologies if anything here is unclear and I'm afraid the more layman the response the better. I've been having real trouble parsing an xml file in to php to then populate an HTML table for my website. At the moment, I have been able to get the full xml feed in to a string which I can then echo and view and all seems well. I then thought I would be able to use simplexml to pick out specific elements and print their content but have been unable to do this.
The xml feed will be constantly changing (structure remaining the same) and is in compressed format. From various sources I've identified the following commands to get my feed in to the right format within a string although I am still unable to print specific elements. I've tried every combination without any luck and suspect I may be barking up the wrong tree. Could someone please point me in the right direction?!
$file = fopen("compress.zlib://$url", 'r');
$xmlstr = file_get_contents($url);
$xml = new SimpleXMLElement($url,null,true);
foreach($xml as $name) {
echo "{$name->awCat}\r\n";
}
Many, many thanks in advance,
Chris
PS The actual feed
Since no one followed my closevote, I think I can just as well put my own comments as an answer:
First of all, SimpleXml can load URIs directly and it can do so with stream wrappers, so your three calls in the beginning can be shortened to (note that you are not using $file at all)
$merchantProductFeed = new SimpleXMLElement("compress.zlib://$url", null, TRUE);
To get the values you can either use the implicit SimpleXml API and drill down to the wanted elements (like shown multiple times elsewhere on the site):
foreach ($merchantProductFeed->merchant->prod as $prod) {
echo $prod->cat->awCat , PHP_EOL;
}
or you can use an XPath query to get at the wanted elements directly
$xml = new SimpleXMLElement("compress.zlib://$url", null, TRUE);
foreach ($xml->xpath('/merchantProductFeed/merchant/prod/cat/awCat') as $awCat) {
echo $awCat, PHP_EOL;
}
Live Demo
Note that fetching all $awCat elements from the source XML is rather pointless though, because all of them have "Bodycare & Fitness" for value. Of course you can also mix XPath and the implict API and just fetch the prod elements and then drill down to the various children of them.
Using XPath should be somewhat faster than iterating over the SimpleXmlElement object graph. Though it should be noted that the difference is in an neglectable area (read 0.000x vs 0.000y) for your feed. Still, if you plan to do more XML work, it pays off to familiarize yourself with XPath, because it's quite powerful. Think of it as SQL for XML.
For additional examples see
A simple program to CRUD node and node values of xml file and
PHP Manual - SimpleXml Basic Examples
Try this...
$url = "http://datafeed.api.productserve.com/datafeed/download/apikey/58bc4442611e03a13eca07d83607f851/cid/97,98,142,144,146,129,595,539,147,149,613,626,135,163,168,159,169,161,167,170,137,171,548,174,183,178,179,175,172,623,139,614,189,194,141,205,198,206,203,208,199,204,201,61,62,72,73,71,74,75,76,77,78,79,63,80,82,64,83,84,85,65,86,87,88,90,89,91,67,92,94,33,54,53,57,58,52,603,60,56,66,128,130,133,212,207,209,210,211,68,69,213,216,217,218,219,220,221,223,70,224,225,226,227,228,229,4,5,10,11,537,13,19,15,14,18,6,551,20,21,22,23,24,25,26,7,30,29,32,619,34,8,35,618,40,38,42,43,9,45,46,651,47,49,50,634,230,231,538,235,550,240,239,241,556,245,244,242,521,576,575,577,579,281,283,554,285,555,303,304,286,282,287,288,173,193,637,639,640,642,643,644,641,650,177,379,648,181,645,384,387,646,598,611,391,393,647,395,631,602,570,600,405,187,411,412,413,414,415,416,649,418,419,420,99,100,101,107,110,111,113,114,115,116,118,121,122,127,581,624,123,594,125,421,604,599,422,530,434,532,428,474,475,476,477,423,608,437,438,440,441,442,444,446,447,607,424,451,448,453,449,452,450,425,455,457,459,460,456,458,426,616,463,464,465,466,467,427,625,597,473,469,617,470,429,430,615,483,484,485,487,488,529,596,431,432,489,490,361,633,362,366,367,368,371,369,363,372,373,374,377,375,536,535,364,378,380,381,365,383,385,386,390,392,394,396,397,399,402,404,406,407,540,542,544,546,547,246,558,247,252,559,255,248,256,265,259,632,260,261,262,557,249,266,267,268,269,612,251,277,250,272,270,271,273,561,560,347,348,354,350,352,349,355,356,357,358,359,360,586,590,592,588,591,589,328,629,330,338,493,635,495,507,563,564,567,569,568/mid/2891/columns/merchant_id,merchant_name,aw_product_id,merchant_product_id,product_name,description,category_id,category_name,merchant_category,aw_deep_link,aw_image_url,search_price,delivery_cost,merchant_deep_link,merchant_image_url/format/xml/compression/gzip/";
$zd = gzopen($url, "r");
$data = gzread($zd, 1000000);
gzclose($zd);
if ($data !== false) {
$xml = simplexml_load_string($data);
foreach ($xml->merchant->prod as $pr) {
echo $pr->cat->awCat . "<br>";
}
}
<?php
$xmlstr = file_get_contents("compress.zlib://$url");
$xml = simplexml_load_string($xmlstr);
// you can transverse the xml tree however you want
foreach ($xml->merchant->prod as $line) {
// $line->cat->awCat -> you can use this
}
more information here
Use print_r($xml) to see the structure of the parsed XML feed.
Then it becomes obvious how you would traverse it:
foreach ($xml->merchant->prod as $prod) {
print $prod->pId;
print $prod->text->name;
print $prod->cat->awCat; # <-- which is what you wanted
print $prod->price->buynow;
}
$url = 'you url here';
$f = gzopen ($url, 'r');
$xml = new SimpleXMLElement (fread ($f, 1000000));
foreach($xml->xpath ('//prod') as $name)
{
echo (string) $name->cat->awCatId, "\r\n";
}

Categories