Use Simple HTML DOM Parser to JSON? - php

I'm trying to group each of the elements of a scraped website, convert it into a json element but it doesn't seem to be working.
<?php
// Include the php dom parser
include_once 'simple_html_dom.php';
header('Content-type: application/json');
// Create DOM from URL or file
$html = file_get_html('urlhere');
foreach($html->find('hr ul') as $ul)
{
foreach($ul->find('div.product') as $li)
$data[$count]['products'][]['li']= $li->innertext;
$count++;
}
echo json_encode($data);
?>
This returns
{"":{"products":[{"li":" <a class=\"th\" href=\"\/products\/56942-haters-crewneck-sweatshirt\"> <div style=\"background-image:url('http:\/\/s0.merchdirect.com\/images\/15814\/v600_B_AltApparel_Crew.png');\"> <img src=\"http:\/\/s0.com\/images\/6398\/product-image-placeholder-600.png\"> <\/div> <\/a> <div class=\"panel panel-info\" style=\"display: none;\"> <div class=\"name\"> <a href=\"\/products\/56942-haters-crewneck-sweatshirt\"> Haters Crewneck Sweatshirt <\/a> <\/div> <div class=\"subtitle\"> $60.00 <\/div> <\/div> "}
When I'm actually hoping to achieve:
{"products":[{
"link":"/products/56942-haters-crewneck-sweatshirt",
"image":"http://s0.com/images/15814/v600_B_AltApparel_Crew.png",
"name":"Haters Crewneck Sweatshirt",
"subtitle":"60.00"}
]}
How do I get rid of all of the redundant information and probably name each element in the reformatted json?
Thanks!

You simply need to extend your logic within the inner loop:
foreach($html->find('hr ul') as $ul)
{
foreach($ul->find('div.product') as $li) {
$product = array();
$product['link'] = $li->find('a.th')[0]->href;
$product['name'] = trim($li->find('div.name a')[0]->innertext);
$product['subtitle'] = trim($li->find('div.subtitle')[0]->innertext);
$product['image'] = explode("'", $li->find('div')[0]->style)[1];
$data[$count]['products'][] = $product;
}
}
echo json_encode($data);

Related

Fetch content of all div with same class using PHP Simple HTML DOM Parser

I am new to HTML DOM parsing with PHP, there is one page which is having different content in its but having same 'class', when I am trying to fetch content I am able to get content of last div, Is it possible that somehow I could get all the content of divs having same class request you to please have a look over my code:
<?php
include(__DIR__."/simple_html_dom.php");
$html = file_get_html('http://campaignstudio.in/');
echo $x = $html->find('h2[class="section-heading"]',1)->outertext;
?>
In your example code, you have
echo $x = $html->find('h2[class="section-heading"]',1)->outertext;
as you are calling find() with a second parameter of 1, this will only return the 1 element. If instead you find all of them - you can do whatever you need with them...
$list = $html->find('h2[class="section-heading"]');
foreach ( $list as $item ) {
echo $item->outertext . PHP_EOL;
}
The full code I've just tested is...
include(__DIR__."/simple_html_dom.php");
$html = file_get_html('http://campaignstudio.in/');
$list = $html->find('h2[class="section-heading"]');
foreach ( $list as $item ) {
echo $item->outertext . PHP_EOL;
}
which gives the output...
<h2 class="section-heading text-white">We've got what you need!</h2>
<h2 class="section-heading">At Your Service</h2>
<h2 class="section-heading">Let's Get In Touch!</h2>

php DOMDocument - List child elements to array

For the following HTML:
<html>
<body>
<div whatever></div>
<div id="archive-wrapper">
<ul class="archive-list">
<li><div>A</div></li>
<li><div>B</div></li>
<li><div>C</div></li>
</ul>
</div>
</body>
How could I retrieve, with PHP DOMDocument (http://php.net/manual/es/class.domdocument.php), an array containing (#1,#2,#3) in the most effective way? It's not that I did not try anything or that I want an already done code, I just need to know some guidelines to do it and understand it on my own. Thanks :)
A simple example using php DOMDocument -
<?php
$html = <<<HTML
<html>
<body>
<div whatever></div>
<div id="archive-wrapper">
<ul class="archive-list">
<li><div>A</div></li>
<li><div>B</div></li>
<li><div>C</div></li>
</ul>
</div>
</body>
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
//get all links
$links = $dom->getElementsByTagName('a');
$linkArray = array();
//loop through each link
foreach ($links as $link){
$linkArray[] = $link->getAttribute('href');
}
edit
to get only the links inside ul->li, you could do something like -
$dom = new DOMDocument();
$dom->loadHTML($html);
$linkArray = array();
foreach ($dom->getElementsByTagName('ul') as $li){
foreach ($li->getElementsByTagName('li') as $a){
foreach ($a->getElementsByTagName('a') as $link){
$linkArray[] = $link->getAttribute('href');
}
}
}
or if you just want the 1st ul you could simplify to
//get 1st ul using ->item(0)
$ul = $dom->getElementsByTagName('ul')->item(0);
foreach ($ul->getElementsByTagName('li') as $li){
foreach ($li->getElementsByTagName('a') as $a){
$linkArray[] = $a->getAttribute('href');
}
}
what do you mean with PHP DOM? do you mean with PHP and JQuery? You can setup
you can put all that in a form and post it to a script
you can also wrap around a select which will only store the selected
data
better idea would be to jquery to post the items to an array on the
same page and using php as a processor for server side
munipilation? this is better in the long run, being its the most updated way of
interacting with html and server side scripts.
for example, you can try either way:
$("#form").submit(function(){ //form being the #form id
var items = [];
$("#archive-list li").each(function(n){
items[n] = $(this).html();
});
$.post(
"munipilate-data.php",
{items: items},
function(data){
$("#result").html(data);
});
});
I suggest you a regex to parse it.
$html = '<html>
<body>
<div whatever></div>
<div id="archive-wrapper">
<ul class="archive-list">
<li><div>A</div></li>
<li><div>B</div></li>
<li><div>C</div></li>
</ul>
</div>
</body>';
$reg = '/a href=["\']?([^"\' ]*)["\' ]/';
preg_match_all($reg, $html, $m);
$arr = array_map(function($v){
return trim(str_replace('a href=', '', $v), '"');
}, $m[0]);
print '<pre>';
print_r($arr);
print '</pre>';
Output:
Array
(
[0] => #1
[1] => #2
[2] => #3
)
Regex Demo

PHP - Get links from within an element after element has been found

I have the following code....
<div class="outer">
<div>
<h1>Christmas</h1>
<ul>
<li>Holiday</li>
<li>Fun</li>
<li>Joy</li>
</ul>
<h1>4th July</h1>
<ul>
<li>Fireworks</li>
<li>Happy</li>
<li>Spectral</li>
</ul>
</div>
</div>
<div class="outer">
<div>
<h1>Christmas2</h1>
<ul>
<li>Holiday</li>
<li>Fun</li>
<li>Joy</li>
</ul>
<h1>4th July</h1>
<ul>
<li>Fireworks2</li>
<li>Happy</li>
<li>Spectral</li>
</ul>
</div>
</div>
I already know that I can find the DIV and then look inside the DIV for the elements etc by doing...
$doc->loadHTML($output); //$output being the text above
$xpath = new DOMXpath($doc);
$elements = $xpath->query('//div[#class="outer"]'); //Check outer
I know this above 3 lines will get the elements from within the DIV listed, but what I really want to be able to do is get the text of the [H1], then display the [li] values next to each H1..
the output i'm looking for is...
Christmas - Holiday, Fun, Joy
4th July - Fireworks, Happy, Spectral
Christmas2 - Holiday, Fun, Joy
4th July2 - Fireworks, Happy, Spectral
Yes you can continue to use xpath to traverse the elements on the header and get its following sibling, the list. Example:
$doc = new DOMDocument();
$doc->loadHTML($output);
$xpath = new DOMXpath($doc);
$elements = $xpath->query('//div[#class="outer"]/div');
if($elements->length > 0) {
foreach($elements as $div) {
foreach ($xpath->query('./h1', $div) as $e) {
$header = $e->nodeValue;
$list = array();
foreach ($xpath->query('./following-sibling::ul/li', $e) as $li) {
$list[] = $li->nodeValue;
}
echo $header . ' - ' . implode(', ', $list) . '<br/>';
}
echo '<hr/>';
}
}
Sample Output
I've used phpQuery for this type of issue in the past:
// include phpquery
require('phpQuery/phpQuery.php');
// initialize
$doc = phpQuery::newDocumentHTML($markup);
// get the text from the various elements
$h1Value = $doc['h1:first']->text(); // Christmas
// ... etc.
(untested)

Get img src with PHP Simple HTML DOM

Demo
I need to get the image src from the following code
HTML
<div class="avatar profile_CF48B2B4A31B43EC96F0561F498CE6BF ">
<a onclick="">
<img id="lazyload_-247847544_0" height="74" width="74" class="avatar potentialFacebookAvatar avatarGUID:CF48B2B4A31B43EC96F0561F498CE6BF" src="http://media-cdn.tripadvisor.com/media/photo-l/05/f3/67/c3/lilrazzy.jpg" />
</a>
</div>
I tried writing the js:
foreach($html->find('div[class=profile_CF48B2B4A31B43EC96F0561F498CE6BF] a img') as $element) {
$img = $element->getAttribute('src');
echo $img;
}
But it shows src key doesn't exists. How can I scrap review avatar images?
UPDATE:
The image url is not found when I looked at the page source, But firebug shows the image url:
<img id='lazyload_1953171323_17' height='24' alt='4 helpful votes' width='25' class='icon lazy'/>
Here is my page's source code:
<div class="col1of2">
<div class="member_info">
<div id="UID_3E0FAF58557D3375508A9E5D9A7BD42F-SRC_175428572" class="memberOverlayLink" onmouseover="ta.trackEventOnPage('Reviews','show_reviewer_info_window','user_name_photo'); ta.call('ta.overlays.Factory.memberOverlayWOffset', event, this, 's3 dg rgba_gry update2012', 0, (new Element(this)).getElement('.avatar')&&(new Element(this)).getElement('.avatar').getStyle('border-radius')=='100%'?-10:0);">
<div class="avatar profile_3E0FAF58557D3375508A9E5D9A7BD42F ">
<a onclick=>
<img id='lazyload_1953171323_15' height='74' width='74' class='avatar potentialFacebookAvatar avatarGUID:3E0FAF58557D3375508A9E5D9A7BD42F'/>
</a>
</div>
<div class="username mo">
<span class="expand_inline scrname hvrIE6 mbrName_3E0FAF58557D3375508A9E5D9A7BD42F" onclick="ta.trackEventOnPage('Reviews', 'show_reviewer_info_window', 'user_name_name_click')">Prataspeles</span>
</div>
</div>
<div class="location">
Latvia
</div>
</div>
<div class="memberBadging">
<div id="UID_3E0FAF58557D3375508A9E5D9A7BD42F-CONT" class="totalReviewBadge badge no_cpu" onclick="ta.trackEventOnPage('Reviews','show_reviewer_info_window','review_count'); ta.util.cookie.setPIDCookie('15984'); ta.call('ta.overlays.Factory.memberOverlayWOffset', event, this, 's3 dg rgba_gry update2012', -10, -50);">
<div class="reviewerTitle">Reviewer</div>
<img id='lazyload_1953171323_16' height='24' alt='4 reviews' width='25' class='icon lazy'/>
<span class="badgeText">4 reviews</span>
</div>
<div id="UID_3E0FAF58557D3375508A9E5D9A7BD42F-HV" class="helpfulVotesBadge badge no_cpu" onclick="ta.trackEventOnPage('Reviews','show_reviewer_info_window','helpful_count'); ta.util.cookie.setPIDCookie('15983'); ta.call('ta.overlays.Factory.memberOverlayWOffset', event, this, 's3 dg rgba_gry update2012', -22, -50);">
<img id='lazyload_1953171323_17' height='24' alt='4 helpful votes' width='25' class='icon lazy'/>
<span class="badgeText">4 helpful votes</span>
</div>
</div>
</div>
Is there any problem because of using lazyload?
UPDATE 2
Using lazyload makes my images load once the pages are loaded, i tried getting image ids and compare them with the lazyload js array, but this id doesn't match with the lazyload var array.
Question:
How to get this js array from this JSON?
Example:
{"id":"lazyload_-205858383_0","tagType":"img","scroll":true,"priority":100,"data":"http://media-cdn.tripadvisor.com/media/photo-l/05/f3/67/c3/lilrazzy.jpg"}
, {"id":"lazyload_-205858383_1","tagType":"img","scroll":true,"priority":100,"data":"http://c1.tacdn.com/img2/icons/gray_flag.png"}
, {"id":"lazyload_-205858383_2","tagType":"img","scroll":true,"priority":100,"data":"http://media-cdn.tripadvisor.com/media/photo-l/01/2a/fd/98/avatar.jpg"}
, {"id":"lazyload_-205858383_3","tagType":"img","scroll":true,"priority":100,"data":"http://c1.tacdn.com/img2/icons/gray_flag.png"}
, {"id":"lazyload_-205858383_4","tagType":"img","scroll":true,"priority":100,"data":"http://media-cdn.tripadvisor.com/media/photo-l/01/2e/70/5e/avatar036.jpg"}
, {"id":"lazyload_-205858383_5","tagType":"img","scroll":false,"priority":100,"data":"http://c1.tacdn.com/img2/badges/badge_helpful.png"}
You are having difficulty because javascipt is used to lazy load the image once the page is loaded. Use phpDom to find the Id of the element, and then use regular expression to find the relevant images based on this Id.
To achieve this, try something like :
$json = json_decode("<JSONSTRING HERE>");
foreach($html->find('div[class=profile_CF48B2B4A31B43EC96F0561F498CE6BF] a img') as $element) {
$imgId = $element->getAttribute('id');
foreach ($json as $lazy)
{
if ($lazy["id"] == $imgId) echo $lazy["data"];
}
}
The above is untested so you will need to resolve the kinks. They key is to extract the relevant javascript and convert it to json.
Alternatively, you can use string search functions to get the row which contains the information about the img, and extract the required value.
If you're looking for all IDs that contain the substring, "lazyload", you might try the wildcard selector and upon a hit look at the 'src' property of the element found. See the jsfiddle below. Good luck!
$(document.body).find('img[id*=lazyload]').each(function() {
console.log($(this).prop('src'));
});
Jsfiddle
Try this -
foreach($html->find('div[class=profile_CF48B2B4A31B43EC96F0561F498CE6BF ] a img') as $element) {
$img = $element->getAttribute('src');
echo $img;
}
There is space after the class name. You have to add space at the end of class name.
OR
use even full class name
$html->find('div[class=avatar profile_CF48B2B4A31B43EC96F0561F498CE6BF ] a img'
Use jQuery selectors i.e. $('#lazyload_-247847544_0') and you can get the image source using this
var src = $('#lazyload_-247847544_0').attr('src');
Or more specifically
$('.profile_CF48B2B4A31B43EC96F0561F498CE6BF #lazyload_-247847544_0').attr('src');
Thanks
function getReviews(){
$url = 'http://www.tripadvisor.com/Hotel_Review-g274965-d952833-Reviews-Ezera_Maja-Liepaja_Kurzeme_Region.html';
$html = new simple_html_dom();
$html = file_get_html($url);
$array = array();
$i = 0;
// IMG ID
foreach($html->find('div[class=avatar] a img') as $element) { $array[$i]['id'] = $element->getAttribute('id'); $i++;} unset($i);$i = 0;
// IMG SRC
$p1 = strpos( $html, 'var lazyImgs =' ) + 14;
$p2 = strpos( $html, ']', $p1 );
$raw = substr( $html, $p1, $p2 - $p1 ) . ']';
$images = json_decode($raw);
foreach ($images as $image){
$id = $image->id;
$data = $image->data;
foreach ($array as $element){
if ( isset($element['id']) && $element['id'] == $id){
$array[$i]['image'] = $data;
$i++;
}
}
}
$html->clear();
unset($html);
return $array;
}
Get IMG ID in array. Then scrach var Lazyload in json and decode. Then compare 2 arrays and if id mach add data to array.
Thanks to everybody!

how to get href from within element using php and simple html dom

I have an html page that looks a bit like this
xxxx
google!
<div class="big-div">
<a href="http://www.url.com/123" title="123">
<div class="little-div">xxx</div></a>
<a href="http://www.url.com/456" title="456">
<div class="little-div">xxx</div></a>
</div>
xxxx
I am trying to pull of the href's out of the big-div. I can get all the href's out of the whole page by using code like below.
$links = $html->find ('a');
foreach ($links as $link)
{
echo $link->href.'<br>';
}
But how do I get only the href's within the div "big-div".
Edit:
I think I got it. For those that care:
foreach ($html->find('div[class=big-div]') as $element) {
$links = $element->find('a');
foreach ($links as $link) {
echo $link->href.'<br>';
}
}
The documentation is useful:
$html->find(".big-div")->find('a');
And then proceed to get the href and whatever other attributes you are interested in.
Edit: The above would be the general idea. I've never used Simple HTML DOM, so perhaps you need to tweak the syntax somewhat. Try:
foreach($html->find('.big-div') as $bigDiv) {
$link = $bigDiv->find('a');
echo $link->href . '<br>';
}
or perhaps:
$bigDivs = $html->find('.big-div');
foreach($bigDivs as $div) {
$link = $div->find('a');
echo $link->href . '<br>';
}
Quick flip - put this in your foreach
$image = $html->find('.big-div')->href;

Categories