Scrape Facebook page and parse with php

Scrape Facebook page and parse with php - php

I am trying to get a list of names and profiles that shared a particular post of a Facebook page.
I thought I could use simple html dom to parse the page with php, but with no success yet. This is my code so far:
<?php
include_once('simple_html_dom.php');
function scraping_shares() {
$html = file_get_html('https://m.facebook.com/shares/view?id=10156833628051729');
foreach($html->find('div.bn') as $data) {
$item['name'] = trim($data->find('h3.bo', 0)->plaintext);
$ret[] = $item;
}
$html->clear();
unset($html);
return $ret;
}
ini_set('user_agent', 'My-Application/2.5');
$ret = scraping_shares();
foreach($ret as $v) {
echo $v['name'].' <br>';
}
?>
Any help please?

Related

PHP Simple HTML DOM - method find retrieve empty array

Now I am trying to write a PhP parser and I don't why my code return an empty array. I am using PHP Simple HTML DOM. I know my code is't perfect, but it's only for testing.
I will be appreciate for any help
public function getData() {
// get url form urls.txt
foreach ($this->list_url as $i => $url) {
// create a DOM object from a HTML file
$this->html = file_get_html($url);
// find array all elements with class="name" because every products having name
$products = $this->html->find(".name");
foreach ($products as $number => $product) {
// get value attr a=href product
$href = $this->html->find("div.name a", $number)->attr['href'];
// create a DOM object form a HTML file
$html = file_get_html($href);
if($html && is_object($html) && isset($html->nodes)){
echo "TRUE - all goodly";
} else{
echo "FALSE - all badly";
}
// get all elements class="description"
// $nodes is empty WHY? Tough web-page having content and div.description?
$nodes = $html->find('.description');
if (count($nodes) > 0) {
$needle = "Производитель:";
foreach ($nodes as $short_description) {
if (stripos($short_description->plaintext, $needle) !== FALSE) {
echo "TRUE";
$this->data[] = $short_description->plaintext;
} else {
echo "FALSE";
}
}
} else {
$this->data[] = '';
}
$html->clear();
unset($html);
}
$this->html->clear();
unset($html);
}
return $this->data;
}

hi you should inspect the element and copy->copy selector and use it in find method to getting the object

Php Simple Html Dom Parser can't get content on pagination

Hi i'm a beginner in using simple_html_dom. i'm trying to fetch list of href's from list of posts from this sample website having pagination using below code.
<?php
include('simple_html_dom.php');
$html = file_get_html('http://www.themelock.com/wordpress/elegantthemes/');
function getArticles($page) {
global $articles;
$html = new simple_html_dom();
$html->load_file($page);
$items = $html->find('h2[class=post-title]');
foreach($items as $post) {
$articles[] = array($post->children(0)->href);
}
foreach($articles as $item) {
echo "<div class='item'>";
echo $item[0];
echo "</div>";
}
}
if($next = $html->find('div[class=navigation]', 0)->last_child() ) {
$URL = $next->href;
$html->clear();
unset($html);
getArticles($URL);
}
?>
As a result i'm getting
http://www.themelock.com/wordpress/908-minimal-elegantthemes-wordpress-theme.html
http://www.themelock.com/wordpress/892-event-elegantthemes-wordpress-theme.html
http://www.themelock.com/wordpress/882-askit-elegantthemes-wordpress-theme.html
http://www.themelock.com/wordpress/853-lightbright-elegantthemes-wordpress-theme.html
http://www.themelock.com/wordpress/850-inreview-elegantthemes-review-wordpress-theme.html
http://www.themelock.com/wordpress/807-boutique-elegantthemes-wordpress-theme.html
http://www.themelock.com/wordpress/804-elist-elegantthemes-directory-wordpress-theme.html
http://www.themelock.com/wordpress/798-webly-elegantthemes-wordpress-theme.html
http://www.themelock.com/wordpress/795-elegantestate-real-estate-elegantthemes-wordpress-theme.html
http://www.themelock.com/wordpress/786-notebook-elegantthemes-wordpress-theme.html
Above code fetching only Next page (Second page) contents. I'm wondering how to get first page post url's followed by next pages.
Did anyone know how to do this ?

Thanks for your support guys, I made this to work using below code,
<?php
include('simple_html_dom.php');
$url = "http://www.themelock.com/wordpress/yootheme-wordpress/";
// Start from the main page
$nextLink = $url;
// Loop on each next Link as long as it exsists
while ($nextLink) {
echo "<hr>nextLink: $nextLink<br>";
//Create a DOM object
$html = new simple_html_dom();
// Load HTML from a url
$html->load_file($nextLink);
$posts = $html->find('h2[class=post-title]');
foreach($posts as $post) {
// Get the link
$articles = $post->children(0)->href;
echo $articles.'</br>';
}
// Extract the next link, if not found return NULL
$nextLink = ( ($temp = $html->find('div[class=navigation]', 0)->last_child()) ? $temp->href : NULL );
// Clear DOM object
$html->clear();
unset($html);
}
?>

Saving objects in array back to JSON

I'm trying to filter out some unwanted Instagram images based on a review score that is a part of the photo caption.
I'm using the Instagram PHP API by cosenary for it and I have the data, and I have been able to filter out the entries with the correct syntax for a review score (*/10).
The problem is trying to get the objects back into json form so I can cache it so I dont have to keep making a million requests to Instagram.
I am trying to store the objects in an array and then encode the array back to json, but I keep getting the error in
Trying to get property of non-object in /Users/***/Sites/instagram/index.php on line 16
Here is what I have so far.
<?php
require_once('instagram.class.php');
$app_key = '<removed>';
$instagram = new Instagram($app_key);
$media = $instagram->getTagMedia('kittens');
$pattern = '/\d+\/+10/';
$fulldata = array();
if($media) {
do {
foreach ($media->data as $entry) {
if(is_object($entry) && preg_match($pattern, $entry->caption->text)) {
$fulldata[] = $entry;
}
}
} while ($media = $instagram->pagination($media));
}
?>
<pre>
<? echo json_encode($fulldata); ?>
</pre>
I'm not sure if that is_object($entry) does much.

This should work
<?php
require_once('instagram.php');
$app_key = '';
$instagram = new Instagram($app_key);
$media = $instagram->getTagMedia('cats');
$pattern = '/\d+\/+10/';
$fulldata = array();
if($media) {
do {
foreach ($media->data as $entry) {
$text = $entry->caption->text;
if(preg_match($pattern, $text)) {
$fulldata[] = $entry;
}
}
} while ($media = $instagram->pagination($media));
}
?>
<pre>
<? echo json_encode($fulldata); ?>
</pre>

Error due to variable scope in PHP

I have the following code
function scraping_digg() {
// create HTML DOM
$html = file_get_html('http://digg.com/');
// get news block
foreach($html->find('div.news-summary') as $article) {
// get title
$item['title'] = trim($article->find('h3', 0)->plaintext);
// get details
$item['details'] = trim($article->find('p', 0)->plaintext);
// get intro
$item['diggs'] = trim($article->find('li a strong', 0)->plaintext);
$ret[] = $item;
}
// clean up memory
$html->clear();
unset($html);
return $ret;
}
When I run it I get the following error.
Undefined variable: ret in /opt/lampp/htdocs/web_scrapper/example/scraping/example_scraping_digg.php on line
I can't find the fix for the scope of $ret. Please help.

In the beggining of scraping_digg function declare variable:
$ret = array();

The line number would be the most important information!
$ret[] = $item;
This line will likely trigger the notice, at the start of the function add something like
$ret=array();

Its because $ret is undefined...
Try declaring $ret before your loop
function scraping_digg() {
// create HTML DOM
$html = file_get_html('http://digg.com/');
$ret = array();
// get news block
foreach($html->find('div.news-summary') as $article) {
// get title
$item['title'] = trim($article->find('h3', 0)->plaintext);
// get details
$item['details'] = trim($article->find('p', 0)->plaintext);
// get intro
$item['diggs'] = trim($article->find('li a strong', 0)->plaintext);
$ret[] = $item;
}
// clean up memory
$html->clear();
unset($html);
return $ret;
}

I think you don't initialize both $ret and $item within the function scope.
function scraping_digg() {
// create HTML DOM
$html = file_get_html('http://digg.com/');
$ret = array();
// get news block
foreach($html->find('div.news-summary') as $article) {
$item = array();
// get title
$item['title'] = trim($article->find('h3', 0)->plaintext);
// get details
$item['details'] = trim($article->find('p', 0)->plaintext);
// get intro
$item['diggs'] = trim($article->find('li a strong', 0)->plaintext);
$ret[] = $item;
}
// clean up memory
$html->clear();
unset($html);
return $ret;
}

I couldn't even find any div with the class of news-summary on digg's homepage. That foreach loop probably never get executed because PHP couldn't find any of the div you're looking for. Thus, $ret is never declared.
However you could add $ret = array(); at the top of the function as hsz mentioned in his answer to make the error message go away.

PHP Simple DOM Parser to Scrape From Multiple URLs

Is it possible to use a foreach loop to scrape multiple URL's from an array? I've been trying but for some reason it will only pull from the first URL in the array and the show the results.
include_once('../../simple_html_dom.php');
$link = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($link as $links) {
function scraping_IMDB($links) {
// create HTML DOM
$html = file_get_html($links);
$values = array();
foreach($html->find('input') as $element) {
$values[$element->id=='ASIN'] = $element->value; }
// get title
$ret['ASIN'] = end($values);
// get rating
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] =$html->find('b[class="priceLarge"]', 0)->innertext;
// clean up memory
//$html->clear();
// unset($html);
return $ret;
}
// -----------------------------------------------------------------------------
// test it!
$ret = scraping_IMDB($links);
foreach($ret as $k=>$v)
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
Here is the code since the comment part didn't work. :) It's very dirty because I just edited one of the examples to play with it to see if I could get it to do what I wanted.

include_once('../../simple_html_dom.php');
function scraping_IMDB($links) {
// create HTML DOM
$html = file_get_html($links);
// What is this spaghetti code good for?
/*
$values = array();
foreach($html->find('input') as $element) {
$values[$element->id=='ASIN'] = $element->value;
}
// get title
$ret['ASIN'] = end($values);
*/
foreach($html->find('input') as $element) {
if($element->id == 'ASIN') {
$ret['ASIN'] = $element->value;
}
}
// Our you could use the following instead of the whole foreach loop above
//
// $ret['ASIN'] = $html->find('input[id="ASIN"]', 0)->value;
//
// if the 0 means, return first found or something similar,
// I just had a look at Amazons source code, and it contains
// 2 HTML tags with id='ASIN'. If they were following html-regulations
// then there should only be ONE element with a specific id.
// get rating
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] = $html->find('b[class="priceLarge"]', 0)->innertext;
// clean up memory
//$html->clear();
// unset($html);
return $ret;
}
// -----------------------------------------------------------------------------
// test it!
$links = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
This should do the trick
I have renamed the array to 'links' instead of 'link'. It's an array of links, containing link(s), therefore, foreach($link as $links) seemed wrong, and I changed it to foreach($links as $link)

I really need to ask this question as it will answer way more questions after the world reads this thread. What if ... you used articles like the simple html dom site.
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] = $html->find('b[class="priceLarge"]', 0)->innertext;
return $ret;
}
$links = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
what if its $articles?
$articles[] = $item;
}
//print_r($articles);
$links = array (
'http://link1.com',
'http://link2.com',
'http://link3.com'
);
what would this area look like?
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
Ive seen this multiple links all over stackoverflow for past 2 years, and I still cannot figure it out. Would be great to get the basic handle on it to how the simple html dom examples are.
thx.
First time postin im sure I broke a bunch of rules and didnt do the code section right. I just had to ask this question badly.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Scrape Facebook page and parse with php - php

Related

PHP Simple HTML DOM - method find retrieve empty array

Php Simple Html Dom Parser can't get content on pagination

Saving objects in array back to JSON

Error due to variable scope in PHP

PHP Simple DOM Parser to Scrape From Multiple URLs

Categories

Resources