Can't parse the titles of some links using function - php

I've written a script to parse the title of each page after making use of links populated from this url. To be clearer: my below script is supposed to parse all the links from the landing page and then reuse those links in order to go one layer deep and parse the titles of posts from there.
As this is my first ever attempt to write anything in php, I can't figure out where I'm going wrong.
This is my try so far:
<?php
include("simple_html_dom.php");
$baseurl = "https://stackoverflow.com";
function get_links($baseurl)
{
$weburl = "https://stackoverflow.com/questions/tagged/web-scraping";
$html = file_get_html($weburl);
$processed_links = array();
foreach ($html->find(".summary h3 a") as $a) {
$links = $a->href . '<br>';
$processed_links[] = $baseurl . $links;
}
return implode("\n",$processed_links);
}
function reuse_links($processed_links){
$ihtml = file_get_html($processed_links);
foreach ($ihtml -> find("h1 a") as $item) {
echo $item->innertext;
}
}
$pro_links = get_links($baseurl);
reuse_links($pro_links);
?>
When I execute the script, it produces the following error:
Warning: file_get_contents(https://stackoverflow.com/questions/52347029/getting-all-the-image-urls-from-a-given-instagram-user<br> https://stackoverflow.com/questions/52346719/unable-to-print-links-in-another-function<br> https://stackoverflow.com/questions/52346308/bypassing-technical-limitations-of-instagram-bulk-scraping<br> https://stackoverflow.com/questions/52346159/pulling-the-href-from-a-link-when-web-scraping-using-python<br> https://stackoverflow.com/questions/52346062/in-url-is-indicated-as-query-or-parameter-in-an-attempt-to-scrap-data-using<br> https://stackoverflow.com/questions/52345850/not-able-to-print-link-from-beautifulsoup-for-web-scrapping<br> https://stackoverflow.com/questions/52344564/web-scraping-data-that-was-shown-previously<br> https://stackoverflow.com/questions/52344305/trying-to-encode-decode-locations-when-scraping-a-website<br> https://stackoverflow.com/questions/52343297/cant-parse-the-titles-of-some-links-using-function<br> https: in C:\xampp\htdocs\differenttuts\simple_html_dom.php on line 75
Fatal error: Uncaught Error: Call to a member function find() on boolean in C:\xampp\htdocs\differenttuts\testfile.php:18 Stack trace: #0 C:\xampp\htdocs\differenttuts\testfile.php(23): reuse_links('https://stackov...') #1 {main} thrown in C:\xampp\htdocs\differenttuts\testfile.php on line 18
Once again: I expect my script to tarck the links from the landing page and parse the titles from it's target page.

I'm not very familiar with simple_html_dom, but I'll try to answer the question. This library uses file_get_contents to preform HTTP requests, but in PHP7 file_get_contents doesn't accept negative offset (which is the default for this library) when retrieving network resources.
If you're using PHP 7 you'll have set the offset to 0.
$html = file_get_html($url, false, null, 0);
In your get_links function you join your links to a string. I think it's best to return an array, since you'll need those links for new HTTP requests in the next function. For the same reason you shouldn't add break tags to the links, you can break when you print.
function get_links($url)
{
$processed_links = array();
$base_url = implode("/", array_slice(explode("/", $url), 0, 3));
$html = file_get_html($url, false, null, 0);
foreach ($html->find(".summary h3 a") as $a) {
$link = $base_url . $a->href;
$processed_links[] = $link;
echo $link . "<br>\n";
}
return $processed_links ;
}
function reuse_links($processed_links)
{
foreach ($processed_links as $link) {
$ihtml = file_get_html($link, false, null, 0);
foreach ($ihtml -> find("h1 a") as $item) {
echo $item->innertext . "<br>\n";
}
}
}
$url = "https://stackoverflow.com/questions/tagged/web-scraping";
$pro_links = get_links($url);
reuse_links($pro_links);
I think it makes more sense to use the main url as a parameter in get_links, we can get the base url from it. I've used array functions for the base url, but you could use parse_url which is the appropriate function.

Related

why is this php code not working?

This is the code I am using to scrape specific data from http://www.partyhousedecorations.com
however I keep getting this error (Fatal error: Call to a member function children() on a non-object in C:\wamp\www\webScraping\PartyHouseDecorations.php on line 8 )and I am stuck and can't seem to be able to fix it.
This is my code:
<?php
include_once("simple_html_dom.php");
$serv=$_GET['search'];
$url = 'http://www.partyhousedecorations.com/category-adult-birthday-party-themes'.$serv;
$output = file_get_html($url);
$arrOfStuff = $output->find('div[class=product-grid]', 0)->children();
foreach( $arrOfStuff as $item )
{
echo "Party House Decorations".'<br>';
echo $item->find('div[class=name]', 0)->find('a', 0)->innertext.'<br>';
echo '<img src="http://www.partyhousedecorations.com'.$item->find('div[class=image]', 0)->find('img', 0)->src.'"><br>';
echo str_replace('KWD', 'AED', $item->find('div[class=price]',0)->innertext.'<br>');
}
?>
Looks like $output->find('div[class=product-grid]', 0) doesn't return an object with a method called children(). Maybe it's returning null or something that's not an object. Put it in a separate variable and look what the value of that variable is.
$what_is_this = $output->find('div[class=product-grid]', 0);
var_dump($what_is_this)
Update:
I debugged your program, and apart from the simple html dom parser seemingly expecting classes to be given as 'div.product-grid' instead of 'div[class=x]' it also turns out that the webpage responds by returning a product list instead of a product grid. I've included a working copy below.
<?php
include_once("simple_html_dom.php");
$serv=$_GET['search'];
$url = 'http://www.partyhousedecorations.com/category-adult-birthday-party-themes';
$output = file_get_html($url);
$arrOfStuff = $output->find('div.product-list', 0)->children();
foreach( $arrOfStuff as $item )
{
echo "Party House Decorations".'<br>';
echo $item->find('div.name', 0)->find('a', 0)->innertext.'<br>';
echo '<img src="http://www.partyhousedecorations.com'.$item->find('div.image', 0)->find('img', 0)->src.'"><br>';
echo str_replace('KWD', 'AED', $item->find('div.price',0)->innertext.'<br>');
}
?>

Get table value using PHP Simple HTML DOM

I'm writing this PHP to read the data from the following website, and the write it into database.
Here's the code:
<?php
require('simple_html_dom.php');
$html = file_get_html('http://backpack.tf/pricelist/spreadsheet');
$data = $html->find('.table tr td[1]');
foreach($data as $result)
{
echo $result->plaintext . '<br />';
}
?>
I intended to get all the data in the tds and even the attribute inside the trs.
So, I tried by getting them in plain text first.
By far the code returns:
Fatal error: Call to a member function find() on a non-object
How can I solve and improve the code?
The following code is working for your example.
It could be the memory limit for your executing script that's causing trouble.
ini_set('memory_limit','160M');
require('simple_html_dom.php');
$url = 'http://backpack.tf/pricelist/spreadsheet';
$html = new simple_html_dom();
$html->load_file($url);
$data = $html->find('.table tr td[1]');
foreach($data as $result)
{
echo $result->plaintext . '<br />';
}

Trying to scrape images from reddit, having trouble cleaning up strings

So I'm not asking for you to fix my script, if you know the answer I would appreciate it if you just pointed me in the right direction. This is a script I found and I'm trying to edit it for a project.
I believe that whats going on is the formatting of $reddit is causing problems when I input that string into $url. I am not sure how to filter the string.
Right after I posted this I had the idea of using concatenation on $reddit to get the desired result instead of filtering the string. Not sure.
Thanks!
picgrabber.php
include("RIS.php");
$reddit = "pics/top/?sort=top&t=all";
$pages = 5;
$t = new RIS($reddit, $pages);
$t->getImagesOnPage();
$t->saveImage();
RIS.php
class RIS {
var $after = "";
var $reddit = "";
public function __construct($reddit, $pages) {
$this->reddit = preg_replace('/[^A-Za-z0-9\-]/', '' , $reddit);
if(!file_exists($this->reddit)) {
mkdir($this->reddit, 0755);
}
$pCounter = 1;
while($pCounter <= $pages) {
$url = "http://reddit.com/r/$reddit/.json?limit=100&after=$this->after";
$this->getImagesOnPage($url);
$pCounter++;
}
}
private function getImagesOnPage($url) {
$json = file_get_contents($url);
$js = json_decode($json);
foreach($js->data->children as $n) {
if(preg_match('(jpg$|gif$|png$)', $n->data->url, $match)) {
echo $n->data->url."\n";
$this->saveImage($n->data->url);
}
$this->after = $js->data->after;
}
}
private function saveImage($url) {
$imgName = explode("/", $url);
$img = file_get_contents($url);
//if the file doesnt already exist...
if(!file_exists($this->reddit."/".$imgName[(count($imgName)-1)])) {
file_put_contents($this->reddit."/".$imgName[(count($imgName)-1)], $img);
}
}
}
Notice: Trying to get property of non-object in C:\Program Files (x86)\EasyPHP-DevServer-13.1VC9\data\localweb\RIS.php on line 33
Warning: Invalid argument supplied for foreach() in C:\Program Files (x86)\EasyPHP-DevServer-13.1VC9\data\localweb\RIS.php on line 33
Fatal error: Call to private method RIS::getImagesOnPage() from context '' in C:\Program Files (x86)\EasyPHP-DevServer-13.1VC9\data\localweb\vollyeballgrabber.php on line 23
line 33:
foreach($js->data->children as $n) {
var_dump($url);
returns:
string(78) "http://reddit.com/r/pics/top/?sort=top&t=all/.json?limit=100&after=" NULL
$reddit in picgrabber.php has GET parameters
In the class RIS, you're embedding that value into a string that has another GET set in it with the ".json" token between.
The resulting url is:
http://reddit.com/r/pics/top/?sort=top&t=all/.json?limit=100&after=
The ".json" token needs to come after the end of the location portion of the url and before the GET sets. I would also change any addition "?" tokens to "&" (ampersands) so any additional sets of GET parameters you decide to concatenate to the URL string become additional parameters.
Like this:
http://reddit.com/r/pics/top/.json?sort=top&t=all&limit=100&after=
The difference is, your url is returning html code because the reddit server doesn't understand how to parse what you're sending. You're trying to parse html with a json decoder. My URL returns actual json data. That should get your json decoder returning an actual json object array.

PHP script that counts the number of outgoing links on a page and ignores the rel="nofollow" ones

I am a php newb but I am pretty sure this will be hard to accomplish and very server consuming. But I want to ask, get the opinion of much smarter users than myself.
Here is what I am trying to do:
I have a list of URL's, an array of URL's actually.
For each URL, I want to count the outgoing links - which DO NOT HAVE REL="nofollow" attribute - on that page.
So in a way, I'm afraid I'll have to make php load the page and preg match using regular expressions all the links?
Would this work if I'd had lets say 1000 links?
Here is what I am thinking, putting it in code:
$homepage = file_get_contents('http://www.site.com/');
$homepage = htmlentities($homepage);
// Do a preg_match for http:// and count the number of appearances:
$urls = preg_match();
// Do a preg_match for rel="nofollow" and count the nr of appearances:
$nofollow = preg_match();
// Do a preg_match for the number of "domain.com" appearances so we can subtract the website's internal links:
$internal_links = preg_match();
// Substract and get the final result:
$result = $urls - $nofollow - $internal_links;
Hope you can help, and if the idea is right maybe you can help me with the preg_match functions.
You can use PHP's DOMDocument class to parse the HTML and parse_url to parse the URLs:
$url = 'http://stackoverflow.com/';
$pUrl = parse_url($url);
// Load the HTML into a DOMDocument
$doc = new DOMDocument;
#$doc->loadHTMLFile($url);
// Look for all the 'a' elements
$links = $doc->getElementsByTagName('a');
$numLinks = 0;
foreach ($links as $link) {
// Exclude if not a link or has 'nofollow'
preg_match_all('/\S+/', strtolower($link->getAttribute('rel')), $rel);
if (!$link->hasAttribute('href') || in_array('nofollow', $rel[0])) {
continue;
}
// Exclude if internal link
$href = $link->getAttribute('href');
if (substr($href, 0, 2) === '//') {
// Deal with protocol relative URLs as found on Wikipedia
$href = $pUrl['scheme'] . ':' . $href;
}
$pHref = #parse_url($href);
if (!$pHref || !isset($pHref['host']) ||
strtolower($pHref['host']) === strtolower($pUrl['host'])
) {
continue;
}
// Increment counter otherwise
echo 'URL: ' . $link->getAttribute('href') . "\n";
$numLinks++;
}
echo "Count: $numLinks\n";
You can use SimpleHTMLDOM:
// Create DOM from URL or file
$html = file_get_html('http://www.site.com/');
// Find all links
foreach($html->find('a[href][rel!=nofollow]') as $element) {
echo $element->href . '<br>';
}
As I'm not sure that SimpleHTMLDOM supports a :not selector and [rel!=nofollow] might only return a tags with a rel attribute present (and not ones where it isn't present), you may have to:
foreach($html->find('a[href][!rel][rel!=nofollow]') as $element)
Note the added [!rel]. Or, do it manually instead of with a CSS attribute selector:
// Find all links
foreach($html->find('a[href]') as $element) {
if (strtolower($element->rel) != 'nofollow') {
echo $element->href . '<br>';
}
}

simplexml load on google weather api prooblem

Hi I have been having problems with the google weather api having errors Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 2: parser error ....
I tried to use the script of the main author(thinking it was my edited script) but still I am having this errors I tried 2
//komunitasweb.com/2009/09/showing-the-weather-with-php-and-google-weather-api/
and
//tips4php.net/2010/07/local-weather-with-php-and-google-weather/
The weird part is sometimes it fixes itself then goes back again to the error I have been using it for months now without any problem, this just happened yesterday. Also the demo page of the authors are working but I have the same exact code any help please.
this is my site http://j2sdesign.com/weather/widgetlive1.php
#Mike I added your code
<?
$xml = file_get_contents('http://www.google.com/ig/api?weather=jakarta'); if (! simplexml_load_string($xml)) { file_put_contents('malformed.xml', $xml); }
$xml = simplexml_load_file('http://www.google.com/ig/api?weather=jakarta');
$information = $xml->xpath("/xml_api_reply/weather/forecast_information");
$current = $xml->xpath("/xml_api_reply/weather/current_conditions");
$forecast_list = $xml->xpath("/xml_api_reply/weather/forecast_conditions");
?>
and made a list of the error but I can't seem to see the error cause it's been fixing itself then after sometime goes back again to the error
here is the content of the file
<?php include_once('simple_html_dom.php'); // create doctype $dom = new DOMDocument("1.0");
// display document in browser as plain text
// for readability purposes //header("Content-Type: text/plain");
// create root element
$xmlProducts = $dom->createElement("products");
$dom->appendChild($xmlProducts);
$pages = array( 'http://myshop.com/small_houses.html', 'http://myshop.com/medium_houses.html', 'http://myshop.com/large_houses.html' ) foreach($pages as $page) { $product = array(); $source = file_get_html($page); foreach($source->find('img') as $src) { if (strpos($src->src,"http://myshop.com") === false) { $product['image'] = "http://myshop.com/$src->src"; } } foreach($source->find('p[class*=imAlign_left]') as $description) { $product['description'] = $description->innertext; } foreach($source->find('span[class*=fc3]') as $title) { $product['title'] = $title->innertext; } //debug perposes! echo "Current Page: " . $page . "\n"; print_r($product); echo "\n\n\n"; //Clear seperator } ?>
When simplexml_load_string() fails you need to store the data you're trying to load somewhere for review. Examining the data is the first step to diagnose what it causing the error.
$xml = file_get_contents('http://example.com/file.xml');
if (!simplexml_load_string($xml)) {
file_put_contents('malformed.xml', $xml);
}

Categories