I've written some code in php to scrape some preferable links out of the main page of wikipedia. When I execute my script, the links are coming through accordingly.
However, at this point I've defined two functions within my script in order to learn how to pass links from one function to another. Now, my goal is to print the links in the latter function but it only prints the first link and nothing else.
If I use only this function fetch_wiki_links(), I can get several links but when i try to print the same within get_links_in_ano_func() then it prints the first link only.
How can I get them all even when I use the second function?
This is what I've written so far:
include("simple_html_dom.php");
$prefix = "https://en.wikipedia.org";
function fetch_wiki_links($prefix)
{
$weblink = "https://en.wikipedia.org/wiki/Main_Page";
$htmldoc = file_get_html($weblink);
foreach ($htmldoc->find("a[href^='/wiki/']") as $a) {
$links = $a->href . '<br>';
$absolute_links = $prefix . $links;
return $absolute_links;
}
}
function get_links_in_ano_func($absolute_links)
{
echo $absolute_links;
}
$items = fetch_wiki_links($prefix);
get_links_in_ano_func($items);
Your function returned the value at the very first iteration. You will need something like this:
function fetch_wiki_links($prefix)
{
$weblink = "https://en.wikipedia.org/wiki/Main_Page";
$htmldoc = file_get_html($weblink);
$absolute_links = array();
foreach ($htmldoc->find("a[href^='/wiki/']") as $a) {
$links = $a->href . '<br>';
$absolute_links []= $prefix . $links;
}
return implode("\n", $absolute_links);
}
Related
I'm trying to code a website that takes the comments of a Reddit page and shows them. However, the comments have replies, and I want to show those too, but a comment can have none, one or more replies, and those replies can have replies. Is there a way to repeat the same code for all of the replies with minor differences (indentation)? I'm using the reddit json feature, and am getting the JSON from something like here: https://www.reddit.com/r/pcmasterrace/comments/dln0o3/foldinghome_and_pcmr_team_up_use_your_pc_to_help/.json.
I have:
$url = ('https://www.reddit.com/r/pcmasterrace/comments/dln0o3/foldinghome_and_pcmr_team_up_use_your_pc_to_help/.json');
$json = file_get_contents($url);
$obj = json_decode($json, true);
$comment_array = array_slice($obj[1]['data']['children'], 0, 50);
echo '<div class="comments">';
foreach ($comment_array as $c) {
echo "<p>(".$c['data']['author'].") ". $c['data']['score'] . " Points<br>".$c['data']['body']."</p>";
if (!($c['data']['replies'] == "")) {
$r1_array = $c['data']['replies']['data']['children'];
foreach ($r1_array as $r1) {
echo "<p> (".$r1['data']['author'].") ". $r1['data']['score'] . " Points<br> ".$r1['data']['body']."</p>";
if (!($r1['data']['replies'] == "")) {
$r2_array = $r1['data']['replies']['data']['children'];
foreach ($r2_array as $r2) {
echo "<p> (".$r2['data']['author'].") ". $r1['data']['score'] . " Points<br> ".$r2['data']['body']."</p>";
}
}
}
}
}
}
This produces the desired result, with replies to replies and such. However, it's a bit messy, and if there is a really long reply chain, it won't catch it. Is there a way to make it repeat somehow or should I just copy and paste it a bunch of times?
Thanks very much!
I think you're looking for a concept called recursion.
The basic idea is that a function will call itself as many times as needed (as opposed to using a fixed number of loops).
Something like this:
<?php
function output($data, $level = 0) {
$spaces = str_repeat(' ', $level);
foreach ($data as $post) {
echo "<p>(".$spaces.$post['data']['author'].") ". $post['data']['score'] . " Points<br>".$post['data']['body']."</p>\r\n";
if ($post['data']['replies']) {
// Notice that we are calling the function again, this time increasing the level
// This is the "recursive" part of the function
output($post['data']['replies']['data']['children'], $level + 1);
}
}
}
$url = ('https://www.reddit.com/r/pcmasterrace/comments/dln0o3/foldinghome_and_pcmr_team_up_use_your_pc_to_help/.json');
$json = file_get_contents($url);
$data = json_decode($json, true);
$comment_array = array_slice($data[1]['data']['children'], 0, 50);
output($comment_array);
?>
This is my entire code
// include the scrapper
include('simple_html_dom.php');
// connect the page for scrapping
$html = file_get_html('http://www.niagarafallsreview.ca/news/local');
// make empty arrays
$headlines = array();
$links = array();
// look for 'h' headings on page
foreach($html->find('h1') as $header) {
$headlines[] = $header->plaintext;
}
// look for 'a' links that start with 'http://www.niagarafallsreview.ca/2016/04/'
foreach($html->find('a[href^="http://www.niagarafallsreview.ca/2016/04/"]') as $link) {
$links[] = $link->href;
}
// trim the headlines because one on top and bottom were not needed
$output = array_slice($headlines, 1, -1);
// for each header output a nice list of the headers
foreach ($output as $headers){
echo "< a href='#'>$headers</a>" . "<br />";
}
// make sure the links are unique and no doubles are found
$result = array_unique($links);
// for each link output it in a nice list
foreach ($result as $linkk){
echo "<a href='$linkk'>$linkk</a>" . "<br />";
}
this code will produce the headings in a nice list, and will also produce a nice list of the links.
My problem is that i need to combine them, i would like the $header to be the text of the href, and the link in the href to be the $linkk
like this..
< a href ='$linkk'>$headers</a>
I dont know how to do this as i have two foreach statements. I tried to combine them but i was unsuccessful.
Any help will be greatly appreciated.
Thanks.
Try this:
// include the scrapper
include('simple_html_dom.php');
// connect the page for scrapping
$html = file_get_html('http://www.niagarafallsreview.ca/news/local');
// make empty arrays
$headlines = array();
$links = array();
// look for 'h' headings on page
foreach($html->find('h1') as $header) {
$headlines[] = $header->plaintext;
}
// look for 'a' links that start with 'http://www.niagarafallsreview.ca/2016/04/'
foreach($html->find('a[href^="http://www.niagarafallsreview.ca/2016/04/"]') as $link) {
$links[] = $link->href;
}
// trim the headlines because one on top and bottom were not needed
$output = array_slice($headlines, 1, -1);
// make sure the links are unique and no doubles are found
$result = array_unique($links);
// for each link output it in a nice list
foreach ($result as $i=>$linkk) {
$headline = isset($output[$i]) ? $output[$i] : '(empty)';
echo "<a href='$linkk'>$headline</a>" . "<br />";
}
Here is the foreach you are looking for:
foreach($output as $i=>$headers) {
$linkk = $result[$i];
echo "< a href='$linkk'>$headers</a>" . "<br />";
}
This assumes the arrays have the same length and also the correct order.
I am trying to read a link from one page, print the URL, go to that page, and read the link on the next page in the same location, print the url, go to that page (and so on...).
All I'm doing is reading the URL and passing it as an argument to the get_links() function until there are no more links.
This is my code but it throws:
Fatal error: Call to a member function find() on a non-object.
Anyone know how to fix this?
<?php
$mainPage = 'https://www.bu.edu/link/bin/uiscgi_studentlink.pl/1346752597?ModuleName=univschr.pl&SearchOptionDesc=Class+Subject&SearchOptionCd=C&KeySem=20133&ViewSem=Fall+2012&Subject=&MtgDay=&MtgTime=';
get_links($mainPage);
function get_links($url) {
$data = new simple_html_dom();
$data = file_get_html($url);
$nodes = $data->find("input[type=hidden]");
$fURL = $data->find("/html/body/form");
$firstPart = $fURL[0]->action . '<br>';
foreach ($nodes as $node) {
$val = $node->value;
$name = $node->name;
$name . '<br />';
$val . "<br />";
$str1 = $str1 . "&" . $name . "=" . $val;
}
$fixStr1 = str_replace('&College', '?College', $str1);
$fixStr2 = str_replace('Fall 2012', 'Fall+2012', $fixStr1);
$fixStr3 = str_replace('Class Subject', 'Class+Subject', $fixStr2);
$fixStr4 = $firstPart . $fixStr3;
echo $nextPageURL = chop($fixStr4);
get_links($nextPageURL);
}
?>
Alright so I was using the load->file() function somewhere in my code and did not see it until I really scraped through it. Finally have a running script :) The key is to use file_get_html instead of loading the webpage as an object using the load->file() function.
Is it possible to use a foreach loop to scrape multiple URL's from an array? I've been trying but for some reason it will only pull from the first URL in the array and the show the results.
include_once('../../simple_html_dom.php');
$link = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($link as $links) {
function scraping_IMDB($links) {
// create HTML DOM
$html = file_get_html($links);
$values = array();
foreach($html->find('input') as $element) {
$values[$element->id=='ASIN'] = $element->value; }
// get title
$ret['ASIN'] = end($values);
// get rating
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] =$html->find('b[class="priceLarge"]', 0)->innertext;
// clean up memory
//$html->clear();
// unset($html);
return $ret;
}
// -----------------------------------------------------------------------------
// test it!
$ret = scraping_IMDB($links);
foreach($ret as $k=>$v)
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
Here is the code since the comment part didn't work. :) It's very dirty because I just edited one of the examples to play with it to see if I could get it to do what I wanted.
include_once('../../simple_html_dom.php');
function scraping_IMDB($links) {
// create HTML DOM
$html = file_get_html($links);
// What is this spaghetti code good for?
/*
$values = array();
foreach($html->find('input') as $element) {
$values[$element->id=='ASIN'] = $element->value;
}
// get title
$ret['ASIN'] = end($values);
*/
foreach($html->find('input') as $element) {
if($element->id == 'ASIN') {
$ret['ASIN'] = $element->value;
}
}
// Our you could use the following instead of the whole foreach loop above
//
// $ret['ASIN'] = $html->find('input[id="ASIN"]', 0)->value;
//
// if the 0 means, return first found or something similar,
// I just had a look at Amazons source code, and it contains
// 2 HTML tags with id='ASIN'. If they were following html-regulations
// then there should only be ONE element with a specific id.
// get rating
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] = $html->find('b[class="priceLarge"]', 0)->innertext;
// clean up memory
//$html->clear();
// unset($html);
return $ret;
}
// -----------------------------------------------------------------------------
// test it!
$links = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
This should do the trick
I have renamed the array to 'links' instead of 'link'. It's an array of links, containing link(s), therefore, foreach($link as $links) seemed wrong, and I changed it to foreach($links as $link)
I really need to ask this question as it will answer way more questions after the world reads this thread. What if ... you used articles like the simple html dom site.
$ret['Name'] = $html->find('h1[class="parseasinTitle"]', 0)->innertext;
$ret['Retail'] = $html->find('b[class="priceLarge"]', 0)->innertext;
return $ret;
}
$links = array (
'http://www.amazon.com/dp/B0038JDEOO/',
'http://www.amazon.com/dp/B0038JDEM6/',
'http://www.amazon.com/dp/B004CYX17O/'
);
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
what if its $articles?
$articles[] = $item;
}
//print_r($articles);
$links = array (
'http://link1.com',
'http://link2.com',
'http://link3.com'
);
what would this area look like?
foreach ($links as $link) {
$ret = scraping_IMDB($link);
foreach($ret as $k=>$v) {
echo '<strong>'.$k.'</strong>'.$v.'<br />';
}
}
Ive seen this multiple links all over stackoverflow for past 2 years, and I still cannot figure it out. Would be great to get the basic handle on it to how the simple html dom examples are.
thx.
First time postin im sure I broke a bunch of rules and didnt do the code section right. I just had to ask this question badly.
when we add a param to the URL
$redirectURL = $printPageURL . "?mode=1";
it works if $printPageURL is "http://www.somesite.com/print.php", but if $printPageURL is changed in the global file to "http://www.somesite.com/print.php?newUser=1", then the URL becomes badly formed. If the project has 300 files and there are 30 files that append param this way, we need to change all 30 files.
the same if we append using "&mode=1" and $printPageURL changes from "http://www.somesite.com/print.php?new=1" to "http://www.somesite.com/print.php", then the URL is also badly formed.
is there a library in PHP that will automatically handle the "?" and "&", and even checks that existing param exists already and removed that one because it will be replaced by the later one and it is not good if the URL keeps on growing longer?
Update: of the several helpful answers, there seems to be no pre-existing function addParam($url, $newParam) so that we don't need to write it?
Use a combination of parse_url() to explode the URL, parse_str() to explode the query string and http_build_query() to rebuild the querystring. After that you can rebuild the whole url from its original fragments you get from parse_url() and the new query string you built with http_build_query(). As the querystring gets exploded into an associative array (key-value-pairs) modifying the query is as easy as modifying an array in PHP.
EDIT
$query = parse_url('http://www.somesite.com/print.php?mode=1&newUser=1', PHP_URL_QUERY);
// $query = "mode=1&newUser=1"
$params = array();
parse_str($query, $params);
/*
* $params = array(
* 'mode' => '1'
* 'newUser' => '1'
* )
*/
unset($params['newUser']);
$params['mode'] = 2;
$params['done'] = 1;
$query = http_build_query($params);
// $query = "mode=2&done=1"
Use this:
http://hu.php.net/manual/en/function.http-build-query.php
http://www.addedbytes.com/php/querystring-functions/
is a good place to start
EDIT: There's also http://www.php.net/manual/en/class.httpquerystring.php
for example:
$http = new HttpQueryString();
$http->set(array('page' => 1, 'sort' => 'asc'));
$url = "yourfile.php" . $http->toString();
None of these solutions work when the url is of the form:
xyz.co.uk?param1=2&replace_this_param=2
param1 gets dropped all the time
.. which means it never works EVER!
If you look at the code given above:
function addParam($url, $s) {
return adjustParam($url, $s);
}
function delParam($url, $s) {
return adjustParam($url, $s);
}
These functions are IDENTICAL - so how can one add and one delete?!
using WishCow and sgehrig's suggestion, here is a test:
(assuming no anchor for the URL)
<?php
echo "<pre>\n";
function adjustParam($url, $s) {
if (preg_match('/(.*?)\?/', $url, $matches)) $urlWithoutParams = $matches[1];
else $urlWithoutParams = $url;
parse_str(parse_url($url, PHP_URL_QUERY), $params);
if (strpos($s, '=') !== false) {
list($var, $value) = split('=', $s);
$params[$var] = urldecode($value);
return $urlWithoutParams . '?' . http_build_query($params);
} else {
unset($params[$s]);
$newQueryString = http_build_query($params);
if ($newQueryString) return $urlWithoutParams . '?' . $newQueryString;
else return $urlWithoutParams;
}
}
function addParam($url, $s) {
return adjustParam($url, $s);
}
function delParam($url, $s) {
return adjustParam($url, $s);
}
echo "trying add:\n";
echo addParam("http://www.somesite.com/print.php", "mode=3"), "\n";
echo addParam("http://www.somesite.com/print.php?", "mode=3"), "\n";
echo addParam("http://www.somesite.com/print.php?newUser=1", "mode=3"), "\n";
echo addParam("http://www.somesite.com/print.php?newUser=1&fee=0", "mode=3"), "\n";
echo addParam("http://www.somesite.com/print.php?newUser=1&fee=0&", "mode=3"), "\n";
echo addParam("http://www.somesite.com/print.php?mode=1", "mode=3"), "\n";
echo "\n", "now trying delete:\n";
echo delParam("http://www.somesite.com/print.php?mode=1", "mode"), "\n";
echo delParam("http://www.somesite.com/print.php?mode=1&newUser=1", "mode"), "\n";
echo delParam("http://www.somesite.com/print.php?mode=1&newUser=1", "newUser"), "\n";
?>
and the output is:
trying add:
http://www.somesite.com/print.php?mode=3
http://www.somesite.com/print.php?mode=3
http://www.somesite.com/print.php?newUser=1&mode=3
http://www.somesite.com/print.php?newUser=1&fee=0&mode=3
http://www.somesite.com/print.php?newUser=1&fee=0&mode=3
http://www.somesite.com/print.php?mode=3
now trying delete:
http://www.somesite.com/print.php
http://www.somesite.com/print.php?newUser=1
http://www.somesite.com/print.php?mode=1
You can try this:
function removeParamFromUrl($query, $paramToRemove)
{
$params = parse_url($query);
if(isset($params['query']))
{
$queryParams = array();
parse_str($params['query'], $queryParams);
if(isset($queryParams[$paramToRemove])) unset($queryParams[$paramToRemove]);
$params['query'] = http_build_query($queryParams);
}
$ret = $params['scheme'].'://'.$params['host'].$params['path'];
if(isset($params['query']) && $params['query'] != '' ) $ret .= '?'.$params['query'];
return $ret;
}