Trying to scrape images from reddit, having trouble cleaning up strings

Trying to scrape images from reddit, having trouble cleaning up strings - php

So I'm not asking for you to fix my script, if you know the answer I would appreciate it if you just pointed me in the right direction. This is a script I found and I'm trying to edit it for a project.
I believe that whats going on is the formatting of $reddit is causing problems when I input that string into $url. I am not sure how to filter the string.
Right after I posted this I had the idea of using concatenation on $reddit to get the desired result instead of filtering the string. Not sure.
Thanks!
picgrabber.php
include("RIS.php");
$reddit = "pics/top/?sort=top&t=all";
$pages = 5;
$t = new RIS($reddit, $pages);
$t->getImagesOnPage();
$t->saveImage();
RIS.php
class RIS {
var $after = "";
var $reddit = "";
public function __construct($reddit, $pages) {
$this->reddit = preg_replace('/[^A-Za-z0-9\-]/', '' , $reddit);
if(!file_exists($this->reddit)) {
mkdir($this->reddit, 0755);
}
$pCounter = 1;
while($pCounter <= $pages) {
$url = "http://reddit.com/r/$reddit/.json?limit=100&after=$this->after";
$this->getImagesOnPage($url);
$pCounter++;
}
}
private function getImagesOnPage($url) {
$json = file_get_contents($url);
$js = json_decode($json);
foreach($js->data->children as $n) {
if(preg_match('(jpg$|gif$|png$)', $n->data->url, $match)) {
echo $n->data->url."\n";
$this->saveImage($n->data->url);
}
$this->after = $js->data->after;
}
}
private function saveImage($url) {
$imgName = explode("/", $url);
$img = file_get_contents($url);
//if the file doesnt already exist...
if(!file_exists($this->reddit."/".$imgName[(count($imgName)-1)])) {
file_put_contents($this->reddit."/".$imgName[(count($imgName)-1)], $img);
}
}
}
Notice: Trying to get property of non-object in C:\Program Files (x86)\EasyPHP-DevServer-13.1VC9\data\localweb\RIS.php on line 33
Warning: Invalid argument supplied for foreach() in C:\Program Files (x86)\EasyPHP-DevServer-13.1VC9\data\localweb\RIS.php on line 33
Fatal error: Call to private method RIS::getImagesOnPage() from context '' in C:\Program Files (x86)\EasyPHP-DevServer-13.1VC9\data\localweb\vollyeballgrabber.php on line 23
line 33:
foreach($js->data->children as $n) {
var_dump($url);
returns:
string(78) "http://reddit.com/r/pics/top/?sort=top&t=all/.json?limit=100&after=" NULL

$reddit in picgrabber.php has GET parameters
In the class RIS, you're embedding that value into a string that has another GET set in it with the ".json" token between.
The resulting url is:
http://reddit.com/r/pics/top/?sort=top&t=all/.json?limit=100&after=
The ".json" token needs to come after the end of the location portion of the url and before the GET sets. I would also change any addition "?" tokens to "&" (ampersands) so any additional sets of GET parameters you decide to concatenate to the URL string become additional parameters.
Like this:
http://reddit.com/r/pics/top/.json?sort=top&t=all&limit=100&after=
The difference is, your url is returning html code because the reddit server doesn't understand how to parse what you're sending. You're trying to parse html with a json decoder. My URL returns actual json data. That should get your json decoder returning an actual json object array.

Related

Prevent an link becoming double encoded in PHP

I have the following URL in a MySQL database for a PHP application - part of our system allows a user to edit their previous post with these links and save - however as the url gets encoded again when a user edits this is then breaks the url as displayed below.
Is there an easy way or existing PHP function to determine if the string already has been encoded and to alter the string to remove the unwanted characters so it remains in the expected output below.
Expected output
url:https://r5uy4lmtdqka6a1rzyexlusfl-902rjcrzfe6k93co7a644-tom.s3.eu-west-2.amazonaws.com/Carbon%20Monoxide/Summer%20CO%20Campaign/CO%20Summer%202022/CO%20Summer%20you%20can%20smell%20the%20BBQ%20-%20600x600.jpg
Actual output
url:https://r5uy4lmtdqka6a1rzyexlusfl-902rjcrzfe6k93co7a644-tom.s3.eu-west-2.amazonaws.com/Carbon%2520Monoxide/Summer%2520CO%2520Campaign/CO%2520Summer%25202022/CO%2520Summer%2520you%2520can%2520smell%2520the%2520BBQ%2520-%2520600x600.jpg

As suggested in comments, double decode, then encode (only the query string part).
<?php
$str = "https://r5uy4lmtdqka6a1rzyexlusfl-902rjcrzfe6k93co7a644-tom.s3.eu-west-2.amazonaws.com/Carbon%2520Monoxide/Summer%2520CO%2520Campaign/CO%2520Summer%25202022/CO%2520Summer%2520you%2520can%2520smell%2520the%2520BBQ%2520-%2520600x600.jpg";
$str = "https://r5uy4lmtdqka6a1rzyexlusfl-902rjcrzfe6k93co7a644-tom.s3.eu-west-2.amazonaws.com/Carbon%20Monoxide/Summer%20CO%20Campaign/CO%20Summer%202022/CO%20Summer%20you%20can%20smell%20the%20BBQ%20-%20600x600.jpg";
function fix_url($str)
{
$arr = explode('/', $str, 4);
$qs = $arr[3]; // add if at all check?
while (true) {
$decoded = urldecode($qs);
if ($decoded == $qs) {
break;
}
$qs = $decoded;
}
$encoded = urlencode($decoded);
$result = $arr[0] . '//' . $arr[2] . $encoded;
return $result;
}
echo fix_url($str);

Avoiding equal and ampersand conversion in PHP

I have a php variable containing some special characters, inside a Codeigniter 3 controller:
page_url = 'search=' . $expression . '&page';
In a template, I use this variable:
In the in the browser I see the characters mentioned above in this form:
posts/search?search%3Dharum%26page=2
The = sign turns to %3D, "&" to %26.
I tried page_url = urldecode($page_url); but it does not work.
How do I keep the original characters?

Please use utf8 decode and try again
echo utf8_decode(urldecode("search%3Dharum%26page=2"));
Try this decode function.
function decode($url)
{
$special = array(
'%21' => '!', '\\' => '%5C', // so on you need to define.
);
foreach($special as $key => $value)
{
$result = str_replace($key, $value, $url);
}
return $result;
}
echo decode("search=%21");

Your problem is not that easy to reproduce. The following rextester demo produces the text in the form you request, using basically your code: http://rextester.com/YTY13099
<?php
$page=2;
$expression='harum';
$page_url = 'posts/search?search=' . $expression . '&page=' . $page;
?>
resulting in
Link
Could it be that the problem is caused by that fact that the script is part of a codeigniter controller? I do not know anything about codeigniter but I can image that further processing takes place there.

Can't parse the titles of some links using function

I've written a script to parse the title of each page after making use of links populated from this url. To be clearer: my below script is supposed to parse all the links from the landing page and then reuse those links in order to go one layer deep and parse the titles of posts from there.
As this is my first ever attempt to write anything in php, I can't figure out where I'm going wrong.
This is my try so far:
<?php
include("simple_html_dom.php");
$baseurl = "https://stackoverflow.com";
function get_links($baseurl)
{
$weburl = "https://stackoverflow.com/questions/tagged/web-scraping";
$html = file_get_html($weburl);
$processed_links = array();
foreach ($html->find(".summary h3 a") as $a) {
$links = $a->href . '<br>';
$processed_links[] = $baseurl . $links;
}
return implode("\n",$processed_links);
}
function reuse_links($processed_links){
$ihtml = file_get_html($processed_links);
foreach ($ihtml -> find("h1 a") as $item) {
echo $item->innertext;
}
}
$pro_links = get_links($baseurl);
reuse_links($pro_links);
?>
When I execute the script, it produces the following error:
Warning: file_get_contents(https://stackoverflow.com/questions/52347029/getting-all-the-image-urls-from-a-given-instagram-user<br> https://stackoverflow.com/questions/52346719/unable-to-print-links-in-another-function<br> https://stackoverflow.com/questions/52346308/bypassing-technical-limitations-of-instagram-bulk-scraping<br> https://stackoverflow.com/questions/52346159/pulling-the-href-from-a-link-when-web-scraping-using-python<br> https://stackoverflow.com/questions/52346062/in-url-is-indicated-as-query-or-parameter-in-an-attempt-to-scrap-data-using<br> https://stackoverflow.com/questions/52345850/not-able-to-print-link-from-beautifulsoup-for-web-scrapping<br> https://stackoverflow.com/questions/52344564/web-scraping-data-that-was-shown-previously<br> https://stackoverflow.com/questions/52344305/trying-to-encode-decode-locations-when-scraping-a-website<br> https://stackoverflow.com/questions/52343297/cant-parse-the-titles-of-some-links-using-function<br> https: in C:\xampp\htdocs\differenttuts\simple_html_dom.php on line 75
Fatal error: Uncaught Error: Call to a member function find() on boolean in C:\xampp\htdocs\differenttuts\testfile.php:18 Stack trace: #0 C:\xampp\htdocs\differenttuts\testfile.php(23): reuse_links('https://stackov...') #1 {main} thrown in C:\xampp\htdocs\differenttuts\testfile.php on line 18
Once again: I expect my script to tarck the links from the landing page and parse the titles from it's target page.

I'm not very familiar with simple_html_dom, but I'll try to answer the question. This library uses file_get_contents to preform HTTP requests, but in PHP7 file_get_contents doesn't accept negative offset (which is the default for this library) when retrieving network resources.
If you're using PHP 7 you'll have set the offset to 0.
$html = file_get_html($url, false, null, 0);
In your get_links function you join your links to a string. I think it's best to return an array, since you'll need those links for new HTTP requests in the next function. For the same reason you shouldn't add break tags to the links, you can break when you print.
function get_links($url)
{
$processed_links = array();
$base_url = implode("/", array_slice(explode("/", $url), 0, 3));
$html = file_get_html($url, false, null, 0);
foreach ($html->find(".summary h3 a") as $a) {
$link = $base_url . $a->href;
$processed_links[] = $link;
echo $link . "<br>\n";
}
return $processed_links ;
}
function reuse_links($processed_links)
{
foreach ($processed_links as $link) {
$ihtml = file_get_html($link, false, null, 0);
foreach ($ihtml -> find("h1 a") as $item) {
echo $item->innertext . "<br>\n";
}
}
}
$url = "https://stackoverflow.com/questions/tagged/web-scraping";
$pro_links = get_links($url);
reuse_links($pro_links);
I think it makes more sense to use the main url as a parameter in get_links, we can get the base url from it. I've used array functions for the base url, but you could use parse_url which is the appropriate function.

Call to a member function getAttribute() on a non-object [duplicate]

This question already has answers here:
Reference - What does this error mean in PHP?
(38 answers)
Closed 8 years ago.
This is the error I got :
Fatal error: Call to a member function getAttribute() on a non-object
in /home/a4688869/public_html/random/index.php
on line 45
Here is a picture: http://prntscr.com/5rum9z
The function works for the first few, but breaks after a few pictures are displayed. Essentially the code generates a random html. I use cURL to get the html and then parse it with a function and then select an image from the site and then repeat the process until it I get 5 images.
My code
$original_string = '123456789abh';
$random_string = get_random_string($original_string, 6);
//Generates a random string of characters and returns the string
function get_random_string($valid_chars, $length)
{
$random_string = ""; // start with an empty random string
$num_valid_chars = strlen($valid_chars); // count the number of chars in the valid chars string so we know how many choices we have
// repeat the steps until we've created a string of the right length
for ($i = 0; $i < $length; $i++)
{
$random_pick = mt_rand(1, $num_valid_chars); // pick a random number from 1 up to the number of valid chars
// take the random character out of the string of valid chars
// subtract 1 from $random_pick because strings are indexed starting at 0, and we started picking at 1
$random_char = $valid_chars[$random_pick-1];
$random_string .= $random_char; // add the randomly-chosen char onto the end of our string so far
}
return $random_string;
}
//Parses the random website and returns the image source
function websearch($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$html = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument;
#$dom->loadHTML($html); // Had to supress errors
$img = $dom->getElementsByTagName('img')->item(1); //Get just the second picture
$src = $img->getAttribute('src'); //Get the source of the picture
return $src;
}
The part of the code that displays the html
for ($i = 0; $i < $perpage; $i++){
$random_string = get_random_string($original_string, 6);
$src = websearch('http://prntscr.com/' . $random_string);
while( $src == "http://i.imgur.com/8tdUI8N.png"){
$random_string = get_random_string($original_string, 6);
$src = websearch('http://prntscr.com/' . $random_string);
}
?>
<img src="<?php echo $src; ?>">
<p><?php echo $src; ?></p>
<?php if ($i != $perpage - 1){ // Only display the hr if there is another picture after it ?>
<hr>
<?php }}?>

Whenever you get "non-object" errors it is because you are calling a method on a variable that is not an object.
This can be remedied by always checking your return values. It may not be very elegant, but computers are stupid and if you want your code to work then you have to always make sure that it does what you want it to do.
$img = $dom->getElementsByTagName('img')->item(1);
if ($img === null) {
die("The image was not found!");
}
You should also get in the habit of reading the documentation for stuff that you are using (in this case the return values).
DOMDocument
DOMDocument::getElementsByTagName
DOMNodeList
DOMNodelist::item
As you can see on the DOMNodelist::item page the return value, if the method failed, is null:
Return Values
The node at the indexth position in the DOMNodeList, or NULL if that is not a valid index.

PHP DomDocument failing to handle quotes in a url

When I try to open a url like that :
http://api.anghami.com/rest/v1/GETsearch.view?sid=11754134061397734622103190992&query=Can't Remember to Forget You Shakira&searchtype=SONG&ook&songCount=1
containing a quote with the browser everything works fine and the output is good as an xml
But when I try to call it from a php file:
$url = "http:/api.anghami.com/rest/v1/GETsearch.view?sid=11754134061397734622103190992&query=Can't Remember to Forget You Shakira&searchtype=SONG&ook&songCount=1"
//using DOMDocument for parsing.
$data = new DOMDocument();
// loading the xml from Anghami API.
if($data->load("$url")){// Getting the Tag song.
foreach ($data->getElementsByTagName('song') as $searchNode)
{
$count++;
$n++;
//Getting the information of Anghami Song from the XML file.
$valueID = $searchNode->getAttribute('id');
$titleAnghami = $searchNode->getAttribute('title');
$album = $searchNode->getAttribute('album');
$albumID = $searchNode->getAttribute('albumID');
$artistAnghami = $searchNode->getAttribute('artist');
$track = $searchNode->getAttribute('track');
$year = $searchNode->getAttribute('year');
$coverArt = $searchNode->getAttribute('coverArt');
$ArtistArt = $searchNode->getAttribute('ArtistArt');
$size = $searchNode->getAttribute('size');
}
}
I get this error:
'Warning: DOMDocument::load(): I/O warning : failed to load external entity /var/www/html/http:/api.anghami.com/rest/v1/GETsearch.view?sid=11754134061397734622103190992&query=Can't Remember to Forget You Shakira&searchtype=SONG&ook&songCount=1" in /var/www/html/search.php on line 93'
Can anyone help please?

#Fracsi is correct: the URL needs to start with http:// not http:/
The other problem is that the XML has a default namespace (defined with the xmlns attribute on the root element), so you need to use
$data->getElementsByTagNameNS('http://api.anghami.com/rest/v1', 'song')
to select all the "song" elements.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Trying to scrape images from reddit, having trouble cleaning up strings - php

Related

Prevent an link becoming double encoded in PHP

Avoiding equal and ampersand conversion in PHP

Can't parse the titles of some links using function

Call to a member function getAttribute() on a non-object [duplicate]

PHP DomDocument failing to handle quotes in a url

Categories

Resources