Hi I'm trying to write a PHP script that download all images and videos from a subreddit and store them locally.
My plan is to get all the links from a URL then process to decide whether it's an image or video then download it.
If someone can guide me or give me an idea on how to proceed that will be appreciated.
My idea would be to curl download the link to website, so you get the html version of it, than take a look on this topic. Having this, you can extract all needed tags, for example "img" tags and their href.
Than just load them into array, and iterate curl to download them and store them locally.
Another approach would be to download html and load all links based on filter (ex. beggining on "\"http://" and ending with quote (also make another filter for single quote if there are singe quotes in the html).
Than just iterate all of links and whitelist them based on extension, if that's file what you are intrested in. Than curl download and store them.
EDIT:
I forgot - also do not forget to fix links in the .html and .css and .js (and probably more) files. Also just offtopic sidenote, watch out for images with php in them.
Related
I see some websites allow users download our source include html file, css file, js file ... and almost. I don't want to do this for my website. What should I do? Thank for wathching!
P/S: If you can, please show me this approach with Zend. I'm using Zend 1.9.6.
You cannot restrict the download of resources. The browser needs to download them in order to process them, if you restrict them from being downloaded the browser wont be able to access them as well.
That is impossible. The browser need the source code to display your site, in the same way you can't prevent the user to download an image if you show it to them. The best you can achieve is obfuscate your CSS and Javascript to a hard to read scrambled code, using YUICompressor, for example. But someone determined will always be able to decipher your code logic...
As said its impossible to hide js or css files but what you can do is minify(compress) them which will make it harder to interpret by user and making your site load faster at the same time .
Check this implementation of minfy library with ZF , it provides css,js view helpers to automate the compression .
http://hobodave.com/2010/01/17/bundle-phu-compress-your-js-css-in-zend-framework/
If their browser can't read your HTML, how can it display your page?
They won't be able to read your PHP (assuming that your server is set up to parse PHP correctly), but they will always be able to read the HTML output.
Does anybody know how instead of providing a link to users to download a doc file, I can embed PART of the file on an iframe on the same page. I want to give users a teaser on the iframe but not access to the entire document..Thanks!
Browsers can't open native MS Word files.
For the '.doc' or '.docx' files, you'll need to read the except on the server-side and convert it into HTML. For '.txt', most browsers will read those natively, but if you want to show only an excerpt, you will need to read into the file, probably server-side.
See Convert .doc to html in php. Once you have HTML on the server, you can trim it down to make your excerpt before displaying it.
The bad news is this is probably more complicated than you thought. The good news is that you won't need iframes.
well what I'm actually trying to do is to figure out how BEEMP3.COM works.
Because of the site's speed, I doubt they scrape other sites/sources on the spot.
They probably use some sort of database (PostgreSQL or MySQL) to store the "results" and then just query the search terms.
My question is how do you guys think they crawl/spider or actually get the mp3 files/content?
They must have some algorithm to spider the internet OR use google's index of mp3 trick to find hosts with the raw mp3 files.
Any comments and tips or ideas are appreciated :)
QueryPath is a great tool for building a web spider.
I'm guessing they find MP3s using a combination approach - they have a list of "seed sites" (gathered from Google, Usenet or manually inserted) that they use as a starting points for the search and then set spiders running against them.
You need to write a script that will:
Take a webpage as a starting point
Fetch the webpage data (use cURL)
Use a regular expression to extract (a) any links (b) any links to mp3 files
Place any MP3 links into a database
Add the list of links to other webpages to a queue for processing through the above method
You'll also need to re-check your MP3 links regularly to erase any bad links.
Alternatively you can crawl MP3 spiders like beemp3.com and extract all direct download links and save them to your data base. you need only two file
I. Simple html Dom.
II. An application that can take extracted links to your database.
Check what i did in http://kenyaforums.com/bongomp3_external_link_search_engine_at_kenyaforums_com.php
You keep on asking in case of any contradiction.
using php how do I go to a website and receive a list of all the xml files on a site. (It would also be nice to get the last date changed also)
There is a html files linking to html pages with a xml counterpart, does that help?
You might want to use SimpleXML or file_get_contents function depending on your requirements.
Depends on what you mean by "list of all xml files". There's no way to find unlinked files without a brute force search (which is impossible in practice at this scale). If you only care about linked files, you'll either have to crawl manually and match links in the content of the pages, or do a Google search with site:example and inurl:.xml then parse the results.
You have a forum (vbulletin) that has a bunch of images - how easy would it be to have a page that visits a thread, steps through each page and forwards to the user (via ajax or whatever) the images. i'm not asking about filtering (that's easy of course).
doable in a day? :)
I have a site that uses codeigniter as well - would it be even simpler using it?
assuming this is to be carried out on server, curl + regexp are your friends .. and yes .. doable in a day...
there are also some open-source HTML parsers that might make this cleaner
It depends on where your scraping script runs.
If it runs on the same server as the forum software, you might want to access the database directly and check for image links there. I'm not familiar with vbulletin, but probably it offers a plugin api that allows for high level database access. That would simplify querying all posts in a thread.
If, however, your script runs on a different machine (or, in other words, is unrelated to the forum software), it would have to act as a http client. It could fetch all pages of a thread (either automatically by searching for a NEXT link in a page or manually by having all pages specified as parameters) and search the html source code for image tags (<img .../>).
Then a regular expression could be used to extract the image urls. Finally, the script could use these image urls to construct another page displaying all these images, or it could download them and create a package.
In the second case the script actually acts as a "spider", so it should respect things like robots.txt or meta tags.
When doing this, make sure to rate-limit your fetching. You don't want to overload the forum server by requesting many pages per second. Simplest way to do this is probably just to sleep for X seconds between each fetch.
Yes doable in a day
Since you already have a working CI setup I would use it.
I would use the following approach:
1) Make a model in CI capable of:
logging in to vbulletin (images are often added as attachments and you need to be logged in before you can download them). Use something like snoopy.
collecting the url for the "last button" using preg_match(), parsing the url with parse_url() / and parse_str() and generating links from page 1 to page last
collecting html from all generated links. Still using snoopy.
finding all images in html using preg_match_all()
downloading all images. Still using snoopy.
moving the downloaded image from a tmp directory into another directory renaming it imagename_01, imagename_02, etc. if the same imagename allready exists.
saving the image name and precise bytesize in a db table. Then you can avoid downloading the same image more than once.
2) Make a method in a controller that collects all images
3) Setup a cronjob that collect images at regular intervals. wget -o /tmp/useless.html http://localhost/imageminer/collect should do nicely
4) Write the code that outputs pretty html for the enduser using the db table to get the images.