Hi,
I download a large amount of files for data mining. I used to use PHP for this purpose but I am finding it to be too slow. Also I just want a small part of the web page. I want to achieve two things
Curl should be able to utilize all my download bandwidth
Is there any way to download only a part of the web page where my data resides.
I am not confined to PHP. If curl works better in terminal I would use that.
Yes, you can download only a part of the page by using the CURLOPT_RANGE option, and you can also provide a write callback function that simply returns an error when you've received "enough" data and you want to stop and move on.
Are you downloading HTML? Your comment leads me to believe that you are. If that's the case, simply load up the html with Simple PHP DOM and get only the part that you want. Although, I find it hard to believe that grabbing just the HTML is slowing you down. Are you downloading any files or media as well?
Link : http://simplehtmldom.sourceforge.net/
There is no way to download only part of a page. When you request a URL, the server response is what it is.
Utilize more of your bandwidth by using cURL's ability to make multiple connections at once.
Related
Is it possible to save all resources downloaded from a HTTP request with PHP?
For example: using curl, wget or similar to get all files necessary to load the page in browser instead of only getting the HTML content of the page.
I don't want to get all links and then download each link with a direct curl. I would like a way to do it only once. I assume it's possible since in a browser I also only do one url request to get all resources.
Edit:
The point here is to simulate the browser behavior. How can I save an entire page with PHP. If it must be done with several steps what should be the logic to follow?
I have huge problems in getting all files from a page even after extracting links since I find it very hard to store session data and reuse it for further curls.
You can use Ajax for that and save that as objects and show them when needed. does that works for you?
I see some websites allow users download our source include html file, css file, js file ... and almost. I don't want to do this for my website. What should I do? Thank for wathching!
P/S: If you can, please show me this approach with Zend. I'm using Zend 1.9.6.
You cannot restrict the download of resources. The browser needs to download them in order to process them, if you restrict them from being downloaded the browser wont be able to access them as well.
That is impossible. The browser need the source code to display your site, in the same way you can't prevent the user to download an image if you show it to them. The best you can achieve is obfuscate your CSS and Javascript to a hard to read scrambled code, using YUICompressor, for example. But someone determined will always be able to decipher your code logic...
As said its impossible to hide js or css files but what you can do is minify(compress) them which will make it harder to interpret by user and making your site load faster at the same time .
Check this implementation of minfy library with ZF , it provides css,js view helpers to automate the compression .
http://hobodave.com/2010/01/17/bundle-phu-compress-your-js-css-in-zend-framework/
If their browser can't read your HTML, how can it display your page?
They won't be able to read your PHP (assuming that your server is set up to parse PHP correctly), but they will always be able to read the HTML output.
This is a very broad question, so I'm just looking for the best way of doing this.
I want to periodically monitor certain pages on my website.
I am looking to write a PHP script which will load the page as if it is being loaded in a browser. So that means, it loads all CSS, Javascript, Images, Videos, etc...
I want to just get the load time of these pages and then email the results to myself in a crontab. For this I was going to use microtime() and a phpMailer.
Does anyone know of a script to load a complete page, or have any suggestions on how to go about this?
Thanks.
What if the page has dynamic content? You will also need to execute all the JavaScript and fetch all CSS images to get the final amount of time. I believe that is impossible using only PHP.
A php script you run from the same server you host your site will give you abnormal readings (very low) since it's loading on the first hop essentially. What you really would want to do is run a script from various servers outside of your own. There are also limitations with what php can see ie JS and JQuery etc.
The simplest is to check from your home pc, using jmeter. You set your home browser to use it as a proxy and go to whichever website you want. Jmeter will record statistics. When you are happy you can choose to save the stats.
This avoids the problems of handling JS and JQuery through a script.
This could get very complicated. You'd basically have to parse the HTML, and then there's tons of edge cases, like JS including resources, etc... I would definitely recommend using something like the network tab of Chrome's dev tools instead.
I'm working on a site in Drupal which will reside at http://sub.clientdomain.com, pulling in an static html fragment from a server on at http://clientdomain.com/path/to/fragment.html, which I need to display in a page, to give a parent menu for a site.
I'm working on a VPS, so I have access to curl, wget and pretty much anything I want to install.
What's the optimal way to do this?
My first approach here would be to use curl on a cron job every n minutes to pull in the html fragment, and then store it somewhere like /var/www/html/site, to save making the http request on each load.
I'd then use file_get_contents function to pull the content into the page , with a fallback if the file isn't there (lets say if the file isn't there for example).
I'm adding this here mainly as a sanity check - would love to see other approaches to it.
Your first approach sounds like a good idea. By using a cronjob, the pulling of the content is separated from your site, so the user doesn't have to wait.
If you have access to the page that generates the fragment, you could reduce net traffic by actively sending the new content instead of requesting it every couple of minutes.
I'd store the fragment in a database, though, as I believe database-queries are faster than access to the filesystem.
If that main domain is on the same server, you will probably have access to directly, so use just the simple include and point the static html file by absolute path
You have a forum (vbulletin) that has a bunch of images - how easy would it be to have a page that visits a thread, steps through each page and forwards to the user (via ajax or whatever) the images. i'm not asking about filtering (that's easy of course).
doable in a day? :)
I have a site that uses codeigniter as well - would it be even simpler using it?
assuming this is to be carried out on server, curl + regexp are your friends .. and yes .. doable in a day...
there are also some open-source HTML parsers that might make this cleaner
It depends on where your scraping script runs.
If it runs on the same server as the forum software, you might want to access the database directly and check for image links there. I'm not familiar with vbulletin, but probably it offers a plugin api that allows for high level database access. That would simplify querying all posts in a thread.
If, however, your script runs on a different machine (or, in other words, is unrelated to the forum software), it would have to act as a http client. It could fetch all pages of a thread (either automatically by searching for a NEXT link in a page or manually by having all pages specified as parameters) and search the html source code for image tags (<img .../>).
Then a regular expression could be used to extract the image urls. Finally, the script could use these image urls to construct another page displaying all these images, or it could download them and create a package.
In the second case the script actually acts as a "spider", so it should respect things like robots.txt or meta tags.
When doing this, make sure to rate-limit your fetching. You don't want to overload the forum server by requesting many pages per second. Simplest way to do this is probably just to sleep for X seconds between each fetch.
Yes doable in a day
Since you already have a working CI setup I would use it.
I would use the following approach:
1) Make a model in CI capable of:
logging in to vbulletin (images are often added as attachments and you need to be logged in before you can download them). Use something like snoopy.
collecting the url for the "last button" using preg_match(), parsing the url with parse_url() / and parse_str() and generating links from page 1 to page last
collecting html from all generated links. Still using snoopy.
finding all images in html using preg_match_all()
downloading all images. Still using snoopy.
moving the downloaded image from a tmp directory into another directory renaming it imagename_01, imagename_02, etc. if the same imagename allready exists.
saving the image name and precise bytesize in a db table. Then you can avoid downloading the same image more than once.
2) Make a method in a controller that collects all images
3) Setup a cronjob that collect images at regular intervals. wget -o /tmp/useless.html http://localhost/imageminer/collect should do nicely
4) Write the code that outputs pretty html for the enduser using the db table to get the images.