PHP - file_get_contents skip style and script - php

I am trying to pull a html of a site,
I am using file_get_contents($url).
When I run file_get_contents then its takes too much time pull html of host site,
Can I skip style, scripts and images ?
I think then it will take less time to pull html of that site.

Try:
$file = file_get_contents($url);
$only_body = preg_replace("/.*<body[^>]*>|<\/body>.*/si", "", $file);

Related

How to grab flash video and download video data with file_get_contents using PHP Simple HTML DOM Parser

I am simply looking to screen scrape a web page that contains flash videos. While scraping the web page utilizing PHP Simple HTML DOM Parser I like to grab the embed snippet and download the video data. Can anyone help?
Reference to possibly help:
How to find object tag with param and embed tag inside html using simple html d
I don't think there's an universal solution, otherwise youtube-dl's special-casing for many sites wouldn't be necessary.
If the site you have in mind is on that list I would recommend simply using youtube-dl.
If you`re using linux server, you can use this small PHP wrapper for youtube-dl. Just pass url of the page with video to the function.
function video_download($video_url) {
$downloaderUrl = 'https://yt-dl.org/downloads/2015.01.23.4/youtube-dl';
$downloaderPath = '/tmp/youtube-dl';
$videoPath = '/tmp/videos';
if(!file_exists($videoPath.'/')) {
mkdir($videoPath, 0777);
}
if(!file_exists($downloaderPath)) {
file_put_contents($downloaderPath, file_get_contents($downloaderUrl));
chmod($downloaderPath, 0777);
}
exec($downloaderPath.' -o "'.$videoPath.'/%(title)s.%(ext)s":w '.$video_url);
return true;
}
Usage example:
video_download('http://www.youtube.com/watch?v=9bZkp7q19f0');
If you interested in some explanation, it`s just grab youtube-dl downloader from website, save it to your tmp folder and then run it with url passed. Result is saved to the other folder in tmp.

Taking long time to load random images

I have a random image generator for my site. The problem is, it takes a really long time.. I was wondering if anybody could help to speed it up in any ways. The site is http://viralaftermath.com/, and this is the script:
header('Content-type: image/jpeg;');
$images = glob("images/" . '*.{jpg,jpeg,png,gif}', GLOB_BRACE);
echo file_get_contents($images[array_rand($images)]);
This is a pretty resource-intensive way to do this, as you are passing the image data through PHP and not specifying any caching headers, so the image has to be reloaded every single time you open the page.
A much better approach would be to have glob() list the files within the HTML page that you're using to embed the image. Then randomize that list, and emit an <img> tag pointing to the actual file name that you determined randomly.
When you are linking to a static image instead of the PHP script, you also likely benefit from the web server's caching defaults for static resources. (You could use PHP to send caching headers as well, but in this scenario it really makes the most sense to randomly point to static images.)
$images = glob("images/" . '*.{jpg,jpeg,png,gif}', GLOB_BRACE);
# Randomize order
shuffle ($images);
# Create URL
$url = "images/".basename($images[0]);
echo "<img src='$url'>";
Profile your code and find the bottlenecks. I can only make guesses.
echo file_get_contents($file);
This will first read the complete file into memory and then send it to the output buffer. It would be way nicer if the file goes directly into output buffer. readfile() is your friend. It would be even better to avoid buffering completely. ob_end_flush() will help you there.
A next candidate is the image directory. If searching for one image takes a significant time, you have to optimize that. This can be achieved by an index (e.g. with a database).

How can i download and parse portion of web page?

I don't want to download the whole web page. It will take time and it needs lot of memory.
How can i download portion of that web page? Then i will parse that.
Suppose i need to download only the <div id="entryPageContent" class="cssBaseOne">...</div>. How can i do that?
You can't download a portion of a URL by "only this piece of HTML". HTTP only supports byte ranges for partial downloads and has no concept of HTML/XML document trees.
So you'll have to download the entire page, load it into a DOM parser, and then extract only the portion(s) you need.
e.g.
$html = file_get_contents('http://example.com/somepage.html');
$dom = new DOM();
$dom->loadHTML($html);
$div = $dom->getElementById('entryPageContent');
$content = $div->saveHTML();
Using this:
curl_setopt($ch, CURLOPT_RANGE, "0-10000");
will make cURL download only the first 10k bytes of the webpage. Also it will only work if the server side supports this. Many interpreted scripts (CGI, PHP, ...) ignore it.

Remote uploading MULTIPLE images

Okay, I have a question guys. I want to remote upload (copy an image from a site to my server) MULTIPLE images by putting links into a TEXTAREA and hitting submit. I just don't know how to make this possible with multiple images.
I am able to make it with an single image using the copy(); function, but not for multiple entries in a TEXTAREA.
I also want to limit the remote uploading feature up to 30 remote links and one image should not exceed 10MB - But I don't know how to start. I heard cURL is able to make this and I also heard that file_get_contents(); with file_put_contents(); can make a similar thing, but I still cannot figure out how to do it myself.
Help anyone? :)
You can use the same procedure as you do now with a single image, but do it in a loop.
$lines = explode("\n", $_POST['textarea']);
if(count($lines) > 30) {
die('Too many files');
}
foreach($lines as $line) {
$srcfile = trim($line);
//copy $srcfile here
//check size of the file with filesize()
}
You need to parse the URLs out of the textarea. You could with this PHP side with a regular expression.
You could then examine the parsed URLs and array_slice() the first 30, or error if more than 30.
You'd then need to copy the files from the remote server. You could inspect the Content-Length header to ensure the file is under 10mb. You could get just the headers using HEAD instead of GET.
I am not familiar with PHP but I suggest the following:
Solving the multiple files upload issue:
splitting the content in the text area by the carriage return
then iterate them to get image
preserve the size of each file in a variable, but how to get the size?
you can do exec (system) call to know the file size (this requires a full image download but its the most convenient way ), or you can make use of Content-Length header value, if the content length is more than 10 MG then skip it and move to the next item.
How to download the image?
use the file put content but make sure to put the encoding as binary encoding to preserve the content type.

How to improve Image Scraping (using PHP and JS) to Imitate Facebook Previewer

I've developed an image-scraping mechanism in PHP+JS that allows a user to share URLs and get a rendered preview (very much like Facebook's previewer when you share links). However, the whole process sometimes gets slow or sometimes fetches wrong images, so in general, I'd like to know how to improve it, especially its speed and accuracy. Stuff like parsing the DOM faster or getting image sizes faster. Here's the process I'm using, for those who want to know more:
A. Get the HTML of the page using PHP (I actually use one of CakePHP's classes, which in turn use fwrite and fread to fetch the HTML. I wonder if cURL would be significantly better).
B. Parse the HTML using DOMDocument to get the img tags, while also filtering out any "image" that is not a png, jpg, or gif (you know, sometimes people place tracking scripts inside img tags).
$DOM = new DOMDocument();
#$DOM->loadHTML($html); //$html here is a string returned from step A
$images = $DOM->getElementsByTagName('img');
$imagesSRCs = array();
foreach ($images as $image) {
$src = trim($image->getAttribute('src'));
if (!preg_match('/\.(jpeg|jpg|png|gif)/', $src)) {
continue;
}
$src = urldecode($src);
$src = url_to_absolute($url, $src); //custom function; $url is the link shared
$imagesSRCs[] = $src;
}
$imagesSRCs = array_unique($imagesSRCs); // eliminates copies of a same image
C. Send an array with all those image tags to a page which processes using Javascript (specifically, JQuery). This processing consists mostly in discarding images that are less than 80pixels (so I dont get blank gifs, hundreds of tiny icons, etc.). Because it must calculate each image size, I decided to use JS instead of PHP's getimagesize() because it was insanely slow. Thus, as the images get loaded by the browser, it does the following:
$('.fetchedThumb').load(function() {
$smallestDim = Math.min(this.width, this.height);
if ($smallestDim < 80) {
$(this).parent().parent().remove(); //removes container divs and below
}
});
Rather than downloading the content like this, why not create a server-side component that uses something like wkhtmltoimage or PhantomJS to render an image of the page, and then just scale the image down to a preview size.
This is exactly why I made jQueryScrape
It's a very lightweight jQuery plugin + PHP proxy that lets you scrape remote pages asynchronously, and it's blazing fast. That demo I linked above goes to around 8 different sites and pulls in tons of content, usually in less than 2 seconds.
The biggest bottleneck when scraping with PHP is that PHP will try to download all referenced content (meaning images) as soon as you try to parse anything server side. To avoid this, the proxy in jQueryScrape actually breaks image tags on the server before sending it to the client (by changing all img tags to span tags.)
The jQuery plugin then provides a span2img method that converts those span tags back to images, so the downloading of images is left to the browser and happens as the content is rendered. You can at that point use the result as a normal jQuery object for parsing and rendering selections of the remote content. See the github page for basic usage.

Categories