How can i download and parse portion of web page? - php

I don't want to download the whole web page. It will take time and it needs lot of memory.
How can i download portion of that web page? Then i will parse that.
Suppose i need to download only the <div id="entryPageContent" class="cssBaseOne">...</div>. How can i do that?

You can't download a portion of a URL by "only this piece of HTML". HTTP only supports byte ranges for partial downloads and has no concept of HTML/XML document trees.
So you'll have to download the entire page, load it into a DOM parser, and then extract only the portion(s) you need.
e.g.
$html = file_get_contents('http://example.com/somepage.html');
$dom = new DOM();
$dom->loadHTML($html);
$div = $dom->getElementById('entryPageContent');
$content = $div->saveHTML();

Using this:
curl_setopt($ch, CURLOPT_RANGE, "0-10000");
will make cURL download only the first 10k bytes of the webpage. Also it will only work if the server side supports this. Many interpreted scripts (CGI, PHP, ...) ignore it.

Related

How to grab flash video and download video data with file_get_contents using PHP Simple HTML DOM Parser

I am simply looking to screen scrape a web page that contains flash videos. While scraping the web page utilizing PHP Simple HTML DOM Parser I like to grab the embed snippet and download the video data. Can anyone help?
Reference to possibly help:
How to find object tag with param and embed tag inside html using simple html d
I don't think there's an universal solution, otherwise youtube-dl's special-casing for many sites wouldn't be necessary.
If the site you have in mind is on that list I would recommend simply using youtube-dl.
If you`re using linux server, you can use this small PHP wrapper for youtube-dl. Just pass url of the page with video to the function.
function video_download($video_url) {
$downloaderUrl = 'https://yt-dl.org/downloads/2015.01.23.4/youtube-dl';
$downloaderPath = '/tmp/youtube-dl';
$videoPath = '/tmp/videos';
if(!file_exists($videoPath.'/')) {
mkdir($videoPath, 0777);
}
if(!file_exists($downloaderPath)) {
file_put_contents($downloaderPath, file_get_contents($downloaderUrl));
chmod($downloaderPath, 0777);
}
exec($downloaderPath.' -o "'.$videoPath.'/%(title)s.%(ext)s":w '.$video_url);
return true;
}
Usage example:
video_download('http://www.youtube.com/watch?v=9bZkp7q19f0');
If you interested in some explanation, it`s just grab youtube-dl downloader from website, save it to your tmp folder and then run it with url passed. Result is saved to the other folder in tmp.

PHP - file_get_contents skip style and script

I am trying to pull a html of a site,
I am using file_get_contents($url).
When I run file_get_contents then its takes too much time pull html of host site,
Can I skip style, scripts and images ?
I think then it will take less time to pull html of that site.
Try:
$file = file_get_contents($url);
$only_body = preg_replace("/.*<body[^>]*>|<\/body>.*/si", "", $file);

saving unknown files with curl w/ PHP 5.3.x

I'm trying to archive a web base forum that has attachments that users have posted. So far, I made use of the php cURL library to get the individual topics and have been able to save the raw pages. However, I now need to figure out a way to archive the attachments that are located on the site.
Here is the problem: Since the file type is not consistent, I need to find a way to save the files with the correct extension. Note that I plan to rename the file when I save it so that it's organized in a way that it can be easily found later.
The link to the attached files in a page is in the format:
some file.txt
I've already used preg_match() to get the URL's to the attached files. My biggest problem now is now just making sure the fetched file is saved in the correct format.
My question: Is there any way to get the file type efficiently? I'd rather not have to use a regular expression, but I'm not seeing any other way.
Does the server add the correct Content-Type header field when serving the files? You can then intercept that with setting CURLOPT_HEADER or file_get_contents + $http_response_header.
http://www.php.net/manual/en/reserved.variables.httpresponseheader.php
i would look into
http://www.php.net/manual/en/book.fileinfo.php
to see if you can automatically grab the file type when you get ahold of it.
you can use DOMDocument and DOMXpath to extract urls and filename safely.
$doc=new DOMDocument();
$doc->loadHTML($content);
$xpath= new DOMXpath($doc);
//query examples:
foreach($xpath->query('//a') as $node)
echo $node->nodeValue;
foreach($xpath->query('//a/#href') as $node)
echo $node->nodeValue;

How to improve Image Scraping (using PHP and JS) to Imitate Facebook Previewer

I've developed an image-scraping mechanism in PHP+JS that allows a user to share URLs and get a rendered preview (very much like Facebook's previewer when you share links). However, the whole process sometimes gets slow or sometimes fetches wrong images, so in general, I'd like to know how to improve it, especially its speed and accuracy. Stuff like parsing the DOM faster or getting image sizes faster. Here's the process I'm using, for those who want to know more:
A. Get the HTML of the page using PHP (I actually use one of CakePHP's classes, which in turn use fwrite and fread to fetch the HTML. I wonder if cURL would be significantly better).
B. Parse the HTML using DOMDocument to get the img tags, while also filtering out any "image" that is not a png, jpg, or gif (you know, sometimes people place tracking scripts inside img tags).
$DOM = new DOMDocument();
#$DOM->loadHTML($html); //$html here is a string returned from step A
$images = $DOM->getElementsByTagName('img');
$imagesSRCs = array();
foreach ($images as $image) {
$src = trim($image->getAttribute('src'));
if (!preg_match('/\.(jpeg|jpg|png|gif)/', $src)) {
continue;
}
$src = urldecode($src);
$src = url_to_absolute($url, $src); //custom function; $url is the link shared
$imagesSRCs[] = $src;
}
$imagesSRCs = array_unique($imagesSRCs); // eliminates copies of a same image
C. Send an array with all those image tags to a page which processes using Javascript (specifically, JQuery). This processing consists mostly in discarding images that are less than 80pixels (so I dont get blank gifs, hundreds of tiny icons, etc.). Because it must calculate each image size, I decided to use JS instead of PHP's getimagesize() because it was insanely slow. Thus, as the images get loaded by the browser, it does the following:
$('.fetchedThumb').load(function() {
$smallestDim = Math.min(this.width, this.height);
if ($smallestDim < 80) {
$(this).parent().parent().remove(); //removes container divs and below
}
});
Rather than downloading the content like this, why not create a server-side component that uses something like wkhtmltoimage or PhantomJS to render an image of the page, and then just scale the image down to a preview size.
This is exactly why I made jQueryScrape
It's a very lightweight jQuery plugin + PHP proxy that lets you scrape remote pages asynchronously, and it's blazing fast. That demo I linked above goes to around 8 different sites and pulls in tons of content, usually in less than 2 seconds.
The biggest bottleneck when scraping with PHP is that PHP will try to download all referenced content (meaning images) as soon as you try to parse anything server side. To avoid this, the proxy in jQueryScrape actually breaks image tags on the server before sending it to the client (by changing all img tags to span tags.)
The jQuery plugin then provides a span2img method that converts those span tags back to images, so the downloading of images is left to the browser and happens as the content is rendered. You can at that point use the result as a normal jQuery object for parsing and rendering selections of the remote content. See the github page for basic usage.

saving and reading a xml file getting from the other url

may be i am going to ask some stupid question but i don't have any idea about php
that's why i want to know it i never worked in php and now i have to do it so please provide me some useful tips,
i have XML file that is coming from a different URL and i want to save it on the server then i have to read it and extract it to a page in proper format and some modification in data.
You can use DOM
$dom = new DOMDocument();
$dom->load('http://www.example.com');
This would load the XML from the remote URL. You can then process it as needed. See my previous answers on various topics using DOM. To save the file to your server after your processed it, you use
$dom->save('filename.xml');
Loading the file with $dom->load() will only work if you have allow_url_fopen enabled in your php.ini. If not, you have to use cURL to download the remote file first.
Maybe this should be helpfull to you: http://www.php.net/manual/en/function.simplexml-load-file.php
If you're have dificulte to get the XML file from the remote host you can use combine with above simplexml-load-string
$path_to_xml = 'http://some.com/file.xml';
$xml = simplexml_load_string( file_get_content($path_to_xml) );

Categories