how to crawl and download all pdf files from html link?

how to crawl and download all pdf files from html link? - php

This is my code to crawl all pdf links but it doesn't work. How to download from those links and save to a folder on my computer?
<?php
set_time_limit(0);
include 'simple_html_dom.php';
$url = 'http://example.com';
$html = file_get_html($url) or die ('invalid url');
//extrack pdf links
foreach($html->find('a[href=[^"]*\.pdf]') as $element)
echo $element->href.'<br>';
?>

foreach($htnl->find('a[href=[^"]*\.pdf]') as element)
^---typo. should be an 'm' ^---typo. need a $ here
How does your code "not work", other than because of above typo?

Have you looked into into phpquery?
http://code.google.com/p/phpquery/

More simple solution here will be:
foreach ($html->find('a[href$=pdf]') as $element)
https://simplehtmldom.sourceforge.io/manual.htm
[attribute$=value] Matches elements that have the specified attribute
and it ends with a certain value.

Related

Getting content from external div

I need to get content from external page.
For example:
Let's use this site: https://en.wikipedia.org/wiki/Main_Page I need to get
only content of "On this day..." so it means div with id="mp-otd"
How can I do that with PHP?

You can do this by suing PHP DOM parser
include_once('simple_html_dom.php');
$html = file_get_html('https://en.wikipedia.org/wiki/Main_Page');
$div_content = $html->find('div[id=mp-otd]', 0);

Need to download library from http://simplehtmldom.sourceforge.net/
for example
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find specific
foreach($html->find('div #mp-otd') as $element)
echo $element->innertext . '<br>';

Issue with php simple html DOM parser in Joomla

I try to isert a stock-chart-module from an orther site into my own website.
As i use:
jimport('simplehtmldom.simple_html_dom');
// get DOM from URL or file
$html = file_get_html('http://www.raiffeisen.com/');
foreach($html->find('div#agrarfenster') as $element)
echo $element->innertext;
The Output will work. But i need this Code for the required output:
jimport('simplehtmldom.simple_html_dom');
// get DOM from URL or file
$html = file_get_html('http://www.raiffeisen.com/');
foreach($html->find('div#boersenfenster_bf_4562') as $element)
echo $element->innertext;
This Code would'nt work. But why?
My guess is that there are those underscores in the "boersenfenster_bf_4562".
Can somebody help me?

PHP extract specific DIV from target URL

I'm using Simple HTML DOM to try and extract a div and all of it's contents from a target URL, here is my code:
<?php
require 'simple_html_dom.php';
$html = file_get_html('http://mozilla.org');
foreach($html->find('.accordion') as $element)
echo $element . '<br>';
?>
The problem I have is that the above code only extracts the plain text of the div. There are also images in the div that I need to extract. If I use this following code, then all images are extracted but so is everything else in the page.
<?php
require 'simple_html_dom.php';
$html = file_get_html('http://mozilla.org');
echo $html;
?>
So my question is, how can I use the first bit of code to extract the contents + images from .accordion?
Thanks

You could always try;
$imgs = array();
foreach($html->find('.accordion',0)->find('img') as $img){
$imgs[] = $img->src;
}
print_r($imgs);
This should populate the $imgs variable with all of the image links from the .accordion div.
:)

How to find specific query in the URL and display whole link

Current code is like this :
include 'simple_html_dom.php';
// Create DOM from URL or file
$html = file_get_html('http://www.AnyLinkAlsoCan.com');
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
It will crawl and find tag like this :
<a href="http://news.example.com/node">
And will output all link it found on the website.
Example
http://news.example.com.my/node/321072
http://news.example.com.my/taxonomy/term/2
http://news.example.com.my/node/321060?tid=2
I want to search url that contain only ?tid= as you can see on the 3rd URL in the example.
http://news.example.com.my/node/321060?tid=2
I replace echo $element->href="*?tid but that just return error. Can someone help me with this?

You can use preg_match or you can check all urls taken if they contain ?tid
<?php
include 'simple_html_dom.php';
// Create DOM from URL or file
$html = file_get_html('http://www.AnyLinkAlsoCan.com');
// Find all links
foreach($html->find('a') as $element) {
$search = '?tid';
if(strpos($element->href,$search)) {
echo $element->href . '<br>';
}
}
?>

Use parse_url() to parse each url and then only select ones you want based on PHP_URL_QUERY

Get SRC from div contents

I have code that gets a div contents:
include_once('simple_html_dom.php');
$html = file_get_html("link");
$ret = $html->find('div');
echo $ret[0];
preg_match_all('/(src)=("[^"]*")/i',$ret[0], $link);
echo $link[0];
It returns the full div contents including all the CSS. However I just wanted it to echo the information after src= basically just echoing the image link and nothing else. I've tried to use preg_match with no success.
Any ideas?

Your HTML parser will help you there - there should be a src property in the $ret object:
echo $ret[0]->src;

You don't need regexp for that since you already use a dom parser.
foreach($ret as $element)
echo $element->src,'<br/>';

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

how to crawl and download all pdf files from html link? - php

foreach($htnl->find('a[href=[^"]*\.pdf]') as element) ^---typo. should be an 'm' ^---typo. need a $ here How does your code "not work", other than because of above typo?

Have you looked into into phpquery? http://code.google.com/p/phpquery/

More simple solution here will be: foreach ($html->find('a[href$=pdf]') as $element) https://simplehtmldom.sourceforge.io/manual.htm [attribute$=value] Matches elements that have the specified attribute and it ends with a certain value.

Related

Getting content from external div

Issue with php simple html DOM parser in Joomla

PHP extract specific DIV from target URL

How to find specific query in the URL and display whole link

Get SRC from div contents

Categories

Resources