I'm using following code to get remote content using PHP cURL
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://example.com");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
curl_close($ch);
echo $output;
This code returns whole content But I just want to print all stylsheets in following format.
<link rel="stylesheet" href="http://www.example.com/css/style1.css">
<link rel="stylesheet" href="http://www.example.com/css/style2.css">
How do I Filter content using str.replace() to only get stylsheets with cURL?
If you only want to leave the <link> elements intact then you can use PHP's strip_tags() function.
strip_tags — Strip HTML and PHP tags from a string
It accepts an additional parameter that defines allowed tags, so all you have to do is set the only allowed tag to be the <link> tag.
$output = curl_exec($ch);
$linksOnly = strip_tags($ouput,'link');
The main problem here is that you don't really know what content you are going to get and trying to parse HTML content with anything other than a tool designed for that task may leave you with grey hair and a nervious twitch ;)
References -
strip_tags()
Using simple html dom library,
include('simple_html_dom.php');
// get DOM from URL or file
$html = file_get_html('http://www.example.com/');
// or your can get $html string through your curl request and say
// $html = str_get_html($html);
// find all "link"
foreach($html->find('link') as $e) {
if($e->type="text/css" && strpos($e->href, ":/") !=== false) // you don't want relative css hrefs. right?
echo $e->href."<br>";
}
A better approach would be to use PHP DOM to parse the HTML tree and retrieve the required nodes - <link> in your case - and filter them appropriately.
Using a regex:
preg_match_all('/rel="stylesheet" href="(.*)">/', $output, $matches);
if (isset($matches[1]) && count($matches[1]))
{
foreach ($matches as $value)
{
echo '<link rel="stylesheet" href="'.$value.'">';
}
}
Related
There are 6 images in this url which I want to give src images. My goal is getting all images src with PHP but only one image src coming.
<?php
require_once ('simple_html_dom/simple_html_dom.php');
$html = file_get_html('https://www.zara.com/tr/en/flatform-derby-shoes-with-reversible-fringe-p15318201.html?v1=5276035&v2=734142');
foreach($html->find('img') as $element){
echo $element->src . '<br>';
}
?>
After looking at the Simple HTML DOM bug tracker. It seems like they are having some issues fetching values that aren't real URL's.
Looking at the source of the page you're trying to fetch, only one image actually does have a URL. The rest has inline images: src="data:image/png;base64,...".
I would suggest using PHP's own DOMDocument for this.
Here's a working solution (with comments):
<?php
// Get the HTML from the URL
$data = file_get_contents("https://www.zara.com/tr/en/flatform-derby-shoes-with-reversible-fringe-p15318201.html?v1=5276035&v2=734142");
$doc = new DOMDocument;
// DOMDocument throws a bunch of errors since the HTML isn't 100% valid
// (and for all HTML5-tags) but it will sort them out.
// Let's just tell it to fix it in silence.
libxml_use_internal_errors(true);
$doc->loadHTML($data);
libxml_clear_errors();
// Fetch all img-tags and get the 'src' attributes.
foreach ($doc->getElementsByTagName('img') as $img) {
echo $img->getAttribute('src') . '<br />';
}
Demo: https://www.tehplayground.com/sh4yJ8CqIwypwkCa
Actually those base64encodes are the images base64ecnoded images. As far as this page you want to parse although the images are base64 encoded the a tags that are the parents of the images actually are containing the image urls.
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch,CURLOPT_URL,"https://www.zara.com/tr/en/flatform-derby-shoes-with-reversible-fringe-p15318201.html?v1=5276035&v2=734142");
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
$data = curl_exec($ch);
curl_close($ch);
and now the data manipulation
libxml_use_internal_errors(true);
$siteData = new DOMDocument();
$siteData->loadHTML($data);
$a = $siteData->getElementsByTagName("a"); //get the a tags
for($i=0;$i<$a->length;$i++){
if($a->item($i)->getAttribute("class")=="_seoImg"){ //_seoImg class is the image class
echo $a->item($i)->getAttribute("href").'<br/>';
}
}
and the result is
//static.zara.net/photos///2017/I/1/1/p/5318/201/040/3/w/560/5318201040_2_1_1.jpg?ts=1508311623896
//static.zara.net/photos///2017/I/1/1/p/5318/201/040/3/w/560/5318201040_1_1_1.jpg?ts=1508311816920
//static.zara.net/photos///2017/I/1/1/p/5318/201/040/3/w/560/5318201040_2_3_1.jpg?ts=1508311715728
//static.zara.net/photos///2017/I/1/1/p/5318/201/040/3/w/560/5318201040_2_10_1.jpg?ts=1508315639664
//static.zara.net/photos///2017/I/1/1/p/5318/201/040/3/w/560/5318201040_2_2_1.jpg?ts=1508311682567
I want to get the whole element <article> which represents 1 listing but it doesn't work. Can someone help me please?
containing the image + title + it's link + description
<?php
$url = 'http://www.polkmugshot.com/';
$content = file_get_contents($url);
$first_step = explode( '<article>' , $content );
$second_step = explode("</article>" , $first_step[3] );
echo $second_step[0];
?>
You should definitely be using curl for this type of requests.
function curl_download($url){
// is cURL installed?
if (!function_exists('curl_init')){
die('cURL is not installed!');
}
$ch = curl_init();
// URL to download
curl_setopt($ch, CURLOPT_URL, $url);
// User agent
curl_setopt($ch, CURLOPT_USERAGENT, "Set your user agent here...");
// Include header in result? (0 = yes, 1 = no)
curl_setopt($ch, CURLOPT_HEADER, 0);
// Should cURL return or print out the data? (true = retu rn, false = print)
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// Timeout in seconds
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
// Download the given URL, and return output
$output = curl_exec($ch);
// Close the cURL resource, and free system resources
curl_close($ch);
return $output;
}
for best results for your question. Combine it with HTML Dom Parser
use it like:
// Find all images
foreach($output->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($output->find('a') as $element)
echo $element->href . '<br>';
Good Luck!
I'm not sure I get you right, But I guess you need a PHP DOM Parser. I suggest this one (This is a great PHP library to parser HTML codes)
Also you can get whole HTML code like this:
$url = 'http://www.polkmugshot.com/';
$html = file_get_html($url);
echo $html;
Probably a better way would be to parse the document and run some xpath queries over it afterwards, like so:
$url = 'http://www.polkmugshot.com/';
$xml = simplexml_load_file($url);
$articles = $xml->xpath("//articles");
foreach ($articles as $article) {
// do sth. useful here
}
Read about SimpleXML here.
extract the articles with DOMDocument. working example:
<?php
$url = 'http://www.polkmugshot.com/';
$content = file_get_contents($url);
$domd=#DOMDocument::loadHTML($content);
foreach($domd->getElementsByTagName("article") as $article){
var_dump($domd->saveHTML($article));
}
and as pointed out by #Guns , you'd better use curl, for several reasons:
1: file_get_contents will fail if allow_url_fopen is not set to true in php.ini
2: until php 5.5.0 (somewhere around there), file_get_contents kept reading from the connection until the connection was actually closed, which for many servers can be many seconds after all content is sent, while curl will only read until it reaches content-length HTTP header, which makes for much faster transfers (luckily this was fixed)
3: curl supports gzip and deflate compressed transfers, which again, makes for much faster transfer (when content is compressible, such as html), while file_get_contents will always transfer plain
Here is a function that validates .edu TLD and checks that the url does not point to a .pdf document or a .doc document.
public function validateEduDomain($url) {
if( preg_match('/^https?:\/\/[A-Za-z]+[A-Za-z0-9\.-]+\.edu/i', $url) && !preg_match('/\.(pdf)|(doc)$/i', $url) ) {
return TRUE;
}
return FALSE;
Now I am encountering links that point to jpg, rtf and others that simple_html_dom tries to parse and return its content. I want to avoid this happening by skipping all such links. The problem is that the list is non-exhaustive and I want the code to skip all such links. How am I supposed to do that??
Tring to filter urls by guessing what's behind it will always fail in a number of cases. Assuming you are using curl to download, you should check if the response document-type header is among the acceptable ones:
<?php
require "simple_html_dom.php";
$curl = curl_init();
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true); //default is to output it
$urls = array(
"google.com",
"https://www.google.com/logos/2012/newyearsday-2012-hp.jpg",
"http://cran.r-project.org/doc/manuals/R-intro.pdf",
);
$acceptable_types = array("text/html", "application/xhtml+xml");
foreach ($urls as $url) {
curl_setopt($curl, CURLOPT_URL, $url);
$contents = curl_exec($curl);
//we need to handle content-types like "text/html; charset=utf-8"
list($response_type) = explode(";", curl_getinfo($curl, CURLINFO_CONTENT_TYPE));
if (in_array($response_type, $acceptable_types)) {
echo "accepting {$url}\n";
// create a simple_html_dom object from string
$obj = str_get_html($contents);
} else {
echo "rejecting {$url} ({$response_type})\n";
}
}
running the above results in:
accepting google.com
rejecting https://www.google.com/logos/2012/newyearsday-2012-hp.jpg (image/jpeg)
rejecting http://cran.r-project.org/doc/manuals/R-intro.pdf (application/pdf)
Update the last regex to something like this:
!preg_match('/\.(pdf)|(doc)|(jpg)|(rtf)$/i', $url) )
Will filter out the jpgs and rtf documents.
You have to add the extensions to the regex above to omit them.
Update
I don’t think its possible to block all sort of extensions and I personally do not recommend it for scraping usage also. You will have to skip some extensions to keep crawling. Why dont you change you regex filter to the ones you would like to accept like:
preg_match('/\.(html)|(html)|(php)|(aspx)$/i', $url) )
is have been told that the best way to parse html is through DOM like this:
<?
$html = "<span>Text</span>";
$doc = new DOMDocument();
$doc->loadHTML( $html);
$elements = $doc->getElementsByTagName("span");
foreach( $elements as $el)
{
echo $el->nodeValue . "\n";
}
?>
but in the above the variable $html can't be a url, or can it??
wouldnt i have to use to function get_file_contents() to get the html of a page?
You have to use DOMDocument::loadHTMLFile to load HTML from an URL.
$doc = new DOMDocument();
$doc->loadHTMLFile($path);
DOMDocument::loadHTML parses a string of HTML.
$doc = new DOMDocument();
$doc->loadHTML(file_get_contents($path));
It can be, but it depends on allow_url_fopen being enabled in your PHP install. Basically all of the PHP file-based functions can accept a URL as a source (or destination). Whether such a URL makes sense is up to what you're trying to do.
e.g. doing file_put_contents('http://google.com') is not going to work, as you'd be attempting to do an HTTP upload to google, and they're not going allow you to replace their homepage...
but doing $dom->loadHTML('http://google.com'); would work, and would suck in google's homepage into DOM for processing.
If you're having trouble using DOM, you could use CURL to parse. For example:
$url = "http://www.davesdaily.com/";
$curl = curl_init();
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_URL, $url);
$input = curl_exec($curl);
$regexp = "<span class=comment>([^<]*)<\/span>";
if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) {
foreach($matches as $match);
}
echo $match[0];
The script should grab the text between <span class=comment> and </span> and store inside an array $match. This should echo Entertainment.
I am looking to draw html of a webpage inside my website.
Take this scenario:
I have a website that checks availability of a hotel. But instead of hosting that hotel's images on my server. I simple curl, a specific page on the hotels website that contains their images.
Can I grab anything from the html and display it on my website? using their HTML code, but only the div(s) or images that i want to display?
I'm using this code, sourced from:
http://davidwalsh.name/download-urls-content-php-curl
As practice and arguments sake, lets try and display Google's logo from their homepage.
function get_data($url)
{
$ch = curl_init();
$timeout = 5;
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$returned_content = get_data('http://www.google.com');
echo '<base href="http://www.google.com/" />';
echo $returned_content;
Thanks to #alex I have started to play with DOMDocument in PHP's lib. However, I have hit a snag.
function get_data($url)
{
$ch = curl_init();
$timeout = 5;
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$url = "www.abc.net.au";
$html = get_data($url);
$dom = new DOMDocument;
#$dom->loadHTML($html);
$logo = $dom->getElementById("abcLogo");
var_dump($logo);
Returns: object(DOMElement)[2]
How do I parse this further? Or Simply just print/echo the contents of the DIV with that id..?
Yes, run the resulting HTML through something like DOMDocument to extract the portions you require.
Once you have found a DOM element, it can be a bit tricky to get the HTML of the element itself (rather than just its contents).
You can get the XML value of a single element very easily with DOMDocument::saveXML:
echo $dom->saveXML($logo);
This may be good enough for you. I believe there is a change coming that will add this functionality to saveHTML as well.
echo $logo->nodeValue should work because you can only have 1 element by id!