As the title suggest i am trying to find all CSS files on a website (for later use i will find all image urls in each of the CSS files on the server).
Now ive tried the following:
$url_to_test = $_GET['url'];
$file = file_get_contents($url_to_test);
$doc = new DOMDocument();
$doc->loadHTML($file);
$domcss = $doc->getElementsByTagName('css');
However the array domcss turned out empty (for a site i know has alot of css files).
So my question is how do i find all css files loaded on a given page?
you should check for link not css, change:
$domcss = $doc->getElementsByTagName('css');
to
$domcss = $doc->getElementsByTagName('link');
foreach($domcss as $links) {
if( strtolower($links->getAttribute('rel')) == "stylesheet" ) {
echo "This is:". $links->getAttribute('href') ."<br />";
}
}
Try this:
preg_match('/<link rel="stylesheet" href="(.*?)" type="text\/css">/',$data,$output_array);
Related
Im setting up a xampp php website with auto creating css for site (If site named xyz.php/html is created, then a css is created too). Unfortunatelly css doesn't want to include in website using php echo's and html tags. No error.
In style.php:
$arr = explode("/",$_SERVER['PHP_SELF']);
$style = "";
foreach ($arr as $key){
if(strpos($key, ".php")){
$style = str_replace(".php", "style.css", $key);
}
}
if($fp = fopen($_SERVER['DOCUMENT_ROOT']."/TestPHP/".$addr,"wb+")){
fwrite($fp,"body{background-color:#666;}");
fclose($fp);
}
echo $addr = "lib/require/styles/".$style;
echo '<link href="'.$addr.'" rel="stylesheet">';
In index.php:
require_once 'lib/require/php/styles.php';
That's because the only HTML (as i can see in your code) doesn't have anything else than the style, no head, no body... Why don't you directly paste the HTML inside the file instead of making PHP echo it?
I'm trying to create a proxy request to a different domain from my own and doing some changes on the code before outputting the HTML to be displayed. And all works well except that my CSS file doesn't seem to take effect.
<?php
if (isset($_GET['$url']))
{
$html = file_get_contents($_GET['url']);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$a = array();
foreach ($dom->getElementsByTagName('link') as $href)
{
$a[] = $href->getAttribute('href');
}
echo str_replace($a[0],$url."/".$a[0], $html);
}
?>
The result is an HTML document but without CSS styling. But if I check the source code in my browser it shows that the link to the CSS file is okay and clicking on it takes me to that CSS file, but its not taking effect in styling the output
I'm scraping some html from a webite using php simple html dom, which include several images. However the images is not pointing correctly to the website. For example below is a example of one of the images where you can see it is no pointing to the website. Is it possible to dynamically change the urls to point to the website for instance
http://www.url.com/bilder/flags_long/United States.gif
html example
<img src="/bilder/flags_long/United States.gif" align="absmiddle" title="United States" alt="United States" border="0">
sample code:
include('simple_html_dom.php');
$sum_gosu = file_get_html("http://www.gosugamers.net/counterstrike/news/30995-starladder-is-back-with-the-thirteenth-edition-of-starseries");
$gosu_full = $sum_gosu->find("//div[#class='content light']/div[#class='text clearfix']/div", 0);
How about concatenating the actual URL you fetched the document from and the relative image paths. Just to give an idea (this is not tested and you should definitely do some checks whether the image src attribute is relative or maybe absolute in some cases):
<?php
$url = 'http://www.url.com/';
$html = file_get_html($url);
$images = array();
foreach($html->find('img') as $img) {
// Option 1: Fill your images array (in case you only need the images)
$images[] = rtrim($url, '/') . '/' . ltrim($img->src, '/');
// Option 2: Update $img->src inside your $html document
$img->src = rtrim($url, '/') . '/' . ltrim($img->src, '/');
}
?>
UPDATE According your sample code my example could look like follows:
<?php
include('simple_html_dom.php');
$sum_gosu_url = "http://www.gosugamers.net/counterstrike/news/30995-starladder-is-back-with-the-thirteenth-edition-of-starseries";
$sum_gosu = file_get_html($sum_gosu_url);
$gosu_full = $sum_gosu->find("//div[#class='content light']/div[#class='text clearfix']/div", 0);
foreach($gosu_full->find('img') as $img) {
$img->src = $sum_gosu_url . $img->src;
}
?>
After that the img src attributes inside your $gosu_full document should be fixed and resolvable (downloadable by a client). Hope that helps and that I'm actually understanding your problem :)
$url="http://www.url.com";
$Chtml = file_get_html($url);
$imgurl=Chtml->find("img",0)->src;
echo $url.$imgurl;
My scraping code works for just about every site i've come accross while testing... except for nytimes.com articles. I use ajax with the following PHP code (i've left out some details to focus on my specific problem):
$link = "http://www.nytimes.com/2014/02/07/us/huge-leak-of-coal-ash-slows-at-north-carolina-power-plant.html?hp";
$article = new DOMDocument;
$article->loadHTMLFile($link);
//generate image array
$images = $article->getElementsByTagName("img");
foreach ($images as $image) {
$source = $image->getAttribute("src");
echo '<img src="' . $source . '" alt="alt"><br><br>';
}
My problem is that the main images on nytimes pages don't even seem to get picked up by the getElementsByTagName. Pinterest finds a way to scrape the main images from this site for example: http://www.nytimes.com/2014/02/07/us/huge-leak-of-coal-ash-slows-at-north-carolina-power-plant.html?hp whereas I cannot. Any suggestions?
OK. So this is what I tried so far as I found your question interesting.
When I do this on browser console using jQuery, I do get results on images. My query was
var a= new Array();
$('img[src]').each(function(){ a.push($(this).attr('src'));});
console.log(a);
Also see screenshot of results
Note that console.log(arrayname) work in Chrome browser.
So ideally your code must work. Please consider adding a is_null check like I've done.
Below is the code where I try loading the URL using a different approach(perhaps better too) and get the root cause of why you get only single image of NYT logo.
The resultant HTML screenshot is attached .
<?php
$html = file_get_contents("http://www.nytimes.com/2014/02/07/us/huge-leak-of-coal-ash-slows-at-north-carolina-power-plant.html?hp");
echo $html;
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->recover=true;
#$doc->loadHTML("<html><body>".$html."</body></html>");
$xpath = new DOMXpath($doc);
$images = $xpath->query("//*/img");
if (!is_null($images)) {
echo sizeof($images);
foreach ($images as $image) {
$source = $image->getAttribute('src');
echo '<img src="' . $source . '" alt="alt"><br><br>';
}
}
?>
You can't get the content via feed unless you are authenticated.
You can try-
To use context parameter in file_get_contents method
You can try consuming the RSS/ATOM feeds of the article.
you download the page as HTML and then load it in file_get_contents methods. PS: This works.
I have been working with this php code, which should modify Google Calendars layout. But when I put the code to page, it makes everything below it disappear. What's wrong with it?
<?php
$your_google_calendar=" PAGE ";
$url= parse_url($your_google_calendar);
$google_domain = $url['scheme'].'://'.$url['host'].dirname($url['path']).'/';
// Load and parse Google's raw calendar
$dom = new DOMDocument;
$dom->loadHTMLfile($your_google_calendar);
// Change Google's CSS file to use absolute URLs (assumes there's only one element)
$css = $dom->getElementByTagName('link')->item(0);
$css_href = $css->getAttributes('href');
$css->setAttributes('href', $google_domain . $css_href);
// Change Google's JS file to use absolute URLs
$scripts = $dom->getElementByTagName('script')->item(0);
foreach ($scripts as $script) {
$js_src = $script->getAttributes('src');
if ($js_src) { $script->setAttributes('src', $google_domain . $js_src); }
}
// Create a link to a new CSS file called custom_calendar.css
$element = $dom->createElement('link');
$element->setAttribute('type', 'text/css');
$element->setAttribute('rel', 'stylesheet');
$element->setAttribute('href', 'custom_calendar.css');
// Append this link at the end of the element
$head = $dom->getElementByTagName('head')->item(0);
$head->appendChild($element);
// Export the HTML
echo $dom->saveHTML();
?>
When I'm testing your code, I'm getting some errors because of wrong method call:
->getElementByTagName should be ->getElementsByTagName with s on Element
and
->setAttributes and ->getAttributes should be ->setAttribute and ->getAttribute without s at end.
I'm guessing that you don't have any error_reporting on, and because of that don't know anything went wrong?