I'm trying to create a proxy request to a different domain from my own and doing some changes on the code before outputting the HTML to be displayed. And all works well except that my CSS file doesn't seem to take effect.
<?php
if (isset($_GET['$url']))
{
$html = file_get_contents($_GET['url']);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$a = array();
foreach ($dom->getElementsByTagName('link') as $href)
{
$a[] = $href->getAttribute('href');
}
echo str_replace($a[0],$url."/".$a[0], $html);
}
?>
The result is an HTML document but without CSS styling. But if I check the source code in my browser it shows that the link to the CSS file is okay and clicking on it takes me to that CSS file, but its not taking effect in styling the output
As the title suggest i am trying to find all CSS files on a website (for later use i will find all image urls in each of the CSS files on the server).
Now ive tried the following:
$url_to_test = $_GET['url'];
$file = file_get_contents($url_to_test);
$doc = new DOMDocument();
$doc->loadHTML($file);
$domcss = $doc->getElementsByTagName('css');
However the array domcss turned out empty (for a site i know has alot of css files).
So my question is how do i find all css files loaded on a given page?
you should check for link not css, change:
$domcss = $doc->getElementsByTagName('css');
to
$domcss = $doc->getElementsByTagName('link');
foreach($domcss as $links) {
if( strtolower($links->getAttribute('rel')) == "stylesheet" ) {
echo "This is:". $links->getAttribute('href') ."<br />";
}
}
Try this:
preg_match('/<link rel="stylesheet" href="(.*?)" type="text\/css">/',$data,$output_array);
I have made this:
<html>
<head>
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
<script>
$(document).ready(
function()
{
$("body").html($("#HomePageTabs_cont_3").html());
}
);
</script>
</head>
<body>
<?php
echo file_get_contents("http://www.bankasya.com.tr/index.jsp");
?>
</body>
</html>
When I check my page with Firebug, It gives countless "missing files" (images, css files, js files, etc.) errors. I want to have just a part of the page not of all. This code does what I want. But I am wondering if there is a better way.
EDIT:
The page does what I need. I do not need all the contents. So iframe is useless to me. I just want the raw data of the div #HomePageTabs_cont_3.
Your best bet is PHP server-side parsing. I have written a small snippet to show you how to do this using DOMDocument (and possibly tidyif your server has it, to barf out all the mal-formed XHTML foos).
Caveat: outputs UTF-8. You can change this in the constructor of DOMDocument
Caveat 2: WILL barf out if its input is neither utf-8 not iso-8859-9. The current page's charset is iso-8859-9 and I see no reason why they would change this.
header("content-type: text/html; charset=utf-8");
$data = file_get_contents("http://www.bankasya.com.tr/index.jsp");
// Clean it up
if (class_exists("tidy")) {
$dataTidy = new tidy();
$dataTidy->parseString($data,
array(
"input-encoding" => "iso-8859-9",
"output-encoding" => "iso-8859-9",
"clean" => 1,
"input-xml" => true,
"output-xml" => true,
"wrap" => 0,
"anchor-as-name" => false
)
);
$dataTidy->cleanRepair();
$data = (string)$dataTidy;
}
else {
$do = true;
while ($do) {
$start = stripos($data,'<script');
$stop = stripos($data,'</script>');
if ((is_numeric($start))&&(is_numeric($stop))) {
$s = substr($data,$start,$stop-$start);
$data = substr($data,0,$start).substr($data,($stop+strlen('</script>')));
} else {
$do = false;
}
}
// nbsp breaks it?
$data = str_replace(" "," ",$data);
// Fixes for any element that requires a self-closing tag
if (preg_match_all("/<(link|img)([^>]+)>/is",$data,$mt,PREG_SET_ORDER)) {
foreach ($mt as $v) {
if (substr($v[2],-1) != "/") {
$data = str_replace($v[0],"<".$v[1].$v[2]."/>",$data);
}
}
}
// Barf out the inline JS
$data = preg_replace("/javascript:[^;]+/is","#",$data);
// Barf out the noscripts
$data = preg_replace("#<noscript>(.+?)</noscript>#is","",$data);
// Muppets. Malformed comment = one more regexp when they could just learn to write proper HTML...
$data = preg_replace("#<!--(.*?)--!?>#is","",$data);
}
$DOM = new \DOMDocument("1.0","utf-8");
$DOM->recover = true;
function error_callback_xmlfunction($errno, $errstr) { throw new Exception($errstr); }
$old = set_error_handler("error_callback_xmlfunction");
// Throw out all the XML namespaces (if any)
$data = preg_replace("#xmlns=[\"\']?([^\"\']+)[\"\']?#is","",(string)$data);
try {
$DOM->loadXML(((substr($data, 0, 5) !== "<?xml") ? '<?xml version="1.0" encoding="utf-8"?>' : "").$data);
} catch (Exception $e) {
$DOM->loadXML(((substr($data, 0, 5) !== "<?xml") ? '<?xml version="1.0" encoding="iso-8859-9"?>' : "").$data);
}
restore_error_handler();
error_reporting(E_ALL);
$DOM->substituteEntities = true;
$xpath = new \DOMXPath($DOM);
echo $DOM->saveXML($xpath->query("//div[#id=\"HomePageTabs_cont_3\"]")->item(0));
In order of appearance:
Fetch the data
If we have tidy, sanitize HTML with it
Create a new DOMDocument and load our document ((string)$dataTidy is a short-hand tidy getter)
Create an XPath request path
Use XPath to request all divs with id set as what we want, get the first item of the collection (->item(0), which will be a DOMElement) and request for the DOM to output its XML content (including the tag itself)
Hope it is what you're looking for... Though you might want to wrap it in a function.
Edit
Forgot to mention: http://rescrape.it/rs.php for the actual script output!
Edit 2
Correction, that site is not W3C-valid, and therefore, you'll either need to tidy it up or apply a set of regular expressions to the input before processing. I'm going to see if I can formulate a set to barf out the inconsistencies.
Edit 3
Added a fix for all those of us who do not have tidy.
Edit 4
Couldn't resist. If you'd actually like the values rather than the table, use this instead of the echo:
$d = new stdClass();
$rows = $xpath->query("//div[#id=\"HomePageTabs_cont_3\"]//tr");
$rc = $rows->length;
for ($i = 1; $i < $rc-1; $i++) {
$cols = $xpath->query($rows->item($i)->getNodePath()."/td");
$d->{$cols->item(0)->textContent} = array(
((float)$cols->item(1)->textContent),
((float)$cols->item(2)->textContent)
);
}
I don't know about you, but for me, data works better than malformed tables.
(Welp, that one took a while to write)
I'd get in touch with the remote site's owner and ask if there was a data feed I could use that would just return the content I wanted.
Sébastien answer is the best solution, but if you want to use jquery you can add Base tag in head section of your site to avoid not found errors on images.
<base href="http://www.bankasya.com.tr/">
Also you will need to change your sources to absolute path.
But use DOMDocument
I am using the PHP Simple DOM parser to extract all of the image sources on a given page like so:
// Include the library
include('simple_html_dom.php');
// Retrieve the DOM from a given URL
$html = file_get_html('http://google.com/');
// Retrieve all images and print their SRCs
foreach($html->find('img') as $e)
echo $e->src . '<br>';
Instead of using Google.com, I wish to use a page on Wordpress's admin (backend) area. These pages are PHP pages, not HTML (but the page has standard HTML throughout). How would I use the current page as the $html variable? PHP newbie over here.
Using this library dxtool found here.
Login
require 'WebGet.php';
$w = new WebGet();
// using cache to prevent repetitive download
$w->useCache = true;
$w->cacheLocation = '/tmp';
$w->cacheMaxAge = 3600;
$w->cookieFile = '/tmp/cookie.txt';
// $login_get_data and $login_post_data is associative array
$login = $w->requestContent($login_url, $login_get_data, $login_post_data);
Visiting Image containing page
// $image_page_url is the url of the page where your images exist.
$image_page = $w->requestContent($image_page_url);
Parse images and display
$dom = new DOMDocument();
$dom->loadHTML($image_page);
$imgs = $dom->getElementsByTagName("img");
foreach($imgs as $img){
echo $img->getAttribute("src");
}
Disclaimer: I am the author of this class
I have the following code that replaces all tags on a page and adds the nCode image resizer to it. The code is as follows:
function ncode_the_content($content) {
return preg_replace("/<img([^`|>]*)>/im", "<img onload=\"NcodeImageResizer.createOn(this);\"$1>", $content); }
}
What I need to do is make it so that if an image has the class of "noresize" it doesn't do the preg_match.
I have only managed to get it so that if there is the "noresize" class anywhere on the page it stops resizing all images instead of just the one with the correct class.
Any suggestions?
UPDATE:
Am I even remotely in the right ballpark with this?
function ncode_the_content($content) {
//Load the HTML page
$html = file_get_contents($content);
//Parse it. Here we use loadHTML as a static method
//to parse the HTML and create the DOM object in one go.
#$dom = DOMDocument::loadHTML($html);
//Init the XPath object
$xpath = new DOMXpath($dom);
//Query the DOM
$linksnoresize = $xpath->query( 'img[#class = "noresize"]' );
$links = $xpath->query( 'img[]' );
//Display the results as in the previous example
foreach($links as $link){
echo $link->getAttribute('onload'), 'NcodeImageResizer.createOn(this);';
}
foreach($linksnoresize as $link){
echo $link->getAttribute('onload'), '';
}
}
Here's some untested code:
$dom = DOMDocument::loadHTML($content);
$images = $dom->getElementsByTagName("img");
foreach ($images as $image) {
if (!strstr($image->getAttribute("class"), "noresize")) {
$image->setAttribute("onload", "NcodeImageResizer.createOn(this);");
}
}
But, if it were me, I would eschew any such inline event handler and instead just find the appropriate elements with Javascript.
I ended up just using pure CSS and adding a around the images I didn't want to be resized. Forced the width and height of that div back to auto and then removed the warning message that was displayed above them. Seems to work fine. Thanks for your help :)