PHP - display remote page's content in full - php

I need to fetch a remote page, modify some elements (using 'PHP Simple HTML DOM Parser' library for that) and output modified content.
There's a problem with remote pages that don't have full URLs in their source, so CSS elements and images are not loaded. Sure, it doesn't stop me from modifying elements, but the output looks bad.
For example, open https://www.raspberrypi.org/downloads/
However, if you use code
$html = file_get_html('http://www.raspberrypi.org/downloads');
echo $html;
it will look bad. I tried to apply a simple hack, but that helps just a little:
$html = file_get_html('http://www.raspberrypi.org/downloads');
$html=str_ireplace("</head>", "<base href='http://www.raspberrypi.org'></head>", $html);
echo $html;
Is there any way to "instruct" script to parse all links from $html variable from 'http://www.raspberrypi.org'? In other words, how to make raspberrypi.org to be the "main" source of all images/CSS elements fetched?
I daon't know how to explain it better, but I believe you got an idea.

I just have tried this on local, and I've noticed(in the source code) the link tags in the HTML are like this:
<link rel='stylesheet' href='/wp-content/themes/mind-control/js/qtip/jquery.qtip.min.css' />
It obviously requires a file that should be in my local directory (like localhost/wp-content/etc.../).
The href of the link tags must be something like
<link rel='stylesheet' href='https://www.raspberrypi.org/wp-content/themes/mind-control/js/qtip/jquery.qtip.min.css' />
So what you probably want to do is find all link tags and add in their href attribute "https://www.raspberrypi.org/" in front of the rest.
EDIT: Hey I've actually made the style work, try this code:
$html = file_get_html('http://www.raspberrypi.org/downloads');
$i = 0;
foreach($html->find('link') as $element)
{
$html->find('link', $i)->href = 'http://www.raspberrypi.org'.$element->href;
$i++;
}
echo $html;die;

Since only Nikolay Ganovski offered a solution, I wrote a code which converts partial pages into full by looking for incomplete css/img/form tags and making them full. In case someone needs it, find the code below:
//finalizes remote page by completing incomplete css/img/form URLs (path/file.css becomes http://somedomain.com/path/file.css, etc.)
function finalize_remote_page($content, $root_url)
{
$root_url_without_scheme=preg_replace('/(?:https?:\/\/)?(?:www\.)?(.*)\/?$/i', '$1', $root_url); //ignore schemes, in case URL provided by user was http://domain.com while URL in source is https://domain.com (or vice-versa)
$content_object=str_get_html($content);
if (is_object($content_object))
{
foreach ($content_object->find('link.[rel=stylesheet]') as $entry) //find css
{
if (substr($entry->href, 0, 2)!="//" && stristr($entry->href, $root_url_without_scheme)===FALSE) //ignore "invalid" URLs like //domain.com
{
$entry->href=$root_url.$entry->href;
}
}
foreach ($content_object->find('img') as $entry) //find img
{
if (substr($entry->src, 0, 2)!="//" && stristr($entry->src, $root_url_without_scheme)===FALSE) //ignore "invalid" URLs like //domain.com
{
$entry->src=$root_url.$entry->src;
}
}
foreach ($content_object->find('form') as $entry) //find form
{
if (substr($entry->action, 0, 2)!="//" && stristr($entry->action, $root_url_without_scheme)===FALSE) //ignore "invalid" URLs like //domain.com
{
$entry->action=$root_url.$entry->action;
}
}
}
return $content_object;
}

Related

Change src atribute from img, using Simple HTML Dom php library

I'm totally new to php, and I'm having a hard time changing the src attribute of img tags.
I have a website that pulls a part of a page using Simple Html Dom php, here is the code:
<?php
include_once('simple_html_dom.php');
$html = file_get_html('http://www.tabuademares.com/br/bahia/morro-de-sao-paulo');
foreach($html ->find('img') as $item) {
$item->outertext = '';
}
$html->save();
$elem = $html->find('table[id=tabla_mareas]', 0);
echo $elem;
?>
This code correctly returns the part of the page I want. But when I do this the img tags comes with the src of the original page: /assets/svg/icon_name.svg
What I want to do is change the original src so that it looks like this: http://www.mywebsite.com/wp-content/themes/mytheme/assets/svg/icon_name.svg
I want to put the url of my site in front of assets / svg / icon_name.svg
I already tried some tutorials, but I could not make any work.
Could someone please kind of help a noob in php?
i could make it work. So if someone have the same question, here is how i managed to get the code working.
<?php
// Note you must download the php files simple_html_dom.php from
// this link https://sourceforge.net/projects/simplehtmldom/files/
//than include them
include_once('simple_html_dom.php');
//target the website
$html = file_get_html('http://the_target_website.com');
//loop thru all images of the html dom
foreach($html ->find('img') as $item) {
// Get a attribute ( If the attribute is non-value attribute (eg. checked, selected...), it will returns true or false)
$value = $item->src;
// Set a attribute
$item->src = 'http://yourwebsite.com/'.$value;
}
//save the variable
$html->save();
//findo on html the div you want to get the content
$elem = $html->find('div[id=container]', 0);
//output it using echo
echo $elem;
?>
That's it!
did you read the documentation for read and modify attributes
As per that
// Get a attribute ( If the attribute is non-value attribute (eg. checked, selected...), it will returns true or false)
$value = $e->href;
// Set a attribute
$e->href = 'ursitename'.$value;

How do I call an iframe or something similar in PHP?

Hey how do I call an iframe or something similar in PHP?
I have found some code but I might be setting up wrong, this is the code that I found, code:
<iframe id="frame" src="load.php?sinput="<?php echo $_GET["sinput"]; ?> > </iframe>
Does anybody know any iframe PHP codes or something similar for PHP?
Some people are saying not to use iframes what is there from PHP?
There is no function to generate an iframe in PHP.
What you're doing is fine, but allow me to make a suggestion:
<?
$input = "";
if(isset($_GET['sinput'])) {
$input = htmlspecialchars($_GET['sinput']);
}
?>
<iframe id="frame" src="load.php?sinput="<?php echo $input; ?>">Your browser does not support iframes</iframe>
EDIT: actually
<?
$url = "load.php";
// Query Building Logic
$querys = array();
if(isset($_GET['sinput'])) {
$queries[] = "sinput=".htmlspecialchars($_GET['sinput']);
}
// Generate full URL
if(count($queries) > 0) {
$url .= "?" . implode("&", $queries);
}
?>
<iframe id="frame" src="<? echo $url; ?>">Your browser does not support iframes</iframe>
I think is better quality overall, but ill let that up to my peers to judge. This is just another suggestion, to generate the full usable URL to use in your HTML in a full logic block, rather than relying on information to be present and usable in the template (because if the element ['sinput'] in the $_GET array is not set for whatever reason, the page will outright snap on you.

Save the contents of manipulated div to a variable and pass to php file

I have tried to use AJAX, but nothing I come up with seems to work correctly. I am creating a menu editor. I echo part of a file using php and manipulate it using javascript/jquery/ajax (I found the code for that here: http://www.prodevtips.com/2010/03/07/jquery-drag-and-drop-to-sort-tree/). Now I need to get the edited contents of the div (which has an unordered list in it) I am echoing and save it to a variable so I can write it to the file again. I couldn't get that resource's code to work so I'm trying to come up with another solution.
If there is a code I can put into the $("#save").click(function(){ }); part of the javascript file, that would work, but the .post doesn't seem to want to work for me. If there is a way to initiate a php preg_match in an onclick, that would be the easiest.
Any help would be greatly appreciated.
The code to get the file contents.
<button id="save">Save</button>
<div id="printOut"></div>
<?php
$header = file_get_contents('../../../yardworks/content_pages/header.html');
preg_match('/<div id="nav">(.*?)<\/div>/si', $header, $list);
$tree = $list[0];
echo $tree;
?>
The code to process the new div and send to php file.
$("#save").click(function(){
$.post('edit-menu-process.php',
{tree: $('#nav').html()},
function(data){$("#printOut").html(data);}
);
});
Everything is working EXCEPT something about my encoding of the passed data is making it not read as html and just plaintext. How do I turn this back into html?
EDIT: I was able to get this to work correctly. I'll make an attempt to switch this over to DOMDocument.
$path = '../../../yardworks/content_pages/header.html';
$menu = htmlentities(stripslashes(utf8_encode($_POST['tree'])), ENT_QUOTES);
$menu = str_replace("<", "<", $menu);
$menu = str_replace(">", ">", $menu);
$divmenu = '<div id="nav">'.$menu.'</div>';
/* Search for div contents in $menu and save to variable */
preg_match('/<div id="nav">(.*?)<\/div>/si', $divmenu, $newmenu);
$savemenu = $newmenu[0];
/* Get file contents */
$header = file_get_contents($path);
/* Find placeholder div in user content and insert slider contents */
$final = preg_replace('/<div id="nav">(.*?)<\/div>/si', $savemenu, $header);
/* Save content to original file */
file_put_contents($path, $final);
?>
Menu has been saved.
To post the contents of a div with ajax:
$.post('/path/to/php', {
my_html: $('#my_div').html()
}, function(data) {
console.log(data);
});
If that's not what you need, then please post some code with your question. It is very vague.
Also, you mention preg_match and html in the same question. I see where this is going and I don't like it. You can't parse [X]HTML with regex. Use a parser instead. Like this: http://php.net/manual/en/class.domdocument.php

Replace CSS file if is specific subdomain, using preg_match

Regular Expressions are still a stone in my boot. Can you help me, guys?
I have this piece of code for a hook in a CMS. Actually it is the whole code enclosed in the function to be excecuted by the main code.
if (preg_match('#^/member/helpdesk/index.*#i', $_SERVER['REQUEST_URI'])) //do it only for specific url
{
$event->replace('#(<h1>Tickets.*</h1>)#i', '$1<div>Some content</div>');
}
But what I really want is to check if the pages belongs to subdomain member.site.com, find the <link rel="stylesheet" href="http://site.com/orange.css"/> and replace orange.css by blue.css
Thank you :)
I mean, at the core I think you're trying to do this:
$str = '<html><head><link rel="stylesheet" href="http://site.com/style.css"/></head></html>'
if (preg_match('#member\.site\.com#i'), $_SERVER['HTTP_HOST'])){
$str = preg_replace('#http://site\.com/style\.css#', 'http://site.com/style-member.css', $str);
}
But perhaps you should consider how whatever it is you're trying to replace is being generated in the first place? Perhaps this is a check that could be placed at that location? Additionally, if you're going to be modifying an html document, I highly suggest using a parser of some kind. If you're going to do the first, maybe something like this:
$head = '<head><link rel="stylesheet" href="http://site.com/style';
if (preg_match('#member\.site\.com#i'), $_SERVER['HTTP_HOST'])){
$head .= '-member';
}
$head .= '.css"></head>';
But if you insist on parsing an html document:
$str = '<html><head><link rel="stylesheet" href="http://site.com/style.css"/></head></html>'
$dom = new DOMDocument();
$dom->loadHTML($str);
if (preg_match('#member\.site\.com#i'), $_SERVER['HTTP_HOST'])){
$links = $dom->getElementsByTagName('link');
foreach ($links as $link){
$attr = $link->attributes;
if ($attr
&& $attr->getNamedItem('rel')->nodeValue == 'stylesheet'
&& $attr->getNamedItem('href')->nodeValue == 'http://site.com/style.css'){
$attr->getNamedItem('href')->nodeValue = 'http://site.com/style-member.css'
}
}
}
$str = $dom->saveHTML();
If you want to check full domain name use
if( strtolower($_SEVER['HTTP_HOST'])=='member.site.com' ){
// other stuff
}
if you need to check it with REQUEST_URI than
if( preg_match('#^/member#i',$_SERVER['REQUEST_URI']) ){
// other stuff
}
to check hostname from full url
if( preg_match('#^(?:http[s]*://)?([^/]+)#i',$url) ){
// other stuff
}
Note: Remember if there is really one line this will work with catching the beggining of line
preg_match('#^/member/#i','/member/blahstuftuff/member/member/member/me?user=amigo&dir=mber/member')
You can test regular expression here: RegExp online version
EDIT
If you want to change css when the user is in member site and if is logged in a session, than just set:
$_SESSION['member']=true; when logins,
and do this in the part of the page (header or wherever you plan to write the css file):
USING request uri that starts with '/member' :
echo '<link rel="stylesheet" href="http://site.com/'.(preg_match('#^/member#i',$_SERVER['REQUEST_URI'])==true&&$_SESSION['member']==true?'blue.css':'orange.css').'"/>';
USING member domain name 'member.site.com' :
echo '<link rel="stylesheet" href="http://site.com/'.(strtolower($_SEVER['HTTP_HOST'])=='member.site.com'&&$_SESSION['member']==true?'blue.css':'orange.css').'"/>';
If you want blue.css to be seen by even guest users that are not logged in than just remove the session variable comparison!

How do I programmatically add rel="external" to external links in a string of HTML?

How can I check if links from a string variable are external? This string is the site content (like comments, articles etc).
And if they are, how do I append a external value to their rel attribute? And if they don't have this attribute, append rel="external" ?
A HTML parser is appropriate for input filtering, but for modifying output you'll need the performance of a simpleminded regex solution. In this case a callback regex would do:
$html = preg_replace_callback("#<a\s[^>]*href="(http://[^"]+)"[^>]*>#",
"cb_ext_url", $html);
function cb_ext_url($match) {
list ($orig, $url) = $match;
if (strstr($url, "http://localhost/")) {
return $orig;
}
elseif (strstr($orig, "rel=")) {
return $orig;
}
else {
return rtrim($orig, ">") . ' rel="external">';
}
}
You'll probably need more fine-grained checks. But that's the general approach.
Use an XML parser, like SimpleXML. Regex isn't made to do XML/HTML parsing, and here's a perfect explanation of what happens when you do: RegEx match open tags except XHTML self-contained tags.
Parse the input as XML, use the parser to select the required elements, edit their properties using the parser, and spit them back out.
It'll save you a headache, as regex makes me cry...
Here's my way of doing this (didn't test it):
<?php
$xmlString = "This is where the HTML of your site should go. Make sure it's valid!";
$xml = new SimpleXMLElement($xmlString);
foreach($xml->getElementsByTagName('a') as $a)
{
$attributes = $a->attributes();
if (isThisExternal($attributes['href']))
{
$a['rel'] = 'external';
}
}
echo $xml->asXml();
?>
It might be easier to do something like this on the client side, using jQuery:
<script type="text/javascript">
$(document).ready(function()
{
$.each($('a'), function(idx, tag)
{
// you might make this smarter and throw out URLS like
// http://www.otherdomain.com/yourdomain.com
if ($(tag).attr('href').indexOf('yourdomain.com') < 0)
{
$(tag).attr('rel', 'external');
}
});
});
</script>
As Craig White points out though, this doesn't do anything SEO-wise and won't help users who have JavaScript disabled.

Categories