I have used AJAX to successfully change the content of a web page. I can include another web page from my domain but the problem I have is making the hyperlinks to work. If the hyperlinks use relative addressing then that will not work relative to the page I am including it in so I was investigating php to parse the html tag as I read it in
I am using the following RegExp /href[\s]?=[\s\"\']+(.*?)[\"\']/im to find the href data but would like a pointer on how I can prefix a relative address.
I would like to replace a link href="./test1/page1.html"
on page http: // foo.bar.com/folder1/info1/data.html
with href="http: // foo.bar.com/folder1/info1/./test1/page1.html" then if I include the the page content of /folder1/info1/data.html in http://foo.bar.com/folder2/faraway/another.html the links on the embedded page will function correctly
I was looking at using the php preg_replace function to do that but have very quickly come unstuck. If I am barking up the wrong tree and there is a more appropriate tool or approach can someone please point me in the right direction ;-). Maybe it can all be down in Javascript?
If you're planning to do much more javascript on the page, you could use JQuery.
function make_absolute(base_path){
$("#embedded a").each(function(){
this.attr("href",
base_path + this.attr("href")
);
});
}
Replace "#embedded" with the id of your embedded page.
This is nearly certainly overkill if you're not going to use javascript for anything else, but if you're planning to make a shiny dynamic ajaxy page, you might look into it.
Bonus:
Doing ajax page loading with JQuery:
$("#embedded").load(page_you_want_to_load)
Taking ABentSpoon's response a step further, your jQuery selector can search for all anchor tags that start with a slash.
$('#embedded a[#href^=/]').each(function() {
$(this).attr('href', baseUrl + $(this).attr('href'));
});
For more help with jQuery selectors, go here.
Why donĀ“t you just use absolute paths?
You guys have certainly helped me out here, many thanks. I think the regular expression I need would be /href[\s]?=[\s\"\']\./is as ABentSpoon pointed out "If it starts with a slash, that's absolute enough for most purposes". However I guess it would be a good excersise to enable reading pages from other sites. Luckily any of the pages I may wish to do this with are on a same site, and on same server.
To pick up on Jeroen comment of just using absolute paths, that is not really an option as there are many pages on this site. Also each page would get addressed differently (DNS) depending on where it'll be accessed from... internally or externally. If you give your links an absolute path you tie ALL of them to having that site DNS name. A problem when you find this changing all too regularly, or for that matter depatments feel the need to change thir subdirectory names, but that's another story. I wish to design this feature to be a little more flexible.
I will certainly read up about jQuery. Looks interesing, it's not something I've played with yet... more learning coming up ;-)
Thanks again for taking the time guys.
Related
What I want to do: Scape all the links from a page using Simple HTML DOM while taking care to get full links (i.e. from http:// all the way to the end of the address).
My Problem: I get links like /wiki/Cell_wall instead of http://www.wikipedia.com/wiki/Cell_wall.
More examples: If I scrape the URL: http://en.wikipedia.org/wiki/Leaf, I get links like /wiki/Cataphyll, and //en.wikipedia.org/. Or if I'm scraping http://php.net/manual/en/function.strpos.php, I get links like function.strripos.php.
I've tried so many different techniques of building the actual full URL, but there are so many possible cases that I am completely at a loss as to how I can possibly cover all the bases.
However, I'm sure there are many people who've had this problem before - which is why I turn to you!
P.S I suppose this question could almost be reduced to just handling local hrefs, but as mentioned above, I've come across //en.wikipedia.org/ which is not a full url and yet is not local.
I think this is what you're looking for. It worked for me on an old project.
http://www.electrictoolbox.com/php-resolve-relative-urls-absolute/
You need a library that converts relative urls to absolute. URL To Absolute seems popular. Then you just:
require('url_to_absolute.php');
foreach($doc->find('a[href]') as $a){
echo url_to_absolute('http://en.wikipedia.org/wiki/Leaf', $a->href) . "\n";
}
See PHP: How to resolve a relative url for a list of libraries.
I don't know if this is what you are looking for, but this will give you the full URL of the page it is executed from:
window.location.href
Hope it helps.
Okay, thanks everyone for your comments.
I think the solution is to use regex to find the webroot of any particular URL, then simply append the local address to this.
Tricky part:
Designing a regex statement that works for all domains, including their subdomains...
I have a question about iframes, but i realy don't know how to start with it. I think its the best to give the url inmediatly. http://www.nielsjansen.be/project/saved.php
When you click at the body of the article, the article opens in the same window, that's good, but i want to keep my menu etc. How is this possible?
Thank you
It depends on your level of expertise in PHP and HTML, but I would not use IFRAMEs as they tend to be deprecated.
As #Aziz said, too, IFRAMEs are in the HTML and non in the PHP domain.
If you can edit your code and are able to program in PHP, a very basic technique would be to write down a function that outputs your menu ad use it in every page, including the article pages. That's the caveman solution, to get more complicated one should think about layout, content management and so on.
If you cannot program in PHP, things get a lot more difficult.
You actually are asking an HTML question here. Since I don't have a sample of your code, I'll just take a shot at it:
<a href="mylink.html" target="myframe">
<iframe name="myframe" src="mypage.html"></iframe>
That should get you started. All you need to do is give a name attribute to your iframe, and a target to your link.
NOTE: I would strongly recommend against using iframes, as they have been deprecated as of HTML5, but have always (IMHO) been bad practice. There has only been one case where I used them in a project, which, if I had really set my mind to it, could have been avoided completely.
Hope that helps.
In a project I'm trying to fetch data within the <body> tag. So I can't echo anything in the <title> 'cause I haven't fetched anything yet. I want to change the title tag after the page has been loaded with jQuery.
Will crawlers understand this and when they index the page will they use the title I have provided with jQuery?
nope.. search engine crawlers see what is rendered by the server..
But if you are building an AJax website you can read the google provided Making AJAX Applications Crawlable
quoting the guide
If you're running an AJAX application with content that you'd like to appear in search results, we have a new process that, when implemented, can help Google (and potentially other search engines) crawl and index your content.
No, crawlers are highly unlikely to execute any of the javascript on the page. Some may inspect any javascript and make some assumptions based on that. But one should not assume that this is the case.
Google's spider can run JavaScript on pages that it processes, but I don't think there's any advice anywhere on what it can and can't do. Of course other crawlers won't be as sophisticated and will probably ignore dynamic content.
It's an interesting test actually. I'll try this one of one of my sites and post back. I know googlebot does understand some javascript, but I think this is more for dark SEO tactics; i.e. $('.spammystuff').hide(); type things.
I've just been messing around with file_get_contents() at school and have noticed, it allows me to open websites in school that are blacklisted.
Only a few issues:
No images load
Clicking a link on the website just takes me back to the original blocked page.
I think i know a way of fixing the linking issue, but haven't really thought it through..
I could do a str_replace on the content from file_get_contents to replace any link, with another file_gets_contents() function, on that link...right?
Would it make things easier if i used cURL instead?
Is what I'm trying to do, even possible, or am i just wasting my valuable time?
I know this isn't a good way to go about something like this, but, it is just a thought, thats made me curious.
This is not a trivial task. It is possible, but you would need to parse the returned document(s) and replace everything that refers to external content so that they are also relayed through your proxy, and that is the hard part.
Keep in mind that you would need to be able to deal with (for a start, this is not a complete list):
Relative and absolute paths that may or may not fetch external content
Anchors, forms, images and any number of other HTML elements that can refer to external content, and may or may not explicitly specify the content they refer to.
CSS and JS code that refers to external content, including JS that modifies the DOM to create elements with click events that act as links, to name but one challenge.
This is a fairly mammoth task. Personally I would suggest that you don't bother - you probably are wasting your valuable time.
Especially since some nice people have already done the bulk of the work for you:
http://sourceforge.net/projects/php-proxy/
http://sourceforge.net/projects/knproxy/
;-)
Your "problem" comes from the fact that HTTP is a stateless protocol and different resources like css, js, images, etc have their own URL, so you need a request for each. If you want to do it yourself, and not use php-proxy or similar, it's "quite trivial": you have to clean up the html and normalize it with tidy to xml (xhtml), then process it with DOMDocument and XPath.
You could learn a lot of things from this - it's not overly complicated, but it involves a few interesting "technologies".
What you'll end up with what is called a crawler or screen scraper.
I have webpage1.html which has a hyperlink whose href="some/javascript/function/outputLink()"
Now, using curl (or any other method in php) how do I deduce the hyperlink (of http:// format) from the javascript function() so that I can go to next page.
Thanks
You'd have to scrape the JavaScript. Figure out where the function is and see what URL it's using.
Sometimes http:// is omitted for links that are on the same page, so that won't be a good search reference.
At this point the only valuable thing to do is to try and understand the JavaScript code yourself, and once you find the link you could use regex to filter the result programmatically with PHP.
preg_match("/url + \'\/apples.html/g", "blah blah var javaScriptVar= url + '/apples.html';", $matches);
There is no straight forward way. There are very few to zero libraries which can perfectly do what you require. I think http://www.dapper.net/ is something close to what you want. I am not sure if its the ideal solution. Dapper.net will help you parse text and links and would probably also handle javascript.