What I want to do: Scape all the links from a page using Simple HTML DOM while taking care to get full links (i.e. from http:// all the way to the end of the address).
My Problem: I get links like /wiki/Cell_wall instead of http://www.wikipedia.com/wiki/Cell_wall.
More examples: If I scrape the URL: http://en.wikipedia.org/wiki/Leaf, I get links like /wiki/Cataphyll, and //en.wikipedia.org/. Or if I'm scraping http://php.net/manual/en/function.strpos.php, I get links like function.strripos.php.
I've tried so many different techniques of building the actual full URL, but there are so many possible cases that I am completely at a loss as to how I can possibly cover all the bases.
However, I'm sure there are many people who've had this problem before - which is why I turn to you!
P.S I suppose this question could almost be reduced to just handling local hrefs, but as mentioned above, I've come across //en.wikipedia.org/ which is not a full url and yet is not local.
I think this is what you're looking for. It worked for me on an old project.
http://www.electrictoolbox.com/php-resolve-relative-urls-absolute/
You need a library that converts relative urls to absolute. URL To Absolute seems popular. Then you just:
require('url_to_absolute.php');
foreach($doc->find('a[href]') as $a){
echo url_to_absolute('http://en.wikipedia.org/wiki/Leaf', $a->href) . "\n";
}
See PHP: How to resolve a relative url for a list of libraries.
I don't know if this is what you are looking for, but this will give you the full URL of the page it is executed from:
window.location.href
Hope it helps.
Okay, thanks everyone for your comments.
I think the solution is to use regex to find the webroot of any particular URL, then simply append the local address to this.
Tricky part:
Designing a regex statement that works for all domains, including their subdomains...
Related
This is my first scraper https://scraperwiki.com/scrapers/my_first_scraper_1/
I managed to scrape google.com but not this page.
http://subeta.net/pet_extra.php?act=read&petid=1014561
any reasons why?
I have followed the documentation from here.
https://scraperwiki.com/docs/php/php_intro_tutorial/
And there is no reason why the code should not work.
It looks like you are specifying to find a specific element. Elements change dependent on the site you are scraping. So if it doesn't find the element you are looking for you get no return. Also I would look into creating your own scraping/spidering tool with curl. Not only will you learn a lot but you will find out a lot about how to scrape sites.
Also a side not you might want to consider abiding by the robots.txt file on the website you are scraping from or ask permission before scraping as it is considered impolite.
I am writing a php code that uses regex to get all the links from a page and I need to transform it to get the links from entire website.
I guess the extracted urls should be checked again and so on, so that the script will access all the urls of it, not only the one given page.
I know that anything is possible, but how about this? Thank you for your guidance.
Hmm, to ensure that you get all the pages that google have found, what about crawling google instead? Just search for "site:domain.com", and then retrieve anything that follows this pattern:
<h3 class="r"><a href="http://domain.com/.*?" class=l
(you'll have to escape the right characters as well, and the '.*?' is the RegEx that gives you all the urls that google finds.
Anyways, that's just a suggestion for an alternative approach.
So, your regex grabs all the links. You cycle through a loop of those links, grab each with cURL, run that through your regex, wash, rinse, repeat.
Might want to make sure to put some sort of URL depth counter in there, lest you end up parsing The Internet.
Might also want to make sure you don't re-check links you've already followed, lest you end up at the end of Infinite Recursion Street.
Might also want to look at threading, lest it take 100,000 years.
This will get urls from url() (css), href and src attributes (links, imgs, scripts):
#(?:href|src)="([^"]+)|url\(["']?(.*?)["']?\)#i
They will be captured in group 1 and 2.
Be aware that some urls can be relative, so you have to make them absolute before calling them.
Normally, you do not have access to the underlying server that allows you to retrieve all pages on the site.
So you just need to do what Google does: Get all links from the page and then scan those links for additional links.
I'm developing a PHP-based web-application in which you have a form with textarea inputs that can accept links via anchor tags. But when I tested it after adding a hyperlink as follows, it pointed to a non-existent local subdirectory:
link
I realized that this was because I had not appended http:// before the link.
there might be cases where a user might input the link just as I did above. In such cases I don't want the link to be pointing as it did above. is there any possible solution, such as automatically appending http:// before the link in case that it doesn't exist? How do I do that?
----------------------------------------Edit---------------------------------------------
Please consider that the anchor tags are amidst other plaintext and this is making things harder to work with.
I'd go for something like this:
if (!parse_url($url, PHP_URL_SCHEME)) {
$url = 'http://' . $url;
}
This is an easy and stable way to check for the presence of a protocol in a URL, and allows others (e.g. ftp, https) that may be entered.
What you're talking about involves two steps, URL detection and URL normalization. First you'll have to detect all the URLs in the string being parsed and store them in a data structure for further processing, such as an array. Then you need to iterate over the array and normalize each URL in turn, before attempting to store them.
Unfortunately, both detection and normalization can be problematic, as a URL has a quite complicated structure. http://www.regexguru.com/2008/11/detecting-urls-in-a-block-of-text/ makes some suggestions, but as the page itself says, no regex URL detection is ever perfect.
There are examples of regular expressions that can detect URLs available from various sites, but in my experience none of them are completely reliable.
As for normalization, Wikipedia has an article on the subject which may be a good starting point. http://en.wikipedia.org/wiki/URL_normalization
I have posted a similar question here. However, this was more about getting advice on what to do. Now that I know what to do, I am looking for a little help on how to do it!
Basically I have a website that is pretty much 100% dynamic. All the website links are generated using PHP and all the pages are made up of php includes/code. I am trying to improve the SEO of the site by improving the URLs (as stated in the other question) and I am struggling a little.
I am using mod_rewrite of rewriting the nice urls to the ugly urls on the server. So what I need is to now convert the ugly urls (which are generated from the php code in the pages) to the nicer urls.
Here are the URLs I need to parse (these are in the other question aswell):
/index.php?m=ModuleType
/index.php?m=ModuleType&categoryID=id
/index.php?m=ModuleType&categoryID=id&productID=id
/index.php?page=PageType
/index.php?page=PageType&detail=yes
Here is what I want the above URLs to be parsed to:
/ModuleType
/ModuleType/CategoryName
/ModuleType/CategoryName/ProductName
/PageType
/PageType/Detail
There is an example on the other question posted by Gumbo however I felt it was a bit messy and unclear on exactly what it was doing.
Could someone help me solve this problem?
Thanks in advance.
I think I see what you're after... You've done all the URL rewriting, but all the links between your pages are using the old URL syntax.
The only way I can see around this is to do some kind of regex search and replace on the links so they use the new syntax. This will be a bit more complicated if all the links are dynamically generated, but hopefully there won't be too much of this to do.
Without seeing how your links are generated at the moment, it's difficult to say how to change the code. I imagine it works something like this though:
<?php echo "<a href='/index.php?m=$ModuleType&categoryID=$id'>"; ?>
So you'd change this to:
<?php echo "<a href='$ModuleType/$id'>"; ?>
Sorry if I've made errors in the syntax, just off the top of my head...
Unless I misunderstood your question, you don't parse the "ugly" URLs, your PHP script is called with them, so you $_GET[] your parameters (m, categoryID, productID) and you combine them to make your nice URLs, which shouldn't be too hard (just a bit of logic to see if one parameter is there and concatenate the strings).
You will need a front controller, which will dispatch the URL to the correct page.
Apache will rewrite the URL using rules in .htaccess, so that anything written will be redirected to index.php?q=. For example, typing http://example.com/i/am/here will result in a call to index.php?q=/i/am/here
Index.php will parse the path from $_GET["q"] and decide what to do. For example, it may include a page, or go to the database, look the path up, get the appropriate content and print it out
If you want a working example of a .htaccess which will do exactly that (redirect to index.php with ?q=path) take a look at how drupal does it:
http://cvs.drupal.org/viewvc.py/drupal/drupal/.htaccess?revision=1.104
As Palantir wrote this is done using mod_rewrite and .htaccess. To get the correct rewrite conditions into your .htaccess you might want to take a look at a Mod Rewrite Generator (e.g. http://www.generateit.net/mod-rewrite/). Makes it a lot easier.
I have used AJAX to successfully change the content of a web page. I can include another web page from my domain but the problem I have is making the hyperlinks to work. If the hyperlinks use relative addressing then that will not work relative to the page I am including it in so I was investigating php to parse the html tag as I read it in
I am using the following RegExp /href[\s]?=[\s\"\']+(.*?)[\"\']/im to find the href data but would like a pointer on how I can prefix a relative address.
I would like to replace a link href="./test1/page1.html"
on page http: // foo.bar.com/folder1/info1/data.html
with href="http: // foo.bar.com/folder1/info1/./test1/page1.html" then if I include the the page content of /folder1/info1/data.html in http://foo.bar.com/folder2/faraway/another.html the links on the embedded page will function correctly
I was looking at using the php preg_replace function to do that but have very quickly come unstuck. If I am barking up the wrong tree and there is a more appropriate tool or approach can someone please point me in the right direction ;-). Maybe it can all be down in Javascript?
If you're planning to do much more javascript on the page, you could use JQuery.
function make_absolute(base_path){
$("#embedded a").each(function(){
this.attr("href",
base_path + this.attr("href")
);
});
}
Replace "#embedded" with the id of your embedded page.
This is nearly certainly overkill if you're not going to use javascript for anything else, but if you're planning to make a shiny dynamic ajaxy page, you might look into it.
Bonus:
Doing ajax page loading with JQuery:
$("#embedded").load(page_you_want_to_load)
Taking ABentSpoon's response a step further, your jQuery selector can search for all anchor tags that start with a slash.
$('#embedded a[#href^=/]').each(function() {
$(this).attr('href', baseUrl + $(this).attr('href'));
});
For more help with jQuery selectors, go here.
Why donĀ“t you just use absolute paths?
You guys have certainly helped me out here, many thanks. I think the regular expression I need would be /href[\s]?=[\s\"\']\./is as ABentSpoon pointed out "If it starts with a slash, that's absolute enough for most purposes". However I guess it would be a good excersise to enable reading pages from other sites. Luckily any of the pages I may wish to do this with are on a same site, and on same server.
To pick up on Jeroen comment of just using absolute paths, that is not really an option as there are many pages on this site. Also each page would get addressed differently (DNS) depending on where it'll be accessed from... internally or externally. If you give your links an absolute path you tie ALL of them to having that site DNS name. A problem when you find this changing all too regularly, or for that matter depatments feel the need to change thir subdirectory names, but that's another story. I wish to design this feature to be a little more flexible.
I will certainly read up about jQuery. Looks interesing, it's not something I've played with yet... more learning coming up ;-)
Thanks again for taking the time guys.