I have webpage1.html which has a hyperlink whose href="some/javascript/function/outputLink()"
Now, using curl (or any other method in php) how do I deduce the hyperlink (of http:// format) from the javascript function() so that I can go to next page.
Thanks
You'd have to scrape the JavaScript. Figure out where the function is and see what URL it's using.
Sometimes http:// is omitted for links that are on the same page, so that won't be a good search reference.
At this point the only valuable thing to do is to try and understand the JavaScript code yourself, and once you find the link you could use regex to filter the result programmatically with PHP.
preg_match("/url + \'\/apples.html/g", "blah blah var javaScriptVar= url + '/apples.html';", $matches);
There is no straight forward way. There are very few to zero libraries which can perfectly do what you require. I think http://www.dapper.net/ is something close to what you want. I am not sure if its the ideal solution. Dapper.net will help you parse text and links and would probably also handle javascript.
Related
I'm modifying a simple php crawler script.
one of the modules it uses is a converter of relative urls into absolute urls.
For this, I need to find a way to determine the base href of a given url. Otherwise I end up with a bunch of wrongly converted links.
I need a simple function to check if an url has a base href tag, and if yes, return it.
Thanks
parse_url() splits up a URL into its parts. You can get what you need from that.
I need a simple function to check if an url has a base href tag, and if yes, return it.
A URL cannot have a base href tag, since that is an HTML tag. It might be defined in the HTML that you retreive from that URL. How to read that can be found at this question.
I don't know what you exactly mean but parse_url will give you a lot of information such as the hostname, the querystring, etc.
If I understand you correctly you wan't to know if there is a http in your url. The scheme part of the information parse_url returns is your friend here. If scheme is empty or something different then http, you know that there was no http in your URL.
Inside the crawler you start crawling a specific page and you parse that HTML if I understand your question correct. Simply construct the base URL (without paths) from the information parse_url gives you and I don't see any problems.
How do I use: http://graph.facebook.com/?ids=http://www.sitename.com
to return the current page's number of likes on a dynamic page? I'd like to pull the URL from somewhere and output the number on the page and allow me to use the variable for other things too.
You can use a Javascript framework like JQuery to make an AJAX call to this url. The response will be parsed by JQuery as a JSON object holding the values you see, when you open the URL in a browser. This way you have the "shares" variable in your Javascript code you can use for whatsoever you like.
This should give you some keywords you can Google for. Best thing would be to go to www.jquery.com and read some tutorials and look at the AJAX examples. The usage is very straight forward.
One more thing: Please show some research effort in your next question. People are only here to help, but not to do the whole work for you.
good day everyone!
i am trying to append a script to a remote page (not mine and it is a form page) that would hide some of its content (some in particular) before showing it. i am using curl but the only thing i could do is retrieve its html code.
is there anyway of doing what i wanted to happen?
I'm assuming that the user asks your server for content, and your server needs to fetch that content on another server and process it before sending it back to the user.
Query the other script using CURL, then run your script on that HTML to remove the pieces that you don't want to keep (I hope for your sake that they are reasonably easy to find and eliminate), and finally output the resulting HTML to the user.
To remove some part of the html, you could preg_replace() it using regular expressions.
Googling for an online regexp might be of some help, if you have no experience with regular expressions.
I have used AJAX to successfully change the content of a web page. I can include another web page from my domain but the problem I have is making the hyperlinks to work. If the hyperlinks use relative addressing then that will not work relative to the page I am including it in so I was investigating php to parse the html tag as I read it in
I am using the following RegExp /href[\s]?=[\s\"\']+(.*?)[\"\']/im to find the href data but would like a pointer on how I can prefix a relative address.
I would like to replace a link href="./test1/page1.html"
on page http: // foo.bar.com/folder1/info1/data.html
with href="http: // foo.bar.com/folder1/info1/./test1/page1.html" then if I include the the page content of /folder1/info1/data.html in http://foo.bar.com/folder2/faraway/another.html the links on the embedded page will function correctly
I was looking at using the php preg_replace function to do that but have very quickly come unstuck. If I am barking up the wrong tree and there is a more appropriate tool or approach can someone please point me in the right direction ;-). Maybe it can all be down in Javascript?
If you're planning to do much more javascript on the page, you could use JQuery.
function make_absolute(base_path){
$("#embedded a").each(function(){
this.attr("href",
base_path + this.attr("href")
);
});
}
Replace "#embedded" with the id of your embedded page.
This is nearly certainly overkill if you're not going to use javascript for anything else, but if you're planning to make a shiny dynamic ajaxy page, you might look into it.
Bonus:
Doing ajax page loading with JQuery:
$("#embedded").load(page_you_want_to_load)
Taking ABentSpoon's response a step further, your jQuery selector can search for all anchor tags that start with a slash.
$('#embedded a[#href^=/]').each(function() {
$(this).attr('href', baseUrl + $(this).attr('href'));
});
For more help with jQuery selectors, go here.
Why donĀ“t you just use absolute paths?
You guys have certainly helped me out here, many thanks. I think the regular expression I need would be /href[\s]?=[\s\"\']\./is as ABentSpoon pointed out "If it starts with a slash, that's absolute enough for most purposes". However I guess it would be a good excersise to enable reading pages from other sites. Luckily any of the pages I may wish to do this with are on a same site, and on same server.
To pick up on Jeroen comment of just using absolute paths, that is not really an option as there are many pages on this site. Also each page would get addressed differently (DNS) depending on where it'll be accessed from... internally or externally. If you give your links an absolute path you tie ALL of them to having that site DNS name. A problem when you find this changing all too regularly, or for that matter depatments feel the need to change thir subdirectory names, but that's another story. I wish to design this feature to be a little more flexible.
I will certainly read up about jQuery. Looks interesing, it's not something I've played with yet... more learning coming up ;-)
Thanks again for taking the time guys.
So I'm looking for ideas on how to best replicate the functionality seen on digg. Essentially, you submit a URL of your page of interest, digg then crawl's the DOM to find all of the IMG tags (likely only selecting a few that are above a certain height/width) and then creates a thumbnail from them and asks you which you would like to represent your submission.
While there's a lot going on there, I'm mainly interested in the best method to retrieve the images from the submitted page.
While you could try to parse the web page HTML can be such a mess that you would be best with something close but imperfect.
Extract everything that looks like an image tag reference.
Try to fetch the URL
Check if you got an image back
Just looking for and capturing the content of src="..." would get you there. Some basic manipulation to deal with relative vs. absolute image references and you're there.
Obviously anytime you fetch a web asset on demand from a third party you need to take care you aren't being abused.
I suggest cURL + regexp.
You can also use PHP Simple HTML DOM Parser which will help you search all the image tags.