Crawl Website using PHP - php

I've tried a bunch of techniques to crawl this url (see below), and for some reason the title comes back incorrect. If I look at the source of the page with firebug I can see the correct title tag, however, if I view the page source it's different.
Using several php techniques I get the same result. Digg is able to crawl the page and parse the correct title.
Here's the link: http://lifehacker.com/#!5772420/how-to-make-ios-more-like-android
The correct title is "How to Make Your iPhone (or Other iOS Device) More Like Android"
The parsed title is "Lifehacker, tips and downloads for getting things done"
Is this normal? How are they doing this? Is there a way to get the correct title?

That's because when you request it using PHP (without any JS support) you're getting the main page of lifehacker - which is lifehacker.com.
Lifehacker switched their CMS recently so that all requests go to an initial page and then everything after the hashbang is read by a JS script in the main page to figure out which page needs to be served. You need to modify your program to take this into account
EDIT
Have a gander at these links
http://code.google.com/web/ajaxcrawling/docs/getting-started.html
http://www.tbray.org/ongoing/When/201x/2011/02/09/Hash-Blecch

Found the answer:
http://lifehacker.com/#!5772420/how-to-make-ios-more-like-android
becomes:
http://lifehacker.com/?_escaped_fragment_=5772420/how-to-make-ios-more-like-android

Related

Load entire html file to parse with php?

i try to parse a website's html file with php fopen(). Thats works so far very well but the problem is, that there are serveral posts on the site that aren't shown in the html file, because u have to scroll far down until the posts load.
As an example, i try to count the total amount of comments in my own facebook page. (Just an example, if it's shown somewhere on facebook, that doesn't help me)
How can I make the html file load completely?
Thank you
you cannot, directly. What you are doing is called scraping. You have to inspect the queries made by the browser in your developer tools when viewing that page yourself, and reproduce those queries in php through fopen() or any other means (CUrl, etc...)

Pass php results to another website.

So what I am trying to do is this:
On my server users can put there YouTube channel name. My php file will then parse the channel and output HTML code with results. What I am looking to do is for the users to be able to put a code on there website that till call on my website lets say youtubevideos.com/videos.php?channel=channelname my code will take that name and output the videos back to there site. much like Google ads I guess.
Any idea how that is done, other than an iframe, I figured that will be my last resort.
I think what I'm looking for if for them to put a JavaScript on there site that will render as the HTML code I'm pushing from my php file.
Thank you!
The receiver code which is on the server you target need to set a header like that :
"Access-Control-Allow-Origin:*"
So, if you provide a service which need to exchange with your server & your code, is it possible. If you can't edit the targeted code & the header is not setted, it'll be impossible
There would be two parts of this solution.
In the videos.php file on your server, you would implement the logic to scrape the data from the original site and format it in the way you want to show on the final website.
For the end user, you would give a code similar to this that they would have to paste in their php pages to display the content from your site.
$your_website_url="http://youtubevideos.com/videos.php?channel=channelname";
//Don't forget the http:// at the start.
echo file_get_contents($your_website_url);
If file_get_contents() gives a security error, you can use curl.
I hope that helps.

how to open google links inside iframe?

i am trying to open a google search inside an iframe. it was working until recently but something happened.
this can be tested here: http://jsfiddle.net/patrioticcow/xTjyX/
i also added &output=embed at the end of the link, but it looks like it doesn't help.
in chrome i get: Refused to display document because display forbidden by X-Frame-Options.
but it doesn't work in Mozilla also.
any ideas?
thanks
The X-Frame option is a header sent by the webserver of the page you are trying to embed into the iframe. It basically tells the browser not to allow embedding the page in an iframe. Have a look at https://developer.mozilla.org/en/The_X-FRAME-OPTIONS_response_header for a more detailed description.
Obviously Google does not want you to embed it's search results into an iframe.
We are seeing the same problem - this time with google in a standard frame. It was fine a couple of months ago, now it's not working. I think that Google just changed the rules... Not a very open thing to do.
I'd suggest that you run what was in the iframe as a separate child window or new tab - not sure if this will give you the result you wanted.

Facebook Feed (RSS using PHP)

Can someone tell me what happens when i enter a link into the Facebook Status Update Form and it loads up a mini info kinda thing of the website (I'm guessing its RSS or something?)
How do i implement this on my site using PHP?
What do i need to learn to be able to implement that?
It scrapes the page you are linking to. It doesn't have anything to do with RSS.
By looking at the HTML of the page it can get the page title for you and find all the images that can be used as a thumbnail.
Take a look at HTTP or cURL in the PHP manual for methods to get webpage content.

How to hide an iframe url in HTML source code

How to hide an iframe url in HTML source code.I have two applications one applications get an url of another application into its iFrame,so that it displays in its source code.I dont want to display another application url in the source code.
I think you would need to set the IFRAME URL via JavaScript. The Javascript could then be Obfuscated, so that the URL would not be in plain text... Please see the following link for the obfuscator:
http://www.javascriptobfuscator.com/Default.aspx
i.e. if it was jQuery...
$("#myiFrame").attr('src','http://www.google.com');
becomes:
var _0xc1cb=["\x73\x72\x63","\x68\x74\x74\x70\x3A\x2F\x2F\x77\x77\x77\x2E\x67\x6F\x6F\x67\x6C\x65\x2E\x63\x6F\x6D","\x61\x74\x74\x72","\x23\x6D\x79\x69\x46\x72\x61\x6D\x65"];$(_0xc1cb[3])[_0xc1cb[2]](_0xc1cb[0],_0xc1cb[1]);
You can't hide it per say, but you can run it through something like TinyURL so that anyone interested would need to go an extra step. Anyway, that's the only thing I can think of. However, if you are displaying that page in a frame, what's the harm in having the URL in the source code? There really isn't a good, foolproof way to prevent someone determined from finding out the location of that iframe page.
You can create a php script which uses curl to call the url through localhost, then use this script as your iframe source.
If you have an issue with relative links and sub-directories, you can put your curl script inside the sub-directory.

Categories