Make a JavaScript-aware Crawler - php

I want to make a script that's crawling a website and it should return the locations of all the banners showed on that page.
The locations of banners are most of the time from known domains. But banners are not in the HTML as an easy image or swf-file. Most of the times a Javascript is used to show the banner.
So if a .swf-file or image-file is loaded from a banner-domain, it should return that url.
Is that possible to do? And how could I do that roughly?
Best would be if it can also returns the landing page of that ad. How to solve that?

You could use selenium to open the pages in a real browser and then access the DOM.
PhantomJS might also be worth a look - it's a headless version of WebKit (the engine behind Chrome, Safari, etc.).
However, none of those solutions are pure php - if that's a requirement, you'll probably have to write your own JavaScript engine in PHP (which is nothing I'd ask my worst enemy to do ;))

In order to get the output of the JavaScript you will need a JavaScript engine (such as Google's V8 Engine). The V8 engine is written in C++ but there are some resources that tell you embed the V8 engine into PHP.
With that said, you have to study the output "by hand" and determine exactly what can be scraped and how to identify it. Once you've identified some common syntax for the advertisement banners, then you can write a script to extract the banner and the landing page which is referenced.
None of this is easy work, but if you have an example of an ad you'd like to collect then I can give you more advice.

Related

Language for web scraping JAVASCRIPT content

I think topic ask the question, I usually use PHP for parse/ web scraping, but I have really bad time scraping javascript most cases I cant do it
ex: Parse a div that appears when a javascript its executed.
I readed about RUBY, that have a parser library for javascript, so question is w is the languaje for program a web scraping that will effective scrap javascript generated content ?? Its here a library for PHP like the one for ruby for parse javascript content ?
There are a handful of strategies for this. Depending on your needs, consider pro grammatically instantiating a browser instance that you can hook into and read the page from.
The idea is, let the browser do the work, as the page is made for a browser and not your bot. You can then tap in and scrape away using a browser plugin that feeds data to your primary application running things.
This may be way overkill for what you need though. I'll leave it up to you to decide.
You should look at some GUI-less/headless browsers. There is some written for Java. I didn't find one for PHP.
Look at :
HTMLUnit
Golf
You can try using something like Selenium, which allows you to automate browser tasks.
On the other hand, you can go into details on what happens when the js code is executed. For example, if the js code is requesting something from the server by POSTing some data, you could emulate that in the regular fashion.
You should look at PhantomJS and CasperJS (headless browsers).
In the ruby world the gem for running Phantomjs would be poltergeist
There is another article about some of the options you have in ruby here too (however they are not all js capable)

How to show HTML pages instead of Flash to search engines

Let's say I have a plain HTML website. More than 80% of my visitors are usually from search engines like Google, Yahoo, etc. What I want to do is to make my whole website in Flash.
However, search engines can't read information from Flash or JavaScript. That means my web page would lose more than half of the visitors.
So how do I show show HTML pages instead of Flash to the search engines?
Note: you could reach a specific page/category/etc in Flash by using PHP GET function, for example: you can surf trough all the web pages from the homepage and link to a specific web page by typing page?id=1234.
Short answer: don't make your whole site in Flash.
Longer answer: If you show humans one view and the googlebot another, you are potentially guilty of "cloaking". If the Google Gods find you guilty, you will be banned to the Supplemental Index, never to be heard from again.
Also, doing an entire site in Flash breaks the basic contract of the web, namely that you can link to specific content from other sites or in emails. If your site has just one URL and everything else is handled inside of Flash ... well, I don't know what you have, but it isn't a website anymore. Adobe may like you, but many people will not. Oh, and Flash is very unfriendly to people with handicaps.
I recommend using Flash where it is needed (videos, animations, etc.), but make it part of an honest-to-God website.
What I want to do is to make my whole
website in Flash
So how to accomplish this: show HTML
pages instead of Flash?
These two seem a bit contradictory.
Important is to understand the reasoning behind choosing Flash to build your entire website.
More than 80 percent of my visitors
are usually from search engines
You did some analysis but did you look at how many visitors access your website via a mobile device? Because apart from SEO, Flash won't serve on the majority of these devices.
Have you considered HTML5 as an alternative for anything you want to do with Flash?
Facebook requires you to build applications in Flash among others but html, why? I do not know, but that is their policy and there has got to be a reason.
I have been recently developing simple social applications in Flash (*.swf) and my latest app is a website in flash that will display in tab of my company webpage in Facebook; at the same time, I also want to use that website as a regular webpage on the internet for my company. So, the only way I could find out to display html text within a flash file is by changing the properties for the text wherever I can in CHARACTER to "Render text as HTML", look for the symbol "<>". I think that way the search engines will be able to read your content and process your website accordingly. Good luck.
As you say that you can reach the Flash page by get variable using page ID or any other variables. So its good. I hope you will add Flash in each HTML page. Beside this, you can add all other HTML contents in hidden format. So the crawlers could reach the content and your site will look-up in Flash. Isn't it?
Since no-one actually gave you an straight answer (probably because your question is absolute face-palm-esque), i'll try:
Consider using the web-development approach called progressive enhancement. Now, it's fair to say that it probably wasn't intended for Flashification of a website, but you can make use of it's principles.
Start with your standard HTML version of your website
Introduce swfobject to dynamically (important bit) swap out the HTML content for it's Flash equivalent
Introduce swfaddress to allow for deep linking into your Flash movies (pseudo-URLs)
Granted, steps 2 and 3 are a little more advanced that how i've described them and your site size/structure/design may not suit this approach, but at least it's an answer.
All that being said, I agree with the other answers/comments about the need for using Flash to display your entire site - there's very very very few reasons anyone would do that, and there's more reasons than already added as to why not to (iOS devices etc)...

Scraping IMDB's Top 250 List gives some results in foreign languages?

I'm having my server grab this page to download the full list for a movie analysis I'm doing:
http://www.imdb.com/chart/top
But when it does a lot of the movie titles are appearing in another language. For example instead of saying The Shawshank Redemption it's giving me: Побег из Шоушенка
A simple file_get_contents in PHP is the fastest way to reproduce, though I'm using curl
Anyone have any ideas for what's going on, how to fix?
UPDATE: IMDB might be interpreting my server as being in another country for some strange reason. Is there any way to enforce it as being in the US?
Use an user account and Set title display language at https://secure.imdb.com/register-imdb/siteprefs
Then automate the login process within your scraper and follow your normal process.
I know how to deal with this in the Windows environment. You may borrow the same idea for your server OS.
In Windows with a WebBrowser control, you can use menu View -> Encoding to select whatever language that shows the text properly, then when you grab the source page from the browser control, it will be in the correct coding.
You may find the IRobotSoft web scraper easy to use for your movie analysis, which runs in Windows platform only.

What is the easiest way to convert existing PHP web application to mobile application?

Suppose I have developed one web portal in PHP/MySQL. I want to make it work in mobile also. What is the easiest way to do this?
Can we use PHP with any mobile based mark up languages like WML or XHTML i.e. as we can use PHP with HTML in web applications used to view in normal web browsers?
PHP has nothing to do directly with the platform you want to display your app on. PHP is just the tool to deliver the kind of markup you need for your page to be displayed on whatever platform you want. It's up to your own knowledge and creativity to render markup which suits your needs. So in other words, yes of course you can send WML, XML, XHTML, you name it to the client!
The client doesn't know anything about PHP anyways (PHP 'exists' only on the server side), the client doesn't understand PHP and doesn't need to. It understands XHTML or any other markup and that's what you have to deliver! What tool you use to do that is completely up to you. PHP is one option.
So all you need to know is for what platform/client you want to render your content and what kind of markup this platform understands and then deliver the right markup to the right platform/client including the respective CSS, js, etc.
What your app does:
detect what client is requesting your site
see if you're able to send the appropriate markup
send this markup or if not available some default or similar markup
Pseudo-code for each page, (or just the template page, if you have that)
<?php if(mobile()): ?>
Mobile HTML and PHP
<?php else: ?>
Desktop HTML and PHP
<?php endif; ?>
I use this.
Depends on what you understand under the term "mobile". Basically it would just mean to adapt your portal displayed data and css to the smaller display sizes and make as ZOMFG said an if statement to output your source accordingly. If you want to enable WAP browsing you have to output your data in the Wireless markup language.
PHP is just a tool which generates some markup language (or anything else, actually, which might not be markup-oriented at all) that is understood by the client -- the browser.
Your PHP code will have to be able to generate two kind of different outputs :
a "full ouput" (likely HTML), which you already have, for computer web browsers
a "light ouput" (maye VML, maybe HTML too but without some heavy stuff), for mobile-based browsers.
The task you'll have to deal with is to differenciate between mobile and non-mobile users ; this might be done by user-agent sniffing, for instance, or detecting what the client requested.
A nice thing to do could be to use a special domain-name for users of mobile platforms ; something like mobile.example.com ; for instance, so they can bookmark it and directly access the "mobile-version" of your site -- can be useful if your detection doesn't work well ^^
If you are targetting advanced-mobile-machines (like iPhone) which have a not too bad browser, you might want to send them "rich" HTML pages ; just test your pages to verify they fit on the small screen of theses machines ; and, maybe you'll want to send less data (like not display some sidebars, or have a smaller menu, ... )
BTW, what kind of platform do you mean by "mobile" ? Those old phones with small screens, or more power-users-oriented phones, like iPhone / Android / stuff like that ?
This could make quite a difference, actually, as the more recent ones have nice functionnalities that the oldest didn't have ^^
In any case, one important thing to remember :
you will spend some time making the site work on these devices
you will have to spend more time maintaining it !
Which means : do something simple, and, as much as possible, use the same PHP code for both mobile and non-mobile version of the site : the less code you duplicate, the less code you'll have to maintain !
Hope these few ideas (not really well sorted, I admit) will help...
Have fun !
Already the mobile browsers support almost full XTHML, Javascript, Flash.
My recommandations are:
have a light css for the mobile
version
restrict some heavy functionalities
validate your code
optimize, optimize, optimize, although this works even for the full version.

Can I programatically open, search and return results in a browser from multiple websites?

Currently I have a bookmarked list of websites that I look for content on, and open them in a new window, in tabs, and search for the string that I want to find.
Is there any way to do this programatically, ending up with a browser window containing the results pages of the sites? There's around 20 of them.
I'm running Safari 3 on Mac OS X Leopard. I have experience using PHP, but even something as simple as using AppleScript would be OK too. I just want to speed up my workflow. :-)
Well if you know ruby, there is ScRUBYt!
http://www.softwaredeveloper.com/features/scrubyt-ruby-web-scraping-tool-051007/
http://scrubyt.org/
Here's are a couple of articles about writing a web crawler in perl:
http://www.linuxjournal.com/article/2200
http://www.devshed.com/c/a/Perl/Web-Mining-with-Perl/
Not entirely programming related, but you could create something like this using Google's Custom Search Engine facility (assuming all the sites you plan to search are publicly accessible).
You can then insert the provided snippets into an HTML file of your choice, and even set this as your "Home Page" in Firefox.
You are going to have to specify which browser you are using, because I am sure that each one has different automation capabilities.
Try Yahoo Pipes, there is probably one that's alredy half way where you want it.
I searched it for "search" and here are resoults.
http://pipes.yahoo.com/pipes/search?q=search&x=0&y=0

Categories