screen scraping technique using php - php

How to screen scrape a particular website. I need to log in to a website and then scrape the inner information.
How could this be done?
Please guide me.
Duplicate: How to implement a web scraper in PHP?

Zend_Http_Client and Zend_Dom_Query

You want to look at the curl functions - they will let you get a page from another website. You can use cookies or HTTP authentication to log in first then get the page you want, depending on the site you're logging in to.
Once you have the page, you're probably best off using regular expressions to scrape the data you want.

You should look look at curl.

You might also want to take a look at BeautifulSoup which is a Python library which is supposed to be very good at making bad HTML parseable. It is aimed at things like screen scraping.
How easy it would be to call from PHP I don't know though.

You could also check out http://php.net/dom

Curl, and once ure in, use QueryPath php library. (querypath.org)
You can access dom elements just like in JQuery, via CSS selectors,
there's method chaining...
Way better than just using php's native xml functions.
It also works as drupal extension, but I suppose you could implement it in any php project.

Related

Data screaping based on Search engines

Is it possible to scrap the web based on Keywords using Search engines in PHP?
Like when some put keyword, the script will search google and render the results and then render the pages and scrap/extract the line that includes the matched keywords?
Any idea or library to refer to?
You can do that using google api https://developers.google.com/custom-search/json-api/v1/overview and a related php client https://github.com/google/google-api-php-client.
Later on you need to write a web scraper to download the websites (curl) and parse the html parser (i.e. https://github.com/paquettg/php-html-parser).
I would, however, not recommend php for the latter task. There are much more sophisticated scraping tools available for python (i.e. BeautifulSoup or Scrapy) that will make your life much MUCH easier than using php.
You can use php function call
file_get_contents('web url goes here');
example file_get_contents('http://www.google.com');
That function will get the html returned from the url, then you can use xpath to extract the element of html to get the data that you want.
You can see example and more explanation url below.
https://gist.github.com/anchetaWern/6150297
I personally have done something similar of your question, but it's in ruby on rails, you can explore the project here.
https://github.com/dvarun/gextract
the xpath that I used is here:
https://github.com/dvarun/gextract/blob/master/app/jobs/fetch_keyword_job.rb

Scrape using html dom parser

Is it right way to scrape other websites contents into my website using simple_html_dom. If it is wrong, suggest me what is the method to display news in my website.
simple_html_dom is some extension I am guessing. If you are looking for something in Core PHP(PHP Extension), use DOMDocument
Basically by scraping you are taking the sites content. And if you are doing the same with their(sites team) consent then its okay, otherwise its not legal(depends on their T&C). Also sites have mechanism to block such acts.
Better ask the site team for content, they might be able to provide the data in much better and simpler way. Like API, RSS or a direct Database.

How can I manipulate DOM using PHP?

I saw an article here: http://code.lancepollard.com/automatically-publish-posts-to-stumbleupon-with-ruby
I don't know Ruby, but the following lines are pretty self explanatory:
page = agent.get("http://www.stumbleupon.com/submit?url=#{url}&title=#{title}")
form = page.forms.first
form.radiobuttons_with(:name => "sfw").first.check
page = agent.submit(form)
I'm guessing Ruby can fetch the webpage, check a checkbox, then submit the form. Is that possible using PHP?
The Ruby code you referenced actually uses a third party library called Mechanize.
Something similar for PHP is The SimpleTest Scriptable Browser. It's not as feature rich as Mechanize but can get the job done and it can be used independently of the SimpleTest framework.
You probably would want something like:
http://simplehtmldom.sourceforge.net/
PHP's internal support is sufficient, but would be more cumbersome to use than a third party library.
Not out of the box. Possibly there is a third-party library that can do it for you. One that might help is PHPQuery to loop over a fetched page and select the form and its values. The submit then would have to be done using Curl or the like...!
More info:
fetching a page: http://php.net/manual/en/function.file-get-contents.php
JQuery for PHP: http://code.google.com/p/phpquery/wiki/Basics for a basic intro
Submitting a form with Curl: http://davidwalsh.name/execute-http-post-php-curl

How to extract content from other websites automatically?

I want to extract a specific data from the website from its pages...
I dont want to get all the contents of a specific page but i need only some portion (may be data only inside a table or content_div) and i want to do it repeatedly along all the pages of the website..
How can i do that?
Use curl to retreive the content and xPath to select the individual elements.
Be aware of copyright though.
"extracting content from other websites" is called screen scraping or web scraping.
simple html dom parser is the easiest way(I know) of doing it.
You need the php crawler. The key is to use string manipulatin functions such as strstr, strpos and substr.
There are ways to do this. Just for fun I created a windows app that went through my account on a well know social network, looked into the correct places and logged the information into an xml file. This information would then be imported elsewhere. However, this sort of application can be used for motives I don't agree with so I never uploaded this.
I would recommend using RSS feeds to extract content.
I think, you need to implement something like a spider. You can make an XMLHTTP request and get the content and then do a parsing.

Any PHP -> jQuery libraries out there?

Have any bridge libraries been developed for PHP that provide access to the jQuery framework? Ideally it would be nice to have something fairly extensible so that creating jQuery-based content using PHP code would be fairly easy and customizeable. Does such a thing exist yet?
pquery
jqpie
jquery-php
There's a warmup list.
So far I've found one that seems to fit the description. I haven't tried it out yet, so if anyone has any feedback or experience with this or other ones don't hesitate to post!
PQuery
jQPie might be what you're after.
What can jQPie do?
Easily request and process data from php using $.getJSON
Inject php generated html into elements using $.(element).load
Call php functions directly from your web pages using $.jqpie
Call jQuery from php in respond to $.jqpie calls
Advanced autocomplete using jqpie_complete
QueryPath (http://querypath.org) is a full implementation of the jQuery DOM/XML/HTML part of jQuery. QueryPath has full CSS 3 selector support (including the stuff jQuery doesn't have, like XML namespace support). It also comes with DB tools, where you can run queries and have the results inserted into the query object. And it has a template engine, too. Like jQuery, you can write custom extensions very easily.
But it definitely takes advantage of its server-side status.
The main project page is at https://fedorahosted.org/querypath. You can download it there (and see lots of examples, including RSS and SVG manipulation).
Integrating with jQuery, then, can be done easily by sending XML data of many sorts down to jQuery. (You could probably send JSON, too... never tried.) And since the server side code and the client side code both look the same, there's less of a need to learn two totally different toolkits.

Categories