Data screaping based on Search engines - php

Is it possible to scrap the web based on Keywords using Search engines in PHP?
Like when some put keyword, the script will search google and render the results and then render the pages and scrap/extract the line that includes the matched keywords?
Any idea or library to refer to?

You can do that using google api https://developers.google.com/custom-search/json-api/v1/overview and a related php client https://github.com/google/google-api-php-client.
Later on you need to write a web scraper to download the websites (curl) and parse the html parser (i.e. https://github.com/paquettg/php-html-parser).
I would, however, not recommend php for the latter task. There are much more sophisticated scraping tools available for python (i.e. BeautifulSoup or Scrapy) that will make your life much MUCH easier than using php.

You can use php function call
file_get_contents('web url goes here');
example file_get_contents('http://www.google.com');
That function will get the html returned from the url, then you can use xpath to extract the element of html to get the data that you want.
You can see example and more explanation url below.
https://gist.github.com/anchetaWern/6150297
I personally have done something similar of your question, but it's in ruby on rails, you can explore the project here.
https://github.com/dvarun/gextract
the xpath that I used is here:
https://github.com/dvarun/gextract/blob/master/app/jobs/fetch_keyword_job.rb

Related

How to extract content from other websites automatically?

I want to extract a specific data from the website from its pages...
I dont want to get all the contents of a specific page but i need only some portion (may be data only inside a table or content_div) and i want to do it repeatedly along all the pages of the website..
How can i do that?
Use curl to retreive the content and xPath to select the individual elements.
Be aware of copyright though.
"extracting content from other websites" is called screen scraping or web scraping.
simple html dom parser is the easiest way(I know) of doing it.
You need the php crawler. The key is to use string manipulatin functions such as strstr, strpos and substr.
There are ways to do this. Just for fun I created a windows app that went through my account on a well know social network, looked into the correct places and logged the information into an xml file. This information would then be imported elsewhere. However, this sort of application can be used for motives I don't agree with so I never uploaded this.
I would recommend using RSS feeds to extract content.
I think, you need to implement something like a spider. You can make an XMLHTTP request and get the content and then do a parsing.

advice on php parsing

I am using phpTumblr, a wrapper around the tumblr blog api that allows you to access posts via php.
I want the site to display new posts dynamically, so I am using php to write html code. I find myself writing things like print(blablabla); or print(); ... and so on, and setting the header of the document to text/html, so that the browser would read it as html.
This just seems to me like a kind of ugly hack, and I was wondering if most dynamic pages are set up in this way, or are there different ways to convert php objects(say arrays) automatically into html tags. So far it doesnt seem like there are any.) maybe i have to be using some CMS software?
Any advice would be great.
Thanks
I believe what you're describing is known as a template engine. It essentially separates the logic from the UI, and allows you to write dynamic pages without an excessive number of print or echo statements.
For PHP, I would recommend Smarty, but Google can also help you with finding alternative ones if you find you don't like it.
PHP is a language in which you can do alot of different things and one of them is to send output to browsers. So if you want to print an array as HTML code , write a PHP function for it. PHP has NOTHING to do with HTML tags directly.
Like the above post mentions you can use Smarty templating engine ... BUT then you will need to learn the smarty language to print the array :)
All scripting languages work in this way. So lets say if any xyz language supports a function called print_array_as_html($array) .... then observe that it is a function. That's the idea of having functions/methods in a language , extend the functionality to get what you need.

How best to search a website and retrieve data in PHP?

Trying to learning some more PHP. Here is what I'm after.
Essentially, I would like to search a website and return data to my own website.
Add a few keywords to a form.
Use those keywords to query a website such as monster.com for results that match the keywords entered.
Grab that data and return it to my own website.
How hard is something like this? I acknowledge the above outline is oversimplified but any tips you can offer are much appreciated.
If you're querying a site that has an API designated for this kind of functionality, you're on easy street. Just call the API's appropriate search function and you're all set.
If the site you're querying doesn't have an API, you still might be able to search the site with an HTTP GET using the right parameters. Then you just need to scrape through the file for the search results with your script and a few regex functions.
Here's a little tutorial on screen scraping with PHP. Hopefully that will be of some help to you. The trouble with this is that in general if the site hasn't made it easy to access their data, they might not want you to do this.
Enter Yahoo Query Language (yql). It's a service that let's you use things like xpath to get data from websites and put them into an easy to use xml or json format. The language is similarly structured to sql (hence the name).
I've used it for other sites to build rss feeds for sites that didn't have it and it was pretty easy to learn.
http://developer.yahoo.com/yql/

screen scraping technique using php

How to screen scrape a particular website. I need to log in to a website and then scrape the inner information.
How could this be done?
Please guide me.
Duplicate: How to implement a web scraper in PHP?
Zend_Http_Client and Zend_Dom_Query
You want to look at the curl functions - they will let you get a page from another website. You can use cookies or HTTP authentication to log in first then get the page you want, depending on the site you're logging in to.
Once you have the page, you're probably best off using regular expressions to scrape the data you want.
You should look look at curl.
You might also want to take a look at BeautifulSoup which is a Python library which is supposed to be very good at making bad HTML parseable. It is aimed at things like screen scraping.
How easy it would be to call from PHP I don't know though.
You could also check out http://php.net/dom
Curl, and once ure in, use QueryPath php library. (querypath.org)
You can access dom elements just like in JQuery, via CSS selectors,
there's method chaining...
Way better than just using php's native xml functions.
It also works as drupal extension, but I suppose you could implement it in any php project.

Any PHP -> jQuery libraries out there?

Have any bridge libraries been developed for PHP that provide access to the jQuery framework? Ideally it would be nice to have something fairly extensible so that creating jQuery-based content using PHP code would be fairly easy and customizeable. Does such a thing exist yet?
pquery
jqpie
jquery-php
There's a warmup list.
So far I've found one that seems to fit the description. I haven't tried it out yet, so if anyone has any feedback or experience with this or other ones don't hesitate to post!
PQuery
jQPie might be what you're after.
What can jQPie do?
Easily request and process data from php using $.getJSON
Inject php generated html into elements using $.(element).load
Call php functions directly from your web pages using $.jqpie
Call jQuery from php in respond to $.jqpie calls
Advanced autocomplete using jqpie_complete
QueryPath (http://querypath.org) is a full implementation of the jQuery DOM/XML/HTML part of jQuery. QueryPath has full CSS 3 selector support (including the stuff jQuery doesn't have, like XML namespace support). It also comes with DB tools, where you can run queries and have the results inserted into the query object. And it has a template engine, too. Like jQuery, you can write custom extensions very easily.
But it definitely takes advantage of its server-side status.
The main project page is at https://fedorahosted.org/querypath. You can download it there (and see lots of examples, including RSS and SVG manipulation).
Integrating with jQuery, then, can be done easily by sending XML data of many sorts down to jQuery. (You could probably send JSON, too... never tried.) And since the server side code and the client side code both look the same, there's less of a need to learn two totally different toolkits.

Categories