Importing /scraping page content form other sites? - php

i've been playing with php and also http://www.alchemyapi.com/, and embed.ly
but i was wondering if there other options out there to import and parse a webpage, any page, either is a news site or a blog...
thanks

To fetch the data: curl, file_get_contents (may be others those are the two common)
To parse the data: PHP: DOM, SimpleXML preg_match**
Since it was tagged with PHP, I only gave working information for PHP. There are tons of ways to do this, if you can narrow your question down to what you are trying to do it would help. The better ways to parse any site, is through their RSS feed if they have one, or through their API, speculating that they offer up the content you want via RSS/API.
** preg_match is not a great alternative it does "work" but better to use the DOM / Simple XML functions if possible.

I wrote a crawler at work using cURL and preg_match
Before I chose to do it that way, I had looked at DOM Parsers http://php.net/manual/en/book.dom.php

Related

Data screaping based on Search engines

Is it possible to scrap the web based on Keywords using Search engines in PHP?
Like when some put keyword, the script will search google and render the results and then render the pages and scrap/extract the line that includes the matched keywords?
Any idea or library to refer to?
You can do that using google api https://developers.google.com/custom-search/json-api/v1/overview and a related php client https://github.com/google/google-api-php-client.
Later on you need to write a web scraper to download the websites (curl) and parse the html parser (i.e. https://github.com/paquettg/php-html-parser).
I would, however, not recommend php for the latter task. There are much more sophisticated scraping tools available for python (i.e. BeautifulSoup or Scrapy) that will make your life much MUCH easier than using php.
You can use php function call
file_get_contents('web url goes here');
example file_get_contents('http://www.google.com');
That function will get the html returned from the url, then you can use xpath to extract the element of html to get the data that you want.
You can see example and more explanation url below.
https://gist.github.com/anchetaWern/6150297
I personally have done something similar of your question, but it's in ruby on rails, you can explore the project here.
https://github.com/dvarun/gextract
the xpath that I used is here:
https://github.com/dvarun/gextract/blob/master/app/jobs/fetch_keyword_job.rb

How to Parse a Web page using PHP?

I am Learning php. I have learned some basics. Now I am eager to learn Web page parsing.
I want to Parse this page http://www.icc-cricket.com/rankings/team-rankings/test
I want to parse this alone
Rank Team Matches Points Rating
1 South Africa 24 3240 135
I would recommend Symfony2 The DomCrawler Component http://symfony.com/doc/current/components/dom_crawler.html
If you know basic PHP, I would recommend you using this framework: http://simplehtmldom.sourceforge.net/
Its simple to use.
You could have a look at http://simplehtmldom.sourceforge.net/ which allows you to parse HTML pages rather easily.
That said, one should always instead look into if the service offers feeds instead, because parsing them are both less error prone, more efficient and (usually) don't change much. HTML-markup can change over time, causing your dom query to become invalid.
Seems that those scores are attached to pages via ajax. So you cannot parse this link directly to get your rankings. It seems that request is sent to
http://cma.icc-cricket.com/api/getRankings?callback=onRankings&_1375776810417=
So you need to make similar request and process data then.
Result from url:
onRankings([{"matchType":"TEST","rankings":[{"position":"1","team":{"fullName":"South Africa","abbreviation":"SA"},"qfyMatches":"0","played":"24","points":"3240","rating":"135"},{"position":"2","team":{"fullName":"India","abbreviation":"IND"},"qfyMatches":"0","played":"30","points":"3473","rating":"116"},{"position":"3","team":{"fullName":"England","abbreviation":"ENG"},"qfyMatches":"0","played":"32","points":"3577","rating":"112"},{"position":"4","team":{"fullName":"Australia","abbreviation":"AUS"},"qfyMatches":"0","played":"27","points":"2846","rating":"105"},{"position":"5","team":{"fullName":"Pakistan","abbreviation":"PAK"},"qfyMatches":"0","played":"19","points":"1947","rating":"102"},{"position":"6","team":{"fullName":"West Indies","abbreviation":"WI"},"qfyMatches":"0","played":"22","points":"2168","rating":"99"},{"position":"7","team":{"fullName":"Sri Lanka","abbreviation":"SL"},"qfyMatches":"0","played":"26","points":"2295","rating":"88"},{"position":"8","team":{"fullName":"New Zealand","abbreviation":"NZ"},"qfyMatches":"0","played":"27","points":"2126","rating":"79"},{"position":"9","team":{"fullName":"Bangladesh","abbreviation":"BAN"},"qfyMatches":"0","played":"13","points":"135","rating":"10"}]},{"matchType":"ODI","rankings":[{"position":"1","team":{"fullName":"India","abbreviation":"IND"},"qfyMatches":"0","played":"48","points":"5906","rating":"123"},{"position":"2","team":{"fullName":"Australia","abbreviation":"AUS"},"qfyMatches":"0","played":"34","points":"3861","rating":"114"},{"position":"3","team":{"fullName":"England","abbreviation":"ENG"},"qfyMatches":"0","played":"38","points":"4257","rating":"112"},{"position":"4","team":{"fullName":"Sri Lanka","abbreviation":"SL"},"qfyMatches":"0","played":"49","points":"5435","rating":"111"},{"position":"5","team":{"fullName":"South Africa","abbreviation":"SA"},"qfyMatches":"0","played":"34","points":"3584","rating":"105"},{"position":"6","team":{"fullName":"Pakistan","abbreviation":"PAK"},"qfyMatches":"0","played":"42","points":"4294","rating":"102"},{"position":"7","team":{"fullName":"New Zealand","abbreviation":"NZ"},"qfyMatches":"0","played":"29","points":"2593","rating":"89"},{"position":"8","team":{"fullName":"West Indies","abbreviation":"WI"},"qfyMatches":"0","played":"41","points":"3639","rating":"89"},{"position":"9","team":{"fullName":"Bangladesh","abbreviation":"BAN"},"qfyMatches":"0","played":"23","points":"1754","rating":"76"},{"position":"10","team":{"fullName":"Zimbabwe","abbreviation":"ZIM"},"qfyMatches":"0","played":"23","points":"1205","rating":"52"},{"position":"11","team":{"fullName":"Ireland","abbreviation":"IRE"},"qfyMatches":"0","played":"10","points":"394","rating":"39"},{"position":"12","team":{"fullName":"Netherlands","abbreviation":"NL"},"qfyMatches":"0","played":"7","points":"88","rating":"13"},{"position":"13","team":{"fullName":"Kenya","abbreviation":"KEN"},"qfyMatches":"0","played":"4","points":"40","rating":"10"}]},{"matchType":"T20I","rankings":[{"position":"1","team":{"fullName":"Sri Lanka","abbreviation":"SL"},"qfyMatches":"20","played":"16","points":"2003","rating":"125"},{"position":"2","team":{"fullName":"Pakistan","abbreviation":"PAK"},"qfyMatches":"31","played":"21","points":"2599","rating":"124"},{"position":"3","team":{"fullName":"India","abbreviation":"IND"},"qfyMatches":"18","played":"14","points":"1689","rating":"121"},{"position":"5","team":{"fullName":"South Africa","abbreviation":"SA"},"qfyMatches":"24","played":"18","points":"2158","rating":"120"},{"position":"4","team":{"fullName":"West Indies","abbreviation":"WI"},"qfyMatches":"22","played":"17","points":"2041","rating":"120"},{"position":"6","team":{"fullName":"England","abbreviation":"ENG"},"qfyMatches":"26","played":"19","points":"2148","rating":"113"},{"position":"7","team":{"fullName":"Australia","abbreviation":"AUS"},"qfyMatches":"23","played":"17","points":"1753","rating":"103"},{"position":"8","team":{"fullName":"New Zealand","abbreviation":"NZ"},"qfyMatches":"25","played":"19","points":"1937","rating":"102"},{"position":"unranked","team":{"fullName":"Afghanistan","abbreviation":"AFG"},"qfyMatches":"7","played":"6","points":"525","rating":"88"},{"position":"9","team":{"fullName":"Ireland","abbreviation":"IRE"},"qfyMatches":"12","played":"7","points":"568","rating":"81"},{"position":"10","team":{"fullName":"Bangladesh","abbreviation":"BAN"},"qfyMatches":"14","played":"10","points":"739","rating":"74"},{"position":"11","team":{"fullName":"Scotland","abbreviation":"Sco"},"qfyMatches":"9","played":"7","points":"435","rating":"62"},{"position":"12","team":{"fullName":"Zimbabwe","abbreviation":"ZIM"},"qfyMatches":"14","played":"10","points":"478","rating":"48"},{"position":"13","team":{"fullName":"Netherlands","abbreviation":"NL"},"qfyMatches":"8","played":"5","points":"181","rating":"36"},{"position":"14","team":{"fullName":"Kenya","abbreviation":"KEN"},"qfyMatches":"11","played":"9","points":"309","rating":"34"},{"position":"unranked","team":{"fullName":"Canada","abbreviation":"CAN"},"qfyMatches":"6","played":"4","points":"24","rating":"6"}]}]);
But if you want to just learn HTML parsing then you can allso use Ganon
As per my view its not possible to parse, because that table is appending through AJAX calls.
We can see a empty tag like this:
<section class="standings"></section>
If I have this all wrong, please correct me
Thanks

Using PHP to retrieve information from a different site

I was wondering if there's a way to use PHP (or any other server-side or even client-side [if possible] language) to obtain certain pieces of information from a different website (NOT a local file like the include 'nav.php'.
What I mean is that...Say I have a blog at www.blog.com and I have another website at www.mysite.com
Is there a way to gather ALL of the h2 links from www.blog.com and put them in a div in www.mysite.com?
Also, is there a way I could grab the entire information inside a DIV (with an ID of-course) from blog.com and insert it in mysite.com?
Thanks,
Amit
First of all, if you want to retrieve content from a blog, check if the blog generator (ie, Blogger, WordPress) does not have a API thanks to which you won't have to reinvent the wheel. Usually, good APis come with good documentations (meaning that probably 5% out of all APIs are good APIs) and these documentations should come with code examples for top languages such as PHP, JavaScript, Java, etc... Once again, if it is to retrieve content from a blog, there should be tons of frameworks that are here for you
Check out the PHP Simple HTML DOM library
Can be as easy as:
// Create DOM from URL or file
$html = file_get_html('http://www.otherwebsite.com/');
// Find all images
foreach($html->find('h2') as $element)
echo $element->src;
This can be done by opening the remote website as a file, then taking the HTML and using the DOM parser to manipulate it.
$site_html = file_get_contents('http://www.example.com/');
$document = new DOMDocument();
$document->loadHTML($site_html);
$all_of_the_h2_tags = $document->getElementsByTagName('h2');
Read more about PHP's DOM functions for what to do from here, such as grabbing other tags, creating new HTML out of bits and pieces of the DOM, and displaying that on your own site.
Your first step would be to use CURL to do a request on the other site, and bring down the HTML from the page you want to access. Then comes the part of parsing the HTML to find all the content you're looking for. One could use a bunch of regular expressions, and you could probably get the job done, but the Stackoverflow crew might frown at you. You could also take the resulting HTML and use the domDocument object, and loadHTML to parse the HTML and load the content you want.
Also, if you control both sites, you can set up a special page on the first site (www.blog.com) with exactly the information you need, properly formatted either in HTML you can output directly, or XML that you can manipulate more easily from www.mysite.com.

Multiple REST Feeds to MYSQL Database - Using PHP

I've got a number of REST feeds I'd like to store in a MYSQL database, can anyone suggest a solution for this? Something PHP related appreciated....
It's not PHP related, but PERL has both a REST interface and a DBI interface (for interfacing with MYSQL).
http://metacpan.org/pod/WWW::REST
There are many other REST interfaces for Google, Twitter, etc. Just search CPAN modules at search.cpan.org
To my knowledge there is no such thing as a REST feed. There are RSS feeds and Atom feeds, so I will assume you are talking about one of those.
Both are based on XML so I suggest you find an XML parser for PHP and do an HTTP request to get the feed contents, parse the XML into a DOM and then copy the DOM data into MYSQL!
I'm not sure how to be more precise.
Are you looking for someone to write the code?
Ok, I'm assuming you are talking about "RSS" feeds. Here's a great opensource library that makes it easy -- http://simplepie.org/ . Point it at an RSS or Atom feed, it will give you back PHP arrays and objects. From there you can interpret them and save them any way you want.
Depending on what you actually want to do with the database, you could use RSS as an XML clob format. Not fast, but easy. Again, it totally depends on what you want to do with the database.

screen scraping technique using php

How to screen scrape a particular website. I need to log in to a website and then scrape the inner information.
How could this be done?
Please guide me.
Duplicate: How to implement a web scraper in PHP?
Zend_Http_Client and Zend_Dom_Query
You want to look at the curl functions - they will let you get a page from another website. You can use cookies or HTTP authentication to log in first then get the page you want, depending on the site you're logging in to.
Once you have the page, you're probably best off using regular expressions to scrape the data you want.
You should look look at curl.
You might also want to take a look at BeautifulSoup which is a Python library which is supposed to be very good at making bad HTML parseable. It is aimed at things like screen scraping.
How easy it would be to call from PHP I don't know though.
You could also check out http://php.net/dom
Curl, and once ure in, use QueryPath php library. (querypath.org)
You can access dom elements just like in JQuery, via CSS selectors,
there's method chaining...
Way better than just using php's native xml functions.
It also works as drupal extension, but I suppose you could implement it in any php project.

Categories