How to Parse a Web page using PHP? - php

I am Learning php. I have learned some basics. Now I am eager to learn Web page parsing.
I want to Parse this page http://www.icc-cricket.com/rankings/team-rankings/test
I want to parse this alone
Rank Team Matches Points Rating
1 South Africa 24 3240 135

I would recommend Symfony2 The DomCrawler Component http://symfony.com/doc/current/components/dom_crawler.html

If you know basic PHP, I would recommend you using this framework: http://simplehtmldom.sourceforge.net/
Its simple to use.

You could have a look at http://simplehtmldom.sourceforge.net/ which allows you to parse HTML pages rather easily.
That said, one should always instead look into if the service offers feeds instead, because parsing them are both less error prone, more efficient and (usually) don't change much. HTML-markup can change over time, causing your dom query to become invalid.

Seems that those scores are attached to pages via ajax. So you cannot parse this link directly to get your rankings. It seems that request is sent to
http://cma.icc-cricket.com/api/getRankings?callback=onRankings&_1375776810417=
So you need to make similar request and process data then.
Result from url:
onRankings([{"matchType":"TEST","rankings":[{"position":"1","team":{"fullName":"South Africa","abbreviation":"SA"},"qfyMatches":"0","played":"24","points":"3240","rating":"135"},{"position":"2","team":{"fullName":"India","abbreviation":"IND"},"qfyMatches":"0","played":"30","points":"3473","rating":"116"},{"position":"3","team":{"fullName":"England","abbreviation":"ENG"},"qfyMatches":"0","played":"32","points":"3577","rating":"112"},{"position":"4","team":{"fullName":"Australia","abbreviation":"AUS"},"qfyMatches":"0","played":"27","points":"2846","rating":"105"},{"position":"5","team":{"fullName":"Pakistan","abbreviation":"PAK"},"qfyMatches":"0","played":"19","points":"1947","rating":"102"},{"position":"6","team":{"fullName":"West Indies","abbreviation":"WI"},"qfyMatches":"0","played":"22","points":"2168","rating":"99"},{"position":"7","team":{"fullName":"Sri Lanka","abbreviation":"SL"},"qfyMatches":"0","played":"26","points":"2295","rating":"88"},{"position":"8","team":{"fullName":"New Zealand","abbreviation":"NZ"},"qfyMatches":"0","played":"27","points":"2126","rating":"79"},{"position":"9","team":{"fullName":"Bangladesh","abbreviation":"BAN"},"qfyMatches":"0","played":"13","points":"135","rating":"10"}]},{"matchType":"ODI","rankings":[{"position":"1","team":{"fullName":"India","abbreviation":"IND"},"qfyMatches":"0","played":"48","points":"5906","rating":"123"},{"position":"2","team":{"fullName":"Australia","abbreviation":"AUS"},"qfyMatches":"0","played":"34","points":"3861","rating":"114"},{"position":"3","team":{"fullName":"England","abbreviation":"ENG"},"qfyMatches":"0","played":"38","points":"4257","rating":"112"},{"position":"4","team":{"fullName":"Sri Lanka","abbreviation":"SL"},"qfyMatches":"0","played":"49","points":"5435","rating":"111"},{"position":"5","team":{"fullName":"South Africa","abbreviation":"SA"},"qfyMatches":"0","played":"34","points":"3584","rating":"105"},{"position":"6","team":{"fullName":"Pakistan","abbreviation":"PAK"},"qfyMatches":"0","played":"42","points":"4294","rating":"102"},{"position":"7","team":{"fullName":"New Zealand","abbreviation":"NZ"},"qfyMatches":"0","played":"29","points":"2593","rating":"89"},{"position":"8","team":{"fullName":"West Indies","abbreviation":"WI"},"qfyMatches":"0","played":"41","points":"3639","rating":"89"},{"position":"9","team":{"fullName":"Bangladesh","abbreviation":"BAN"},"qfyMatches":"0","played":"23","points":"1754","rating":"76"},{"position":"10","team":{"fullName":"Zimbabwe","abbreviation":"ZIM"},"qfyMatches":"0","played":"23","points":"1205","rating":"52"},{"position":"11","team":{"fullName":"Ireland","abbreviation":"IRE"},"qfyMatches":"0","played":"10","points":"394","rating":"39"},{"position":"12","team":{"fullName":"Netherlands","abbreviation":"NL"},"qfyMatches":"0","played":"7","points":"88","rating":"13"},{"position":"13","team":{"fullName":"Kenya","abbreviation":"KEN"},"qfyMatches":"0","played":"4","points":"40","rating":"10"}]},{"matchType":"T20I","rankings":[{"position":"1","team":{"fullName":"Sri Lanka","abbreviation":"SL"},"qfyMatches":"20","played":"16","points":"2003","rating":"125"},{"position":"2","team":{"fullName":"Pakistan","abbreviation":"PAK"},"qfyMatches":"31","played":"21","points":"2599","rating":"124"},{"position":"3","team":{"fullName":"India","abbreviation":"IND"},"qfyMatches":"18","played":"14","points":"1689","rating":"121"},{"position":"5","team":{"fullName":"South Africa","abbreviation":"SA"},"qfyMatches":"24","played":"18","points":"2158","rating":"120"},{"position":"4","team":{"fullName":"West Indies","abbreviation":"WI"},"qfyMatches":"22","played":"17","points":"2041","rating":"120"},{"position":"6","team":{"fullName":"England","abbreviation":"ENG"},"qfyMatches":"26","played":"19","points":"2148","rating":"113"},{"position":"7","team":{"fullName":"Australia","abbreviation":"AUS"},"qfyMatches":"23","played":"17","points":"1753","rating":"103"},{"position":"8","team":{"fullName":"New Zealand","abbreviation":"NZ"},"qfyMatches":"25","played":"19","points":"1937","rating":"102"},{"position":"unranked","team":{"fullName":"Afghanistan","abbreviation":"AFG"},"qfyMatches":"7","played":"6","points":"525","rating":"88"},{"position":"9","team":{"fullName":"Ireland","abbreviation":"IRE"},"qfyMatches":"12","played":"7","points":"568","rating":"81"},{"position":"10","team":{"fullName":"Bangladesh","abbreviation":"BAN"},"qfyMatches":"14","played":"10","points":"739","rating":"74"},{"position":"11","team":{"fullName":"Scotland","abbreviation":"Sco"},"qfyMatches":"9","played":"7","points":"435","rating":"62"},{"position":"12","team":{"fullName":"Zimbabwe","abbreviation":"ZIM"},"qfyMatches":"14","played":"10","points":"478","rating":"48"},{"position":"13","team":{"fullName":"Netherlands","abbreviation":"NL"},"qfyMatches":"8","played":"5","points":"181","rating":"36"},{"position":"14","team":{"fullName":"Kenya","abbreviation":"KEN"},"qfyMatches":"11","played":"9","points":"309","rating":"34"},{"position":"unranked","team":{"fullName":"Canada","abbreviation":"CAN"},"qfyMatches":"6","played":"4","points":"24","rating":"6"}]}]);
But if you want to just learn HTML parsing then you can allso use Ganon

As per my view its not possible to parse, because that table is appending through AJAX calls.
We can see a empty tag like this:
<section class="standings"></section>
If I have this all wrong, please correct me
Thanks

Related

Fetching information from another website?

I'd like to fetch all latest news from this site (at the center board):
http://web.hanu.vn/en/
My latest approach was parsing html by using Simple HTML DOM Parser in PHP but I think it's so slow. My idea is to fetch news from almost 20 similar sites like this site. They are all
developed by Moodle so they have the same html format. However, with 1 site it takes several seconds to fetch => 20 sites require a lot of time.
Is there any better approach rather than parsing HTML? Or should I store the result in the database and after a period of time updating it rather than fetching it for each user request? Am I doing the so-called "crawling", isn't it?
Or should I store the result in the database and after a period of time updating it rather than fetching it for each user request?
Yes, you should. And stick to parsing HTML, do not use regular expressions for parsing HTML.
And what you are trying to do is web scraping, not yet crawling (unless you really crawl the pages).
I recomend you download the page with curl, and do the correct tratament without using regex , try to use substr,strpos, strip tags and so on... and also store the last notices in a database, and update it using cronjob.
I'd recomend you to use Reqular Expressions. (Wikipedia)
Also, it is very good idea to strip some parts of HTML data using strpos and substr functions, which are faster than regular expressions.
And here is nice regular expression tester.

Importing /scraping page content form other sites?

i've been playing with php and also http://www.alchemyapi.com/, and embed.ly
but i was wondering if there other options out there to import and parse a webpage, any page, either is a news site or a blog...
thanks
To fetch the data: curl, file_get_contents (may be others those are the two common)
To parse the data: PHP: DOM, SimpleXML preg_match**
Since it was tagged with PHP, I only gave working information for PHP. There are tons of ways to do this, if you can narrow your question down to what you are trying to do it would help. The better ways to parse any site, is through their RSS feed if they have one, or through their API, speculating that they offer up the content you want via RSS/API.
** preg_match is not a great alternative it does "work" but better to use the DOM / Simple XML functions if possible.
I wrote a crawler at work using cURL and preg_match
Before I chose to do it that way, I had looked at DOM Parsers http://php.net/manual/en/book.dom.php

Real time RSS display on web page (best practices and source codes)

i have a php script who parser a rss and give me the data in a know pattern. Im very new with ASP, JavaScript and Jquery so i dont have any idea of how to autoupdate the script and display the new data with a smooth animation (see this example, that exactly what i want). Thanks for the support and if you know a good script to made this i will appreciate it.
Seems like you're looking for this:
http://leftlogic.com/lounge/articles/jquery_spy2/
It's PHP (not ASP), so that might be an issue, though the code is SUPER easy to implement (I've written by own implementation on three separate occasions).
The site itself has some decent documentation on getting things up and running, but if you need some extra help, comment and I'll point you in the right direction :)
Good luck!
The resources people have linked here are helpful and merely mentioning jQuery means you're probably headed in the right direction. But if you're new to this it might still be worth mentioning some of the concepts you'll be looking to play with here.
First of all, you'll probably want to stick with one language on the client side and one on the server side. This means choosing either PHP or ASP -- this isn't clear from your question but I'll assume you're dealing with PHP since that's the language I use for this kind of thing. JavaScript + jQuery is the right choice for the browser (client) side of things.
Like Luca points out, you'll have to set up some JavaScript code that goes live on page load and "polls" the server at a set interval. In JavaScript you do this using something called XMLHttpRequest (or "XHR") and it's pretty complicated. You could use combination of jQuery and a library like the one Matt points to in his answer, or just jQuery -- sample code abounds but it's basically a loop with a function call and sleep timer.
That function call is going to be one of the more difficult parts if you're trying to emulate the Twitter World Cup site. But here's the basic idea: You need to populate a list using jQuery and a data standard like JSON. Since the RSS feed you'll be parsing is written in XML, you'll have to write a server side (PHP/ASP) script that fetches, parses and converts the feed to JSON. In PHP, this is best done through cURL (file_get_contents() if you're lazy), SimpleXML and json_encode(), respectively.
Your JavaScript should load the list based on JSON. To do this, and display any new items, what you'll do is load the JSON from the client (browser) side using a jQuery method like getJSON(). Then you spin through the array object and add any new items to the list by adding new <li> elements to the "DOM." The same jQuery code that does this can easily also do the cross dissolve with something like fadeIn().
It looks like the script on that example page has an Ajax request running every TOT seconds.
You could simply have your PHP script return the RSS data (in JSON format say) and let JavaScript parse it and generate some HTML with it.
If all of this doesn't make sense to you I advice reading a little about JavaScript and PHP... there's plenty of good books.

screen scraping technique using php

How to screen scrape a particular website. I need to log in to a website and then scrape the inner information.
How could this be done?
Please guide me.
Duplicate: How to implement a web scraper in PHP?
Zend_Http_Client and Zend_Dom_Query
You want to look at the curl functions - they will let you get a page from another website. You can use cookies or HTTP authentication to log in first then get the page you want, depending on the site you're logging in to.
Once you have the page, you're probably best off using regular expressions to scrape the data you want.
You should look look at curl.
You might also want to take a look at BeautifulSoup which is a Python library which is supposed to be very good at making bad HTML parseable. It is aimed at things like screen scraping.
How easy it would be to call from PHP I don't know though.
You could also check out http://php.net/dom
Curl, and once ure in, use QueryPath php library. (querypath.org)
You can access dom elements just like in JQuery, via CSS selectors,
there's method chaining...
Way better than just using php's native xml functions.
It also works as drupal extension, but I suppose you could implement it in any php project.

Any PHP -> jQuery libraries out there?

Have any bridge libraries been developed for PHP that provide access to the jQuery framework? Ideally it would be nice to have something fairly extensible so that creating jQuery-based content using PHP code would be fairly easy and customizeable. Does such a thing exist yet?
pquery
jqpie
jquery-php
There's a warmup list.
So far I've found one that seems to fit the description. I haven't tried it out yet, so if anyone has any feedback or experience with this or other ones don't hesitate to post!
PQuery
jQPie might be what you're after.
What can jQPie do?
Easily request and process data from php using $.getJSON
Inject php generated html into elements using $.(element).load
Call php functions directly from your web pages using $.jqpie
Call jQuery from php in respond to $.jqpie calls
Advanced autocomplete using jqpie_complete
QueryPath (http://querypath.org) is a full implementation of the jQuery DOM/XML/HTML part of jQuery. QueryPath has full CSS 3 selector support (including the stuff jQuery doesn't have, like XML namespace support). It also comes with DB tools, where you can run queries and have the results inserted into the query object. And it has a template engine, too. Like jQuery, you can write custom extensions very easily.
But it definitely takes advantage of its server-side status.
The main project page is at https://fedorahosted.org/querypath. You can download it there (and see lots of examples, including RSS and SVG manipulation).
Integrating with jQuery, then, can be done easily by sending XML data of many sorts down to jQuery. (You could probably send JSON, too... never tried.) And since the server side code and the client side code both look the same, there's less of a need to learn two totally different toolkits.

Categories