This website showing forex rates of different countries, and i want to crawl all of the stored data which can be shown by selecting different dates, Please help me how can i write curl or fpot crawler,
www.forex.pk/open_market_rates.asp
Thanks
You can crawl the page with curl and then parse the content with a simple HTML parser.
here is one http://simplehtmldom.sourceforge.net/
simple as that.
Related
my idea is,
create a website which aggregate content from other sources and display it in a page ,
say,
i have list of 10,15 websites which deals with entertainment news
i have to crawl the websites ,then save the data into database,output the contents on a web page sorted by date/time,
have to crawl heading,full content or 10,15 lines,images and then a link to the original source.
the site must be updated every 5,10 minutes.
in every update, check for new articles and display it with heading,text,image,original source link in a web page with infinite scroll.
well my experience is with php.
any php frameworks,services,classes to start on?
any help will be greatly appreciated.
thanks
Instead of crawling the pages and screen scraping, could you gather the same information by consuming the RSS feeds from the sites? You should avoid screen scraping if at all possible.
If you have to scrape, try using a DOM parser, instead of a regex.
http://simplehtmldom.sourceforge.net/
i want get complete content of a news or post of a website via feed. but we know that many websites only presents some part of news or post via their feed.
of course i know that exists a script called SimplePie that is developed for get content of websites via feed. but this script do not retrieve full content of a news.
of course i found a script called Full-Text Feeds that do It. but it is no free . i want a free script.
Do you know a similar script or way to do my need?
The code behind Five Filters' content extraction is actually open source, and is based on Readability's original Javascript (before they became a service).
You should be able to use it like this:
$page = file_get_contents($item_url);
$readability = new Readability($page);
if ($result = $readability->init()) {
$content = $readability->getContent()->innerHTML;
}
Not entirely sure what you're trying to do here but this might help you:
$full_page_content = file_get_contents('http://www.example.com/');
Edit: Ok, if I understand you correctly you'll need to do something like this:
Get rss feed
Use SimplePie or something like it to go through each feed item
For each item in RSS feed
Get the item's url
Get the content from that URL
Strip out the HTML/extract only the text you need
Combine all of these into a new RSS feed and send that to the user
Note: This isn't a simple thing to do. There is a reason that Full-Text RSS can charge for their product.
You could use http://magpierss.sourceforge.net/cookbook.shtml (free)
It retrieves RSS feeds. There are many many many PHP scripts that do that on the web... Google si your friend !! :)
I read a lot of articles explaining how to parse a HTML file with PHP but in the case of twitter it uses iframe where texts are hidden. How can I parse the twitter HTML?
I know it is very easy to use API's or .rss page or json to get the tweets/string but I want to be able to work with twitter HTML page directly. Is there any way I could find the tweets using their html page?
The best way would be to use something like Simple HTML DOM. With this you can use CSS selectors like with jQuery to find the elements on the page you are looking for. However Twitter pages use a lot of javascript and ajax so you may be stuck with either using an API or maybe you could try it with the mobile site.
I am Working in PHP MySql project, I Have A Page Called Live Information, And Client need this page to function like all the information from different blogs related to some specific topic must be displayed on this page.
So Any Direction On how can it be done?
If the blogs give out an RSS feed you can use an RSS library like Magpie to get at the data.
If they don't, you'll need to get their HTML and parse. You'll most probably have to write a parser for each site. Have a look at web scraping.
On a website I am maintaining for a radio station they have a page that displays news articles. Right now the news is posted in an html page which is then read by a php page which includes all the navigation. I have been asked to make this into and RSS feed. How do I do this? I know how to make the XML file but the person who edits the news file is not technical and needs a WYSIWYG editor. Is there a WYSIWYG editor for XML? Once I have the feed how do I display it on my site? Im working with PHP on this site so a PHP solution would be preferred.
Use Yahoo Pipes! : you don't need programming knowledge + the load on your site will be lower. Once you've got your feed, display it on your site using a simple "anchor" with "image" in HTML. You could consider piping your feed through Feedburner too.
And for the freeby: if you want to track your feed awareness data in rss, use my service here.
Are you meaning that someone will insert the feed content by hand?
Usually feeds are generated from the site news content, that you should already have into your database.. just need a php script that extract it and write the xml.
Edit: no database is used.
Ok, now you have just 2 ways:
Use php regexp to get the content you need from the html page (or maybe phpQuery)
As you said, write the xml by hand and then upload it, but i havent tryed any wysiwyg xml editor, sorry.. there are many on google
Does that PHP site have a database back end? If so, the WYSIWYG editor posts into there then a special PHP file generates an RSS feed.
I've used the following IBM page as a guide and it worked wonderfully:
http://www.ibm.com/developerworks/library/x-phprss/
I decided that instead of trying to find a WYSIWYG for XML that I would let the news editor continue to upload the news as HTML. I ended up writing a php program to find the <p> and </p> tags and creating an XML file out of it.
You could use rssa.at - just put in your URL and it'll create a RSS feed for you. You can then let people sign up for alerts (hourly/daily/weekly/monthly) for free, and access stats.
If the HTML is consistent, you could just have them publish as normal and then scrape a feed. There are programatic ways to do this for sure but http://www.dapper.net/dapp-factory.jsp is a nice point and click feed scraping service. Then, use either MagpieRSS, SimplePie or Feed.informer.com to display the feed.