my idea is,
create a website which aggregate content from other sources and display it in a page ,
say,
i have list of 10,15 websites which deals with entertainment news
i have to crawl the websites ,then save the data into database,output the contents on a web page sorted by date/time,
have to crawl heading,full content or 10,15 lines,images and then a link to the original source.
the site must be updated every 5,10 minutes.
in every update, check for new articles and display it with heading,text,image,original source link in a web page with infinite scroll.
well my experience is with php.
any php frameworks,services,classes to start on?
any help will be greatly appreciated.
thanks
Instead of crawling the pages and screen scraping, could you gather the same information by consuming the RSS feeds from the sites? You should avoid screen scraping if at all possible.
If you have to scrape, try using a DOM parser, instead of a regex.
http://simplehtmldom.sourceforge.net/
Related
I have created an ajax driven website which can load any page when given the correct parameters. For instance: www.mysite.com/?page=blog&id=7 opens a blog post.
If I create a sitemap with links to all pages within the website will this be indexed?
Many thanks.
If you provide a url for each page that will actually display the full page, then yes. If those requests are just responding with JSON, or only part of a page, then no. In reality this is probably a poor design SEO wise. Each page should have it's own URL e.g. www.mysite.com/unicorns instead of www.mysite.com/?page=blog&id=1, and the links on the page should point to those. Then you should be using Javascript to capture all the link click events for the AJAX links, and then use Javascript how you like to update the page. Or better yet maybe try out PJAX which will load just the content of a page instead of a full page refresh speeding things up a little without really any changes from your normal site setup.
You do realize that making that sitemap all your search engine links will be ugly.
As Google said a page can still be crawled with nice url if you use fragment identifier:
<meta name="fragment" content="!"> // for meta fragment
and when you generate your page by ajax append the fragment to URL:
www.mysite.com/#!page=blog-7 //(and split them)
The page should load content directly in PHP by using $_GET['_escaped_fragment_']
As I've read that Bing and Yahoo started crawling with same process.
This website showing forex rates of different countries, and i want to crawl all of the stored data which can be shown by selecting different dates, Please help me how can i write curl or fpot crawler,
www.forex.pk/open_market_rates.asp
Thanks
You can crawl the page with curl and then parse the content with a simple HTML parser.
here is one http://simplehtmldom.sourceforge.net/
simple as that.
I am Working in PHP MySql project, I Have A Page Called Live Information, And Client need this page to function like all the information from different blogs related to some specific topic must be displayed on this page.
So Any Direction On how can it be done?
If the blogs give out an RSS feed you can use an RSS library like Magpie to get at the data.
If they don't, you'll need to get their HTML and parse. You'll most probably have to write a parser for each site. Have a look at web scraping.
I'm currently working on a website which main function is to search a relatively large database of companies. After submitting their search parameters, people will be redirected to a list of companies that match whatever they searched for and from there, they can see the details of any specific company if they wish by clicking on the result.
This website splash-page is a simple form to execute this search and nothing else.
As far as I know, crawlers will not try to submit forms with random text so, I'm affraid everything after my form will never be indexed.
I want to avoid, obviously, all kind of black hat techniques.
Having that said, how can I make sure my inner content will ever get indexed???
Also, if you know, is there any good reading material around the web about indexing websites?
Thanks in advance.
My suggestions would be
1) Generate a sitemap XML file from your list with each element being an individual page as well as a dynamic html based sitemap with a tiny link to this in the footer.
Manually Creating Sitemaps: http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=34657
or (my personal preference)
2) Create an RSS Feed + Icon that will allow google to peruse the list of companies as well as yourself or clients being able to subscribe and see companies as they sign up. Submit the RSS feed to Google and voila. Make sure you have an icon or RSS link on the front page though.
Submit an RSS Feed: http://www.google.com/submityourcontent/content_type/rss.html
Create an RSS Feed: http://www.wikihow.com/Create-an-RSS-Feed
Create a page with an index of all the companies and link to it from the search page. If you have a ton of companies separating them by letter will be helpful. This allows search engines to see all of your content and has the added benefit of helping users as well.
On a website I am maintaining for a radio station they have a page that displays news articles. Right now the news is posted in an html page which is then read by a php page which includes all the navigation. I have been asked to make this into and RSS feed. How do I do this? I know how to make the XML file but the person who edits the news file is not technical and needs a WYSIWYG editor. Is there a WYSIWYG editor for XML? Once I have the feed how do I display it on my site? Im working with PHP on this site so a PHP solution would be preferred.
Use Yahoo Pipes! : you don't need programming knowledge + the load on your site will be lower. Once you've got your feed, display it on your site using a simple "anchor" with "image" in HTML. You could consider piping your feed through Feedburner too.
And for the freeby: if you want to track your feed awareness data in rss, use my service here.
Are you meaning that someone will insert the feed content by hand?
Usually feeds are generated from the site news content, that you should already have into your database.. just need a php script that extract it and write the xml.
Edit: no database is used.
Ok, now you have just 2 ways:
Use php regexp to get the content you need from the html page (or maybe phpQuery)
As you said, write the xml by hand and then upload it, but i havent tryed any wysiwyg xml editor, sorry.. there are many on google
Does that PHP site have a database back end? If so, the WYSIWYG editor posts into there then a special PHP file generates an RSS feed.
I've used the following IBM page as a guide and it worked wonderfully:
http://www.ibm.com/developerworks/library/x-phprss/
I decided that instead of trying to find a WYSIWYG for XML that I would let the news editor continue to upload the news as HTML. I ended up writing a php program to find the <p> and </p> tags and creating an XML file out of it.
You could use rssa.at - just put in your URL and it'll create a RSS feed for you. You can then let people sign up for alerts (hourly/daily/weekly/monthly) for free, and access stats.
If the HTML is consistent, you could just have them publish as normal and then scrape a feed. There are programatic ways to do this for sure but http://www.dapper.net/dapp-factory.jsp is a nice point and click feed scraping service. Then, use either MagpieRSS, SimplePie or Feed.informer.com to display the feed.