Generate keywords for contents through Solr - php

I'm integrating Solr for my new PHP application.
As I'm newbie in solr section, I want to know that is it possible to generate some useful tags for every content pages through solr? something like auto-tagging mechanism.
Thanks in Advance...
P.S My contents available in both Persian and English languages.

something like auto-tagging mechanism.
Yes, you can build something like that.
There are 2 different ways to realize that:
Use the Clustering Component from Solr to build groups of docs and label those docs by solr. The labels are something like the taggs your are looking for.
Realize a tagging by using the MLT feature.
I started an auto-tagging project with the 1.) method with medium success. Finding labels for a cluster of documents is an hard process.
But fortunately, I had some already taggegd documents. If you also have some documents with valid tags, than you can use the 2.) method to use those document as an base to start learning:
Take a document without tags and perform a MLT search against docs with tags. Take the tags from the docs you fond and count them. Depending on the count, apply one or more tags to the untaggegd document. In my case, that works very well. Method 2.) is an cheep implementation of machine based learning, but you will get 95% success with only 5% Work-input.

As it's a PHP application, if it's OK for you to generate tags in php and then inserting/updating to Solr, Here are few options -
If using a web service is OK, check Yahoo's Term Extractor
If you can/want to host a term extraction service yourself to (may be in local server), check FiveFilters
Here is a php function for extracting valuable words from text block. Surely not as efficient as Yahoo Term Extractor, but it may work for you.

Related

How do I build a search engine in PHP to search live content of multiple sites?

I am a relatively novice programmer with a good understanding of PHP but more of the case of read, understand and copy the bits I need rather than develop from scratch.
I have a list of over 1000 URLs I would like to search. I would like to search those pages for content on demand and return only results containing the text query I provide. I have looked at Google Custom Search Engine as an easy option and this works well but limits the amount of pages I can add.
I've looked into cURL but doesn't seem to offer what I'm looking for unless I'm missing something?
Or are there other options like Google CSE that are free and easy to use?
You can write crawler for needed pages and use Sphinx engine(http://sphinxsearch.com/) for search in pages. For my opinion, should write a crawler with HTTP extension is better than pure cURL lib.

Search Algorithm for tags and contents

i'm designing a tag system and i'm looking for a good search algorithm. It must consider both tags and text contents, maybe with the possibility to give more importance to tag or to contents according to my needs. Is there anything similar in the literature? It's my first time working on such a system, so easy and popular solutions could fit too.
Thank you for your time.
It would be possible to implement this within MySQL but I think it would be worth looking at dedicated full text search applications for what you're trying to achieve. Most of them handle tags (usually referred to as attributes) as this is a common use case.
I'd recommend looking at the following:
Sphinx Search
Elastic Search
Solr

What's the most efficient way to setup a multi-lingual website

I'm developing a website that will be available in different languages. It is a LAMP (Linux, Apache, MySQL, PHP) setup, and it makes use of Smarty, mostly for the template engine.
The way we currently translate is by a self-written smarty plugin, which will recognize certain tags in the HTML files, and will find the corresponding tag in an earlier defined language file.
The HTML could look as follows:
<p>Hi, welcome to $#gamedesc;!</p>
And the language file could look like this:
gamedesc:Poing 2009$;
welcome:this is another tag$;
Which would then output
<p>Hi, welcome to Poing 2009!</p>
This system is very basic, but it is pretty hard to control, if I f.e. would like to keep track of what has been translated so far, or give certain users the rights to translate only certain tags.
I've been looking at some alternative ways to approach this, by either replacing the text-file with XML files which could store some extra meta-data, or by perhaps storing all the texts in the database, and retrieving it there.
My question is, what would be the best way to make this system both maintainable and perform well with high user-traffic? Are there perhaps any (lightweight) plugins I could take a look at?
You could give a shot at gettext. It is the way it is done in most C/C++ linux applications and it is an extension to PHP too. The idea is not very different from what you're already doing, but there are tools that ease the mantainance of translations (i.e. poedit).
For user rights to translations, gettext won't be of much help, I think you'll need to do it on your own or look at some frameworks if they have smarter solutions.
Maybe taking a look to gettext lib could help you get some hints http://php.net/manual/en/book.gettext.php hope it helps!
You will need to have a table in your database that you can use to store strings of text, each with an composite ID. the composite ID will be made up of language ID and text node ID.
You will need to give the user a chance to select a preferred language. You should make sure that you either have a default "this has not been translated" for every language you use, or a default language that your entire site can be vied in.
For every bit of text with in your web site, rather then store the text with in the page, you just assign it an ID.
When serving the page, look up the text node ID and preferred language ID and load that string of text, or the string for the default.
in our project, http://pkp.sfu.ca/ojs, we use XML files to store translation key-value pairs. Browse our code: http://github.com/pkp/pkp-lib/blob/master/classes/i18n/PKPLocale.inc.php
We use that class to read the XML files for each locale and in our code we use Locale::translate('locale.key.name');. Similar to gettext, but using an XML file for easier updating.
Looking around at web stuff today I came across this website: http://translateth.is/
It looks simple to use... copy paste in some javascript.

php script to find synonyms

Im writing a php script to compare the similarity of 2 strings. This works pretty good at the moment, but what I would like to do is match words when one is a synonym of the first.
Any thoughts?
You might want to try looking for a thesaurus service that allows you to query the synonyms for a word and have it return an XML list of synonyms.
Here is something to look at: http://nbii-thesaurus.ornl.gov/thesaurus/
I don't know if this would be helpful for you but time ago I have been working on a PHP (CodeIgniter) library for Google Search that gets related terms by using the ~ on searches.
Maybe you can digg on the source code codeigniter-googlesearch-api
Formally aren't synonymous but depending on the application that you have in mind it could be useful (for example for SEO purposes).
As a side note, if you put ~term in Google, then it will bold you the terms that are related. Try it with ~investment for example.

How to build a in-site search engine with php?

I want to build a in-site search engine with php. Users must login to see the information. So I can't use the google or yahoo search engine code.
I want to make the engine searching for the text and pages, and not the tables in mysql database right now.
Has anyone ever done this? Could you give me some pointers to help me get started?
you'll need a spider that harvests pages from your site (in a cron job, for example), strips html and saves them in a database
You might want to have a look at Sphinx http://sphinxsearch.com/ it is a search engine that can easily be access from php scripts.
You can cheat a little bit the way the much-hated Experts-Exchange web site does. They are for-profit programmer's Q&A site much like StackOverflow. In order to see answers you have to pay, but sometimes the answers come up in Google search results. It is rather clear that E-E present different page for web crawlers and different for humans. You could use the same trick, then add Google Custom Search to your site. Users who are logged in would then see the results, otherwise they'd be bounced to login screen.
Do you have control over your server? Then i would recommend that you install Solr/Lucene for index and SolPHP for interacting with PHP. That way you can have facets and other nice full text search features.
I would not spider the actual pages, instead i would spider pages without navigation and other things that is not content related.
SOLR requiers Java on the server.
I have used sphider finally which is a free tool, and it works well with php.
Thanks all.
If the content and the titles of your pages are already managed by a database, you will just need to write your search engine in php. There are plenty of solutions to query your database, for example:
http://www.webreference.com/programming/php/search/
If the content is just contained in html files and not in the db, you might want to write a spider.
You may be interested in caching the results to improve the performances, too.
I would say that everything depends on the size and the complexity of your website/web application.

Categories