Finding images in Wikipedia that are being used across various articles

Finding images in Wikipedia that are being used across various articles - php

I'm trying to query wikipedia using MediaWiki api with php (and Curl), in order to search for images that are being used in various articles by a specific search term. For example - search for 'panda', but get only images that are being used somewhere and be able to go to the articles.
I am able to search for images generally using:
https://en.wikipedia.org/w/api.php?action=query&list=allimages&ailimit=100&aifrom=Panda&aiprop=url&format=xmlfm
and I know that basically this should show the usage:
https://commons.wikimedia.org/w/api.php?action=query&prop=images&list=imageusage&iutitle=File:MY_IMAGE_NAME&format=xmlfm
Trying the above does not give me the result I need - I can see a list of images, but I cannot know if or where they are being used.
Can anyone assist?

list=imageusage does not show cross-wiki usage; you'll need prop=globalusage for that. Which is also conveniently a prop module, so it can be folded into the first query using allimages as a generator:
action=query&generator=allimages&gailimit=100&gaifrom=Panda&prop=globalusage
(Omitted prop=images since it does not seem to have any useful purpose.)

Related

How to get related content WITHOUT using tags?

I need a way to get related content without using tags because in my case there are too many tags and those tags are inserted by users ( so in the most case they forgot to use them ).
Youtube do the same thing: if, for example, you are watching a funny video, then youtube show you other funny videos in the related content.
For instance, if the article's title is "Barack Obama, president of USA, go to Miami", I need to get other articles that contain "Barack Obama", "USA", "president" or "Miami" in the title and, if possible, other articles of the same topic.
THIS CAN BE VERY COMPLEX TO DO, so I asked you for some advice.

Possible solution is to use Zend Lucene.
http://framework.zend.com/manual/1.12/en/zend.search.lucene.html
It's an easy to implement search engine that runs entirely in php. You can use it a component separate from Zend Framework, and it's fairly easy to implement.
Index all your contents. Use the (for some reason undocumented) boost feature to make parts of the content that more relevant (IE. title, user tags)
Example here: http://davedash.com/2007/05/29/boosting-terms-in-zend-search-lucene/
Then, use the title as a keyword query and display the x highest scoring results to your users. (making sure to filter the content the user is currently looking at)
For optimization you could cash the search results per page.
You can tweak outcomes:
- What content best describes the content - Boost those items while indexing
- When searching what will you use (Title, User Tag, combination)

PHP library for word clustering/NLP?

What I am trying to implement is a rather trivial "take search results (as in title & short description), cluster them into meaningful named groups" program in PHP.
After hours of googling and countless searches on SO (yielding interesting results as always, albeit nothing really useful) I'm still unable to find any PHP library that would help me handle clustering.
Is there such a PHP library out there that I might have missed?
If not, is there any FOSS that handles clustering and has a decent API?

Like this:
Use a list of stopwords, get all words or phrases not in the stopwords, count occurances of each, sort in descending order.
The stopwords needs to be a list of all common English terms. It should also include punctuation, and you will need to preg_replace all the punctuation to be a separate word first, e.g. "Something, like this." -> "Something , like this ." OR, you can just remove all punctuation.
$content=preg_replace('/[^a-z\s]/', '', $content); // remove punctuation
$stopwords='the|and|is|your|me|for|where|etc...';
$stopwords=explode('|',$stopwords);
$stopwords=array_flip($stopwords);
$result=array(); $temp=array();
foreach ($content as $s)
if (isset($stopwords[$s]) OR strlen($s)<3)
{
if (sizeof($temp)>0)
{
$result[]=implode(' ',$temp);
$temp=array();
}
} else $temp[]=$s;
if (sizeof($temp)>0) $result[]=implode(' ',$temp);
$phrases=array_count_values($result);
arsort($phrases);
Now you have an associative array in order of the frequency of terms that occur in your input data.
How you want to do the matches depends upon you, and it depends largely on the length of the strings in the input data.
I would see if any of the top 3 array keys match any of the top 3 from any other in the data. These are then your groups.
Let me know if you have any trouble with this.

"... cluster them into meaningful groups" is a bit to vague, you'll need to be more specific.
For starters you could look into K-Means clustering.
Have a look at this page and website:
PHP/irInformation Retrieval and other interesting topics
EDIT: You could try some data mining yourself by cross referencing search results with something like the open directory dmoz RDF data dump and then enumerate the matching categories.
EDIT2: And here is a dmoz/category question that also mentions "Faceted Search"!
Dmoz/Monster algorithme to calculate count of each category and sub category?

If you're doing this for English only, you could use WordNet: http://wordnet.princeton.edu/. It's a lexicon widely used in research which provides, among other things, sets of synonyms for English words. The shortest distance between two words could then serve as a similarity metric to do clustering yourself as zaf proposed.
Apparently there is a PHP interface to WordNet here: http://www.foxsurfer.com/wordnet/. It came up in this question: How to use word Net with php, but I have not tried it. However, interfacing with a command line tool from PHP yourself is feasible as well.

You could also have a look at Programming Collective Intelligence (Chapter 3 : Discovering Groups) by Toby Segaran which goes through just this use case using Python. However, you should be able to implement things in PHP once you understand how it works.
Even though it is not PHP, the Carrot2 project offers several clustering engines and can be integrated with Solr.

This may be way off but check out OpenCalais. They have a web service which allows you to pass a block of text in and it will pass you back a parseable response of things that it found in the text, such as places, people, facts etc. You could use these categories to build your "clouds" and too choose which results to display.
I've used this library a few times in php and it's always been quite easy to work with.
Again, might not be relevant to what your trying to do. Maybe you could post an example of what your trying to accomplish?

If you can pre-define the filters for your faceted search (the named groups) then it will be much easier.
Rather than relying on an algorithm that uses the current searcher's input and their particular results to generate the filter list, you would use an aggregate of the most commonly performed searches by all users and then tag results with them if they match.
You would end up with a table (or something) of URLs in a many-to-many join to a table of tags, so each result url could have several appropriate tags.
When the user searches, you simply match their search against the full index. But for the filters, you take the top results from among the current resultset.
I'll work on query examples if you want.

Most used words on website using Solr etc

I want to generate a list of the most words used on a website. The application should crawl the content of the site.
Does anyone know if this can be done by Solr or any other technique?
The list can be php objects/array or an xml file.

you might want to check http://wiki.apache.org/solr/TermsComponent
Example -
http://host:port/solr/core/terms?terms.fl=title&terms.sort=count
Will give you all the terms for the field title ordered by count (default)
terms.fl - Field you want to check the terms on
terms.sort={count|index} - If count, sorts the terms by the term frequency (highest count first). If index, returns the terms in index order. Default is to sort by count.
This gives the indexed terms which go through the tokenizer and filters, so if you need terms as is, you can vary the field analysis. (probably use field type string)

SOLR is a search engine. It doesn't crawl websites. You need to make a simple website crawler using scrapy http://scrapy.org/ or some similar tool. Design a SOLR schema to record the data, crawl the websites, send record updates to SOLR. Your specific question would probably be answered by the SCHEMA BROWSER choice on the SOLR admin menu through the web admin interface. Click on DYNAMIC FIELDS, select the field you are interested and see the to 10. Change the number to 50, press ENTER and get the top 50.

PHP Detect Pages Genre/Category

I was wondering if their was any sort of way to detect a pages genre/category.
Possibly their is a way to find keywords or something?
Unfortunately I don't have any idea so far, so I don't have any code to show you.
But if anybody has any ideas at all, let me know.
Thanks!
EDIT #Nican
Perhaps their is a way to set, let's say 10 category's (Entertainment, Funny, Tech).
Then creating keywords for these category's (Funny = Laughter, Funny, Joke etc).
Then searching through a webpage (maybe using a cUrl) for these keywords and assigning it to the right category.
Hope that makes sense.

What you are talking about is basically what Google Adsense and similar services do, and it's based on analyzing the content of a page and matching it to topics. Generally, this kind of stuff is beyond what you would call simple programming / development and would require significant resources to be invested to get it to work "right".
A basic system might work along the following lines:
Get page content
Get X most commonly used words (omitting stuff like "and" "or" etc.)
Get words used in headings
Assign weights to different words according to a set of factors (is used in heading, is used in more than one paragraph, is used in link anchors)
Match the filtered words against a database of words related to a specific "category"
If cumulative score > treshold, classify site as belonging to category
Rinse and repeat

Folksonomy may be a way of accomplishing what you're looking for:
http://en.wikipedia.org/wiki/Folksonomy
For instance, in Drupal they have a Folksonomy module:
http://drupal.org/node/19697 (Note this module appears to be dead, see http://drupal.org/taxonomy/term/71)
Couple that with a tag cloud generator, and you may get somewhere:
http://drupal.org/project/searchcloud
Plus, a little more complexity may be able to derive mapped relationships to other terms, especially if you control the structure of the tagging options.
http://intranetblog.blogware.com/blog/_archives/2008/5/22/3707044.html
EDIT
In general, the type of system you're trying to build relies on unique word values on a page. So you would need to...
Get unique word values from your content (index values or create a bot to crawl your site)
Remove all words and symbols you can't use (at, the, or, and, etc...)
Count the number of times the unique words appear on the page
Add them to some type of datastore so you can call them based on the relationships you're mapping
If you have a root label system in place, associate those values with the word counts on the page (such as a query or derived table)
This is very general, and there are a number of ways this can be implemented/interpreted. Folksonomies are meant to "crowdsource" much of the effort for you, in a "natural way", as long as you have a user base that will contribute.

How do I design a web interface for browsing text man pages?

I would like to design a web app that allows me to sort, browse, and display various attributes (e.g. title, tag, description) for a collection of man pages.
Specifically, these are R documentation files within an R package that houses a collection of data sets, maintained by several people in an SVN repository. The format of these files is .Rd, which is LaTeX-like, but different.
R has functions for converting these man pages to html or pdf, but I'd like to be able to have a web interface that allows users to click on a particular keyword, and bring up a list (and brief excerpts) for those man pages that have that keyword within the \keyword{} tag.
Also, the generated html is somewhat ugly and I'd like to be able to provide my own CSS.
One obvious option is to load all the metadata I desire into a database like MySQL and design my site to run queries and fetch the appropriate data.
I'd like to avoid that to minimize upkeep for future maintainers. The number of files is small (<500) and the amount of data is small (only a couple of hundred lines per file).
My current leaning is to have a script that pulls the desired metadata from each file into a summary JSON file and load this summary.json file in PHP, decode it, and loop through the array looking for those items that have attributes that match the current query (e.g. all docs with keyword1 AND keyword2).
I was starting in that direction with the following...
$contents=file_get_contents("summary.json");
$c=json_decode($contents,true);
foreach ($c as $ind=>$val ) { .... etc
Another idea was to write a script that would convert these .Rd files to xml. In that case, are there any lightweight frameworks that make it easy to sort and search a small collection of xml files?
I'm not sure if xQuery is overkill or if I have time to dig into it...
I think I'm suffering from too-many-options-syndrome with all the AJAX temptations. Any help is greatly appreciated.
I'm looking for a super simple solution. How might some of you out there approach this?

My approach would be parsing the keywords (from your description i assume they have a special notation to distinguish them from normal words/text) from the files and storing this data as searchindex somewhere. Does not have to be mySQL, sqlite would surely be enough for your project.
A search would then be very simple.
Parsing files could be automated as post-commit-hook to your subversion repository.

Why don't you create table SUMMARIES with column for each of summary's fields?
Then you could index that with full-text index, assigning different weight to each field.
You don't need MySQL, you can use SQLite which has the the Google's full-text indexing (FTS3) built in.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.