How do I grab just the parsed Infobox of a wikipedia article? - php

I'm still stuck on my problem of trying to parse articles from wikipedia. Actually I wish to parse the infobox section of articles from wikipedia i.e. my application has references to countries and on each country page I would like to be able to show the infobox which is on corresponding wikipedia article of that country. I'm using php here - I would greatly appreciate it if anyone has any code snippets or advice on what should I be doing here.
Thanks again.
EDIT
Well I have a db table with names of countries. And I have a script that takes a country and shows its details. I would like to grab the infobox - the blue box with all country details images etc as it is from wikipedia and show it on my page. I would like to know a really simple and easy way to do that - or have a script that just downloads the information of the infobox to a local remote system which I could access myself later on. I mean I'm open to ideas here - except that the end result I want is to see the infobox on my page - of course with a little Content by Wikipedia link at the bottom :)
EDIT
I think I found what I was looking for on http://infochimps.org - they got loads of datasets in I think the YAML language. I can use this information straight up as it is but I would need a way to constantly update this information from wikipedia now and then although I believe infoboxes rarely change especially o countries unless some nation decides to change their capital city or so.

I'd use the wikipedia (wikimedia) API. You can get data back in JSON, XML, php native format, and others. You'll then still need to parse the returned information to extract and format the info you want, but the info box start, stop, and information types are clear.
Run your query for just rvsection=0, as this first section gets you the material before the first section break, including the infobox. Then you'll need to parse the infobox content, which shouldn't be too hard. See en.wikipedia.org/w/api.php for the formal wikipedia api documentation, and www.mediawiki.org/wiki/API for the manual.
Run, for example, the query: http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xmlfm&titles=fortran&rvsection=0

I suggest you use DBPedia instead which has already done the work of turning the data in wikipedia into usable, linkable, open forms.

It depends what route you want to go. Here are some possibilities:
Install MediaWiki with appropriate
modifications. It is a after all a
PHP app designed precisely to parse
wikitext...
Download the static HTML version, and parse out the parts you want.
Use the Wikipedia API with appropriate caching.
DO NOT just hit the latest version of the live page and redo the parsing every time your app wants the box. This is a huge waste of resources for both you and Wikimedia.

There is a number of semantic data providers from which you can extract structured data instead of trying to parse it manually:
DbPedia - as already mentioned provides SPARQL endpoint which could be use for data queries. There is a number of libraries available for multiple platforms, including PHP.
Freebase - another creative commons data provider. Initial dataset is based on parsed Wikipedia data, but there is some information taken from other sources. Data set could be edited by anyone and, in contrast to Wikipedia, you can add your own data into your own namespace using custom defined schema. Uses its own query language called MQL, which is based on JSON. Data has WebID links back to correspoding Wikipedia articles. Free base also provides number of downloadable data dumps. Freebase has number of client libraries including PHP.
Geonames - database of geographical locations. Has API which provides Country and Region information for given coordinates, nearby locations (e.g. city, railway station, etc.)
Opensteetmap - community built map of the world. Has API allowing to query for objects by location and type.
Wikimapia API - another location service

To load the parsed first section, Simply add this parameter to the end of the api url
rvparse
Like this:
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xmlfm&titles=fortran&rvsection=0&rvparse
Then parse the html to get the infobox table (using Regex)
$url = "http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=json&titles=Niger&rvsection=0&rvparse";
$data = json_decode(file_get_contents($url), true);
$data = current($data['query']['pages']);
$regex = '#<\s*?table\b[^>]*>(.*)</table\b[^>]*>#s';
$code = preg_match($regex, $data["revisions"][0]['*'], $matches);
echo($matches[0]);

if you want to parse one time all the articles, wikipedia has all the articles in xml format available,
http://en.wikipedia.org/wiki/Wikipedia_database
otherwise you can screen scrape individual articles e.g.

To update this a bit: a lot of data in Wikipedia infoboxes are now taken from Wikidata, which is a free database of structured information. See data page for Germany for example, and https://www.wikidata.org/wiki/Wikidata:Data_access for information on how to access the data programatically.

def extract_infobox(term):
url = "https://en.wikipedia.org/wiki/"+term
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
tbl = soup.find("table", {"class": "infobox"})
if not tbl:
return {}
list_of_table_rows = tbl.findAll('tr')
info = {}
for tr in list_of_table_rows:
th = tr.find("th")
td = tr.find("td")
if th is not None and td is not None:
innerText = ''
for elem in td.recursiveChildGenerator():
if isinstance(elem, str):
# remove references
clean = re.sub("([\[]).*?([\]])", "\g<1>\g<2>", elem.strip())
# add a simple space after removing references for word-separation
innerText += clean.replace('[]','') + ' '
elif elem.name == 'br':
innerText += '\n'
info[th.text] = innerText
return info

I suggest performing a WebRequest against wikipedia. From there you will have the page and you can simply parse or query out the data that you need using a regex, character crawl, or some other form that you are familiar with. Essentially a screen scrape!
EDIT - I would add to this answer that you can use HtmlAgilityPack for those in C# land. For PHP it looks like SimpleHtmlDom. Having said that it looks like Wikipedia has a more than adequate API. This question probably answers your needs best:
Is there a Wikipedia API?

Related

Getting data from website and policy to get it (legal or not)

is it possible to get data from some air companies (like: "tickets" form where to where, its date and time, duration, "price",...) in PHP, and isn't it means stealing the data, cuz there are GDS companies which sells the data of all possible flights. So is "passing air company" legal, or not?
I was trying to get sources from URL, like file_get_content("URL of searched flight from ex. wizzair.com"), so I got different code, but through the browser, I can see the code how I want with (Ctrl + Shift + I or F12)
$url = "url of any searched flight";
$file = file_get_contents($url);
$pattern = '#<form id=" one of from, to, date, price* .+?</form>#s';
preg_match($pattern, $file, $matches);
want to get the data of flight tickets, (from where, to where, date&time, duration, price) from any air company website. And policy of getting data (legal or not?).
What you want is harder and harder these days.
You used to be able to simple get the HTML in (be it dynamically generated or not) via file_get_contents() and the likes, and find the parts you care about. While this CAN be hard, and error prone, it could be done. (When the webpage changed, you need to adjust your scraping routine)
Nowadays your average webpage is more like a javascript Christmas tree, all parts are fetched from different places, and the webpage is build in the browser, so the actual source HTML is much harder to interpret.
You could have a deeper look at the HTML in the page, and hope for an easy to use javascript call to some service that fetches the data based on some query.
The creators of the page could have build that sloppy.
If it is legal for your research is of course your own call.

Autocompleteplus how to detect user language(Real language not browser)?

I'm looking for a way to get a user default language by the country. For example I have windows in english but still I would like to get my country 2 letter language ("cs")
You can see an example of what I want, In the source code of http://search.conduit.com/ using (Autocompleteplus) as well. This is what I see:
window.language = "en-us";
window.countryCode = "cz";
window.suggestBaseUrl = "http://api.autocompleteplus.com/?q=UCM_SEARCH_TERM&l=cs&c=cz&callback=acp_new";
You can see the api url has inside "l=cs&c=cz" how did they get this information? I would like to have the same thing I use the same autocompleteplus method just need a way to generate the l=(user true langague)&c=(country code) and performance is important as well. It's autosuggestions for my website search.
This is Ed from AutoComplete+. Getting the user country is typically done when using our API through server side implementations. There are however some open APIs that can assist you. Regardless, you can use our autocomplete feed without the user country. Feel free to contact us directly for further info at http://www.autocompleteplus.com
Thanks,
--ed

Reading XML into PHP

I'm trying to determine the best course of action for the display of data for a project I'm working on. My client is currently using a proprietary CMS geared towards managing real estate. It easily lets you add properties, square footage, price, location, etc. The company that runs this CMS provides the data in a pretty straightforward XML file that they say offers access to all of the data my client enters.
I've read up on PHP5's SimpleXML feature and I grasp the basic concepts well enough, but my question is: can I access the XML data in a similar fashion as if I were querying a MySQL database?
For instance, assuming each entry has a unique ID, will I be able to set up a view and display just that record using a URL variable like: http://example.com/apartment.php?id=14
Can you also display results based on values within strings? I'm thinking a form submit that returns only two bedroom properties in this case.
Sorry in advance if this is a noob question. I'd rather not build a custom CMS for my client if for no other reason than they'd only have to login to one location and update accordingly.
Some short answers on your questions:
a. Yes you can access XML data with queries, but using XPath instead of SQL. XPath is for XML what SQL is for databases, working quite different.
b. Yes you can build a php program that receives an id as parameter and uses this for an XPath search on a given XML file.
c. All data in a XML file is a string, so it is no problem to search for or display strings. Even your example id=14 is to handle as a string.
You might be interested in this further information:
http://www.ibm.com/developerworks/library/x-simplexml.html?S_TACT=105AGX06&S_CMP=LP
http://www.ibm.com/developerworks/library/x-xmlphp1.html?S_TACT=105AGX06&S_CMP=LP
PHP can access XML not only via SimpleXML but also with DOM. SimpleXML accesses the elements like PHP-arrays, DOM provides a w3c-DOM-compatible api.
See php.net for other ways to access XML, but they seem not to be appropriate for you.

How to do search with out using the scripting language and by using jquery?

I have a html site. In that site around 100 html files are available. i want to develop the search engine . If the user typing any word and enter search then i want to display the related contents with the keyword. Is't possible to do without using any server side scripting? And it's possible to implement by using jquery or javascript?? Please let me know if you have any ideas!!!
Advance thanks.
Possible? Yes. You can download all the files via AJAX, save their contents in an array of strings, and search the array.
The performance however would be dreadful. If you need full text search, then for any decent performance you will need a database and a special fulltext search engine.
3 means:
Series of Ajax indexing requests: very slow, not recommended
Use a DB to store key terms/page refernces and perform a fulltext search
Utilise off the shelf functionality, such as that offered by google
The only way this can work is if you have a list of all the pages on the page you are searching from. So you could do this:
pages = new Array("page1.htm","page2.htm"...)
and so on. The problem with that is that to search for the results, the browser would need to do a GET request for every page:
for (var i in pages)
$.get(pages[i], function (result) { searchThisPage(result) });
Doing that 100 times would mean a long wait for the user. Another way I can think of is to have all the content served in an array:
pages = {
"index" : "Some content for the index",
"second_page" : "Some content for the second page",
etc...
}
Then each page could reference this one script to get all the content, include the content for itself in its own content section, and use the rest for searching. If you have a lot of data, this would be a lot to load in one go when the user first arrives at your site.
The final option I can think of is to use the Google search API: http://code.google.com/apis/customsearch/v1/overview.html
Quite simply - no.
Client-side javascript runs in the client's browser. The client does not have any way to know about the contents of the documents within your domain. If you want to do a search, you'll need to do it server-side and then return the appropriate HTML to the client.
The only way to technically do this client-side would be to send the client all the data about all of the documents, and then get them to do the searching via some JS function. And that's ridiculously inefficient, such that there is no excuse for getting them to do so when it's easier, lighter-weight and more efficient to simply maintain a search database on the server (likely through some nicely-packaged third party library) and use that.
some useful resources
http://johnmc.co/llum/how-to-build-search-into-your-site-with-jquery-and-yahoo/
http://tutorialzine.com/2010/09/google-powered-site-search-ajax-jquery/
http://plugins.jquery.com/project/gss
If your site is allowing search engine indexing, then fcalderan's approach is definitely the simplest approach.
If not, it is possible to generate a text file that serves as an index of the HTML files. This would probably be rudimentarily successful, but it is possible. You could use something like the keywording in Toby Segaran's book to build a JSON text file. Then, use jQuery to load up the text file and find the instances of the keywords, unique the resultant filenames and display the results.

How do I design a web interface for browsing text man pages?

I would like to design a web app that allows me to sort, browse, and display various attributes (e.g. title, tag, description) for a collection of man pages.
Specifically, these are R documentation files within an R package that houses a collection of data sets, maintained by several people in an SVN repository. The format of these files is .Rd, which is LaTeX-like, but different.
R has functions for converting these man pages to html or pdf, but I'd like to be able to have a web interface that allows users to click on a particular keyword, and bring up a list (and brief excerpts) for those man pages that have that keyword within the \keyword{} tag.
Also, the generated html is somewhat ugly and I'd like to be able to provide my own CSS.
One obvious option is to load all the metadata I desire into a database like MySQL and design my site to run queries and fetch the appropriate data.
I'd like to avoid that to minimize upkeep for future maintainers. The number of files is small (<500) and the amount of data is small (only a couple of hundred lines per file).
My current leaning is to have a script that pulls the desired metadata from each file into a summary JSON file and load this summary.json file in PHP, decode it, and loop through the array looking for those items that have attributes that match the current query (e.g. all docs with keyword1 AND keyword2).
I was starting in that direction with the following...
$contents=file_get_contents("summary.json");
$c=json_decode($contents,true);
foreach ($c as $ind=>$val ) { .... etc
Another idea was to write a script that would convert these .Rd files to xml. In that case, are there any lightweight frameworks that make it easy to sort and search a small collection of xml files?
I'm not sure if xQuery is overkill or if I have time to dig into it...
I think I'm suffering from too-many-options-syndrome with all the AJAX temptations. Any help is greatly appreciated.
I'm looking for a super simple solution. How might some of you out there approach this?
My approach would be parsing the keywords (from your description i assume they have a special notation to distinguish them from normal words/text) from the files and storing this data as searchindex somewhere. Does not have to be mySQL, sqlite would surely be enough for your project.
A search would then be very simple.
Parsing files could be automated as post-commit-hook to your subversion repository.
Why don't you create table SUMMARIES with column for each of summary's fields?
Then you could index that with full-text index, assigning different weight to each field.
You don't need MySQL, you can use SQLite which has the the Google's full-text indexing (FTS3) built in.

Categories