Parsing Wiki API content - php

I have this wiki from the API http://fr.wikipedia.org/w/api.php?action=query&titles=%C9rythropo%EF%E9tine&prop=revisions&rvprop=content&format=xmlfm
which I would like to retrieve the main content starting from:
L''''érythropoïétine''' ('''EPO''') est une [[hormone]] ......etc
I tried for a start to preg_replace everything from the top starting from the word "{{Chimiebox..." to the bottom "}}" using this
preg_replace( '/^{{(.*)}}$/sim', '', $value[0]['*'] );
But kind of doesn't work..does anyone know of a good way to determine the start of the content?? Thanks for any advice.

Well, afaik the most projects use the Wikipedia Parser directly, e.g. the Wikipedia Offline Client Project at my university. Since you seem to be using php, this may the be the easiest way for you.

Related

Cakephp: How can make excerpt safer

I use $this->Text->excerpt() in TextHelper for excerpt my post description. But, I realize that isn't safe. It maybe break up my web layout,
ex
<p>Advanced SystemCare 7 PRO provides automated and all-in-one PC care service with Malware Removal</p>
<p>It also creates...
It maybe excerpt without </p> tag and break up my layout
So, if you have any solution, please help me. Thanks
The way I got around it was to strip the tags beforehand.
$searchDisplayText = strip_tags($modelItemDetails[$model->alias][$fieldName]);
$searchDisplayTextTruncated = String::excerpt($searchDisplayText, $options['keyword'], $settings['excerptLength'] , '...');
You would need to do this anyway, as you probably don't want images or other unsafe items in your results, you only want text.

How to get Wikipedia "clean" content?

I'm using Mediawiki api in order to get content from Wikipedia pages.
I've written a code which generates the next query (for example):
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvsection=0&titles=hawaii
Which retrieves only the leading paragraph from the Wikipdia page about Hawaii.
The problem is that as you might notice there are a lot of irrelevant substrings such as:
"[[Molokai|Moloka{{okina}}i]], [[Lanai|Lāna{{okina}}i]], [[Kahoolawe|Kaho{{okina}}olawe]], [[Maui]] and the [[Hawaii (island)|".
All those barckets [[]] are not relevant , and I wonder whether there is an alegant method to pull only 'clean' content from such pages?
Thanks in advance.
You can get a clean HTML text from Wikipedia with this query:
https://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=hawaii
If you want just a plain text, without HTML, try this:
https://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=hawaii&explaintext
please try this:
$relevant = preg_replace('/[[.*?]]/', '', $string);
EDIT: just found this - hope it is helpful

Changing/deleting html from file_get_contents

I'm currently using this code:
$blog= file_get_contents("http://powback.tumblr.com/post/" . $post);
echo $blog;
And it works. But tumblr has added a script that activates each time you enter a password-field. So my question is:
Can i remove certain parts with file_get_contents? Or just remove everything above the <html> tag? could i possibly kill a whole div so it wont load at all? And if so; how?
edit:
I managed to do it the simple way. By skipping 766 characters. The script now work as intended!
$blog= file_get_contents("powback.tumblr.com/post/"; . $post, NULL, NULL, 766);
After file_get_contents returns, you have in your hands a string. You can do anything you want to it, including cutting out parts of it.
There are two ways to actually do the cutting:
Using string functions like str_replace, preg_replace and others; the exact recipe depends on what you need to do. This approach is kind of frowned upon because you are working at the wrong level of abstraction, but in some cases it has an unmatched performance to time spent ratio.
Parsing the HTML into a DOM tree, modifying it appropriately (this time working at the appropriate level of abstraction) and then turn it back into a string and echo it. This can be more convenient to work with if your requirements are not dead simple and is easier to maintain, but it typically requires more code to be written.
If you want to do something that's most naturally expressed in HTML document terms ("cutting out this <div>") then don't be tempted and go with the second approach.
At that point, $blog is just a string, so you can use normal PHP functions to alter it. Look into these 2:
http://php.net/manual/en/function.str-replace.php
http://us2.php.net/manual/en/function.preg-replace.php
You can parse your output using simple html dom parser and display olythe contents thatyou really want to display

PHP Summarize any URL

How can I, in PHP, get a summary of any URL? By summary, I mean something similar to the URL descriptions in Google web search results.
Is this possible? Is there already some kind of tool I can plug in to so I don't have to generate my own summaries?
I don't want to use metadata descriptions if possible.
-Dylan
What displays in Google is (generally) the META description tag. If you don't want to use that, you could use the page title instead though.
If you don't want to use metadata descriptions (btw, this is exactly what they are for), you have a lot of research and work to do. Essentially, you have to guess which part of the page is content and which is just navigation/fluff. Indeed, Google has exactly that; note however, that extracting valuable information from useless fluff is their #1 competency and they've been researching and improving that for a decade.
You can, of course, make an educated guess (e.g. "look for an element with ID or class maincontent" and get the first paragraph from it) and maybe it will be OK. The real question is, how good do you want the results to be? (Facebook has something similar for linking to websites, sometimes the summary just insists that an ad is the main content).
The following will allow you to to parse the contents of a page's title tag. Note: php must be configured to allow file_get_contents to retrieve URLs. Otherwise you'll have to use curl to retrieve the page HTML.
$title_open = '<title>';
$title_close = '</title>';
$page = file_get_contents( 'http://www.domain.com' );
$n = stripos( $page, $title_open ) + strlen( $title_open );
$m = stripos( $page, $title_close);
$title = substr( $page, n, m-n );
While i hate promoting a service i have found this:
embed.ly
It has an API, that returns a JSON with all the data you need.
But i am still searching for a free/opensource library to do the same thing.

Using PHP PCRE to fetch div content

I'm trying to fetch data from a div (based on his id), using PHP's PCRE. The goal is to fetch div's contents based on his id, and using recursivity / depth to get everything inside it. The main problem here is to get other divs inside the "main div", because regex would stop once it gets the next </div> it finds after the initial <div id="test">.
I've tryed so many different approaches to the subject, and none of it worked. The best solution, in my oppinion, is to use the R parameter (Recursion), but never got it to work properly.
Any Ideais?
Thanks in advance :D
You'd be much better off using some form of DOM parser - regex really isn't suited to this problem. If all you want is basic HTML dom parsing, something like simplehtmldom would be right up your alley. It's trivial to install (just include a single PHP file) and trivial to use (2-3 lines will do what you need).
include('simple-html-dom.php');
$dom = str_get_html($bunchofhtmlcode);
$testdiv = $dom->find('div#test',0); // 0 for the first occurrence
$testdiv_contents = $testdiv->innertext;

Categories