How to get Wikipedia "clean" content?

How to get Wikipedia "clean" content? - php

I'm using Mediawiki api in order to get content from Wikipedia pages.
I've written a code which generates the next query (for example):
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvsection=0&titles=hawaii
Which retrieves only the leading paragraph from the Wikipdia page about Hawaii.
The problem is that as you might notice there are a lot of irrelevant substrings such as:
"[[Molokai|Moloka{{okina}}i]], [[Lanai|Lāna{{okina}}i]], [[Kahoolawe|Kaho{{okina}}olawe]], [[Maui]] and the [[Hawaii (island)|".
All those barckets [[]] are not relevant , and I wonder whether there is an alegant method to pull only 'clean' content from such pages?
Thanks in advance.

You can get a clean HTML text from Wikipedia with this query:
https://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=hawaii
If you want just a plain text, without HTML, try this:
https://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=hawaii&explaintext

please try this:
$relevant = preg_replace('/[[.*?]]/', '', $string);
EDIT: just found this - hope it is helpful

Related

Convert HTML code to doc using PHP and PHPWord

I am using PHPWord to load a docx template and replace tags like {test}. This is working perfectly fine.
But I want to replace a value with html code. Directly replacing it into the template is not possible. There is now way to do this using PHPWord, as far as I know.
I looked at htmltodocx. But it seams it will not work either, is it posible to transform a peace of code like <p>Test<b>test</b><br>test</p> to a working doc markup? I only need the basic code, no styleing. but Linebreaks have to work.

Here is the link to the github. It is working fine Html-Docx-js.
And it is the demo also available here.
Other option is this Link.
$toOpenXML = HTMLtoOpenXML::getInstance()->fromHTML("<p>te<b>s</b>t</p>");
$templateProcessor->setValue('test', $toOpenXML);

The other answers propose H2OXML which only supports
Bold, italic and underlined text
Bulled lists
As described in their docs and their last update was in 2012.
I did some research and found a pretty nice solution:
$var = 'Some text';
$xml = "<w:p><w:r><w:rPr><w:strike/></w:rPr><w:t>". $var."</w:t></w:r></w:p>";
$templateProcessor->setValue('param_1', $xml);
The above example, shows how would be a striked text. Instead of "w:strike" you can use "w:i" for italic or "w:b" bold, and so on. Not sure if it works on all tags or not.

Thanks for your answer, Varun.
The simple PHP library H2OXML works for me https://h2openxml.codeplex.com/
$toOpenXML = HTMLtoOpenXML::getInstance()->fromHTML("<p>te<b>s</b>t</p>");
$templateProcessor->setValue('test', $toOpenXML);
I can now convert html code to insert it using PHPWord.

$content = '<p>Test<b>test</b><br>test</p>';
use it before IOFactory::createWriter();
\PhpOffice\PhpWord\Shared\Html::addHtml($section, $content);

preg_replace limit issue, handling array values

I've been working with the Sphider search engine for an internal website, we need to be able to quickly search for contact details in exported .htm(l) files.
$fulltxt = ereg_replace("[_A-Za-z0-9-]+(\.[_A-Za-z0-9-]+)*#[A-Za-z0-9-]+(\.[A-Za-z0-9-]+)*(\.[A-Za-z]{2,3})", "\\0", $fulltxt);
I am replacing e-mail addresses with a convenient mailto: link so users can open Outlook straight from the search results.
However,
while (preg_match("/[^\>](".$change.")[^\<]/i", " ".$fulltxt." ", $regs)) {
$fulltxt = preg_replace("/".$regs[1]."/i", "<b>".$regs[1]."</b>", $fulltxt);
}
It replaces all matches in the search results with bold tags, which resuts into the tags been included in Outlook's 'To...' field. It looks something like this in HTML (thanks Yuriy):
<b>name</b>.surname#domain
I have tried adding a value to the 'limit' parameter:
while (preg_match("/[^\>](".$change.")[^\<]/i", " ".$fulltxt." ", $regs)) {
$fulltxt = preg_replace("/".$regs[1]."/i", "<b>".$regs[1]."</b>", $fulltxt, 1);
}
Supposingly this should be the solution to my problem by simply replacing only the first occurrence (being the name as the pattern is name-phone num-email and we always search by name), instead it only makes it incredibly slow to the point i get a timeout message from the server. I've been trying various solutions but have been out of luck.
Any ideas? Am i doing something wrong?
Thanks.
(*Original heavily edited).

Did I understand you right that something like this happens?
<b>email#domain</b>
Why don't you put tags into search results first, and only then apply "mailto:" anchors to emails? Added 's would be easy to filter out in the patter on that second step.

Find and replace problem

My website, has 2 database tables. 1 of them have the posts_table and the other one have the videos.
At the moment i am getting the text images etc , normally from the post_table table.
In my CMS when we add a video there is added a short code
[media id=487 width=660 height=440]
This shortcode automaticly get the link of a video from the vid_table where the id is the same as the shortcode.
So what i want is:
I need to do the same thing that the short code do, when a video is added on CMS the short code is showed in the post, i need to delete the shortcode and instead of it want to be played a video that has the link on the vid_table.
I have some problems with my english , so if you dont understand again please tell me.
Any kind of help will be great.
Thank you.
EDITED: So i want to replace the whole media tag with a flash player, that plays the url that belongs to the ID in the media tag
BUMP !! CAN HELP PLEASE ?

This is quite a sophisticated problem actually. I was bored and made a basic tag parser. Right now it has some problems:
HTML rendering should be implemented in a separated class (and a template engine such as Twig should do the rendering);
Tag parsing is way too naive and will probably give you unexpected results if a tag's syntax is incorrect;
[media] tag does not support IE. You would have to change the source itself (method TagParser::renderMedia())
Some features to note:
extra parameters will be rendered as attributes for [link] tag, e.g [link id=25 class=foo] will output example.
parameters may contain spaces if you quote them: [link id=25 class="foo bar"] will output example
If DataProvider::findById() does not return a 'content' in its array, the parser will output http://example.com
The code is too long to paste here, you can find it on gist. Just put each file in the directory specified by the first commented line and you should be set. Run example.php to see it in action. You can find out some more details about using this script by looking at the unit test.

What do you want exactly?, you can get media id out of the text using
$text = 'some stuff [media id=468 width=660 height=440] more stuff';
preg_match("/media id=(.*) w/",$text, $results);
$result = $results[0];
$result = str_replace("media id=","",$result);
$result = str_replace("w","",$result);
$id = $result;

Using PHP PCRE to fetch div content

I'm trying to fetch data from a div (based on his id), using PHP's PCRE. The goal is to fetch div's contents based on his id, and using recursivity / depth to get everything inside it. The main problem here is to get other divs inside the "main div", because regex would stop once it gets the next </div> it finds after the initial <div id="test">.
I've tryed so many different approaches to the subject, and none of it worked. The best solution, in my oppinion, is to use the R parameter (Recursion), but never got it to work properly.
Any Ideais?
Thanks in advance :D

You'd be much better off using some form of DOM parser - regex really isn't suited to this problem. If all you want is basic HTML dom parsing, something like simplehtmldom would be right up your alley. It's trivial to install (just include a single PHP file) and trivial to use (2-3 lines will do what you need).
include('simple-html-dom.php');
$dom = str_get_html($bunchofhtmlcode);
$testdiv = $dom->find('div#test',0); // 0 for the first occurrence
$testdiv_contents = $testdiv->innertext;

Text Display in PHP

if i stored data in DB which contains urls (for example : Go thorugh this link http://www.google.com).
when i display that data in browser, i want to display that data like " Go through this link http://www.google.com ". but that url which looks like anchor link...
if you didn't get this..open google chat...send some msg to anyone like http://google.com..if u send plain text like http://google.com,but it shows with hyper link..to that url..
i want this functionality in PHP technology...how can we implement this
thanks in advance...

So, you want to convert the urls to links in php? See the first result, or answers to same question in stackoverflow.

If I understood this correctly you want to transform URLs in a text to links automatically, without going further into details a crude (very crude) regexp should do it for now:
$textWithLinks = preg_replace('#(http|ftp)s?://[^\s]+#i', '$0', $textWithUrls);

function add_href ($text) {
return preg_replace('/((www\.|(http|https|ftp|news|file)+\:\/\/)[_.a-z0-9-]+\.[a-z0-9\/_:#=.+?,##%&~-]*[^.|\'|\# |!|\(|?|,| |>|<|;|\)])/', '$0', $text);
}
Expression taken from http://rickyrosario.com/blog/converting-a-url-into-a-link-in-csharp-using-regular-expressions/

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to get Wikipedia "clean" content? - php

You can get a clean HTML text from Wikipedia with this query: https://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=hawaii If you want just a plain text, without HTML, try this: https://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=hawaii&explaintext

please try this: $relevant = preg_replace('/[[.*?]]/', '', $string); EDIT: just found this - hope it is helpful

Related

Convert HTML code to doc using PHP and PHPWord

preg_replace limit issue, handling array values

Find and replace problem

Using PHP PCRE to fetch div content

Text Display in PHP

Categories

Resources