MediaWiki to Markdown conversion with PHP preg_replace

MediaWiki to Markdown conversion with PHP preg_replace - php

Since I discovered BoostNote (Open source note-taking app made for programmers), I decided to migrate from a local MediaWiki (on a NAS) to this new app.
But that also means migrating from MediaWiki syntax to Markdown!!
I tried several online conversion tools, but they didn't work very well. The most interesting one is Pandoc, but it also has a few problems (ex. I couldn't force it to use fenced blocks instead of indentation to delimit code blocks).
SOLUTION:
I ended up writing a tiny PHP script which represents a quick working solution (altough it may not be very elegant and polished ;) ). It is based on php preg_replace() function.
(The script will be available in the accepted answer below... if you have any suggestion, please post a comment)

Usage:
save MediaWiki markup in a txt file ('mediawiki.txt') in the same directory of the php script
run the script
a new txt file ('markdown.txt') will be created usgin markdown syntax
Enjoy!!
<?php
//Before running this script, fix manually all the mixed list (ordered + unordered). It's the only missing feature.
$string = file_get_contents("mediawiki.txt");
$pattern = array(
"/\*\*\*(?![^<]*>|[^<>]*<\/)/", //List: replace '***' with ' * '
"/\*\*(?![^<]*>|[^<>]*<\/)/", //List: replace '**' with ' * '
"/\*(?!\s+|[^<]*>|[^<>]*<\/)/", //List: replace '*' with '* '
"/\#\#\#(?![^<]*>|[^<>]*<\/)/", //List: replace '###' with ' 1. '
"/\#\#(?![^<]*>|[^<>]*<\/)/", //List: replace '##' with ' 1. '
"/\#(?![^<]*>|[^<>]*<\/)/", //List: replace '#' with '1. '
"/\'\'\'\'\'(.*?)\'\'\'\'\'(?![^<]*>|[^<>]*<\/)/", //Bold & Italic: replace "'''''(text)'''''" with "**_(text)_**"
"/\'\'\'([\s\S]*?)\'\'\'(?![^<]*>|[^<>]*<\/)/", //Bold: replace "'''(text)'''" with "**(text)**"
"/\'\'([\s\S]*?)\'\'(?![^<]*>|[^<>]*<\/)/", //Italic: replace "''(text)''" with "_(text)_"
"/\=\=\=\=([\s\S]*?)\=\=\=\=(?![^<]*>|[^<>]*<\/)/", //Headings: replace "====(text)====" with "#### (text)"
"/\=\=\=([\s\S]*?)\=\=\=(?![^<]*>|[^<>]*<\/)/", //Headings: replace "===(text)===" with "### (text)"
"/\=\=([\s\S]*?)\=\=(?![^<]*>|[^<>]*<\/)/", //Headings: replace "==(text)==" with "## (text)"
"/\=([\s\S]*?)\=(?![^<]*>|[^<>]*<\/)/", //Headings: replace "=(text)=" with "# (text)"
"/\{\{[\s\S]*?\|([\s\S]*?)\}\}(?![^<]*>|[^<>]*<\/)/", //Notes: replace "{{...|(text)}}" with "(text)"
"/\<pre[\s\S]*?\>([\s\S]*?)\<\/pre\>/", //Code block: replace "<pre ...>(code)</pre>" with "```java```" !!CHANGE THIS (php, javascript, etc)!!
"/\<code\>([\s\S]*?)\<\/code\>/", //Inline code: replace "<code>(code)</code>" with "`(code)`"
);
$replacement = array(
" * ",
" * ",
"* ",
" 1. ",
" 1. ",
"1. ",
"**_$1_**",
"**$1**",
"_$1_",
"#### $1",
"### $1",
"## $1",
"# $1",
"$1",
"```java
$1
```",
"`$1`",
);
$parsedString = preg_replace($pattern, $replacement, $string);
file_put_contents("markdown.txt", $parsedString);
?>

Related

How to preg_replace links so that the hash is kept

I have huge online php fileset that is made dynamically.
It has links, even some invalid ones with quotes (made with frontpage)
index2.php?page=xd
index2.php?page=xj asdfa
index2.php?page=xj%20aas
index2.php?page=xj#jumpword
index2.php?page=gj#jumpword with spaces that arenot%20
index2.php?page=afdsdj#jumpword%20with
index2.php?page=xj#jumpword with "quotes" iknow
$input_lines=preg_replace("/(index2.php?page\=.*)(#[a-zA-Z0-9_ \\"]*)(\"\>)/U", "$0 --> $2", $input_lines);
I want all of those to be just with the # -part and not have the index2.php?page=* part.
I could not get this to work in whole evening. So please help.

In some cases, you can use parse_url to get attributes from the URL (ex: what is after the #), like so:
$urls = array(
'index2.php?page=xd',
'index2.php?page=xj asdfa',
'index2.php?page=xj%20aas',
'index2.php?page=xj#jumpword',
'index2.php?page=gj#jumpword with spaces that arenot%20',
'index2.php?page=afdsdj#jumpword%20with',
'index2.php?page=xj#jumpword with "quotes" iknow',
);
foreach($urls as $url){
echo 'For "' . $url . '": ';
$parsed = parse_url($url);
echo isset($parsed['fragment']) ? $parsed['fragment'] : 'DID NOT WORK';
echo '<br>';
}
Output:
For "index2.php?page=xd": DID NOT WORK
For "index2.php?page=xj asdfa": DID NOT WORK
For "index2.php?page=xj%20aas": DID NOT WORK
For "index2.php?page=xj#jumpword": jumpword
For "index2.php?page=gj#jumpword with spaces that arenot%20": jumpword with spaces that arenot%20
For "index2.php?page=afdsdj#jumpword%20with": jumpword%20with
For "index2.php?page=xj#jumpword with "quotes" iknow": jumpword with "quotes" iknow

Preg Relplace only replace instances of text with a trailing space

I have a preg_replace question. I am using preg_replace to generate contextual links in text blocks using the following code:
$contextualLinkStr = 'mytext';
$content = 'My text string which includes MYTEXT in various cases such as Mytext and mytext. However it also includes image tags such as <img src="http://www.myurl.com/mytext-1.jpg">';
$content = preg_replace('/' . $contextualLinkStr . '/i', '\\0', $content);
The preg_replace is working well on the text and generating the relevant links while retaining case but it's also generating a link within the URL of the image tag. I was thinking If I simply added a trailing space to the expression in the preg_replace function it would fix it due to the fact that all text instances will have a trailing space whereas no image urls will, as follows:
$content = preg_replace('/' . $contextualLinkStr . '/i' . ' ', '\\0' . ' ', $content);
But this doesn't work. Can anybody tell me how I make the trailing space a condition of the match?
Thanks in advance.
Jason.

I've just worked it out guys. I was being daft. For reference the relevant code is:
$content = preg_replace('/' . $contextualLinkStr . ' /i', '\\0 ', $content);
Thanks.

How do I split a Wordpress title at the – in PHP?

I am working on my Wordpress blog and its required to get the title of a post and split it at the "-". Thing is, its not working, because in the source its &ndash and when I look at the result on the website, its a "long minus" (–). Copying and pasting this long minus into some editor makes it a normal minus (-). I cant split at "-" nor at &ndash, but somehow it must be possible. When I created the article, I just typed "-" (minus), but somewhere it gets converted to – automatically.
Any ideas?
Thanks!

I think I found it. I remember that I have meet the similar problem that when I paste code in my post the quote mark transform to an em-quad one when display to readers.
I found that is in /wp-include/formatting.php line 56 (wordpress ver 3.3.1), it defined some characters need to replace
$static_characters = array_merge( array('---', ' -- ', '--', ' - ', 'xn–', '...', '``', '\'\'', ' (tm)'), $cockney );
$static_replacements = array_merge( array($em_dash, ' ' . $em_dash . ' ', $en_dash, ' ' . $en_dash . ' ', 'xn--', '…', $opening_quote, $closing_quote, ' ™'), $cockneyreplace );
and in line 85 it make an replacement
// This is not a tag, nor is the texturization disabled static strings
$curl = str_replace($static_characters, $static_replacements, $curl);

If you want to split a string at the "-" character, basically you must replace "-" with a space.
Try this:
$string_to_be_stripped = "my-word-test";
$chars = array('-');
$new_string = str_replace($chars, ' ', $string_to_be_stripped);
echo $new_string;
These lines splits the string at the "-". For example, if you have my-word-test, it will echo "my word test". I hope it helps.
For more information about the str_replace function click here.
If you want to do this in a WordPress style, try using filters. I suggest placing these lines in your functions.php file:
add_filter('the_title', function($title) {
$string_to_be_stripped = $title;
$chars = array('-');
$new_string = str_replace($chars, ' ', $string_to_be_stripped);
return $new_string;
})
Now, everytime you use the_title in a loop, the title will be escaped.

Switch gettext translated language with original language

I started my PHP application with all text in German, then used gettext to extract all strings and translate them to English.
So, now I have a .po file with all msgids in German and msgstrs in English. I want to switch them, so that my source code contains the English as msgids for two main reasons:
More translators will know English, so it is only appropriate to serve them up a file with msgids in English. I could always switch the file before I give it out and after I receive it, but naaah.
It would help me to write English object & function names and comments if the content text was also English. I'd like to do that, so the project is more open to other Open Source collaborators (more likely to know English than German).
I could do this manually and this is the sort of task where I anticipate it will take me more time to write an automated routine for it (because I'm very bad with shell scripts) than do it by hand. But I also anticipate despising every minute of manual computer labour (feels like an oxymoron, right?) like I always do.
Has someone done this before? I figured this would be a common problem, but couldn't find anything. Many thanks ahead.
Sample Problem:
<title><?=_('Routinen')?></title>
#: /users/ruben/sites/v/routinen.php:43
msgid "Routinen"
msgstr "Routines"
I thought I'd narrow the problem down. The switch in the .po-file is no issue of course, it is as simple as
preg_replace('/msgid "(.+)"\nmsgstr "(.+)"/', '/msgid "$2"\nmsgstr "$1"/', $str);
The problem for me is the routine that searches my project folder files for _('$msgid') and substitutes _('msgstr') while parsing the .po-file (which is probably not even the most elegant way, after all the .po-file contains comments which contain all file paths where the msgid occurs).
After fooling around with akirk's answer a little, I ran into some more problems.
Because I have a mixture of _('xxx') and _("xxx") calls, I have to be careful about (un)escaping.
Double quotes " in msgids and msgstrs have to be unescaped, but the slashes can't be stripped, because it may be that the double quote was also escaped in PHP
Single quotes have to be escaped when they're replaced into PHP, but then they also have to be changed in the .po-file. Luckily for me, single quotes only appear in English text.
msgids and msgstrs can have multiple lines, then they look like this
msgid = ""
"line 1\n"
"line 2\n"
msgstr = ""
"line 1\n"
"line 2\n"
plural forms are of course skipped at the moment, but in my case that's not an issue
poedit wants to remove strings as obsolete that seem successfully switched and I have no idea why this happens in (many) cases.
I'll have to stop working on this for tonight. Still it seems using the parser instead of RegExps wouldn't be overkill.

I built on akirk's answer and wanted to preserve what I came up with as an answer here, in case somebody has the same problem.
This is not recursive, but that could easily change of course. Feel free to comment with improvements, I will be watching and editing this post.
$po = file_get_contents("locale/en_GB/LC_MESSAGES/messages.po");
$translations = array(); // german => english
$rawmsgids = array(); // find later
$msgidhits = array(); // record success
$msgstrs = array(); // find later
preg_match_all('/msgid "(.+)"\nmsgstr "(.+)"/', $po, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
$german = str_replace('\"','"',$match[1]); // unescape double quotes (could misfire if you escaped double quotes in PHP _("bla") but in my case that was one case versus many)
$english = str_replace('\"','"',$match[2]);
$en_sq_e = str_replace("'","\'",$english); // escape single quotes
$translations['_(\''. $german . '\''] = '_(\'' . $en_sq_e . '\'';
$rawmsgids['_(\''. $german . '\''] = $match[1]; // find raw msgid with searchstr as key
$translations['_("'. $match[1] . '"'] = '_("' . $match[2] . '"';
$rawmsgids['_("'. $match[1] . '"'] = $match[1];
$translations['__(\''. $german . '\''] = '__(\'' . $en_sq_e . '\'';
$rawmsgids['__(\''. $german . '\''] = $match[1];
$translations['__("'. $match[1] . '"'] = '__("' . $match[2] . '"';
$rawmsgids['__("'. $match[1] . '"'] = $match[1];
$msgstrs[$match[1]] = $match[2]; // msgid => msgstr
}
foreach (glob("*.php") as $file) {
$code = file_get_contents($file);
$filehits = 0; // how many replacements per file
foreach($translations AS $msgid => $msgstr) {
$hits = 0;
$code = str_replace($msgid,$msgstr,$code,$hits);
$filehits += $hits;
if($hits!=0) $msgidhits[$rawmsgids[$msgid]] = 1; // this serves to record if the msgid was found in at least one incarnation
elseif(!isset($msgidhits[$rawmsgids[$msgid]])) $msgidhits[$rawmsgids[$msgid]] = 0;
}
// file_put_contents($file, $code); // be careful to test this first before doing the actual replace (and do use a version control system!)
echo "$file : $filehits <br>";
echo $code;
}
/* debug */
$found = array_keys($msgidhits, 1, true);
foreach($found AS $mid) echo $mid . " => " . $msgstrs[$mid] . "\n\n";
echo "Not Found: <br>";
$notfound = array_keys($msgidhits, 0, true);
foreach($notfound AS $mid) echo $mid . " => " . $msgstrs[$mid] . "\n\n";
/*
following steps are still needed:
* convert plurals (ngettext)
* convert multi-line msgids and msgstrs (format mentioned in question)
* resolve uniqueness conflict (msgids are unique, msgstrs are not), so you may have duplicate msgids (poedit finds these)
*/

See http://code.activestate.com/recipes/475109-regular-expression-for-python-string-literals/ for a good python-based regular expression for finding string literals, taking escapes into account. Although it's python, this might be quite good for multiline strings and other corner cases.
See http://docs.translatehouse.org/projects/translate-toolkit/en/latest/commands/poswap.html for a ready, out-of-the-box base language swapper for .po files.
For instance, the following command line will convert german-based spanish translation to english-based spanish translation. You just have to ensure that your new base language (english) is 100% translated before starting conversion:
poswap -i de-en.po -t de-es.po -o en-es.po
And finally to swap english po file to german po file, use swappo:
http://manpages.ubuntu.com/manpages/hardy/man1/swappo.1.html
After swapping files, some manual polishing of resultant files might be required. For instance headers might be broken and some duplicate texts might occur.

So if I understand you correctly you'd like to replace all German gettext calls with English ones. To replace the contents in the directory, something like this could work.
$po = file_get_contents("translation.pot");
$translations = array(); // german => english
preg_match_all('/msgid "(.+)"\nmsgstr "(.+)"/', $po, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
$translations['_("'. $match[1] . '")'] = '_("' . $match[2] . '")';
$translations['_(\''. $match[1] . '\')'] = '_(\'' . $match[2] . '\')';
}
foreach (glob("*.php") as $file) {
$code = file_get_contents($file);
$code = str_replace(array_keys($translations), array_values($translations), $code);
//file_put_contents($file, $code);
echo $code; // be careful to test this first before doing the actual replace (and do use a version control system!)
}

preg_replace() in PHP - Replacing n-occurrences of a character with n-occurrences of another, when following a new line

I'm looking to replace all occurrences of space characters that follow a new line (or occur at the beginning of the input string). I know that I can achieve this using preg_replace_callback() with a callback that uses str_repeat and strlen, or similarly with the /e switch; but was wondering if it could be done more simply.
Currently I have the following:
$testData = " Hello\n to everybody\n in the world";
echo preg_replace('/^|\n( )+/', ' ', $pValue);
which gives:
" Hello to everybody in the world"
What I'm really after is:
" Hello\n to everybody\n in the world"

I should have searched harder before asking: found the answer (for a java solution) that seems to work perfectly. I'll leave the solution here for the sake of anybody else that has the same problem.
$testData = " Hello\n to everybody\n in the world";
echo preg_replace('/(?m)(?:^|\\G) /', ' ', $pValue);
Now just need to identify whether older versions of PHP support this.

You can use recursion
$pValue = " Hello\n to everybody\n in the world";
echo myReplace($pValue);
function myReplace($value)
{
$value = preg_replace('/(^|\\n)(( )*) /', '\1\2 ', $value);
if (preg_match('/(^|\\n)(( )*) /', $value))
{
$value = myReplace($value);
}
return $value;
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.