Template extraction in python/php - php

Are there existing template extract libraries in either python or php? Perl has Template::Extract, but I haven't been able to find a similar implementation in either python or php.
The only thing close in python that I could find is TemplateMaker (http://code.google.com/p/templatemaker/), but that's not really a template extraction library.

After digging around some more I found a solution to exactly what I was looking for. filippo posted a list of python solutions for screen scraping in this post: Options for HTML scraping? among which is a package called scrapemark ( http://arshaw.com/scrapemark/ ).
Hope this helps anyone else who is looking for the same solution.

TmeplateMaker does seem to do what you need, at least according to its documentation. Instead of receiving a template as an input, it infers ("learns") if from a few documents. Then, it has the extract method to extract the data from other documents that were created with this template.
The example shows:
# Now that we have a template, let's extract some data.
>>> t.extract('<b>red and green</b>')
('red', 'green')
>>> t.extract('<b>django and stephane</b>')
('django', 'stephane')
# The extract() method is very literal. It doesn't magically trim
# whitespace, nor does it have any knowledge of markup languages such as
# HTML.
>>> t.extract('<b> spacy and <u>underlined</u></b>')
(' spacy ', '<u>underlined</u>')
# The extract() method will raise the NoMatch exception if the data
# doesn't match the template. In this example, the data doesn't have the
# leading and trailing "<b>" tags.
>>> t.extract('this and that')
Traceback (most recent call last):
...
So, to achieve the task you require, I think you should:
Give it a few documents rendered from your template - it will have no trouble inferring the template from them.
Use the inferred template to extract data from new documents.
Come to think about it, it's even more useful than Perl's Template::Extract as it doesn't expect you to provide it a clean template - it learns it on its own from sample text.

Here is an interesting discussion from Adrian the author of TemplateMaker http://www.holovaty.com/writing/templatemaker/
It seems to be a lot like what I would call a wrapper induction library.
If your looking for something else that is more configurable (less for scraping) take a look at lxml.html and BeautifulSoup, also for python.

Related

Perf. issue / Too much calls to string manipulation functions

This question is about optimizing a part of a program that I use to add in many projects as a common tool.
This 'templates parser' is designed to use a kind of text pattern containing html code or anything else with several specific tags, and to replace these by developer given values when rendered.
The few classes involved do a great job and work as expected, it allows when needed to isolate design elements and easily adapt / replace design blocks.
The patterns I use look like this (nothing exceptional I admit) :
<table class="{class}" id="{id}">
<block_row>
<tr>
<block_cell>
<td>{content}</td>
</block_cell>
</tr>
</block_row>
</table>
(Example code below are adapted extracts)
The parsing does things like that :
// Variables are sorted by position in pattern string
// Position is read once and stored in cache to avoid
// multiple calls to str_pos or str_replace
foreach ($this->aVars as $oVar) {
$sString = substr($sString, 0, $oVar->start) .
$oVar->value .
substr($sString, $oVar->end);
}
// Once pattern loaded, blocks look like --¤(<block_name>)¤--
foreach ($this->aBlocks as $sName=>$oBlock) {
$sBlockData = $oBlock->parse();
$sString = str_replace('--¤(' . $sName . ')¤--', $sBlockData, $sString);
}
By using the class instance I use methods like 'addBlock' or 'setVar' to fill my pattern with data.
This system has several disadvantages, among them the multiple objects in memory (one for each instance of block) and the fact that there are many calls to string manipulation functions during the parsing process (preg_replace in the past, now just a bunch of substr and pals).
The program on which I'm working is making a large use of these templates and they are just about to show their limits.
My question is the following (No need for code, just ideas or a lead to follow) :
Should I consider I've abused of this and should try to manage so that I don't need to make so many calls to these templates (for instance improving cache, using only simple view scripts...)
Do you know a technical solution to feed a structure with data that would not be that mad resource consumer I wrote ? While I'm writing I'm thinking about XSLT, would it be suitable, if yes could it improve performances ?
Thanks in advance for your advices
Use the XDebug extension to profile your code and find out exactly which parts of the code are taking the most time.

Retrieving variables (document metadata) from a MultiMarkdown document with PHP

How can I retrieve MultiMarkdown document metadata (as defined here) using php?
I was rather surprised that I couldn't find a MultiMarkdown php parser, PHP Markdown Extra doesn't do MultiMarkdown.
I'm afraid that the scripts that MultiMarkdown's comes packaged with have all the answers for somebody that would know how to define/use a custom XSLT, but sadly that's not my case.
MultiMarkdown Document Metadata goes like this:
Title: A New MultiMarkdown Document
Author: Fletcher T. Penney
John Doe
Date: July 25, 2005
I would like to use my own properties and control where they will be displayed in the output. By default, mmd2XHTML outputs the above (pre-defined) variables in tags, but I need to display them somewhere in the HTML body.
thanks
I am not an expert in php, but the easiest way would probably be to call out to the multimarkdown binary as a shell command, e.g.
multimarkdown -e title foo.txt
This command would output the value of the title metadata for foo.txt.
This is basically the approach I use in perl, Objective-C, and shell scripts, and is the reason I added the -e flag to MultiMarkdown to begin with.
The XSLT approach is great if you are using MMD to actually generate the HTML result, but probably not as useful in this circumstance.
Your other option would be to write a custom regular expression to grab what you need, but why reinvent the wheel?
for parsing markdown files with metadata you can use front yaml or kurenai
i am not sure about exact compatibility with multimarkdown.
front yaml
$parser = new Mni\FrontYAML\Parser();
$document = $parser->parse($str);
$yaml = $document->getYAML();
$html = $document->getContent();
kurenai
kurenai can parse different metadata content types like yaml and json.
$kurenai = new \Kurenai\Parser(
new \Kurenai\Parsers\Metadata\JsonParser,
new \Kurenai\Parsers\Content\MarkdownParser
);
$document = $kurenai->parse('path/to/document.md');
$document->getRaw();
$document->getMetadata();
$document->getContent();

Mediawiki tag extension - chained tags do not get processed

I'm trying to develop a simple Tag Extension for Mediawiki. So far I'm basically outputing the input as it comes. The problem arises when there are chained tags. For instance, for this example:
function efSampleParserInit( Parser &$parser ) {
$parser->setHook( 'sample', 'efSampleRender' );
return true;
}
function efSampleRender( $input, array $args, Parser $parser, PPFrame $frame ) {
return "hello ->" . $input . "<- hello";
}
If I write this in an article:
This is the text <sample type="1">hello my <sample type="2">brother</sample> John</sample>
Only the first sample tag gets processed. The other one isn't. I guess I should work with the $parser object I receive so I return the parsed input, but I don't know how to do it.
Furthermore, Mediawiki's reference is pretty much non existant, it would be great to have something like a Doxygen reference or something.
Use $parser->recursiveTagParse(), as shown at Manual:Tag_extensions#How do I render wikitext in my extension?.
It is kind of a clunky interface, and not very well documented. The underlying reason why such a seemingly natural thing to do is so tricky to accomplish is that it sort of goes against the original design intent of tag extensions — they were originally conceived as low-level filters that take in raw unparsed text and spit out HTML, completely bypassing normal parsing. So, for example, if you wanted to include some content written in Markdown (such as a StackOverflow post) on a wiki page, the idea was that you could install a suitable extension and then write
<markdown>
**Look,** here's some Markdown text!
</markdown>
on the page, and the MediaWiki parser would leave everything between the <markdown> tags alone and just hand it over to the extension for parsing.
Of course, it turned out that most people who wrote MediaWiki tag extensions didn't really want to replace the parser, but just to apply some tweaks to it. But the way the tag extension interface was set up, the only way to do that was to call the parser recursively. I've sometimes thought it would be nice to add a new parser extension type to MediaWiki, something that looked like tag extensions but didn't interrupt normal parsing in such a drastic manner. Alas, my motivation and copious free time hasn't so far been sufficient to actually do something about it.

Dealing with XML in PHP

I'm currently working a project that has me working with XML a lot. I have to take an XML response and decrypt each text node and then do various tasks with the data. The problem I'm having is taking the response and processing each text node. Originally I was using the XMLToArray library, and that worked fine I would change the XML into an array and then loop through the array and decrypt the values. However some of the XML response I'm dealing with have repeated tags and the XMLToArray library will only return the last values.
Is there a good way that I can take an XML response and process all the text nodes and easily putting the values into an array that has a similar structure to the response?
Thanks in advance.
I would use SimpleXML.
Here's a small example of using it. It loads and parses XML from http://www.w3schools.com/xml/plant_catalog.xml and then outputs values of "COMMON" and "PRICE" tags of each "PLANT" tag.
$xml = simplexml_load_file('http://www.w3schools.com/xml/plant_catalog.xml');
foreach ( $xml->PLANT as $plantNode ) {
echo $plantNode->COMMON, ' - ', $plantNode->PRICE, "\n";
}
If you have any problems with adapting it to your needs, just give an example of your XML so that we can help with it.
All those XML to array libraries are a remain of the times where PHP 4 would force you to write your own XML parser almost from scratch. In recent PHP versions you have a good set of XML libraries that do the hard job. I particularly recommend SimpleXML (for small files) and XMLReader (for large files). If you still find them complicate, you can try phpQuery.
You might want to give SimpleXML a try. Plus it comes by default in php so you dont need to install
Check out SimpleXML, it may offer a bit more for what you are looking for.

Why use an XML parser?

I'm a somewhat experienced PHP scripter, however I just dove into parsing XML and all that good stuff.
I just can't seem to wrap my head around why one would use a separate XML parser instead of just using the explode function, which seems to be just as simple. Here's what I've been doing (assuming there is a valid XML file at the path xml.php):
$contents = file_get_contents("xml.php");
$array1 = explode("<a_tag>", $contents);
$array2 = explode("</a_tag>", $array1[1]);
$data = $array2[0];
So my question is, what is the practical use for an XML parser if you can just separate the values into arrays and extract the data from that point?
Thanks in advance! :)
Excuse me for not going into details but for starters try parsing
$contents = '<a xmlns="urn:something">
<a_tag>
<b>..</b>
<related>
<a_tag>...</a_tag>
</related>
</a_tag>
<foo:a_tag xmlns:foo="urn:something">
<![CDATA[This is another <a_tag> element]]>
</foo:a_tag>
</a>';
with your explode-approach. When you're done we can continue with some trickier things ;-)
In a nutshell, its consistency. Before XML came into wide use there were numerous undocumented formats for keeping information in files. One of the motivators behind XML was to create a well defined, standard document format. With this well defined format in place, a general set of parsing tools could be developed that would work consistently on documents so long as the documents adhered to the aforementioned well defined format.
In some specific cases, your example code will work. However, if the document changes
...
<!-- adding an attribute -->
<a_tag foo="bar">Contents of the Tag</a_tag>
...
...
<!-- adding a comment to the contents -->
<a_tag>Contents <!-- foobar --> of the Tag</a_tag>
...
Your parsing code will probably break. Code written using a correctly defined XML parser will not.
XML parsers:
Handle encoding
May have xpath support
Allow you to easily modify and save the XML; append/remove child nodes, add/remove attributes, etc.
Don't need to load the whole file into memory (except from DOM parsers)
Know about namespaces
...
How would you explode the same file if a_tag had an attribute?
explode("<a_tag>" ... will work differently than explode("<a_tag attr='value'>" ..., after all.
XML Parsers understand the XML specification. Explode can only handle the simplest of cases, and will most likely fail in a lot of instances of that case.
Using a proven XML parsing method will make the code more maintainable and easy to read. It will also make it more easily adaptable should the schema change, and it can make it easier to determine error conditions. XPath and XSLT exist for a reason, they are proven ways to deal with XML data in a sensible, legible manner. I'd suggest you use whichever is applicable in your given situation. Remember, just because you think you're only writing code for one specific purpose, you never know what a piece of well-written code could evolve into.

Categories