Yaml::dump gives different output [duplicate] - php

This question tries to collect information spread over questions about different languages and YAML implementations in a mostly language-agnostic manner.
Suppose I have a YAML file like this:
first:
- foo: {a: "b"}
- "bar": [1, 2, 3]
second: | # some comment
some long block scalar value
I want to load this file into an native data structure, possibly change or add some values, and dump it again. However, when I dump it, the original formatting is not preserved:
The scalars are formatted differently, e.g. "b" loses its quotation marks, the value of second is not a literal block scalar anymore, etc.
The collections are formatted differently, e.g. the mapping value of foo is written in block style instead of the given flow style, similarly the sequence value of "bar" is written in block style
The order of mapping keys (e.g. first/second) changes
The comment is gone
The indentation level differs, e.g. the items in first are not indented anymore.
How can I preserve the formatting of the original file?

Preface: Throughout this answer, I mention some popular YAML implementations. Those mentions are never exhaustive since I do not know all YAML implementations out there.
I will use YAML terms for data structures: Atomic text content (even numbers) is a scalar. Item sequences, known elsewhere as arrays or lists, are sequences. A collection of key-value pairs, known elsewhere as dictionary or hash, is a mapping.
If you are using Python, using ruamel will help you preserve quite some formatting since it implements round-tripping up to native structures. However, it isn't perfect and cannot preserve all formatting.
Background
The process of loading YAML is also a process of losing information. Let's have a look at the process of loading/dumping YAML, as given in the spec:
When you are loading a YAML file, you are executing some or all of the steps in the Load direction, starting at the Presentation (Character Stream). YAML implementations usually promote their most high-level APIs, which load the YAML file all the way to Native (Data Structure). This is true for most common YAML implementations, e.g. PyYAML/ruamel, SnakeYAML, go-yaml, and Ruby's YAML module. Other implementations, such as libyaml and yaml-cpp, only provide deserialization up to the Representation (Node Graph), possibly due to restrictions of their implementation languages (loading into native data structures requires either compile-time or runtime reflection on types).
The important information for us is what is contained in those boxes. Each box mentions information which is not available anymore in the box left to it. So this means that styles and comments, according to the YAML specification, are only present in the actual YAML file content, but are discarded as soon as the YAML file is parsed. For you, this means that once you have loaded a YAML file to a native data structure, all information about how it originally looked in the input file is gone. Which means that when you dump the data, the YAML implementation chooses a representation it deems useful for your data. Some implementations let you give general hints/options, e.g. that all scalars should be quoted, but that doesn't help you restore the original formatting.
Thankfully, this diagram only describes the logical process of loading YAML; a conforming YAML implementation does not need to slavishly conform to it. Most implementations actually preserve data longer than they need to. This is true for PyYAML/ruamel, SnakeYAML, go-yaml, yaml-cpp, libyaml and others. In all these implementations, the style of scalars, sequences and mappings is remembered up until the Representation (Node Graph) level.
On the other hand, comments are discarded rather early since they do not belong to an event or node (the exceptions here is ruamel which links comments to the following event, and go-yaml which remembers comments before, at and after the line that created a node). Some YAML implementations (libyaml, SnakeYAML) provide access to a token stream which is even more low-level than the Event Tree. This token stream does contain comments, however it is only usable for doing things like syntax highlighting, since the APIs do not contain methods for consuming the token stream again.
So what to do?
Loading & Dumping
If you need to only load your YAML file and then dump it again, use one of the lower-level APIs of your implementation to only load the YAML up until the Representation (Node Graph) or Serialization (Event Tree) level. The API functions to search for are compose/parse and serialize/present respectively.
It is preferable to use the Event Tree instead of the Node Graph as some implementations already forget the original order of mapping keys (due to internally using hashmaps) when composing. This question, for example, details loading / dumping events with SnakeYAML.
Information that is already lost in the event stream of your implementation, for example comments in most implementations, is impossible to preserve. Also impossible to preserve is scalar layout, like in this example:
"1 \x2B 1"
This loads as string "1 + 1" after resolving the escape sequence. Even in the event stream, the information about the escape sequence has already been lost in all implementations I know. The event only remembers that it was a double-quoted scalar, so writing it back will result in:
"1 + 1"
Similarly, a folded block scalar (starting with >) will usually not remember where line breaks in the original input have been folded into space characters.
To sum up, loading to the Event Tree and dumping again will usually preserve:
Style: unquoted/quoted/block scalars, flow/block collections (sequences & mappings)
Order of keys in mappings
YAML tags and anchors
You will usually lose:
Information about escape sequences and line breaks in flow scalars
Indentation and non-content spacing
Comments – unless the implementation specifically supports putting them in events and/or nodes
If you use the Node Graph instead of the Event Tree, you will likely lose anchor representations (i.e. that &foo may be written out as &a later with all aliases referring to it using *a instead of *foo). You might also lose key order in mappings. Some APIs, like go-yaml, don't provide access to the Event Tree, so you have no choice but to use the Node Graph instead.
Modifying Data
If you want to modify data and still preserve what you can of the original formatting, you need to manipulate your data without loading it to a native structure. This usually means that you operate on YAML scalars, sequences and mappings, instead of strings, numbers, lists or whatever structures the target programming language provides.
You have the option to either process the Event Tree or the Node Graph (assuming your API gives you access to it). Which one is better usually depends on what you want to do:
The Event Tree is usually provided as stream of events. It may be better for large data since you do not need to load the complete data in memory; instead you inspect each event, track your position in the input structure, and place your modifications accordingly. The answer to this question shows how to append items giving a path and a value to a given YAML file with PyYAML's event API.
The Node Graph is better for highly structured data. If you use anchors and aliases, they will be resolved there but you will probably lose information about their names (as explained above). Unlike with events, where you need to track the current position yourself, the data is presented as complete graph here, and you can just descend into the relevant sections.
In any case, you need to know a bit about YAML type resolution to work with the given data correctly. When you load a YAML file into a declared native structure (typical in languages with a static type system, e.g. Java or Go), the YAML processor will map the YAML structure to the target type if that's possible. However, if no target type is given (typical in scripting languages like Python or Ruby, but also possible in Java), types are deduced from node content and style.
Since we are not working with native loading because we need to preserve formatting information, this type resolution will not be executed. However, you need to know how it works in two cases:
When you need to decide on the type of a scalar node or event, e.g. you have a scalar with content 42 and need to know whether that is a string or integer.
When you need to create a new event or node that should later be loaded as a specific type. E.g. if you create a scalar containing 42, you might want to control whether that it is loaded as integer 42 or string "42" later.
I won't discuss all the details here; in most cases, it suffices to know that if a string is encoded as a scalar but looks like something else (e.g. a number), you should use a quoted scalar.
Depending on your implementation, you may come in touch with YAML tags. Seldom used in YAML files (they look like e.g. !!str, !!map, !!int and so on), they contain type information about a node which can be used in collections with heterogeneous data. More importantly, YAML defines that all nodes without an explicit tag will be assigned one as part of type resolution. This may or may not have already happened at the Node Graph level. So in your node data, you may see a node's tag even when the original node does not have one.
Tags starting with two exclamation marks are actually shorthands, e.g. !!str is a shorthand for tag:yaml.org,2002:str. You may see either in your data, since implementations handle them quite differently.
Important for you is that when you create a node or event, you may be able and may also need to assign a tag. If you don't want the output to contain an explicit tag, use the non-specific tags ! for non-plain scalars and ? for everything else on event level. On node level, consult your implementation's documentation about whether you need to supply resolved tags. If not, same rule for the non-specific tags applies. If the documentation does not mention it (few do), try it out.
So to sum up: You modify data by loading either the Event Tree or the Node Graph, you add, delete or modify events or nodes in the data you get, and then you present the modified data as YAML again. Depending on what you want to do, it may help you to create the data you want to add to your YAML file as native structure, serialize it to YAML and then load it again as Node Graph or Event Tree. From there, you can include it in the structure of the YAML file you want to modify.
Conclusion / TL;DR
YAML has not been designed for this task. In fact, it has been defined as a serialization language, assuming that your data is authored as native data structures in some programming language and from there dumped to YAML. However, in reality, YAML is used a lot for configuration, meaning that you typically write YAML by hand and then load it into native data structures.
This contrast is the reason why it is so difficult to modify YAML files while preserving formatting: The YAML format has been designed as transient data format, to be written by one application, and then to be loaded by another (or the same) application. In that process, preserving formatting does not matter. It does, however, for data that is checked-in to version control (you want your diff to only contain the line(s) with data you actually changed), and other situations where you write your YAML by hand, because you want to keep style consistent.
There is no perfect solution for changing exactly one data item in a given YAML file and leaving everything else intact. Loading a YAML file does not give you a view of the YAML file, it gives you the content it describes. Therefore, everything that is not part of the described content – most importantly, comments and whitespace – is extremely hard to preserve.
If format preservation is important to you and you can't live with the compromises made by the suggestions in this answer, YAML is not the right tool for you.

I would like to challenge the accepted answer. Whether you can preserve comments, the order of map keys, or other features depends on the YAML parsing library that you use. For starters, the library needs to give you access to the parsed YAML as a YAML Document, which is a collection of YAML nodes. These nodes can contain metadata besides the actual key/value pairs. The kinds of metadata that your library chooses to store will determine how much of the initial YAML document you can preserve. I will not speak for all languages and all libraries, but Golang's most popular YAML parsing library, go-yaml supports parsing YAML into a YAML document tree and serializing YAML document back, and preserves:
comments
the order of keys
anchors and aliases
scalar blocks
However, it does not preserve indentation, insignificant whitespace, and some other minor things. On the plus side, it allows modifying the YAML document and there's another library,
yaml-jsonpath that simplifies browsing the YAML node tree. Example:
import (
"github.com/stretchr/testify/assert"
"gopkg.in/yaml.v3"
"testing"
)
func Test1(t *testing.T) {
var n yaml.Node
y := []byte(`# Comment
t: &t
- x: 1 # anchor
a:
b: *t # alias
b: |
cccc
dddd
`)
err := yaml.Unmarshal(y, &n)
assert.NoError(t, err)
y2, _ := yaml.Marshal(&n)
assert.Equal(t, y, y2)
}

Related

Convert xml to MAP in php using simplexml, NOT json

I'm trying to figure out how to take a simple custom xml file (its actually an EML file, but simpeXML works with it anyway) and take tagnames and the text that follows (i think simpleXML calls them children) and put them into a MAP, with key/value pairs. I've looked at some examples on this site and others about converting to arrays and such but they all seem extremely complicated for my needs. I should note that my custom xml does not contain ANY attributes and this conversion only needs to work with MY custom xml file and not any others ever.
So a simple example of my eml file is here
<lesson>
<unit>4</unit>
</lesson>
So then basically what I would want is a MAP, or whatever a key/value collection is called in php that would give me:
Map[0](lesson,null)
Map[1](unit,4)
It's important that I get the null values (or an empty string is ok too), so I can verify that the eml file is valid. I need to validate it with php, not using a namespace validator or a dtd file or however that is done. So the first key/value pair, or the root tag, HAS to be lesson, and then ill also verify that there is a unit tag, then a title tag, then at least one other type of tag etc...I can do that easy if i can get everything into a key/value collection. Also, there are many tagnames that are the same, so keys should be not-unique. However for the value, they should be unique, but only to the tag name. So unit can only one one "4", but another tag, lets say imageID could also have "4". This is not a requirement but a "nice to have" and I can probably figure that out if its not simple. But if its REALLY hard then I will skip it all together.
I hope this makes sense.
And no, I don't think Im allowed to use json. I'm sure it can be done in simpleXMl but if its impossible, then please provide a method to do it in json (assuming that json is included with PHP and not an extension that has to be loaded).
This is university homework, so I can't use extensions or anything else that would require anything beyond what comes with the XAMPP basic package (php, mysql, apache etc...).
Really surprised I got no votes or views or answers or anything on this. I did figure this out in the end. Oh yeah...got the tumbleweed badge for this too!
Anyway the answer was actually quite simple. I used the simplexml_load_file function in PHP which actually supports any kind of xml-style. So after running this,
$eml = simplexml_load_file("unit.eml");
I then did things like this
foreach ($eml->children() as $child)
$tag = $child->getName();
$tagInfo = $child;
And used $tag and $tagInfo to iterate through my eml and get everything I needed.

Parse 88 GB rdf with PHP

How can I parse an 88 GB RDF file with PHP?
This RDF is filled with entities and facts about each entity.
I'm trying to iterate through each entity and check for certain facts per each entity. Then write those facts to an XML document I created earlier in the script.
So as I am navigating the rdf, per each entity I create a <card></card> element and give it a child called <facts>. I run through all the facts on the entity and I take the ones I need and write them inside and as <fact></fact> element children inside the <facts></facts>.
How can I parse the rdf, extract the data, and write it to XML?
First, use an RDF parser. Googling for a PHP RDF parser turned up lots of results; I dont use PHP personally, but I'm sure one of them will do the job of parsing RDF. But make sure it's a streaming parser, you're not going to hold 88G of RDF in memory on your workstation.
Second, you said you need to 'iterate through each entity' that might be tricky if either they're not sorted by subject in the original file, or the parser does not report them in the same order.
Assuming that is not a problem, then you can just keep the triples for each subject in a local data structure, and when you get a triple w/ a subject different than the ones you've queued locally, do whatever business logic you need and write out the XML. Might want to make sure you can't queue up so many statements locally that you'll OOM.
Lastly, I'm going to assume you have a good reason to take RDF and turn it into an XML format that is not RDF/XML. But I you might reconsider your design just in case.
Or you could put the data in an RDF database and write SPARQL queries against it, transforming query results into whatever XML or anything else you need.
I think your best option would be:
use some external tool (probably something like rapper?) to convert the source-file from Turtle into n-triples format
iterate file one line at a time via fopen+fgets as n-triples defines strict 1-statement per 1-line constraint which is perfect in this case

Is it always bad practice to start an ID with a number? (CSS)

In my project I have submissions and comments, each with an ID. Currently the ID's are just numeric and correspond to their database ID's. Everything is working fine but when I run it through the W3 validator I get the error:
value of attribute "id" invalid: "1" cannot start a name
I suppose instead that I could just precede all ids with some sort of string but then whenever I was using or manipulating the id in JQuery or PHP I have to do a id.replace('string', '') before using it. This seems rather cumbersome. Any advice?
Yes, using numbers as HTML element IDs is bad practice.
It violates W3C specification.
You noted this in your question, and it is true for every HTML specification except HTML5
It is detrimental to SEO.
Search-engine optimized HTML element IDs SHOULD reflect the content of the identified element. Check out How To Compose HTML ID and Class Names like a Rockstar by Meitar Moscovitz. It provides a good overview of this concept.
There can be server-side scripting issues.
Back when I first started programming in ASP classic, I had to access submitted form fields by a syntax like Request.Form("some_id"). However, if I did Request.Form(1) it would return the value of the second field in the form collection, instead of the element with an Id equal to 1. This is a pretty standard behavior for working with collections. Its also similar with javascript, and could make your client side scripting more complicated to maintain as well.
I suggest you to use prefixes "comment-ID" or "post-ID".
If you need the id in JavaScript, you just have to id.substring(8) (for "comment-")
The HTML 5 Specification lifts this restriction. If you're worried about validity you might simply consider changing the DTD to HTML5's.
http://www.w3.org/TR/html5/elements.html#the-id-attribute
If you're manipulating the element then you can just use $(this).jQueryOperation() - therefore you can have a prefix without having to replace anything!
The best way for your need is having the prefix for your class, I mean something like item-x and x is the number that you need.
But from my personal experience, it is better to use classes for your elements, and you know that you must use classes if the item is not unique in the page

What is the best way to put a translation system in php website?

I'm developing a website in PHP and I'd like to give the user to switch from German to English easily.
So, a translation politic must be considered:
Should I store the data and its translation in a database table ((1, "Hello", "hallo"), (2, "Good morning", "Guten Tag") etc .. ?
Or should I use the ".mo" Files to store it?
Which way is the best?
What are the pros and the cons?
After having just tackled this myself recently (12 languages and counting) on a production system and having run into some major performance issues along the way I would suggest a hybrid system.
1) Store the language strings and translations in a database--this will make it easy to interact with/update/remove items plus will be part of your normal backup routines.
2) Cache the languages into flat files on the server and draw those out as necessary to display on the page.
The benefits here are many--mostly it is fast! I am not dealing with connection overhead for MySQL or any traffic slowdowns during the transfer. (especially important if your DB server is not localhost).
This will also make it very easy to use. Store the data from your database in the file as a php serialized array and GZIP the contents of the file to shrink storage overhead (this also makes it faster in my benchmarking).
Example:
$lang = array(
'hello' => 'Hallo',
'good_morning' => 'Guten Tag',
'logout_message' = > 'We are sorry to see you go, come again!'
);
$storage_lang = gzcompress( serialize( $lang ) );
// WRITE THIS INTO A FILE SUCH AS 'my_page.de'
When a user loads your system for the first time do a file_exists('/files/languages/my_page.de'). If the file exists then load the content, un-gzip, and un-serialize and it is ready to go.
Example
$file_contents = get_contents( 'my_page.de' );
$lang = unserialize( gzuncompress( $file_contents ) );
As you can see you can make the caching specific to each page in the system keeping the overhead even smaller and use the file extension to denote language... (my_page.en, my_page.de, my_page.fr)
If the file DOESN'T exist then query the DB, build your array, serialize it, gzip it and write the missing file--at the same time you have just constructed the array that the page needed so continue on to display the page and everyone is happy.
Finally, this allows you to build in update pages accessible to non-programmers but you also control when changes appear by deciding when to remove cache files so they can be rebuilt by the system.
Warnings and Pitfalls
When I kept everything in the database directly we hit some MAJOR slowdowns when our traffic spiked.
Trying to keep them in flat-file arrays only was so much trouble because updates were painful and prone to errors.
Not GZIP compressing the contents of the cache files made the language system about 20% slower in my benchmarks.
Make sure all of your database fields containing languages are set to UTF8-general-ci (or at least one of the UTF8 options, I find general-ci best for my use). If you don't you will not be able to store non-unicode character sets in your database (like Chinese, Japanese, etc)
Extension:
In response to a comment below, be sure to set your database tables up with page level language strings in mind.
id string page global
1 hello NULL 1
2 good_morning my_page.php 0
Anything that shows up in headers or footers can have a global flag that will be queried in every cache file created, otherwise query them by page to keep your system responsive.
PHP arrays are indeed the fastest way to load translations. However, you really don't want to update these files by hand in an editor. This might work in the beginning, and for one or two languages, but when your site grows this gets really hard to maintain.
I advise you to setup a few simple tables in a database where you keep the translations, and build a simple app that lets you update the translations (some forms to add and update texts). As for the database: use one table to store translation variables; use another to link translations to these variables.
Example:
`text`
id variable
1 hello
2 bye
`text_translations`
id textId language translation
1 1 en hello
2 1 de hallo
3 2 en bye
4 2 de tschüss
So what you do is:
create the variable in the first table
add translations for it in the second table (in whatever language you want)
After you've updated the translations, create/update a language file for each language that you're using:
select the variables you need and its translation (tip: use English if there's no translation)
create a big array with all this stuff, e.g.:
$texts = array('hello' => 'hallo', 'bye' => 'tschüss');
write the array to a file, e.g.:
file_put_contents('de.php', serialize($texts));
in your PHP/HTML create the array from the file (based on selected language by user), e.g.:
$texts = unserialize(file_get_contents('de.php'));
in your PHP/HTML use the variables, e.g.:
<h1><?php echo $texts['hello']; ?></h1>
or if you like/enabled PHP short tags:
<p><?=$texts['bye'];?></p>
This setup is very flexible, and with a few forms to update the translations it's easy to keep your site up to date in multiple languages.
I'd also suggest Zend Framework Zend_Translate package.
The manual gives a good overview on How to decide which translation adapter to use. Even when not using ZF, this will give you some ideas about what is out there and what the pros and cons are.
Adapters for Zend_Translate
Array
Use PHP arrays Small pages;
simplest usage; only for programmers
Csv
Use comma separated (.csv/.txt) files
Simple text file format; fast; possible problems with unicode characters
Gettext
Use binary gettext (*.mo) files GNU standard for linux;
thread-safe; needs tools for translation
Ini
Use simple ini (*.ini) files
Simple text file format; fast; possible problems with unicode characters
Tbx
Use termbase exchange (.tbx/.xml) files
Industry standard for inter application terminology strings; XML format
Tmx
Use tmx (.tmx/.xml) files
Industry standard for inter application translation; XML format; human readable
Qt
Use qt linguist (*.ts) files
Cross platform application framework; XML format; human readable
Xliff
Use xliff (.xliff/.xml) files
A simpler format as TMX but related to it; XML format; human readable
XmlTm
Use xmltm (*.xml) files
Industry standard for XML document translation memory; XML format; human readable
There are some factors you should consider.
Will the website be updated frequenytly? if yes, by whom? you or the owner? how much data / information are you dealing with? and also... are you doing this frequently (for many clients) ?
I can hardly think that using a relational database can couse any serious speed impacts unless you are having VERY high traffic (several hundreds of thousands of pageviews per day).
Should you be doing this frequently (for lots of clients) think no further: build up a CMS (or use an existing one). If you really need to consider speed impact, you can customize it so that when you are done with the website you can export static HTML pages where possible.
If you are updating frequently, the same as above applies.
If the client has to update (and not you), again, you need a CMS.
If you are dealing with lots of infomration (big and lots of articles), you need a CMS.
All in all, a CMS will help you build up your website structure fast, add content fast and not worry that much about code since it will be reusable.
Now, if you just need to create a small website fast, you can easily do this with hardcoded arrays and datafiles.
If you need to provide web interface for adding/editting translations, then database is a good idea.
If, however, your translations are static, I would use gettext or even plain PHP array.
Either way you can take advantage of Zend_Translate.
Small comparison, the first two from Zend tutorial:
Plain PHP arrays: Small pages; simplest usage; only for programmers.
Gettext: GNU standard for linux; thread-safe; needs tools for translation.
Database: Dynamic; Worst performance.
I would recommend PHP arrays, they can be built around a GUI for easy access.
Be realize the everybody in the world when dealing with computer, they usually know some common English used in computer or internet like About Us, Home, Send, Delete, Read More etc. Question : Are they really need to be translated?
Ok, honestly, some translation to that words is actually not about 'required', it's all about 'style'.
Now, if it's really wanted, for the common words that no need to be changed forever, it's better use a php file which output lang array for only local and English. And for some contents such as blog, news and some descriptions, use database and save in as many as language translation required. You must do it manually.
Using and rely on Google Translate? I think you have to think 1000 times. At least for this decade.

How do I design a web interface for browsing text man pages?

I would like to design a web app that allows me to sort, browse, and display various attributes (e.g. title, tag, description) for a collection of man pages.
Specifically, these are R documentation files within an R package that houses a collection of data sets, maintained by several people in an SVN repository. The format of these files is .Rd, which is LaTeX-like, but different.
R has functions for converting these man pages to html or pdf, but I'd like to be able to have a web interface that allows users to click on a particular keyword, and bring up a list (and brief excerpts) for those man pages that have that keyword within the \keyword{} tag.
Also, the generated html is somewhat ugly and I'd like to be able to provide my own CSS.
One obvious option is to load all the metadata I desire into a database like MySQL and design my site to run queries and fetch the appropriate data.
I'd like to avoid that to minimize upkeep for future maintainers. The number of files is small (<500) and the amount of data is small (only a couple of hundred lines per file).
My current leaning is to have a script that pulls the desired metadata from each file into a summary JSON file and load this summary.json file in PHP, decode it, and loop through the array looking for those items that have attributes that match the current query (e.g. all docs with keyword1 AND keyword2).
I was starting in that direction with the following...
$contents=file_get_contents("summary.json");
$c=json_decode($contents,true);
foreach ($c as $ind=>$val ) { .... etc
Another idea was to write a script that would convert these .Rd files to xml. In that case, are there any lightweight frameworks that make it easy to sort and search a small collection of xml files?
I'm not sure if xQuery is overkill or if I have time to dig into it...
I think I'm suffering from too-many-options-syndrome with all the AJAX temptations. Any help is greatly appreciated.
I'm looking for a super simple solution. How might some of you out there approach this?
My approach would be parsing the keywords (from your description i assume they have a special notation to distinguish them from normal words/text) from the files and storing this data as searchindex somewhere. Does not have to be mySQL, sqlite would surely be enough for your project.
A search would then be very simple.
Parsing files could be automated as post-commit-hook to your subversion repository.
Why don't you create table SUMMARIES with column for each of summary's fields?
Then you could index that with full-text index, assigning different weight to each field.
You don't need MySQL, you can use SQLite which has the the Google's full-text indexing (FTS3) built in.

Categories