6. Input and output files

At the moment there are two binary file formats extensively used by MDANSE: Network Common Data Form (NetCDF) and Hierarchical Data Format (HDF5). The former is used for storing trajectories, and the latter for the analysis results.

Note

The plan for the future releases of MDANSE is to phase out the NetCDF format and focus exclusively on HDF5. This does not affect the way you use MDANSE at the moment, but please be aware that you may need to convert your trajectories into HDF5 when switching to MDANSE 2.0.

However, in certain circumstances MDANSE can use or produce another type of files. We will start this section by explaining in detail the NetCDF file format introducing next the other file formats used by MDANSE.

6.1. NetCDF file format

NetCDF is a set of software libraries and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data. The project homepage is hosted by the Unidata program at the University Corporation for Atmospheric Research. [Ref6] They are also the chief source of NetCDF -based software, standards development, updates, etc. The format is an open standard.

The data format is self-describing. This means that there is a header which describes the layout of the rest of the file, in particular the data arrays, as well as arbitrary file metadata in the form of name/value attributes. The format is platform independent, with issues such as endianness being addressed in the software libraries. The data arrays are rectangular, not ragged, and stored in a simple and regular fashion that allows efficient sub setting.

MDANSE expects trajectories to be in NetCDF format and follow the conventions of Molecular Modelling ToolKit (MMTK). Trajectories that have not been produced with MMTK or MMTK-based programs must be converted to MMTK format before they can be analysed with MDANSE. This conversion is necessary because no other common trajectory format permits efficient access both to conformations at a given time and to one-atom trajectories for all times. In addition to providing such an access, the NetCDF format has several advantages that make it particularly suitable for archiving trajectories:

  • compact files (binary storage)

  • machine-independent format

  • fully self-contained, complete information about the system is stored in the trajectory file.

The conversion of the trajectories from different formats to the MMTK format can be made directly via the MDANSE GUI, and specifically the Trajectory Converters.

MMTK NetCDF files work, however, not just as input files; they are at the centre of MDANSE. The result of an Analysis is, by default, written into an MMTK NetCDF file, which can then be once again used as an input file. The 2D/3D Plotter, the inbuilt tool for graph visualisation, only works with MMTK NetCDF files.

6.2. HDF5 file format

HDF is a set of file formats designed to store and organise large amounts of data. The project is maintained by The HDF Group [Ref7], a non-profit corporation, who ensure its continued development and accessibility. The associated libraries and tools are available under a liberal license for general use.

HDF5 is the latest version, and its use is widespread; even the version 4 of the NetCDF format is built on top of HDF5. It is organised hierarchically like a file system and uses POSIX-like syntax. The data is stored in datasets, n-dimensional arrays, which are grouped in groups, file-like objects. Either can then be modified with metadata by adding attributes.

It is a goal to replace NetCDF with HDF5 as the main storage format, and therefore MDANSE supports HDF5 output for Analysis and input for plotters.

6.3. DAT file format

When performing an Analysis, a binary output format is selected by default, but it is possible to change it to ASCII, which indicates a text file output. If the ASCII option is selected, a tarball is generated. Inside are multiple files which together contain the results of the analysis. Firstly, there is a text file, jobinfo.txt, which contains the options that were selected when performing the analysis.

Secondly, there is a DAT file for each variable generated by the analysis. Each file is named after the variable it contains, and this name is identical to the name that would appear in 2D/3D Plotter if the equivalent NetCDF file were loaded in. Each file begins with a couple commented line describing the variable:

  • variable name

  • type of plot (this represents the dimensions of plot)

  • which variable is on the x-axis if the variable in this DAT file were to be plotted on the y-axis

  • Units in which the data is written

  • the length of the trajectory (indicated as slice:[length])

After that is a list of numbers representing the variable as described.

6.4. MDANSE scripts

These files are python scripts that, when run, perform a given analysis with all the options set the way they were when this script was created. It can be run like any other script, you only have to make sure you use the python interpreter that comes with MDANSE. For more information about MDANSE python, read Using MDANSE Command Line Interface.