THE GETDATA PROJECT
===================

The GetData Project is the reference implementation of the Dirfile Standards.
The Dirfile database format is designed to provide a fast, simple format for
storing and reading binary time-ordered data.  The Dirfile Standards are
described in detail in three Unix manual pages distributed with this package:
dirfile(5), dirfile-format(5) and dirfile-encoding(5), which may be read before
installation by running:

  $ man man/dirfile.5
  $ man man/dirfile-format.5
  $ man man/dirfile-encoding.5

from the top GetData Project directory (the directory containing this README
file).  After installation, they can be read with the standard man command:

  $ man dirfile
  $ man dirfile-format
  $ man dirfile-encoding

More information on the GetData Project and the Dirfile database format may be
found on the World Wide Web:

  http://getdata.sourceforge.net/


WARRANTY AND REDISTRIBUTION
===========================

GetData is free software; you can redistribute it and/or modify it under the
terms of the GNU Lesser General Public License as published by the Free Software
Foundation: either version 2.1 of the License, or (at your option) any later
version.

GetData is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.  See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along
with GetData in a file called `COPYING'; if not, write to the Free Software
Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA.


CONTENTS
========

This package provides:
  * the Dirfile Standards documents (three Unix manual pages)
  * the C GetData library (libgetdata) including Unix manual pages
  * The handy checkdirfile utility, included also as an example
  * several bindings to the library from other languages:
    - C++ (libgetdata++)
    - Fortran 77 (libfgetdata)
    - Fortran 95 (libf95getdata)
    - Python (pygetdata)
    - the Interactive Data Language (IDL; idl_getdata)

Documentation for the various bindings, if present, can be found in files
named `README.<language>' in the doc/ directory.  The C interface is described
in this document and the associated man pages.

A full list of features new to this release of GetData may be found in the
file called `NEWS'.


DIRFILE STANDARDS VERSION 7
===========================

The 0.6.0 release of the GetData Library (October 2009), was released in
conjunction with a new version of the Dirfile Standards, known as Standards
Version 7.

Standards Version 7 introduces the following:
  * Support for single and double precision floating-point complex data.
    This includes the support for complex literals in the format file, as well
    as operators ("representations") to compute norms of complex data.
  * Two new field types: SBIT, a (two's complement) signed bitfield; and
    POLYNOM, a polynomial function of a given input field.
  * The number of fields parameter for LINCOM is now optional.

This is the first update to the Dirfile Standards since Standards Version 6
(October 2008).

Standards Version 6 introduced the following:
  * Support for two scalar types: CONST (numerical) and STRING (a character
    string)
  * The ability to attach subdata to fields using the META directive.
  * Support for different "encodings", which change how binary data is stored
    on disk.
  * Significantly more permissive and explicit rules on allowed characters in
    format file tokens.
  * Removal of "FILEFRAM" as an alias for the implicit "INDEX" field.  It is
    encouraged that made use of FILEFRAM be updated to use INDEX instead.
    However, a work-around is to add:

      FILEFRAM LINCOM 1 INDEX 1 0

    to the dirfile's format file to recreate the old behaviour.
  * The ability to flag portions of the dirfile as protected.

Standards Version 6 follows closely on the heels of Standards Version 5,
released in August of 2006 with GetData 0.3.0.

Standards Version 5 introduced the following:
  * Three new supported data type: 8-bit signed integers, and 64-bit signed
    and unsigned integers
  * A new, uniform way of referring to data types.  Previous versions of the
    Standards used the single character data specifiers: c, u, s, U, S, i, f,
    and d.  The new standard defines the specifiers UINT8, INT8, UINT16, INT16,
    UINT32, INT32, UINT64, INT64, FLOAT32, and FLOAT64.  (FLOAT is also accepted
    as an alias for FLOAT32 and DOUBLE for FLOAT64.)  Accordingly, a RAW field
    declaration, which in previous versions of the Dirfile Standards would
    have been:

      field RAW u 20

    would be preferred by Standards Version 5 to be:

      field RAW UINT16 20

    For backwards compatibility, the single character specifiers are still
    fully acceptable in format file declarations, although no single character
    specifiers exist for the new data types.
  * Unlimited field name lengths.  Previous versions of the Standards limited
    fields to 50 characters or less.  Field name length is still effectively
    limited by the maximum length of a filename and the maximum length of a
    format file line, either of which will typically effectively limit field
    names to a few thousand characters.
  * Forbidding the implicit fields INDEX and FILEFRAM to be defined in the
    format file.  Previous Standards Versions allowed fields with these names to
    be defined, but old GetData implementations would never let you access them,
    returning the implicit frame-count field instead.
  * 64-bit BIT fields.  Previous Standards Versions limited BITs to 32-bits.
    The size of any particular BIT field is still limited by the underlying RAW
    type on which it depends.
  * The ENDIAN directive for format files, which can be used to specify the
    byte-sex of the underlying raw data.  Previous Standards Versions indicated
    that the byte-sex of raw data in a dirfile was always little-endian, but
    old GetData implementations assumed data was native endian, by never doing
    endianness conversion.  Dirfile Standards Version 5 adopts the behaviour
    of GetData rather than that of previous Standards Versions: dirfiles lacking
    an ENDIAN directive will be assumed to have native endianness.  Facilities
    exist in the new GetData API to override this behaviour.
  * The VERSION directive for format files, which specifies the Standards
    Version to which the dirfile conforms.  The Standards indicate that this
    directive is optional, and should not be needed to properly parse a format
    file conforming to Standards Version 5 or any earlier version.  It is used
    by GetData to ease forward-compatibility: if GetData encounters a Dirfile
    conforming to a newer Standards Version than it is prepared to parse,
    GetData will silently ignore any lines it doesn't understand.  This allows
    GetData to make a best-effort attempt at reading Standards newer than it
    knows.

A full history of the Dirfile Standards can be found in the dirfile-format(5)
man page.


BUILDING THE LIBRARY
====================

This package may be configured and built using the GNU autotools.  Generic
installation instructions are provided in the file called `INSTALL'.  A
brief summary follows.  A C99-compliant compiler is required to build (but not
to use) the package.

Most users should be able to build the package by simply executing:

  $ ./configure
  $ make

from the top GetData Project directory (the directory containing this README
file).  After the project has been built, you may (optionally) test the build by
executing:

  $ make check

which will run a series of self-tests.  Finally,

  $ make install

will install the utilities, libraries, bindings, and documentation.

The package configuration can be changed, if the default configuration is
insufficient, before building it by passing options to ./configure.  Running

  $ ./configure --help

will display a brief help message summarising available options.


USING THE LIBRARY
=================

To use the library in C programs, the header file getdata.h should be included.
This file declares both the new and legacy APIs.  Programs linking to static
versions of the GetData library need to be linked against the C Standard Math
Library, in addition to GetData itself, plus any other libraries required for
encodings supported by the library, e.g. using `-lgetdata -lm
<other-libraries>'.

The `checkdirfile' program is provided as an example of library use in the
`util' subdirectory of the package.  This program will be installed in
${prefix}/bin, unless disabled by passing `--disable-checkdirfile' to
./configure.

This utility was designed to report syntax error in the format file(s) of the
large, complex dirfiles used in the analysis of the BLAST experiment data.
This utility will report all syntax errors it find in the supplied dirfile,
plus any problems in the metadata itself.

Bindings exist for using the GetData library in languages other than C.  If
language bindings exist for your particular library, a README.<language> file
explaining its use should be present in the `doc' subdirectory.

If no bindings exist for your language of choice, you will have to write your
own.  If you are willing to have these bindings redistributed under identical
terms as the GetData Project, we encourage you to submit them for inclusion in
the package.


USING MODULES
=============

Starting with GetData 0.5.0, encoding schemes which rely on optional external
libraries (slim, gzip, bzip2) may be built as stand-alone library modules which
will be dynamically loaded, as needed, at runtime by the core GetData library.
This architecture is provided to permit packaging the core GetData library
separately from the parts of the library requiring the optional external
libraries without having to give up the functionality these extra libraries
provide.  To enable this behaviour, pass `--enable-modules' to ./configure.

The modules are dynamically loaded via GNU libtool's portable dlopen wrapper
library, libltdl.  The libltdl library permits dynamic loading of modules on at
least Solaris, Linux, BSD, HP-UX, Win16, Win32, BeOS, Darwin, Mac OS X.  The
libltdl library is no longer distributed with GetData.  It can usually be found
as part of the GNU libtool package, on any modern system.

GetData will fail gracefully if a library module is not found or cannot be
opened at runtime.  In this case, the call which triggered the attempt to load
the module will fail with the GD_E_UNSUPPORTED error.

The dlopen library (and, by extension, libltdl) is notoriously not thread safe.
As a result, if a POSIX thread implementation can be found, calls to the dynamic
loader will be wrapped in a mutex.


THE GETDATA HEADER FILE
=======================

The GetData header file, `getdata.h', installed in ${prefix}/include, declares
the new API.  It also includes `getdata_legacy.h' (also installed) which
declares the legacy API.  The legacy header should never be included directly.
Defining the preprocessor symbol `NO_GETDATA_LEGACY_API' before including
getdata.h will prevent the legacy API from being declared.  In cases when the
legacy API is declared, getdata.h will define the symbol GETDATA_LEGACY_API,
which can be used by callers to determine whether the legacy API is present at
compile time.

If the legacy API is not built (as a result of passing `--disable-legacy-api'
to ./configure), getdata_legacy.h will not be installed, and the legacy API will
never be declared.


LARGEFILE SUPPORT
=================

When built on a platform using the GNU C Library, or another compatible
C Library, the new getdata API will respect the feature test macros
_LARGEFILE64_SOURCE and _FILE_OFFSET_BITS affecting largefile (> 2 GB)
support.  If one or the other of these are to be used, they must be defined
before including getdata.h or any Standard C Library header file.

The first of these, _LARGEFILE64_SOURCE, if defined before including getdata.h,
will enable the obsolete, transitional largefile extensions defined by the LFS.
This will enable explicit support for large files through the definition of the
64-bit explicit type `off64_t', and result in getdata defining the explicitly
64-bit interfaces `getdata64', `putdata64', `get_nframes64', &c.  This macro is
largely obsolete, and using _FILE_OFFSET_BITS is preferred, if supported.

The second macro, _FILE_OFFSET_BITS, determines the size of `off_t'.  If not
defined, or defined to `32', `off_t' will be the old 32-bit type.  If, instead,
this macro is defined to `64', `off_t' will be the largefile supporting 64-bit
type, and calls to getdata, putdata, &c. will intrinsically have largefile
support.  On 64-bit systems this macro has no effect, since a 64-bit `off_t'
is used all the time.

If your system uses the GNU C Library, the feature_test_macros(7) man page
will provide further explanation.  On systems where these macros are
unsupported, the getdata64, &c. interfaces will never be defined, and the size
of `off_t' will be system dependent.  In this case, GetData will likely lack
largefile support.

If you build GetData against a C Library that lacks largefile support, the
getdata library will not support large files either, no matter what you do with
these macros.


THE API CHANGE
==============

Users of older versions of GetData will notice a fairly substantial change in
the API starting with version 0.3.0 as compared with older version.  Older
versions of this library (hereafter referred to as the "legacy API") suffered
from thread safety issues, and lacked LFS (large file) support.

The new API separates the opening of a Dirfile from reading or writing to it.
Where in the old API one would use:

  int n_read = GetData("dirfile", "field", 1, 0, 1, 0, 'S', data, &error_code);

the corresponding new API would have:

  DIRFILE* D = dirfile_open("dirfile", GD_RDONLY);
  size_t n_read = getdata(D, "field", 1, 0, 1, 0, GD_INT32, data);

Here, D is a pointer to a DIRFILE object.  This object is modelled after
the FILE object of the C Library's stdio interface.

Where the old API was passed an integer pointer to store the error code, this is
now accessed via the DIRFILE object directly: D->error, which is of type int.
This is the only public member of the DIRFILE object.  As a side-effect of this
change, the descriptive error string which can be generated by the library is
now local to a particular instance of a particular dirfile, rather than being
global across the library.

Once a DIRFILE object has been created by a call to dirfile_open, all subsequent
operations on the dirfile operate on this object.  Once the program is finished
with the dirfile, the object can be destroyed, and all open file handles closed,
with the call:

  dirfile_close(D);

Partially in order to fully support large files (>2 GB) as defined by the LFS,
a consistent data type structure is used in the new API:

  * database offsets and database sizes are of type `off_t'
  * object sizes and counts of items read are of type `size_t'
  * samples-per-frame are of type `unsigned int'
  * data type specifiers are of type `gd_type_t', which is defined in getdata.h.
    Any time this type is specified, one of the following symbols, also defined
    in getdata.h, should be used: GD_NULL, GD_UINT8, GD_INT8, GD_UINT16,
    GD_INT16, GD_UINT32, GD_INT32, GD_UINT64, GD_INT64, GD_FLOAT32, GD_FLOAT64,
    or GD_UNKNOWN.  Using a legacy API single character type specifier for this
    type will result in the library error GD_E_BAD_TYPE.

The legacy API continues to use `int' for offsets, sizes, and counts, which
prevents it from supporting large files.

The new API is fully documented in the included man pages.  A translation
example from the legacy API to the new API is present as an Appendix at the end
of this document.


AUTHOR
======

The Dirfile Standards and the GetData library were conceived and written by
C. B. Netterfield <netterfield@astro.utoronto.ca>.

Since Standards Version 3 (January 2006), the Dirfile Standards and GetData
have been maintained by D. V. Wiebe <dvw@ketiltrout.net>.

A full list of contributors is given in the file called `AUTHORS'.


APPENDIX:  API TRANSLATION EXAMPLE
==================================

The following example programs demonstrate how to convert from the legacy to
the new API:

  /* Legacy API */
  #include <getdata.h>
  #include <stdlib.h>
  #include <stdio.h>

  int main(void)
  {
    const char* dirfile_name = "/var/dirfile"; /* dirfile name */
    const char* field_name = "datafield";      /* field code */
    char error_buffer[1024];
    int first_frame = 1000;
    int error_code;

    /* Get size of the database -- third argument is ignored */
    int nf = GetNFrames(dirfile_name, &error_code, NULL);
    if (error_code) {
      printf("GetData error: %s\n", GetDataErrorString(error_buffer, 1024));
      exit(1);
    }

    /* Get samples-per-frame */
    int spf = GetSamplesPerFrame(dirfile_name, field_name, &error_code);
    if (error_code) {
      printf("GetData error: %s\n", GetDataErrorString(error_buffer, 1024));
      exit(1);
    }

    /* Allocate a buffer */
    double* data_buffer = malloc(sizeof(double) * spf * (nf - first_frame));

    /* Retrieve all but the first 1000 frames */
    int n_read = GetData(dirfile_name, field_name, first_frame, 0,
        nf - first_frame, 0, 'd', data_buffer, &error_code);
    if (error_code) {
      printf("GetData error: %s\n", GetDataErrorString(error_buffer, 1024));
      exit(1);
    }

    /* Clean up */
    free(data_buffer);

    return 0;
  }

  /* New API -- same header file */
  #include <getdata.h>
  #include <stdlib.h>
  #include <stdio.h>

  int main(void)
  {
    const char* dirfile_name = "/var/dirfile"; /* dirfile name */
    const char* field_name = "datafield";      /* field code */
    char error_buffer[1024];
    off_t first_frame = 1000; /* off_t is for dirfile offsets and lengths */

    /* Open the dirfile */
    DIRFILE* dirfile = dirfile_open(dirfile_name, GD_RDONLY);
    if (get_error(dirfile)) {
      /* dirfile_open() returns a pointer to a newly allocated DIRFILE
         object even if the open failed. This DIRFILE object should still
         be freed by calling dirfile_close() after checking the error state */

      printf("GetData error: %s\n",
          get_error_string(dirfile, error_buffer, 1024));
      dirfile_close(dirfile);
      exit(1);
      }

      /* Get size of the database */
      off_t nf = get_nframes(dirfile); /* again off_t */
      if (get_error(dirfile)) {
        printf("GetData error: %s\n",
            get_error_string(dirfile, error_buffer, 1024));
        dirfile_close(dirfile);
        exit(1);
      }

      /* Get samples-per-frame */
      unsigned int spf = get_spf(dirfile, field_name);
      if (get_error(dirfile)) {
        printf("GetData error: %s\n",
            get_error_string(dirfile, error_buffer, 1024));
        dirfile_close(dirfile);
        exit(1);
      }

      /* Allocate a buffer */
      double* data_buffer = malloc(sizeof(double) * spf * (nf - first_frame));

      /* Retrieve all but the first 1000 frames -- size_t is for counts of
         objects read */
      size_t n_read = getdata(dirfile, field_name, first_frame, 0,
          nf - first_frame, 0, GD_FLOAT64, data_buffer);
      if (get_error(dirfile)) {
        printf("GetData error: %s\n",
            get_error_string(dirfile, error_buffer, 1024));
        dirfile_close(dirfile);
        exit(1);
      }

      /* Clean up */
      free(data_buffer);

      /* not strictly necessary at the end of the program for a dirfile opened
         read-only */
      dirfile_close(dirfile);

      return 0;
    }
