Using Binary Files In ProMod3

A few features in ProMod3 (and potentially your next addition) require binary files to be loaded and stored. Here, we provide guidelines and describe helper tools to perform tasks related to loading and storing binary files.

Generally, each binary file consists of a short header and binary data. The header ensures consistency between the storing and the loading of data, while the “binary data” is some binary representation of the data of interest.

The main issue, we try to address is that in C++, the binary representation of objects can be machine- and compiler-dependent. The standard guarantees though that sizeof(char) = 1 and that std::vector is contiguous in memory. Everything else (e.g. sizeof(int), endianness, padding of structs) can vary. Two approaches can be used:

  1. Raw binary data files which are very fast to load, but assume a certain memory-layout for the internal representation of data

  2. Portable binary data files which are slow to load, but do not assume a given memory-layout for the internal representation of data

Portable I/O should always be provided for binary files. If this is too slow for your needs, you can provide functionality for raw binary files. In that case you should still distribute only the portable file and provide a converter which loads the portable file and stores a raw binary file for further use. Storing and loading of raw binary files on the same machine with the same compiler should never be an issue.

For instance, the classes TorsionSampler, FragDB, StructureDB, BBDepRotamerLib and RotamerLib use this approach and the conversion is automatically done in the make process. Code examples are given in the unit tests in test_check_io.cc and test_portable_binary.cc and in the C++ code of the classes listed above (see methods Load, Save, LoadPortable and SavePortable).

File Header

The header is written/read with functions provided in the header file promod3/core/check_io.hh. The header is written/read before the data itself and is structured as follows:

  • a “magic number” (ensures that we can read uint32_t which is needed for the following fields)

  • a version number (allows for backwards-compatibility)

  • sizes for all types which are treated as raw memory (i.e. casted to a byte (char) array and written either to memory or to a stream)

  • example values for the used base-types (ensures we can e.g. read an int)

For portable I/O (see below), we only write/read fixed-width fundamental data-types (e.g. int32_t, float). Hence, we only check if we can read/write those types. When data is converted from a non-fixed fundamental type T (e.g. uint, short, Real), we furthermore ensure that the used fixed-width type (size written to file) is <= sizeof(T).

All write functions (when saving a binary) should be mirrored by the corresponding check (or get) function in the exact same order when loading.

All functions are templatized to work with any OST-like data sink or source and overloaded to work with std::ofstream and std::ifstream.

Portable Binary Data

Portable files are written/read with functions and classes provided in the header file promod3/core/portable_binary_serializer.hh. Generally, we store any data-structure value-by-value as fixed-width types!

Writing and reading is performed by the following classes:

  • PortableBinaryDataSink to write files (opened as std::ofstream)

  • PortableBinaryDataSource to read files (opened as std::ifstream)

Each serializable class must define a Serialize function that accepts sinks and sources, such as:

template <typename DS>
void Serialize(DS& ds) {
  // serialize element-by-element
}

Or if this is not possible for an object of type T, we need to define global functions such as:

inline void Serialize(core::PortableBinaryDataSource& ds, T& t) { }
inline void Serialize(core::PortableBinaryDataSink& ds, T t) { }

Given a sink or source object ds, we read/write an object v as:

  • ds & v, if v is an instance of a class, a bool or any fixed-width type (e.g. char, int_32_t, float)

  • core::ConvertBaseType<T>(ds, v), where T is a fixed-width type. v will then be converted to/from T. This is needed for any non-fixed fundamental type (e.g. uint, short, Real).

Implementation notes:

  • the Serialize function for fundamental types takes care of endianness (all written as little endian and converted from/to native endianness)

  • custom Serialize functions exist for String (= std::string), std::vector<T> and std::pair<T,T2>. It will throw an error if the used type T or T2 is a fundamental type. In that case, you have to serialize the values manually and convert each element appropriately.

  • you can use ds.IsSource() to distinguish sources and sinks.

Code Example

Here is an example of a class which provides functionality for portable and non-portable I/O:

// includes for this class
#include <boost/shared_ptr.hpp>
#include <iostream>
#include <fstream>
#include <sstream>
#include <vector>

// includes for I/O
#include <promod3/core/message.hh>
#include <promod3/core/portable_binary_serializer.hh>
#include <promod3/core/check_io.hh>

using namespace promod3;

// define some data-structure
struct SomeData {
  short s;
  int i;
  Real r;

  // portable serialization
  // (cleanly element by element with fixed-width base-types)
  template <typename DS>
  void Serialize(DS& ds) {
    core::ConvertBaseType<int16_t>(ds, s);
    core::ConvertBaseType<int32_t>(ds, i);
    core::ConvertBaseType<float>(ds, r);
  }
};

// define pointer type
class MyClass;
typedef boost::shared_ptr<MyClass> MyClassPtr;

// define class
class MyClass {
public:
  MyClass(const String& id): id_(id) { }

  // raw binary save
  void Save(const String& filename) {
    // open file
    std::ofstream out_stream(filename.c_str(), std::ios::binary);
    if (!out_stream) {
      std::stringstream ss;
      ss << "The file '" << filename << "' cannot be opened.";
      throw promod3::Error(ss.str());
    }

    // header for consistency checks
    core::WriteMagicNumber(out_stream);
    core::WriteVersionNumber(out_stream, 1);
    // required base types: short, int, Real (for SomeData).
    //                      uint (for sizes)
    // required structs: SomeData
    core::WriteTypeSize<uint>(out_stream);
    core::WriteTypeSize<short>(out_stream);
    core::WriteTypeSize<int>(out_stream);
    core::WriteTypeSize<Real>(out_stream);
    core::WriteTypeSize<SomeData>(out_stream);
    // check values for base types
    core::WriteBaseType<uint>(out_stream);
    core::WriteBaseType<short>(out_stream);
    core::WriteBaseType<int>(out_stream);
    core::WriteBaseType<Real>(out_stream);

    // write string
    uint str_len = id_.length();
    out_stream.write(reinterpret_cast<char*>(&str_len), sizeof(uint));
    out_stream.write(id_.c_str(), str_len);
    // write vector of SomeData
    uint v_size = data_.size();
    out_stream.write(reinterpret_cast<char*>(&v_size), sizeof(uint));
    out_stream.write(reinterpret_cast<char*>(&data_[0]),
                     sizeof(SomeData)*v_size);
  }

  // raw binary load
  static MyClassPtr Load(const String& filename) {
    // open file
    std::ifstream in_stream(filename.c_str(), std::ios::binary);
    if (!in_stream) {
      std::stringstream ss;
      ss << "The file '" << filename << "' does not exist.";
      throw promod3::Error(ss.str());
    }

    // header for consistency checks
    core::CheckMagicNumber(in_stream);
    uint32_t version = core::GetVersionNumber(in_stream);
    if (version > 1) {
      std::stringstream ss;
      ss << "Unsupported file version '" << version
         << "' in '" << filename;
      throw promod3::Error(ss.str());
    }
    // check for exact sizes as used in Save
    core::CheckTypeSize<uint>(in_stream);
    core::CheckTypeSize<short>(in_stream);
    core::CheckTypeSize<int>(in_stream);
    core::CheckTypeSize<Real>(in_stream);
    core::CheckTypeSize<SomeData>(in_stream);
    // check values for base types used in Save
    core::CheckBaseType<uint>(in_stream);
    core::CheckBaseType<short>(in_stream);
    core::CheckBaseType<int>(in_stream);
    core::CheckBaseType<Real>(in_stream);

    // read string (needed for constructor)
    uint str_len;
    in_stream.read(reinterpret_cast<char*>(&str_len), sizeof(uint));
    std::vector<char> tmp_buf(str_len);
    in_stream.read(&tmp_buf[0], str_len);

    // construct
    MyClassPtr p(new MyClass(String(&tmp_buf[0], str_len)));

    // read vector of SomeData
    uint v_size;
    in_stream.read(reinterpret_cast<char*>(&v_size), sizeof(uint));
    p->data_.resize(v_size);
    in_stream.read(reinterpret_cast<char*>(&p->data_[0]),
                   sizeof(SomeData)*v_size);

    return p;
  }

  // portable binary save
  void SavePortable(const String& filename) {
    // open file
    std::ofstream out_stream_(filename.c_str(), std::ios::binary);
    if (!out_stream_) {
      std::stringstream ss;
      ss << "The file '" << filename << "' cannot be opened.";
      throw promod3::Error(ss.str());
    }
    core::PortableBinaryDataSink out_stream(out_stream_);

    // header for consistency checks
    core::WriteMagicNumber(out_stream);
    core::WriteVersionNumber(out_stream, 1);
    // required base types: short, int, Real
    // -> converted to int16_t, int32_t, float
    core::WriteTypeSize<int16_t>(out_stream);
    core::WriteTypeSize<int32_t>(out_stream);
    core::WriteTypeSize<float>(out_stream);
    // check values for base types
    core::WriteBaseType<int16_t>(out_stream);
    core::WriteBaseType<int32_t>(out_stream);
    core::WriteBaseType<float>(out_stream);

    // write string (provided in portable_binary_serializer.hh)
    out_stream & id_;
    // write vector (provided in portable_binary_serializer.hh)
    // -> only ok like this if vector of custom type
    // -> will call Serialize-function for each element
    out_stream & data_;
  }

  // portable binary load
  static MyClassPtr LoadPortable(const String& filename) {
    // open file
    std::ifstream in_stream_(filename.c_str(), std::ios::binary);
    if (!in_stream_) {
      std::stringstream ss;
      ss << "The file '" << filename << "' does not exist.";
      throw promod3::Error(ss.str());
    }
    core::PortableBinaryDataSource in_stream(in_stream_);

    // header for consistency checks
    core::CheckMagicNumber(in_stream);
    uint32_t version = core::GetVersionNumber(in_stream);
    if (version > 1) {
      std::stringstream ss;
      ss << "Unsupported file version '" << version
         << "' in '" << filename;
      throw promod3::Error(ss.str());
    }
    // check for if required base types (see SavePortable)
    // are big enough
    core::CheckTypeSize<short>(in_stream, true);
    core::CheckTypeSize<int>(in_stream, true);
    core::CheckTypeSize<Real>(in_stream, true);
    // check values for base types used in Save
    core::CheckBaseType<int16_t>(in_stream);
    core::CheckBaseType<int32_t>(in_stream);
    core::CheckBaseType<float>(in_stream);

    // read string (needed for constructor)
    String s;
    in_stream & s;
    // construct
    MyClassPtr p(new MyClass(s));
    // read vector of SomeData
    in_stream & p->data_;

    return p;
  }

private:
  std::vector<SomeData> data_;
  String id_;
};

int main() {
  // generate raw file
  MyClassPtr p(new MyClass("HELLO"));
  p->Save("test.dat");
  // load raw file
  p = MyClass::Load("test.dat");

  // generate portable file
  p->SavePortable("test.dat");
  // load portable file
  p = MyClass::LoadPortable("test.dat");

  return 0;
}

Exisiting Binary Files

The following binary files are currently in ProMod3:

During the make process, portable versions of the files (stored in the <MODULE>/data folder) are converted and corresponding raw binary files are stored in the stage/share/promod3/<MODULE>_data folder.

If the stage folder is moved after compilation (e.g. make install), the location of the share/promod3 folder is to be stored in an environment variable called PROMOD3_SHARED_DATA_PATH. This variable is automatically set if you load any Python module from promod3 or if you use the pm script or if you use a well-setup module on a cluster.

Code for the generation of the binary files and their portable versions are in the extras/data_generation folder (provided as-is).

Search

Enter search terms or a module, class or function name.

Contents