mmCIF File Format

The mmCIF file format is a container for structural entities provided by the PDB. Saving/loading happens through dedicated convenient functions (ost.io.LoadMMCIF()/ost.io.SaveMMCIF()). Here provide more in-depth information on mmCIF IO and describe how to deal with information provided above the legacy PDB format (MMCifInfo, MMCifInfoCitation, MMCifInfoTransOp, MMCifInfoBioUnit, MMCifInfoStructDetails, MMCifInfoObsolete, MMCifInfoStructRef, MMCifInfoStructRefSeq, MMCifInfoStructRefSeqDif, MMCifInfoRevisions, MMCifInfoEntityBranchLink).

Reading mmCIF files

Categories Available

The following categories of a mmCIF file are considered by the reader:

Notes:

  • Structures in mmCIF format can have two chain names. The “new” chain name extracted from atom_site.label_asym_id is used to name the chains in the EntityHandle. The “old” (author provided) chain name is extracted from _atom_site.auth_asym_id for the first atom of the chain. It is added as string property named “pdb_auth_chain_name” to the ChainHandle. The mapping is also stored in MMCifInfo as GetMMCifPDBChainTr() and GetPDBMMCifChainTr() if a non-empty SEQRES record exists for that chain (this should exclude ligands and water).

  • Molecular entities in mmCIF are identified by an entity.id, which is extracted from atom_site.label_entity_id for the first atom of the chain. It is added as string property named “entity_id” to the ChainHandle. Each chain is mapped to an ID in MMCifInfo as GetMMCifEntityIdTr().

  • For more complex mappings, such as ligands which may be in a same “old” chain as the protein chain but are represented in a separate “new” chain in mmCIF, we also store string properties on a per-residue level. For mmCIF files from the PDB, there is a unique mapping between (label_asym_id, label_seq_id) and (auth_asym_id, auth_seq_id, pdbx_PDB_ins_code). The following data items are available:

    • atom_site.label_asym_id: residue.chain.name

    • _atom_site.label_seq_id: residue.GetStringProp("resnum") (this is the same as residue.number for residues in polymer chains. However, for ligands residue.number is unset in mmCIF, but it is set to 1 by openstructure.)

    • atom_site.label_entity_id: residue.GetStringProp("entity_id")

    • _atom_site.auth_asym_id: residue.GetStringProp("pdb_auth_chain_name")

    • atom_site.auth_seq_id: residue.GetStringProp("pdb_auth_resnum")

    • atom_site.pdbx_PDB_ins_code: residue.GetStringProp("pdb_auth_ins_code")

    The last two items might be missing (not empty) if the atom_site.auth_seq_id or atom_site.pdbx_PDB_ins_code are not present in the mmCIF file.

  • Missing values in the aforementioned data items will be denoted as . or ?.

  • Author residue numbers (atom_site.auth_seq_id) and insertion codes (atom_site.pdbx_PDB_ins_code) are optional according to the mmCIF dictionary. The data items (whole columns) can be omitted in structures where the “new” residue numbers (_atom_site.label_seq_id) are defined (to valid values). This is usually the case for polymer chains. However non-polymer and water chains do not have valid “new” residue numbers. In structures containing such missing data, OST requires the presence of both “old” residue numbers and insertion codes in order to identify and build residues properly. It is a known limitation of the mmCIF format to allow ambiguous identifiers for waters (and ligands to some extent) and so we have to require these additional identifiers.

Info Classes

Information from mmCIF files that goes beyond structural data, is kept in a special container, the MMCifInfo class. Here is a detailed description of the annotation available.

class MMCifInfo

This is the container for all bits of non-molecular data pulled from a mmCIF file.

citations

Stores a list of citations (MMCifInfoCitation).

Also available as GetCitations().

biounits

Stores a list of biounits (MMCifInfoBioUnit).

Also available as GetBioUnits().

method

Stores the experimental method used to create the structure (_exptl.method).

Also available as GetMethod(). May also be modified by SetMethod().

Some PDB entries have multiple experimental methods. This function only a single one of them.

resolution

Stores the resolution of the crystal structure, obtained from the refine.ls_d_res_high data item. Set to 0 if no value in loaded mmCIF file.

Also available as GetResolution(). May also be modified by SetResolution().

em_resolution

Stores the resolution of the EM reconstruction, obtained from the em_3d_reconstruction.resolution data item. Set to 0 if no value in loaded mmCIF file.

Also available as GetEMResolution(). May also be modified by SetEMResolution().

r_free

Stores the R-free value of the crystal structure. Set to 0 if no value in loaded mmCIF file.

Also available as GetRFree(). May also be modified by SetRFree().

r_work

Stores the R-work value of the crystal structure. Set to 0 if no value in loaded mmCIF file.

Also available as GetRWork(). May also be modified by SetRWork().

operations

Stores the operations needed to transform a crystal structure into a bio unit.

Also available as GetOperations(). May also be modified by AddOperation().

struct_details

Stores details about the structure in a MMCifInfoStructDetails object.

Also available as GetStructDetails(). May also be modified by SetStructDetails().

struct_refs

Lists all links to external databases in the mmCIF file.

revisions

Stores a simple history of a PDB entry.

Also available as GetRevisions(). May be extended by AddRevision().

Type:

MMCifInfoRevisions

obsolete

Stores information about obsoleted / superseded entries.

Also available as GetObsoleteInfo(). May also be modified by SetObsoleteInfo().

Type:

MMCifInfoObsolete

AddCitation(citation)

Add a citation to the citation list of an info object.

Parameters:

citation (MMCifInfoCitation) – Citation to be added.

AddAuthorsToCitation(id, authors, fault_tolerant=False)

Adds a list of authors to a specific citation.

Parameters:
  • id (str) – Identifier of the citation.

  • authors (StringList) – List of authors.

  • fault_tolerant (bool) – Logs a warning if id is not found and proceeds without setting anything if set to True. Raises otherwise.

GetCitations()

See citations

AddBioUnit(biounit)

Add a bio unit to the bio unit list of an info object. If the id of biounit already exists in the set of assemblies, both will be merged. This means that chain and operations lists will be concatenated and the interval lists (operationsintervalls, chainintervalls) will be updated.

Parameters:

biounit (MMCifInfoBioUnit) – Bio unit to be added.

GetBioUnits()

See biounits

SetMethod(method)

See method

GetMethod()

See method

SetResolution(resolution)

See resolution

GetResolution()

See resolution

AddOperation(operation)

See operations

GetOperations()

See operations

SetStructDetails(details)

See struct_details

GetStructDetails()
AddMMCifPDBChainTr(cif_chain_id, pdb_chain_id)

Set up a translation for a certain mmCIF chain name to the traditional PDB chain name.

Parameters:
GetMMCifPDBChainTr(cif_chain_id)

Get the translation of a certain mmCIF chain name to the traditional PDB chain name.

Parameters:

cif_chain_id (str) – atom_site.label_asym_id

Returns:

_atom_site.auth_asym_id as str (empty if no mapping)

AddPDBMMCifChainTr(pdb_chain_id, cif_chain_id)

Set up a translation for a certain PDB chain name to the mmCIF chain name.

Parameters:
GetPDBMMCifChainTr(pdb_chain_id)

Get the translation of a certain PDB chain name to the mmCIF chain name.

Parameters:

pdb_chain_id (str) – _atom_site.auth_asym_id

Returns:

atom_site.label_asym_id as str (empty if no mapping)

AddMMCifEntityIdTr(cif_chain_id, entity_id)

Set up a translation for a certain mmCIF chain name to the mmCIF entity ID.

Parameters:
  • cif_chain_id (str) – atom_site.label_asym_id

  • entity_id (str) – atom_site.label_entity_id

GetMMCifEntityIdTr(cif_chain_id)

Get the translation of a certain mmCIF chain name to the mmCIF entity ID.

Parameters:

cif_chain_id (str) – atom_site.label_asym_id

Returns:

atom_site.label_entity_id as str (empty if no mapping)

GetEntityIdsOfType(type)

Get list of entity ids for which MMCifEntityDesc.entity_type equals type

Parameters:

type (str) – Selection criteria of returned entity ids

Returns:

list of str representing selected entity ids

AddRevision(num, date, status, major=-1, minor=-1)

Add a new iteration to the revision history. See MMCifInfoRevisions.AddRevision().

GetRevisions()

See revisions

SetRevisionsDateOriginal(date)

Set the date, when this entry first entered the PDB. Ignored if it was set in the past. See MMCifInfoRevisions.SetDateOriginal().

GetObsoleteInfo()

See obsolete

SetObsoleteInfo()

See obsolete

Get bond information for branched entities. Returns all MMCifInfoEntityBranchLink objects in one list. Chain and residue information is available by the stored AtomHandles of each entry.

Returns:

list of MMCifInfoEntityBranchLink

GetEntityBranchByChain(chain_name)

Get bond information for chains with branched entities. Returns all MMCifInfoEntityBranchLink objects in one list if chain is a branched entity, an empty list otherwise.

Parameters:

chain_name (str) – Chain name to check for branch links

Returns:

list of MMCifInfoEntityBranchLink

Add bond information for a branched entity.

Parameters:
  • chain_name (str) – Chain the bond belongs to

  • atom1 (AtomHandle) – First atom of the bond

  • atom2 (AtomHandle) – Second atom of the bond

  • bond_order (int) – Bond order (e.g. 1=single, 2=double, 3=triple)

Returns:

Nothing

GetEntityBranchChainNames()

Get a list of chain names which contain branched entities.

Returns:

list of str

GetEntityBranchChains()

Get a list of chains which contain branched entities.

Returns:

list of ChainHandle

Establish all bonds stored for branched entities.

GetEntityDesc(entity_id)

Get info of type MMCifEntityDesc for specified entity_id. The entity id for a chain can be fetched with GetMMCifEntityIdTr().

Parameters:

entity_id (str) – ID of entity

class MMCifInfoCitation

This stores citation information from an input file.

id

Stores an internal identifier for a citation. If not provided, resembles an empty string.

Also available as GetID(). May also be modified by SetID().

cas

Stores a Chemical Abstract Service identifier if available. If not provided, resembles an empty string.

Also available as GetCAS(). May also be modified by SetCas().

isbn

Stores the ISBN code, presumably for cited books. If not provided, resembles an empty string.

Also available as GetISBN(). May also be modified by SetISBN().

published_in

Stores the book or journal title of a publication. Should take the full title, no abbreviations. If not provided, resembles an empty string.

Also available as GetPublishedIn(). May also be modified by SetPublishedIn().

volume

Supposed to store volume information for journals. Since the volume number is not always a simple integer, it is stored as a string. If not provided, resembles an empty string.

Also available as GetVolume(). May also be modified by SetVolume().

page_first

Stores the first page of a publication. Since the page numbers are not always a simple integers, they are stored as strings. If not provided, resembles empty strings.

Also available as GetPageFirst(). May also be modified by SetPageFirst().

page_last

Stores the last page of a publication. Since the page numbers are not always a simple integers, they are stored as strings. If not provided, resembles empty strings.

Also available as GetPageLast(). May also be modified by SetPageLast().

doi

Stores the Document Object Identifier as used by doi.org for a cited document. If not provided, resembles an empty string.

Also available as GetDOI(). May also be modified by SetDOI().

pubmed

Stores the PubMed accession number. If not provided, is set to 0.

Also available as GetPubMed(). May also be modified by SetPubmed().

year

Stores the publication year. If not provided, is set to 0.

Also available as GetYear(). May also be modified by SetYear().

title

Stores a title. If not provided, is set to an empty string.

Also available as GetTitle(). May also be modified by SetTitle().

book_publisher

Name of publisher of the citation, relevant for books and book chapters.

Also available as GetBookPublisher() and SetBookPublisher().

book_publisher_city

City of the publisher of the citation, relevant for books and book chapters.

Also available as GetBookPublisherCity() and SetBookPublisherCity().

citation_type

Defines where a citation was published. Either journal, book or unknown.

Also available as GetCitationType(). May also be modified by SetCitationType() with values from MMCifInfoCType. For conveinience setters SetCitationTypeJournal(), SetCitationTypeBook() and SetCitationTypeUnknown() exist.

For checking the type of a citation, IsCitationTypeJournal(), IsCitationTypeBook() and IsCitationTypeUnknown() can be used.

authors

Stores a StringList of authors.

Also available as GetAuthorList(). May also be modified by SetAuthorList().

GetCAS()

See cas

SetCAS(cas)

See cas

GetISBN()

See isbn

SetISBN(isbn)

See isbn

GetPublishedIn()

See published_in

SetPublishedIn(title)

See published_in

GetVolume()

See volume

SetVolume(volume)

See volume

GetPageFirst()

See page_first

SetPageFirst(first)

See page_first

GetPageLast()

See page_last

SetPageLast(last)

See page_last

GetDOI()

See doi

SetDOI(doi)

See doi

GetPubMed()

See pubmed

SetPubMed(no)

See pubmed

GetYear()

See year

SetYear(year)

See year

GetTitle()

See title

SetTitle(title)

See title

GetBookPublisher()

See book_publisher

SetBookPublisher()

See book_publisher

GetBookPublisherCity()

See book_publisher_city

SetBookPublisherCity()

See book_publisher_city

GetCitationType()

See citation_type

SetCitationType(publication_type)

See citation_type

SetCitationTypeJournal()

See citation_type

SetCitationTypeBook()

See citation_type

SetCitationTypeUnknown()

See citation_type

IsCitationTypeJournal()

See citation_type

IsCitationTypeBook()

See citation_type

IsCitationTypeUnknown()

See citation_type

GetAuthorList()

See authors

SetAuthorList(list)

See authors

class MMCifInfoTransOp

This stores operations needed to transform an EntityHandle into a bio unit.

id

A unique identifier. If not provided, resembles an empty string.

Also available as GetID(). May also be modified by SetID().

type

Describes the operation. If not provided, resembles an empty string.

Also available as GetType(). May also be modified by SetType().

translation

The translational vector. Also available as GetVector(). May also be

modified by SetVector().

rotation

The rotational matrix. Also available as GetMatrix(). May also be

modified by SetMatrix().

GetID()

See id

SetID(id)

See id

GetType()

See type

SetType(type)

See type

GetVector()

See translation

SetVector(x, y, z)

See translation

GetMatrix()

See rotation

SetMatrix(i00, i01, i02, i10, i11, i12, i20, i21, i22)

See rotation

class MMCifInfoBioUnit

This stores information how a structure is to be assembled to form the bio unit.

id

The id of a bio unit as given by the original mmCIF file.

Also available as GetID(). May also be modified by SetID().

Type:

str

details

Special aspects of the biological assembly. If not provided, resembles an empty string.

Also available as GetDetails(). May also be modified by SetDetails().

method_details

Details about the method used to determine this biological assembly.

Also available as GetMethodDetails(). May also be modified by SetMethodDetails().

chains

Chains involved in this bio unit. If not provided, resembles an empty list.

Also available as GetChainList(). May also be modified by AddChain() or SetChainList().

chainintervals

List of intervals on the chain list. Needed if there a several sets of chains and transformations to create the bio unit. Comes as a list of tuples. First component is the start, second is the right border of the interval.

Also available as GetChainIntervalList(). Is automatically modified by AddChain(), SetChainList() and MMCifInfo.AddBioUnit().

operations

Translations and rotations needed to create the bio unit. Filled with objects of class MMCifInfoTransOp.

Also available as GetOperations(). May be modified by AddOperations()

operationsintervalls

List of intervals on the operations list. Needed if there a several sets of chains and transformations to create the bio unit. Comes as a list of tuples. First component is the start, second is the right border of the interval.

Also available as GetOperationsIntervalList(). Is automatically modified by AddOperations() and MMCifInfo.AddBioUnit().

GetID()

See id

SetID(id)

See id

GetDetails()

See details

SetDetails(details)

See details

GetMethodDetails()

See method_details

SetMethodDetails(details)

See method_details

GetChainList()

See chains

SetChainList(chains)

See chains, also resets chainintervalls to contain only one interval enclosing the whole chain list.

Parameters:

chains (StringList) – List of chain names.

AddChain(chain name)

See chains, also extends the right border of the last entry in chainintervalls.

GetChainIntervalList()

See chainintervals

GetOperations()

See operations

AddOperations(list of operations)

See operations, also extends the right border of the last entry in operationsintervalls.

GetOperationsIntervalList()

See operationsintervalls

PDBize(asu, seqres=None, min_polymer_size=None, transformation=False, peptide_min_size=10, nucleicacid_min_size=10, saccharide_min_size=10)

Returns the biological assembly (bio unit) for an entity. The new entity created is well suited to be saved as a PDB file. Therefore the function tries to meet the requirements of single-character chain names. The following measures are taken.

  • All ligands are put into one chain (_)

  • Water is put into one chain (-)

  • Each polymer gets its own chain, named A-Z 0-9 a-z.

  • The description of non-polymer chains will be put into a generic string property called description on the residue level.

  • Ligands that resemble a polymer but have less than min_polymer_size / peptide_min_size / nucleicacid_min_size / saccharide_min_size residues are assigned the same numeric residue number. The residues are distinguished by insertion code.

  • Sometimes bio units exceed the coordinate system storable in a PDB file. In that case, the box around the entity will be aligned to the lower left corner of the coordinate system.

Since this function is at the moment mainly used to create biounits from mmCIF files to be saved as PDBs, the function assumes that the ChainType properties are set correctly. For a more mmCIF-style of doing things read this: Biounits

Parameters:
  • asu (EntityHandle) – Asymmetric unit to work on. Should be created from a mmCIF file.

  • seqres (SequenceList) – If set to a valid sequence list, the length of the seqres records will be used to determine if a certain chain has the minimally required length.

  • min_polymer_size (int) – The minimal number of residues a polymer needs to get its own chain. Everything below that number will be sorted into the ligand chain. Overrides peptide_min_size, nucleicacid_min_size and saccharide_min_size if set to a value different than None.

  • transformation (bool) – If set, return the transformation matrix used to move the bounding box of the bio unit to the lower left corner.

  • peptide_min_size (int) – Minimal size to get an individual chain for a polypeptide. Is overridden by min_polymer_size.

  • nucleicacid_min_size (int) – Minimal size to get an individual chain for a polynucleotide. Is overridden by min_polymer_size.

  • saccharide_min_size (int) – Minimal size to get an individual chain for an oligosaccharide or polysaccharide. Is overridden by min_polymer_size.

class MMCifInfoStructDetails

Holds details about the structure.

entry_id

Identifier for a curtain data block. If not provided, resembles an empty string.

Also available as GetEntryID(). May also be modified by SetEntryID().

title

Set a title for the structure.

Also available as GetTitle(). May also be modified by SetTitle().

casp_flag

Tells whether this structure was a target in some competition.

Also available as GetCASPFlag(). May also be modified by SetCASPFlag().

descriptor

Descriptor for an NDB structure or the unstructured content of a PDB COMPND record.

Also available as GetDescriptor(). May also be modified by SetDescriptor().

mass

Molecular mass of a molecule.

Also available as GetMass(). May also be modified by SetMass().

mass_method

Method used to determine the molecular weight.

Also available as GetMassMethod(). May also be modified by SetMassMethod().

model_details

Details about how the structure was determined.

Also available as GetModelDetails(). May also be modified by SetModelDetails().

model_type_details

Details about how the type of the structure was determined.

Also available as GetModelTypeDetails(). May also be modified by SetModelTypeDetails().

GetEntryID()

See entry_id

SetEntryID(id)

See entry_id

GetTitle()

See title

SetTitle(title)

See title

GetCASPFlag()

See casp_flag

SetCASPFlag(flag)

See casp_flag

GetDescriptor()

See descriptor

SetDescriptor(descriptor)

See descriptor

GetMass()

See mass

SetMass(mass)

See mass

GetMassMethod()

See mass_method

SetMassMethod(method)

See mass_method

GetModelDetails()

See model_details

SetModelDetails(details)

See model_details

GetModelTypeDetails()

See model_type_details

SetModelTypeDetails(details)

See model_type_details

class MMCifInfoObsolete
Holds details on obsolete / superseded structures. The data is

available both in the obsolete and in the replacement entries.

date

When was the entry replaced?

Also available as GetDate(). May also be modified by SetDate().

id

Type of change. Either Obsolete or Supersede. Returns a string starting upper case. Has to be set via OBSLTE or SPRSDE.

Also available as GetID(). May also be modified by SetID().

pdb_id

ID of the replacing entry.

Also available as GetPDBID(). May also be modified by SetPDBID().

replace_pdb_id

ID of the replaced entry.

Also available as GetReplacedPDBID(). May also be modified by SetReplacedPDBID().

GetDate()

See date

SetDate(date)

See date

GetID()

See id

SetID(id)

See id

GetPDBID()

See pdb_id

SetPDBID(flag)

See pdb_id

GetReplacedPDBID()

See replace_pdb_id

SetReplacedPDBID(descriptor)

See replace_pdb_id

class MMCifInfoStructRef

Holds the information of the struct_ref category. The category describes the link of polymers in the mmCIF file to sequences stored in external databases such as UniProt. The related categories struct_ref_seq and struct_ref_seq_dif also list differences between the sequences of the deposited structure and the sequences in the database. Two prominent examples of such differences are point mutations and/or expression tags.

db_name

Name of the external database, for example UNP for UniProt.

Type:

str

db_id

Name of the reference sequence in the database pointed to by db_name.

Type:

str

db_access

Alternative accession code for the sequence in the database pointed to by db_name.

Type:

str

GetAlignedSeq(name)

Returns the aligned sequence for the given name, None if the sequence does not exist.

aligned_seqs

List of aligned sequences (all entries of the struct_ref_seq category mapping to this struct_ref).

class MMCifInfoStructRefSeq

An aligned range of residues between a sequence in a reference database and the deposited sequence.

align_id

Uniquely identifies every struct_ref_seq item in the mmCIF file.

Type:

str

seq_begin
seq_end

The starting point (1-based) and end point of the aligned range in the deposited sequence, respectively.

Type:

int

db_begin
db_end

The starting point (1-based) and end point of the aligned range in the database sequence, respectively.

Type:

int

difs

List of differences between the deposited sequence and the sequence in the database.

chain_name

Chain name of the polymer in the mmCIF file.

class MMCifInfoStructRefSeqDif

A particular difference between the deposited sequence and the sequence in the database.

rnum

The residue number (1-based) of the residue in the deposited sequence

Type:

int

details

A textual description of the difference, e.g. point mutation, expression tag, purification artifact.

Type:

str

class MMCifInfoRevisions

Revision history of a PDB entry. If you find a ‘?’ somewhere, this means ‘not set’.

date_original

The date when this entry was seen in PDB for the very first time. This is not necessarily the release date. Expected format ‘yyyy-mm-dd’.

Type:

str

first_release

Index + 1 of the revision releasing this entry. If the value is 0, was not set yet. Set first time we encounter a GetStatus() value of “full release” (mmCIF versions < 5) or “Initial release” (current mmCIF).

Type:

int

AddRevision(num, date, status, major=-1, minor=-1)

Add a new iteration to the history.

Parameters:
Raises:

Exception if num is <= the last added iteration.

GetSize()
Returns:

Number of revisions (valid revision indices are in [0, number-1]).

Return type:

int

GetDate(i)
Parameters:

i (int) – Index of revision

Returns:

Date the PDB revision took place. Expected format ‘yyyy-mm-dd’.

Return type:

str

Raises:

Exception if i out of bounds.

GetNum(i)
Parameters:

i (int) – Index of revision

Returns:

Unique identifier of revision (assigned in increasing order)

Return type:

int

Raises:

Exception if i out of bounds.

GetStatus(i)
Parameters:

i (int) – Index of revision

Returns:

The status of this revision.

Return type:

str

Raises:

Exception if i out of bounds.

GetMajor(i)
Parameters:

i (int) – Index of revision

Returns:

The major version of this revision (-1 if not set).

Return type:

int

Raises:

Exception if i out of bounds.

GetMinor(i)
Parameters:

i (int) – Index of revision

Returns:

The minor version of this revision (-1 if not set).

Return type:

int

Raises:

Exception if i out of bounds.

GetLastDate()
Returns:

Date of the latest revision (‘?’ if no revision set).

Return type:

str

GetLastMajor()
Returns:

Major version of the latest revision (-1 if not set).

Return type:

int

GetLastMinor()
Returns:

Minor version of the latest revision (-1 if not set).

Return type:

int

SetDateOriginal(date)
GetDateOriginal()

See date_original

GetFirstRelease()

See first_release

Data from pdbx_entity_branch, most specifically pdbx_entity_branch_link. That is connectivity information for branched entities, e.g. carbohydrates/ oligosaccharides. Conop Processors can not easily connect them so we use this information in LoadMMCIF() to do that.

atom1

The first atom of the bond. Corresponds to entity_branch_link.atom_id_1, entity_branch_link.comp_id_1 and entity_branch_link.entity_branch_list_num_1. Also available via GetAtom1() and SetAtom1().

Type:

AtomHandle

atom2

The second atom of the bond. Corresponds to entity_branch_link.atom_id_2, entity_branch_link.comp_id_2 and entity_branch_link.entity_branch_list_num_2. Also available via GetAtom2() and SetAtom2().

Type:

AtomHandle

bond_order

Order of a bond (e.g. 1=single, 2=double, 3=triple). Corresponds to entity_branch_link.value_order. Also available via GetBondOrder() and SetBondOrder().

Type:

int

Establish a bond between atom1 and atom2 of a MMCifInfoEntityBranchLink.

Parameters:

editor (XCSEditor) – The editor instance to call for connecting the atoms.

Returns:

Nothing

GetAtom1()

See atom1

GetAtom2()

See atom2

GetBondOrder()

See bond_order

SetAtom1()

See atom1

SetAtom2()

See atom2

SetBondOrder()

See bond_order

class MMCifEntityDesc

Data collected for certain mmCIF entity

type

The ost chain type which can be assigned to ost.mol.ChainHandle

Type:

ost.mol.ChainType

entity_type

value of _entity.type token

str

entity_poly_type

value of _entity_poly.type token - empty string if entity is not of type “polymer”

str

branched_type

value of _pdbx_entity_branch.type token - empty string if entity is not of type “branched”

Type:

str

details

value of _entity.pdbx_description token

str

seqres

SEQRES with gentle preprocessing - empty string if entity is not of type “polymer”. By default, the ost.io.MMCifReader reads the value of the _entity_poly.pdbx_seq_one_letter_code token. Copies all letters but searches a ost.conop.CompoundLib for compound names in brackets. seqres gets an ‘X’ if no compound is found or the respective compound has one letter code ‘?’. Uses the one letter code of the found compound otherwise. So it’s basically a canonical SEQRES with exactly one character per residue.

Type:

str

mon_ids

Monomer ids of all residues in a polymer - empty if entity is not of type “polymer”. Read from _entity_poly_seq category. If a residue is heterogeneous, this list contains the monomer id that comes first in the CIF file. The other variants end up in hetero_num / hetero_ids.

Type:

ost.base.StringList

hetero_num

Residue numbers of heterogeneous compounds - empty if entity is not of type “polymer”. Read from _entity_poly_seq category. If a residue is heterogeneous, the monomer id that comes first in the CIF file ends up in mon_ids. The remnant is listed here. This list specifies the residue numbers for the respective monomer ids in hetero_ids.

hetero_ids

Monomer ids of heterogeneous compounds - empty if entity is not of type “polymer”. Read from _entity_poly_seq category. If a residue is heterogeneous, the monomer id that comes first in the CIF file ends up in mon_ids. The remnant is listed here. This list specifies the monomer ids for the respective locations in hetero_num.

Writing mmCIF files

Star Writer

The syntax of mmCIF is a subset of the CIF file syntax, that by itself is a subset of STAR file syntax. OpenStructure implements a simple StarWriter that is able to write data in two ways:

  • key-value: A category name and an attribute name that is linked to a value. Example:

    _citation.year 2024
    

    _citation.year is called a mmCIF token. It consists of a data category (_citation) and a data item (year), delimited by a “.”.

  • tabular: Represents several values for a mmCIF token. The tokens are written in a header which is followed by the respective values. Example:

    loop_
    _atom_site.group_PDB
    _atom_site.type_symbol
    _atom_site.label_atom_id
    _atom_site.label_comp_id
    _atom_site.label_asym_id
    _atom_site.label_entity_id
    _atom_site.label_seq_id
    _atom_site.label_alt_id
    _atom_site.Cartn_x
    _atom_site.Cartn_y
    _atom_site.Cartn_z
    _atom_site.occupancy
    _atom_site.B_iso_or_equiv
    _atom_site.auth_seq_id
    _atom_site.auth_asym_id
    _atom_site.id
    _atom_site.pdbx_PDB_ins_code
    ATOM N N  SER A 0 1 . -47.333 0.941 8.834 1.00 52.56 71 P 0 ?
    ATOM C CA SER A 0 1 . -45.849 0.731 8.796 1.00 53.56 71 P 1 ?
    ATOM C C  SER A 0 1 . -45.191 1.608 7.714 1.00 51.61 71 P 2 ?
    ...
    

What follows is an example of how to use the StarWriter and its associated objects. In principle thats enough to write a full mmCIF file but you definitely want to check out the MMCifWriter which extends StarWriter and extracts the relevant data from an OpenStructure ost.mol.EntityHandle.

from ost import io
import math

writer = io.StarWriter()

# Add key value pair
value = io.StarWriterValue.FromInt(42)
data_item = io.StarWriterDataItem("_the", "answer", value)
writer.Push(data_item)

# Add tabular data
loop_desc = io.StarWriterLoopDesc("_math_oper")
loop_desc.Add("num")
loop_desc.Add("sqrt")
loop_desc.Add("square")
loop = io.StarWriterLoop(loop_desc)
for i in range(10):
  data = list()
  data.append(io.StarWriterValue.FromInt(i))
  data.append(io.StarWriterValue.FromFloat(math.sqrt(i), 3))
  data.append(io.StarWriterValue.FromInt(i*i))
  loop.AddData(data)
writer.Push(loop)

# Write this groundbreaking data into a file with name numbers.gz
# and yes, its directly gzipped
writer.Write("numbers", "numbers.gz")

The content of the file written:

data_numbers
_the.answer 42
#
loop_
_math_oper.num
_math_oper.sqrt
_math_oper.square
0 0.000 0
1 1.000 1
2 1.414 4
3 1.732 9
4 2.000 16
5 2.236 25
6 2.449 36
7 2.646 49
8 2.828 64
9 3.000 81
#
class StarWriterValue

A value which is stored as string - must be constructed from static constructor functions

FromInt(int_val)

Static constructor from an integer value

Parameters:

int_val (int) – The value

Returns:

StarWriterValue

FromFloat(float_val, decimals)

Static constructor from a float value

Parameters:
  • float_val (float) – The value

  • decimals – Number decimals that get stored as internal value

Returns:

StarWriterValue

FromString(string_val)

Static constructor from a string value, stores input as is with the exception of the following processing:

  • set to “?” if string_val is an empty string (in mmCIF, “?” marks “unknown” values)

  • encapsulate string in quotes if string_val contains space character

  • encapsulate string in quotes if string_val starts with any of the following special characters: _, #, $, ‘, “, [, ], ;

  • encapsulate string in quotes if string_val starts with any of the following special strings: “data_” (case insensitive), “save_” (case insensitive)

  • encapsulate string in quotes if string_val is equal to any of the following reserved words (case insensitive): “loop_”, “stop_”, “global_”

Parameters:

string_val (str) – The value

Returns:

StarWriterValue

GetValue()

Returns the internal string representation

class StarWriterDataItem(category, attribute, value)

key-value data representation

Parameters:
  • category (str) – The category name of the data item

  • attribute (str) – The attribute name of the data item

  • value (StarWriterValue) – The value of the data item

GetCategory()

Returns category

GetAttribute()

Returns attribute

GetValue()

Returns value

class StarWriterLoopDesc(category)

Defines header for tabular data representation for the specified category

Parameters:

category (str) – The category

GetCategory()

Returns category

GetSize()

Returns number of added attributes

Add(attribute)

Adds an attribute

Parameters:

attribute (str) – The attribute

GetIndex(attribute)

Returns index for specified attribute, -1 if not found

Parameters:

attribute (str) – The attribute for which the index should be returned

class StarWriterLoop(desc)

Allows to populate StarWriterLoopDesc with data to get a full tabular data representation

Parameters:

desc (StarWriterLoopDesc) – The header

GetDesc()

Returns desc

GetN()

Returns number of added data lists

AddData(data_list)

Add data for each attribute in desc.

Parameters:

data_list (list of StarWriterValue) – Data to be added, length must match attributes in desc

class StarWriter

Can be populated with data which can then be written to a file.

Push(star_writer_object)

Push data to be written

Parameters:

star_writer_object (StarWriterDataItem/StarWriterLoop) – Data

Write(data_name, filename)

Writes pushed data in specified file.

Parameters:
  • data_name (str) – Name of data block, i.e. the written file starts with data_<data_name>.

  • filename (str) – Name of generated file - applies gzip compression in case of .gz suffix.

mmCIF Writer

Data categories considered by the OpenStructure mmCIF writer are described in the following. The listed attributes are written to fulfill all dependencies in a mmCIF file according to mmcif_pdbx_v50.

The writer is designed to only require an OpenStructure ost.mol.EntityHandle/ ost.mol.EntityView as input but optionally performs preprocessing in order to separate residues of chains into valid mmCIF entities. This is controlled by the mmcif_conform flag which has significant impact on how chains are assigned to mmCIF entities, chain names and residue numbers. Ideally, the input is mmcif_conform which is the case when loading a structure from a valid mmCIF file with ost.io.LoadMMCIF().

Behaviour when mmcif_conform is True

Expected properties when mmcif_conform is enabled:

  • The residues in a chain all belong to the same mmCIF molecular entity. That is for example a polypeptide chain with all residues being peptide linking. In mmCIF lingo: An entity of type “polymer” which is of _entity_poly type “polypeptide(L)” and all residues being “L-PEPTIDE LINKING”. Well, some glycines might be “PEPTIDE LINKING”. Another example might be a ligand where the chain refers to an entity of type “non-polymer” and only contains that particular ligand.

  • Each chain must have a chain type assigned (available as ost.mol.ChainHandle.GetType()) which refers to the entity type. For entity type “polymer” and “branched”, the chain type also encodes the subtypes. If you for example have a polymer chain, not the general CHAINTYPE_POLY is expected but the more finegrained polymer specific type. That could be CHAINTYPE_POLY_PEPTIDE_D. This is also true for entities of type “branched”. There, a subtype such as CHAINTYPE_OLIGOSACCHARIDE is expected.

  • The residue numbers in “polymer” chains must match the SEQRES of the underlying entity with 1-based indexing. Insertion codes are not allowed and raise an error.

  • Each residue must be named according to the entries in the ost.conop.CompoundLib which is provided when calling MMCifWriter.SetStructure(). This is relevant for the _chem_comp category. If the respective compound cannot be found, the type for that compound is set to “OTHER”

There is one quirk remaining: The assignment of underlying mmCIF entities. This is a challenge primarily for polymers. The current logic starts with an empty internal entity list and successively processes chains. If no match is found, a new entity gets generated and the SEQRES is set to what we observe in the chain residues given their residue numbers (i.e. the ATOMSEQ). If the first residue has residue number 10, the SEQRES gets prefixed by 9 elements using a default value (e.g. UNK for a chain of type CHAINTYPE_POLY_PEPTIDE_D). The same is done for gaps. A chain is considered matching an mmCIF entity, if all of its residue names are an exact match at the respective location in the SEQRES. Location is determined with residue numbers which follow a 1-based indexing scheme. However, there might be the case that one chain resolves more residues than another. So you may have residues at locations that are undefined in the current SEQRES. If the fraction of matches with undefined locations does not exceed 5%, we still assume an overall match and fill in the previsouly undefined locations in the SEQRES with the newly gained information. This is a heuristic that works in most cases but potentially introduces errors in entity assignment. If you want to avoid that, you must set your entities manually and pass a MMCifWriterEntityList when calling MMCifWriter.SetStructure(). There is a dedicated section on that below.

if mmcif_conform is enabled, there is pretty much everything in place and the previously listed mmCIF categories/attributes are written with a few special cases:

  • _atom_site.auth_asym_id: Honours the residue string property “pdb_auth_chain_name” if set, uses the actual chain name otherwise. The string property is set in the mmCIF reader.

  • _pdbx_poly_seq_scheme.pdb_strand_id: Same behaviour as _atom_site.auth_asym_id

  • _atom_site.auth_seq_id: Honours the residue string property “pdb_auth_resnum” if set, uses the actual residue number otherwise. The string property is set in the mmCIF reader.

  • _pdbx_poly_seq_scheme.pdb_seq_num: Same behaviour as _atom_site.auth_seq_id

  • _atom_site.pdbx_PDB_ins_code: Honours the residue string property “pdb_auth_ins_code” if set, uses the actual residue insertion code otherwise. The string property is set in the mmCIF reader. If mmcif_conform is enabled, the actual residue insertion code can expected to be empty though.

  • _pdbx_poly_seq_scheme.pdb_ins_code: Same behaviour as _atom_site.pdbx_PDB_ins_code

Behaviour when mmcif_conform is False

If mmcif_conform is not enabled, the only expectation is that residues are named according to the ost.conop.CompoundLib which is provided when calling MMCifWriter.SetStructure(). The ost.conop.CompoundLib is used to extract the respective chem classes (see ost.mol.ChemClass). Residues with no entry in the ost.conop.CompoundLib are set to UNKNOWN. There will be significant preprocessing involving the split of chains which is purely based on these chem classes. Each chain gets split with the following rules:

  • separate chain of _entity.type “non-polymer” for each residue with chem class NON_POLYMER/ UNKNOWN

  • if any residue has chem class WATER, all of them are collected into one separate chain with _entity.type “water”

  • if any residue is a saccharide, i.e. has chem class SACCHARIDE/ L_SACCHARIDE/ D_SACCHARIDE, all of them are gathered into a single separated chain of _entity.type “branched” and _pdbx_entity_branch.type “oligosaccharide”.

  • if any residue has chem class RNA_LINKING, all of them are collected into one separate chain of _entity.type “polymer” and _entity_poly.type “polyribonucleotide”.

  • if any residue has chem class DNA_LINKING, all of them are collected into one separate chain of _entity.type “polymer” and _entity_poly.type “polydeoxyribonucleotide”.

  • if any residue is peptide linking, all of them are collected into one separate chain of _entity.type “polymer” and _entity_poly.type “polypeptide(L)”/”polypeptide(D)”. We only allow the following combinations of chem classes. Either L_PEPTIDE_LINKING/ PEPTIDE_LINKING or D_PEPTIDE_LINKING/ PEPTIDE_LINKING. Mixing L_PEPTIDE_LINKING and D_PEPTIDE_LINKING raises an error.

Chain names are generated by iterating over “ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789abcdefghijklmnopqrstuvwxyz”, starting with AA, AB, AC etc. once the first cycle is through. There can therefore be as many chains as needed. The mmCIF entities are built the same way as for mmcif_conform with two differences: 1) the extracted SEQRES of a chain is the ATOMSEQ, i.e. the exact sequence of its residues 2) entity matching happens through exact matches of SEQRES and is independent from residue numbers. As a consequence, the residue numbers written as _atom_site.label_seq_id do not correspond anymore to the actual residue numbers but refer to the location in ATOMSEQ.

Once split and new chain names are assigned, the rest is straightforward. The special cases listed above (_atom_site.auth_asym_id, _pdbx_poly_seq_scheme.pdb_strand_id, _atom_site.auth_seq_id etc.) are treated the same as if mmcif_conform was true.

To see it all in action:

from ost import io
from ost import conop

ent = io.LoadMMCIF("1a0s", remote=True)

writer = io.MMCifWriter()

# The MMCifWriter is still object of type StarWriter
# I can decorate my mmCIF file with any data I want
val = io.StarWriterValue.FromInt(42)
data_item = io.StarWriterDataItem("_the", "answer", val)
writer.Push(data_item)

# The actual relevant part... mmcif_conform can be set to
# True, as we loaded from mmCIF file
lib = conop.GetDefaultLib()
writer.SetStructure(ent, lib, mmcif_conform = True)

# And write...
writer.Write("1a0s", "1a0s.cif.gz")

Define mmCIF entities

The writer provides a way to pre-define mmCIF entities. This only works if mmcif_conform is enabled and for polymer entities. The problem is that we have no guarantee to ever see the full SEQRES (written in entity_poly_seq category) only with a structure as input. As an example: gaps, i.e. missing residues based on residue numbers, are filled with UNK in case of a L_PEPTIDE_LINKING chain. In order to retain the full SEQRES information, we provide a way to define these polymer entities in form of MMCifWriterEntity. The provided entities must fulfill:

  • They must be of _entity.type “polymer”

  • All chains in input structure that are of _entity.type “polymer” must be assigned to exactly one of these MMCifWriterEntity objects and must match the SEQRES (MMCifWriterEntity.mon_ids)

  • All chain names that are assigned to any of the MMCifWriterEntity objects must be present in input structure

Here is an example with pre-defined mmCIF entities:

from ost import io
from ost import conop

# Read the structure and also seqres and meta information
ent, seqres, info = io.LoadMMCIF("1a0s", remote=True,
                                 seqres=True, info=True)

# we need the compound library at several places
lib = conop.GetDefaultLib()

# pre-define mmCIF entities
entity_info = ost.io.MMCifWriterEntityList()
for entity_id in info.GetEntityIdsOfType("polymer"):

  # Get entity description from info object
  entity_desc = info.GetEntityDesc(entity_id)

  # interface of entity_desc is similar to MMCifWriterEntity
  entity_poly_type = entity_desc.entity_poly_type
  mon_ids = entity_desc.mon_ids
  e = ost.io.MMCifWriterEntity.FromPolymer(entity_poly_type,
                                           mon_ids, lib)
  entity_info.append(e)

  # search all chains assigned to the entity we just added
  for ch in ent.chains:
    if info.GetMMCifEntityIdTr(ch.name) == entity_id:
      entity_info[-1].asym_ids.append(ch.name)

  # deal with heterogeneities
  for a,b in zip(entity_desc.hetero_num, entity_desc.hetero_ids):
    entity_info[-1].AddHet(a,b)

# write mmcif file with pre-defined mmCIF entities
writer = io.MMCifWriter()
writer.SetStructure(ent, conop.GetDefaultLib(),
                    entity_info=entity_info)
writer.Write("1a0s", "1a0s.cif.gz")
class MMCifWriterEntity

Defines mmCIF entity which will be written in MMCifWriter. Must be created from static constructor function.

FromPolymer(entity_poly_type, mon_ids, compound_lib)

Static constructor function for entities of type “polymer”

Parameters:
  • entity_poly_type (str) – Entity poly type from restricted vocabulary for _entity_poly.type

  • mon_ids (list of str) – Full names of all compounds defining the SEQRES of that entity

  • compound_lib (ost.conop.CompoundLib) – Components dictionary from which chem classes are fetched

type

(str) The _entity.type

poly_type

(str) The _entity_poly.type - empty string if type is not “polymer”

branch_type
(str) The _pdbx_entity_branch.type - empty string if type is not

“branched”

mon_ids

(ost.StringList) The compound names making up this entity

seq_olcs

(ost.StringList) The one letter codes for mon_ids which will be written to _pdbx_seq_one_letter_code - invalid if type is not “polymer”

seq_can_olcs

(ost.StringList) The one letter codes for mon_ids which will be written to _pdbx_seq_one_letter_code_can - invalid if type is not “polymer”

asym_ids

(ost.StringList) Asym chain names that are assigned to this entity

class MMCifWriterEntityList

A list for MMCifWriterEntity

class MMCifWriter

Inherits all functionality from StarWriter and provides functionality to extract relevant mmCIF information from ost.mol.EntityHandle/ ost.mol.EntityView

SetStructure(ent, compound_lib, mmcif_conform=True,
entity_info=list())

Extracts mmCIF categories/attributes based on the description above. An object of type MMCifWriter can only be associated with one Structure. Calling this function more than once raises an error.

Parameters:
  • ent (ost.mol.EntityHandle/ ost.mol.EntityView) – The stucture to write

  • compound_lib (ost.conop.CompoundLib) – The compound library

  • mmcif_conform (bool) – Determines data extraction strategy as described above

  • entity_info (MMCifWriterEntityList) – Predefine mmCIF entities - useful to define complete SEQRES. If given, the provided list serves as a starting point, i.e. chains in ent are matched to entities in entity_info. In case of no match, this list gets extended. Starts from empty list if not given.

GetEntities()

Returns MMCifWriterEntityList. Useful to check after SetStructure() has been called. Order in this list defines entity ids in written mmCIF file with zero based indexing.

Biounits

Biological assemblies, i.e. biounits, are an integral part of mmCIF files and their construction is fully defined in MMCifInfoBioUnit. MMCifInfoBioUnit.PDBize() provides one possibility to construct such biounits with compatibility with the PDB format in mind. That is single character chain names, dumping all ligands in one chain etc. For a more mmCIF-style way of constructing biounits, check out ost.mol.alg.CreateBU() in the ost.mol.alg module.

Search

Enter search terms or a module, class or function name.

Contents

Documentation is available for the following OpenStructure versions:

(Currently viewing dev) / 2.7 / 2.6 / 2.5 / 2.4 / 2.3.1 / 2.3 / 2.2 / 2.1 / 2.0 / 1.9 / 1.8 / 1.7.1 / 1.7 / 1.6 / 1.5 / 1.4 / 1.3 / 1.2 / 1.11 / 1.10 / 1.1

This documentation is still under heavy development!
If something is missing or if you need the C++ API description in doxygen style, check our old documentation for further information.