Modelling Algorithms¶

A collection of algorithms that can be useful in modelling

Rigid Blocks¶

RMSD is a typical measure for similarity of two structures. Given an atom atom mapping between two structures, the minimum RMSD and the according superposition can efficiently be calculated using an approach based on singular value decomposition. This approach is problematic if there are very dissimilar regions or when domain movement events occur. We can therefore implement an iterative superposition. The two structures undergo an initial superposition. For every iteration we then select a subset of atoms that are within a certain distance threshold that serve as input for the next superposition. This iterative superpostion typically converges to the largest common subpart but is non-deterministic since it depends on the initial superposition. The RigidBlocks algorithm is based on only the CA positions and performs this iterative superposition multiple times by using a sliding window to select the initial subset and gathers all unique results. These results can be very similar and only differ by single positions. The algorithm therefore reduces the amount of solutions by merging them based on a threshold of similarity. The similarity is defined by the fraction of positions in solution A that are also present in solution B. As a final result, the algorithm therefore detects common rigid subsets of positions.

promod3.modelling.RigidBlocks(bb_list_one, bb_list_two[, window_length = 12, max_iterations=20, distance_thresh=3.0, cluster_thresh=0.9])¶

Performs the RigidBlock algorithm on given input

Parameters:

bb_list_one (promod3.loop.BackboneList) – First piece structural information from which CA positions will be extracted
bb_list_two (promod3.loop.BackboneList) – Second piece of structural information from which CA positions will be extracted
window_length (int) – Length of sliding window to generate initial subsets
max_iterations (int) – Maximal numbers of iterations for every single iterative superposition
distance_thresh (float) – Maximal distance two CA positions can have to be considered in the same rigid block and to select the common subset for the next iteration of the iterative superposition
cluster_thresh (float) – Threshold of similarity to perform the final merging of the solutions

Returns:

tuple with the first element being a list of list defining the indices of the common subsets (rigid blocks) relative to the input promod3.loop.BackboneList objects and the second element being a list of ost.geom.Mat4 defining the transformations to superpose the according positions in bb_list_one onto bb_list_two

promod3.modelling.RigidBlocks(aln[, seq_one_idx=0, seq_two_idx=1, window_length = 12, max_iterations=20, distance_thresh=3.0, cluster_thresh=0.9])¶

Performs the RigidBlock algorithm on given input

Parameters:

aln (ost.seq.AlignmentHandle) – An alignment with attached ost.mol.EntityView objects from which the positions are extracted
seq_idx_one (int) – The idx of the first sequence from which the CA positions will be extracted
seq_idx_two (int) – The idx of the second sequence from which the CA positions will be extracted
window_length (int) – Length of sliding window to generate initial subsets
max_iterations (int) – Maximal numbers of iterations for every single iterative superposition
distance_thresh (float) – Maximal distance two CA positions can have to be considered in the same rigid block and to select the common subset for the next iteration of the iterative superposition
cluster_thresh (float) – Threshold of similarity to perform the final merging of the solutions

Returns:

tuple with the first element being a list of list defining the column indices of the common subsets (rigid blocks) relative to the input ost.seq.AlignmentHandle and the second element being a list of ost.geom.Mat4 defining the transformations to superpose the according positions from the first sequence onto the second sequence.

promod3.modelling.RigidBlocks(pos_one, pos_two[, window_length = 12, max_iterations=20, distance_thresh=3.0, cluster_thresh=0.9])¶

Performs the RigidBlock algorithm on given input

Parameters:

pos_one (ost.geom.Vec3List) – First piece position information
pos_two (ost.geom.Vec3List) – Second piece of position information
window_length (int) – Length of sliding window to generate initial subsets
max_iterations (int) – Maximal numbers of iterations for every single iterative superposition
distance_thresh (float) – Maximal distance two CA positions can have to be considered in the same rigid block and to select the common subset for the next iteration of the iterative superposition
cluster_thresh (float) – Threshold of similarity to perform the final merging of the solutions

Returns:

tuple with the first element being a list of list defining the indices of the common subsets (rigid blocks) relative to the input ost.geom.Vec3List objects and the second element being a list of ost.geom.Mat4 defining the transformations to superpose the according positions in pos_one onto pos_two

De Novo Modelling¶

ProMod3 provides algorithms for sampling and fragment detection. Here we provide an object, that facilitates fragment detection and caching, as well as a convenient function to combine the functionalities into an example pipeline.

Motif Finder¶

Distinct spatial arrangements of atoms or functional groups are key for protein function. For their detection, ProMod3 implements the MotifFinder algorithm which is based on geometric hashing as described by Nussinov and Wolfson [nussinov1991]. The algorithm consists of a learning stage, a detection stage and a refinement stage.

Learning Stage: A motif (query) is represented by a set of coordinates. Triplets (p1, p2, p3) of coordinates are selected that define triangles. For each triangle one can define an orthogonal vector basis (in our case v1 = norm(p2-p1), v3 = norm(cross(v1,p3-p1), v2 = norm(cross(v1,v3)))). For each coordinate not in [p1,p2,p3], we add the identity of the query/triangle as value to a hash map. The corresponding key consists of discretized values describing the edge lengths of the triangle, as well as the coordinate transformed into the triangle specific orthogonal vector basis. That’s 6 numbers in total.

Detection Stage: The goal is to identify one or several subsets of target coordinates that resemble an input query. We first setup an accumulator containing a counter for each triangle observed in the input query. We then iterate over each possible triangle with vertices p1, p2 and p3 in the target coordinates. At the beginning of each iteration, all counters in the accumulator are set to zero. Again, we build a vector basis given that triangle and transform all coordinates not in [p1,p2,p3] into that vector space. For each transformed coordinate we obtain a key for the query hash map. If there is one or several values at that location in the hash map, we increment the corresponding locations in the accumulator. Once all coordinates are processed, we search for high counts in the accumulator. Given N query coordinates, we keep a solution for further refinement if count/(N-3) >= hash_tresh. This is repeated until all triangles in the target are processed. One key problem with this approach is the discretization of floating point numbers that give raise to the hash map keys. Two extremely close values might end up in different bins just because they are close to the bin boundaries. For each of the 6 relevant numbers we estimate the actual bin as well as the closest neighbouring bin. Processing all possible combinations results in 64 hash map lookups instead of only one.

Refinement Stage: Every potential solution identified in the detection stage is further refined based on the distance_thresh and refine_thresh parameters. A potential solution found in the detection stage is a pair of triangles, one in the query and one in the target, for which we find many matching coordinates in their respective vector space. We start with a coordinate mapping based on the triangle vertices from the query and the target (3 pairs). This coordinate mapping is iteratively updated by estimating the minimum RMSD superposition of the mapped query coordinates onto the target, apply that superposition on the query, find the closest target coordinate for each coordinate in the query and redo the mapping by including all pairs with minimum distance < distance_thresh. Iteration stops if nothing changes anymore. The solution is returned to the user if the final fraction of mapped query coordinates is larger or equal refine_thresh. The larger the mapping, the more accurate the superposition. As we start with only the three triangle vertices, distance_thresh is doubled for the initial iteration.

# Example script that loads protein structures that contain ATP and
# generates motif queries describing their binding pockets.
# In a second step, those pockets are matched against a protein 
# structure that only contains an ATP analog. The ATP from every 
# match is transformed and stored to disk for further processing.

from ost import io, geom, mol
from promod3 import modelling

files = ['data/1E2Q.pdb', 'data/1KO5.pdb', 'data/2IYW.pdb']

atp_list = list()
query_list = list()

for f in files:

    prot = io.LoadPDB(f)
    peptide_sel = prot.Select("peptide=true")
    atp_sel = prot.Select("rname=ATP")

    # generate a single query for each ATP pocket
    for atp_idx, atp_r in enumerate(atp_sel.residues):
        pocket_view = peptide_sel.CreateEmptyView()
        for atp_at in atp_r.atoms:
            close_at = peptide_sel.FindWithin(atp_at.GetPos(), 4.5)
            for at in close_at:
                r = at.handle.GetResidue()
                add_flag = mol.INCLUDE_ATOMS | mol.CHECK_DUPLICATES
                pocket_view.AddResidue(r, add_flag)

        ca_positions = geom.Vec3List()
        for res in pocket_view.residues:
            ca_positions.append(res.FindAtom("CA").GetPos())
        i = "%s_%i"%(f, atp_idx)
        query = modelling.MotifQuery(ca_positions, i, 4.0, 9.0, 1.0)
        query_list.append(query)
     
        # create an entity from atp for later use
        atp_view = prot.CreateEmptyView()
        atp_view.AddResidue(atp_r, mol.INCLUDE_ATOMS)
        atp_list.append(mol.CreateEntityFromView(atp_view, True))

# That's it, let's combine the single queries
full_query = modelling.MotifQuery(query_list)

prot = io.LoadPDB("data/1AKE.pdb")
peptide_sel = prot.Select("peptide=true")
ca_positions = geom.Vec3List()
for r in peptide_sel.residues:
    ca_positions.append(r.FindAtom("CA").GetPos())

# search all matches, fetch the corresponding atps,
# transform them onto the 1ake binding site and dump them to disk
matches = modelling.FindMotifs(full_query, ca_positions)
for m_idx, m in enumerate(matches):
    atp = atp_list[m.query_idx].Copy()
    atp.EditXCS().ApplyTransform(m.mat)
    io.SavePDB(atp, "m_%i.pdb"%(m_idx))

class promod3.modelling.MotifQuery(positions, identifier, min_triangle_edge_length, max_triangle_edge_length, bin_size)¶

class promod3.modelling.MotifQuery(positions, identifier, min_triangle_edge_length, max_triangle_edge_length, bin_size, flags)

class promod3.modelling.MotifQuery(query_list)

A single query or a container of queries. The constructor performs the learning stage of a single query or combines several queries, so they can be searched at once.

Parameters:

positions (ost.geom.Vec3List) – Coordinates of the query
identifier (str) – Descriptor of the query
min_triangle_edge_length (float) – To avoid the full O(n^3) hell, triangles with any edge length below min_triangle_edge_length are skipped
max_triangle_edge_length (float) – Same as min_triangle_edge_length but upper bound
bin_size (float) – Bin size in A, relevant to generate hash map keys
flags (list of int) – Flag in range [0,63] for every coordinate in positions. They’re also added to the hash map keys (default: 0). This means that additionally to having a matching relative position, the coordinates must also have a matching flag in the detection/refinement stage. If not provided (in the query and in the search), only coordinates matter.
query_list (list of MotifQuery) – E pluribus unum

Save(filename)¶

Saves the query down to disk

Parameters:: filename (str) – filename

static Load(filename)¶

Load query from disk

Parameters:: filename (str) – filename

GetPositions(query_idx)¶

Returns coordinates of specified query

Parameters:: query_idx (int) – Query from which you want the positions

GetIdentifiers()¶: Returns a list of all query identifiers.

GetN()¶: Returns the number of queries

class promod3.modelling.MotifMatch¶

Object that holds information about a match found in FindMotifs()

query_idx¶: Index of matching query

mat¶: Transformation matrix to superpose matching query onto target

alignment¶: List of tuples which define matching pairs of query/target coordinates

promod3.modelling.FindMotifs(query, target_positions, hash_tresh=0.4, distance_thresh=1.0, refine_thresh=0.7, flags=list(), swap_thresh=False)¶

Performs the detection and refinement stages of the geometric hashing algorithm.

Parameters:

query – Query to be searched
target_positions – Coordinates of the target
hash_thresh – Parameter relevant for detection stage
distance_thresh – Parameter relevant for refinement stage
refine_thresh – Parameter relevant for refinement stage
flags – Equivalent to flags in MotifQuery constructor. If you didn’t provide anything there, this can be ignored. Only the actual coordinates matter in this case.
swap_thresh – hash_thresh and refine_thresh refer to fraction of covered positions in query. When setting this to True, they refer to the fraction of covered positions in target_positions.

Returns:

All found matches

Return type:

list of MotifMatch

AFDB Modelling¶

Template based modelling using AFDB as template library comes with two challenges for which ProMod3 provides solutions:

efficient structure storage for the whole AFDB: FSStructureServer
fast sequence searches with limited sensitivity: PentaMatch

The creation of these two object requires extensive preprocessing. The required scripts and documentation are available in <GIT_ROOT>/extras/data_generation/afdb_modelling.

Basic modelling functionality is available in the following two functions:

AFDBTPLSearch()
AFDBModel()

Modelling Algorithms¶

Rigid Blocks¶

De Novo Modelling¶

Motif Finder¶

AFDB Modelling¶

Search

Contents