You are reading the documentation for version 1.5 of OpenStructure. You may also want to read the documentation for: 1.1 1.2 1.3 1.4 1.6 1.7 1.7.1 1.8 1.9 1.10 1.11 2.0 2.1 2.2 devel

hhblits - Search related sequences in databases

Introduction

HHblits is a sequence search tool like BLAST, able to find more distant homologs This is achieved by performing profile-profile searches. In BLAST, a query sequence is searched against a sequence database. That is a sequence-sequence search. HHblits works on a profile database, usually that one is provided, queried with a sequence profile. The latter one needs to be calculated before the actual search. In very simple words, HHblits is using per-sequence scoring functions to be more sensitive, in this particular case Hidden Markov models. The software suite needed for HHblits can be found here.

Examples

A typical search: Get an instance of the bindings, build the search profile out of the query sequence, run the search and iterate results. Since HHblits works with a SequenceHandle or a FastA file, both ways are shown.

First query by sequence:

from ost.bindings import hhblits

# get a SequenceHandle
query_seq = seq.CreateSequence('Query',
                               'MRIILLGAPGAGKGTQAQFIMEKYGIPQISTGDMLRAAVKSGS'+
                               'ELGKQAKDIMDAGKLVTDELVIALVKERIAQEDCRNGFLLDGF'+
                               'PRTIPQADAMKEAGINVDYVLEFDVPDELIVDRIVGRRVHAPS'+
                               'GRVYHVKFNPPKVEGKDDVTGEELTTRKDDQEETVRKRLVEYH'+
                               'QMTAPLIGYYYYSKEAEAGNTKYAKVDGTKPVAEVRADLEKILG')

# set up the search environment
# $EBROOTHHMINSUITE points to the root of a HHsuite installation
hh = hhblits.HHblits(query_seq, os.getenv('EBROOTHHMINSUITE'))

# now create a search profile for the query sequence against the NR20 db
# provided on the HHblits web page, nr20_12Aug11 is just the prefix common to
# all db files, so `ls hhtools/nr20_12Aug11*` would list all of them
a3m_file = hh.BuildQueryMSA(nrdb='hhtools/nr20_12Aug11')

# search time! we just search against NR20 again, but every HHblits db is
# working here, e.g. one build from all the sequences in PDB
hit_file = hh.Search(a3m_file, 'hhtools/nr20_12Aug11')

# lets have a look at the resuls
with open(hit_file) as hit_fh:
    header, hits = hhblits.ParseHHblitsOutput(hit_fh)
for hit in hits:
    print hit.aln

Very similar going by file:

from ost.bindings import hhblits

# set up the search environment
#  $EBROOTHHMINSUITE points to the root of a HHsuite installation
hh = hhblits.HHblits('query.fas', os.getenv('EBROOTHHMINSUITE'))

# now create a search profile for the query sequence against the NR20 db
# provided on the HHblits web page, nr20_12Aug11 is just the prefix common to
# all db files, so `ls hhtools/nr20_12Aug11*` would list all of them
a3m_file = hh.BuildQueryMSA(nrdb='hhtools/nr20_12Aug11')

# search time! we just search against NR20 again, but every HHblits db is
# working here, e.g. one build from all the sequences in PDB
hit_file = hh.Search(a3m_file, 'hhtools/nr20_12Aug11')

# lets have a look at the resuls
with open(hit_file) as hit_fh:
    header, hits = hhblits.ParseHHblitsOutput(hit_fh)
for hit in hits:
    print hit.aln

The alignments produced by HHblits are sometimes slightly better than by BLAST, so one may want to extract them:

from ost.bindings import hhblits

# set up the search environment
#  $EBROOTHHMINSUITE points to the root of a HHsuite installation
hh = hhblits.HHblits('query.fas', os.getenv('EBROOTHHMINSUITE'))

# now create a search profile for the query sequence against the NR20 db
# provided on the HHblits web page, nr20_12Aug11 is just the prefix common to
# all db files, so `ls hhtools/nr20_12Aug11*` would list all of them
a3m_file = hh.BuildQueryMSA(nrdb='hhtools/nr20_12Aug11')

# note that ParseA3M is not a class method but a module function
output = hhblits.ParseA3M(open(a3m_file))

print output['msa']

Binding API

class HHblits(query, hhsuite_root, hhblits_bin=None, working_dir=None)

Initialise a new HHblits “search” for the given query. Query may either be a SequenceHandle or a string. In the former case, the query is the actual query sequence, in the latter case, the query is the filename to the file containing the query.

Parameters:
  • query (SequenceHandle or str) – Query sequence as file or sequence.
  • hhsuite_root (str) – Path to the top-level directory of your hhsuite installation.
  • hhblits_bin (str) – Name of the hhblits binary. Will only be used if hhsuite_root/bin/hhblits does not exist.
  • working_dir (str) – Directory for temporary files. Will be created if not present but not automatically deleted.
A3MToCS(a3m_file, cs_file=None, options={})

Converts the A3M alignment file to a column state sequence file. If cs_file is not given, the output file will be set to <a3m_file-basename>.seq219.

Parameters:
  • a3m_file (str) – A3M file to be converted
  • cs_file (str) – output file name (may be omitted)
  • options (dict) – dictionary of options to cstranslate, must come with the right amount of ‘-‘ in front.
Returns:

the path to the column state sequence file

Return type:

str

A3MToProfile(a3m_file, hhm_file=None)

Converts the A3M alignment file to a hhm profile. If hhm_file is not given, the output file will be set to <a3m_file-basename>.hhm.

Parameters:
  • a3m_file (str) – input MSA
  • hhm_file (str) – output file name
Returns:

the path to the profile

Return type:

str

BuildQueryMSA(nrdb, iterations=1, mact=None, cpu=1)

Builds the MSA for the query sequence

This function directly uses hhblits of hhtools. While in theory it would be possible to do this by PSI-blasting on our own, hhblits is supposed to be faster. Also it is supposed to prevent alignment corruption. The alignment corruption is caused by low-scoring terminal alignments that draw the sequences found by PSI-blast away from the optimum. By removing these low scoring ends, part of the alignment corruption can be suppressed. hhblits does not call PSIPRED on the MSA to predict the secondary structure of the query sequence. This is done by addss.pl of hhtools. The predicted secondary structure is stored together with the sequences identified by hhblits.

Parameters:
  • nrdb (str) – Database to be align against; has to be an hhblits database
  • iterations (int) – Number of hhblits iterations
  • mact (float) – -mact of hhblits
  • cpu (int) – -cpu of hhblits
Returns:

the path to the MSA file

Return type:

str

Cleanup()

Delete temporary data.

Delete temporary data if no working dir was given. Controlled by needs_cleanup.

CleanupFailed()

In case something went wrong, call to make sure everything is clean.

Search(a3m_file, database, options={}, prefix='')

Searches for templates in the given database. Before running the search, the hhm file is copied. This makes it possible to launch several hhblits instances at once. Upon success, the filename of the result file is returned. This file may be parsed with ParseHHblitsOutput().

Parameters:
  • a3m_file (str) – input MSA file
  • database (str) – search database, needs to be the common prefix of the database files
  • options (dict) – dictionary of options, must come with the right amount of ‘-‘ in front.
  • prefix (str) – prefix to the result file
Returns:

the path to the result file

Return type:

str

class HHblitsHit(hit_id, aln, score, ss_score, evalue, pvalue, prob)

A hit found by HHblits

hit_id

String identifying the hit

Type:str
aln

Pairwise alignment containing the aligned part between the query and the target. First sequence is the query, the second sequence the target.

Type:AlignmentHandle
score

The alignment score

Type:float
ss_score

The secondary structure score

Type:float
evalue

The E-value of the alignment

Type:float
pvalue

The P-value of the alignment

Type:float
prob

The probability of the alignment (between 0 and 100)

Type:float
class HHblitsHeader

Stats from the beginning of search output.

query

The name of the query sequence

Type:str
match_columns

Total of aligned Match columns

Type:int
n_eff

Value of the -neff option

Type:float
searched_hmms

Number of profiles searched

Type:int
date

Execution date

Type:datetime.datetime
command

Command used to run

Type:str
ParseHHblitsOutput(output)

Parses the HHblits output and returns a tuple of HHblitsHeader and a list of HHblitsHit instances.

Parameters:output (file/ iteratable) – output of a HHblits.Search(), needs to be iteratable, e.g. an open file handle
Returns:a tuple of the header of the search results and the hits
Return type:(HHblitsHeader, HHblitsHit)
ParseA3M(a3m_file)

Parse secondary structure information and the multiple sequence alignment out of an A3M file.

Parameters:a3m_file (iteratable, e.g. an opened file) – Iteratable containing the lines of the A3M file
Returns:Dictionary containing “ss_pred” (list), “ss_conf” (list) and “msa” (AlignmentHandle).
ParseHeaderLine(line)

Fetch header content.

First, we seek the start of the identifier, that is, the first whitespace after the hit number + 1. Since the identifier may contain whitespaces itself, we cannot split the whole line

Parameters:line (str) – Line from the output header.
Returns:Hit information
Return type:HHblitsHit
ParseHHM(profile)

Parse secondary structure information and the MSA out of an HHM profile.

Parameters:profile (file) – Opened file handle holding the profile.
Returns:Dictionary containing “ss_pred” (list), “ss_conf” (list), “msa” (AlignmentHandle) and “consensus” (~ost.seq.SequenceHandle).
EstimateMemConsumption()

Estimate the memory needed by HHblits. By default it uses not more than 3G. Also for small sequences it already uses quite some memnmory (46AA, 1.48G). And since the memory consumption could depend on the iterative search runs, how many hits are found in each step, we just go with 4G, here.

Returns:Assumed memory consumtion
Return type:(float, str)

Contents

Search

Enter search terms or a module, class or function name.

Previous topic

blast - Search related sequences in databases

Next topic

msms – Calculating Molecular Surfaces

You are here