Computational Systems Biology and Bioinformatics Lab


 

SMAP: Software for ligand binding site comparison


Please follow the instruction to install and run this software


*****************************************************************************
******************************************************************************

A) System requirements

Windows XP or Linux Operating system: SMAP can be executed on both Windows and Linux operating system. For windows users, it is strongly recommended to install cygwin.

1G Memory: For the most of PDB chains, at least 1G RAM should be allocated to the software. Some comparisons needs more memory than 1G.

Java 1.6: Java 1.6 is required to run SMAP.

B) Installation

1) Download SMAP v2.0 and uncompress it.

For SMAP_v2_0.zip, using command:
>unzip SMAP_v2_0.zip

For SMAP_v2_0.tar.gz, using command:
>gunzip -c SMAP_v2_0.tar.gz | tar -xvf -

2) A directory smap_v2_0 will be generated in the installed directory. There are several directories and files in this directory.

README: this file
License.txt: The academic license of SMAP software
classes: the directory for java class files of the software
lib: the directory for external java libraries that will be imported by the software
external: the directory includes binaries of external software (PSI-Blast and qhull) for both windows and linux
conformerUnit: the directory to save serializable object files of conformer units that are needed to run SMAP. It is empty when downloaded.
*.bat, *.csh, and *.sh: shell scripts to facilitate runing the software

C) How to start

1) Go to smap_v2_0 directory and set up environment varibles in shell script,
.csh, .sh, or .bat


a. Set SMAPROOT as the path of the directory in which user install SMAP

eg. in smap_comp.bat
set SMAPROOT=C:\user\smap_v2_0

in smap_comp.sh
export SMAPROOT=/home/user/smap_v2_0

b. Add the paths of libraries which will be used to CLASSPATH

eg. in smap_comp.bat
set CLASSPATH=%CLASSPATH%;%SMAPROOT%\classes;%SMAPROOT%\lib\biojava.jar;%SMAPROOT%\lib\pdblibs.jar;%SMAPROOT%\lib\siteormapping.jar;%SMAPROOT%\lib\pdbormapping.jar;%SMAPROOT%\lib\mbt.jar

in smap_comp.sh
export CLASSPATH=${CLASSPATH}:${SMAPROOT}/classes:${SMAPROOT}/lib/biojava.jar:${SMAPROOT}/lib/pdblibs.jar:${SMAPROOT}/lib/siteormapping.jar:${SMAPROOT}/lib/pdbormapping.jar:${SMAPROOT}/lib/mbt.jar

c. Set minimum memory in the command line. If system is out of memory, the program will end prematurely.

eg. in smap_comp.bat
java -Xmx1200M -classpath %CLASSPATH% org.interactome.siteengine.sitesearch.SMAP -templateChain %templChain% -queryChain %queryChain% -output %output%

in smap_comp.sh
java -Xmx1200M -cp ${CLASSPATH} org.interactome.siteengine.sitesearch.SMAP -templateChain $templChain -queryChain $queryChain -output $output

2) Modify pdbdefault.properties in the classes directory

a. Change QUERY_CONFORMER_UNIT_DIR and TEMPLATE_CONFORMER_UNIT_DIR to the directory in which serializable objects of conformer units of query structures and template structures are saved.

eg. in Windows
QUERY_CONFORMER_UNIT_DIR=C\:/home/user/conformerUnit

in Linux
QUERY_CONFORMER_UNIT_DIR=/home/user/conformerUnit

b. Change SMAP_INSTALLED_DIR to the directory in which SMAP is installed.

eg. in Windows
SMAP_INSTALLED_DIR=C\:/work/smap

in Linux
SMAP_INSTALLED_DIR=/work/smap

c. Change LOCAL_PDB_DIR to the directory in which PDB file is saved.

eg. in Windows
LOCAL_PDB_DIR=C\:/ExternalData/pdb

in Linux
LOCAL_PDB_DIR=/ExternalData/pdb
If this directory does not exist, SMAP will get the file from the PDB online.
 
d. Set the PDB file format for loading structure files.

STRUCT_FILE_FORMAT=PDB
Three file types are supported: pdb format (PDB), pdbml format (XML), andbiological unit (BLU).
 
3) now ligand binding pocket similarity between two structures can be computed by running the shell script using default parameter settings.

in Windows, run
smap_comp.bat template_chain query_chain output

in Linux, run
smap_comp.sh template_chain query_chain output

where template_chain and query_chain are specified protein chains in the format of [PDB ID]_[CHAIN ID]. If chain id is not specified, all chains in the structure will be compared. For multiple chains, the chain ids are specified as [chain id 1]-[chain id 2]-[chain id 3]-.....

The result will be printed out in the file specified by "output".

4) Depending on the purpose for running SMAP, a set of parameters can be adjusted. To change the parameters, copy the file smapdefault.properties in the classes directory to the file smap.properties in the directory where you will run the program. Change parameters in smap.properties as needed. The details on SMAP parameters are given in section E) Parameter settings.

D) Understanding result

SMAP will give the local structural alignment between detected ligand-binding sites on query and template protein, raw-score, p-value, volume coverage of template and query pockets, Tanimoto coefficient and RMSD between them. It will also provide the transformation matrices for query and template protein which are used to superimpose the two structures.

Raw score is the profile-profile alignment score between the binding pockets of two proteins. This score will evaluate the evolutionary and geometric similarities for the two binding pockets.

P-value will estimate the statistic significance of the raw score by considering the background probability distribution of the binding site alignment scores.

Template coverage and query coverage is the ratio of overlapped pocket volume in the template and query pocket, respectively.

Tanimoto coefficient is one way to calculate similarity coefficient. It will calculate the ratio of the overlapped pocket volume over the union pocket volumes for two proteins.

RMSD is the root mean square of deviation between the binding sites in two proteins.

True and false positive mathces are the best distinguished by the p-value and the Tanimoto coefficient. The lower p-value ( <1.0e-4 ) and larger Tanimoto coefficient ( >0.5 ) indicate a better chance of biological meaningfull similarity.

E) Parameter settings

a. Parameters for segmentation of structure

MIN_PL_ATOM_SPHERE_SIZE=20: This parameter represents the minimum number of virtual atoms involved in one virtual ligand. The default value is 20.
 
MAX_PL_ATOM_SPHERE_SIZE=300: This parameter represents the maximum number of virtual atoms involved in one virtual ligand. The default value is 300.
 
MIN_ATOM_SPHERE_DISTANCE=3.0: This parameter represents the minimum distance between two virtual ligands. If the distance between any intra-ligand atom pairs from ligand i and j is smaller than MIN_ATOM_SPHERE_DISTANCE, these two virtual ligands will be considered as overlapped and will be merged as a single virtual ligand. The default value is 3.0 angstrom.

MAX_ATOM_SPHERE_RADIUS=5.0: This parameter represents the maximum radius for the circumscribed spheres outside the protein boundary but inside the environmental boundary. Any sphere with a radius larger than MAX_ATOM_SPHERE_RADIUS won't be considered. The default value is 5.0 angstrom when the protein is represented
by all atoms.

MIN_PL_CA_SPHERE_SIZE=5: This parameter represents the minimum number of CA atoms involved in one virtual ligand. The default value is 5.

MIN_CA_SPHERE_DISTANCE=5.0: This parameter represents the minimum distance between two virtual ligands when the protein is represented only by CA atoms. If the distance between any intra-ligand CA atom pairs from ligand i and j is smaller than MIN_CA_SPHERE_DISTANCE, these two virtual ligands will be considered as overlapped and will be merged as a single virtual ligand. The default value is 5.0 angstrom.

MAX_CA_SPHERE_RADIUS=7.5: This parameter represents the maximum radius for the circumscribed spheres outside the protein boundary but inside the environmental boundary. Any sphere with a radius larger than MAX_ATOM_SPHERE_RADIUS won't be considered. The default value is 7.5 angstrom when the protein is represented
only by CA atoms.

MAX_NUM_PL=5: This parameter represents the maximum number of virtual ligands in each protein.

b. Parameters for determination of ligand binding sites

LIGAND_CONTACT_DISTANCE_CUTOFF=10.0: If the distance between a protein and a ligand atom is less than the specified value of LIGAND_CONTACT_DISTANCE_CUTOFF with a unit of angstrom, and these two atoms are not obstructed by other atoms, the protein atom and its associated residue is considered as the ligand binding site. The default value is 5.0.

c. Parameters for comparison of two pockets

LOCAL_SCORE=true: If this parameter is set as true, SMAP will compare local structure similarity for query and template structures.

MATCH_SECONDARY_STRUCTURE=true: If this parameter is set as true, secondary structure will be first matched during alignment.

TEMPLATE_LIGAND_SITE_ONLY=true: If this parameter is set as true, for a template with multiple binding pockets, only the pockets with ligand presented in the structure will be compared.

TEMPLATE_LIGAND_ID=B65,NE6: if template ligand IDs are specified, only the pockets with the specified ligands will be compared

QUERY_LIGAND_SITE_ONLY=true: If this parameter is set as true, for a query with multiple binding pockets, only the pockets with ligand presented in the structure will be compared.

QUERY_LIGAND_ID=HEM,ATP: if query ligand IDs are specified, only the pockets with the specified ligands will be compared

ASSOCIATE_GRAPH_NODE_FILTER=0.5: This parameter indicates how many nodes in associated graph will be removed. When building the associated graph, each node will be given a score according to the similarity of the residue pairs in this node. To save time for the following alignment, some of the nodes will be removed according to their scores. 0 means all nodes will be kept. The alignment will be slow, but more accurate. 0.5 means almost half of the nodes will be removed. 1.0 means all nodes will be removed, which cannot happen during calculation.

TIMES_RANDOM_SHUFFLE=0: This parameter represent how many times of random shuffle will be done to determine the statistic significance of a similarity score between two structures. If this parameter is set as 0, a background distribution will be used to estimate the significance of the score.

SCORE_MATRIX=McLACHLAN: This parameter shows which scoring matrix will be used during alignment. Available matrices include McLACHLAN, AAGroup, BLOSUM45, MIYATA. The default value is McLACHLAN.

d. Parameters for output of superimposed structures

PRINT_PDB=false: If this parameter is set as false, SMAP will not print out the coordinate file for query structure in the PDB format after superposed on template structure.

SUPER_PDB_OUTPUT_DIR=/home/user/smap_v2_0/pdb: This parameter shows the directory in which the superposed structure will be printed out.

PRINT_TEMPLATE_LIGAND=true: If this parameter is set as true, the coordinates of the ligand located in the aligned pocket of the template structure will be output.

PRINT_QUERY_LIGAND=true: If this parameter is set as true, the coordinates of the ligand located in the aligned pocket of the query structure will be output.

PVALUE_CUTOFF=0.001: The superimposed query structure will be printed out only if the SMAP pvalue is less than the specified.