bmDCA

Here is the 'working version' code for bmDCA, as described in

M. Figliuzzi, P.Barrat-Charlaix, and M.Weigt, How pairwise coevolutionary models capture the collective residue variability in proteins. Molecular Biology and Evolution, Under Review

The main routine bmDCA_v2.1.sh contains different scripts (c/c++/awk) and can be executed in command line in macOS/Linux based operating systems.

Steps to use the code

1/ Compile the code by executing the shell script:

$./bmDCA_compile_v2.1.sh

2/ Pre-process the MSA (fasta format):

$./bmDCA_preprocessing.sh [-rw] input_alignment.fasta

If option -r is used, the input alignment in fasta format will be converted to a numerical format used by the learning procedure. If option -w is used, reweighting coefficients will be computed for each sequence in the alignment. Note that this step may take a long time, as it is quadratic in the number of sequences of the alignment. Processed data are then found in the "Processed" folder.

3/ Run bmDCA to learn the model parameters:

$./bmDCA_v2.1.sh Processed/msa_numerical.txt Processed/weights.txt OutputFolder

The three inputs of bmDCA.sh are:

Processed/msa_numerical.txt: this is the target MSA in a numeric format (see the EXAMPLE folder, the file is generated in the preprocessing step). The first line contains three integers specifying the number of sequences, the sequence length and the alphabet size;
Processed/weights.txt: this is the file containing a single column with the statistical weights of the MSA sequences, it is generated in the pre-processing step;
OutputFolder: this is the folder where all outputs will be saved.

Inside the script bmDCA_v2.1.sh there are hyperparameters that can be set, modifying the learning, such as values of regularization or number of iterations. Inferred parameters are present in the OutputFolder, in files parameters_learnt_%d.txt No stopping procedure has been implemented to stop the learning. The default number of iterations of the Boltzmann machine is 2000.

The mapping from amino acids to integers is defined in the following way. Amino acids are ordered as in the following string "-ACDEFGHIKLMNPQRSTVWY". They are then mapped to the integer corresponding to their position in the string, minus one. The gap symbol is mapped to 0, A is mapped to 1, etc ... The output directory contains learned parameters saved every 3 iterations (default) in files called parameters_learnt_[it].txt. Indices of sites in the sequence go from 0 to L-1 in the output format. The file error.txt contained in the output directory contains information about the fitting quality at different iterations of the boltzmann machine, and can be used to decide when to stop calculations.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
EXAMPLE		EXAMPLE
Processed_EXAMPLE		Processed_EXAMPLE
msa		msa
sources		sources
LICENSE		LICENSE
README.md		README.md
bmDCA_compile_v2.1.sh		bmDCA_compile_v2.1.sh
bmDCA_preprocessing.sh		bmDCA_preprocessing.sh
bmDCA_v2.1.sh		bmDCA_v2.1.sh
example.txt		example.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EXAMPLE

EXAMPLE

Processed_EXAMPLE

Processed_EXAMPLE

msa

msa

sources

sources

LICENSE

LICENSE

README.md

README.md

bmDCA_compile_v2.1.sh

bmDCA_compile_v2.1.sh

bmDCA_preprocessing.sh

bmDCA_preprocessing.sh

bmDCA_v2.1.sh

bmDCA_v2.1.sh

example.txt

example.txt

Repository files navigation

bmDCA

About

Releases

Packages

Languages

License

matteofigliuzzi/bmDCA

Folders and files

Latest commit

History

Repository files navigation

bmDCA

About

Resources

License

Stars

Watchers

Forks

Languages