The amino acid composition method (1) calculates the frequency of each natural amino acid in the sequence.
Where is the group of the 20 natural amino acids,
is the number of
times amino acid
appears in the sequence, and
is the sequence
length.
The dipeptide composition method (1) calculates the frequency of each consecutive amino acid pair in the sequence.
Where is the group of the 20 natural amino acids,
is the number
of times amino acid pair
appears in the sequence, and
is the sequence
length.
The tripeptide composition method (1) calculates the frequency of each consecutive amino acid triplet in the sequence.
Where is the group of the 20 natural amino acids,
is the number of times amino acid triplet
appears in the
sequence, and
is the sequence length.
The composition of k-spaced amino acid pairs method (2) calculates the
frequency of amino acid pairs separated by characters.
Where is the group of the 20 natural amino acids,
is the number
of times amino acid pair
, separated by
characters, appears
in the sequence,
is the sequence length and
is the maximum
number of
. Two consecutive characters are separated by
. If
,
then each possible pair would be calculated for
and 5.
The dipeptide deviation from expected mean method (3) calculates the
dipeptide composition (), theoretical mean (
) and theoretical
variance (
) and applies the following formulas:
Where is the group of the 20 natural amino acids,
is the number
of times the amino acid pair
appears in the sequence,
and
are
the number of times the amino acid
or
appears in the
sequence, and
is the sequence length.
In the amino acid pair antigenicity scale method (4), for each existing amino acid pair, it counts the number of times they appear consecutively in the sequence and multiplies them by their normalized amino acid pair antigenicity scale, which can be understood as the chance each amino acid pair is associated with an epitope.
Where is the group of the 20 natural amino acids,
is the number
of times the amino acid pair
appears in the sequence and
is the normalized amino acid pair antigenicity scale for the pair
.
The values of are in Supplementary Material 2. They were calculated as follows:
Where and
are the frequencies of the amino acid pair
in
the epitopes (obtained from the Bcipep database) (5) and non-epitopes (obtained
from the Swiss-Prot database) (6), respectively.
The composition moment vector method (7) contains information of the position of each occurence for each amino acid in the sequence in its calculation.
Where is the group of the 20 natural amino acids,
is the
residue in the sequence and
is the sequence length.
The enhanced amino acid composition method (8) calculates the frequency of each natural amino acid in a sliding window across the whole sequence.
Where is the group of the 20 natural amino acids,
is the
number of times amino acid
appears in the sliding window
and
is the
size of the sliding window. For example, the first sliding window
would go from the
first amino acid
to the amino acid
, while the second
sliding window would go from the second amino acid
to the amino acid
.
The grouped amino acid composition method (8) finds the proportion of each of the five group of proteins in the sequence. These five groups are based on their physicochemical properties, which are aliphatic (AGILMV), aromatic (FWY), positive (HKR), negative (DE) and uncharged (CNPQST) (9).
Where are the 5 groups based on the amino acids' physicochemical properties,
is
the number of times an amino acid belonging to the group
appears and
is the
sequence length.
The enhanced grouped amino acid composition method (8) finds the proportion of each of the five group of proteins in a sliding window across the peptide sequence. These five groups are based on their physicochemical properties, which are aliphatic (AGILMV), aromatic (FWY), positive (HKR), negative (DE) and uncharged (CNPQST) (9).
Where are the 5 groups based on the amino acids' physicochemical properties,
is the number of times an amino acid belonging to the group
appears in the sliding window
and
is the window size.
The composition of k-spaced amino acid group pairs method (8) calculates the
frequency of amino acid pairs, grouped by their physicochemical properties as in GAAC, separated by
characters.
Where are the 5 groups based on the amino acids' physicochemical properties,
is the number of times amino acids from the groups
, separated by
characters, are paired in the sequence,
is the sequence
length and
is the maximum number of
. Two consecutive
characters are separated by
. If
, then each
possible pair would be calculated for
and 5.
The grouped dipeptide composition method (8) calculates the frequency of each consecutive amino acid group pair in the sequence.
Where are the 5 groups based on the amino acids' physicochemical properties,
is the number of times amino acids from the groups
appear
consecutively in the sequence, and
is the sequence length.
The grouped tripeptide composition method (8) calculates the frequency of each consecutive amino acid group triplet in the sequence.
Where are the 5 groups based on the amino acids' physicochemical properties,
is the number of times amino acids from the groups
appear
consecutively in the sequence, and
is the sequence length.
For the encoding based on grouped weight method (10), the amino acids are split in 4 groups, based on their hydrophobicity and charge:
And then each amino acid would have an associated value for each group in the following way:
So, if an amino acid belongs to, for example, group , then it would
pre-encode as 1 for the first group
, and 0 for the other ones. If it belongs to group
, it
would pre-encode as 1 for every group. This results in three binary sequences
, one per group,
with
length, being
the sequence length, and
is a number between
1 and 3.
The full table associating amino acids with its group value can be found in Supplementary Material 2.
The normalized weight of a characteristic sequence
is the
frequency of 1 appearing in it.
Given a number , the characteristic sequence
can be split
into
subsequences. This way,
represents a subsequence, where
, and
is the largest integer below the result of the division inside.
Joining all
values would yield the whole characteristic sequence. Hence,
is the normalized weight of the subsequence
. This results in the following weight characteristic sequence:
Finally, all three vectors (one per ) are concatenated.
Both quasi-sequence-order and sequence-order-coupling number encodings (11) use the Grantham (12) and the Schneider-Wrede (13) distance matrices.
The -th rank sequence-order-coupling number is a sum of squares of the distance (according
to the distance matrices) between two amino acids that are separated by
characters in the
sequence.
Where
is the value in a distance matrix between two amino acids at positions
and
,
is
the maximum value of the lag value
, and
is the sequence
length.
This encoding is the concatenation of all per distance
matrix.
First, the quasi-sequence-order numbers for the amino acids must be calculated as follows:
Where is the group of the 20 natural amino acids,
is the frequency
of each amino acid in the sequence (just as in AAC encoding), and
is a weight factor.
Then, the quasi-sequence-order numbers for the lag values must be calculated as follows:
Where is the group of the 20 natural amino acids,
is the frequency
of each amino acid in the sequence (just as in AAC encoding),
is a the lag value,
and
is a weight factor.
The autocorrelation descriptors use the amino acid properties from the AAindex Database (14), found in the data/AAidx.txt file. The default indices used (CIDH920105, BHAR880101, CHAM820101, CHAM820102, CHOC760101, BIGC670101, CHAM810101, DAYM780201) were taken from the work by Xiao et al. (15). All the values in the indices are centralized and standardized for the autocorrelation encodings as follows:
Where is the group of the 20 natural amino acids,
is the value of
the property for the amino acid
, and
and
are the average and standard deviation of all the 20 amino acids in the index, respectively.
The Geary autocorrelation (16) is calculated as:
Where is the lag value,
is the maximum lag
value,
and
are the centralized and standardized values for the amino acids
at positions
and
, and
is the
average property value between all amino acids in the sequence.
The Moran autocorrelation (17) is calculated as:
Where is the lag value,
is the maximum lag
value,
and
are the centralized and standardized values for the amino acids
at positions
and
, and
is the
average property value between all amino acids in the sequence.
The Normalized Moreau-Broto autocorrelation (18) is calculated as:
Where is the lag value,
is the maximum lag
value, and
and
are the centralized and standardized values for the amino acids
at positions
and
.
The Composition/Transition/Distribution encodings (19), (20) are based on a categorical division of the 20 natural amino acids according to their structural and physicochemical properties. 13 properties were chosen on iFeature (8), and 1 (surface tension) was added (21), as listed in Supplementary Material 2.
Calculates the frequency of each division per property.
Where is the number of amino acids in the division
found in the
sequence, and
is the sequence length.
Where and
are the numbers of consecutive amino acids from divisions
and
in
both orders (
and
), and
is the sequence
length.
Calculates where the first, 25%, 50%, 75% and 100% of amino acids in a division occur in a sequence. It is done
by highlighting all the amino acids that belong to a certain division in a sequence. Find the position of the
first occurence and divide it by (the sequence length). Then, find the position where the first 25%
(rounded down) of the amino acids in that division occurs in the sequence, and divide this position over
. After
that, do the same with the other percentages (Figure 1).
![]() |
For the conjoint triad encodings, the amino acids were classified in 7 classes based on the dipoles and volumes
of the side chains (22):
,
,
,
,
,
, and
.
The conjoint triad method (22) is calculated as:
Where is the number of times three consecutive amino acids belonging to groups
,
and
are
seen in the sequence.
The k-Spaced conjoint triad method (8) is based on the conjoint triad method,
but instead of only evaluating consecutive amino acids, it evaluates triads separated by 0 to
characters. The original CT method is the same as KSCT with
.
Where is the number of times three consecutive amino acids belonging to groups
,
and
are
seen in the sequence. This should be evaluated for
, so it is calculated
times. For
example, for
, each triad is formed by the amino acids at positions
,
and
for
. For
, each triad is
formed by the amino acids at positions
,
,
for
. For
, each triad is
formed by the amino acids at positions
,
,
for
.
The pseudo-amino acid composition encodings use the hydrophobicity values proposed by Tanford (23), the hydrophilicity values proposed by Hopp and Woods (24) and the side chain mass values are the standard ones. Their initial values
are represented by ,
and
, where
is
each of the 20 natural amino acids. These values are centralized and standardized as follows:
Where is the group of the 20 natural amino acids, and
represents the
centralized and standardized value of any of the three properties (
,
,
) of
the amino acid
, so in the end we would have
,
and
.
For the pseudo-amino acid composition method (11), a correlation function is calculated as:
Where is the group of the 20 natural amino acids. Then, the sequence-order-correlated
factors are computed as follows:
Where is the maximum lag value. Now, the first 20 features (one per amino acid in
) are
computed.
Where is the number of times the amino acid appears in the sequence, and
is a
weighting factor set by default as 0.05, as suggested by Chou et al. (11).
Finally, the last set of features are added to the vector.
The amphiphilic pseudo-amino acid composition method (25) only uses the
hydrophilicity () and hydrophobicity (
) values. These
values are used to define their correlation functions as:
Where is the group of the 20 natural amino acids. Now, the sequence-order can be found with
the following formula:
Where is the maximum lag value. Now, the first 20 features (one per amino acid in
) are
computed.
Where is the number of times the amino acid appears in the sequence, and
is a
weighting factor set by default as 0.05, as suggested by Chou et al. (11).
Finally, the last set of features are added to the vector.
The binary encoding (26) represents each amino acid in the sequence as a binary string of 20 numbers. For example, amino acid A is "10000000000000000000", C is "01000000000000000000", etc., following the order of "ACDEFGHIKLMNPQRSTVWY".
The Taylor's venn diagram method (27) is based on 10 physicochemical groups (hydrophobic, positive, negative, polar, charged, small, tiny, aliphatic, aromatic, proline) where the 20 natural amino acids might belong to. These amino acids are encoded as binary vectors of length 10 (1 per property), getting a 1 if the amino acid belonging to the group that has that property. For example, if the amino acid belongs to the hydrophobic group, it will get 1, and if not, it will get 0.
Where is a property and
is the set of the 20
natural amino acids. The full table with the binary values is found in Supplementary Material 2.
The pseudo k-tuple reduced amino acid composition (28) represents proteins as
vectors that contain information based on K-tuples of reduced amino acid cluster (RAAC) components. These
components can depend on a
-
or a
-
, a type of reduced amino acid alphabet and a number of clusters (or mode).
These types and modes, as well as the groups, are found in Supplementary Material 2.
For the -
type of calculation, it represents the sequence-order information
of subsequences of length
separated by
residues. Thus, it
counts the number of times a combination of groups in the selected RAAC appears (Figure 2).
![]() |
For the -
type of calculation, it represents the sequence-order information of groups
of amino acids separated by
residues between amino acids. Thus, it counts the number of
times a combination of groups in the selected RAAC appears (Figure 3).
![]() |
These encodings use the generated .ss2 files from PSIPRED (29) or the .spxOut files from SPINE-X (30). There must be one file per input sequence.
The secondary structure elements binary method (8) represents each amino acid, depending on the type of secondary structure element where they were classified in, as a vector of 3 binary digits. The elements are helix (001), sheet (010) and coil (100).
The secondary structure elements content method (8) calculates the frequency of each element type (helix, sheet, coil) found in the peptide sequence.
Where is the number of times the element
appears in the
sequence, and
is the sequence length
Each amino acid in the sequence gets a probability of it having one of the three structural elements (helix,
coil, sheet). The secondary structure probabilities bigram (31) sums the
multiplication of the probabilities for each of the combinations between structural elements among the pairs of
amino acids separated by residues. This parameter
was added by us,
originally it was 1.
Where and
are the probabilities of the amino acids at positions
and
in
the sequence having the elements
and
, respectively, and
is
the sequence length.
Each amino acid in the sequence gets a probability of it having one of the three structural elements (helix,
coil, sheet). The secondary structure probabilities auto-covariance method (31) sums the multiplication of the probabilities for each structural element
among the pairs of amino acids separated by residues, where
ranges from 1 to
.
Where and
are the probabilities of the amino acids at positions
and
in
the sequence having the element
,
is the maximum value
for the separation between residues, and
is the sequence
length.
These encodings use the generated .spxOut files from SPINE-X (30). There must be one file per input sequence.
The torsion angles method (8) adds the and
values per amino acid to the vector.
The torsional angles composition (31) converts the and
values per amino acid from degrees to radians, calculates the sine and cosine of these two angles, divides these
values by the length of the sequence, and adds the 4 final values to the vector.
Where is the
or
value for the
amino acid at position
in the sequence, and
is the sequence
length.
The torsional angles bigram (31) converts the and
values per amino acid from degrees to radians, and calculates the sine and cosine of these two angles, so each
amino acid has 4 associated values. Then, each type of value is multiplied as pairs in the sequence separated by
residues, and finally divided by the sequence length. This parameter
was added by us,
originally it was 1.
Where and
are the
or
values for the amino acid at position
and
in the sequence,
and
is the sequence length.
The torsional angles auto-covariance method (31) converts the and
values per amino acid from degrees to radians, and calculates the sine and cosine of these two angles, so each
amino acid has 4 associated values. Then, it sums the multiplication of each type of value among the pairs of
amino acids separated by
residues, where
ranges from 1 to
.
Where and
are the
or
values for the amino acid at position
and
in the sequence,
is
the maximum value for the separation between residues, and
is the sequence
length.
The accessible surface area method (8) reads the ASA values per amino acid and adds them to the vector.
The disorder-based methods use the generated .dis files generated by VSL2 (32). There must be one file per input sequence.
The disorder method (33) reads the probability values per amino acid and adds them to the vector.
The disorder content method (8) calculates the frequency of ordered and disordered residues in the sequence.
Where is the number of ordered or disordered residues in the sequence, and
is the
sequence length.
The disorder binary method (8) encodes each amino acid as a binary vector of length 2. If the residue is ordered, then it is encoded as [1, 0], and if it is disordered, it is encoded as [0, 1].
The k-nearest neighbors (KNN) methods require two additional files: a training file in FASTA format that will contain a training set, and a label file, which will contain the class each sequence corresponds to. The KNN method uses the similarity score between every two sequences in the training file as distance.
The
values depend on the total number of samples provided in the training file, finding the amount of sequences in
of the training file. If the training file has 10 sequences,
then from 1% to 10% the value will be 1, from 11% to 20% the value will be 2, and so on.
The k-nearest neighbor for peptides method (34) indicates how many of the
sequences per each class in the neighboring % from the training
file are close to the input sequence according to the similarity score
, which is
calculated as:
Where is the set of the 20 natural amino acids,
is the value for the amino acid pair
in the
BLOSUM62 matrix, and
and
are the amino acids at position
in the sequences
and
.
The k-nearest neighbor for proteins (8) indicates how many of the sequences per
each class in the neighboring k% from the training file are close to the input sequence, according to the
similarity score , which is calculated as:
Where is the number of equal characters in the resulting Needleman-Wunsch alignment
(35) between sequences
and
,
is
the number of training sequences, and
is the sequence length.
The position-specific scoring matrix-based methods use the generated .pssm by blastpgp in legacy BLAST (36) and psiblast in BLAST+ (37) against the uniref50 database (38).
The PSSM method (33) inserts all 20 values per sequence amino acid in the vector.
The PSSM amino acid composition method (39) calculates the average score for each of the 20 natural amino acids along the whole sequence.
Where is the set of the 20 natural amino acids,
is the score
in the PSSM matrix for the amino acid
at position
in the sequence,
and
is the sequence length.
The bigram PSSM method (31) sums the product between the PSSM values of two
residues in the sequence separated by characters for two amino acid types and divides that sum by the
sequence length. This parameter
was added by us, originally it was 1.
Where is the set of the 20 natural amino acids,
and
are the scores in the PSSM matrix for the amino acids
and
at
positions
and
respectively, and
is the sequence
length.
The PSSM autocovariance method (40) calculates the autocovariance between two
residues separated by characters for a specific amino acid type.
Where is the set of the 20 natural amino acids,
and
are the scores in the PSSM matrix for the amino acid
at positions
and
,
and
is the sequence length.
The pseudo-PSSM method (41) finds the average for every amino acid type in the
PSSM matrix, and calculates the correlation between residues separated by characters per each
amino acid type. First, all values in the PSSM matrix must be standardized by using the following formula:
Where is the set of the 20 natural amino acids,
is the initial score in the PSSM matrix for the amino acid
at the
row
, and
and
are the initial scores in the PSSM matrix for the row
, columns
and
.
Where and
are the standardized scores in the PSSM matrix for the amino acid
at
rows
and
, and
is the sequence
length.
The PPSSM vector is the concatenation of the 20 values for and the 20
values of
.
The amino acid index method (42) uses the amino acid properties from the AAindex Database (14). This database has 544 different indices, where 531 have no "NA" values for any of the 20 amino acids. The features are the values for each amino acid in the sequence found in each one of the indices.
The BLOSUM62 method (43) uses the BLOSUM62 matrix to get the features, which are all the values for the 20 amino acids in each respective row. This means that for every amino acid in the sequence there will be 20 features.
The z-scale method (44) uses the z-scale table (45), where each amino acid type has 5 z-values. This means that for every amino acid in the sequence there will be 5 features.
This document was generated using the LaTeX2HTML translator Version 2020.2 (Released July 1, 2020)
The command line arguments were:
latex2html supp.tex -split 0
The translation was initiated on 2022-07-20