The amino acid composition method (1) calculates the frequency of each natural amino acid in the sequence.
Where is the group of the 20 natural amino acids, is the number of times amino acid appears in the sequence, and is the sequence length.
The dipeptide composition method (1) calculates the frequency of each consecutive amino acid pair in the sequence.
Where is the group of the 20 natural amino acids, is the number of times amino acid pair appears in the sequence, and is the sequence length.
The tripeptide composition method (1) calculates the frequency of each consecutive amino acid triplet in the sequence.
Where is the group of the 20 natural amino acids, is the number of times amino acid triplet appears in the sequence, and is the sequence length.
The composition of k-spaced amino acid pairs method (2) calculates the frequency of amino acid pairs separated by characters.
Where is the group of the 20 natural amino acids, is the number of times amino acid pair , separated by characters, appears in the sequence, is the sequence length and is the maximum number of . Two consecutive characters are separated by . If , then each possible pair would be calculated for and 5.
The dipeptide deviation from expected mean method (3) calculates the dipeptide composition (), theoretical mean () and theoretical variance () and applies the following formulas:
Where is the group of the 20 natural amino acids, is the number of times the amino acid pair appears in the sequence, and are the number of times the amino acid or appears in the sequence, and is the sequence length.
In the amino acid pair antigenicity scale method (4), for each existing amino acid pair, it counts the number of times they appear consecutively in the sequence and multiplies them by their normalized amino acid pair antigenicity scale, which can be understood as the chance each amino acid pair is associated with an epitope.
Where is the group of the 20 natural amino acids, is the number of times the amino acid pair appears in the sequence and is the normalized amino acid pair antigenicity scale for the pair .
The values of are in Supplementary Material 2. They were calculated as follows:
Where and are the frequencies of the amino acid pair in the epitopes (obtained from the Bcipep database) (5) and non-epitopes (obtained from the Swiss-Prot database) (6), respectively.
The composition moment vector method (7) contains information of the position of each occurence for each amino acid in the sequence in its calculation.
Where is the group of the 20 natural amino acids, is the residue in the sequence and is the sequence length.
The enhanced amino acid composition method (8) calculates the frequency of each natural amino acid in a sliding window across the whole sequence.
Where is the group of the 20 natural amino acids, is the number of times amino acid appears in the sliding window and is the size of the sliding window. For example, the first sliding window would go from the first amino acid to the amino acid , while the second sliding window would go from the second amino acid to the amino acid .
The grouped amino acid composition method (8) finds the proportion of each of the five group of proteins in the sequence. These five groups are based on their physicochemical properties, which are aliphatic (AGILMV), aromatic (FWY), positive (HKR), negative (DE) and uncharged (CNPQST) (9).
Where are the 5 groups based on the amino acids' physicochemical properties, is the number of times an amino acid belonging to the group appears and is the sequence length.
The enhanced grouped amino acid composition method (8) finds the proportion of each of the five group of proteins in a sliding window across the peptide sequence. These five groups are based on their physicochemical properties, which are aliphatic (AGILMV), aromatic (FWY), positive (HKR), negative (DE) and uncharged (CNPQST) (9).
Where are the 5 groups based on the amino acids' physicochemical properties, is the number of times an amino acid belonging to the group appears in the sliding window and is the window size.
The composition of k-spaced amino acid group pairs method (8) calculates the frequency of amino acid pairs, grouped by their physicochemical properties as in GAAC, separated by characters.
Where are the 5 groups based on the amino acids' physicochemical properties, is the number of times amino acids from the groups , separated by characters, are paired in the sequence, is the sequence length and is the maximum number of . Two consecutive characters are separated by . If , then each possible pair would be calculated for and 5.
The grouped dipeptide composition method (8) calculates the frequency of each consecutive amino acid group pair in the sequence.
Where are the 5 groups based on the amino acids' physicochemical properties, is the number of times amino acids from the groups appear consecutively in the sequence, and is the sequence length.
The grouped tripeptide composition method (8) calculates the frequency of each consecutive amino acid group triplet in the sequence.
Where are the 5 groups based on the amino acids' physicochemical properties, is the number of times amino acids from the groups appear consecutively in the sequence, and is the sequence length.
For the encoding based on grouped weight method (10), the amino acids are split in 4 groups, based on their hydrophobicity and charge:
And then each amino acid would have an associated value for each group in the following way:
So, if an amino acid belongs to, for example, group , then it would pre-encode as 1 for the first group , and 0 for the other ones. If it belongs to group , it would pre-encode as 1 for every group. This results in three binary sequences , one per group, with length, being the sequence length, and is a number between 1 and 3.
The full table associating amino acids with its group value can be found in Supplementary Material 2.
The normalized weight of a characteristic sequence is the frequency of 1 appearing in it.
Given a number , the characteristic sequence can be split into subsequences. This way, represents a subsequence, where , and is the largest integer below the result of the division inside. Joining all values would yield the whole characteristic sequence. Hence, is the normalized weight of the subsequence . This results in the following weight characteristic sequence:
Finally, all three vectors (one per ) are concatenated.
Both quasi-sequence-order and sequence-order-coupling number encodings (11) use the Grantham (12) and the Schneider-Wrede (13) distance matrices.
The -th rank sequence-order-coupling number is a sum of squares of the distance (according to the distance matrices) between two amino acids that are separated by characters in the sequence.
Where is the value in a distance matrix between two amino acids at positions and , is the maximum value of the lag value , and is the sequence length.
This encoding is the concatenation of all per distance matrix.
First, the quasi-sequence-order numbers for the amino acids must be calculated as follows:
Where is the group of the 20 natural amino acids, is the frequency of each amino acid in the sequence (just as in AAC encoding), and is a weight factor.
Then, the quasi-sequence-order numbers for the lag values must be calculated as follows:
Where is the group of the 20 natural amino acids, is the frequency of each amino acid in the sequence (just as in AAC encoding), is a the lag value, and is a weight factor.
The autocorrelation descriptors use the amino acid properties from the AAindex Database (14), found in the data/AAidx.txt file. The default indices used (CIDH920105, BHAR880101, CHAM820101, CHAM820102, CHOC760101, BIGC670101, CHAM810101, DAYM780201) were taken from the work by Xiao et al. (15). All the values in the indices are centralized and standardized for the autocorrelation encodings as follows:
Where is the group of the 20 natural amino acids, is the value of the property for the amino acid , and and are the average and standard deviation of all the 20 amino acids in the index, respectively.
The Geary autocorrelation (16) is calculated as:
Where is the lag value, is the maximum lag value, and are the centralized and standardized values for the amino acids at positions and , and is the average property value between all amino acids in the sequence.
The Moran autocorrelation (17) is calculated as:
Where is the lag value, is the maximum lag value, and are the centralized and standardized values for the amino acids at positions and , and is the average property value between all amino acids in the sequence.
The Normalized Moreau-Broto autocorrelation (18) is calculated as:
Where is the lag value, is the maximum lag value, and and are the centralized and standardized values for the amino acids at positions and .
The Composition/Transition/Distribution encodings (19), (20) are based on a categorical division of the 20 natural amino acids according to their structural and physicochemical properties. 13 properties were chosen on iFeature (8), and 1 (surface tension) was added (21), as listed in Supplementary Material 2.
Calculates the frequency of each division per property.
Where is the number of amino acids in the division found in the sequence, and is the sequence length.
Where and are the numbers of consecutive amino acids from divisions and in both orders ( and ), and is the sequence length.
Calculates where the first, 25%, 50%, 75% and 100% of amino acids in a division occur in a sequence. It is done by highlighting all the amino acids that belong to a certain division in a sequence. Find the position of the first occurence and divide it by (the sequence length). Then, find the position where the first 25% (rounded down) of the amino acids in that division occurs in the sequence, and divide this position over . After that, do the same with the other percentages (Figure 1).
For the conjoint triad encodings, the amino acids were classified in 7 classes based on the dipoles and volumes of the side chains (22): , , , , , , and .
The conjoint triad method (22) is calculated as:
Where is the number of times three consecutive amino acids belonging to groups , and are seen in the sequence.
The k-Spaced conjoint triad method (8) is based on the conjoint triad method, but instead of only evaluating consecutive amino acids, it evaluates triads separated by 0 to characters. The original CT method is the same as KSCT with .
Where is the number of times three consecutive amino acids belonging to groups , and are seen in the sequence. This should be evaluated for , so it is calculated times. For example, for , each triad is formed by the amino acids at positions , and for . For , each triad is formed by the amino acids at positions , , for . For , each triad is formed by the amino acids at positions , , for .
The pseudo-amino acid composition encodings use the hydrophobicity values proposed by Tanford (23), the hydrophilicity values proposed by Hopp and Woods (24) and the side chain mass values are the standard ones. Their initial values are represented by , and , where is each of the 20 natural amino acids. These values are centralized and standardized as follows:
Where is the group of the 20 natural amino acids, and represents the centralized and standardized value of any of the three properties (, , ) of the amino acid , so in the end we would have , and .
For the pseudo-amino acid composition method (11), a correlation function is calculated as:
Where is the group of the 20 natural amino acids. Then, the sequence-order-correlated factors are computed as follows:
Where is the maximum lag value. Now, the first 20 features (one per amino acid in ) are computed.
Where is the number of times the amino acid appears in the sequence, and is a weighting factor set by default as 0.05, as suggested by Chou et al. (11). Finally, the last set of features are added to the vector.
The amphiphilic pseudo-amino acid composition method (25) only uses the hydrophilicity () and hydrophobicity () values. These values are used to define their correlation functions as:
Where is the group of the 20 natural amino acids. Now, the sequence-order can be found with the following formula:
Where is the maximum lag value. Now, the first 20 features (one per amino acid in ) are computed.
Where is the number of times the amino acid appears in the sequence, and is a weighting factor set by default as 0.05, as suggested by Chou et al. (11). Finally, the last set of features are added to the vector.
The binary encoding (26) represents each amino acid in the sequence as a binary string of 20 numbers. For example, amino acid A is "10000000000000000000", C is "01000000000000000000", etc., following the order of "ACDEFGHIKLMNPQRSTVWY".
The Taylor's venn diagram method (27) is based on 10 physicochemical groups (hydrophobic, positive, negative, polar, charged, small, tiny, aliphatic, aromatic, proline) where the 20 natural amino acids might belong to. These amino acids are encoded as binary vectors of length 10 (1 per property), getting a 1 if the amino acid belonging to the group that has that property. For example, if the amino acid belongs to the hydrophobic group, it will get 1, and if not, it will get 0.
Where is a property and is the set of the 20 natural amino acids. The full table with the binary values is found in Supplementary Material 2.
The pseudo k-tuple reduced amino acid composition (28) represents proteins as vectors that contain information based on K-tuples of reduced amino acid cluster (RAAC) components. These components can depend on a - or a - , a type of reduced amino acid alphabet and a number of clusters (or mode). These types and modes, as well as the groups, are found in Supplementary Material 2.
For the - type of calculation, it represents the sequence-order information of subsequences of length separated by residues. Thus, it counts the number of times a combination of groups in the selected RAAC appears (Figure 2).
For the - type of calculation, it represents the sequence-order information of groups of amino acids separated by residues between amino acids. Thus, it counts the number of times a combination of groups in the selected RAAC appears (Figure 3).
These encodings use the generated .ss2 files from PSIPRED (29) or the .spxOut files from SPINE-X (30). There must be one file per input sequence.
The secondary structure elements binary method (8) represents each amino acid, depending on the type of secondary structure element where they were classified in, as a vector of 3 binary digits. The elements are helix (001), sheet (010) and coil (100).
The secondary structure elements content method (8) calculates the frequency of each element type (helix, sheet, coil) found in the peptide sequence.
Where is the number of times the element appears in the sequence, and is the sequence length
Each amino acid in the sequence gets a probability of it having one of the three structural elements (helix, coil, sheet). The secondary structure probabilities bigram (31) sums the multiplication of the probabilities for each of the combinations between structural elements among the pairs of amino acids separated by residues. This parameter was added by us, originally it was 1.
Where and are the probabilities of the amino acids at positions and in the sequence having the elements and , respectively, and is the sequence length.
Each amino acid in the sequence gets a probability of it having one of the three structural elements (helix, coil, sheet). The secondary structure probabilities auto-covariance method (31) sums the multiplication of the probabilities for each structural element among the pairs of amino acids separated by residues, where ranges from 1 to .
Where and are the probabilities of the amino acids at positions and in the sequence having the element , is the maximum value for the separation between residues, and is the sequence length.
These encodings use the generated .spxOut files from SPINE-X (30). There must be one file per input sequence.
The torsion angles method (8) adds the and values per amino acid to the vector.
The torsional angles composition (31) converts the and values per amino acid from degrees to radians, calculates the sine and cosine of these two angles, divides these values by the length of the sequence, and adds the 4 final values to the vector.
Where is the or value for the amino acid at position in the sequence, and is the sequence length.
The torsional angles bigram (31) converts the and values per amino acid from degrees to radians, and calculates the sine and cosine of these two angles, so each amino acid has 4 associated values. Then, each type of value is multiplied as pairs in the sequence separated by residues, and finally divided by the sequence length. This parameter was added by us, originally it was 1.
Where and are the or values for the amino acid at position and in the sequence, and is the sequence length.
The torsional angles auto-covariance method (31) converts the and values per amino acid from degrees to radians, and calculates the sine and cosine of these two angles, so each amino acid has 4 associated values. Then, it sums the multiplication of each type of value among the pairs of amino acids separated by residues, where ranges from 1 to .
Where and are the or values for the amino acid at position and in the sequence, is the maximum value for the separation between residues, and is the sequence length.
The accessible surface area method (8) reads the ASA values per amino acid and adds them to the vector.
The disorder-based methods use the generated .dis files generated by VSL2 (32). There must be one file per input sequence.
The disorder method (33) reads the probability values per amino acid and adds them to the vector.
The disorder content method (8) calculates the frequency of ordered and disordered residues in the sequence.
Where is the number of ordered or disordered residues in the sequence, and is the sequence length.
The disorder binary method (8) encodes each amino acid as a binary vector of length 2. If the residue is ordered, then it is encoded as [1, 0], and if it is disordered, it is encoded as [0, 1].
The k-nearest neighbors (KNN) methods require two additional files: a training file in FASTA format that will contain a training set, and a label file, which will contain the class each sequence corresponds to. The KNN method uses the similarity score between every two sequences in the training file as distance.
The values depend on the total number of samples provided in the training file, finding the amount of sequences in of the training file. If the training file has 10 sequences, then from 1% to 10% the value will be 1, from 11% to 20% the value will be 2, and so on.
The k-nearest neighbor for peptides method (34) indicates how many of the sequences per each class in the neighboring % from the training file are close to the input sequence according to the similarity score , which is calculated as:
Where is the set of the 20 natural amino acids, is the value for the amino acid pair in the BLOSUM62 matrix, and and are the amino acids at position in the sequences and .
The k-nearest neighbor for proteins (8) indicates how many of the sequences per each class in the neighboring k% from the training file are close to the input sequence, according to the similarity score , which is calculated as:
Where is the number of equal characters in the resulting Needleman-Wunsch alignment (35) between sequences and , is the number of training sequences, and is the sequence length.
The position-specific scoring matrix-based methods use the generated .pssm by blastpgp in legacy BLAST (36) and psiblast in BLAST+ (37) against the uniref50 database (38).
The PSSM method (33) inserts all 20 values per sequence amino acid in the vector.
The PSSM amino acid composition method (39) calculates the average score for each of the 20 natural amino acids along the whole sequence.
Where is the set of the 20 natural amino acids, is the score in the PSSM matrix for the amino acid at position in the sequence, and is the sequence length.
The bigram PSSM method (31) sums the product between the PSSM values of two residues in the sequence separated by characters for two amino acid types and divides that sum by the sequence length. This parameter was added by us, originally it was 1.
Where is the set of the 20 natural amino acids, and are the scores in the PSSM matrix for the amino acids and at positions and respectively, and is the sequence length.
The PSSM autocovariance method (40) calculates the autocovariance between two residues separated by characters for a specific amino acid type.
Where is the set of the 20 natural amino acids, and are the scores in the PSSM matrix for the amino acid at positions and , and is the sequence length.
The pseudo-PSSM method (41) finds the average for every amino acid type in the PSSM matrix, and calculates the correlation between residues separated by characters per each amino acid type. First, all values in the PSSM matrix must be standardized by using the following formula:
Where is the set of the 20 natural amino acids, is the initial score in the PSSM matrix for the amino acid at the row , and and are the initial scores in the PSSM matrix for the row , columns and .
Where and are the standardized scores in the PSSM matrix for the amino acid at rows and , and is the sequence length.
The PPSSM vector is the concatenation of the 20 values for and the 20 values of .
The amino acid index method (42) uses the amino acid properties from the AAindex Database (14). This database has 544 different indices, where 531 have no "NA" values for any of the 20 amino acids. The features are the values for each amino acid in the sequence found in each one of the indices.
The BLOSUM62 method (43) uses the BLOSUM62 matrix to get the features, which are all the values for the 20 amino acids in each respective row. This means that for every amino acid in the sequence there will be 20 features.
The z-scale method (44) uses the z-scale table (45), where each amino acid type has 5 z-values. This means that for every amino acid in the sequence there will be 5 features.
This document was generated using the LaTeX2HTML translator Version 2020.2 (Released July 1, 2020)
The command line arguments were:
latex2html supp.tex -split 0
The translation was initiated on 2022-07-20