ligpred
Home Submit How To

Data Submission
The tool accepts a sets of fasta formatted entries containing amino acids. Submitted amino acids must contain 50 or more residues. For some descriptors, entries may contain small amounts of ambiguous amino acids can still be predicted. (amino acid composition, dipeptide composition, and split amino acid composition). However, some descriptors are not able to function with make proper calculations this these and thus they will with a value of N/A if encountered during classification.


Descriptor Choice
Presently LigPred offers 7 different types of descriptors. Dipeptide composition has been set to the default value as it had the highest MCC. Users may choose any combination they wish for classification. However, it should be noted that they are not composite descriptors but single descriptors. The predictions from each additional descriptor will be combined with the others to make a consensus prediciton, giving higher accuracy in ambiguous cases. Below are descriptions of the descriptors used by LigPred.

Amino Acid Composition
The fraction of each of the twenty standard amino acids in a protein sequence.
f (r) = N r N , N: Length of amino acid sequence. r = 1,2,...,20: One of the 20 standard amino acids N r : Number of amino acids of type r
Dipeptide Composition
The fraction of each combination of two the twenty standard amino acids in a protein sequence.
f ( r,s ) = N rs N-1 , N : Length of the amino acid sequence. r = 1,2,...,20: One of the 20 standard amino acids N rs : Number of amino acids dipeptides of type r
Split Amino Acid Composition
The protein sequence is split into p parts and the amino acid composition is calculated for each part.
f ( r , A1 , A2 ,... An ) = { A r A 1 ,   for amino acids 1 - A1 N r A 2 - A 1 ,   for amino acids (A1+1) - A2 .   .   .   .   .   .   N r A n - A n-1 ,   for amino acids (An-1+1) - An A n : The position of an amino acid in the amino acid sequence. r = 1,2,...,20: One of the 20 standard amino acids N r : Number of amino acids of type r
Physicochemical Properties
A 13 element vector with each element containing a physical or chemical property of the protein.
based on the Physico-chemical property composition given on COPid
Where:
element 1 = Molecular weight of the protein
element 2 = Number of amino acids in the protein sequnece
element 3 = % Composition of charged residues (DEKHR)
element 4 = % Composition of aliphatic residues (ILV)
element 5 = % Composition of Aromatic residues (FHWY)
element 6 = % Composition of Polar residues (DERKQN)
element 7 = % Composition of Neutral residues (AGHPSTY)
element 8 = % Composition of Hydrophobic residues (CVLIMFW)
element 9 = % composition of Positive charged residues (HKR)
element 10 = % Composition of Negative charged residues (DE)
element 11 = % Composition of tiny residues (ACDGST)
element 12 = % Composition of Small residues (EHILKMNPQV)
element 13 = % Composition of Large residues (FRWY)

Geary Autocorrelation
One of a set of topological descriptors, the Geary Autocorrelation describes the level of correlation between a given varible and itself through space. The proterties used in this classification system are hydrophobicity scale, average flexibility index, polarizability parameter, free energy of solution in water, accessible surface areas, residue volume, steric parameters and relative mutability. With each property contributing 30 elements to create a vector with 240 elements.

Geary Autocorrelation is defined as:
f ( d ) = 1 2 ( N d ) i = 1 N d ( P i P i + d ) 2 1 N 1 i = 1 N ( P i P ¯ ) 2 , P ¯ : The average value of a property. N : Length of the amino acid sequence. d = 1,2,...,30: Lag of the autocorrelation. P i : Property of amino acid at position i. P i+d : Property of amino acid at position i+d.

Moreau Broto Autocorrelation Descriptor
One of a set of topological descriptors that uses the sets of property values as the basis for measurement. The proterties used in this classification system are hydrophobicity scale, average flexibility index, polarizability parameter, free energy of solution in water, accessible surface areas, residue volume, steric parameters and relative mutability. With each property contributing 30 elements to create a vector with 240 elements.

Moreau Broto Autocorrelation is defined as:
f ( d ) = i = 1 N d P i P i + d , : The average value of a property. N : Length of the amino acid sequence. d = 1,2,...,30: Lag of the autocorrelation. P i : Property of amino acid at position i. P i+d : Property of amino acid at position i+d.

Sequence Order Coupling Number Total
A 60 element vector derived from the Schneider-Wrede physicochemical distance matrix and the Grantham chemical distance matrix between each pair of the standard amino acids.

Sequence order coupling number total is defined as:
f ( d ) = i = 1 N d ( d i , i + d ) 2 , f(d): The dth sequence order coupling order d: The integers 1-30. N : Length of the amino acid sequence.

Interpreting Data
Sample Output
  1. Click this link to automaticaly bookmark the page.
  2. This will open a text file of the data submitted for classification.
  3. This will open a text file containing a log of operations preformend during classification.
  4. This will open a text file containing the full classification data in csv format(comma separated values) sutible for use in spreadsheets
  5. This will open a text file containing classification values only in csv format.
  6. The order fasta entries were submitted.
  7. The first 20 characters of the fasta entrie names
  8. The prediction returned by the descriptor listed in the title.
  9. The composite prediction.

Below is a table of the 37 classes of lignin related enzymes which can be predicted by LigPred. The classes returned by the prediction system correspond the classes in this table.

#GeneProtein NameEC#RolePTHPTHPKEGG
14CL4-coumarate:CoA ligase6.2.1.12SynthesisUniprotUniprotKegg
2C4Hcinnamate-4-hydroxylase1.14.13.11SynthesisUniprotUniprotKegg
3CADcinnamyl-alcohol dehydrogenase1.1.1.195SynthesisUniprotUniprotKegg
4CcoAOMTcaffeoyl-CoA O-methyltransferase2.1.1.104SynthesisUniprotUniprotKegg
5CCRcinnamoyl-CoA reductase1.2.1.44SynthesisUniprotUniprotKegg
6COMTcaffeic acid 3-O-methyltransferase2.1.1.68SynthesisUniprotUniprotKegg
7CSchorismate synthase4.2.3.5SynthesisUniprotUniprotKegg
8DAHPS3-deoxyarabinoheptulosonate-7-phosphate synthase2.5.1.54SynthesisUniprotUniprotKegg
9DYPPeroxidase DypB1.11.1.19DegradationUniprotUniprotKegg
10ESTcinnamoyl esterase, glucuronoyl esterase3.1.1.73DegradationUniprotUniprotKegg
11F5Hferulate-5-hydroxylase1.14.-.-SynthesisUniprotUniprotKegg
12FerAFeruloyl-CoA synthetase6.2.1.34SynthesisUniprotUniprotKegg
13FerBFeruloyl-CoA hydratase/lyase4.2.1.101SynthesisUniprotUniprotKegg
14HCTshikimate O-hydroxycinnamoyltransferase2.3.1.133SynthesisUniprotUniprotKegg
15LDA1Aryl-alcohol oxidase1.1.3.7DegradationUniprotUniprotKegg
16LDA2Vanillyl-aclohol oxidase1.1.3.38DegradationUniprotUniprotKegg
17LDA3Glyoxal oxidase1.1.3.-DegradationUniprotUniprotNA
18LDA4Pyranose oxidase1.1.3.10DegradationUniprotUniprotKegg
19LDA5Glactose oxidase1.1.3.9DegradationUniprotUniprotKegg
20LDA6Glucose Oxidase1.1.3.4DegradationUniprotUniprotKegg
21LDA7Benzoquinone reductase1.6.5.(5/6/7)DegradationUniprotUniprotKegg
22LDA8alcohol oxidase1.1.3.13DegradationUniprotUniprotKegg
23LigABProtocatechuate 4,5-dioxygenase1.13.11.8DegradationUniprotUniprotKegg
24LigBenDiOLignostilbene dioxygenase1.13.11.43DegradationUniprotUniprotKegg
25LigDC alpha-dehydrogenase1.-.-.-DegradationUniprotUniprotKegg
26LigEFGBeta-etherase2.5.1.18DegradationUniprotUniprotKegg
27LigHformate---tetrahydrofolate ligase6.3.4.3DegradationUniprotUniprotKegg
28LigI2-pyrone-4,6-dicarboxylic acid hydrolase3.1.1.57DegradationUniprotUniprotKegg
29LigJ4-oxalomesaconate hydratase4.2.1.83Degradation Uniprot UniprotKegg
30LigMVanillate/3-O-methylgallate O-demethylase1.14.13.82DegradationUniprotUniprotKegg
31LO1Laccases1.10.3.2DegradationUniprotUniprotKegg
32LO2Chloroperoxidase1.11.1.10DegradationUniprotUniprotKegg
32LO2Lignin peroxidase1.11.1.14DegradationUniprotUniprotKegg
32LO2Manganses Peroxidase1.11.1.13DegradationUniprotUniprotKegg
32LO2Versitile Peroxidase1.11.1.16DegradationUniprotUniprotKegg
33LO3Cellobiose dehydrogenase1.1.99.18DegradationUniprotUniprotKegg
34MetFmethylenetetrahydrofolate reductase1.5.1.20DegradationUniprotUniprotKegg
35PALphenylalanine ammonia-lyase4.3.1.24SynthesisUniprotUniprotKegg
36SHDHshikimate dehydrogenase1.1.1.25SynthesisUniprotUniprotKegg
37VDHVanillin dehydrogenase1.2.1.67DegradationUniprotUniprotKegg

References
Tang, Z. Q., Lin, H. H., Zhang, H. L., Han, L. Y., Chen, X., & Chen, Y. Z. (2007). Prediction of Functional Class of Proteins and Peptides Irrespective of Sequence Homology by Support Vector Machines. Bioinformatics and biology insights, 1, 19.

Udatha, D. B. R. K., Kouskoumvekaki, I., Olsson, L., & Panagiotou, G. (2011). The interplay of descriptor-based computational analysis with pharmacophore modeling builds the basis for a novel classification scheme for feruloyl esterases. Biotechnology advances, 29(1), 94.

Bugg, T. D., Ahmad, M., Hardiman, E. M., & Rahmanpour, R. (2011). Pathways for degradation of lignin in bacteria and fungi. Natural product reports, 28(12), 1883-1896.

Fu, L., Niu, B., Zhu, Z., Wu, S., & Li, W. (2012). CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics, 28(23), 3150-3152.

Ahmad, M., Roberts, J. N., Hardiman, E. M., Singh, R., Eltis, L. D., & Bugg, T. D. (2011). Identification of DypB from Rhodococcus jostii RHA1 as a lignin peroxidase. Biochemistry, 50(23), 5096-5107.

Ong, S. A., Lin, H. H., Chen, Y. Z., Li, Z. R., & Cao, Z. (2007). Efficacy of different protein descriptors in predicting protein functional families. BMC bioinformatics, 8(1), 300.

Levasseur, A., Piumi, F., Coutinho, P. M., Rancurel, C., Asther, M., Delattre, M., ... & Record, E. (2008). FOLy: an integrated database for the classification and functional annotation of fungal oxidoreductases potentially involved in the degradation of lignin and related aromatic compounds. Fungal genetics and biology, 45(5), 638-645.

Kaundal, R., Saini, R., & Zhao, P. X. (2010). Combining machine learning and homology-based approaches to accurately predict subcellular localization in Arabidopsis. Plant physiology, 154(1), 36-54.


NIMFFAB LOGO