Pseudo amino acid composition
Encyclopedia
Pseudo amino acid composition, or PseAA composition, was originally introduced by Kuo-Chen Chou in 2001 to represent protein
samples for improving protein subcellular localization prediction
and membrane protein
type prediction.
of proteins and other attributes based on their sequence, two kinds of models are generally used to represent protein samples: (1) the sequential model, and (2) the non-sequential model or discrete model.
The most typical sequential representation for a protein sample is its entire amino acid (AA) sequence, which can contain its most complete information. This is an obvious advantage of the sequential model. To get the desired results, the sequence-similarity-search-based tools are usually utilized to conduct the prediction. However, this kind of approach failed to work when a query protein did not have significant homology to the attribute-known proteins. Thus, various discrete models were proposed.
The simplest discrete model is using the amino acid composition (AAC) to represent protein samples, as formulated as follows. Given a protein sequence P with amino acid residues, i.e.,
where R1 represents the 1st residue of the protein P, R2 the 2nd residue, and so forth, according to the amino acic composition (AAC) model, the protein P of Eq.1 can be expressed by
where are the normalized occurrence frequencies of the 20 native amino acids in P, and T the transposing operator. Accordingly, the amino acid composition of a protein can be easily derived once the protein sequencing
information is known.
Owing to its simplicity, the amino acid composition (AAC) model was widely used in many earlier statistical methods for predicting protein attributes. However, all the sequence-order information would be lost by using the AA composition to represent a protein. This is its main shortcoming.
The additional factors are a series of rank-different correlation factors along a protein chain, but they can also be any combinations of other factors so long as they can reflect some sorts of sequence-order effects one way or the other. Therefore, the essence of PseAA composition is that on one hand it covers the AA composition, but on the other hand it contains the information beyond the AA composition and hence can better reflect the feature of a protein sequence through a discrete model.
Meanwhile, various modes to formulate the PseAA composition have also been developed, as summarized in a review.
where the () components are given by
Protein
Proteins are biochemical compounds consisting of one or more polypeptides typically folded into a globular or fibrous form, facilitating a biological function. A polypeptide is a single linear polymer chain of amino acids bonded together by peptide bonds between the carboxyl and amino groups of...
samples for improving protein subcellular localization prediction
Protein subcellular localization prediction
Protein subcellular localization prediction involves the computational prediction of where a protein resides in a cell. Prediction of protein subcellular localization is an important component of bioinformatics-based prediction of protein function and genome annotation, and it can aid the...
and membrane protein
Membrane protein
A membrane protein is a protein molecule that is attached to, or associated with the membrane of a cell or an organelle. More than half of all proteins interact with membranes.-Function:...
type prediction.
Background
To predict the subcellular localizationSubcellular localization
The cells of eukaryotic organisms are elaborately subdivided into functionally distinct membrane bound compartments. Some major constituents of eukaryotic cells are: extracellular space, cytoplasm, nucleus, mitochondria, Golgi apparatus, endoplasmic reticulum , peroxisome, vacuoles, cytoskeleton,...
of proteins and other attributes based on their sequence, two kinds of models are generally used to represent protein samples: (1) the sequential model, and (2) the non-sequential model or discrete model.
The most typical sequential representation for a protein sample is its entire amino acid (AA) sequence, which can contain its most complete information. This is an obvious advantage of the sequential model. To get the desired results, the sequence-similarity-search-based tools are usually utilized to conduct the prediction. However, this kind of approach failed to work when a query protein did not have significant homology to the attribute-known proteins. Thus, various discrete models were proposed.
The simplest discrete model is using the amino acid composition (AAC) to represent protein samples, as formulated as follows. Given a protein sequence P with amino acid residues, i.e.,
where R1 represents the 1st residue of the protein P, R2 the 2nd residue, and so forth, according to the amino acic composition (AAC) model, the protein P of Eq.1 can be expressed by
where are the normalized occurrence frequencies of the 20 native amino acids in P, and T the transposing operator. Accordingly, the amino acid composition of a protein can be easily derived once the protein sequencing
Protein sequencing
Protein sequencing is a technique to determine the amino acid sequence of a protein, as well as which conformation the protein adopts and the extent to which it is complexed with any non-peptide molecules...
information is known.
Owing to its simplicity, the amino acid composition (AAC) model was widely used in many earlier statistical methods for predicting protein attributes. However, all the sequence-order information would be lost by using the AA composition to represent a protein. This is its main shortcoming.
Concept
To avoid completely losing the sequence-order information, the concept of PseAA (pseudo amino acid) composition was proposed. In contrast with the conventional amino acid composition that contains 20 components with each reflecting the occurrence frequency for one of the 20 native amino acids in a protein, the PseAA composition contains a set of greater than 20 discrete factors, where the first 20 represent the components of its conventional AA composition while the additional factors incorporate some sequence-order information via various modes.The additional factors are a series of rank-different correlation factors along a protein chain, but they can also be any combinations of other factors so long as they can reflect some sorts of sequence-order effects one way or the other. Therefore, the essence of PseAA composition is that on one hand it covers the AA composition, but on the other hand it contains the information beyond the AA composition and hence can better reflect the feature of a protein sequence through a discrete model.
Meanwhile, various modes to formulate the PseAA composition have also been developed, as summarized in a review.
Algorithm
According to the PseAA composition model, the protein P of Eq.1 can be formulated aswhere the () components are given by
-
where is the weight factor, and the -th tier correlation factor that reflects the sequence order correlation between all the -th most contiguous residues as formulated by-
with-
where is the -th function of the amino acid , and the total number of the functions considered. For example, in the original paper by Chou, , and are respectively the hydrophobicity value, hydrophilicity value, and side chain mass of amino acid ; while , and the corresponding values for the amino acid . Therefore, the total number of functions considered there is . It can be seen from Eq.3 that the first 20 components, i.e. are associated with the conventional AA composition of protein , while the remaining components are the correlation factors that reflect the 1st tier, 2nd tier, …, and the -th tier sequence order correlation patterns. It is through these additional factors that some important sequence-order effects are incorporated.
in Eq.3 is a parameter of integer and that choosing a different integer for will lead to a dimension-different PseAA composition.
Using Eq.6 is just one of the modes for deriving the correlation factors or PseAA components. The others, such as the physicochemical distance mode and amphiphilic pattern mode, can also be used to derive different types of PseAA composition, as summarized in a review paper.
Applications
Since PseAA composition was introduced, it has been widely used to predict various attributes of proteins, such as structural classes of proteins, enzyme family classes and subfamily classes, GABA(A) receptor proteins, protein folding rates, cyclin proteins, supersecondary structure, subcellular location of proteins, subnuclear location of proteins, apoptosis protein subcellular localization, submitochondria localization, protein quaternary structure, bacterial secreted proteins, conotoxin superfamily and family classification, protease types, GPCRG protein-coupled receptorG protein-coupled receptors , also known as seven-transmembrane domain receptors, 7TM receptors, heptahelical receptors, serpentine receptor, and G protein-linked receptors , comprise a large protein family of transmembrane receptors that sense molecules outside the cell and activate inside signal...
types, human papillomaviruses, outer membrane proteins, transmembrane regions in protein, protein secondary structural contents, subcellular localization of mycobacterial proteins, lipase types, DNA-binding proteins, cell wall lytic enzymes, cofactors of oxidoreductases, among many other protein attributes and protein-related features (see, e.g., the review paper by Gonzalez-Diaz et al. as well as the relevant references cited therein).
Ever since the concept of PseAA composition was introduced, it has been widely utilized to predict various protein attributes. It has also been used to incorporate the protein domainProtein domainA protein domain is a part of protein sequence and structure that can evolve, function, and exist independently of the rest of the protein chain. Each domain forms a compact three-dimensional structure and often can be independently stable and folded. Many proteins consist of several structural...
or FunD (functional domain) information and GO (gene ontologyGene OntologyThe Gene Ontology, or GO, is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species...
) information for improving the prediction quality for the subcellular localization of proteins. as well as their other attributes.
Meanwhile, the concept of PseAA composition has also stimulated the generation of pseudo-folding topological indices and pseudo-folding lattice network.
-
-