Bioinformatics In Biotechnology Education

1. Introduction
Biotechnology is the buzzword of the current times.   If biotechnology is hot, bioinformatics is its hottest arm, shrouded in a lot of hype, with conveniently concealed ground realities.   It is necessary that bioinformatics is viewed in the proper perspective in order to reap the rich benefits that accrue out of it.   In fact, serious efforts should be made to place even biotechnology in a rational perspective.   Awareness is the key to a successful deployment of both bioinformatics and biotechnology, in enhancing the well being of people, animals and the environment.   This effort should essentially begin with biotechnology education.

The term ‘bioinformatics’ is the short form of ‘biological informatics’, just as biotechnology is the short form of ‘biological technology’. There are several definitions of bioinformatics, as there are for biotechnology, often depending upon ‘whom, are you taking to?’   anthony kerlavage, of the celera genomics, defined bioinformatics as ‘any application of computation to the field of biology, including data management, algorithm development, and data mining’.   Clearly, a number of divergent areas, many of them outside biotechnology, come under bioinformatics.

The concept of a computer database came into practice by 1948, through us defence initiative.   A database is meant to store voluminous information in an orderly fashion, to facilitate addition and/or deletion of information and to provide for its retrieval in any one or more of several different permutations and combinations as desired by the user.   Biologists have taken advantage of this facility from the very early stages, and used it in different contexts.   What is considered as bioinformatics today, by general consent (or silence), is actually a much later development, from the concept of the database.

2. What is Bioinformatics?

Bioinformatics has emerged out of the inputs of specialists from several different areas such as biology, biochemistry, biophysics, molecular biology, biostatistics and computer science. Specially designed algorithms and organised computer databases are at the core of all bioinformatic operations.   Algorithms, that are necessarily complex, make voluminous data easy to handle for defined purposes, in an amazingly short time, a process that is humanly impossible.   The requirements of such an activity make heavy and high level demands on both the hardware and the software capabilities of computers.

With several divergent claimants, it is rather difficult to decide which areas of knowledge and information genuinely constitute bioinformatics.   It may be helpful to identify areas that are not normally considered as bioinformatics, as for example,

A)    structure determination by crystallography and nmr,
B)    ecological modelling of populations of organisms,
C)    genome sequencing methods (genetic mapping),
D)    radiological image processing (human structure scans),
E)    artificial life simulation such as artificial immunology and life security,
F)     organism phylogenies based on non-molecular data,
G)    computerised diagnosis based on genetic analysis (pedigrees), and
A few others, though all these constitute computer processing of biological data.

By convention, which no one explains why so, only genomics (study of the total molecular sequencing of one set of all genes of an organism) and proteomics (amino acid sequences and the three dimensional structure related to function of proteins) constitute bioinformatics.   Thus, bioinformatics is concerned with compounds of high molecular weight (hmw), particularly the nucleic acids and proteins.   In recent times, cheminformatics (or chemoinformatics; study of low molecular weight, lmw, compounds), glycomics (study of carbohydrates), metabolomics (study of metabolic pathways in organisms) and drug design through bioinformatics, are also being projected as legitimate areas of bioinformatics.

3. The Great Hijack

Organisms can be interpreted as variously ordained and organised packages of chemical compounds.   The biological processes, susceptibility or resistance to pests and diseases and all other aspects of life, are all interactions of chemical compounds.   Big or small, all biological molecules (biomolecules) have biological activity (bioactivity), either promontory or inhibitory, be it nutritional, enzymatic or therapeutic, in the organisms in which they occur and/or on other organisms.   The biology of all molecules of biological origin should constitute molecular biology.   However, the platform was hijacked a long time ago, for molecular biology to mean only the study of the chemistry and biology of the nucleic acids.   Because of the importance of particularly the enzymes, proteins also came to be regarded as a legitimate area of molecular biology.   The second hijack, that regards only genomics and proteomics as bioinformatics, was much more imperceptible.

Biomolecules, other than nucleic acids and proteins, are equally complex and important in the organisms’ metabolism, have much wider applications and they had a much longer history.   For example, it took 150 years to synthesise the stereo-conformational molecule of quinine, a lmw plant product, which has been the most important drug in the control of malaria for over one and a half centuries.   Early attempts to synthesise quinine have opened up a whole new industry, that of the manufacture of synthetic dyes.

A lot of the bias in favour of nucleic acids and proteins arises out of the distinction of micro- (lmw) and macro- (hmw) compounds.   The distinction, usually based on the mw of a compound, is quite arbitrarily placed at 10 kda (kilo daltons).   Several polysaccharides have mws of well over 10 kda (galactomannan 310 kda), while some proteins have lmws, as for example the lectin (a protein) of the stinging nettle is 8.5 kda and the melanin controlling peptide hormone is only 2.39 kda.

In certain situations, the distinction between peptides and proteins also seems to be ambiguous.   A structure with less than a dozen amino acids is usually considered as a peptide.   Peptides too have a tertiary structure.   However, structures with even 40 amino acids are sometimes regarded as (poly)peptides.   Generally most proteins have mws between 30 kda and 120 kda, though some like the cytochrome p can be over 900 kda.   Cytochrome c, the metabolic enzyme, is only 12.9 kda.    Insulin, with two chains of 5.8 kda each, is often considered as a peptide.

4. Genomics

Genomics is an important area of modern biology, where the nucleotide sequences of all the chromosomes of an organism are mapped and thereby the location of different genes and their sequences are determined.   Genomics involves extensive analysis of nucleic acids through molecular biological techniques, before the data are ready for processing by computers.

Entire genomes of several organisms such as escherichia coli, yeast, the malarial parasite, the nematode caenorhabditis elegans, the angiosperm arabidopsis thaliana, etc., have now been unravelled.   The most significant recent advancement in modern biology is the mapping of the entire genomes of man and the rice plant.

Estimating the number of genes in an organism basing on the number of nucleotide base pairs was not reliable, due to the presence of high numbers of redundant copies of many genes.   Genomics has corrected this situation.   It is now known that a human being has about 30,000 genes and not 1,00,000, as estimated earlier.   The rice plant contains about 50,000 genes, many thousands more than in the human being.   It is also clear that several thousands of genes are common to different organisms, irrespective of their taxonomic closeness or otherwise.   Information derived from genome analysis not only tells us on which chromosome specific genes reside but also helps in determining their function.   Such knowledge is necessary to improve the economic potential of organisms, reduce susceptibility to parasites and diseases, transfer genes from one organism to a totally unrelated organism to produce improved varieties, etc.   Useful genes can be selected from a gene library thus constructed and inserted into other organisms for improvement or harmful genes can be silenced.   Genomics is an area with full of promise to greatly enhance the well being of humans, animals and the environment, in so many different ways.

4a. Structural genomics is the area concerned with the identification of genes, their location, their nucleotide sequences and associated features, for which examples are already given.   Many more of such studies to unravel entire genomes, of important crop plants, pathogens and others, are underway.

4b. Functional genomics aims at determining the function of different genes.   The insertion of the crystal protein genes for pest resistance from bacillus thuringiensis in the genomes of several crop plants was the outcome of functional genomics.    Very concerted efforts are on in understanding the function of human genes, genes of the rice plant and other organisms.   Functional genomics also help us to identify genes responsible for the production of specific antibodies and to produce vaccines for mass inoculation purposes.   It is now possible to identify the genes responsible for pathogenesis in the genomes of parasites and to produce dna vaccines basing on this information.    The area that is concerned with genes responsible for the production of pharmacetutically important compounds, is sometimes distinguished as pharmacogenomics.

4c. Nutritional genomics, a rapidly emerging area, is the study and manipulation of genes responsible for the synthesis of nutritionally important enzymes or other molecules, often involving entire biosynthetic pathways.   This will pave way for insertion of these genes into crop plants to enrich them in special ways.   The first example of such a biofortified crop plant is golden rice, where the biosynthetic machinery for β-carotene (pro-vitamin a) is introduced into the rice genome to express in the rice grain, a feature that was not present there earlier.   The genomes of the gene donors for golden rice, daffodil (narcissus pseudonarcissus) and the bacterium erwinia uredovora, have not been worked out.   Nor the genome of the rice plant was available till the first successful product was generated.

5. Proteomics

Proteomics involves the sequencing of amino acids in a protein, determining its three-dimensional structure and relating it to the function of the protein.   Before computer processing comes into the picture, extensive data, particularly through crystallography and nmr, are required for this kind of a study.    With such data on known proteins, the structure and its relationship to function of newly discovered proteins can be understood in a very short time.   In such areas, bioinformatics has an enormous analytical and predictive potential.    Metabolic proteins such as haemoglobin and insulin have been subjected to intensive proteomic investigation.

6. Cheminformatics and drug design

Drug design through bioinformatics is one of the most actively pursued areas of research.   Since a great majority of drugs are lmw compounds and since many of them are primarily derived from biological sources, there has always been a great interest in the study of lmw compounds of biological origin.   Cheminformatics (or chemoinformatics) deals with such compounds, the products of secondary metabolism, often called natural products.   Over one million products of secondary metabolism are known.   The physico-chemical properties and chemical structures for over 100,000 natural products are available in different databases.   For most of them, the biological role in the organisms in which they are synthesised is not known, but they have some kind of bioactivity against others.   This bioactivity can be turned to advantage for therapeutic purposes.   Here the expertise of a pharmacologist is required.

Several therapeutically active compounds are synthetic.   Over a period of time, synthetic organic chemists have realised that it is no longer easy or possible, to continuously conceptualise new structures.   The alternative is to use natural products with a desired and known activity and to use them directly or to structurally modify them for improved performance and lower levels of side effects.   In this context, the natural products are of great importance to the field of drug design.

Whether synthetic or structurally modified natural products, drug development is a time consuming and expensive process.   It would take any thing like 10 to 15 years and 100 to 150 million us dollars to develop a successful drug.   At the end of this effort there is no guarantee that the drug would be as important as when it was conceived and/or that the market forces would accept it.   It is now possible, through computer algorithm based bioinformatic procedures, to identify and structurally modify a natural product, to design a drug with the desired properties and to assess its therapeutic effects, theoretically.   Such procedures, similar to an architect’s on board plan before construction, are described as in silico (in the computer, based on silicon chip technology), as opposed to the earlier in vitro (in experimental models) and in vivo (in clinical trials) methods.   In silico procedures take a surprisingly short time, and provide the drug designers all the information they need before actually synthesising the drug.

Cheminformatics involves organisation of chemical data in a logical form to facilitate the process of understanding chemical properties, their relationship to structures and making inferences.   Chemical structures are the input to identify similar compounds for screening for biological activity.   It also helps to assess the properties of new compounds, by comparison with the known compounds.
The risk involved in the earlier random processes of drug discovery methods is largely removed by bioinformatics.

7. Glycomics

Glycobiology is the study of carbohydrates of biological origin.   Monosaccharides, the building blocks of complex polysaccharides, are lmw compounds like the nucleotides and amino acids.   Polysaccharides are hmw compounds like nucleic acids and the proteins.   There are iso- and heteropolymers of carbohydrates.   Polysaccharides are involved in such biological functions as storage products (starch, glycans, arabans, galactans, mannans), structural components (cellulose, hemicellulose, pectin, chitin) and functional compounds (metabolic and nutritional).   The structures of the monosaccharides, their number and sequences in polysaccharides, are all genetically determined, as for nucleic acids and proteins.   While four nucleotides offer only 64 triplet codes, the carbohydrates offer 34,625 combinations.   With ever continuously discovered numerous biological roles of carbohydrates, glycobiology is a rapidly expanding are of biological research.   Glycomics, the application of bioinformatic procedures to carbohydrates research, is the future field of bioinformatics.

8. Molecular phylogenies

Phylogeny is the origin and evolution of organisms.   With an estimated four million organisms, though not even a quarter of them are currently known to science, it is necessary that they are properly classified and named.   It will be of great advantage to understand the genetic and evolutionary relationships of organisms, in order to use them in a profitable manner, in biotechnology and elsewhere.   Biologists have constructed very elegant systems of classifications for the known organisms, though problems persist.   All this commendable work, with over three centuries of history, was done using externally visible, structural, chemical or functional attributes of organisms.   This constitutes the field of taxonomy, which is called systematics when the theory of organic evolution is applied to it.

With the advancements in molecular biology, biologists have used data from the genetic material to characterise organisms and to verify their classification and relationships, inferred on the basis of other evidence.   Since it is impractical to use entire genomes for this purpose, nucleotide sequences of genes in the genomes from the mitochondria and chloroplasts are used.   These nucleotide sequences are compared using complex computer software.   Extensive work was carried out this way, comparing a very large number of organisms of plants and animals.   A number of systematists would be benefited if bioinformatists provide them with computer-based services to analyse their systematic data.

Amino acid sequences and charecteristics of proteins are also used in systematics.   The metabolic protein enzyme cytochrome c, with 100 to 112 amino acids and a mw of about 12.59 kda, was used to unravel phylogenetic relationships of a wide range of organisms.   The protein is identified basing on its function, which is a certain guide of its nature and then the sequence comparisons are made.

Study of amino acid sequences of insulin, the peptide/protein hormone, which is involved in the mammalian carbohydrate metabolism, is another example.   Such a study has also helped in choosing non-homologous insulin closest to human insulin, for use in the management of diabetes.

9. L- and d-amino acids

It is a much debated and yet unsolved perplexing feature that in nature all the carbohydrates are of the d-configuration and all the amino acids are of the l-configuration, although carbohydrates and amino acids of the alternative configuration do occur.

The ‘dermorphin gene associated peptide’, that mimics morphin activity, is composed of 11 d-amino acids.   An all l-amino acid structure has no activity.   It is hard to explain why.    It is also not understood why peptides formed of d-amino acids are more susceptible to protein degradation

It will be very helpful if search is made for peptide drugs with partially or wholly d-amino acids.   There are several bioactive but toxic l-amino acid peptides, which can be modified to contain some or all d-amino acids to reduce toxicity and to even improve bioactivity.   There is a d-amino acid hexapeptide combinational library with structures of over 52,28,400 peptides, which is a very rich source of information for such research, in the very promising area of drug design.
.
10. Drug modification

Certain metabolic deficiencies such as diabetes require an exogenous supply of the active compound, insulin in this case, to maintain health.   The insulin from bovine, sheep, horse, pig and human differs from that of the other, in the sequence of one or two amino acids at positions 8, 9 and 10, the rest of the sequence being identical.    Clinical insulin used in the management of human diabetes is usually extracted from pig pancreas.   If the differing amino acid in a non-human insulin is appropriately substituted, the product becomes human insulin.   That, genetically modified bacteria now produce human insulin is a different matter.

Several synthetic products are quite useful but cannot be used by one and all for certain side effects in some people.   For example, aspartame (marketed under different trade names) is a dipeptide of aspartic acid and phenylalanine, and is 300 times sweeter than cane sugar.   Aspartame is widely used as an alternate sweetener by diabetics and others who cannot take sweeteners loaded with calories.   Unfortunately, pregnant women and people suffering from phenylketonuria, a disorder due to an impaired metabolism of phenylalanine, should not use aspartame.   It would be useful if phenylalanine were substituted by some other amino acid without affecting its sweetness, to remove the restriction on its use.

Cyclosporin a, an 11-amino acid cyclic peptide, is the most popular immunosuppressant widely used in tissue and organ transplantation to prevent tissue rejection.   However, cyclosporin a has certain side affects and some antibiotic activity, which complicate post-transplant monitoring.   It will be a great help if the side effects and antibiotic activity are removed through amino acid substitution, retaining immunosuppressant activity, to make the drug more reliable and safer.

11. Enlarge the scope of bioinformatics

From the foregoing it should be clear that bioinformatics has a far wider scope than now projected.   Bioinfromatics is a versatile, vibrant, futuristic and important field, rich in applications.   Enlarging the scope of bioinformatics will only be to the advantage of bioinformatics and the bioinformatists.

12. Partnership in bioinformatics

Bioinformatics operates under a three-partner system.

A) Data gatherers: enormous amounts of basic data from biomolecular chemistry and related areas, very painstakingly gathered over long years by experimental and analytical scientists, are the body and substance of bioinformatics; these are the first party.

B) Data processors: the second party use skills of complex software, to serve the needs of the 1st and the 3rd parties; should understand the area of the 1st party and the needs of the 3rd party.

C) Process product users: end users of products, the third party.

1st and the 3rd parties need not have the skills of the 2nd partner.

Teaching bioinformatics should be comprehensive covering all the three partnering areas.

13. Curriculum for bioinformatics

It is not easy to get everyone agree to any particular curriculum for any course of study.   However, there is always a possibility for a generally acceptable curriculum, which can be suitably modified for special needs.   The following curriculum for bioinformatics was drawn in consultation with specialists in different areas of bioinformatics and biotechnology and the curricula of some reputed international institutions.   Whether conducted independently or as a part of a biotechnology course, instruction in bioinformatics should include a comprehensive background in biology and related areas, and the core and advanced areas of bioinformatics.   The details of the syllabus can be worked out depending upon the level of the course and its needs.

Foundation courses:
Cell biology, genetics
Biochemistry, biophysics
Microbiology, immunology
Molecular biology
Microbial biotechnology, genetic engineering
Protein engineering, immunotechnology
Computer courses

Level-one courses:
Information theory and biology
Internet use
Databases: structure of databases, sequence databases, relational databases
Sequence analysis, software resources
Sequence alignment and database searches
Phylogenetic analysis
Predictive methods
Informatics and automation in genome mapping
Genome mapping
Genome analysis

Level-two courses:
Genomics, proteomics, cheminformatics, glycomics
Advanced bioinformatics
Neural network and genetic algorithms
Molecular modelling in drug design

14. Conclusion

Bioinformatics should be an important component of biotechnology education and it should be taught from a broad based platform. Bioinformatics is an essential component of modern biology and not independent of it.   Bioinformatics is not an area of information technology and cannot be restricted to biotechnology alone.   The whole area of biology can immensely benefit from the bioinformatic approach.

We need large numbers of competent biotechnologists and bioinformatists, but not holders of mere degrees in these areas.   Incentives are required to attract talent, but inducement, such as assured job placement, high salaries as in information technology, are not conducive to the long-term interests of any subject.    Once the balloon of hype is pricked, in the face of un-kept promises, the resultant disillusionment will be detrimental to both biotechnology and bioinformatics.   A rational assessment and projection of the scope and benefits of these two areas of biology are the need of the hour.