Lade Inhalt...

Testing a hypothesis with the methods of conceptual biology

A literature-based approach

Bachelorarbeit 2006 37 Seiten

Biologie - Mikrobiologie, Molekularbiologie


Table Of Contents

1. Abstract

2. Introduction

3. The scientific problem

4. Materials and Methods
4.1 MeSH (
4.2 Using databases
4.3 Using Literature Discovery Tools
4.3.1 Arrowsmith
4.3.2 BioRAT
4.3.3 BITOLA
4.3.4 Manjal
4.3.5 LitLinker

5. Evaluation and Interpretation

6. Results

7. Discussion

8. References

9. Table of Figures

10. Abbrevations

11. Acknowledgements

1. Abstract

Undoubtedly, the development of Conceptual Biology, which uses text-mining applications specific to biology is the only way to cope with the increasing amount of free textual data produced in this field. The increasing interest of users in efficiently retrieving and extracting relevant information, the need to keep up with new discoveries described in the literature or in biological databases, and the demands posed by the analysis of high throughput experiments, are the underlying forces motivating the development of Conceptual Biology tools, such as text-mining applications in molecular biology. Therefore the methods of Conceptual Biology have been used for this study to test the hypothesis, that genetically modified foods have no impact on public health. We studied the records of databases and those ones of Literature Based Discovery tools. After the binary scoring of the records with respect to their usefulness, they were also classified by their positive, neutral or negative conclusions with respect to the effect of genetically modified food on public health. In conclusion, we have to deny the hypothesis and therefore to state that genetically modified foods have an impact on public health. Further studies in conceptual biology may focus what kind of impact genetically modified food has on public health.

2. Introduction

“ Knowledge can be created by drawing inference from what is already known .”

-Davies, R. 1989-

The abundance of electronically accessible texts is rising exponentially throughout the last decade. A vast amount of digital information – especially in molecular biology and genetics- is seemingly an auspicious resource for conceptual biology. In the demands of biomedical or biochemical investigators for sources or references, librarians and information specialist are commonly puzzled.

The increasing amount of scientific journals, with an even greater number of articles per journal, expands already in humongous bibliographic databases. The rapid and persistent augmentation in the number of biological, biomedical and even genetics publications gives rise to the desperate situation, that researchers can no longer read more than minute proportion of the literature in their field. Dealing with the substantial quantity of information has induced a fragmentation of scientific literature (Ganiz et al. 2005), that exists within:

1. specialities: advances in the research field e.g. modern comforts in biophysics or mathematical physics
2. sub-specialit ies: subordinated field in the research field e.g. proteomics or aquatic toxicology
3. structure: structure that occurs in the research field e.g. blood, cell or lipid
4. technique: special techniques that can be found in the research field e.g. mass spectrometry or gel electrophoresis

This specialisation or fragmentation of scientific literature leads to an insuperable border and furthermore to an increasing problem in science, particularly with regard to biomedicine (Swanson 2001).

Scientists incline the correspondence more within their fragments than with the scientific field´s farther community, enhancing the lack of communication between specialities (Swanson et al. 1997). This argument is proven within the citations of literature of authors, that cite heavily those of their own narrow specialities.

Thus, scientists may never be aware of the published data and results of others´ relevant work. In addition, this gives rise to useful and important connections between fragmented and implicit data, but yet unnoticed.

illustration not visible in this excerpt

Figure 1 . Swanson´s discovery. The relationships AB and BC are known and reported in the literature. The implicit relationship AC is a putative new discovery (Weeber et al. 2001)[1] Swanson´s serendipitous literature-based discovery of a cure for Raynaud´s didease by dietary fish oil. The literature for both issues were disjoint. If these scientific fields had been aware of each other, the cure would have been found much earlier than Swanson´s discovery

For this reason, conceptual biology allows one to encompass, without limitation as many fields of science as necessary. This is important because one cannot work experimentally in every scientific field. It is predicted that the most important changes in cellular and molecular biology will be conceptual. In turn it will be conceptual biology, supporting a need for data collection and phenomenological publications, but, and that is the most important thing, how these collections and publications are connected and related.

Hence, Informational Retrieval and the conventional computer-aided literature searching represent insufficient techniques for recognizing useful connections. Thus, this leads to Literature Based Discovery (LBD), that directly addresses the limits of knowledge discovery. LBD is a tool, that gives rise to occurrence of novel connections, that have yet not been published.

The concept´s principle is based on the hypothesis that “wealth of recorded knowledge is greater than the sum of its parts.” (Davies 1989). As a precursor in LBD, Don R. Swanson introduced in 1986 his concept of discovering new associations within a bibliographic database. Furthermore he has asserted hypothesis that have been published in various articles (Swanson 1986; 1988; 1990; Smalheiser & Swanson 1996, 1998). According to his statement LBD is a process, finding complementary structures in apparently disjoint literature.

These complementary structures derive from two discrete arguments, yielding novel and important inferences and insights, when combined. Those discrete arguments, that do not mention, cite or co-cite each other, are defined as “disjoint” arguments (Fig.1, p.7)

To specify the main purpose of LBD, I will focus the relevant traits:

1. LBD avails present knowledge from published science literature (e.g. Medline)
2. LBD is a process that strives to find relations between two disjoint arguments (e.g.” high blood viscosity” and “platelet aggregation” are mentioned arguments in separate literature of Fish Oil and Raynaud´s Disease)
3. The combination of these arguments may obtain a new non obvious insight
4. Any connection made should be novel and previously unpublished (e.g. no publication ever mentioned Fish Oil and Raynaud´s Disease together)

An intricacy of LBD, because it comprises two types of entities: concept and literature.

illustration not visible in this excerpt

Figure 2. The Connection Explosion.

To emphasize the problem, take a look at the number of potential connections between units of specialized literatures, that grows much faster than the number of units themselves. The number of pairwise connections increases with the formula x = [n(n-1)/2].

Hence, we would receive for 107 Medline records 50,000 billion possible 2-way connections between individual articles.

illustration not visible in this excerpt

But there are three more serious obstacles for LBD. First, there is seeminlgy unmanageable information space with many potential relations due to the vast amount of data and text. Second, the language itself represents in an unstructured format with characteristic grammar and semantic – even more there embedded in different languages.

Third, the lack of standardized vocabulary inhibits the process to formally define various LBD techniques. Swanson´s example reflects, that we could discover new knowledge from available existing text, if we can assemble the pieces of existing knowledge in the right way.

Conceptual Biology is not a common term in the field of biochemical and biomedical research. Indeed, it leads to important information, enveloped in the overwhelming multitude of literature. Thus, researchers could pose the question of a comprehensible definition of what conceptual biology is really about. Mikhail Blagosklonny hit the bull´s eye, when he says, that a conceptual biologist “…can generate a hypothesis in which predictions are formulated in testable terms, and then search for relevant information among published reports of experiments that may have had a different purpose altogether.” (Blagosklonny et al. 2002).

In the common field of biochemical investigations, “one is not licensed to theorize without providing new data” (Blagosklonny et al. 2002), but according to D. Bray (Bray 2001) this “is a sociological problem and not a scientific one.” Furthermore conceptual biology is an important and irreplaceable complement to the accepted empirical biology in part, because researchers struggle to maintain expertise and management in their fields and even more to understand the connections between different research fields that could reveal fundamental new facts, embedded in the overproduced data field.

Hypothesis testing is central to the process of scientific discovery, and experimental design is a common methodology in the evidence gathering part of hypothesis testing. But some sort of what is commonly now known as ‘data mining’ methodologies can and always have been used too for the data gathering part of hypothesis testing. However, and maybe even ironically, data mining recently has been confined by the vast expansion of knowledge that is increasingly stored in specialized databases and formats as will be explained bellow. To avert these limitations, new developments have emerged in the form of what has been called Conceptual Biology.

With specialized tools and methodologies which are also explained below, Conceptual Biology allows seemingly unlikely hypothesis testing to be performed, examples of which will be given bellow also.

3. The scientific problem

As the scientific environment for this research project was to work in the Critical Genome Project, the purpose of my assay was to conduct research on the scientific basis of genetic engineering and related aspects of biotechnology with the methodology that is embodied within Conceptual Biology. The special area of interest was the impact of genetically modified food on public health and the attendant scientific problem was to test the hypothesis: Genetically modified food has no impact on public health!

To confirm or deny the hypothesis, we had to test it with formulated and testable predictions against published literature, by recognising, establishing and exploiting the links between seemingly unrelated topics, using databases and the tools of Literature based Discovery.

4. Materials and Methods

Materials and methods are sorted like the way they have been used during the process of the hypothesis testing.

illustration not visible in this excerpt

Figure 3 . Web of Science. The homepage presents three different search options. The general search is a search for records by topic, author name, source title, and author address. With the Cited Reference Search the user can search for articles that cite other works that you select from the citation index. By using the advanced search the user can create complex searches using field tags and set combinations. For the goal of this thesis we choosed the general and the advanced search modus.

4.1 MeSH (

PubMed uses a controlled vocabulary to index the articles in the database. This controlled vocabulary is called Medical Subject Headings (MeSH). Each citation in PubMed is assigned a series of subjects, or MeSH Headings, to identify the topics covered in an article.

MeSH provides consistent way to retrieve information that may use different terminology for the same concepts. When doing a keyword search, the user may miss key articles. If the exact keyword is not used, PubMed may not retrieve that article. Not every concept has a corresponding MeSH term, but it is always a good idea to search MeSH before doing a basic keyword search.

We started the MeSH term database search with the words “Genetically modified food” and its suggestions. Furthermore we searched for the term “Public Health” to finally combine these two terms by the Boolean operator AND.

Those terms then have been send to the Pubmed database. Moreover, all the offered MeSH terms and its combinations were used for the search in the following databases.

4.2 Using databases

With total access to all databases listed in the Levy Library of the Mount Sinai School (, we decided to use only eight of them in advance. This was meant to bypass an overload of digital information and paper. So the used databases had been:

1. Medline (
2. Web of Science (
3. Faculty of 1000 Biology (
4. Faculty of 1000 Medicine (
5. ProQuest Digital Dissertations (
6. Wiley Interscience (Online Books) (
7. BioOne (
8. Google Scholar (

Medline (Medical Literature Analysis and Retrieval System Online) is an international literature database, offering digital essays of life sciences and biomedical information, covering the fields of medicine, nursing, pharmacy, dentistry, veterinary medicine, and health care. In addition, the Medline database covers most of the literature in biology and biochemistry and contains more than 13 million records from nearly 4,800 selected publications covering biomedicine and health from 1966 to the present. The database is freely accessible via the PubMed interface, and new citations are added Tuesday through Saturday[2]


[1] In different works disease can be defined as A and substance can be defined as C. Thus, the search can either start from disease or substance.

[2] , received 08-16-2006


ISBN (eBook)
ISBN (Buch)
1.2 MB
Institution / Hochschule
Westfälische Hochschule Gelsenkirchen, Bocholt, Recklinghausen
Molecular Biology Genetically modified food Public Health Literature-based Discovery




Titel: Testing a hypothesis with the methods of conceptual biology