Word prediction and word probability exemplified in searches over a pharmaceutical database


Bachelor Thesis, 2009

63 Pages, Grade: 1,0


Excerpt


Contents

Abstract

Acknowledgements

List of Figures

List of Tables

1 Introduction to CREAM
1.1 The Corpus Research for Exploitation of Annotated Metadata project (CREAM)
1.1.1 The main topic
1.1.2 The Corpus - eNova Database
1.1.2.1 eNova Application and Sample Walk Through . . .
1.1.3 The main goal

2 State of the Art
2.1 Contributions to Word Prediction and Word Probability
2.1.1 Current Contributions to n-Gram Analysis
2.2 Applied n-Gram Analysis
2.2.1 National Security
2.2.2 Spelling Correction
2.2.2.1 Other Areas Related to Spelling Correction

3 n -Grams - Word Count and n -Gram Modeling
3.1 Introduction to Word Count in Corpora
3.2 Tokenization - Word Segmentation in Corpora
3.2.1 Word Types vs. Word Tokens
3.2.2 Stemming and Lemmatization
3.2.3 Non-word Characters
3.3 Parameters of Tokenization
3.3.1 Compounding and Words Separated by Whitespace
3.3.2 Hyphens
3.3.3 Case-(In-)Sensitivity
3.3.4 Other Cases in English
3.3.5 Stop words
3.3.6 Other Languages
3.4 Introduction to Word Probability and Word Prediction
3.4.1 Markov Assumption and n-Gram Modeling
3.4.2 n-Grams - n-Token Sequences of Words
3.4.3 Simple n-Gram Analysis - Maximum Likelihood Estimation
3.5 n-Gram Analysis over Sample Corpus

4 Waterfall Model
4.1 Introduction
4.2 Introduction to the Waterfall Model of Software Development . . .
4.2.1 Requirement Analysis
4.2.2 Specification Phase
4.2.3 Design
4.2.4 Implementation and Testing - Monogram Code
4.2.5 Implementation and Testing - Bigram Code
4.2.6 Integration

5 Interpretation of n-Gram Retrieval
5.1 Introduction
5.2 Presentation of Monograms
5.2.1 Interpretation of Monograms
5.2.2 Spelling Correction
5.3 Bigrams
5.3.1 Relative Frequency of Sample Bigrams
5.4 Trigrams
5.5 Main interpretation and forecast

A Expert Search Language of eNova - Glossary of Su-xes for Key- words

B Corpora - External and Internal Expert Queries

C n-Gram Codes in Ruby

D Results of n-Gram Analyses

Bibliography

Abstract

Faculty of Linguistics and Literary Sutdies Department of Anglistik - British and American Studies

Bachelor of Arts by Marc Bohnes

This Bachelor of Arts thesis contributes to the CREAM project between Novartis Pharma AG and Bielefeld University. Throughout the thesis a method called n- gram modeling will be discovered which supplies its user with information about the frequential use of words. This information will be needed in order to improve a database the CREAM project works on. This improvement is to do with a calcu- lation of probabilities in search queries sent to the database. The thesis consists of five chapters. The first chapter introduces the CREAM project and the database. The second chapter provides the reader with information about the current state of n-gram modeling and where it can be found in contemporary literature. The third chapter deals extensively with how corpora have to be prepared in order to be analyzed accordingly and how n-gram modeling can be computed in terms of frequential distribution of words. In chapter four a computer code will be intro- duced that uses a corpus to obtain certain n-gram. Finally, in chapter five, the information retrieved by the computer code(s) will be evaluated and a forecast of future work will be mentioned.

Acknowledgements

I would like to thank Dr. Thorsten Trippel and Prof. Dr. Dafydd Gibbon for the supervision of this thesis as well as for their advice and guidance not only through- out the CREAM project and this Bachelor of Arts thesis, but also throughout my studies at Bielefeld University, especially in many issues related to English and linguistics.

List of Figures

1.1 Novartis Drug Specification

1.2 Novartis Drug: Primary Search

1.3 Novartis Drug: Secondary Search

1.4 Novartis Drug: Refinement

1.5 Novartis Drug: Final Summary

4.1 Waterfall-Model by Royce

5.1 Comparison of Monograms - Diagram

List of Tables

3.1 Word Type vs. Word Token

3.2 Tokenization

5.1 Monograms - External vs. Internals

5.2 Monograms - External vs. Internals normalized due to percent

5.3 Bigrams - Occurrence Internal Queries

5.4 Bigrams - Occurrence External Queries

5.5 Trigrams - Occurrence External Queries

5.6 Trigrams - Occurrence Internal Queries

1 Introduction to CREAM

1.1 The Corpus Research for Exploitation of An- notated Metadata project (CREAM)

This Bachelor of Arts thesis is based on work that has been done during the last couple of months. This work contributes to the project called corpus research for exploitation of annotated metadata project (henceforth CREAM). CREAM is a mutual project between Medical Information & Communication (MIC): Knowledge Analysis (KA), Novartis Pharma AG in Basel and the Compu- tational Linguistics and Spoken Language Working Group at Bielefeld University, Faculty of Linguistics and Literary Studies. The CREAM project is supervised by Dr. Thorsten Trippel (project leader, finances, and project management) and Prof. Dr. Dafydd Gibbon (co-applicant and design issues) as well as Uwe Knu¨ttel (project manager).

1.1.1 The main topic

The project CREAM deals with the problem of accessing large and complex lan- guage resources that are available for scientific use based on complex annotations. The CREAM project works in the area of the guided search, developing principles allowing naive users to exploit the language resources fully using the available metadata.

The Bachelor of Arts thesis will focus on a method that is known as n-gr am

analysis or n-gr am modeling. That is, we will be looking at analyses of word com- binations exemplified by searches over a pharmaceutical database. This database is the corpus the n-gram analyses are based on.

1.1.2 The Corpus - eNova Database

As the main goal of the CREAM project is to exploit large databases, we will have to take a look at the corpus in advance. The corpus we are going to work with consists of search queries of the Novartis Corporate Drug Literature Database, also called eNova. By using eNova, Novartis created a system with which they supply customers with any information about their own products (drugs). Hence, Novartis provides scientists, doctors, internal experts, or the like with a thorough and large database of pharmaceutical keywords and articles written by experts about a certain drug and related issues, such as contributing authors, respective journals, drug side-e-ects, and so on. The unique feature of the drug literature database covers for instance: ”1) a comprehensive coverage of products (drugs) by Novartis, 2) customized Novartis drug specific abstracts, [...], and 4) direct access to the full text of most articles” [Mas, 2008]. Therefore, eNova is a drug literature database of excellent quality and quantity. Further, every search over eNova executed by the user is saved and stored on hard disc for further analysis. The analysis of this corpus is part of this thesis and will be dealt with throughout the following chapters. What now follows are some sample lines taken from the corpus of executed searches in order to get a first impression on how the, say, ”raw” corpus looks like:

illustration not visible in this excerpt

As we will focus on the occurring keywords, we will refer to the actual term (such as xolair) as the stem and its specification (such as .prn) as its su-x. A detailed list of what the su-xes stand for is given in appendix A. It is important to note that we will be working with two di-erent corpora. The one consists of internal expert search queries based on expert searches by experts within Novartis’s department

of Knowledge Analysis, the other corpus consists of searches that were executed by experts outside Knowledge Analysis. These di-erences do not alter the upcoming calculations in any way for we will calculate n-gram analyses over each corpus individually.

1.1.2.1 eNova Application and Sample Walk Through

After we introduced the corpus and discussed what is being used for the eNova database, let us now have a look at how a sample query search might be executed. It is important to note, however, that the applications shown are from 2008, which means that there have already been changes and there are improvements yet to be released. Yet to get a good overview of what a user can do with eNova these applications utterly su-ce. The procedure to be executed by the user is divided into four main steps:

- primary choice for a data typ: drug
- secondary choice for another data type: disease
- tertiary choice: additionally involved drug
- additional search criteria

It is important to note that the first two steps are mandatory, for these steps are crucial for functioning as a basis for any additional input. The other steps are additional and can be chosen flexibly by the respective user due to their needs.

Let us now look at a sample pass through eNova. First, the user has to specify and submit a drug they are looking for. In our case, the user chose the Novartis drug Sandimmun.

illustration not visible in this excerpt

Figure 1.1: Novartis Drug Specification.

Second, the user will have to choose between di-erent parameters (such as context of disease, the disease itself, and othe additional parameters) that are applied to the very drug they chose in the first step:

illustration not visible in this excerpt

Figure 1.2: Novartis Drug: Primary Search.

In a third step, the user may add additional drugs which interact with the already stated one and, again, check new parameters related to this drug.

illustration not visible in this excerpt

Figure 1.3: Novartis Drug: Secondary Search.

In a fourth step, the interface enables the user to refine their searches, in order to maximize the qualitative best outcome for the query that is being searched for:

illustration not visible in this excerpt

Figure 1.4: Novartis Drug: Refinement.

Finally, the user is given a final summary of their results. Further, they then can access the relevant literature for their queries right away:

illustration not visible in this excerpt

Figure 1.5: Novartis Drug: Final Summary.

1.1.3 The main goal

Although eNova is a very powerful tool for all users, we always long for improve- ment. The best way to improve eNova is to actually ”know” in advance what the user is likely to look at and which certain combinations of terms and related issues they are likely to access. That is why corpus analysis is indispensable in order to improve the guided search. According to the corpus analysis, we will try to infer certain strategies by which users search for pharmaceutical keywords and combinatorial terms. The more we know about the users’ input the more we are able to guide them through their search, thus making eNova even more powerful and more intuitive to use.

There are quite a number of web searches whose performance and usability di-ers among di-erent web engines. google and yahoo!, for example, simply di-er in their search algorithm and the way searches are computed, which, in turn, leads to di-erent information results[1]. Accordingly, what the CREAM project is in- terested in is to find a way to build an interface based on eNova that is, on the

one hand, highly performant in terms of the users’ input. That means that we want the interface (or the search engine) to be as fast, thorough, and correct as possible by taking the least amount of time. On the other hand, it has to be highly transparent and usable at the same time, for we cannot expect every user to be a computational expert. That being said, we not only have to try to build an interface that works highly e-ciently by being as user-friendly as possible at the same time. We have to ask ourselves how to build a professional search engine without having people to report ”Not being able to return to a page [they] once visited”, as it happens a lot on the Internet [Teevan, J. et al.]. Therefore we need to work on issues in the area of the guided search. That means that we will develop ways which eventually enable the user to exploit the language resources (eNova) without requiring from them to learn a specific query language or anything else which does not come intuitively to the user [Mas, 2008]. One possible way of do- ing this is predicting what a user might look for. Therefore we will need to know which keywords co-occur together with other keywords. This prediction is known as n-gr am modeling, which will be dealt with throughout the following chapters. By analyzing n-grams we will be able to infer which certain categories the user is interested in and which contents occur most frequently in a search query. By

using probabilistic calculations we will be able to have the user work with pick lists, for instance, based on frequential distributions of keywords (those that occur in the corpora). This would thus enable us to highly support the user during their search and increase the performance of eNova. The more we know about the user’s intuition in using eNova, the better the improvement and, hence, results. In sum, predicting a user’s intuition while using eNova for their needs (based on n-gram analysis), will result in a highly powerful database. That is why we will describe and exemplify a starting point of n-gram modeling in this thesis.

2 State of the Art

2.1 Contributions to Word Prediction and Word Probability

The very first contributions to what is today known as n-gram modeling and its mathematical background was first proposed by Markov in 1913. He used what is now called Markov Chains”to predict whether an upcoming letter in Pushkins’s Eugene Onegin would be a vowel or a consonant” [Jurafsky, D. and James H. Martin, 2008e]. At that time Markov looked at bigrams and trigrams which were defined as sequences of letters of length n (2 and 3, respectively). Claude E. Shannon, on the contrary, looked at sequences of entire words in the middle of the twentieth century. He was interested in the entropy of natural language texts. Shannon looked at the ability of human beings, in this case native speakers of English, to predict words by already existing ones [Mahoney, M.]. The procedure Shannon introduce is now known as the so-called Shannon Game[2].The way the Shannon Game works is as follows: Shannon had his subjects guessed sequent characters in a string of text. He, for example, selected a short passage that was unfamiliar to his subjects and had them guess the first letter in the passage. If the guess was correct, the subject was informed and he then was to proceed to guess the second letter, and so forth[3]. If the subject was not correct, however,

he was told the correct first letter and the procedure went on as indicated. It is noteworthy that the subjects had to choose one out of twenty-seven letters, for spaces were also included as an additional letter. The following briefly shows how a sample text looked liked. The first line displayes the original text, whereas the second line resembles the subject’s guesses; a miss indicated by a dash:

illustration not visible in this excerpt

Interestingly, in this example 69% of the letters were guessed correctly [Shannon, Claude E., 1950].

2.1.1 Current Contributions to n-Gram Analysis

As the Shannon Game had just been the beginning of word prediction and word frequency, there have been quite a lot of contributions to this issue so far. Of further interest might also be Rosenfeld’s ”A Maximum Entropy Approach to Adaptive Statistical Language Modeling” and Cover’s and King’s ”A Convergent Gambling Estimate of the Entropy of English” [Cover, T.M. and R.C. King, 1978; Rosenfeld, Ronald, 1996]. There are, of course, di-erent interests among people, be it linguists who are interested in Natural Language Processing, be it scholars working on text technology for spelling corrections and text message improvement, or, among others of course, federal agents who are interested in analyzing e-mails to filter out confidential contents (hidden hints for building bombs, for instance). Martin Jurafsky provides a very thorough and contemporary overview about Nat- ural Language Processing and n-gram analysis in Speech and Language Processing [Jurafsky, D. and James H. Martin, 2008f]. Further contributions can also be found in Manning and Schu¨tze’s A n Introduction to Information Retrieval [Man- ning, C.D. and H. Schu¨tze, 2009d], in F oundations of Statistical Natural Language Processing [Manning, C.D. and H. Schu¨tze, 2000a], and Programming Collective Intelligence by Toby Segeran[Segeran, T., 2007].

2.2 Applied n-Gram Analysis

In contemporary Natural Language Processing there are also other areas of applied linguistics where methods of n-gram analysis are applied to. Let us briefly take a look at some of the areas in which n-gram analysis plays a major role and how it is applied to various contexts and di-erent content in the area of contemporary computational linguistic research.

2.2.1 National Security

In addition to checking users’ input against statistically covered items, in contem- porary political situations it becomes more and more important for National Se- curity Agencies to be present online. Since terroristic threats not only are present in the media, which cover TV, for instance, but also sent through cyberspace via e-mail messages or the like, it has become tremendously important to check these messages and identify specific patterns that on the whole provide the agencies with a thorough picture of the content.

2.2.2 Spelling Correction

Spell checkers and spelling corrections is very important when it comes to a) yielding to correct search queries and b) maintaining correct semantic as well as syntactic features of sentences. As an example, let us take the following sentences into consideration:

They are leaving in about fifteen minuets to go to their house.

The design an construction of the system will take more than a year.

Due to the fact that the errors actually resemble real words, we cannot just match them against non-findings of a given dictionary or vocabulary. A spell checker, however, can compute the probability of occurring n-grams. Thus it will compute that the tetragram in about fifteen minutes yields a higher probability than in about fifteen minuets [Jurafsky, D. and James H. Martin, 2008b].

Google, for instance, works on a similar basis. The only di-erence is that Google’s main idea is to use the correction that is based on actual user queries which have

been searched for. ”The idea is here that if grunt is typed as a query more often than grant, then it is more likely that the user who typed grnt intended to type the query grunt” [Manning, C.D. and H. Schu¨tze, 2009c].

2.2.2.1 Other Areas Related to Spelling Correction

There are also other areas that are related to the principles used in spelling cor- rection such as speec h reco gnition and handwriting recognition (used in PDA’s, for example). In both cases n-gram probabilities are computed too, so that the most likely cases are considered true for the very context. Further, n-gram mod- els are also crucial for machine translation, for there are also probabilities within syntactic contexts which make it likely or unlikely for certain words to precede other words. Besides these areas, n-gram modeling is further crucial for Natural Language Processing tasks, such as part-of-speech tagging and word similarity, as well as in predictive text input systems for cell phones [Jurafsky, D. and James H. Martin, 2008b].

[...]


[1] The query for Bielefeld University, for example, took 0.32 seconds and yielded approximately 2,670,000 results on google, whereas the same query took 0.41 seconds but, in turn, yielded 3,100,000 results using yahoo’s search engine [both websites were last visited August 3, 2009 11:51 p.m.]

[2] An online version of the actual game can be found at http://www.resourcekt.co.uk/shannon/manualWeb/manual.htm, whereas a a detailed de- scription of the procedure can be looked at on http://cs.fit.edu/ mmahoney/dissertation/en- tropy1.html.

[3] Please note that ”he” in this context refers to the generic third person pronoun.

Excerpt out of 63 pages

Details

Title
Word prediction and word probability exemplified in searches over a pharmaceutical database
College
Bielefeld University
Grade
1,0
Author
Year
2009
Pages
63
Catalog Number
V199178
ISBN (eBook)
9783656256755
ISBN (Book)
9783656258216
File size
712 KB
Language
English
Keywords
Linguistik, Englisch, n-Gramme, NLP, Computerlinguistik, Programmieren
Quote paper
B.A. Marc Bohnes (Author), 2009, Word prediction and word probability exemplified in searches over a pharmaceutical database, Munich, GRIN Verlag, https://www.grin.com/document/199178

Comments

  • No comments yet.
Look inside the ebook
Title: Word prediction and word probability exemplified in searches over a pharmaceutical database



Upload papers

Your term paper / thesis:

- Publication as eBook and book
- High royalties for the sales
- Completely free - with ISBN
- It only takes five minutes
- Every paper finds readers

Publish now - it's free