Loading...

Comprehensive Reanalysis of Genomic Storm (Transcriptomic) Data, Integrating Clinical Varibles and Utilizing New and Old Approaches

Bachelor Thesis 2014 51 Pages

Computer Science - Bioinformatics

Excerpt

Contents

1. Theory
1.1 Normalization
1.2 Comparison of two groups of samples
1.3 Signal Log Ratio Algorithm
1.4 Correlation (r)
1.5 Log2-transformation
1.6 Intensity ratio
1.7 Hypothesis pair
1.8 Threshold for p-value
1.9 Fold change
1.10 Time series
1.11 Microarray preparation
1.12 Probe preparation, hybridization and imaging
1.13 Low level information analysis

2. Introduction
2.1 SIRS, Sepsis and Septic Shock
2.2 Related Background
2.3 .CEL File Description
2.4 Gene Expression Omnibus (GEO)
2.5 KEGG

3. Materials methods
3.1 Data
3.2 Data Analysis
3.3 Clustering
3.4 Enrichment tests
3.5 Lagged Correlation
3.6 Additional Information

4. RESULTS
4.1 Differentially Expressed Genes
4.2 Clustering:
4.3 Regulation of some important genes
4.3.1 HLA-DMB LCN
4.3.2 Correlation of LCN 2and LTF
4.3.3 SLC4A1 IL5RA
4.4 Gender Linked Genes:
4.5 Gene Set Enrichment Analysis (GSEA)
4.5.1 Kegg Mapper
4.5.2 Glycolysis Gluconeogenesis
4.5.3 Ribosome
4.6 Toll Like Receptors Signaling Pathway and Heatmap

5. DISCUSSION

6. REFERENCES

7. SUPPLEMENTARY

LIST of FIGURES

1. The process of fluorescently labeled RNA probe production (From Affymetrix website)

2. Gene expression data. Each spot represents the expression level of a gene in two different experiments. Yellow or red spots indicate that the gene is expressed in one experiment. Green spots show that the gene is expressed at same levels in both experiments

3. Relationship of Infection, SIRS, Sepsis, Severe Sepsis and Septic Shock

4. Histograms of p-values with and without multiple tests adj. in parametric and non-parametric version

5. Histograms of Log2 Fold Change

6. Hierarchical clustering of all samples

7. Box Plots of LCN2 HLA-DMB

8. Antigen Processing and Presentation Pathway

9. Pearson's product-moment correlation of LCN2 and LTF (r = 0.9441)

10. LTF LCN expression

11. Scatter plot showing Correlation of IL5RA with Eosinophils (r= 0.6136)

12. Plots of IL5RA and SLC4A

13. Sex linked genes (outliers identified)

14. Top Up and Down regulated KEGG pathways

15. Box plot of highly up and down regulated genes of Glycolysis pathway

16. Glycolysis Gluconeogenesis pathway with genes regulation

17. Box plot of highly up and down regulated genes of Ribosome pathway

18. Ribosome pathway with genes regulation

19. TLR signalling pathway with genes regulation

20. TLR genes heatmap

21. Pathogen Escherichia Coli Infection (hsa05130)

22. Aminocyl tRNA Biosynthesis (hsa00970)

23. Galactose Metabolism (hsa00052)

LIST of TABLES

1. No. of Differentially Expressed Genes

2. Top KEGG pathways Enriched

1. Theory

1.1 Normalization [1]

Normalization is the attempt to compensate for systematic technical differences between chips, to see more clearly the systematic biological differences between samples. Differences in treatment of two samples, especially in labelling and in hybridization, bias the relative measures on any two chips.

Systematic non-biological differences between chips are evident in several ways:

- Total brightness differs between chips
- One dye seems stronger than the other (in 2-color systems) on one chip, but not on another
- Typical background is higher in one chip than on another

There are also many non-obvious systematic differences between chips in an experiment, and even between the two channels on a single array. Some causes of systematic measurement variation include:

- Different amounts of RNA
- One dye is more readily incorporated than the other (in 2-color systems)
- The hybridisation reaction may proceed more fully to equilibrium in one array than the other
- Hybridisation conditions may vary across an array
- Scanner settings are often different, and of course
- Murphy’s Law predicts even more variation than can be simply explained.

1.2 Comparison of two group of samples [2]

The simplest and most common experimental set-up is to compare two groups: for example, Treatment vs. Control, or Mutant vs. Wild type. The issues arising in simple comparisons arise also in more complex settings; it is easier to explain these in the simpler context. The long-time standard test statistic for comparing two groups is the t-statistic:

illustration not visible in this excerpt

Where xi,1 is the mean value of gene i in group 1, xi,2 is the mean in group 2, and si is the (non-pooled) within-groups standard error (SE) for gene i.

1.3 Signal Log Ratio Algorithm [3]

Signal Log Ratio algorithm estimates the measure and the direction of change of a Gene/transcript when two arrays are compared. Each probe pair on the experiment array is compared to the corresponding probe pair in the baseline arrays in the calculation of Signal Log Ratio. This process eliminates differences due to different probe binding coefficients. A One-Step Tukey’s Biweight method is used in computing the Signal Log Ratio value by taking a mean of the log ratios of probe pair intensities across the two arrays. The base 2 log scale is used, translating the Signal Log Ratio of 1.0 to a 2-fold increase in the expression level and of -1.0 to a 2-fold decrease. No change in the expression level is thus indicated by a Signal Log Ratio value 0. Tukey’s Biweight method also gives estimate of the amount of variation in the data. Confidence intervals are generated from the scale of variation of the data. A 95% confidence interval shows a range of values, which will include the true value 95% of the time. Small confidence interval implies that the expression data is more exact, while large confidence intervals reflect more noise and uncertainty in estimating the true level. Since the confidence intervals attached to Signal Log Ratios are computed from variation between probes, they may not reflect the full width of experimental variation.

1.4 Correlation (r) [3]

The correlation of two variables represents the degree to which the variables are related. When two variables are perfectly linearly related, the points in the scatter plot fall on a straight line. Correlation measures only linear relationship. Two summary measures or correlation coefficients, Pearson’s correlation and Spearman’s rho, are most commonly used. Both of these measure range from perfectly positive linear relationship to perfectly negative linear relationship, or from -1 to 1. It is not wrong to calculate the correlation between variables, which are not linearly related, but it does not make much sense. If the variables are not linearly related, the correlation does not describe their relationships effectively, and no conclusions can be based on the correlation coefficient only. Correlation and scatter plot are a good example of how numerical and graphical tools effectively complement each other.

1.5 Log2-transformation [3]

Log2-transformation is often used with DNA microarray experiments. Usually, the intensity ratio is log2-transformed. The resulting new variable is called log ratio. The increase of one in the log ratio means that the actual intensity or expression has doubled.

1.6 Intensity ratio [3]

The simplest approach is to divide the intensity of a gene in the sample by the intensity level of the same gene in the control.

1.7 Hypothesis pair [3]

Before applying the test to the data, a hypothesis pair should be formed consists of a null hypothesis (H0) and an alternative hypothesis (H1) hypotheses are always formulated as follows:

H0= There is no difference in means between compared groups H1= There is a difference in means between compared groups.

1.8 Threshold for p-value [3]

The p-value is usually associated with a statistical test, and it is the risk that we reject the null hypothesis, when it actually is true. Before testing, a threshold for p-value should be decided. This is a cut-off below which the results are statistically significant, and above which the results are not statistically significant. Often a threshold of 0.05 is used. This means that every 20th time we conclude by chance alone that the difference between groups is statistically signif-icant, when it actually isn’t. If the compared groups are large enough, even the tiniest difference can get a significant p-value. In such cases it needs to be carefully weighted whether the statistical significance is just that, statistical significance, or is there real biological phenomenon acting in the background.

1.9 Fold change [3]

Another means to make the distribution of intensity ratios more symmetrical is to calculate the fold change. The fold change is equal to the intensity ratio, when the expression is higher than one. Below one, the fold change is equal to the inversed intensity ratio.

illustration not visible in this excerpt

The fold change makes the distribution of the expression values more sym-metric, and both under and over-expressed genes can take values between zero and infinity. Note, that the fold change makes the expression values additive in a similar fashion as the log­transformation.

1.10 Time series [3]

In a time series experiment expression changes are monitored with samples taken between certain time intervals. Although several replicates can be made per every time point, it should be considered that these replicate chips can possible be made a better use of, if they are added to the time series as sampling points. That is, it should be weighted whether a high precision in every time point is more valuable than the additional information of expression changes new sampling points (time points) produce.

1.11 Microarray preparation [4]

Microarrays are commonly prepared on a glass, nylon or quartz substrate. Critical steps in this process include the selection and nature of the DNA sequences that will be placed on the array, and the technique of fixing the sequences on the substrate. Affymetrix Company that is a leading manufacturer of gene chips, uses a method adopted from the semiconductor industry with photolithography and combinatorial chemistry. The density of oligonucleotides in their GeneChips is reported as about half a million sequences per 1.282 cm2.(Affymetrix web site).

1.12 Probe preparation, hybridization and imaging [4]

To prepare RNA probes fro reacting with the microarray, the first step is isolation of the RNA population from the experimental and control samples. cDNA copies of the mRNAs are synthesized using reverse transcriptase and then by in vitro transcription cDNA is converted to cRNA and fluorescently labeled. This probe mixture is then cast onto the microarray. RNAs that are complementary to the molecules on the microarray hybridize with the strands on the microarray. After hybridization and probe washing the microarray substrate is visualized using the appropriate method based on the nature of substrate. With
high density chips this generally requires verysensitive microscopic scanning of the chip. Oligonucleotide spots that hybridize with the RNA will show a signal based on the level of the labeled RNA that hybridized to the specific sequence. Whereas the dark spots that show little or no signal, mark sequences that are not represented in the population of expressed mRNAs.

illustration not visible in this excerpt

FIG. 1: The process of fluorescently labeled RNA probe production (From Affymetrix website).

1.13 Low level information analysis [4]

Microarrays measure the target quantity (i.e. relative or absolute mRNA abundance) indirectly by measuring another physical quantity - the intensity of the fluorescence of the spots on the array for each fluorescent dye. These images should be later transformed into the gene expression matrix. This task is not a trivial one because:

1. The spots corresponding to genes should be identified.
2. The boundaries of the spots should be determined.
3. The fluorescence intensity should be determined depending on the background intensity.

illustration not visible in this excerpt

FIG.2 : Gene expression data. Each spot represents the expression level of a gene in two different experiments. Yellow or red spots indicate that the gene is expressed in one experiment. Green spots show that the gene is expressed at same levels in both experiments.

In conclusion, microarray-based gene expression measurements are still far from giving estimates of mRNA counts per cell in the sample. The samples are relative by nature. In addition, appropriate normalization should be applied to enable gene or samples comparisons. It is important to note that even if we had the most precise tools to measure mRNA abundance in the cell; it still wouldn’t provide us a full and exact picture about the cell activity because of post-translational changes.

2. Introduction

Despite continuing advances in intensive care medicine, severe sepsis and septic shock are currently among the most common causes of morbidity and mortality in intensive care. Moreover, the incidence of severe sepsis and septic shock has increased with ageing of the population over the past decade [5,6,7]. According to the

2.1 SIRS, Sepsis and Septic Shock [8,9,10,11]

For many years’ doctors, attending intensive care units used a variety of terms to describe illnesses associated with infection, or illness that looked like infection. These terms included sepsis, septicaemia, bacteraemia, infection, septic shock, toxic shock etc. Unfortunately there were two problems with these terms: 1. there were no strict definitions for the terms used, and often these words or phrases were used incorrectly. 2, an emerging body of evidence arose which led us to believe that systemic inflammation, rather than infection, was responsible for multi-organ failure. In the early 1990s a consensus conference between the ACCP and the SCCM laid out a new series of definitions for what is inflammation and what is sepsis. 3. The terminology has come into common usage, albeit with some reservations, and I recommend that you learn and use these definitions.

illustration not visible in this excerpt

FIG.3 : Relationship of Infection, SIRS, Sepsis, Severe Sepsis and Septic Shock [11]

2.2 Related Background

Trauma represents a frequent clinical syndrome characterized by the patient’s systemic inflammatory response to infection, and carries a very high mortality rate. Trauma injuries frequently lead to infections, sepsis, and multiple organ failure (MOF) [12,13], which contribute to 51%-61% of late trauma mortality [14]. Traumatic injury with its potential for infection was likely a common cause of death for our human ancestors. Even today, massive injury remains the most common cause of death for those under the age of 45 yr in developed countries [15,16]. Systematic screening approaches are necessary in order to better diagnose and treat trauma, because it's a complex disease state with time-dependent intra-patient variability [17]. A number of clinical trials for treating late trauma complications have failed, believed partly due to the inability to identify a proper patient population as well as the limited understanding of the interplay of biological processes underlying post-injury inflammatory complications [18,19]. High-throughput transcriptomic data enable researchers to monitor molecular dynamics on a broad scale and to determine promising diagnostic as well as interventional targets. A more comprehensive characterisation of the genomic response to trauma is therefore required in order to increase our understanding of the molecular basis of clinical outcomes, leading to improvements in diagnosis and treatment.

Furthermore, potential influential factors in sepsis, including treatment, age, sex and organ failure as well as interactions among these factors are assumed to play a major role in disease progression and are potentially reflected in molecular markers. Only recently has the human injury response been studied systematically at the genomic level and only now is it beginning to become better understood. Prior work has focused on the role of individual mediators [20,21,22] or processes such as apoptosis and cellular death in nosocomial infections and organ injury after trauma [23]. Circulating blood leukocytes have the capacity to seek out, recognise, and mount an appropriate inflammatory response at the earliest sign of injury. Blood neutrophils, monocytes, and Natural Killer cells are implicated as primary effectors during the initial inflammation and activation of innate immunity. Severe trauma has also been characterised by immunosuppression, primarily seen on the adaptive immune system with T lymphocyte populations being the most markedly affected cell population [24,25].

2.3 .CEL File Description [26]

The CEL file stores the results of the intensity calculations on the pixel values of the DAT file (Contains the pixel intensity values collected from an Affymetrix Scanner). This includes an intensity value, standard deviation of the intensity, the number of pixels used to calculate the intensity value, a flag to indicate an outlier as calculated by the algorithm and a user defined flag indicating the feature should be excluded from future analysis. The file stores the previously stated data for each feature on the probe array.

2.4 Gene Expression Omnibus (GEO)

GEO is an international public repository that archives and freely distributes microarray, next-generation sequencing, and other forms of high-throughput functional genomics data submitted by the research community [27].

2.5 KEGG

Kyoto Encyclopaedia of Genes and Genomes; or K.E.G.G., as it is commonly called; is a collection of online databases dealing with genomes, enzymatic pathways, and biological chemicals. The Pathway Database, records networks of molecular interactions in cells and their variants (specific to particular organisms). K.E.G.G. switched to a subscription model, accessible via FTP in July, 2011. KEGG is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from genomic and molecular-level information.

The Kyoto Encyclopaedia of Genes and Genomes was initiated by the Japanese human genome program in 1995 [28]. According to the developers, KEGG is a "computer representation" of the biological system [29]. The KEGG database can be utilized for modelling and simulation, browsing and retrieval of data. It is a part of the systems biology approach.

KEGG is best known for their display of biochemical pathways, but many other functions are now available at KEGG. KEGG is a collection of about 20 databases, which can be divided into three groups covering different biological spaces:

- Genes
-KEGG Genes - manually curated from completely sequenced genomes
-DGENES - draft genomes
-EGENES - from EST contigs
-KEGG Orthology - manually defined ortholog groups based on KEGG pathways and BRITE functional hierarchies
-KEGG SSDB - Seq similarity scores

- Chemicals and Ligands
-Ligand

- Systems
-KEGG Pathway
-KEGG Brite

[...]

Details

Pages
51
Year
2014
ISBN (eBook)
9783656858447
ISBN (Book)
9783656858454
File size
2 MB
Language
English
Catalog Number
v284986
Grade
165/200 (A+)
Tags
comprehensive reanalysis genomic storm transcriptomic data integrating clinical varibles utilizing approaches

Author

Share

Previous

Title: Comprehensive Reanalysis of Genomic Storm (Transcriptomic) Data, Integrating Clinical Varibles and Utilizing New and Old Approaches