Bioinformatics Visualization in R - Part 2

Published

July 12, 2024

Lecture 2: Introduction to Visualization

Objective: - Explore advanced visualization techniques and create more complex bioinformatics plots using biological data. Density Plot, Violin Plot, Dendogram, Heatmap - Overview of syntax with simulated data - Explore plots using iris dataset. - Exercise 2

Load necessary libraries

if (!require("ggplot2")) install.packages("ggplot2")

Loading required package: ggplot2

if (!require("GGally")) install.packages("GGally")

Loading required package: GGally

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

if (!require("pheatmap")) install.packages("pheatmap")

Loading required package: pheatmap

if (!require("dendextend")) install.packages("dendextend")

Loading required package: dendextend


---------------------
Welcome to dendextend version 1.17.1
Type citation('dendextend') for how to cite the package.

Type browseVignettes(package = 'dendextend') for the package vignette.
The github page is: https://github.com/talgalili/dendextend/

Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
You may ask questions at stackoverflow, use the r and dendextend tags: 
     https://stackoverflow.com/questions/tagged/dendextend

    To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
---------------------


Attaching package: 'dendextend'

The following object is masked from 'package:stats':

    cutree

if (!require("ggridges")) install.packages("ggridges")

Loading required package: ggridges

library(dendextend)
library(ggplot2)
library(GGally)
library(pheatmap)
library(reshape2)
library(dplyr)

Warning: package 'dplyr' was built under R version 4.3.2


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Generate the Mock Gene Expression Dataset

# Set random seed for reproducibility
set.seed(42)

# Number of genes and samples
num_genes <- 10
num_samples <- 5

# Generate gene names and sample names
genes <- paste0("Gene_", 1:num_genes)
samples <- paste0("Sample_", 1:num_samples)

# Generate random expression levels
expression_data <- matrix(rnorm(num_genes * num_samples), nrow = num_genes, ncol = num_samples)
colnames(expression_data) <- samples
rownames(expression_data) <- genes

# Convert to data frame
df_expression <- as.data.frame(expression_data)

# Display the first few rows of the dataset
head(df_expression)

         Sample_1   Sample_2   Sample_3   Sample_4   Sample_5
Gene_1  1.3709584  1.3048697 -0.3066386  0.4554501  0.2059986
Gene_2 -0.5646982  2.2866454 -1.7813084  0.7048373 -0.3610573
Gene_3  0.3631284 -1.3888607 -0.1719174  1.0351035  0.7581632
Gene_4  0.6328626 -0.2787888  1.2146747 -0.6089264 -0.7267048
Gene_5  0.4042683 -0.1333213  1.8951935  0.5049551 -1.3682810
Gene_6 -0.1061245  0.6359504 -0.4304691 -1.7170087  0.4328180

Visualization Part 2: Advanced Visualization Techniques

Density Plot A density plot shows how the data is spread out, highlighting where values are concentrated.

df_melted <- melt(df_expression, variable.name = "Sample", value.name = "Expression")

No id variables; using all as measure variables

# Create the density plot
ggplot(df_melted, aes(x = Expression)) +
  geom_density(fill = "blue", alpha = 0.5) + # transparency
  labs(title = "Density Plot of Gene Expression Levels",
       x = "Expression Level",
       y = "Density") +
  theme_minimal()

Violin Plot: A violin plot is a data visualization that combines a box plot and a kernel density plot to show the distribution, probability density, and variability of data across different categories.

# Melt the data frame for ggplot2
df_melted <- melt(df_expression, variable.name = "Sample", value.name = "Expression")

No id variables; using all as measure variables

# Create the violin plot with colors
ggplot(df_melted, aes(x = Sample, y = Expression, fill = Sample)) +
  geom_violin(trim = FALSE) +
  labs(title = "Violin Plot of Gene Expression Levels",
       x = "Sample",
       y = "Expression Level") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set3")

Dendrogram: A dendrogram is a tree-like diagram that displays the arrangement of clusters formed by hierarchical clustering, showing the relationships and distances between data points.

# Generate the hierarchical clustering
hc <- hclust(dist(df_expression), method = "ward.D2")

# Create the dendrogram
dend <- as.dendrogram(hc)
plot(dend, main = "Dendrogram", xlab = "Genes", ylab = "Distance")

Heatmap + Dendrogram Heatmaps are useful to visualize matrix-like data, such as gene expression data

# Create the heatmap
pheatmap(df_expression, scale = "row", main = "Heatmap of Gene Expression")

Load the iris dataset and prepare it for visualization:

# Load the iris dataset
data("iris")

# Display the first few rows of the dataset
head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

summary(iris)

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50

Density Plot

# Melt the data frame for ggplot2
df_melted <- melt(iris, id.vars = "Species", variable.name = "Measurement", value.name = "Value")

# Create the density plot with facets for each species
ggplot(df_melted, aes(x = Value, fill = Measurement)) +
  geom_density(alpha = 0.5) + # transparency
  labs(title = "Density Plot of Iris Measurements by Species",
       x = "Value",
       y = "Density") +
  theme_minimal() +
  facet_wrap(~ Species)

Density Plot

# Melt the data frame for ggplot2
df_melted <- melt(iris, id.vars = "Species", variable.name = "Measurement", value.name = "Value")

# Create the violin plot with colors
ggplot(df_melted, aes(x = Measurement, y = Value, fill = Species)) +
  geom_violin(trim = FALSE) +
  labs(title = "Violin Plot of Iris Measurements",
       x = "Measurement",
       y = "Value") +
  theme_minimal() +
  facet_wrap(~ Measurement, scales = "free")

Dendogram

# Remove the Species column for clustering
iris_no_species <- iris[, -5]

# Compute the correlation matrix
cor_matrix <- cor(iris_no_species)
cor_matrix

             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

# Perform hierarchical clustering on the correlation matrix
hc <- hclust(as.dist(1 - cor_matrix), method = "ward.D2")

# Create the dendrogram
dend <- as.dendrogram(hc)

# Plot the dendrogram
plot(dend, main = "Dendrogram of Iris Attributes", xlab = "Attributes", ylab = "Distance")

# Remove the Species column for heatmap plotting
iris_data <- iris[, -5]

# Create the heatmap with a simplified color scale
pheatmap(as.matrix(iris_data), 
         main = "Heatmap of Iris Measurements", 
         cluster_rows = TRUE, cluster_cols = TRUE, 
         show_rownames = FALSE,
         color = colorRampPalette(c("blue", "white", "red"))(50))

—————————————Assignment 2: Part 2——————————————————–

Dr. Smith is studying the famous iris dataset to understand the differences in flower measurements across three species: setosa, versicolor, and virginica. Help Dr. Smith visualize and interpret the data to identify distinguishing features of each species.

Density Plot Analysis: Create a density plot to visualize the distribution of sepal length measurements across the three iris species. Identify which species have similar or distinct sepal length distributions.

# Solution a


# Hint: Look for peaks in the density plots to see which species have similar or distinct sepal length distributions.

Violin Plot Analysis: Use a violin plot to compare the distribution and density of petal widths across the three iris species. Highlight which species have the widest and narrowest petals.

# Solution b


# Hint: Examine the width and shape of the violins to understand the distribution and density of petal widths across species.\