# Statistics and Data Science

## Overview

• Credit value: 30 credits at Level 7
• Convenor: Dr Mark Williams
• Assessment: online tests (40%) and two sets of long-form computing problems (60%)

## Module description

In this module we introduce you to the essential statistics and data analysis techniques that underpin modern bioinformatics. Most practical sessions focus on programming in R, which has become the key statistical analysis tool for bioinformatics. Building on this statistical foundation, we then introduce unsupervised and supervised machine learning approaches to data analysis. In later sessions, which assume some prior familiarity with Python, we also introduce other aspects of practical data science: data retrieval, cleaning, construction of relational databases and data visualisation.

### Indicative syllabus

• Descriptive statistics: measures of central tendency and variation
• Discrete probabilities
• Probability density functions
• Common probability distributions: Binomial, Poisson, Normal (Gaussian), Uniform
• Hypothesis testing: formulation of hypotheses for research question - null hypothesis, research/alternate hypothesis, power or tests, the problems of multiple testing
• Sample and distribution, distribution of sample mean, standard deviation of sample mean
• Central limit theorem
• Parametric tests for differences of mean and variance
• Non-parametric test for differences of median and differences of distribution
• Description of accuracy and precision: standard errors and confidence intervals
• Bootstrapping
• One-sample and two-sample tests for categorical (count) data
• One-way analysis of variance for differences of mean or median (ANOVA and Kruskal-Wallis)
• Correlation of numerical variables
• Linear regression: fitting models to data
• Exploratory data analysis
• Unsupervised and supervised machine learning
• Mayor data sources in bioinformatics and programmatic API access
• Approaches to data cleaning
• Data modelling and relational database design
• Database queries in SQL
• Data visualisation tools
• Features of the R and Python programming languages required to implement these methods

## Learning objectives

By the end of this module, you will be able to:

• show fluency in the application of a range of statistical ideas and recognise their utility in data analysis and decision making
• choose statistical methods appropriate for the analysis of a problem
• formulate hypothesis about data that can be tested using statistical tools
• show awareness of the assumption of statistical methods and be able to identify situations where the tools presented in the module are insufficient to analyse data
• use R to generate correct analyses in a variety of commonly encountered scenarios involving biological data
• identify relevant bioinformatic data sources and programmatically extract data from them
• clean data and impute missing data points with defined data points
• model data, and design and query a relational database populated with that data
• use R to generate correct statistical analyses in a variety of commonly encountered scenarios involving biological data
• use R and Python libraries to manipulate and visualise biological data
• select and use appropriate R or Python libraries to apply unsupervised and supervised machine learning methodologies to biological data.