Skip to main content

Big Data Analytics

Overview

Module description

This module builds on the material taught in related modules such as Data Analytics using R, Principles of Programming and Programming in Java.

We focus on using Python and other related programming languages, e.g. Scala or Java, for big data analytics, including big data-related algorithms, methods and techniques for storage, analysis and processing. Python is the main programming language used in this module. However, Scala or Java can also be used.

The module is highly practical and includes various tutorials on using modern big data tools for structured and unstructured data, including Apache Spark, Hadoop MapReduce, Cassandra and others. Thus, you should expect to have excellent knowledge of programming.

Indicative syllabus

    • Introducing big data systems, algorithms and applications
    • Exploring computational complexity for algorithmic design
    • Building algorithms for big data analysis, e.g. aggregations, transformations and actions
    • Introducing big data algorithms for machine learning using distributed data processing pipelines
    • Introducing big data storage using distributed file systems (DFS) and NoSQL systems such as Apache Cassandra and others
    • Building distributed database clusters for big data storage
    • Greedy vs Divide and Conquer algorithms and advanced data processing algorithms (Distributed Hash Tables, Resilient Distributed Datasets)
    • Introducing the MapReduce algorithm and the Hadoop MapReduce ecosystem
    • Introducing in-memory storage and processing using Redis and Apache Spark
    • Learning programming with Spark, including spark programming APIs, e.g. PySpark
    • Deployment of big data systems on the cloud and use of containerised environments for big data analysis
    • Exploring big data libraries for data analysis (TensorFlow, Keras and Spark MLlib)
    • Exploring computational complexity for algorithmic design

    Learning objectives

    By the end of this module, you will be able to:

    • demonstrate sufficient knowledge of Python programming for big data analysis
    • work with different big data formats and data sources and deploy on the cloud for data analysis
    • understand the organisation of data for big data storage, processing and analysis
    • understand the use of modern frameworks used in industry for big data analysis
    • work with different big data formats and data sources and deploy on the cloud for data analysis.