COMP09017 2019 Programming for Big Data

General Details

Full Title
Programming for Big Data
Transcript Title
Programming for Big Data
Code
COMP09017
Attendance
N/A %
Subject Area
COMP - Computing
Department
COEL - Computing & Electronic Eng
Level
09 - NFQ Level 9
Credit
05 - 05 Credits
Duration
Semester
Fee
Start Term
2019 - Full Academic Year 2019-20
End Term
9999 - The End of Time
Author(s)
Diane O'Brien, Therese Hume, Donny Hurley, Mary Loftus
Programme Membership
SG_KDATA_M09 201900 Master of Science in Data Science
Description

This module introduces students to the architectures and tools underpinning the management and processing of large scale datasets, which are too big for conventional approaches. Students will understand these architectures and tools and be able to use them to code solutions, query data from structured, unstructured, and streamed sources, and analyse that data using appropriate algorithms.

Students will also be able to evaluate a variety of Big Data Cloud platform providers e.g. Amazon AWS, Microsoft Azure, in order to deploy and host data solutions. 

Learning Outcomes

On completion of this module the learner will/should be able to;

1.

Discuss the problem of managing data at scale and why traditional data management systems are insufficient

2.

Evaluate state of the art architectures, tools & frameworks for working with Big Data

3.

Implement Big Data solutions using a synthesis of different data paradigms e.g. distributed data and streaming data, structured and unstructured data

4.

Compare a variety of Big Data query languages and identify optimum query approaches for a variety of scenarios

5.

Outline some well-known Big Data problem scenarios from a variety of domains, and from student's own experience, and evaluate some standard, state-of-the-art approaches to solving them with appropriate architectures, tools, & frameworks

6.

Evaluate some of the human and organisational issues involved in integrating Big Data solutions across the enterprise, and in current research questions from the domain e.g. ethics, privacy, bias, and cybersecurity

Teaching and Learning Strategies

Each week, a lecture will introduce concepts and technologies. 

Weekly Labs will provide students with an opportunity to test-drive these technologies, with hands-on exercises.

Each week’s lab will build into a semester-long project.

Module Assessment Strategies

Problem based learning will be used in Weekly Labs, which will build week-on-week to form a semester-long project.

An end of semester project will challenge students to integrate and synthesise module knowledge into a cohesive fully formed piece of assessable work.

Repeat Assessments

Repeat Project

Indicative Syllabus

Understanding the Big Data Context 

Outline the challenges that come with Big Data, and how they break traditional paradigms. 

 

Analysing a typical Big Data Technology Stack 

Outline a typical Big Data technology stack and examine the technologies at each layer: 

Describe the role of distributed file systems e.g. Hadoop, Apache Spark 

Describe the role of a distributed processing system e.g. MapReduce

Describe some querying approaches for different types of data stores
 

Installing and/or Configuring a Distributed File System e.g. Hadoop 

- Downloading and installing a Distributed File System

- Downloading and installing Apache Spark

- Running HDFS on Amazon AWS

- Running HDFS on Azure
 

Working with Data - Query Languages & Environments  

- Relational Data e.g. Hive, MySQL 

- Non-Relational Data e.g. HBase & Cassandra

- Streaming Data e.g. Apache Spark
 

Machine Learning with Spark and Python

- Implementing Machine Learning Algorithms e.g. Linear Regression, Logistic Regression using Spark & Python

- Collaborative Filtering for Recommender Systems

 

Graph Analytics with GraphX

- Define & Describe a Graph

- Identify scenarios where Graph Databases suit your data

- Analyse data using GraphX 

 

Building Real World Applications

- Design and implement systems using Big Data architectures, tools & frameworks across a variety of industry domains e.g. business intelligence, recommender systems, Internet of Things, industrial and manufacturing sensors, health informatics

- Consider the ethical implications and risks of your proposed  solution within the design process

Coursework & Assessment Breakdown

Coursework & Continuous Assessment
100 %

Coursework Assessment

Title Type Form Percent Week Learning Outcomes Assessed
1 Project Project Assignment 50 % End of Semester 1,2,3,4,5,6
2 Big Data Implementation I Continuous Assessment Assignment 10 % Week 4 2,3,4,5,6
3 Big Data Implementation II Continuous Assessment Assignment 20 % Week 8 2,3,4,5
4 Big Data Implementation III Continuous Assessment Assignment 20 % Week 11 3,4,5,6

Full Time Mode Workload


Type Location Description Hours Frequency Avg Workload
Lecture Computer Laboratory Lecture & Computer Lab 3 Weekly 3.00
Independent Learning Not Specified Independent Research & Reading 4 Weekly 4.00
Total Full Time Average Weekly Learner Contact Time 3.00 Hours

Online Learning Mode Workload


Type Location Description Hours Frequency Avg Workload
Lecture Online Lecture 1.5 Weekly 1.50
Directed Learning Online Virtual Lab 1.5 Weekly 1.50
Independent Learning Online Independent Research & Reading 4 Weekly 4.00
Total Online Learning Average Weekly Learner Contact Time 3.00 Hours

Required & Recommended Book List

Required Reading
2016-08-26 Big Data Analytics with Spark and Hadoop
ISBN 1785884697 ISBN-13 9781785884696

A handy reference guide for data analysts and data scientists to fetch "Value" out of big data analytics using Spark on Hadoop ClustersAbout This Book* Practical tutorial with real-world examples that explores Spark on Hadoop clusters* This book is based on the latest version of Apache Spark and Hadoop integrated with the most commonly used tools* Learn about all the Spark stack components including the latest topics such as DataFrames, DataSets, and SparkRWho This Book Is ForThough this book is primarily aimed at data analysts and data scientists, it will also help architects, programmers, and practitioners. Knowledge of either Spark or Hadoop would be beneficial. It is assumed that you have basic programming background in Scala, Python, SQL, or R programming with basic Linux experience. Working experience within big data environments is not mandatory.What You Will Learn* Find out about and implement the tools and techniques of big data analytics using Spark on Hadoop clusters* Understand all the Hadoop and Spark ecosystem components and how Spark replaced MapReduce* Get to know all the Spark components: Spark Core, Spark SQL, DataFrames, DataSets, Streaming, MLLib, and Graphx* See batch and real-time data analytics using Spark Core, Spark SQL, and Spark Streaming* Get to grips with data science and machine learning using MLLib, H2O, Hivemall, Graphx, and SparkR* Get an introduction to all the new tools (based on Notebooks, Data Flow, and Spark as a Service) and their integrations with Spark and HadoopIn DetailThis book explains the fundamentals of Apache Spark and Hadoop, and how they are easily integrated together with the most commonly used tools and techniques. All the Spark components-Spark Core, Spark SQL, DataFrames, Data sets, Streaming, MLlib, Graphx, and Hadoop core components-HDFS, MapReduce, and Yarn are explored in greater depth with implementation examples on Spark and Hadoop clusters.The big data analytics industry is moving away from MapReduce to Spark. In this book, the advantages of Spark over MapReduce are explained at great depth so you can reap the benefits of in-memory speeds. The DataFrames API, Data Sources API, and new Data sets API are explained so you can build big data analytical applications.We'll explore real-time data analytics using Spark Streaming with Apache Kafka and HBase to help you build streaming applications. You'll get to know the machine learning techniques using MLLib and SparkR, and Graph Analytics with the GraphX component of Spark.You will also get the opportunity to start working with web-based notebooks such as Jupyter, Apache Zeppelin, and the data flow tool Apache NiFi to analyze and visualize data.

Required Reading
2014-01-08 Large-Scale Data Analytics Springer Science & Business Media
ISBN 9781461492429 ISBN-13 1461492424

This edited book collects state-of-the-art research related to large-scale data analytics that has been accomplished over the last few years. This is among the first books devoted to this important area based on contributions from diverse scientific areas such as databases, data mining, supercomputing, hardware architecture, data visualization, statistics, and privacy. There is increasing need for new approaches and technologies that can analyze and synthesize very large amounts of data, in the order of petabytes, that are generated by massively distributed data sources. This requires new distributed architectures for data analysis. Additionally, the heterogeneity of such sources imposes significant challenges for the efficient analysis of the data under numerous constraints, including consistent data integration, data homogenization and scaling, privacy and security preservation. The authors also broaden reader understanding of emerging real-world applications in domains such as customer behavior modeling, graph mining, telecommunications, cyber-security, and social network analysis, all of which impose extra requirements for large-scale data analysis. Large-Scale Data Analytics is organized in 8 chapters, each providing a survey of an important direction of large-scale data analytics or individual results of the emerging research in the field. The book presents key recent research that will help shape the future of large-scale data analytics, leading the way to the design of new approaches and technologies that can analyze and synthesize very large amounts of heterogeneous data. Students, researchers, professionals and practitioners will find this book an authoritative and comprehensive resource.

Required Reading
2013-08-23 Big Data Analytics Elsevier
ISBN 9780124186644 ISBN-13 0124186645

Big Data Analytics will assist managers in providing an overview of the drivers for introducing big data technology into the organization and for understanding the types of business problems best suited to big data analytics solutions, understanding the value drivers and benefits, strategic planning, developing a pilot, and eventually planning to integrate back into production within the enterprise. Guides the reader in assessing the opportunities and value proposition Overview of big data hardware and software architectures Presents a variety of technologies and how they fit into the big data ecosystem

Module Resources