COMP09017 2019 Programming for Big Data
This module introduces students to the architectures and tools underpinning the management and processing of large scale datasets, which are too big for conventional approaches. Students will understand these architectures and tools and be able to use them to code solutions, query data from structured, unstructured, and streamed sources, and analyse that data using appropriate algorithms.
Students will also be able to evaluate a variety of Big Data Cloud platform providers e.g. Amazon AWS, Microsoft Azure, in order to deploy and host data solutions.
On completion of this module the learner will/should be able to;
Discuss the problem of managing data at scale and why traditional data management systems are insufficient
Evaluate state of the art architectures, tools & frameworks for working with Big Data
Implement Big Data solutions using a synthesis of different data paradigms e.g. distributed data and streaming data, structured and unstructured data
Compare a variety of Big Data query languages and identify optimum query approaches for a variety of scenarios
Outline some well-known Big Data problem scenarios from a variety of domains, and from student's own experience, and evaluate some standard, state-of-the-art approaches to solving them with appropriate architectures, tools, & frameworks
Evaluate some of the human and organisational issues involved in integrating Big Data solutions across the enterprise, and in current research questions from the domain e.g. ethics, privacy, bias, and cybersecurity
Teaching and Learning Strategies
Each week, a lecture will introduce concepts and technologies.
Weekly Labs will provide students with an opportunity to test-drive these technologies, with hands-on exercises.
Each week’s lab will build into a semester-long project.
Module Assessment Strategies
Problem based learning will be used in Weekly Labs, which will build week-on-week to form a semester-long project.
An end of semester project will challenge students to integrate and synthesise module knowledge into a cohesive fully formed piece of assessable work.
Understanding the Big Data Context
Outline the challenges that come with Big Data, and how they break traditional paradigms.
Analysing a typical Big Data Technology Stack
Outline a typical Big Data technology stack and examine the technologies at each layer:
Describe the role of distributed file systems e.g. Hadoop, Apache Spark
Describe the role of a distributed processing system e.g. MapReduce
Describe some querying approaches for different types of data stores
Installing and/or Configuring a Distributed File System e.g. Hadoop
- Downloading and installing a Distributed File System
- Downloading and installing Apache Spark
- Running HDFS on Amazon AWS
- Running HDFS on Azure
Working with Data - Query Languages & Environments
- Relational Data e.g. Hive, MySQL
- Non-Relational Data e.g. HBase & Cassandra
- Streaming Data e.g. Apache Spark
Machine Learning with Spark and Python
- Implementing Machine Learning Algorithms e.g. Linear Regression, Logistic Regression using Spark & Python
- Collaborative Filtering for Recommender Systems
Graph Analytics with GraphX
- Define & Describe a Graph
- Identify scenarios where Graph Databases suit your data
- Analyse data using GraphX
Building Real World Applications
- Design and implement systems using Big Data architectures, tools & frameworks across a variety of industry domains e.g. business intelligence, recommender systems, Internet of Things, industrial and manufacturing sensors, health informatics
- Consider the ethical implications and risks of your proposed solution within the design process
Coursework & Assessment Breakdown
|Title||Type||Form||Percent||Week||Learning Outcomes Assessed|
|1||Project||Project||Assignment||50 %||End of Semester||1,2,3,4,5,6|
|2||Big Data Implementation I||Continuous Assessment||Assignment||10 %||Week 4||2,3,4,5,6|
|3||Big Data Implementation II||Continuous Assessment||Assignment||20 %||Week 8||2,3,4,5|
|4||Big Data Implementation III||Continuous Assessment||Assignment||20 %||Week 11||3,4,5,6|
Full Time Mode Workload
|Lecture||Computer Laboratory||Lecture & Computer Lab||3||Weekly||3.00|
|Independent Learning||Not Specified||Independent Research & Reading||4||Weekly||4.00|
Online Learning Mode Workload
|Directed Learning||Online||Virtual Lab||1.5||Weekly||1.50|
|Independent Learning||Online||Independent Research & Reading||4||Weekly||4.00|
Required & Recommended Book List
2016-08-26 Big Data Analytics with Spark and Hadoop
ISBN 1785884697 ISBN-13 9781785884696
A handy reference guide for data analysts and data scientists to fetch "Value" out of big data analytics using Spark on Hadoop ClustersAbout This Book* Practical tutorial with real-world examples that explores Spark on Hadoop clusters* This book is based on the latest version of Apache Spark and Hadoop integrated with the most commonly used tools* Learn about all the Spark stack components including the latest topics such as DataFrames, DataSets, and SparkRWho This Book Is ForThough this book is primarily aimed at data analysts and data scientists, it will also help architects, programmers, and practitioners. Knowledge of either Spark or Hadoop would be beneficial. It is assumed that you have basic programming background in Scala, Python, SQL, or R programming with basic Linux experience. Working experience within big data environments is not mandatory.What You Will Learn* Find out about and implement the tools and techniques of big data analytics using Spark on Hadoop clusters* Understand all the Hadoop and Spark ecosystem components and how Spark replaced MapReduce* Get to know all the Spark components: Spark Core, Spark SQL, DataFrames, DataSets, Streaming, MLLib, and Graphx* See batch and real-time data analytics using Spark Core, Spark SQL, and Spark Streaming* Get to grips with data science and machine learning using MLLib, H2O, Hivemall, Graphx, and SparkR* Get an introduction to all the new tools (based on Notebooks, Data Flow, and Spark as a Service) and their integrations with Spark and HadoopIn DetailThis book explains the fundamentals of Apache Spark and Hadoop, and how they are easily integrated together with the most commonly used tools and techniques. All the Spark components-Spark Core, Spark SQL, DataFrames, Data sets, Streaming, MLlib, Graphx, and Hadoop core components-HDFS, MapReduce, and Yarn are explored in greater depth with implementation examples on Spark and Hadoop clusters.The big data analytics industry is moving away from MapReduce to Spark. In this book, the advantages of Spark over MapReduce are explained at great depth so you can reap the benefits of in-memory speeds. The DataFrames API, Data Sources API, and new Data sets API are explained so you can build big data analytical applications.We'll explore real-time data analytics using Spark Streaming with Apache Kafka and HBase to help you build streaming applications. You'll get to know the machine learning techniques using MLLib and SparkR, and Graph Analytics with the GraphX component of Spark.You will also get the opportunity to start working with web-based notebooks such as Jupyter, Apache Zeppelin, and the data flow tool Apache NiFi to analyze and visualize data.
2014-01-08 Large-Scale Data Analytics Springer Science & Business Media
ISBN 9781461492429 ISBN-13 1461492424
This edited book collects state-of-the-art research related to large-scale data analytics that has been accomplished over the last few years. This is among the first books devoted to this important area based on contributions from diverse scientific areas such as databases, data mining, supercomputing, hardware architecture, data visualization, statistics, and privacy. There is increasing need for new approaches and technologies that can analyze and synthesize very large amounts of data, in the order of petabytes, that are generated by massively distributed data sources. This requires new distributed architectures for data analysis. Additionally, the heterogeneity of such sources imposes significant challenges for the efficient analysis of the data under numerous constraints, including consistent data integration, data homogenization and scaling, privacy and security preservation. The authors also broaden reader understanding of emerging real-world applications in domains such as customer behavior modeling, graph mining, telecommunications, cyber-security, and social network analysis, all of which impose extra requirements for large-scale data analysis. Large-Scale Data Analytics is organized in 8 chapters, each providing a survey of an important direction of large-scale data analytics or individual results of the emerging research in the field. The book presents key recent research that will help shape the future of large-scale data analytics, leading the way to the design of new approaches and technologies that can analyze and synthesize very large amounts of heterogeneous data. Students, researchers, professionals and practitioners will find this book an authoritative and comprehensive resource.
2013-08-23 Big Data Analytics Elsevier
ISBN 9780124186644 ISBN-13 0124186645
Big Data Analytics will assist managers in providing an overview of the drivers for introducing big data technology into the organization and for understanding the types of business problems best suited to big data analytics solutions, understanding the value drivers and benefits, strategic planning, developing a pilot, and eventually planning to integrate back into production within the enterprise. Guides the reader in assessing the opportunities and value proposition Overview of big data hardware and software architectures Presents a variety of technologies and how they fit into the big data ecosystem