COMP08142 2018 Big Data
The module is intended to introduce students to the concept of Big Data. By using Big Data techniques the student will learn how to work with problems in this field.
Learning Outcomes
On completion of this module the learner will/should be able to;
Discuss the problem of managing data at scale and why traditional data management systems are insufficient.
Describe Big Data programming models such as MapReduce and how to use them on real examples.
Utilise distributed file systems and learn how to manage a cluster.
Query large data sets in near real time and the importance of proper query languages for Big Data.
Teaching and Learning Strategies
A practical approach to teaching and learning will be used. Problem-based learning will be used where possible. The one hour lecture will be used to introduce core concepts about the issue of Big Data Analytics. The lab practicals will be used to apply the concepts talked about in the lectures and to see them working on continuous data collections.
Module Assessment Strategies
The students will be assessed by a final exam contributing to 60% of their final grade. An ongoing project will be submitted before the end of term and will consist of implementing and querying a big data cluster. This project will be worked on and iterated throughout the semester with milestones applied throughout.
Repeat Assessments
Repeat exam and/or project
Indicative Syllabus
Discuss the problem of managing data at scale and why traditional data management systems are insufficient.
- Examining the scale of the problem.
- The possibilities and ethics of big data collection.
Describe Big Data programming models such as MapReduce and how to use them on real examples.
- Discuss the various data management tools in the context of big data (e.g. relational, NoSQL).
- Implement a big data programming model such as MapReduce.
Utilise distributed file systems and learn how to manage a cluster.
- Hadoop.
- HDFS.
- Amazon S3.
Query large data sets in near real time and the importance of proper query languages for Big Data.
- Utilise a big data query language such as Hive.
- Compare with SQL.
- Built-in user-defined functions (UDFs) to manipulate dates, strings, and other data-mining tools.
Coursework & Assessment Breakdown
Coursework Assessment
Title | Type | Form | Percent | Week | Learning Outcomes Assessed | |
---|---|---|---|---|---|---|
1 | Big Data Project | Project | Project | 40 % | OnGoing | 2,3,4 |
End of Semester / Year Assessment
Title | Type | Form | Percent | Week | Learning Outcomes Assessed | |
---|---|---|---|---|---|---|
1 | Final Exam | Final Exam | Closed Book Exam | 60 % | End of Semester | 1,2,3,4 |
Full Time Mode Workload
Type | Location | Description | Hours | Frequency | Avg Workload |
---|---|---|---|---|---|
Lecture | Not Specified | Lecture | 1 | Weekly | 1.00 |
Laboratory Practical | Computer Laboratory | Practical | 2 | Weekly | 2.00 |
Independent Learning | Not Specified | Independent Learning | 4 | Weekly | 4.00 |
Online Learning Mode Workload
Type | Location | Description | Hours | Frequency | Avg Workload |
---|---|---|---|---|---|
Online Lecture | Distance Learning Suite | Lecture | 1 | Weekly | 1.00 |
Directed Learning | Not Specified | Directed Learning | 1 | Weekly | 1.00 |
Independent Learning | Not Specified | Independent Learning | 5 | Weekly | 5.00 |
Required & Recommended Book List

2015-04-11 Hadoop: The Definitive Guide O'Reilly Media
ISBN 1491901632 ISBN-13 9781491901632
Ready to unlock the power of your data? With the fourth edition of this comprehensive guide, you'll learn how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. This book is ideal for programmers looking to analyze datasets of any size, and for administrators who want to set up and run Hadoop clusters. You'll find illuminating case studies that demonstrate how Hadoop is used to solve specific problems. This edition includes new case studies, updates on Hadoop 2, a refreshed HBase chapter, and new chapters on Crunch and Flume. Author Tom White also suggests learning paths for the book.Store large datasets with the Hadoop Distributed File System (HDFS) Run distributed computations with MapReduce Use Hadoop's data and I/O building blocks for compression, data integrity, serialization (including Avro), and persistence Discover common pitfalls and advanced features for writing real-world MapReduce programs Design, build, and administer a dedicated Hadoop cluster - or run Hadoop in the cloud Load data from relational databases into HDFS, using Sqoop Perform large-scale data processing with the Pig query language Analyze datasets with Hive, Hadoop's data warehousing system Take advantage of HBase for structured and semi-structured data, and ZooKeeper for building distributed systems

2017-04-21 Big-Data Analytics for Cloud, IoT and Cognitive Computing Wiley-Blackwell
ISBN 1119247020 ISBN-13 9781119247029

2012-12-22 MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems O'Reilly Media
ISBN 1449327176 ISBN-13 9781449327170
Design patterns for the MapReduce framework, until now, have been scattered among various research papers, blogs, and books. This handy guide brings together a unique collection of valuable MapReduce patterns that will save you time and effort regardless of the domain, language, or development framework you're using. Each pattern is explained in context, with pitfalls and caveats clearly identified - so you can avoid some of the common design mistakes when modeling your Big Data architecture. This book also provides a complete overview of MapReduce that explains its origins and implementations, and why design patterns are so important. Hadoop MapReduce code is provided to help you learn how to apply the design patterns by example. Topics include: Basic patterns, including map-only filter, group by, aggregation, distinct, and limit Joins: traditional reduce-side join, reduce-side join with Bloom filter, replicated join with distributed cache, merge join, Cartesian products, and intersections Binning, sharding for other systems, sorting, sampling, unions, and other patterns for organizing data Job optimization patterns, including multi-job map-only job folding, and overloading the key grouping to perform two jobs at once