+44 1923526021 +91 989 345 2420

Big Data

We are a Training, Research and Development Organisation based in UK.

Big data is a technique and technology basically used to gather, organize, process, and gather insights from large datasets. Also you will explore about cleaning datasets and then integrating different datasets. In this you will learn about splitting data into small clusters by the concept of data mining. You will also get an idea about Machine learning with R or Python.

We Offer Comprehensive set of Skills from initial concept down to the Final Outcome.

0%
Web
0%
Data science
0%
Desiging

Course Content

Unit 1 : Big Data


The introduction to Hadoop chapter includes topic such as the high availability of data and scaling. This also includes the advantages of Hadoop and the challenges of using Hadoop. The chapter also covers an intro to Hadoop Distributed File System and a comparison between Hadoop and SQL. The industries that are currently using Hadoop are also mentioned in this chapter. The chapter also briefly covers topics like Data Locality, Hadoop Architecture, Map Reduce & HDFS, using Hadoop Single Node Image (clone) etc.
The introductory part of big data chapter covers a brief about what is big data followed by discussing the various opportunities and challenges that Big Data can offer. The chapter also covers the main characteristics that Big Data possess.
The Hadoop Distributed File System(HDFS) chapter briefly explains HDFS design, Data nodes, HDFS federation, Hadoop DFS, the basic File System Operations, Anatomy of File Read and Write, Block Placement Policy and Modes, Configuration files, Metadata, FSCK Utility, ZOOKEEPER Leader Election Algorithm and at last a small exercise on HDFS.
The chapter on Map Reduce covers topic like basics of Functional Programming, Map & Reduce basics, its Anatomy, Legacy Architecture, Job Completion and failures, intro to Shuffling and Sorting, Splits, Partitions & Combiner, Optimization Techniques. Also the chapter will brief about types of Schedulers and counters, getting data from RDBMS into HDFS, Distributed Cache, YARN, Map side join with distributed cache, types of I/O formats and handling files using CombineFileInput Format.
Here Map Reduce Programming will be explained in Java Programming and the topics that are going to be explained in this chapter includes Word Count in map reduce, sorting files with Hadoop Configuration API, Emulating ‘grep’ , DB Input format, Job dependency and Input Format API discussion and creating custom data type in Hadoop.
The chapter on NoSQL briefly discuss about ACID in RDBMS, NoSQL BASE, consistency types, CAP theorem. The basics of types of NoSQL databases, Columnar Databases, TTL, Bloom filters and compensation is also covered briefly in this chapter.
The HBase chapter briefly introduce HBase first and then it follows to discuss about its basic concepts, its installation, HBase data model, Master and Region servers, the basic operations on HBase, Catalog Tables, Block Cache and sharding, SPLITS, Data Modelling, Java API’s, buffering process, HBase counters, intro on HBase raw scans, HBase filters, Bulk loading and some useful case studies on the topics discussed.
The chapter briefly covers an overview on how to install Hive, then covering an introduction to Hive and explains its basic architecture. The chapter also provide an understanding of Hive services, shell, server and HWI, meta store, Hive QL, comparison between OLTP and OLAP, working with tables, primitive and complex data types, a brief explanation on working with partitions, Functions, Bucketed Tables and Sampling, RC file, VIEWS, INDEXES, MAPSIDE joins, compression, dynamic substation, Log Analysis and Accessing HBASE.
The chapter starts with covering topics like installation, execution types, Pig Latin, Data processing, grunt shell, and schema on read. Further, an overview of data types, schemas, loading & storing, grouping and joining is provided in the chapter. The chapter also covers a brief explanation on debugging commands, type casting & validations, SPLITS and multi query execution, error handling, parameter substitution, piggy bank and to write and load JSON file using Pig.
The chapter on SQOOP covers briefly topics like Sqoop installation, how to import data in different formats such as tabular data, CSV file dataetc. Also topics like Incremental import, free from Query import are also covered here. The chapter also guides through how to export data to RDBMS, HIVE and HBASE.
The topics that are covered under the supervised learning chapter are regression, classification models such as Logistic Regression, Support Vector Machine, KNN etc. Also the practical implementation of the models are demonstrated using Python.
This chapter covers the Flume installation, an introduction to Flume and Flume agents like sources, channels and sinks. The chapter also briefly discuss on Log user information using java, some of the basic but very useful Flume commands.
This chapter on Oozie briefly discuss about the Oozie workflow, real world use cases, a brief into on Zoo keeper, HBase integration with HIVE and PIG, Phoenix and Proof Of Concept(POC).
The Spark chapter briefs some key topics like what is spark, how to link and initialize spark, using the shell. Also an overview of RDD, Parallelized collections, working with external datasets, spark functions, Key-Value pairs, transformations and Actions is also discussed here. Along with these topics an introduction to Storage level, Revoving of data, shared variables, Accumulators is also discussed here.

Unit 2 : Apache Spark


Getting started with the first chapter of Apache Spark, this covers an intro to spark, background of Scala, Scala vs Java, Interactive Scala, program running with Scala compiler, explore lattice type and type of interface, method defining, pattern matching and setting up Scala environment on Linux.
The chapter on Functional Programming covers the basics of Functional programming with an introduction to the difference between OPPS and FPP.
This chapter covers an overview of topics like iterating, mapping, filtering, counting, Regular expressions, Maps, Sets, group by, flatten, word count, IO operations, file access and flatMaps.
The object oriented programming chapter helps in understanding and cover basics of topics such as classes, properties, objects, packaging, traits, imports, classes, inheritance, lists and apply function.
Integration chapter briefly explains about SBT, an integration of Scala in Eclipse IDE and then an introduction on Integration of SBT with Eclipse is discussed in this chapter.
Spark Code discuss about topics like a comparison between Batch processing and real-time data processing, introduction to Spark and its architecture, coding spark jobs in Scala, creating the Spark context, explaining RDD and operations on RDD programming and Broadcast variables.
The Persistence chapter covers configuring and running spark clusters, exploring multi-node spark cluster, cluster management, submitting spark jobs and running in cluster mode, developing spark applications in Eclipse and tuning & debugging Spark.
This chapter on Cassandra covers a basic introduction to Cassandra and then followed by this the basic architecture and installation of Cassandra is discussed here. Further, the chapter covers communicating with Cassandra, creating a database and table, inserting, updating, modelling and deleting the data and creating a web application.
The chapter mainly covers an introduction to Spark and Cassandra connectors, setting up spark with Cassandra, creating spark context to connect Cassandra, creating Spark RDD on Cassandra database, transformation& Actions on Cassandra RDD, an introduction to Amazon Web services, building Node4 spark multi-node cluster in AWS, deploying in production with Mesos and YARN.
The chapter on Spark streaming covers basic introduction of spark streaming with its basic architecture, processing of distributed log files in real time, Discretized streams RDD, transformation & actions on streaming data, integration of Flume and Kafka, integration of Cassandra, monitoring streaming jobs.
In this section we will cover introduction of Apache Spark SQL, Processing the Text files, JSON and Parquet Files,DataFrames,user-defined functions,Using Hive,Local Hive Metastore server
In Spark MLLib we will learn different models of machine learning with big data like regression, classification with decision tree, SVM and Naïve Bayes, clustering with K-means and building with Spark server

Classroom
Australia

Coming soon

Classroom
UAE

Coming soon

Classroom
India

Coming soon

Fee structure for
Australia

“Kindly send your queries to info@academyofdatascience.com”

Fee structure for
UAE

“Kindly send your queries to info@academyofdatascience.com”

Fee structure for
India

“Kindly send your queries to info@academyofdatascience.com”
0

Hours
Training

0

Classroom
Batch Available

0

Online
Batches Runninng

0

Hours
of Live Projects

What we do

Other Training Programmes