• phone icon +44 7459 302492 email message icon info@uplatz.com
  • Register
Job Meter = High

Apache Spark and Scala

25 Hours
Self-paced Training (pre-recorded videos)
USD 17 (USD 140)
Save 88% Offer ends on 30-Jun-2024
Apache Spark and Scala course and certification
58 Learners

About this Course

Apache Spark is an open source, distributed processing system, help organisation to manage big data workloads and in fast computation. The fundamental data structure of spark is RDD- a logical collection of data partitioned across machines. The Spark has in-memory cluster computing and uses Hadoop in two ways- first for storage and second for processing that ultimately speeds up the processing. Apache spark is fast, supports multiple languages and handle advanced analytics like SQL queries, machine learning, graph algorithms etc.

Scala - short for scalable language is modern multi-paradigm programming designed to express common programming patterns in a concise way and integrates the features of object oriented and a functional language. Scala is a functional language that defines an anonymous function and supports higher order function.

In this Apache Spark and Scala course by Uplatz, you will be able to make sense of the use of Spark and Scala in an organization, especially in a web application. You will learn about the RDD, its installation, configuration and programming. Further, you will be preceded by pattern matching, importance, classes’ concepts in Scala.

------------------------------------------------------------------------------------------------------------------------------------------

Apache Spark and Scala

Course Details & Curriculum

Scala - course syllabus

Introduction to Scala

Introducing Scala, deployment of Scala for Big Data applications and Apache Spark analytics, Scala REPL, Lazy Values, Control Structures in Scala, Directed Acyclic Graph (DAG), First Spark Application Using SBT/Eclipse, Spark Web UI and Spark in Hadoop Ecosystem.

Pattern Matching

The importance of Scala, the concept of REPL (Read Evaluate Print Loop), deep dive into Scala pattern matching, type interface, higher-order function, currying, traits, application space and Scala for data analysis

Executing the Scala Code

Learning about the Scala Interpreter, static object timer in Scala and testing string equality in Scala, implicit classes in Scala, the concept of currying in Scala and various classes in Scala

Classes Concept in Scala

Learning about the Classes concept, understanding the constructor overloading, various abstract classes, the hierarchy types in Scala, the concept of object equality and the val and var methods in Scala

Case Classes and Pattern Matching

Understanding sealed traits, wild, constructor, tuple, variable pattern and constant pattern

Concepts of Traits with Example

Understanding traits in Scala, the advantages of traits, linearization of traits, the Java equivalent and avoiding of boilerplate code

Scala–Java Interoperability

Implementation of traits in Scala and Java and handling of multiple traits extending

Scala Collections

Introduction to Scala collections, classification of collections, the difference between Iterator and Iterable in Scala and example of list sequence in Scala

Mutable Collections Vs. Immutable Collections

The two types of collections in Scala, Mutable and Immutable collections, understanding lists and arrays in Scala, the list buffer and array buffer, queue in Scala and double-ended queue Deque, Stacks, Sets, Maps and Tuples in Scala

Use Case Bobsrockets Package

Introduction to Scala packages and imports, the selective imports, the Scala test classes, introduction to JUnit test class, JUnit interface via JUnit 3 suite for Scala test, packaging of Scala applications in Directory Structure and examples of Spark Split and Spark Scala

--------------------------------------------------------------------------------------------------------------------------------------------------------

Spark - course syllabus

Introduction to Spark

Introduction to Spark, how Spark overcomes the drawbacks of working on MapReduce, understanding in-memory MapReduce, interactive operations on MapReduce, Spark stack, fine vs. coarse-grained update, Spark stack, Spark Hadoop YARN, HDFS Revision, YARN Revision, the overview of Spark and how it is better than Hadoop, deploying Spark without Hadoop, Spark history server and Cloudera distribution

Spark Basics

Spark installation guide, Spark configuration, memory management, executor memory vs. driver memory, working with Spark Shell, the concept of resilient distributed datasets (RDD), learning to do functional programming in Spark and the architecture of Spark

Working with RDDs in Spark

Spark RDD, creating RDDs, RDD partitioning, operations and transformation in RDD, deep dive into Spark RDDs, the RDD general operations, a read-only partitioned collection of records, using the concept of RDD for faster and efficient data processing, RDD action for collect, count, collects map, save-as-text-files and pair RDD functions

Aggregating Data with Pair RDDs

Understanding the concept of Key–Value pair in RDDs, learning how Spark makes MapReduce operations faster, various operations of RDD, MapReduce interactive operations, fine and coarse-grained update and Spark stack

Writing and Deploying Spark Applications

Comparing the Spark applications with Spark Shell, creating a Spark application using Scala or Java, deploying a Spark application, Scala built application, creation of mutable list, set and set operations, list, tuple, concatenating list, creating application using SBT, deploying application using Maven, the web user interface of Spark application, a real-world example of Spark and configuring of Spark

Parallel Processing

Learning about Spark parallel processing, deploying on a cluster, introduction to Spark partitions, file-based partitioning of RDDs, understanding of HDFS and data locality, mastering the technique of parallel operations, comparing repartition and coalesce and RDD actions

Spark RDD Persistence

The execution flow in Spark, understanding the RDD persistence overview, Spark execution flow and Spark terminology, distribution shared memory vs. RDD, RDD limitations, Spark shell arguments, distributed persistence, RDD lineage, Key–Value pair for sorting implicit conversions like CountByKey, ReduceByKey, SortByKey and AggregateByKey

Spark MLlib

Introduction to Machine Learning, types of Machine Learning, introduction to MLlib, various ML algorithms supported by MLlib, Linear Regression, Logistic Regression, Decision Tree, Random Forest, K-means clustering techniques and building a Recommendation Engine

Hands-on Exercise: Building a Recommendation Engine

Integrating Apache Flume and Apache Kafka

Why Kafka, what is Kafka, Kafka architecture, Kafka workflow, configuring Kafka cluster, basic operations, Kafka monitoring tools and integrating Apache Flume and Apache Kafka

Hands-on Exercise: Configuring Single Node Single Broker Cluster, Configuring Single Node Multi Broker Cluster, Producing and consuming messages and integrating Apache Flume and Apache Kafka

Spark Streaming

Introduction to Spark Streaming, features of Spark Streaming, Spark Streaming workflow, initializing StreamingContext, Discretized Streams (DStreams), Input DStreams and Receivers, transformations on DStreams, Output Operations on DStreams, Windowed Operators and why it is useful, important Windowed Operators and Stateful Operators

Hands-on Exercise:  Twitter Sentiment Analysis, streaming using netcat server, Kafka–Spark Streaming and Spark–Flume Streaming

Improving Spark Performance

Introduction to various variables in Spark like shared variables and broadcast variables, learning about accumulators, the common performance issues and troubleshooting the performance problems

Spark SQL and Data Frames

Learning about Spark SQL, the context of SQL in Spark for providing structured data processing, JSON support in Spark SQL, working with XML data, parquet files, creating Hive context, writing Data Frame to Hive, reading JDBC files, understanding the Data Frames in Spark, creating Data Frames, manual inferring of schema, working with CSV files, reading JDBC tables, Data Frame to JDBC, user-defined functions in Spark SQL, shared variables and accumulators, learning to query and transform data in Data Frames, how Data Frame provides the benefit of both Spark RDD and Spark SQL and deploying Hive on Spark as the execution engine

Scheduling/Partitioning

Learning about the scheduling and partitioning in Spark, hash partition, range partition, scheduling within and around applications, static partitioning, dynamic sharing, fair scheduling, Map partition with index, the Zip, GroupByKey, Spark master high availability, standby masters with ZooKeeper, Single-node Recovery with Local File System and High Order Functions

----------------------------------------------------------------------------------------------------------------------------------------------------

Job Prospects

------------------------------------------------------------------------------------------------------------------------------

Apache Spark and Scala Interview Questions

------------------------------------------------------------------------------------------------------------------------------

1. Compare MapReduce with Spark.

Criteria

MapReduce

Spark

Processing speed

Good

Excellent (up to 100 times faster)

Data caching

Hard disk

In-memory

Performing iterative jobs

Average

Excellent

Dependency on Hadoop

Yes

No

Machine Learning applications

Average

Excellent

 

2. What is Apache Spark?

Spark is a fast, easy-to-use, and flexible data processing framework. It has an advanced execution engine supporting a cyclic data flow and in-memory computing. Apache Spark can run standalone, on Hadoop, or in the cloud and is capable of accessing diverse data sources including HDFS, HBase, and Cassandra, among others.

 

3. Explain the key features of Spark.

• Apache Spark allows integrating with Hadoop.

• It has an interactive language shell, Scala (the language in which Spark is written).

• Spark consists of RDDs (Resilient Distributed Datasets), which can be cached across the computing nodes in a cluster.

• Apache Spark supports multiple analytic tools that are used for interactive query analysis, real-time analysis, and graph processing

 

4. Define RDD.

RDD is the acronym for Resilient Distribution Datasets—a fault-tolerant collection of operational elements that run in parallel. The partitioned data in an RDD is immutable and distributed. There are primarily two types of RDDs:

• Parallelized collections: The existing RDDs running in parallel with one another

• Hadoop datasets: Those performing a function on each file record in HDFS or any other storage system

 

5. What does a Spark Engine do?

A Spark engine is responsible for scheduling, distributing, and monitoring the data application across the cluster.

 

6. Define Partitions.

As the name suggests, a partition is a smaller and logical division of data similar to a ‘split’ in MapReduce. Partitioning is the process of deriving logical units of data to speed up data processing. Everything in Spark is a partitioned RDD.

 

7. What operations does an RDD support?

• Transformations

• Actions

 

8. What do you understand by Transformations in Spark?

Transformations are functions applied to RDDs, resulting in another RDD. It does not execute until an action occurs. Functions such as map() and filer() are examples of transformations, where the map() function iterates over every line in the RDD and splits into a new RDD. The filter() function creates a new RDD by selecting elements from the current RDD that passes the function argument.

 

9. Define Actions in Spark.

In Spark, an action helps in bringing back data from an RDD to the local machine. They are RDD operations giving non-RDD values. The reduce() function is an action that is implemented again and again until only one value if left. The take() action takes all the values from an RDD to the local node.

 

10. Define the functions of Spark Core.

Serving as the base engine, Spark Core performs various important functions like memory management, monitoring jobs, providing fault-tolerance, job scheduling, and interaction with storage systems.

 

11. What is RDD Lineage?

Spark does not support data replication in memory and thus, if any data is lost, it is rebuild using RDD lineage. RDD lineage is a process that reconstructs lost data partitions. The best thing about this is that RDDs always remember how to build from other datasets.

 

12. What is Spark Driver?

Spark driver is the program that runs on the master node of a machine and declares transformations and actions on data RDDs. In simple terms, a driver in Spark creates SparkContext, connected to a given Spark Master. It also delivers RDD graphs to Master, where the standalone Cluster Manager runs.

 

13. What is Hive on Spark?

Hive contains significant support for Apache Spark, wherein Hive execution is configured to Spark:

hive> set spark.home=/location/to/sparkHome;

hive> set hive.execution.engine=spark;

Hive supports Spark on YARN mode by default.

 

14. Name the commonly used Spark Ecosystems.

• Spark SQL (Shark) for developers

• Spark Streaming for processing live data streams

• GraphX for generating and computing graphs

• MLlib (Machine Learning Algorithms)

• SparkR to promote R programming in the Spark engine

 

15. Define Spark Streaming.

Spark supports stream processing—an extension to the Spark API allowing stream processing of live data streams. Data from different sources like Kafka, Flume, Kinesis is processed and then pushed to file systems, live dashboards, and databases. It is similar to batch processing in terms of the input data which is here divided into streams like batches in batch processing.

 

16. What is GraphX?

Spark uses GraphX for graph processing to build and transform interactive graphs. The GraphX component enables programmers to reason about structured data at scale.

 

17. What does MLlib do?

MLlib is a scalable Machine Learning library provided by Spark. It aims at making Machine Learning easy and scalable with common learning algorithms and use cases like clustering, regression filtering, dimensional reduction, and the like.

 

18. What is Spark SQL?

Spark SQL, better known as Shark, is a novel module introduced in Spark to perform structured data processing. Through this module, Spark executes relational SQL queries on data. The core of this component supports an altogether different RDD called SchemaRDD, composed of row objects and schema objects defining the data type of each column in a row. It is similar to a table in relational databases.

 

19. What is a Parquet file?

Parquet is a columnar format file supported by many other data processing systems. Spark SQL performs both read and write operations with the Parquet file and considers it be one of the best Big Data Analytics formats so far.

 

20. What file systems does Apache Spark support?

• Hadoop Distributed File System (HDFS)

• Local file system

• Amazon S3

 

21. What is YARN?

Similar to Hadoop, YARN is one of the key features in Spark, providing a central and resource management platform to deliver scalable operations across the cluster. Running Spark on YARN needs a binary distribution of Spark that is built on YARN support.

 

22. List the functions of Spark SQL.

Spark SQL is capable of:

• Loading data from a variety of structured sources

• Querying data using SQL statements, both inside a Spark program and from external tools that connect to Spark SQL through standard database connectors (JDBC/ODBC), e.g., using Business Intelligence tools like Tableau

• Providing rich integration between SQL and the regular Python/Java/Scala code, including the ability to join RDDs and SQL tables, expose custom functions in SQL, and more.

 

23. What are the benefits of Spark over MapReduce?

• Due to the availability of in-memory processing, Spark implements data processing 10–100x faster than Hadoop MapReduce. MapReduce, on the other hand, makes use of persistence storage for any of the data processing tasks.

• Unlike Hadoop, Spark provides in-built libraries to perform multiple tasks using batch processing, steaming, Machine Learning, and interactive SQL queries. However, Hadoop only supports batch processing.

• Hadoop is highly disk-dependent, whereas Spark promotes caching and in-memory data storage.

• Spark is capable of performing computations multiple times on the same dataset, which is called iterative computation. Whereas, there is no iterative computing implemented by Hadoop.

 

24. Is there any benefit of learning MapReduce?

Yes, MapReduce is a paradigm used by many Big Data tools, including Apache Spark. It becomes extremely relevant to use MapReduce when data grows bigger and bigger. Most tools like Pig and Hive convert their queries into MapReduce phases to optimize them better.

 

25. What is Spark Executor?

When SparkContext connects to Cluster Manager, it acquires an executor on the nodes in the cluster. Executors are Spark processes that run computations and store data on worker nodes. The final tasks by SparkContext are transferred to executors for their execution.

 

26. Name the types of Cluster Managers in Spark.

The Spark framework supports three major types of Cluster Managers.

• Standalone: A basic Cluster Manager to set up a cluster

• Apache Mesos: A generalized/commonly-used Cluster Manager, running Hadoop MapReduce and other applications

• YARN: A Cluster Manager responsible for resource management in Hadoop

 

27. What do you understand by a Worker node?

A worker node refers to any node that can run the application code in a cluster.

 

28. What is PageRank?

A unique feature and algorithm in GraphX, PageRank is the measure of each vertex in a graph. For instance, an edge from u to v represents an endorsement of v‘s importance w.r.t. u. In simple terms, if a user at Instagram is followed massively, he/she will be ranked high on that platform.

 

29. Do you need to install Spark on all the nodes of the YARN cluster while running Spark on YARN?

No, because Spark runs on top of YARN.

 

30. Illustrate some demerits of using Spark.

Since Spark utilizes more storage space when compared to Hadoop and MapReduce, there might arise certain problems. Developers need to be careful while running their applications on Spark. To resolve the issue, they can think of distributing the workload over multiple clusters, instead of running everything on a single node.

 

31. How to create an RDD?

Spark provides two methods to create an RDD:

• By parallelizing a collection in the driver program. This makes use of SparkContext’s ‘parallelize’ method val

IntellipaatData = Array(2,4,6,8,10)

val distIntellipaatData = sc.parallelize(IntellipaatData)

• By loading an external dataset from external storage like HDFS, the shared file system

 

32. What are Spark DataFrames?

When a dataset is organized into SQL-like columns, it is known as a DataFrame. This is, in concept, equivalent to a data table in a relational database or a literal ‘DataFrame’ in R or Python. The only difference is the fact that Spark DataFrames are optimized for Big Data.

 

33. What are Spark Datasets?

Datasets are data structures in Spark (added since Spark 1.6) that provide the JVM object benefits of RDDs (the ability to manipulate data with lambda functions), alongside a Spark SQL-optimized execution engine.

 

34. Which languages can Spark be integrated with?

Spark can be integrated with the following languages:

• Python, using the Spark Python API

• R, using the R on Spark API

• Java, using the Spark Java API

• Scala, using the Spark Scala API

 

35. What do you mean by in-memory processing?

In-memory processing refers to the instant access of data from physical memory whenever the operation is called for. This methodology significantly reduces the delay caused by the transfer of data. Spark uses this method to access large chunks of data for querying or processing.

 

36. What is lazy evaluation?

Spark implements a functionality, wherein if you create an RDD out of an existing RDD or a data source, the materialization of the RDD will not occur until the RDD needs to be interacted with. This is to ensure the avoidance of unnecessary memory and CPU usage that occurs due to certain mistakes, especially in the case of Big Data Analytics.

 

------------------------------------------------------------------------------------------------------------------------------


Didn't find what you are looking for?  Contact Us