Lab: Exploring the Spark UI

WARNING: Do not reload or close this page - your answers will not be saved.
   

1) Notebook: Exploring the Spark UI

Complete the following questions by reviewing the output of the provided notebook.
How many jobs were executed by Cmd 1? (#)
How many jobs were executed by Cmd 2? (#)
In Cmd 3, how many stages did Job #2 consist of? (#)

2) Cluster

Complete the following questions by reviewing the cluster configuration.
What is the name of the pool VMs are pulled from?
Which DBR is this cluster configured to use? (#.#)
Which version of Python is this cluster configured to use? (#)
Which version of Scala is this cluster configured to use? (#.##)
Which version of Apache Spark is this cluster configured to use? (#.#.#)
Which VM type is this cluster configured to use?
How many cores does each VM make available? (#)
How many cores total does this cluster have available? (#)
How many VMs does this cluster consist of? (#)
Is autoscaling enabled?
Is the cluster configured to auto-terminate?
Select the Libraries tab.
Specify the python library attached to this cluster:
Select the Event Log tab.
What is the name of the most recent cluster event (aka Event Type)?
Select the Driver Logs tab.
What is the size, in bytes, of the log file log4j-active.log?
(###,###)
On which tab can you find the Ganglia UI?

3) Spark UI - Jobs

Complete the following questions by reviewing the output of the Jobs tab in the Spark UI.
What operation triggered job #0?
What action triggered job #1?
What action triggered job #2?
What action triggered job #3?
How many actions were executed? (#)
Job #0 shows parquet as the key operation that triggered this job.
But parquet is not an action. What exactly was this job reading in during those 7 seconds?
How many tasks were executed for job #2? (###)
Open the details page for Job #1.
How many Mebibyte of data was read in? (##.#)
How many seconds did Stage #1 take? (##)
Open the details page for Stage #1 of Job #1
How many records were read in? (#,###,###)
Review the Event Timeline. At a quick glance, does Stage #1 look healthy?
How many times did Apache Spark attempt to execute this stage? (#)

4) Spark UI - SQL

Complete the following questions by reviewing the output of the SQL tab in the Spark UI.
What job triggered Step E? (#)
What query triggered Step C? (##)
Which query took the most amount of time to execute? (##)
Open the details page for Query #44
How many rows were output by the first filter operation found in WholeStageCodegen (1)? (#,###,###)
How many rows were output by the second filter operation found in WholeStageCodegen (2)? (##)
How many rows were estimated to be output by the second filter operation found in WholeStageCodegen (2)? (#,###,###)
By what factor was the estimating off for the second filter operation found in WholeStageCodegen (2)? (###,###)
How many records were finally returned by this query? (##)
Which Spark action resulted in the execution of this query?

5) Spark UI - Stages

Complete the following questions by reviewing the output of the Stages tab in the Spark UI.
How many stages where executed in all? (##)
How many tasks were executed for stage #3? (###)
What Scheduling Mode is being employed by this application (specifically the acronym)?
What does that acronym stand for?
Open the details page for Stage #6
In the Event Timeline, what is represented by the blue bars?
In the Event Timeline, what is represented by the red bars?
In the Event Timeline, what is represented by the orange bars?
In the Event Timeline, what is represented by the green bars?
Of the colors represented in the Event Timeline, which color do we want to see the most of?
Open the details page for Stage #5
Which job does Stage #5 belong to?
At a quick glance of the Event Timeline, does Stage #5 look healthy?
Contrast the Event Timeline of Stage #5 to Stage #6. Was Stage #6 healthy?

6) Spark UI - Storage

Complete the following questions by reviewing the output of the Storage tab in the Spark UI.
How many partitions were cached for RDD #5? (#)
How many Mebibytes of data is in the Parquet IO Cache (aka the Delta Cache) across all executors? (##.#)
What is the maximum capacity, in Gibibytes, of the Parquet IO Cache for the entire cluster? (###.#)
Because the Delta Cache has moved the data to the executors, should we be using the cluster's RAM for caching?
Open the storage details page for RDD #5
Is the cached data for RDD #5 evenly distributed?
How many Mebibytes of data is cached in each partition for RDD #5? (#.#)
How many Mebibytes of the cached data was spilled to disk for RDD #5? (#.#)
What is the final size, in Mebibytes, of the cached data for RDD #5? (##.#)
What percentage of RDD #5 is cached? (###)

7) Spark UI - Executors

Complete the following questions by reviewing the output of the Executors tab in the Spark UI.
How many cores do you have for parallel processing? (#)
What is the IP Address of the driver? (##.###.###.###)
How many tasks is your cluster able to simultaneously execute? (#)

8) Spark UI - Environment

Open the Spark UI and navigate to Environment tab.
What is the path to your Java Home? (see Runtime Information)
Is speculative execution of tasks enabled (see Spark Properties)?
https://spark.apache.org/docs/latest/configuration.html
Which file encoding is Spark configured to use (see System Properties)?

9) Structured Streaming

Open the Spark UI and navigate to Structured Streaming tab.
Investigate Job #2 and answer these questions:
How many active streaming queries are there? (#)
How many comleted streaming queries are there? (#)
For the one stream, how many seconds was it running? (##)
On average, how many records were received into the stream each second? (##,###.##)
On average, how many records were processed by the stream each second? (##,###.##)
On average, where records processed faster than they were received?
Navigate to the details for Run c7a226f7-889a-44f6-a266-14890236e4b0 and answer these questions:
How many batches were completed? (##)
What is the name of the stream/query?

10) Spark UI - Bonus

Complete the following questions by navigating the Spark UI as necessary.
Investigate Job #2 and answer these questions:
Which operation is triggering the shuffle at the end of Stage #2
Which operation is triggering the shuffle at the end of Stage #3
Which operation is triggering the shuffle at the end of Stage #4
Investigate Job #2, Stage #3 and answer these questions:
How many records where read in as a result of the previous shuffle operation? (#,###)
How many records were written out as a result of this shuffle operation? (#,###)
Investigate Job #4, Stage #13 and answer these questions:
According to the Summary Metrics at least one task spent more time in garbage collection than any other.
How many seconds where spent in that task?
(#)
According to the Summary Metrics what was the least amount of time spent scheduling tasks? (#.#)
Did this stage run of out RAM while processing its tasks?
What is the average size in GiB that was spilled from memory? (#.#)
What is the average size in MiB that was spilled to disk? (###.#)
What percentage of tasks spilled to disk? (###)
Which task longest Scheduler Delay? (#)