Lab: Exploring the Spark UIWARNING: Do not reload or close this page - your answers will not be saved.
|
|
1) Notebook: Exploring the Spark UIComplete the following questions by reviewing the output of the provided notebook. |
|
How many jobs were executed by Cmd 1? | (#) |
How many jobs were executed by Cmd 2? | (#) |
In Cmd 3, how many stages did Job #2 consist of? | (#) |
2) ClusterComplete the following questions by reviewing the cluster configuration. |
|
What is the name of the pool VMs are pulled from? | |
Which DBR is this cluster configured to use? | (#.#) |
Which version of Python is this cluster configured to use? | (#) |
Which version of Scala is this cluster configured to use? | (#.##) |
Which version of Apache Spark is this cluster configured to use? | (#.#.#) |
Which VM type is this cluster configured to use? | |
How many cores does each VM make available? | (#) |
How many cores total does this cluster have available? | (#) |
How many VMs does this cluster consist of? | (#) |
Is autoscaling enabled? | |
Is the cluster configured to auto-terminate? | |
Select the Libraries tab. Specify the python library attached to this cluster: |
|
Select the Event Log tab. What is the name of the most recent cluster event (aka Event Type)? |
|
Select the Driver Logs tab. What is the size, in bytes, of the log file log4j-active.log? |
(###,###) |
On which tab can you find the Ganglia UI? | |
3) Spark UI - JobsComplete the following questions by reviewing the output of the Jobs tab in the Spark UI. |
|
What operation triggered job #0? | |
What action triggered job #1? | |
What action triggered job #2? | |
What action triggered job #3? | |
How many actions were executed? | (#) |
Job #0 shows parquet as the key operation that triggered this job. But parquet is not an action. What exactly was this job reading in during those 7 seconds? |
|
How many tasks were executed for job #2? | (###) |
Open the details page for Job #1. | |
How many Mebibyte of data was read in? | (##.#) |
How many seconds did Stage #1 take? | (##) |
Open the details page for Stage #1 of Job #1 | |
How many records were read in? | (#,###,###) |
Review the Event Timeline. At a quick glance, does Stage #1 look healthy? | |
How many times did Apache Spark attempt to execute this stage? | (#) |
4) Spark UI - SQLComplete the following questions by reviewing the output of the SQL tab in the Spark UI. |
|
What job triggered Step E? | (#) |
What query triggered Step C? | (##) |
Which query took the most amount of time to execute? | (##) |
Open the details page for Query #44 | |
How many rows were output by the first filter operation found in WholeStageCodegen (1)? | (#,###,###) |
How many rows were output by the second filter operation found in WholeStageCodegen (2)? | (##) |
How many rows were estimated to be output by the second filter operation found in WholeStageCodegen (2)? | (#,###,###) |
By what factor was the estimating off for the second filter operation found in WholeStageCodegen (2)? | (###,###) |
How many records were finally returned by this query? | (##) |
Which Spark action resulted in the execution of this query? | |
5) Spark UI - StagesComplete the following questions by reviewing the output of the Stages tab in the Spark UI. |
|
How many stages where executed in all? | (##) |
How many tasks were executed for stage #3? | (###) |
What Scheduling Mode is being employed by this application (specifically the acronym)? | |
What does that acronym stand for? | |
Open the details page for Stage #6 | |
In the Event Timeline, what is represented by the blue bars? | |
In the Event Timeline, what is represented by the red bars? | |
In the Event Timeline, what is represented by the orange bars? | |
In the Event Timeline, what is represented by the green bars? | |
Of the colors represented in the Event Timeline, which color do we want to see the most of? | |
Open the details page for Stage #5 | |
Which job does Stage #5 belong to? | |
At a quick glance of the Event Timeline, does Stage #5 look healthy? | |
Contrast the Event Timeline of Stage #5 to Stage #6. Was Stage #6 healthy? | |
6) Spark UI - StorageComplete the following questions by reviewing the output of the Storage tab in the Spark UI. |
|
How many partitions were cached for RDD #5? | (#) |
How many Mebibytes of data is in the Parquet IO Cache (aka the Delta Cache) across all executors? | (##.#) |
What is the maximum capacity, in Gibibytes, of the Parquet IO Cache for the entire cluster? | (###.#) |
Because the Delta Cache has moved the data to the executors, should we be using the cluster's RAM for caching? | |
Open the storage details page for RDD #5 | |
Is the cached data for RDD #5 evenly distributed? | |
How many Mebibytes of data is cached in each partition for RDD #5? | (#.#) |
How many Mebibytes of the cached data was spilled to disk for RDD #5? | (#.#) |
What is the final size, in Mebibytes, of the cached data for RDD #5? | (##.#) |
What percentage of RDD #5 is cached? | (###) |
7) Spark UI - ExecutorsComplete the following questions by reviewing the output of the Executors tab in the Spark UI. |
|
How many cores do you have for parallel processing? | (#) |
What is the IP Address of the driver? | (##.###.###.###) |
How many tasks is your cluster able to simultaneously execute? | (#) |
8) Spark UI - Environment |
|
Open the Spark UI and navigate to Environment tab. | |
What is the path to your Java Home? (see Runtime Information) | |
Is speculative execution of tasks enabled (see Spark Properties)? https://spark.apache.org/docs/latest/configuration.html |
|
Which file encoding is Spark configured to use (see System Properties)? | |
9) Structured StreamingOpen the Spark UI and navigate to Structured Streaming tab. |
|
Investigate Job #2 and answer these questions: | |
How many active streaming queries are there? | (#) |
How many comleted streaming queries are there? | (#) |
For the one stream, how many seconds was it running? | (##) |
On average, how many records were received into the stream each second? | (##,###.##) |
On average, how many records were processed by the stream each second? | (##,###.##) |
On average, where records processed faster than they were received? | |
Navigate to the details for Run c7a226f7-889a-44f6-a266-14890236e4b0 and answer these questions: | |
How many batches were completed? | (##) |
What is the name of the stream/query? | |
10) Spark UI - BonusComplete the following questions by navigating the Spark UI as necessary. |
|
Investigate Job #2 and answer these questions: | |
Which operation is triggering the shuffle at the end of Stage #2 | |
Which operation is triggering the shuffle at the end of Stage #3 | |
Which operation is triggering the shuffle at the end of Stage #4 | |
Investigate Job #2, Stage #3 and answer these questions: | |
How many records where read in as a result of the previous shuffle operation? | (#,###) |
How many records were written out as a result of this shuffle operation? | (#,###) |
Investigate Job #4, Stage #13 and answer these questions: | |
According to the Summary Metrics at least one task spent more time in garbage collection than any other. How many seconds where spent in that task? |
(#) |
According to the Summary Metrics what was the least amount of time spent scheduling tasks? | (#.#) |
Did this stage run of out RAM while processing its tasks? | |
What is the average size in GiB that was spilled from memory? | (#.#) |
What is the average size in MiB that was spilled to disk? | (###.#) |
What percentage of tasks spilled to disk? | (###) |
Which task longest Scheduler Delay? | (#) |