Skip to main content

Analyzing ETL Jobs with SnappyFlow

Analytics processing usually requires user to run a sequence of database reads/write jobs and transformation operations. In many cases these jobs are complex and may take a long time to complete. It is a very common need of the users to monitor individual jobs runs, summarize performance of different jobs, compare two runs of a specific job type.

SnappyFlow provides an easy yet powerful tool to analyze ETL Jobs. The approach is explained in the following video

Step 1: Drop logs from ETL Jobs

SnappyFlow allows for a job to have up to a 3-level hierarchy- Job, Stage, Task. Logs in JSON format have to be dropped whenever a job/stage/task is started, completed or terminated. This log can be parsed using SnappyFlow’s ETL parser.

Log format for a Job:
"jobName": <Job-name>,
"jobId": <Unique JobId>,
"time": <Time in epoch milliseconds format>
"type": "job",
"status": <status: started, success, failed, aborted>
Log format for a Stage:
"jobName": <Job-name>,
"jobId": <Unique JobId>,
"stageId": <stageId>,
"stageName": <stageName>
"time": <Time in epoch milliseconds format>
"type": "stage",
"status": <status can be started, success, failed, aborted>
Log format for a Task:
"jobName": <Job-name>,
"jobId": <Unique JobId>,
“stageId”: <staged>,
“stageName”: <stageName>
"time": <Time in epoch milliseconds format>
"type": "task",
"status": <status can be started, success, failed, aborted>

Step 2: Forward logs to SnappyFlow

Add the following log parser to logging section of sfAgent’s config.yaml:

- name: etlRaw
enabled: true
log_path: <log file path>

Restart sfAgent with the new configuration.

service sfagent restart

Check if documents have been received in SnappyFlow. You will find 3 documents under metrics with plugin name as “etlRaw” and documentType as “job”, “stage” and “task” depending on your hierarchy.

Step 3: Generate an access URL for use by summarization module

Logs shipped to SnappyFlow are in a raw form and they cannot be directly used for reporting and analysis. Therefore user has to export this raw data to a summarization script that transforms the data and sends it back to SnappyFlow into a new document.

Import a ETL template into your dashboard.

Go to “Scratchpad” pane

Click on ‘Export API Endpoint’ option in the component and create component URL for all 3 components for interval, say Last 5 mins.

Click on the ‘API Endpoints’ option for the project to view the API List. Copy the URL’s for the 3 components and the Authentication token. These need to be provided in Step 4

Step 4: Run summarization script as a cronjob

Install the pip utility from the below link. Refer to the link for Installation and Usage instructions.

sfapmetl · PyPI

  • The python script takes a config file path as input

  • Set values for key, appName, projectName, Name.

  • Provide the component Url’s for Job, stage and Task and authKey (from Step 3)

  • The data will be available in the dashboard under the plugin ‘etlReport’ and documentType - job, stage and task.

Step 5: Review ETL Dashboards

You will now see the summarized data in dashboard under etlReport for job, stage and tasks. Select a particular job and choose a timeline to see job duration trends over the selected time period. Clicking on a particular job id provides a drilled down view of stages and tasks within that job.