Multiplier currently pays out salaries to various members under our payroll. Payouts to members are recorded in a raw ingestion file. This is loaded into a Google Sheet for your reference. The Product department needs you to build a pipeline to transform this raw data into a clean, query-able format for their analytics.
Notes on Data:
This is raw ingestion data. You may encounter inconsistent date formats, nested JSON strings, or mixed currencies.
amounts: This contains a JSON-like string representing various components of the payout (Salary, Tax, Bonus).
Part 1: Architecture & Modeling
Before writing code, verbalise a strategy for this pipeline:
Target Schema: Design a Star Schema (or appropriate data model) that this data should be transformed into to best answer the business questions below.
Ingestion Strategy: How would you handle this file if it arrived daily? (Consider duplicates, partitioning, etc.)
Part 2: Implementation
Choose ONE of the following options based on your preferred stack:
Option A: Python/DataFrame (Pandas, Spark, Polars)
Implement a transformation script that reads the raw CSV and outputs the answers.
Option B: SQL / ELT (Postgres, Snowflake, BigQuery)
Assume the raw CSV data has already been loaded into a staging table (raw_payments) where all columns are currently TEXT/VARCHAR type. Write the SQL query to transform this raw table into your target schema and answer the business questions.
Business Questions to Answer:
Total Payouts: What is the total amount disbursed per currency in May 2023?
Currency Analysis: What is the average salary per currency by Department?
Note: You do not need an FX table; treat each currency as a separate group.
Data Cleaning: Flatten the amounts column so that salary, tax, and bonus are distinct columns (or rows, depending on your modeling choice in Part 1).
Constraints:
You do not need to connect to a real database.
Output the final results to the console or a clean CSV.