Making github workflows a little Joyful

Recently at work, i came with a project called Seeds – Its a curated/prepared dataset which we send through our pipeline and at the end of it, we verify the expected vs actual results.

Expected You say ? – since the dataset is custom prepared by us, we are aware of its nature and hence know what to expect. Example: Prepare a dataset of 100 orders for customer ‘Adam’ and push it thru your pipeline. I expect select count(*) from orders where customer =’Adam’ ==== 100 at the end of the pipeline.

So i came up with Harvestor, its a github workflow which does the pipeline setup, plants the seeds (send the data) and does various verification steps at various checkpoints in the pipeline. At the end, it verifies the Harvest, i.e the final result.

The normal github pipelines looked đŸĨąâ˜šī¸, So i sprinkled a little Emojis to brighten up the place 🤗

The project naming also resonated well with my team mates too!

That combined with nice icons for slack channel – made the cake complete with cherries.

I call this DATA Art or DartA.

let me know what you think. Hope this inspires you.

Lambda to Dedup less data – Spark vs OwnSolution

Lambda to Dedup less data – Spark vs OwnSolution

Hi guys,

Some data pipelines typically have a DeDup phase to remove duplicate records.

Well I came across a scenario where

  • the data to dedup was < 100Mb
  • Our company goes with a Serverless theme
  • + we are a startup so fast development is a given

So naturally, I thought of

  • Aws Step Functions (to serve as our pipeline) – as input data is < 100MB
  • Lambda for each phase of the pipeline
    • Dedup
    • Quality check
    • Transform
    • Load to warehouse

Now for the Dedup lambda (Since i have used Spark before) I thought it would be like 3 lines of code and hence easy to implement.

Dataset<Row> inputData = sparkSession.read().json(inputPath);
Dataset<Row> deduped = inputData.dropDuplicates(columnName);
deduped.write().json(outputPath);
//using spark local mode - local[1]

I packaged it and ran the lambda … it took > 20 seconds for 10mB of data ☚ī¸

This got me thinking 🤔…. this sounds like overkill

so I decided to write my own code using python (boto3 for s3 interaction + simple hashmap to dedup)

That is a 92.8% decrease 😮 in running time lol

so lets summarize

Apache Spark (local mode)Own Solution (python boto3 + hashmap)
code
effort
very low (spark
does everything)
medium (have to write code to
download + dedup + upload to s3)
Mem requirements >=512 MB (spark needs min 512)<128 MB
Mem Used600MB78 MB
Running time Sec23.7 ❌1.7 ✅

Moral of the Story

  • When you think data pipelines, don’t always go to Apache Spark
  • Keep in mind the amount of data being processed
  • Less code is good for code maintenance but might not be performant
    • i.e. Ease of development is a priority but cost comes first.
  • Keep an eye on cost $ 💰
    • Here we have decreased cost by 92% as lambda bills per running time & memory used