Effective Bash for effortless insights

Problem: In my company, I was given the following problem. Get the users who are inactive i.e whose user_activity data isn't present and they have paid. This is a simple problem, but the way I solved using some clever tricks. Context: The user activity data first gets into an s3 bucket, later it gets processed and puts into a mongo collection user_activity. Stages of solving the problem: Get all the unique users from s3 using athena. Get all the users from the mongo collection. Get the difference between them. Getting the data from s3 create table user_data from "s3:user_data"; select distinct(user_id) from user_data; This I downloaded as csv and saved it in the following path: /tmp/user_ids_in_s3.csv Getting the data from mongo This was the tricky part because of the way the data is stored and the exports of mongo. We have the following tables: user_activity stores all the user activity. user_details stores the details of the users if he paid or not. Pipelines to the rescue. I used mongodb pipelines to get the data of paid users user activity. The pipelines are something like this [ { $lookup: { from: "user_details", localField: "user_id", foreignField: "_id", as: "result" } }, { $match: { "result.status": "paid" } }, { $project: { _id: _id } } ] exported this data as json into /tmp/db.user_activity.json Where's bash ? Good question! The way I got the data was in the following format. [{ "_id": { "$oid": "667..." }, } { "_id": { "$oid": "867..." }, } ... }] Now, this we are probably going to compare with csv. Therefore it'll be better if the file was a little nice with just user id's. Bash's rescue Getting the user ids from the data in processable format cat /tmp/db.user_activity.json | jq '.[]._id."$oid"' | sed 's/"//g' > paid_users.txt Explination: cat /tmp/db.user_activity.json outs the file to stdin. jq '.[]._id."$oid"' print's the oid field from the json. sed 's/"//g' cleaning the quotation marks in the data. > paid_users.txt output the whole thing into a file. If you don't understand any part feel free to read the docs of each command: > man cat > man jq > man sed output redirection I highly encorage you to work out this command using the following data. Use this script to generate data: ✨ This script is generated by AI. import json import random import time def generate_object_id(): timestamp = int(time.time()) random_bytes = random.getrandbits(8 * 8).to_bytes(8, 'big').hex() return f"{timestamp:08x}{random_bytes}"[:24] def generate_data(count=5): return [{"_id": {"$oid": generate_object_id()}} for _ in range(count)] if __name__ == "__main__": data = generate_data(10) print(json.dumps(data, indent=2)) Comparing "paid users" with "active users" comm -23 man comm > man sort > man sed > man uniq output redirection Worthy mention for getting the user ids from the mongo data My collegue had to do something similar. What he did completely blew my mind. I never tought of this way so I'm sharing his approach as well. He opened the file in vscode and used a regex "$oid": "(.*)" selected everything using ctrl+f2 and copied them to clipboard. Pasted this in a new file used alt+shift+end which goes from the cursors location to the end of the file, deleted quotations along with the pesky $oid. It blew my mind I had never tought of it, altough inefficient and prone to errors. Hat's off.

Mar 29, 2025 - 09:52
 0
Effective Bash for effortless insights

Problem: In my company, I was given the following problem.
Get the users who are inactive i.e whose user_activity data isn't present and they have paid. This is a simple problem, but the way I solved using some clever tricks.

Context: The user activity data first gets into an s3 bucket, later it gets processed and puts into a mongo collection user_activity.

Stages of solving the problem:

  • Get all the unique users from s3 using athena.
  • Get all the users from the mongo collection.
  • Get the difference between them.

Getting the data from s3

create table user_data from "s3:user_data";

select distinct(user_id) from user_data;

This I downloaded as csv and saved it in the following path: /tmp/user_ids_in_s3.csv

Getting the data from mongo

This was the tricky part because of the way the data is stored and the exports of mongo.
We have the following tables:

  • user_activity stores all the user activity.
  • user_details stores the details of the users if he paid or not.

Pipelines to the rescue. I used mongodb pipelines to get the data of paid users user activity.
The pipelines are something like this

[
  {
    $lookup:
      {
        from: "user_details",
        localField: "user_id",
        foreignField: "_id",
        as: "result"
      }
  },
  {
    $match:
      {
        "result.status": "paid"
      }
  },
  {
    $project: 
      {
        _id: _id
      }
  }
]

exported this data as json into /tmp/db.user_activity.json

Where's bash ?

Good question! The way I got the data was in the following format.

[{
  "_id": {
    "$oid": "667..."
  },
 }
 {
  "_id": {
    "$oid": "867..."
  },
 }
...
}]

Now, this we are probably going to compare with csv.
Therefore it'll be better if the file was a little nice with just user id's.

Bash's rescue

Getting the user ids from the data in processable format

cat /tmp/db.user_activity.json | 
jq '.[]._id."$oid"' | 
sed 's/"//g' > paid_users.txt

Explination:

  • cat /tmp/db.user_activity.json outs the file to stdin.
  • jq '.[]._id."$oid"' print's the oid field from the json.
  • sed 's/"//g' cleaning the quotation marks in the data.
  • > paid_users.txt output the whole thing into a file.

If you don't understand any part feel free to read the docs of each command:

I highly encorage you to work out this command using the following data.
Use this script to generate data:

✨ This script is generated by AI.

import json
import random
import time

def generate_object_id():
    timestamp = int(time.time()) 
    random_bytes = random.getrandbits(8 * 8).to_bytes(8, 'big').hex()  
    return f"{timestamp:08x}{random_bytes}"[:24]

def generate_data(count=5):
    return [{"_id": {"$oid": generate_object_id()}} for _ in range(count)]

if __name__ == "__main__":
    data = generate_data(10)
    print(json.dumps(data, indent=2))

Comparing "paid users" with "active users"

comm -23 <(sort /tmp/user_ids_in_s3.csv) <(sort /tmp/paid_users.txt) | 
sed -E 's/[[:space:]]+//g' | 
sort | uniq

Explination:

  • comm -23 <(sort /tmp/user_ids_in_s3.csv) <(sort /tmp/paid_users.txt) find the files which are present in one but not both.
  • sed -E 's/[[:space:]]+//g' formatting the data as it has spaces at start and end
  • sort | uniq sorting the data and only taking the uniq id's.

If you don't understand any part feel free to read the docs of each command:

Worthy mention for getting the user ids from the mongo data

My collegue had to do something similar. What he did completely blew my mind.
I never tought of this way so I'm sharing his approach as well.

He opened the file in vscode and used a regex "$oid": "(.*)" selected everything using ctrl+f2 and copied them to clipboard.
Pasted this in a new file used alt+shift+end which goes from the cursors location to the end of the file,
deleted quotations along with the pesky $oid.

It blew my mind I had never tought of it, altough inefficient and prone to errors. Hat's off.