Effective Bash for effortless insights
Problem: In my company, I was given the following problem. Get the users who are inactive i.e whose user_activity data isn't present and they have paid. This is a simple problem, but the way I solved using some clever tricks. Context: The user activity data first gets into an s3 bucket, later it gets processed and puts into a mongo collection user_activity. Stages of solving the problem: Get all the unique users from s3 using athena. Get all the users from the mongo collection. Get the difference between them. Getting the data from s3 create table user_data from "s3:user_data"; select distinct(user_id) from user_data; This I downloaded as csv and saved it in the following path: /tmp/user_ids_in_s3.csv Getting the data from mongo This was the tricky part because of the way the data is stored and the exports of mongo. We have the following tables: user_activity stores all the user activity. user_details stores the details of the users if he paid or not. Pipelines to the rescue. I used mongodb pipelines to get the data of paid users user activity. The pipelines are something like this [ { $lookup: { from: "user_details", localField: "user_id", foreignField: "_id", as: "result" } }, { $match: { "result.status": "paid" } }, { $project: { _id: _id } } ] exported this data as json into /tmp/db.user_activity.json Where's bash ? Good question! The way I got the data was in the following format. [{ "_id": { "$oid": "667..." }, } { "_id": { "$oid": "867..." }, } ... }] Now, this we are probably going to compare with csv. Therefore it'll be better if the file was a little nice with just user id's. Bash's rescue Getting the user ids from the data in processable format cat /tmp/db.user_activity.json | jq '.[]._id."$oid"' | sed 's/"//g' > paid_users.txt Explination: cat /tmp/db.user_activity.json outs the file to stdin. jq '.[]._id."$oid"' print's the oid field from the json. sed 's/"//g' cleaning the quotation marks in the data. > paid_users.txt output the whole thing into a file. If you don't understand any part feel free to read the docs of each command: > man cat > man jq > man sed output redirection I highly encorage you to work out this command using the following data. Use this script to generate data: ✨ This script is generated by AI. import json import random import time def generate_object_id(): timestamp = int(time.time()) random_bytes = random.getrandbits(8 * 8).to_bytes(8, 'big').hex() return f"{timestamp:08x}{random_bytes}"[:24] def generate_data(count=5): return [{"_id": {"$oid": generate_object_id()}} for _ in range(count)] if __name__ == "__main__": data = generate_data(10) print(json.dumps(data, indent=2)) Comparing "paid users" with "active users" comm -23 man comm > man sort > man sed > man uniq output redirection Worthy mention for getting the user ids from the mongo data My collegue had to do something similar. What he did completely blew my mind. I never tought of this way so I'm sharing his approach as well. He opened the file in vscode and used a regex "$oid": "(.*)" selected everything using ctrl+f2 and copied them to clipboard. Pasted this in a new file used alt+shift+end which goes from the cursors location to the end of the file, deleted quotations along with the pesky $oid. It blew my mind I had never tought of it, altough inefficient and prone to errors. Hat's off.

Problem: In my company, I was given the following problem.
Get the users who are inactive i.e whose user_activity
data isn't present and they have paid. This is a simple problem, but the way I solved using some clever tricks.
Context: The user activity data first gets into an s3 bucket, later it gets processed and puts into a mongo collection user_activity
.
Stages of solving the problem:
- Get all the unique users from s3 using athena.
- Get all the users from the mongo collection.
- Get the difference between them.
Getting the data from s3
create table user_data from "s3:user_data";
select distinct(user_id) from user_data;
This I downloaded as csv and saved it in the following path: /tmp/user_ids_in_s3.csv
Getting the data from mongo
This was the tricky part because of the way the data is stored and the exports of mongo.
We have the following tables:
-
user_activity
stores all the user activity. -
user_details
stores the details of the users if he paid or not.
Pipelines to the rescue. I used mongodb pipelines to get the data of paid users user activity.
The pipelines are something like this
[
{
$lookup:
{
from: "user_details",
localField: "user_id",
foreignField: "_id",
as: "result"
}
},
{
$match:
{
"result.status": "paid"
}
},
{
$project:
{
_id: _id
}
}
]
exported this data as json into /tmp/db.user_activity.json
Where's bash ?
Good question! The way I got the data was in the following format.
[{
"_id": {
"$oid": "667..."
},
}
{
"_id": {
"$oid": "867..."
},
}
...
}]
Now, this we are probably going to compare with csv.
Therefore it'll be better if the file was a little nice with just user id's.
Bash's rescue
Getting the user ids from the data in processable format
cat /tmp/db.user_activity.json |
jq '.[]._id."$oid"' |
sed 's/"//g' > paid_users.txt
Explination:
-
cat /tmp/db.user_activity.json
outs the file to stdin. -
jq '.[]._id."$oid"'
print's the oid field from the json. -
sed 's/"//g'
cleaning the quotation marks in the data. -
> paid_users.txt
output the whole thing into a file.
If you don't understand any part feel free to read the docs of each command:
I highly encorage you to work out this command using the following data.
Use this script to generate data:
✨ This script is generated by AI.
import json
import random
import time
def generate_object_id():
timestamp = int(time.time())
random_bytes = random.getrandbits(8 * 8).to_bytes(8, 'big').hex()
return f"{timestamp:08x}{random_bytes}"[:24]
def generate_data(count=5):
return [{"_id": {"$oid": generate_object_id()}} for _ in range(count)]
if __name__ == "__main__":
data = generate_data(10)
print(json.dumps(data, indent=2))
Comparing "paid users" with "active users"
comm -23 <(sort /tmp/user_ids_in_s3.csv) <(sort /tmp/paid_users.txt) |
sed -E 's/[[:space:]]+//g' |
sort | uniq
Explination:
-
comm -23 <(sort /tmp/user_ids_in_s3.csv) <(sort /tmp/paid_users.txt)
find the files which are present in one but not both. -
sed -E 's/[[:space:]]+//g'
formatting the data as it has spaces at start and end -
sort | uniq
sorting the data and only taking the uniq id's.
If you don't understand any part feel free to read the docs of each command:
Worthy mention for getting the user ids from the mongo data
My collegue had to do something similar. What he did completely blew my mind.
I never tought of this way so I'm sharing his approach as well.
He opened the file in vscode and used a regex "$oid": "(.*)"
selected everything using ctrl+f2
and copied them to clipboard.
Pasted this in a new file used alt+shift+end
which goes from the cursors location to the end of the file,
deleted quotations along with the pesky $oid.
It blew my mind I had never tought of it, altough inefficient and prone to errors. Hat's off.