Gesture Control with ElectronJS, MediaPipe and Nut.js - Creative Coding fun
DEMO : Code: Github A while back, I attended a creative coding jam, where I thought of building something fun. Since college time, I wanted to build an app to use gesture control to navigate PPT presentations (cuz we kept losing our pointers ;P). So I thought of building out something similar. So to start I knew I needed a desktop app to control a PC and being familiar with Python and JS, the obvious options were PyQT or Electron. Next, after researching a little I found out about MediaPipe from Google. a open-source framework for real-time multimedia tasks like hand tracking, gesture recognition, and pose estimation. It offers efficient, cross-platform machine learning solutions for developers. I had seen many python projects using computer vision to do such things, but I had recently been playing with JS, so thought it would be a fun challenge to do it in electron. So far I had electron and MediaPipe for the app and the gesture detection. Next I needed something to control the computer programmatically, that's when I found Robot.js & Nut.js. I went with nut.js, as it had more documentation and found it easy to use. Now I had these tasks: Start app and keep it running in background Launch camera, get feed and detect gestures Map the gestures actions to control the computer 1. Start app and keep it running in background Starting with installing dependencies and setting up the electron app. npm install @mediapipe/camera_utils @mediapipe/hands @mediapipe/tasks-vision @nut-tree-fork/nut-js @tensorflow-models/hand-pose-detection @tensorflow/tfjs electron Electron has a simple way to run a app in background. I just had to create a BrowserWindow in the index.js and set the window to show: false. This background window loaded a background.html with below content. Nothing fancy. 2. Launch camera, get feed and detect gestures The mediapipe documentation is very clear on how to initialize the recognizer, pretty straightforward. Source : gestureWorker.js async function initialize() { try { const vision = await FilesetResolver.forVisionTasks( "https://cdn.jsdelivr.net/npm/@mediapipe/tasks-vision@0.10.3/wasm" ); gestureRecognizer = await GestureRecognizer.createFromOptions(vision, { baseOptions: { modelAssetPath: "https://storage.googleapis.com/mediapipe-models/gesture_recognizer/gesture_recognizer/float16/1/gesture_recognizer.task", delegate: "GPU" }, runningMode: "VIDEO" }); // Start webcam const constraints = { video: { width: videoWidthNumber, height: videoHeightNumber } }; const stream = await navigator.mediaDevices.getUserMedia(constraints); video.srcObject = stream; webcamRunning = true; video.addEventListener("loadeddata", predictWebcam); } catch (error) { console.error('Initialization error:', error); setTimeout(initialize, 5000); } } 3. Map the gestures actions to control the computer Once I had the feed, all I had to do was Source : gestureWorker.js results = gestureRecognizer.recognizeForVideo(video, Date.now()); const gesture = results.gestures[0][0].categoryName; MediaPipe has some predefined gestures, like Thumb_Up, Thumb_Down, Open_Palm. I used them as below, if (gesture === "Thumb_Up") { await mouse.scrollUp(10); } else if (gesture === "Thumb_Down") { await mouse.scrollDown(10); } else if (gesture === "Open_Palm") { await keyboard.pressKey(Key.LeftAlt, Key.LeftCmd, Key.M); await keyboard.releaseKey(Key.LeftAlt, Key.LeftCmd, Key.M); } else if (gesture === "Pointing_Up") { await mouse.rightClick(); } else if (gesture === "Victory") { await keyboard.pressKey(Key.LeftCmd, Key.Tab); await keyboard.releaseKey(Key.LeftCmd, Key.Tab); } The mouse and keyboard objects are available from the nut.js package. And finally I had it working, though there were many aaa, aahh, wutt, moments I learned a lot. As you can see in the demo, the last gesture is buggy, but it works

DEMO :
Code: Github
A while back, I attended a creative coding jam, where I thought of building something fun. Since college time, I wanted to build an app to use gesture control to navigate PPT presentations (cuz we kept losing our pointers ;P). So I thought of building out something similar.
So to start I knew I needed a desktop app to control a PC and being familiar with Python and JS, the obvious options were PyQT or Electron. Next, after researching a little I found out about MediaPipe from Google.
a open-source framework for real-time multimedia tasks like hand tracking, gesture recognition, and pose estimation. It offers efficient, cross-platform machine learning solutions for developers.
I had seen many python projects using computer vision to do such things, but I had recently been playing with JS, so thought it would be a fun challenge to do it in electron. So far I had electron and MediaPipe for the app and the gesture detection.
Next I needed something to control the computer programmatically, that's when I found Robot.js & Nut.js. I went with nut.js, as it had more documentation and found it easy to use.
Now I had these tasks:
- Start app and keep it running in background
- Launch camera, get feed and detect gestures
- Map the gestures actions to control the computer
1. Start app and keep it running in background
Starting with installing dependencies and setting up the electron app.
npm install @mediapipe/camera_utils @mediapipe/hands @mediapipe/tasks-vision @nut-tree-fork/nut-js @tensorflow-models/hand-pose-detection @tensorflow/tfjs electron
Electron has a simple way to run a app in background. I just had to create a BrowserWindow
in the index.js
and set the window to show: false
. This background window loaded a background.html
with below content. Nothing fancy.
id="gesture_output" style="display: none;">
2. Launch camera, get feed and detect gestures
The mediapipe documentation is very clear on how to initialize the recognizer, pretty straightforward.
Source : gestureWorker.js
async function initialize() {
try {
const vision = await FilesetResolver.forVisionTasks(
"https://cdn.jsdelivr.net/npm/@mediapipe/tasks-vision@0.10.3/wasm"
);
gestureRecognizer = await GestureRecognizer.createFromOptions(vision, {
baseOptions: {
modelAssetPath: "https://storage.googleapis.com/mediapipe-models/gesture_recognizer/gesture_recognizer/float16/1/gesture_recognizer.task",
delegate: "GPU"
},
runningMode: "VIDEO"
});
// Start webcam
const constraints = {
video: {
width: videoWidthNumber,
height: videoHeightNumber
}
};
const stream = await navigator.mediaDevices.getUserMedia(constraints);
video.srcObject = stream;
webcamRunning = true;
video.addEventListener("loadeddata", predictWebcam);
} catch (error) {
console.error('Initialization error:', error);
setTimeout(initialize, 5000);
}
}
3. Map the gestures actions to control the computer
Once I had the feed, all I had to do was
Source : gestureWorker.js
results = gestureRecognizer.recognizeForVideo(video, Date.now());
const gesture = results.gestures[0][0].categoryName;
MediaPipe has some predefined gestures, like Thumb_Up, Thumb_Down, Open_Palm. I used them as below,
if (gesture === "Thumb_Up") {
await mouse.scrollUp(10);
} else if (gesture === "Thumb_Down") {
await mouse.scrollDown(10);
} else if (gesture === "Open_Palm") {
await keyboard.pressKey(Key.LeftAlt, Key.LeftCmd, Key.M);
await keyboard.releaseKey(Key.LeftAlt, Key.LeftCmd, Key.M);
} else if (gesture === "Pointing_Up") {
await mouse.rightClick();
} else if (gesture === "Victory") {
await keyboard.pressKey(Key.LeftCmd, Key.Tab);
await keyboard.releaseKey(Key.LeftCmd, Key.Tab);
}
The mouse
and keyboard
objects are available from the nut.js package.
And finally I had it working, though there were many aaa, aahh, wutt, moments I learned a lot. As you can see in the demo, the last gesture is buggy, but it works