Designing a Configuration Management System

Starting small I want to build this system from the bottom up. In the past, I’ve fallen into the trap of trying to design for every possible use case from day one, aiming for the perfect architecture and eventually getting stuck before real progress is made. Sticking to the fundamentals. Move upward step by step. Stay pragmatic and let the system grow based on real needs. Identifying Core Challenges The overall goals of system configuration management are well-known and well-documented. I’m not going to repeat what you can already find online, instead I’ll focus on a few specific areas that I care about and want to emphasize. Take something as simple as configuring NGINX on a linux server. We’ve probably all been in a situation where we pushed broken configuration to a server. Maybe it was a syntax error caught by nginx -t. But sometimes, the syntax is perfect, yet the configuration is logically flawed, perhaps pointing to a non-existent backend, causing errors after the service restarts. These functional issues are harder to catch upfront. If the system is set up properly, your configuration management tool should ideally handle both types of failures gracefully. Let’s look at how we might build robust handling for this in Ansible, including a basic health check after the restart: - name: Deploy, validate and health-check nginx configuration hosts: webservers vars: nginx_conf_path: /etc/nginx/nginx.conf nginx_conf_backup: /etc/nginx/nginx.conf.bak new_nginx_template: templates/nginx.conf.j2 nginx_health_check_url: http://localhost/ tasks: - name: Backup current nginx config ansible.builtin.copy: src: "{{ nginx_conf_path }}" dest: "{{ nginx_conf_backup }}" remote_src: yes mode: preserve - block: - name: Deploy new nginx config ansible.builtin.template: src: "{{ new_nginx_template }}" dest: "{{ nginx_conf_path }}" - name: Validate nginx configuration syntax BEFORE restart ansible.builtin.command: nginx -t changed_when: false - name: Restart nginx to apply configuration ansible.builtin.service: name: nginx state: restarted - name: Wait briefly for nginx to stabilise ansible.builtin.pause: seconds: 5 - name: Perform health check AFTER restart ansible.builtin.uri: url: "{{ nginx_health_check_url }}" status_code: 200 register: health_check_result rescue: - name: Roll back nginx config due to failure ansible.builtin.copy: src: "{{ nginx_conf_backup }}" dest: "{{ nginx_conf_path }}" remote_src: yes mode: preserve - name: Restart nginx after rollback ansible.builtin.service: name: nginx state: restarted - name: Fail the playbook after rollback ansible.builtin.fail: msg: "Nginx deployment failed (syntax or health check) and rollback was triggered." While technically valid, it’s a lot of boilerplate for a fairly basic operation. How can we tackle those essential challenges? Do we even need to express them in the code and if so, how do we do it without blowing up the code and adding complexity? Code for declaring resources has to be written. So we need to think about what kind of format makes sense, something like YAML, JSON or maybe a DSL? Each option has its trade-offs. YAML is easy to write, but can get messy. JSON is stricter, but not great for humans. A DSL gives you more power and flexibility, but isn’t as user-friendly for non-programmers. No matter which format we choose, humans make mistakes. Typos, wrong types missing fields. We need to assist the user and help with the validation of the code. If we can declare the desired state of resources, we usually don’t want to just hand it off and hope for the best. Before anything runs, we want to understand what’s about to happen. What’s going to change? What stays the same? Are there any surprises? Being able to plan the resource, see a diff between the current state and the desired one, gives the user a chance to stay in control. A Pragmatic Architecture To build this system pragmatically, let's think about a client-server architecture. But instead of a monolithic server doing everything, we'll split the responsibilities. The client will be the brain, responsible for the orchestration of the overall process, deciding what resources to manage and in what order. The server will be the hands, performing specific actions on resources via a simple REST API. Server The server exposes a granular, idempotent API for managing single resources. It doesn't need to understand the bigger picture, just how to handle its specific tasks reliably. Key Responsibilities Exposing a REST API focused on CRUD-like

Mar 27, 2025 - 14:16

Designing a Configuration Management System

Starting small

I want to build this system from the bottom up.

In the past, I’ve fallen into the trap of trying to design for every possible use case from day one, aiming for the perfect architecture and eventually getting stuck before real progress is made.

Sticking to the fundamentals. Move upward step by step. Stay pragmatic and let the system grow based on real needs.

Identifying Core Challenges

The overall goals of system configuration management are well-known and well-documented. I’m not going to repeat what you can already find online, instead I’ll focus on a few specific areas that I care about and want to emphasize.

Take something as simple as configuring NGINX on a linux server.

We’ve probably all been in a situation where we pushed broken configuration to a server. Maybe it was a syntax error caught by nginx -t. But sometimes, the syntax is perfect, yet the configuration is logically flawed, perhaps pointing to a non-existent backend, causing errors after the service restarts. These functional issues are harder to catch upfront.

If the system is set up properly, your configuration management tool should ideally handle both types of failures gracefully. Let’s look at how we might build robust handling for this in Ansible, including a basic health check after the restart:

- name: Deploy, validate and health-check nginx configuration
  hosts: webservers
  vars:
    nginx_conf_path: /etc/nginx/nginx.conf
    nginx_conf_backup: /etc/nginx/nginx.conf.bak
    new_nginx_template: templates/nginx.conf.j2
    nginx_health_check_url: http://localhost/
  tasks:
    - name: Backup current nginx config
      ansible.builtin.copy:
        src: "{{ nginx_conf_path }}"
        dest: "{{ nginx_conf_backup }}"
        remote_src: yes
        mode: preserve
    - block:
        - name: Deploy new nginx config
          ansible.builtin.template:
            src: "{{ new_nginx_template }}"
            dest: "{{ nginx_conf_path }}"
        - name: Validate nginx configuration syntax BEFORE restart
          ansible.builtin.command: nginx -t
          changed_when: false
        - name: Restart nginx to apply configuration
          ansible.builtin.service:
            name: nginx
            state: restarted
        - name: Wait briefly for nginx to stabilise
          ansible.builtin.pause:
            seconds: 5
        - name: Perform health check AFTER restart
          ansible.builtin.uri:
            url: "{{ nginx_health_check_url }}"
            status_code: 200
          register: health_check_result
      rescue:
        - name: Roll back nginx config due to failure
          ansible.builtin.copy:
            src: "{{ nginx_conf_backup }}"
            dest: "{{ nginx_conf_path }}"
            remote_src: yes
            mode: preserve
        - name: Restart nginx after rollback
          ansible.builtin.service:
            name: nginx
            state: restarted
        - name: Fail the playbook after rollback
          ansible.builtin.fail:
            msg: "Nginx deployment failed (syntax or health check) and rollback was triggered."

While technically valid, it’s a lot of boilerplate for a fairly basic operation.

How can we tackle those essential challenges? Do we even need to express them in the code and if so, how do we do it without blowing up the code and adding complexity?

Code for declaring resources has to be written. So we need to think about what kind of format makes sense, something like YAML, JSON or maybe a DSL?

Each option has its trade-offs.

YAML is easy to write, but can get messy. JSON is stricter, but not great for humans. A DSL gives you more power and flexibility, but isn’t as user-friendly for non-programmers.

No matter which format we choose, humans make mistakes. Typos, wrong types missing fields. We need to assist the user and help with the validation of the code.

If we can declare the desired state of resources, we usually don’t want to just hand it off and hope for the best. Before anything runs, we want to understand what’s about to happen.

What’s going to change? What stays the same? Are there any surprises?

Being able to plan the resource, see a diff between the current state and the desired one, gives the user a chance to stay in control.

A Pragmatic Architecture

To build this system pragmatically, let's think about a client-server architecture. But instead of a monolithic server doing everything, we'll split the responsibilities.

The client will be the brain, responsible for the orchestration of the overall process, deciding what resources to manage and in what order.

The server will be the hands, performing specific actions on resources via a simple REST API.

Server

The server exposes a granular, idempotent API for managing single resources. It doesn't need to understand the bigger picture, just how to handle its specific tasks reliably.

Key Responsibilities

Exposing a REST API focused on CRUD-like idempotent operations for individual resources
- GET: Retrieve the current state and properties of a resource
- PUT: Create or update properties for a resource
- DELETE: Delete an existing resource
Validating the schema of the incoming resource definition

Client

The client is responsible for the end-to-end configuration process, acting as the orchestrator.

Key Responsibilities

Provide the user interface
Load, parse, and validate the overall declarative configuration
Understand dependencies
Determine the execution plan
Execute the plan

Features

This architecture gives me a solid foundation to build on. One that's simple, flexible and designed to grow with the project.

Looking ahead, there are a few features I’m especially interested in exploring:

Execution planning – visualize and control what happens before it happens
Clear diffs – make changes visible and traceable before they’re applied
Safe failure handling – support recovery and rollback without complex boilerplate
Backup functionality – local and remote backups built into the workflow
Graph-based orchestration – use a dependency graph with topological sorting
Parallel execution – speed things up by running independent tasks in parallel
Hybrid configuration – support both human-friendly formats and a programmable layer for more advanced use cases
Strong security support – mTLS, certificate-based auth

That’s the plan for now. We’ll see where it goes.