Unixd MultiResolver Support

Up until July 2024 the purpose and motivation of the Kanidm Unixd component (unix_integration in the source tree) was to allow Unix-like platforms to authenticate and resolve users against a Kanidm instance.

However, throughout 2023 and 2024 this project has expanded in scope - from the addition of TPM support to protect cached credentials (the first pam module to do so!), to use of the framework by himmelblau to enable Azure AD authentication.

We also have new features we want to add including LDAP backends (as an alternative to SSSD), the ability to resolve local system users, as well as support for PIV and CTAP2 for desktop login.

This has pushed the current design of the resolver to it's limits, and it's increasingly challenging to improve it as a result. This will necesitate a major rework of the project.

Current Architecture

                                                    ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐

                ┌───────┐  ┌───────┐  ┌───────┐     │ ┌───────────────────┐     │
                │       │  │       │  │       │       │                   │
                │  NSS  │  │  PAM  │  │  CLI  │     │ │   Tasks Daemon    │     │
                │       │  │       │  │       │       │                   │
                └───────┘  └───────┘  └───────┘     │ └───────────────────┘     │
                    ▲          ▲          ▲                     ▲
            ─ ─ ─ ─ ┼ ─ ─ ─ ─ ─│─ ─ ─ ─ ─ ┼ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ┼ ─ ─ ─ ─ ─ ─ ─ ┤
            │       ▼          ▼          ▼                     │
                ┌─────────────────────────────┐           ┌───────────┐         │
            │   │                             │           │           │
                │         ClientCodec         │           │   Tasks   │         │
            │   │                             │           │           │
                └─────────────────────────────┘           └───────────┘         │
┌ ─ ─ ─ ─ ─ ┘                  ▲                                ▲
                               │                                │               │
│                              ▼                                │
  ┌───────────────┐      ┌────────────────────────────────┐     │               │
│ │               │      │                                │     │
  │  Kani Client  │◀────▶│      Daemon / Event Loop       │─────┘               │
│ │               │      │                                │
  └───────────────┘      └────────────────────────────────┘                     │
│                                   ▲                 ▲
                                    │                 │                         │
│                                   ▼                 ▼
                          ┌──────────────────┐   ┌────────┐                     │
│                         │                  │   │        │
                          │    DB / Cache    │   │  TPM   │                     │
│                         │                  │   │        │
                          └──────────────────┘   └────────┘                     │
│
 ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ── ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴

The current design treated the client as a trivial communication layer. The daemon/event loop contained all state including if the resolver was online or offline. Additionally the TPM and password caching operations primarily occurred in the daemon layer, which limited the access of these features to the client backend itself.

Future Features

Files Resolver

The ability to resolve and authenticate local users from /etc/{passwd,group,shadow}. The classic mechanisms to resolve this are considered "slow" since they require a full-file-parse each operation.

In addition, these files are limited by their formats and can not be extended with future authentication mechanisms like CTAP2 or PIV.

Unixd already needs to parse these files to understand and prevent conflicts between local items and remote ones. Extending this functionality will allow us to resolve local users from memory.

Not only this, we need to store information permanently that goes beyore what /etc/passwd and similar can store. It would be damaging to users if their CTAP2 (passkeys) were deleted randomly on a cache clear!

Local Group Extension

An often requested feature is the ability to extend a local group with the members from a remote group. Often people attempt to achieve this by "overloading" a group remotely such as creating a group called "wheel" in Kanidm and then attempting to resolve it on their systems. This can introduce issues as different distributions may have the same groups but with different gidNumbers which can break systems, or it can cause locally configured items to be masked.

Instead, we should allow group extension. A local group can be nominated for extension, and paired to a remote group. For example this could be configured as:

[group."wheel"]
extend_from = "opensuse_wheel"

This allows the local group "wheel" to be resolved and extended with the members from the remote group opensuse_wheel.

Multiple Resolvers

We would like to support multiple backends simultaneously and in our source tree. This is a major motivator of this rework as the himmelblau project wishes to contribute their client layer into our source tree, while maintaining the bulk of their authentication code in a separate libhimmelblau project.

We also want to support LDAP and other resolvers too.

The major challenge here is that this shift the cache state from the daemon to the client. This requires each client to track it's own online/offline state and to self-handle that state machine correctly. Since we won't allow dynamic modules this mitigates the risk as we can audit all the source of interfaces committed to the project for correct behaviour here.

Resolvers That Can't Resolve Without Authentication Attempts

Some resolvers are unable to resolve accounts without actually attempting an authentication attempt such as Himmelblau. This isn't a limitation of Himmelblau, but of Azure AD itself.

This has consequences on how we performance authentication flows generally.

Domain Joining of Resolvers

Some Clients (and in the future Kanidm) need to be able to persist some state related to Domain Joining, where the client registers to the authentication provider. This provides extra functionality beyond the scope of this document, but a domain join work flow needs to be possible for the providers in some manner.

Encrypted Caches

To protect caches from offline modification content should be optionally encrypted / signed in the future.

CTAP2 / TPM-PIN

We want to allow local authentication with CTAP2 or a TPM with PIN. Both provide stronger assurances of both who the user is, and that they are in possession of a specific cryptographic device. The nice part of this is that they both implement hardware bruteforce protections. For soft-tpm we can emulate this with a strict bruteforce lockout prevention mechanism.

The weakness is that PIN's which are used on both CTAP2 and TPM, tend to be shorter, ranging from 4 to 8 characters, generally numeric. This makes them unsuitable for remote auth.

This means for SSH without keys, we must use a passphrase or similar instead. We must not allow SSH auth with PIN to a TPM as this can easily become a remote DOS via the bruteforce prevention mechanism.

This introduces it's own challenge - we are now juggling multiple potential credentials and need to account for their addition and removal, as well as changing.

Another significant challenge is that linux is heavily embedded in "passwords as the only factor" meaning that many systems are underdeveloped like gnome keyring - this expects stacked pam modules to unlock the keyring as it proceeds.

Local Users

Local Users will expect on login equivalent functionality that /etc/passwd provides today, meaning that local wallets and keyrings are unlocked at login. This necesitates that any CTAP2 or TPM unlock need to be able to unlock the keyring.

This also has implications for how users expect to interact with the feature. A user will expect that changing their PIN will continue to allow unlock of their system. And a change of the users password should not invalidate their existing PIN's or CTAP devices. To achieve this we will need some methods to cryptographically protect credentials and allow these updates.

To achieve this, we need to make the compromise that the users password must be stored in a reversible form on the system. Without this, the various wallets/keyrings won't work. This trade is acceptable since pam_kanidm is already a module that handles password material in plaintext, so having a mechanism to securely retrieve this while the user is entering equivalent security material is reasonable.

The primary shift is that rather than storing a kdf/hash of the users output, we will be storing an authenticated encrypted object where valid decryption of that object is proof that the password matches.

For the soft-tpm, due to PIN's short length, we will need to aggressively increase the KDF rounds and consider HMAC of the output.

                                                  HMAC-Secret
      Password                  PIN                 output
          │                      │                     │
          │                      │                     │
          │                      │                     │
          ▼                      ▼                     ▼
┌──────────────────┐    ┌─────────────────┐   ┌─────────────────┐
│                  │    │                 │   │                 │
│       KDF        │    │   PIN Object    │   │   CTAP Object   │
│                  │    │                 │   │                 │
└──────────────────┘    └─────────────────┘   └─────────────────┘
          │                      │     ▲               │   ▲
          │                      │     │               │   │
          │        Releases      │                     │
          ├───────KDF value──────┴─────┼───────────────┘   │
          │
          │                            │                   │
          ▼
┌──────────────────┐                   │                   │
│                  │
│  Sealed Object   │                   │                   │
│                  │─ ─ ─ ─Unlocks─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
│                  │
└──────────────────┘
          │
       Release
      Password
          │
          ▼
┌──────────────────┐
│                  │
│pam_gnome_keyring │
│   pam_kwallet    │
│                  │
└──────────────────┘

Remote Users (such as Kanidm)

After a lot of thinking, the conclusion we arrived at is that trying to handle password stacking for later pam modules is out of scope at this time.

Initially, remote users will be expected to have a password they can use to access the system. In the future we may derive a way to distribute TPM PIN objects securely to domain joined machines.

We may allow PINs to be set on a per-machine basis, rather than syncing them via the remote source.

This would require that a change of the password remotely invalidates set PINs unless we think of some way around this.

We also think that in the case of things like password managers such as desktop wallets, these should have passwords that are the concern of the user, not the remote auth source so that our IDM has no knowledge of the material to unlock these.

Challenges

  • The order of backend resolvers needs to be stable.
  • Accounts/Groups should not move/flip-flop between resolvers.
  • Resolvers need to uniquely identify entries in some manner.
  • The ability to store persistent/permananent accounts in the DB that can not be purged in a cache clear.
  • Simplicity of the interfaces for providers so that they don't need to duplicate effort.
  • Ability to clear single items from the cache rather than a full clear.
  • Resolvers that can't pre-resolve items

New Architecture

                                                    ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐

                ┌───────┐  ┌───────┐  ┌───────┐     │ ┌───────────────────┐     │
                │       │  │       │  │       │       │                   │
                │  NSS  │  │  PAM  │  │  CLI  │     │ │   Tasks Daemon    │     │
                │       │  │       │  │       │       │                   │
                └───────┘  └───────┘  └───────┘     │ └───────────────────┘     │
                    ▲          ▲          ▲                     ▲
┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┼ ─ ─ ─ ─ ─│─ ─ ─ ─ ─ ┼ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ─ ┼ ─ ─ ─ ─ ─ ─ ─ ┤
                    ▼          ▼          ▼                     │
│               ┌─────────────────────────────┐           ┌───────────┐         │
                │                             │           │           │
│               │         ClientCodec         │           │   Tasks   │         │
  ┌──────────┐  │                             │           │           │
│ │          │  └─────────────────────────────┘           └───────────┘         │
  │  Files   │◀────┐           ▲                                ▲
│ │          │     │           │                                │               │
  └──────────┘     │           ▼                                │
│ ┌───────────────┐│     ┌────────────────────────────────┐     │               │
  │               │└─────┤                                │     │
│ │  Kani Client  │◀─┬──▶│      Daemon / Event Loop       │─────┘               │
  │               │  │   │                                │
│ └───────────────┘◀─│┐  └────────────────────────────────┘                     │
  ┌───────────────┐  │                           ▲
│ │               │  ││                          │                              │
  │  LDAP Client  │◀─┤                           ▼
│ │               │  ││    ┌────────┐  ┌──────────────────┐                     │
  └───────────────┘◀ ┼     │        │  │                  │
│ ┌───────────────┐  │└ ─ ─│  TPM   │  │    DB / Cache    │                     │
  │  Himmleblau   │  │     │        │  │                  │
│ │    Client     │◀─┘     └────────┘  └──────────────────┘                     │
  │               │
└ ┴───────────────┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┴

Online/Offline State Machine

The major change that that this diagram may not clearly show is that the online/offline state machine moves into each of the named clients (excluding files). This also has some future impacts on things like pre-emptive item reloading and other task scheduling. This will require the backends to "upcall" into the daemon, as the TPM transaction needs to be passed from the daemon back down to the provider. Alternately, the provider needs to be able to register scheduled tasks into the daemon with some generic interface.

Resolution Flow

The most important change is that with multiple resolvers we need to change how accounts resolve. In pseudo code the "online" flow (ignoring caches) is:

if files.contains(item_id):
    if item_id.is_extensible:
        # This only seeks items from the providers, not files for extensibility.
        item += resolver.get(item_id.extended_from)
    return item

# Providers are sorted by priority.
for provider in providers:
    if provider.contains(item_id)
        return item

return None

Key points here:

  • One provider is marked as "default".
  • Providers are sorted by priority from highest to lowest.
  • Default always sorts as "highest".
  • The default provider returns items with Name OR SPN.
  • Non-default providers always return by SPN.

Once at item is located it is then added to the cache. The provider marks the item with a cache timeout that the cache respects. The item is also marked to which provider is the origin of the item.

Once an item-id exists in the cache, it may only be serviced by the corresponding origin provider. This prevents an earlier stacked provider from "hijacking" an item from another provider. Only if the provider indicates the item no longer exists OR the cache is cleared of that item (either by single item or full clear) can the item change provider as the item goes through the general resolution path.

If we consider these behaviours now with the cache, the flow becomes:

def resolve:
  if files.contains(item_id):
      if item_id.is_extensible:
          # This only seeks items from the providers, not files for extensibility.
          item += resolver.get(item_id.extended_from)
      return item

  resolver.get(item_id)


def resolver.get:
  # Hot Path
  if cache.contains(item):
      if item.expired:
          provider = item.provider
          # refresh if possible
          let refreshed_item = provider.refresh(item)
          match refreshed_item {
             Missing => break; # Bail and let other providers have at it.
             Offline => Return the cached item
             Updated => item = refreshed_item
          };

          return item

  # Cold Path
  #
  # Providers are sorted by priority. Default provider is first.
  #
  # Providers are looped *only* if an item isn't already in
  # the cache in some manner.
  let item = {
    for provider in providers:
        if provider.contains(item_id)
            if provider.is_default():
                item.id = name || spn
            else:
                item.id = spn
            break item
  }

  cache.add(item)

  return None

Cache and Database Persistence

The existing cache has always been considered ephemeral and able to be deleted at any time. With a move to Domain Join and other needs for long term persistence our cache must now also include elements that are permanent.

The existing cache of items also is highly limited by the fact that we "rolled our own" db schema and rely on sqlite heavily.

We should migrate to a primarily in-memory cache, where sqlite is used only for persistence. The sqlite content should be optionally able to be encrypted by a TPM bound key.

To obfuscate details, the sqlite db should be a single table of key:value where keys are uuids associated to the item. The uuid is a local detail, not related to the provider.

The cache should move to a concread based concurrent tree which will also allow us to multi-thread the resolver for high performance deployments. Mail servers is an often requested use case for Kanidm in this space.

Extensible Entries

Currently UserToken and GroupTokens are limited and are unable to contain provider specific keys. We should allow a generic BTreeMap of Key:Values. This will allow providers to extend entries as required

Offline Password/Credential Caching

The caching of credentials should move to be a provider specific mechanism supported by the presence of extensible UserToken entries. This also allows other types of credentials to be stored that can be consumed by the User.

Alternate Credential Caching

A usecase is that for application passwords a mail server may wish to cache and locally store the application password. Only domain joined systems should be capable of this, and need to protect the application password appropriately.